Anyone used fgets() to read from text files in UTF-8 format?

Anyone used fgets() to read from text files in UTF-8 format?

TeaRDoWN

Member #8,518

April 2007

To allow more language specific characters I have converted my text files from ASCII to UTF-8 but now the fgets() returns odd garbage characters in beginning of string:

Instead of: "Hello world." I get: "ï»¿Hello world."

Looking in the text file with a text editor (for example Notepad) the file looks exactly the same as the old ASCII version.

Is there these three "garbage characters" in the beginning of the file that I need to step by before my "real charcters" come that I want to read?

LennyLen

Member #5,313

December 2004

TeaRDoWN said:

Is there these three "garbage characters" in the beginning of the file that I need to step by before my "real charcters" come that I want to read?

I don't know the answer to that, but it doesn't matter. Don't use any of the stdio functions with unicode. Use one of the unicode-based functions indtead.

Even if you ignore thge first three bytes, you're bound to get more garbage the moment your program encounters any other character that fgets() can't handle.

Mika Halttunen

Member #760

November 2000

TeaRDoWN said:

Is there these three "garbage characters" in the beginning of the file that I need to step by before my "real charcters" come that I want to read?

Those "garbage characters" are actually the byte-order mark of UTF-8, that specify the text is encoded in UTF-8. See UTF-8 BOM for more info. You can ignore them, but as LennyLen said, you shouldn't be using the stdio functions here.

---------------------------------------------
.:MHGames | @mhgames_ :.

CGamesPlay

Member #2,559

July 2002

I'd like to point out that fgets works fine in UTF-8, it's the function you're using to display the string that can't handle it. That unicode byte order mark is a non-printable character, but it does belong in the string.

--
Tomasu: Every time you read this: hugging!

Ryan Patterson - <http://cgamesplay.com/>

Peter Wang

Member #23

April 2000

Exactly. That's nothing wrong with using stdio functions for UTF-8 -- that's the point!

Matthew Leverton

Supreme Loser

January 1999

Those three junk bytes are a zero-width non-breaking space character (EF BB BF). It was probably inserted by some text editor that uses it for UTF-8 detection. You could safely remove it from the text file (but your editor may put it back in), but you shouldn't skip over them if present.

As the others are saying, the standard input functions work fine with UTF-8. You just need to use UTF-8 aware output functions to display the high characters (greater than 0x7f).

--
RTFM | Follow Me on Google+ | I know 10 people

CGamesPlay

Member #2,559

July 2002

To clarify, UTF-16 and UTF-32 both use multibyte sequences for all characters. For the first 255 characters, this ends up being stored as '\0', 'a' (where the character is 'a'). Because of that, you can't use the standard input functions to read UTF-16- or UTF-32-encoded data, because they stop when they encounter a NULL character.

UTF-8 was devised specifically to not ever have NULL characters in the byte stream except to represent the NULL character. It uses a variable-length encoding that represents the ASCII characters as themselves ('a' is stored as 'a').

--
Tomasu: Every time you read this: hugging!

Ryan Patterson - <http://cgamesplay.com/>