Writing/Reading UTF-8 files in C++

AMCerasoli

Well, to be my first project it is actually giving me too much trouble.

Once upon a time, I was happy using the Allegro UTF-8 API, when I realize that I had to save/load my strings to/from files with UTF-8 encoding format.

Soooo.... I start reading and reading...and ended up understanding that fstream doesn't allow me to write UTF-8 strings... because the actual C++ thinks that the whole world speak just English (which would be great (Fuck off idioms)) I don't care if it's English Chinese nor Taiwantan I just want one idiom I hate unicode :'( :'(

Ok stop crying, get concentrated...

So I wrote in Google "Writing/Reading UTF-8 files in C++" and immediately started to download ICU... I thought it was a virus so I cancelled and did the search again, and happens exactly the same. I forgot that Google now have the "Read-mind script" since I have my webCam connected Google read my mind and the ICU download starts... You know...

Anyway... After the download I realize that files were to be used with MSVC, and since I'm a newbie I was thinking, is preferable start using MSVC and forget about C::B or should I be a man and compile the ICU source code?.

Someone have done it before? is too difficile?

Oh BTW if someone is actually looking to Write/Read UTF-8 files in C++ with ICU Check this out.

EDIT: Wait A minute can I use the MSVC compiler with C::B?

Matthew Leverton

You can write Allegro UTF-8 strings to file just fine.

file << al_cstr(my_allegro_utf8_string);

bamccaig

AMCERASOLI said:

If you're actually satisfied just using English (American English where accents are optional at best ) in the game then just write using ASCII and ignore everything above decimal 127. Supporting Unicode is very good for many reasons, but it's also somewhat of an advanced subject. I've personally never written a Unicode aware application and I have indeed tried to learn how on a few occasions. It's just really hard. If Allegro somehow magically can just make it work then go ahead and use Allegro's APIs, but generally speaking you are faced with these problems: you need to know what character encoding your input is, you need to know what character encoding you use internally in your application, and you need to know what character encoding your output should be; and you need to be able to convert between all three (though I imagine the input and expected output will often be the same). The problem is that I don't think there's really a standard way to know these things. At least, nothing completely portable. I'm no expert though so I could be wrong. You certainly wouldn't want to try to hard code the detection or conversions so you basically need a portable library than can do it... I'm not really sure what the state of affairs is on that front.

AMCERASOLI said:

No, I don't know. :-/ If you were trying to be funny I'm afraid your English skills fail you. If you were serious, you might want to look into that... :-X That does indeed sound like your system has been compromised somehow.

AMCERASOLI said:

Someone have done it before? is too difficile?

Whether or not it's worth trying to compile the source code yourself depends on the package and how portable it is. That is, whether or not it already supports MinGW or whatever toolchain you intend to use with Code::Blocks, and whether there are extra dependencies that you'd have to install first. In general, I would say that a beginner shouldn't attempt it because it will probably require knowledge and experience that you haven't acquired yet. Don't let that stop you; try if you feel like trying, but don't be surprised if you get stuck.

If you can get it working with MSVC then go ahead and use MSVC. If not, have you considered installing Linux? :-X OK, to be serious, if you've never used Linux before then it would take a lot of time to figure out. Then again, I think it takes a long time to really get a handle on the freak show that is Windows development so it might not be that bad switching to Linux now after all. Without knowing much about ICU (I think this might actually be the first time I've heard that name), I can seemingly install it in Linux with a simple command:

su -c 'yum install icu'

That's the kind of thing that I'm referring to when I mention Windows vs. Linux.

tobing

As long as your application (game) does not try to edit unicode strings, it's all quite easy. Just assume that everything is UTF-8 and treat everything internally as char*. Using any ttf rendering which is based on freetype, like allegro5-ttf or glyphkeeper with allegro 4, it will render your char* strings just fine, if the font does have all the characters. Read your strings from separate input files, which are in UTF-8. If you're using an editor like Notepad++ (or probably any other decent text editor), you can set the encoding you want somewhere in the menus.

If you want to take the unicode text from the source, it's similar, you need to change the encoding of your source to UTF-8, which the normal IDE doesn't care about. I'm also sure that such an approach would work, but is not defined in the standard, so better use those separate input files.

Only when you want to edit unicode strings from within your application, things become nasty, because only then you really need to know what a single character is and how many characters your string really has. Which is different from bytes, if there's anything non-ascii in that string.

AMCerasoli

bamccaig said:

If you're actually satisfied just using English (American English where accents are optional at best ) in the game then just write using ASCII and ignore everything above decimal 127.

I don't understand, you first said that, and then

bamccaig said:

If you were trying to be funny I'm afraid your English skills fail you.

I was being funny...

But if you know that my natal language isn't English, why you are telling me to support just ANSCII? It's like "yes I will do your software for Spain, but it's going to be in English"

Quote:

I've personally never written a Unicode aware application and I have indeed tried to learn how on a few occasions. It's just really hard.

Isn't that hard... not with the Allegro API. With C++0x things might change and I think will be even easier.

Quote:

you need to know what character encoding your input is

I'm still learning, but I think the input is always Unicode, my keyboard sends Unicode and my program interpret it.

Quote:

you need to know what character encoding you use internally in your application

I think this is solved by just saving your source file with UTF-8 encoding, which can be set in any IDE.

Quote:

and you need to know what character encoding your output should be; and you need to be able to convert between all three (though I imagine the input and expected output will often be the same).

That's right, once you are handling any type of encoding there is no problem by saving it, as Matthew shows above...

Quote:

You are right, ICU can't be compiled with MinGW there are some guys trying to port it.

Quote:

have you considered installing Linux?

Tell it one more time and I'm going to delete it!

I told you that I'm also using Linux!!! TRISQUEL!! and I installed GNU GCC by myself what do you think about that eh eh?

And tobing that is actually what I'm doing.

Matthew Leverton said:

file << al_cstr(my_allegro_utf8_string);

I was about to start using a new IDE a new Library, almost installed a new O.S just to save a file... What is happening with my BRAIN!!!

Writing/Reading (what I'm doing)

#SelectExpand
  1#include <iostream>
  2#include <fstream>
  3
  4#include "allegro5/allegro.h"
  5using namespace std;
  6ALLEGRO_USTR *name;
  7
  8int main()
  9{
 10
 11
 12char temp_string[40];
 13
 14name = al_ustr_new("Hola vive la vida loca coño");
 15
 16//Writing
 17
 18ofstream file_out("file.txt");
 19file_out << al_cstr(name);
 20
 21//Reading
 22
 23ifstream file_in("file.txt");
 24file_in >> temp_string;
 25al_ustr_insert_cstr(name, 0, temp_string);
 26
 27
 28    return 0;
 29}

Michał Cichoń

If you want good support for UTF-8 in your C++ app, I can recommend you this library:
UTF8 for C++
This contain only three headers and is very simple to use.

bamccaig

AMCERASOLI said:

But if you know that my natal language isn't English, why you are telling me to support just ANSCII? It's like "yes I will do your software for Spain, but it's going to be in English"

Well it depends on what you're trying to do. If your goal is to write a Unicode-aware game then obviously you will need to learn how to program with Unicode first. If you're just trying to make a game then you will need to learn how to make a game and the Unicode parts don't matter. I think that I misread what you wrote to mean that you don't care what language you use, so long as it just works, whereas I think what you really meant is that you wouldn't care which language if the whole world did speak one language,... but they don't. Disregard that advice then and keep pushing for Unicode awareness.

AMCERASOLI said:

I'm still learning, but I think the input is always Unicode, my keyboard sends Unicode and my program interpret it.

I don't think this is true. Each application is technically in control of what encoding it uses. It's true that Windows is designed somewhat to use the same encoding throughout (especially if applications are built using Windows APIs), but you can't be sure that it's Unicode. IIRC, when I installed Windows 7 a week ago it was by default using the Latin-1 encoding (ISO 8859-1). I manually changed it to UTF-8 because UTF-8 is just way better.

Besides, even if everybody was using Unicode, that doesn't tell you what the encoding is. Unicode is just a character set: a table of character and number mappings. It doesn't describe how those numbers are represented in computer memory. That is where UTF-7, UTF-8, UTF-16, and UTF-32 come in. These are all Unicode-based character encodings which describe characters very differently in memory. It's possible that Allegro's APIs automatically take this into account for you (I don't know), but if not then you might need to yourself. That is, if you want it to just work(tm) on all systems and configurations.

AMCERASOLI said:

I think this is solved by just saving your source file with UTF-8 encoding, which can be set in any IDE.

I'm afraid that has no affect on what you use internally. The compiler most likely knows absolutely nothing of your source file's encoding (in fact, it should probably be ASCII-compatible if you want the compiler to read it properly). The internal encoding is something that you have to decide upon and maintain throughout the application. Again, it's possible that Allegro effectively does this for you, but I don't know.

AMCERASOLI said:

That's right, once you are handling any type of encoding there is no problem by saving it, as Matthew shows above...

That might depend on what the expected output is. The function call that Matthew demonstrated above doesn't specify the output encoding. I assume then that the function returns either a fixed encoding or the system default encoding. This is probably OK in most situation and probably what you want, but that isn't necessarily true for all applications.

Matthew Leverton

Michał Cichoń said:

If you want good support for UTF-8 in your C++ app, I can recommend you this library:

Please, don't confuse the guy.

bamccaig said:

That might depend on what the expected output is. The function call that Matthew demonstrated above doesn't specify the output encoding

It is UTF-8. That's what Allegro works with because it's essentially compatible with C strings. The only hard part is manipulating single characters, but Allegro provides routines for that.

AMCERASOLI said:

Writing/Reading (what I'm doing)

Do you need help? There are typos and mistakes in your code.

Also, if you expect an embedded UTF-8 string to work, you must make sure that your IDE is saving your source file with UTF-8 encoding.

AMCerasoli

Wait a minute man, I'm getting confused again...

I think everyone is using Unicode, if you change the encoding type of your O.S that doesn't have anything to do, that just tell your O.S. how to read and interpret bytes strings.

bamccaig said:

Each application is technically in control of what encoding it uses.

Of course, and that is possible since the standard is Unicode.

bamccaig said:

It's true that Windows is designed somewhat to use the same encoding throughout (especially if applications are built using Windows APIs), but you can't be sure that it's Unicode.

I can't understand you. To me, Unicode is a standard that is also used for the Windows O.S. That is the whole point about Unicode.

Ok, this is what I have learned so far, maybe I'm very confused.

For example: If I'm in Russia and I'm writing a text, my software is interpreting (along with the OS) this text for example using UTF-8, so the program read at real time Unicode code points, save it in RAM using UTF-8, and shows me correctly that text.

The same what it's doing right know while I'm writing this text.

When I save the text It's automatically saved as UTF-8 (editors doesn't ask you how to encode the text when you save the file, since you were using UTF-8 the software assumes that it must be saved as UTF-8), so if someone else want to read that text, must set his text editor to read UTF-8.

But if someone want to transform UTF-8 to UTF-16 for example, it must have Unicode, otherwise there is no way to do the job. For that reason I can't understand you when you say: "Besides, even if everybody was using Unicode, that doesn't tell you what the encoding is." If you weren't using Unicode then you couldn't be using UTF-8 nor other encoding type... right?

Quote:

It's possible that Allegro's APIs automatically take this into account for you (I don't know),

Of curse the Allegro API take care of it. that is his job

Quote:

The compiler most likely knows absolutely nothing of your source file's encoding (in fact, it should probably be ASCII-compatible if you want the compiler to read it properly).

The compiler doesn't need to know about the encoding format of the source code file. I think, When you save your file as UTF-8 what you're doing is telling to the IDE to save the "" surrounded strings to be saved as UTF-8 strings...

If I don't change my IDE to send UTF-8 strings then my application is going to write another type of encoding, my compiler isn't aware of what kind of encoding is using the IDE, but if the IDE don't send the correct UTF-8 string my program isn't going to be able to understand it.

For example if my IDE sends this with ANSI encoding:

al_draw_text(titulo,al_map_rgb(255, 255, 255),290 ,50 , 0, "ñ");

It wouldn't know what charter is actually sending since ANSI doesn't know what the "ñ" letter is...

Instead if I set my IDE to send UTF-8 strings, my IDE would be writing two bytes that are needed to represent that letter in UTF-8. Which can be used with the Allegro API...

Quote:

If I'm using UTF-8 then I would be saving UTF-8, If I have to load strings better to be encoded using UTF-8 otherwise wouln'd work.

I also was reading this:

http://users.csc.calpoly.edu/~bfriesen/software/builds.html said:

So the question now becomes, if Windows NT supports both ASCII and Unicode, why do Unicode programs run faster. To answer this you have to understand Windows NT itself. All operating systems have what is called a "kernel." The kernel is the heart of the OS; it is the lowest level, the innards or guts of the OS. In Windows NT the kernel is written in Unicode, and therefore only understands Unicode. When an ANSI program runs on Windows NT, the OS must convert the strings from ASCII to Unicode. This takes both time to convert everything, and memory to store both copies (ASCII and Unicode). Whereas a Unicode program has straight access to the kernel and is faster. Now on modern computers running at gigahertz speeds and having hundreds of megs of RAM this speed difference is minimal, but it does exist. The simple fact remains, the same program running as either ANSI or Unicode, the Unicode version will always run faster.

Matthew Leverton said:

Do you need help? There are typos and mistakes in your code.

No actually that was just an example about what I'm doing... I did it right away haven't compiled... But works for me, let me see what I wrote bad... [Fixed]

Matthew Leverton

AMCERASOLI said:

It wouldn't know what charter is actually sending since ANSI doesn't know what the "ñ" letter is...

First, it's character, not charter.

Your statement is not quite correct. In Windows-1252, character 241 is ñ. So if you are saving in that encoding, your string will look like:

0xF1 0x00

So it has a valid representation. However, the UTF-8 routines will see 241 and assume it's a multibyte character because it's > 127. But there is no second character, and if there were, it wouldn't be what you expect to see.

So it doesn't work because it's an invalid UTF-8 string. It should be:

0xC3 0xB1 0x00

bamccaig

AMCERASOLI said:

Of course, and that is possible since the standard is Unicode.

That has nothing to do with Unicode. And there are many standards, which is why text programming is such a mess.

AMCERASOLI said:

To me, Unicode is a standard that is also used for the Windows O.S. That is the whole point about Unicode.

The same what it's doing right know while I'm writing this text.

The whole point of Unicode is to define a single mapping standard that supports basically every language known to humans. That doesn't mean that everybody necessarily uses it. Many people still don't use it yet and most people are still oblivious that it even exists.

Unicode is just a standard mapping of characters to numbers. These numbers are referred to as code points. The entire mapping is referred to as a character set. For example:

Character     Decimal       Hexadecimal       Code Point
=========================================================
A                  65                41           U+0041
B                  66                42           U+0042
C                  67                43           U+0043

Effectively, that is all Unicode defines. The problem is that it doesn't specify how to store that in memory. Sure, these numbers are small (65-67) so they fit in a single byte, but Unicode includes hundreds of thousands of characters^[1]. Some code points require at least 4 bytes to represent them. The problem is that if we assume that every character is 4 bytes then some text (like English text) is going to be wasting 4x as much memory as it needs (for every character, 3/4 bytes would always be zero) because most English characters fit in a single byte.

To describe the standards for representing Unicode code points in memory they came up with character encodings. Unicode encodings actually describe how to represent Unicode code points in memory. For example, UTF-8 is one such encoding. The first non-zero bits of the first byte indicate how many bytes a particular character uses. To know what the code point is, you need to extract those bits and string them all together to form the real code point.

UTF-32 is another possible encoding. There is no real magic here (AFAIK). Instead, every character is 4-bytes wide. This works great for languages that always use 4-bytes to represent characters (I imagine Chinese dialects would) because there's no complex counting of variable-width characters.

So even if everybody is using Unicode, there are binary incompatibilities between the character encodings used to represent Unicode. If I send my text to you as UTF-16 and you try to interpret it as UTF-8 you're going to get the wrong text.

When you save text it's entirely up to your text editor to determine which character encoding to write the file in. It could write it as UTF-7, UTF-8, UTF-16, UTF-32, or some completely non-Unicode encoding if it so chooses. Most editors allow you to change the encoding used.

AMCERASOLI said:

Of curse the Allegro API take care of it. that is his job

There are some things that Allegro can't possible take care of. For example, if you open a text file Allegro has no way to know what encoding that text file is written in. It can try to guess, but can't be sure. That's why most applications that handle text allow you to change the encoding at run-time. If you see a bunch of missing characters or the characters appear to be gibberish (i.e., an English document appears in Chinese characters) then it could be that the editor is using the wrong encoding.

Web browsers are probably the most common application that we use daily that encounter all sorts of character encodings.

AMCERASOLI said:

The compiler doesn't need to know about the encoding format of the source code file.

Yes it does. For the compiler to understand the text it needs to know how the text is encoded. Older compilers, like C and C++ compilers will generally assume ASCII (or ASCII compatible, like UTF-8). IIRC, the Visual C# compiler expects UTF-16, but I'm not too sure (maybe it can guess).

AMCERASOLI said:

I think, When you save your file as UTF-8 what you're doing is telling to the IDE to save the "" surrounded strings to be saved as UTF-8 strings...

No. You're telling the editor to save the entire file as UTF-8. It just so happens that UTF-8 is ASCII compatible so C and C++ compilers shouldn't even notice.

AMCERASOLI said:

Instead if I set my IDE to send UTF-8 strings, my IDE would be writing two bytes that are needed to represent that letter in UTF-8. Which can be used with the Allegro API...

Yes, but remember that it's the Allegro API that understands those characters, not the compiler. The compiler sees those as completely different characters. However, because they're in a string literal (i.e., "") it doesn't care. It just adds those bytes to the executable program as they are and expects the executable program to know what they mean to it.

AMCERASOLI said:

If I'm using UTF-8 then I would be saving UTF-8, If I have to load strings better to be encoded using UTF-8 otherwise wouln'd work.

If you are in control of all inputs and outputs then you can choose whatever you want. You can require your files to be UTF-8 and require network communication to be UTF-8, etc. This will work fine. It's the times when you don't know that you have to care. That probably won't happen with a game, but it might happen if you write other programs that need to deal with variable inputs and outputs. For example, unless Allegro has an API to determine the encoding used by the terminal (or Command Prompt), you would need to use some other method to determine what the expected output is for you to be able to write non-English characters reliably to stdout or stderr.

AMCERASOLI said:

http://users.csc.calpoly.edu/~bfriesen/software/builds.html said:

Notice that the author says "Unicode". They don't say "UTF-8" or "UTF-16". Either the kernel supports all Unicode encodings (possible, but unlikely) or it supports one (I would guess UTF-16). That means that if you use any other encoding (e.g., UTF-8) then it would be just as slow. Besides, the kernel only needs to understand text if you're giving it text to process. If you're reading from a device or file the kernel will give it to you exactly as it got it (it's binary data that could be text or could be an image; it doesn't know).

If You Just Want To Support UTF-8 In Your Game

Then you control all of the inputs and outputs by using Allegro's APIs. That is fine and should work flawlessly (writing to the screen; I don't know about to the terminal). Just know that Allegro won't necessarily know what to do if you need to process files or streams from uncontrolled sources. For example, an existing network protocol.

References

I'm guessing because I don't know the exact number.

Matthew Leverton

bamccaig said:

If You Just Want To Support UTF-8 In Your Game
Then you control all of the inputs and outputs by using Allegro's APIs.

That's not really true, or at least the statement is misleading. For instance, you can read from a UTF-8 encoded text file using standard C/C++ I/O routines.

If you're not manipulating UTF-8 strings, then the only thing you have to do in your game is:

use UTF-8 encoded files
use Allegro's ustr routines when drawing text.

If you are manipulating UTF-8 strings, then you also need to use Allegro's UTF-8 routines to properly insert or delete characters, etc.

AMCerasoli

Oh come on man.... You must be kidding me!! WTH is that... I have read enough today, I'm going to print it and read it later.. Cheeses...

P.S: you could redirect me to the Joel on Software web page again. You don't have to explain it by yourself... See you in a week I'm tired... The same to Matthew...

EDIT:

I was able to load and save files using UTF-8 thanks very much for all that documentation.

At the end, result me that Unicode isn't hard at all, I'm going to try to run some examples in different platforms with different languages to see if I finally understand Unicode. I encourage everyone to understand Unicode and start building his applications with UTF-8 support, this is going to open your mind, and not only allow your applications to run around the world, but also to deepen your knowledge about bytes, bits and other important stuffs. And remember, if you're using Allegro or not while C++0x isn't here, use APIs to help you with the strings .

Thread #606401. Printed from Allegro.cc