|
Unicode routines and std::string |
Biznaga
Member #3,180
January 2003
|
To make some things easier I'm using my own String class, derived from std::string. I added methods to trim, convert (to int, float, etc., via string streams) among other things. Since my project is 100% based on Allegro, should I add methods to use Allegro's Unicode routines OR forget about std::string and make my String class a wrapper around Unicode routines? I would like to use Unicode strings because it's possible I'll need to translate my game to another languages. I'm not sure if std::string's and c-strings (Allegro's Unicode strings) can coexist in the same class or if it's even a good idea (I think it isn't). The same goes for file routines (make_absolute_filename, replace_extension, etc.). What do you suggest? |
axilmar
Member #1,204
April 2001
|
I do not think std::string and C strings can be mixed with Allegro's strings. Even though Allegro strings use 'char *' as data type, the underlying characters may not match ASCII characters. If you are going for Unicode, then you will certainly have a problem with std::string; for example, in Unicode the character '\0' is two bytes long, whereas in std::string/C strings it is one. My suggestion is to forget std::string and wrap Allegro's unicode routines inside your String class. But you can use std::vector<char> for the character buffer, so you avoid the allocation/deallocation issues. You could also make you String class reference-counted, so you can use it by-value. |
Tobias Dammers
Member #2,604
August 2002
|
Quote: in Unicode the character '\0' is two bytes long Not in UTF-8, which allegro uses by default. Since \0 is the only special character for std::string, everything else can be used as you please. String length, though, will return the number of bytes in the string, not necessarily the number of characters. Similarly, index operators ( [] and at() ) will return the nth byte, not the nth character. As long as you just load and compare strings, concatenate them together, and pass them around, std::string will be fine (provided you do use UTF8, not UTF16, and use UTF-8 output routines only). --- |
Fladimir da Gorf
Member #1,565
October 2001
|
Wouldn't UTF-16 be easy to use with std::string? After all you can specify the data type of a character, so you can pass a short for that instead of a char. OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
gillius
Member #119
April 2000
|
you would need to do wstring or basic_string<your_type_here>. I am not an expert in this but I believe it is facets and/or locales that allow you to do multi-byte character strings. I wonder if it is possible to do an Allegro string facet to allow Allegro UTF strings in standard library strings. Gillius |
Biznaga
Member #3,180
January 2003
|
Allegro seems to provide a wide range of string routines, so I think I could make a very functional String class with them. I forgot to mention something important: my std::string based class also uses regular expressions (I'm using the rx library). Will this new string class work along with rx too? And, what about Lua scripts? I just read this, but has anybody worked with Lua scripts and Unicode? I start to believe I should stick with U_ASCII if I really don't need UTF-8. At least, U_ASCII is good enough for most occidental languages, right? |
Kitty Cat
Member #2,815
October 2002
|
UTF-8 is full compatible with 7-bit ASCII. Any character in the 0-127 (inclusive) range is the same in both UTF-8 and ASCII. -- |
Tobias Dammers
Member #2,604
August 2002
|
...including the special characters 0 through 31. Which is why a std::string can easily hold UTF-8 data; the only thing that is not reliable is accessing a single character by index, since operator[] and at() count bytes, not characters. --- |
X-G
Member #856
December 2000
|
std::basic_string<wchar_t> -- |
Tobias Dammers
Member #2,604
August 2002
|
This doesn't handle variable-width characters though; allegro does. --- |
X-G
Member #856
December 2000
|
That's right, but they work smashingly for Unicode. -- |
Thomas Fjellstrom
Member #476
June 2000
|
And don't forget that UTF8 can and will encode less commonly used chars with up to 6 bytes. Which is a Unicode encoding edit, WikiPedia, seems to say its 4 bytes, is it 4? -- |
Dennis
Member #1,090
July 2003
|
It is 4. The standard document after which I modeled my own routines for encoding to and decoding from UTF8, says so as well. Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string, which you can then convert to whatever unicode format you have set for Allegro. --- 0xDB | @dennisbusch_de --- |
Biznaga
Member #3,180
January 2003
|
Tobias Dammers said: I'd use std::string as a base and only implement unicode-specific functionality through allegro. This'll save you from re-coding the memory allocation code yourself. Isn't Allegro already doing this with functions like uinsert, uremove, ustrcat? Dennis Busch said: Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string Really? Wouldn't c_str return a const w_char* instead a const char*? Can I pass a const w_char* as if it was a const char* as a function parameter? I'm concerned about if rx will work with Unicode strings. It receives const char* as parameters. |
Kitty Cat
Member #2,815
October 2002
|
Quote: Can I pass a const w_char* as if it was a const char* as a function parameter? No, you'd have to convert it. w_char is 2 bytes per character, where a char is just 1 (unless it's utf8, but it's still compatible, for the most part). -- |
Tobias Dammers
Member #2,604
August 2002
|
16-bit encoding is not much of a problem; you just use w_char. You do need to take special precautions for outputting, though; either convert to an 8-bit code table, or use 16-bit output routines. --- |
Dennis
Member #1,090
July 2003
|
Quote:
Quote: Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string Really? Wouldn't c_str return a const w_char* instead a const char*?
Unfortunately you cut off what I said, before the all important part: "[..], which you can then convert to whatever unicode format you have set for Allegro."
This will work as long as Allegros U_UNICODE hasn't changed since 4.20rc2, because back in that version Allegros U_UNICODE was equal to a fixed number of two bytes per char. --- 0xDB | @dennisbusch_de --- |
Biznaga
Member #3,180
January 2003
|
You are right. I understand the conversion part. I was thinking in the opposite case: converting Allegro's strings to c-strings that rx could use. Maybe I can convert to U_ASCII but that will result in a possible data lose or corruption and it has no sense at all if what I'm trying is to use Unicode. |
gillius
Member #119
April 2000
|
UTF-16 or "wide character" strings used to be convienent until there were more than 65535 characters in Unicode, so now in UTF-16 a character can be as large as 32 bits (<sarcasm>yeah). However, if you assume that all of your strings are going to stay on the BMP (basic multilingual plane), then you'll probably be fine for most any application to assume that a character will not have a multi-byte representation, but if you want to be safe, check that no characters are in the range D800-DBFF when you get strings from external sources. I am going to assume that Allegro's unicode functions probably don't implement this portion of UTF-16. Gillius |
Shawn Hargreaves
The Progenitor
April 2000
|
The 16 bit range pretty much covers all current languages except Chinese. You can do one type of Chinese (as spoken in Taiwan etc) with just 16 bit characters, but for full Chinese Chinese, you need some code points above 65535. That's the only interesting thing beyond 16 bit, though: all the other extensions are for dead or theoretical languages that nobody actually speaks today. |
gillius
Member #119
April 2000
|
I didn't know there were full chinese characters in the upper code points. I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages... Gillius |
Biznaga
Member #3,180
January 2003
|
Well, CJK is simply too much for me. At most, I'd like to support English and Spanish in my game (BTW, an RPG), and maybe an horrendous French translation. Occidental languages. So UTF-16 it's OK for me; even U_ASCII, I think. I'll try both with std::string<w_char> and a wrapper around Allegro's Unicode routines. Thank you, everybody! I have another related question: ASCII has 7-bit characters, Allegro's U_ASCII has 8-bit. Are the last 128 characters in U_ASCII the same as in ISO-Latin-1 or does it depends on the locale? Taking about translation, does anybody know about a good tutorial for gettext? |
Jakub Wasilewski
Member #3,653
June 2003
|
Quote: I didn't know there were full chinese characters in the upper code points. I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages... Plane 1 (supplementary multilingual plane) is reserved for dead and ancient scripts and some other stuff. Plane 2 however (I believe it's called supplementary ideographic plane) is filled with about 40000 older, traditional and rarely used Chinese symbols that simply didn't fit in the basic multilingual plane. And AFAIK hieroglyphs are not in Unicode... yet . But we have these in Unicode, so... --------------------------- |
|