Unicode routines and std::string

Unicode routines and std::string

Biznaga

Member #3,180

January 2003

To make some things easier I'm using my own String class, derived from std::string. I added methods to trim, convert (to int, float, etc., via string streams) among other things. Since my project is 100% based on Allegro, should I add methods to use Allegro's Unicode routines OR forget about std::string and make my String class a wrapper around Unicode routines? I would like to use Unicode strings because it's possible I'll need to translate my game to another languages.

I'm not sure if std::string's and c-strings (Allegro's Unicode strings) can coexist in the same class or if it's even a good idea (I think it isn't).

The same goes for file routines (make_absolute_filename, replace_extension, etc.).

What do you suggest?

axilmar

Member #1,204

April 2001

I do not think std::string and C strings can be mixed with Allegro's strings. Even though Allegro strings use 'char *' as data type, the underlying characters may not match ASCII characters. If you are going for Unicode, then you will certainly have a problem with std::string; for example, in Unicode the character '\0' is two bytes long, whereas in std::string/C strings it is one.

My suggestion is to forget std::string and wrap Allegro's unicode routines inside your String class. But you can use std::vector<char> for the character buffer, so you avoid the allocation/deallocation issues. You could also make you String class reference-counted, so you can use it by-value.

ALGUI: c++11 A5 GUI library.

Tobias Dammers

Member #2,604

August 2002

Quote:

in Unicode the character '\0' is two bytes long

Not in UTF-8, which allegro uses by default. Since \0 is the only special character for std::string, everything else can be used as you please. String length, though, will return the number of bytes in the string, not necessarily the number of characters. Similarly, index operators ( [] and at() ) will return the nth byte, not the nth character. As long as you just load and compare strings, concatenate them together, and pass them around, std::string will be fine (provided you do use UTF8, not UTF16, and use UTF-8 output routines only).

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

Fladimir da Gorf

Member #1,565

October 2001

Wouldn't UTF-16 be easy to use with std::string? After all you can specify the data type of a character, so you can pass a short for that instead of a char.

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

gillius

Member #119

April 2000

you would need to do wstring or basic_string<your_type_here>. I am not an expert in this but I believe it is facets and/or locales that allow you to do multi-byte character strings. I wonder if it is possible to do an Allegro string facet to allow Allegro UTF strings in standard library strings.

Gillius
Gillius's Programming -- https://gillius.org/

Biznaga

Member #3,180

January 2003

Allegro seems to provide a wide range of string routines, so I think I could make a very functional String class with them.

I forgot to mention something important: my std::string based class also uses regular expressions (I'm using the rx library). Will this new string class work along with rx too? And, what about Lua scripts? I just read this, but has anybody worked with Lua scripts and Unicode?

I start to believe I should stick with U_ASCII if I really don't need UTF-8. At least, U_ASCII is good enough for most occidental languages, right? ::)

Kitty Cat

Member #2,815

October 2002

UTF-8 is full compatible with 7-bit ASCII. Any character in the 0-127 (inclusive) range is the same in both UTF-8 and ASCII.

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

Tobias Dammers

Member #2,604

August 2002

...including the special characters 0 through 31. Which is why a std::string can easily hold UTF-8 data; the only thing that is not reliable is accessing a single character by index, since operator[] and at() count bytes, not characters.
I'm curious about the allegro-string class, though. BTW, I'd use std::string as a base and only implement unicode-specific functionality through allegro. This'll save you from re-coding the memory allocation code yourself.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

X-G

Member #856

December 2000

std::basic_string<wchar_t>

--
Since 2008-Jun-18, democracy in Sweden is dead. | 悪霊退散！悪霊退散！怨霊、物の怪、困った時は　ドーマン！セーマン！ドーマン！セーマン！　直ぐに呼びましょう陰陽師レッツゴー！

Tobias Dammers

Member #2,604

August 2002

This doesn't handle variable-width characters though; allegro does.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

X-G

Member #856

December 2000

That's right, but they work smashingly for Unicode.

Thomas Fjellstrom

Member #476

June 2000

And don't forget that UTF8 can and will encode less commonly used chars with up to 6 bytes. Which is a Unicode encoding

edit, WikiPedia, seems to say its 4 bytes, is it 4?

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Dennis

Member #1,090

July 2003

It is 4. The standard document after which I modeled my own routines for encoding to and decoding from UTF8, says so as well.
Sequences longer than 4 bytes are invalid.

Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string, which you can then convert to whatever unicode format you have set for Allegro.
(In 4.2rc2 the format of the c_str() retrieved from a standard wchar_t string seemed to be equal to Allegros U_UNICODE format.)

--- 0xDB | @dennisbusch_de ---

Biznaga

Member #3,180

January 2003

Tobias Dammers said:

I'd use std::string as a base and only implement unicode-specific functionality through allegro. This'll save you from re-coding the memory allocation code yourself.

Isn't Allegro already doing this with functions like uinsert, uremove, ustrcat?

Dennis Busch said:

Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string

Really? Wouldn't c_str return a const w_char* instead a const char*? Can I pass a const w_char* as if it was a const char* as a function parameter? I'm concerned about if rx will work with Unicode strings. It receives const char* as parameters.

Kitty Cat

Member #2,815

October 2002

Quote:

Can I pass a const w_char* as if it was a const char* as a function parameter?

No, you'd have to convert it. w_char is 2 bytes per character, where a char is just 1 (unless it's utf8, but it's still compatible, for the most part).

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

Tobias Dammers

Member #2,604

August 2002

16-bit encoding is not much of a problem; you just use w_char. You do need to take special precautions for outputting, though; either convert to an 8-bit code table, or use 16-bit output routines.
UTF-8 has variable character widths, which is a bit of a problem. Allegro can output these, and you can store them in a std::string, but this will not give you correct lengths and indices. Even worse, an index may even point somewhere halfway a multi-byte character. If you can live with these issues, then use std::string with UTF-8. If you can use utf-16, use string<w_char> and utf-16 output routines. If you have to use utf-8, and need a string class that handles indices and everything correctly, then I guess you're stuck with writing your own string.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

Dennis

Member #1,090

July 2003

Quote:

Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string

Really? Wouldn't c_str return a const w_char* instead a const char*?

Unfortunately you cut off what I said, before the all important part: "[..], which you can then convert to whatever unicode format you have set for Allegro."
So what you have to do is of course interprete the result of c_str() as char* to be able to use Allegros conversion function to make an allegro-usable unicode string of it.
Example:

1// (assume that temp is a wstring with actual content and not empty.)
2int tmp_size = 0;
3char *outstring = NULL;
4 
5// Allocate memory for Allegro Unicode string
6tmp_size = (temp.length()+1)*uwidth_max(U_CURRENT);
7outstring = new char[tmp_size];
8if(!(outstring)) // error not enough memory for current string
9{
10  // do error handling
11}
12else // now convert to Allegro's format
13{
14  memset(outstring,0,tmp_size);
15  do_uconvert((char *)temp.c_str(),U_UNICODE,outstring,U_CURRENT,tmp_size);
16}

This will work as long as Allegros U_UNICODE hasn't changed since 4.20rc2, because back in that version Allegros U_UNICODE was equal to a fixed number of two bytes per char.

--- 0xDB | @dennisbusch_de ---

Biznaga

Member #3,180

January 2003

You are right. I understand the conversion part. I was thinking in the opposite case: converting Allegro's strings to c-strings that rx could use. Maybe I can convert to U_ASCII but that will result in a possible data lose or corruption and it has no sense at all if what I'm trying is to use Unicode.

gillius

Member #119

April 2000

UTF-16 or "wide character" strings used to be convienent until there were more than 65535 characters in Unicode, so now in UTF-16 a character can be as large as 32 bits (<sarcasm>yeah). However, if you assume that all of your strings are going to stay on the BMP (basic multilingual plane), then you'll probably be fine for most any application to assume that a character will not have a multi-byte representation, but if you want to be safe, check that no characters are in the range D800-DBFF when you get strings from external sources. I am going to assume that Allegro's unicode functions probably don't implement this portion of UTF-16.

Gillius
Gillius's Programming -- https://gillius.org/

Shawn Hargreaves

The Progenitor

April 2000

The 16 bit range pretty much covers all current languages except Chinese. You can do one type of Chinese (as spoken in Taiwan etc) with just 16 bit characters, but for full Chinese Chinese, you need some code points above 65535. That's the only interesting thing beyond 16 bit, though: all the other extensions are for dead or theoretical languages that nobody actually speaks today.

gillius

Member #119

April 2000

I didn't know there were full chinese characters in the upper code points. I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages...

Gillius
Gillius's Programming -- https://gillius.org/

Biznaga

Member #3,180

January 2003

Well, CJK is simply too much for me.

At most, I'd like to support English and Spanish in my game (BTW, an RPG), and maybe an horrendous French translation. Occidental languages. So UTF-16 it's OK for me; even U_ASCII, I think.

I'll try both with std::string<w_char> and a wrapper around Allegro's Unicode routines. Thank you, everybody!

I have another related question: ASCII has 7-bit characters, Allegro's U_ASCII has 8-bit. Are the last 128 characters in U_ASCII the same as in ISO-Latin-1 or does it depends on the locale?

Taking about translation, does anybody know about a good tutorial for gettext?

Jakub Wasilewski

Member #3,653

June 2003

Quote:

I didn't know there were full chinese characters in the upper code points. I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages...

Plane 1 (supplementary multilingual plane) is reserved for dead and ancient scripts and some other stuff. Plane 2 however (I believe it's called supplementary ideographic plane) is filled with about 40000 older, traditional and rarely used Chinese symbols that simply didn't fit in the basic multilingual plane.

And AFAIK hieroglyphs are not in Unicode... yet . But we have these in Unicode, so...

---------------------------
[ ChristmasHack! | My games ] :::: One CSS to style them all, One Javascript to script them, / One HTML to bring them all and in the browser bind them / In the Land of Fantasy where Standards mean something.