Unicode routines and std::string

no-reply@allegro.cc (Biznaga) — Mon, 17 Apr 2006 03:51:23 +0000

To make some things easier I'm using my own String class, derived from std::string. I added methods to trim, convert (to int, float, etc., via string streams) among other things. Since my project is 100% based on Allegro, should I add methods to use Allegro's Unicode routines OR forget about std::string and make my String class a wrapper around Unicode routines? I would like to use Unicode strings because it's possible I'll need to translate my game to another languages.

I'm not sure if std::string's and c-strings (Allegro's Unicode strings) can coexist in the same class or if it's even a good idea (I think it isn't).

The same goes for file routines (make_absolute_filename, replace_extension, etc.).

What do you suggest?

no-reply@allegro.cc (axilmar) — Mon, 17 Apr 2006 19:21:39 +0000

I do not think std::string and C strings can be mixed with Allegro's strings. Even though Allegro strings use 'char *' as data type, the underlying characters may not match ASCII characters. If you are going for Unicode, then you will certainly have a problem with std::string; for example, in Unicode the character '\0' is two bytes long, whereas in std::string/C strings it is one.

My suggestion is to forget std::string and wrap Allegro's unicode routines inside your String class. But you can use std::vector for the character buffer, so you avoid the allocation/deallocation issues. You could also make you String class reference-counted, so you can use it by-value.

no-reply@allegro.cc (Tobias Dammers) — Mon, 17 Apr 2006 20:16:41 +0000

Quote:

in Unicode the character '\0' is two bytes long

Not in UTF-8, which allegro uses by default. Since \0 is the only special character for std::string, everything else can be used as you please. String length, though, will return the number of bytes in the string, not necessarily the number of characters. Similarly, index operators ( [] and at() ) will return the nth byte, not the nth character. As long as you just load and compare strings, concatenate them together, and pass them around, std::string will be fine (provided you do use UTF8, not UTF16, and use UTF-8 output routines only).

no-reply@allegro.cc (Fladimir da Gorf) — Mon, 17 Apr 2006 21:05:32 +0000

Wouldn't UTF-16 be easy to use with std::string? After all you can specify the data type of a character, so you can pass a short for that instead of a char.

no-reply@allegro.cc (gillius) — Tue, 18 Apr 2006 00:19:47 +0000

you would need to do wstring or basic_string. I am not an expert in this but I believe it is facets and/or locales that allow you to do multi-byte character strings. I wonder if it is possible to do an Allegro string facet to allow Allegro UTF strings in standard library strings.

no-reply@allegro.cc (Biznaga) — Tue, 18 Apr 2006 11:26:49 +0000

Allegro seems to provide a wide range of string routines, so I think I could make a very functional String class with them.

I forgot to mention something important: my std::string based class also uses regular expressions (I'm using the rx library). Will this new string class work along with rx too? And, what about Lua scripts? I just read this, but has anybody worked with Lua scripts and Unicode?

I start to believe I should stick with U_ASCII if I really don't need UTF-8. At least, U_ASCII is good enough for most occidental languages, right?

no-reply@allegro.cc (Kitty Cat) — Tue, 18 Apr 2006 11:43:32 +0000

UTF-8 is full compatible with 7-bit ASCII. Any character in the 0-127 (inclusive) range is the same in both UTF-8 and ASCII.

no-reply@allegro.cc (Tobias Dammers) — Tue, 18 Apr 2006 12:03:43 +0000

...including the special characters 0 through 31. Which is why a std::string can easily hold UTF-8 data; the only thing that is not reliable is accessing a single character by index, since operator[] and at() count bytes, not characters.
I'm curious about the allegro-string class, though. BTW, I'd use std::string as a base and only implement unicode-specific functionality through allegro. This'll save you from re-coding the memory allocation code yourself.

no-reply@allegro.cc (X-G) — Tue, 18 Apr 2006 16:32:44 +0000

std::basic_string

no-reply@allegro.cc (Tobias Dammers) — Tue, 18 Apr 2006 16:42:28 +0000

This doesn't handle variable-width characters though; allegro does.

no-reply@allegro.cc (X-G) — Tue, 18 Apr 2006 16:54:53 +0000

That's right, but they work smashingly for Unicode.

no-reply@allegro.cc (Thomas Fjellstrom) — Tue, 18 Apr 2006 16:55:34 +0000

And don't forget that UTF8 can and will encode less commonly used chars with up to 6 bytes. Which is a Unicode encoding

edit, WikiPedia, seems to say its 4 bytes, is it 4?

no-reply@allegro.cc (Dennis) — Tue, 18 Apr 2006 19:55:17 +0000

It is 4. The standard document after which I modeled my own routines for encoding to and decoding from UTF8, says so as well.
Sequences longer than 4 bytes are invalid.

Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string, which you can then convert to whatever unicode format you have set for Allegro.
(In 4.2rc2 the format of the c_str() retrieved from a standard wchar_t string seemed to be equal to Allegros U_UNICODE format.)

no-reply@allegro.cc (Biznaga) — Wed, 19 Apr 2006 08:51:27 +0000

Tobias Dammers said:

I'd use std::string as a base and only implement unicode-specific functionality through allegro. This'll save you from re-coding the memory allocation code yourself.

Isn't Allegro already doing this with functions like uinsert, uremove, ustrcat?

Dennis Busch said:

Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string

Really? Wouldn't c_str return a const w_char* instead a const char*? Can I pass a const w_char* as if it was a const char* as a function parameter? I'm concerned about if rx will work with Unicode strings. It receives const char* as parameters.

no-reply@allegro.cc (Kitty Cat) — Wed, 19 Apr 2006 09:08:50 +0000

Quote:

Can I pass a const w_char* as if it was a const char* as a function parameter?

No, you'd have to convert it. w_char is 2 bytes per character, where a char is just 1 (unless it's utf8, but it's still compatible, for the most part).

no-reply@allegro.cc (Tobias Dammers) — Wed, 19 Apr 2006 10:57:12 +0000

16-bit encoding is not much of a problem; you just use w_char. You do need to take special precautions for outputting, though; either convert to an 8-bit code table, or use 16-bit output routines.
UTF-8 has variable character widths, which is a bit of a problem. Allegro can output these, and you can store them in a std::string, but this will not give you correct lengths and indices. Even worse, an index may even point somewhere halfway a multi-byte character. If you can live with these issues, then use std::string with UTF-8. If you can use utf-16, use string and utf-16 output routines. If you have to use utf-8, and need a string class that handles indices and everything correctly, then I guess you're stuck with writing your own string.

no-reply@allegro.cc (Dennis) — Wed, 19 Apr 2006 11:13:16 +0000

Quote:

Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string

Really? Wouldn't c_str return a const w_char* instead a const char*?

Unfortunately you cut off what I said, before the all important part: "[..], which you can then convert to whatever unicode format you have set for Allegro."
So what you have to do is of course interprete the result of c_str() as char* to be able to use Allegros conversion function to make an allegro-usable unicode string of it.
Example:

1// (assume that temp is a wstring with actual content and not empty.)
2int tmp_size = 0;
3char *outstring = NULL;
4 
5// Allocate memory for Allegro Unicode string
6tmp_size = (temp.length()+1)*uwidth_max(U_CURRENT);
7outstring = new char[tmp_size];
8if(!(outstring)) // error not enough memory for current string
9{
10  // do error handling
11}
12else // now convert to Allegro's format
13{
14  memset(outstring,0,tmp_size);
15  do_uconvert((char *)temp.c_str(),U_UNICODE,outstring,U_CURRENT,tmp_size);
16}

This will work as long as Allegros U_UNICODE hasn't changed since 4.20rc2, because back in that version Allegros U_UNICODE was equal to a fixed number of two bytes per char.

no-reply@allegro.cc (Biznaga) — Wed, 19 Apr 2006 12:27:17 +0000

You are right. I understand the conversion part. I was thinking in the opposite case: converting Allegro's strings to c-strings that rx could use. Maybe I can convert to U_ASCII but that will result in a possible data lose or corruption and it has no sense at all if what I'm trying is to use Unicode.

no-reply@allegro.cc (gillius) — Wed, 19 Apr 2006 17:40:37 +0000

UTF-16 or "wide character" strings used to be convienent until there were more than 65535 characters in Unicode, so now in UTF-16 a character can be as large as 32 bits (yeah). However, if you assume that all of your strings are going to stay on the BMP (basic multilingual plane), then you'll probably be fine for most any application to assume that a character will not have a multi-byte representation, but if you want to be safe, check that no characters are in the range D800-DBFF when you get strings from external sources. I am going to assume that Allegro's unicode functions probably don't implement this portion of UTF-16.

no-reply@allegro.cc (Shawn Hargreaves) — Wed, 19 Apr 2006 23:42:57 +0000

The 16 bit range pretty much covers all current languages except Chinese. You can do one type of Chinese (as spoken in Taiwan etc) with just 16 bit characters, but for full Chinese Chinese, you need some code points above 65535. That's the only interesting thing beyond 16 bit, though: all the other extensions are for dead or theoretical languages that nobody actually speaks today.

no-reply@allegro.cc (gillius) — Thu, 20 Apr 2006 07:16:34 +0000

I didn't know there were full chinese characters in the upper code points. I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages...

no-reply@allegro.cc (Biznaga) — Thu, 20 Apr 2006 09:43:08 +0000

Well, CJK is simply too much for me.

At most, I'd like to support English and Spanish in my game (BTW, an RPG), and maybe an horrendous French translation. Occidental languages. So UTF-16 it's OK for me; even U_ASCII, I think.

I'll try both with std::string and a wrapper around Allegro's Unicode routines. Thank you, everybody!

I have another related question: ASCII has 7-bit characters, Allegro's U_ASCII has 8-bit. Are the last 128 characters in U_ASCII the same as in ISO-Latin-1 or does it depends on the locale?

Taking about translation, does anybody know about a good tutorial for gettext?

no-reply@allegro.cc (Jakub Wasilewski) — Thu, 20 Apr 2006 15:15:33 +0000

Quote:

I didn't know there were full chinese characters in the upper code points. I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages...

Plane 1 (supplementary multilingual plane) is reserved for dead and ancient scripts and some other stuff. Plane 2 however (I believe it's called supplementary ideographic plane) is filled with about 40000 older, traditional and rarely used Chinese symbols that simply didn't fit in the basic multilingual plane.

And AFAIK hieroglyphs are not in Unicode... yet . But we have these in Unicode, so...

1	// (assume that temp is a wstring with actual content and not empty.)
2	int tmp_size = 0;
3	char *outstring = NULL;
4
5	// Allocate memory for Allegro Unicode string
6	tmp_size = (temp.length()+1)*uwidth_max(U_CURRENT);
7	outstring = new char[tmp_size];
8	if(!(outstring)) // error not enough memory for current string
9	{
10	// do error handling
11	}
12	else // now convert to Allegro's format
13	{
14	memset(outstring,0,tmp_size);
15	do_uconvert((char *)temp.c_str(),U_UNICODE,outstring,U_CURRENT,tmp_size);
16	}