Should std::wstring be avoided, for portability?

Should std::wstring be avoided, for portability?

Peter Wang

Member #23

April 2000

The font addon doesn't handle decomposed characters, nor right-to-left scripts, etc. It's pretty basic.

To find the beginning of a "character" you'd need to classify codepoints into its category, which is essentially done with a big if-then-else for different ranges. I think you'd generate it from this file but I haven't done it myself. More likely you find some existing library.

Karadoc ~~

Member #2,749

September 2002

Hmm.

Well if Allegro isn't going to be able to print those kinds of characters anyway, then I guess there's no hurry for me to support them in my text box. Ultimately I'd like for all this stuff to work properly, but I don't want to spend too long on this given that I don't intend to use non-ascii characters anyway.

For the time being, I think I'll just make placeholder functions for 'next character' and 'previous character', and have those functions simply look for the first unit of the code-point. (I don't yet know how to do that either, but I assume there's some particular set of bits which signal which unit is the first unit; and I'm pretty sure al_draw_text will have that!)

Thanks for the info.

-----------

Raidho36

Member #14,628

October 2012

In UTF-8, most significant bit of leading byte is always 1, and trailing bytes have it always 0. The amount of 1-s in a row in MSB determine how many exactly bytes are trailing it, so you could read the first byte and jump to the next character right away. But, if you need to find previous characters, you're pretty much stuck with going through every byte, counting leading bytes.

I'm not exactly sure, but IIRC, Allegro's string handling functions do that.

Being serious is stupid, I'm done with it.

Thomas Fjellstrom

Member #476

June 2000

I think even with UTF8, two multi byte sequences can be combined into one composed character.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Raidho36

Member #14,628

October 2012

It could, existed composed characters are left in unicode for compatibility reasons, but as they do it now, unicode tend to decompose the characters into basic character and special glyphs altering it.

Being serious is stupid, I'm done with it.

Thomas Fjellstrom

Member #476

June 2000

That is what I thought I said. Two unicode "characters" can actually render as a single visible character. So you can't really just jump around like you suggested.

Arthur Kalliokoski

Second in Command

February 2005

Are you guys talking about the things like the "ae" that's smooshed together?

{"name":"230.png","src":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/2\/d\/2d09d84db8ed2a79358e287f3795f8c4.png","w":289,"h":104,"tn":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/2\/d\/2d09d84db8ed2a79358e287f3795f8c4"}

Unicode hexadecimal: 0xe6

They all watch too much MSNBC... they get ideas.

Thomas Fjellstrom

Member #476

June 2000

No, unicode specifies separate characters and modifiers like umlauts, where they are specified separately in the unicode string, but are composed together when drawing.

Raidho36

Member #14,628

October 2012

I wasn't exactly meant character, I meant code point. You do can jump to the next code point right away after simply reading first byte, because it says how many bytes current code point occupies.

----

Also, AFAIK there could be several code points per character rather than just one or two. So to count your characters you also would need to tell apart real character from modifier. Unicode just couldn't keep it simple, as if 4 billion slots wasn't enough to store all possible combinations. Bluh. >:(

Being serious is stupid, I'm done with it.

billyquith

Member #13,534

September 2011

Karadoc ~~ said:

Suppose I'm writing a Linux program, and I choose to support unicode by using UTF-32. On Linux, wchar_t is 32 bits, so I might choose to use std::wstring to store my UTF-32 encoded text. However, if I do this, then try to port the program to Windows, the program won't work anymore because wchar_t on Windows is only 16 bits and so my UTF-32 encoding won't fit anymore. — That's why I'm saying wchar_t and std::wstring should be avoided for portability.

If you use all of the standard functions for wide strings your std::wstring code will be portable. If you save wide string data on Linux and try to load it on Windows it will fail because the encodings are different. If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it.

If all of your code uses std::wstring and the w-string wide string functions, it will all be portable. If you say it "won't fit", this implies that you are setting hard limits on the size of data items (e.g. in bytes). You shouldn't do this in any encoding, you should calculate that using the coding API, in order to make the code and data portable.

Elias

Member #358

May 2000

The problem I have with std::wstring is that the standard doesn't explicitly say it's UTF16 or any other encoding. But whenever you read/write a string from/to a file (or pass it to an Allegro function) you need to know the encoding.

From what I understand you would have to use something like this: http://www.cplusplus.com/reference/locale/codecvt/

But the existance of appropriate locales (e.g. a utf8 one, to convert to that) at runtime is not guaranteed. Or at least I can't figure out from the documentation how to convert a std::wstring from/to utf8 (or any other encoding).

--
"Either help out or stop whining" - Evert

Raidho36

Member #14,628

October 2012

Elias said:

standard doesn't explicitly say it's UTF16 or any other encoding

It's plain unicode value of a ~~character~~ code point. It's not encoded. If you want it encoded you should probably use byte string and special handling functions.

Being serious is stupid, I'm done with it.

Vanneto

Member #8,643

May 2007

I wonder if Karadoc ~~ actually knows more about this than before or less.

In capitalist America bank robs you.

Arthur Kalliokoski

Second in Command

February 2005

http://www.joelonsoftware.com/articles/Unicode.html

They all watch too much MSNBC... they get ideas.

Elias

Member #358

May 2000

Vanneto said:

I wonder if Karadoc ~~ actually knows more about this than before or less.

If he ignores all posts by Raidho36, maybe

--
"Either help out or stop whining" - Evert

Karadoc ~~

Member #2,749

September 2002

billyquith said:

Not if you implement it properly. As soon as you encode the Unicode it becomes implementation specific. But you can convert between different encodings. E.g. if you are using UTF-16 on Windows and you want to use Allegro, at some point you'll have to convert between encodings. It is still all Unicode, just stored in a different format.If you use all of the standard functions for wide strings your std::wstring code will be portable. If you save wide string data on Linux and try to load it on Windows it will fail because the encodings are different. If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it. If all of your code uses std::wstring and the w-string wide string functions, it will all be portable. If you say it "won't fit", this implies that you are setting hard limits on the size of data items (e.g. in bytes). You shouldn't do this in any encoding, you should calculate that using the coding API, in order to make the code and data portable.

The way I see it, the unicode characters must be encoded in order to exist in computer memory at all. The programmer must choose what kind of encoding to use, and what kind of data structure to store the encoded string in. When I said "it won't fit", what I meant was that a 32-bit unit won't fit into a char16_t, and so if I used UTF-32 encoding and stored it in a std::wstring, that would work on Linux but not on Windows.

Quote:

If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it.

My point is that I can read UTF-32 straight into a std::wstring iff wchar_t is 32-bits. The 32-bit units of UTF-32 won't fit inside a 16-bit wchar_t.

Quote:

If all of your code uses std::wstring and the w-string wide string functions, it will all be portable.

I think this is the key to understand what you are talking about. As far as I understand, everything else you've said is stuff that I already knew, but here I'm not sure what you mean by "the w-string wide string functions". What functions are you referring to? (Perhaps it's related to codecvt which Elias mentioned?)

Vanneto said:

I wonder if Karadoc ~~ actually knows more about this than before or less.

Well, in this thread there is some misinformation and a significant amount of talking at cross purposes, but I feel like I've learned a fair bit. For example, I've learned that the maximum number of bytes per character is 4 for any unicode code-point and that some printable characters are actually composed of multiple unicode code-points. And most recently, Elias mentioned something called codecvt, which I hadn't heard of before. I'm reading about it now in the C++11 Standard document. A lot of the information in the document is pretty dense, but from what I can tell codecvt sounds userful, and very relevant to what we're talking about. Check this out:

C++11 Standard said:

The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a user-defined stateT type. The stateT object can contain any state that is useful to communicate to or from the specialized do_in or do_out members.

So, based on that here's my understanding of the situation: std::wstring wide_string = L"Hello"; will store the string in an implementation defined encoding which uses wchar_t. My concern throughout this thread has been that since we don't necessarily know what the implementation defined encoding is, it's difficult to work with it in a platform independent way. For example I can't use al_ustr_new_from_utf16, because wchar_t may or may not be uint16_t; and since I don't know what the encoding is, I don't know how to traverse the wchar_t array to find the starts of characters and things like that which I need for my text box widget that I mentioned. For these reasons, I claimed that it's better to simply avoid using wchar_t and instead use something like char16_t (or whatever) so that one could be sure of which type of unicode encoding is being used.

However, now that I've read that codecvt<wchar_t,char,mbstate_t> will convert between the mystery wchar_t encoding and UTF-8, I see that it is at least possible to right portable code that uses wchar_t. It still seems to me that it's better to pick one of the more explicit character sizes, but at least wchar_t is not fatally flawed.

-----------

m c

Member #5,337

December 2004

Karadoc ~~ said:

By the way, keeping in mind that characters aren't always single code-points, does anyone happen to know an easy to work out where the end of a character is in UTF-8? I'm making a text-box UI widget, and I need to be able to determine where the starts of the characters are so that I can implement the functionality for left and right arrow keys, and mouse selection and stuff like that.

I think allegro 4.4 source code had a routine that could do that? Something that counted the character length (not byte length) of a utf-8 string?

I built my own unicode lib a few years ago from pillaging the allegro 4.4 source code because I didn't like any of the proper unicode libraries.

(\ /)
(O.o)
(> <)