Should std::wstring be avoided, for portability?

no-reply@allegro.cc (Karadoc ~~) — Wed, 22 May 2013 13:17:33 +0000

As I understand it, std::wstring is a basic_string of wchar_t (whereas std::string is `basic_string`). But I don't feel like I have a good understanding of what wchar_t & std::wstring are actually good for. I don't understand why they are used, or why they are part of the standard.

The C++11 standard says this:

C++11 said:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in , called the underlying types

My interpretation of that is that wchar_t is big enough to fit all characters in the 'supported locales', whatever those might be, with some unspecified encoding (maybe UTF16; or maybe something else). That sounds pretty vague to me; and I usually don't like vague stuff in programming.

So why would I want a string of wchar_t? Wouldn't it be better to use either std::string, or std::basic_string<char16_t> (for UTF8 and UTF16 respectively)? What advantage does one get by using std::wstring?

no-reply@allegro.cc (Jeff Bernard) — Wed, 22 May 2013 13:45:46 +0000

I imagine you'd use it because your std::wstring will continue working when a larger extended character set becomes available, whereas std::u16string may break. std::u32string would probably still work, but may be wider than you need it to be.

I think std::wstring is going to be equivalent to std::u32string on most machines, and then equivalent to std::u16string or std::string on machines that don't support, like, Chinese. It's probably safer to use std::wstring when you want to use a wide string.

Karadoc ~~ said:

Wouldn't it be better to use either std::string, or std::basic_string<char16_t> (for UTF8 and UTF16 respectively)?

I don't think std::u16string is wide enough for UTF16, actually. UTF16 can have 4 byte characters.

no-reply@allegro.cc (Karadoc ~~) — Wed, 22 May 2013 18:59:12 +0000

Maybe I've misunderstood how these things work, but I'm under the impression that UTF16 works using 16 bit atoms^[1] of data such that the first 15 bits are used to identify the character, but if those 15 bits aren't enough, then the final bit is used to signal that the character requires another 16 bit atom. This kind of chain can go on indefinitely, and so there is no size limit to the character set, and no size limit to how many bytes might be 'needed' for a single character. (UTF8 is the same, but with 8 bits instead of 16.)

Maybe I've just go the whole thing wrong, but if that's actually how it works, then I think it would be natural to use a string of 16 bit chunks when encoding in UTF16.

References

I don't know the correct technical name

no-reply@allegro.cc (Raidho36) — Wed, 22 May 2013 21:38:07 +0000

Karadoc ~~ said:

atoms

Variables. Simply variables. Your case is specifically 16 bit int variables.

Quote:

UTF16 works using 16 bit

Wrong. UTF-16 works using 2 byte base encoding, but for certain characters it uses up to 6 bytes, as you said. It's a multibyte encoding, just like UTF-8, where characters may have variable witdh. Opposite of multibyte is single byte, but of course you can't say that 4 bytes in UTF-32 is a single byte. But that's how it is - UTF-32 uses 32 bit ints to hold any possible character, and the character is always exactly 4 bytes long. As for upper limit, 32 bit int can hold order of several magnitudes more characters than entire mankind have came up with so far, and I don't think there'll be ever wider UTF character than 32 bit. In practice with UTF-8 encoding, upper 2 bytes (that form last 12 bits) aren't even reserved, not to mention be actually used.

You should only use constant-width character strings for unicode if you use 32 bit variables to store them, and the encoding must be UTF-32 (the only one constant width unicode). That ensures that you'll have no problem with processing your strings and otherwise deal with them.

no-reply@allegro.cc (Karadoc ~~) — Thu, 23 May 2013 04:15:38 +0000

@Raidho36, you said I was 'wrong' in my description, but then went on to describe something completely consistent with what I said.

Also, I'm sure 'variables' is not the right word for what I was talking about. It would be pretty weird to call the individual segments of each character a 'variable'. — Inspired by that weird suggestion, I checked the Wikipedia article for UTF16. What I referred to as atoms, Wikipedia calls units.

I only skimmed over that Wikipedia article, but I found that my original description wasn't quite right. Wikipedia implies that 32 bytes is the maximum size allowed by UTF16, and that those full 32 bytes can only store up to 1,112,064 different charactes!? As I said, I only skimmed the article, but to me that sounds wrong.

In any case, if UTF16 can not have indefinitely many units for a single character then I suppose that means using wchar_t is fine as long as it is the same size as the maximum size of a UTF16 character. It still seems a bit fishy to me - but maybe my original concern was just due to my (erroneous) belief that UTF16 / 8 could have as many units as they wanted per character.

no-reply@allegro.cc (torhu) — Thu, 23 May 2013 05:39:52 +0000

I think char16_t and char32_t are new in C++11. std::wstring and wchar_t are older, and were probably used for 16 bit character sets like Chinese and the original Unicode. Today, wstring and wchar_t default to 32 bits on some platforms (Linux), but are still 16 bits on Windows (where the OS uses UTF-16 internally). I hope this clears it up a bit.

As for what is the correct way to deal with this in C++11, I don't know. For an Allegro project, I suppose you would use std::string or a custom class that wraps A5's UTF-8 support. It depends on your needs and what libraries you are interfacing with.

There's lots of good Unicode information easily available on the web, like this: http://www.unicode.org/faq/utf_bom.html

no-reply@allegro.cc (Raidho36) — Thu, 23 May 2013 08:43:40 +0000

Karadoc ~~ said:

Wikipedia implies that 32 bytes is the maximum size allowed by UTF16, and that those full 32 bytes can only store up to 1,112,064 different charactes!?

You probably did skimmed through too fast. Unicode is designed to be used with 32 bit characters, so all UTFs are designed to hold up to 32 bits of data, therefore for all UTFs maximum amount of characters possible is 4294967296.

Quote:

Also, I'm sure 'variables' is not the right word for what I was talking about.

Memory chunks? Array cells? There's no standard way of calling a block of memory something. So it's simply virtual existed (non-real existent that is) "units", a variables. Provided it's simply a bunch of bytes, there's no real distingquishing between them other than compiler's understanding of their meaning. Try record an int into void-pointed area in memory and then read it as float, or vice versa, or try to read it with small offset, like 1 byte or so. You'll have garbage, but you'll have your bytes exactly as you would expect them (minus endianness) and it'll work fine and compiler won't complain. That explanation is rough, but should give you a good idea of what's going on.

no-reply@allegro.cc (Karadoc ~~) — Thu, 23 May 2013 08:48:01 +0000

torhu said:

For an Allegro project, I suppose you would use std::string or a custom class that wraps A5's UTF-8 support. It depends on your needs and what libraries you are interfacing with.

This sounds like good advice to me. For my own (current) purposes, I don't actually need any non ASCII characters anyway. I just like to get into the habit of doing this the 'right way' so that if I ever need this stuff in the future then I will know what to do — or if I want to reuse my code for something else, it won't be hard to adapt.

Before starting this thread I saw something on Stack Overflow which basically said to always use wstring for everything when programming on Windows. The S.O. post had a bazillion upvotes, but I wasn't comfortable about changing all my `string`s into these wstring things which might mess things up if I attempt to port to xnix.

I think that if I just stick with std::string and try to remember that counting letters will not necessarily give me the offset in the string, then UTF8 should work ok when I need it.

(And I still think using wchar_t is probably a bad idea if one is trying to write portable code.)

--
[edit]

Raidho36 said:

Even if the UTF encoding had no redundancy or waste of any kind, it couldn't encode the full 2^32" /> different characters that you are suggesting. If nothing else, UTF8 needs to use some of its bits to indicate whether or not the full four bytes will be there or not. The value stated by wikipedia (1,112,064) sounds too low to me, but it is certainly possible if there are simply a lot of disallowed combinations of bits. The value you stated (4294967296) is too large to be possible with a maximum of only 4 bytes in UTF8 or UTF16.

Quote:

Memory chunks? Array cells? There's no standard way of calling a block of memory something. So it's simply virtual existed (non-real existent that is) "units", a variables. Provided it's simply a bunch of bytes, there's no real distingquishing between them other than compiler's understanding of their meaning.

I wasn't talking about an arbitrary bunch of bytes though. And 'variable' has some connotations which I don't think are appropriate for what I was talking about. For example, I might use UTF8 to encode a unicode character like char32_t unicode_char. In that case, the 'variable' is called unicode_char. It could be confusing if each of the bytes inside that 32 bit variable were also referred to as variables...

char8_t first_variable = unicode_char 0xFF0000;
// now we have a variable called 'first_variable'. This is not nice.

no-reply@allegro.cc (Peter Wang) — Thu, 23 May 2013 09:21:01 +0000

Raidho36 said:

Variables. Simply variables. Your case is specifically 16 bit int variables

Wrong. They are called 'code units'. Variables are a programming language construct and have nothing to do with Unicode encodings.

Quote:

but for certain characters it uses up to 6 bytes

Wrong. Code points are encoded to one or two UTF-16 code units, i.e. two bytes or four bytes.

Raidho36 said:

all UTFs are designed to hold up to 32 bits of data

Wrong. UTF-16 can only encode code points up to 0x10ffff, minus those values in the range U+D800..U+DFFF which are used to encode surrogate pairs. That is where the 1,112,064 comes from.

UTF-8 as originally designed can encode all 2^32 code points, but since Unicode only goes up to 0x10ffff (limited by UTF-16) UTF-8 also only encodes up to 0x10ffff.

no-reply@allegro.cc (m c) — Thu, 23 May 2013 09:51:47 +0000

Peter Wang said:

Wrong. They are called 'code units'. Variables are a programming language construct and have nothing to do with Unicode encodings.

WRONG PETER! My unicode xhtml transitional documents want to have a word with you.

no-reply@allegro.cc (Raidho36) — Thu, 23 May 2013 09:53:57 +0000

I'm not familiar with UTF-16, sorry.

Karadoc, UTF-8 is designed to use up to 6 bytes of data because it does occupies some bits for system purposes, so it could form 32 bits of data with 6 bytes rather than 4.

As for wchar_t, it's a part of the standard, so no problem with portability whatsoever could possibly arise unless you're using non standard-compliant C version on target platform.

no-reply@allegro.cc (Karadoc ~~) — Thu, 23 May 2013 11:39:47 +0000

Raidho36 said:

As for wchar_t, it's a part of the standard, so no problem with portability whatsoever could possibly arise unless you're using non standard-compliant C version on target platform.

I don't think you've understood what I'm trying to say. I alreddy know that wchar_t is a standard part of C++. I even quoted the description of it from the official C++11 document in my first post. My point is that the size of wchar_t might vary depending on how the code is compiled, and thus it could create problems whenever one changes how their code is compiled. It seems to me that it would be better to use a more predictable type such as char32_t.

Quote:

Karadoc, UTF-8 is designed to use up to 6 bytes of data because it does occupies some bits for system purposes, so it could form 32 bits of data with 6 bytes rather than 4.

From what I can tell, UTF-8 only uses up to 4 bytes. Peter Wang said something that implies an older version of UTF-8 might have used more, but that doesn't seem to be the case any longer based on what I've read.
Wikipedia says this:

wikipedia said:

UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes

I also checked the 'Unicode Standard Version 6.2 – Core Specification'

Unicode standard said:

In UTF-8, a character may be expressed with one, two, three, or four bytes, and the relationship between those byte values and the code point value is more complex [than UTF-32].

But all that stuff is beside the point of the original question anyway.

Actually, the fact that UTF-32 covers all of unicode is relevant - because that suggests that wchar_t should be 32 bits wide in order to meet the definition in the C++11 standard. That sounds fair and reasonable and reliable... except that I've seen a few different people claim that wchar_t is 16 bits on Windows. -- I'm not really sure what "on Windows" really means though given that it would be determined by the compiler rather than by the operating system. I guess I could just test it on VC++ and minGW, but the fact that people are saying that it is sometimes 16 bits and sometimes 32 bits is discouraging enough for me to conclude that it's best to use char32_t to avoid confusion.

no-reply@allegro.cc (Raidho36) — Thu, 23 May 2013 22:56:52 +0000

Karadoc ~~ said:

It seems to me that it would be better to use a more predictable type such as char32_t

That would be more consistent, although may not match your wide character implementation, so you may run into certain problems.

Quote:

From what I can tell, UTF-8 only uses up to 4 bytes.

Oh well, I didn't dig in too deep, it was enough to me that I use UTF-8 encoding when write to file (what I used was 6-byte programmed) and upon read instantly decode it into UTF-32 to use internally, so didn't bothered much with specs changes. 1112064 is way more than enough anyway, even though being suboptimal. The real difference between UTF-32 and other two is that former is constant-length whereas others are variable-length. This property of it allows random access, which is a great deal in terms of performance. If you simply take input, display and discard your strings, then using UTF-8 internally is fine. But if you bend and twist them around a lot - you're fucked.

Quote:

I've seen a few different people claim that wchar_t is 16 bits on Windows.

It is. Check for yourself. Meaning is that without enabling specific obscure settings you'll have your wide characters two bytes long.

no-reply@allegro.cc (m c) — Fri, 24 May 2013 05:51:36 +0000

Karadoc ~~ said:

except that I've seen a few different people claim that wchar_t is 16 bits on Windows.

That is probably for commonality with the win32 subsystem's native Unicode support, which is utf-16

no-reply@allegro.cc (torhu) — Fri, 24 May 2013 06:13:56 +0000

Right, I seem to remember making that exact point recently.

no-reply@allegro.cc (Karadoc ~~) — Fri, 24 May 2013 07:00:36 +0000

My interpretation of the standard is that a single wchar_t should be able to encode every possible unicode character; and so it needs to be 32 bit.

Quote:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

One could argue that 16 bits are enough to meet the requirements because the standard doesn't explicitly say that a single wchar_t should be able to encode every possible character - but if one accepts that argument, they must also accept that a single bit would be enough, and so the requirement would be effectively meaningless.

From my point of view, there's nothing wrong with utf-16, and I'm sure it's convenient for wchar_t to be 16 bits when dealing exclusively with utf-16; but I just don't think that's what the standard asks for, and I expect it would be a pest for portability. I suspect it's probably not really about different interpretations of the standard, but rather about supporting legacy code.

Raidho36 said:

That would be more consistent, although may not match your wide character implementation, so you may run into certain problems.

Obviously if the programmer is explicitly specifying the number of bits, they would choose the number of bits that match the encoding they want to use. On the other hand, wchar_t might not be the right size, because the size of wchar_t is not chosen by the programmer. That's the point I'm trying to make.

no-reply@allegro.cc (torhu) — Fri, 24 May 2013 07:09:10 +0000

Karadoc ~~ said:

My interpretation of the standard is that a single wchar_t should be able to encode every possible unicode character; and so it needs to be 32 bit.

Nope. I'll give you a hint: C++ wchar_t could very well be older than Unicode. And another one: the internet is full of actual, real information about C++. Although that's not true about a lot of other subjects.

no-reply@allegro.cc (Karadoc ~~) — Fri, 24 May 2013 07:39:40 +0000

@torhu, come on man. I quoted the C++11 standard here. That's the most solid piece of "real information about C++" I can imagine. I fully understand that wchar_t could be older than unicode, but that doesn't mean unicode isn't 'specified among the supported locales'. If you know something which invalidates my interpretation of the standard, can you just say it? I don't know what the benefit is of saying 'I know the answer but I don't want to tell you'.

In any case, I'm pretty satisfied that I have the answer to my original question. Although no one here seems to be saying it, I think the answer is 'yes. wchar_t should be avoided when portability is important'.

I found this quote on wikipedia:

wikipedia quoting 2003 Unicode standard said:

The ISO/IEC 10646:2003 Unicode standard 4.0 says that:

"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."

So there it is: an unambiguous recommendation to not use wchar_t in programs that need to be portable.

no-reply@allegro.cc (torhu) — Fri, 24 May 2013 08:14:33 +0000

A Unicode character can be 32 bits, but wchar_t is only 16 bits on Windows. That's all there is to it. The problem with wchar_t for cross platform applications could be just that it sucks to use 32 bits for each character. I don't know if there are issues with Unicode string literals or whatever.

no-reply@allegro.cc (Raidho36) — Fri, 24 May 2013 11:08:41 +0000

Karadoc ~~ said:

the standard doesn't explicitly say that a single wchar_t should be able to encode every possible character

It is. But note it says "supported locales". Means that if total characters amount in all locales supported fits in 16 bits, it'll be 16 bit.

Quote:

it's convenient for wchar_t to be 16 bits when dealing exclusively with utf-16

Nope. wchar_t is specifically wide character, as opposed to char regular character, and it is to be used with wide character strings. These are all assume your characters to be constant length and have no special encoding, since processed with wide character strings functions that are only different from regular ones by using wchar_t instead of char. That's what I was talking about when said of "certain problems." Although you may miracously have everything work fine on it's own.

torhu said:

I don't know if there are issues with Unicode string literals or whatever.

The logic behind this is "what's the point of giving wide characters 32 bit if the damn device can only display ASCII?" Because target platrofm may not necessairly support unicode fully. What do you do? You adjust your text input function to work with this wide characters. But if you have the balls, you may hook up custom libraries that would give support to 32 bit unicode input, processing and display.

no-reply@allegro.cc (billyquith) — Mon, 27 May 2013 12:35:35 +0000

Don't think about wstring, think about Unicode. Unicode has a large set of "code points", i.e. imagine every language and all the characters available (a LOT!). You cannot fit all of these into the defacto unit of character storage: the byte (i.e. range 0-255). Back when computers were simpler there was only ASCII (range 0-127).

If you want to have a large sets of "codes" you can store these in different ways. Ignore Unicode for a minute and think of all the ways that you could do it. E.g. if you know ASCII has a spare bit, you can use this to specify that the code spills over into the next byte, i.e. multi-byte. Or maybe you decide not to use a byte, but to use 2 bytes, or 4 bytes, and chain these together. You might make this decision based on the architecture of the processors you are targeting, or how many languages you want to support.

std::string can hold UTF-8 encoding (i.e. multi-byte 8 bit chars). This is possible because std::string is not null terminated, it is 8 bit pure. I.e. you can store '\0', and any char in a string. So std::string is backwards compatible with null terminated 8 bit strings and ASCII. std::wstring works like std::string but holds "wide characters" (i.e. >8 bit) and is not backwards compatible with 8 bit strings.

std::string and 8 bit ASCII strings are "narrow" and std::wstring are "wide". Note, when we say wide, the size of a character is not specified as it is platform/compiler specific.

So your decision is: how to support Unicode, given the above information.

You might look at the APIs you are going to use, e.g. if you are only using Allegro, you might use std::string and UTF-8, because that is what Allegro uses internally. Otherwise you have to convert any wide strings to UTF-8 (narrow) for Allegro to use.

If you are using a library that only supports wide strings then you might use wide strings exclusively. Some APIs support both with a define.

If you are writing a library to be made public you might want to bear all this in mind, that some people might want to use narrow, and others wide, chars. Most libraries tend to assume ASCII, or narrow encoding. If you are including Windows, then I think you really have to support wide encoding because it only really supports localisation properly using wide encoding (UTF-16 in this case). All of the newer .net stuff uses this internally. It's a PITA!

Soooo... if your question is related to Allegro, I'd say use narrow encoding. You can have the simplicity of ASCII strings, and use UTF-8 Unicode to localise, which Allegro also uses. If you use wide strings you'll just have to convert them to narrow ones every time you call a text rendering function Allegro.

no-reply@allegro.cc (Karadoc ~~) — Tue, 28 May 2013 12:33:49 +0000

billyquith said:

Don't think about wstring, think about Unicode.
[...]
So your decision is: how to support Unicode, given the above information.

Suppose I'm writing a Linux program, and I choose to support unicode by using UTF-32. On Linux, wchar_t is 32 bits, so I might choose to use std::wstring to store my UTF-32 encoded text. However, if I do this, then try to port the program to Windows, the program won't work anymore because wchar_t on Windows is only 16 bits and so my UTF-32 encoding won't fit anymore. — That's why I'm saying wchar_t and std::wstring should be avoided for portability.

Raidho36 said:

The real difference between UTF-32 and other two is that former is constant-length whereas others are variable-length. This property of it allows random access, which is a great deal in terms of performance.

I just read something in the Unicode standard which seems to invalidate what you were saying here. Check this out:

Unicode standard v6.2 said:

Characters Versus Code Points. In any event, Unicode code points do not necessarily match user expectations for “characters.” For example, the following are not represented by a single code point: a combining character sequence such as ; a conjoining jamo sequence for Korean; or the Devanagari conjunct “ksha.” Because some Unicode text pro-cessing must be aware of and handle such sequences of characters as text elements, the fixed-width encoding form advantage of UTF-32 is somewhat offset by the inherently variable-width nature of processing text elements. See Unicode Technical Standard #18, “Uni-code Regular Expressions,” for an example where commonly implemented processes deal with inherently variable-width text elements owing to user expectations of the identity of a “character.”

If I understand this correctly, they are saying that even with UTF-32, a single 32 bit unit does not necessary correspond to a character. ie. some characters might take more than 32 bits to encode. So even with UTF-32, you can't directly relate the character number to the byte number. (and presumably that's what you meant by 'random access'.)

no-reply@allegro.cc (Arthur Kalliokoski) — Tue, 28 May 2013 12:50:17 +0000

This Wikipedia article says that all UTF-32 code points are exactly 32 bits wide, but a particular character may take more than one code point.

no-reply@allegro.cc (Raidho36) — Wed, 29 May 2013 02:14:29 +0000

Oh okay, I'm sorry, there's special symbols that ain't real characters therefore random access doesn't worth jack shit, I forgot. I just never ran into those so everything worked fine like that.

no-reply@allegro.cc (Karadoc ~~) — Wed, 29 May 2013 04:33:49 +0000

No need to be sorry about it. I just mentioned it because I thought you'd like to know. I didn't know either.

By the way, keeping in mind that characters aren't always single code-points, does anyone happen to know an easy to work out where the end of a character is in UTF-8? I'm making a text-box UI widget, and I need to be able to determine where the starts of the characters are so that I can implement the functionality for left and right arrow keys, and mouse selection and stuff like that.

(I'm going to leave it alone for the time being and finish it later, and if I don't find any other source of info I'll check the source code for al_draw_text to see how it is done there.)

no-reply@allegro.cc (Peter Wang) — Wed, 29 May 2013 05:59:57 +0000

The font addon doesn't handle decomposed characters, nor right-to-left scripts, etc. It's pretty basic.

To find the beginning of a "character" you'd need to classify codepoints into its category, which is essentially done with a big if-then-else for different ranges. I think you'd generate it from this file but I haven't done it myself. More likely you find some existing library.

no-reply@allegro.cc (Karadoc ~~) — Wed, 29 May 2013 06:26:43 +0000

Hmm.

Well if Allegro isn't going to be able to print those kinds of characters anyway, then I guess there's no hurry for me to support them in my text box. Ultimately I'd like for all this stuff to work properly, but I don't want to spend too long on this given that I don't intend to use non-ascii characters anyway.

For the time being, I think I'll just make placeholder functions for 'next character' and 'previous character', and have those functions simply look for the first unit of the code-point. (I don't yet know how to do that either, but I assume there's some particular set of bits which signal which unit is the first unit; and I'm pretty sure al_draw_text will have that!)

Thanks for the info.

no-reply@allegro.cc (Raidho36) — Wed, 29 May 2013 09:06:16 +0000

In UTF-8, most significant bit of leading byte is always 1, and trailing bytes have it always 0. The amount of 1-s in a row in MSB determine how many exactly bytes are trailing it, so you could read the first byte and jump to the next character right away. But, if you need to find previous characters, you're pretty much stuck with going through every byte, counting leading bytes.

I'm not exactly sure, but IIRC, Allegro's string handling functions do that.

no-reply@allegro.cc (Thomas Fjellstrom) — Wed, 29 May 2013 09:14:40 +0000

I think even with UTF8, two multi byte sequences can be combined into one composed character.

no-reply@allegro.cc (Raidho36) — Wed, 29 May 2013 10:20:48 +0000

It could, existed composed characters are left in unicode for compatibility reasons, but as they do it now, unicode tend to decompose the characters into basic character and special glyphs altering it.

no-reply@allegro.cc (Thomas Fjellstrom) — Wed, 29 May 2013 10:22:35 +0000

That is what I thought I said. Two unicode "characters" can actually render as a single visible character. So you can't really just jump around like you suggested.

no-reply@allegro.cc (Arthur Kalliokoski) — Wed, 29 May 2013 10:26:33 +0000

Are you guys talking about the things like the "ae" that's smooshed together?

{"name":"230.png","src":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/2\/d\/2d09d84db8ed2a79358e287f3795f8c4.png","w":289,"h":104,"tn":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/2\/d\/2d09d84db8ed2a79358e287f3795f8c4"}

Unicode hexadecimal: 0xe6

no-reply@allegro.cc (Thomas Fjellstrom) — Wed, 29 May 2013 10:57:58 +0000

No, unicode specifies separate characters and modifiers like umlauts, where they are specified separately in the unicode string, but are composed together when drawing.

no-reply@allegro.cc (Raidho36) — Wed, 29 May 2013 12:28:35 +0000

I wasn't exactly meant character, I meant code point. You do can jump to the next code point right away after simply reading first byte, because it says how many bytes current code point occupies.

----

Also, AFAIK there could be several code points per character rather than just one or two. So to count your characters you also would need to tell apart real character from modifier. Unicode just couldn't keep it simple, as if 4 billion slots wasn't enough to store all possible combinations. Bluh. :(" />

no-reply@allegro.cc (billyquith) — Wed, 29 May 2013 15:25:55 +0000

Karadoc ~~ said:

If you use all of the standard functions for wide strings your std::wstring code will be portable. If you save wide string data on Linux and try to load it on Windows it will fail because the encodings are different. If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it.

If all of your code uses std::wstring and the w-string wide string functions, it will all be portable. If you say it "won't fit", this implies that you are setting hard limits on the size of data items (e.g. in bytes). You shouldn't do this in any encoding, you should calculate that using the coding API, in order to make the code and data portable.

no-reply@allegro.cc (Elias) — Wed, 29 May 2013 17:11:49 +0000

The problem I have with std::wstring is that the standard doesn't explicitly say it's UTF16 or any other encoding. But whenever you read/write a string from/to a file (or pass it to an Allegro function) you need to know the encoding.

From what I understand you would have to use something like this: http://www.cplusplus.com/reference/locale/codecvt/

But the existance of appropriate locales (e.g. a utf8 one, to convert to that) at runtime is not guaranteed. Or at least I can't figure out from the documentation how to convert a std::wstring from/to utf8 (or any other encoding).

no-reply@allegro.cc (Raidho36) — Wed, 29 May 2013 17:56:50 +0000

Elias said:

standard doesn't explicitly say it's UTF16 or any other encoding

It's plain unicode value of a ~~character~~ code point. It's not encoded. If you want it encoded you should probably use byte string and special handling functions.

no-reply@allegro.cc (Vanneto) — Wed, 29 May 2013 18:21:43 +0000

I wonder if Karadoc ~~ actually knows more about this than before or less.

no-reply@allegro.cc (Arthur Kalliokoski) — Wed, 29 May 2013 18:27:10 +0000

http://www.joelonsoftware.com/articles/Unicode.html

no-reply@allegro.cc (Elias) — Wed, 29 May 2013 19:24:50 +0000

Vanneto said:

I wonder if Karadoc ~~ actually knows more about this than before or less.

If he ignores all posts by Raidho36, maybe

no-reply@allegro.cc (Karadoc ~~) — Thu, 30 May 2013 06:19:46 +0000

billyquith said:

Not if you implement it properly. As soon as you encode the Unicode it becomes implementation specific. But you can convert between different encodings. E.g. if you are using UTF-16 on Windows and you want to use Allegro, at some point you'll have to convert between encodings. It is still all Unicode, just stored in a different format.If you use all of the standard functions for wide strings your std::wstring code will be portable. If you save wide string data on Linux and try to load it on Windows it will fail because the encodings are different. If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it. If all of your code uses std::wstring and the w-string wide string functions, it will all be portable. If you say it "won't fit", this implies that you are setting hard limits on the size of data items (e.g. in bytes). You shouldn't do this in any encoding, you should calculate that using the coding API, in order to make the code and data portable.

The way I see it, the unicode characters must be encoded in order to exist in computer memory at all. The programmer must choose what kind of encoding to use, and what kind of data structure to store the encoded string in. When I said "it won't fit", what I meant was that a 32-bit unit won't fit into a char16_t, and so if I used UTF-32 encoding and stored it in a std::wstring, that would work on Linux but not on Windows.

Quote:

If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it.

My point is that I can read UTF-32 straight into a std::wstring iff wchar_t is 32-bits. The 32-bit units of UTF-32 won't fit inside a 16-bit wchar_t.

Quote:

If all of your code uses std::wstring and the w-string wide string functions, it will all be portable.

I think this is the key to understand what you are talking about. As far as I understand, everything else you've said is stuff that I already knew, but here I'm not sure what you mean by "the w-string wide string functions". What functions are you referring to? (Perhaps it's related to codecvt which Elias mentioned?)

Vanneto said:

I wonder if Karadoc ~~ actually knows more about this than before or less.

Well, in this thread there is some misinformation and a significant amount of talking at cross purposes, but I feel like I've learned a fair bit. For example, I've learned that the maximum number of bytes per character is 4 for any unicode code-point and that some printable characters are actually composed of multiple unicode code-points. And most recently, Elias mentioned something called codecvt, which I hadn't heard of before. I'm reading about it now in the C++11 Standard document. A lot of the information in the document is pretty dense, but from what I can tell codecvt sounds userful, and very relevant to what we're talking about. Check this out:

C++11 Standard said:

The specialization codecvt converts between the UTF-16 and UTF-8 encoding schemes, and the specialization codecvt converts between the UTF-32 and UTF-8 encoding schemes. codecvt converts between the native character sets for narrow and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a user-defined stateT type. The stateT object can contain any state that is useful to communicate to or from the specialized do_in or do_out members.

So, based on that here's my understanding of the situation: std::wstring wide_string = L"Hello"; will store the string in an implementation defined encoding which uses wchar_t. My concern throughout this thread has been that since we don't necessarily know what the implementation defined encoding is, it's difficult to work with it in a platform independent way. For example I can't use al_ustr_new_from_utf16, because wchar_t may or may not be uint16_t; and since I don't know what the encoding is, I don't know how to traverse the wchar_t array to find the starts of characters and things like that which I need for my text box widget that I mentioned. For these reasons, I claimed that it's better to simply avoid using wchar_t and instead use something like char16_t (or whatever) so that one could be sure of which type of unicode encoding is being used.

However, now that I've read that codecvt<wchar_t,char,mbstate_t> will convert between the mystery wchar_t encoding and UTF-8, I see that it is at least possible to right portable code that uses wchar_t. It still seems to me that it's better to pick one of the more explicit character sizes, but at least wchar_t is not fatally flawed.

no-reply@allegro.cc (m c) — Thu, 30 May 2013 08:12:23 +0000

Karadoc ~~ said:

I think allegro 4.4 source code had a routine that could do that? Something that counted the character length (not byte length) of a utf-8 string?

I built my own unicode lib a few years ago from pillaging the allegro 4.4 source code because I didn't like any of the proper unicode libraries.