Should std::wstring be avoided, for portability?

Should std::wstring be avoided, for portability?

Karadoc ~~

Member #2,749

September 2002

As I understand it, std::wstring is a basic_string of wchar_t (whereas std::string is `basic_string<char>`). But I don't feel like I have a good understanding of what wchar_t & std::wstring are actually good for. I don't understand why they are used, or why they are part of the standard.

The C++11 standard says this:

C++11 said:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <stdint.h>, called the underlying types

My interpretation of that is that wchar_t is big enough to fit all characters in the 'supported locales', whatever those might be, with some unspecified encoding (maybe UTF16; or maybe something else). That sounds pretty vague to me; and I usually don't like vague stuff in programming.

So why would I want a string of wchar_t? Wouldn't it be better to use either std::string, or std::basic_string<char16_t> (for UTF8 and UTF16 respectively)? What advantage does one get by using std::wstring?

-----------

Jeff Bernard

Member #6,698

December 2005

I imagine you'd use it because your std::wstring will continue working when a larger extended character set becomes available, whereas std::u16string may break. std::u32string would probably still work, but may be wider than you need it to be.

I think std::wstring is going to be equivalent to std::u32string on most machines, and then equivalent to std::u16string or std::string on machines that don't support, like, Chinese. It's probably safer to use std::wstring when you want to use a wide string.

Karadoc ~~ said:

Wouldn't it be better to use either std::string, or std::basic_string<char16_t> (for UTF8 and UTF16 respectively)?

I don't think std::u16string is wide enough for UTF16, actually. UTF16 can have 4 byte characters.

--
I thought I was wrong once, but I was mistaken.

Karadoc ~~

Member #2,749

September 2002

Maybe I've misunderstood how these things work, but I'm under the impression that UTF16 works using 16 bit atoms^[1] of data such that the first 15 bits are used to identify the character, but if those 15 bits aren't enough, then the final bit is used to signal that the character requires another 16 bit atom. This kind of chain can go on indefinitely, and so there is no size limit to the character set, and no size limit to how many bytes might be 'needed' for a single character. (UTF8 is the same, but with 8 bits instead of 16.)

Maybe I've just go the whole thing wrong, but if that's actually how it works, then I think it would be natural to use a string of 16 bit chunks when encoding in UTF16.

References

I don't know the correct technical name

-----------

Raidho36

Member #14,628

October 2012

Karadoc ~~ said:

atoms

Variables. Simply variables. Your case is specifically 16 bit int variables.

Quote:

UTF16 works using 16 bit

Wrong. UTF-16 works using 2 byte base encoding, but for certain characters it uses up to 6 bytes, as you said. It's a multibyte encoding, just like UTF-8, where characters may have variable witdh. Opposite of multibyte is single byte, but of course you can't say that 4 bytes in UTF-32 is a single byte. But that's how it is - UTF-32 uses 32 bit ints to hold any possible character, and the character is always exactly 4 bytes long. As for upper limit, 32 bit int can hold order of several magnitudes more characters than entire mankind have came up with so far, and I don't think there'll be ever wider UTF character than 32 bit. In practice with UTF-8 encoding, upper 2 bytes (that form last 12 bits) aren't even reserved, not to mention be actually used.

You should only use constant-width character strings for unicode if you use 32 bit variables to store them, and the encoding must be UTF-32 (the only one constant width unicode). That ensures that you'll have no problem with processing your strings and otherwise deal with them.

Being serious is stupid, I'm done with it.

Karadoc ~~

Member #2,749

September 2002

@Raidho36, you said I was 'wrong' in my description, but then went on to describe something completely consistent with what I said.

Also, I'm sure 'variables' is not the right word for what I was talking about. It would be pretty weird to call the individual segments of each character a 'variable'. — Inspired by that weird suggestion, I checked the Wikipedia article for UTF16. What I referred to as atoms, Wikipedia calls units.

I only skimmed over that Wikipedia article, but I found that my original description wasn't quite right. Wikipedia implies that 32 bytes is the maximum size allowed by UTF16, and that those full 32 bytes can only store up to 1,112,064 different charactes!? As I said, I only skimmed the article, but to me that sounds wrong.

In any case, if UTF16 can not have indefinitely many units for a single character then I suppose that means using wchar_t is fine as long as it is the same size as the maximum size of a UTF16 character. It still seems a bit fishy to me - but maybe my original concern was just due to my (erroneous) belief that UTF16 / 8 could have as many units as they wanted per character.

-----------

torhu

Member #2,727

September 2002

I think char16_t and char32_t are new in C++11. std::wstring and wchar_t are older, and were probably used for 16 bit character sets like Chinese and the original Unicode. Today, wstring and wchar_t default to 32 bits on some platforms (Linux), but are still 16 bits on Windows (where the OS uses UTF-16 internally). I hope this clears it up a bit.

As for what is the correct way to deal with this in C++11, I don't know. For an Allegro project, I suppose you would use std::string or a custom class that wraps A5's UTF-8 support. It depends on your needs and what libraries you are interfacing with.

There's lots of good Unicode information easily available on the web, like this: http://www.unicode.org/faq/utf_bom.html

---
Smokin' Guns - spaghetti western FPS action

Raidho36

Member #14,628

October 2012

Karadoc ~~ said:

Wikipedia implies that 32 bytes is the maximum size allowed by UTF16, and that those full 32 bytes can only store up to 1,112,064 different charactes!?

You probably did skimmed through too fast. Unicode is designed to be used with 32 bit characters, so all UTFs are designed to hold up to 32 bits of data, therefore for all UTFs maximum amount of characters possible is 4294967296.

Quote:

Also, I'm sure 'variables' is not the right word for what I was talking about.

Memory chunks? Array cells? There's no standard way of calling a block of memory something. So it's simply virtual existed (non-real existent that is) "units", a variables. Provided it's simply a bunch of bytes, there's no real distingquishing between them other than compiler's understanding of their meaning. Try record an int into void-pointed area in memory and then read it as float, or vice versa, or try to read it with small offset, like 1 byte or so. You'll have garbage, but you'll have your bytes exactly as you would expect them (minus endianness) and it'll work fine and compiler won't complain. That explanation is rough, but should give you a good idea of what's going on.

Being serious is stupid, I'm done with it.

Karadoc ~~

Member #2,749

September 2002

torhu said:

For an Allegro project, I suppose you would use std::string or a custom class that wraps A5's UTF-8 support. It depends on your needs and what libraries you are interfacing with.

This sounds like good advice to me. For my own (current) purposes, I don't actually need any non ASCII characters anyway. I just like to get into the habit of doing this the 'right way' so that if I ever need this stuff in the future then I will know what to do — or if I want to reuse my code for something else, it won't be hard to adapt.

Before starting this thread I saw something on Stack Overflow which basically said to always use wstring for everything when programming on Windows. The S.O. post had a bazillion upvotes, but I wasn't comfortable about changing all my `string`s into these wstring things which might mess things up if I attempt to port to xnix.

I think that if I just stick with std::string and try to remember that counting letters will not necessarily give me the offset in the string, then UTF8 should work ok when I need it.

(And I still think using wchar_t is probably a bad idea if one is trying to write portable code.)

--
[edit]

Raidho36 said:

Even if the UTF encoding had no redundancy or waste of any kind, it couldn't encode the full $<math>2^32</math>$ different characters that you are suggesting. If nothing else, UTF8 needs to use some of its bits to indicate whether or not the full four bytes will be there or not. The value stated by wikipedia (1,112,064) sounds too low to me, but it is certainly possible if there are simply a lot of disallowed combinations of bits. The value you stated (4294967296) is too large to be possible with a maximum of only 4 bytes in UTF8 or UTF16.

Quote:

Memory chunks? Array cells? There's no standard way of calling a block of memory something. So it's simply virtual existed (non-real existent that is) "units", a variables. Provided it's simply a bunch of bytes, there's no real distingquishing between them other than compiler's understanding of their meaning.

I wasn't talking about an arbitrary bunch of bytes though. And 'variable' has some connotations which I don't think are appropriate for what I was talking about. For example, I might use UTF8 to encode a unicode character like char32_t unicode_char. In that case, the 'variable' is called unicode_char. It could be confusing if each of the bytes inside that 32 bit variable were also referred to as variables...

char8_t first_variable = unicode_char 0xFF0000;
// now we have a variable called 'first_variable'. This is not nice.

-----------

Peter Wang

Member #23

April 2000

Raidho36 said:

Variables. Simply variables. Your case is specifically 16 bit int variables

Wrong. They are called 'code units'. Variables are a programming language construct and have nothing to do with Unicode encodings.

Quote:

but for certain characters it uses up to 6 bytes

Wrong. Code points are encoded to one or two UTF-16 code units, i.e. two bytes or four bytes.

Raidho36 said:

all UTFs are designed to hold up to 32 bits of data

Wrong. UTF-16 can only encode code points up to 0x10ffff, minus those values in the range U+D800..U+DFFF which are used to encode surrogate pairs. That is where the 1,112,064 comes from.

UTF-8 as originally designed can encode all 2^32 code points, but since Unicode only goes up to 0x10ffff (limited by UTF-16) UTF-8 also only encodes up to 0x10ffff.

m c

Member #5,337

December 2004

Peter Wang said:

Wrong. They are called 'code units'. Variables are a programming language construct and have nothing to do with Unicode encodings.

WRONG PETER! My unicode xhtml transitional documents want to have a word with you.

(\ /)
(O.o)
(> <)

Raidho36

Member #14,628

October 2012

I'm not familiar with UTF-16, sorry.

Karadoc, UTF-8 is designed to use up to 6 bytes of data because it does occupies some bits for system purposes, so it could form 32 bits of data with 6 bytes rather than 4.

As for wchar_t, it's a part of the standard, so no problem with portability whatsoever could possibly arise unless you're using non standard-compliant C version on target platform.

Being serious is stupid, I'm done with it.

Karadoc ~~

Member #2,749

September 2002

Raidho36 said:

As for wchar_t, it's a part of the standard, so no problem with portability whatsoever could possibly arise unless you're using non standard-compliant C version on target platform.

I don't think you've understood what I'm trying to say. I alreddy know that wchar_t is a standard part of C++. I even quoted the description of it from the official C++11 document in my first post. My point is that the size of wchar_t might vary depending on how the code is compiled, and thus it could create problems whenever one changes how their code is compiled. It seems to me that it would be better to use a more predictable type such as char32_t.

Quote:

Karadoc, UTF-8 is designed to use up to 6 bytes of data because it does occupies some bits for system purposes, so it could form 32 bits of data with 6 bytes rather than 4.

From what I can tell, UTF-8 only uses up to 4 bytes. Peter Wang said something that implies an older version of UTF-8 might have used more, but that doesn't seem to be the case any longer based on what I've read.
Wikipedia says this:

wikipedia said:

UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes

I also checked the 'Unicode Standard Version 6.2 – Core Specification'

Unicode standard said:

In UTF-8, a character may be expressed with one, two, three, or four bytes, and the relationship between those byte values and the code point value is more complex [than UTF-32].

But all that stuff is beside the point of the original question anyway.

Actually, the fact that UTF-32 covers all of unicode is relevant - because that suggests that wchar_t should be 32 bits wide in order to meet the definition in the C++11 standard. That sounds fair and reasonable and reliable... except that I've seen a few different people claim that wchar_t is 16 bits on Windows. -- I'm not really sure what "on Windows" really means though given that it would be determined by the compiler rather than by the operating system. I guess I could just test it on VC++ and minGW, but the fact that people are saying that it is sometimes 16 bits and sometimes 32 bits is discouraging enough for me to conclude that it's best to use char32_t to avoid confusion.

-----------

Raidho36

Member #14,628

October 2012

Karadoc ~~ said:

It seems to me that it would be better to use a more predictable type such as char32_t

That would be more consistent, although may not match your wide character implementation, so you may run into certain problems.

Quote:

From what I can tell, UTF-8 only uses up to 4 bytes.

Oh well, I didn't dig in too deep, it was enough to me that I use UTF-8 encoding when write to file (what I used was 6-byte programmed) and upon read instantly decode it into UTF-32 to use internally, so didn't bothered much with specs changes. 1112064 is way more than enough anyway, even though being suboptimal. The real difference between UTF-32 and other two is that former is constant-length whereas others are variable-length. This property of it allows random access, which is a great deal in terms of performance. If you simply take input, display and discard your strings, then using UTF-8 internally is fine. But if you bend and twist them around a lot - you're fucked.

Quote:

I've seen a few different people claim that wchar_t is 16 bits on Windows.

It is. Check for yourself. Meaning is that without enabling specific obscure settings you'll have your wide characters two bytes long.

Being serious is stupid, I'm done with it.

m c

Member #5,337

December 2004

Karadoc ~~ said:

except that I've seen a few different people claim that wchar_t is 16 bits on Windows.

That is probably for commonality with the win32 subsystem's native Unicode support, which is utf-16

(\ /)
(O.o)
(> <)

torhu

Member #2,727

September 2002

Right, I seem to remember making that exact point recently.

---
Smokin' Guns - spaghetti western FPS action

Karadoc ~~

Member #2,749

September 2002

My interpretation of the standard is that a single wchar_t should be able to encode every possible unicode character; and so it needs to be 32 bit.

Quote:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

One could argue that 16 bits are enough to meet the requirements because the standard doesn't explicitly say that a single wchar_t should be able to encode every possible character - but if one accepts that argument, they must also accept that a single bit would be enough, and so the requirement would be effectively meaningless.

From my point of view, there's nothing wrong with utf-16, and I'm sure it's convenient for wchar_t to be 16 bits when dealing exclusively with utf-16; but I just don't think that's what the standard asks for, and I expect it would be a pest for portability. I suspect it's probably not really about different interpretations of the standard, but rather about supporting legacy code.

Raidho36 said:

That would be more consistent, although may not match your wide character implementation, so you may run into certain problems.

Obviously if the programmer is explicitly specifying the number of bits, they would choose the number of bits that match the encoding they want to use. On the other hand, wchar_t might not be the right size, because the size of wchar_t is not chosen by the programmer. That's the point I'm trying to make.

-----------

torhu

Member #2,727

September 2002

Karadoc ~~ said:

My interpretation of the standard is that a single wchar_t should be able to encode every possible unicode character; and so it needs to be 32 bit.

Nope. I'll give you a hint: C++ wchar_t could very well be older than Unicode. And another one: the internet is full of actual, real information about C++. Although that's not true about a lot of other subjects.

---
Smokin' Guns - spaghetti western FPS action

Karadoc ~~

Member #2,749

September 2002

@torhu, come on man. I quoted the C++11 standard here. That's the most solid piece of "real information about C++" I can imagine. I fully understand that wchar_t could be older than unicode, but that doesn't mean unicode isn't 'specified among the supported locales'. If you know something which invalidates my interpretation of the standard, can you just say it? I don't know what the benefit is of saying 'I know the answer but I don't want to tell you'.

In any case, I'm pretty satisfied that I have the answer to my original question. Although no one here seems to be saying it, I think the answer is 'yes. wchar_t should be avoided when portability is important'.

I found this quote on wikipedia:

wikipedia quoting 2003 Unicode standard said:

The ISO/IEC 10646:2003 Unicode standard 4.0 says that:

"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."

So there it is: an unambiguous recommendation to not use wchar_t in programs that need to be portable.

-----------

torhu

Member #2,727

September 2002

A Unicode character can be 32 bits, but wchar_t is only 16 bits on Windows. That's all there is to it. The problem with wchar_t for cross platform applications could be just that it sucks to use 32 bits for each character. I don't know if there are issues with Unicode string literals or whatever.

---
Smokin' Guns - spaghetti western FPS action

Raidho36

Member #14,628

October 2012

Karadoc ~~ said:

the standard doesn't explicitly say that a single wchar_t should be able to encode every possible character

It is. But note it says "supported locales". Means that if total characters amount in all locales supported fits in 16 bits, it'll be 16 bit.

Quote:

it's convenient for wchar_t to be 16 bits when dealing exclusively with utf-16

Nope. wchar_t is specifically wide character, as opposed to char regular character, and it is to be used with wide character strings. These are all assume your characters to be constant length and have no special encoding, since processed with wide character strings functions that are only different from regular ones by using wchar_t instead of char. That's what I was talking about when said of "certain problems." Although you may miracously have everything work fine on it's own.

torhu said:

I don't know if there are issues with Unicode string literals or whatever.

The logic behind this is "what's the point of giving wide characters 32 bit if the damn device can only display ASCII?" Because target platrofm may not necessairly support unicode fully. What do you do? You adjust your text input function to work with this wide characters. But if you have the balls, you may hook up custom libraries that would give support to 32 bit unicode input, processing and display.

Being serious is stupid, I'm done with it.

billyquith

Member #13,534

September 2011

Don't think about wstring, think about Unicode. Unicode has a large set of "code points", i.e. imagine every language and all the characters available (a LOT!). You cannot fit all of these into the defacto unit of character storage: the byte (i.e. range 0-255). Back when computers were simpler there was only ASCII (range 0-127).

If you want to have a large sets of "codes" you can store these in different ways. Ignore Unicode for a minute and think of all the ways that you could do it. E.g. if you know ASCII has a spare bit, you can use this to specify that the code spills over into the next byte, i.e. multi-byte. Or maybe you decide not to use a byte, but to use 2 bytes, or 4 bytes, and chain these together. You might make this decision based on the architecture of the processors you are targeting, or how many languages you want to support.

std::string can hold UTF-8 encoding (i.e. multi-byte 8 bit chars). This is possible because std::string is not null terminated, it is 8 bit pure. I.e. you can store '\0', and any char in a string. So std::string is backwards compatible with null terminated 8 bit strings and ASCII. std::wstring works like std::string but holds "wide characters" (i.e. >8 bit) and is not backwards compatible with 8 bit strings.

std::string and 8 bit ASCII strings are "narrow" and std::wstring are "wide". Note, when we say wide, the size of a character is not specified as it is platform/compiler specific.

So your decision is: how to support Unicode, given the above information.

You might look at the APIs you are going to use, e.g. if you are only using Allegro, you might use std::string and UTF-8, because that is what Allegro uses internally. Otherwise you have to convert any wide strings to UTF-8 (narrow) for Allegro to use.

If you are using a library that only supports wide strings then you might use wide strings exclusively. Some APIs support both with a define.

If you are writing a library to be made public you might want to bear all this in mind, that some people might want to use narrow, and others wide, chars. Most libraries tend to assume ASCII, or narrow encoding. If you are including Windows, then I think you really have to support wide encoding because it only really supports localisation properly using wide encoding (UTF-16 in this case). All of the newer .net stuff uses this internally. It's a PITA!

Soooo... if your question is related to Allegro, I'd say use narrow encoding. You can have the simplicity of ASCII strings, and use UTF-8 Unicode to localise, which Allegro also uses. If you use wide strings you'll just have to convert them to narrow ones every time you call a text rendering function Allegro.

Karadoc ~~

Member #2,749

September 2002

billyquith said:

Don't think about wstring, think about Unicode.
[...]
So your decision is: how to support Unicode, given the above information.

Suppose I'm writing a Linux program, and I choose to support unicode by using UTF-32. On Linux, wchar_t is 32 bits, so I might choose to use std::wstring to store my UTF-32 encoded text. However, if I do this, then try to port the program to Windows, the program won't work anymore because wchar_t on Windows is only 16 bits and so my UTF-32 encoding won't fit anymore. — That's why I'm saying wchar_t and std::wstring should be avoided for portability.

Raidho36 said:

The real difference between UTF-32 and other two is that former is constant-length whereas others are variable-length. This property of it allows random access, which is a great deal in terms of performance.

I just read something in the Unicode standard which seems to invalidate what you were saying here. Check this out:

Unicode standard v6.2 said:

Characters Versus Code Points. In any event, Unicode code points do not necessarily match user expectations for “characters.” For example, the following are not represented by a single code point: a combining character sequence such as <g, acute>; a conjoining jamo sequence for Korean; or the Devanagari conjunct “ksha.” Because some Unicode text pro-cessing must be aware of and handle such sequences of characters as text elements, the fixed-width encoding form advantage of UTF-32 is somewhat offset by the inherently variable-width nature of processing text elements. See Unicode Technical Standard #18, “Uni-code Regular Expressions,” for an example where commonly implemented processes deal with inherently variable-width text elements owing to user expectations of the identity of a “character.”

If I understand this correctly, they are saying that even with UTF-32, a single 32 bit unit does not necessary correspond to a character. ie. some characters might take more than 32 bits to encode. So even with UTF-32, you can't directly relate the character number to the byte number. (and presumably that's what you meant by 'random access'.)

-----------

Arthur Kalliokoski

Second in Command

February 2005

This Wikipedia article says that all UTF-32 code points are exactly 32 bits wide, but a particular character may take more than one code point.

They all watch too much MSNBC... they get ideas.

Raidho36

Member #14,628

October 2012

Oh okay, I'm sorry, there's special symbols that ain't real characters therefore random access doesn't worth jack shit, I forgot. I just never ran into those so everything worked fine like that.

Being serious is stupid, I'm done with it.

Karadoc ~~

Member #2,749

September 2002

No need to be sorry about it. I just mentioned it because I thought you'd like to know. I didn't know either.

By the way, keeping in mind that characters aren't always single code-points, does anyone happen to know an easy to work out where the end of a character is in UTF-8? I'm making a text-box UI widget, and I need to be able to determine where the starts of the characters are so that I can implement the functionality for left and right arrow keys, and mouse selection and stuff like that.

(I'm going to leave it alone for the time being and finish it later, and if I don't find any other source of info I'll check the source code for al_draw_text to see how it is done there.)

-----------