<?xml version="1.0"?>
<rss version="2.0">
	<channel>
		<title>Should std::wstring be avoided, for portability?</title>
		<link>http://www.allegro.cc/forums/view/612635</link>
		<description>Allegro.cc Forum Thread</description>
		<webMaster>matthew@allegro.cc (Matthew Leverton)</webMaster>
		<lastBuildDate>Thu, 30 May 2013 08:12:23 +0000</lastBuildDate>
	</channel>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>As I understand it, <span class="source-code">std::wstring</span> is a <span class="source-code">basic_string</span> of <span class="source-code"><span class="k1">wchar_t</span></span> (whereas <span class="source-code">std::string</span> is `basic_string&lt;char&gt;`). But I don&#39;t feel like I have a good understanding of what <span class="source-code"><span class="k1">wchar_t</span></span> &amp; <span class="source-code">std::wstring</span> are actually good for. I don&#39;t understand why they are used, or why they are part of the standard.</p><p>The C++11 standard says this:
</p><div class="quote_container"><div class="title">C++11 said:</div><div class="quote"><p>Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in &lt;stdint.h&gt;, called the underlying types</p></div></div><p>
My interpretation of that is that wchar_t is big enough to fit all characters in the &#39;supported locales&#39;, whatever those might be, with some unspecified encoding (maybe UTF16; or maybe something else). That sounds pretty vague to me; and I usually don&#39;t like vague stuff in programming.</p><p>So why would I want a string of wchar_t? Wouldn&#39;t it be better to use either <span class="source-code">std::string</span>, or <span class="source-code">std::basic_string<span class="k3">&lt;</span>char16_t&gt;</span> (for UTF8 and UTF16 respectively)? What advantage does one get by using <span class="source-code">std::wstring</span>?
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Wed, 22 May 2013 13:17:33 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I imagine you&#39;d use it because your <span class="source-code">std::wstring</span> will continue working when a larger extended character set becomes available, whereas <span class="source-code">std::u16string</span> may break. <span class="source-code">std::u32string</span> would probably still work, but may be wider than you need it to be.</p><p>I think <span class="source-code">std::wstring</span> is going to be equivalent to <span class="source-code">std::u32string</span> on most machines, and then equivalent to <span class="source-code">std::u16string</span> or <span class="source-code">std::string</span> on machines that don&#39;t support, like, Chinese. It&#39;s probably safer to use <span class="source-code">std::wstring</span> when you want to use a wide string.</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983291#target">Karadoc ~~</a> said:</div><div class="quote"><p> Wouldn&#39;t it be better to use either <span class="source-code">std::string</span>, or <span class="source-code">std::basic_string<span class="k3">&lt;</span>char16_t&gt;</span> (for UTF8 and UTF16 respectively)?</p></div></div><p>I don&#39;t think <span class="source-code">std::u16string</span> is wide enough for UTF16, actually. UTF16 can have 4 byte characters.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Jeff Bernard)</author>
		<pubDate>Wed, 22 May 2013 13:45:46 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Maybe I&#39;ve misunderstood how these things work, but I&#39;m under the impression that UTF16 works using 16 bit atoms<span class="ref"><sup>[<a href="#">1</a>]</sup></span> of data such that the first 15 bits are used to identify the character, but if those 15 bits aren&#39;t enough, then the final bit is used to signal that the character requires another 16 bit atom. This kind of chain can go on indefinitely, and so there is no size limit to the character set, and no size limit to how many bytes might be &#39;needed&#39; for a single character. (UTF8 is the same, but with 8 bits instead of 16.)</p><p>Maybe I&#39;ve just go the whole thing wrong, but if that&#39;s actually how it works, then I think it would be natural to use a string of 16 bit chunks when encoding in UTF16.
</p><div class="ref-block"><h2>References</h2><ol><li>I don&#39;t know the correct technical name</li></ol></div></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Wed, 22 May 2013 18:59:12 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983319#target">Karadoc ~~</a> said:</div><div class="quote"><p> atoms
</p></div></div><p>Variables. Simply variables. Your case is specifically 16 bit int variables.
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p> UTF16 works using 16 bit
</p></div></div><p>Wrong. UTF-16 works using 2 byte base encoding, but for certain characters it uses up to 6 bytes, as you said. It&#39;s a multibyte encoding, just like UTF-8, where characters may have variable witdh. Opposite of multibyte is single byte, but of course you can&#39;t say that 4 bytes in UTF-32 is a single byte. But that&#39;s how it is - UTF-32 uses 32 bit ints to hold any possible character, and the character is always exactly 4 bytes long. As for upper limit, 32 bit int can hold order of several magnitudes more characters than entire mankind have came up with so far, and I don&#39;t think there&#39;ll be <i>ever</i> wider UTF character than 32 bit. In practice with UTF-8 encoding, upper 2 bytes (that form last 12 bits) aren&#39;t even reserved, not to mention be actually used.</p><p>You should only use constant-width character strings for unicode if you use 32 bit variables to store them, and the encoding must be UTF-32 (the only one constant width unicode). That ensures that you&#39;ll have no problem with processing your strings and otherwise deal with them.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Wed, 22 May 2013 21:38:07 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>@Raidho36, you said I was &#39;wrong&#39; in my description, but then went on to describe something completely consistent with what I said.</p><p>Also, I&#39;m sure &#39;variables&#39; is not the right word for what I was talking about. It would be pretty weird to call the individual segments of each character a &#39;variable&#39;. — Inspired by that weird suggestion, I checked the Wikipedia article for UTF16. What I referred to as <i>atoms</i>, Wikipedia calls <i>units</i>.</p><p>I only skimmed over that Wikipedia article, but I found that my original description wasn&#39;t quite right. Wikipedia implies that 32 bytes is the maximum size allowed by UTF16, and that those full 32 bytes can only store up to 1,112,064 different charactes!? As I said, I only skimmed the article, but to me that sounds wrong.</p><p>In any case, if UTF16 can not have indefinitely many units for a single character then I suppose that means using <span class="source-code"><span class="k1">wchar_t</span></span> is fine as long as it is the same size as the maximum size of a UTF16 character. It still seems a bit fishy to me - but maybe my original concern was just due to my (erroneous) belief that UTF16 / 8 could have as many units as they wanted per character.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Thu, 23 May 2013 04:15:38 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I think char16_t and char32_t are new in C++11.  std::wstring and wchar_t are older, and were probably used for 16 bit character sets like Chinese and the original Unicode.  Today, wstring and wchar_t default to 32 bits on some platforms (Linux), but are still 16 bits on Windows (where the OS uses UTF-16 internally). I hope this clears it up a bit.</p><p>As for what is the correct way to deal with this in C++11, I don&#39;t know. For an Allegro project, I suppose you would use std::string or a custom class that wraps A5&#39;s UTF-8 support.  It depends on your needs and what libraries you are interfacing with.</p><p>There&#39;s lots of good Unicode information easily available on the web, like this: <a href="http://www.unicode.org/faq/utf_bom.html">http://www.unicode.org/faq/utf_bom.html</a>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (torhu)</author>
		<pubDate>Thu, 23 May 2013 05:39:52 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983381#target">Karadoc ~~</a> said:</div><div class="quote"><p> Wikipedia implies that 32 bytes is the maximum size allowed by UTF16, and that those full 32 bytes can only store up to 1,112,064 different charactes!?
</p></div></div><p>You probably did skimmed through too fast. Unicode is designed to be used with 32 bit characters, so all UTFs are designed to hold up to 32 bits of data, therefore for all UTFs maximum amount of characters possible is 4294967296.
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p> Also, I&#39;m sure &#39;variables&#39; is not the right word for what I was talking about.
</p></div></div><p>Memory chunks? Array cells? There&#39;s no standard way of calling <i>a block of memory</i> something. So it&#39;s simply virtual existed (non-real existent that is) &quot;units&quot;, a variables. Provided it&#39;s simply a bunch of bytes, there&#39;s no real distingquishing between them other than compiler&#39;s understanding of their meaning. Try record an <span class="source-code"><span class="k1">int</span></span> into void-pointed area in memory and then read it as <span class="source-code"><span class="k1">float</span></span>, or vice versa, or try to read it with small offset, like 1 byte or so. You&#39;ll have garbage, but you&#39;ll have your <i>bytes</i> exactly as you would expect them (minus endianness) and it&#39;ll work fine and compiler won&#39;t complain. That explanation is rough, but should give you a good idea of what&#39;s going on.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Thu, 23 May 2013 08:43:40 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983385#target">torhu</a> said:</div><div class="quote"><p>For an Allegro project, I suppose you would use std::string or a custom class that wraps A5&#39;s UTF-8 support.  It depends on your needs and what libraries you are interfacing with.</p></div></div><p>
This sounds like good advice to me. For my own (current) purposes, I don&#39;t actually need any non ASCII characters anyway. I just like to get into the habit of doing this the &#39;right way&#39; so that if I ever need this stuff in the future then I will know what to do — or if I want to reuse my code for something else, it won&#39;t be hard to adapt.</p><p>Before starting this thread I saw something on Stack Overflow which basically said to always use wstring for everything when programming on Windows. The S.O. post had a bazillion upvotes, but I wasn&#39;t comfortable about changing all my `string`s into these <span class="source-code">wstring</span> things which might mess things up if I attempt to port to xnix.</p><p>I think that if I just stick with <span class="source-code">std::string</span> and try to remember that counting letters will not necessarily give me the offset in the <span class="source-code">string</span>, then UTF8 should work ok when I need it.</p><p>(And I still think using <span class="source-code"><span class="k1">wchar_t</span></span> is probably a bad idea if one is trying to write portable code.)</p><p>--<br />[edit]
</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983392#target">Raidho36</a> said:</div><div class="quote"><p>You probably did skimmed through too fast. Unicode is designed to be used with 32 bit characters, so all UTFs are designed to hold up to 32 bits of data, therefore for all UTFs maximum amount of characters possible is 4294967296.</p></div></div><p>Even if the UTF encoding had no redundancy or waste of any kind, it couldn&#39;t encode the full <img class="math" src="http://www.allegro.cc/images/tex/c/8/c8f13b7336e222fe27618051005ad0bb-96.png" alt="&lt;math&gt;2^32&lt;/math&gt;" /> different characters that you are suggesting. If nothing else, UTF8 needs to use some of its bits to indicate whether or not the full four bytes will be there or not. The value stated by wikipedia (1,112,064) sounds too low to me, but it is certainly possible if there are simply a lot of disallowed combinations of bits. The value you stated (4294967296) is too large to be possible with a maximum of only 4 bytes in UTF8 or UTF16.</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>Memory chunks? Array cells? There&#39;s no standard way of calling a block of memory something. So it&#39;s simply virtual existed (non-real existent that is) &quot;units&quot;, a variables. Provided it&#39;s simply a bunch of bytes, there&#39;s no real distingquishing between them other than compiler&#39;s understanding of their meaning.</p></div></div><p>I wasn&#39;t talking about an arbitrary bunch of bytes though. And &#39;variable&#39; has some connotations which I don&#39;t think are appropriate for what I was talking about. For example, I might use UTF8 to encode a unicode character like <span class="source-code">char32_t unicode_char</span>. In that case, the &#39;variable&#39; is called <span class="source-code">unicode_char</span>. It could be confusing if each of the bytes inside that 32 bit variable were also referred to as variables...
</p><div class="source-code snippet"><div class="inner"><pre>char8_t first_variable <span class="k3">=</span> unicode_char <span class="n">0xFF0000</span><span class="k2">;</span>
<span class="c">// now we have a variable called 'first_variable'. This is not nice.</span>
</pre></div></div><p>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Thu, 23 May 2013 08:48:01 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983336#target">Raidho36</a> said:</div><div class="quote"><p>
Variables. Simply variables. Your case is specifically 16 bit int variables
</p></div></div><p>

Wrong. They are called &#39;code units&#39;. Variables are a programming language construct and have nothing to do with Unicode encodings.</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
but for certain characters it uses up to 6 bytes
</p></div></div><p>

Wrong. Code points are encoded to one or two UTF-16 code units, i.e. two bytes or four bytes.</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983392#target">Raidho36</a> said:</div><div class="quote"><p>all UTFs are designed to hold up to 32 bits of data</p></div></div><p>

Wrong. UTF-16 can only encode code points up to 0x10ffff, minus those values in the range U+D800..U+DFFF which are used to encode surrogate pairs. That is where the 1,112,064 comes from.</p><p>UTF-8 as originally designed can encode all 2^32 code points, but since Unicode only goes up to 0x10ffff (limited by UTF-16) UTF-8 also only encodes up to 0x10ffff.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Peter Wang)</author>
		<pubDate>Thu, 23 May 2013 09:21:01 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983397#target">Peter Wang</a> said:</div><div class="quote"><p>Wrong. They are called &#39;code units&#39;. <b>Variables</b> are a programming language construct and have <b>nothing to do with Unicode</b> encodings.</p></div></div><p>
WRONG PETER! My unicode xhtml transitional documents want to have a word with you.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (m c)</author>
		<pubDate>Thu, 23 May 2013 09:51:47 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I&#39;m not familiar with UTF-16, sorry.</p><p>Karadoc, UTF-8 is designed to use up to 6 bytes of data because it does occupies some bits for system purposes, so it could form 32 bits of data with 6 bytes rather than 4.</p><p>As for wchar_t, it&#39;s a part of the standard, so no problem with portability whatsoever could possibly arise unless you&#39;re using non standard-compliant C version on target platform.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Thu, 23 May 2013 09:53:57 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983402#target">Raidho36</a> said:</div><div class="quote"><p>As for wchar_t, it&#39;s a part of the standard, so no problem with portability whatsoever could possibly arise unless you&#39;re using non standard-compliant C version on target platform.</p></div></div><p>
I don&#39;t think you&#39;ve understood what I&#39;m trying to say. I alreddy know that <span class="source-code"><span class="k1">wchar_t</span></span> is a standard part of C++. I even quoted the description of it from the official C++11 document in my first post. My point is that the size of <span class="source-code"><span class="k1">wchar_t</span></span> might vary depending on how the code is compiled, and thus it could create problems whenever one changes how their code is compiled. It seems to me that it would be better to use a more predictable type such as <span class="source-code">char32_t</span>.</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>Karadoc, UTF-8 is designed to use up to 6 bytes of data because it does occupies some bits for system purposes, so it could form 32 bits of data with 6 bytes rather than 4.</p></div></div><p> From what I can tell, UTF-8 only uses up to 4 bytes. Peter Wang said something that implies an older version of UTF-8 might have used more, but that doesn&#39;t seem to be the case any longer based on what I&#39;ve read.<br />Wikipedia says this:
</p><div class="quote_container"><div class="title"><a href="https://en.wikipedia.org/wiki/UTF-8">wikipedia</a> said:</div><div class="quote"><p>UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes</p></div></div><p>
I also checked the &#39;Unicode Standard Version 6.2 – Core Specification&#39;
</p><div class="quote_container"><div class="title"><a href="http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf">Unicode standard</a> said:</div><div class="quote"><p>In UTF-8, a character may be expressed with one, two, three, or four bytes, and the relationship between those byte values and the code point value is more complex [than UTF-32].</p></div></div><p>
But all that stuff is beside the point of the original question anyway.</p><p>Actually, the fact that UTF-32 covers all of unicode is relevant - because that suggests that <span class="source-code"><span class="k1">wchar_t</span></span> should be 32 bits wide in order to meet the definition in the C++11 standard. That sounds fair and reasonable and reliable... except that I&#39;ve seen a few different people claim that <span class="source-code"><span class="k1">wchar_t</span></span> is 16 bits on Windows. -- I&#39;m not really sure what &quot;on Windows&quot; really means though given that it would be determined by the compiler rather than by the operating system. I guess I could just test it on VC++ and minGW, but the fact that people are saying that it is sometimes 16 bits and sometimes 32 bits is discouraging enough for me to conclude that it&#39;s best to use <span class="source-code">char32_t</span> to avoid confusion.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Thu, 23 May 2013 11:39:47 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983408#target">Karadoc ~~</a> said:</div><div class="quote"><p>  It seems to me that it would be better to use a more predictable type such as char32_t
</p></div></div><p>That would be more consistent, although may not match your wide character implementation, so you may run into certain problems.
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p> From what I can tell, UTF-8 only uses up to 4 bytes. 
</p></div></div><p>Oh well, I didn&#39;t dig in too deep, it was enough to me that I use UTF-8 encoding when write to file (what I used was 6-byte programmed) and upon read instantly decode it into UTF-32 to use internally, so didn&#39;t bothered much with specs changes. 1112064 is way more than enough anyway, even though being suboptimal. The real difference between UTF-32 and other two is that former is constant-length whereas others are variable-length. This property of it allows random access, which is a great deal in terms of performance. If you simply take input, display and discard your strings, then using UTF-8 internally is fine. But if you bend and twist them around a lot - you&#39;re <span class="cuss"><span>fuck</span></span>ed.
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p> I&#39;ve seen a few different people claim that wchar_t is 16 bits on Windows.
</p></div></div><p>It is. Check for yourself. Meaning is that without enabling specific obscure settings you&#39;ll have your wide characters two bytes long.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Thu, 23 May 2013 22:56:52 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983408#target">Karadoc ~~</a> said:</div><div class="quote"><p>except that I&#39;ve seen a few different people claim that wchar_t is 16 bits on Windows.</p></div></div><p>

That is probably for commonality with the win32 subsystem&#39;s native Unicode support, which is utf-16
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (m c)</author>
		<pubDate>Fri, 24 May 2013 05:51:36 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Right, I seem to remember making that exact point recently.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (torhu)</author>
		<pubDate>Fri, 24 May 2013 06:13:56 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>My interpretation of the standard is that a single <span class="source-code"><span class="k1">wchar_t</span></span> should be able to encode every possible unicode character; and so it needs to be 32 bit.
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).</p></div></div><p>
One could argue that 16 bits are enough to meet the requirements because the standard doesn&#39;t explicitly say that a <i>single</i> <span class="source-code"><span class="k1">wchar_t</span></span> should be able to encode every possible character - but if one accepts that argument, they must also accept that a single bit would be enough, and so the requirement would be effectively meaningless.</p><p>From my point of view, there&#39;s nothing wrong with utf-16, and I&#39;m sure it&#39;s convenient for <span class="source-code"><span class="k1">wchar_t</span></span> to be 16 bits when dealing exclusively with utf-16; but I just don&#39;t think that&#39;s what the standard asks for, and I expect it would be a pest for portability. I suspect it&#39;s probably not really about different interpretations of the standard, but rather about supporting legacy code.</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983428#target">Raidho36</a> said:</div><div class="quote"><p>That would be more consistent, although may not match your wide character implementation, so you may run into certain problems.</p></div></div><p>Obviously if the programmer is explicitly specifying the number of bits, they would choose the number of bits that match the encoding they want to use. On the other hand, <span class="source-code"><span class="k1">wchar_t</span></span> might not be the right size, because the size of <span class="source-code"><span class="k1">wchar_t</span></span> is not chosen by the programmer. That&#39;s the point I&#39;m trying to make.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Fri, 24 May 2013 07:00:36 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983468#target">Karadoc ~~</a> said:</div><div class="quote"><p> My interpretation of the standard is that a single wchar_t should be able to encode every possible unicode character; and so it needs to be 32 bit.</p></div></div><p>Nope. I&#39;ll give you a hint: C++ wchar_t could very well be older than Unicode. And another one: the internet is full of actual, real information about C++. Although that&#39;s not true about a lot of other subjects.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (torhu)</author>
		<pubDate>Fri, 24 May 2013 07:09:10 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>@torhu, come on man. I quoted the C++11 standard here. That&#39;s the most solid piece of &quot;real information about C++&quot; I can imagine. I fully understand that wchar_t could be older than unicode, but that doesn&#39;t mean unicode isn&#39;t &#39;specified among the supported locales&#39;. If you know something which invalidates my interpretation of the standard, can you just say it? I don&#39;t know what the benefit is of saying &#39;<i>I know the answer but I don&#39;t want to tell you</i>&#39;.</p><p>In any case, I&#39;m pretty satisfied that I have the answer to my original question. Although no one here seems to be saying it, I think the answer is &#39;<i>yes. wchar_t should be avoided when portability is important</i>&#39;.</p><p>I found this quote on wikipedia:
</p><div class="quote_container"><div class="title"><a href="https://en.wikipedia.org/wiki/Wide_character">wikipedia quoting 2003 Unicode standard</a> said:</div><div class="quote"><p>
The ISO/IEC 10646:2003 Unicode standard 4.0 says that:</p><p>    &quot;The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.&quot; </p></div></div><p>
So there it is: an unambiguous recommendation to not use wchar_t in programs that need to be portable.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Fri, 24 May 2013 07:39:40 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>A Unicode character can be 32 bits, but wchar_t is only 16 bits on Windows. That&#39;s all there is to it. The problem with wchar_t for cross platform applications could be just that it sucks to use 32 bits for each character.  I don&#39;t know if there are issues with Unicode string literals or whatever.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (torhu)</author>
		<pubDate>Fri, 24 May 2013 08:14:33 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983468#target">Karadoc ~~</a> said:</div><div class="quote"><p> the standard doesn&#39;t explicitly say that a single wchar_t should be able to encode every possible character
</p></div></div><p>It is. But note it says &quot;supported locales&quot;. Means that if total characters amount in all locales supported fits in 16 bits, it&#39;ll be 16 bit.
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p> it&#39;s convenient for wchar_t to be 16 bits when dealing exclusively with utf-16
</p></div></div><p>Nope. <span class="source-code"><span class="k1">wchar_t</span></span> is specifically <i>wide character</i>, as opposed to <span class="source-code"><span class="k1">char</span></span> <i>regular character</i>, and it is to be used with wide character strings. These are all assume your characters to be constant length and have no special encoding, since processed with wide character strings functions that are only different from regular ones by using <span class="source-code"><span class="k1">wchar_t</span></span> instead of <span class="source-code"><span class="k1">char</span></span>. That&#39;s what I was talking about when said of &quot;certain problems.&quot; Although you may miracously have everything work fine on it&#39;s own.
</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983472#target">torhu</a> said:</div><div class="quote"><p> I don&#39;t know if there are issues with Unicode string literals or whatever.
</p></div></div><p>The logic behind this is &quot;what&#39;s the point of giving wide characters 32 bit if the damn device can only display ASCII?&quot; Because target platrofm may not necessairly support unicode fully. What do you do? You adjust your text input function to work with <i>this wide</i> characters. But if you have the balls, you may hook up custom libraries that would give support to 32 bit unicode input, processing and display.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Fri, 24 May 2013 11:08:41 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Don&#39;t think about wstring, think about Unicode. Unicode has a large set of &quot;code points&quot;, i.e. imagine every language and all the characters available (a LOT!). You cannot fit all of these into the defacto unit of character storage: the byte (i.e. range 0-255). Back when computers were simpler there was only ASCII (range 0-127).</p><p>If you want to have a large sets of &quot;codes&quot; you can store these in different ways. Ignore Unicode for a minute and think of all the ways that you could do it. E.g. if you know ASCII has a spare bit, you can use this to specify that the code spills over into the next byte, i.e. multi-byte. Or maybe you decide not to use a byte, but to use 2 bytes, or 4 bytes, and chain these together. You might make this decision based on the architecture of the processors you are targeting, or how many languages you want to support.</p><p>std::string can hold UTF-8 encoding (i.e. multi-byte 8 bit chars). This is possible because std::string is not null terminated, it is 8 bit pure. I.e. you can store &#39;\0&#39;, and any char in a string. So std::string is backwards compatible with null terminated 8 bit strings and ASCII. std::wstring works like std::string but holds &quot;wide characters&quot; (i.e. &gt;8 bit) and is not backwards compatible with 8 bit strings.</p><p>std::string and 8 bit ASCII strings are &quot;narrow&quot; and std::wstring are &quot;wide&quot;. Note, when we say wide, the size of a character is not specified as it is platform/compiler specific.</p><p>So your decision is: <b>how to support Unicode</b>, given the above information.</p><p>You might look at the APIs you are going to use, e.g. if you are only using Allegro, you might use std::string and UTF-8, because that is what Allegro uses internally. Otherwise you have to convert any wide strings to UTF-8 (narrow) for Allegro to use.</p><p>If you are using a library that only supports wide strings then you might use wide strings exclusively. Some APIs support both with a define.</p><p>If you are writing a library to be made public you might want to bear all this in mind, that some people might want to use narrow, and others wide, chars. Most libraries tend to assume ASCII, or narrow encoding. If you are including Windows, then I think you really have to support wide encoding because it only really supports localisation properly using wide encoding (UTF-16 in this case). All of the newer .net stuff uses this internally. It&#39;s a PITA!</p><p>Soooo... if your question is related to Allegro, I&#39;d say use narrow encoding. You can have the simplicity of ASCII strings, and use UTF-8 Unicode to localise, which Allegro also uses. If you use wide strings you&#39;ll just have to convert them to narrow ones every time you call a text rendering function Allegro.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (billyquith)</author>
		<pubDate>Mon, 27 May 2013 12:35:35 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983700#target">billyquith</a> said:</div><div class="quote"><p>Don&#39;t think about wstring, think about Unicode.<br />[...]<br />So your decision is: how to support Unicode, given the above information.</p></div></div><p>
Suppose I&#39;m writing a Linux program, and I choose to support unicode by using UTF-32. On Linux, <span class="source-code"><span class="k1">wchar_t</span></span> is 32 bits, so I might choose to use <span class="source-code">std::wstring</span> to store my UTF-32 encoded text. However, if I do this, then try to port the program to Windows, the program won&#39;t work anymore because <span class="source-code"><span class="k1">wchar_t</span></span> on Windows is only 16 bits and so my UTF-32 encoding won&#39;t fit anymore. — That&#39;s why I&#39;m saying <span class="source-code"><span class="k1">wchar_t</span></span> and <span class="source-code">std::wstring</span> should be avoided for portability.</p><p>--</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983428#target">Raidho36</a> said:</div><div class="quote"><p>The real difference between UTF-32 and other two is that former is constant-length whereas others are variable-length. This property of it allows random access, which is a great deal in terms of performance.</p></div></div><p>
I just read something in the Unicode standard which seems to invalidate what you were saying here. Check this out:
</p><div class="quote_container"><div class="title">Unicode standard v6.2 said:</div><div class="quote"><p>Characters Versus Code Points. In any event, Unicode code points do not necessarily match user expectations for “characters.” For example, the following are not represented by a single code point: a combining character sequence such as &lt;g, acute&gt;; a conjoining jamo sequence for Korean; or the Devanagari conjunct “ksha.” Because some Unicode text pro-cessing must be aware of and handle such sequences of characters as text elements, the fixed-width encoding form advantage of UTF-32 is somewhat offset by the inherently variable-width nature of processing text elements. See Unicode Technical Standard #18, “Uni-code Regular Expressions,” for an example where commonly implemented processes deal with inherently variable-width text elements owing to user expectations of the identity of a “character.”</p></div></div><p>
If I understand this correctly, they are saying that even with UTF-32, a single 32 bit unit does not necessary correspond to a character. ie. some characters might take more than 32 bits to encode. So even with UTF-32, you can&#39;t directly relate the character number to the byte number. (and presumably that&#39;s what you meant by &#39;random access&#39;.)
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Tue, 28 May 2013 12:33:49 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p><a href="http://en.wikipedia.org/wiki/UTF-32">This Wikipedia article</a> says that all UTF-32 <i>code points</i> are exactly 32 bits wide, but a particular character may take more than one code point.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Arthur Kalliokoski)</author>
		<pubDate>Tue, 28 May 2013 12:50:17 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Oh okay, I&#39;m sorry, there&#39;s special symbols that ain&#39;t real characters therefore random access doesn&#39;t worth jack <span class="cuss"><span><span class="cuss"><span>shit</span></span></span></span>, I forgot. I just never ran into those so everything worked fine like that.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Wed, 29 May 2013 02:14:29 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>No need to be sorry about it. I just mentioned it because I thought you&#39;d like to know. I didn&#39;t know either.</p><p>By the way, keeping in mind that characters aren&#39;t always single code-points, does anyone happen to know an easy to work out where the end of a character is in UTF-8? I&#39;m making a text-box UI widget, and I need to be able to determine where the starts of the characters are so that I can implement the functionality for left and right arrow keys, and mouse selection and stuff like that.</p><p>(I&#39;m going to leave it alone for the time being and finish it later, and if I don&#39;t find any other source of info I&#39;ll check the source code for <span class="source-code"><a href="http://www.allegro.cc/manual/al_draw_text"><span class="a">al_draw_text</span></a></span> to see how it is done there.)
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Wed, 29 May 2013 04:33:49 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>The font addon doesn&#39;t handle decomposed characters, nor right-to-left scripts, etc. It&#39;s pretty basic.</p><p>To find the beginning of a &quot;character&quot; you&#39;d need to classify codepoints into its category, which is essentially done with a big if-then-else for different ranges. I think you&#39;d generate it from <a href="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt">this file</a> but I haven&#39;t done it myself. More likely you find some existing library.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Peter Wang)</author>
		<pubDate>Wed, 29 May 2013 05:59:57 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Hmm.</p><p>Well if Allegro isn&#39;t going to be able to print those kinds of characters anyway, then I guess there&#39;s no hurry for me to support them in my text box. Ultimately I&#39;d like for all this stuff to work properly, but I don&#39;t want to spend too long on this given that I don&#39;t intend to use non-ascii characters anyway.</p><p>For the time being, I think I&#39;ll just make placeholder functions for &#39;next character&#39; and &#39;previous character&#39;, and have those functions simply look for the first unit of the code-point. (I don&#39;t yet know how to do that either, but I assume there&#39;s some particular set of bits which signal which unit is the first unit; and I&#39;m pretty sure <span class="source-code"><a href="http://www.allegro.cc/manual/al_draw_text"><span class="a">al_draw_text</span></a></span> will have that!)</p><p>Thanks for the info.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Wed, 29 May 2013 06:26:43 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>In UTF-8, most significant bit of leading byte is always 1, and trailing bytes have it always 0. The amount of 1-s in a row in MSB determine how many exactly bytes are trailing it, so you could read the first byte and jump to the next character right away. But, if you need to find previous characters, you&#39;re pretty much stuck with going through every byte, counting leading bytes. </p><p>I&#39;m not exactly sure, but IIRC, Allegro&#39;s string handling functions do that.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Wed, 29 May 2013 09:06:16 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I think even with UTF8, two multi byte sequences can be combined into one composed character.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Thomas Fjellstrom)</author>
		<pubDate>Wed, 29 May 2013 09:14:40 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>It could, existed composed characters are left in unicode for compatibility reasons, but as they do it now, unicode tend to decompose the characters into basic character and special glyphs altering it.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Wed, 29 May 2013 10:20:48 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>That is what I thought I said. Two unicode &quot;characters&quot; can actually render as a single visible character. So you can&#39;t really just jump around like you suggested.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Thomas Fjellstrom)</author>
		<pubDate>Wed, 29 May 2013 10:22:35 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Are you guys talking about the things like the &quot;ae&quot; that&#39;s smooshed together?</p><p><span class="remote-thumbnail"><span class="json">{"name":"230.png","src":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/2\/d\/2d09d84db8ed2a79358e287f3795f8c4.png","w":289,"h":104,"tn":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/2\/d\/2d09d84db8ed2a79358e287f3795f8c4"}</span><img src="http://www.allegro.cc//djungxnpq2nug.cloudfront.net/image/cache/2/d/2d09d84db8ed2a79358e287f3795f8c4-240.jpg" alt="230.png" width="240" height="86" /></span></p><p>Unicode hexadecimal: 0xe6
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Arthur Kalliokoski)</author>
		<pubDate>Wed, 29 May 2013 10:26:33 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>No, unicode specifies separate characters and modifiers like umlauts, where they are specified separately in the unicode string, but are composed together when drawing.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Thomas Fjellstrom)</author>
		<pubDate>Wed, 29 May 2013 10:57:58 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I wasn&#39;t exactly meant <i>character</i>, I meant <i>code point</i>. You do can jump to the next code point right away after simply reading first byte, because it says how many bytes current code point occupies.</p><p>----</p><p>Also, AFAIK there could be several code points per character rather than just one or two. So to count your <i>characters</i> you also would need to tell apart real character from modifier. Unicode just couldn&#39;t keep it simple, as if 4 billion slots wasn&#39;t enough to store all possible combinations. Bluh. <img src="http://www.allegro.cc/forums/smileys/angry.gif" alt="&gt;:(" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Wed, 29 May 2013 12:28:35 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983775#target">Karadoc ~~</a> said:</div><div class="quote"><p> Suppose I&#39;m writing a Linux program, and I choose to support unicode by using UTF-32. On Linux, wchar_t is 32 bits, so I might choose to use std::wstring to store my UTF-32 encoded text. However, if I do this, then try to port the program to Windows, the program won&#39;t work anymore because wchar_t on Windows is only 16 bits and so my UTF-32 encoding won&#39;t fit anymore. — That&#39;s why I&#39;m saying wchar_t and std::wstring should be avoided for portability.</p></div></div><p>Not if you implement it properly. As soon as you <b>encode</b> the Unicode it becomes implementation specific. But you can convert between different encodings. E.g. if you are using UTF-16 on Windows and you want to use Allegro, at some point you&#39;ll have to convert between encodings. It is still all Unicode, just stored in a different format.</p><p>If you use all of the standard functions for wide strings your std::wstring code <b>will</b> be portable. If you save wide string data on Linux and try to load it on Windows it will fail because the encodings are different. If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it. </p><p>If all of your code uses std::wstring and the w-string wide string functions, it will all be portable. If you say it &quot;won&#39;t fit&quot;, this implies that you are setting hard limits on the size of data items (e.g. in bytes). You shouldn&#39;t do this in any encoding, you should calculate that using the coding API, in order to make the code and data portable.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (billyquith)</author>
		<pubDate>Wed, 29 May 2013 15:25:55 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>The problem I have with std::wstring is that the standard doesn&#39;t explicitly say it&#39;s UTF16 or any other encoding. But whenever you read/write a string from/to a file (or pass it to an Allegro function) you need to know the encoding.</p><p>From what I understand you would have to use something like this: <a href="http://www.cplusplus.com/reference/locale/codecvt/">http://www.cplusplus.com/reference/locale/codecvt/</a></p><p>But the existance of appropriate locales (e.g. a utf8 one, to convert to that) at runtime is not guaranteed. Or at least I can&#39;t figure out from the documentation how to convert a std::wstring from/to utf8 (or any other encoding).
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Elias)</author>
		<pubDate>Wed, 29 May 2013 17:11:49 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983830#target">Elias</a> said:</div><div class="quote"><p> standard doesn&#39;t explicitly say it&#39;s UTF16 or any other encoding
</p></div></div><p>It&#39;s plain unicode value of a <s>character</s> code point. It&#39;s not encoded. If you want it encoded you should probably use byte string and special handling functions.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Raidho36)</author>
		<pubDate>Wed, 29 May 2013 17:56:50 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I wonder if Karadoc ~~ actually knows more about this than before or less. <img src="http://www.allegro.cc/forums/smileys/tongue.gif" alt=":P" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Vanneto)</author>
		<pubDate>Wed, 29 May 2013 18:21:43 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p><a href="http://www.joelonsoftware.com/articles/Unicode.html">http://www.joelonsoftware.com/articles/Unicode.html</a>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Arthur Kalliokoski)</author>
		<pubDate>Wed, 29 May 2013 18:27:10 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983834#target">Vanneto</a> said:</div><div class="quote"><p>I wonder if Karadoc ~~ actually knows more about this than before or less.</p></div></div><p>

If he ignores all posts by Raidho36, maybe <img src="http://www.allegro.cc/forums/smileys/tongue.gif" alt=":P" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Elias)</author>
		<pubDate>Wed, 29 May 2013 19:24:50 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983828#target">billyquith</a> said:</div><div class="quote"><p>Not if you implement it properly. As soon as you <b>encode</b> the Unicode it becomes implementation specific. But you can convert between different encodings. E.g. if you are using UTF-16 on Windows and you want to use Allegro, at some point you&#39;ll have to convert between encodings. It is still all Unicode, just stored in a different format.If you use all of the standard functions for wide strings your std::wstring code <b>will</b> be portable. If you save wide string data on Linux and try to load it on Windows it will fail because the encodings are different. If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it. If all of your code uses std::wstring and the w-string wide string functions, it will all be portable. If you say it &quot;won&#39;t fit&quot;, this implies that you are setting hard limits on the size of data items (e.g. in bytes). You shouldn&#39;t do this in any encoding, you should calculate that using the coding API, in order to make the code and data portable.</p></div></div><p>
The way I see it, the unicode characters must be encoded in order to exist in computer memory at all. The programmer must choose what kind of encoding to use, and what kind of data structure to store the encoded string in. When I said &quot;it won&#39;t fit&quot;, what I meant was that a 32-bit unit won&#39;t fit into a <span class="source-code">char16_t</span>, and so if I used UTF-32 encoding and stored it in a <span class="source-code">std::wstring</span>, that would work on Linux but not on Windows.</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>If you load the data as UTF-32 (Linux) and convert to UTF-16 (Windows) you will still be able to use it.</p></div></div><p>My point is that I can read UTF-32 straight into a <span class="source-code">std::wstring</span> iff <span class="source-code"><span class="k1">wchar_t</span></span> is 32-bits. The 32-bit units of UTF-32 <i>won&#39;t fit</i> inside a 16-bit <span class="source-code"><span class="k1">wchar_t</span></span>.</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>If all of your code uses std::wstring and the w-string wide string functions, it will all be portable.</p></div></div><p>I think this is the key to understand what you are talking about. As far as I understand, everything else you&#39;ve said is stuff that I already knew, but here I&#39;m not sure what you mean by &quot;the w-string wide string functions&quot;. What functions are you referring to? (Perhaps it&#39;s related to <span class="source-code">codecvt</span> which Elias mentioned?)</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983834#target">Vanneto</a> said:</div><div class="quote"><p>I wonder if Karadoc ~~ actually knows more about this than before or less. <img src="http://www.allegro.cc/forums/smileys/tongue.gif" alt=":P" /></p></div></div><p>Well, in this thread there is some misinformation and a significant amount of talking at cross purposes, but I feel like I&#39;ve learned a fair bit. For example, I&#39;ve learned that the maximum number of bytes per character is 4 for any unicode code-point and that some printable characters are actually composed of multiple unicode code-points. And most recently, Elias mentioned something called <span class="source-code">codecvt</span>, which I hadn&#39;t heard of before. I&#39;m reading about it now in the C++11 Standard document. A lot of the information in the document is pretty dense, but from what I can tell <span class="source-code">codecvt</span> sounds userful, and very relevant to what we&#39;re talking about. Check this out:</p><div class="quote_container"><div class="title">C++11 Standard said:</div><div class="quote"><p>The specialization codecvt&lt;char16_t, char, mbstate_t&gt; converts between the UTF-16 and UTF-8 encoding schemes, and the specialization codecvt &lt;char32_t, char, mbstate_t&gt; converts between the UTF-32 and UTF-8 encoding schemes. codecvt&lt;wchar_t,char,mbstate_t&gt; converts between the native character sets for narrow and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a user-defined stateT type. The stateT object can contain any state that is useful to communicate to or from the specialized do_in or do_out members.</p></div></div><p>

So, based on that here&#39;s my understanding of the situation: <span class="source-code">std::wstring wide_string <span class="k3">=</span> L<span class="s">"Hello"</span><span class="k2">;</span></span> will store the string in an implementation defined encoding which uses <span class="source-code"><span class="k1">wchar_t</span></span>. My concern throughout this thread has been that since we don&#39;t necessarily know what the implementation defined encoding is, it&#39;s difficult to work with it in a platform independent way. For example I can&#39;t use <span class="source-code"><a href="http://www.allegro.cc/manual/al_ustr_new_from_utf16"><span class="a">al_ustr_new_from_utf16</span></a></span>, because <span class="source-code"><span class="k1">wchar_t</span></span> may or may not be <span class="source-code"><span class="k1">uint16_t</span></span>; and since I don&#39;t know what the encoding is, I don&#39;t know how to traverse the <span class="source-code"><span class="k1">wchar_t</span></span> array to find the starts of characters and things like that which I need for my text box widget that I mentioned. For these reasons, I claimed that it&#39;s better to simply avoid using <span class="source-code"><span class="k1">wchar_t</span></span> and instead use something like <span class="source-code">char16_t</span> (or whatever) so that one could be sure of which type of unicode encoding is being used.</p><p>However, now that I&#39;ve read that <span class="source-code">codecvt<span class="k3">&lt;</span><span class="k1">wchar_t</span>,<span class="k1">char</span>,mbstate_t&gt;</span> will convert between the mystery <span class="source-code"><span class="k1">wchar_t</span></span> encoding and UTF-8, I see that it is at least possible to right portable code that uses <span class="source-code"><span class="k1">wchar_t</span></span>. It still seems to me that it&#39;s better to pick one of the more explicit character sizes, but at least <span class="source-code"><span class="k1">wchar_t</span></span> is not fatally flawed.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Karadoc ~~)</author>
		<pubDate>Thu, 30 May 2013 06:19:46 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/612635/983803#target">Karadoc ~~</a> said:</div><div class="quote"><p>By the way, keeping in mind that characters aren&#39;t always single code-points, does anyone happen to know an easy to work out where the end of a character is in UTF-8? I&#39;m making a text-box UI widget, and I need to be able to determine where the starts of the characters are so that I can implement the functionality for left and right arrow keys, and mouse selection and stuff like that.</p></div></div><p>

I think allegro 4.4 source code had a routine that could do that? Something that counted the character length (not byte length) of a utf-8 string?</p><p>I built my own unicode lib a few years ago from pillaging the allegro 4.4 source code because I didn&#39;t like any of the proper unicode libraries.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (m c)</author>
		<pubDate>Thu, 30 May 2013 08:12:23 +0000</pubDate>
	</item>
</rss>
