<?xml version="1.0"?>
<rss version="2.0">
	<channel>
		<title>Unicode routines and std::string</title>
		<link>http://www.allegro.cc/forums/view/579643</link>
		<description>Allegro.cc Forum Thread</description>
		<webMaster>matthew@allegro.cc (Matthew Leverton)</webMaster>
		<lastBuildDate>Thu, 20 Apr 2006 15:15:33 +0000</lastBuildDate>
	</channel>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>To make some things easier I&#39;m using my own String class, derived from std::string. I added methods to trim, convert (to int, float, etc., via string streams) among other things. Since my project is 100% based on Allegro, should I add methods to use Allegro&#39;s Unicode routines OR forget about std::string and make my String class a wrapper around Unicode routines? I would like to use Unicode strings because it&#39;s possible I&#39;ll need to translate my game to another languages.</p><p>I&#39;m not sure if std::string&#39;s and c-strings (Allegro&#39;s Unicode strings) can coexist in the same class or if it&#39;s even a good idea (I think it isn&#39;t).</p><p>The same goes for file routines (make_absolute_filename, replace_extension, etc.).</p><p>What do you suggest?
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Biznaga)</author>
		<pubDate>Mon, 17 Apr 2006 03:51:23 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I do not think std::string and C strings can be mixed with Allegro&#39;s strings. Even though Allegro strings use &#39;char *&#39; as data type, the underlying characters may not match ASCII characters. If you are going for Unicode, then you will certainly have a problem with std::string; for example, in Unicode the character &#39;\0&#39; is two bytes long, whereas in std::string/C strings it is one. </p><p>My suggestion is to forget std::string and wrap Allegro&#39;s unicode routines inside your String class. But you can use std::vector&lt;char&gt; for the character buffer, so you avoid the allocation/deallocation issues. You could also make you String class reference-counted, so you can use it by-value.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (axilmar)</author>
		<pubDate>Mon, 17 Apr 2006 19:21:39 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
in Unicode the character &#39;\0&#39; is two bytes long
</p></div></div><p>
Not in UTF-8, which allegro uses by default. Since \0 is the only special character for std::string, everything else can be used as you please. String length, though, will return the number of bytes in the string, not necessarily the number of characters. Similarly, index operators ( [] and at() ) will return the nth byte, not the nth character. As long as you just load and compare strings, concatenate them together, and pass them around, std::string will be fine (provided you do use UTF8, not UTF16, and use UTF-8 output routines only).
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Mon, 17 Apr 2006 20:16:41 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Wouldn&#39;t UTF-16 be easy to use with std::string? After all you can specify the data type of a character, so you can pass a short for that instead of a char.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Fladimir da Gorf)</author>
		<pubDate>Mon, 17 Apr 2006 21:05:32 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>you would need to do wstring or basic_string&lt;your_type_here&gt;.  I am not an expert in this but I believe it is facets and/or locales that allow you to do multi-byte character strings.  I wonder if it is possible to do an Allegro string facet to allow Allegro UTF strings in standard library strings.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (gillius)</author>
		<pubDate>Tue, 18 Apr 2006 00:19:47 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Allegro seems to provide a wide range of string routines, so I think I could make a very functional String class with them.</p><p>I forgot to mention something important: my std::string based class also uses regular expressions (I&#39;m using the rx library). Will this new string class work along with rx too? And, what about Lua scripts? I just read <a href="http://lua-users.org/wiki/LuaUnicode">this</a>, but has anybody worked with Lua scripts and Unicode?</p><p>I start to believe I should stick with U_ASCII if I really don&#39;t need UTF-8. At least, U_ASCII is good enough for most occidental languages, right? <img src="http://www.allegro.cc/forums/smileys/rolleyes.gif" alt="::)" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Biznaga)</author>
		<pubDate>Tue, 18 Apr 2006 11:26:49 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>UTF-8 is full compatible with 7-bit ASCII. Any character in the 0-127 (inclusive) range is the same in both UTF-8 and ASCII.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Kitty Cat)</author>
		<pubDate>Tue, 18 Apr 2006 11:43:32 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>...including the special characters 0 through 31. Which is why a std::string can easily hold UTF-8 data; the only thing that is not reliable is accessing a single character by index, since operator[] and at() count bytes, not characters.<br />I&#39;m curious about the allegro-string class, though. BTW, I&#39;d use std::string as a base and only implement unicode-specific functionality through allegro. This&#39;ll save you from re-coding the memory allocation code yourself.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Tue, 18 Apr 2006 12:03:43 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>
std::basic_string&lt;wchar_t&gt;
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (X-G)</author>
		<pubDate>Tue, 18 Apr 2006 16:32:44 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>This doesn&#39;t handle variable-width characters though; allegro does.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Tue, 18 Apr 2006 16:42:28 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>
That&#39;s right, but they work smashingly for Unicode.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (X-G)</author>
		<pubDate>Tue, 18 Apr 2006 16:54:53 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>And don&#39;t forget that UTF8 can and will encode less commonly used chars with up to 6 bytes. Which is a Unicode encoding <img src="http://www.allegro.cc/forums/smileys/wink.gif" alt=";)" /></p><p>edit, WikiPedia, seems to say its 4 bytes, is it 4?
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Thomas Fjellstrom)</author>
		<pubDate>Tue, 18 Apr 2006 16:55:34 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>It is 4. The standard document after which I modeled <a href="http://www.allegro.cc/forums/thread/529059">my own routines</a> for encoding to and decoding from UTF8, says so as well.<br />Sequences longer than 4 bytes are invalid.</p><p>Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string, which you can then convert to whatever unicode format you have set for Allegro.<br />(In 4.2rc2 the format of the c_str() retrieved from a standard wchar_t string seemed to be equal to Allegros U_UNICODE format.)
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Dennis)</author>
		<pubDate>Tue, 18 Apr 2006 19:55:17 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Tobias Dammers said:</div><div class="quote"><p>
I&#39;d use std::string as a base and only implement unicode-specific functionality through allegro. This&#39;ll save you from re-coding the memory allocation code yourself.
</p></div></div><p>

Isn&#39;t Allegro already doing this with functions like uinsert, uremove, ustrcat?</p><div class="quote_container"><div class="title">Dennis Busch said:</div><div class="quote"><p>
Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string
</p></div></div><p>

Really? Wouldn&#39;t c_str return a const w_char* instead a const char*? Can I pass a const w_char* as if it was a const char* as a function parameter? I&#39;m concerned about if rx will work with Unicode strings. It receives const char* as parameters.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Biznaga)</author>
		<pubDate>Wed, 19 Apr 2006 08:51:27 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
Can I pass a const w_char* as if it was a const char* as a function parameter?
</p></div></div><p>
No, you&#39;d have to convert it. w_char is 2 bytes per character, where a char is just 1 (unless it&#39;s utf8, but it&#39;s still compatible, for the most part).
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Kitty Cat)</author>
		<pubDate>Wed, 19 Apr 2006 09:08:50 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>16-bit encoding is not much of a problem; you just use w_char. You do need to take special precautions for outputting, though; either convert to an 8-bit code table, or use 16-bit output routines.<br />UTF-8 has variable character widths, which is a bit of a problem. Allegro can output these, and you can store them in a std::string, but this will not give you correct lengths and indices. Even worse, an index may even point somewhere halfway a multi-byte character. If you can live with these issues, then use std::string with UTF-8. If you can use utf-16, use string&lt;w_char&gt; and utf-16 output routines. If you have to use utf-8, and need a string class that handles indices and everything correctly, then I guess you&#39;re stuck with writing your own string.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Wed, 19 Apr 2006 10:57:12 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
Using standard strings with wchar_t is nice, because you can use the c_str() method of the standard string to retrieve a c style string
</p></div></div><p>
Really? Wouldn&#39;t c_str return a const w_char* instead a const char*?
</p></div></div><p>
Unfortunately you cut off what I said, before the all important part: <i>&quot;[..], which you can then convert to whatever unicode format you have set for Allegro.&quot;</i><br />So what you have to do is of course interprete the result of c_str() as char* to be able to use Allegros conversion function to make an allegro-usable unicode string of it.<br />Example:
</p><div class="source-code"><div class="toolbar"></div><div class="inner"><table width="100%"><tbody><tr><td class="number">1</td><td><span class="c">// (assume that temp is a wstring with actual content and not empty.)</span></td></tr><tr><td class="number">2</td><td><span class="k1">int</span> tmp_size <span class="k3">=</span> <span class="n">0</span><span class="k2">;</span></td></tr><tr><td class="number">3</td><td><span class="k1">char</span> <span class="k3">*</span>outstring <span class="k3">=</span> NULL<span class="k2">;</span></td></tr><tr><td class="number">4</td><td>&#160;</td></tr><tr><td class="number">5</td><td><span class="c">// Allocate memory for Allegro Unicode string</span></td></tr><tr><td class="number">6</td><td>tmp_size <span class="k3">=</span> <span class="k2">(</span>temp.length<span class="k2">(</span><span class="k2">)</span><span class="k3">+</span><span class="n">1</span><span class="k2">)</span><span class="k3">*</span><a href="http://www.allegro.cc/manual/uwidth_max" target="_blank"><span class="a">uwidth_max</span></a><span class="k2">(</span>U_CURRENT<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">7</td><td>outstring <span class="k3">=</span> <span class="k1">new</span> <span class="k1">char</span><span class="k2">[</span>tmp_size<span class="k2">]</span><span class="k2">;</span></td></tr><tr><td class="number">8</td><td><span class="k1">if</span><span class="k2">(</span><span class="k3">!</span><span class="k2">(</span>outstring<span class="k2">)</span><span class="k2">)</span> <span class="c">// error not enough memory for current string</span></td></tr><tr><td class="number">9</td><td><span class="k2">{</span></td></tr><tr><td class="number">10</td><td>  <span class="c">// do error handling</span></td></tr><tr><td class="number">11</td><td><span class="k2">}</span></td></tr><tr><td class="number">12</td><td><span class="k1">else</span> <span class="c">// now convert to Allegro's format</span></td></tr><tr><td class="number">13</td><td><span class="k2">{</span></td></tr><tr><td class="number">14</td><td>  <a href="http://www.delorie.com/djgpp/doc/libc/libc_569.html" target="_blank">memset</a><span class="k2">(</span>outstring,<span class="n">0</span>,tmp_size<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">15</td><td>  <a href="http://www.allegro.cc/manual/do_uconvert" target="_blank"><span class="a">do_uconvert</span></a><span class="k2">(</span><span class="k2">(</span><span class="k1">char</span> <span class="k3">*</span><span class="k2">)</span>temp.c_str<span class="k2">(</span><span class="k2">)</span>,U_UNICODE,outstring,U_CURRENT,tmp_size<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">16</td><td><span class="k2">}</span></td></tr></tbody></table></div></div><p>

This will work as long as Allegros U_UNICODE hasn&#39;t changed since 4.20rc2, because back in that version Allegros U_UNICODE was equal to a fixed number of two bytes per char.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Dennis)</author>
		<pubDate>Wed, 19 Apr 2006 11:13:16 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>You are right. I understand the conversion part. I was thinking in the opposite case: converting Allegro&#39;s strings to c-strings that rx could use. Maybe I can convert to U_ASCII but that will result in a possible data lose or corruption and it has no sense at all if what I&#39;m trying is to use Unicode.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Biznaga)</author>
		<pubDate>Wed, 19 Apr 2006 12:27:17 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>UTF-16 or &quot;wide character&quot; strings used to be convienent until there were more than 65535 characters in Unicode, so now in UTF-16 a character can be as large as 32 bits (&lt;sarcasm&gt;yeah).  However, if you assume that all of your strings are going to stay on the BMP (basic multilingual plane), then you&#39;ll probably be fine for most any application to assume that a character will not have a multi-byte representation, but if you want to be safe, check that no characters are in the range D800-DBFF when you get strings from external sources.  I am going to assume that Allegro&#39;s unicode functions probably don&#39;t implement this portion of UTF-16.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (gillius)</author>
		<pubDate>Wed, 19 Apr 2006 17:40:37 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>The 16 bit range pretty much covers all current languages except Chinese. You can do one type of Chinese (as spoken in Taiwan etc) with just 16 bit characters, but for full Chinese Chinese, you need some code points above 65535. That&#39;s the only interesting thing beyond 16 bit, though: all the other extensions are for dead or theoretical languages that nobody actually speaks today.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Shawn Hargreaves)</author>
		<pubDate>Wed, 19 Apr 2006 23:42:57 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I didn&#39;t know there were full chinese characters in the upper code points.  I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages...
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (gillius)</author>
		<pubDate>Thu, 20 Apr 2006 07:16:34 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Well, CJK is simply too much for me. <img src="http://www.allegro.cc/forums/smileys/tongue.gif" alt=":P" /></p><p>At most, I&#39;d like to support English and Spanish in my game (BTW, an RPG), and maybe an horrendous French translation. Occidental languages. So UTF-16 it&#39;s OK for me; even U_ASCII, I think.</p><p>I&#39;ll try both with std::string&lt;w_char&gt; and a wrapper around Allegro&#39;s Unicode routines. Thank you, everybody!</p><p>I have another related question: ASCII has 7-bit characters, Allegro&#39;s U_ASCII has 8-bit. Are the last 128 characters in U_ASCII the same as in ISO-Latin-1 or does it depends on the locale?</p><p>Taking about translation, does anybody know about a good tutorial for gettext?
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Biznaga)</author>
		<pubDate>Thu, 20 Apr 2006 09:43:08 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
I didn&#39;t know there were full chinese characters in the upper code points. I thought it was just dead languages like Egyptian hieroglyphs and other ancient languages...
</p></div></div><p>

Plane 1 (supplementary multilingual plane) is reserved for dead and ancient scripts and some other stuff. Plane 2 however (I believe it&#39;s called supplementary ideographic plane) is filled with about 40000 older, traditional and rarely used Chinese symbols that simply didn&#39;t fit in the basic multilingual plane.</p><p>And AFAIK hieroglyphs are not in Unicode... yet <img src="http://www.allegro.cc/forums/smileys/smiley.gif" alt=":)" />. But we have <a href="http://www.alanwood.net/unicode/dingbats.html">these</a> in Unicode, so...
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Jakub Wasilewski)</author>
		<pubDate>Thu, 20 Apr 2006 15:15:33 +0000</pubDate>
	</item>
</rss>
