<?xml version="1.0"?>
<rss version="2.0">
	<channel>
		<title>Encoder/Decoder (UTF-8 Unicode4.0) FILE Functions</title>
		<link>http://www.allegro.cc/forums/view/529059</link>
		<description>Allegro.cc Forum Thread</description>
		<webMaster>matthew@allegro.cc (Matthew Leverton)</webMaster>
		<lastBuildDate>Sat, 17 Sep 2005 02:44:03 +0000</lastBuildDate>
	</channel>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>I needed the possibility to store Unicode strings for my current project, so i&#39;ve written some functions that do so using UTF-8.<br />(Thanks again to Evert and Kirr for suggesting to go for UTF-8 instead of UTF-16.)<br />Since these functions are not directly dependent of said project, i decided to seperate them and after that, i thought they might be generally useful not only to me, so i present them here free to everyone who wants to use them.:)<br />C++ hpp,cpp pair is attached.</p><p>Header is here, so you don&#39;t need to download to see, what it can|can not do:
</p><div class="source-code"><div class="toolbar"></div><div class="inner"><table width="100%"><tbody><tr><td class="number">1</td><td><span class="c">/*</span></td></tr><tr><td class="number">2</td><td><span class="c">  --- Projectname: Encoder/Decoder (UTF-8 Unicode4.0) FILE Functions ---</span></td></tr><tr><td class="number">3</td><td><span class="c">  (Targetsystem: Crossplatform)</span></td></tr><tr><td class="number">4</td><td><span class="c">  </span></td></tr><tr><td class="number">5</td><td><span class="c">  Author: Dennis Busch (http://www.dennisbusch.de)</span></td></tr><tr><td class="number">6</td><td><span class="c"></span></td></tr><tr><td class="number">7</td><td><span class="c">  Content:</span></td></tr><tr><td class="number">8</td><td><span class="c">  Functions for storing and restoring full lines of</span></td></tr><tr><td class="number">9</td><td><span class="c">  UTF-8(as described in "Unicode4.0.0 Chapter 3.9") encoded UNICODE</span></td></tr><tr><td class="number">10</td><td><span class="c">  'string's , 'wstring's and 'uistring's to and from 'FILE's.</span></td></tr><tr><td class="number">11</td><td><span class="c"></span></td></tr><tr><td class="number">12</td><td><span class="c">  Details:</span></td></tr><tr><td class="number">13</td><td><span class="c">    Unicode characters that can not be held in a single</span></td></tr><tr><td class="number">14</td><td><span class="c">    'memory cell' of the desired target string type will be read but ignored.</span></td></tr><tr><td class="number">15</td><td><span class="c"></span></td></tr><tr><td class="number">16</td><td><span class="c">    Footnote1:</span></td></tr><tr><td class="number">17</td><td><span class="c">    "The definition of UTF-8 prohibits encoding character numbers between</span></td></tr><tr><td class="number">18</td><td><span class="c">     U+D800 and U+DFFF"</span></td></tr><tr><td class="number">19</td><td><span class="c">    So characters decoded into that range will be ignored as well and more</span></td></tr><tr><td class="number">20</td><td><span class="c">    important, characters from that range will not be encoded on storing.</span></td></tr><tr><td class="number">21</td><td><span class="c"></span></td></tr><tr><td class="number">22</td><td><span class="c">    Footnote2:</span></td></tr><tr><td class="number">23</td><td><span class="c">    On storing, the ByteOrderMark(BOM) U+FEFF</span></td></tr><tr><td class="number">24</td><td><span class="c">    will be stored like any other character encoded in UTF-8,</span></td></tr><tr><td class="number">25</td><td><span class="c">    always leading to the octet sequence EF BB BF.</span></td></tr><tr><td class="number">26</td><td><span class="c">    On restoring, any sequence decoded to U+FEFF will be ignored, that is</span></td></tr><tr><td class="number">27</td><td><span class="c">    it gets not appended to the resulting string.</span></td></tr><tr><td class="number">28</td><td><span class="c"></span></td></tr><tr><td class="number">29</td><td><span class="c">    Footnote3:</span></td></tr><tr><td class="number">30</td><td><span class="c">    The standard allows only unicodes of a maximum of 21</span></td></tr><tr><td class="number">31</td><td><span class="c">    bits to be encoded or decoded. Even though characters</span></td></tr><tr><td class="number">32</td><td><span class="c">    with the full range of 32bits *could* be stored in UTF-8 </span></td></tr><tr><td class="number">33</td><td><span class="c">    using up to 6 octets, this implementation does *NOT* do so.</span></td></tr><tr><td class="number">34</td><td><span class="c">    It only correctly encodes, decodes up to 4 octets, as defined by said</span></td></tr><tr><td class="number">35</td><td><span class="c">    standard("Unicode4.0.0 Chapter 3.9" UTF-8).</span></td></tr><tr><td class="number">36</td><td><span class="c">    Quote: "Unicode4.0.0 Appendix C.3"</span></td></tr><tr><td class="number">37</td><td><span class="c">      "those five- and six-byte sequences are illegal for the use </span></td></tr><tr><td class="number">38</td><td><span class="c">       of UTF-8 as an encoding form of Unicode characters."</span></td></tr><tr><td class="number">39</td><td><span class="c"></span></td></tr><tr><td class="number">40</td><td><span class="c">    Footnote4:</span></td></tr><tr><td class="number">41</td><td><span class="c">    This implementation also tries to protect against decoding of invalid</span></td></tr><tr><td class="number">42</td><td><span class="c">    octet sequences.</span></td></tr><tr><td class="number">43</td><td><span class="c">    It does so by silently reading any invalid sequences, but every such</span></td></tr><tr><td class="number">44</td><td><span class="c">    invalid sequence(as it should not be allowed in the first place) will</span></td></tr><tr><td class="number">45</td><td><span class="c">    be skipped and an ASCII '?' is given out as result character.</span></td></tr><tr><td class="number">46</td><td><span class="c">*/</span></td></tr><tr><td class="number">47</td><td>&#160;</td></tr><tr><td class="number">48</td><td><span class="p">#include &lt;cstdio&gt;</span></td></tr><tr><td class="number">49</td><td><span class="p">#include &lt;cstdlib&gt;</span></td></tr><tr><td class="number">50</td><td><span class="p">#include &lt;string&gt;</span></td></tr><tr><td class="number">51</td><td><span class="k1">using</span> <span class="k1">namespace</span> std<span class="k2">;</span></td></tr><tr><td class="number">52</td><td>&#160;</td></tr><tr><td class="number">53</td><td><span class="p">#if !defined(__DB_utf_eight_HEAD_INCLUDED)</span></td></tr><tr><td class="number">54</td><td><span class="p">#define __DB_utf_eight_HEAD_INCLUDED</span></td></tr><tr><td class="number">55</td><td>&#160;</td></tr><tr><td class="number">56</td><td><span class="c">/* - All functions return a negative value on failure or if the result</span></td></tr><tr><td class="number">57</td><td><span class="c">     would be empty.</span></td></tr><tr><td class="number">58</td><td><span class="c">   - The FILE parameter is always expected to be a valid </span></td></tr><tr><td class="number">59</td><td><span class="c">     file pointer and it must be opened in binary mode. */</span></td></tr><tr><td class="number">60</td><td>&#160;</td></tr><tr><td class="number">61</td><td>&#160;</td></tr><tr><td class="number">62</td><td><span class="c">/* Explicitly Writing the UTF-8 ByteOrderMark</span></td></tr><tr><td class="number">63</td><td><span class="c">   (use only once at begin of file) as a signature for external editors */</span></td></tr><tr><td class="number">64</td><td><span class="k1">int</span> write_bom_utf8<span class="k2">(</span>FILE <span class="k3">*</span>out<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">65</td><td>&#160;</td></tr><tr><td class="number">66</td><td><span class="c">// Encoding and Decoding a single UNICODE character to and from file</span></td></tr><tr><td class="number">67</td><td><span class="k1">int</span> encode_utf8<span class="k2">(</span>FILE <span class="k3">*</span>out, <span class="k1">unsigned</span> <span class="k1">int</span> code<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">68</td><td><span class="k1">int</span> decode_utf8<span class="k2">(</span>FILE <span class="k3">*</span>in, <span class="k1">unsigned</span> <span class="k1">int</span> <span class="k3">*</span>result<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">69</td><td>&#160;</td></tr><tr><td class="number">70</td><td><span class="c">// Reading string and wstring from file </span></td></tr><tr><td class="number">71</td><td><span class="c">/* Starts reading actual character data after ignoring any preceeding newline</span></td></tr><tr><td class="number">72</td><td><span class="c">   or carriage return codes or byte order marks</span></td></tr><tr><td class="number">73</td><td><span class="c">   then reads characters until the next newline or carriage return appears, </span></td></tr><tr><td class="number">74</td><td><span class="c">   the nl or cr character itself is not appended to result) */</span></td></tr><tr><td class="number">75</td><td><span class="c">/* (basically means two things: Empty lines will be skipped and windows' CR+NL</span></td></tr><tr><td class="number">76</td><td><span class="c">    will not lead to reading extra empty lines.) */</span></td></tr><tr><td class="number">77</td><td><span class="c">/* (result will be replaced with the read line if return value is 0, otherwise</span></td></tr><tr><td class="number">78</td><td><span class="c">    result will not be altered in any way) */</span></td></tr><tr><td class="number">79</td><td><span class="k1">int</span> readline_utf8<span class="k2">(</span>FILE <span class="k3">*</span>in,  string <span class="k3">*</span>result<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">80</td><td><span class="k1">int</span> readline_utf8<span class="k2">(</span>FILE <span class="k3">*</span>in, wstring <span class="k3">*</span>result<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">81</td><td>&#160;</td></tr><tr><td class="number">82</td><td><span class="c">// Writing string and wstring to file (also writes CR+NL after the string)</span></td></tr><tr><td class="number">83</td><td><span class="c">/* (if FILE is not in binary mode, some platforms(like windows) will do</span></td></tr><tr><td class="number">84</td><td><span class="c">    a conversion from NL to CR+NL, so then actually CR+CR+NL will be written,</span></td></tr><tr><td class="number">85</td><td><span class="c">    which is not really a problem for the readline functions above, this info</span></td></tr><tr><td class="number">86</td><td><span class="c">    is just here, so you do not have to get confused about that, if you should</span></td></tr><tr><td class="number">87</td><td><span class="c">    ever take a closer peek at the output file) */</span></td></tr><tr><td class="number">88</td><td><span class="k1">int</span> writeline_utf8<span class="k2">(</span>FILE <span class="k3">*</span>out,  string <a href="http://www.allegro.cc/manual/line" target="_blank"><span class="a">line</span></a><span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">89</td><td><span class="k1">int</span> writeline_utf8<span class="k2">(</span>FILE <span class="k3">*</span>out, wstring <a href="http://www.allegro.cc/manual/line" target="_blank"><span class="a">line</span></a><span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">90</td><td>&#160;</td></tr><tr><td class="number">91</td><td><span class="c">// Functions for strings that can store the full range of characters</span></td></tr><tr><td class="number">92</td><td><span class="k1">typedef</span> basic_string<span class="k3">&lt;</span><span class="k1">unsigned</span> int&gt; uistring<span class="k2">;</span></td></tr><tr><td class="number">93</td><td><span class="k1">int</span> readline_utf8<span class="k2">(</span>FILE <span class="k3">*</span>in, uistring <span class="k3">*</span>result<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">94</td><td><span class="k1">int</span> writeline_utf8<span class="k2">(</span>FILE <span class="k3">*</span>out, uistring <a href="http://www.allegro.cc/manual/line" target="_blank"><span class="a">line</span></a><span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">95</td><td>&#160;</td></tr><tr><td class="number">96</td><td><span class="p">#endif // #if !defined(__DB_utf_eight_HEAD_INCLUDED)</span></td></tr></tbody></table></div></div><p>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Dennis)</author>
		<pubDate>Sat, 17 Sep 2005 02:44:03 +0000</pubDate>
	</item>
</rss>
