no-reply@allegro.cc (Dennis) — Sat, 17 Sep 2005 02:44:03 +0000

I needed the possibility to store Unicode strings for my current project, so i've written some functions that do so using UTF-8.
(Thanks again to Evert and Kirr for suggesting to go for UTF-8 instead of UTF-16.)
Since these functions are not directly dependent of said project, i decided to seperate them and after that, i thought they might be generally useful not only to me, so i present them here free to everyone who wants to use them.:)
C++ hpp,cpp pair is attached.

Header is here, so you don't need to download to see, what it can|can not do:

1/*
2  --- Projectname: Encoder/Decoder (UTF-8 Unicode4.0) FILE Functions ---
3  (Targetsystem: Crossplatform)
4  
5  Author: Dennis Busch (http://www.dennisbusch.de)
6
7  Content:
8  Functions for storing and restoring full lines of
9  UTF-8(as described in "Unicode4.0.0 Chapter 3.9") encoded UNICODE
10  'string's , 'wstring's and 'uistring's to and from 'FILE's.
11
12  Details:
13    Unicode characters that can not be held in a single
14    'memory cell' of the desired target string type will be read but ignored.
15
16    Footnote1:
17    "The definition of UTF-8 prohibits encoding character numbers between
18     U+D800 and U+DFFF"
19    So characters decoded into that range will be ignored as well and more
20    important, characters from that range will not be encoded on storing.
21
22    Footnote2:
23    On storing, the ByteOrderMark(BOM) U+FEFF
24    will be stored like any other character encoded in UTF-8,
25    always leading to the octet sequence EF BB BF.
26    On restoring, any sequence decoded to U+FEFF will be ignored, that is
27    it gets not appended to the resulting string.
28
29    Footnote3:
30    The standard allows only unicodes of a maximum of 21
31    bits to be encoded or decoded. Even though characters
32    with the full range of 32bits *could* be stored in UTF-8 
33    using up to 6 octets, this implementation does *NOT* do so.
34    It only correctly encodes, decodes up to 4 octets, as defined by said
35    standard("Unicode4.0.0 Chapter 3.9" UTF-8).
36    Quote: "Unicode4.0.0 Appendix C.3"
37      "those five- and six-byte sequences are illegal for the use 
38       of UTF-8 as an encoding form of Unicode characters."
39
40    Footnote4:
41    This implementation also tries to protect against decoding of invalid
42    octet sequences.
43    It does so by silently reading any invalid sequences, but every such
44    invalid sequence(as it should not be allowed in the first place) will
45    be skipped and an ASCII '?' is given out as result character.
46*/
47 
48#include 
49#include 
50#include 
51using namespace std;
52 
53#if !defined(__DB_utf_eight_HEAD_INCLUDED)
54#define __DB_utf_eight_HEAD_INCLUDED
55 
56/* - All functions return a negative value on failure or if the result
57     would be empty.
58   - The FILE parameter is always expected to be a valid 
59     file pointer and it must be opened in binary mode. */
60 
61 
62/* Explicitly Writing the UTF-8 ByteOrderMark
63   (use only once at begin of file) as a signature for external editors */
64int write_bom_utf8(FILE *out);
65 
66// Encoding and Decoding a single UNICODE character to and from file
67int encode_utf8(FILE *out, unsigned int code);
68int decode_utf8(FILE *in, unsigned int *result);
69 
70// Reading string and wstring from file 
71/* Starts reading actual character data after ignoring any preceeding newline
72   or carriage return codes or byte order marks
73   then reads characters until the next newline or carriage return appears, 
74   the nl or cr character itself is not appended to result) */
75/* (basically means two things: Empty lines will be skipped and windows' CR+NL
76    will not lead to reading extra empty lines.) */
77/* (result will be replaced with the read line if return value is 0, otherwise
78    result will not be altered in any way) */
79int readline_utf8(FILE *in,  string *result);
80int readline_utf8(FILE *in, wstring *result);
81 
82// Writing string and wstring to file (also writes CR+NL after the string)
83/* (if FILE is not in binary mode, some platforms(like windows) will do
84    a conversion from NL to CR+NL, so then actually CR+CR+NL will be written,
85    which is not really a problem for the readline functions above, this info
86    is just here, so you do not have to get confused about that, if you should
87    ever take a closer peek at the output file) */
88int writeline_utf8(FILE *out,  string line);
89int writeline_utf8(FILE *out, wstring line);
90 
91// Functions for strings that can store the full range of characters
92typedef basic_string<unsigned int> uistring;
93int readline_utf8(FILE *in, uistring *result);
94int writeline_utf8(FILE *out, uistring line);
95 
96#endif // #if !defined(__DB_utf_eight_HEAD_INCLUDED)

1	/*
2	--- Projectname: Encoder/Decoder (UTF-8 Unicode4.0) FILE Functions ---
3	(Targetsystem: Crossplatform)
4
5	Author: Dennis Busch (http://www.dennisbusch.de)
6
7	Content:
8	Functions for storing and restoring full lines of
9	UTF-8(as described in "Unicode4.0.0 Chapter 3.9") encoded UNICODE
10	'string's , 'wstring's and 'uistring's to and from 'FILE's.
11
12	Details:
13	Unicode characters that can not be held in a single
14	'memory cell' of the desired target string type will be read but ignored.
15
16	Footnote1:
17	"The definition of UTF-8 prohibits encoding character numbers between
18	U+D800 and U+DFFF"
19	So characters decoded into that range will be ignored as well and more
20	important, characters from that range will not be encoded on storing.
21
22	Footnote2:
23	On storing, the ByteOrderMark(BOM) U+FEFF
24	will be stored like any other character encoded in UTF-8,
25	always leading to the octet sequence EF BB BF.
26	On restoring, any sequence decoded to U+FEFF will be ignored, that is
27	it gets not appended to the resulting string.
28
29	Footnote3:
30	The standard allows only unicodes of a maximum of 21
31	bits to be encoded or decoded. Even though characters
32	with the full range of 32bits could be stored in UTF-8
33	using up to 6 octets, this implementation does NOT do so.
34	It only correctly encodes, decodes up to 4 octets, as defined by said
35	standard("Unicode4.0.0 Chapter 3.9" UTF-8).
36	Quote: "Unicode4.0.0 Appendix C.3"
37	"those five- and six-byte sequences are illegal for the use
38	of UTF-8 as an encoding form of Unicode characters."
39
40	Footnote4:
41	This implementation also tries to protect against decoding of invalid
42	octet sequences.
43	It does so by silently reading any invalid sequences, but every such
44	invalid sequence(as it should not be allowed in the first place) will
45	be skipped and an ASCII '?' is given out as result character.
46	*/
47
48	#include
49	#include
50	#include
51	using namespace std;
52
53	#if !defined(__DB_utf_eight_HEAD_INCLUDED)
54	#define __DB_utf_eight_HEAD_INCLUDED
55
56	/* - All functions return a negative value on failure or if the result
57	would be empty.
58	- The FILE parameter is always expected to be a valid
59	file pointer and it must be opened in binary mode. */
60
61
62	/* Explicitly Writing the UTF-8 ByteOrderMark
63	(use only once at begin of file) as a signature for external editors */
64	int write_bom_utf8(FILE *out);
65
66	// Encoding and Decoding a single UNICODE character to and from file
67	int encode_utf8(FILE *out, unsigned int code);
68	int decode_utf8(FILE in, unsigned int result);
69
70	// Reading string and wstring from file
71	/* Starts reading actual character data after ignoring any preceeding newline
72	or carriage return codes or byte order marks
73	then reads characters until the next newline or carriage return appears,
74	the nl or cr character itself is not appended to result) */
75	/* (basically means two things: Empty lines will be skipped and windows' CR+NL
76	will not lead to reading extra empty lines.) */
77	/* (result will be replaced with the read line if return value is 0, otherwise
78	result will not be altered in any way) */
79	int readline_utf8(FILE in, string result);
80	int readline_utf8(FILE in, wstring result);
81
82	// Writing string and wstring to file (also writes CR+NL after the string)
83	/* (if FILE is not in binary mode, some platforms(like windows) will do
84	a conversion from NL to CR+NL, so then actually CR+CR+NL will be written,
85	which is not really a problem for the readline functions above, this info
86	is just here, so you do not have to get confused about that, if you should
87	ever take a closer peek at the output file) */
88	int writeline_utf8(FILE *out, string line);
89	int writeline_utf8(FILE *out, wstring line);
90
91	// Functions for strings that can store the full range of characters
92	typedef basic_string<unsigned int> uistring;
93	int readline_utf8(FILE in, uistring result);
94	int writeline_utf8(FILE *out, uistring line);
95
96	#endif // #if !defined(__DB_utf_eight_HEAD_INCLUDED)