|
|
| Guide To Unicode Programming |
|
bamccaig
Member #7,536
July 2006
|
I'm having trouble finding an up-to-date guide to Unicode programming. From what I've read, UTF-8 seems like the best overall format to use, but I'm open to suggestions. I'm basically interested in learning how to develop applications that fully support Unicode standards. If anybody has advice, I'd appreciate it. Also, tutorials, APIs, further reading, etc., are welcomed. I'm most interested in Unicode programming in C and/or C++, but feel free to provide resources for other languages as well. I'd prefer guides that make use of the standard library rather than using third party libraries, though guides making use of them are also welcome. I'd also prefer cross-platform guides, if applicable. UNIX/Linux guides are preferred over Windows-only guides, but of course Windows-only guides are also welcome. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
|
Peter Wang
Member #23
April 2000
|
You could do worse than starting with ICU (http://icu-project.org/). The user guide has some good documentation about Unicode in general, as I recall. You weren't too specific about what you wanted to do. Personally, I haven't had to get too deep into Unicode (thankfully) but I'm usually happy with just being about to pass UTF-8 strings unmangled through whatever function. Any characters that I need to interpret (e.g. '/') are usually 7-bit ASCII. Of course, everyone should get it through their thick skulls the difference between characters, bytes and codepoints, at the least.
|
|
bamccaig
Member #7,536
July 2006
|
Essentially, I'm trying to understand how to write a program that correctly handles all Unicode characters... In the end, I'm hoping to write a custom mail server/client that enforces the use of Unicode for mail messages (and it's looking like I'll want to enforce the use of UTF-8 specifically). I need to know how to handle locale settings and how to convert input into UTF-8, regardless of what locale the user is using, and then convert back to their locale for output... I'm hoping there are guides to Unicode programming out there somewhere, similar to how there are guides to network programming and threaded programming, etc. At the very least, the European and Asian members must have some experience dealing with Unicode... -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
|
Matthew Leverton
Supreme Loser
January 1999
|
All of this has been done before. Are you trying to reinvent the wheel on purpose? But UTF-8 is simple to parse. I think even the Wikipedia article is sufficient, if I recall correctly. If the first bit is 0, it's a regular ASCII character from 0 to 127. But if it is set, then you have a character that spans at least two bytes. Then you check the next byte (in a similar, but different way) to see if the character extends to a third byte, etc. You add up the bits in the proper way to get the final character. You can try to detect if it's UTF-8 by looking for invalid byte sequences. That won't guarantee that it is UTF-8, but it should work well enough for distinguishing between ASCII and UTF-8. |
|
bamccaig
Member #7,536
July 2006
|
Matthew Leverton said: All of this has been done before. Are you trying to reinvent the wheel on purpose?
I'm not trying to reinvent the wheel at all... I'm trying to find the wheel and learn how to use it. Matthew Leverton said: But UTF-8 is simple to parse. I think even the Wikipedia article is sufficient, if I recall correctly.
Already been there. Matthew Leverton said: If the first bit is 0, it's a regular ASCII character from 0 to 127. But if it is set, then you have a character that spans at least two bytes. Then you check the next byte (in a similar, but different way) to see if the character extends to a third byte, etc. You add up the bits in the proper way to get the final character. IIRC, the first bits of the first byte tells you how many bytes are in the character. 0xxxxxxx # Single-byte character "sequence"/identical to ASCII. 110xxxxx # Two-byte character sequence. 1110xxxx # Three-byte character sequence. 11110xxx # Four-byte character sequence. 111110xx # Five-byte character sequence. 1111110x # Six-byte character sequence.
Each continuation byte will match 10xxxxxx. The actual character code value is the resulting x bits streamed together. I'd rather not have to implement this low-level stuff myself. There's little chance I'll get it 100% right and zero chance it will perform well. Matthew Leverton said: You can try to detect if it's UTF-8 by looking for invalid byte sequences. That won't guarantee that it is UTF-8, but it should work well enough for distinguishing between ASCII and UTF-8. Well ASCII really isn't a problem... AFAIK, I can just consider ASCII data as being UTF-8 data and it should work fine. I'm more or less interested in writing software that will work on my machine and also work properly on a Russian or Japanese machine, for example. A program's input data isn't necessarily going to match the internally used encoding (UTF-8). Data input from the operating system (stdin) should be in the locale specific encoding, I think. File data and network data, however, can be in any encoding... In order to display correctly, output data should be in the locale specific encoding, I think. I'm trying to figure out how to interface the internal encoding with the external encoding(s). ** EDIT ** I'm hoping this will help a lot: Using Unicode In C/C++. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
|
axilmar
Member #1,204
April 2001
|
Quote: Essentially, I'm trying to understand how to write a program that correctly handles all Unicode characters You have to know which values are plain characters and which values are special prefixes. The "geniuses" that invented Unicode formats decided that characters of a unicode string should not be indexed at exact multiples of a byte (16, 32 etc), and thus you can't do 'unicode_array[character_index]'. Just another stupidity in the line of stupidities for computers... |
|
Thomas Harte
Member #33
April 2000
|
Quote: The "geniuses" that invented Unicode formats decided that characters of a unicode string should not be indexed at exact multiples of a byte (16, 32 etc), and thus you can't do 'unicode_array[character_index]'. Just another stupidity in the line of stupidities for computers... Unicode includes UTF-8, UTF-16 and UTF-32 encodings. UTF-32 uses a fixed-width encoding and is the type used by most UNIXes for wchar_t. UTF-16 is variable width like UTF-8, but obviously easier to handle since there are only two possible character lengths. UTF-16 is used by Windows for wchar_t. I think the best advice for dealing with unicode is to use wchar_t instead of char, and check out the functions in <wchar.h> and <wctype.h>. [My site] [Tetrominoes] |
|
axilmar
Member #1,204
April 2001
|
Personally I prefer UCS2: each character is 16-bit, and there are no surrogate pairs. It's silly to spend all the memory in 32-bit characters, just because there are hundreds of Chinese dialects. |
|
Thomas Harte
Member #33
April 2000
|
Personally, I'm not that bothered while I can hide behind the wchar functions. I think you and Microsoft are right though, UTF-16/UCS2 is probably the compromise I would pick for writing brand new software; though I'll wager that UTF-8 is more efficient for network transmission. [My site] [Tetrominoes] |
|
bamccaig
Member #7,536
July 2006
|
Yeah, I think I'm planning to go with UTF-8. I've read that it's more UNIX friendly, and it was apparently even invented by Ken Thompson, one of the original developers of UNIX! Everything I read seems to suggest that it is most prominent in UNIX and Linux. I just need to figure out how to work with it now... Another clarification that really helped is that UCS and Unicode basically just map characters to integers. They don't describe how to store them in memory. UCS-2, UCS-4, UTF-8, UTF-16, and UTF-32 apparently describe how those characters are stored as bytes. I think. It's all very confusing. In any case, I'm really confused as to what's what and what to use... -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
|
Thomas Harte
Member #33
April 2000
|
I don't know the origin of most of these things, but I do know that UTF-x are now defined in the Unicode spec, and that is currently the definitive authority for their interpretation. But yeah, Unicode is a registry of symbols with code numbers. UTF-x are schemes for mapping from collections of bytes to those code numbers. UTF-8 is designed so that the first 128 characters are mostly compatible with ASCII. UTF-16 is based on thoughts close to the original design that no more than 65536 symbols would be discovered; since they're on more than 100,000 now they needed UTF-32. wchar_t is meant to be an integer type that is wide enough to hold any symbol that your OS understands. So strictly speaking Microsoft have implemented it incorrectly. The various functions in wchar.h/etc duplicate all the functions of string.h and stdio.h but for your OS's wchar_t type. So you have wide versions of printf, getc, etc. That means you can mostly treat wide strings as an opaque type and not worry about the size of each element and/or whether each element contains a whole code on its own (which it should per the language spec, but I guess doesn't in Windows). NULL is still always 0. I have to admit to not currently knowing exactly whether sizeof(wchar_t) == 1 means things in memory are definitely UTF-8, etc, or whether there are standard functions for converting from UTF-x to UTF-y for file IO. I've really only gone as far as receiving wchar_t strings from the OS, processing them as I need and using them correctly. [My site] [Tetrominoes] |
|
Tobias Dammers
Member #2,604
August 2002
|
I prefer just using a language that has Unicode support built-in. --- |
|
Thomas Harte
Member #33
April 2000
|
Quote: I prefer just using a language that has Unicode support built-in. If having a datatype for storing individual codes, using an array of them for strings and having a whole bunch of strlen/printf-type functions for manipulating the arrays doesn't count as having support for that encoding built in then C doesn't have ASCII support built in either. [My site] [Tetrominoes] |
|
Tobias Dammers
Member #2,604
August 2002
|
Well, no, C doesn't really have any string support built into the language at all, except for string literals (which are just syntactic sugar around plain byte arrays). --- |
|
bamccaig
Member #7,536
July 2006
|
Tobias, I think you need to stop beating around the bush and say what you're trying to say. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
|
Thomas Harte
Member #33
April 2000
|
Quote: Well, no, C doesn't really have any string support built into the language at all, except for string literals (which are just syntactic sugar around plain byte arrays). I'm not disagreeing (you know, when comparing C to languages were stuff like "stringa + stringb" does concatenation), but I have suddenly remembered to add that you declare static wide strings like this: char String[] = "Stringula"; wchar_t wString_t[] = L"Stringula"; Just one more potentially helpful tip. [My site] [Tetrominoes] |
|
Simon Parzer
Member #3,330
March 2003
|
In C++ you can use wstring instead of string. |
|
Tobias Dammers
Member #2,604
August 2002
|
Quote: What language are you thinking of and what support does it have built-in that C lacks?
Both Java and C# have unicode built into the language, on all levels - you can use unicode characters directly in string literals, and the built-in string types behave just like you would expect strings to behave. In these languages, one can basically work with unicode strings without knowing that such athing as unicode exists. --- |
|
bamccaig
Member #7,536
July 2006
|
Tobias Dammers said: Both Java and C# have unicode built into the language, on all levels - you can use unicode characters directly in string literals, and the built-in string types behave just like you would expect strings to behave. In these languages, one can basically work with unicode strings without knowing that such athing as unicode exists.
This I did not know... ** EDIT ** According to Wikipedia at the time of writing, D's Unicode support is incomplete. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
|
Tobias Dammers
Member #2,604
August 2002
|
Yes and no: a .NET is by definition a unicode one, but most of the various text writers .NET has to offer can be set to any encoding you like. Using unicode internally is not a bad thing at all, since everything that can be expressed in any one 8-bit encoding, can be expressed unambiguously in unicode, and from there, it can be converted to any other encoding (save unsupported characters in the target encoding). Unicode really is the only choice for an in-between format that makes any sense currently. --- |
|
|