I'm having trouble finding an up-to-date guide to Unicode programming. From what I've read, UTF-8 seems like the best overall format to use, but I'm open to suggestions. I'm basically interested in learning how to develop applications that fully support Unicode standards. If anybody has advice, I'd appreciate it. Also, tutorials, APIs, further reading, etc., are welcomed. I'm most interested in Unicode programming in C and/or C++, but feel free to provide resources for other languages as well. I'd prefer guides that make use of the standard library rather than using third party libraries, though guides making use of them are also welcome. I'd also prefer cross-platform guides, if applicable. UNIX/Linux guides are preferred over Windows-only guides, but of course Windows-only guides are also welcome.
You could do worse than starting with ICU (http://icu-project.org/). The user guide has some good documentation about Unicode in general, as I recall.
You weren't too specific about what you wanted to do. Personally, I haven't had to get too deep into Unicode (thankfully) but I'm usually happy with just being about to pass UTF-8 strings unmangled through whatever function. Any characters that I need to interpret (e.g. '/') are usually 7-bit ASCII. Of course, everyone should get it through their thick skulls the difference between characters, bytes and codepoints, at the least.
Essentially, I'm trying to understand how to write a program that correctly handles all Unicode characters... In the end, I'm hoping to write a custom mail server/client that enforces the use of Unicode for mail messages (and it's looking like I'll want to enforce the use of UTF-8 specifically).
I need to know how to handle locale settings and how to convert input into UTF-8, regardless of what locale the user is using, and then convert back to their locale for output...
I'm also interested in how to identify encoding of a text file. Is it acceptable to just insist that the user tells you what the file is encoded as (when quite often they don't even know), or is there a reliable way to guess, etc.?
I'm basically looking for best practices in dealing with Unicode and character encodings in general, and trying to figure out what needs to change in a program to make it Unicode aware... 
I'm hoping there are guides to Unicode programming out there somewhere, similar to how there are guides to network programming and threaded programming, etc. At the very least, the European and Asian members must have some experience dealing with Unicode...
All of this has been done before. Are you trying to reinvent the wheel on purpose? But UTF-8 is simple to parse. I think even the Wikipedia article is sufficient, if I recall correctly.
If the first bit is 0, it's a regular ASCII character from 0 to 127. But if it is set, then you have a character that spans at least two bytes. Then you check the next byte (in a similar, but different way) to see if the character extends to a third byte, etc. You add up the bits in the proper way to get the final character.
You can try to detect if it's UTF-8 by looking for invalid byte sequences. That won't guarantee that it is UTF-8, but it should work well enough for distinguishing between ASCII and UTF-8.
All of this has been done before. Are you trying to reinvent the wheel on purpose?
I'm not trying to reinvent the wheel at all... I'm trying to find the wheel and learn how to use it.
I'm looking for Unicode APIs, if you will, not the technical low-level details of Unicode. I'm hoping there are APIs built into the standard libraries or well known third party APIs (the kind of thing that would be installed on most UNIX-like systems, for example)... The tutorials I've been able to find are 5+ years old and making use of their own APIs. I'd rather use well known APIs and best practices that are up-to-date with today...
But UTF-8 is simple to parse. I think even the Wikipedia article is sufficient, if I recall correctly.
Already been there.
If the first bit is 0, it's a regular ASCII character from 0 to 127. But if it is set, then you have a character that spans at least two bytes. Then you check the next byte (in a similar, but different way) to see if the character extends to a third byte, etc. You add up the bits in the proper way to get the final character.
IIRC, the first bits of the first byte tells you how many bytes are in the character.
0xxxxxxx # Single-byte character "sequence"/identical to ASCII. 110xxxxx # Two-byte character sequence. 1110xxxx # Three-byte character sequence. 11110xxx # Four-byte character sequence. 111110xx # Five-byte character sequence. 1111110x # Six-byte character sequence.
Each continuation byte will match 10xxxxxx. The actual character code value is the resulting x bits streamed together. I'd rather not have to implement this low-level stuff myself. There's little chance I'll get it 100% right and zero chance it will perform well.
You can try to detect if it's UTF-8 by looking for invalid byte sequences. That won't guarantee that it is UTF-8, but it should work well enough for distinguishing between ASCII and UTF-8.
Well ASCII really isn't a problem... AFAIK, I can just consider ASCII data as being UTF-8 data and it should work fine. I'm more or less interested in writing software that will work on my machine and also work properly on a Russian or Japanese machine, for example.
A program's input data isn't necessarily going to match the internally used encoding (UTF-8). Data input from the operating system (stdin) should be in the locale specific encoding, I think. File data and network data, however, can be in any encoding... In order to display correctly, output data should be in the locale specific encoding, I think. I'm trying to figure out how to interface the internal encoding with the external encoding(s).
** EDIT **
I'm hoping this will help a lot: Using Unicode In C/C++.
Essentially, I'm trying to understand how to write a program that correctly handles all Unicode characters
You have to know which values are plain characters and which values are special prefixes.
The "geniuses" that invented Unicode formats decided that characters of a unicode string should not be indexed at exact multiples of a byte (16, 32 etc), and thus you can't do 'unicode_array[character_index]'. Just another stupidity in the line of stupidities for computers...
The "geniuses" that invented Unicode formats decided that characters of a unicode string should not be indexed at exact multiples of a byte (16, 32 etc), and thus you can't do 'unicode_array[character_index]'. Just another stupidity in the line of stupidities for computers...
Unicode includes UTF-8, UTF-16 and UTF-32 encodings. UTF-32 uses a fixed-width encoding and is the type used by most UNIXes for wchar_t. UTF-16 is variable width like UTF-8, but obviously easier to handle since there are only two possible character lengths. UTF-16 is used by Windows for wchar_t.
I think the best advice for dealing with unicode is to use wchar_t instead of char, and check out the functions in <wchar.h> and <wctype.h>.
Personally I prefer UCS2: each character is 16-bit, and there are no surrogate pairs. It's silly to spend all the memory in 32-bit characters, just because there are hundreds of Chinese dialects.
Personally, I'm not that bothered while I can hide behind the wchar functions. I think you and Microsoft are right though, UTF-16/UCS2 is probably the compromise I would pick for writing brand new software; though I'll wager that UTF-8 is more efficient for network transmission.
Yeah, I think I'm planning to go with UTF-8. I've read that it's more UNIX friendly, and it was apparently even invented by Ken Thompson, one of the original developers of UNIX! Everything I read seems to suggest that it is most prominent in UNIX and Linux.
I just need to figure out how to work with it now...
Apparently UTF-8, UTF-16, and UTF-32 are actually not Unicode. They are UCS Transmission Formats (don't hold me to that because Wikipedia is saying the U stands for Unicode), and UCS just happens to be compatible with Unicode (because ISO and the Unicode Consortium realized that developing conflicting standards was counter-productive and instead began developing in synchronization with each other). As it goes, UCS is 100% compatible with Unicode, but Unicode is not necessarily compatible with UCS (UCS just defines characters, whereas Unicode defines further rules for the characters; ISO has separate standards for this apparently).
Another clarification that really helped is that UCS and Unicode basically just map characters to integers. They don't describe how to store them in memory. UCS-2, UCS-4, UTF-8, UTF-16, and UTF-32 apparently describe how those characters are stored as bytes. I think. It's all very confusing.
In any case, I'm really confused as to what's what and what to use...
So I'm just sticking with this mysterious thing called "UTF-8" which seems to be the most recommended format.
I don't know the origin of most of these things, but I do know that UTF-x are now defined in the Unicode spec, and that is currently the definitive authority for their interpretation.
But yeah, Unicode is a registry of symbols with code numbers. UTF-x are schemes for mapping from collections of bytes to those code numbers. UTF-8 is designed so that the first 128 characters are mostly compatible with ASCII. UTF-16 is based on thoughts close to the original design that no more than 65536 symbols would be discovered; since they're on more than 100,000 now they needed UTF-32.
wchar_t is meant to be an integer type that is wide enough to hold any symbol that your OS understands. So strictly speaking Microsoft have implemented it incorrectly. The various functions in wchar.h/etc duplicate all the functions of string.h and stdio.h but for your OS's wchar_t type. So you have wide versions of printf, getc, etc. That means you can mostly treat wide strings as an opaque type and not worry about the size of each element and/or whether each element contains a whole code on its own (which it should per the language spec, but I guess doesn't in Windows).
NULL is still always 0.
I have to admit to not currently knowing exactly whether sizeof(wchar_t) == 1 means things in memory are definitely UTF-8, etc, or whether there are standard functions for converting from UTF-x to UTF-y for file IO. I've really only gone as far as receiving wchar_t strings from the OS, processing them as I need and using them correctly.
I prefer just using a language that has Unicode support built-in.
I prefer just using a language that has Unicode support built-in.
If having a datatype for storing individual codes, using an array of them for strings and having a whole bunch of strlen/printf-type functions for manipulating the arrays doesn't count as having support for that encoding built in then C doesn't have ASCII support built in either.
Well, no, C doesn't really have any string support built into the language at all, except for string literals (which are just syntactic sugar around plain byte arrays).
Tobias, I think you need to stop beating around the bush and say what you're trying to say.
What language are you thinking of and what support does it have built-in that C lacks?
Well, no, C doesn't really have any string support built into the language at all, except for string literals (which are just syntactic sugar around plain byte arrays).
I'm not disagreeing (you know, when comparing C to languages were stuff like "stringa + stringb" does concatenation), but I have suddenly remembered to add that you declare static wide strings like this:
char String[] = "Stringula"; wchar_t wString_t[] = L"Stringula";
Just one more potentially helpful tip.
In C++ you can use wstring instead of string.
I would only use unicode if I really need it, though, because I would definitely run into some problems. I had a project where we used "wide" strings for everything, just as a matter of principle. It wasn't fun.
What language are you thinking of and what support does it have built-in that C lacks?
Both Java and C# have unicode built into the language, on all levels - you can use unicode characters directly in string literals, and the built-in string types behave just like you would expect strings to behave. In these languages, one can basically work with unicode strings without knowing that such athing as unicode exists.
A string, a pointer and an array are totally different concepts, but C treats them as the same thing.
I'm not trying to diss C here: it is after all the language that made unix, and it's a great tool for a great number of things - just not for string-related things (except when the project, for whatever reasons, rigourously favours runtime performance over readability and maintainability).
Both Java and C# have unicode built into the language, on all levels - you can use unicode characters directly in string literals, and the built-in string types behave just like you would expect strings to behave. In these languages, one can basically work with unicode strings without knowing that such athing as unicode exists.
This I did not know...
Actually, .NET's System.String class appears to only represent Unicode strings (essentially insisting that developers use Unicode internally; or go to a lot of trouble to make their lives more difficult). I'm guessing though, that for them to work as expected the source files need to be written in Unicode as well (which Visual Studio probably does for you and ASCII editors probably wouldn't conflict with, assuming UTF-8; however, users using other encodings will probably need to make the conscious effort to set their editors to Unicode when writing C# code; this is just an assumption at this point). While that's great for a new language, you really can't expect it from a 40 year old language.
I see what you're saying though. I'd still like to learn to do it in C/C++ though. Largely because C# support in *nix is limited and Java is evil. There is probably some alternative that I haven't used that would also fill that void, but until I've used it I won't know.
I wonder if D has any Unicode support... Anyway, there may be times where performance does matter and internationalization is also important so it wouldn't hurt to understand Unicode programming in C/C++. Like I said earlier, I'm hoping to attempt to write a mail server/client system and performance, while not as important as reliability, security, and maintainability; is important in a server.
** EDIT **
According to Wikipedia at the time of writing, D's Unicode support is incomplete.
Yes and no: a .NET is by definition a unicode one, but most of the various text writers .NET has to offer can be set to any encoding you like. Using unicode internally is not a bad thing at all, since everything that can be expressed in any one 8-bit encoding, can be expressed unambiguously in unicode, and from there, it can be converted to any other encoding (save unsupported characters in the target encoding). Unicode really is the only choice for an in-between format that makes any sense currently.
I have no idea what C# (or VB for that matter) does when fed 8-bit encoded source files, but I suppose it'll handle them semi-nicely at least (working as expected for ASCII characters, and guessing the encoding for anything above 127).
The rare occasions where using 8-bit strings rather than unicode ones makes any difference in performance that really matters, probably deserve tailor-made code anyway, and C# does support plain byte arrays - you just need to write your own conversions etc. But I suppose the overhead of using the .NET runtime as opposed to running native code would be much larger anyway, and the best solution in such a scenario would be to use C++ or something like that.
In C++, std::wstring is pretty much what you'll be working with, unless you plug in an extra unicode lib.