UTF-8 string routines

These functions are declared in the main Allegro header file:

#include <allegro5/allegro.h>

About UTF-8 string routines

Some parts of the Allegro API, such as the font rountines, expect Unicode strings encoded in UTF-8. These basic routines are provided to help you work with UTF-8 strings. You should use another library (e.g. ICU) if you require more functionality.

You should also see elsewhere for an introduction to Unicode. Extremely briefly, Unicode is a standard consisting of a large character set (of over 100,000 characters), and rules, such as how to sort strings. A code point is the integer value of a character, but not all code points are characters, as some code points have other uses. Clearly it is impossible represent each code point with a 8-bit byte or even a 16-bit integer, so there exist different Unicode Transformation Formats. UTF-8 has many nice properties, but the main advantages are that it is backwards compatible with C strings, and ASCII characters (code points <= 127) are encoded in UTF-8 exactly as they would be in ASCII.

Here is a diagram of the representation of the word "ål", with a NUL terminator.

                   ---------------- ---------------- --------------
           String         å                l              NUL
                   ---------------- ---------------- --------------
      Code points    U+00E5 (229)     U+006C (108)     U+0000 (0)
                   ---------------- ---------------- --------------
   UTF-8 encoding     0xC3, 0xA5          0x6C            0x00
                   ---------------- ---------------- --------------
UTF-16LE encoding     0xE5, 0x00       0x6C, 0x00      0x00, 0x00
                   ---------------- ---------------- --------------

U+00E5 is greater than 127 so requires two bytes to represent in UTF-8. U+006C and U+0000 both exist in the ASCII range, so take one byte each, exactly as in an ASCII string. UTF-16 is a different encoding, in which each code is represented by two or four bytes. In UTF-8 a zero byte is only present when it represents the NUL character, but this is not true for UTF-16.

In the Allegro API, be careful whether a function takes byte offsets or code-point indices. In general, all position parameters are in byte offsets, not code point indices. This may be surprising, but if you think about, is required for good performance. It also means many functions will work even if they do not contain UTF-8, so you may actually store arbitrary data in the strings.

For actual text processing, where you want to specify positions with code point indices, you should use al_ustr_offset to find the byte position of a code point.

UTF-8 string types

Creating and destroying strings

Predefined strings

Creating strings by referencing other data

Sizes and offsets

Getting code points

Inserting into strings

Appending to strings

Removing parts of strings

Assigning one string to another

Replacing parts of string

Searching

Comparing

UTF-16 conversion

Low-level UTF-8 routines

Low-level UTF-16 routines