by Markus Kuhn
This text is a very comprehensive one-stop information resource on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). You will find here both introductory information for every user as well as detailed references for the experienced developer.
Unicode is well on the way to replace ASCII and Latin-1 in a few years at all levels. It allows you not only to handle text in practically any script and languagany script and language used on this planet, it also provides you with a comprehensive set of mathematical and technical symbols that will simplify scientific information exchange.
The UTF-8 encoding allows Unicode to be used in a convenient and backwards compatible way in environments that, like Unix, were designed entirely around ASCII. UTF-8 is the way in which Unicode is used under Unix, Linux, and similar systems. It is now time to make sure that you are well familiar with it and that your software supports UTF-8 smoothly.
The international standard ISO 10646 defines the Universal Character Set (UCS). UCS is a superset of all other character set standards. It guarantees round-trip compatibility to other character sets. If you convert any text string to UCS and then back to the original encoding, then no information will be lost.
UCS contains the characters required to represent practically all known languages. This includes not only the Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and Georgian scripts, but also also Chinese, Japanese and Korean Han ideographs as well as scripts such as Hiragana, Katakana, Hangul, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetian, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi, and others. For scripts not yet covered, research on how to best encode them for computer usage is still going on and they will be added eventually. This includes not only Cuneiform, Hieroglyphs and various Indo-European languages, but even some selected artistic scripts such as Tolkien's Tengwar and Cirth. UCS also covers a large number of graphical, typographical, mathematical and scientific symbols, including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many word processing and publishing systems, and more are being added.
ISO 10646 defines formally a 31-bit character set. However, of this huge code space, so far characters have been assigned only to the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that are expected to be encoded outside the 16-bit BMP belong all to rather exotic scripts (e.g., Hieroglyphs) that are only used by specialists for historic and scientific purposes. Current plans suggest that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters. The ISO 10646-1 standard was first published in 1993 and defines the architecture of the character set and the content of the BMP. A second part ISO 10646-2 which defines characters encoded outside the BMP is under preparation, but it might take a few years until it is finished. New characters are still being added to the BMP on a continuous basis, but the existing characters will not be changed any more and are stable.
UCS assigns to each cha>UCS assigns to each character not only a code number but also an official name. A hexadecimal number that represents a UCS or Unicode value is commonly preceded by "U+" as in U+0041 for the character "Latin capital letter A". The UCS characters U+0000 to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 to U+F8FF and also larger ranges outside the BMP are reserved for private use.
The full name of the UCS standard is
International Standard ISO/IEC 10646-1, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. Second edition, International Organization for Standardization, Geneva, 2000-09-15.
It can be ordered online from ISO as a set of PDF files on CD-ROM for 80 CHF (~53 EUR, ~45 USD, ~32 GBP).
Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character. The most important accented chara most important accented characters, like those used in the orthographies of common languages, have codes of their own in UCS to ensure backwards compatibility with older character sets. Accented characters that have their own code position, but could also be represented as a pair of another character followed by a combining character, are known as precomposed characters. Precomposed characters are available in UCS for backwards compatibility with older encodings such as ISO 8859 that had no combining characters. The combining character mechanism allows to add accents and other diacritical marks to any character, which is especially important for scientific notations such as mathematical formulae and the International Phonetic Alphabet, where any possible combination of a base character and one or several diacritical marks could be needed.
Combining characters follow the character which they modify. For example, the German umlaut character Ä ("Latin capital letter A with diaeresis") can either be represented by the precomposed UCS code U+00C4, or alternatively by the combination of a normal "Latin capital letter A" followed by a "combining diaeresis": U+0041 U+0308. Several combining characters can be applied when it is necessary to stack multiple accents or add combining marks both above and below the base character. For example with the Thai script, up to two combining characters are needed on a single base character. ingle base character.
Not all systems are expected to support all the advanced mechanisms of UCS such as combining characters. Therefore, ISO 10646 specifies the following three implementation levels:
Yes, a number of countries have published national adoptions of ISO 10646-1:1993, sometimes after adding additional annexes with cross-references to older national standards and specifications of various national implementation subsets:
Historically, there have been two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized in around 1991 that two different unified character sets is not what the world needs. They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. Unicode 1.1 corresponded to ISO 10646-1:1993 and Unicode 3.0 corresponds to ISO 10646-1:2000. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future.
The Unicode Standard can be ordered like any normal book, for instance via amazon.com for around 50 USD:
The Unicode Consortium: The Unicode Standard, Version 3.0,
Reading, MA, Addison-Wesley Developers Press, 2000,
ISBN 0-201-61633-5.
If you work frequently with text processing and character sets, you definitely should get a copy. It is also available online now.
The Unicode Standard published by the Unicode Consortium contains exactly the ISO 10646-1 Basic Multilingual Plane at implementation level 3. tion level 3. All characters are at the same positions and have the same names in both standards.
The Unicode Standard defines in addition much more semantics associated with some of the characters and is in general a better reference for implementors of high-quality typographic publishing systems. Unicode specifies algorithms for rendering presentation forms of some scripts (say Arabic), handling of bi-directional texts that mix for instance Latin and Hebrew, algorithms for sorting and string comparison, and much more.
The ISO 10646 standard on the other hand is not much more than a simple character set table, comparable to the well-known ISO 8859 standard. It specifies some terminology related to the standard, defines some encoding alternatives, and it contains specifications of how to use UCS in connection with other established ISO standards such as ISO 6429 and ISO 2022. There are other closely related ISO standards, for instance ISO 14651 on sorting UCS strings. A nice feature of the ISO 10646-1 standard is that it provides CJK example glyphs in five different style variants, while the Unicode standard shows the CJK ideographs only in a Chinese variant.
UCS and Unicode are first of all just code tables that assign integer numbers to characters. There exist several alternatives for how a sequence of such characters or their respective integer values respective integer values can be represented as a sequence of bytes. The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2 and UCS-4 respectively. Unless otherwise specified, the most significant byte comes first in these (Bigendian convention). An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte.
Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like '\0' or '/' which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and can't read 16-bit words as characters without major modifications. For these reasons, UCS-2 is not a suitable external encoding of Unicode in filenames, text files, environment variables, etc.
The UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in RFC 2279 as well as section 3.8 of the Unicode 3.0 standard does not have these problems. It is clearly the way to go for using Unicode under Unix-style operating r Unix-style operating systems.
UTF-8 has the following properties:
The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:
| U-00000000 - U-0000007F: | 0xxxxx-00000000 - U-0000007F: | 0xxxxxxx |
| U-00000080 - U-000007FF: | 110xxxxx 10xxxxxx | |
| U-00000800 - U-0000FFFF: | 1110xxxx 10xxxxxx 10xxxxxx | |
| U-00010000 - U-001FFFFF: | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | |
| U-00200000 - U-03FFFFFF: | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | |
| U-04000000 - U-7FFFFFFF: | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
Examples: The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as
11000010 10101001 = 0xC2 0xA9
and character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:
11100010 10001001 10100000 = 0xE2 0x89 0xA0
The official name and spelling of this encoding is UTF-8, where UTF stands for UCS Transformation Format. Please do nosformation Format. Please do not write UTF-8 in any documentation text in other ways (such as utf8 or UTF_8), unless of course you refer to a variable name and not the encoding itself.
An important note for developers of UTF-8 decoding routines: For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. For example, the character U+000A (line feed) must be accepted from a UTF-8 stream only in the form 0x0A, but not in any of the following five possible overlong forms:
0xC0 0x8A 0xE0 0x80 0x8A 0xF0 0x80 0x80 0x8A 0xF8 0x80 0x80 0x80 0x8A 0xFC 0x80 0x80 0x80 0x80 0x8A
Any overlong UTF-8 sequence could be abused to bypass UTF-8 substring tests that look only for the shortest possible encoding. All overlong UTF-8 sequences start with one of the following byte patterns:
| 1100000x (10xxxxxx) |
| 11100000 100xxxxx (10xxxxxx) |
| 11110000 1000xxxx (10xxxxxx 10xxxxxx) |
| 11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx) |
| 11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx) |
Also note that the code positions U+D800 to U+DFFF (UTF-16 surrogates) as welo U+DFFF (UTF-16 surrogates) as well as U+FFFE and U+FFFF must not occur in normal UTF-8 or UCS-4 data. UTF-8 decoders should treat them like malformed or overlong sequences for safety reasons.
Markus Kuhn's UTF-8 decoder stress test file contains a systematic collection of malformed and overlong UTF-8 sequences and will help you to verify the robustness of your decoder.
A few interesting UTF-8 example files for tests and demonstrations are:
Both the UCS and Unicode standards are first of all large tables that assign to every character an integer number. If you use the term "UCS", "ISO 10646", or "Use the term "UCS", "ISO 10646", or "Unicode", this just refers to a mapping between characters and integers. This does not yet specify how to store these integers as a sequence of bytes in memory.
ISO 10646-1 defines the UCS-2 and UCS-4 encodings. These are sequences of 2 bytes and 4 bytes per character, respectively. ISO 10646 was from the beginning designed as a 31-bit character set (with possible code positions ranging from U-00000000 to U-7FFFFFFF), however only very recently characters have been assigned beyond the Basic Multilingual Plane (BMP), that is beyond the first 216 character positions (see ISO 10646-2 and Unicode 3.1). UCS-4 can represent all UCS and Unicode characters, UCS-2 can represent only those from the BMP (U+0000 to U+FFFF).
"Unicode" originally implied that the encoding was UCS-2 and it initially didn't make any provisions for characters outside the BMP (U+0000 to U+FFFF). When it became clear that more than 64k characters would be needed for certain special applications (historic alphabets and ideographs, mathematical and musical typesetting, etc.), Unicode was turned into a sort of 21-bit character set with possible code points in the range U-00000000 to U-0010FFFF. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to allow 1024×1024 non-BMP characters to be represented as a sequence of two 16-bia sequence of two 16-bit surrogate characters. This way UTF-16 was born, which represents the extended "21-bit" Unicode in a way backwards compatible with UCS-2. The term UTF-32 was introduced in Unicode to mean a 4-byte encoding of the extended "21-bit" Unicode. UTF-32 is the exact same thing as UCS-4, except that by definition UTF-32 is never used to represent characters above U-0010FFFF, while UCS-4 can cover all 231 code positions up to U-7FFFFFFF.
In addition to all that, UTF-8 was introduced to provide an ASCII backwards compatible multi-byte encoding. The definitions of UTF-8 in UCS and Unicode differ actually slightly, because in UCS, up to 6-byte long UTF-8 sequences are possible to represent characters up to U-7FFFFFFF, while in Unicode only up to 4-byte long UTF-8 sequences are defined to represent characters up to U-0010FFFF. The difference is in essence the same as between UCS-4 and UTF-32, except that no two different names have been introduced for UTF-8 covering the UCS and Unicode ranges.
No endianess is implied by UCS-2, UCS-4, UTF-16, and UTF-32, though ISO 10646-1 says that Bigendian should be preferred unless otherwise agreed. It has become customary to append the letters "BE" (Bigendian, high-byte first) and "LE" (Littleendian, low-byte first) to the encoding names ) to the encoding names in order to explicitly specify a byte order.
In order to allow the automatic detection of the byte order, it has become customary on some platforms (notably Win32) to start every Unicode file with the character U+FEFF (ZERO WIDTH NO-BREAK SPACE), also known as the Byte-Order Mark (BOM). Its byte-swapped equivalent U+FFFE is not a valid Unicode character, therefore it helps to unambiguously distinguish the Bigendian and Littleendian variants of UTF-16 and UTF-32.
A full featured character encoding converter will have to provide the following 13 encoding variants of Unicode and UCS:
UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE
Where no byte order is explicitly specified, use the byte order of the CPU on which the conversion takes place and in an input stream swap the byte order whenever U+FFFE is encountered. The difference between outputting UCS-4 versus UTF-32 and UTF-16 versus UCS-2 lies in handling out-of-range characters. The fallback mechanism for non-representable characters has to be activated in UTF-32 (for characters > U-0010FFFF) or UCS-2 (for characters > U+FFFF) even where UCS-4 or UTF-16 respectively would offer a representation.
Really just of historic interest are UTF-1, UTF-7, SCSU and a dozen other less widely publicised UCS encoding proposals with various properties, none of which ever enjoyed any significant use. Their use should be avoided.
A good encoding converter will also offer options for adding or removing the BOM:
It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the beginning of a UTF-8 file. This practice should definitely not be used on POSIX systems for several reasons:
A full-featured character encoding converter should also offer conversion between normalization forms. Care should be used with mapping to NFKD or NFKC, as semantic information might be lost (for instance U+00B2 (SUPERSCRIPT TWO) maps to 2) and extra mark-up information might have to be added to preserve it (e.g., <SUP>2</SUP> in HTML).
More recent programming languages that were developed after around 1993 have already special data types for Unicode/ISO 10646-1 characters. This is the case with Ada95, Java, TCL, Perl, Python, C# and others.
ISO C 90 specifies mechanisms to handle multi-byte encoding and wide characters. These facilities were improved with Amendment 1 to ISO C 90 in 1994 and even further improvements were made in the new ISO C 99 standard. These facilities were designed originally with various East-Asian encodings in mind. They are on one side slightly more sophisticated than what would be necessary to handle UCS (handling of "shift sequences"), but also lack support for more advanced aspects of UCS (combining characters, etc.). UTF-8 is an example encoding for what the ISO C standard calls a multi-byte encoding and the type wchar_t, which is in moderr_t, which is in modern environments usually a signed 32-bit integer, can be used to hold Unicode characters.
Unfortunately, wchar_t was already widely used for various Asian 16-bit encodings throughout the 1990s, therefore the ISO C 99 standard could for backwards compatibility not be changed any more to require wchar_t to be used with UCS, like Java and Ada95 managed to do. However, the C compiler can at least signal to an application that wchar_t is guaranteed to hold UCS values in all locales by defining the macro __STDC_ISO_10646__ to be an integer constant of the form yyyymmL (for example, 200009L for ISO/IEC 10646-1:2000; the year and month refer to the version of ISO/IEC 10646 and its amendments that have been implemented).
Before UTF-8 emerged, Linux users all over the world had to use various different language-specific extensions of ASCII. Most popular were ISO 8859-1 and ISO 8859-2 in Europe, ISO 8859-7 in Greece, KOI-8 / ISO 8859-5 / CP1251 in Russia, EUC and Shift-JIS in Japan, etc. This made the exchange of files difficult and application software had to worry about various small differences between these encodings. Support for these encodings was usually incomplete, untested, and unsatisfactory, because the application developers rarely used all these encodings themselves.ese encodings themselves.
Because of these difficulties, the major Linux distributors and application developers now foresee and hope that Unicode will eventually replace all these older legacy encodings, primarily in the UTF-8 form. UTF-8 will be used in
In UTF-8 mode, terminal emulators such as xterm or the Linux console driver transform every keystroke into the corresponding UTF-8 sequence and send it to the stdin of the foreground process. Similarly, any output of a process on stdout is sent to the terminal emulator, where it is processed with a UTF-8 decoder and then displayed using a 16-bit font.
Full Unicode functionality with all bells and whistles (e.g. high-quality typesetting of the Arabic and Indic scripts) can only be expected from sophisticated multi-lingual word-processing packages. What Linux will use on a broad base to replace ASCII and the other 8-bit character sets is far simpler. Linux terminal emulators and command line tools will in the first step only switch to UTF-8. This means that only a Level 1 implementation of ISO 10646-1 is used (no comof ISO 10646-1 is used (no combining characters), and only scripts such as Latin, Greek, Cyrillic, Armenian, Georgian, CJK, and many scientific symbols are supported that need no further processing support. At this level, UCS support is very comparable to ISO 8859 support and the only significant difference is that we have now thousands of different characters available, that characters can be represented by multibyte sequences, and that ideographic Chinese/Japanese/Korean characters require two terminal character positions (double-width).
Combining characters might also be supported under Linux eventually (there is even some experimental terminal emulator support available today), but even then the precomposed characters should be preferred over combining character sequences where available. More formally, the preferred way of encoding text in Unicode under Linux should be Normalization Form C as defined in Unicode Technical Report #15.
One influential non-POSIX PC operating system vendor (whom we shall leave unnamed here) suggested that all Unicode files should start with the character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this role also referred to as the "signature" or "byte-order mark (BOM)", in order to identify the encoding and byte-order used in a file. Linux/Unix does not use any BOMs and signatures. They would break far . They would break far too many existing ASCII-file syntax conventions. On POSIX systems, the selected locale identifies already the encoding expected in all input and output files of a process. It has also been suggested to call UTF-8 files without a signature "UTF-8N" files, but this non-standard term is usually not used in the POSIX world.
Before you start using UTF-8 under Linux, update your installation to use glibc 2.2 and XFree86 4.0.3 or newer. This is the case for example starting with the SuSE 7.1 and Red Hat 7.1 distributions. Earlier Linux distributions lack UTF-8 locale support and ISO10646-1 X11 fonts.
If you are a developer, there are two approaches to add UTF-8 support, which I will call soft and hard conversion. In soft conversion, data is kept in its UTF-8 form everywhere and only very few software changes are necessary. In hard conversion, UTF-8 data that the program reads will be converted into wide-character arrays using standard C library functions and will be handled as such everywhere inside the application. Strings will only be converted back to UTF-8 at output time.
Most applications can do very fine with just soft conversion. This is what makes the introduction of UTF-8 on Unix feasible at all. For example, programs such as cat and echo do not have to be modified at all. They can remain completl. They can remain completely ignorant as to whether their input and output is ISO 8859-2 or UTF-8, because they handle just byte streams without processing them. They only recognize ASCII characters and control codes such as '\n' which do not change in any way under UTF-8. Therefore the UTF-8 encoding and decoding is done for these applications completely in the terminal emulator.
A small modification will be necessary for all programs that determine the number of characters in a string by counting the bytes. In UTF-8 mode, they must not count any bytes in the range 0x80 - 0xBF, because these are just continuation bytes and not characters of their own. C's strlen(s) counts the number of bytes, but not necessarily the number of characters in a string correctly. Instead, mbstowcs(NULL,s,0) can be used to count characters if a UTF-8 locale has been selected.
The strlen function does not have to be replaced where the result is used as a byte count, for example to allocate a suitably sized buffer for a string. The second most common use of strlen is to predict, how many columns the cursor of the terminal will advance if a string is printed out. With UTF-8, a character count will also not be satisfactory to predict column width, because ideographic characters (Chinese, Japanese, Korean) will occupy two column positions. To determine the width of a string on th width of a string on the terminal screen, it is necessary to decode the UTF-8 sequence and then use the wcwidth function to test the display width of each character.
For instance, the ls program had to be modified, because it has to know the column widths of filenames to format the table layout in which the directories are presented to the user. Similarly, all programs that assume somehow that the output is presented in a fixed-width font and format it accordingly have to learn how to count columns in UTF-8 text. Editor functions such as deleting a single character have to be slightly modified to delete all bytes that might belong to one character. Affected are for instance editors (vi, emacs, readline, etc.) as well as programs that use the ncurses library.
Any Unix-style kernel can do fine with soft conversion and needs only very minor modifications to fully support UTF-8. Most kernel functions that handle strings (e.g. file names, environment variables, etc.) are not affected at all by the encoding. Modifications might be necessary in the following places:
Starting with GNU glibc 2.2, the type wchar_t is officially intended to be used only for 32-bit ISO 10646 values, independent of the currently used locale. This is signalled to applications by the definition of the __STDC_ISO_10646__ macro as required by ISO C99. The ISO C multi-byte conversion functions (mbsrtowcs(), wcsrtombs(), etc.) are csrtombs(), etc.) are fully implemented in glibc 2.2 or higher and can be used to convert between wchar_t and any locale-dependent multibyte encoding, including UTF-8, ISO 8859-1, etc.
For example, you can write
#include <stdio.h>
#include <locale.h>
int main()
{
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return 1;
}
printf("%ls\n", L"Schöne Grüße");
return 0;
}
Call this program with the locale setting LANG=de_DE and the output will be in ISO 8859-1. Call it with LANG=de_DE.UTF-8 and the output will be in UTF-8. The %ls format specifier in printf calls wcsrtombs in order to convert the wide character argument string into the local-dependent multi-byte encoding.
If your application is soft converted and does not use the standard locale-dependent C multibyte routines (mbsrtowcs(), wcsrtombs(), etc.) to convert everything into wchar_t for processing, then it might have to find out in some way, whether it is supposed to assume that the text data it handles is in some 8-bit encoding (like ISO 8859-1, where 1 byte = 1 character) or UTF-8. Hopefully, in a few years everyone. Hopefully, in a few years everyone will only be using UTF-8 and you can just make it the default, but until then both the classical 8-bit sets and UTF-8 will have to be supported.
The first wave of applications with UTF-8 support used a whole lot of different command line switches to activate their respective UTF-8 modes, for instance the famous xterm -u8. That turned out to be a very bad idea. Having to remember a special command line option or other configuration mechanism for every application is very tedious, which is why command line options are not the proper way of activating a UTF-8 mode.
The proper way to activate UTF-8 is the POSIX locale mechanism. A locale is a configuration setting that contains information about culture-specific conventions of software behaviour, including the character encoding, the date/time notation, alphabetic sorting rules, the measurement system and common office paper size, etc. The names of locales usually consist of ISO 639-1 language and ISO 3166-1 country codes, sometimes with additional encoding names or other qualifiers.
You can get a list of all locales installed on your system (usually
in /usr/lib/locale/) with the command locale
-a. Set the environment variable You can query the name of the character encoding in your current
locale with the command locale charmap. This should say
UTF-8 if you successfully picked a UTF-8essfully picked a UTF-8 locale in the
LC_CTYPE category. The command locale -m provides a list
with the names of all installed character encodings.
If you use exclusively C library multibyte functions to do all the
conversion between the external character encoding and the
wchar_t encoding that you use internally, then the C
library will take care of using the right encoding according to
LC_CTYPE for you and your program does not even have to
know explicitly what the current multibyte encoding is.
However, if you prefer not to do everything using the libc
multi-byte functions (e.g., because you think this would require too
many changes in you software or is not efficient enough), then your
application has to find out for itself when to activate the UTF-8
mode. To do this, on any X/Open compliant systems, where <langinfo.h> is available, you can use a line
such as
in order to detect whether the current locale uses the UTF-8
encoding. You have of course to add a setlocale(LC_CTYPE,
"") at the beginning of your application to set the locale
according to the environment variables first. The standard function
call nl_langinfo(CODESET) is also what locale
charmap calls to f
charmap calls to find the name of the encoding specified by the
current locale for you. It is available on pretty much every modern
Unix, except for FreeBSD, which has unfortunately still quite abysmal
locale support. If you need an autoconf test for the availability of
nl_langinfo(CODESET), here is the one Bruno Haible
suggested:
[You could also try to query the locale environment variables
yourself without using setlocale(). In the sequence
LC_ALL, LC_CTYPE, LANG, look
for the first of these environment variables that has a value. Make
the UTF-8 mode the default (still overridable by command line
switches) when this value contains the substring UTF-8,
as this indicates reasonably reliably that the C library has bereliably that the C library has been
asked to use a UTF-8 locale. An example code fragment that does this
is
This relies of course on all UTF-8 locales having the name of the
encoding in their name, which is not always the case, therefore the
nl_langinfo() query is clearly the better method. If you
are concerned about that calling nl_langinfo() might not
be portable enough (e.g., FreeBSD still doesn't have it), then use libcharset,
which is a portable library for determining the current locale's
character encoding. That's also what several of the GNU packages use.]
The xterm
version that comes with XFree86
4.0 or higher (maintained by Thomas
Dickey) includes already UTF-8 support. To activate it, start
xterm in a UTF-8 locale and use a font with iso10646-1
encoding, for instance with
and then cat some 46-1'
and then cat some example file, such as UTF-8-demo.txt
in the newly started xterm and enjoy what you see.
If you are not using XFree86 4.0 or newer, then you can
alternatively download the latest xterm
development version separately and compile it yourself with
"./configure --enable-wide-chars ; make" or alternatively
with "xmkmf; make Makefiles; make; make install; make
install.man".
If you do not have UTF-8 locale support available, use command line
option -u8 when you invoke xterm to switch input and
output to UTF-8.
Xterm in XFree86 4.0.1 only supported Level 1 (no combining
characters) of ISO 10646-1 with a fixed character width and
left-to-right writing direction. In other words, the terminal
semantics were basically the same as for ISO 8859-1, except that it
can now decode UTF-8 and can access 16-bit characters.
With XFree86 4.0.3, two important functions were added:
The following fonts coming with XFree86 4.x are suitable for
display of Japanese and Korean Unicode text with terminal emulators
and editors:
Some simple support for nonspacing or enclosing combining
characters (i.e., those with general category code Mn or Me in the Unicode
database) is now also available, which is implementedailable, which is implemented by just
overstriking (logical OR-ing) a base-character glyph with up to two
combining-character glyphs. This produces acceptable results for
accents below the base line and accents on top of small characters. It
also works well for example for Thai fonts that were specifically
designed for use with overstriking. However the results might not be
fully satisfactory for combining accents on top of tall characters in
some fonts, especially with the fonts of the "fixed" family, therefore
precomposed characters will continue to be preferable where available.
The following fonts coming with XFree86 4.x are suitable for
display of Latin etc. combining characters (extra head-space), other
fonts will only look nice with combining accents on small x-high
characters:
The following fonts coming with XFree86 4.x are suitable for
display of Thai combining characters:
A note for programmers of text mode applications:
With support for CJK ideographs and combining characters, the
output of xterm behaves a little bit more like with a proportional
font, because a Latin/Greek/Cyrillic/etc. character requires one
column position, a CJK ideograph two, and a combining character zero.
The Open Group's Single UNIX
Specification specifies the two C functions wcwidth() and wcswidth() that allow an application to
test how many column positions a character will occupy:
MaE>
Markus Kuhn's free wcwidth()
implementation can be used by applications on platforms where the C
library does not yet provide a suitable function.
Xterm will for the foreseeable future probably not support the
following functionality, which you might expect from a more
sophisticated full Unicode rendering engine:
Hebrew and Arabic users will therefore have to use application
programs that reverse and left-pad Hebrew and Arabic strings before
sending them to the terminal. In other words, the bidirectional
processing has to be done by the application and not by xterm. The
situation for Hebrew and Arabic improves over ISO 8859 at least in the
form of the availability of precomposed glyphs and presentation forms.
It is far from clear at the moment, whether bidirectional support
should really go into xterm and how precisely this should work. Both
ISO 6429 =
ECMA-48 and the Unicode bidi
algorithm provide alternative starting points. ide alternative starting points. See also ECMA Technical
Report TR/53.
If you plan to support bidirectional text output in your
application, have a look at either Dov Grobgeld's FriBidi or Mark Leisher's Pretty Good Bidi
Algorithm, two free implementations of the Unicode bidi algorithm.
Xterm currently does not support the Arabic, Syriac, Hangul Jamo,
or Indic text formatting algorithms, although Robert Brady has
published some experimental patches
towards bidi support. It is still unclear whether it is feasible or
preferable to do this in a VT100 emulator at all. Applications can
apply the Arabic and Hangul formatting algorithms themselves easily,
because xterm allows them to output all the necessary presentation
forms. For Indic scripts, the X font mechanism does at the moment not
even support the encoding of the necessary ligature variants, so there
is little xterm could offer anyway. Applications requiring Indic or
Syriac output should better use a proper Unicode X11 rendering library
such as Pango instead of a VT100
emulator like xterm.
Unicode X11 font names end with -ISO10646-1. This is
now the officially registered value for the X Logical Font Descriptor (XLFD) fields
CHARSET_REGISTRY and CHARSET_ENCODING for
all Unicode and ISO 10646-1 16-bit fonts. The
*-ISO10646-1 fonts contain some unspecified subset of the
entire Unicode character set, and users have to make sure that
whatever font they select covers the subset of characters needed by
them.
The *-ISO10646-1 fonts usually also specify a
DEFAULT_CHAR value that points to a special non-Unicode
glyph for representing any character that is not available in the font
(usually a dashed box, the size of an H, located at 0x00). This
ensures that users at least see clearly that there is an unsupported
character. The smaller fixed-width fonts such as 6x13 etc. for xterm
will never be able to cover all of Unicode, because many scripts such
as Kanji can only be represented in considerably larger pixel sizes
than those widely used by European users. Typical Unicode fonts for
European usage will contain only subsets of between 1000 and 3000
characters, such as the CEN MES-3
repertoire.
You might notice that in the *-ISO10646-1 fonts the shapes of the ASCII quotation marks has
slightly changed to bring them in line with the standards and practice
on other platforms.
VT100 terminal emulators accept ISO
2022 (=ECMA-35) ESC
sequences in order to switch between different character sets.
UTF-8 is in the sense of ISO 2022 an "other coding system" (see
section 15.4 of ECMA 35). UTF-8 is outside the ISO 2022
SS2/SS3/G0/G1/G2/G3 world, so if you switch from ISO 2022 to UTF-8,
all SS2/SS3/G0/G1/G2/G3 state becomes meaningless until you leave
UTF-8 and switch back to ISO 2022. UTF-8 is a stateless encoding, i.e.
a self-terminating short byte sequence determines completely which
character is meant, independent of any switching state. G0 and G1 in
ISO 10646-1 are those of ISO 8859-1, and G2/G3 do not exist in ISO
10646, because every character has a fixed position and no switching
takes place. With UTF-8, it is not possible that your terminal remains
switched to strange graphics-character mode after you accidentally
dumped a binary file to it. This makes a terminal in UTF-8 mode much
more robust than with ISO 2022 and it is therefore useful to have a
way of locking a terminal into UTF-8 mode such that it can't
accidentally go back to the ISO 2022 world.
The ISO 2022 standard specifies a range of ESC % sequences for
leaving the ISO 2022 world (designation of other coding system, DOCS),
and a numbystem, DOCS),
and a number of such sequences have been registered for UTF-8
in section 2.8 of the ISO 2375 International
Register of Coded Character Sets:
While a terminal emulator is in UTF-8 mode, any ISO 2022 escape
sequences such as for switching G2/G3 etc. are ignored. The only ISO
2022 sequence on which a terminal emulator might act in UTF-8 mode is
ESC %@ for returning from UTF-8 back to the ISO 2022
scheme.
UTF-8 still allows you to use C1 control characters such as CSI,
even though UTF-8 also uses bytes in the range 0x80-0x9F. It is
important to understand that a terminal emulator in UTF-8 mode must
apply the UTF-8 decoder to the incoming byte stream
before interpreting any control characters. C1
characters are UTF-8 decoded just like any other character above
U+ike any other character above
U+007F.
Many text-mode applications available today expect to speak to the
terminal using a legacy encoding or to use ISO 2022 sequences for
switching terminal fonts. In order to use such applications within a
UTF-8 terminal emulator, it is possible to use a conversion layer that
will translate between ISO 2022 and UTF-8 on the fly. One such utility
is Juliusz Chroboczek's luit. If all
you need is ISO 8859 support in a UTF-8 terminal, you can also use
Michael Schroeder's screen
(version 3.9.9 or newer). As implementation of ISO 2022 is a complex
and error-prone task, better avoid implementing ISO 2022 yourself,
implement only UTF-8 and point users who need ISO 2022 at luit (or
screen).
Starting with Solaris 2.8, UTF-8 is at least partially supported.
To use it, just set one of the UTF-8 locales, for instance by typing
Now the dtterm terminal emulator can be used to input
and output UTF-8 text and the mp print filter will print
UTF-8 files on PostScript printers. The en_US.UTF-8
locale is at the moment supported by Motif and CDE desktop
applications and libraries, but not by OpenWindows, XView, and
OPENLOOK DeskSet applications and libraries.
For more information, read Sun's Overview of en_US.UTF-8 Locale Support web page.
See Adobe's Unicode
and Glyph Names guide.
With over 40000 characters, a full and complete Unicode
implementation is an enormous project. However, it is often sufficient
(especially for the European market) to implement only a few hundred
or thousand characters as before and still enjoy the simplicity of
reaching all required characters in just one single simple encoding
via Unicode. A number of different UCS subsets have already been
established:
Markus Kuhn's uniset Perl script
allows convenient set arithmetic over UCS subsets for anyone who wants
to define a new one or wants to check coverage of an implementation.
The Unicode Consortium maintains a nicode.org/Public/MAPPINGS/">collection of mapping
tables between Unicode and various older encoding standards. It is
important to understand that these tables alone are only suitable for
converting text from the older encodings to Unicode. Conversion in the
opposite direction from Unicode to a legacy character set requires
non-injective (= many-to-one) extensions of these mapping tables.
Several Unicode characters have to be mapped to a single code point in
a legacy encoding. This is necessary, because some legacy encodings
distinguished characters that others unified. The Unicode consortium
does currently not maintain standard many-to-one tables for this
purpose, but such tables can easily be generated from available
normalization information.
Here are some examples for the many-to-one mappings that have to be
handled when converting from Unicode into something else:
The Unicode
database does contain in field 5 the Character Decomposition
Mapping that can be used to generate the above example mappings
automatically. As a rule, the output of a Unicode-to-Something
converter should not depend on whether the Unicode input has first
been converted into Normalization Form
C or not. For equivalence information on Chinese, Japanese, and
Korean Han/Kanji/Hanja characters, use the Unihan
database (20 MB).
The Unicode mapping tables also have to be slightly modified
sometimes to preserve information in combination encodings. For
example, the standard mappings provide round-trip compatibility for
conversion chains ASCII to Unicode to ASCII as well as for JIS X 0208
to Unicode to JIS X 0208. However, the EUC-JP encoding covers the
union of ASCII and JIS X 0208, and the UCS repertoire covered by the
ASCII and JIS X 0208 mapping tables overlaps for one character, namely
U+005C REVERSE SOLIDUS. EUC-JP converters therefore have to use a
slightly modified JIS X 0208 mapping table, such that the JIS X 0208
code 0x2140 (0xA1 0xC0 in EUC-JP) gets mapped to U+FF3C FULLWIDTH
REVERSE SOLIDUS. This way, round-trip compatibility from EUC-JP to
Unicode to EUC-JP can be guaranteed without any loss of information.
Unicode
Standard Annex #11: East Asian Width provides further guidance on
this issue.
In addition to just using standard normalization mappings,
developers of code converters can also offer transliteration support.
Transliteration is the conversion of a Unicode character into a
graphically and/or semantically similar character in the target code,
even if the two are distinct characters in Unicode after
normalization. Examples of transliteration:
The Unicode Consortium does not provide or maintain any standard
transliteration tables. Which transliterations are appropriate or not
can in some cases depend on language, application field, and even
personal preference. Available Unicode transliteration tables include
for example those found in Bruno Haible's libiconv,
the glibc 2.2 locales,
and Markus Kuhn's transtab
package.
The X11 R6.6 release (2001)
is the latest version of the X Consortium's sample implementation of
the X11 Window System standards. The bulk of the current X11
standards and the sample implementation pre-date widespread
interest into Unicode under Unix. There are a number of problems and
inconveniences for Unicode users in both that really should be fixed
in the next X11 release:
UTF-8 cut and paste: The ICCCM
standard does not specify how to transfer UCS strings in selections.
Some vendors have added UTF-8 as yet another encoding to the existing
COMPOUND_TEXT mechanism (CTEXT). This is not a good solution for
at least the following reasons: Juliusz Chroboczek
has written an Inter-Client Exchange of Unicode Text draft proposal for an
extension of the ICCCM to handle UTF-8 selections with a new
UTF8_STRING atom that can be used as a property type and selection
target. This clean approach fixes all of the above problems.
UTF8_STRING is just as state-less and easy to use as the existing
STRING atom (which is reserved exclusively for ISO 8859-1 strings and
therefore not usable for UTF-8), and adding a new selection target
allows applications to offer selections in both the old CTEXT and the
new UTF8_STRING format simultaneously, which maximizes
interoperability. The use of UTF8_STRING can be negociated between the
selection holder and requestor, leading to no compatibility issues
whatsoever. Markus Kuhn has prepared an ICCCM
patch that adds the necessary definition to the standard. Current
status: standard. Current
status: The UTF8_STRING atom has now been officially registered with X.Org,
and an update of the ICCCM is expected for the next release.
A few workarounds have been used so far: These workarounds do not solve the underlying problem that
XFontStruct is unsuitable for sparsely populated fonts, but they do
provide a significant efficiency improvement without requiring any
changes in the API or client source code. One real solution would be
to extend or substitute XFontStruct with something slightly more
flexible that contains a sorted list or hash table of characters as
opposed to an array. This redesign of XFontStruct would also allow to
add the urgently needed provisions for combining characters and
ligatures at the same time.
Several XFree86 team members are trying to work on these issues
with X.Org, which is the official
successor of the X Consortium and the Opengroup as the custodian of
the X11 standards and the sample implementation. But things are moving
rather slowly. Support for UTF8_STRING, UCS keysyms, and ISO10646-1
extensysyms, and ISO10646-1
extensions of the core fonts will hopefully make it into R6.6.1 in
2001-Q4. With regard to the other font related problems, the solution
will probably be to dump the old server-side font mechanisms entirely
and use instead Keith
Packard's new X
Render Extension.
You should certainly be on the linux-utf8@nl.linux.org
mailing list. That's the place to meet for everyone interested in
working towards better UTF-8 support for GNU/Linux or Unix systems and
applications. To subscribe, send to majordomo@nl.linux.org a
message with the line "subscribe linux-utf8" in the body. You can also
browse the linux-utf8
archive.
There is also the unicode@unicode.org mailing list, which is the best
way of finding out what the authors of the Unicode standard and a lot
of other gurus have to say. To subscribe, send to unicode-request@unicode.org
a message with the subject line "subscribe" and the text "subscribe
YOUR@EMAIL.ADDRESS unicode".
The relevant mailing lists for discussions about Unicode discussions about Unicode support in
Xlib and the X server are the fonts@xfree86.org
and i18n@xfree86.org
mailing lists.
I add new material to this document very frequently, so please
check it regularly or ask Netminder to
notify you of any changes. Suggestions for
improvement, as well as advertisement in the freeware community for
better UTF-8 support, are very welcome. UTF-8 use under Linux is quite
new, so expect a lot of progress in the next few months here.
Special thanks to Ulrich Drepper, Bruno Haible, Robert Brady,
Shuhei Amakawa and many others for valuable comments, and to SuSE
GmbH, Nürnberg, for their support.
Markus Kuhn
<Markus.Kuhn@cl.cam.ac.uk> utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);
======================== m4/codeset.m4 ================================
#serial AM1
dnl From Bruno Haible.
AC_DEFUN([AM_LANGINFO_CODESET],
[
AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
[AC_TRY_LINK([#include <langinfo.h>],
[char* cs = nl_langinfo(CODESET);],
am_cv_langinfo_codeset=yes,
am_cv_langinfo_codeset=no)
])
if test $am_cv_langinfo_codeset = yes; then
AC_DEFINE(HAVE_LANGINFO_CODESET, 1,
[Define if you have <langinfo.h> and nl_langinfo(CODESET).])
fi
])
=======================================================================
char *s;
int utf8_mode = 0;
if ((s = getenv("LC_ALL")) ||
(s = getenv("LC_CTYPE")) ||
(s = getenv("LANG"))) {
if (strstr(s, "UTF-8"))
utf8_mode = 1;
}
How do I get a UTF-8 version of xterm?
LANG=en_GB.UTF-8 xterm \
-fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
How much of Unicode does xterm support?
If the selected normal font is X×Y pixels
large, then xterm will now attempt to load in addition a
2X×Y pixels large font (samY pixels large font (same XLFD, except
for a doubled value of the AVERAGE_WIDTH property). It
will use this font to represent all Unicode characters that have been
assigned the East Asian Wide (W) or East Asian FullWidth
(F) property in Unicode Technical
Report #11.
6x13 -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
6x13B -Misc-Fixed-Bold-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
6x13O -Misc-Fixed-Medium-O-SemiCondensed--13-120-75-75-C-60-ISO10646-1
12x13ja -Misc-Fixed-Medium-R-Normal-ja-13-120-75-75-C-120-ISO10646-1
9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
9x18B -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1
18x18ja -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1
18x18ko -Misc-Fixed-Medium-R-Normal-ko-18-120-100-100-C-180-ISO10646-1
6x12 -Misc-Fixed-Medium-R-Semicondensed---Semicondensed--12-110-75-75-C-60-ISO10646-1
9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
9x18B -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1
6x13 -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
9x15 -Misc-Fixed-Medium-R-Normal--15-140-75-75-C-90-ISO10646-1
9x15B -Misc-Fixed-Bold-R-Normal--15-140-75-75-C-90-ISO10646-1
10x20 -Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO10646-1
9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
#include <wchar.h>
int wcwidth(wchar_t wc);
int wcswidth(const wchar_t *pwcs, size_t n);
Where do I find ISO 10646-1 X11 fonts?
Quite a number of Unicode fonts have become available for X11 overcome available for X11 over
the past few months, and the list is growing quickly:
on other platforms.
What are the issues related to UTF-8 terminal emulators?
What UTF-8 enabled applications are already available?
What patches to improve UTF-8 support are available?
Are there free libraries for dealing with Unicode available?
What is the status of Unicode support for various X widget libraries?
What packages with UTF-8 support are currently under development?
How does UTF-8 support work under Solaris?
setenv LANG en_tance by typing
setenv LANG en_US.UTF-8
in a C shell.
How are Postscript glyph names related to UCS codes?
Are there any well-defined UCS subsets?
that can be used to define and document implemented
subsets. Unicode defines similar, but not quite identical, blocks of
characters, which correspond to sections in the Unicode standard.
What issues are there to consider when converting encodings
UCS characters equivalent character in target code
U+00B5 MICRO SIGN
U+03BC GREEK SMALL LETTER MU
0xB5 ISO 8859-1
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ANGSTROM SIGN
0xC5 ISO 8859-1
U+03A9 GREEK CAPITAL LETTER OMEGA
U+2126 OHM SIGN
0xEA CP437
U+005C REVERSE SOLIDUS
U+FF3C FULLWIDTH REVERSE SOLIDUS
0x2140 JIS X 0208
UCS characters equivalent character in target code
U+0022 QUOTATION MARK
U+201C LEFT DOUBLE QUOTATION MARK
U+201D RIGHT DOUBLE QUOTATION MARK
U+201E DOUBLE LOW-9 QUOTATION MARK
U+201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
0x22 ISO 8859-1
package.
Is X11 ready for Unicode?
Are there any good mailing lists on these issues?
Further References
created 1999-06-04 -- last
modified 2001-08-28 --
http://www.cl.cam.ac.uk/~mgk25/unicode.html