UTF-8 and Unicode FAQ for Unix/Linux

xxxxx-00000000 - U-0000007F: · NFC is the preferred form for Linux and WWW.

    11000010 10101001 = 0xC2 0xA9

    11100010 10001001 10100000 = 0xE2 0x89 0xA0

  0xC0 0x8A
  0xE0 0x80 0x8A
  0xF0 0x80 0x80 0x8A
  0xF8 0x80 0x80 0x80 0x8A
  0xFC 0x80 0x80 0x80 0x80 0x8A

  #include <stdio.h>
  #include <locale.h>

  int main()
  {
    if (!setlocale(LC_CTYPE, "")) {
      fprintf(stderr, "Can't set the specified locale! "
              "Check LANG, LC_CTYPE, LC_ALL.\n");
      return 1;
    }
    printf("%ls\n", L"Schöne Grüße");
    return 0;
  }

  utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);
======================== m4/codeset.m4 ================================
#serial AM1

dnl From Bruno Haible.

AC_DEFUN([AM_LANGINFO_CODESET],
[
  AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
    [AC_TRY_LINK([#include <langinfo.h>],
      [char* cs = nl_langinfo(CODESET);],
      am_cv_langinfo_codeset=yes,
      am_cv_langinfo_codeset=no)
    ])
  if test $am_cv_langinfo_codeset = yes; then
    AC_DEFINE(HAVE_LANGINFO_CODESET, 1,
      [Define if you have <langinfo.h> and nl_langinfo(CODESET).])
  fi
])
=======================================================================

  char *s;
  int utf8_mode = 0;

  if ((s = getenv("LC_ALL")) ||
      (s = getenv("LC_CTYPE")) ||
      (s = getenv("LANG"))) {
    if (strstr(s, "UTF-8"))
      utf8_mode = 1;
  }

  LANG=en_GB.UTF-8 xterm \
    -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'

  6x13    -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  6x13B   -Misc-Fixed-Bold-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  6x13O   -Misc-Fixed-Medium-O-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  12x13ja -Misc-Fixed-Medium-R-Normal-ja-13-120-75-75-C-120-ISO10646-1

  9x18    -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
  9x18B   -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1
  18x18ja -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1
  18x18ko -Misc-Fixed-Medium-R-Normal-ko-18-120-100-100-C-180-ISO10646-1

  6x12    -Misc-Fixed-Medium-R-Semicondensed---Semicondensed--12-110-75-75-C-60-ISO10646-1
  9x18    -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
  9x18B   -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1

  6x13    -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
  9x15    -Misc-Fixed-Medium-R-Normal--15-140-75-75-C-90-ISO10646-1
  9x15B   -Misc-Fixed-Bold-R-Normal--15-140-75-75-C-90-ISO10646-1
  10x20   -Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO10646-1
  9x18    -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1

  #include <wchar.h>
  int wcwidth(wchar_t wc);
  int wcswidth(const wchar_t *pwcs, size_t n);

 setenv LANG en_tance by typing

 setenv LANG en_US.UTF-8

in a C shell.

Now the dtterm terminal emulator can be used to input
and output UTF-8 text and the mp print filter will print
UTF-8 files on PostScript printers. The en_US.UTF-8
locale is at the moment supported by Motif and CDE desktop
applications and libraries, but not by OpenWindows, XView, and
OPENLOOK DeskSet applications and libraries.

For more information, read Sun's Overview of en_US.UTF-8 Locale Support web page.

How are Postscript glyph names related to UCS codes?

See Adobe's Unicode
and Glyph Names guide.

Are there any well-defined UCS subsets?

With over 40000 characters, a full and complete Unicode
implementation is an enormous project. However, it is often sufficient
(especially for the European market) to implement only a few hundred
or thousand characters as before and still enjoy the simplicity of
reaching all required characters in just one single simple encoding
via Unicode. A number of different UCS subsets have already been
established:

The Windows
Glyph List 4.0 (WGL4.html">Windows
Glyph List 4.0 (WGL4) is a set of 650 characters that covers all
the 8-bit MS-DOS, Windows, Mac, and ISO code pages that Microsoft had
used before. All Windows fonts now cover at least the WGL4 repertoire.
WGL4 is a superset of CEN MES-1. (WGL4 test
file).

Three European
UCS subsets MES-1, MES-2, and MES-3 have been defined by the
European standards committee CEN/TC304 in CWA 13873:

MES-1 is a very small Latin subset with only 335 characters. It
contains exactly all characters found in ISO 6937 plus the EURO SIGN.
This means MES-1 contains all characters of ISO 8859 parts
1,2,3,4,9,10,15. [Note: If your aim is to provide only the cheapest
and simplest reasonable Central European UCS subset, I would implement
MES-1 plus the following important 14 additional characters found in
Windows code page 1252 but not in MES-1: U+0192, U+02C6, U+02DC,
U+2013, U+2014, U+201A, U+201E, U+2020, U+2021, U+2022, U+2026,
U+2030, U+2039, U+203A.]

MES-2 is a Latin/Greek/Cyrillic/Armenian/Georgian subset with 1052
characters. It covers every language and every 8-bit code page used in
Europe (not just the EU!) and European language countries. It also
adds a small collection of mathematical symbols for use in technical
documentation. MES-2 is a superset of MES-1. If you are developing
only for a European or Western market, European or Western market, MES-2 is the recommended
repertoire. [Note: For bizarre committee-politics reasons, the
following eight WGL4 characters are missing from MES-2: U+2113,
U+212E, U+2215, U+25A1, U+25AA, U+25AB, U+25CF, U+25E6. If you
implement MES-2, you should definitely also add those and then you can
claim WGL4 conformance in addition.]

MES-3 is a very comprehensive UCS subset with 2819 characters. It
simply includes every UCS collection that seemed of potential use to
European users. This is for the more ambitious implementors. MES-3 is
a superset of MES-2 and WGL4.

JIS X 0221-1995 specifies 7 non-overlapping UCS subsets for
Japanese users:

Basic Japanese (6884 characters): JIS X 0208-1997, JIS X 0201-1997

Japanese Non-ideographic Supplement (1913 characters): JIS X
0212-1990 non-kanji, plus various other non-kanji

Japanese Ideographic Supplement 1 (918 characters): some JIS X
0212-1990 kanji

Japanese Ideographic Supplement 2 (4883 characters): remaining JIS
X 0212-1990 kanji

Japanese Ideographic Supplement 3 (8745 characters): remaining
Chinese characters

Full-width Alphanumeric (94 characters): for compatibility

Half-width Katakana (63 characters): for compatibility

The ISO 10646 standard splits up its repertoire into a number of
collections that can be uons.html"
>collections that can be used to define and document implemented
subsets. Unicode defines similar, but not quite identical, blocks of
characters, which correspond to sections in the Unicode standard.

RFC
1815 is a memo written in 1995 by someone who obviously didn't
like ISO 10646 and was unaware of JIS X 0221-1995. It discusses a UCS
subset called "ISO-10646-J-1" consisting of 14 UCS collections, some
of which are intersected with JIS X 0208. This is just what a
particular font in an old Japanese Windows NT version from 1995
happened to implement. RFC 1815 is completely obsolete and irrelevant
today and should best be ignored.

Markus Kuhn has defined in the ucs-fonts.tar.gz README three UCS
subsets TARGET1, TARGET2, TARGET3 that are sensible extensions of the
corresponding MES subsets and that were the basis for the completion
of this xterm font package.

Markus Kuhn's uniset Perl script
allows convenient set arithmetic over UCS subsets for anyone who wants
to define a new one or wants to check coverage of an implementation.

What issues are there to consider when converting encodings

The Unicode Consortium maintains a nicode.org/Public/MAPPINGS/">collection of mapping
tables between Unicode and various older encoding standards. It is
important to understand that these tables alone are only suitable for
converting text from the older encodings to Unicode. Conversion in the
opposite direction from Unicode to a legacy character set requires
non-injective (= many-to-one) extensions of these mapping tables.
Several Unicode characters have to be mapped to a single code point in
a legacy encoding. This is necessary, because some legacy encodings
distinguished characters that others unified. The Unicode consortium
does currently not maintain standard many-to-one tables for this
purpose, but such tables can easily be generated from available
normalization information.

Here are some examples for the many-to-one mappings that have to be
handled when converting from Unicode into something else:

UCS characters equivalent character in target code
U+00B5 MICRO SIGN
U+03BC GREEK SMALL LETTER MU
    0xB5 ISO 8859-1
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ANGSTROM SIGN
    0xC5 ISO 8859-1
U+03A9 GREEK CAPITAL LETTER OMEGA
U+2126 OHM SIGN
    0xEA CP437
U+005C REVERSE SOLIDUS
U+FF3C FULLWIDTH REVERSE SOLIDUS
    0x2140 JIS X 0208

The Unicode
database does contain in field 5 the Character Decomposition
Mapping that can be used to generate the above example mappings
automatically. As a rule, the output of a Unicode-to-Something
converter should not depend on whether the Unicode input has first
been converted into Normalization Form
C or not. For equivalence information on Chinese, Japanese, and
Korean Han/Kanji/Hanja characters, use the Unihan
database (20 MB).

The Unicode mapping tables also have to be slightly modified
sometimes to preserve information in combination encodings. For
example, the standard mappings provide round-trip compatibility for
conversion chains ASCII to Unicode to ASCII as well as for JIS X 0208
to Unicode to JIS X 0208. However, the EUC-JP encoding covers the
union of ASCII and JIS X 0208, and the UCS repertoire covered by the
ASCII and JIS X 0208 mapping tables overlaps for one character, namely
U+005C REVERSE SOLIDUS. EUC-JP converters therefore have to use a
slightly modified JIS X 0208 mapping table, such that the JIS X 0208
code 0x2140 (0xA1 0xC0 in EUC-JP) gets mapped to U+FF3C FULLWIDTH
REVERSE SOLIDUS. This way, round-trip compatibility from EUC-JP to
Unicode to EUC-JP can be guaranteed without any loss of information.
Unicode
Standard Annex #11: East Asian Width provides further guidance on
this issue.

In addition to just using standard normalization mappings,
developers of code converters can also offer transliteration support.
Transliteration is the conversion of a Unicode character into a
graphically and/or semantically similar character in the target code,
even if the two are distinct characters in Unicode after
normalization. Examples of transliteration:

UCS characters equivalent character in target code
U+0022 QUOTATION MARK
U+201C LEFT DOUBLE QUOTATION MARK

        U+201D RIGHT DOUBLE QUOTATION MARK

        U+201E DOUBLE LOW-9 QUOTATION MARK

        U+201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
    0x22 ISO 8859-1

The Unicode Consortium does not provide or maintain any standard
transliteration tables. Which transliterations are appropriate or not
can in some cases depend on language, application field, and even
personal preference. Available Unicode transliteration tables include
for example those found in Bruno Haible's libiconv,
the glibc 2.2 locales,
and Markus Kuhn's transtab
package.

package.

Is X11 ready for Unicode?

The X11 R6.6 release (2001)
is the latest version of the X Consortium's sample implementation of
the X11 Window System standards. The bulk of the current X11
standards and the sample implementation pre-date widespread
interest into Unicode under Unix. There are a number of problems and
inconveniences for Unicode users in both that really should be fixed
in the next X11 release:

UTF-8 cut and paste: The ICCCM
standard does not specify how to transfer UCS strings in selections.
Some vendors have added UTF-8 as yet another encoding to the existing
COMPOUND_TEXT mechanism (CTEXT). This is not a good solution for
at least the following reasons:

CTEXT is a rather complicated ISO 2022 mechanism and Unicode
offers the opportunity to provide not just another add-on to CTEXT,
but to replace the entire monster with something far simpler, more
convenient, and equally powerful.

Many existing applications can communicate selections via CTEXT,
but do not support a newly added UTF-8 option. A user of CTEXT has to
decide whether to use the old ISO 2022 encodings or the new UTF-8
encencodings or the new UTF-8
encoding, but both cannot be offered simultaneously. In other words,
adding UTF-8 to CTEXT seriously breaks backwards compatibility with
existing CTEXT applications.

The current CTEXT specification even explicitly forbids the
addition of UTF-8 in section 6: "ISO registered 'other coding systems'
are not used in Compound Text; extended segments are the only
mechanism for non-2022 encodings."

Juliusz Chroboczek
has written an Inter-Client Exchange of Unicode Text draft proposal for an
extension of the ICCCM to handle UTF-8 selections with a new
UTF8_STRING atom that can be used as a property type and selection
target. This clean approach fixes all of the above problems.
UTF8_STRING is just as state-less and easy to use as the existing
STRING atom (which is reserved exclusively for ISO 8859-1 strings and
therefore not usable for UTF-8), and adding a new selection target
allows applications to offer selections in both the old CTEXT and the
new UTF8_STRING format simultaneously, which maximizes
interoperability. The use of UTF8_STRING can be negociated between the
selection holder and requestor, leading to no compatibility issues
whatsoever. Markus Kuhn has prepared an ICCCM
patch that adds the necessary definition to the standard. Current
status: standard. Current
status: The UTF8_STRING atom has now been officially registered with X.Org,
and an update of the ICCCM is expected for the next release.

Inefficient font data structures: The Xlib API and X11
protocol data structures used for representing font metric information
are extremely inefficient when handling sparsely populated fonts. The
most common way of accessing a font in an X client is a call to
XLoadQueryFont(), which allocates memory for an XFontStruct and
fetches its content from the server. XFontStruct contains an array of
XCharStruct entries (12 bytes each). The size of this array is the
code position of the last character minus the code position of the
first character plus one. Therefore, any "*-iso10646-1" font that
contains both U+0020 and U+FFFD will cause an XCharStruct array with
65502 elements to be allocated (even for CharCell fonts), which
requires 786 kilobytes of client-side memory and data transmission,
even if the font contains only a thousand characters.

A few workarounds have been used so far:

The non-Asian -misc-fixed-*-iso10646-1 fonts that
come with XFree86 4.0 contain no characters above U+31FF. This reduces
the memory requirement to 153 kilobytes, which is still bad, but much
less so. (There are actually many useful characters above U+31FF
present in the BDF files, waiting for the day when thisting for the day when this problem will
be fixed, but they currently all have an encoding of -1 and are
therefore ignored by the X server.)

Bruno Haible has written a BIGFONT protocol extension for XFree86
4.0, which uses a compressed transmission of XCharStruct from server
to client and also uses shared memory in Xlib between several clients
which have loaded the same font.

These workarounds do not solve the underlying problem that
XFontStruct is unsuitable for sparsely populated fonts, but they do
provide a significant efficiency improvement without requiring any
changes in the API or client source code. One real solution would be
to extend or substitute XFontStruct with something slightly more
flexible that contains a sorted list or hash table of characters as
opposed to an array. This redesign of XFontStruct would also allow to
add the urgently needed provisions for combining characters and
ligatures at the same time.

Keysyms: The keysyms defined at the moment cover only a
tiny repertoire of Unicode. Markus Kuhn has suggested (and implemented
in xterm) that any UCS character in the range U-00000000 to U-00FFFFFF
can be represented by a keysym value in the range 0x01000000 to
0x01ffffff. This admittedly does not cover the entire 31-bit space of
UCS, but it does cover all the characters up to U-0010FFFF, which can
be represented by UTF-16, and more, and it is very unlikely that
higher UCS codelikely that
higher UCS codes will ever be assigned by ISO (in fact there are
proposals to remove the code space above U-0010FFFF from ISO 10646 in
the future). So to get Unicode character U+ABCD you can directly use
keysym 0x0100abcd. See also the file keysym2ucs.c in the xterm source code for
a suggested conversion table between the classical keysyms and UCS,
something which should also go into the X11 standard. Markus also
wrote a proposed draft revision of the X protocol standard Appendix A: KEYSYM Encoding (PDF) that adds a UCS cross reference
table.

Combining characters: The X11 specification does not
support combining characters in any way. The font information lacks
the data necessary to perform high-quality automatic accent placement
(as it is found for example in all TeX fonts). Various people have
experimented with implementing simplest overstriking combining
characters using zero-width characters with ink on the left side of
the origin, but details of how to do this exactly are unspecified
(e.g., are zero-width characters allowed in CharCell and Monospaced
fonts?) and this is therefore not yet widely established practice.

Ligatures: The Indic scripts need font file formats that
support ligature substitution, which is at the moment just as
completely out of the scope of the X11 specification as are co specification as are combining
characters.

UTF-8 locales: The X11 R6.4 sample implementation did not
contain any support for UTF-8 locales. There is an old UTF locale, but
it is incomplete and uses the now obsolete UTF-1 encoding.
Implementing a UTF-8 locale not only requires the usual encoding
conversion routines, but also various keyboard entry methods, ranging
from mapping the existing ISO 8859 and keysym keyboards to UCS, over
vastly extended support for the compose key and ISO 14755 hexadecimal entry of
arbitrary characters to input entry support for Hangul and Han
characters.

Sample implementation: A number of comprehensive Unicode
standard fonts as well as Unicode support for classic standard tools
such as xterm, xfontsel, the window managers, etc. should be added to
the sample implementation. Some work on this part has already been
done within XFree86, other work is currently delayed by the fact that
the previous points have not yet been resolved.

Several XFree86 team members are trying to work on these issues
with X.Org, which is the official
successor of the X Consortium and the Opengroup as the custodian of
the X11 standards and the sample implementation. But things are moving
rather slowly. Support for UTF8_STRING, UCS keysyms, and ISO10646-1
extensysyms, and ISO10646-1
extensions of the core fonts will hopefully make it into R6.6.1 in
2001-Q4. With regard to the other font related problems, the solution
will probably be to dump the old server-side font mechanisms entirely
and use instead Keith
Packard's new X
Render Extension.

Are there any good mailing lists on these issues?

You should certainly be on the linux-utf8@nl.linux.org
mailing list. That's the place to meet for everyone interested in
working towards better UTF-8 support for GNU/Linux or Unix systems and
applications. To subscribe, send to majordomo@nl.linux.org a
message with the line "subscribe linux-utf8" in the body. You can also
browse the linux-utf8
archive.

There is also the unicode@unicode.org mailing list, which is the best
way of finding out what the authors of the Unicode standard and a lot
of other gurus have to say. To subscribe, send to unicode-request@unicode.org
a message with the subject line "subscribe" and the text "subscribe
YOUR@EMAIL.ADDRESS unicode".

The relevant mailing lists for discussions about Unicode  discussions about Unicode support in
Xlib and the X server are the fonts@xfree86.org
and i18n@xfree86.org
mailing lists.

Further References

Bruno Haible's Unicode
HOWTO.

The
Unicode Standard, Version 3.0, Addison-Wesley, 2000. You
definitely should have a copy of the standard if you are doing
anything related to fonts and character sets.

Ken Lunde's  CJKV
Information Processing, O'Reilly & Associates, 1999. This is
clearly the best book available if you are interested in East Asian
character sets.

Unicode Technical Reports

Mark Davis' Unicode
FAQ

ISO/IEC
10646-1:2000

Frank Tang's
Iñtërnâtiônàlizætiøn Secrets

IBM's Unicode
Zone

Unicode /software/white-papers/wp-unicode/">Unicode Support
in the Solaris 7 Operating Environment

The USENIX paper by Rob Pike and Ken Thompson on the introduction
of UTF-8 under Plan9 reports about the experience gained when Plan9 migrated as the
first operating system back in 1992 completely to UTF-8 (which was at
the time still called UTF-2). Must read!

Li18nux is a project
initiated by several Linux distributors to enhance Unicode support for
Linux. It has recently published the Li18nux 2000 Globalization
Specification as well as some patches.

The Online
Single Unix Specification contains definitions of all the ISO C
Amendment 1 function, plus extensions such as wcwidth().

The Open Group's summary of ISO
C Amendment 1.

GNU libc

The Linux Console Tools

The Unicode Consortium character database
and character set
conversion tables are an essential resource for anyone developing
Unicode related tools.

Other conversion tables are available from Microsoft
and  Keld
Simonsen's WG15 archive.

Michael Everson's ISO10646-1
archive contains online versions of many of the more recent ISO
10646-1 amendments, plus many other goodies. See also his
Roadmaps to the Universal Character Set.

An introduction into The Universal Character Set (UCS).

Otfried Cheong's essay on Han Unification
in Unicode

The AMS STIX project is
working on revising and extending the mathematical characters for
Unicode 4.0 and ISO 10646-2.

Jukka Korpela's Soft hyphen (SHY) -
a hard problem? is an excellent discussion of the controversy
surrounding U+00AD.

James Briggs' Perl,
Unicode and I18N FAQ.

Mark Davis discusses in Forms
of Unicode the tradeoffs between UTF-8, UTF-16, and UCS-4 (now
also called UTF-32 for political reasons).

Alan Wood has a good page on Unicode and Multilingual
Support in Web Browsers and HTML.

ISO/JTC1/SC22/WG20 produced various Unicode related standards
such as the International String Ordering (ISO 14651) and the Cultural Convention Specification TR (ISO TR 14652) (an extension
of the POSIX locale format that covers for example transliteration of
wide character output).

ISO/JTC1/SC2/WG2/IRG
(Ideographic Rapporteur Group)

The Letter Database
answers queries on languages, character sets and names.

China has specified in GB 18030 a new encoding of UCS for use in Chinese government
systems that is backwards-compatible with the widely used GB 2312 and
GBK encodings for Chinese. It seems though that the first version
(released 2000-03) is somewhat buggy and will likely go tat buggy and will likely go through a
couple more revisions, so use with care. GB 18030 is probably more of
a temporary migration path to UCS and will probably not survive for
long against UTF-8 or UTF-16, even in Chinese government systems.

I add new material to this document very frequently, so please
check it regularly or ask Netminder to
notify you of any changes. Suggestions for
improvement, as well as advertisement in the freeware community for
better UTF-8 support, are very welcome. UTF-8 use under Linux is quite
new, so expect a lot of progress in the next few months here.

Special thanks to Ulrich Drepper, Bruno Haible, Robert Brady,
Shuhei Amakawa and many others for valuable comments, and to SuSE
GmbH, Nürnberg, for their support.

Markus Kuhn
<Markus.Kuhn@cl.cam.ac.uk>
created 1999-06-04 -- last
modified 2001-08-28 --
http://www.cl.cam.ac.uk/~mgk25/unicode.html

SMALL>

U-00000000 - U-0000007F:	0xxxxx-00000000 - U-0000007F:	0xxxxxxx
U-00000080 - U-000007FF:	110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF:	1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF:	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF:	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF:	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 and Unicode FAQ for Unix/Linux

Contents

What are U What are UCS and ISO 10646?

What are combining characters?

What are UCS implementation levels?

Has UCS been adopted as a national standard?

What is Unicode?

So what is the difference between Unicode and ISO 10646?

What is UTF-8?

Where do I find nice UTF-8 example files?

What different encodings are there?

What programming languages do support Unicode?

How should Unicode be used under Linux?

How do I have to modify my software?

C support for Unicode and UTF-8

How should the UTF-8 mode be activated?

How do I get a UTF-8 version of xterm?

How much of Unicode does xterm support?

Where do I find ISO 10646-1 X11 fonts?

on other platforms.

What are the issues related to UTF-8 terminal emulators?

What UTF-8 enabled applications are already available?

What patches to improve UTF-8 support are available?

Are there free libraries for dealing with Unicode available?

What is the status of Unicode support for various X widget libraries?

What packages with UTF-8 support are currently under development?

How does UTF-8 support work under Solaris?

How are Postscript glyph names related to UCS codes?

Are there any well-defined UCS subsets?

What issues are there to consider when converting encodings

Is X11 ready for Unicode?

Are there any good mailing lists on these issues?

Further References

UCS characters	equivalent character	in target code
U+00B5 MICRO SIGN U+03BC GREEK SMALL LETTER MU	0xB5	ISO 8859-1
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE U+212B ANGSTROM SIGN	0xC5	ISO 8859-1
U+03A9 GREEK CAPITAL LETTER OMEGA U+2126 OHM SIGN	0xEA	CP437
U+005C REVERSE SOLIDUS U+FF3C FULLWIDTH REVERSE SOLIDUS	0x2140	JIS X 0208