|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
XML (like Java)
uses Unicode to encode characters.
|
|
• |
Unicode comes in
many flavors. The most common one
|
|
|
used in the West
is UTF-8.
|
|
• |
UTF-8 is a
variable length code. Characters are
|
|
|
encoded in 1
byte, 2 bytes, or 4 bytes.
|
|
• |
The first 128
characters in Unicode are ASCII.
|
|
• |
In UTF-8, the
numbers between 128 and 255 code for
|
|
|
some of the more
common characters used in western
|
|
|
Europe, such as ã,
á, å, or ç.
|
|
• |
Two byte codes
are used for some characters not listed
|
|
|
in the first 256
and some Asian ideographs.
|
|
• |
Four byte codes
can handle any ideographs that are left.
|
• |
Those using
non-western languages should investigate
|
|
|
other versions of
Unicode.
|
|