Introduction to XML

Encoding

•

XML (like Java) uses Unicode to encode characters.

•

Unicode comes in many flavors. The most common one

used in the West is UTF-8.

•

UTF-8 is a variable length code. Characters are

encoded in 1 byte, 2 bytes, or 4 bytes.

•

The first 128 characters in Unicode are ASCII.

•

In UTF-8, the numbers between 128 and 255 code for

some of the more common characters used in western

Europe, such as ã, á, å, or ç.

•

Two byte codes are used for some characters not listed

in the first 256 and some Asian ideographs.

•

Four byte codes can handle any ideographs that are left.

•

Those using non-western languages should investigate

other versions of Unicode.