Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).
Updated October, 2009 (changes since 2007 are only stylistic).
Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.
The name of this specification is "UCS-G".
UCS-G enables extension of the Universal Character Set to support more than 2x109 (two billion, or two thousand million) characters, with code points up to U+7FFFFFFF. It is a member of the UCS-X family of specifications.
UCS-G provides three encoding forms, UTF-G-8, UTF-G-16, and UTF-G-32, which are compatible extensions of UTF-8, UTF-16, and UTF-32, respectively. They are compatible extensions of their subsets in much the same way that UTF-8 is a compatible extension of ASCII. In particular, UTF-G-16 is a non-trivial extension of UTF-16 enabling full interoperability of UTF-G-8, UTF-G-16, and UTF-G-32. (As a historical note, UTF-G-8 and UTF-G-32 are relatively trivial extensions since they are equivalent to the original UTF-8 and UCS-4, respectively, before the U+10FFFF limit was imposed for compatibility with UTF-16.)
UCS-G can stand on its own as a specification. It does not depend on the supersets UCS-E or UCS-∞ (the other two members of the UCS-X family of specifications). Nevertheless it anticipates the possible need for extensions beyond U+7FFFFFFF (to be defined in their own specifications).
UCS-G preserves and extends useful properties of UTF-8, UTF-16, and UTF-32. For code points less than U+110000, UCS-G encodings are identical to the original encodings. The extended encodings employ sequences of one or more code units. Code units are eight, sixteen, or thirty-two bits for UTF-G-8, UTF-G-16, and UTF-G-32, respectively. Leading code units are distinguished from trailing code units by simple rules for each encoding.
One valuable consequence of the distinction between leading and trailing units is that there is no risk of a "false match" when searching. (Compare, for example, the variable-length encoding GB18030, which encodes the sequence of two characters 问题 as four bytes CE CA CC E2, in which a simple search algorithm could mistakenly find a match for the character 侍 whose code is CA CC.)
A useful property of UTF-8 and UTF-32 is that a simple binary comparison of strings (with the C function strcmp(), for example) yields the same sort-order as a numerical comparison of code points. This property is preserved and extended in UTF-G-8 and UTF-G-32. (Since UTF-16 lacks this property, UTF-G-16 necessarily also lacks it.)
A useful property of UTF-8 and UTF-16 is that the leading code unit indicates the length of a code. This property is preserved and extended in UTF-G-8 and UTF-G-16. (In both UTF-32 and UTF-G-32, a code always consists of a single unit.)
The set of UCS-G code points is the range U+0000..U+7FFFFFFF. The number of code points is 231 = 2,147,483,648. As in Unicode, U+D800..U+DFFF (211 = 2,048 "surrogate code points") are excluded from the set of USV (UCS scalar values). Hence, the number of UCS-G scalar values is 231 - 211 = 2,147,481,600.
UCS-G specifies three encoding forms (using 8-bit, 16-bit, and 32-bit units) that each associate a unique code with each USV. Detailed specifications are provided for the three encoding forms. Please see:
USV = U+0041 (the letter 'A') UTF-G-8 = 41 UTF-G-16 = 0041 UTF-G-32 = 00000041 USV = U+5B57 (the Chinese/Japanese/Korean character '字') UTF-G-8 = E5 AD 97 UTF-G-16 = 5B57 UTF-G-32 = 00005B57 USV = U+10FFFF (the last USV in the Unicode Standard) UTF-G-8 = F4 8F BF BF UTF-G-16 = DBFF DFFF UTF-G-32 = 0010FFFF USV = U+110000 (the first USV beyond the U+10FFFF limit) UTF-G-8 = F4 90 80 80 UTF-G-16 = DC04 DE80 DE00 UTF-G-32 = 00110000 USV = U+7FFFFFFF (the last USV supported by UCS-G) UTF-G-8 = FD BF BF BF BF BF UTF-G-16 = DD0F DFFF DFFF DFFF UTF-G-32 = 7FFFFFFF
More examples are shown in the individual specifications for UTF-G-8, UTF-G-16, and UTF-G-32.
The following table compares code length (in bytes) for UTF-G-8, UTF-G-16, and UTF-G-32:
USV | Code length (in bytes) | ||
---|---|---|---|
UTF-G-8 | UTF-G-16 | UTF-G-32 | |
U+0000..U+007F | 1 | 2 | 4 |
U+0080..U+07FF | 2 | 2 | 4 |
U+0800..U+FFFF | 3 | 2 | 4 |
U+10000..U+10FFFF | 4 | 4 | 4 |
U+110000..U+1FFFFF | 4 | 6 | 4 |
U+200000..U+3FFFFFF | 5 | 6 | 4 |
U+4000000..U+7FFFFFFF | 6 | 8 | 4 |
For implementation, applications, references, etc., please see the UCS-X page.