Authors: Tom Bishop (email@example.com) and Richard Cook (firstname.lastname@example.org).
Updated December 1, 2007
Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.
The name of this UTF (UCS Transformation Format) is "UTF-G-8". UTF-G-8 extends (or restores) UTF-8 to support over two billion characters, with code points up to U+7FFFFFFF.
UTF-G-8 is one of the encodings defined as part of UCS-G, which also includes similar extensions for UTF-16 and UTF-32. For general information about UCS-G, please see the UCS-G Specification.
UTF-G-8 is identical to the original UTF-8 encoding, invented by Ken Thompson. We do not claim any credit for it, except for giving it the new name "UTF-G-8". The reason for giving it a new name is that the original name is now (sometimes) used for the subset limited to maximum code point U+10FFFF.
UTF-G-8 preserves and extends useful properties of UTF-8. UTF-G-8 is identical to the original one-to-six-byte UTF-8 encoding. (This implies that for code points up to U+10FFFF, UTF-G-8 is identical to UTF-8; and for ASCII text, UTF-G-8 is identical to ASCII.)
The rule for distinguishing leading (initial) and trailing (non-initial) bytes still applies. All trailing bytes have the bit pattern 10xxxxxx (binary). Leading bytes conform to the bit pattern 11xxxxxx (binary) and are in the range C2..FD. (Hexadecimal notation is used here and below except where binary or decimal are explicitly specified.)
One valuable consequence of the distinction between leading and trailing bytes is that there is no risk of a "false match" when searching.
Another great property of UTF-8, which is still true for UTF-G-8, is that a simple binary comparison of strings (with the C function strcmp(), for example) yields the same sort-order as a numerical comparison of code points, or a binary comparison of the same strings in a fixed-width encoding (such as big-endian UTF-32).
Yet another useful property of one-to-six-byte UTF-8 is that the leading byte indicates the length of a code. UTF-G-8 preserves this property.
UTF-G-8 is identical to the one-to-six-byte encoding defined by the original UTF-8 specification.
NOTE: see, for example, RFC 2279.
U+0041 = 41 (the one-byte ASCII code for the letter 'A') U+007F = 7F (the last ASCII code) U+0080 = C2 80 (the first two-byte code) U+07FF = DF BF (the last two-byte code) U+0800 = E0 A0 80 (the first three-byte code) U+FFFF = EF BF BF (the last three-byte code) U+10000 = F0 90 80 80 (the first four-byte code) U+10FFFF = F4 8F BF BF (the maximum for UCS-M) U+110000 = F4 90 80 80 (the first code beyond UCS-M) U+1FFFFF = F7 BF BF BF (the last four-byte code) U+200000 = F8 88 80 80 80 (the first five-byte code) U+3FFFFFF = FB BF BF BF BF (the last five-byte code) U+4000000 = FC 84 80 80 80 80 (the first six-byte code) U+7FFFFFFF = FD BF BF BF BF BF (six bytes; the last UTF-G-8 code)
To the UCS-G Specification
UCS-X Home Page