Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated January 31, 2009

Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.

The name of this UTF (UCS Transformation Format) is "UTF-∞-32", to be read in English as "UTF-Infinity-Thirty-Two". UTF-∞-32 extends UTF-32 to support an infinite number of characters.

UTF-∞-32 is one of the encodings defined as part of UCS-∞, which also includes similar extensions for UTF-8 and UTF-16. For general information about UCS-∞, please see the UCS-∞ Specification.

UTF-∞-32 preserves and extends useful properties of UTF-32 and UCS-4. For code points less than or equal to U+10FFFF, it is identical to UTF-32. For code points less than U+80000000, it is identical to the original UCS-4 encoding.

UTF-∞-32 employs thirty-two-bit code units. For codes that consist of two or more code units, the leading (initial) code unit always matches the pattern Fxxxxxxx (hexadecimal), and all trailing (non-initial) code units match the pattern Exxxxxxx (hexadecimal). (Hexadecimal notation is used for code points and code units except where binary or decimal are explicitly specified.) The distinction between leading and trailing code units has the consequence that there is no risk of a "false match" when searching.

A simple binary comparison of UTF-∞-32 codes yields the same sort-order as a numerical comparison of code points.

For codes longer than one code unit, UTF-∞-32 specifies the code length in the first "few" code units, so that it is not necessary to scan all the way to the end of a code to determine its length. Except for extremely long codes, the length is specified in the first code unit.

Since UTF-∞-32 code units are much larger than the units in UTF-∞-8 and UTF-∞-16, high efficiency of storage is possible without manipulating individual bits. Each thirty-two-bit unit is treated as a sequence of eight nybbles, and each hexadecimal digit of a USV (udigit) is treated as indivisible (rather than being broken down into bits that may be stored in separate code units, as in UTF-∞-8 and UTF-∞-16). The procedure for conversion between UTF-∞-32 and USV is therefore somewhat shorter and simpler than the corresponding procedures for UTF-∞-8 and UTF-∞-16, and hexadecimal representation of UTF-∞-32 is human-readable in the sense that it is relatively easy to recognize how the USV is stored.

(Note: some readers may prefer to skip down to view the examples first, then return here to read the details.)

*Code point:*Any nonnegative integer used for character encoding.*USV:*A UCS-∞ scalar value, that is, any code point except U+D800..U+DFFF.*Udigit:*A USV hexadecimal digit used in the conventional "U+" notation.*NUD:*Number of udigits in a particular USV, not counting leading zeros. For example, U+123456789 has NUD = 9, and U+0050 has NUD = 2.*NMT:*NUD minus twenty.*Nybble:*Half a byte; four bits; one hexadecimal digit.*Unit:*A thirty-two-bit (eight-nybble) code unit that forms part or all of a UTF-∞-32 code.*Code:*A whole sequence of UTF-∞-32 units corresponding to a single code point.

A code is a single unit which simply contains the USV.

A code consists of two units. The USV is padded on the left with zeros as needed to make fourteen nybbles. The leading unit has the form Fxxxxxxx and stores the first seven nybbles; it is in the range F000000E..FDFFFFFF. The trailing unit has the form Exxxxxxx and stores the last seven nybbles.

A code consists of three units. The leading unit has the form FF0xxxxx, and is in the range FF000000..FF0FFFFF. Both trailing units have the form Exxxxxxx, range E0000000..EFFFFFF. The USV is padded on the left with zeros as needed to make nineteen nybbles, of which five are stored in the first unit, seven in the second unit, and seven in the third unit.

3.5.1 A code consists of four or more units. The leading unit has the form FFAxxxxx or FFBxxxxx, and is in the range FFA00000..FFBBBBBB. All trailing units have the form Exxxxxxx, range E0000000..EFFFFFF.

3.5.2 Starting with the third nybble of the leading unit, NUD (the number of udigits) is indicated by a variable-length sequence of two or more "length-storage nybbles".

3.5.3 Instead of NUD, the length value actually stored is NMT = NUD minus twenty (since NUD is at least twenty).

3.5.4 If NMT is less than sixteen, the sequence matches the simple pattern Ax. That is, the sequence of length-storage nybbles is one nybble whose value is A, followed by one nybble whose value is NMT. Examples:

FFA1... (NMT = one; NUD = twenty-one) FFA3... (NMT = three; NUD = twenty-three) FFAF... (NMT = fifteen; NUD = thirty-five)

3.5.5 If NMT is sixteen or more, the sequence matches one of the patterns BAxx, BBAxxx, BBBAxxxx, BBBBAxxxxx, ... . To be precise, let N = the number of nybbles needed to store NMT. The sequence of length-storage nybbles is:

- (N - 1) nybbles whose value is B (eleven; mnemonic for "before")
- One nybble whose value is A
- N nybbles storing NMT, starting with the highest-order hexadecimal digit.

The sequence length is therefore 2 * N. The B and A nybbles indicate how many length-storage nybbles follow, and also ensure that shorter codes are sorted *before* longer codes, thus enabling the useful binary comparison property described above. Examples:

FFBA10... (NMT = 10 hexadecimal = sixteen; NUD = thirty-six) FFBA45... (NMT = 45 hexadecimal = sixty-nine; NUD = eighty-nine) FFBBA123 ... (NMT = 123 hexadecimal = 291 decimal; NUD = 311 decimal) FFBBAFFF ... (NMT = FFF hexadecimal = 4095 decimal; NUD = 4115 decimal)

3.5.6 Length-storage nybbles can spill over into trailing units, but only in extremely long codes. Examples:

FFBBBA43 E21... (NMT = 4321 hexadecimal) FFBBBBBB EBBA9876 E54321... (NMT = 987654321 hexadecimal)

3.5.7 The remaining nybbles in a code (that is, other than initial FF in the leading unit, initial E in each trailing unit, and length-storage nybbles) are referred to as "USV-storage nybbles". Each USV-storage nybble can hold one udigit (four bits of the USV).

3.5.8 A USV is stored in as few units as possible. If NUD is less than the number of available USV-storage nybbles, then the udigits are effectively padded on the left with nybbles whose value is 0 (zero). In other words, as many zero nybbles as necessary are inserted between the length-storage nybbles and the udigits, so that the last udigit fills the last nybble in the last unit.

Units | NUD | USV | Leading | Trailing |
---|---|---|---|---|

1 | 0..8 | U+0000..U+DFFFFFFF | 00000000..DFFFFFFF | (none) |

2 | 8..14 | U+E0000000..U+EFFFFFFFFFFFFF | F000000E..FDFFFFFF | E0000000..EFFFFFFF |

3 | 14..19 | U+E0000000000000..U+FFFFFFFFFFFFFFFFFFF | FF000000..FF0FFFFF |

The table above describes codes with one to three units. The table below describes codes with four or more units, more concisely. (USV ranges are from U+1000... to U+FFF..., where the lengths are given by NUD. Trailing units are all E0000000..EFFFFFFF.)

Units | NUD | Leading |
---|---|---|

4 | 20..25 | FFA00000..FFA5FFFF |

5 | 26..32 | FFA60000..FFACFFFF |

6 | 33..37 | FFAD0000..FFBA11FF |

7 | 38..44 | FFBA1200..FFBA18FF |

8 | 45..51 | FFBA1900..FFBA1FFF |

9 | 52..58 | FFBA2000..FFBA26FF |

10 | 59..65 | FFBA2700..FFBA2DFF |

11..∞ | 66..∞ | FFBA2E00..FFBBBBBB |

For codes with length up to 589 units, NUD = 4115, leading unit = FFBBAFFF, the length is completely indicated by the leading unit. After that, length storage sequences start with BBB... and the second unit must be examined to determine the length.

When the number of units increases by one, the NUD range usually increases by seven, but there are occasional exceptions corresponding to the transitions where the number of length-storage nybbles increases. (This happens with 6 units, where the length-storage format changes from Ax to BAxx; and again with 589 units, where it changes from BAxx to BBAxxx.)

U+0041 = 00000041 (one unit; the code for the letter 'A') U+10FFFF = 0010FFFF (one unit; the last UTF-32 code) U+110000 = 00110000 (one unit) U+7FFFFFFF = 7FFFFFFF (one unit; the last original UCS-4 code) U+80000000 = 80000000 (one unit) U+DFFFFFFF = DFFFFFFF (the last one-unit code) U+E0000000 = F000000E E0000000 (the first two-unit code) U+123456789ABCD = F0123456 E789ABCD (two units) U+DFFFFFFFFFFFFF = FDFFFFFF EFFFFFFF (the last two-unit code; NUD = fourteen) U+E0000000000000 = FF000000 EE000000 E0000000 (the first three-unit code; NUD = fourteen) U+FFFFFFFFFFFFFFFFFFF = FF0FFFFF EFFFFFFF EFFFFFFF (the last three-unit code; NUD = nineteen) U+10000000000000000000 = FFA00000 E0100000 E0000000 E0000000 (the first four-unit code; NUD = twenty, NMT = 0) U+FFFFFFFFFFFFFFFFFFFFFFFFF = FFA5FFFF EFFFFFFF EFFFFFFF EFFFFFFF (the last four-unit code; NUD = twenty-five, NMT = 5) U+10000000000000000000000000 = FFA60000 E0010000 E0000000 E0000000 E0000000 (the first five-unit code; NUD = twenty-six, NMT = 6) U+FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF = FFACFFFF EFFFFFFF EFFFFFFF EFFFFFFF EFFFFFFF (the last five-unit code; NUD = thirty-two, NMT = twelve = C hexadecimal)

Color-coded examples of UTF-∞-32 (in separate window)