Hypothetical Applications of UCS-X (Science Fiction)

The UCS-X specification is based on the assumption that availability of more than a million UCS code points will be useful, but it does not assume or depend on any particular future application. Please see the rationale for UCS-X for our essential justifications of UCS-X, and for examples of a few applications (such as private-use and variation selectors) that seem likely to run into trouble with the U+10FFFF limit in the relatively near future.

This "Science Fiction" page is much more speculative, and not at all essential to our justification of UCS-X. It is easy to imagine many applications for very large fonts or character sets. The point is that the future is very likely to exceed our expectations in one way or another.

"Heavier-than-air flying machines are impossible" (1895); "Radio has no future." (1897) --Lord Kelvin

"The energy produced by the breaking down of the atom is a very poor kind of thing. Anyone who expects a source of power from the transformation of these atoms is talking moonshine." (1933) --Ernest Rutherford

"When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." --Arthur C. Clarke

"One might not believe that such a large number [a googolplex] would ever really have any application; but one who felt that way would not be a mathematician." --Edward Kasner and James R. Norman, Mathematics and the Imagination

"Most actual digital computers have only a finite store. There is no theoretical difficulty in the idea of a computer with an unlimited store. Of course only a finite part can have been used at any one time. Likewise only a finite amount can have been constructed, but we can imagine more and more being added as required." --Alan Turing, Can a Machine Think?

Astronomical alphabets

Images of astronomical objects have been used as fonts. Astronomers and artists may find it useful to store astronomical images in font files, indexed by character code points. In 1999 the Hubble Space Telescope estimated 125 billion (1.25×10¹¹) galaxies in the universe. Our own galaxy contains about 200 billion (2×10¹¹) stars. By one estimate, the number of stars in the visible universe is 30 billion trillion (3×10²²).

Molecular characters

For scientific, medical, and engineering work, it may be useful to store molecular images or diagrams in font files. Special code points serving as standardized molecular identifiers may also be included in text streams for other purposes. Future display devices may be embedded in sheets of paper, using nanotechnology, or projected as holograms, etc. The notion of a "character" of text may come to include various kinds of tiny images, or molecular blueprints, conveniently stored in enormous, yet microscopic, futuristic font files.

"There's plenty of room at the bottom." --Richard Feynman

Emoticons/Mediaglyphics

Writing in the future may increasingly employ a combination of static and animated images, such as "emoticons" and "mediaglyphics". An efficient method of implemention might use a font in which each "glyph" could be a movie (or a frame or portion of a movie), mapped to a unique code point.

RFID

The rising usage of RFID (radio frequency identification) suggests applications such as the ability to find missing things. If every human being is likely to have a collection of items with unique RFID codes, then many billions of codes might be used, and a mapping to character code points could be useful for transmitting each RFID as a single "character". The glyph for such a character might be a photograph of the identified book, shoe, writing implement, wallet, etc.

Internet Protocol (IPv6)

To show that even quintillions of codes (as in UCS-E) are not necessarily "enough", here is a hypothetical example of an application of 33-hex-digit UCS-∞ code points. The internet is based on the Internet Protocol (IP). Originally IP supported about four billion 32-bit addresses, but that is no longer enough, so IP has been extended to 128-bit addresses. For example, 2001:0db8:85a3:08d3:1319:8a2e:0370:7334 is a valid IPv6 address. Displayed in this way as a text string, an IP address takes 39 characters. As an ASCII or UTF-8 string, the example would be the following thirty-nine bytes (in hexadecimal):

    32 30 30 31 3a 30 64 62 38 3a 38 35 61 33 3a 30 38 64 33 3a 31 33 31 39
    3a 38 61 32 65 3a 30 33 37 30 3a 37 33 33 34

(UTF-16 would be twice as long, seventy-eight bytes; and UTF-32 would be twice as long again, one hundred fifty-six bytes.)

In some contexts it could be useful instead to store each IP address as a single "character". The USV might be thirty-three hex digits, with the first digit (say an "A") being a prefix to avoid leading zero. The IP address just shown would map to the "character" U+A20010DB885A308D313198A2E03707334. Its UTF-∞-8 code would be only twenty-four bytes (much shorter than the ASCII/UTF-8 string):

    FF AF A8 A0 80 90 B6 B8 A1 9A 8C 88 B4 B1 8C 99 A2 A2 B8 83 9C 87 8C B4

Similarly, the UTF-∞-16 would be:

    DDFF DE0A DE28 DF00 DE10 DFB7 DE21 DED1 DF08 DFA6 DE4C DECC DEA2
    DFC0 DEDC DE39 DF34

And UTF-∞-32 would be:

    FFAD0000 E00A2001 E0DB885A E308D313 E198A2E0 E3707334

The UTF-∞-8 and UTF-∞-32 codes are each only 24 bytes, while UTF-∞-16 is 34 bytes. All these codes take less memory than the string form (regardless of whether code units are 8, 16, or 32 bits). Furthermore, the string form would require additional mark-up of some kind to show unambiguously that it represented an IP address, whereas the single-character form would be unambiguous if this encoding and assignment were actually standardized. Thus the "single-character" form would be more efficient. Yet it would be transmissible as part of a text stream (not only as a separate piece of binary data). Such a "character" could be displayed either as a generic IP icon, or as an icon corresponding to the particular address, perhaps dynamically fetched, similar to the favicon.ico supported by many web browsers. Clicking on the icon could access the website (without additional overhead for mark-up such as an HTML href tag).

The number of available IP addresses is truly astronomical. Nobody expects that they will all be used. Nevertheless, the internet is empowered by their availability. The hypothetical mapping between IP addresses and code points just described would be relatively trivial to implement, supposing that UCS-∞ were already implemented. Compare an alternative, which would be to assign such mappings only when requested for each address, perhaps using code points less than U+10FFFF. Not only would the availability of code points be severely restricted, but a mapping table would need to be maintained, perhaps with great expense and complexity. This example illustrates a general principle, which is that very large code points might sometimes be assigned long before all or most of the smaller ones are used up. If we want a simple and complete mapping between IP addresses and code points, it is most practical to go straight up to 33 hex digits.

Every string has a code point

Although the set of all strings might seem larger than the set of all code points, mathematically they have the same size. (They both have cardinality ℵ₀; in other words they are both countably infinite.) Therefore every string, regardless of length, can be mapped to a unique code point (and back again). No mapping table is needed; a simple algorithm is sufficient. For example, the string "I❤UCS-∞" is normally encoded as UTF-8:

    49 E2 9D A4 55 43 53 2D E2 88 9E

Now simply (and strangely) write the code without spaces, and prefix it with "U+F00" to make a private-use USV:

    U+F0049E29DA45543532DE2889E

The USV may then be reencoded as UTF-∞-8:

    FF A7 80 8F 80 84 A7 A2 A7 9A 91 95 90 B5 8C AD B8 A8 A2 9E

Measured in bytes, the new encoding is about twice as long as the original, but measured in characters, it is only one character while the original string is seven characters. (The reverse mapping is straightforward.)

This method of mapping could be applied to tokenizing. For example, programming languages are often processed in stages, one stage being accomplished by a lexical analyzer which converts lexemes (strings of characters) into tokens. If each token is represented as a single character, subsequent stages, such as parsing, may conveniently use regular expressions with character classes. Since the text after being tokenized (the "token stream") is still a UCS-∞ string, the parser can take advantage of general-purpose UCS-∞ software rather than requiring customization for a limited-purpose token stream format.

In mathematics, the idea of mapping strings to integers (called "the arithmetization of syntax") is key to the proof of Gödel's incompleteness theorems and was also applied by Alonzo Church and Alan Turing to the Entscheidungsproblem.

Back to the UCS-X page