Authors: Tom Bishop (firstname.lastname@example.org) and Richard Cook (email@example.com).
This page supplements the UCS-X specification with explanations of its rationale and design.
Disclaimer: quotations here are to convey general ideas and do not imply that any of the people quoted would approve of UCS-X.
"No one involved in computers would ever say that a certain amount of memory is enough for all time." --Bill Gates
"It's not anything new as a principle of software engineering that you shouldn't have arbitrary limits." --Richard Stallman
"Stay an order of magnitude more general than you think you need, because you will end up needing it in the long term." --John Warnock (Programmers at Work, p.51)
"Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected." --Eric S. Raymond (The Cathedral and the Bazaar)
"... to give you an idea of what this magic of symbolic construction is ... I must begin with the simplest, and in a certain sense most profound, example: the natural numbers or integers by which we count objects. ... We do not leave it to chance which numbers we shall meet by counting actual objects, but we generate the open sequence of all possible numbers which starts with 1 (or 0) and proceeds by adding [one] to any number symbol n already reached ... This intuition of the 'ever one more,' of the open countable infinity, is basic for all mathematics." --Hermann Weyl, The Mathematical Way of Thinking
"As opposed to other kinds of engineering, where the constraints on what you can build are the constraints of physical systems (the constraints of physics, of noise, and approximation), the constraints imposed in building large software systems are the limitations of our own minds." --Hal Abelson
"The basis for my speculation [that] ninety-nine percent of [internet] applications haven't been invented [yet], is to look at the rate at which new ideas are coming along on the net (new things that are happening either within the web context or elsewhere), and recognizing that there are an increasing number of people with capability and interests in building applications on the net, that you can predict, even now, with only one billion users on the net, as we move towards the next decade of the 21st century maybe we'll have five billion users on the net. Well, that's a factor of five right there, and some of these things are not linear in terms of the rate at which inventions happen. Every time somebody invents something that's successful, or comes up with a new standard, it creates another platform on top of which invention can happen; and so this thing is a positive feedback loop." --Vinton Cerf
"In enabling mechanism to combine together general symbols in successions of unlimited variety and extent, a uniting link is established between the operations of matter and the abstract mental processes of the most abstract branch of mathematical science. A new, a vast, and a powerful language is developed for the future use of analysis, in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind than the means hitherto in our possession have rendered possible. Thus not only the mental and the material, but the theoretical and the practical in the mathematical world, are brought into more intimate and effective connexion with each other. . . . In the case of the Analytical Engine we have undoubtedly to lay out a certain capital of analytical labour in one particular line; but this is in order that the engine may bring us in a much larger return in another line. It should be remembered also that the cards, when once made out for any formula, have all the generality of algebra, and include an infinite number of particular cases." --Ada Lovelace
Font-related technologies are mostly used for characters or glyphs in conventional written languages, for which about a million code points may (or may not) be sufficient. Such technologies are also used for special symbols, copies of handwritten signatures, etc. They could be used for diagrams of molecules, photographs of galaxies, and other applications we can't even imagine yet. Billions of human beings have diverse needs for assigning numbers to many kinds of things, and solving new problems by adapting available technologies. UCS-X could help to solve some of these needs. (Of course, UCS-X is not the solution to every problem. It may be appropriate to unify some, but not all, code spaces with UCS.)
The number of private-use code points allowed by UCS-M is 137,468 (decimal), less than an eighth of a million. The authors (Bishop and Cook) are already using over twenty percent of the allowance for work on ancient Chinese seal script, rare CJK characters, and Tangut. The inhabitants of one small city could exceed the allowance simply by each person creating a single character. Nearly two hundred dingbats (e.g., ❁ ✍ ❧) by the typographer Herman Zapf are assigned in UCS-M. What about the rest of us? For any font or electronic text employing "too many" private-use code points to be considered permanently illegal or incompatible with UCS would be an unreasonable and unfair obstruction to the creative use of technology.
In general, there are many ways to write the "same" character or symbol, when differences of handwriting, typography, etc., are taken into account. UCS naturally unifies character variants, but it does include a small set of "variation selectors" (VS) for use as suffixes to distinguish unified variants for special purposes. Unfortunately, UCS currently provides only 256 variation selectors (U+FE00..U+FE0F, and U+E0100..U+E01EF), which might be enough for some purposes, but not for others.
For example, an optical character recognition (OCR) algorithm might employ a database of thousands or even millions of samples per character, and it might employ conventional font and text-encoding formats in the construction of that database -- but it could not use Unicode VS due to the limitation of only 256 available VS. Private-use VS may therefore need to be used for such applications, and even so, the number of available private-use code points is currently limited to 137,468, which can't be considered "enough for all time". Millions of sample images could fairly easily be obtained, for example, for any letter of the alphabet. (See image of web search for "letter z".) Therefore, even the limit of 137,468 VS (supposing the entire private-use area were devoted to this purpose) for a character image database should not be considered acceptable. Either the set of code points available for use as VS (at least for private use) needs to be enlarged, or else (for any application in which scalability is a concern) VS need to be abandoned in favor of some other mechanism.
VS are potentially very valuable, if made scalable through extension to a less limited code space. OCR would not be the only application for a large set of VS. A character-image database, mapping millions of character images (and associated descriptions) to UCS code points and VS, could be made available on the web as a font server, enabling people to include very particular forms of characters, even in "plain text" documents. Such a method might not be dependent on a single server or authority; a large VS address space would enable the peaceful coexistence of many different innovative character-image databases.
Skeptics may argue that higher-level protocols or mark-up languages, such as XML, could be employed instead of UCS-X. Sometimes this will be true, but if it were entirely true, there would have been no need for UCS; ASCII plus XML would suffice. Many powerful technologies assume that the glyphs in a font are indexed by integers (not by strings), or that text used in certain contexts must be "plain text" (rather than XML). In fact, XML itself is such a technology, since the element and attribute names in an XML application need to be plain text. In general, simple "characters" that can be stored in ordinary fonts and can be used in plain text will always enjoy certain advantages over complex data structures (such as strings with mark-up). Of course, complex structures also have advantages over simple characters. The wise approach to problem-solving is to consider the pros and cons of various possible solutions, to use the best tool (or combination of tools) for a particular task. Consequently, tools should be designed to be flexible rather than arbitrarily limited in their scope of application. ASCII is useful. XML plus ASCII is more useful. XML plus UCS-M is even more useful. XML plus UCS-G/E/∞ could be even more useful still.
Single characters have an important privileged status in the technology of regular expressions. A simple regex to match a pattern of single characters might become complex, if it were necessary to treat a particular kind of marked-up string as equivalent to a single character. For example, suppose that
numeric character references
字 needed to be matched as single characters. Then to match a sequence of two characters, instead of
.. (two dots), a more complex regex (left as an exercise for the reader) would be required. An alternative solution is to convert all numeric character references to single characters, and then apply the simple
.. (two dots) regex. This idea could be carried further by tokenizing an input text and replacing certain strings (maybe temporarily) with uniquely corresponding single character code points (maybe assigned by mapping to a private use area). The technology of "character classes" could then be applied to sets of any type or size (names of people/places, URL's, etc.).
It is easy to imagine many applications for very large fonts or character sets. However, such speculations by their nature are somewhat like science fiction, just as many of today's technologies were science fiction a few decades ago. What seems plausible now, may seem foolish after a few years, or vice-versa. The point is that the future is very likely to exceed our expectations in one way or another. Based on experience and intuition, we believe there will almost certainly be important applications that need to break the one-million-code-point limit eventually, just as another asteroid is bound to come on a collision course with the Earth, sooner or later. Fortunately, character set extension is easier than asteroid deflection, if we don't just ignore the problem until it is too late. UCS-X deserves attention even though all our speculations about particular applications might well turn out to be wrong. With this disclaimer, we provide some speculations on a separate page: UCS-X-Science-Fiction.
Theoretically, if everyone could support UCS-∞, then there would be no need for UCS-G or UCS-E specifications, since they are merely subsets. In reality, implementing UCS-∞ is challenging for reasons of efficiency, complexity, and security. Given only the choice between UCS-M and UCS-∞, nearly all implementers would probably stick with UCS-M for a long time. In contrast, UCS-G is easy to implement right now and already provides a huge increase. In fact, many implementations of UCS-G already exist in the form of applications that support the original UTF-8 and/or UCS-4 standards. UCS-G breathes new life into those existing implementations, and UTF-G-16 enables them to be made fully interoperable with new versions of applications originally based on UTF-16.
While UCS-G addresses limitations of UCS-M that are clearly of practical concern for the near future (such as the inability to use a font with more than 137,468 private-use characters), UCS-E opens up a new world of creative potential. For billions of human beings, innovative applications are likely to need more than two billion code points. (The human population of Earth is projected to reach seven billion by early 2012. Imagine a font in which each glyph is a photograph of a human being, and how it might be used for artistic or governmental purposes.)
UCS-E occupies a natural middle ground between UCS-G and UCS-∞. UCS-E does not require a 64-bit processor for implementation, but it will be more simply and efficiently implemented using 64-bit processors. Its widespread usage will become more practical as 64-bit computers become the norm and 32-bit computers become antiquated. Widespread usage of UCS-E may therefore be expected to occur some years or even decades later than for UCS-G. One justification for consideration of UCS-E at the present time is that UCS-G has been carefully designed to leave room for eventual further extension, and UCS-E serves as an example of such an extension.
The finite sets UCS-G and UCS-E will be more practical and secure than UCS-∞ until more work has been done (development of protocols, etc.), and until there is a clear use for code points beyond U+7FFFFFFF and U+7FFFFFFFFFFFFFFF. It is hard to predict how soon UCS-∞ might be used for non-experimental purposes. Nevertheless, such purposes for UCS-∞ can be imagined, and serious security problems could result from a failure to begin experimental research and testing well before the need becomes urgent. Another justification for consideration of UCS-∞ at the present time is that UCS-E has been carefully designed to leave room for eventual further extension, and UCS-∞ serves as an example of such an extension, that is, a proof of concept that UCS-E is in fact easily (and infinitely) extensible. Furthermore, UCS-∞ can serve as a basis for defining larger finite sets than UCS-E, as outlined in the conformance section of the UCS-X specification.
Theoretically, applications that use code points beyond U+10FFFF could all use the same encoding form (UTF). It could be mandated that one UTF would enjoy infinite extension but the other two would be dead-ends. However, that would be unfair and uneconomical, and people might never agree which size code unit to prefer. It is more practical to make the improvement of existing software as easy as possible. It will generally be easier to extend an existing UTF-16 application to UTF-G-16, for example, than to extend it to UTF-G-8. Another reason is that 8-, 16-, and 32-bit encodings all have advantages in particular contexts, even where backwards compatibility is not a concern. For example, in terms of memory requirements, 8-bit encoding is optimal for text that mostly contains ASCII characters; 16-bit encoding is optimal for text that mostly contains CJK (Chinese/Japanese/Korean) characters; and 32-bit encoding will be optimal for text that mostly contains code points beyond U+10FFFF.
Some implementations of UCS-E are likely to use fixed-length 64-bit integers to store UCS scalar values at least for internal processing. That trivial encoding form might be named "UTF-64" or "UTF-E-64". However, for internal processing, an official name or specification might not be necessary. The potential benefits of UTF-E-64 for long-term file storage or interapplication communication are doubtful, and must be weighed against potential costs (including the cost of having too many different encoding specifications, as well as the cost of wasting 57 bits for each ASCII character). While UCS-X doesn't specify encodings with 64-bit (or larger) code units, it does allow for the possibility of their future specification. In particular, the fact that UCS-E stops at U+7FFFFFFFFFFFFFFF rather than U+FFFFFFFFFFFFFFFF ensures that compatible extension from UTF-E-64 to UTF-∞-64 would be easy. (It could be very similar to the extension from UTF-32 to UTF-∞-32.) Similarly, the conformance requirements for maxNUD (limits of U+7FFF... rather than U+FFFF...) ensure the analogous property for any future encodings using 128-bit or larger power-of-two code unit sizes.
UCS-X is a natural next step in the evolution of character encoding. The essential idea can be formulated (and over-simplified) as a slogan: Fixed length bad; variable length good; unlimited variable length best.
The term "bad" and other pejoratives are used here only in relative and evolutionary senses, as motivation for current and future development, not to criticize any of the historical development up to the present time.
Fixed-length encodings (such as ASCII, Latin1, UCS-2, and UCS-4) are inherently problematic. They have to be stingy or wasteful (or both). Stingy Latin1 squeezes characters through a one-byte bottleneck, imposing a limit of only 256 possible character codes. Wasteful UCS-4 encodes all characters using four bytes each, wasting three bytes per character in the case of common Latin characters. In other words, a fixed-length encoding is a Procrustean bed: whoever lies in it gets either amputated or stretched.
Under the constraints of the mostly 8-bit and 16-bit technologies available in the 1980s, and on the assumption that a fixed-length encoding would be simpler to implement than a variable-length encoding, Unicode was originally designed to use the fixed-length two-byte (16-bit) encoding form called UCS-2. A file was converted from ASCII to UCS-2 simply by inserting a zero byte next to each ASCII byte. Doubling the file size (not to mention ASCII-incompatibility) was the price that had to be paid, even to include a single non-ASCII character in a file that was otherwise all ASCII. Viewed in this way, UCS-2 is wasteful (at least for mostly ASCII text). UCS-2 is also stingy, because it can support only 65,536 code points (up to U+FFFF), which within a few years turned out to be insufficient (especially for Han 漢 characters; see also the section concerning CDL below).
The obvious fixed-length solution might have been UCS-4, which had been defined (along with UCS-2) by ISO 10646. However, UCS-4 has been avoided for two reasons: it is even more wasteful than UCS-2, and it isn't backward-compatible with the already-implemented UCS-2 (just as UCS-2 isn't backward-compatible with ASCII). Increasing the length of a fixed-length encoding necessarily introduces incompatibility and wastefulness.
Two different variable-length solutions arose in the 1990s: UTF-8 and UTF-16.
UTF-8 provides backward-compatibility with ASCII, and solves the wastefulness problem, by storing ASCII characters as single ASCII bytes, and storing non-ASCII characters as sequences of two or more non-ASCII bytes. The original UTF-8 also alleviated the stinginess problem: by defining codes up to six bytes long, it supported two billion code points (up to U+7FFFFFFF, the same as UCS-4).
The other variable-length solution, UTF-16, provides backward-compatibility with UCS-2. The extension from UCS-2 to UTF-16 is somewhat analogous to the extension from ASCII to UTF-8. However, technical constraints, the desire to avoid complexity, and (presumably) disbelief in the usefulness of more than one million code points, all combined to produce UTF-16 which is still relatively stingy. UTF-16 supports one million code points, which is only one twentieth of one percent of the two billion code points supported by UTF-8.
|Old encoding||Old limit||Compatible extension||Extended limit (approx.)|
UTF-16 is still wasteful like UCS-2 in doubling the size of an ASCII file. On the positive side, UTF-16 stores most commonly used Han characters in one unit (two bytes), where UTF-8 requires three bytes; so sometimes UTF-16 is less wasteful than UTF-8, depending on the particular text. UTF-16 may in some situations be faster to process than UTF-8. In any case, the choice between UTF-8 and UTF-16 is now mostly only of concern to software engineers, and can be decided on a case-by-case basis to maximize convenience or efficiency.
Conversion between UTF-8 and UTF-16 is generally a trivial operation. There is one big exception: as already mentioned, UTF-16 only supports a million code points, while UTF-8 supports (or originally supported) two billion. This lack of interoperability between UTF-16 and UTF-8 poses a problem. The solution chosen so far by the Unicode Consortium has been to cut UTF-8 down to size with the limit of one million code points (up to U+10FFFF). All six- and five-byte UTF-8 codes, and almost half of the four-byte UTF-8 codes, are banished to the Pointless Forest. The same reduction has been performed on UCS-4, given the new name UTF-32. (UTF-32 is in a sense even more wasteful than UCS-4, since at least eleven out of thirty-two bits are required to be wasted.) Amputation is a pragmatic temporary strategy. (UCS-X recognizes the U+10FFFF limit for UTF-8 and UTF-32 as a fait accompli, and therefore assigns the new names UTF-G-8 and UTF-G-32, for the original 8-bit and 32-bit encodings up to U+7FFFFFFF.) For the long term, imposition of such a low and arbitrary limit on the evolution of communication technology would be another Procrustean bed, and would indicate a failure to recognize the inherent potential of variable-length encoding.
The question naturally arises: What is the upper limit to the number of code points that can be supported by a variable-length character encoding form? It turns out that there is no theoretical limit. There will always be practical limits, imposed by hardware, etc.; but ordinary modern 32-bit hardware can easily support two billion code points, and tomorrow's hardware will easily support far more. (Already, 64-bit computers are becoming common.) UCS-X takes variable-length encoding to its logical conclusion.
Variable-length encodings solve the conflict between economy and capacity. Extension of UTF-8 and UTF-16 to support UCS-X doesn't cost anything in terms of their efficiency for code points that were already supported prior to the extension. In other words, you can have your one-byte-per-character ASCII text (UTF-X-8); or your two-byte-per-character Chinese/Japanese/Korean text (UTF-X-16). Then you can add UCS-X characters of any size to the text, without causing the one- or two-byte characters to become larger (with zero-padding, as would be necessary in a fixed-length encoding). Thus, the UCS-X extensions do not suffer from some essential problems (such as wastefulness and incompatibility) that characterized the conversion from ASCII to UCS-2. They also appear to solve the stinginess problem for all time.
Of course, it's an over-simplification to say fixed length bad. Some kinds of programming are more convenient using data items of fixed sizes. Nevertheless, a permanent, universal standard need not be crippled for all of eternity on the basis of an obsolete value for the largest convenient, efficient, or useful integer. Instead, an upper limit to anything so important and powerful as character encoding should be treated as a parameter which can be changed without having to invent a new encoding system or convert existing documents.
Currently, code points above U+10FFFF (including 99.948% of the original ISO 10646 code points) are illegal, even for private use. Although this is treated as a permanent limitation, it is no more permanent than the U+FFFF limitation was a couple decades ago, or the U+007F limitation was a couple decades earlier. We must learn from history and plan ahead to support broad, innovative application of text-processing technologies without unnecessary limitations.
Standards are extremely important and valuable. UCS serves as a basis for numerous technologies, such as OpenType, PDF, HTML, XML, database formats, programming languages, search engines, operating systems, etc. The claim of a standard to be "universal" implies a commitment and responsibility to avoid imposing arbitrary limitations on compatible technologies. On first consideration, it may seem that any organization is entitled to impose limitations on itself. After all, bigger is not always better, and the standards organization people should not have to waste their time on evaluating and registering billions of doodles or dingbats. However, the U+10FFFF limitation is effectively imposed not only on the standards organizations themselves, but on every UCS application, meaning almost the entire infrastructure of information technology on which humanity increasingly relies. In the long run, if the need for more code points makes an application incompatible with UCS, then the benefits of unification may be lost. Developers might have to choose whether to comply with an unreasonable, outdated standard, or to use new encodings without clear standards. That dilemma can be avoided if UCS-X eventually becomes either a part of the Unicode Standard and ISO 10646, or a widely recognized, standard compatible extension.
Possibly the standards organizations should not concern themselves to a great extent with assignment of code points beyond U+10FFFF (other than the private-use code points assigned by UCS-X). However, there is some urgency that the essential technical infrastructure (encoding forms such as UTF-G-16 and UTF-E-8), should be designed and tested sooner rather than later. Code points beyond U+10FFFF should become legal and reliably supported under appropriate conditions, even if their assignment or interpretation may be postponed or left to other agencies (public or private). In other words, a standards organization might reasonably reject (or postpone) the role of managing the assignment of billions of individual code points, which would be an infinitely complex and never-ending administrative task. At the same time it might reasonably accept the role of defining encoding forms and protocols enabling the use of large code points, even if the only immediate application is for private use. This is a challenging but finite and realistically achievable engineering task, which we hope has already been accomplished to a considerable extent by UCS-X.
It has long been well-known that the set of Han 漢 or CJKV (Chinese/Japanese/Korean/Vietnamese) characters is open-ended. Only an infinite set of codes can encode all Han characters for all time. However, a real solution for handling rare/obscure/novel/variant Han characters needs to do more than simply assign numbers to them. CDL (Character Description Language) is a way to specify Han characters by their graphical analysis into strokes and components. If it were necessary to choose between CDL and UCS-X as a solution to Han encoding problems, CDL would be the better choice. Nevertheless, the two technologies are complementary; each solves problems that are not solved by the other. The authors began working on UCS-X in the context of their work on CDL. One use of UCS-X in the context of CDL would be to support more variation selectors (see 2.2 above).
The radio spectrum isn't just for AM/FM radios anymore. It's used for cell phones, televisions, RFID, communications satellites, and microwave ovens. Similarly, technologies that store, process, and transmit character code points, through encoding forms, have more potential applications than conventional text. The UCS code space is analogous to the radio spectrum: applications shouldn't be arbitrarily limited to a tiny fraction of the code space or the spectrum.
Telephone lines aren't just for voice anymore, even though that was originally their only purpose. By means of DSL (Digital Subscriber Line) technology, old telephone lines are used for web-surfing and videoconferencing. Fortunately, their built-in limitations did not prevent their extended uses. Unfortunately, extended use of UCS-based technologies may be prevented by the unnecessary U+10FFFF limitation.
The internet is based on the Internet Protocol (IP). Originally IP (IPv4) supported about four billion 32-bit addresses, but that is no longer enough, so IP has been extended to 128-bit addresses. Unfortunately, although the new specification (IPv6) has existed since 1998, it still hasn't been widely adopted in 2011. There are many problems with making the transition, which is urgent since IPv4 allocations are already exhausted in some parts of the world. Similarly, plans for UCS extension should be made more than twelve years before their need will become urgent, to allow time for implementation. Of course, the analogy may not hold to the extent that "twelve years" is an accurate estimate; the sooner we start planning, the better.
Many years ago a neighbor planted a sapling in his yard, several feet back from the property line, on the corner of his driveway about 15 feet back from the house. It wasn't clear at the time what the tiny tree's future might be: the neighbor wasn't sure what kind of tree it was. Over the years the tree got very big indeed, and today it is some seven feet in circumference. The once little redwood has grown into the driveway and well beyond the property line, tearing up the concrete sidewalk all the way to the street. A car can't fit in the driveway, and pedestrians have to step into the street to go past. The homeowner and city are now between a rock and a hard place: tree lovers want to protect the tree at all costs, but the city is pushing the homeowner to cut it down so that the sidewalk can be repaired. All of which goes to show: when planting a mighty redwood, it's best to recognize it as such. And when planting any tree with unknown potential, best plan for a redwood.
To the UCS-X Home Page