Computer Standards & Interfaces 14 (1992) 327-332 North-Holland
327
Commentary Communicating in a multilingual world * J.J. A n d e r s e n IBM Corp., Systems Communications Diu., P.O. Box 12195, C71/673, Research Triangle Park, NC 27709, USA
1. Introduction Of all the concepts and technologies used in computer systems, few would seem to be as simple, stable and inherently reliable as the code used in computers to represent character data. Assign a number to each of the letters, digits, punctuation marks, and graphic symbols, and the problem should be solved! Unfortunately there are numerous such codes, so different computers often use different codes. As a result, data sent electronically from one computer to another often ends up being misprinted or misinterpreted. The errors thus caused are of widely varying complexity and severity, and often hinder full exploitation of interconnected
Correspondence to: J.J. Andersen, IBM Corp., Systems Communications Div., P.O. Box 12195, C71/673, Research Triangle Park, NC 27709, USA. * The US standardization community conducts an annual National Standards Week Paper Contest and announces the winner during the Standards Week activities. Mr. Anderson won the 1991 award and received a cash award, commemorative plaque, and national recognition. His paper was published in the October, 1991 ANSI Reporter and is reprinted here because of its direct application to computer standards and the timeliness of the subject to present needs and activities. R e a d e r s interested in submitting papers for future contests should be aware that the submission deadline is usually mid-June for the October Standards Week. For more information, contest rules, and application form, contact: National Standards Week Paper Contest, Standards Engineering Society, P.O. Box 2307, Dayton, O H 45401-2307, USA. Fax: 513/223-6307. The Editors
computer systems. This paper describes the development of a family of character code standards which now make it possible to correctly transmit text from one country to another, provided that the countries' languages share the same base alphabet. It also discusses a follow-on project which is creating a multilingual code standard which will allow error-free transmission of character coded data anywhere on earth, using any language or languages, in any alphabets.
2. Background Graphic characters, such as 'A', '$', and '9' are stored and processed by computers in coded form. Each character is assigned a number, which is actually made up of several binary digits, or 'bits'. Figure 1 shows a small segment of such a code, with several characters, their binary codes, and the decimal equivalents. Early in the development of the computer, it was felt that seven bits was a reasonable size for a character code, as it allowed as many as 128 characters to be coded. Thirty-three positions in the code were used for controls, which left 95 available for the coding of graphic characters. This was an extremely conveA= 1 0 0 0 0 0 1
(65)
B= 1 0 0 0 0 1 0
(66)
C=1000011
(67)
D=I000100
(68)
(etc.)
Fig. 1. Portion of a 7-bit code, showing binary and decimal n u m b e r s assigned to each character.
0920-5489/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved
328
J.J. Andersen
nient number, as it coincided neatly with the number of characters that could easily be engraved on a keyboard, as well as the number of positions on the 'daisy-wheels' used by many printers. It was also marginally adequate to encode the letters needed to write English text, along with the digits, punctuation marks, and other common symbols required. The American Standard Code for Information Interchange (ASCII), for example, contains the 26 letters of the Latin alphabet in both small and capital form, the digits 0 through 9, the space, and 32 symbols such as period, comma, and question mark. IBM's Extended Binary Coded Decimal Interchange Code (EBCDIC), although an 8-bit code which differs from ASCII in several aspects, originally contained essentially the same set of characters. Text written in languages other than English normally requires additional letters. Spanish, French, German and Italian, for example, although they are also written with the Latin alphabet, use various accented letters. Since the total number of accented and unaccented Latin letters, plus the needed symbols and digits, is more than can be accommodated by a code that can only contain 95 characters, these other languages each require their own character codes. As a result, there are different code standards for the US, France, Germany, Italy, and Spain, even though their various languages are all written with the familiar latin alphabet. Languages with different alphabets obviously cannot use the same code, so there are also separate code standards for Arabic, Cyrillic, Greek, Hebrew and so on. Different applications, devices, and manufacturers, like different countries, may also have different character requirements. The factors that go to make up a 'good' character code are diverse and complex, and change over time, as technologies change. At one time or another, for example, codes have been designed that attempted to optimize the time needed to sort records, the number of transistors needed to perform a given operation, compatibility with previous codes, the ability to determine whether a given character was a letter or a digit, the number of keys needed to be depressed on a keyboard in order to generate a given character, the type of arithmetic instructions performed by a computer, the location of capital letters in respect to their lower case equivalents, and a host of other parameters which
065
066
067
068
069
070
071
072
O.S.
A
B
C
D
E
F
G
H
Japan
T
">'
-7-
t"
-1-
:
.~
:¢"
APL
a
l
fl
L
~
_
v
A
Fig. 2. Portions of three different codes. often seem irrelevant today. These changing requirements, along with the ever increasing demands of users for more symbols, more letters, and more languages, have resulted in creation of an ever increasing number of character code standards being created over the past 35 years or so; they now number in the hundreds. The IBM Corporation alone has documented over 250 different character codes. Although it is true that the American National Standards Institute (ANSI), the European Computer Manufacturer's Association (ECMA), the International Organization for Standardization (ISO) and various other national and international standards-making organizations have created character code standards in an attempt to control the proliferation, their efforts were inevitably hindered by the limitations inherent in a 7-bit code, as well as the steady progress of technology and user requirements. As a result, there are often two or more such standards covering a given required alphabet or application, and there are dozens of alphabets and applications that require their own unique codes. Figure 2 shows the graphic characters assigned to code values 65 through 72 in three different codes, including one for APL (A Programming Language), and illustrates the wide variety of characters that can be found in existing 7-bit codes. The existence of hundreds of different character code standards would not be a problem, were it not for the need to send data from one computer to another. As long as data stays within a single computer system the choice of character code is more or less immaterial, provided that all the components of that system understand the same code. Transmission of data between systems, however, increasingly involves data being sent from one country to another, or being received by a system manufactured by a different vendor, or in some other way exposing the fact that the code used by the sender differs from the code used by the receiver. The result is data that is displayed, printed, or processed, incorrectly.
Communicating in a multilingual worm
This has drastically inhibited the data processing industry's ability to provide error-free transmission of data across national borders, as well as the user's ability to interconnect equipment manufactured by different vendors. The conventional response to character code mismatch problems is to convert, or translate, the data to the correct code. This is often easier said than done, however. It is only possible if the sending code is known by the receiver, which may not be the case. Also, it is possible only if character data can be distinguished from other types of data. The most fundamental problem with conversion, however, is that it is not possible to convert a character in the incoming data if that character is not included in the code of the receiving system. This is particularly significant when sending data from one country to another, for the alphabet and code of one country may include a different set of letters from the alphabet and code of another. Misprinting of characters is not the least of the problems that are caused by mismatched or misconverted character codes, although it is the most common and most obvious. While many such errors of this type are obvious on close inspection, others are more subtle, such as the now well-known substitution of one currency symbol for another. Printing $100,000 rather than ¥100,000, for example, would misrepresent the original amount by a factor of more than one hundred. The result would be embarrassing at best, and extremely costly at worst. More complex problems are also possible, such as the inability of the recipient of a file of data to be able to retrieve that data, because one or more characters in the file's name cannot be generated at a keyboard. Misinterpretation of a character can also result in unexpected and incorrect operation of a program, resulting in errors of almost any imaginable type. These errors also may result from incorrect code translation, and may be dependent on various otherwise unimportant factors, such as the location of the program, the location of the user, or even the time of day! Problems stemming from the need to deal with a multiplicity of supported character codes have become an increasing concern to IBM and its customers. In a 52-page report published in 1985, an IBM user group in Europe reported on problems resulting from the use of a variety of charac-
329
ter codes and keyboards. Although limited to problem encountered in non-English-speaking countries which use the Latin alphabet, the study still found that users often experienced inconvenience and frustration, and that products suffered in function, user friendliness, a n d / o r performance.
3. Resolution
One of the first comprehensive attempts within IBM to resolve the character code problems of its customers was the establishment in 1978 of an EBCDIC multilingual character code which contained a set of 192 graphic characters. The code was termed multilingual because it included not only the 26 letters used in writing English, but also the accented Latin letters used in writing many common Latin-based languages. For the first time, an IBM EBCDIC code broke the 95character barrier, and all available positions were filled with graphic characters. This code was implemented in several IBM products used for dedicated word processing, but was deemed unacceptable for data processing products because of incompatibilities with the codes being supported at the time. It did lay the groundwork for the future, however, as it established a specific set of 192 characters as being sufficient to meet the common needs of a large number of languages and countries. During late 1979 and early 1980, efforts began in IBM to resolve the incompatibility problems that had prevented the EBCDIC multilingual code from becoming a more universally implemented code. A task force was formed in April 1980, and it was proposed that the existing 95character codes should be extended, using whichever characters from the multilingual code might be necessary in each case. That is, each should be extended so as to include all the characters of the original EBCDIC multilingual code. Extension of the older codes in this way would not only add many needed characters, but would make code translation possible between any of the codes. The suggestion was eventually adopted as the prime recommendation of the task force, and efforts were begun within IBM to design and standardize the new Country Extended Code Pages, or CECPs.
330
J.J. Andersen
During the same period several other IBM task forces and studies began looking at character code, keyboard, and national language problems, at least partially as a response to the user group paper. An internal IBM Coded Character Set Strategy was published in September 1983, which outlined the known problems and established a strategy for dealing with them. A critical component of that strategy was the recommendation to establish common sets of characters to be supported across all EBCDIC, ASCII and personal computer codes used within given geographical or cultural areas. The CECP multilingual character set was recommended as the common set to be supported in those countries which require the Latin alphabet. Shortly after, IBM established an organization in Toronto which was given the mission of establishing an architecture to guide the development of products in a way consistent with the overall character code strategy. This group subsequently clarified and expanded the strategy, and documented specific requirements for conforming implementations. The CECP recommendations of the IBM task force were studied and debated throughout the company, and even became the subject of a second task force, but were ultimately agreed upon as the most practical solution to the character code problems. The codes were approved within the company, and became internal IBM standards in 1984. Each contained the original 95 characters of its predecessor, in their original positions, so compatibility with earlier products and data bases was assured. As each had been extended to include the full multilingual character set, translation from one to another was always possible. Since the character repertoire was drawn from the original multilingual word processing code, translation between word processing and data processing products and data was also possible. The new CECP codes verified the utility of the original multilingual set of characters, and established the principle that there is value to standardization of a common set of characters across several different codes, even if the actual code values may differ. Multilingual standardization efforts in ISO, the International Organization for Standardization, followed a somewhat similar course, in that they began with an 8-bit, western European code stan-
dard, ISO 6937-2, approved in 1983, which provided a similar but larger set of characters, and also ran into compatibility problems. The total repertoire of characters supported by this standard is approximately 350, which is possible because of a somewhat controversial technique used to code accented letters. A set of unique diacritical marks, or accents, was provided, each of which can be used in conjunction with various letters so that the combination of accent plus letter represents an accented letter. The two-character combination of the codes for 'n' plus ' ~ ' therefore represents the code for an 'fi', for example. While this use of 'non-spacing' diacritics was quite convenient and efficient for the transmission of data, it did not gain wide favor with programmers and engineers, who argued on technical grounds that the code for a letter should always be one character, regardless of whether the letter was accented. The standard was for that reason less widely accepted in the data processing industry than it was with the public text communication services for whom it was originally intended. By 1983, then, the data processing industry had made several attempts to resolve some of the problems associated with sending text electronically from one country to another, and between computers which supported different character codes. Dissatisfaction with the nonspacing diacritic aspect of ISO 6937-2 led ANSI, ECMA and ISO to independently explore the possible development of a second 8-bit multilingual code standard. The new standard would include accented letters only as complete characters, as had been the practice in earlier national code standards. IBM's experience with its EBCDIC multilingual codes appeared directly relevant, and the company participated actively throughout the development of the new standard. Although the desired code would be able to contain twice as many characters as one of the older 7-bit codes, the choice of which 192 to include was hotly debated over the next year or so in meetings of the various organizations. The national standards bodies of many countries were involved, each with its own set of favored characters. Numerous companies were represented on the committees and many, like IBM, had their own candidates. Is the 'OE' ligature used in France really a letter? Should the 'ij' used in the Netherlands be considered one letter or two? What
Communicating in a multilingual worm
about the Catalan 'L. ', and the ' L L ' required to write Spanish? Are the typographically correct opening and closing quotation marks n e c e s s a r y ? Do we really need to include the Icelandic Thorn and Eth letters? Has anyone ever seen a capital Y with dieresis? These and countless other technical and philosophical questions occupied the committee m e m b e r s until consensus began to emerge. By early 1984, three competing versions of a draft multilingual code standard had been prepared by the American committee, the E C M A committee, and the ISO committee. While all were similar, each included a few characters that the others did not. A plenary meeting of the key ISO committee, T C 9 7 / S C 2 , was held in Kyoto that spring, and many of those in attendance were also members of one or more of the other committees working on the problem. During informal discussions, agreement was finally reached on a compromise proposal which resolved the differences, and the resulting code was eventually supported formally by each group. The new code, which has come to be known as Latin-l, was formally adopted under the designations ISO 8859-1 and E C M A 94. Several national standards organizations have also approved the code as an 8-bit code standard for their countries, and it is extensively endorsed and supported. The same code, or its E B C D I C or personal computer equivalent, has also been adopted as a standard by many companies in addition to IBM. The Latin-1 code family can be seen to represent a unique and broadbased agreement between national and international standards bodies, as well as industry, covering the requirements for multilingual character codes throughout the world. Somewhat ironically, the code was approved by the U.S. committee, X3L2, but has not yet become a formal U.S. standard, due to staffing problems within the committee and various other procedural problems. That fact has not hindered support for the standard, however, as the international influence of the ISO standard has been sufficient to insure its acceptance in the United States. IBM's contribution to the creation of the Latin-1 code can be seen by comparing the characters in it with the set of characters contained in IBM's CECP codes. Of the 192 total, all but four of IBM's C E C P character set were included.
331
These four characters (double underscore, dotless i, numeric space, and florin symbol) were considered of marginal utility by the various committees, and were replaced by others felt to be more often needed. IBM subsequently modified its CECP codes by making the same substitutions, so that today's E B C D I C CECP code pages contain precisely the same set of characters as the E C M A and ISO standards. An IBM personal computer code also supports the Latin-1 set of characters, so conversion of transmitted data is always possible, whether products support IBM E B C D I C or personal computer codes or the Latin-1 ISO code of another manufacturer. Using products supporting a Latin-1 code, it is now possible to send documents electronically from one country to another with the assurance that every character will be correctly processed and printed upon receipt. G e r m a n or Spanish text, or mixtures of the two, for example, will be printed out using the correct accented letters, whether the document is received in Germany, the U.S., Peru, Italy, or any other country that uses the same Latin-1 set of characters. The principle has since been extended to an entire family of standards, so that Eastern European countries can now use a Latin-2 set of characters, which are tailored for the somewhat different set of accented letters used in writing those languages. An Arabic version of ISO 8859 offers the same capability to all countries which use the Arabic script, and Hebrew, G r e e k and Cyrillic versions of the family also have been standardized. The 26 unaccented letters of the Latin alphabet are common to each m e m b e r of the family, so that English-language text can be transmitted, no matter which code is selected. The Latin-1 codes now make practical the establishment of world-wide multilingual networks, connecting equipments of different vendors, and tying together the various components of an international business. As an example, documents and messages from Switzerland, whether written in English, French, German, or Italian, can be sent to offices in Europe, Central America, or Australia. N a m e s can be spelled correctly, even if they contain foreign characters not ordinarily used in the originating country. A name like Jos6 Pefia can be processed and printed correctly even in the United States, where data processing products formerly did not support the
332
J.J. Andersen
or the ft. While this advantage may seem only cosmetic, it is critical in countries which by law require that names be spelled with the correct accents if a document is to be considered legal. O f course it is impossible to estimate specific cost savings that have been derived from the adoption of the Latin-1 code standards, but it is clear that both manufacturers and users have benefited immeasurably. The ability to correctly transmit documents and messages from one country to another, using the equipment of several different vendors, and in any language or languages, is becoming the norm rather than the exception. The Latin-1 codes help make that possible, and insure that every character transmitted is the one that the sender intended. 4. The future
While the new multilingual character code standards have solved many of the problems caused by the multiplicity of country, company and application unique codes, they are by no means a panacea. Communication between countries using different scripts remains difficult to impossible, and the thousands of ideographic characters used in East Asia present additional difficulties. In fact it is impossible for a single 8-bit code to contain dozens of alphabets and thousands of characters, and the inevitable result is the continued existence of many different code standards. The real panacea, if one is possible, must be a single code standard that can contain all characters used anywhere in the world. To be able to contain all the tens of thousands of characters known to exist, a code must be composed of at least 16 bits, or 2 octets. The capacity of such a code would be 65,536 characters, which it is believed would be more than sufficient for all but the most demanding of applications. In recognition of the need for such a global standard, the International Organization for Standardization has been working for several years on a multiple-octet character code standard that would contain all the world's characters. It is a 32-bit (4 octet) code, although a 16-bit subset will be provided which contains the most commonly used characters. The additional code space will be more than sufficient for coding rare and obsolete characters, as well as providing for virtually unlimited future expansion.
Although the progress of the new standard (ISO DIS 10646) has been substantially more rocky and contentious than that of ISO 8859-1, the work is nearing completion. The discussions have been even more esoteric, with frequent debates over whether characters are significant because of their shape, meaning or both, whether various characters are unique or merely shape variations of other characters, and whether a character is or is not a duplicate of another with similar shape but different name. The technical details of the code structure have also been the subject of considerable scrutiny, as have the requirements for conformance. Competition has also arisen from an industrysponsored group which is preparing a similar standard, but many participants in both groups concede that one standard must be the goal rather than two. Recent efforts to merge the two efforts have been encouraging, and it is now at least moderately probable that the two groups will be able to resolve their differences. No matter what the outcome, it is significant that both codes currently use the ISO Latin-1 code as a base, with the same set of characters, coded in the same order. The Latin-1 family of character code standards has therefore not only brought a significant improvement to data communication in the single-octet environment of today, but it also is providing the base for the multiple-octet computer environment of the future. Jerome Andersen is a Senior Pro-
grammer with the IBM Corporation, having worldwide standards responsibility for Coded Character Sets. He joined IBM in June 1957 as a SAGE System Field Engineer, and has held a variety of technical and management positions since that time. He has been in his present standards position since April 1982, and was the recipient of a Corporate Standards award in 1985. Mr. Andersen has been IBM's principal member of X3L2, the technical committee responsible for US character code standards, for over nine years, and is currently Chairman of that committee. He also represents IBM in ISO/IEC JTC1/SC2/WG2, the international standards committee responsible for producing a global multi-octet character code, and served as Convenor of that committee from 1987 to 1990. He has presented papers on that work in Copenhagen and at the 1989 International Symposium on Standardization for Chinese Information Processing in Beijing. Mr. Andersen is married, and the father of four children. He and his wife live in Raleigh, North Carolina.