275
Transitions to N e w Standards
Networks and Coded Character Sets J o h a n W. V A N W I N G E N Independent Consultant, P.O. Box 486, 2300 AL, Leiden, Netherlands (Tel.." + 31 71 276900; Fax." + 31 71 143739)
Abstract. The paper presents a tutorial on standards for coded
character sets, developed by ISO and related organisations, and IBM. It also reviews recent developments, to which networks standards have payed little attention up to now, X.400 (ISO 10021) and ASN.1 (ISO 8824) being based on character concepts already outdated 5 years ago. Keywords. Character sets, ISO, X.400, ASCII, SHARE, IBM.
Johan W. Van Wingen, educated a
mathematician, started activities in computing in 1965, working for TNO, and later, Leiden University. He is now an independent consultant. In 1973, he became a member of the Standards Committee for Programming Languages in the Netherlands. From 1979, he was Convener of the ISO Working Group on ALGOL, and Project Editor for the Standard ISO 1538 for ALGOL 60, issued in 1984. He has represented the Netherlands on meetings of ISO/IEC JTC1/SC22, "Languages" from 1977, and, more recently, of SC2, "Characters and Information Coding", and of SC18, "Text and Office Systems". He is now Liaison Representative of JTC1/SC2 to SC22, and Chairman of the Netherlands SC2. He is also author of several fundamental contributions to the work of JTC1/SC2. He has been active in SHARE Europe (SEAS) since 1968 and was a contributor to t h e Report of the SHARE ASCII/EBCDIC Character Set Task Force. He also takes part in the National Language Architecture Group of SEAS. North-Holland Computer Networks and ISDN Systems 19 (1990) 275-284
1. Introduction
1.1. The Problem Communication between human beings, or computers in general, uses language as a vehicle. Statements, when written, consist of characters. When these are to be interchanged by an electronic medium, they need to be coded. These considerations form the starting point for the discussion between the networks specialists and the standardisers of coded character sets, the category to which the author belongs. This paper is intended to present a contribution from his side of the table. Both networks and character codes serve a multilingual community, a good reason to include language aspects in general into the discussion. In recent years, there has been an increasing demand for computer facilities that do not need the English language for their expression. In the field of International Standards, this effects, in the first place, the work of I S O / I E C J T C 1 / S C 2 , Characters and Information Coding, because this committee develops the elementary tools for expressing everything dependent on language. There is an increasing awareness amongst standards developers in other fields, such as programming languages, database, and networks (that are important users of these tools), that they are becoming a target for requirements from non-English speakers.
1.2. Standards Producing Bodies Standards development in the field of coded character sets is not concentrated in one single group; but several are active: - I S O / I E C J T C 1 / S C 2 Characters and Information Coding: - WG2, Multiple Octet Coding, - WG3, 7 and 8-bit Coding,
0169-7552/90/$03.50 © 1990 - Elsevier Science Publishers B.V. (North-Holland)
J. W. Van Wingen / Networks and coded character sets
276
- ISO T C 4 6 / S C 4 Computer Applications in Information and Documentation: - W G 1 , coded character sets for bibliographic use, - CCITT, Study Group VIII, - C E N / C E N E L E C , The Joint European standards Institution, - IBM.
pattern", " b i t combination" and " b y t e " are used almost as synonyms; " b i t string" is not used. "Byte" is not restricted to 8-bit combinations. For those, "octet" is used instead. Instead of CCITT terms, like IA5 or X.400, the corresponding ISO names are used (ISO 646 IRV, MOTIS).
Character Sets
2. Coded
1.3. Terminology, Notations and Conventions
2.1. The Birth of ASCII
The terminology used in this paper is that of the ISO standards in the field. The terms " b i t
The idea of coding data is rather old. For several purposes, it appeared necessary to repre-
ASCII (ANSI X 3 . ~ - 1 9 6 8 )
O/
1/
2/
5/
4/
5/
ISO 6~6 f o r
6/
7/
O/
p
SP ! " # $ X & ' ( ) x + • -
, - - - , - - - , - - - , - - - , - - - . - - - , - - - , - - - ,
/0 /1 /2
INULIDLEI ISOHIDCll ISTXIDC21
/3
IETXlDC31
/ ~ IEOTIDC~I / 5 IENQINAKI / 6 IACKISYNI /7 IBELIETBI / a I BSlCANI / 9 I HTI EMI /101LFISUBI /111VTIESCl / 1 2 1 FFI FSI /131CRI GSI / 1 4 1 s o l RSI /151SII USI 4,
÷
SPI 0 1 ! w 2 | 5 $ 5 & 6 ' 7 8 ( 9 ) : + ; < , = > . / ?
4,
;J A B C D E F G H I J K L N N o
4,
4,
P Q R S T U V N X Y Z [ \ ] ^
e
g
f
v
g
w
h i 5 k 1
x y z l I
m
}
/131CRI
Gsl
n
N
/1~1 sol /151SlI
RSI USI /
4,
4,
a
q
b
r
c
s
d
DELl
4,
4- - -
.0 .1 .2 .5 .5 .6 .7 .8 .9 .A .B .C .D .E .F
-4--
1.
Z.
- -4- - - -4--
3. - -+
INULIDLEI I ISOHIDCZl I ISTXIDCZl ISYN IETXIDC31 I I I I I I NTI I LFI I I BSlETBI IDELI IESCIEOT ICANI I I EHI I I I I VTI I I FFI FSI IDC41 CRI GSIENQINAKI s o l RSIACKI I SII USlBELISUBI + - --+
- --4-
~,.
4,
- - -+
5.
6.
- - -4- --
3/
4/
5/
4,
4,
~ A B C D E F G H I J K L M N 0 4,
P Q
8.
9.
-4- - - -4- - --4-
A.
B.
C.
---4-
---4-
---4-
D.
a
j
N
b c
k 1
s t
d
m
u
e
n
v
f g h i
o
w
p
x
q
y
r
z
4,
E.
F.
---4-
---4-
---4-
}
\
J K L M N 0 P Q R
S T U V N X Y Z
!
w -4---
-4- - - -4- --
Fig. 1.
-4- --
-4---
-4- --
-4--
r
c
s
d
t
e
g
f g h i 5 k 1
v w x Y z a
A
m
4,
=
-4---
q
b
n o
|
--
p
A
¢
---4,---4-1----4,---4,
7/
a
R S T U V N X Y z R
~o IBH G X 2 0 - 1 8 5 0 )
7.
-4- --
4,
0 1 2 5 4 5 6 7 8 9 : ; < = > ?
sel &
< ( + I
6/
, - - - . - - - , - - - , - - - . - - - . - - - , - - - , - - - .
EBCDIC ( a c c o r d i n g O.
2/
/0 INULIDLEI /1 ISOHIDCll / 2 ISTXIDC21 / 3 IETXIDC31 / ~ IEOTIDC~I / 5 IENQINAKI / 6 IACKISYNI /7 IBELIETBI / S I BSICANI / 9 I HTI EMI /lOl LFISUBI /111VTIESCI /121FFI FSI
o 4,
1/
N o r w a y (N ~ 5 5 1 )
- -4- - --4-
DELl 4,
4,
J. W. Van Wingen / Networks and coded character sets
sent texts or numbers in a form other than is spoken or written. The Morse code was an important step in a long development, as was the Hollerith punched card. The idea of having holes as a unit of information, the bit, was very fruitful, and could be generalized for use on electronic media. As early as 1931 the 5-bit T E L E X code (CCITT # 2 ) was adopted, introducing the concept of bit pattern or bit combination. Main areas of application of representing data with bit patterns emerged in the course of time: (1) storage of data, (2) transmission of data, and (3) processing data by a computer. The increasing use of electronic methods necessitated the adoption of standards, which had to serve the areas of application where data interchange was of primary importance. Thus ASCII, a 7-bit code (characters mapped on 7-bit patterns), saw the light in 1963. ASCII provided codes (assigned bit combinations) for 94 graphic characters (26 letters, 52 after 1968, 10 digits and 32 specials), the SPACE, and 33 control characters for control functions. The code table is in Fig. 1. The control characters are in columns 0 and 1, the capital letters in 4 and 5, small letters (after 1968) in 6 and 7, digits in 3, SPACE at position 2/0, D E L E T E at 7/15, and specials in the positions left over. ASCII was designed by its structure to serve the first two application areas well: By assigning to letters bit patterns in ascending order without gaps, a contiguous "collating sequence" could be defined, easily implementable on a electronic device. (The old telex code did not possess this property.) - By providing codes for control functions and making them easily recognizable by putting them together in two columns of the code tables, ASCII was well suited for transmission of data; text i n particular. For internal processing by a computer, ASCII was not very well adapted. A 7-bit machine word is hardly usable. For internal representation of codes, 6-bit or 8-bit "bytes" were much better. 6-bit bytes could be contained in a 24, 36, 48, 60 bit machine word 4, 6, 8, 10 times, or a 8-bit byte in a 32- or 64-byte word. Only DEC succeeded in putting 5 ASCII characters into a 36-bit word. It is no surprise that many computer manufacturers defined their own 6- or 8-bit coded character sets -
277
for their specific machine use. EBCDIC from IBM (Fig. 1) became particularly influential. ASCII has another important property (not present in the old T E L E X code). Every character of the set has a unique code, and every bit combination has a unique meaning. The presence of 8-bit bytes in a computer poses a new problem. If we want to transfer collections of these outside the computer, ASCII does not provide facilities. We may define certain 8-bit combinations as being equivalent to ASCII codes, but even then we are faced with the fact that there are 128 left without a clear meaning.
2.2. Extension of the Character Set It was clear from the start that ASCII deserved an international status such as could be achieved under the responsibility of ISO. Because countries other than the US have different requirements to the contents of the coded character set, the approved document ISO R 646 contains options for a number of positions in the code table. Once exercised, the result is a National Version; ASCII being the US National Version. Unfortunately this implied that the principle of unique code-character correspondence was abandoned. With the rules of ISO R 646-1968 (revised in 1973 and 1983 as ISO 646), it became possible to code texts in Danish or Swedish, which carry a 29-letter alphabet, at the price of losing 6 specials. Further needs such as accented letters (for French) or additional specials could not be satisfied. To this purpose, an extension scheme was devised and standardized as ISO 2022. The idea is that different characters may be coded with the same bit combination. To indicate which character is meant, a control function S H I F T is inserted (several are defined) or a ESCAPE sequence with analogous effects. At reading (or receiving), each time a S H I F T or ESCAPE sequence is detected, the "state" of the reader changes, and a different code table is accessed. (Possible code tables may be a national version of ISO 646 or one registered in the International Register for Coded Character Sets to be used with Escape Sequences.) ISO 2022 provides the means for coding an almost unlimited number of characters by a single, but not unique, bit combination. It was not restricted to 7-bits, but was later extended to include 8-bit coded character sets as soon as the structure
278
J. Iv. Van Wingen / Networks and coded character sets
of these was defined in ISO 4873. Because reading data encoded according to ISO 2022 requires a finite state machine with very many states, practical use never has been extensive. With the advent of hardware with 8-bit facilities, partial solutions for the more urgent problems became feasible. Nevertheless, ISO 2022 supplies the general method in all cases where switching of code tables is unavoidable. Even for multiple-byte coded sets, rules are defined. Sets are identified with a set designation CO, C1, GO, G1. Control characters occupy columns 0,1 (CO), 8,9 (C1). Columns 2 - 7 (GO) and 10-14 (G1) are reserved for graphic characters. ISO 4873 specifies the structure of 8-bit coded character sets, but does not define a single code table. It fixes the content of some areas, but for the rest only options are given. It does not specify the control characters in columns 0,1 (CO), 8,9 (C1): Columns 2 - 7 (GO) are identical (in the revised standard) with those of ISO 646 (the 1990 edition) and ASCII. For 10-15 (G1), options for 94 or 96 characters are specified. Thus ISO 4873 is only a generic standard. It specifies three levels. In 1, no shifts are allowed (single octet coding), providing for 188 or 190 graphic characters. Here an octet has always a context-independent meaning. In 2 and 3, shifts may occur to a fixed G1, G2, G3, providing for at most 383 different graphic characters. With 2, single shifts are allowed, with 3 also locking shifts (three-state machine). Meaning of an octet depends on context, and with 3, on serial reading in one direction. ISO 10367 specifies available sets. 2.3. Composite Characters
In order to restrict the complexities of coding by the ISO 2022 method, especially where hardware does not allow midstream code table switching, other approaches for extending the available number of graphic characters were recommended. Some characters can be represented by combinations of several other characters. Following the practice of overprinting, ISO 646 allows creation of composite graphic characters by the use of BACKSPACE a n d / o r C A R R I A G E R E T U R N . But it warns (on p. 7): "According to clause 5, it is permitted to use composite graphic characters, and there is no limit to their number. Because of this freedom, their processing and imaging may
cause difficulties at the receiving end. Therefore, agreement between sender and recipient is recommended if composite characters are used." To avoid this pitfall, ISO 6937 follows a different approach. There are simple and composite graphic characters. Several characters are coded with a single bit combination (digits, specials, letters of the Latin alphabet, and some additional ones). Others are coded by a double one: the first representing a diacritical mark (non-spacing), the second a Latin letter (spacing). Arbitrary composite graphic characters are not allowed. The number of graphic characters defined by ISO 6937 is restricted to those occurring in a "repertoire". Equally, not all "duples" are permitted, only those included in the repertoire. It is assumed that these duples can be displayed by hardware (the "character imaging device") as one single graphic symbol. ISO 6937 defines a " p r i m a r y " (GO) and a "supplementary" (G1) set, which can be combined to form the graphic character part of an 8-bit code (popularly, the left and the right-hand of the table). In this way, a unique, but mixed single/double octet representation of characters is created. All European languages and several others can thus be represented. ISO 8859 was developed for presenting a unique single octet representation of graphic characters corresponding with Level 1 of ISO 4873. Because not all characters that are desired can be accomodated in a 94 + 96 code table, ISO 8859 is in several parts, each defined for a particular region of the world, and serving the need of groups of languages (Europe: West, East, North, South; Cyrillic, Greek, Arabic, Hebrew). Parts 1 - 4 contain Latin Alphabets 1-4, but no. 5 is Part 9. This one has Turkish letters where no. 1 has Icelandic. Thus both languages are mutually exclusive, as are most of those from the West to those of the East, establishing an Iron Curtain in standards. Each part contains the 94 (GO) from ASCII as a subset, supplemented by a varying 96 (G1) set (Fig. 2). Thus, the code of ISO 8859 is only unique in a restricted sense. Where graphics from different regions are to be combined in a text, switching techniques from ISO 2022 are required. 2.4. Standards from Related Bodies
The bodies mentioned in the introduction aim at harmonizing their standards as much as possi-
J. W. Van Wingen / Networks and coded character sets
/001 /011 /021 /O~l
IS0
8859-1
00/
01/
02/
03/
06/
05/
06/
07/
08/
09/
10/
11/
279
12/
13/
16/
15/
+ - _ -4- - __4-_ _ _ + _ - - 4 - - - _ 4 - _ _ _4-__ -4- - __4-_ - - 4 - _ _ -4- - - - 4 - _ _ _4-__ - ÷ - - - 4 - - _ + _ _ -4p NRS I o SPI 0 & 6 a q i ".2_ A ~ 1 A & B ¢ z b r A 0 " 2 B b [. z c s 0 t 5 C 6
/o61
$
/051 /061 /071
z & ' ( )
/081 /o91 /101 /111 /121 /131
])
w
+ ,
/1~1
/151
/
+
4-
4-
I50
8859-2
00/
01/
E F G
e f g
u v
8 9 : ; < = >
H I J K L M N
h i
x y z {
? +
02/
03/
÷ . . . ÷. . . +. . . . +. SP /011 ! /021 " /031 | /o61 $
/001
k m n
0 +
+
05/
. . .÷ . ~ A B C D
4-
06/
¥
p
I §
'll •
¢ !
z o
; E E E E
: 1
t
ISHYI ~
}
o 4-
06/
. . +. 0 1 2 3 6
0 5 0 x e 0 0 0 O ~'
t
5 6 7
+
07/
4-
08/
I -
I z
4-
+
09/
10/
~ 4-
11/
. . +. . . .÷ . . . 4-. . . 4. . . .÷ . . . 4-. p INB5 a q /~ b r c s >.. d t: •
p Q R S T
R 4-
12/
ii
9 &
•
&
n
i
(i
i $
9 i,
~ +
13/
. . .4- . . . 4-. o ~ q A ~, ~ ;~ A
8 6
A ae
+
lt~/
+
15/
. . 4. . 4'B F" ~1 k ~ fi ~ 0 i~ 6 ~ 6 fi d 6
/051
z
5
F
U
e
u
[
I
[
O
i
6
/061 I07 J /081 1091 /lOl /111
& ~ ( ) ~ +
6 7 8 9 = ;
V N X Y z [
f g h i
g §
.~
,
<
\
1
I
~, ~ 1= 2
~= ~ ~ ~"
(~ G (~ (= I~ E
6 ~ ~ a ~ ~ ~
/131 /161
-
=
]
m n
} ~
i~ x I~ 0 0 0 (J i' ]"
i~ +
k
v w x y z {
/121
F G H I J K L M N
>
/151 +
/ +
+
? +
^
0 +
5
_ 4-
ISHY I L~
o 4-
" ~.
I t 4-
4-
÷
+
~ ~ ~
~: 4-
]~ 4-
O U 9
i ~"
a 4-
0
6
a 4-
+
F i g . 2.
ble to those of ISO's JTC1/SC2. ISO/TC46 restricts itself to languages and their features not yet covered elsewhere, but used in bibliography, extending ASCII. CCITT Study Group VIII works in close contact with SC2, producing Recommendations numbered with a "T". C E N / CENELEC aims at reducing options in ISO standards by creating "functional" standards, relevant to Europe. 2.5. I B M
IBM specified EBCDIC as a universal code for its new 360 range in 1965. It is an 8-bit code, but
graphic characters were specified originally only for 94 positions (Fig. 1). Columns 1-4 were reserved for control characters. The structure is completely different from ASCII and ISO 4873, and clearly shows the influence of coding for Hollerith punched cards. Later on, positions for national characters were defined, but were assigned differently for each country. Recently, complete 192-graphic-character sets were introduced, but, in order to preserve continuity with current practice, these contain the same characters (taken from ISO 8859-1 for the Western World) differently distributed over the the code table for each country. Thus, a large number of Country
~.
o"
~
"
~',-.~
o ,
=~
~'~.
~-
o
.
-~ ~ ~ : ~
~
:~ ~,.~
~.~ ~ a.~
'~
=~
o
~.e
~.~
~
I:
~
-ao==
o~
~
~
,
~ ~ o" ~'
a.:=
~'.
&
= r~
.i
II
.
~,) i ~
..
* i"-'~ i ' ~ l i d ~ h C ~ I'lfl:d'11
I
I I
J.
I I
X
s"
:3: C') - r l
,<
~
N
"<
X
Z
m
<
%.` ~
~'~ I ' ~ ~0 4D "O ~
C3~ e ' ~ ~0~ ~3: ~
fl~ r'~ ~1( r ' :
~
,u') ;io< ,-<~ ~:~ z < u1~ N
: O,' O~ "/~ O: O* ~ •~ . . . . . "<
I
:
+,
I
4.i
I .4-
O
C
"4
~n
~
~
,=i,, ~
"~ r"
O
e"
.I.
£-
~.
~
~
<
/
~.~'
,..,
~-o
+~
I
=
I
I ÷
I I
÷
I ÷ I I I +
I
I
:
I ÷
I
4I
I
I ÷ I
I
I
I I
J.
+
I
I
+
-n
>
N
eto
~0
P,
o
rn
X
:1
I
I
-
:
~
e-~ l"~ * ' :
D
c~
¢"0 ¢'- 4=:) ~ , o
~
"<
Z
,.,J ~
X
O
u)
r-- p ;
~
I~
"o
I%) ~
"I*
o
/
~._ ..~
o~ :m. ,..~
tK
C:: "-I { ~
k~ ,~
~
:2: Z
~10
•
H~flrl/m:r'fl>l'~.~l
4~ "~1 (0) ~
~0 4D " 0
I~ N
~
<
~1~ I ~
~ I-i,t.'.l:N)
O~ :3: I-~ :3: G) "111"rl
(,n
~
lira --
I ~
O~ Os O :
~
II
O! O~ Os ~." O~
+, O t I .<:
I
I I
I
I
I
I I
I
I ÷
I
÷
t
I ÷
I +
I
I
I
I
J.
I
I ÷
I
,l
I I
÷
|
I
J.
÷ I I
Ln
o
N
Z G'J
Z
o
t-) f.n C) "0 o
b.)
J. W. Van Wingen / Networks and coded character sets
tional Language Architecture. Response from IBM indicates that these reports are taken very seriously.
and that more than one byte would be required to accomodate a growing number of control functions, having various properties and even carrying parameters. Thus, control sequences were introduced. The standard for it is ISO 6429, supplemented by ISO 10538. For continuity, the old 64 positions were kept for those to be represented by a single octet, but it became obvious that these were never implemented in their entirety, reason enough to remove them from the conformance requirements of revised standards (like ISO 6461990).
2.6. Control F u n c t i o n s
From the beginning, certain actions to be performed at receiving data were indicated by certain codes. There were 32 positions reserved in ASCII, and 64 with the 8-bit codes. After some time, people realised that a function is not a character CODE TABLE FOR VERSION L A T I N 1 2 (ISO 8859-1 included, as is) 00/
01/
02/
05/
0~/
05/
281
06/
(K1 + B + L1 + M1)
07/
08/
09/
10/
11/
12/
1~/
14/
15/
4- - - - 4 - - - -4- + - -4- - - - 4 - - - -4- - - - 4 - - - - 4 - - - -4- - - - 4 - - - - 4 - - - -4- - - - 4 - - - - 4 - - - - 4 - - - - 4 - - - - 4 -
/00 /01 /02 /03 /04 /05 /06 /07 /08 /09 /10
NULl ~ ~ t [ l fi
a a ~
SPI !
fi ~ I i
! $ ~ & ' ( ) ~
0 1 2 3
"
5 6 7 8 9 :
P HTI " LFI I ESC I + , 0101 CRI + I -~ I ~ I ~ I e I /
/ii
/12+ /13 /Z~ /15 4,
4-
4-
TABLE
(ISO
8859-2
4,
01/ 4,
= >
? 4-
CODE
00/
;
<
4,
4,
SP
0
~
P
/021 /o31
~ •
z z
. t
2 3
B c
R s
/o51 /061 /071
£ i •
~ z )
x s
5 6
E F
'
7
/09l
HTI
•
)
~
~
/111 - IESCI + /121 ± I t I • /131CRI 9 I /l~lO[ll /151 ~ I h I /
/101LFI
4,
4.
4,
=
T t ._
0 0 0
o
+ t,
Z 4-
LATIN21
SHY I ~ el~ - I z
Z ~
4-
4-
4-
(K2
+
B +
f Y
4-
L2
4-
+
B ÷
9
~" 4-
4-
4-
M2)
is)
06/
07/ 4,
08/ 4,
09/ 4,
10/ 4,
11/ 4,
12/ 4,
15/ 4,
1~/ 4,
15/ 4,
4.
p
A
i
INBSI
o
~
B
0
b c
r s
i G
~ a
~
~
A ~
~ G
~ ~
6
U V
e f
u v
E 0
~ 0
[ ~
I •
L C
G G
i ~
6 6
G
N
g
w
G
~
§
~
x
9
÷
9
I
Y
2
y
t
z
~
~
E
0
~
O
=
J
Z
j
z
A
~
~
~
~
0
~
; < = >
K L M N o
[ \ ]
k 1 m
_
fi 6 A E ~
fi ~ 6 ~ & ISHYI = I Z [ ~ • I Z I
~ ~ "
n o
{ I } ~ -
E ~ t i ~
0 0 q l B
~ ~ i ~ a
? 4,
i
x
o
4,
~
G 6 G G G
I
05/
/OOINUL
G
z
as
0~/ 4,
u
Z G 0
w x
4-
included,
03/
I~
¢
_ 4-
o _+
INBS
q
~
FOR V E R S I O N
02/ 4,
4,
p
P R R S T U V N X Y Z [ \ ]
A B C D E F G H I J K L M N o
+
4,
4,
4,
4,
Fig. 4.
4,
4,
+
~ 4,
4,
4.
G O
9 4,
4,
282
J.W. Van Wingen / Networksand codedcharactersets
2.7. All-graphic Code Tables
The problems met with several parts of ISO 8859, causing languages to be mutually exclusive within one document, together with the developments in the PC field (other manufacturers produced solutions of their own), made the Netherlands National Body ( N N I ) of ISO realise that an initiative was needed in order to standardize PC code tables. A Proposal is now out for letter ballot at I S O / I E C JTC1, closing August 6, 1990. If it succeeds (5 countries are required to participate in the effort), the restriction of graphic characters to 192 with a single octet code can be removed by reducing the number of control functions coded by a single character. Examples of possible code tables are given in Fig. 4. 2.8. Multiple-byte Coded Character Sets
Multiple-byte character sets have attracted a lot of attention in recent times. F r o m this it might seem that it is a clearly defined concept. But it is not. It is not even a new one. Four schemes have emerged so far: - That of ISO 646. Characters may be represented by 1, 3, 5 or more bytes, by use of sequences such as "char" BACKSPACE "char", and so on. - That of ISO 6937-2. Graphic characters may be formed from diacritic plus letter, giving a mixed single/double byte representation. - A standard in development by S C 2 / W G 2 , to which the number DP 10646 now has been assigned. All imaginable characters of the whole world (except cuneiform and hieroglyphs) are uniquely represented with up to 4 bytes per character (but uniformly, not mixed). This is the price to be paid for doing without ISO 2022. The national standards of China, Japan, and Korea are included "as is". Thus, a 3 or 4 octet representation may be required when all these three C J K languages are to occur in one single document, where otherwise 2 would do. This makes 10646 rather complex, and at the moment serious disagreements exist, with US, Canada, and China at one side, and Japan and Europe at the other. - A scheme proposed by a group of manufacturers at the US West Coast, called U N I C O D E . This is a consistent two-octet approach. N o positions for control characters are left, except
those from ISO 4873. For each of the East-Asian ideographs included, only one code is defined, disregarding its use in national language or culture. Unfortunately for its creators, this idea is not liked at all by the Japanese, while the Chinese sympathise with it. Besides these four, several schemes have been invented and used in Japan and China that mix single and double byte character representations, in a bewildering variation. Multiple-octets are able to solve m a n y problems arising when m a n y languages are to be combined into a single document. But there is a price. Memory required at least doubles, and searching databases may take more resources.
3.
Networks
3.1. Ways of Communication
The interchange of data may follow one of three ways: - serial communication, - communication with protocol, - media under OS control. The first is the conventional way with which most communications people were brought up. There is a Sender and a Recipient, and a continuous stream of bytes between them in one direction. Controls may change the state of the reader. This is the world of ISO 2022. The second is that of Networking, where things are no longer that simple. Transfer is directed by protocols. The third is employed where data is coded on a magnetic medium under the control of an Operating or D a t a Management System, and then shipped physically to a Recipient who is able to use the same system for reading it. Storage may be Direct Access, without a prescribed order of interpretation. It is clear that in this case the model of ISO 2022 does not apply. 3.2. Character Issues in Networks
One would expect that the variety of character codes and their possibilities for non-English languages had been reflected in the ISO standards for Networks. But if one looks into the most obvious ones, ISO 8824 (1987), Abstract Syntax Notation N u m b e r 1 (ASN.1) and ISO 10021-7 MOTIS, Part 7 (X420), one is inclined to be disappointed. 8824 contains a list of " p r i n t a b l e " characters that is so
J. 14/. Van Wingen / Networks and coded character sets
restricted that every vendor of modern printers would be ashamed if he could not offer anything else. Not even the complete ASCII repertoire is available there. Unfortunately, some networks or, perhaps, gateways follow the same policy. It seems impossible to transfer Postscript files or programs written in " C " to or from J A N E T nodes, because "curly brackets" are turned into capital letters. Obviously, these are not "printable". Another nice concept is that of "visible string (ISO 646 string)", suggesting that any accented letter is inherently invisible. Both discuss IA5, Teletex and Videotex (which are specified by no ISO standards at all, other than in an informative annex), but call the rest "general". These are identified by reference to their C (control) and G (graphic) sets they include. This convention automatically excludes PC data encoded with CP437, CP850 or the rest from transfer by MOTIS, because these contain more than 191 graphics, and do not know C or G. As I stated in my 1988 paper on Coded Character Sets and Programming Languages (SC2 N 1961): "Traditionally, the Information Processing world is English speaking only. Now that the access to this world is no longer reserved for an intellectual elite, this practice has become an untenable barrier to large groups of people." But both standards give the impression that they have been designed for the exclusive use of English. It could be defended that the terms of the syntax itself are English, because some Lingua Franca is required for international communication, but that does not pertain to strings. Here another problem arises. ISO 6937, Teletex and Videotex, allow characters to be coded by single/double bytes mixed. This implies that the length of a string is not equal to the number of bytes with which it coded. There is an universal consensus in the data processing world that such coding schemes are unusable for their applications.
4.
Conclusions
and
Controversies
4.1. Conclusions
The consequences of recent developments in coded character set standards for networking could
283
only be sketched very summarily. A closer cooperation between the bodies working in both fields deserves strong recommendation. The issues should also be placed into a broader context. Several groups are working on what is called "Internationalization", that in fact means "deanglification". A need for a National Language Architecture has been identified with IBM users. The extent of the problems has been excellently summarized in the SEAS White Paper on National Language support. For further developments in networking, the following conclusions should be kept in mind: - 7-bit codes will have to go. ISO will not spend any effort on further development. - Text and Data will be the same thing. Any octet may have a meaning. Gateways shall not damage bytes. Change or removal of bytes will obstruct information interchange. - Coding schemes based on mixed single/double byte representation are awkward for data processing and have to be avoided (Teletex, Videotex). 4.2. Controversial Issues
The following issues are controversial: Control functions not to be coded by single bytes, but by control sequences (except a few). Conflict between 4873-type codes and actual practice in PCs. H E L P ISO T O SOLVE IT! Support ISO 10XXX project! Deadlock on multiple octet code. Beware of unofficial (non-ISO) proposals as a solution. - IBM is now reconsidering the Convert Always Everything approach. Support S H A R E and SEAS proposals! -
References
[1] J.W. Van Wingen, Coded character sets and programming languanges, ISO/IEC JTC1/SC2 N 1961R and SC22 N 578R, September 1988, rev. April 1989. [2] J.W. Van Wingen, A reference model for characters and documents, ISO/IEC JTC1/SC2 N 2133, December 1989. [3] P. Gardner, Ed., SEAS national character task force, white Paper on National Character, Language and Keyboard Problems, Geneva, SHARE European Association, September 1985. [4] E. Hart, (Ed.), SHARE ASCII and EBCDIC character set
284
J. W. Van Wingen / Networks and coded character sets
task force, White Paper on ASCII and EBCDIC Character Set and Code Issues in Systems Application Architecture, Chicago, SHAR Inc., SSD #366, May 1989. [5] K. Daube, Ed., SHARE Europe (SEAS), White Paper on
National Language Architecture, Geneva, SHARE European Association (in preparation). [6] B. Jerman-Blazic, Multilingual communication in an open system architecture (to be published).