Plain base32 ASCII-compatible encoding and 8-bit dual-mode transformation format of ISO 10646

Plain base32 ASCII-compatible encoding and 8-bit dual-mode transformation format of ISO 10646

Computer Standards & Interfaces 23 Ž2001. 457–466 www.elsevier.comrlocatercsi Plain base32 ASCII-compatible encoding and 8-bit dual-mode transformati...

387KB Sizes 0 Downloads 49 Views

Computer Standards & Interfaces 23 Ž2001. 457–466 www.elsevier.comrlocatercsi

Plain base32 ASCII-compatible encoding and 8-bit dual-mode transformation format of ISO 10646 Pei-Chi Wu ) Department of Computer Science and Information Engineering, National Penghu Institute of Technology, 300 Liu-Ho Road, Makung, Penghu 880, Taiwan, ROC Received 2 May 2001; received in revised form 6 June 2001; accepted 18 June 2001

Abstract A variety of proposals have been proposed for format of ASCII-compatible encoding ŽACE. in internationalized domain name ŽIDN.. The issue of supporting UTF-8 in IDN is also raised during the discussion. In this paper, we address alternatives to these encoding methods. We first propose a plain base32 ACE, which achieves the space efficiency of 3.2 bytes per 16-bit character. Secondly, we propose an 8-bit dual-mode transformation format of ISO 10646 Universal Character Set. The 128 base characters Ž80.16 – ŽFF.16 are used to represent non-ASCII characters. This results in space efficiency of 2.29 bytes per 16-bit character. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Internationalized domain name; Universal character set; Base128; Simplicity; Space efficiency.

1. Introduction ISO 10646 Universal Character Set ŽUCS. w10x is a 31-bit coding architecture that covers symbols in most of the world’s written languages. UCS defines two alternative encoding forms: UCS-4 Ž4-byte. and UCS-2 Ž2-byte. Žor Unicode w17x.. UCS-4 and UCS-2 encoding cannot be directly used in applications and protocols that assume 8-bit or even 7-bit characters. This has led to the development of various UCS transformation formats ŽUTF. such as UTF-7 w6x, UTF-8 w21x, and UTF-16 w7x. UTF-8 is the must-have encoding format in Internet Engineering Task Force ŽIETF. protocols w1x. However, the space efficiency

)

Tel.: q886-6-9264115x1215; fax: q886-6-9277361. E-mail address: [email protected] ŽP.-C. Wu..

of UTF-8 is not very good for Asian characters: a character in Ž800.16 – ŽFFFF.16 of UCS is represented in 3 bytes under UTF-8. The domain name system ŽDNS. w11,12x provides a mechanism for naming resources in Internet. The domain name space is structured as a tree, where each node has a label in 0–63 bytes. DNS assumes the use of ASCII w2x, which represents mainly English words. IETF has set up a working group on internationalized domain name ŽIDN. w8,20x. The use of UCS in IDN w13x is promising to simplify implementation problems in handling standard character sets of various countries. The issue of supporting UTF-8 in IDN is also raised during the discussion. A variety of proposals w3,4,9,14,16,19x have been proposed for format of ASCII-compatible encoding ŽACE., which utilizes characters within the alphanumeric range wA–Zxw0–9x and hyphen Ž-.. Most pro-

0920-5489r01r$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 0 - 5 4 8 9 Ž 0 1 . 0 0 0 8 6 - 1

458

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

posals adopt UCS. Due to the length limit of 63 bytes in domain name labels, many proposals have emphasized on the compression of internationalized domain names. It is difficult to judge which one is best, even when considering only space efficiency. In addition, none of them achieved encoding 63 UCS characters, the original length limitation in a DNS label. Since a clean design is essential to avoid misinterpretation of the encoding scheme used in IDN, the design of ACE thus should be emphasized more on simplicity and functionality. In this paper, we address alternatives to UTF-8 and ACE proposals, respectively. We first propose a plain base32 ACE, which preserves the lexicographic sorting order of UCS. The format applies no further compression and achieves the space efficiency of 3.2 bytes per 16-bit character. Secondly, we propose an 8-bit dual-mode transformation format of ISO 10646, called UTF-8D. ASCII characters are represented in their original 7-bit values, and the 128 base characters Ž80.16 – ŽFF.16 are used to represent non-ASCII characters. This results in space efficiency of 16r7 ( 2.29 bytes per 16-bit character.

2. Related work UTF-7 w6x classifies ASCII characters into safe, base, optional, white space, and unsafe. There are two encoding modes: direct and base64. UTF-7 directly represents safe, base, and white space characters. The other characters in Basic Multilingual Plane ŽBMP. are represented with a prefix ‘q’, following with a sequence of base64 encoded characters. This results in space efficiency of about 2.67 bytes per 16-bit character.

UTF-8 w21x encodes all characters in UCS-4 as a varying number of bytes. ASCII characters are encoded in one byte having the usual ASCII value. Table 1 shows the mapping from UCS-4 to byte sequences in UTF-8. Each byte can be recognized as in the following categories: ASCII, the first byte, and the continuing byte of a multi-byte sequence. A program can easily find character boundaries in a byte stream. UTF-8 also preserves the lexicographic sorting order of UCS. UTF-5 w16x is proposed for compatibility with current domain name system w11,12x. Each character is encoded using a sequence of 1–8 bytes. The transformation is in two steps: Ž1. map each 4 bits in an UCS-4 code to a quintet Ž5 bits.; Ž2. translate each quintet to an alphanumeric character. As shown in Table 2, the first quintet has the most significant bit ŽMSB. set to 1 and the following quintets have the MSB set to 0. The remaining 4 bits of a quintet contain bits from the character code. The alphanumeric characters used in Step 2 include Župpercase. A–V and digits 0–9, in total of 32 characters. Multipurpose Internet Mail Extensions ŽMIME. w5x defines a standard mechanism for encoding binary data into a 7-bit short line format. The encoding used is denoted by Content-Transfer-Encoding header field. Currently there are two content transfer encoding methods w5,15x: quoted-printable and base64. The base64 encoding represents a sequence of bytes in a 65-character subset of ASCII. This enables 6 bits to be represented per printable ASCII character. A variety of proposals w3,4,9,14,16,19x have been proposed for format of ACE. Many proposals have emphasized on the compression of internationalized domain names. For example, RACE indicates a row number in the header byte, when all non-ASCII

Table 1 The mapping from UCS-4 to byte sequences in UTF-8 UCS-4 Range Žhexadecimal.

UTF-8 byte sequence Žbinary.

Free bits

0000 0000–0000 007F 0000 0080–0000 07FF 0000 0800–0000 FFFF 0001 0000–001F FFFF 0020 0000–03FF FFFF 0400 0000–7FFF FFFF

0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

7 11 16 21 26 31

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

459

Table 2 The mapping from UCS-4 to quintets in UTF-5 UCS-4 Range Žhexadecimal.

UTF-5 quintets Žbinary.

Free bits

0000 0000–0000 000F 0000 0010–0000 00FF 0000 0100–0000 0FFF 0000 1000–0000 FFFF ... 1000 0000–7FFF FFFF

1xxxx 1xxxx 0xxxx 1xxxx 0xxxx 0xxxx 1xxxx 0xxxx 0xxxx 0xxxx ... 1xxxx 0xxxx 0xxxx 0xxxx 0xxxx 0xxxx 0xxxx 0xxxx

4 8 12 16 ... 32

characters of a label come from only one row, a block of 256 characters. Each character in this row is represented in one byte and encoded in base32. Otherwise, the header byte is ŽD8.16 following with 16-bit characters encoded in base32. For the input Žnet. wUq7DB2, Uq7D61x, all charstring of acters are in the same row, Ž7D.16 . Thus, the output is w7D, B2, 61x Žhex.. ™ w01111101, 10110010, 01100001x Ž8-bit. ™ w01111, 10110, 11001, 00110, 00010x Ž5-bit. ™ APWZGCB, encoded in the 32 base characters: wA–Z2–7x. Table 3 summarizes various ACE proposals. All current row-based approaches do not preserve the lexicographic sorting order of UCS. These formats also need a complete scan of a label to select an encoding style. When encoding a random UCS string, most current ACE proposals degrade to a plain base32 encoding with a proper header to indicate the encoding style used. Oscarsson w13x defines the length limit of IDN as follows: a label is limited to a maximum of 63 characters in UCS; the domain name is limited to a maximum of 255 characters in UCS. This is similar

to the 63 and 255 limits defined in the DNS specification w12x.

3. Plain base32 ACE 3.1. Design rationales Our design is based on the following rationales. Ž1. IDN should have length limitation similar to that of ASCII domain names. We follow Oscarsson w13x: a label is limited to 63 UCS characters, and the domain name is limited to 255 UCS characters. Ž2. Since all ACE formats cannot achieve the above length limitation, there should be a more elegant way such as the use of extended labels of DNS w18x in representing IDN. The uses of ACE are only for compatibility with current systems. Ž3. ACE should be as simple as possible and its space efficiency should be capable to play such a role. Ž4. ACE should preserve the lexicographic sorting order of UCS. Domain labels are strings, which can

Table 3 A summary of various ACE proposals Format

Compression

Encoding

UTF-5 RACE LACE SACE

None A prefix for row number. An escape code for row 0. A prefix for length and row number. Switching between 3 modes: Latin, 10bit, base36. A base32 prefix for 10bit. A base36 prefix for base36. Bit difference represented by variable-length integer Use a 2-bit tag for four encoding styles: half-row, full-row, mixed style, no-row.

base32: w0–9A–Vx base32: wA–Z2–7x base32: wA–Z2–7x base32, and base36: wA–Z1234790856x

DUDE BRACE

base32: w0–9A–Vx base32: digits 0 and 1 and letters O and L ŽAlB . are unused.

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

460

be used in various application contexts, such as a database for registered domain labels.

3.2. The ACE format In this section, we propose an ACE format for IDN based on the rationales in Section 3.1. There are two typical designs for preserving the lexicographic sorting order of UCS: fixed-width encoding and variable-length encoding. The former is adopted in UTF-7; the latter is adopted in UTF-8. We adopt the fixed-width encoding approach and use plain base32 encoding. Base32 encoding can represent 5 bits by using 32 base characters. Base32 is a scale-down version of base64. The format applies no further compression and achieves the space efficiency of 3.2 bytes per 16-bit character. The format achieves space efficiency close to that of UTF-8 w21x, which encodes UCS characters Ž800.16 – ŽFFFF.16 in 3 bytes. To preserve the lexicographic sorting order of UCS, the 32 base characters are listed in the order of ASCII: w0–9x wA–Vx. Table 4 shows three example strings A B ŽTaipei., A B Žinternationalized domain name., and A B Žnet. encoded in UTF-16, UTF-8, UTF-8D, UTF-5, RACE, LACE, and plain base32,

for the comparison. The byte sequences for UTF-8 and UTF-8D are shown in hexadecimal. The others are shown as alphanumeric strings. The strings in RACE and LACE are the result generated from the program in http:rrwww.imc.orgrnameprepr. RACE and LACE apply row-based compression, ŽTaipei. and they generate shorter strings for Žnet.. Characters in ŽTaipei. wUq53F0, and Uq5317x are in the same row: Ž53.16 ; characters Žnet. wU q 7DB2, U q 7D61x are in the in same row: Ž7D.16 . When the compression does wU q 570B, not work, as in the case U q 969B, U q 5316, U q 57DF, U q 540Dx, the sizes of resulting strings are longer than that of plain base32. Plain base32 preserves the lexicographic sorting order of UCS. For example,

The resulting strings also follow this sorting order: AFO565O- AS5PD6QJ2PBTUL0D- FMP7QO8. Note that UTF-5 preserves the lexicographic sorting order of UCS only when all characters of a string are encoded in the same number of quintets. RACE

Table 4 Example strings encoded in various formats for the comparison ŽTaipei.

abytes

abytes

Žnet.

abytes

Žinternationalized domain name. UTF-16

Uq53F0 Uq5317

4

UTF-8

E5, 8F, B0, E5, 8C, 97

6

UTF-8D

A9, FC, 8A, B1, B8

5

UTF-5 RACE LACE Plain base32

L3F0L317 KPYBO AJJ7AFY AFO565O

8 5 7 7

Uq570B Uq969B Uq5316 Uq57DF Uq540D E5, 9C, 8B, E9, 9A, 9B, E5, 8C, 96, E5, 9F, 9F, E5, 90, 8D AB, C2, F2, E9, DA, CC, AC, D7, EF, D5, 81, D0 L70BP69BL316L7DFL40D 3BLQXFU3KMLFPX2UBU 75LQXFU3KMLFPX2UBU AS5PD6QJ2PBTUL0D

10

Uq7DB2 Uq7D61

4

15

E7, B6, B2, E7, B5, A1

6

12

BE, EC, CF, D6, 88

5

20 18 18 16

NDB2ND61 PWZGC AJ63EYI FMP7QO8

8 5 7 7

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

461

Fig. 1. The encode, and decode arrays, and function init – tables.

and LACE do not preserve the lexicographic sorting order of UCS as shown in Table 4. 3.3. The implementation Fig. 1 shows the C code for encode and decode arrays, and function init – tables. The encode array contains 32 base characters. Function init – tables sets up the mapping between integers and base characters. All elements of decode array are first initialized with y1 as invalid character code. The decode array maps these characters back to integers. Fig. 2 shows the function int – base32 mapping an array of UCS codes Žint codewx, int count. to plain

base32 strings Žunsigned char seqwx.. The function also returns the number of bytes in the encoded representation. The encoding uses a bit buffer Žbit – buffer., and nbits is the number of bits in the buffer. When encoding each UCS character c, the bit buffer shifts 16 bits left and is appended with the code c. When nbits G 5, the function takes 5 bits from the bit buffer and generates an encoded alphanumeric character. At the end of processing, when there are rest bits Ž- 5 bits., the function pads the bit buffer with zero bits and takes these bits to generate an encoded alphanumeric character. Fig. 3 shows the function base32 – int converting plain base32 strings Žunsigned char seq0wx, int count.

Fig. 2. Mapping an array of codes to a plain base32 string: Function int – base32.

462

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

Fig. 3. Converting a plain base32 string to an array of codes: Function base32 – int.

to an array of UCS codes Žint codewx.. The function decodes each input character and appends the bit buffer with the 5 bits decoded. When nbits G 16, generate an UCS code.

4. An 8-bit dual-mode transformation format In this section, we propose an 8-bit dual-mode transformation format of ISO 10646, called UTF-8D. 4.1. The transformation format We adopt the dual-mode design of UTF-7. There are two encoding modes in UTF-8D: direct and base128. UTF-8D directly represents all 7-bit ASCII characters. The rest 8-bit characters Ž80.16 – ŽFF.16

Table 5 Encoding a markup of Chinese paragraph in various encoding methods

Žwith MSB set to 1. are used as base characters in the base128 encoding: Non-ASCII characters in BMP are represented with a sequence of base128 encoded characters, of which each represents 7 bits. This results in space efficiency of 16r7 ( 2.29 bytes per 16-bit character. When the number of bits needed to represent a non-ASCII substring is not a multiple of 7, these bits are padded with zero bits. A non-ASCII substring is implicitly ended when the following character is in ASCII. Table 5 shows an example string in various encoding. The string used here is a markup of Chinese paragraph Žfrom Dao Te Ching.: A B. There are 17 characters in total: 7 ASCII characters, 10 Chinese and punctuations. The string takes 34 bytes in UTF-16, 37 bytes in UTF-8, and 30 bytes in UTF-8D. The original size

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

Fig. 4. The last three bytes in the sequence that encodes Uq3002.

Fig. 5. Definitions of bit – buffer, nbits, and flush – buffer.

Fig. 6. Encoding an array of codes into UTF-8D byte sequence.

463

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

464

Fig. 7. Decoding UTF-8D byte sequence into an array of codes.

in Big5 is 27 bytes. The sizes of UTF-8D can be computed by 7 q u10 P 16r7v s 30. The shaded region in each row of the table indicates non-ASCII characters in the paragraph. Fig. 4 shows the last three bytes wE9, C0, 84x in the above byte sequence of UTF-8D that encodes Uq3002. The last byte Ž84.16 encodes Ž10 P 16 mod 7. s 6 bits Ž000010. 2 and is padded with one zero bit. Byte ŽC0.16 encodes 7 bits Ž1000000. 2 ; byte Table 6 Number of bytes in encoded sequences for characters of various UCS-4 ranges using several UTF for comparison UCS-4 range Žhexadecimal.

UTF-16

UTF-8

UTF-8D

0000 0000–0000 007F 0000 0080–0000 07FF 0000 0800–0000 FFFF 0001 0000–0010 FFFF 0011 0000–001F FFFF 0020 0000–03FF FFFF 0400 0000–7FFF FFFF

2 2 2 4 n.a. n.a. n.a.

1 2 3 4 4 5 6

1 2.29 2.29 4.57 n.a. n.a. n.a.

n.a.s not applicable

ŽE9.16 encodes the first 3 bits Ž001. 2 that belong to Uq3002. The concatenated bit sequence then is Ž0011000000000010. 2 s Ž3002.16 . UTF-8D preserves the lexicographic sorting order of UCS. As shown in Table 4, we have the following lexicographic sorting order of UCS in UTF-8D: wA9,FC,8A,B1,B8x w AB,C2,F2,E9,DA,CC, AC,D7,EF,D5,81,D0 x w BE,EC,CF,D6,88 x . -

Table 7 The text files used in the comparison

File 1 File 2 File 3 File 4 File 5 File 6 File 7

File name

Description

general.txt hardware.txt iejit.htm tip.htm main.htm labor.htm seafarer.htm

System document System document System document System document Personal web page Chinese document Chinese document

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466 Table 8 Number of characters in two modes Mode

File 1 File 2 File 3 File 4 File 5 File 6 File 7

0 ŽASCII. 11,198 12,311 23,013 6251 48,471 3200 2931 1 Žnon-ASCII. 7534 8233 637 2135 2560 11,021 9358

465

decoding each character seqwpx, the bit buffer is appended with the 7 bits decoded from the character. When nbits G 16, the function takes 16 bits to generate a 16-bit code. 4.3. Space efficiency

4.2. The implementation Fig. 5 shows the C function flush – buffer. The encoding uses a bit buffer Žbit – buffer., and nbits is the number of bits in the buffer. Function flush – buffer pads Ž7-nbits. zero bits and generates a 7-bit value d with MSB set to 1. Fig. 6 shows function int – utf8d, which encodes an array of codes Žint codewx, int count. into a UTF-8D encoded byte sequence Žunsigned char seqwx.. The parameter mode is for passing the initial mode and for returning the final mode. The encoding used in the two modes are direct Žmode 0. and base128 Žmode 1., respectively. The function also returns the number of bytes in the encoded representation. Before encoding a character, the function first checks whether the context is on the needed mode. If it is not, the program first changes mode. There are two cases: Ž1. from mode 0 to mode 1: reset the bit buffer; and Ž2. from mode 1 to mode 0: if there are rest bits in the bit buffer, flush the bit buffer. In direct encoding, each character c is represented by itself. In base128 encoding of each character c, the bit buffer shifts 16 bits left and is padded with code c. When nbits G 7, the function takes 7 bits d and generates a base128 encoding of d with MSB set to 1. At the end of processing, the function calls flush – buffer again if mode s 1. Fig. 7 shows the function utf8d – int, which decodes an UTF-8D byte sequence Žunsigned char seqwx, int count. into an array of codes Žint codewx.. In

Table 9 The sizes of test files using various UTF and Big5

The advantage of UTF-8D is its compactness over UTF-8 and UTF-16. Table 6 compares number of bytes in encoded sequences for characters in various UCS-4 ranges using UTF-16, UTF-8, and UTF-8D. UTF-8D takes less space than UTF-8 in the range Ž800.16 – ŽFFFF.16 ; UTF-8D takes less space than UTF-16 in ASCII but takes more space in non-ASCII. Table 7 lists the files used in the comparison. There are two text and five HTML files. All use Big5 character set. This is only a rough sampling of text files. Table 8 shows the number of characters in two modes. The byte order mark of Unicode is not counted as a character here. The statistic shows that these test files include those ASCII text dominate ŽFile 3. and those Chinese text dominate ŽFile 6.. Table 9 shows the sizes of test files encoded using UTF-16, UTF-8, UTF-8D, and their original sizes in Big5. Given Big5 as space rating 1.00, the maximal space ratings of each encoding in these test files are as follows: 1.95, 1.44, and 1.13, respectively. In each encoding, the cell shaded is the file that results in the maximal space rating number. The average space ratings are 1.51, 1.25, and 1.08, respectively. The result shows that UTF-8D, in average, is more space efficient than UTF-16 and UTF-8.

5. Conclusions We have addressed alternatives to UTF-8 and ACE proposals, respectively. The ACE format adopts

466

P.-C. Wu r Computer Standards & Interfaces 23 (2001) 457–466

base32 encoding and preserves the lexicographic sorting order of UCS. The format applies no further compression and achieves the space efficiency of 3.2 bytes per 16-bit character. We have also proposed an 8-bit dual-mode transformation format of ISO 10646, called UTF-8D. The format remains compatible with ASCII and preserves the lexicographic sorting order of UCS. It achieves space efficiency of 16r7 ( 2.29 bytes per 16-bit character. We have tested several text files. The result shows that the space efficiency of UTF-8D, in average, is better than that of UTF-16 and UTF-8.

Acknowledgements The author would like to thank the referees, whose comments helped to improve the overall presentation. This research was partly supported by National Science Council, Taiwan, Republic of China, under Contract No. NSC 89-2213-E-346-002.

References w1x H. Alvestrand, IETF Policy on Character Sets and Languages, RFC 2277, 1998. w2x American National Standards Institute, Coded character set7-bit American national standard code for information interchange ŽNew York, 1986, ANSI X3.4-1986.. w3x A.M. Costello, BRACE: Bi-mode Row-based ASCII-Compatible Encoding for IDN, version 0.1.2, Internet Draft, draft-ietf-idn-brace-00.txt, 2000. w4x M. Davis, P. Hoffman, LACE: Length-based ASCII Compatible Encoding for IDN, Internet Draft, draft-ietf-idn-lace01.txt, 2001. w5x N. Freed, N. Borenstein, Multipurpose Internet Mail Extensions ŽMIME.: Part 1. Format of Internet Message Bodies, RFC 2045, 1996. w6x D. Goldsmith, M. Davis, UTF-7 A Mail-Safe Transformation Format of Unicode, RFC 2152, 1997. w7x P. Hoffman, F. Yergeau, UTF-16, an encoding of ISO 10646, RFC 2781, 2000.

w8x P. Hoffman, Comparison of Internationalized Domain Name Proposals, Internet Draft, draft-ietf-idn-compare-01, 2000. w9x P. Hoffman, RACE: Row-based ASCII Compatible Encoding for IDN, Internet Draft, draft-ietf-idn-race-03.txt, 2000. w10x ISO, ISOrIEC 10646-1:2000 ŽE. Information technology– Universal Multiple-Octet Coded Character Set ŽUCS.: Part 1. Architecture and Basic Multilingual Plane. International Organization for Standardization, Geneva, 2000. w11x P. Mockapetris, Domain Names—Concepts and Facilities, RFC 1034, 1987. w12x P. Mockapetris, Domain Names—Implementation and Specification, RFC 1035, 1987. w13x D. Oscarsson, Using the Universal Character Set in the Domain Name System ŽUDNS., Internet Draft, draft-ietfidn-udns-02.txt, 2001. w14x D. Oscarsson, Simple ASCII Compatible Encoding ŽSACE., Internet Draft, draft-ietf-idn-sace-00.txt, 2000. w15x J. Reynolds, J. Postel, Assigned Numbers, RFC 1700, 1994. w16x J. Seng, M. Durst, ¨ T.W. Tan, UTF-5, a transformation format of Unicode and ISO 10646, Internet Draft, draft-jseng-utf501, 2000. w17x The Unicode Consortium, The Unicode Standard, Version 3.0. Addison-Wesley, Reading, MA, 2000. w18x P. Vixie, Extension Mechanisms for DNS ŽEDNS0., RFC 2671, 1999. w19x M. Welter, B.W. Spolarich, A.M. Costello, DUDE: Differential Unicode Domain Encoding, Internet Draft, draft-ietfidn-dude-02.txt, 2001. w20x Z. Wenzel, J. Seng, Requirements of Internationalized Domain Names, Internet Draft, draft-ietf-idn-requirements07.txt, 2001. w21x F. Yergeau, UTF-8, A transformation format of ISO 10646, RFC 2279, 1998.

Pei-Chi Wu was born on March 11, 1967, in Hsinchu, Taiwain, the Republic of China. He received BS, MS, and PhD from National Chiao Tung University, in 1989, 1991, and 1995, respectively, all in Computer Science and Information Engineering. He became the member of ACM and IEEE Computer Society since 1996. He has been with Department of Computer Science and Information Engineering, National Penghu Institute of Technology, Taiwan, since August 1998. His research interests include multilingual systems, extensible markup language, object-oriented programming, compiler design, random number generators, and high-performance distributed computing.