Computers and Electrical Engineering 27 (2001) 265±272
www.elsevier.com/locate/compeleceng
An ecient bitwise Human coding technique based on source mapping Abdel-Rahman Elabdalla, Mansour I. Irshid * Department of Electrical Engineering, Jordan University of Science and Technology, Irbid, Jordan 22110 Received 19 May 1999; accepted 24 April 2000
Abstract In this paper, we propose an ecient source encoding technique based on mapping a non-binary information source with a large alphabet onto an equivalent binary source using weighted ®xed-length code assignments. The weighted codes are chosen such that the entropy of the resulting binary source multiplied by the code length is made as close as possible to that of the original non-binary source. It is found that a large saving in complexity, execution time, and memory size is achieved when the commonly used source encoding algorithms are applied to the nth-order extension of the resulting binary source. This saving is due to the large reduction in the number of symbols in the alphabet of the new extended binary source. As an example to validate the eectiveness of this approach, text compression using Human encoder applied to the nth-order extended binary source is studied. It is found that the bitwise Human encoder of the fourthorder extended binary source (16 symbols) achieves a compression eciency close to that of the conventional Human encoder (256 symbols). Ó 2001 Elsevier Science Ltd. All rights reserved. Keywords: Data compression; Source mapping; Human coding
1. Introduction The need for more ecient data compression algorithms has been increased in recent years despite the availability of large capacity storage devices and broadband transmission channels. This is because of the need to store and transfer extremely large amount of data generated by very large number of information sources (text, sound, image, etc.) distributed all over the globe and linked together by huge information networks. One can comprehend this fact when he starts *
Corresponding author. Tel.: +962-2-7095111, ext. 22550; fax: +962-2-7095123. E-mail address:
[email protected] (M.I. Irshid).
0045-7906/01/$ - see front matter Ó 2001 Elsevier Science Ltd. All rights reserved. PII: S 0 0 4 5 - 7 9 0 6 ( 0 0 ) 0 0 0 2 2 - 7
266
A.-R. Elabdalla, M.I. Irshid / Computers and Electrical Engineering 27 (2001) 265±272
downloading a large ®le via the Internet, where a saving of one-half in money and time is possible when the ®le is in its compressed form. Although the approaches used to compress data are few and well established, a lot of research has been done on developing new data compression algorithms. It is not only the compression eciency which decides how ecient is a certain compression technique, but also the implementation complexity, the processing time (speed) and the memory size needed to accomplish the compression/decompression processes play a major rule in choosing the proper compression algorithm [1]. The implementation complexity, the large execution time and the large memory size needed in many compression/decompression algorithms are mainly due to the large number of symbols in the alphabet of the original information sources (256 symbols for text, sound, and image) [2]. On the contrary, information sources with binary alphabet such as black±white images have simple source encoding techniques, which makes them very suitable for hardware implementation [3,4]. In text compression as an example, the 256 characters in the alphabet of the input text ®le are transformed into an arbitrary ®xed-length binary code such as the standard 8-bit ASCII code. Then the chosen compression algorithm manipulates the resulting binary text ®le on a bytewise basis (characterwise basis). To the best of the author's knowledge, a limited research has been done on compressing text on a bitwise basis instead of the conventional bytewise basis [1]. One method based on bitwise approach has been proposed to reassign the character codes of the English text so that it would be more eective for the bitwise run length coding [5]. This paper deals with a new approach for source encoding information sources with a large alphabet. The approach is based on transforming the alphabet of the source to be compressed onto a new weighted ®xed-length binary code. The code words are chosen such that the ®rst-order entropy of the resulting binary source multiplied by the code length is made as close as possible to that of the original source. The commonly used compression techniques are now used to manipulate the resulting binary ®le on a bitwise basis rather than the commonly used bytewise basis. In Section 2, the new source mapping method applied to the English text is discussed. In Section 3, a bitwise Human encoding algorithm applied to the resulting nth-order extended binary source is studied. Conclusions are discussed in Section 4. 2. Source mapping Source mapping is a process by which the original information source is mapped (transformed) onto a new binary source such that the entropy of the binary source is made as close as possible to the entropy per bit of the original source. In this paper, the source mapper is a ®xed-length binary encoder having weighted codeword assignments governed by the following mapping rule: binary code words with large Hamming weights are assigned to symbols having large probability of occurrence in descending order fp0 ; p1 ; p2 ; . . .g. The N-bit codeword with all ones, i.e., its Hamming weight is N, is assigned to the symbol with the largest probability of occurrence, code words having a Hamming weight of
N 1 are assigned to the next N symbols, code words having a Hamming weight of
N 2 are assigned to the next N
N 1=2 symbols, or, in general, code N words having Hamming weight of
N m are assigned to the next
m
N !=
m!
N m! symbols. Such an assignment process results in a binary source having symbols (0 and 1) with probabilities, which re¯ect the degree of randomness in the original source. On the contrary, if the assignment process is done in an arbitrary way such as the ASCII code, the resulting probabilities
A.-R. Elabdalla, M.I. Irshid / Computers and Electrical Engineering 27 (2001) 265±272
267
of zero and one will not fully re¯ect the statistics of the original source. Let the original information source S has an alphabet of M 2N symbols with probabilities fp0 ; p1 ; p2 ; . . . ; pM 1 g, pi P pj , 8 i > j and let the equivalent binary source B has a probability of q for symbol ``1'' and a probability of
1 q for symbol ``0''. The entropy of the original source is given by [2] M X1
H
S
i0
pi log2
pi :
1
When the mapping process is done according to the above-mentioned mapping rule, the probability of symbol ``1'' of the resulting binary source can be calculated as follows: q p0
N X
N
N
i1
where C
i
i X N j0
C
i X
i
j
pk ;
2
kC
i 11
and
N j
N! : j!
N j!
Accordingly, the entropy of the binary source is given by H
B
q log2
q
1
q log2
1
q:
3
An ideal source mapper is one that results in a binary source whose entropy is equal to the entropy per bit of the original information source and it satis®es the following condition: H
S NH
B:
4
This ideal mapping is only possible if the original information source generates symbols having equal probabilities. For other non-uniform information sources, the entropy of the resulting binary source is higher than the entropy per bit of the original source and the following inequality holds: NH
B > H
S:
5
The dierence between the entropy of the N th order extended binary source and the entropy of the original source fNH
B H
Sg is minimum when the code words are assigned according to the above-mentioned mapping rule. This is because the entropy of the binary source H
B decreases as q increases, and the highest value of q is achieved only if the mapping process is done according to the above-mentioned mapping rule. The following example illustrates the mapping procedure discussed above: Table 1 shows an information source, which generates eight symbols with unequal probabilities. Three dierent source mappers are chosen to represent the original source, the ®rst one is the mapper proposed in this paper, and the other two are chosen arbitrarily. The entropy of the original source is found to be 2.47 bits per symbol, and the entropies of the third-order extension of the three mappers are found to be 2.54, 2.61, and 2.91 bits per symbol, respectively. It is obvious that the proposed source mapper has the closest entropy to that of the original source. A source mapper for English text (256 symbols) is designed according to the proposed mapping rule. Table 2 shows the probabilities and the code words assigned to the dierent characters of the English text (not all of them are shown in the table).
268
A.-R. Elabdalla, M.I. Irshid / Computers and Electrical Engineering 27 (2001) 265±272
Table 1 Example ± Code words of dierent source mappers Symbols
Probabilities
Mapper #1
Mapper #2
Mapper #3
S0 S1 S2 S3 S4 S5 S6 S7
0.35 0.25 0.15 0.10 0.05 0.05 0.03 0.02
111 110 101 011 100 010 001 000
000 001 010 011 100 101 110 111
001 010 101 100 011 110 111 000
Table 2 Probabilities and the new code assignments of the English text alphabet Character
Probability
Assigned code
Space e t a o i n s r h ...
0.174 0.098 0.070 0.062 0.059 0.055 0.055 0.050 0.048 0.042 ...
11111111 11111110 11111101 11111011 11110111 11101111 11011111 10111111 01111111 11111100 ...
The entropy of the original source is calculated using Eq. (1) and it is found to be 4.47 bits per symbol. The probabilities of symbol 1 and symbol 0 of the resulting binary source are found to be 0.853 and 0.147, respectively, and the entropy of the resulting ®rst-order binary source is found to be 0.6 bits per symbol. The entropy of the nth-order extended binary source is n times that of the ®rst-order binary source, i.e., 0.6n bits per symbol [2]. The entropy of the eighth-order extended binary source has the same number of symbols as that of the original source which is 4.8 bits per symbol compared to 4.47 bits per symbol for the original source. If an ASCII code is used as a source mapper, the probability of symbol 1 and symbol 0 of the resulting binary source are found to be 0.44 and 0.56, respectively. The entropy of the resulting ®rst-order binary source is 0.99 bits per symbol and the entropy of the corresponding eighth-order extended binary source is found to be 7.9 bits per symbol, which is very far from that of the original source. 3. Bitwise Human source encoding algorithm Human encoding is the oldest and the most widespread source encoding technique [6,7]. It is used in many commercial compression algorithms for dierent applications such as text and image compression algorithms [9]. It is based on giving short codes for symbols with large
A.-R. Elabdalla, M.I. Irshid / Computers and Electrical Engineering 27 (2001) 265±272
269
probabilities and longer codes for symbols with small probabilities such that the resulting average codeword length is made as close as possible to the entropy of the information source. When applied to an original information source with a large number of symbols, the resulting algorithm suers from many drawbacks; such as the very large code book of variable-length code words, the large time delay in the compression process, and the large overhead to be sent with the compressed ®le for the decompression process. All these problems can be overcome when the Human encoding is applied to the proposed nth-order extended binary source on a bitwise basis rather than on a bytewise basis as it is the case in the standard Human algorithm. When the Human algorithm is applied to the nth-order extended binary source, the probabilities of the resulting 2n symbols are found using the following rule: Probability of a given n-bit symbol qm
1
qn
m
;
6
where q is the probability of symbol 1 of the ®rst-order binary source, and m is Hamming weight of the given n-bit symbol. Since the probability of symbol 1 is higher than that of symbol 0 according to the proposed weighted mapping rule, the n-bit symbols with the higher Hamming weights have higher probability of occurrence. The 2n symbols are listed in descending order according to their probability of occurrence or according to their Hamming weights. The remaining steps of the Human procedure are then performed on this list of symbols. The variablelength code words are found, and the average code length is calculated. The coding eciency of the nth-order extended source is given by Coding efficiency nH
B=L;
7
where H
B is the entropy of the ®rst-order binary source, and L is the average codeword length per n-bit symbol of the nth-order extended binary source in bits per symbol. The eective coding eciency of the compression process using Human encoder of the nth-order extended source can be found as follows: eective coding eciency H
S=average code length per original character, where H
S is the entropy of the original source, and the average codeword length per original character is
NL=
n bits per character. Therefore, the eective coding eciency is given by Effective coding efficiency
nH
S : NL
8
An upper limit on the eective coding eciency can be easily found from Eqs. (7) and (8), and it is given as Effective coding efficiency <
H
S : NH
B
9
A Human algorithm is applied to the nth-order binary source resulting from converting English text ®le using the proposed mapping technique discussed in Section 2. The resulting ®rstorder binary source has probabilities of 0.853 for symbol 1 and 0.147 for symbol 0. Tables 3 and 4 show the probabilities and the Human code words for the symbols of the second- and thirdorder extended binary sources.
270
A.-R. Elabdalla, M.I. Irshid / Computers and Electrical Engineering 27 (2001) 265±272
Table 3 Probabilities and code words of the second-order extended binary source for English text Symbols
Probabilities
Code words
11 01 10 00
0.728 0.125 0.125 0.022
0 11 100 101
Table 4 Probabilities and code words of the third-order extended binary source for English text Symbols
Probabilities
Code words
111 110 101 011 100 010 001 000
0.621 0.107 0.107 0.107 0.018 0.018 0.018 0.004
0 100 101 110 11100 11101 11110 11111
Table 5 Eective coding eciencies, average and maximum code lengths for dierent values of extension order (n) of the English text Extension order
Eective coding eciency
Average code length (bits per character)
Maximum code length (bits per symbol)
1 2 3 4 5
0.56 0.79 0.89 0.91 0.92
8.00 5.68 5.00 4.91 4.86
1 3 5 9 12
The eective coding eciencies for second- and third-order extended sources are calculated using Eq. (8) and are found to be 0.79 and 0.89, respectively. The eective coding eciency, the average code lengths in bits per character and the maximum code lengths in bits per symbol for the nth-order extended binary source with n 1, 2, 3, 4, and 5 are shown in Table 5. It is obvious from Table 5, that it is useless to go beyond the ®fth-order extended source, since the eective coding eciency approaches its upper limit given by Eq. (9). Reducing the number of symbols from 256 in the original source to 16 in the fourth-order extended binary source reduces the complexity and the compression/decompression time in Human encoder very dramatically. The major advantage of the proposed technique is that the encoder deals with 2n symbols for the nth-order extended binary source instead of 2N symbols in the original source, which results in a good saving in the processing time. Moreover, this technique gives the designer a great ¯exibility in choosing the right number of symbols (4, 8, 16, etc.) for a given application depending on the
A.-R. Elabdalla, M.I. Irshid / Computers and Electrical Engineering 27 (2001) 265±272
271
available resources, while this is not possible when dealing with the original source on a bytewise basis. Because of the large reduction in the number of symbols in the source alphabet, this proposed technique is very suitable for adaptive statistical source encoding algorithms. In such adaptive schemes, the probability of each symbol in the source's alphabet has to be updated as new symbols are read [8]. In this proposed method, all that is needed is to update the probability of symbol `one', instead of updating the probabilities of 2N symbols in conventional schemes. Other known source encoding techniques can be applied to the proposed nth-order binary source and similar results are expected to be obtained. Lempel±Ziv encoders applied to the proposed mapped source are under investigation and encouraging results are obtained. The major disadvantage of this technique is the need to transmit the new code assignments of the original symbols with the compressed data. But this disadvantage is compensated by the large reduction in the length of the probability table, which has to be sent with the compressed data. The probability table of the proposed technique contains a single probability compared to 256 probabilities in conventional methods.
4. Conclusions In this paper, we propose a powerful technique by which simple and more ecient text compression algorithms can be implemented. The technique is based on transforming the original non-binary information source into an equivalent binary source having entropy proportional to that of the original source. Commonly used compression algorithms can be applied to the nthorder extension of the resulting binary source. This approach results in great reduction in the alphabet needed to implement the compression algorithm. The proposed technique is tested using the standard Human encoder on English text. Using 16 symbols results in compression eciencies comparable to that of the conventional approach, where 256 symbols are used. Other compression algorithms based on the proposed technique are under investigation and promising results are obtained.
References [1] Bell TC, Cleary JG, Witten IH. Text compression, Englewood clis, NJ: Prentice-Hall, 1990. [2] Held G, Marshall TR. Data compression, New York: Wiley, 1991. [3] Langdon GG, Rissanen JJ. Compression of black±white images with arithmetic coding. IEEE Trans Commun 1981;COM-29(6):858±67. [4] Langdon GG, Rissanen JJ. A simple general binary source code. IEEE Trans Inform Theory 1982;IT-28(9):800±3. [5] Lynch MF. Compression of bibliographic ®les using an adaptation of run-length coding. Inform Stor Retriev 1973;9:207±14. [6] Human DA. A method for the construction of minimum-redundancy codes. Proc IRE 1952;40:1098±101. [7] Gallager RG. Information theory and reliable communication. New York: Wiley, 1968. [8] Gallager RG. Variations on a theme by Human. IEEE Trans Inform Theory 1978;IT-24(6):668±74. [9] Hashemian R. Memory ecient and high-speed search human coding. IEEE Trans Commun 1995;COM43(10):2576±81.
272
A.-R. Elabdalla, M.I. Irshid / Computers and Electrical Engineering 27 (2001) 265±272 Abdel-Rahman Elabdalla was born in Irbid, Jordan. He received his B.Sc. degree from Alexandria University, Egypt, M.Sc. from Columbia University, New York, USA and the Ph.D. from Stanford University, CA, USA all in the ®eld of Electrical Engineering. He is associated both with Electrical Engineering and Computer Engineering Departments, Jordan University of Science and Technology, Irbid, Jordan. His research interests are in signal processing, data compression, and security in computer networks.
Mansour I. Irshid was born in Amman, Jordan, on January 1, 1952. He received his B.Sc. degree in Electrical Engineering from King Saud University, Saudi Arabia, in 1974 and the M.S. and Ph.D. degrees in Electrical Engineering from University of Wisconsin, Madison, USA in 1978 and 1982, respectively. Presently, he is a Professor in the Department of Electrical Engineering at Jordan University of Science and Technology, Jordan. His research interests are in optical communications and digital techniques.