INFORMATION AND CONTROL 6, 297--300
(1963)
Entropy of Tamil Prose GIFT
SIROMONEY
Madras Christian College, Tambaram, India
The proportions of the different letters of the alphabet in Tamil prose are estimated from a large sample and an optimum code is constructed. The prose is compared with the Tamil poetry of different periods and the one-gram entropy of prose is significantly different from the entropies of the poetical works considered. An estimate is made, experimentally, of the entropy of Tamil prose. Let p l , p2, • • • , p~ be the proportions of the different letters of the alphabet. The one-gram entropy (Shannon, 1948) is given b y H1 = - ~ p r l d p r , 1
where "ld" stands for the logarithm to the base two. If all the pr are equal, then the value of H1 is denoted b y H0 = ld n. I n modern Tamil prose, 12 vowels, 18 consonants, 216 vowel-consonants and one auxiliary " A i t h a m " are used. In our discussion, each vowel-consonant will be considered as two l e t t e r s - - a consonant followed by a vowel. " A i t h a m " very seldom occurs in modern prose and it will not be considered; the vowels and consonants alone will be considered to make up the Tamil alphabet of 30 letters. A large sample of over 20,000 letters was taken from the prose works published in Madras State during 1946-57, using random sampling methods. H1 =
4.34 bits
and
H0 = 4.91 bits
For English prose (26-1etter alphabet) the corresponding values are 4.14 and 4.70 bits respectively (Shannon, 1951). Samples were t a k e n from Tamil poetry from six different works belonging to periods ranging from the beginning of the Christian era to the modern period. As the dates of these works cannot be fixed with any certainty, the (accepted) order in which they were written is fol_ 297
298
SIROMONEY
lowed. Tholkappiam (Porul athikaram) is the oldest and Bharathi's is the most recent. The values of the unbiased estimates of H1 and the size of the sample used are tabulated (Table I) together with the corresponding standard deviations of the estimate of H1. H1 is asymptotically normally distributed and the unbiased estimate of H1 and the standard deviation are obtained from the following formulas (Basharin, 1959).
E([II) = H~
n--!lde+ 2N
0(_1~
\/w/
where N is the size of the sample. H1 is not significantly different between Thiru Kural and Silappathikaram of the early period and Bharathi's works of the modern period. Tholkappiam and Yuthakandam have praetically the same value. The values of H~ for Yuthakandam and Utharakandam are not significantly different and these two Kandams of Kamba Ramayanam are commonly accepted as written by two different authors, the author of the first being Kamban. A x2-test (Herdan, 1956) shows that the proportions of letters from these works are signifieantly different. Therefore, for the purposes of testing passages of disputed authorship the x~-test is a more useful tool than the characteristic H~. The value of H~ for Tamil prose is significantly different from those of all the poetic works considered. The letter "1-," which has a sound which is difficult to pronounce and peculiar to Tamil, has its highest value for Tholkappiam and the lowest for modern prose. Another peculiar sound represented by "t" shows the same type of downward trend towards the modern period. Ht is the average number of binary digits required, per letter of Tamil, if the language is encoded with 100% effieieney, on the first assumption that the occurrence of a letter in Tamil is independent of the preceding letters. Following Huffman (1952), an optimum binary code is construtted and tabulated. The average number of binary digits per symbol is 4.44 compared to 4.34, the value of H~. Therefore the efficiency (Reza, 1961) of coding is 98%. Other codes which are not optimum, may be constructed, using the given values of the p's.
ENTROPY
OF
TAMIL
299
PROSE
TABLE I RELATIVE
Letters
g a:
i i: U U: e 0:
ai O O: au
k ng C
nj t: n:
th nh P m Y r
1 V
11: t n
Total size of sample UBEofH~ S.D. of H1
FREQUENCIES
OF L E T T E R S
Tholkappiam
Thiru Kural
Silappathikaram
Kamban's Yutha Kandam
IKamban's Uthara Kandam
Bharathi
Modern prose
139 29 83 3 87 5 19 14 29 14 9 0 56 9 16 5 25 15 61 20 49 47 38 39 33 44 16 12 41 44 1001
126 60 64 7 74 5 24 11 30 13 9 0 71 7 21 4 25 20 57 26 42 45 34 39 35 35 8 12 43 53 t000
142 35 68 7 73 6 22 11 29 13 12 0 67 11 18 3 32 15 63 24 39 44 38 46 33 38 12 12 36 51 1000
145 43 69 5 72 6 23 11 23 12 11 0 58 9 17 2 27 16 68 22 33 44 36 47 32 39 9 16 36 68 999
156 47 71 6 58 6 20 12 28 9 11 0 63 9 19 4 25 16 65 22 33 48 35 40 34 41 9 12 32 68 999
129 43 75 7 70 3 22 20 32 13 12 0 49 5 24 3 25 16 73 27 40 57 36 38 36 42 10 16 24 53 1000
150 47 78 4 77 4 17 13 27 10 7 0 79 6 22 1 36 11 71 21 44 42 29 45 31 35 7 21 27 38 1000
12430
6355
4165
11855
16283
5056
22855
4.4122 0.0094
4.4621 0.0121
4.4506 0.0158
4.4150 0.0096
4.3981 0.0079
4.4661 0,0137
4.3435 0.0073
Huffman code
101 0100 1101 01111010 1100 011110111 011111 010101 01100 000100 0111100 0111101100 1110 0101000 00011 0111101101 10001 000101 1001 000000 0010 11111 01101 0011 01110 10000 010100 000001 01011 11110
300
SIROMONEY
Our first assumption t h a t the occurrences of the letters are independent can be improved b y assuming t h a t the occurrence of a letter is dependent on the (n - 1) preceding letters. The corresponding n-gram entropy is denoted b y H ~ , and its limiting value, when n is large, is denoted b y H. An estimate of H is made, using Shannon's (1951) experimental methods. Passages of lengths about of 30 letters were chosen from normal prose and the subject was asked to guess each letter. I f the guess was correct, the subject was told so, and if the guess was wrong, the correct letter was given. The same passages were tried separately on two subjects $1 and $2, and $1 guessed 320 letters and $2, 317 out of 583 letters, giving 55% as the proportion of letters guessed correctly. $1 and $2 were second year university students. Following Brillouin (1956), the estimate of H is 2.51 bits and this should be considered as an upper limit. Therefore, it is possible to reduce drastically the length of a given message in Tamil, if proper encoding is used. ACKNOWLEDGMENT The author wishes to thank Dr. W. F. Kibble of Heriot-Watt College, Edinburgh and late Dr. R. P. Sethu Pillai, Professor of Tamil, University of Madras, for their valuable help at the initial stages of the work. REFERENCES
BRILLOUIN,L. (1956), "Science and Information Theory," p. 25. Academic Press, New York.
BASHARIN,G. P. (1959), Teoriya Veroyatnostei lee Primeneniya 4, 361. ttERDAN, G. (1956), "Language as Choice and Chance," p. 88. Noordhoff, Groningen.
HUFFMAN,D. A. (1952), Proc. Inst. Radio Engrs. 40, 1098. REzA, F. M. (1961), "An Introduction to Information Theory," p. 133. McGrawHill, New York. SHANNON, C. E. (1948), Bell System Tech. J. 27, 379. SHANNON, C. E. (1951), Bell System Tech. J. 30, 50.