Entropy of Tamil prose

Entropy of Tamil prose

INFORMATION AND CONTROL 6, 297--300 (1963) Entropy of Tamil Prose GIFT SIROMONEY Madras Christian College, Tambaram, India The proportions of the...

169KB Sizes 20 Downloads 148 Views

INFORMATION AND CONTROL 6, 297--300

(1963)

Entropy of Tamil Prose GIFT

SIROMONEY

Madras Christian College, Tambaram, India

The proportions of the different letters of the alphabet in Tamil prose are estimated from a large sample and an optimum code is constructed. The prose is compared with the Tamil poetry of different periods and the one-gram entropy of prose is significantly different from the entropies of the poetical works considered. An estimate is made, experimentally, of the entropy of Tamil prose. Let p l , p2, • • • , p~ be the proportions of the different letters of the alphabet. The one-gram entropy (Shannon, 1948) is given b y H1 = - ~ p r l d p r , 1

where "ld" stands for the logarithm to the base two. If all the pr are equal, then the value of H1 is denoted b y H0 = ld n. I n modern Tamil prose, 12 vowels, 18 consonants, 216 vowel-consonants and one auxiliary " A i t h a m " are used. In our discussion, each vowel-consonant will be considered as two l e t t e r s - - a consonant followed by a vowel. " A i t h a m " very seldom occurs in modern prose and it will not be considered; the vowels and consonants alone will be considered to make up the Tamil alphabet of 30 letters. A large sample of over 20,000 letters was taken from the prose works published in Madras State during 1946-57, using random sampling methods. H1 =

4.34 bits

and

H0 = 4.91 bits

For English prose (26-1etter alphabet) the corresponding values are 4.14 and 4.70 bits respectively (Shannon, 1951). Samples were t a k e n from Tamil poetry from six different works belonging to periods ranging from the beginning of the Christian era to the modern period. As the dates of these works cannot be fixed with any certainty, the (accepted) order in which they were written is fol_ 297

298

SIROMONEY

lowed. Tholkappiam (Porul athikaram) is the oldest and Bharathi's is the most recent. The values of the unbiased estimates of H1 and the size of the sample used are tabulated (Table I) together with the corresponding standard deviations of the estimate of H1. H1 is asymptotically normally distributed and the unbiased estimate of H1 and the standard deviation are obtained from the following formulas (Basharin, 1959).

E([II) = H~

n--!lde+ 2N

0(_1~

\/w/

where N is the size of the sample. H1 is not significantly different between Thiru Kural and Silappathikaram of the early period and Bharathi's works of the modern period. Tholkappiam and Yuthakandam have praetically the same value. The values of H~ for Yuthakandam and Utharakandam are not significantly different and these two Kandams of Kamba Ramayanam are commonly accepted as written by two different authors, the author of the first being Kamban. A x2-test (Herdan, 1956) shows that the proportions of letters from these works are signifieantly different. Therefore, for the purposes of testing passages of disputed authorship the x~-test is a more useful tool than the characteristic H~. The value of H~ for Tamil prose is significantly different from those of all the poetic works considered. The letter "1-," which has a sound which is difficult to pronounce and peculiar to Tamil, has its highest value for Tholkappiam and the lowest for modern prose. Another peculiar sound represented by "t" shows the same type of downward trend towards the modern period. Ht is the average number of binary digits required, per letter of Tamil, if the language is encoded with 100% effieieney, on the first assumption that the occurrence of a letter in Tamil is independent of the preceding letters. Following Huffman (1952), an optimum binary code is construtted and tabulated. The average number of binary digits per symbol is 4.44 compared to 4.34, the value of H~. Therefore the efficiency (Reza, 1961) of coding is 98%. Other codes which are not optimum, may be constructed, using the given values of the p's.

ENTROPY

OF

TAMIL

299

PROSE

TABLE I RELATIVE

Letters

g a:

i i: U U: e 0:

ai O O: au

k ng C

nj t: n:

th nh P m Y r

1 V

11: t n

Total size of sample UBEofH~ S.D. of H1

FREQUENCIES

OF L E T T E R S

Tholkappiam

Thiru Kural

Silappathikaram

Kamban's Yutha Kandam

IKamban's Uthara Kandam

Bharathi

Modern prose

139 29 83 3 87 5 19 14 29 14 9 0 56 9 16 5 25 15 61 20 49 47 38 39 33 44 16 12 41 44 1001

126 60 64 7 74 5 24 11 30 13 9 0 71 7 21 4 25 20 57 26 42 45 34 39 35 35 8 12 43 53 t000

142 35 68 7 73 6 22 11 29 13 12 0 67 11 18 3 32 15 63 24 39 44 38 46 33 38 12 12 36 51 1000

145 43 69 5 72 6 23 11 23 12 11 0 58 9 17 2 27 16 68 22 33 44 36 47 32 39 9 16 36 68 999

156 47 71 6 58 6 20 12 28 9 11 0 63 9 19 4 25 16 65 22 33 48 35 40 34 41 9 12 32 68 999

129 43 75 7 70 3 22 20 32 13 12 0 49 5 24 3 25 16 73 27 40 57 36 38 36 42 10 16 24 53 1000

150 47 78 4 77 4 17 13 27 10 7 0 79 6 22 1 36 11 71 21 44 42 29 45 31 35 7 21 27 38 1000

12430

6355

4165

11855

16283

5056

22855

4.4122 0.0094

4.4621 0.0121

4.4506 0.0158

4.4150 0.0096

4.3981 0.0079

4.4661 0,0137

4.3435 0.0073

Huffman code

101 0100 1101 01111010 1100 011110111 011111 010101 01100 000100 0111100 0111101100 1110 0101000 00011 0111101101 10001 000101 1001 000000 0010 11111 01101 0011 01110 10000 010100 000001 01011 11110

300

SIROMONEY

Our first assumption t h a t the occurrences of the letters are independent can be improved b y assuming t h a t the occurrence of a letter is dependent on the (n - 1) preceding letters. The corresponding n-gram entropy is denoted b y H ~ , and its limiting value, when n is large, is denoted b y H. An estimate of H is made, using Shannon's (1951) experimental methods. Passages of lengths about of 30 letters were chosen from normal prose and the subject was asked to guess each letter. I f the guess was correct, the subject was told so, and if the guess was wrong, the correct letter was given. The same passages were tried separately on two subjects $1 and $2, and $1 guessed 320 letters and $2, 317 out of 583 letters, giving 55% as the proportion of letters guessed correctly. $1 and $2 were second year university students. Following Brillouin (1956), the estimate of H is 2.51 bits and this should be considered as an upper limit. Therefore, it is possible to reduce drastically the length of a given message in Tamil, if proper encoding is used. ACKNOWLEDGMENT The author wishes to thank Dr. W. F. Kibble of Heriot-Watt College, Edinburgh and late Dr. R. P. Sethu Pillai, Professor of Tamil, University of Madras, for their valuable help at the initial stages of the work. REFERENCES

BRILLOUIN,L. (1956), "Science and Information Theory," p. 25. Academic Press, New York.

BASHARIN,G. P. (1959), Teoriya Veroyatnostei lee Primeneniya 4, 361. ttERDAN, G. (1956), "Language as Choice and Chance," p. 88. Noordhoff, Groningen.

HUFFMAN,D. A. (1952), Proc. Inst. Radio Engrs. 40, 1098. REzA, F. M. (1961), "An Introduction to Information Theory," p. 133. McGrawHill, New York. SHANNON, C. E. (1948), Bell System Tech. J. 27, 379. SHANNON, C. E. (1951), Bell System Tech. J. 30, 50.