Computer recognition of printed Tamil characters

Computer recognition of printed Tamil characters

Pattern Recognition, Vol. 10, 243-247. © Pergamon Press, Ltd. 1978. Printed in Great Britain 0031-3205/78,/0801-0243 $02.00/0 COMPUTER RECOGNITION O...

407KB Sizes 9 Downloads 180 Views

Pattern Recognition, Vol. 10, 243-247. © Pergamon Press, Ltd. 1978. Printed in Great Britain

0031-3205/78,/0801-0243 $02.00/0

COMPUTER RECOGNITION OF PRINTED TAMIL CHARACTERS G. SIROMONEY,R. CHANDRASEKARANand M. CHANDRASEKARAN Madras Christian College, Tambaram 600059, India (Received 3 October 1977; received for publication 3 January 1978) Abstract - Computer recognition of machine-printed letters of the Tamil alphabet is described. Each character is represented as a binary matrix and encoded into a string using two different methods. The encoded strings form a dictionary. A given text is presented symbol by symbol and information from each symbol is extracted in the form of a string and compared with the strings in the dictionary. When there is agreement the letters are recognized and printed out in Roman letters following a special method of transliteration. The lengthening of vowels and hardening of consonants are indicated by numerals printed above each letter. Character recognition procedure

Template matching

INTRODUCTION

The development of Tamil script, like all other Indian scripts, can be traced back to the Brahmi script of king Asoka in the third century B.C. Ill Tamii writing is a combination of alphabetical and syllabic systems. The Tamil word eluthu represents not only the forms of pure consonants and vowels but also syllables such as pi, ra, mi and ku which are combinations of consonants and vowels. Compared to other Indian languages Tamil has a relatively small number of pure consonants and vowels. 12~ In the Brahmi script syllables were denoted by vowel-markers tagged on to the consonant symbols. The vowel-markers were simple vertical and horizontal strokes. During the first few centuries of development the vowel-marker for the medial a sign got separated from the syllable sign and was represented as an auxiliary symbol placed to the right of a consonant. For certain syllables with the medial d there was a separate symbol. The medial e sign also got separated from the syllable sign and was represented as a separate symbol placed to the left of the consonant. The medial o is represented by two symbols one of which is on the right and the other to the left of the consonant. The Tamil alphabet has thirty basic letters of which eighteen are consonants and twelve are vowels. The short and the long vowels are reckoned as distinct letters. In addition there are 216 combinations of consonants and vowels called uyirmei which are compound letters or syllables which are represented by individual symbols or a combination of consonants and auxiliary vowel-markers. There is also a special symbol called (~itham and this brings the total to 247 combinations. In practice some of these combinations are not used and the printers have types for only about 125 symbols. For the recognition of printed characters one may

Feature encoding

Tamil alphabet

Condensing

use a template matching method ~31 but it cannot be extended to include recognition of handprinted characters. Narasimhan t41 has proposed a linguistic method for the recognition ofhandprinted English letters. Work has also been done on the recognition of Indian scripts such as Telugu, ~51Devanagari ~61and Bengali. ~7~ There are different fonts used by printers in India and we have chosen for the study the Bharathi Antique type (Fig. 1) which is less stylized than many other types. Each character is converted manually into a rectangular binary array in which a "0" represents a blank and a "1" a nonblank. We used the word character to denote a letter of the alphabet as well as any auxiliary symbol which forms part of a c o m p o u n d letter. There are a n u m b e r of ways of extracting information from the matrices for purposes of forming a dictionary. Dutta ~s~ considers the frequency of

sr

fl ar

g /r

I0

ht,

Fig. 1. A sample of Tamil characters. 243

244

G. SIROMONEY,R. CHANDRASEKARANand M. CHANDRASEKARAN

change of runs in each column and derives "events", "half events" etc. We make use of frequency of runs of "l's" in both columns as well as rows. We are able to recognize all the 125 different characters used by the printers of Tamii texts.

CONDENSED RUN METHOD Each Tamil character is represented as a M x N binary matrix called PATN(I, J). The features present in the ith row of the pattern matrix P A T N ( I , J ) are extracted and represented in the form of a single numeral called RRUN(I) which is equal to the number of distinct runs of l's in the ith row of PATN(I, J). Similarly CRUN(J) is the number of distinct runs of l's in the jth column of PATN(I,J) and represents the features present in the jth column. We now define RRUN of PATN(I, J) as a string of length M such that RRUN = r(1)r(2).., r(M) where r(l) = RRUN(I), I = 1,2 ..... M. Thus RRUN is a string of RRUN(I)'s where I ranges over all the M runs and contains all the necessary information extracted from all the rows. Similarly CRUN is defined as a string of CRUN(J)'s, J = 1, 2..... N. The strings RRUN and CRUN are further condensed as follows to the strings RCON and CCON respectively. Let RRUN be composed of k substrings such that RRUN = s(il)s(iz)...s(ik) where s(ir) is a substring composed entirely of all consecutive occurrences of the same number 4- Such a representation is unique. Then RCON = i 1 i2... ik where i, represents a single numeral. For example if RRUN is the string 11333322211 then the corresponding RCON is 1321. RCON is called the condensed row string. Similarly the condensed column string CCON is formed by retaining only one digit in each run of that digit in the corresponding CRUN. This is a kind of thinning process and the condensed strings will be independent of the size and thickness of the letters as well as the proportion of the matrices. In practice we omit solitary occurrences of any numeral from RRUN and CRUN before forming RCON and CCON since such occurrences are often, though not always, due to noise generated in digitization of lines which are three or four units thick in our samples. This often eliminates spurious features being retained in RRUN and CRUN and transferred to RCON and CCON. Figure 2 is a binary matrix which represents the syllable ha. In the first row there is only a single run of l's and RRUN(1) = 1. We note that the thickness of the horizontal line of the character is 4 and we get RRUN(2) = RRUN(3) = RRUN(4) = 1. In the fifth row there are two distinct runs of l's. Therefore RRUN(5) = 2. This is given in the row run string RRUN. In the fifteenth row there are three runs of l's and RRUN(15) = 3. In the condensed row run RCON we find that the first run of l's is replaced by "1", the run of 2's is replaced by "2", the run of 3's replaced by

"3" and the last run of l's is replaced by "1". Hence RCON is the string 1231. Similarly column runs are formed and then reduced to CCON. We note that the length of the string RCON is much less than that of the corresponding string RRUN. Let RCON = p(1)p(2).., p(RN) where RN is the length of the string RCON. Then p(1) v~ p(I + 1) and each p(l) is equal to some RRUN(J), J = 1,2 ..... M. We define RCON(I) = p(1). We need this definition for comparing RCON strings digit by digit for purposes of recognition. The string lengths of RCON and CCON are stored in RN and CN respectively. Condensed strings RCON and CCON are obtained for each of the 125 pattern matrices and stored in the memory of the computer forming a dictionary.

Condensed run recognition procedure When a test character is taken up for recognition it is converted into a binary matrix and the values for condensed row runs, condensed column runs and their corresponding lengths are extracted and represented as RCONT, CCONT, RTN and CTN respectively. First RN and RTN are taken up for comparison. If they agree then RCON(K) is compared to RCONT(K) for K = 1,2 .... , R T N ; otherwise RN of the next pattern is taken up for consideration. If RCON(K) equals RCONT(K) for K = 1,2,...,RTN then it is checked whether CN equals CTN for the same pattern. If it is so then CCON(L) is compared with CCONT(L) for L = 1, 2,...,CTN. If CN # CTN the next pattern 00001111111111111111100 00001111111111111111100 00001111111111111111100 00001111111111111111100 00001111000111100000000 00001111000111100000000 00001111000111100000000 00001111000111111111000 00001111000111111111100 000011110001111111111~0 00001111000111111111111 00001111000111110011111 00001111000111100001111 00001111000111100001111 00001111000111100001111 00001111000111100001111 00001111000111100011111 00000000000000000011111 00000000000000000111111 00001111111111111111110 00111111111111111111100 00111111111111111111000 01111110000000000000000 01111000000000000000000 11111000000000000000000 11110000000000000000000 11110000000000000000000 11110000000000000000000

ROw r~u Condensed

1111222222233333311111111111 ~ow r~m

1231

8ymbollc row run

SIM2~3L1

Column r,~

11112222222222233332211

Condensed

colum~ run

Symbolic ¢olua~ run

12321 $IT.283S281

Fig. 2. Representation of the character na.

Computer recognition of printed Tamil characters is taken up for consideration. Only when RCON(K) = RCONT(K) for K = 1, 2..... RTN and CCON(L) = CCONT(L) for L --- 1, 2.... , CTN for a test pattern we reckon that the pattern is recognized.

SYMBOLIC RUN M E T H O D

In the condensed run method, if a numeral occurs in consecutive positions in RRUN or CRUN such repeated occurrences are ignored. In the symbolic run method such repeated occurrences are classified into three types, viz. small run, medium run and long run. This method is therefore more sensitive than the condensed run method. Let R RUN be composed entirely of substrings of the form s(l) where s(I) is made up entirely of all consecutive occurrences of the numeral I. Such a representation is unique. Let ,s(l) represent the length of the substring s(1). If,s(I) is less than or equal to [H/4] then s(l) -- SI where H is the common height of certain basic characters such as ka, ca and pa and [H/4] the integral part of H/4. SI is called the small run of the numeral I. For instance ifH -- 19 then H/4 equals 4.75 and [H/4] is 4. Hence the runs 11, 111 and 1111 will be denoted by S1; the runs 22, 222 and 2222 will be denoted by $2 and so on. We recall that in the condensed run method any string of l's will be denoted by 1 irrespective of whether it is a short run of l's or a long run of l's. We define the long run L I and the medium run M I of l's as follows. If,s(I) ~> [H/2] + 1 then s(I) = L I. If [H/4] < ,s(I) ~< [H/Z] then s(I) = M1. The same procedure is followed for compressing the column runs. The flow diagram for the recognition procedure is given in Fig. 3. SUB 1 is the subroutine RCRUN

245

which obtains the row and column runs. SUB 2 is the subrouting RCCOND which obtains the condensed row and column runs. All the 125 characters except four have unique representations. Ties are observed between characters for pa and pu; and also between mi and the symbol for medial e. In the printout the alternatives are given with a slash between them. It is possible to guess the correct symbol from fhe textual context. It is possible to distinguish between pa and pu by studying the proportion between row and column lengths of the binary matrices. If pa is represented by a p × q binary matrix then pu will be represented by a (p + k) x q matrix where k > p/2; p and q being nearly equal. Another method will be to form the complements of the binary matrices by interchanging O's and l's and compare the new matrices. We shall also discuss a third possibility called the method of symbolic runs. It is observed that for each pattern matrix of M rows and N columns a maximum of only M + N locations are used for storing the condensed runs. We note that the condensed row and column runs are independent of the size of the pattern matrix. RRUN and CRUN are condensed to symbolic row runs and symbolic column runs. If RRUN is the string 33311 then the corresponding symbolic row run will be $3S1. This is illustrated in Table 2. To minimize the recognition time the primitives are arranged in the order of relative frequency of the characters in Tamil prose. TRANSLITERATION

Strings of characters are recognized and printed in Roman letters. There are some compound letters that are represented by single characters and they pose no difficulty. There are certain other compound letters such as kti which are represented by two symbols - the medial ti sign following the consonant symbol. There are compound letters such as ke which are represented by the medial symbol and a consonant symbol which follows the medial e symbol. The compound letter ko is represented by the consonant symbol k preceded by an auxiliary symbol as well as followed by another auxiliary symbol. The program takes care of all these contingencies making use of context-sensitive rules. We use a new scheme of transliteration here (Table 1). A long vowel is distinguished from a short vowel by placing the numeral 2 above the vowel. Similarly the hardening of a consonant is indicated by the numeral placed above the consonant. In Tamil there are three kinds of l's and two kinds of r's. CONCLUSION

Fig. 3. Flow diagram for recognizing test patterns using the condensed run method.

The methods can be extended to the recognition of handwritten Tamil characters that follow certain constraints such as maintenance of connectivity properties. Instead of transliterating into Roman letters it may be possible to transliterate into any other Indian

G. SIROMONEY,R. CHANDRASEKARANand M. CHANDRASEKARAN

246

Table 1. Condensed row and column runs for Tamil characters of Fig. 1 Name of character va

Computer form

Condensed row run

Condensed column run

2342

1321

23121

21321

2343

1321

2131

121

na

VA 3 LA 2 LA 2 RA 2 NA

13453

123212121

pi

PI

12432

12321

mi

MI

12342

12321

yi

YI

12543

12321

ri

RI

123

12121

li

LI 22 NA

123453

13212321

2456421

12432321

e medial

12342

12321

~medial

1321231

1242

13453

1212121

ha la ra

n~

aimedial kT yT

v~ ff ff n0 th0 nO p0

2 KI 2 YI 2 RI 2 VI 32 LI 22 LI 32 NU H2 TU 2 NU 2 PU

1 3 2 1 2 1 3 2 1232532 12432

1231

1212

1231

12342

132121231

1323121

232452

132343

1321231

2689621

235434343432121

2 4 3 5 4 3 2 1 12345321 245321

1321

21321

2321

were given for recognition in the form of binary matrices. Each matrix was converted into a string and compared with the stored string patterns. When there was agreement, individual symbols were recognized and printed out in R o m a n letters. The lengthening of the vowels and hardening of consonants were indicated by numerals printed above each letter. Two methods of converting a letter matrix into a small string were tried. The first method is called the condensed run method. Each binary matrix is examined by the computer column by column and the number of runs of l's is noted for each column. This gives a string of numbers. This string is condensed by deleting the repeated consecutive occurrences of the same numbers. This condensing procedure shortens the string considerably and is equivalent to a sort of thinning procedure. Similarly the matrix is examined row by row and another condensed string formed. These condensed strings are short and contain all the necessary information for recognizing a vast majority of symbols. However there are two pairs of characters which do not have unique representations. In the printout both alternatives are printed with a slash between the symbols. The second method is called the method of symbolic runs. As in the case of the first method each matrix is examined column-wise and the number of runs of l's is noted. In the new string of numerals that is formed, any one numeral may occur in consecutive positions forming a short, medium or long runs. This information is noted and a new string called the symbolic run is formed, Similarly another symbolic run string is formed after row-wise examination of the matrix. These two strings are stored for purposes of recognition. In this method, unlike in the first method, all the different symbols used in printed Tamil have unique representations. To minimize recognition time, statistical information on the relative frequency of different characters of Tamil alphabet is made use of. The method described here can be extended to recognize hand-written Tamil characters that follow certain constraints. REFERENCES

script. The symbolic run method also works well for printed numerals. It may be possible to use these methods and their variations to recognize other scripts. SUMMARY

A new method of automatic recognition of letters of the Tamil alphabet is proposed. Bharathi Antique type was chosen for the experiment. Each character was converted into a rectangular binary array in which a "0" represented a blank and a "1" a nonblank. Digitization was done by hand. Information was extracted from each array and condensed into a string and stored in memory. Letters of the Tamil alphabet

1. G. Siromoney and M. Lockwood, The invention of the Brahmi script, MCC Mag. 46, 31-33 (1977). 2. G. Siromoney, Entropy of Tamil prose, Inf. Control 6, 297-300 (1963). 3. V. A. Kovalevsky, Character Readers and Pattern Recognition. Spartan, New York (1968). 4. R. Narasimhan and V. S. N. Reddy, A syntax-aided recognition scheme for handprinted English letters, Pattern Reco(Inition 3, 345-361 (1971). 5. S. N. S. Rajasekaran, Computer generation and recognition of printed Telugu characters, Ph.D. Thesis, Indian Institute of Science, Bangalore (1976). 6. I. K. Sethi and B. Chatterjee, Syntactic methods in character recognition, IS1 Symposium on Digital Techniques and Pattern Recognition, Calcutta (1977). 7. A. Sore and A. K. Nath, On some methods of sequential pattern recognition, ISI Symposium on Digital Tech-

Computer recognition of printed Tamil characters

247

Table 2. Symbolic row and column runs for Tamil characters of Fig. 1 Name of character va la la ra na pi mi yi ri li nA e medial medial ai medial kl yT vT ff 00 thQ n~ p~

Symbolic row run $2M3S4S2 $2L3S1S2S1 $2M3S4S3 S2S1L3M1 S1S3S4S5S3 S1M2S4L3S2 S1M2S3L4S2 S1M2S5M4S3 S1M2L3 S1M2S3S4S5S3 S2S4S5M6S4M2S1 S1M2S3S4S2 S1S3S2S1S2S3S1 S1S3S4S5S3 S1S3S2S1S2S1S3S2 S1S2S4M3S2 S1S2MIL2 S1S2M3S4S2 S1S3M2L3M1S2S1 S1S3S2M3S4S3 S2S6S8M9S6M2S1 $2S4S3S5S4S3S2S1 S2M4M5S3S2S1 L2S1S3S2S1

niques and Pattern Recognition, Calcutta (1977). 8. A, K. Dutta, An experimental procedure for handwritten

Symbolic column run S1S3M2L1 M2S1S3S2S1 S1S3S2L1 S1L2S1 S1S2S3S2S1S2S1S2L1 M1S2S3S2M1 L1S2S3S2S1 L1S2S3S2MI M1S2S1S2M1 S1S3S2M1S2S3S2S1 S1S2S4S3S2S3M2S1 S1S2M3S2S1 S1M2S4S2 S1M2S1S2S1M2S1 S1S2S3S2S5S3S2 L1S2S3SI M1S2S3S1 S1S3S2S1S2S1S2M3S1 M2S3S2S4S5S2 S1S3S2M1S2S3S1 S2S3S5M4S3S4S3S4S3S4S3S2S1S2L1 S1S2S3M4S5M3S2L1 SIL3S2L1 S2M3L2S1

character recognition, IEEE Trans. Comput. 23, 536-545 (1974).

About the Author GIFTSIROMONEYreceived the B.A.(Hons) degree in mathematics in 1953 and the M.Sc. degree in 1959 from the Madras Christian College which is affiliated to the University of Madras. He received the Ph.D. degree in 1964 from Madras University for his work on information theory carried out at the Madras Christian College. He became Lecturer in Mathematics in 1953 at the American College, Madurai. He joined the staff of the Madras Christian College in 1954 as Lecturer in Mathematics. He was an Ecumenical Fellow at the Union Theological Seminary, New York during 1958-59. He was the recipient of the Homi Bhabha Fellowship during 1974-75 and was a consultant at the Computer Science Center of the University of Maryland at College Park. He is presently the Head of the Department of Statistics at the Madras Christian College. Dr Siromoney has coauthored "Mahabalipuram Studies" and "Mathematics for the Social Sciences". He has published many articles on information theory, formal languages, statistics, natural history, art and archaeology. Dr Siromoney is a Fellow of the Royal Statistical Society and a member of the Epigraphical Society of India and the Archaeological Society of South India. About the Author - RAMAKRISHNANCHANDRASEKARANwas born in New Delhi, India on 7 March 1954. He joined the Madras Christian College in 1970 and received the B.Sc. degree in Statistics in 1974 and the M.Sc. degree in 1976. He became Assistant Professor in Statistics at the Madras Christian College in 1976. He has been working in the area of pattern recognition. He is a member of the Epigraphical Society of India. About the Author - MUTHUSWAMYCHANDRASEKARANwas born in Madras, India on 17 June 1952. He attended the Madras Christian College from 1967 to 1973 and received the B.Sc. and the M.Sc. degrees in Statistics in 1971 and 1973 respectively. He joined the staff of the Madras Christian College in 1973 as Assistant Professor in Statistics. He has been working on pattern recognition problems. He is a member of the Epigraphical Society of India.