Signal Processing 14 (1988) 177-184 North-Holland
177
AUTOMATIC R E C O G N I T I O N OF ISOLATED ARABIC CHARACTERS Talaat S. E L - S H E I K H and Ramez M. G U I N D I Department of Communications and Electronics, Faculty of Engineering, Cairo University, Giza, Egypt Received 5 January 1987 Revised 1 May 1987
Abstract. In this paper, a recognition system has been developed for the recognition of isolated typewritten Arabic characters. The outer contour of each character is traced using a contour-following algorithm that leads to 8-connected contours. A set of Fourier descriptors is obtained from the coordinate sequences of the contour points. This set of features has proven to be sufficient for the classification of isolated Arabic characters. Three different classification techniques have been developed using this set of features. The first technique is a m i n i m u m distance multicategory classifier which utilizes twelve dimensions. The second technique is a pairwise classifier that uses the best discriminating feature for each pair. The third technique is a hierarchical classifier which has the smallest character recognition time. The pairwise classifier, however, performs better than the other two classifiers as it shows the highest recognition rate. Zusammenfassung. In diesem Beitrag wird ein System entwickelt, das in der Lage ist, isolierte maschinengeschriebene arabische Schriftzeichen zu erkennen. Hierbei wird die Umhiillende jedes Zeichens mit Hilfe eines Kontur-Detektionsalgorithmus bestimmt und abgesucht. Aus den Koordinaten der Konturpunkte wird ein Satz yon Fourierkoeffizienten berechnet. Dieser Merkmalssatz erweist sich als hinreichend fiir die Klassifizierung isolierter arabischer Schriftzeichen, und drei verschiedene Klassifizierungsverfahren wurden ffir ihn entwickelt. Das erste Verfahren ist ein hochdimensionaler Minimum-DistanceKlassifikator, der im vorliegenden Fall zwSlf Dimensionen umfasst und alle Klassen in einem Schritt trennt. Das zweite Verfahren ist ein paarweiser Klassifikator, der ffir jedes Paar von Klassen das Merkmal einsetzt, das die beste Unterscheidung erlaubt. Das dritte Verfahren ist ein hierarchischer Klassifikator, der von allen Verfahren die geringste Rechenzeit ben6tigt. Von allen Verfahren besitzt jedoch der paarweise Klassifikator die hSchste Erkennungsrate. R6sum6. Dans cet article un syst~me de reconnaissance a &6 d6velopp6 pour la reconnaissance des caractbres arabiques isol6s et dactylographi6s. Le contour ext6rieur de chaque caract~re est trac6 en utilisant un algorithme de poursuite de contour qui conduit fi des contours 8-connexes. Un ensemble de descripteurs de Fourier est obtenu fi partir de la s6quence des coordonn6es des contours. Il a 6t6 montr6 que cet ensemble d'attributs est suffisant pour la classification des caract~res arabiques isol6s. Trois techniques de classification ont 6t6 d6velopp6es en utilisant ces attributs. La premiere technique est une classification ~• plusieurs cat6gories et/i distance minimum, qui utilise douze dimensions. La deuxi~me technique est une classification par paire qui utilise l'attribut le plus discriminant pour chaque paire. La troisi~me technique est une classification hi6rarchique qui a la dur6e la plus courte de reconnaissance. Toutefois, la classification par paire est meilleure que les deux autres par son taux de reconnaissance ~lev6. Keywords. Arabic character, classification, contour, Fourier descriptor, hierarchical, isolated, multiclass, pairwise, recognition, topological, typewritten.
1. Introduction The principal motivation for the development of automatic character recognizers is the need to cope with an enormous flood of paper generated by an expanding technological world. These recog-
nizers can efficiently be used to process bank checks, government records and pieces of mail, and to hopefully replicate the functions of humans. In this paper we consider the problem of recognizing isolated typewritten characters for the Arabic language. The importance of recognizing
0165-1684/88/$3.50 O 1988, Elsevier Science Publishers B.V. (North-Holland)
178
T.S. El-Sheikh, R.M. Guindi / Automatic recognition of isolated Arabic characters
isolated Arabic characters is twofold. In many applications, such as automatic mail sorting and the recognition of mathematical equations, Arabic characters appear in an isolated form. On the other hand, Arabic characters may be cursive in text processing applications. In this case, isolated characters appear only after completing the segmentation phase. In both cases, it is very essential to design independent systems for the recognition of isolated characters with no regard to the segmentation process. Thus, it is assumed in this paper that the characters are either isolated or resulting from a perfect segmentizer. Each character can generally be represented by a closed curve contour of line segments. Thus, tracing the boundary of the character can yield useful features for discriminating between different characters. Such features may be chosen so that they are invariant with respect to translation, rotation, shift and size of similar shapes. In this paper we utilize a set of Fourier descriptors [2, 5, 6, 7, 9, 10, 12] that represents the shape of the outer contour of each character. Given the features (Fourier descriptors) of each character, three different classifiers have been designed in order to discriminate between the different Arabic characters. The first is a minimum distance multicategory classifier which employs twelve features. The second is a pairwise classifier where for each pair, only one feature is used. Finally, the third is a hierarchical classifier. The three classifiers have been tested and they all perform exceptionally well. The pairwise classifier, however, performs better than the other two classifiers since it utilizes the best feature for each pair of characters. The remainder of the p a p e r is organized as follows. Section 2 describes the Arabic alphabet and explains different parts of the recognition system. The next two sections present the contourfollowing algorithm and the Fourier descriptors used in the Arabic character recognition system. In Section 5 we design three different classification techniques and a topological classifier. Test results of some experiments performed using the recogniSignal Processing
tion system and conclusions are given in Sections 6 and 7, respectively.
2. Properties of Arabic characters The Arabic alphabet consists of 29 characters, where an Arabic character may appear in an isolated form or can be part of a cursive word. The first situation occurs in some applications such as automatic mail sorting where characters include letters and digits, or the recognition of mathematical equations where characters include letters, digits and mathematical symbols. In this case, each character has a single shape. In other applications, where text is cursive, the shape of each character depends on its position within a word (beginning, middle, or end). A character at the beginning or in the middle of a word may have a different shape than it would have at the end of another word or if it is isolated. Thus, it is better to divide the alphabet into two main groups. The first group contains 46 shapes (shown in Fig. 1) which appear at the end of a word or for isolated characters. The other group (shown in Fig. 2) contains all the other shapes (28), i.e., those that may be at the beginning or in the middle of a word. There are only two characters (.1= ,&) that have the same shape regardless of their position in a word and therefore appear in the two groups. ,'a
J
,..,
J
6. r,. ~.~. L ¢.. L L ~
=..~
-L I. t i ~
"~ ~ , . . " , - ~
I
,..," J .,,
+. Fig. 1. C h a r a c t e r s h a p e s a p p e a r i n g at the end of a word or in an isolated form.
Fig. 2. C h a r a c t e r s h a p e s in the b e g i n n i n g or in the m i d d l e of a word.
T.S. El-Sheikh, R.M. Guindi / Automatic recognition of isolated Arabic characters
As shown in Figs. 1 and 2, some characters contain more than one connected component: the main character body and one out of five different stress marks shown in Fig. 3. A stress mark may be over or under the character main body. It is is a characteristic of the Arabic language that the main body of some characters is the same. Therefore, ignoring the stress marks, the number of different shapes in the two groups reduces to 26 and 13 respectively. For example, the three different characters (e,, ~,, .'-r) have the same main body but different stress marks. Therefore, such characters are included in the same main class. Consequently, they can be recognized by classifying both their main body shapes and their stress marks.
(5)
(4)
(3)
(a) (~)
Fig. 3. Different stress m a r k shapes.
Fig. 4 shows the block diagram of the proposed recognition system used for isolated Arabic characters. In the first two stages, the outer contour of the character main body is traced and the Fourier descriptors of the resulting contour are calculated, thus obtaining a feature vector representing the character with no regard to the stress marks. In the third stage, we assign the unknown feature vector to its closest class using a suitable classification rule. In the last stage, we use topological information to classify any stress mark associated with the main body and, consequently, the character is completely recognized.
3. Contour following After a character has been isolated within the binary image, the outer contour of the character
179
is obtained using a contour-following algorithm [4]. The direction of travel out of each pixel is given relative to the direction of entry into that pixel by the following left-most-looking (LML) rule. Starting from any point on the outer contour of the character main connected component, look at the left most pixel (relative to the direction of entry). If this pixel level is high, move to it; if not, move to the next left pixel. If the level of this pixel is high, move to it; if not, look to the next left pixel, etc. The algorithm continues until the next point on the contour is found. This step is repeated for the new pixel until the contour is closed. The described algorithm leads to 8-connected contours which leads to smooth contours since travelling from one pixel to the next on the contour occurs in one of eight different directions. Unlike the 8-connected contour, the 4-connected contour maintains a uniform step between consecutive points. However, it leads to larger number of boundary points which increases the computational complexity. The output of the contour-following algorithm is the sequence of the X - Y coordinates of the outer contour of the character. Each of the Cartesian coordinates (x(rn), y(m), m = O, 1 , . . . , N - 1 ) of the boundary elements represents one period of a periodic function with period N since the contours are closed curves,
x(rn+nN)=x(m), for 0~< rn~< N - l ,
y(m+nN)=y(rn) 0
Fig. 5 shows the outer contour of the character (O°) and the corresponding contour coordinate sequences. Due to the scanner limited resolution, similar characters yield completely different output contours and, consequently, different sequence coordinates. An example for such a case is considered
Input~ Contour~ Feature characte~follower~I extractor classifier[[ c l a s s i f i e r
Decision ]
Fig. 4. The p r o p o s e d r e c o g n i t i o n system. Vol. ~4, No. 2, March 1988
T.S. El-Sheikh, R.M. Guindi / Automatic recognition o f isolated Arabic characters
180
and can be written (al
as
1 N-I
a(n) =-~ ~, x(m) exp{-j2wnm/N},
(3)
m=O 1 N 1
y(m) exp{-j2~rnm/ N}.
b(n) = ~
(4)
m=O
#
(b)
(c)
II
Fig. 5. (a) Class ~.p) outer contour. (b) The X projection of the contour. (c) The Y projection of the contour.
We can also obtain the Fourier series coefficients of the complex sequence
z(m) = x(m) + j y ( m ) , which can be written as
c(n) = a(n) +jb(n). in Fig. 6 which shows the obtained output contours for two patterns of the same class (.a.). In the first pattern, the contour follower accidently got inside the figure. This phenomenon frequently occurs for other characters. We have overcome this problem by assigning more than one class for each of these characters.
(a)
(1~)
Fig. 6. Effect of the scanner resolution on the character outer contour.
(5)
(6)
From equations (3) and (4) it can easily be shown that a(n) and b(n) achieve the following equalities:
a(N - n) = a*(n),
(7)
b(N-n)=b*(n).
(8)
Therefore, using the magnitudes of these coefficients, only ½N of these coefficients need to be calculated. Regarding c(n), it is readily shown that
c ( N - n) = a ( N - n ) + j b ( N - n ) = a*(n) +jb*(n),
4. Fourier descriptors
Since each character can generally be represented by a closed-curve contour [7], tracing the boundary of the character can yield useful features for distinguishing between different characters. Now, it is required to evaluate the Fourier series coefficients for the two coordinate sequences x(m) and y(m). The Fourier series expansion of x(m) and y(m) can be written as [8] N
1
x(m) = ~. a(n) exp{j2~rnm/N},
(1)
n=O N--1
y(m) = Y~ b(n) exp{j2~rnm/N},
(2)
n=O
where N is the number of points on the contour. The sequences a(n) and b(n) are the complex Fourier coefficients o f x ( m ) and y(m) respectively Signal Processing
(9)
which is not equal to c*(n). By computing the exponential terms
exp{-j2,rnm/N},
m=0,1,...,N-l,
for a given contour for a single value of n we can simultaneously calculate four elements in the feature vector, namely, the descriptors la(n)], ]b(n)l, Ic(n)], and ]c(N-n)] using equations (3), (4), (6), and (9). This significantly reduces the time required to calculate the feature vector. This can be repeated for different values of n. Note that ]c(N-n)] cannot be determined from the other three descriptors. Fig. 7 shows the three Fourier spectra (]a(n)l, Ib(n)l, Ic(n)l) for two different characters (~,,tP). Below we prove that the above features are invariant with respect to the translation o f a charac-
181
T.S. El-Sheikh, R.M. Guindi / Automatic recognition of isolated Arabic characters
taln)L
descriptor ]a(n) I is invariant with respect to translation. In other words, no normalization with respect to position is required. Similarly, the other descriptors can be shown to be invariant with respect to character translation. Furthermore, these descriptors can easily be shown to be invariant with respect to the contour starting point using a similar p r o o f [5]. It has been found that calculating the above coefficients for n = 1, 2, 3 is sufficient to correctly classify the character main contour and, consequently, such a twelve-dimensional feature vector leads to almost 100% probability of correct classification. Table 1 gives the values of the Fourier descriptors for some characters.
III,..................................... 1111 n
(a)
Ib(n)
h(nl
Illlit,................................
/,,L,,............................. ,L,,,,I
(c)
Table 1
(d)
Fourier descriptors of some characters
I0tn)q
:c(n)
Dimen-
iitk,............................................ (e)
iil,dl,L................................ (£)
Fig. 7. (a), (c) and (e): The Fourier descriptors for the character (~'). (b), (d) and (f): the Fourier descriptors for the character (¢,,~. ter. If we assume that a character has been shifted over a horizontal distance Xo, then the new coefficients a'(n) can be calculated as follows: 1
a'(n)=~
N
1
E
(x(m)+xo)exp{-j2~rnm/N}
Class
sio.
I
1 2 3 4 5 6 7 8 9 10 11 12
0.63 4.05 4.68 3.42 0.02 0.00 0.02 0.02 0.20 0.45 0.64 0.27
12
~
1.62 4.60 6.21 3.01 2.88 0.68 3.55 2.22 1.21 0.63 0,74 1.78
4,86 2,19 6,52 3.79 1,25 1.44 2.66 0.42 0.42 1.42 1.18 1.73
o ° 4.90 2.43 6.72 3.82 1.18 2.09 3.27 0.94 0.71 1.17 1.15 1.56
~
--
~
4.59 3.49 6.91 4.33 1.66 2.32 2.87 2.84 0.13 0.52 0,64 0,40
3.60 1.48 5.04 2.21 0.63 0.46 1.08 0.22 0.14 0.58 0.62 0.58
2.06 2.01 4.07 0.25 2.31 0.51 2.81 1.83 0.57 0.13 0.58 0.58
.~ 1.89 2.59 4.47 0.79 1.73 0.45 2,00 1.54 0.63 0.28 0.62 0.75
5. Classification stage
m=0 l
N I
=-- Y~ x(m) exp{-j2~nm/N} N I'rl=O N-I
+ N ~" exp{-j2~rnm/ N} en=O
=a(n) for n ~ O .
(10)
Since the second term is equal to zero except for n = 0 and since a(O) is not used in our case, the
Character classification is performed in two different stages. In the first stage, the character main body is classified using the described feature vector. Three different classification techniques have been developed; a multicategory classifier, a pairwise classifier, and a hierarchical classifier. In the second stage, simple topological features extracted from the geometry of the stress marks are used by the topological classifier to completely Vol. 14, No, 2, March 1988
182
T.S. El-Sheikh, R.M. Guindi / Automatic recognition of isolated Arabic characters
recognize the characters. Each classifier has two phases of operation. In the first phase (the training phase), the classifier is redesigned according to new characters having a different font. In the second phase (the classification phase), an unknown character is assigned to a certain class according to a selected classification rule.
decisions is required to classify each character, where for each pair only the single best feature is calculated if it has not been already evaluated for previous pairs. This classifier performs better than the other two classifiers but needs more time than the hierarchical classifier to classify unknown characters.
5.1. Multicategory classifier [3]
5.3. Hierarchical classifier [ 1]
The training (design) samples are used to calculate the average feature vector for each class using twelve dimensions. To classify an unknown shape, its feature vector is calculated and the Euclidean distance between this vector and each of the average vectors is evaluated. The unknown pattern is assigned to the class corresponding to the closest average vector in the feature space. Thus, this classifier is a minimum distance classifier which ignores the effect of covariance matrices in the Fourier space. However, when the Mahalanobis distance was calculated instead, the performance of the classifier degraded as the number of training samples was not sufficient for the estimation of the covariance matrices.
This classifier involves partitioning the patterns in each stage into a number of clusters. The discrimination in each stage is obtained through the use of a multiclass nearest neighbour classifier where characters of each cluster are represented by their average patterns calculated from the training samples. Partitioning within each cluster is continued until a decision is reached. The maximum number of stages required to reach a decision is three where, in each stage, one, two, or four features are used. However, decision is obtained immediately after the first stage for some characters. This technique is faster but it is slightly less reliable than the pairwise technique.
5.4. Topological classifier [ 11 ] 5.2. Pairwise classifier In the classifier training phase, for each pair of classes, the best discriminating feature is selected. The rule used to select this feature is to maximize the one-dimensional nearest neighbour distance between the pair of classes. Furthermore, a threshold is calculated in the space of the best feature to divide the total space into two regions each corresponding to one of the two classes. For some pairs of classes it was not possible to find a single feature that makes the samples of the two classes linearly separable. For any of these pairs, the implemented algorithm splits one of these two classes into subclasses. This training method results in a zero probability of error if the system is tested on its design samples. Classification of an unknown pattern is achieved through several stages where in each stage one class is rejected. Thus, a fixed number of two-class Signal Processing
In this classification stage, the algorithm searches for the stress mark which may be over or under the main character body. Knowing both the main group of the character body and the type and position of the stress mark (if any), we can completely identify the character. The topological features used to classify each type of the five stress marks shown in Fig. 3 are the maximum width, the maximum height, and the number of black pixels of the stress mark. Different topological features were used to discriminate between different pairs of stress marks. For example, class 1 in Fig. 3 can easily be detected by counting the number of black pixeis, while classes 2 and 3 can easily be distinguished using the stress mark height. Although we sometimes used the classification of the character body to classify the accompanied stress mark, this topological classification stage
T.S. El-Sheikh, R.M. Guindi / Automatic recognition of isolated Arabic characters
may lead to a reject option when the stress mark classification contradicts the main body classification. Reject rates for the different experiments are given in Table 2. Table 2 Error and reject rates for the different systems Recognition system
Error rate
Reject rate
Muhiclass Pairwise Hierarchical
0.33 % 0.16% 0.27 %
0.16% 0.05% 0.16%
6. Test results
To the best of our knowledge, Fourier descriptors have not been used before to recognize Arabic characters. In order to investigate the effectiveness of Fourier descriptors in recognizing Arabic characters, a series of tests was performed using the three classification techniques and three different sample sets where each set contained ten patterns for each character. Data was obtained using an IBM scanner with a maximum resolution of 200 pixels/inch and the tests were performed using an IBM personal computer. The average recognition rate (speed) is approximately five characters/sec. This rate can drastically be increased if the Fourier descriptors are calculated using special hardware and if parallel processing is used.
183
Test results are shown in Tables 2 and 3. Table 2 shows the error and the reject rates for the three recognition systems. Table 3 lists the true and decided class for the misclassified and rejected characters. It is shown that the pairwise classifier has the highest recognition rate among the three recognition systems. This is due to the use of the best feature in the classification process for each pair. Moreover, the time required to classify an unknown character for this classifier is smaller than that required if the multicategory classifier is used because, unlike the multicategory technique, it rarely needs to calculate all twelve dimensions for any character. In Table 3, it can be noticed that characters having similar outer contour have been rarely misclassified. This indicates that Arabic characters can perfectly be classified by use of the Fourier coefficients of their outer contours with the help of some topological information.
7. Conclusions
This paper considers the recognition of isolated typewritten Arabic characters. This recognition process is essential in some applications or after completing the segmentation phase in textprocessing applications. The set of Arabic characters has been divided into two main groups; the characters that are at the tail of a word or isolated, and those that are at the head or in the middle of a word.
Table 3 Misclassified or rejected characters for the classification systems True character
• J
Percentage of occurrence
Recognized character
~ ~ d~
Multiclass
Pairwise
Hierarchical
10% 7% 3%
3% 3%
3% 7%
3%
L
I
J
~ (rejected)
L
~(rejected)
10%
3%
7% 7% 3% Vol. 14, No. 2, March 1988
184
T.S. El-Sheikh, R.M. Guindi / Automatic recognition of isolated Arabic characters
A contour-following algorithm has been used in order to trace the outer contour. A set of Fourier descriptors has been derived using the coordinate sequences of the contour points. These features have efficiently been used to discriminate between the different characters. The training samples have been used to design three different classification techniques: the multicategory classifier, the pairwise classifier, and the tree classifier. These classifiers have been tested using an independent set of characters and very good results have been obtained. We conclude that the pairwise technique, where each pair uses the best discriminating feature, performs better but slightly slower than the hierarchical technique. Topological features such as height, width and number of pixels of the stress marks have also been used to enhance the recognition process. We also conclude that the small set of features used were sufficiently capable of classifying the different characters.
References [1] R. Chir, P. Beaudet and P. Argentiero, "An automated approach to the design of decision tree classifier", Proc. 5th lnternat. Conf. on Pattern Recognition, 1980, pp. 660665.
Signal Processing
[2] F.M. Dekking and P.J.V. Otterloo, "Fourier coding and reconstruction of complicated contours", IEEE Trans. Syst., Man., Cybernet., Vol. SMC-16, No. 3, May/June 1986, pp. 395-404. [3] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973, Chap. 2, pp. 20-31. [4] R.C. Gonzalez and P. Wintz, Digital Image Processing, Wesley, 1977, Chap. 6, pp. 253-265. [5] G.H. Granlund, "Fourier processing for hand print character recognition", IEEE Trans. Comput., Vol. C-21, February 1972, pp. 195-201. [6] S. Impedovo, 8. Marangelli and A.M. Fanelli, "'A Fourier descriptor set for recognizing nonstylized numerals", IEEE Trans. Syst., Man., Cybernet., Vol. SMC-8, No. 8, August 1978, pp. 640-645. [7] K.P. Lam, "Contour map registration using Fourier descriptors of gradient codes", IEEE Trans. Pattern Anal. & Machine Intelligence, Vol. PAMI-7, No. 3, May 1985, pp. 332-338. [8] A.V. Oppenheim and R. Schafer, DigitalSignal Processing, Prentice-Hall, Englewood Cliffs, N J, 1975, Chap. 3, pp. 87-110. [9] E. Persoon and K.S. Fu, "Shape descrimination using Fourier descriptors", IEEE Trans. Syst., Man., Cybernet., Vol. SMC-7, No. 3, March 1977, pp. 170-179. [10] M. Shridhar and A. Badreldin, "High accuracy character recognition algorithm using Fourier and topological descriptors", Pattern Recognition, Vol. 17, No. 5, 1984, pp. 515-524. [ 11] J.T. Tou and R.C. Gonzalez, "'Recognition of handwritten characters by topological feature extraction and multilevel categorization", IEEE Trans. Comput., Vol. C-21, July 1972, pp. 776-785. [12] C.T. Zafin and R.Z. Roskies, "Fourier descriptors for plane closed curves", IEEE Trans. Comput., Vol. C-21, No. 3, March 1972, pp. 269-281.