Pattern Recognition, Vol. 28, No. 3, pp. 293 301, 1995 Elsevier Science Ltd Copyright © 1995 Pattern Recognition Society Printed in Great Britain. All rights reserved 0031 3203/'95 $9.50+.00
Pergamon
0031-3203(94)00099-9
MASSIVELY-PARALLEL HANDWRITTEN CHARACTER RECOGNITION BASED ON THE DISTANCE TRANSFORM ZS. M. KOV,/~CS-V. and R. G U E R R I E R I Dipartimento di Elettronica, Informatica e Sistemistica, University of Bologna, Viale Risorgimento 2, 40136 Bologna, Italy (Received 10 December 1992; in revised form 17 June 1994; received for publication 17 Auoust 1994)
Abstract--A new statistical classifier for handwritten character recognition is presented. After a standard preprocessing phase for image binarization and normalization, a distance transform is applied to the normalized image, converting a black and white (B/W) into a gray scale picture. The latter is used as feature space for a k-Nearest-Neighbor classifier, based on a dissimilarity measure which generalizes the use of the distance transform itself. The classifier has been implemented on a massively-parallel processor, Connection Machine CM-2. Classification results of digits extracted from the U.S. Post Office ZIP code database and the upper-case letters of the NIST Test Data 1 are provided. The system has an accuracy of 96.73% on the digits and 94.51% on the upper-case letters when no rejection is allowed and an accuracy of 98.96% on the digits and 98.72% on the upper-case letters at 1% error rate. Pattern recognition OCR Template matching Distance transform Nearest neighbors Classifiers Digits Upper case letters Massively-parallel processing Connection Machine
1. INTRODUCTION The recognition of handwritten characters is an essential capability in the area of Office Automation. Statistical algorithms "~ applied to this task can have a very significant impact since they offer the possibility of generalizing the information contained in the training set of other examples without requiring an extensive programming effort tailored to the specific handwriting style. Classifiers based on the k-Nearest Neighbors (k-NN) approach have recently received increasing attention thanks to their simple implementation and the absence of training. In this technique, the dissimilarity measure used to compute the distance between the stored patterns and the test element is one of the most crucial parts of the method. Different approaches have been used to classify handwritten characters: a fully structural approach has been pursued by Lam and Suen, {z~ where the salient strokes of the characters have been extracted and further analyzed by a rule-based approach. A combination of rule-based processing and learning techniques has been described by Gader et al., ~3~ where the beneficial effects of the cooperation of different techniques is quantified. A neural network technique based on a perceptron has been proposed by Le Cun et al., t41 obtaining results comparable with those achieved by techniques requiring a more extensive coding of the classification rules. Takahashi {5~ proposes a method based on neural network and geometrical features combined with zonal properties. Finally, Blackwell et al. ~6~ considers a novel biologically-motivated approach to the OCR problem, based on a Dystal neural network.
In this paper we investigate a learning methodology applied to handwritten OCR. Our contribution is a new dissimilarity measure for k-NN classifiers, which is suitable for a massively parallel implementation of OC R on a Connection Machine CM-2. The proposed dissimilarity measure is based on distance transform 17'8) and is not restricted to OCR but could be applied to any problem requiring the evaluation of distances between two-dimensional curves. The problem of computing the similarity of pairs of lines has also been addressed by other authors [e.g. Tubbslg~], where these simple metrics measure the overlapping of the curves rather than their proximity, thus discarding some important geometric information which is maintained by the method proposed in this work. Additionally, the measure of overlapping is not a continuous function of the curve distances, while the proposed measure is. The paper is organized as follows: in Section 2 the recognition methodology is described; Section 3 details the new dissimilarity measure and its geometric meaning is illustrated; in Section 4 the algorithmic properties of the classifier implementation on a massively-parallel computer, Connection Machine CM-2, are introduced; Section 5 shows the results obtained from the classification of handwritten digits extracted from a ZI P-code database ~t°~ and upper case letters of the NIST Test Data 1/t 1~ Finally, in Section 6 some conclusions are drawn. 2. THE RECOGNITION METHODOLOGY The recognition process starts once the image aquisition has been completed and consists of the following
293
294
ZS. M. KOV,~CS-V. and R. GUERRIERI
steps: image normalization, feature extraction and classification, which are described in the following sections. 2.1. Image normalization Image normalization is useful to reduce the variability of human writing. This step moves the points belonging to a class closer to each other in the feature space thus helping the subsequent classification procedure. Of course, a loss of information is associated with each normalization step. Therefore, it has to be chosen very carefully to avoid undesired effects and misleading deformation of the original image. If the original image is gray scale normalization is carried out on an adaptively thresholded version of it. The first step consists of noise removal by area, Gaussian or median filtering according to the specific noise of data. Since, experimentally, the skew of handwriting introduces a strong variance in the feature space, the deskew ¢2) process is one of the most important steps to be performed. If the writing tool has no limitations due to constraints on letter size, thinning t13) of the strokes originates a further helpful normalization phase. The final process is size normalization. There are basically two possibilities to force a size. The first is normalizing both the vertical and horizontal directions. By so doing, the symbol touches the four boundaries of the final box. The second method consists of normalizing one of the two directions only, maintaining the original aspect ratio of the image. In this way, the character touches two out of four boundaries only. The first method has the advantage of better class separation, because it normalizes both the vertical and horizontal directions but seriously deforms one-dimensional symbols, such as '1' and T. The second does not change any of the crucial properties of the images but produces a smaller reduction of class variance in the feature space. In this work the first method was employed, neglecting the above mentioned inconvenience, because
a)
I
i
ii
G(x),
b)
1.o
i!
o.5
i
0.0 0
I[lllmlflll ;i ;i
1
ii\P
......
. . . . . . . ~_~!z,tion pr~sho~d
i
i~
2
:3
4
5
x
4
5
×
DT(x). 1.5
c)
1.o
!
1
o.5
i
0.0 0
1
2
Fig. 1. Distance transform vs standard gray scale: (a) original one-dimensional gray scale image; (b) gray scale curve of image (a); (c) distance transform of image (a).
the possible symbol deformations do not interfere too heavily with the classification, thanks to uniform deformation of these images. In fact, an almost completely black image is the result of the normalization performed on a symbol '1', which for this reason cannot be confused with other digits. The result of the normalization steps is a square black and white picture. 2.2. Feature extraction The novel feature extraction stage is based on a distance transform tT) of the normalized image. Let be the two-dimensional region where pattern ( lies. The distance transform associates with each point (x, y)efl the distance to the nearest point (~, ~)e~, according to a defined metric. Many different methods for evaluation of the distance have been proposed. For example, in Borgefors cs) the Euclidean distance transform and its approximations are discussed, while in Pagiieroni "4) the chessboard, L~, optimal chamfer and the Euclidean distance transforms are described. We have found the L1 norm to be the best for the OCR problem. The advantage of the gray scale image obtained by the distance transform compared with a normal gray scale image is that the information is propagated throughout f~ instead of the neighboring area of ~ only, as shown in Fig. 1. Image (a) shows the original gray scale image of a one-dimensional case. The pattern is black, the background is white and the shaded region is the transition between the two. In (b), the related gray scale curve is equal to zero where the pattern lies, equal to one in the background and reaches inter-mediate values in between. Note that only the area surrounding the pattern is influenced by it, while most of the background is unchanged. Since the distance transform has to be applied to B/W pictures, the original gray scale image is thresholded, using the horizontal dashed line as the binarization threshold. In (c) the distance transform applied to the binary image is reported, associating a distance equal to zero with the region where the pattern is defined. The increasing values of the function DT(x) in the background region are the distances of the x coordinate from the nearest point of the pattern. Thanks to the function DT(x), every point of the image is correlated with the pattern. In Fig. 2 the final distance transformed image of a '5' is shown using a gray scale and contour plot. In this figure, darker regions are closer to the curve forming the pattern. Considering the meaning of the distance transform and the curve in Fig. 2, it is worthwhile to note that this gray scale conversion is not a simple image blurring, because there is no loss of information coupled with this process. The final gray scale representation of the image is formed by w x h pixels of g bits each, where w and h are the width and height of the picture, respectively, and g depends on the norm employed in the distance transformation process. Using the LI norm for the distance transform, ff is bound by [log2(w + h - 2)]. Using each pixel as a coordinate in a multidimensional space, this procedure maps a binary pat-
Massively-parallel handwritten character recognition
Fig. 2. Contour plot of a distance transformed '5'. tern into a point in an N-dimensional space, where N = w x h. Thanks to the distributed information, it is possible to subsample the distance transformed image, obtaining a lower dimensional feature space. Such a possibility comes from the strong correlation between the values of neighboring pixels. In fact, the value of each pixel is in the range ___1 of its neighbors (assuming 1 as the distance between two neighboring pixels) because the distance transform associates to each pixel its distance from the pattern. A very simple subsampling operation can be performed by dividing the image into regions and extracting only one pixel from each. The number of such regions defines the final dimension of the input space of the classifier, reducing at the same time the correlation between adjacent pixels. The effect of this operation will be shown in Section 5.
2.3. Classification The classification of the points in the feature space is carried out using a nearest neighbor technique. Nearest-neighbor(s) classifiers are one of the oldest and best established techniques for pattern classification and several excellent descriptions of this technique are available in literature, tL~5) For this reason, we shall simply state some definitions which are used throughout the paper. A k-NN classifier is defined in terms of a triplet (D, k, 9"-), where D is a dissimilarity measure which associates to each pair of samples in a suitable Ndimensional space a (real or integer) non-negative number, k is the number of nearest samples which are used to perform the classification and ~ is the database of M training samples used by the classifier to actually perform the classification. Assuming k even, a classification is thus correctly performed when at least kc > k/2 + 1 of the k elements closest to the new sample vote for the correct category. Obviously, when k is odd, kc _>(k + 1)/2. If the majority of the neighbors is not
295
able to define a single classification, the element is rejected(16) by the process. Of course other rules can be defined, obtaining different trade-offs in terms of rejection and error ratio. Focusing now on D, several properties are desirable, among them robustness against noise and imperfections of the pattern to be classified as well as independence of the presence (or absence) of specific structures (strokes). From a geometric point of view, important semi-metric requisites are that D(a, a) = 0 and D(a, b) = D(b, a), where a and b are a pair of elements to be classified. The first assumption is important when the sample to be classified has a twin in the database, allowing direct classification when k = I. The second requirement ensures equal influence between a couple of samples, when the first classifies the second and the opposite. Finally, from a computational point of view, the evaluation should be recast into a simple, uniform set of operations defined in a suitable N-dimensional space so that efficient hardware can be designed to support the computation. Besides, D(a,a)= 0 states that the minimum dissimilarity occurs when D = 0, associating the idea of a distance computation with D.
3. DISSIMILARITY MEASURE
The definition of a dissimilarity measure is not unique and embodies several aspects of the problem under consideration. Dealing with handwritten characters the goal is in finding a way to associate a distance with a couple of samples consisting of binarized images representing a curve in a bidimensional space. In this section the use of the previously defined distance transformed image is described. The following dissimilarity measure is used in the classifier to obtain the relationship between points in the feature space, defined by the normalization and distance transform procedures of Section 2. Let ~ be the bidimensional region where the two curves lie, and let ~ and ~/be the two curves, using the distance transform D T associated with each curve, we define the distance/) between curves ~ and ~/as
D= S DTcdl + ~ DT,~dl,
(1)
where the line integrals are computed over the two curves ~ and ~/. The reason for the use of two integrals instead of one is that the symmetry property of the metric must be preserved. As a special case, let us consider the distance of two curves when one is part of the other. In this case, if we compute the integral over the smaller domain, t h e n / ) would be zero and would not therefore be able to express the difference between the two curves. A similar technique is used by Liu and Srinath, (17) based on one integral only, to find the similarity between known two-dimensional shapes and their distorted counterparts. Denoting with D T¢I, the restriction of function D T¢ to line ~/and with DT, k the restriction of function DT, to line ~, it can easily be proved that when ~ = ~/,D = 0,
296
ZS. M. KOVACS-V. and R. GUERRIERI
because DT~I, = 0 and DT, I¢=O, while a direct consequence of equation (1) is that D(q, 0 =/9(~, r/). One of the main inconveniences of this method is the non-uniformity of the operations to be carried out, which requires different treatment for the pixels depending on whether they belong to the character or to the background. To generalize the definition of the distance, note that expression (1) of/~ can be rewritten as:
= S(DT~ - DT~)dl + S(DT~ - DT~)dl
= [.IDT~ - DT, Idt + f. fDT,- DTddl,
(2)
because DT, I~= 0 and DTd~ = 0 and the value of the distance transform is always non-negative. This expression, which takes into account only the line integrals on r / a n d ~, can then be modified to cover the whole domain D
D* = SIDT~-DT, IdD+ SIDT,--DTddfL
(3)
D
Proving the semi-metric properties of distance D* in the space of the two-dimensional curves is immediate. In fact, D*(r/, ~) = D*(~, r/) and D*(~, ~) = 0 for any pair of curves r/, ~. From a geometric point of view, D* is the volume between the surfaces representing D T~ and DT,. The distance between curves in equation (3) has two advantages over the one described in equation (1):(i) its calculation on a raster image does not need to check the difference between character and background once the distance transform has been obtained, and (ii), our extensive experiments on handwritten characters show that it better underlines the difference between the two images, which is essential for a k-NN classifier. In the discrete case, neglecting constant multiplicative factors, D* has the following expression:
two samples in the multidimensional space. Since x and y are now free parameters, their choice is problemdependent and their values should be chosen by a suitable optimization procedure. We have determined the best values of x and y using the response surface method. "s) This method requires a polynomial approximation of a suitable objective function, which can be computed evaluating the performances of the classifier for different values ofx and y and then interpolating the values obtained. Using the interpolation, it is then possible to choose the best values of the parameters. In this case, the objective function is the recognition rate of the classifier for a chosen error rate. In Figs 3 and 4 the geometrical meaning of the distance computation is shown for the case x = 1, y = 0 and x = 2, y = 1, respectively. The one-dimensional case induced by points A and B is reported. The horizontal axis X shows the one-dimensional interval between 0 and the upper bound W, where points A and B lie, while the vertical axis shows the value of distance transform DT(X). More specifically, curve DTa(X ) is the distance transform generated by A, while DTn(X) is that generated by B. The numerical value of the area of the shadowed region pointed out by D(A, B) is the distance between the two points, i.e. it is their disDT(X)
7! 6. 5. 4. 3. 2 1 0 0
n
D* = ~ DT ~j)- IYT ~j) . ,
1
2
3
i4 A
(4)
51
6
7
8
9
B
10 ill X W
j=l
where n is the number of pixels of the image. Another way to interpret equation (4) is to consider the pixel values as coordinates of a point in an Ndimensional space. In this case, equation (4) gives the 11 norm in the N-dimensional space. A further generalization of equation (4) leads to the following formula, defining the dissimilarity measure D:
Fig. 3. Dissimilarity measure based on L 1 norm.
DT(X)
7t 6 5 4
o =
(o
+ o r;'v'°.
(5)
where the symbols have the previous meaning and x and y are parameters. For x = 1 and y = 0, equation (5) gives the L 1 distance of vectors DT~ and DT,r Additionally, the xth power of norm I~ is obtained by assigning 2, 3, etc. to x, while y=0. The denominator of the formula introduces the possibility of taking into account the relative error between
3 2 1 0
0
1
2
314 A
516 B
7
8
9
10:,11x W
Fig. 4. D i s s i m i l a r i t y m e a s u r e b a s e d o n L 1 n o r m a n d relative
error.
Massively-parallel handwritten character recognition similarity. The L1 norm shown in Fig. 3 gives the same weight to each point, leading to a simple area computation between the two functions DTA(X)and DTn(X), while in Fig. 4 the choice ofy = 1 gives major emphasis to the region surrounding the pattern. The dissimilarity measure already outlined can also be used to face problems other than OCR. It is able to associate a distance measure, which is a continuous, smooth function of the distance, with a couple of bidimensional and binarized images in a general pattern matching framework. Examples of applications can be the recognition of mechanical parts on a surface, the identification of airplanes once the edges have been extracted and so on; however, only the OCR problem was investigated by the authors. 4. A MASSIVELY-PARALLEL I M P L E M E N T A T I O N
Several researchers have applied statistical algorithms for recognition tasks using massively parallel hardware. (t9) For example, Wang and Iyengar(2°) used the Connection Machine for binary pattern matching. We implemented the classifier on a Connection Machine CM-2. (21) Connection Machine CM-2 is a massively-parallel computer with up to 65,536 processors, having a conventional computer as a front end. Each processor has a 1-bit CPU and up to 1M-bit of local memory. A floating point accelerator is optionally shared by each cluster of 32 processors. The machine has a Single Instruction Multiple Data (SIMD) architecture. Communication between processors is either by a regular pattern on an N-dimensional grid or by an arbitrary pattern on a hypercube. Our mapping of the algorithm on the Connection Machine has exploited the regular structure of the problem by allocating each element of the training database to a processor. This programming method is referred to as data parallel paradigm. The Connection Machine has a software mechanism for emulating more processors than actually exist in the machine, referred to as virtual processors. By using this, it is possible to emulate a machine with a factor M more processors, where each processor has a factor M less memory and runs a factor M more slowly. M is called the virtual processor ratio (VP ratio), and is restricted to a power of 2. In practice, for small VP ratios, the slow down is less than a factor M, because the overhead of the tasks on the front end is overlapped with the computation on the Connection Machine. For large VP ratios the slow down becomes very close to a factor M. The CM-2 used in this work has 8 K processor with 256 K-bits memory, without floating point accelerator and has a SUN4 as front-end. For the sake of efficiency the classification routines were written in C/Paris, a low-level protocol for data parallel programs. The C code controls front-end (serial) operations, while Pariscalls direct only the handling of data by Connection Machine processors and data communication between the front end and the Connection Machine.
297
The classifier reads the preprocessed images stored in the training database and sends each feature vector to a virtual processor. The cost of this initial procedure grows linearly with the product of the size of the database and the number of dimensions of the feature space. However, even though this step is demanding, it is performed only once during the classifier set-up phase, thus giving a negligible contribution to the overall C P U time when many classifications have to be performed. When the classification of a new image is required, its feature vector is copied in parallel into a dedicated area of each virtual processor. In the local memory of each virtual processor, besides the two feature vectors, the known and the unknown, an accumulator is defined to hold the result of the distance computation. A loop performed over the N-dimensions of the feature space computes the running sum of equation (5). Thanks to the above operations, the distance between the test pattern and each element of the training set is computed in parallel, sequentially scanning the dimensions of the feature space. Thus the CPU time grows linearly with feature space dimensionality, but it is independent of the number of training samples, provided there are 'enough' processors. At this point, the minimum of the distances is computed using a binary tree embedded in the hypercube network and the index of the corresponding processor gives all the information about the nearest neighbor. Hence the computational complexity of this process grows logarithmically with the number of elements in the traning set. The extraction of the subsequent neighbors is easily obtained, disabling the virtual processor holding the already found nearest neighbor(s) and computing the new minimum distance. The data distribution over the virtual processors is a crucial part of the implementation of algorithms on the Connection Machine, because the total classification time is a function of the operations done on the front end, on the Connection Machine and of the communication between the two. Therefore, it is important to reduce the time wasted in the data transmission from and to the virtual processors. Due to the communication model implemented on the machine, based on the hypercube architecture, care must be taken to reduce message traffic between processors without a direct connection. This is why the test data is spread over the processors at the beginning of the computation, and the distance is calculated locally instead of sending each feature coordinate alone to the virtual processors and increasing the communication time. Additionally, due to a finite message length, it is not necessary to send all the feature coordinates together but the best choice is the greatest number of coordinates which can fit into one single message. In this way the local memory used by the single processors can also be taken under control, without losing performance, splitting the feature vector of the test element into pieces. In this way the computation and communication times are exactly the same as sending all the coordinates together, while
298
ZS. M. KOV,~CS-V. and R. GUERRIERI t(s)
/,////" /" [ 0.005
l I 8192
I 16384
# elements
Fig. 5. Recognition time vs training set size: parallel (continuous) and serial (dashed) implementation. the local memory needed by each processor is independent of the feature vector size. Using 8192 known patterns for the classification and a feature space of 1024 dimensions, 15 classifications per second are possible. However, if the dimensionality is reduced by subsampling the distance transformed image by a factor of 16, the classification speed is above 200 patterns/s. In this case the VP ratio is 1 and each processor works on one known image only. For example, using twice as many images, the VP ratio becomes 2, while partitioning of the images, 2 for each processor, is transparent to the user, because it is carried out by the low-level system routines. It may be useful to compare the speed of the recognition process on the CM-2 with a serial implementation running on a SUN SPARCstation 2 with a 64MB central memory. Obviously the platform and the programming paradigm are completely different. On the SPARCstation the implementation of the classifier is serial, giving rise to a linear increase of the classification time with the number of known elements. The time needed by the parallel version, on the other hand, is constant until the number of known elements is less than or equal to the number of physical processors, then doubles until they reach twice the number of physical processors and so on. In Fig. 5 comparison of the recognition times is reported. The continuous line refers to the CM-2, while the dashed one refers to the SPARCstation 2. As can be seen, in the case of the already described 8192 known elements which form the training set, about 3 s/character is the recognition time on the SPARC station 2. Thus, at this point, the parallel implementation is about 500 times faster than the serial implementation on a SUN SPARCstation 2. 5. RESULTS
The evaluation of a pattern classifier, using supervised learning, requires the selection of a training set, made of known data, and a test set, which is classified using the first set. Sometimes the two sets are distinct as in the NIST Special Database 3 and the NIST Test Test Data 1, which we used for the upper case letter recognition. However, the problem arises as how to
define the training set and the test set when only one database is available. Dealing with only one database as, for example in the case of the ZIP code database which we used for digit recognition, the cross-validation method in the form of the leave-one-out method ~22) has been adopted to evaluate the performances of the classifier since it provides an unbiased estimate of the true error rate of the classifier if the training set is composed of statistically independent samples. This technique evaluates the classifier by removing one element at a time from the initial data base and using the remaining M - 1 elements of ~- to classify the extracted sample. This procedure is then repeated for each element in J ' . The ZIP code database consists of 1985 unsegmented ZIP codes in gray tones, obtained from real pieces of U.S. Mail passing through Buffalo and New York post offices, thus featuring digits written by people operating in real conditions. The ZIP codes of the database have been binarized, the location of the digits found and a careful segmentation of the ZIP codes carried out. At the end of the above operations, approximately 9000 single digits were obtained. The preprocessing described in Section 2 converts the images into a 32 x 32 square picture.
5.1. Optimization of the dissimilarity measure Several geometric transforms have been tried in order to find the best classification performance. The distance transform (7) can be based on different norms such as L t and Euclidean (L2). In addition, several local approximations are possible, such as chessboard, etc. We have found the L 1 norm to be the best for the OCR problem. The computational burden of the distance transformation can be overcome using algorithms based on local computation, defined by templates, instead of a true Lt computation. The chessboard distance transform gives a reasonably good approximation to the true LI norm. The isovist transform (23) which bears some similarity to the distance transform, has also been tested in this problem, but the result was significantly worse than that obtained using the previous technique. We then investigated the effects of parameters x and y in equation (5), on the recognition and the error rates of the classifier, using a fixed-size data base of digits and a neighborhood of size three with majority voting for the classification. For this computation the chessboard (s) distance transform was employed. In Fig. 6 the plot of the recognition rate for different values of x and y is shown when the database contains 5000 randomly-selected elements. The recognition rate is defined as the number of correctly classified samples divided by the total number of elements in the database, multiplied by 100. In Fig. 7 the error rate is reported. The error rate is equal to the number of misclassified elements divided by the dimension of the databse, multiplied by 100. The elements which do not appear in the preceding two categories have
Massively-parallel handwritten character recognition
Recognition
rate
299
(%)
97
96 /.
.4-.
.
. . . . . -............
,,
.' /
/"
95
/
/
/,"
//
/
/
/
/
./'y=l
.,'" y
94
=
2
/,-'
/ /"
/
/,".,
y
=
3
y
=
/
93
92
I
I
2
3
Fig. 6. Recognition rate vs x and y.
Error
rate
(%)
5.5
",,
y
=
1
'",,
Y
=
2
'.
3
"k.,...
4.5
~..,
5'
=
0
3,5
Fig. 7. Error rate vs x and y. been rejected by the classifier, because the vote expressed by the neighbors was ambiguous. We have found the best results when y = x - 1 (when x > 1) and for increasing values of x. If x = 2 the distance can be interpreted as the product of the relative error and the L 1 norm along each coordinate. We have similar results for databases having size greater than 2000 and also for upper and lower case letters as well as for different distance transforms, such as L,,
Euclidean, and L~, thus showing a considerable independence of the optimal values from the size of data base and the underlying type of distance transform. 5.2. Digit and upper case letter recognition capacity To measure the recognition capabilities of the classifier, two experiments have been performed. The first was based on the ZIP code database for digit re-
300
ZS. M. KOV./~CS-V.and R. GUERRIERI
cognition and the second one on the NIST databases for upper case letter recognition. Parameters x and y have been set to values 2 and 1, respectively, and the true L 1 distance transform has been applied. For the digit recognition the leave-one-out method was employed. The system has shown a recognition rate of 96.73~ when forced to classify, and a recognition rate of 93.59~o at 0.98~ error rate, corresponding to an accuracy of 98.96~o. In this case, the feature space dimensionality is 1024, since the number of pixels in the window is 32 × 32. In the second experiment, the 44951 images of the NIST Special Database 3 were used as a training set to recognize the 11941 upper-case letters of the NIST Test Data 1. The images in the above databases are black and white and their size is 128 × 128, The normalization state is the same as for the previous experiment, except for the area filter used for noise removal instead of the median one and the missing skeletonization. Using the final 32 x 32 images the recognition rate is 94.51~o by forcing the classification and 77.00~o at 1~ error rate, corresponding to 98.72~o accuracy. 5.3. Subsamplin 9 in digit and upper case letter recognition To investigate the robustness of the distributed representation of the information provided by distance transform, we have reduced the 1024 dimensions obtained by the original 32 x 32 image. Both the horizontal and vertical 32 pixels were clustered into 8 sets of 4 adjacent pixels, giving rise to 64 regular regions defined on the original image. At this point, the new subsampled image was formed by considering the top leftmost pixel of each region. The correlation of the neighboring pixels is much weaker in this way compared to that of the pixels of the original image. Several other dimension reduction procedures have been tried, modifying the number of regions, their shape and the operations to be performed on the pixels belonging to the same region. However, the most useful method we found experimentally is this simple pixel extraction without any average calculation between neighboring pixels, probably because in this way correlation of the neighboring elements is kept as small as possible. When the classification procedure is applied to the digits in the same conditions as reported in the previous section, the recognition rate is 96.48~o by forcing the classification, and 92.52~ at 0.98~o error rate. Thus, reducing the feature space dimensionality by a factor 16, the difference between the recognition rate when no rejection is allowed is 0.25~o, while at 0.98~ error rate the recognition difference is 1.07~o. Note that a straightforward reduction of the image size to an 8 × 8 format before the feature extraction is carried out by the distance transform, reduces the recognition rate of the system to 94.24~ at no rejection and to 81.78~ at 0.98% error rate. As far as the upper case letter classification is concerned, if the 32 x 32 image is reduced to an 8 x 8
one by subsampling, the recognition rate becomes 93.76~ when no rejection is allowed, while it becomes 74.11~o when 1~ error is allowed. The latter result is part of the NIST report, t24) that describes the systems presented at the First Census OCR Systems Conference, and shows the industrial and academic state of the art in handwritten character recognition. 6. CONCLUSIONS In this paper a novel handwritten character recognition methodology has been introduced. It is based on the distance transform used as the feature extraction step and on a novel metric suitable for two-dimensional pattern recognition. An optimization technique has been used to determine the best values of the parameters of this measure. Experimental data show that the optimal values are independent of the specific data base, flits size is reasonably large. The parallel algorithm used for implementation of the classifier on the Connection Machine CM-2 has been described. The performances obtained by the k-NN classifier are comparable with the current state of the art in optical character recognition, featuring 94.51~ of the recognition rate for the handwritten upper-case letters of the NIST Test Data 1 and 96.73~o of the recognition rate for the digits of the USPS ZIP code database. Finally, work under progress shows that the proposed technique is most useful when coupled to other existing classifiers, in a cooperative multiple classifier environment. Acknowledgements--The authors wish to express their appreciation to Prof. G. Baccarani, Prof. S. Graffi and Prof. G. Masetti for their help and encouragement on this work. The first author acknowledges the support provided by a grant from SGS-Thomson. The authors gratefully acknowledge the National Institute of Standards and Technology, the CENSUS Bureau and the Officeof Advanced Technology of the United States Postal Service for providing the training and testing databases. The Istituto di Scienze per rIngegneria of Parma (Italy) is gratefully acknowledged for the use of Connection Machine CM-2.
REFERENCES
1. K. Fukunaga, lntroduction to Statistical Pattern Recognition, second edition. Academic Press, New York (1990). 2. L. Lam and C.Y. Suen, Structural classification and relaxation matching of totally unconstrained handwritten zip-code numbers, Pattern Recognition 21, 19-31 (1988). 3. P. Gader, B. Forester, M. Ganzberger, A. Gillies, B. Mitchell, M. Whalen and T. Yocum, Recognition of handwritten digits using template and model matching, Pattern Recognition 2,4, 421-431 (1991). 4. Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel, Handwritten digit recognition with a back-propagation network, in Neural Information Processing Systems (edited by D. Touretzky) 2, Morgan Kaufmanr, (1990). 5. H. Takahashi, A neural net OCR using geometrical and zonal pattern features, Proc. of Int. Conf. on Document Analysis and Recognition France, pp. 821-828 (1991). 6. K. T. Blackwell, T. P. Vogl, S. D. Hyman, G. S. Barbour and D. L. Alkon, A new approach to handwritten character recognition, Pattern Recognition 25, 655-666 (1992).
Massively-parallel handwritten character recognition 7. A. Rosenfeld and J. L. Pfaltz, Sequential operations in digital picture processing, J. Assoc. Comput. Mach. 13, 471-494 (1966). 8. G. Borgefors, Distance transformations in digital images, Computer Visions, Graphics and Image Processing 34, 344-371 (1986). 9. J. D. Tubbs, A note on binary template matching, Pattern Recognition 22, 359-365 (1989). 10. United States Postal Service Office of Advanced Technology Handwritten ZIP Code Database (1987). 11. NIST Special Database 3 and Test Data 1 by M.D. Garris and R. A. Wilkinson, NIST Advanced Systems Division, Image Recognition Group (1992). 12. R.G. Casey, Moment normalization of handprinted characters, IBM J. Res. Dev, 548 (1970). 13. Z. Gun and R.W. Hall, Parallel thinning with twosubiteration algorithms, Image Processing and Computer Vision 32, 339-373 (1989). 14. D.W. Paglieroni, Distance transforms: properties and machine vision applications, Computer Vision, Graphics and Image Processing 54, 56-74 (1992). 15. R.O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. John Wiley, New York (1973). 16. M. E. Hellman, The nearest neighbor classification rule
17.
18. 19. 20. 21. 22. 23. 24.
301
with reject option, IEEE Trans. on Systems, Science and Cybernetics SSC-6, 179-185 (1970). Hong-Chih Liu and Mandyam D. Srinath, Partial shape classification using contour matching in distance transformation, IEEE Trans. on Pattern Analysis and Machine Intelligence 12, 1072-1079 (1990). G. E. P. Box and W. G. Hunter, Statics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley, New York (1978). D. L. Waltz, Massively parallel AI, Proc. of AAA1 Conf. Boston (August 1990). Wu Wang and S. Sitharama Iyengar, Memory-based reasoning approach for pattern recognition of binary images, Pattern Recognition 22, 505-518 (1989). W. D. Hillis, The Connection Machine. MIT Press, Cambridge, Massachusetts (1985). S.J. Raudys and A. K. Jain, Small sample size effects in statistical pattern recognition: recommendations for practitioners, IEEE Trans. on PAM113, 252-264 (1991). A. Rosenfeld, A note on "geometric transforms' of digital sets, Pattern Recognition Letters 1,223-225 (1983). Zs. M. Kovfics-V., Nistir 4912, The first census optical character recognition systems conference, pp. 313 318 (September 1992).
About the Anthor--ZSOLT M. KOVACS-V. received the Dr. Eng degree from the University of Bologna, Italy, in 1988. From 1989 he has been with the Department of Electrical Engineering of the same University where he is working toward the Ph.D. degree in electrical engineering and computer sciences. His research interests include various aspects of optical handwritten character recognition, neural networks and circuit simulation techniques. He is a member of the Institute of Electrical and Electronics Engineers (IEEE), of the International Association for Pattern Recognition (IAPR-IC) and of the International Neural Network Society (INNS).
About the Author--ROBERTO GUERRIERI received the Dr. Eng degree from the University of Bologna, Italy, in 1980. From 1980 to 1986, he was with the Department of Electrical Engineering of the same University where he received the Ph.D. degree in electrical engineering for his research on the numerical simulation of semiconductor devices. From 1986 to 1988 he was with the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley as visiting researcher. In 1987 he spent the winter semester at the MIT, Cambridge, as visiting scientist. In 1989 he joined the University of Bologna where he is currently Associate Professor, in charge of the Laboratory for VLSI design. His research interests are in various aspects of applied pattern recognition, integrated circuit design and parallel processing. In 1986 he received a NATO fellowship and in 1989 a fellowship for young researchers provided by Consiglio Nazionale delle Ricerche, Italy.