Improving OCR performance using character degradation models and boosting algorithm

Improving OCR performance using character degradation models and boosting algorithm

Pattern Recognition Letters 18 Ž1997. 1415–1419 Improving OCR performance using character degradation models and boosting algorithm Jianchang Mao ) ,...

59KB Sizes 4 Downloads 134 Views

Pattern Recognition Letters 18 Ž1997. 1415–1419

Improving OCR performance using character degradation models and boosting algorithm Jianchang Mao ) , K.M. Mohiuddin IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099, USA

Abstract We introduce three character degradation models in a boosting algorithm for training an ensemble of character classifiers. We also compare the boosting ensemble with the standard ensemble of networks trained independently with character degradation models. An interesting discovery in our comparison is that although the boosting ensemble is slightly more accurate than the standard ensemble at zero reject rate, the advantage of the boosting training over independent training quickly disappears as more patterns are rejected. Eventually the standard ensemble outperforms the boosting ensemble at high reject rates. Explanation of such a phenomenon is provided in the paper. q 1997 Elsevier Science B.V. Keywords: Combining multiple classifiers; Boosting algorithm; Neural network ensemble; OCR; Character degradation

1. Introduction In this paper, we study the effectiveness of a boosting algorithm ŽDrucker et al., 1993. in improving the performance of OCR. The original theoretical work on the boosting algorithm was done by Schapire Ž1990.. He showed that it is in principle possible for a combination of weak classifiers Žwhose performances are a little better than random guessing. to achieve an arbitrarily low error Žon the training data set.. Drucker et al. Ž1993. applied the boosting algorithm to character recognition. They produced a large number of training patterns by deforming the original character images by various degrees. It was shown that the performance of character recognition was dramatically improved over that of the single network which was used as the first network in the boosting hierarchy. However, it remains to be an)

Corresponding author. E-mail: [email protected].

swered whether the boosting ensemble outperforms the standard ensemble of independently trained networks. In this paper, we provide a comparative study of the boosting ensemble and the standard ensemble. We also introduce three character degradation models in the boosting algorithm.

2. Boosting algorithm In the boosting algorithm, the weak classifiers are trained hierarchically to learn harder and harder parts of a classification problem. The algorithm requires an oracle to produce a large number of independent training patterns. The basic boosting algorithm works as follows. 1. Generate a set of training data and train the first classifier. 2. Generate a set of training data for training the second classifier in the following manner: Flip a

0167-8655r97r$17.00 q 1997 Elsevier Science B.V. All rights reserved. PII S 0 1 6 7 - 8 6 5 5 Ž 9 7 . 0 0 1 3 7 - 2

1416

J. Mao, K.M. Mohiuddinr Pattern Recognition Letters 18 (1997) 1415–1419

coin. If it heads up, the oracle generates a pattern and passes it to the first classifier; if it is misclassified, add it to the training set; otherwise, repeat this process until a pattern is misclassified. If the coin is a tail, perform the same procedure except that the first correctly classified pattern is added to the training set. Repeat the whole process until enough patterns in the training data set have been generated. Train the second classifier using this training data set. 3. Generate the third training data set by collecting enough patterns that the first two trained classifiers disagree on the classification. Train the third classifier using this data set. Once the training is done, the three classifiers can be combined by either a hierarchical voting scheme or the averaging scheme. Drucker et al. Ž1993. applied the boosting algorithm to character recognition and found that the averaging method outperforms the hierarchical voting. If we consider the ensemble of three classifiers as a single classifier, the basic boosting can be applied recursively to produce more complex classifiers Žwith nine classifiers, 27 classifiers, etc...

experiments for generating additional training data was proposed by Jain et al. Ž1996.. This model is different from the one used in ŽDrucker et al., 1993. which works on gray-scale images. In this model, the coordinate of each pixel on the original bitmap is subject to a displacement which is continuous and is zero on the edges of the bitmap. In our experiments, we first pad the size normalized character bitmap Ž24 = 16. with a 2-pixel wide boundary on each side. The deformed bitmap is then renormalized into 24 = 16. This allows non-zero displacements for character strokes near the bitmap boundaries. We use the noise model proposed by Kanungo Ž1996.. This model assumes that the pixel inversion Žblack-white flip. occurs independently at each pixel position due to light intensity fluctuations, different threshold levels in image binarization, and blurring due to the point-spread function of the scanner’s optical system. Moreover, the probability of a pixel inversion is an exponential decay function of the squared distance between the pixel and its nearest point on the character boundary.

4. Experiments and discussions 3. Character degradation models We introduce three document degradation models in the boosting algorithm: Ži. affine transformation, Žii. an image deformation model used in ŽJain et al., 1996., and Žiii. a probabilistic model for document degradation ŽKanungo, 1996.. The affine model is a linear transformation of coordinate systems which take into consideration the following operations: Ži. translation, Žii. scaling, Žiii. rotation, and Živ. shearing. In our OCR system, character features are invariant to translation and scaling due to the size normalization. However, slant correction is not performed before character feature extraction due to the difficulty in slant angle estimation given a single character. The most dominant source of linear deformation is character slant. Slant can be modeled as a horizontal shearing operation. We assume that the slant angle satisfies a zero-mean Gaussian distribution, whose standard deviation is determined experimentally Ž10 degrees in this paper.. The nonlinear deformation model used in our

We used the lower-case alphabets of the NIST ŽNational Institute of Standards and Technology. Special Database 3 ŽSD3: 39,636 segmented characters. and Test Data 1 ŽTD1: 12,000 segmented characters.; these databases consist of pre-segmented characters used in the 1992 comparative study ŽWilkinson et al., 1992.. In our experiments, the SD3 data set was further partitioned into a training data set ŽSD3-train. with 27,636 characters and a validation data set ŽSD3-valid. with 12,000 characters. The quality of the SD3 data set and TD1 data set is quite different ŽWilkinson et al., 1992.. A classifier trained on SD3 will not generalize very well on TD1. In our experiments, we also created three new sets of data for training ŽMIX-train with 27,636 characters., validation ŽMIX-valid with 12,000 characters., and test ŽMIX-test with 12,000 characters. in the following manner: first mix the SD3 and TD1 data sets and then randomly partition the mixed data into three sets of the given sizes. Character images are normalized to 24Žheight. =

J. Mao, K.M. Mohiuddinr Pattern Recognition Letters 18 (1997) 1415–1419

16Žwidth. in order to reduce the character size variation. A total of 184 features proposed in ŽTakahashi, 1991. are extracted for each character image. These features capture information about local contour directions and bending points. Feedforward networks are used as individual classifiers in our experiments. The number of input units is set to 184, the number of features. The number of output units is 26, corresponding to the 26 alphabet classes. The number of hidden units was determined experimentally after numerous trials. We found that a network with 50 hidden units achieves relatively good performance given our feature set and NIST training data. The standard backpropagation algorithm is used to train all the individual networks. Each epoch goes through the entire training set once in a random order. When the character degradation models are used in training, for each pattern presentation in an epoch, either the original pattern or one of its degraded patterns is used according to the following probabilities: 0.5 for original, 0.16667 for each of the three degradation models. At the end of every five epochs, the network is evaluated on the validation data set. The weights corresponding to the best accuracy on the validation set is finally selected. This process is repeated ten times Žtrials. for each classifier. The performance comparison of a single network trained with or without the character degradation models is shown in the second and third columns of Table 1. The three character degradation models help the network generalize about 0.5–1.0% better on the two test sets. There are substantial differences between the recognition performance on the two test data sets. The character degradation models capture only the document degradation introduced by the scanning process and slant angles in writing. The

1417

variations of other writing styles as present in the TD1 are not captured by the three models. We also compare the performance of two ensembles of three networks: the boosting ensemble which is trained using the boosting algorithm and the standard ensemble in which the three networks are independently trained on the same data with different initial weights and random seeds in invoking character degradation models. In both the ensembles, the averaging scheme is used to combine the outputs of the three networks. The average recognition accuracies and the standard deviations of the two ensembles over the ten trials are given in the last two columns of Table 1. While both the boosting and standard ensembles outperform the single network, the boosting ensemble is slightly better than the standard ensemble of three independently trained networks. This is not surprising because in the boosting ensemble, the error characteristics of the three individual networks are less correlated than these of the networks in the standard ensemble. The second network in the boosting algorithm is specialized in errors made in the first network, while the third network attempts to resolve the disagreement in the first two networks. The curves of recognition accuracy versus reject rate for the three classifiers on the TD1 data set are plotted in Fig. 1. A pattern is rejected if the difference between the confidence values of the top two candidates is lower than a threshold. We can see that both the boosting and standard ensembles have a consistent improvement over the single network at rejection rates from 0% to 50%. It is interesting to observe that although the boosting ensemble has a slightly higher accuracy than the standard ensemble at zero reject rate, the gap quickly becomes narrower as the reject rate increases and eventually the standard ensemble takes the lead in recognition accuracy.

Table 1 The average recognition accuracies and standard deviations Žin parentheses. of a single network and two ensembles of networks at the zero rejection level. single

Data set wro degradation

ensemble wr degradation

boosting

standard

TD1 Ž%.

85.77 Ž0.30.

86.78 Ž0.28.

88.03 Ž0.19.

87.65 Ž0.12.

MIX-test Ž%.

92.65 Ž0.09.

93.17 Ž0.12.

93.86 Ž0.14.

93.73 Ž0.06.

1418

J. Mao, K.M. Mohiuddinr Pattern Recognition Letters 18 (1997) 1415–1419

Fig. 1. Recognition accuracy versus reject rate for the three classifiers Žsingle network, boosting ensemble and standard ensemble. on TD1 data set.

The same phenomenon was observed at every trial and also on the MIX-test data. One explanation for this phenomenon is that the major portion of patterns that the first network misclassifies but the boosting ensemble manages to classify correctly at zero reject rate are quickly rejected at low reject rates. The advantage of the boosting training over independent training disappears when more patterns have been rejected. At high reject rates, the second and third networks in the boosting ensemble no longer play the important role in helping the first network to improve the recognition accuracy because for the remaining patterns, which are considered as the ‘‘easy’’ part of the recognition problem, the second and third networks do not have expertise. This is evident from the fact that the distributions of the second and third training data sets are over-represented by the ‘‘hard’’ part of recognition problem, which has already been rejected. The standard ensemble of three independently trained networks does not have this problem. 5. Conclusions We have introduced three character degradation models in the boosting training. We compare the

boosting ensemble with the standard ensemble of networks trained independently with character degradation models. Both the ensembles outperform the single network trained using the character degradation models. An interesting discovery in our comparison is that although the boosting ensemble has a slightly higher accuracy than the standard ensemble at zero reject rate, the advantage of the boosting training over independent training quickly disappears as more patterns have been rejected. At high reject rates, the second and third networks in the boosting ensemble no longer play the important role in helping the first network to improve the recognition accuracy. The ensemble of three independently trained networks does not have this problem. Discussion Nagy: I have a comment regarding your first conclusion about training with deformed patterns. About 35 years ago, it was discovered that neural networks had better generalization properties when they were trained on characters of different sizes. It really helps to train on different sizes to recognize different sizes. And then somebody had the clever idea of normalizing the sizes in the system. It seems to me that somebody will eventually discover that training on

J. Mao, K.M. Mohiuddinr Pattern Recognition Letters 18 (1997) 1415–1419

distorted characters is no better than building invariance into the features of the classifier. That’s a comment. In other words, if you know what distortions you encounter, build it into the classifier, don’t make distorted samples. The second thing is: you mentioned that lack of correlations improves the performance. I don’t think that is necessarily always true. It’s easy to produce examples where increased correlation improves the classification. The point is that the correlation has to be different for the different classes, not that there shouldn’t be correlations. It’s possible to construct examples where the only thing that’s different between classes is the correlation and that is sufficient to classify accurately. So, on correlations, you have to go one step beyond that. Mao: Let me comment on the first comment that you made: there are a number of ways to achieve invariance properties. One way is using invariant features, and normalization is one way to do that to achieve a certain degree of invariance. However, it is very difficult to develop features which are totally invariant to various operators. The other way very often used is to collect a lot of training data that cover the variations and then you train the classifier. At the final level you achieve the invariance. So that’s the approach we take with various degradation models. To your second comment: I was talking about error correlation and you’re right; less correlated errors do not guarantee better improvement. There is one more

1419

level of analysis needed to establish the relationship between error correlation and final improvement. Yamany: What about the time requirements, which is an important consideration in OCR. I did not see any time comparisons. Mao: We do not care much about training time. The recognition speed is three times slower than with a single network, because we use three networks that have the same architecture, the same size. References Drucker, H., Schapire, R., Simard, P., 1993. Improving performance in neural networks using boosting algorithm. In: Hanson, S.J., Cowan, J.D., Giles, C.L. ŽEds.., Advances in Neural Information Processing Systems, vol. 5. Morgan Kaufmann, Los Altos, CA, pp. 42–49. Jain, A., Zhong, Y., Lakshmanan, S., 1996. Object matching using deformable templates. IEEE Trans. Pattern Anal. Machine Intell. 18 Ž3.. Kanungo, T., 1996. Document degradations models and a methodology for degradation model validation. Ph.D. Thesis. University of Washington, Seattle, WA. Schapire, R.E., 1990. The strength of weak learnability. Machine Learning 5, 197–227. Takahashi, H., 1991. A neural net OCR using geometrical and zonal-pattern features. In: Proc. 1st Internat. Conf. on Document Analysis and Recognition, pp. 821–828. Wilkinson, R.A., Geist, J., et al. ŽEds.., 1992. The first census optical character recognition system conference. Tech. Rept., NISTIR 4912, U.S. Department of Commerce, NIST, Gaitherburg, MD 20899.