Pattern Recognition Applied to Formatted Input of Handwritten Digits

Pattern Recognition Applied to Formatted Input of Handwritten Digits

Copyright © IFAC Intelligent Components and Instruments for Control Applications. Malaga. Spain. 1992 PATTERN RECOGNITION APPLIED TO FORMATTED INPUT ...

939KB Sizes 2 Downloads 108 Views

Copyright © IFAC Intelligent Components and Instruments for Control Applications. Malaga. Spain. 1992

PATTERN RECOGNITION APPLIED TO FORMATTED INPUT OF HANDWRITTEN DIGITS F.A. Vazquez and R. Marin Department of Systems Engineering and Information Languages and Systems. University of Vi go, Spain

Abstract: The system presented in this paper uses Optical Character Recognition (OCR) techniques to allow massive capturing of handwritten data from product demand information contained in forms filled out by vendors in their visits to clients. OCR is confronted through feature extraction followed by classification using neighrest neighbor clustering and/or exaustive search (Q-analisis (7) ). Details are given on features and both classification methods are compared as far as error rate, speed and memory requirements are concemed. Keywords: Optical Character Recognition. Pattern Recognition , Image Processing, Learning Systems, Artificial Intelligence, Cybernetics, Textile Industry, Computer Interfaces, Handwritten Digits, Data Adquisition.

INTRODUCTION Formatted input consists in reading information that is spatially distributed in a predictable manner. Usually, a form will consist of a set of delimiters (possibly boxes) that contain characters belonging to a certain set.

Information Languages and Systems of the University of Vigo as part of a CIM project for a textile company called Pili Carrera S.A .. Its mission is to provide massive capturing of data concerning production needs and trends.

In order to develop a system capable of transfering the data contained in a form to a database, the image data contained in each box must be extracted from it and latter recognized as to correspond to a certain character.

Before developing this system, data capturing was the great bottle-neck of the company 's production cycle. Without this data , production needs and trends must be guessed. Acurrate guessing is extremely difficult in textile markets and usually produces situations of excess stock (possibly non-reusable).

Recognition is specially difficult when characters are handwritten and even the best OCR systems, humans, present an error rate of 4% in absense of context (6) .

The project consists of a system that reads formatted forms which contain handwritten digits in a collection of small boxes. The system includes a user interface for correction of possible errors and queries a data base to guarantee data consistency.

Many digitizing-table based systems have been developed for formatted data introduction . The main problem with these systems is that digitizing tables are uncomfortable and expensive compared to paper. Paper based systems have been developed by AT &T [2,3], Hewlett Packard and a number of Japanese Companies but a need exists for Off The Shelf products that easily adjust to any company's needs.

This system has the following advantages over existing systems: • It is paper based, that is, no expensive and uncomfortable digitizing tables are needed for input. • Its user interface is totally integrated in the company's needs and extremely intuitive. • It uses database query capabilities to assure data consistency and to correct errors produced both during recognition and data introduction. • It can doubt on a decision and, instead of making an immediate hard decision for a digit keeps a record of possible alternative candidates. This information can be used during various stages of the system. This state of doubt is reflected by the user interface. • A tool has been developed that allows the design

Neural Network based systems like AT&Ts ZIP-CODE reader [2,3) and the Necognitron (4) seam to work very well but are not oriented toward formatted input. NEW CONTRIBUTION This paper is a summary of a project developed by the Department of Systems Engineering and

361

of any particular form using a graphical builder. That makes the system sufficiently Off the Shelf and available for any user without need for special installation or maintenance. SYSTEM OVERVIEW The system can be separated into the following subsystems . • Form reader • OCR (Optical Character Recognition) • Consistency check and contextual error correction • Error Correction Since the company's original forms only contained numerical information, character recognition has been reduced to digit recognition. This limitation is not inherent to the system and any set of characters can be used. It is true though, that features have been chosen to enhance digit recognition and that it is much easier to obtain a low error rate if the set of characters from which to decide is small. FORM READER The mISSIon of the form reader is to extract the image data inside the boxes that constitute the form. These boxes should contain either a digit or a blank space.

Fig. 1 Typical Form

The main difficulty involved in developing this subsystem is providing immunity to rotations, translations and spatial deformations produced by the scanning system ( a scanner or a FAX ) and dealing with large amounts of image data ( 300 d.p.i for an A4.).

Two types of clasification methods have been developed: the first one is a classical theoretical decision method which makes decisions based on the statistics of feature occurences for each character [l); the other method reviews all the history of a learning base searching for exact occurences of the feature vector.

The image data extracted by this subsystem is fed to the OCR subsystem . Once all of the boxes have been processed, it provides a table containing the data in the form.

Both methods construct a merit vector in which each component corresponds to a character of the set and its value is greatest for the component to whom the sample character most resembles.

This table will be used by the subsystems that follow it. The forms used where for a company in the textile sector which receives data from their vendors and consist of a header containing the client and form numbers and the date, also, a series of lines that contain a product number, color number, and number of units per size to be ordered (Fig. 1 ).

Syntactical information [1) is also used in detecting some geometrical and topological features and the decision space is generated by learning from experience. CONSISTENCY CHECK AND CONTEXTUAL ERROR CORRECTION

OCR The subsystem destined to consistency check and contextual error correction makes sure that non existent client or product numbers introduced or interpreted are detected and corrected and that the sizes and colors interpreted for a cenain model exist. This can avoid errors due to wrong introduction or interpretation .

The OCR subsystem processes the image data provided by the form reader to determine the character to which the data most resembles. The OCR used in this case is feature based, that is, compresses the image data that describes the character into a feature vector which is later used to classify the character. The feature vector consists of a 54 dimensional binary vector which includes closed contours, straight segments, curves, number of horizontal and venical crossings, number of strokes, etc ..

ERROR CORRECTION The error correction subsystem provides a friendly mouse driven user interface which allows correction of introduction or interpretation errors 362

not detected previously. This subsystem is intended to be used during an initial stage until the system is considered very reliable and should tend to be used only for visualizing results or forms .

"""'--',,"

,

~--.-

FEATURE VECTOR

!

The Feature Vector is formed by 54 binary components indicating the absense or existence of a feature. The following types of features are considered (see Fig. 2 for examples) : • Closed Contours. Can be small, large or huge and the first two can be centered in the upper, middle or lower parts. • Concavities . Can be oriented towards the right, left, top or bottom or can be straight lines, and can be located in the upper, middle or lower parts. • Horizontal Crossings. Mean number of times a horizontal line crosses the character. The number of crossings can be less than once or more than once. For example: A four or a nine have more than one mean horizontal crossings in the top and one in the bottom . • Vertical Crossings. • Loose extreme locations. Can be left or right and in the upper, middle or lower parts. • Number of Strokes. Can be one or more than one. • Stroke crossing locations. Can be left or right and in the upper, middle or lower parts. • Horizontal lines. Can be in the upper, middle or lower parts. • Zig-Zags.

r-,

,

I

--\

/

V· ,,

f i

....rr

Some other properties have been used that are inherant to the way segmentation was done. More details can be found in [5) or by contacting the authors.

r ....·.

~

THEORETICAL DECISION The learning base is a 10 by 54 array where each component represents the probability that a certain character presents a certain feature . This is equivalent to 10 vectors or points in a 54 dimensional space. Each sample vector is compared to the 10 vectors or points in the decision space to find the neigherest neighbor. The merit vector components are a measure of how close the sample vector is to each candidate character. This is done by calculating the normalized inner product of the sample vector with each of the 10 learning base vectors.

Fig 2. Some Features of each component is :

EXAUSTIVE SEARCH

~..{iJ lY-t

The learning base is formed by 54 bit vectors where each bit represents the existence or absense of a feature and each vector corresponds to a sample vector created during the learning process.

di

= _1 ns ' "" £.Ij~

w·n ·· J IJ

where : • nsi : number of samples in the learning base for

This method creates a merit vector where the value

363

character i. • di : maximum Hamming distance from the sample vector to any vector in the learning base that corresponds to character i. • wj : weight for a Hamming distance of j. • nij : Number of vectors in the learning base that are at distance j and correspond to character i.

Information Processing Systems, vol. 2, pp. 396-404,1990. • [4] K. Fukushima, " Necognitron: A hierarchical neural network capable of visual pattern recognition", Neural Networks, vol. 1, pp. 119-130, 1988. ·[5] F. Yazquez, " P .F.C.: Sistema de Reconocimiento Automatico de Pedidos Manllscritos ", E.T.S.I. Telecomunicaci6n Yigo, 1991 .

The weights should decrease with distance so that small distances are favored over large ones, making them increase would generate a cost vector instead of a merit vector.

·r6] Suen, Berthod & Mori, " Automatic Recognition of Handprinted Characters: The Slate of The Art ", Proc . IEEE, YOL.68, NOA, April 1980, pp. 469-487.

If wj > nsi * wj-l for all j then, the method compares number of 0 distances and, in case of tie, compares number of 1 distances, and so on.

[7] Ollero A . "Contribuciones al AmHisis y Optimizaci6n Multicriterio de Sistemas Complejos" . Tesis Doctoral. Universidad de Se-villa, 1980 .

THEORETICAL DECISION vs

EXAUSTIVE SEARCH Theoretical decision is faster and its data base is of a fixed sized and usually smaller than the other method's data base. These favorable conditions are achieved by compressing all the experience in ten vectors. The price to pay for this accumulative proc~'dure is information loss and thus, an increase in error rai l'. A 12% error rate was achieved wilh Iheoretical decision while it dropped to 0.25 % with eX311stive search. These where achieved with a 50 Kb data h~lse for exaustive search and a 4 Kb data base for theoretical decision. This project was developed in a UNIX environment where disk spaces under 1 Mb are far from being considered large. Thus, the only compensation for error rate that could only be considered is speed. This brings us to a range of solutions: • Classification could be a batch process and the user interacts with the error correction subsystem after classification, consistency check and contextual error correction. This is totally fcasable in the case we have worked on and is the solution that has been implemented. Thus classification is the primary classification method and theoretical decision and some syntactical properties :.Ire taken into account in case of doubt. • For interactive recognition, exaustive search seems slow and theoretical decision with or without exaustive search in C
References • [1] King Sun fou , " Syntactic Pattern Recognition and Applications " , Prentice-Hall. [2] Y. Le-Cun, B. Boser, 1.S. Denker, ... , " Backpropagation applied to handwritten zip code recognition ". Neural Computation, vol. 1, no. 4, pp. 541-5:il, 1989. • [3] Y. LeCun, B. Boser, 1.S. Denker,... , " Handwrillcn digit recognition with a backpf()p~lgation network ", Advances in Neural

364