Recognition of printed chinese characters by automatic pattern analysis

Recognition of printed chinese characters by automatic pattern analysis

COMPUTER GRAPHICS AND IMAGE PROCESSING (1972) 1~(47--65) Recognition of Printed Chinese Characters by Automatic Pattern Analysis* WILLIAM STALLINGS ...

2MB Sizes 0 Downloads 94 Views

COMPUTER GRAPHICS AND IMAGE PROCESSING (1972) 1~(47--65)

Recognition of Printed Chinese Characters by Automatic Pattern Analysis* WILLIAM STALLINGS

Honeywell Information Systems, Inc. Waltham, Massachusetts 02154 Communicated by T. S. Hua~g Received August 13, 1971 An approach to pattern recognition by computer, using analysis of pattern structure, is explored. The approach is programmed and tested on a set of Chinese characters. The input to the program is a black-white matrix of points depicting a single character. The program produces a description of the character on two levels: (i) the internal structure of each connected part of the charaeter, and (ii) the arrangement in two dimensions of the connected parts. The description is achieved by producing a structural representation of the chm'aeter. The structure corresponding to level (i) is a graph, the edges of which correspond to parts of sta'okes. The structure corresponding to level (ii) is a tree, whose telzninal elements correspond to connected parts of die eharaeter and whose nodes correspond to geometa'ie relations, A method has been devised for producing a numeric code for eaeh character. The code is generated from the structural representation of a character, and is used for recognition. 1. INTRODUCTION

An App~'oach to Pattern Recognition This paper reports on a study of an approach to pattern recognition based on the description and analysis of pattern structure. The author s that recognition should imply more than just the classification of patterns according to the features they possess, but should mean the structuring of the pattern, i.e., the determination of the relationship among the elements of the pattern, r With this point of view, a scheme for automatic pattern recognition has been developed which includes the following tasks: (i) Description. A systematic scheme for the description of the pictorial structure of the patterns to be recognized is developed. (ii) Analysis. An algorithm is designed which analyzes the structure of the patterns, producing a representation of the structure conforming to the descriptive scheme. * This [Japer is based on a thesis submitted in partial hdfillment of the requirements/or the degree of Doctor of Philosophy in the Department of Electrical Engineering at the Massachusetts Institute of Technology, 1971. This polar o}"view is propounded by Sayre [11] and Grenander [4]. 47 9 Copyright 1972 by Academic Press, Inc.

48

STALLINGS

(iii) Encoding. From the structural representation of a pattern, a code is generated which uniquely identifies the pattern. This method has been applied to the recognition of Chinese characters. A program has been written which analyzes Chinese characters; the program produces a data structure which describes a character in reims of basic picture elements and the relationship among them. A procedure has been developed ~br generating a numeric code from the structural representation, Recognition is achieved by building up a dictionary matching characters with their codes; the code for any new instance of a character can then be looked up in the dietionary.

Chinese Characters Chinese is a pictorial and symbolic language which differs markedly from written Western languages. The characters are of uniform dimension; they are generally square; they are not alphabetic but are composed of strokes. Chinese characters possess a great deal of structure and hence are wellsuited to the method of recognition outlined above. Many regularities of stroke configuration occur. Quite frequently, a character is simply a twodimensional arrangement of two or more simpler characters. Nevertheless, the system is rich; strokes and collections of strokes are combined in many dff ferent ways to produce thousands of different character patterns. The author feels that the method of pattern recognition by analysis is suited for application to a class of patterns which display a rich structure developed from a small number of simple basic elements. Hence Chinese charaeters were chosen. In addition, the development of a successful Chinese character recognition device is desireable in itself." It is hoped that the method presented here can be the basis of a feasible Chinese character recognition device, which would increase the access of the West to the vast quantity of Chinese writing. g. T H E STRUCTURE OF C H I N E S E CHARACTERS

Chinese characters consist of strokes, which are drawn roughly along a straight line2 Nearly all strokes appear as horizontal, vertical, or in a direction along one of the main diagonals. Strokes are combined to form connected units hereafter referred to as components. Each character consists of a twodimensional arrangement of one or more disjoint components. Figure 1 shows a character having three components. The structure of a Chinese character may therefore be specified on two levels: (i) a description of the internal structure of each component, and (ii) a description of the arrangement of components in two dimensions. .a The author is aware of one previous investigation, hy Casey and Nagy [1]. The method used was template matching. '~ This is not quite correct, A native Chinese would draw 7 with a single stroke, not lifting his pen, For the sake of this discussion, however, it is simpler to say that 7 is composed of two "strokes", namely ~ a n d / .

RECOGNITION

OF

PRINTED

CHINESE

CHARACTERS

49

IiitIIIIIIIIIIIIIIIIIIIIIIIIIIIII IOtQH~H

4aJoJO~OitoeO

Ill~176176 I|IIIIIII

IIIII

IIIIIIIIIIII IIIIIIIIIII

IIIIIII| ~eIeeIHeol

|I lQl ItIIIIIIIO | ~ # ##I O I el|l tIIIO011 OOIOIIII*IeOIO

IIIIIIllllOII OllellllOll|ll

....,...t,

IllIIIIIe)|II lilt~176

it

,o+~ , t . . .. . . !,

,........... +,,,..hi IIIIIiiiii

tt l . h0! h l

.., . . . . . . . .

IllIOe .,t,,~ 6Ol

~

'""~176176 ,,.+++*'"'"~+',

Ill

Ill~ IIIIIIIIIIII

+lJlmllIp#fe+ IIOlIOI|IOIIJ

,.,...,.,.,,.: I I. .I. .l. .l. I. .I. .l. l. l O O l

II

. . t+, , . t~176 .. o

,,,,,,,,,,,,

+l+|,+I+J,

ihl!.hi II Ill I

t,llltth,

..... ,,..lth.h.l

"Il lIlIlIIIlIlI"~ l l "' II

. . . . . . . . . . .

i!!iiilt11tltl'tI'tttHtIttl ,!!!!!:i!!!!!'!

llllllllll iiii iiii

IlllllllllllllllllllI~ llllllllll

i 9

ItlttllltI llllll .....

lellllllll IIIII141111

lllllllllll

llIllllll

llllllllll

IIIII III II IIII

Illl

It

,

,,,

:Iiiiii,,,ii,,,11t1111111,.1 t

1 I!!!

I llllllllll

Il . ,Ill . f i.l l l i.f l l l l l . I IttIllltttll] llllllllll8 llil | l l l ,Ilf li,,,Ilhlhl II lllhlll ll III II II IIIII

~llbllI~

I

i

lllmlIllI~

111111111111 lllllllillI

.

iI

l

, l t .11. l h hI

l

llll!Iltlll l ll||lllhll lllllllllll Illlllll

lllll IIIii II 61~16166

~1011 llI~l 6 Ill~ 9

l

J,,h,........ '

lllll

~ lllll~ lllllllll

fill!Ill~ llllllillll ........... lllllIlllll ltilbtlll+l

11111114111 41161161161

.++++++++'" ++++l+i++++++i+i++++++ +++++++++++

lIllJlllI llllllll

lllllllllll lllllllllll I IIII l l l l'l' l"l'l l l l

ItI:tI'

tilt.

'

.

.

.

.

.

'. Ilii~lblil I+llllll+ll# llIllllIlll IIIIIlll

FIG.

1.

.

.

.

.

.

lJlllllllll llllllll+ll IIlllllllll llllllllll* IIII Illl

.

I

IIIIIplllll tlllt#t~ IIIiIllllll l+llIllIl

lllllIlllll iiiiiiiiiii iiiiiiiiiii ll#llIlllll IiiliIllllll

111111111~1 Ilt~qtllllll I*llelllllll llllOIiIII

Character with Three Components.

Compone~zt8 Two questions are involved in the decision of how to describe the internal structure of a component: (i) What class of objects shall be considered as the basic picture element? (ii) What sort of structure shall be used to indicate the relationship between elements ? Three criteria were used in answering these questions: (i) The structure mentioned in question (ii) should be relatively easy to generate from the original pattern. (ii) It should be relatively easy to generate a unique numeric code from the structure. (iii) The structure should represent the pattern in a natural manner. A quite natural method of representing the internal structure of a component would be in terms of strokes. This indeed is the approach taken by several previous recognition schemes [5,7]. These schemes make use of on-line input, in which strokes are drawn one at a time. The difficulty with taking this approaeh for printed characters is that strokes do overlap and are not easily

50

STALLINGS 10JO00 10oJ,101J,

9, 9 ~,,l~ e*JHOQ0J 9 ~ooeooeooe

~o0oloo6oleto*Jla oeeeeoer Jo460oeoeJeee4oee~6eHeoee

o~eee~eteoeos e e ge o a 0 4e ee eee ee ~ ee e e4 Je o e e e e e o o J .,.,, ...... .,, ,.,,,,,,,,,...,,.,,,0,,,,.,. ....... e~aoQ~eQ eoQs 9 9 eeHeeeeeeo6eleeeeeeoos~seeeGeeee6u~r o~alllsQ~BiQell~641~e~olloe~loo~eees~4~ee~eoees~eee~

iiiii:iiM,::iMii ..:. = = iiiiii&ii==!!iiii iiii , 9t t ~ t t

.:

t~etol

9. . . . h . . . . . | l l

.

.

.:

4~ttoo~toooto~ooJooo

.

.....:!i....:..=.

tt~oootto*644o$llg~to~etg044e

. , .1t1 . . tiot . t t~ . t t 9 ~

9149 9 ~eooo

ttt:ttt:.: 9 ":'t::t.~ltt 9~ 9

4$ott4

t ~

.

l9

~eloeol$ e9 o 9 e6$eeo4~o eo o ~ 00 oo 9 9 9 9176149149149149149149149 06 9 oeo ea

.

............

eooeoeoee 9

9

::.

9

4

(

|Itttl|ttt'lt,.:: 0400401044~e01101

~0 ~ 0 . 0 4-1 P 4 4 ~: , ~ 9 ....... ~ l l t t t ~ t t t t z , s t o

2

...... uqi|mm::i mm

............. a==~az:z:=~==

.

r

""

9

.

.....

~o1 ~o

%.

.:.:.::-

6.(

2

0

0

t:ttt:~ttttltt::: ~OO~Otflitllo,eo 9

9

..........................

iiiiii!!ii!i.= 9.,ii!il,}...

. . . . . . . .

r

4

9,

aeee .at''~ ~176176176176

., .., ,.~., ,. . . . .

(oi

(b)

FIe. 2. C o m p o n e n t (a) a n d Graph (b),

isolated. Further, the description of file relationship b e t w e e n strokes is not straightforward. 4 A m u c h m o r e promising approach is to descrihe c o m p o n e n t s in terms of" stroke segments. This can best be u n d e r s t o o d with reference to Fig. 2, As can b e seen, a c o m p o n e n t can be d e p i c t e d as a graph. T h e branches of the graph c o r r e s p o n d to segments of strokes. T h e s e segments are b o u n d e d by stroke intersections a n d ends of strokes. 5 It will be s:hown in later sections that this representation satisfies criteria (i) a n d (if). That it satisfies criterion (iii) is fairly clear. To the h u m a n observer, t h e g r a p h of" a c o m p o n e n t is readily apparent.

Characters T h e a r r a n g e m e n t of c o m p o n e n t s in two dimensions to form characters can b e described using the c o n c e p t of frame. Each character is v i e w e d as o c c u p y i n g a hypothetical square. The segmentation of a character into c o m p o n e n t s segments its square accordingly. T h e square, or frame [~, may b e s e g m e n t e d in one of three ways: (a) East-West [JJ, (b) North-South B , (c) Border-Interior []. Each of these segmentations corresponds to a twoc o m p o n e n t character. For example, /~ ~_ w o u l d be r e p r e s e n t e d by (a), w h i c h 4 F o r a discussion of a system for t h e description of C h i n e s e characters in terms of strokes, see F u j i m u r a and Kagaya [3]. T h e authors are primarily i n t e r e s t e d in computer generation of C h i n e s e characters. 5 T h e n u m b e r s on t h e nodes and b r a n c h e s are for t h e sake of discussion later in t h e text.

]RECOGNITION O F P]RINTED C H I N E S E CHAI/ACTE]RS

51

decomposes the character into d and ~. ~- would be represented by (b). Finally, either partial or complete enclosure, such as ~ and [-~ would be represented by (c). Frames for characters composed of more than two components are obtained by embedding (a), (b), or (c) in one of the sub-frames of (a), (b), or (c). The process of embedding is recursive, in that any subframe of a derived frame may be used for further embedding. For example, the four-component character of Fig. 3 can be described by the frame arrangement of Fig. 4a. The frame description can be conveniently represented by a tree, as indicated in Fig. 4b. This description of the arrangement of components is based on the work of Rankin [10], who introduced the concept of frame-embedding. The definition of component used here is slightly different from that of Rankin. Despite this, Rankin's claim that the three relations used in his scheme are sutt3eient to describe accurately nearly all characters seems to apply. 3, INPUT

The program operates on a representation of one character at a time. The representation is in the ~bnn of a matrix whose entries have value 0 or I corresponding to white or black in the original picture. The matrix is obtained by means of a flying-spot scanner. The printed characters used were taken from a number of different sources; the characters were all of roughly the same style but varied considerably in size. Certain functions of the program depend on the fact that there are no gaps or holes in any of the strokes. This is not always the ease due to the quality of the printed input. Accordingly, a smoothing operation is performed to fill in the .~:I..o **,**,,**~ .,, ...... ..**.* .,,*.**~176176

~ e~ll** *.~176176 eIHa~176 ,o*ooeoeeo**oe*

.,,~176176176 iJJoB~lgoo~ ~ *.*.~176

iiiii!iiiiii!i!

~176176176 o~176176 ,-~176176176176 *~176176176176176 *~176176176 *~.~176176176 H~176176176176 **~176176176176176 *~176176176176176176176 ~176176176176176176176176176 e~176176176

":

"

. . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6, ~1,7. 6o 1~ 7. ~6~117766

.*. ~~ 11 77 66 11 77 66 11 77 66 11 77 66 11 77 66 11 77 66 1 7 6 1 7 6 1 7 6,, ~,1 7~ 6 11 776 167 6 1 7 6 1 7 6 1 7 6

~176 ..... ~176176176176 . o ~ 1 7 6 1 7 6 1 7 6 1 7 6 1. 7 6 . . . ~176176176176 . . . . .~176176176176176176176176 . . . . . . . . . . ~176 .... ~ ...... . . . . . . . . .

~176176176176 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~176176176176176176176176176176176 ~176176176176176 .... . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1. .7. 6. .1 7 6 ,~176176176176176176176176

~i . . . . . . . . . . . . . . . . .

o~~

~176176176176176176

.~176176176176 ..... ,~176176176176176176

~ 1 7 6 1 7 6 1 7 6 1 7 6 ~176 .~176176 ..... ~ .... ~176176176176176176 ~176 ..~ ..... . ...... ....~176 ~176176 ~ . . . . . . . ~ ..... ~176176176176176176 .~176 ~ , ~ ~ 1 7 6 ....... ,o.~.~176176176 ~..~ ~176176176176 ...... ~176176176176176176176 ~ . , ~ .,~176 ....... ,.~176176176 ~ ~ 1 7 6 ~176176176176 ~176176176176176176 ~ 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 .6 . . . . . . . . . ~176176176176176176176176 ...... ~ , . . . . . . . . . ~176 ..... ~176176 ~176176176 ....... .~176176176176176176 ~176176176 ~176176176176176176176176176176 .... ~176176 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 , ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 . .6. . .1. . 7 6 1 7~ 6 1 7 6 1 7 6 , . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6. . .1. . .7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6. 1.7 .6 . . . . . . . . ~176176176176176176176176176 . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 .1 . 7 . 6. 1. .7 . 6. 1. 7 .6. ~1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . ~ 1 7 6 1 7. 6. . . . o ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6.1. .7. 6. 1 7 6 1 79 6.1. .7. 6. . .~176176176176176176176 ..... ~176176176176176 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1. .7. . 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ,,~ . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . .1. . .7. . .6 1 7 6 1, ~71 766 1 7 6 ~ ~176176176176176176176176 . . . . . . . . . . . . . .

FIG. 3. Chinese Character,

52

STALLINGS

I

(e)

(b)

FIG. 4. Frame Description (a) and Tree Representation (b). gaps. T h e resulting matrix is used as the data base for the program. The digitized form of a character can be displayed on a CRT. Figures 1, 2a, 3 and 11 are photographs of such displays. 4. ANALYSIS OF COMPONENTS A program has been written to perform the analysis of components. For a given component, the output of the program is a graph in which branches correspond to stroke segments and nodes correspond to the endpoints of stroke segments. To eonstlalet the graph of a component, one principal procedure, BUILD, is used. In addition, use is made of some auxiliary routines. It will be helpful to describe these first.

C o1~tour Tracing Contour tracing is the process of finding a series of black points on the boundary of a black region in a white field. Two routines are used: one which keeps the black region on the left as the tracing proceeds, and one which keeps the black region on the right. To keep the black region on the left, the tracing proceeds from point to point, turning right after encountering a black point and left after encountering a white point. An additional rule is used to increase the speed of the algorithm: If three points of the same color are encountered in succession, the next point is assumed to be of the opposite color. Thus, two steps m a y be taken at once. T h e operation of the algorithm is depicted in Fig. 5. T h e last step shown is diagonal, indicating the effect of the g-move rule. The algorithm for keeping the black region on the right is similar. Both algorithms were developed by Prerau [9]. Search The task of the SEARCH routine is to find some stroke segment to be used as a starting point. It is unimportant which particular s e g m e n t of a c o m p o n e n t is found. The output of the SEARCH routine is the coordinates of the endpoints of a strip of black points straddling a stroke segment. SEARCH proceeds by scan-

RECOGNITION O~

O F P R I N T E D C H I N E S E CHARACTE1RS

0

O~

0

t 1

0

0

0

0

0

I

0

1

1

t

0

I

1

~

0

0

0

0

0

0

o

o

,

53

0

F1G. 5. Contour Tracing.

ning alternately from left to right and from top to bottom along various rows and columns of the pattern. This continues until a series or strip of black points is encountered. Is strip is too long (more than 88 width of the pattern), it is assumed that the strip is lying along the length of a stroke. This is rejected and the scanning continues. Similarly, if the strip is too small (1 or 2 points), it is rejected as being a speck. Otherwise, it is assumed that the strip is straddling a stroke segment and the endpoints of the strip are returned. Figure 6 shows examples of all possible outcomes of scanning a single row.

),,.

(a) No block points found,

(b) Speck found.

{C) Line olong slroke found.

(d) Line straddling slroke found.

FIG. 6. O u t c o m e s of Scan ])y SEARCH Algorithm.

54

STALLINGS

Cl'~zw[ The CRAWL routine is used kbr "crawling along" a stroke segment. The routine proceeds along a stxoke segment in a given direction, halting when a node is encountered, i.e., when an intersection or the tip of a stroke is reached. The input to CRAWL is (i) a location on a segment, in the form of the two endpoints of a horizontal or vertical strip of points straddling the segment, and (ii) one of four directions (left, right, up, down) in which the crawl is to proceed. The output is the location on tlae segment where the crawl halted, again in the fbrm of two endpoints of a strip. The crawl is accomplished by moving from each of the input points along the contour of the segment. Tracing from the left-hand input point (with respect to the direction of tlae crawl) is done keeping the black region on the right and conversely for the right-hand input point. The crawling proceeds by advancing both "tracers" one unit in the specified direction at a time. This is depicted in Fig. 7. For each move from one line to the next, each tracer goes through one or more contour points. Figure 8 shows the four conditions under which a crawl will be halted. All four cases correspond to a node being encountered: (i) If the two tracers, instead of advancing, meet each other, then the tip of a stroke has been encountered. (ii) If the two ta'acers do advance, but not all of the points between tlaem are black, then a fork has been encountered. (iii) If the new strip of black points on which the two ta'acers sit is significantly longer than the previous strip, then an intersection has been encountered.

Direction of crawl

7 6

(~)

(~

X

X

X

X

X

(~)

X

X

X

X

X

X

X

x

x



x

x

x|

x

x

x

x|

x|

X

4

|

5

(~)



X

X

X

X (~)

X

X

X

X

X

X (~

x





x

x|

2

1

(~)

|

C)

FIc. 7. Crawling along a Stroke. Circled points indicate position of tracers at end of'each move. Botb tracers are always on the same line. Lines are n u m b e r e d for sake or discussion. See text.

R E C O G N I T I O N O F PRINTED CHINESE CHARACTERS

x

~

xx

x

(b) Fork

XXXXXX

XXXXXXXXX

(c) intersection

xx

XX XXXXX

(o) Tip

XXX

xxx

xxx

XXXXX XXXXX

55

XXX XXX

XXX X XXXXXX

(d) Turn-around

FIG. 8. Conditions for halting CRAWL procedure. Circled points indicate location of two tracers just before crawl is halted, Direction of crawl same as in Figure 7.

(iv) If one of the two b'acers reverses direction, then again a fork has been encountered, but this time by coming up one of the two anns rather than the main road. Although only horizontal and vertical directions of crawl are specified, the routine works on diagonally-oriented segments. Notice that in Fig. 7 both tracers move diagonally from line 1 to line 2. This could continue along the entire length of a diagonal segment.

Node After CRAWL has encountered a node, NODE is called to investigate it. The input to N O D E is the output of'CRAWL: the endpoints of a strip of points which marks the termination of a segment at an intersection. The task of the N O D E routine is to find all other sta'oke segments radiating from this intersection. For each segment found, NODE returns the endpoints of a strip straddling that segment at the intersection. Also, the direction of the segment away from the node is indicated. The operation of NODE is shown in Fig. 9. The routine starts at one of the input points and proceeds along contour points around the intersection. This continues until a contour point is found which is the endpoint of a horizontal or vertical strip straddling a segment (i.e., the endpoint of a small black strip). This strip and the direction perpendicular to it away from the node are noted. The routine then continues from the other endpoint of the strip. This process of going a few contour points, finding a segment, crossing it, going a few contour points, etc., continues until the other input point is encountered. In addition to locating the segments leading from a node, the routine assigns a position to the node. This is done by averaging the X and Y coordinates (with respect to an origin in the upper left-hand corner of the matrix) of the endpoints of all the strips found, including the input points.

56

STALLINGS

)

~nw

points [

point

Fie,, 9, The NODE Algorithm.

Build The construction of a graph can now be described. As a graph is a collection of interconnected nodes, it is represented in the computer as a collection of" interconnected blocks of data. For each node in a graph, a block of contiguous memory words is allocated. The length of a block depends on how many branches there are at the corresponding node, If two nodes are adjacent in a graph, their data blocks will contain pointers to each other. Each of these pairs of pointers represents a branch. To begin construction of a graph for a particular component, SEARCH is called to find some initial stroke segment. SEARCH returns a position somewhere along the length of a segment. From this position, CRAWL is used to crawl along the segment in both directions to its two endpoints. Thus two initial nodes are found. NODE is called once for each endpoint to determine the segments leading from them. Storage blocks are allocated for each node. Pointers are placed in each block linking the two together. From this start, the graph is completed using BUILD. BUILD is called once for each segment leading from each of the two initial nodes. The arguments to BUILD are (i) a pointer to a block of data corresponding to a node (the input node), and (ii) the starting point of some segment (the input segment) leading from the input node. BUILD performs the following operations: 1. The input segment is crawled along to reach its endpoint, using CRAWL. 9.. NODE is called to examine this endpoint, or node. The coordinates of the node and the segments leading from it are determined. 3. a. The coordinates of this node are compared to those of all previously encountered nodes (those for which data blocks already exist). If a match is found, then a pointer to the existing block for this node is placed in the block of the input node, and the routine stops.

RECOGNITION OF PRINTED CHINESE CHARACTERS

57

b. If the encountered node is new, then a block is allocated for it, and it is linked back to the block of the input node. BUILD is then called once for each segment leading from the n e w node. Then the routine stops, It can be seen that B U I L D is a recursive routine. B U I L D is described more formally in Fig. 10. As an example, the analysis of"the component of Fig. 2 w i l l b e described. The two nodes initially found are marked 1 and 2. The branch b e t w e e n them corresponds to the initial segment found by SEARCH. Blocks of data are allocated }br 1 and 2. Then, all the segments leading from 1 are examined, clockwise, by BUILD. Crawling along the first segment, node 3 is found. This is linked back to 1. The segment leading from 3 is examined next, finding node 4. The procedure unwinds back to node 1 and examines its next segment. As a result, 5 and 6 are found. From 6, node 2 is encountered. Node 6 is linked to node 2 and the procedure again returns to node 1, which is seen to b e completed. Next B U I L D is applied to node 2 which finds first 6 and then 7. At flais point, 2 is complete and the analysis terminates. 5. ANALYSIS OF CHARACTERS The algorithm for analyzing a character is in two parts: 1. A collection of graphs is produced, one for each component. 2. The relationship b e t w e e n components is determined.

Finding All Components The first palt of the algorithm involves a l'ew modifications to the program discussed in the previous section. The objective is to keep track of which components in a pattern have already been analyzed. To do this, the following procedure is employed. As a component is being analyzed, its outline is drawn on a separate pattern. That is, the contour points

procedure build (block,stroke) : begin node := find node at end of s t r o k e ; n .= number of other branches at node ; branch .= n-vector of other branches at node ; if node = oldblock* then place pointer to oldblock in block else begin newblock .= create block of length n+5 ; place pointer to newblock in block ; place pointer to block in newblock ; place number,x,y in newblock ; for i := 1 step I until n do build (newblock,branch(i)) end end

*i.e., node is compared to all nodes previously encountered. The wdue is true if node is the same as another node represented by the data block "oldbh)ek". FIG. 10. BUILD Procedure.

STALLINGS

58

**%

:.! :.!

(o)

(b)

.: :

!':

!" t' ,.~

,o

!

i

,..oo~176

$

. o.~

9%.~ . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . .

~

'::::::i'"'"'".:

(d)

!

;!

!,

|lt,,i i

(c)

.'il i

i

.fi ....'.

! i

~

mO~

%'||

i

'*~176 , 9'

do w ~176

Yi

'o: ":,. 9 ......

.: ~176149

. . . . . . . . . .

!'

i

,fi"!

t . . . . . . . . . . . . . . . .

i

!

.. eOOOW~O

l

~Joa,i ,

i| i

9"

~

* " 9 |, ,,,. ,,o,

|

q~

,i

i

e

J

""',,

!

! i

"'

,0o ,

:l t !

00|

!

:" 9 o .....

~176

i

., 9

,

~ ...... .~

~

~176176176176176149 o. . . . . . . . . . . . . . .

! i

FIG. 11. Outline of a Character. (a) Step one; (b) step two; (e) step three; (d) step four.

of a c o m p o n e n t are filled in on a n e w pattern as they are encountered. The n e w pattern contains, at any time, the outline of"all components of a character w h i c h have b e e n processed. T h e SEARCH routine is modified to test the endpoints of any strip of black points it finds against the n e w pattern. If the corres p o n d i n g points are black irt the n e w pattern, t h e n the strip is rejected and S E A R C H eontSnues to scan. If no new strip is found after scanning a sufficiently large n u m b e r of rows and columns, it is assumed that no n e w components remain to be found. After each c o m p o n e n t is analyzed, SEARCH is c a l l e d to locate a stroke segment on a new component. The process of analyzing c o m p o n e n t s continues until no n e w components can be found. The result is to p r o d u c e a collection of c o n n e c t e d graphs. F i g u r e 11 shows the result of" applying the algorithm to the character of Fig. 3.

o:

RECOGNITION

OF

PRINTED

CHINESE

CHARACTERS

59

Constructing the Frame R e p r e s e n t a t i o n o f the f l a m e description of a character is d o n e c o n v e n i e n t l y b y m e a n s of a h'ee. T h e root n o d e of the t r e e has as its v a l u e one o f the t h r e e relations indicating h o w t h e overall flame is b r o k e n into two subframes. T h e t w o sons r e p r e s e n t the s t r u e t u r e of the two subffames. TmTninal e l e m e n t s corr e s p o n d to c o m p o n e n t s (see Fig. 4). T h e m e t h o d of o b t a i n i n g such a tree will b e briefly d e s c r i b e d . First, e a c h c o m p o n e n t in the c h a r a c t e r is inscribed in a rectangle. T h i s is easy to do since t h e coordinates of e a c h n o d e are known. T h e relationship b e t w e e n all possible pairs o f c o m p o n e n t s is detmTnined b y d e t e r m i n i n g the r e l a t i o n s h i p b e t w e e n t h e i r rectangles. T h e o n e of the t h r e e p e r m i t t e d relationships (EastWest, North-South, B o r d e r - I n t e r i o r ) w h i c h m o s t n e a r l y a p p r o x i m a t e s the t r u e relationship is chosen. T h e n it is d e t e r m i n e d if o n e of the c o m p o n e n t s has the same relation to all other c o m p o n e n t s . This will usually b e the case. I f so, that c o m p o n e n t b e c o m e s one son of the root n o d e of the tree; t h e value o f the n o d e is the appropriate relation; the o t h e r son is a tree r e p r e s e n t a t i o n d e v e l o p e d for t h e r e m a i n i n g c o m p o n e n t s . T h i s subtree is d e t e r m i n e d in the stone way. I f no single c o m p o n e n t is found, a m o r e c o m p l i c a t e d p r o c e d u r e is u s e d to d e t e r m i n e if a n y two c o m p o n e n t s have the same relation to all others, a n d so on. A p r o c e d u r e for c o n s t r u c t i n g the tree r e p r e s e n t a t i o n o f a frmne d e s c r i p t i o n is d e s c r i b e d formally in Fig. 19.. 6, ENCODING OF COMPONENTS F o r r e c o g n i t i o n p u r p o s e s , a p r o c e d u r e has b e e n d e v e l o p e d for g e n e r a t i n g a n u m e r i c code for e a c h character. T h e first step in this p r o c e d u r e is t h e generation of a code for e a c h c o m p o n e n t in a character. procedure frame (list,tree) ; begin l i s t l

:= first group of components

list2 := second group of components ; node := relation between two groups

if listl is a list then frame ( l i s t l , t r e e l ) else t r e e l ;= list1 if list2 is a list then frame (list2,tree2) else tree2 := list2 tree := t r e e l , n o d e , tree2

end

Note s: 1. The input to flame is the argument list, which is a list of eombinatimls of two or more components taken two at a time. 2, The output of flame is the argument tree which is a triple corresponding to the left son, node, and right son of a tree. 3. listl and listg represent disjoint groups of con,ponents such that the two groups have one of'the three allowed relations between fllem. If either group contains only one component, file emTesponding variable (listi or list2) is simply an identifier of that component and not a list. FiG. 12. F R A M E

Procedure.

60

STALLINGS

The code for a component is generated fi'om its graph. To this end, the branches of a graph are labeled at each end. The label on a branch at a node indicates the direction or slope of that branch quantized into eight directions. All the branch labels at a node are stored in the data block of that node. An algorithm can then be specified for starting at a particular node of a graph and traversing all of its branches. The sequence of branch numbers encountered is the code produced. An example appears in Fig. 13. The algorithm obeys the following rules: 1. Start at the node in the upper left-hand corner of the graph. Exit by the branch with the lowest-valued label. Mark the exiting branch to indicate its having been taken, and write down the branch label. 2. Upon entering a node, check to see if" it is being visited for the first time. If" so, mark the entering branch to indicate this. 3. Upon leaving a node, if there are available unused directions other than along the first entering branch, choose the one among these with the lowest-valued label. Leave by the first entering branch only as a last resort. Mark the exiting branch to indicate its having been taken and write down the label on the branch. Since at each node there are just as many exiting branches as entering branches, the procedure can only halt at the starting node. At the starting node, all exiting branches have been used (otherwise the procedure could have been continued), hence a]l entering branches have been used since 4

6"75~ 13 0 L

rz

"

s~

00246206734426 FIC. 13. Encoding a Graph.

R E C O G N I T I O N O F PRINTED CHINESE CHARACTERS

61

there are just as many of these. The same reasoning can be applied to the second node that is visited. The first entering branch is from the starting node and this branch has been covered both ways. But this branch would only have b e e n used for exit from the second node if all other exits had been exhausted. Therefore all branches at the second node have been covered both ways. In this manner, we find that the branches of all nodes visited have been traversed both ways. Since the graph is connected, this means that the whole graph has been covered. All branches are traversed exactly once in each direction by this procedure, so all labels are picked up. The code consists of the branch labels in the graph written down in the order in which they are encountered. This algorithm is based on a procedure for traversing graphs described in Ore [8]. While this scheme will always generate the same code for a given component, the goal of generating a unique code for each component is not achieved. For example, _+and +_are represented by the same graph, hence the same code. Fortunately, this type of situation is rare. Characters with this property could be treated as special cases without seriously impairing the efficiency of the algorithm. 7. E N C O D I N G OF CHARACTERS

The representation of a character is in the form of a tree, The nodes of the tree are binary relations; the terminal elements correspond to components. Considering the relations as binary operators, the tree can easily be flattened to prefix form. This is done by walking around the tree counter-clockwise, starting from the root node, and picking up nodes and terminals the first time they are encountered. As is well-known, the string generated in such a fashion is unique; the tree can readily be reconstructed from it. To generate a numeric string, the following code can be used.. 0 1 2 3

r ~ terminals (components) < > left node ( > above node ~ surround node

Figure 14 shows the generation of code from the tree of Fig. 4. We can consider that the code so generated defines a class of Chinese characters all of which have the same frame description. Therefore, a Chinese character may be specified by first giving its frame description code and then giving the code for each of the components that fits into one of the subframes. A character having n components will have a code consisting of the concatenation o f n + 1 numbers: No, N1 . . . . .

N,,

where No is the code generated from the tree and Nx through N. are the codes of the components listed according to the order in which the components were encountered in the tree flattening.

62

STALLINGS

0

1012000

FIG. 14. Flattening a Tree.

8. RESULTS

The algorithms discussed in this paper have heen i m p l e m e n t e d as a computer program. The program is written in FORTRAN augmented by a package of assembly language routines to permit sh'uctured data and recursive procedures. The program runs on a PDP-9 computer. The program has been tested with a n u m b e r of characters fi'om several dif: ferent sources. The tests were designed to consider 4 questions: 1. How successful is the program in analyzing the structure of Chinese characters ? 2. Does the program generate consistent codes for characters of the same font? That is, will two instances of the same character from the same source yield the same code? 3. Does the program work for characters from different sources? 4. Do factors such as character size and character complexity affect program performance? Initial results were obtained from a set of characters obtained from a T a i w a n printer. A sample of this set appears in Fig. 15. To start, 225 different characters were processed. This was to provide a dictionary for later tests, and to test the pattern analysis capabilities of the program. The result show a reasonable structural representation produced for about 94% of the characters. The failures were all due to a particular c o m p o n e n t not being analyzed; for all characters the relationship among components was correctly determined. The problems all occured in the N O D E routine, which is

RECOGNITION OF PRINTED CHLNESE CHARACTERS

63

supposed to isolate a node and locate all segments leading from it. The N O D E routine would sometimes make mistakes if, for example, t w o nodes were very close together or one node covered a large area. The characters involved were typically quite complex. From the characters that were successfully analyzed, 25 were chosen for additional testing. Four additional instances of each character from the same source were processed, for a total of 100 new characters. All n e w instances of the 25 characters produced reasonable structural representations. For 5 of the characters, one of the new instances produced a slightly different representation, hence a different code. No character generated more than two codes. In all cases, the discrepancy was caused by the fact that two strokes which were very close in one instance touched in another instance of the same character. Additional testing was done using two other sources. Characters from issues of a Chinese magazine were used. These were approximately half the size of

FIG. 15. Example of Character Set.

64

STALLINGS

the characters in the original set. Also, some computer-generated characters [6] were used. These were about double the size of the originals. Both were of about the same style. 50 instances were taken ~om each source. The percentage of instances generating the same code as the corresponding character fl'om the original set was 89% for the magazine source and 95% for the computer source. Discrepancies mostly had to do with stroke segments appearing at somewhat different angles and with strokes touching in one case but not the other, 9. C O N C L U S I O N S

Pattern Analysis A descriptive scheme for the structure of Chinese characters has been proposed and a program for computer analysis conforming to the scheme has been written. The description is on two levels: the internal structure of components, and the relationship among components. The first level of description is straightforward: a connected part of a character is represented by a graph. This representation is adequate for the description of components; it is reasonable for the human percipient to think os components as graphs. Analysis on this level works fairly well; difficulty is encountered with some complex characters. Some work has been done on modifying the described approach. The modification consists of "shrinking" a component to a skeleton and obtaining the graph from the skeleton. This procedure is sensitive to contour noise, and it seems that use of this method would result in many components generating several different graphs fi'om different instances. The second level of description is based on the work of Rankin. With the exception of a very few characters whose components do not fit neatly into the k?ame description, it is an effective means of describing the structure of Chinese characters in terms of components. The analysis pi'ogram for this level has been successful for all characters tested.

Character Recognition Chinese character recognition is made difficult by the size of the character set and the complexity of the individual characters. Test results indicate that use of the approach described here would necessitate a dictionary in which some characters are associated with several codes. Several possibilities exist which could improve the chances of constructing a practical character recognition device: 1. High standards of print quality. A device restricted to use only with very high quality print should be more consistent in code generation, thus reducing the size of the required dictionary. 2. Stylized font. A specially-designed font tailored to the recognition algorithm would improve the algorithm's performance. 3. Language simplification. A particularly hopefhl development in this regard is the Communist program to reduce the number of characters in general use and the complexity of individual characters [2].

RECOGNITION OF PRINTED CHINESE CHARACTERS

65

The results reported here lead the author to believe that pattern analysis can be a fruitful approach to Chinese character recognition. ACKNOWLEDGMENTS

The author would like to thank Professor Francis Lee of M, I. T., whose guidance and advice were invaluable to dais project. The author is also grateful to Professor Thomas S. Huang of M, I. T. and Professor Herbert Teager of Boston University for their help. REFERENCES

1. R. CASEYAND G. NAGY,Recognition of printed Chinese characters, IEEE Trans. Electronic Computers, 1966, 91-101, Vol. EC-15. 2. Y. CHU, A comparative study of language refi)rms in China and Japan, Skidmore College Faculty Research Lecture, Skidmore College Bulletin, Saratoga Springs, N. Y., 1969. 3. Fu.IIMUr~ AND IG~,CAYA,Structural PettterJ~s of Chi~ese Characters, Research Institute of Logopedics and Phoniatries, University of Tokyo, Annual Bulletin No. 3, April 1968-July 1969, pp. 131-148. 4. u, GF~ENANDER,A unified approach to pattern analysis, Adua~wes in Computers, Academic Press, 1970, Vol. 10, pp. i75-216. 5. GRONER,HEAFNER, AND ROBINSON, On-line computer classification ofhandprinted Chinese characters as a h-anslation aid, IEEE Trans. Electronic Computers, 1967, Vol. EC-16, 856860. 6. A.V. HERSHEY, Calligraphy for Computers, U. S. Naval Weapons Lab., Dahlgren, Virginia, AD 662 398, 1967. 7. J. H. LIU, Real Time Chinese Handwriting Recognition Machine, Thesis, M.I.T., Cambridge, Mass., 1966. 8. O. ORE, Theory of Graphs, American Mathematical Society, Providence, R. I., 1962. 9. D. S. PRERAU, Computer Patter~ Reeognitio~ of Sta~dard E~graved Music Notation, Ph. D. Thesis, M.I.T., Cambridge, Mass., 1970. 10. RANKIN AND TAN, Component combination and fl'ame-embedding in Chinese character grammars, NBS Tech. Note 492, National Bureau of Standards, Washington, D.C., 1970. 11. K. M. SAYRE,Recognition: a study in the philosophy of artificial intelligence, University of Notre Dame Press, Notre Dame, Indiana, 1965. 12. W, W. STALL~NGS,Computer Analysis of Printed Chinese Characters, Ph. D. thesis, M.I.T., Cambridge, Mass., 1971.