Pergamon
Pattern Recoclnition, Vol. 27, No. I, pp. 41 52, 1994 Elsevier Science Ltd Copyright (~) 1994 Pattern Recognition Society Printed in Great Britain. All rights reserved 0031 3203/94 $6.00+.00
EXTERNAL W O R D SEGMENTATION OF OFF-LINE H A N D W R I T T E N TEXT LINES GIOVANNI SENI'~ and EDWARD COHEN~ t Center of Excellence for Document Analysis and Recognition (CEDAR), 226 Bell Hall, Buffalo, NY 14260, U.S.A. Tritek Corporation, 125 Sandy Drive, Newark, DE 19713, U.S.A. (Received 6 January 1993; received.for publication 6 Au,qust 1993)
Abstract--Techniques are described to separate a line of unconstrained (written in a natural manner) handwritten text into words. When the writing style is unconstrained, recognition of individual components may be unreliable so they must be grouped together into word hypotheses, before recognition algorithms (which may require dictionaries) can be used. The system uses original algorithms to determine distances between components in a text line and to detect punctuation. The algorithms are tested on nearly 3000 handwritten text lines extracted from postal address blocks. A detailed performance analysis is given of the complete system and its components. Text understanding
Pattern recognition
Handwritten text recognition
I. INTRODUCTION This research focuses on separating a line of handwritten text into words by determining the location of inter-word gaps (gaps between words). This paper describes and evaluates distance measurement algorithms, punctuation detection algorithms, and an algorithm that combines the two to produce word hypotheses. These techniques were developed for English text, but should be applicable to any Latin-based language (and the general approach should be applicable to other languages). This research is put forward as a step towards developing high-performance computationally-efficient techniques of creating word hypotheses that are useful for text recognition algorithms, Text consists of one or more words, spatially arranged so that the reader can isolate them and determine their ordering. We assume the words are ordered left-toright in a series of horizontal lines (text lines). To interpret the text a reader must normally locate a text line and separate it into words. In this paper, we address the problem of separating a located text line into words. Separating handwritten text into words is challenging because handwritten text lacks the uniform spacing normally found in machine-printed text. Machineprinted text typically has inter-word gaps that are much larger than inter-character gaps (gaps between characters). In addition, the inter-word gaps normally contain no text and extend the height of the text line. If we consider the text line to be a series of connected components, gap distances for machine-printed text can be determined by computing the horizontal distance between components (Fig. 1). The gaps between words and characters in handwritten text vary much more. The examples in Fig. 2 illustrate the difficulties of finding a distance algorithm that can easily distinguish between inter-word gaps and inter-character gaps. 41
We assume that the process of dividing the text into text lines has been completed (this process is performed automatically"~), and our goal is to divide those lines into words. In handwriting, a text line is not strictly horizontal or linear (Fig. 3), but we can consider it to be a left-to-right ordered set of connected components. Distances between pairs of connected components combined with punctuation detection information can often indicate the inter-word gap locations. These interword gaps can divide the ordered set of connected components into words. This paper describes three separate algorithms that are combined to separate text lines into words. The first algorithm (a technique for determining distances between adjacent connected components) is described fully in reference (2) and only summarized here. The second algorithm detects punctuation marks that are useful for word separation (periods and commas) using a set of fuzzy features and a discriminant classification function. We give details for each of the ten features and the discriminant function. The third algorithm combines the first two algorithms to rank the gaps from largest to smallest and then determines which gaps are inter-character gaps and which are interword gaps. Most previous work in recognizing text assumed that the words are already isolated or that isolation is trivial (e.g. words are separated by large amounts of white space ~3 5) or words are written in boxes whose location is known~61). Only a few published papers describe methods of locating word gaps in unconstrained handwriting} t.7 "~ However, these papers only mention word gap location as part of a larger text interpretation system and do not thoroughly describe their word isolation algorithms or compare the strengths and weaknesses of different word isolation algorithms, Brady I1°~ showed how certain filters can highlight
42
G. SEN!and E. COHEN
Machiri-printedgap, xample. I
i..;
l..;
II
II
b.o2 inch
b.12 inc~b.12 inc£
Inter-character gap
Inter-word gap
Fig. 1. Machine-printed word gaps and their distances. The inter-character gaps are much smaller than the inter-word gaps.
(b)
(c) Fig. 2. Examples of handwritten text lines with unusual spatial gaps: (a) shows words that overlap horizontally, (b) shows an inter-character gap (between the digits 2 and 7) that is larger than an inter-word gap (between the character A and the digit 5), and (c) shows a text line where many inter-character gaps and inter-word gaps have similar size.
qq..-Tqt
Fig. 3. Examplesofhandwritten text linesthat arenot linear or horizontal. Some slant correction algorithms can make the lines more linear and horizontal, but in our experience, no slant correction algorithm works very well on unconstrained handwritten text. These algorithms may improve the image quality of some images, but they also reduce the image quality in others. the word gaps in grey-level machine-printed text images. His focus was on determining a preprocessing step in reading and on offering an explanation for certain psychological text-interpretation studies. His method was not tested on handwritten text. The remainder of this paper is divided into four sections, and describes our techniques and results in developing algorithms to separate lines of text. Section 2
summarizes the work on distance measures between connected components. Section 3 discusses how ten features were developed for punctuation detection and how these features are combined to determine the location of commas and periods. Section 4 describes how the distance measures and punctuation detection are combined to classify gaps. The final section discusses future work.
External word segmentation of off-line handwritten text lines
2. DESCRIPTION OF DISTANCE MEASURES
We explored eight algorithms that computed the distances between pairs of connected components. In these experiments, we wanted to isolate the spatial aspect of inter-word gaps and used only that information to compute the distances, realizing that using recognition information (e.g. detecting punctuation) can improve performance. The input to our system is a binarized text line image. The connected components are ordered left-toright (the ordering technique is discussed later). We assume that each connected component consists of either a word, a part of a word (e.g. a character, character fragment, or cluster of characters), a punctuation mark, or noise. This means that no connected component belongs to more than one word (an assumption supported by the test data). All gaps between pairs of adjacent connected components (adjacent refers to their ordering in the list) are ordered from largest to smallest using distance measures; and, we present quantitative measures to determine how good a particular gap ordering is. These measures are then used to compare the performance of eight different distance algorithms. We also briefly examine the computational efficiency of the algorithms. This part of our research focuses specifically on comparing the performance and speed of different spatial distance computation algorithms for locating interword gaps in binary text images. Three types of distance measures are shown in Fig. 4, and these types are used as the basis for our eight distance algorithms. (1) The bounding box method computes the minimum horizontal distance between the bounding boxes of the connected components. The distance between bounding boxes that overlap horizontallyt is considered to be zero. (2) The Euclidean method computes the minimum Euclidean distance between connected components. (3) The minimum run-length method computes the horizontal run-lengths ~+between portions of the connected components that overlap vertically. The minimum of the run-lengths is used as the distance measure. If the connected components do not overlap vertically, the bounding box distance method is used. (4) The average run-length method is the same as the minimum run-length method except that the average of the run-lengths is used as the distance measure when components overlap vertically. (5) The run-length with bounding box heuristic method (RLBB) uses the minimum run-length method when t Bounding boxes (or connected components) overlap horizontally, if the right-most edge of the box (or component) on the left extends to (or past) the left-most edge of the box tor component) on the right. Vertical overlapping is defined similarly along the vertical axis. In this paper, a run-length is the distance along a straight line between two connected components. We consider only horizontal run-lengths because they are useful and are often explicitly available from the image format (e.g. fax images).
43
components overlap vertically and uses heuristics (with bounding box distances) otherwise. If the connected components do not overlap horizontally or vertically, the horizontal bounding box distance is used. If one connected component is within the horizontal extent of the other, the distance is set to zero. Finally, if the connected components overlap horizontally but not vertically, the vertical distance between the bounding boxes is used. (6) The run-length with Euclidean heuristic method (RLE) uses the minimum run-length method if connected components overlap vertically by more than a threshold (we use 0.133 in.); otherwise, the minimum Euclidean distance is used. (7) The RLE with 1 heuristic method (RLE(Hl)) is the same as the RLE method with one heuristic added. If one connected component completely overlaps the other horizontally, the distance between the two is set to zero. A typical situation where this occurs is when a capital T is written as two separate strokes (a horizonal stroke and a vertical stroke). The RLE(H1) method ensures that these components will not be separated by a gap. (8) The RLE with 2 heuristics method (RLE(H2)) is the same as the RLE(H1) method with an additional heuristic. Sometimes connected components are close and overlap horizontally, but the run-length method computes a non-representative distance due to the way the components are positioned. With the new heuristic when the run-length method is used (i.e. the vertical overlap threshold is exceeded) and the bounding boxes of the two components horizontally overlap more than a fixed amount (we use 0.133 in.), the computed distance between the components is reduced by 60%. Essentially, this heuristic says that if two components are close (according to the run-length measurement) and overlap horizontally, make them closer. In our testing (summarized in Table 2), the RLE(H2) method has the best performance at 90.3~o. The RLE(H1) and RLE methods were next best, but did not differ significantly from the Euclidean and the RLBB algorithms (which were fourth and fifth best). However, on average, the Euclidean method takes significantly longer to compute. The bounding box method takes the least time to compute and performs better than the minimum and average run-length methods. The RLBB method has a computational cost less than the RLE and Euclidean algorithms (but more than the bounding box algorithm) and performs as well as all but the RLE(H2) algorithm. Additional performance results are given in Section 2.2. 2.1. Distance measures testing methods The gap distance algorithms (and algorithms described in Sections 3 and 4) were tested on two sets of text lines collected from postal addresses (1453 text lines used for training and testing and 1084 text lines used only for testing). The locations of inter-word gaps in each text line were manually determined, and each
44
G. SEN1and E. COHEN
. . . . . . . . . . . . . . . . . . . . . . . . . .
~
I
. . . .
i
P
I
i
I
I
I
J
(u)
(c)
(d)
Fig. 4. Examples of types of gap measures between a pair of connected components. The original image is shown in (a). The bounding boxes are shown in (b), and the horizontal bounding box distance is the horizontal distance between the boxes, The Euclidean distance is shown in (c) where the arrow indicates the distance between the closest points in each component. The run-length distance is represented in (d), whose arrows show some run-lengths between the components. Different distance algorithms could use the average or minimum of the run-lengths.
distance algorithm was run on all pairs of adjacent connected components in the text lines. Four different success measures were used to determine the effectiveness of each gap distance algorithm.
Test images. The text lines were extracted from 1000 address images. These images are 300 pixel per inch 8-bit grey-level images collected by scanning original mail pieces with an Eikonix C C D camera and storing them in H I P S image format3 t ~ The images were manually cropped (with a b o u n d i n g rectangle) in an effort to have only the destination address text appear in the image; however, other text or marks (e.g. postmarks, return address text) also frequently appeared in the cropped image. All images were collected from live mail at the Buffalo post office (the sorting center for western New York State) and were selected to represent a sample across the United States. Each of the 1000 images was binarized, passed through a guide line removal program (that removes pre-printed underlining in the image), and separated into text lines. The details of these three preprocessing steps can be found in reference (1). Due to performance degradation of the line separation algorithm in the upper lines of the addresses, at most the bottom four lines were separated. Automatic line separation was used since it quickly produced many text line images. The resulting 3634 text line images were manually examined, and images not properly separated into text lines were discarded because we did not want improper line separation affecting the results. We also removed images with one or fewer words (we wanted only images that contained inter-word gaps); the remaining 2537 images formed our testing set (1453 images for training and testing of the distance metrics and 1084 images for testing of the word separation algorithm).
Truthing of test images. The connected components in each text line were automatically sorted left-to-right based on their mean-x value, t Each gap between components in the text lines was then manually classified as one of four types. (1) Primary gaps are gaps between semantic units (e.g. city, state, Z I P Code, street number, street name, apartment number) with no commas, periods, or dashes on either side of the gap. (2) Secondary oaps are gaps between words in semantic units (e.g. a gap between the words New and York is a secondary gap if the words formed a semantic unit) with no commas, periods, or dashes on either side of the gap. (3) Punctuation gaps are the gaps between words on either side of commas, periods, or dashes. (4) Inter-character gaps are all other gaps. These are gaps that separate components within a word. Table 1 shows the quantity of gaps found in the 1453 text images. The inter-word gaps consist of primary, secondary, and punctuation gaps. The punctuation gaps are ignored in our distance measures testing (they count towards neither the correct or failure rate) because punctuation placed between words substantially reduces the inter-word gaps (at least between connected component pairs). Locating these inter-word gaps is best done with the help of a punctuation detector (described in Section 3). The primary and secondary gaps are separated because people often have wider gaps between semantic units than between words of the name semantic unit. Tests showed primary gaps were, on average, 1.5 times larger than secondary t The mean-x value was based on each connected component's bounding box and was calculated with the equation mean-x value = (minx value + max x value)/2.
External word segmentation of off-linehandwritten text lines
45
Table 1. Quantity of each gap type Gap type Primary Secondary Punctuation Character
Total number
Minimum occurrences in a single text line
Maximum occurrences in a single text line
2150 275 1490 12,400
0 0 0 3
6 4 6 39
gaps. We give results that consider inter-word gaps to be both primary gaps alone and primary and secondary gaps. 2,2. Results of distance measures The performance differences in the eight distance algorithms are shown in the test results in this section. These two tests show how successfully the distances from each algorithm ordered the gaps, where success is defined as having inter-word gaps larger than intercharacter gaps. Tests 1 and 2 (Tables 2 and 3, respectively) show the number of text lines and inter-character gaps that are correctly ordered using each distance algorithm. The tables show the ranking of each algorithm, the number and percentage (out of 1453) of correctly ordered text
Table 2. Test 1 results showing word gap detections where only primary gaps are considered to be inter-word gaps
Rank 1 2 3 4 5 6 7 8
Method RLE(H2) RLE(HI) RLE Euclidean RLBB Bounding box Minimum RL AverageRL
Correct ordering of primary gaps in text lines Qty %
Correct ordering of inter-character gaps Qty %
t312 1301 1297 1298 1298 1283 1261 1183
12,135 12,120 12,112 12,105 12,081 12,006 12,004 11,941
90.3 89.5 89.3 89.3 89.3 88.3 86.8 81.4
97.9 97.7 97.7 97.6 97.4 96.8 96.8 96.3
Table 3. Test 2 results showing word gap detections where primary and secondary gaps are considered to be inter-word gaps
Rank Method
Correct ordering of primary gaps in text lines Qty %
Correct ordering of inter-character gaps Qty %
1 2 3 4 5 6 7 8
1270 1258 1252 1257 1252 1231 1216 1140
12,036 12,010 11,992 11,971 11,991 11,870 11,888 11,823
RLE(H2) RLE(H1) RLE RLBB Euclidean Bounding box Minimum RL AverageRL
PR 27:1-B*
87.4 86.6 86.2 86.5 86.2 84.7 83.7 78.5
97.1 96.9 96.7 96.5 96.7 )5.7 95.9 95.3
lines (i.e. lines in which all inter-word gaps are larger than all inter-character gaps), and the number and percentage (out of 12,400) of correctly ordered intercharacter gaps (i.e. inter-character gaps that are smaller than all inter-word gaps in their respective text lines). Test 1 assumes only primary gaps are inter-word gaps (secondary and punctuation gaps are ignored). Test 2 assumes primary and secondary gaps are inter-word gaps (punctuation gaps are ignored). 2.3. Discussion of distance measures The goal of the tests was to indicate which distance algorithms would be most useful in determining the inter-word gaps in a text line. The main conclusions from the tests are: (1) Spatial information from the distance algorithms provide indications where some inter-word gaps are. (2) Different distance algorithms have different correct performance levels. (3) The RLE(H2) method performed best with a 90.3% rate of correctly ordering gaps in a text line. (4) The other RLE algorithms, the Euclidean algorithm, and the RLBB algorithm performed next best, but did not differ significantly statistically from each other (at a 95% confidence level for a sample size of 1453). The tests showed that the spatial distance measures computed did indicate how gaps in a text line should be ordered. The performance of the algorithms varied from 81.4 to 90.3%. Determining the best distance algorithm should be based on correct performance and computational complexity. The RLE(H2) algorithm has the highest correct performance for ordering the text lines correctly. The other RLE algorithms, the Euclidean algorithm, and the RLBB algorithm were slightly worse, but this difference is statistically significant. All RLE algorithms perform faster than the Euclidean method, on average, although the worst case time complexities are the same. This is because in the worst case (when the vertical overlap of all connected components is below the threshold), all RLE methods perform the Euclidean calculation for all pairs. However, in our tests, the RLE method performed the run-length calculations 44% of the time, the bounding box distance 5% of the time, and the Euclidean distance calculation 51% of the time. The run-length methods without heuristics (minimum RL and average RL) are the worst of all methods. While initially, it is surprising that the run-length methods
46
G. SEN!and E. COHEN
Fig. 5. Example of run-length distance that does not match our intuitive notion of gap distance. In this fraction of a text line, the run-lengths (shown by the arrows) do not match up well with the two connected components. This will produce a run-length distance that is longer than the gap we would find intuitively. performed worse than the bounding box method, further analyses show the reason for this. There are a number of instances where two connected components overlap vertically (allowing us to compute the run-lengths), but do not capture the intuitive measure of gaps that we are seeking (see Fig. 5). Average RL is especially prone to this behavior. The difference between primary and secondary gaps can be determined by comparing Tables 2 and 3. The most obvious difference is that the correct rate drops in test 2. This is expected because test 2 is more stringent than test 1 and secondary gaps are smaller and harder to distinguish from inter-character gaps. Test 2 is more stringent because in addition to the test 1 restrictions, test 2 also checks if secondary gaps are ordered correctly. The ordering of the distance algorithms in test 2 is similar to test 1 (the order of the Euclidean and RLBB methods have switched, but their placement is not statistically significant at a 95~ confidence level). 3. DESCRIPTIONOF PUNCTUATIONDETECTION Distance measures alone are insufficient to locate inter-word gaps, since some connected components that reduce gap size are supposed to indicate gap locations (see Fig. 6). The role of some punctuation marks (e.g. hyphens, dashes, apostrophes) as word separators is not always clear (e.g. sometimes a dash separates words and sometimes it joins words). Due to the uncertain role of some punctuation marks and because of the limited types of punctuation marks in our test sets, we only developed detection systems for commas and periods. Commas always indicate a word break, and periods usually do (however, consider the phrase an I.O.U.). Punctuation marks are usually written as short lines or simple curves whose interpretation is dependent on their location in the text line. We developed a set of
fuzzy features where each feature describes a shape or location aspect of a connected component. Each feature algorithm returns a confidence value between 0.0 and 1.0 (inclusive), where 0.0 means the feature is unlikely to be present and increasing values indicate an increasing likelihood that a feature is present. Combinations of these features can indicate which connected components have the shape and location characteristics of commas and periods (we believe these same features can also be used for many other types of punctuation). 3.1. Features for punctuation detection
Shape features. The shape features we use are narrow, small, short, full-height, and perimeter-to-area ratio. Since the punctuation marks we were trying to detect are simple (i.e. no complicated curves or loops), we feel that these features can distinguish between different punctuation marks and other marks in the text (e.g. characters, word-fragments, digits). The narrow feature compares the height with the maximum width of each connected component. If the connected component has a height smaller than two times the width, it is assigned a value of 0.0. Otherwise, the value is the ratio of height to maximum width (with a maximum of 1.0). The small feature compares the size (i.e. number of pixels) of the connected component with the size of other connected components in the text line. If the current connected component is as big or bigger than the average component in its line, it is assigned a value of 0.0. If the current connected component is as small or smaller than the average comma encountered in the training set, it is assigned a value of 1.0. Otherwise a value between 0.0 and 1.0 is assigned depending on whether the size of the given component is closer to the average component size or to the average comma
+
II
Fig. 6. Example of the connected components that reduce gap size but indicate word gaps.
External word segmentation of off-linehandwritten text lines size, respectively ((comp_size-avg_compJize)/(avg_
comma_size - avg_comp_size) ). The short feature compares the height of the connected component with the height of comma components encountered in the training set. If the current connected component is as short or shorter than the average comma in the training set, it is assigned a value of 1.0. If it is greater than the average component height, it is assigned a value of 0.0. Otherwise a value between 0.0 and 1.0 is assigned depending on whether the height of the given component is closer to the average component height or to the average comma height, respectively ( (comp_height - avg_comp_height)/(avg_
comma_height - avg_comp_height ) ). The full-height feature determines if a connected component has the height of an "average" character by comparing the height of the connected component with the height of other connected components in the text line. If the current connected component is as tall or taller than the average component in its line, it is assigned a value of 1.0. Otherwise, the ratio of comp_ height to avg_comp_height is used. The perimeter-to-area ratio feature complements the narrow feature and gives another indication of how thick a connected component is. The perimeter is considered to be the number of pixels present on the external border of the connected component, and the area is the space contained within the exterior contour (including area covered by "holes" in the connected component). A circle has a low ratio and a one-pixel wide line segment will have a very high ratio. This feature compensates for "comma-shaped" components that are mostly horizontally oriented (e.g. a comma that has the shape of a dash). These components have low value for their narrow feature, since they are much
47
wider than they are tall. However, their perimeter-toarea ratio is typically high. Values greater than 1.0 are set to 1.0.
Spatial features. The spatial location is necessary since the same mark in different locations can mean different things, e.g. consider a comma (,) and an apostrophe ('). Our spatial features determine ifa connected component is located low on the line, in the middle of the line, or in the upper portion of the line. These features indicate a connected component's vertical location with respect to the reference lines (lines that show where the ascenders and descenders are separated from the main body of the text). In our tests, many of the text lines were curved or slanted so that horizontal lines could usually not be used as reference lines. In some cases, the text was written horizontally with a sudden shift in vertical position midway through the text line. Due to these shifts, any global reference line would be inaccurate for some portion of the text (see Fig. 7). To overcome this difficulty, we relied on local information (from one or two neighboring connected components) to determine the spatial features. The projection of the two neighboring components on the vertical axis renders a histogram that gives the position of the upper half and lower half reference lines. Such a method is sensitive to character skew and may be deceived by T-crossings. Therefore, we have included two spatial features that do not rely on these reference lines for their computation. The spatial features are low-on-line, near-midline, near-baseline,extend-beneath-left-neighbor,and extendbeneath-right-neighbor; they are illustrated in Fig. 8. The low-on-line feature gives an indication of how
Upper Bound
UPl~rHole~ Lower Hall
~middle sregion
lower Bound.
(~)
/iiii/iii/iiiii/iiiii/ii/iiiiii/iiiiiiii//iiiiiiiii/.......iii...........i//i/iiii .........!! i/iii (b)
(c) Fig. 7. Difficultiesof assigning reference lines are illustrated: (a) a good fit, (b) a poor fit due to a sudden shift, and (c) a poor fit due to a slanting line.
48
G. SENIand E. CohEn
Of:fill ........................
.
.
.
.
.
.
.
.
.
: : : :
.
-
( a l ) low on line
(a2) not low on line
(bl) near midline
(b2) not near midline
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............- - T : r Z / - I . ? . L
.
.
.
.
.
.
.
_ .
.
.
.
.
.
.
.
...................................................
( c l ) n e a r baseline
(c2) not near baseline
( d l ) e x t e n d b e n e a t h left n e i g h b o r
(d2) not e x t e n d b e n e a t h left n e i g h b o r
Fig. 8. Examples of spatial features shown with reference lines: (al) shows a comma located low on line, (a2) shows a dot not located low on line, (bl) shows a period located near midline, (b2) shows a comma not located near midline, (cl) shows a comma located near baseline, (c2) shows a comma not located near baseline, (d 1) shows a period that extends beneath its left neighbor, (d2) shows a component (letter Y) that does not extend beneath its left neighbor. Extend beneath right neighbor is assigned similarly.
low in a line a connected component is. If the given component is located below the lower-half reference line (i.e. it totally falls in the lower portion of the image made up by its two neighbors), it is assigned a value of 1.0. If it extends above the upper-half line, it is assigned a value of 0.0. Otherwise, as the minimum-y coordinate (upper edge) of the current component moves closer to the upper-half line and further from the lower-half line, the value decreases as computed by the following formula: ((comp_miny - upper_half)~ (lower_half - upper_half)). The near-midline feature gives an indication of how
much of the component falls within the middle portion or the size of the middle portion that gets covered by the current component. If the component falls totally above the upper-half line or below the lower-half line, then the assigned value is 0.0. Otherwise, the maximum between the percentage of the middle region's height that is covered by the component, and the percentage of the component's height that falls inside the middle region is assigned to the near-midline feature. These two cases are necessary to differentiate between short components that totally fall within a wide middle region, and tall components that extend beyond the two boundaries of a narrow middle zone.
The near-baseline feature indicates what portion of the current component falls below the lowerhalf reference line. This feature value is the ratio between the height of the component's portion that extends below the lower-half line and the component's height. The extend-beneath-left-neighbor and extend-beneathright-neighbor features indicate where the top of the current component is located relative to its left and right neighbor components, respectively. If such components are tall enough to be used as references (small noisy components are eliminated), the ratio of the distance between the component's top and the neighbor's bottom, and the neighbor's height is assigned ( ( c o m p A n i n y neighbor_miny)/(neighbor_maxy - neighbor_rainy)).
Checking for the height of the neighbors is necessary to prevent all components that come after or before components located high on the line (e.g. the horizontal stroke of a disconnected ' T ' , or the dot on an "i") being assigned a high value. In such cases, we move one component to the left or to the right looking for a neighbor with an appropriate height. In our current implementation a given component is considered to have an appropriate height if it is greater than 35% of the average-component height in the line.
External word segmentation of off-line handwritten text lines
49
Table 4. 4-NN results for training set From TRUTH Comma Other Period Total
Comma # ~ 696 37 22 755
Other
90.86 0.37 3.66 6.63
Period %
#
~o
#
27 9905 33 9965
3,53 98.76 5.49 87.44
40 83 545 668
5.22 0.83 90.68 5.86
Reject o~
# 3 4 1 8
Total
Yo
#
0.39 0.04 0.17 0.07
766 10,029 601 11,396
100,0 100.0 100,0 100.0
Table 5. Majority rule results for training set From TRUTH Comma Other Period Total
Comma # o.~ 678 37 24 739
88.51 0.37 3.99 6.48
Other # 47 9903 51 10,001
Period o°
~,~
#
6.14 98.74 8.49 87.76
36 84 525 645
4.70 0.84 87.35 5.66
#
Reject o~;
5 5 1 11
0.65 0.05 0.17 0.1
Total
#
~0
766 10,029 601 11,396
100.0 100.0 100.0 100.0
Table 6. Maximum rule results for training set From TRUTH
#
Comma °/o
Comma Other Period Total
682 40 23 745
89.03 0.40 3.83 6.54
Other #
o~
#
45 9898 61 10,004
5.87 98.69 10.15 87.79
39 91 517 647
3.2. Combininy.features for punctuation detection While the previous sub-section describes the features used, this section describes how those features are combined to indicate if a particular component is a comma or a period. Originally, we were trying to develop fuzzy features and hoped to use a fuzzy classifier to combine them. However, simple fuzzy classification strategies yielded poor results and we switched to more classical strategies, the logistic regression model and the non-parametric K-nearest-neighbor method (KNN). Parametric methods were avoided based on an intuitive non-Gaussianess of the features (i.e. there are characteristics rather than random variations within each class). Since our word separation algorithm needs to distinguish between commas and periods (the presence of periods does not always indicate a word gap), we began our tests assuming we had a 3-class problem; that is, discriminating between commas, periods and others (i.e. non-comma/periods). Our training set consisted of l 1,396 connected components extracted from 1885 handwritten lines (432 lines, each containing punctuation, were added to the original training set of 1453 lines). In this set, 766 components were commas and 601 components were periods. T h e remaining 10,029 components consisted of digits, characters, entire and partial cursive words, and other types of punctuation (e.g. dashes). A separate testing set of 5567 connected components, extracted from 1084 handwritten lines, was also collected. In this set, 351 components were commas,
Period o° 5.09 0.91 86.02 5.68
Total #
Oo
766 10,029 601 11,396
100.0 100.0 100.0 100.0
351 were periods, and 4865 were non-commas/periods. We tried three different formulations of our classification problem: a pure 3-class approach, a combination approach using the majority vote rule, and a combination method using the winner-take-all rule. In the 3class case, we used a 4 - K N N algorithm to discriminate between commas, periods, and others (we do not detail the logistic regression results since logistic regression is primarily used for 2-class discrimination and our experiments showed it performed poorly). In the majority rule case, we developed three discriminant functions using the logistic regression model (commas vs. periods, commas vs. others, and periods vs. others).t The results of these three discriminant functions were combined using the majority vote rule. That is, when at least 2 out of the 3 functions agree on their classification decision, an answer is returned. Otherwise, a reject is returned.:~ Similarly, in the winner-take-all rule case we also developed three discriminant functions using the logistic regression model (commas vs. otherswperiods, others vs. commas w periods, and periods vs. others w commas). The results of these three discriminant functions are combined using the winner-take-all rule. That is, the
"t"In this case, the set others contained all types of connected components except periods and commas. :~Since every component needs to be classified as a punctuation or non-punctuation, a reject is treated as an others classification.
50
G. SEN!and E. COHEN Table 7. Majority rule results for testing set From TRUTH
#
Comma ~o
Comma Other Period Total
311 27 17 355
Other
88.6 0.55 4.84 6.38
Period ~o
#
Yo
#
23 4817 13 4853
4.5 99.01 3.7 87.17
15 21 321 357
#
2.94 0.43 91.45 6.41
Reject ~
2 0 0 2
Total
0.39 0.0 0.0 0.036
#
~o
351 4865 351 5567
100.0 100.0 100.0 100.0
Table 8. Maximum rule results for testing set From TRUTH
#
Comma ~o
Comma Other Period Total
311 25 19 355
Other
88.6 0.51 5.41 6.38
Period
Total
#
~o
#
~o
#
~o
28 4817 13 4858
7.98 99.01 3.7 87.26
12 23 319 354
3.42 2.66 90.88 6.36
351 4865 351 5567
100.0 100.0 100.0 100.0
Table 9. 4-NN results for testing set From TRUTH Comma Other Period Total
Comma # Z 306 14 17 337
87.18 0.29 4.84 6.05
Other
Period Z
#
Z
#
9 4800 14 4823
2.56 98.66 3.99 86.64
15 24 275 314
best decision is given by the function with the maximum a-posteriori probability. The best performance for our training set was with the four nearest-neighbors using the Mahalanobis distance based on the pooled covariance matrix. The nearest-neighbor technique was tested on the training set with the leave-one-out method and gave a 97.84~o correct rate. The Majority rule gave a 97.50~o correct rate, and the M a x i m u m rule gave a 97.38~ correct rate. Although the three approaches do not significantly differ from each other in the correct rate, they do differ in the false-positive, false-negative, and sensitivity rates. O n our separate testing set, the 4 - N N method performed slightly worse than the logistic regression models with a 97.14~ correct rate. The Majority rule gave a 97.88~o correct rate, and the M a x i m u m rule gave a 97.84~o correct rate. 3.3. Discussion o f punctuation detection The high performance levels indicate that the features and combination methods offer significant discrimination capabilities. We found no other figures for this kind of discrimination so qualitative comparisons to other techniques are difficult. Certainly other additional features and combination methods can be explored, but we have already reached an area of diminishing returns. In addition to higher correct performance, the logistic regression model has significant run-time computation
4.27 0.49 78.35 5.64
# 21 27 45 93
Reject ~/o 5.98 0.55 12.82 1.67
Total #
Z
351 4865 351 5567
100.0 100.0 100.0 100.0
advantages. For these reasons, the majority rule was the clear choice for use in our combined system. We believe our punctuation detection system will work for non-address punctuation since the feature set can describe many forms of punctuation. For instance, a dash can be described as not(narrow) ^ small ^ short ^ not(full_height) ^ not(low_on_line) ^ near_midline. 4. C O M B I N I N G SPATIAL AND P U N C T U A T I O N DETECTION I N F O R M A T I O N
This section describes how gap distance and punctuation detection can be combined to develop word segmentation hypotheses. While context can give valuable word-break clues (e.g. the number of words in the line, the role of punctuation, the average size of inter-character and inter-word gaps), we designed our algorithm to be mostly context independent. Towards this goal, we based our algorithm only on two assumptions. First, we assumed that aside from punctuation gaps, for a given text line inter-word gaps should be larger than inter-character gaps and that there would be a significant size difference between the two types of gaps. Second, we assumed that the presence of punctuation (periods and commas) in a location increases the likelihood that the given location is an inter-word gap. Punctuation marks boost the confidence of a gap being a word break. That is, when a component is classified as a comma, the gap to the right of this corn-
External word segmentation of off-linehandwritten text lines ponent becomes a very likely inter-word break. To increase the chances that such a gap will be selected as a word gap, we increase its size. Ideally, such an increment should be proportional to the confidence of the punctuation recognition result. A more straightforward approach is to add the size of both gaps on each side of the comma. In our implementation we used the formula d[i] ~3d[i] + d[i - 1] to increase the size of the gap to the right. The size of the gap to left of the comma (i.e. d[i - 1]) is reduced by 60~o to reduce the possibility of selecting that gap as a word gap. Periods must be treated more carefully since they are not always word break indicators. Usually, this duality is present when they appear inside an abbreviation like "P.O." which stands for Post Office. Here, the first period is not intended to indicate a word break but the second one does. This example suggests that additional heuristic rules are needed to decide when periods indicate word breaks. One such rule could be that every time a period is found, we check two positions ahead for the presence of another period; if another period is found then the current one is ignored in the word hypothesis process. Otherwise, we modify the size of the gap to the right of the period using the formula d[i] *--~d[i] + d[i - 1]. The size of the gap to the left of the period (i.e. d [ i - 1]) is also reduced by 60~ for the reason given above. Our word segmentation algorithm starts by using the RLE(H2) method to measure the length of all intercomponent gaps. It then utilizes period and comma recognition results to adjust the size of some of the gaps (as described above). The resulting list of gaps is then ordered from biggest to smallest. After that, the goal is to find a dividing line that splits this list into two sets, the set of inter-word gaps and the set of intercharacter gaps. This task is performed as follows:
51
Step II. Compute the average distance of all intercharacter gaps (i.e. gaps smaller than the current dividing gap). Step III. Compute the average distance of all interword gaps (i.e. gaps greater than or equal to the current dividing gap). Step IV. Compute the ratio avg_inter_word_gaps/ avg_inter_character_gaps. If this ratio increases (over the last measurement in this line), try the next gap as the dividing gap, and go back to Step II. Otherwise use this gap as the dividing gap, and stop. Essentially, the algorithm finds the dividing line where the big gaps are the "most bigger" than the smaller gaps. Then it considers the big gaps to be word gaps and the smaller gaps to be character gaps. 4.1. Combining information results The word segmentation algorithm was tested on an independent'~ set of 1084 text lines. Table 10 shows the quantity of gaps found in these text images. Tables 11 and 12 show the results of four different tests to compare the performance of the proposed method against the traditional bounding box (BB) approach. 4.2. Combining information discussion Our results deafly show how our system can correctly determine the gap types of over 90~o of gaps in our Table 10. Quantity of each gap type
Step I. Assume the dividing line between inter-word and inter-character gaps is the first gap.
Gap type
Total number
Primary Secondary Comma Character Total
1672 135 351 8612 10,770
Table 11. Performance of the word segmentation algorithm using the bounding box and RLE(H2) distance methods without punctuation recognition Method BB RLE(H2)
False positives # % 1004 616
9.32 5.72
Missed primary # % 584 440
5.42 4.09
Missed comma # % 125 135
1.16 1.25
Missed secondary # %
Correctly parsedlines # 7o
73 77
239 343
0.68 0.71
22.05 31.64
Table 12. Performance of the word segmentation algorithm using the bounding box and RLE(H2) distance methods and punctuation adjustment Method BB RLE(H2)
False positives # % 1066 569
9.90 5.28
Missed primary # % 596 456
5.53 4.23
Missed comma # %
Missed secondary # %
Correctly parsedlines # %
79 30
73 79
261 425
0.73 0.28
0.68 0.73
24.08 39.21
-j-This set was not used in the training of the distance and punctuation algorithms.
52
G. SENIand E. COr~EN
testing data. This system would allow nearly 4 0 ~ of the text lines to be properly separated into words. With some trivial adjustments, we could use this system to generate word hypotheses for a text understanding system. The RLE(H2) method showed a significantly better performance than the bounding box method (applicable for most machine-printed text). The punctuation detection reduced the false positives and the number of missed c o m m a gaps significantly while causing minimal increases in the number of missed primary and secondary gaps. Further testing could show how generating multiple hypotheses (instead of one answer per text line) could increase the likelihood of generating the correct parse. These results show that a significant number of words in unconstrained handwritten text lines can be correctly segmented using only spatial information and punctuation detection. 5. SUMMARY AND FUTURE WORK
We have presented a set of algorithms for separating words in a text line and have shown different ways of measuring their performance. The algorithms were tested on a large number of images and have been shown to be useful in this domain. We believe this in-depth analysis is needed to develop a robust text recognition system. A complete system word separation algorithm must incorporate context (i.e. text interpretation) to determine all word groupings. However, given the c o m p u tational complexity and accuracy of current handwritten word recognition algorithms, it is likely that most full handwritten text processing systems will use some preprocessing word separation algorithms such as those described in this paper. Acknowledgement--The authors wish to thank several people at CEDAR. Dr Sargur Srihari for his support and encourage-
ment. Professor Peter D. Scott, Evelyn Kleinberg and DarShyang Lee offered many suggestions and assistance. Ronald Curtis provided the initial code for computing the fuzzy attributes used in punctuation detection. Keith Bettinger wrote an X utility that greatly improved the speed of the truthing process. This work was supported by the United States Postal Service Office of Advanced Technology under Task Order I04230-91 O-5329.
REFERENCES
1. E. Cohen, J.J. Hull and S. N. Srihari, Understanding handwritten text in a structured environment: determining ZIP Codes from addresses, Int. J. Pattern Recognition Artif. lntell. 5(1 &2), 221-264 (1991). 2. G. Seni and E. Cohen, Segmenting handwritten text lines into words using distance algorithms, SP1E-IS& T Conf. Proe., pp. 1000-1110 (1992). 3. H.S. Baird and K. Thompson, Reading chess, IEEE Trans. Pattern Analysis Mach. lntell. 12,552 559 (1990). 4. J. J. Hull, A computational theory of visual word recognition, Technical Report 88-07, State University of New York at Buffalo, Department of Computer Science, February (1988). 5. K. M. Sayre, Machine recognition of handwritten words: a project report, Pattern Recognition 5, 213-228 (1973). 6. R. O. Duda and P. E. Hart, Experiments in the recognition of hand-printed text: Part II context analysis, AFIPS Conf. Proe., pp. 1139-1149 {1968). 7. A. Gardin Du Boisdulier, Z. Bichri and F. Tourand, Post office box detection on handwritten addresses, USPS Advanced Teehnology ConS, pp. 585 603,November(1990). 8. F. Kimura, A. Z. Chen and M. Shridhar, An integrated character recognition algorithm for locating and recognizing ZIP Codes, USPS Advanced Technology Conf, pp. 605-619, November (1990). 9. A. C. Downton, R. W. S. Tregidgo, C. G. Leedham and Hendrawan, Recognition of handwritten British postal addresses, From Pixels to Features III: Frontiers in Hand. writing Recognition, pp. 129-144 (1992). I0. M. Brady, Toward a computational theory of early visual processing in reading, Visible Language 15(2), 183-215 (1981). 11. M. S. Landy, Y. Cohen and G. Sperling, HIPS: a UNIXbased image processing system, Comput. Vision Graphics Image Process. 25, 331-347 (1984).
About the A u t h o r - GIOVANNISENIreceived his B.S. in computer engineering from Universidad de Los Andes (Bogota, Colombia) in 1989, and was awarded a Fulbright scholarship in 1990. He obtained his M.S. in computer science from State University of New York at Buffalo in 1992 and is currently preparing his doctoral dissertation proposal. As a research assistant in CEDAR, he is working on automated reading problems.
About the Author--EDWARD COHENiS head of the research department at Tritek Corporation exploring a
variety of real-time image processing and text understanding issues. He is also teaching part-time at University of Delaware. Cohen received his Ph.D. from State University of New York at Buffalo in 1992 where he worked as lead researcher and project director for a system to read handwritten addresses sponsored by the U.S. Postal Service. He received his B.S. in electrical engineering from Cornell University in 1982. He is a member of IEEE and ACM.