Engineering Applications of Artificial Intelligence 14 (2001) 35–41
A methodology for document processing: separating text from images Nikolaos G. Bourbakis* Department of Electrical Engineering, Technical University of Crete, 73100 Chania, Crete, Greece Received 11 March 1997; accepted 1 May 1998
Abstract This paper presents a methodology for document processing, by separating text paragraphs from images. The methodology is based on the recognition of text characters and words for the efficient separation text paragraphs from images by keeping their relationships for a possible reconstruction of the original page. The text separation and extraction is based on a hierarchical framing process. The process starts with the framing of a single character, after its recognition, continues with the recognition and framing of a word, and ends with the framing of all text lines. The text line form a natural language text which requires analysis. # 2001 Published by Elsevier Science Ltd. Keywords: OCR & document processing; Natural language processing; Stochastic petri-nets; Understanding human communication; Multimedia
1. Introduction The recognition of printed and handwritten characters and words is an important research field (single modality) with many applications (Wang, 1991; Mantas, 1986; Srihari et al., 1992; O’Gorman and Kasturi, 1992; Bourbakis et al., 1999; Bourbakis and Goldman, 1999; Frontiers for Handwritten Recognition, 1993–96) in post offices for identifying the postal code from the addresses on the envelopes and sorting the mail, in banks for check processing, in libraries for computerizing the storage of books and texts, and also as reading devices for blind people, etc. (Proceedings of IEEE, 1992; Wang, 1991; Frontiers for Handwritten Recognition, 1993–96). Although many methodologies and systems have been developed for optical character recognition (OCR), OCR remains a challenging area (Wang, 1991; Mantas, 1986; Kahan et al., 1987; Frontiers for Handwritten Recognition, 1993–96). In particular, a good OCR system (Srihari et al., 1992) spends on an average about 2–3 s for the recognition of a handwritten character from a handwritten word. An extreme case is the OCR system (Loral) based on a very expensive parallel multiprocessor system of 1024 Intel*Present address: Department of Electrical Engineering, Binghampton University, P.O. Box 6000, Binghampton, NY 13902-6000, USA. Tel.: +1-607-777-4856; fax: +1-607-777-4464. E-mail address:
[email protected] (N.G. Bourbakis).
386 microprocessors, where each 386 cpu processes only one character at a time. There are also many OCR methods based on neural networks, such as the AT&T Bell labs OCR chip, the multiple Neural Networks OCR approach (Bourbakis et al., 1999; Bourbakis and Goldman, 1999), etc. There are some other OCR methods based on human like recognition. One of them uses a fuzzy graph-based OCR approach, with adaptive learning capabilities, which reduces the character dimensions to speed up the recognition process. It scans the text page, detects a character, extracts and recognizes it, produces the appropriate ASCII code, and sends it to the host computer in a few milliseconds simulated average test time (Bourbakis et al., 1996; Bourbakis, 1997). Image processing and pattern recognition (IPPR) are two older research fields with many significant contributions. The recognition and extraction of objects from images is a small subfield of IPPR. There are many successful methods based on neural nets or graphs to recognize different kind of objects (faces, cars, chairs, tables, buildings, etc) under very noisy conditions. The document processing field lately has been getting extra attention due to multimedia applications (Narasimhalu and Christodoulakis, 1991; Gudivada and Raghavan, 1995). Although document processing is an interesting research field, it introduces many difficult problems associated with the recognition of text characters from images. For instance, there are cases where a document can be considered either as text or as
0952-1976/01/$ - see front matter # 2001 Published by Elsevier Science Ltd. PII: S 0 9 5 2 - 1 9 7 6 ( 0 0 ) 0 0 0 5 5 - 5
36
N.G. Bourbakis / Engineering Applications of Artificial Intelligence 14 (2001) 35–41
image, like images generated by text characters. Also, artistic letters in very old and valuable books, where the starting letter of each paragraph looks like a complex image. However, in some cases, the text is handwritten, and the problem becomes more difficult. Several methods have been developed for document processing (O’Gorman and Kasturi, 1995; Proceedings of IEEE, 1992; Pavlidis and Zhou, 1992; Fletcher and Kasturi, 1988; O’Gorman, 1993; Wahl et al., 1989). Most of these methods deal with the segmentation of a page and the separation of text from images. In particular, the method presented in (Wahl et al., 1989), is a ‘‘topdown’’ approach and produces good results under the condition that the examined page can be separated into blocks. The algorithmic approach presented in (Fletcher and Kasturi, 1988) is a ‘‘bottom-up’’ process with good performance in several categories of pages with good spacing features, and ‘‘non-overlapping’’ blocks. The method proposed in (O’Gorman, 1993) is also a ‘‘bottom-up’’ process with a very good performance especially in long text uniform string. Another method presented in (Bourbakis, 1996) separates images from text (typed or handwritten) by maintaining their relationships. The methodology presented here facilitates document processing by efficiently separating and interrelating single modalities, such as text, handwriting, and images. In particular, the methodology starts with the recognition of text characters and words for the efficient separation of text paragraphs from images by maintaining their relationships for a possible reconstruction of the original page. The text separation and extraction is based on a hierarchical framing process. The process starts with the framing of a single character, after its recognition, continues with the recognition and framing of a word, and ends with the framing of all text lines. The methodology used here can process different type of documents, such as typed, handwritten, skewed, mixed, but not half-tone ones.
‘‘informative’’ region R1 } a region with text or image } is detected, the methodology defines that particular region at the first pyramidal level (the original page) and focuses on the upper left corner of the region to detect a text character if possible.
2.1.2. Character recognition The character recognition process starts with the creation of a temporal window W nxm , of nxm pixels. This window W covers the upper left area of the actual (in size) region R1 . At this window W nxm , a scanning process take place to detect the edges of possible character or the shape of an unknown object. When an edge is detected, a chain code (CC) method is used to extract the shape of the unknown character or object (see Fig. 2). The ‘‘unknown’’ shape extracted by the CC method is represented as a string S, S ¼ cnk1 ðdjk1 Þnk2 ðdjk2 Þ cnl1 ðdjl1 Þ nlm ðdjlm Þcc; where nkm 2 Z; djkm 2 f1; 2; 3; 4; 5; 6; 7; 8g, c ¼ 0, cc ¼ 9 and i; j; k; l; m 2 Z.
Fig. 1. Page hierarchical representation, framing of text paragraphs and images, association of frames.
2. Separation of text from images 2.1. Character recognition and framing This section discusses the recognition of a text character and its framing, which plays a very important role in the process that separates text from images. 2.1.1. Text binarization & character detection Initially, the entire text page is binarized and its pyramidal form is generated. Then the page area is scanned for the isolation and interrelation of informative regions Rk ðijÞ, k 2 Z, with text or images (Bourbakis and Klinger, 1989) (see Fig. 1). When, the first top
Fig. 2. Detection and extraction of text characters.
N.G. Bourbakis / Engineering Applications of Artificial Intelligence 14 (2001) 35–41
37
Fig. 4. A graph record of a character with k nodes in the OCR DB. Fig. 3. Graph representation of a text character.
A line generation and recognition process is applied on the string S and its segments are recognized either as straight lines (SL) or as curve lines (CL) (Fig. 3) (Bourbakis et al., 1996). At this point the methodology converts a string S into a graph G f : S ! G ¼ N1 ar12 N2 ar23 N3 . . . arak Nk ; where a line segment (SL or CL) corresponds to a graph node f : SLi ! Ni
or
Fig. 5. Skewed character with orientation.
patterns and saving each block’s coordinates for future reconstruction.
CLj ! Nj ;
where each graph node Ni represents the properties of the corresponding segment: Ni ¼½relative starting point ðSPÞ; length ðLÞ; direction ðDÞ; curvature ðKÞ and each arc arij represents the relationships between segments: arij ¼ fconnectivity ðcoÞ; parallelism ðpÞ; symmetry ðsyÞ; relative size ðrsÞ; relative distance ðrdÞ; relative orientation ðroÞ; similarity ðsiÞ; . . . g; r 2 fco; p; sy; rm; rd; sig. For the actual matching process, each node Ni has only one property namely the curvature (K). In the case where a text character is extracted and represented in a graph form, a fuzzy graph matching process takes place within a graph database to classify the character (see Fig. 4). The classification of a character is associated with attributes, such as orientation, size, font (Fig. 5). However, if the extracted pattern is not recognizable, the method considers it as a possible piece of an image or drawing. Thus, the method saves the unknown pattern’s coordinates and continues the extraction of the next pattern. If the new extracted pattern is also unrecognizable as a text character, the method repeats its attempts until it covers that particular informative region, by generating a block (or blocks) of non-text
2.1.3. Character framing When a particular character is recognized by the method, its attributes are used for the generation of its frame. This is a flexible process since it provides the ability to frame characters with different skews and size. Fig. 6 shows the framing of a character by using the maximum points sp (for top) and cp (for left side), and the frames of different size characters. In the case where a particular character has overlapping parts with neighboring characters, a voting recognition process is used to appropriately recognize it (Bourbakis et al., 1996; Bourbakis, 1997). 2.2. Words framing 2.2.1. Connecting character frames When the framing of the first character is completed, the character extraction and recognition process is repeated with the next neighboring character which has the same or different orientation (v) with the previous one. Thus, after the framing of the next character, the method connects these two frames into one, under the condition that they belong to the same word. The connection (or synthesis) of two frames (see Fig. 7) starts with the use of the frames orientations (vi ; vj ; i; j2Zþ) to match one of the eight possible connection patterns showing in Fig. 8. The methodology assumes that two consecutive characters belong to the same word if the distance between them is equal or
38
N.G. Bourbakis / Engineering Applications of Artificial Intelligence 14 (2001) 35–41
smaller than (dc), where dc is a predefined parameter. The connection block (cb) is generated by the projection of hi into the other frame’s high hj . Thus, the shape of cb varies according to the orientations of these two frames. 2.2.2. Word framing The method repeats the same character-framing procedure until the distance between the last two
consecutive characters is greater than dc. Thus, at the end of this process, the method creates a multi-frame block for the extracted word by using the character frames and their projections to each other. Fig. 9 shows graphically the synthesis of frames and connection blocks by using the patterns of Fig. 8. In particular, frame (W) is connected by the frame (h) by using the pattern-e, frame (h) is connected by the frame (e) using pattern-e, frames (e) and (r) use the pattern-b, frames (r) and (e) use the pattern-a.
2.3. Word recognition
Fig. 6. Framing of characters.
The method presented here has the ability to recognize handwritten words by appropriately segmenting them and saving temporarily their recognizable ASCII codes . Then it composes these codes into text word and compares them with the contents of the lexicon database. An illustrative example is given to present the words recognition process of this method. More specifically, the segmentation of the handwritten word ‘‘animation’’ (Table 1) is presented. If a character is not isolated from the neighboring characters, due to overlapping and/or underlining, then the method moves the window W nxm into three positions (left, center, right; see Table 1) around the character and every time a
Fig. 9. Framing a word. Fig. 7. Connecting character frames. Table 1
Fig. 8. The connection rules.
N.G. Bourbakis / Engineering Applications of Artificial Intelligence 14 (2001) 35–41
character recognition is performed. This means that three character-recognition attempts are made for a possible single character, in an effort of optimizing the correct recognition of each character and the ‘‘best’’ segmentation of the examined word. The three recognizable character outputs are compared in a voting section and the character with more than two appearances is selected as the one with the higher probability. At the end of this process, the selected character is saved in the memory. The same process is repeated until the ‘‘last’’ character of the examined word to be recognized and saved in the memory. At this point the method extracts the length of that particular word, defines the starting character (if possible) and attempts a fuzzy matching process with the known words in the lexicon database. As a result of this matching process, a number of words associated with their matching probability are retrieved from the lexicon database. Thus, the word with the highest probability (if any) is selected as the correct one. The word given in Table 1, the fuzzy matching to lexicon database provides as a first choice the word ‘‘animation’’ (55%) and second choice the word ‘‘animosity’’ (11%). The words-recognition process has an 89% success on different styles of handwritten words (Bourbakis et al., 1996). 2.4. Text line framing
with numbers (#N) according to their relative positions on the document page. 2.5.2. Extracting images The extraction of the images is based on a sequential scanning of the image region, by saving the coordinates (X,Y) of the upper left top pattern and its relative orientation (RV) regarding the borders of the document page. 2.6. An illustrative example In this section, an illustrative example is given to show the separation of text and images from a document and the flexibility of the proposed method in handwritten documents. In particular, Fig. 12 shows an original synthetic document; Fig. 13 presents a reduced binarized version of the original document. Fig. 14 shows the frames of the characters extracted from the document. Fig. 15 presents the result of the word-framing process
Fig. 11. The record of a text line block.
The connection of word frames follows a similar procedure, like the connection of character frames, by connecting the last frame of each word with the first frame of the next one. Thus, a text line frame may be a multi-skew block, which covers only the characters frames and the space of the connection blocks (Fig. 10). 2.5. Text framing and extraction 2.5.1. Connecting and extracting text line frames In order to extract text lines from a document, it is necessary to save the coordinates (x,y) and the relative orientation (rv) of the first character frame of each text line relative to the borders of the document page, as shown in Fig. 11. Thus, the framing and the extraction of paragraphs or the entire text of a document page is obtained by interrelating the extracted text line blocks
Fig. 10. Framing different words.
39
Fig. 12. The original picture with handwritten text.
Fig. 13. A reduced version of the original.
40
N.G. Bourbakis / Engineering Applications of Artificial Intelligence 14 (2001) 35–41
Fig. 14. Frames for handwritten characters.
Fig. 17. Records for extracted text line blocks.
extraction and recognition of handwritten unstructured text from documents; (iii) accurate reconstruction of the original page (document). The weak point of this method is that it is slower than the other documentprocessing methods due to words recognition effort which the other methods do not use.
Acknowledgements This work is partially supported by an ECC Grant 1995–96 and an AFRL Grant 1998–99.
Fig. 15. Connection of character frames.
Fig. 16. Connection of word frames.
and the sequence of connection patterns used to obtain it. Fig. 16 shows the final result of the text line framing and the sequence of patterns used for the connections. Fig. 17 presents the text line blocks with their reconstruction pointers.
3. Conclusion A methodology for text-paragraphs and images separating has been presented in this paper. The main advantages of this method are: (i) accurate extraction and recognition of text-paragraphs from documents; (ii)
References Bourbakis, N., 1996. A method of separating text from images. IEEE Symposium I&S, Maryland, November, pp. 311–317. Bourbakis, N., 1997. Recognition of handwritten characters and words. AIIS-TR-1997, p. 11. Bourbakis, N.G., Goldman, N., 1999. Recognition of line segments with unevenness used in OCR and fingerprints. Engineering Applications of Artificial Intelligence 12, 273–279. Bourbakis, N., Klinger, A., 1989. Hierarchical picture coding. PR Journal on Pattern Recognition 22, 239–254. Bourbakis, N.G., Koutsougeras, C., Jameel, A., 1999. Handwritten character recognition using low resolutions. Engineering Applications of Artificial Intelligence 12, 139–147. Bourbakis, N., Periera, N., Mertoguno, S., 1996. Hardware design of a letter-driven OCR and document processing system. International Journal of Networks & Computer Applications 19, 275–294. Fletcher, L.A., Kasturi, T., 1988. A robust algorithm for text string separation from text graphics images. IEEE Transactions on Pattern Analysis and Machine Intelligence 10, 910–918. Frontiers for Handwritten Recognition, 1993–96. Proceedings of the International Workshops. Gudivada, V., Raghavan, V., 1995. Content based image retrieval systems. IEEE Computer 28, 9. Kahan, S., Pavlidis, T., Baird, H., 1987. On the recognition of printed characters of any fonts and size. IEEE Transactions on Pattern Analysis and Machine Intelligence. Mantas, J., 1986. An overview of character recognition methodologies. PR Society Pattern Recognition Journal 19 (6). Narasimhalu, D., Christodoulakis, S., 1991. Multimedia information systems: the unfolding of a reality. IEEE Computer 24, 10. O’Gorman, L., 1993. The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 11. O’Gorman, L., Kasturi, R., 1992. Document Image Analysis System. IEEE Computer, 25. O’Gorman, L., Kasturi, T., 1995. Document Image Analysis. IEEE Computer Society Press, Silver Spring, MD.
N.G. Bourbakis / Engineering Applications of Artificial Intelligence 14 (2001) 35–41 Pavlidis, T., Zhou, J., 1992. Page Segmentation and Classification. CVGIP 54, 6. Proceedings of IEEE, 1992. Selected paper on document processing. Srihari, S.N., Palumbo, P., Sridhar, R., Soh, J., Demjanenko, J., 1992. Postal address block location in real time. IEEE Computer 34–42.
41
Wahl, F., Wong, W., Casey, R., 1989. Block segmentation and text extraction in mixed text image documents. CVGIP 20. Wang, P., 1991. Characters and Handwritten Expanding Frontiers. WSPub.