Text extraction in complex color documents

Pattern Recognition 35 (2002) 1743–1758 www.elsevier.com/locate/patcog Text extraction in complex color documents C. Strouthopoulos, N. Papamarkos∗ ...

Download PDF

1MB Sizes 210 Downloads 126 Views

Report

PDF Reader
Full Text

Pattern Recognition 35 (2002) 1743–1758

www.elsevier.com/locate/patcog

Text extraction in complex color documents C. Strouthopoulos, N. Papamarkos∗ , A.E. Atsalakis Electric Circuits Analysis Laboratory, Department of Electrical & Computer Engineering, Democritus University of Thrace, Xanthi 67100, Greece Received 5 July 2001

Abstract Text extraction in mixed-type documents is a pre-processing and necessary stage for many document applications. In mixed-type color documents, text, drawings and graphics appear with millions of di0erent colors. In many cases, text regions are overlaid onto drawings or graphics. In this paper, a new method to automatically detect and extract text in mixed-type color documents is presented. The proposed method is based on a combination of an adaptive color reduction (ACR) technique and a page layout analysis (PLA) approach. The ACR technique is used to obtain the optimal number of colors and to convert the document into the principal of them. Then, using the principal colors, the document image is split into the separable color plains. Thus, binary images are obtained, each one corresponding to a principal color. The PLA technique is applied independently to each of the color plains and identi4es the text regions. A merging procedure is applied in the 4nal stage to merge the text regions derived from the color plains and to produce the 4nal document. Several experimental and comparative results, exhibiting the performance of the proposed technique, are also presented. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Color segmentation; Color documents; Color quantization; Neural networks; Text extraction; Page layout analysis

1. Introduction Document processing is an active research area with signi4cant applications. Page layout analysis (PLA) is a necessary pre-processing stage for many document-processing systems such as optical character recognition, document retrieval and document compression. A key phase in such systems is the segmentation of a mixed-type document into text and non-text regions. Until recently, text extraction techniques have been developed only on monochrome documents [1–9]. These techniques can be classi4ed as bottom-up, top-down and hybrid. A good description of PLA techniques for gray-scale mixed-type documents is given in Refs. [8,10]. ∗ Corresponding author. Tel.: +30-541-79585; fax: +30541-79569. http:==ipml.ee.duth.gr= ∼papamark=Index.html. E-mail address: [email protected] (N. Papamarkos).

In recent years, there has been an increasing need for systems which are able to read and convert hardcopy color documents into digital format automatically. This process is important because much information is stored and processed as mixed-type color documents. In comparison to monochrome documents, the PLA of color documents addresses more di@culties associated with the large number of unique colors (more than 16 millions), the distribution of colors, the many color schemes used and the overlapping of text with drawings and graphics. Moreover, often in mixed-type color documents, the colors of the characters are not solid. Even the characters of a unique word may have di0erent colors and the colors in a simple character may be gradually distributed. Therefore, it is important, as a pre-processing stage, for the colors of the document image to be reduced to a suitably small number. This requires an intelligent technique that can estimate the optimal small number of the image principal colors and a color quantization

0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 1 ) 0 0 1 6 7 - 4

1744

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

algorithm that can determine the best values for the principal colors. In the last few years, several PLA analysis techniques for color documents have been proposed. Zhong et al. [11] proposed a hybrid technique for text extraction in complex color images and especially in CD cover pages. The 4rst method segments the color image into connected components with uniform colors, and uses heuristics to classify the components as text or non text regions. In this method, the color quantization process is based on a histogram smoothing approach that can result in 5 –500 prototype colors. The color quantization method cannot be used to estimate an optimal number of document principal colors. The second method extracts text regions by examining the spatial variation in the gray-scale image of the document. For superior text identi4cation, the results of the two methods are combined. The method fails when text regions are cropped around pictures [10]. The method of Suen and Wang [12] uses an edge-based color quantization algorithm that leads to good text extraction results if the color document has a uniform background. Chen and Chen have proposed an adaptive page segmentation technique for color documents [13]. This technique uses a color quantization algorithm that can reduce the colors of the documents to 42 or less. However, the color quantization algorithm used is heuristic and cannot determine the optimal color number and the principal colors of the documents. After color quantization, an edge-based block extraction is applied and coherent blocks are obtained. Then, in the block classi4cation stage, geometrical features are used to decide whether a block is a text block or not. Finally, in a post-processing stage, the extracted blocks are merged according to their spatial relative positions. The method proposed by Sobottka et al. [14] uses two techniques for the establishment of hypotheses about text regions. The 4rst is a top-down analysis technique that 4rst extracts rectangular blocks and then rejects the blocks with homogeneous colors. The second is a bottom-up region growing technique that searches for homogeneous shapes of arbitrary shapes. The 4nal text regions are obtained by combining the results of the two methods. The color quantization scheme used is a histogram-based clustering algorithm proposed by Matas and Kittler [15]. According to this approach, a peak-clustering procedure is applied to the image three-dimensional (3D)-histogram that determines the principal peaks and then classi4es the colors of the neighboring peaks. This technique seems similar to the hill-clustering approach used in gray-scale images [16] and its e0ectiveness depends on the 3D-histogram form and the de4nition of the peaks. In this paper, a new method for text extraction in complex color documents is proposed. The method includes several stages. In the 4rst stage, an unsupervised adaptive color reduction (ACR) technique is used to

determine the optimal number of unique colors and to convert the image to the principal colors obtained. This technique is based on an unsupervised neural network classi4er and on a tree-search procedure. Merging and split conditions are used during the adaptive process in order to decide whether color classes must be split or merged. An important advantage of the ACR technique is that the classi4er can be fed not only with the image color values, but also with suitable spatial features that improve the separability of the text regions. In the second stage, the document is split into the color planes obtained by the ACR technique. These color planes are next considered as binary documents. In each one of the color planes, a PLA technique is applied based on a run length segmentation algorithm (RLSA) [17] and a neural network block classi4er (NNBC). To improve PLA results, the NNBC is fed by suitable texture spatial features. This technique is an improved version of the PLA technique used for gray-scale documents [8]. After the PLA of the color planes, a merging procedure is applied which merges the text blocks extracted in the color planes. Finally, the colors of the text regions are obtained again and reduced to two (one for the characters and one for the background) by the application of a self-organized feature map (SOFM) neural network [18]. Thus, in the segmented document, the characters appear with solid colors close to the colors of the original document. This is an important result for some applications such as compression, transmission and presentation of digital color documents. The paper is organized as follows. Section 2 provides a brief discussion of ACR technique. Section 3 describes the text identi4cation procedure. Section 4 analyzes the text merging procedure. Finally, Section 5 presents experimental results con4rming the e0ectiveness of the proposed method.

2. Optimal color reduction In complex color documents, text and graphics may be distributed in millions of colors. This introduces dif4culties when text extraction and recognition is needed. For the extraction of text regions, it is preferable that the colors of the characters should be solid, i.e. to appear with a small number of unique colors. In order to achieve this, a technique for optimal estimation of the principal colors of the documents is needed. To solve this problem, the ACR technique [19] can be used with suitable split and merging conditions. The ACR technique is a color quantization method that achieves color reduction using an adaptive tree clustering procedure. In each node of the tree, a self-organized neural network classi4er (NNC) is used which is fed by image color values

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

1745

Fig. 1. A tree scheme of the adaptive color reduction algorithm.

and additional local spatial features. The NNC consists of a principal component analyzer (PCA) and a Kohonen self-organized feature map (SOFM) neural network [18]. The output neurons of the NNC de4ne the color classes for each node. The 4nal image not only has the dominant image colors, but also its texture approaches the image local spatial features used. Split and merging conditions are used in order to de4ne whether color classes must be split or merged. Giving suitable values to the parameters of these conditions, the ACR process leads to a small number of dominant colors classes. In the following, a short description of the ACR technique is given. 2.1. The ACR technique A color image could be considered as a set of n × m pixels, where the color of each pixel is a point in the color space. There are many color spaces used for color images. The ACR technique can be applied to any type of color space. A color space can be considered as a 3D vector space where each pixel (i; j) is associated with an ordered triple of color components (c1 (i; j); c2 (i; j); c3 (i; j)). Therefore, a general color

image function can    c1 (i; j) Ik (i; j) = c2 (i; j)   c3 (i; j)

be de4ned by the relation if k = 1; if k = 2; if k = 3:

(1)

Each primary color component represents an intensity, which varies from zero to a maximum value. Let N (i; j) denote the local neighboring region of pixel (i; j). Usually, N (i; j) is considered to be a 3 × 3 or a 5 × 5 mask, where the pixel (i; j) is the center pixel of the mask. It is obvious that usually, the color of each pixel is associated with the colors of the neighboring pixels and the local texture of the image. Therefore, the color of each pixel (i; j) can be associated with local image characteristics extracted from the region N (i; j). These characteristics are considered as local spatial features of the image and can be helpful for the color reduction process. That is, using the values of the colors of N (i; j); fk ; k = 4; : : : ; K + 3, local spatial features can be de4ned that are considered next as image spatial features. This approach transforms the 3D color feature space of the classical color quantization techniques, to a more advantageous one of K + 3 dimensions. No restrictions are implied about the type of the local features. However, in the case of color documents, the features must represent

1746

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

Fig. 2. The structure of the neural network classi4er.

spatial characteristics that are useful for converting the character colors to solid colors. Such types of spatial features are those extracted by using for example the mean, max and emboss 4lters. According to the above analysis, the color reduction problem can be considered as the problem of best transforming the original color image to a new one, with only J colors, so that the 4nal image approximates not only the principal color values, but also the local characteristics used. An e0ective approach to solve this problem is to consider it as a clustering problem and achieve its solution using suitable classi4ers. It is obvious every time we use additional features, the feature space increases and color classes that could not be split before, now can be separated. This is useful in the cases of color documents where text and background colors are similar. To solve this problem, the color reduction procedure is applied in an adaptive mode. Speci4cally, the ACR technique follows a tree structure (Fig. 1) with levels and nodes. In each level, an additional and proper set of features can be used so that new classes become visible. In each tree node, the color reduction technique is performed only on the pixels of the initial image that correspond to the color class obtained in the previous level. The entire procedure is terminated when all the nodes of the tree have been examined. In the 4nal stage, the extracted color image components are merged by a simple ADD procedure. The classi4er used in each tree node is a powerful self-organized neural network classi4er with the structure shown in Fig. 2. As can be observed, its form is similar to the NNBC and consists of a PCA and an SOFM neural network. The use of PCA is essential due to the multi-dimensionality of the feature space. The classi4er decreases optimally the input feature space into a smaller one. The resultant feature space can be viewed as a representative of the original feature space and there-

fore, it approximates statistical characteristics of the input space. The PCA has K + 3 input and N output neurons. Usually, N is taken equal to K + 3. It is used to increase the discrimination of the feature space. The SOFM has N input and J output neurons. The entire neural network is fed by the color values and the extracted additional features. After training, the neurons in the output competition layer of the SOFM de4ne the J classes. Using the neural network, each image pixel is classi4ed into one of the J classes and takes the color de4ned by the class. 2.2. Split and merging conditions To avoid the split of compact classes, split conditions can be used during the adaptive clustering process. The split conditions de4ne when a class must be split more. These conditions are applied locally to each tree node. Analytically, the split conditions used are as follows: (a) The variance of each class. A class is not split further if its variance is less than a threshold value. (b) The variance of the class centers. The classes produced in each node of the tree are not split further if the variance of their centers is less than a threshold value. (c) The increasing of the variance. In each level of the tree, each class is accepted if its appearances increase the variance of the centers of the classes. Otherwise, the speci4c class will not be split further. (d) The minimum number of pixels in each class. The minimum number of pixels classi4ed in each class cannot be less than a threshold value. (e) The minimum number of samples. In each node of the tree, the number of the training points, i.e. the number of pixels belonging to the node class must exceed an upper bound.

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

1747

Fig. 3. The original color document.

Additionally, in order to increase the clustering capabilities, merging conditions are used in each level of the tree de4ning when classes are close enough and therefore must be merged. Speci4cally, in each level of the

tree, the Euclidean distances between the classes are determined and then the classes that have distances less than a threshold value, expressed as a percentage of the mean distance, are merged. A typical threshold value for

1748

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

Fig. 4. Binary documents corresponding to the four color planes.

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

1749

Fig. 5. (a) Text regions identi4ed in the binary document of Fig. 4(a). (b) Text regions identi4ed in the binary document of Fig. 4(c).

Table 1 RGB values of the four principal colors

well if the number of the reduced colors lies between 4 and 8.

R

G

B

91 194 193 248

88 183 227 247

87 176 243 250

the merging conditions is 50%. This procedure is important because it leads to good color segmentation results by merging color classes that come from di0erent paths of the tree. By suitably selecting the split and merging conditions, the ACR technique can be used to identify the optimal number and the dominant colors of a document. This procedure is important because it not only merges color cohesive regions but is also used for the document decomposition into a small number of separated document color planes. However, the experimental results show that the proposed text extraction technique performs

2.3. Image sub-sampling The ACR technique can be applied to color documents without any sub-sampling. However, in the case of large-size documents and in order to achieve reduction of the computational time and memory size requirements, it is preferable to have a sub-sampling version of the original document. To do this, a fractal scanning process, based on the well-known Hilbert space 4lling curve [20], is used. To improve the sub-sampling process further, samples are taken, using a random process, not only on the peaks of the fractal curve but also in the neighboring of the peaks. Thus, we can better adjust the number of samples and capture the local image characteristics. Also, for better training results, in each epoch of the training procedure, the samples are di0erent and taken using a clockwise procedure from the random samples of the fractal peaks.

1750

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

Table 2 RGB values of the C1 ; C2 , Cb color classes Cb

C1

3. Identication of text areas

C2

R

G

B

R

G

B

R

G

B

91 91 91 91 91

88 88 88 88 88

87 87 87 87 87

96 98 81 94 81

92 80 78 92 79

112 101 72 85 75

227 220 234 242 203

174 223 238 245 207

197 213 238 246 206

Fig. 6. Text regions before and after the local estimation of the character colors.

2.4. The stages of the method Summarizing, the stages of the ACR technique are: Stage 1. De4ne the desired maximum number J of the 4nal colors. Stage 2. De4ne the type of the spatial features that will be used in each level of the tree. Stage 3. De4ne the split and merging conditions. Stage 4. Construct the PCA and SOFM neural networks. Stage 5. De4ne the sub-sampling parameters. Stage 6. In each tree node, train the neural network with the pre-de4ned features and then using the neural network, transform the colors of the original image to those obtained. Stage 7. At the end of the adaptive process, construct the 4nal image by merging the separated color regions obtained.

Each one of the color planes, obtained after the application of the ACR procedure, is considered as a new binary document. For example, the application of the ACR algorithm, using suitable split and merging conditions, on the color document shown in Fig. 3, results in four unique colors that decompose the document into the four binary documents shown in Fig. 4. Table 1 shows the RGB values of each color. Due to the color reduction, some pixels of the original color document appear as noise in the obtained binary documents. This happens mainly in the regions of the image edges, where the colors of the pixels are graduated between the object and the background colors. For example, the binary document shown in Fig. 4(c) contains many pixels lying in the character borders of the binary document shown in Fig. 4(a). Also, in the binary document shown in Fig. 4(b), such undesirable pixels appear in the graphics regions. As can be observed in Fig. 4(d), a second special characteristic of the obtained binary documents is the appearance of the characters regions as white shapes in background color planes. Due to this, it is possible that undesirable classi4cation results appear during the 4nal classi4cation stage of the PLA procedure. In order to identify the text regions in each one of the color planes, a modi4ed version of the PLA technique proposed by Strouthopoulos et al. [8] is applied. Initially, this technique has been used for the decomposition of monochrome documents into text, drawing and graphic regions. It is based on the use of an RLSA and an NNBC. In this approach, for block extraction, the RLSA is applied globally and locally to every color plane. An important advantage of the RLSA used is its ability to estimate, according to the structure of the documents, appropriate smoothing values. Texture spatial features extracted from the blocks feed the NNBC and the NNBC classi4es the extracted blocks as text, drawing or graphic blocks. The complete analysis of the NNBC is given by Strouthopoulos and Papamarkos [7,8]. In the modi4ed version used in this work, the 4nal block classi4cation is based not only on the NNBC, but also on additional classi4cation criteria. These criteria are applied to every block, before the classi4cation by the NNBC, in order to (a) reject blocks constructed of noise pixels, and (b) classify the white character blocks that appear in the background color planes as non-text blocks. Considering that “1” represents object pixels and “0” background pixels, the classi4cation criteria are based on the following block characteristics: • The block width W . • The block height H .

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

1751

Fig. 7. The 4nal text regions extracted for the color document of Fig. 3.

• The block area A = WH . • The number Op of the object pixels in the block. • The number C of “0” to “1” and “1” to “0” transmis-

Using these characteristics, a block is classi4ed as non-text block if at least one of the following criteria is satis4ed.

• The number VL of top to bottom vertical complete

(a) The percentage of top to bottom vertical complete background lines must be less than a threshold value

sions in the block.

background lines in the block.

1752

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

Fig. 8. (a) The original document. (b) Document with only 4ve colors.

T (usually T is taken equal to 0.1): VL ¡ T: W

(2)

The density of foreground pixels must satisfy the conditions: Op Op ¡ 0:07 or ¿ 0:3: (3) A A

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

Fig. 9. Final text extraction results.

Fig. 10. (a) The original document and (b) Document with only 4ve colors.

1753

1754

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

• Local application of RLSA on the remaining marks by

using local smoothing values.

• Filtering the extracted blocks using the three rejection

criteria de4ned above.

• Classi4cation of the rest blocks as text or non-text

regions by using the spatial features and the NNBC.

Fig. 5 shows the text regions obtained by the application of the above PLA technique on the color planes of the initial color document shown in Fig. 4.

4. Merging text areas

Fig. 11. Final text extraction results.

(b) The percentage of transitions must be in the ranges: C C ¡ 0:25 or ¿ 2: (4) Op Op After the rejection of the non-text blocks by the above classi4cation criteria, the rest of the blocks are fed the NNBC. The NNBC, using the 34 spatial texture features de4ned in Ref. [8], classi4es the blocks as text or non-text blocks. BriePy, the PLA technique applied on every color plane, consists of the following stages: • Separation of the document into 8×8 regions, iden-

ti4cation of text regions using NNBC, and, for each region, the estimation of the local values of the horizontal mean character distance. • Global application of the RLSA using as smoothing values the smallest of the mean character distance values. • Identi4cation of high marks (groups of connected components). • Classi4cation of high marks by using the NNBC.

After the application of the PLA technique on every color plane, each of the identi4ed text blocks is considered as a text region of the original color document. The character pixels of these regions have been de4ned by the global application of the color reduction procedure. As an optional stage and in order to improve the character shapes, the text and background colors in each text region are re-de4ned locally. In order to achieve this, the two principal colors (the character and local background color) are obtained using a neural network SOFM. In this case, the neural network has only two output neurons, one for each color. The three input neurons of the SOFM are fed by the RGB values of the block pixels. This results in the local estimation of the two color classes representing the colors of the character pixels and the colors of the local background pixels, respectively. Then, it is necessary to obtain the color that corresponds to the character colors. To solve this problem, the following procedure is applied: • Let b denote one of the color planes, corresponding to

the Cb (Rb ; Gb ; Bb ) color class obtained by the application of the ACR technique. • Let tb denote a text region in b. • Let C1 (R1 ; G1 ; B1 ), and C2 (R2 ; G2 ; B2 ); be the color classes, obtained by the application of the neural network SOFM classi4er on the color document pixels included in region tb . It is obvious that among the classes C1 and C2 , the class corresponding to the character color is the class having RGB color nearest (according to the Euclidean distance) to the Cb class. Therefore, a pixel p of a text region tb is considered as a character pixel, if its RGB color, taken from the initial color document, is nearest to the character color class. Table 2 shows the values of C1 ; C2 , Cb color classes for the four text regions of Fig. 5(a). Fig. 6 shows the character pixels of the text regions, before and after the character color estimation of each region locally. Fig. 7 shows the 4nal text regions extracted.

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

1755

Fig. 12. (a) Original color document and (b) Document after color quantization.

5. Examples The proposed technique has been tested with a variety of color documents. Due to space limitation, only three additional examples are presented. The documents of these examples are all of 200 dpi resolution. The experimental results are obtained by using a Pentium III 500 MHz computer. Example 1. This example demonstrates the application of the proposed text extraction technique on the complex color document of Fig. 8(a). The size of the document is 1433 × 1246 pixels and the number

of unique colors is 332.627. The background of this document has two principal color tones. In addition, this document has white text and background regions. Using suitable values for the split and merging conditions, only 4ve principal colors have been obtained by the ACR technique. Therefore, the RGB colors of the document is reduced to (61,69,114), (126,134,157), (190,191,199), (219,78,61), (254,254,249) and the document takes the form shown in Fig. 8(b). The application of the proposed technique, after 65 s, results in the text extraction results shown in Fig. 9. It can be observed that the majority of the text areas are correctly obtained.

1756

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

Fig. 13. The four color planes of the document.

Fig. 14. The 4nal text extraction results.

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

Example 2. In the second example, the size of the color document, shown in Fig. 10(a), is 1053 × 1646 pixels and has 321.997 unique colors. The use of the ACR technique leads to the document of Fig. 10(b), with the 4ve principal RGB colors: (79, 75, 104), (145, 122, 112), (231, 145, 82), (244, 210, 87) and (247, 244, 235). The interesting point of this example is the gradual change of the background colors. The application of the proposed text extraction technique, after 43 s, gives the text extraction results shown in Fig. 11. It is noticed that the text areas that are not well de4ned are those that have both small size characters and not solid colors. Example 3. This example demonstrates the application of the proposed text extraction technique to the complex color document shown in Fig. 12(a). The size of the document is 1104 × 616 pixels and has 179.473 unique colors. The RGB values of the obtained four color planes are (237, 233, 238), (230, 81, 18), (138, 136, 152), (9, 22, 100) and the corresponding document with the reduced colors is depicted in Fig. 12(b). Fig. 13 shows the binary documents that correspond to the color planes obtained. In this example the white color of the characters has an intense variance. For this reason, as can be observed in Fig. 13(a), the shapes of the characters included in the white color planes (237, 233, 238) are of bad quality. However, as is depicted in Fig. 14, the local re-estimation of the character and background colors, applied in the 4nal merging stage, improves the character shapes of the extracted text. The computation time required for this example is about 48 s. Also, the reasons for some not correctly extracted text areas are the low resolution, the small size of the characters and that they do not have solid colors. 6. Conclusions In this paper, a text extraction method for complex color documents is proposed. The color quantization technique used in the 4rst stage of the method is the ACR algorithm. Using suitable split and merging conditions, the ACR technique can be used to determine the optimal reduced number of the document principal colors. In the next stage, the document with the reduced colors is decomposed in binary documents that correspond to the color planes. In each color plane, a PLA technique is applied and text regions are identi4ed. To improve the shapes of the characters, the two color tones of each text region can be re-de4ned using an SOFM neural network. Finally, using a merging procedure, the extracted text regions are merged and the 4nal text regions are obtained. The proposed text extraction technique is relatively robust to variations in font, color, or size of the text. Experimental results also reveal the feasibility and the e0ectiveness of the proposed approach.

1757

References [1] K.C. Fan, C.H. Liu, Y.K. Wang, Segmentation and classi4cation of mixed text=graphics=image documents, Pattern Recognition Lett. 15 (1994) 1201–1209. [2] A. Jain, S. Bhattacharjee, Text segmentation using Gabor 4lters for automatic document processing, Mach. Vision Appl. 5 (1992) 169–184. [3] A. Jain, Y. Zhong, Page segmentation using texture analysis, Pattern Recognition 29 (1996) 743–770. [4] L. O’ Gorman, The document spectrum for page layout analysis, IEEE Trans. Pattern Anal. Mach. Intell. 15 (1993) 1162–1173. [5] L. O’ Gorman, R. Kasturi, Document image analysis, IEEE Computer Society Press, Silver Spring, MD, 1995. [6] C. Strouthopoulos, N. Papamarkos, C. Chamzas, Identi4cation of text-only areas in mixed type documents, Eng. Appl. Artif. Intell. 10 (1997) 387–401. [7] C. Strouthopoulos, N. Papamarkos, Text identi4cation for document image analysis using a neural network, Image Vision Comput. (special issue on Document Image Process. Multimedia Environ.) 16 (1998) 879–896. [8] C. Strouthopoulos, N. Papamarkos, C. Chamzas, PLA using RLSA and a neural network, Eng. Appl. Artif. Intell. 12 (1999) 119–138. [9] V. Wu, R. Manmatha, E.M. Riseman, TextFinder: an automatic system to detect and recognize text in images, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1999) 1224–1229. [10] P. Parodi, R. Fontana, E@cient and Pexible text extraction from document pages, Int. J. Document Anal. Recognition 2 (1999) 67–79. [11] Y. Zhong, K. Karu, A.K. Jain, Locating text in complex color image, Pattern Recognition 28 (1995) 1523–1535. [12] M. Suen, J.F. Wang, Text string extraction from images of color printed documents, Proceedings of the IPPR Conference Computer Vision, Graphics Image Processing, Taiwan, 1995, pp. 534 –541. [13] W.Y. Chen, S.Y. Chen, Adaptive page segmentation for color technical journals cover images, Image Vision Comput. 16 (1998) 855–877. [14] K. Sobottka, H. Kronenberg, T. Perroud, H. Bunke, Text extraction from colored book and journal covers, Int. J. Document Anal. Recognition 2 (2000) 163–176. [15] J. Matas, J. Kittler, Spatial and feature space clustering: application in image analysis, Proceedings of the International Conference on Computer Analysis of Images and Patterns, 1995, pp. 162–173. [16] N. Papamarkos, B. Gatos, A new approach for multithreshold selection, Comput. Vision Graph. Image Process. Graph. Models Image Process. 56 (1994) 357–370. [17] F.M. Wahl, K.Y. Wong, R.G. Casey, Block segmentation and text extraction in mixed text=image documents, Comput. Graph. Image Process. 20 (1989) 375–390. [18] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, New York, 1994. [19] N. Papamarkos, A. Atsalakis, Gray-level reduction using local spatial features, Comput. Vision Image Understanding 78 (2000) 336–350. [20] H. Sagan, Space-Filling Curves, Springer, New York, 1994.

1758

C. Strouthopoulos et al. / Pattern Recognition 35 (2002) 1743–1758

About the Author—NIKOS PAPAMARKOS was born in Alexandroupoli, Greece, in 1956. He received his Diploma Degree in Electrical and Mechanical Engineering from the University of Thessaloniki, Thessaloniki, Greece, in 1979 and the Ph.D. Degree in Electrical Engineering in 1986, from the Democritus University of Thrace, Greece. From 1987 to 1990, Dr. Papamarkos was a Lecturer, from 1990 to 1996 Assistant Professor in the Democritus University of Thrace where he is currently Associate Professor since 1996. During 1987 and 1992, he has also served as a Visiting Research Associate at the Georgia Institute of Technology, USA. His current research interests are in digital image processing, document image analysis, computer vision, pattern recognition, neural networks, digital signal processing and optimization algorithm. Dr. Nikos Papamarkos is a member of IEEE and of the Technical Chamber of Greece. About the Author—ATSALAKIS E. ANTONIS received the Diploma in Electrical and Computer Engineering from Democritus University of Thrace, Greece, in 1999. He is currently a research and teaching assistant and is studying towards the Ph.D. degree at the Department of Electrical and Computer Engineering, Democritus University of Thrace. His research interests include color document image processing and analysis, neural networks and pattern recognition. He is a member of the Technical Chamber of Greece. About the Author—CHARALAMPOS P. STROUTHOPOULOS was born in Drama, Greece, in 1962. He received his Diploma Degree in Electrical Engineering from the University of Patras, in 1985 and his Ph.D. degree in 1999 from the Electrical and Computer Engineering Department of Democritus University of Thrace, Xanthi, Greece. His Ph.D. thesis is on Document Image Analysis Techniques. His main research interests are in digital image processing, document image analysis, color quantization and pattern recognition. Dr. Strouthopoulos is a member of the Technical Chamber of Greece.

Text extraction in complex color documents

Text extraction in complex color documents

Recommend Documents