Pattern Recognition, Vol. 31, No. 12, pp. 2055—2076, 1998 ( 1998 Pattern Recognition Society. Published by Elsevier Science Ltd All rights reserved. Printed in Great Britain 0031-3203/98 $19.00#0.00
PII: S0031-3203(98)00067-3
AUTOMATIC TEXT LOCATION IN IMAGES AND VIDEO FRAMES ANIL K. JAIN and BIN YU Department of Computer Science, Michigan State University, East Lansing, MI 48824-1027, U.S.A. (Received 22 January 1998; in revised form 23 April 1998) Abstract—Textual data is very important in a number of applications such as image database indexing and document understanding. The goal of automatic text location without character recognition capabilities is to extract image regions that contain only text. These regions can then be either fed to an optical character recognition module or highlighted for a user. Text location is a very difficult problem because the characters in text can vary in font, size, spacing, alignment, orientation, color and texture. Further, characters are often embedded in a complex background in the image. We propose a new text location algorithm that is suitable in a number of applications, including conversion of newspaper advertisements from paper documents to their electronic versions, World Wide Web search, color image indexing and video indexing. In many of these applications, it is not necessary to extract all the text, so we emphasize on extracting important text with large size and high contrast. Our algorithm is very fast and has been shown to be successful in extracting important text in a large number of test images. ( 1998 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved Automatic text location Web search Image database Video indexing Multivalued image decomposition Connected component analysis 1. INTRODUCTION
Textual data carry useful and important information. People routinely read text in paper-based documents, on television screens, and over the Internet. At the same time, optical character recognition (OCR) techniques(1) have advanced to a point where they can be used to automatically read text in a wide range of environments. Compared with the general task of object recognition, text is composed of a set of symbols, which are arranged with some placement rules. Therefore, it is easier for a machine to represent, understand, and reproduce (model) textual data. Generally, we have two goals in automatic text processing: (i) convert text from paper documents to their electronic versions (e.g. technical document conversion(2)); (ii) understand the document (e.g. image, video, paper document) using the text contained in it. It is this second goal which plays an important role in Web search, color image indexing, image database organization, automatic annotation and video indexing, where only important text is desired to be located (e.g. book titles, captions, labels and some key words). Automatic text location (the first step in automatic or semi-automatic text reading) is to locate regions that just contain text from various text carriers without recognizing characters contained in the text. The expected variations of text in terms of character font, size and style, orientation, alignment, texture and color embedded in low contrast and complex background images make the problem of automatic text location very difficult. Furthermore, a high speed of text location is desired in most applications.
We define a text as coded text if it is represented by some code from which its image can be reproduced with a predefined font library. Examples of coded text can be found in Postscript-formatted files and files used in many word processing software packages where characters are represented in ASCII code or Unicode. On the other hand, a text is defined as pixel text if it is represented by image pixels. In other words, pixel text is contained in image files. Sometimes, both these types of text appear in the same document. For instance, a web page often consists of both coded text and pixel text. Figure 1 depicts a part of a web page which consists of two images and a line of coded text ‘‘Department of Computer Science’’ as indicated by its source code shown in Fig. 2. The ASCII code for the coded text can be read directly from its source code, while the pixel text ‘‘Michigan State University’’ is contained in the image named ‘‘msurev1.gif ’’. The problem of automatic text location is mainly concerned with the pixel text. Several approaches to text location have been proposed for specific applications like page segmentation,(2) address block location,(3,4) form dropout(5) and graphics image processing.(6) In these applications, images generally have a high resolution and the requirement is that all the text regions be located. There are two primary methods for text location proposed in the literature. The first method regards regions of text as textured objects and uses wellknown methods of texture analysis(7) such as Gabor filtering(8) and spatial variance(9) to automatically locate text regions. This use of texture for text location
2055
2056
A. K. JAIN and B. YU
Fig. 1. A part of a web page.
Sa href ‘‘http://www.msu.edu/’’TSIMG align"left src"‘‘/img/misc/msurev1.gif’’ WIDTH"229 HEIGHT"77TS/aT Sa href ‘‘http://www.egr.msu.edu’’TSIMG align"right src‘‘/img/misc/engpic3.gif’’ WIDTH"227 HEIGHT"77TS/aT SbrTSbrTSbrTSbrTSbrTSbrT ScenterTSh1T Department of Computer ScienceS/h1TS/centerT Fig. 2. Source code of the web page in Fig. 1.
is sensitive to character font size and style. Further, this method is generally time-consuming and cannot always accurately give text’s location which may reduce the performance of OCR when applied to the extracted characters. Figure 3(b) shows horizontal spatial variance for the image in Fig. 3(a) proposed by Zhong et al.(9) The text location results are shown in Fig. 3(c), where there is some unpredictable offset. The second method of text location uses connected component analysis.(2,3,5,10,11) This method, which has a higher processing speed and localization accuracy, however, is applicable to only binary images. Most black and white documents can be regarded as two-valued images. On the other hand, color documents, video frames, and pictures of natural scenes are multivalued images. To handle various types of documents, we localize text through multivalued image decomposition. In this paper we will introduce: (i) multivalued image decomposition, (ii) foreground image generation and selection, (iii) color space reduction, and (iv) text location using statistical features. The proposed method has been applied to the problem of locating text in a number of different domains, including classified advertisements, embedded text in synthetic web images, color images and video frames. The significance of
automatic text location in these problems is summarized below. 1.1. Conversion of newspaper advertisements World Wide Web (WWW) is now recognized as an excellent media for information exchange. As a result, the number of applications which require converting paper-based documents to hypertext is growing rapidly. Most newspaper and advertisement agencies would like to put a customer’s advertisements onto their web sites at the same time as they appear in the newspaper. Figure 4(a) shows an example of a typical newspaper advertisement. Since the advertisements sent to these agencies are not always in the form of coded text, there is a need to automatically convert them to electronic versions which can be further used in automatically generating Web pages. Although these images are mostly binary, both black and white objects can be regarded as foreground due to text reversal. The text in advertisements varies in terms of font, size, style and spacing. In addition to text, the advertisements also contain some graphics, logos and symbolized rulers. We use a relatively high scan resolution (150 dpi) for these images because (i) they are all binary, so storage requirements are not severe and
Automatic text location in images and video frames
2057
Fig. 3. Text location by texture analysis: (a) original image; (b) horizontal spatial variance; (c) text location (shown in rectangular blocks).
(ii) all the text in the advertisement, irrespective of their font, size and style, must be located for this application. 1.2. Web search Since 1993, the number of web servers has been doubling nearly every three months(12) and now exceeds 476,000.(13) Text is one of the most important components in a web page which can be either coded text or pixel text. Through the information superhighway, users can access any authorized site to obtain information of interest. This has created the problem of automatically and efficiently finding useful pages on the web. To obtain desired information from this humongous source, a coded text-based search engine (e.g. Yahoo, Infoseek, Lycos, and AltaVista) is commonly used. For instance, AltaVista search engine processes more than 29 million requests each day.(13) Because of the massive increase in network bandwidths and disk capacities, more and more web pages now contain images for better visual appearance and rich information content. These images, especially the pixel text embedded in them, provide search engines with additional cues to accurately retrieve the desired information. Figure 4(c) shows one such example. Therefore, a multimedia search engine which can use the information from both coded text and pixel text, image, video and audio is desired for the information superhighway. Most of web images are computer created and called synthetic images. Text in web page images varies in font, color, size and style even in the same page. Furthermore, the color and texture of the text and its background may also vary from one part of the page to the other. For these reasons, it is very difficult to locate text in Web images automatically without utilizing character recognition capabilities. Only a few simple approaches have been published for text location in Web images.(14) 1.3. Color image databases A color image can be captured by a scanner or a camera. Figure 4(b) shows a color image scanned
from a magazine cover. Automatically locating text in color images has many applications, including image database search, automatic annotation and image database organization. Some related work can be found in vehicle license plate recognition.(15) 1.4. Video indexing The goal of video indexing is to retrieve a small number of video frames based on user queries. A number of approaches have been proposed which retrieve video frames using texture,(15) shape(17) and color(18) information contained in the query. At the same time, word spotting(19) and speech recognition(20) techniques have been used in searching for dialogue and narration for video indexing. Both caption text and non-caption text on objects contained in video can be used in interactive indexing and automatic indexing, which is the major objective of text location for video. Figure 4(d) shows a video frame which contains text. Some related work has been done for image and video retrieval where the search cues use visual properties of specific objects and captions in video databases.(9,21~23) Lienhart and Stuber(23) assume that text is monochromatic and is generated by video title machines. 1.5. Summary There are essentially two different classes of applications involved in our work on automatic text location: (i) document conversion and (ii) web searching and image and video indexing. The first class of applications, which mostly involves binary images, requires that all the text in the input image be located. This necessitates a higher image resolution. On the other hand, it is evident that the most important requirements for the second class of applications is (i) high speed of text location, and (ii) extraction of only important text in the input image. Usually, the larger the font size of text, the more important it is. The text which is very small in size cannot be recognized easily by OCR engines anyway.(24) Since the important text in images appear mainly in the
2058
A. K. JAIN and B. YU
Fig. 4. Examples of input images for automatic text location applications: (a) classified advertisement in a newspaper; (b) color scanned image; (c) web image; (d) video frame.
horizontal direction, our method tries to extract only horizontal text of relatively large size. Because some non-text objects can be subsequently rejected by an OCR module, we minimize the probability of missing text (false dismissal) at the cost of increasing the probability of detecting spurious regions (false alarm). Figure 5 gives an overview of the proposed system. The input can be a binary image, a synthetic web image, a color image or a video frame. After color reduction, including bit dropping and color clustering and multivalued image decomposition, the input image is decomposed into multiple foreground images. Indi-
vidual foreground images go through the same processing steps, so the connected component analysis and text identification modules can be implemented in parallel on a multiprocessor system to speed up the algorithm. Finally, the outputs from all the channels are composed together to locate the text in the input image. Text location is represented in terms of the coordinates of its bounding box. In Section 2 we describe decomposition method for multivalued images, including color space reduction. Connected component analysis method is applied to foreground images, which is explained in Section 3.
Automatic text location in images and video frames
2059
Fig. 5. Automatic text location system.
Section 4 introduces textual features, text identification and text composition. Finally, we report the results of experiments in a number of applications and discuss the performance of the proposed system in Section 5.
text is shown in Fig. 7(b). Therefore, an image I can always be completely separated into a foreground image IF and a background image IB , where I #I "I and I WI "0. Theoretically, a ºF B F B valued image can generate up to (2U!2) different foreground images. A foreground image is called a real foreground image if it is produced such that
2. MULTIVALUED IMAGE DECOMPOSITION
An image I is multivalued if the pixel values u3 U"M0, 1, 2 , º!1N, where º is an integer, º'1. Let pixels with value u0 3U be object pixels and all pixels with value u3U and uOu0 be nonobject pixels. A º-valued image can be decomposed into a set of º element images I"MIiN, where U~1 Z Ii"I, Ii Y Ij"0. i/0 iOj Figure 6(b) depicts nine element images of the multivalued image shown in Fig. 6(a) consisting of º"9 different pixel values. Furthermore, all object pixels are set as 1’s and non-object pixels are set as 0’s. We assume that a text represented with a nearly uniform color can be composed of one or several color values, which is regarded as real foreground text. An example of real foreground text is shown in Fig.7(a). On the other hand, text consisting of various colors and texture is assumed to be located in a background with a nearly uniform color, which can be regarded as background-complementary foreground text. An example of background-complementary foreground
IRF" Z Im , )RF LI, I ) m | RF where ) denotes a set of element images of I. So, we can construct a real foreground image by combining element images which are easily extracted. A foreground image is a background-complementary foreground image if it is produced such that IBCF"I!IB ,
IB" Z Im , )B LI, I ) m| B
where IB is the background of IBCF . In this case, a background image is easier to extract. Note that the union operation in constructing the real foreground images and the background images of backgroundcomplementary foreground images is simplified by color space reduction discussed in Section 2.3. For the image in Fig. 6(a), the union of four element images with pixel values 1, 2, 3 and 4 generates a real foreground image shown in Fig. 8(a). Let the element image with the value 9 be the background, then the corresponding background-complementary foreground image is shown in Fig. 8(b).
2060
A. K. JAIN and B. YU
Fig. 6. A multivalued image and its element images: (a) color image; (b) nine element images.
Fig. 7. Examples of text: (a) a real foreground text; (b) a background-complementary foreground text.
In our system, each element image can be selected as a real foreground image if there is a sufficient number of object pixels in it. On the other hand, we generate at most one background-complementary image for each multivalued image such that the background image IB is set as the element image with the largest number of object pixels or the union of this element image with the element image with the second largest number of object pixels if it is larger than a threshold. 2.1. Binary images The advertisement images of interest to us are binary images (see Fig. 4(a)) for which º"2. A binary
image has only two element images, the given image and its inverse, each being a real foreground image or a background-complementary image with respect to the other. 2.2. Pseudo-color images For web images, GIF and JPEG are the two most popular image formats because they both have high compression rates and simple decoding methods. The latter is commonly used for images or videos of natural scenes.(25) Most of the web images containing meaningful text are created synthetically and are stored in GIF format. A GIF image is an 8-bit
Automatic text location in images and video frames
2061
Fig. 8. Foreground images of the multivalued image in Fig. 6(a): (a) a real foreground image; (b) a background-complementary foreground image.
Fig. 9. Histogram of the multivalued image shown in Fig. 4(c).
pseudo-color image whose pixel values are bounded between 0 and 255. A local color map and/or a global color map is attached to each GIF file to map the 8-bit image to a full color space. GIF format has two versions, GIF87a and GIF89a. The later can encode an image by interlacing in order to display it in a coarse-to-fine manner during transmission and can indicate a color as a transparent background. As far as the data structure is concerned, an 8-bit pseudocolor image is no different from an 8-bit gray scale image. However, they are completely different in terms of visual perception. The pixel values in a gray scale image have physical interpretation in terms of light reflectance, so the difference between two gray values is meaningful. However, a pixel value in a pseudo-color image is an index to a full color map. Therefore, two pixels with similar pseudo-color values may have distinct colors.
We extract text in pseudo-color images by combining two methods. One is based on foreground information and the other is based on the background information. Although the pixel values in a GIF image can range from 0 to 255, most images contain values only in a small interval, i.e., º@256. Figure 9 is the histogram of the pseudo-color image in Fig. 4(c), which shows that a large number of bins are empty. First, we regard each element image as a real foreground image. Furthermore, the number of distinct values shared by a large number of pixels is small due to the nature of synthetic images. We assume that the characters in a text are of reasonable size and the characters occupy a sufficiently large number of pixels. Therefore, we retain those real foreground images in which the number of foreground pixels is larger than a threshold ¹np ("400). Further, we empirically choose N"8 as the number of real
2062
A. K. JAIN and B. YU
Fig. 10. Decomposition of web image of Fig. 4(c): (a)—(f ) real foreground images; (g) backgroundcomplementary foreground image.
Fig. 11. Foreground extraction from a full color video frame: (a) original frame; (b) bit dropping; (c) color quantization reduces the number of distinct colors to four; (d)—(g) real foreground images; (h) background-complementary foreground image.
foreground images. For the text without an unique color value, we assume that its background has a unique color value. The area of the background should be large enough, so we regard the color with the largest number of pixels as the background. We also regard the color value with the second largest number of pixels as background if this number is larger than a threshold ¹bg ("10 000). Thus, a background-complementary foreground image can be generated. At most, we consider only nine foreground images (eight real foreground images plus one background-complementary foreground image). Each foreground is tagged with a foreground identification (FI). The image in Fig. 4(c) has 117 element images (see the histogram in Fig. 9) and only six of them are selected as real foreground images which are shown in Fig. 10(a)—(f ). One background-complementary foreground image is shown in Fig. 10(g).
2.3. Color images and video frames A color image or a video frame is a 24-bit image, so the value of º can be very large. To extract only a small number of foregrounds from a full color image with the presumption that the color of text is distinct from the color of its background, we implement (i) bit dropping for RGB color bands and (ii) color quantization. A 24-bit color image consists of three 8-bit red, green and blue images. For our task of text location, we simply use the highest two bits for each band image, which has the same effect as color re-scaling. Therefore, a 24-bit color image is correspondingly reduced to a 6-bit color image and the value of º is reduced to 64. Figure 11(b) shows the bit dropping results for the input color image shown in Fig. 11(a), where only the highest two bits have been retained from each color band. The retained color prototypes
Automatic text location in images and video frames
2063
Fig. 12. Color prototypes: (a) after bit-dropping; (b) after color quantization.
are illustrated in Fig. 12(a). In a bit dropped image, text may be present in several colors which are assumed to be close in the color space. So, a color quantization scheme or clustering algorithm is used to generate a small number of meaningful color prototypes. Since we perform color quantization in the 6-bit color space, it greatly reduces the computational cost. We employ the well-known single-link clustering method(26) for quantizing the color space. The dissimilarity between two colors Ci"(Ri , Gi , Bi ) and Cj"(Rj , Gj , Bj) is defined as d(Ci , Cj)"(Ri!Rj )2#(Gi!Gj )2#(Bi!Bj )2. We construct a 64]64 proximity matrix and at each stage of the clustering algorithm, two colors with the minimum proximity value are merged together. The two merged colors are replaced by a single color which has a higher value in the histogram. The color quantization/clustering algorithm terminates when the number of colors either reaches a predetermined value of 2 or the minimum value in the proximity matrix is larger than 1. The color quantization result for the image after bit dropping (Fig. 11(b)) is depicted in Fig. 11(c); the four color prototypes are illustrated in Fig. 12(b). Using the same method as for pseudo-color images, we produce real foreground and background-complementary foreground images for the color quantized images. The image in Fig. 11(c) is decomposed into five foreground images shown in Fig. 10(d—h).
3. CONNECTED COMPONENTS IN MULTIVALUED IMAGES
After decomposition of a multivalued image, we obtain a look-up table of foreground identifications (FIs) for pixel values according to foreground images. A pixel in the original image has one or more FI values and can contribute to one or more foreground images specified by this table. This information will be
Fig. 13. A binary image and its BAG.
used for finding connected components in gray level images described below. Block adjacency graph (BAG) has been used for efficient computation of connected components since it can be created by an one-pass procedure.(5) The BAG of an image is defined as B"(N, E) , where N"Mni N is a set of block nodes and E"Me(ni , nj ) D ni , nj 3NN is the set of edges indicating the connection between nodes ni and nj . For a binary image, one of the two gray values can be regarded as foreground and the other as background. The pixels in the foreground are clustered into blocks which are adjacently linked as nodes in a graph. Figure 13 gives an example of a BAG, where a block characterized by its upper left (Xu , ½u) and lower right (X , ½ ) rectangular boundary coordinates, is a l l bounding box of a group of closely aligned run lengths. Note that links exist between adjacent blocks. We have extended the traditional algorithm of creating BAG for multivalued images, where BAGs are individually created for each of the foreground images using run lengths. A run length in a multivalued image consists of as many continuous pixels with the same FI on a row as they are tagged previously. A highlevel algorithm for creating BAG for multivalued images is presented in Fig. 14. Note that the BAG nodes for different foregrounds do not connect to each other. The following process is implemented in parallel for all the foreground images. Given a BAG representation, a connected component ci"MnjN is a set of connected BAG nodes
2064
A. K. JAIN and B. YU
Each run length in the first row of the input image is regarded as a block with a corresponding FI. For the successive rows in the image M For each run length r in the current row M c If r is 8!connected to a run length in the preceding row and they have the same FI M c If r is 8-connected to only one run length r with the same FI and the differences of the horizontal positions of their c l beginning and end pixels are, respectively, within a given tolerance ¹ , then r is merged into the block node a c n involving r . i l Else, r is regarded as a new block node n with a corresponding FI, initialized with edges e(n , n ) to those block c i`1 i j nodes Mn N which are 8-connected to r . j c N Else, r is regarded as a new block node n with a corresponding FI. c i`1 N N Fig. 14. One-pass BAG generation algorithm for multivalued images.
Fig. 15. Connected component analysis for the foreground image in Fig. 11(f ): (a) connected components; (b) connected component thresholding; (c) candidate text lines.
which satisfy the following conditions: (i) ci LB; (ii) ∀nj , nk 3ci , there is a path (n , n , n , 2 , n , n ) jp k j j1 j2 such that n 3c for l"1, 2, 2 , p and jl i e(n , n ), e(n , n ), 2 , e(n , n ), e(n , n ) 3E; j j1 j1 j2 jp~1 jp jp k and (iii) if &e(nj , nk) 3E and nj 3ci then nk 3 ci . The upper left and lower right coordinates of a connected component ci"MnjN are Xu (ci )"min MXu (nj )N, Xl (ci )"max MXl (nj )N, n |c n |c j i
j i
½u (ci )"min M½u (nj )N, ½l (ci )"max M½l (nj )N. nj |ci nj |ci The extracted connected components for the foreground image shown in Fig. 11(f ) is depicted in Fig. 15(a). Very small connected components are deleted as shown in Fig. 15(b). Assuming that we are looking for horizontal text, we cluster connected components in horizontal direction and the resulting components are called candidate text lines as shown in Fig. 15(c). 4. TEXT IDENTIFICATION
Without character recognition capabilities, it is not easy to distinguish characters from non-characters simply based on the size of connected components. A line of text consisting of several characters can
provide additional information for this classification. Text identification module in our system determines whether candidate text lines contain text or non-text based on statistical features of connected components. A candidate text line containing a number of characters will usually consist of several connected components. The number of such connected components may not be the same as the number of characters in this text line because some of the characters may be touching each other. Figure 16(b) illustrates the text lines and connected components for the text in Fig. 16(a) where characters are well separated. On the other hand, many characters shown in Fig. 16(c) are touching each other and a connected component shown in Fig. 16(d) may include more than one character. We have designed two different recognition strategies for touching and non-touching characters. A candidate line is recognized as a text line if it is accepted by any one of the strategies. 4.1. Inter-component features For separated characters, their corresponding connected components should be well aligned. Therefore, we preserve those text lines in which the top and bottom edges of the contained connected components are respectively aligned, or both the width and the height values of these connected components are close to each other. In addition, the number of connected components should be in proportion to the length of the text line.
Automatic text location in images and video frames
2065
Fig. 16. Characters in a text line: (a) well separated characters; (b) connected components and text lines for (a); (c) characters touching each other; (d) connected components and text line for (c); (e) X-axis projection profile and signature of the text in (c); (f ) ½-axis projection profile and signature of the text in (c).
Fig. 17. Text composition: (a) text lines extracted from the foreground image in Fig. 10(b); (b) text line extracted from the foreground image in Fig. 10(g); (c) composed result.
Table 1. Image size and processing time for text location Text carrier
No. of test images
Typical size
Accuracy (%)
Avg. CPU time (s)
26 54 30 6952
548]769 385]234 769]537 160]120
99.2 97.6 72.0 94.7
0.15 0.11 0.40 0.09
Advertisement Web image Color image Video frame
4.2. Projection profile features For characters touching each other, features are extracted based on the projection profiles of the text line in both horizontal and vertical directions. The basic idea is that if there are characters in a candidate text line then there will be a certain number of humps in its X-axis projection profile and one significant hump in its ½-axis projection profile. Figure 16(e) and (f ) depict X-axis and ½-axis projection profiles of the text shown in Fig. 16(c). The signatures of the projection profiles in both the directions are generated by means of thresholding and they are also shown in Fig. 16(e) and (f ). The threshold for X profile is its mean value and the threshold for ½ profile is chosen as one third of the highest value in it. The signatures
can be viewed as run lengths of 1s and 0s, where a 1 represents a profile value larger than a threshold and a 0 represents a profile value below the threshold. Therefore, we consider the following features to characterize text: (i) because text should have many humps in the X profile, but only a few humps in the ½ profile, the number of its 1-run lengths in X signature is required to be larger than 5 and the number of its 1-run lengths in ½ signature should be less than 3; (ii) since a very wide hump in the X profile of text is not expected, the maximum length of the 1-run lengths in X signature should be less than 1.4 times the height of the text line; and (iii) the humps in the X profile should be regular in width, i.e. the standard deviation of the length of 1-run lengths should be less
2066
A. K. JAIN and B. YU
Fig. 18. Located text lines for the advertisement images.
than 1.2 times their mean, and the mean should be less than 0.11 times the height of the text line. 4.3. Text composition Connected component analysis and text identification modules are applied to individual foreground images. Ideally, the union of the outputs from the individual foreground images should provide the location of the text. However, the text lines extracted from different foreground images may be overlapping and, therefore, they need to be merged. Two text lines are merged and replaced by a new text line if their horizontal distance is small and their vertical overlap is large. Figure 17(c) shows the final text location results for the image in Fig. 4(c). Figure 17(a) and (b) are the text lines extracted from the two foreground images shown in Fig. 10(b) and (g); Fig. 17(c) is the union of Fig. 17(a) and (b).
5. EXPERIMENTAL RESULTS
The proposed system for automatic text location has been tested on a number of binary images, pseudo-color images, color images and video frames. Since different applications need different heuristics, the modules and parameters used in the algorithm shown in Fig. 5 change accordingly. Table 1 lists the performance of our system. We compute the accuracy for advertisement images by manually counting the number of correctly located characters. The accuracies for other images are subjectively computed based on the number of correctly located important text regions in the image. The false alarm rate is relatively high for color images and is the lowest for advertisement images. At the same time, the accuracy for color image is the lowest because of the high complexity of the background. The processing time is reported for a Sun UltraSPARC I system (167 MHz)
Automatic text location in images and video frames
Fig. 18. (Continued.)
2067
2068
A. K. JAIN and B. YU
Fig. 19. Web images and located text regions.
Automatic text location in images and video frames
2069
Fig. 19. (Continued.)
with a 64 MB memory. More details of our experiments for different text carriers are explained in the following sub-sections. 5.1. Advertisement images The test images were scanned from a newspaper at 150 dpi. Some of the text location results are shown in Fig. 18, where both normal text and reversed text are located and illustrated in red bounding boxes. The line of white blocks in the upper part of Fig. 18(b) is detected as text because the blocks are regularly arranged in terms of size and alignment. However, this region should be easily rejected by an OCR module.
The text along a semicircle at the top of Fig. 18(e) cannot be detected by our algorithm. More complicated heuristics are needed to locate such text. Some punctuation and dashed lines are missed as expected because of their small size. 5.2. Web images The 22 representative web images shown in Fig. 19 were down-loaded through the Internet. The corresponding results of text location are shown in gray scale in Fig. 19. The text in Fig. 19(a) is not completely aligned along a straight line. The data for image in Fig. 19(h) could not be completely
2070
A. K. JAIN and B. YU
Fig. 20. Text location in color images.
Automatic text location in images and video frames
Fig. 20. (Continued.)
Fig. 21. False alarm in complex images.
2071
2072
A. K. JAIN and B. YU
Fig. 22. Video frames with low resolution.
Fig. 23. Video frames containing both caption and non-caption text.
Fig. 24. Video frames with text in sub-window.
Automatic text location in images and video frames
Fig. 25. Video frames with high resolution.
2073
2074
A. K. JAIN and B. YU
Fig. 26. Locating text on containers.
down-loaded because of the accidental interruption of transmission. Even so, the word ‘‘Welcome’’ was successfully located. Figure 19(k) contains a title with Chinese characters which has been correctly located. Figures 19(b), (j), and (n) have text logos. Vertical text in Fig. 19(h) and most small sized text are ignored. The ‘‘smoke’’ in Fig. 19(p) is regarded as text because of its good regularity. The IBM logo in Fig. 19(r) is missed since broken rendered text is not regarded as important text in our system and is also rather difficult to locate. 5.3. Scanned color images Experimental color images are scanned at 50 dpi from magazine and book covers. Some of the results are shown in Fig. 20. Most important text with sufficiently large font size are successfully located by our system. Some text with small size are missed, but they are probably not important for image indexing. Our system can also locate handwritten text as in Fig. 20(d) and (f ). Most of the false alarm occurs in images with very complex background as shown in Fig. 21. 5.4. Video frames A large number of video frames were selected from eight different videos covering news, sports, advertisement, movie, weather report and camera monitor events. The resolution of these videos ranges from 160]120 to 720]486. The results in Fig. 22 show the performance of our algorithm on video frames with as low resolution as 160]120, where text font, size, color and contrast change within a large range. Our algorithm was applied to video frames which contained a significant amount of text. The entire text in Fig. 22(g) could not be located. Note that it is not easy even for humans to locate all the text in this image due to low resolution. The color and texture of the text in Fig. 23(b) vary and are fairly similar to that of the background. In Fig. 23(c), our system located
Fig. 27. A video frame with vertical text.
the non-caption text on the wall. Non-caption text is more difficult to locate because of arbitrary orientation, alignment and illumination. Lienhart and Stuber’s algorithm(23) worked on gray-level video frames under the assumption that text should be monochromatic and generated artificially by title machines. However, no information about processing speed was provided. Figure 24(a) shows text in a window which is commonly used in news broadcasting. Our system is not very sensitive to the image resolution. The video frames in Fig. 25 are at a resolution of up to 720]486, where the text shown on the weather forecast map has not been located. We are currently working to augment our heuristic to locate the missed text in weather maps. One of the potential applications of text location is container identification and our algorithm can be applied to such images as shown in Fig. 26. By a simple extension of our method, we can also locate vertical text as shown in Fig. 27, although it is not commonly encountered in practice.
6. CONCLUSIONS
The problem of text location in images and video frames has been addressed in this paper. Text
Automatic text location in images and video frames
conversion and database indexing are two major applications of the proposed text location algorithm. A method for text location based on multivalued image processing is proposed. A multivalued image including binary image, gray-scale image, pseudo-color image and full color image can be decomposed into multiple real foreground and background-complementary foreground images. For full color images, a color reduction method is presented, including bit dropping and color clustering. Therefore, the connected component analysis for binary images can be used in multivalued image processing to find text lines. We have also proposed an approach to text identification which is applicable to both separated and touching characters. Text location algorithm has been applied to advertisement images, Web images, color images and video frames. The application to classified advertisement conversion demands a higher accuracy. Therefore, we use a higher scan resolution of 150 dpi. For other applications, the goal is to find all the important text for searching or indexing. Compared to texture-based method(8) and motion-based approach for video,(2,3) our method has a higher speed and accuracy in terms of finding a bounding box around important text regions. Because of the diversity of colors, the text location accuracy for color images is not as good compared to that for other input sources. Our method does not work well where the three-dimensional color histogram is sparse and there there are no dominant prototypes.
REFERENCES
1. S. Mori and D. Y. Suen and K. Yamamoto, Historical review of OCR research and development, Proc. IEEE 80, 1029—1058 (1992). 2. A. Jain and B. Yu, Document representation and its application to page decomposition, IEEE ¹rans. Pattern. Anal. Machine Intell. 20, 294—308 (1998). 3. B. Yu, A. Jain and M. Mohiuddin, Address block location on complex mail pieces, in Proc. 4th Int. Conf. on Document Analysis and Recognition, Ulm, pp. 897—901 (1997). 4. S. N. Srihari, C. H. Wang, P. W. Palumbo and J. J. Hull, Recognizing address blocks on mail pieces: specialized tools and problem-solving architectures, Artificial Intelligence 8, 25—35, 38—40 (1987). 5. B. Yu and A. Jain, A generic system for form dropout, IEEE ¹rans. Pattern Anal. Machine Intell. 18, 1127—1134 (1996). 6. L. A. Fletcher and R. Kasturi, A robust algorithm for text string separation from mixed text/graphics images,
7. 8. 9. 10. 11. 12. 13. 14. 15.
16. 17. 18.
19. 20.
21. 22.
23. 24. 25. 26.
2075
IEEE ¹rans. Pattern Anal. Machine Intell. 10, 910—918 (1988). I. Pitas and C. Kotropoulos, A texture-based approach to the segmentation of semitic image, Pattern Recognition 25, 929—945 (1992). A. Jain and S. Bhattacharjee, Text segmentation using Gabor filters for automatic document processing, Mach. »ision Applic. 5, 169—184 (1992). Y. Zhong, K. Karu and A. Jain, Locating text in complex color images, Pattern Recognition 28, 1523—1535 (1995). B. Yu and A. Jain, A robust and fast skew detection algorithm for generic documents, Pattern Recognition, 29, 1599—1629 (1996). Y. Tang, S. Lee and C. Suen, Automatic document processing: a survey, Pattern Recognition 29, 1931—1952 (1996). M. Gray, Internet statistics: growth and usage of the Web and the Internet, at http://www.mit.edu/people/ mkgray/net/. Altavista Web page, at http://altavista.digital.com/. D. Lopresti and J. Zhou, Document analysis and the World Wide Web. Proc. ¼orkshop on Document Analysis Systems, Marven, pp. 417—424 (1996). E. R. Lee, P. K. Kim and H. J. Kim, Automatic recognition of a car license plate using color image processing, Proc. 1st IEEE Conf. on Image Processing, Austin, pp. 301—305 (1994). R. W. Picard and T. P. Minka, Vision texture for annotation, Multimedia Systems 3, 3—14 (1995). S. Sclaroff and A. Pentland, Modal matching for correspondence and recognition IEEE ¹rans. Pattern Anal. Machine Intell. 17, 545—561 (1995). H. Sakamoto, H. Suzuki and A. Uemori, Flexible montage retrieval for image data, Proc. SPIE Conf. on Storage and Retrieval for Image and »ideo Databases II, Vol. SPIE2185, San Jose, pp. 25—33 (1994). A. S. Gordon and E. A. Domeshek, Conceptual indexing for video retrieval, Proc. Int. Joint Conf. on Artificial Intelligence, Montreal, pp. 23—38 (1995). P. Schauble and M. Wechsler, First experiences with a system for content based retrieval of information from speech, Proc. Int. Joint Conf. on Artificial Intelligence, Montreal, pp. 59—70 (1995). A. Jain and A. Vailaya, Image retrieval using color and shape, Pattern Recognition 29, 1233—1244 (1996). B. Shahraray and D. Gibbon, Automatic generation of pictorial transcripts of video programs, Proc. SPIE Conf. on Multimedia Computing and Networking, Vol. SPIE 2417, San Jose, pp. 2417—2447 (1995). R. Lienhart and F. Stuber, Automatic text recognition in digital videos, Proc. SPIE 2666, San Jose, pp. 180—188 (1996). J. Zhou, D. Lopresti and Z. Lei, OCR for World Wide Web images, Proc. of IS&¹/SPIE Electronic Imaging: Document Recognition I», San Jose (1997). W. B. Pennebaker and J. L. Mitchell, JPEG: Still Image Compression Standard. Van Nostrand Reinhold, New York: NY (1993). A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ (1988).
About the Author—ANIL JAIN is a university distinguished Professor and Chair of the Department of Computer Science at Michigan State University. His research interests include statistical pattern recognition, Markov random fields, texture analysis, neural networks, document image analysis, fingerprint matching and 3D object recognition. He received the best paper awards in 1987 and 1991 and certificates for outstanding contributions in 1976, 1979, 1992, and 1997 from the Pattern Recognition Society. He also received the 1996 IEEE Trans. Neural Networks Outstanding Paper Award. He was the Editor-in-Chief of the IEEE Trans. on Pattern Analysis and Machine Intelligence (1990—94). He is the co-author of Algorithms for Clustering Data, Prentice-Hall, 1988, has edited the book Real-Time Object Measurement and Classification, Springer-Verlag, 1988, and co-edited the books, Analysis and Interpretation of Range Images, Springer-Verlag, 1989, Markov Random Fields, Academic Press, 1992, Artificial Neural Networks and Pattern Recognition, Elsevier, 1993, 3D Object Recognition, Elsevier, 1993, and BIOMETRICS:
2076
A. K. JAIN and B. YU Personal Identification in Networked Society to be published by Kluwer in 1998. He is a Fellow of the IEEE and IAPR, and has received a Fulbright research award. About the Author—BIN YU received his Ph.D. degree in Electronic Engineering from Tsinghua University in 1990, M.S. degree in Electrical Engineering from Tianjin University in 1986 and B.S. degree in Mechanical Engineering from Hefei Polytechnic University in 1983. Dr. Yu was a visiting scientist in the Pattern Recognition and Image Processing Laboratory of the Department of Computer Science at Michigan State University from 1995 to 1997. Since 1992, he has been an Associate Professor in the Institute of Information Science at Northern Jiaotong University where he worked as a Postdoctoral Fellow from 1990 to 1992. He is now working as a Senior Staff Vision Engineer at Electroglas, Inc., Santa Clara. His research interests include Image Processing, Pattern Recognition and Computer Vision. Dr. Yu has authored more than 50 journal and conference papers. He is a Member of the IEEE, a Member of the Youth Board of the Chinese Institute of Electronics, and a Senior Member of the Chinese Institute of Electronics.