An adaptive over-split and merge algorithm for page segmentation

Accepted Manuscript An Adaptive Over-Split and Merge Algorithm for Page Segmentation Ha Dai-Ton, Nguyen Duc-Dung, Le Duc-Hieu PII: DOI: Reference: S...

Download PDF

7MB Sizes 4 Downloads 92 Views

Report

Full Text

Accepted Manuscript

An Adaptive Over-Split and Merge Algorithm for Page Segmentation Ha Dai-Ton, Nguyen Duc-Dung, Le Duc-Hieu PII: DOI: Reference:

S0167-8655(16)30135-0 10.1016/j.patrec.2016.06.011 PATREC 6569

To appear in:

Pattern Recognition Letters

Received date: Accepted date:

28 September 2015 10 June 2016

Please cite this article as: Ha Dai-Ton, Nguyen Duc-Dung, Le Duc-Hieu, An Adaptive OverSplit and Merge Algorithm for Page Segmentation, Pattern Recognition Letters (2016), doi: 10.1016/j.patrec.2016.06.011

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

CR IP T

1

AN US

Research Highlights (Required)

• A new hybrid over-split and merge algorithm that reduces simultaneously split and merge errors in document layout analysis.

M

• An adaptive thresholding method for grouping text lines of variable font size in diversified and complicated document structure. • A new approach of context analysis to overcome the common failure in separating close text regions of similar font size.

ED

• Decomposing text regions of any shape into paragraphs.

AC

CE

PT

• Achieving highest score on the UW-III and ICDAR2009 datasets with different measures.

ACCEPTED MANUSCRIPT 2

Pattern Recognition Letters journal homepage: www.elsevier.com

An Adaptive Over-Split and Merge Algorithm for Page Segmentation Ha Dai-Tona,∗∗, Nguyen Duc-Dungb , Le Duc-Hieub Long High School for Gifted Student, Ha Long City, Vietnam of Information Technology, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet, Hanoi, Vietnam

CR IP T

a Ha

b Institute

ABSTRACT

M

AN US

Page segmentation is a key step in building a document recognition system. Variation in character font sizes, narrow spacing between text blocks, and complicated structure are main causes of the most common over-segmentation and under-segmentation errors. We propose an adaptive over-split and merge algorithm to reduce simultaneously these types of error. The document image is firstly over-split into text blocks, even text lines. These text blocks are then considered to merge into text regions using a new adaptive thresholding method. Local context analysis uses a set of text line separators to split homogeneous text regions of similar font size and close text blocks into paragraphs. Experiments on the ICDAR2009 and UW-III benchmarking datasets show the effectiveness of the proposed algorithm in reducing both the under and over-segmentation errors and boost the performance significantly when comparing with popular page segmentation algorithms. c 2016 Elsevier Ltd. All rights reserved.

ED

1. Introduction

AC

CE

PT

Document layout analysis is one of the main components in an OCR system. The task of document layout analysis includes automatically detecting zones on a document image (physical layout analysis - page segmentation) and classifying them into different regions such as: text, images, tables, header, footer, etc (logical layout analysis). The results of page segmentation are used as an input to the process of recognition and automatic data entry of image processing systems in general. Compared with the logical layout analysis, the physical layout analysis has attracted more attention of researchers because of its diverse and complex layout of different types of document. Not only the specific types of document (books, newspapers, magazines, reports, etc) but also other factors of a page such as editors and font size, layout, alignment constraints, etc affect detection and segmentation accuracy of the algorithm. Based on the order of processing, page segmentation algorithms are primarily divided into three categories: bottom-up, top-down and hybrid. Bottom-up algorithms are both the oldest, e.g. (Wahl et al., 1982) and more recently published, e.g. (Chen and Ding, 2003; Chowdhury et al., 2007; O’Gorman,

∗∗ Corresponding

author: Tel.: +8-423-498-1188; e-mail: [email protected] (Ha Dai-Ton)

1993), algorithms. They classify small parts of the image (pixels, groups of pixels, or connected components), and gather those of the same type together to form regions. The key advantage of bottom-up algorithms is that they can handle arbitrarily shaped regions with ease (rectangular or non-rectangular). However, the sensitivity of the measure used to form higherlevel entities is the main disadvantage of this approach; this often leads to over-segmentation error in the page with many changes in font sizes and styles, especially the titles. Topdown algorithms, e.g. (Breuel, 2002; Nagy et al., 1992) cut the image recursively in vertical and horizontal directions along with white-spaces that are expected to be column boundaries or paragraph boundaries. Although top-down algorithms have low computation complexity and good separation result on images with rectangular layout, they are not really able to handle the variety of formats that occur in many magazine pages, such as non-rectangular regions and cross-column headings that blend seamlessly into the columns below. This leads to undersegmentation error. The third type of algorithm, e.g. (Smith, 2009), is based on bottom-up method to find delimiters, such as rectangular white-spaces, tab-stops, etc. These delimiters are then used to infer a top-down layout of document image. After that, the algorithms use a bottom-up method and top-down layout to detect text regions. Therefore, hybrid algorithms can overcome over-segmentation error caused by bottom-up algorithms

ACCEPTED MANUSCRIPT 3 Phase 1: Over-segmentation Connected component filter Delimiter detection Candidate text region detection

Phase 2: Paragraph detection

Fig. 1. Illustration of under-segmentation and over-segmentation errors.

Text-line grouping

Paragraph separation

Fig. 2. The process chain of AOSM algorithm.

AC

CE

PT

ED

M

AN US

and perform better than top-down algorithms when dealing with non-rectangular text regions. However, it is not trivial to detect delimiters exactly by many reasons, e.g. text regions are very close to each other, text regions are not left or right aligned, spaces of connected components are large, which lead to misidentified delimiters or missed delimiters. So, the results of the algorithms often contain both over-segmentation and under-segmentation errors as illustrated in Figure 1. In short, the over-segmentation and under-segmentation errors are the most frequent types and they are, in fact, not easy to overcome. Fixing the over-segmentation error usually leads to the under-segmentation error, and vice versa. In this paper, we present an Adaptive Over-Split and Merge (AOSM) algorithm for overcoming both the over- and undersegmentation errors in page segmentation problem. AOSM firstly over-segments page image using a set of white-spaces covering the whole document background. It then groups over-segmented text regions using adaptive parameters. Finally, local context analysis sub-divides (under-segmented) text regions into paragraphs. Experimental results on the ICDAR2009 page segmentation competition and the UW-III datasets show that AOSM reduces both over-segmentation and under-segmentation errors significantly, thus boost the performance of page segmentation when comparing with the state-ofthe-art algorithms. The rest of this paper is organized as follows. In section 2 we describe in detailed the AOSM algorithm. Section 3 gives experiment results and analysis on two benchmark datasets ICDAR2009 and UW-III. Finally, conclusions and discussions are given in section 4.

CR IP T

Text-line detection

2. Adaptive over-split and merge page segmentation Figure 2 outlines two stages and main processing steps of the proposed AOSM algorithm. The first stage aims at quickly dividing page images into regions and sub-regions. The second stage groups text regions using adaptive thresholds and then once again separate text blocks into paragraphs using local context analysis. Details of the main steps are presented in the following sub-sections.

2.1. Phase 1: Over-segmentation Instead of using tab-stops as delimiters, e.g. (Smith, 2009), AOSM uses all available rectangular white-spaces covering background of document as delimiters. It is to eliminate undersegmentation error of the conventional top-down methods. 2.1.1. Connected component filtering The morphological processing (Bloomberg, 1991) firstly detects vertical, horizontal lines and image regions. The detected elements are subtracted from the image before passing to connected components analysis. The connected components are then filtered by their size into small (likely as noise), large (likely as image halftone) and the rest are medium (likely as text - CCs). 2.1.2. Delimiter detection When a document is written and laid out by a word processor or a professional publishing system, text regions are usually bounded and thus differentiated from each other by means of delimiters. The delimiters can either be long horizontal/vertical line segments (dubbed solid separator), physical delimiters (distance of CCs), large elongated empty areas (dubbed whitespaces) or the chain of alignment connected components (tabstops), etc. There have been different approaches for delimiter detection in which analyzing the structure of the white background is one of the common methods to detect delimiter, e.g. the WhiteSpace algorithm (Breuel, 2002). It is the fact that white space is a generic layout delimiter and background structure is simpler than those of the foreground. However, the segmentation result of the WhiteSpace algorithm is rather sensitive to the stopping rule (based on the number of delimiters). Early stopping results in a higher number of under-segmentation errors and late stopping results in more over-segmentation errors (Shafait et al., 2008). To overcome this limitation, AOSM uses the set of all

ACCEPTED MANUSCRIPT 4

Fig. 3. The red overlap rectangles are white-spaces covering document background. They are all used as delimiters in the AOSM algorithm.

b)

CR IP T

a)

Fig. 5. Illustration of grouping text lines into homogeneous text regions: a) over-segmented text lines; b) homogeneous text regions.

b)

M

a)

AN US

simultaneously the following conditions: ( (i) DistHoriz(linei , line j ) ≤ x− heighti j , yi − y j ≤ (1 + θ) ∗ x− heighti j , (ii)

ED

Fig. 4. Illustration of over-segmented text regions: a) text regions with blue color; b) over-segmented text regions are with the red bounding boxes.

PT

white spaces covering background document as delimiters (Figure 3). This policy solves not only the delimiter detection but also the under-segmentation problem.

AC

CE

2.1.3. Candidate text region detection The white spaces discovered in the previous section are removed from page image and the remaining candidate text regions are detected easily (Figure 4). In the next section, candidate of text lines within each separated region will be detected efficiently and the potential of joining text lines in two neighboring regions is mostly eliminated. 2.2. Phase 2: Paragraph detection The output of Phase 1 intentionally consists of oversegmented text regions and the task of Phase 2 is to detect and fix these errors. 2.2.1. Text-line extraction and grouping CCs in each detected candidate text region are scanned from left to right and from top to bottom to form text-lines. After that, two text-lines linei and line j (belonging to two neighboring text regions) are considered to group into one region if they satisfy

where yi and y j are the y-coordinates of the center of text-lines linei and line j , x− heighti j is the minimum estimated x − height of the two lines. The parameter θ is used to determine the local perpendicular distance between two lines in the same text region. These conditions mean that two text-lines will be set in the same region if they are close enough horizontally (i) or vertically (ii). It is noteworthy that the condition (ii) favors text lines of the same font size and becomes stricter when their sizes are different. In the later case, the distance between two line centers on the left-hand side counts the large font size whereas the right-hand side counts the small font size. We will show experimentally that the performance of AOSM is very insensitive to the value of θ (Figure 12). Our primary experiment indicates that the suitable values for θ are in the range between 1.4 and 1.6. Therefore, the default value of 1.5 is used for θ in all of our experiments. Figure 5 shows an example of grouping (over-segmented) text-lines into homogeneous text regions. The title are usually over-segmented due to the large gap between lines. AOSM can still group them into the same region using their similar heights and relative distance to each other. The title and text bodies are prohibited from merging as the large relative gaps between two centers of text-lines in different regions. 2.2.2. Paragraph separation The difficulties in page segmentation are not only the complicated structure of document images or the variation in font sizes and styles, but also the narrow space between text blocks, as illustrated in Figure 6. The space is even smaller than the distance between words in the same text line. This challenge fails most of the page segmentation methods relying on both separation objects or connected component analysis. To overcome this difficulty, AOSM uses a set of separation text-lines to sub-divide the detected homogeneous text regions into paragraphs. We define 5 types of text-line separator as shown in Figure 7. AOSM scans from top to bottom and from

ACCEPTED MANUSCRIPT 5

Fig. 6. The red text-line laying across two columns is very close to two text-lines in the below text blocks. Conventional methods mostly failed in separating and caused the under-segmentation error.

c)

b)

d)

e)

AN US

a)

b)

CR IP T

a)

Fig. 7. Definition of separation text-lines: a) and b) laid across two columns; c), d) and e) are at the start of a paragraph;

e)

AC

CE

PT

ED

M

bottom to top of each text region for across separation text-lines (Figure 7.a or 7.b), and then split the region into sub-regions (step 1 in Figure 8.d). After that, text-lines in those sub-regions are sorted in vertical and horizontal orders, (step 2 in Figure 8.d). Finally, paragraphs are detected using separation textlines in Figure 7c, 7d or 7e (step 3 in Figure 8.d). As illustrated in Figure 8, the defined text-line separators show their effectiveness in separating similar font size, very close to each other, and very complicated text regions. Conventional top-down or bottom-up methods, even the intended over-segmentation in the first phase, mostly failed in this situation. In summary, AOSM algorithm uses both global separation objects like white spaces to over-split text blocks and local adaptive thresholds to merge over-segmented text blocks into homogenous text regions. Once again, local context analysis is used to find separation text-lines in each text regions and paragraphs are detected using these separator. In the next section, we will evaluate and compare AOSM with other well-known methods in document layout analysis.

c)

3. Experiments 3.1. Data

We use the UW-III dataset (Guyon et al., 1997) and the PRImA dataset (Antonacopoulos et al., 2009b) for performance evaluation and comparison of page segmentation algorithms. These datasets have text-lines and paragraphs level groundtruth represented by non-overlapping polygons for each document image.

d) Fig. 8. Segmentation of homogeneous text region into paragraphs: a) Original image, b) Segmentation result without using separation text-lines, c) Detected separation text-lines, d) Sub-dividing text regions into paragraphs, e) Final page segmentation result.

The UW-III dataset has 1,600 unskewed binary document images scanned at 300 DPI resolution. It provides a good basis for comparative evaluation of page segmentation algorithms because the majority of documents available today like books, journals, magazines, letters, etc containing Manhattan layouts and many document images with heavy noise (speckles, margin noise or undesired text parts from the neighboring page, etc). The PRImA dataset has 305 document images scanned at 300 DPI resolution. It contains a wide variety of document types, reflecting the various challenges in layout analysis. The layouts of these pages contain a mixture of simple and complex layouts, including many instances of text wrapping tightly around images, varying font sizes and other characteristics which are useful to evaluate layout analysis methods. The dataset is divided into two subsets: the training set consists of 250 images that has been used for optimizing parameter(s) of different algo-

ACCEPTED MANUSCRIPT 6

Success rate (%)

80 70 60 50

UWIII

ICDAR2009

Docstrum

92.87

70.77

Voronoi

83.53

62.43

WhiteSpace

89.67

67.64

Tab-Stop

90.42

76.68

AOSM

93.12

86.43

Fig. 9. Performance comparison using PSET-measure on the UW-III and ICDAR2009 datasets.

90 70 50 30 10

M

AN US

The evaluation of document layout analysis is always complicated because the result strongly depends on data, groundtruth, and evaluation methodology. In this work, we use two commonly and recently used performance metrics: PRImAmeasure, and PSET-measure. These measures have been used in many recent works and competitions in document analysis and recognition. The PRImA-measure (Clausner et al., 2011) has been used in several competitions, e.g. (Antonacopoulos et al., 2009a, 2013). It provides an evaluation methodology that combines different types of segmentation errors (split, merge, miss, false detection, miss-classification and reading order) with adjustable weights accounting for different application scenarios. The PSET-measure (Mao and Kanungo, 2002) uses text-line ground truth for evaluating performance of page segmentation algorithms. This method is particularly useful because it does not make any assumption about the whole layout of the document. Let G be the set of all ground-truth text-lines in a document image. Then, three subsets of text-lines are defined as follows.

90

CR IP T

3.2. Performance metrics

100

Success rate (%)

rithms and the remaining 55 images was used for performance testing in the ICDAR2009 page segmentation competition. In our experiments, the training images are used for assessing the sensitivity of the adaptive merging parameter θ described in section 2.2.1. The performance comparisons of different algorithms will be based on the independent testing images.

PT

ED

a) The set of ground-truth text-lines that are missed (C), i.e. they are not part of any detected text region. b) The set of ground-truth text-lines whose bounding boxes are split (S), i.e. bounding box of a text-line is not completely within one detected segment. c) The set of ground-truth text-lines that are horizontally merged (M), i.e. two horizontally overlapping ground-truth text-lines are part of one detected segment.

CE

The overall performance rate is measured as the percentage of ground-truth text-lines that are identified correctly:

AC

PSET-measure =

|G| − |C ∪ S ∪ M| . |G|

3.3. Algorithms

Representative algorithms for the top-down, bottom-up, and hybrid methods have been selected for analysis and comparison, including Docstrum (O’Gorman, 1993), Voronoi (Kise et al., 1998), WhiteSpace (Breuel, 2002, 2003), Tab-Stop (Smith, 2009). These algorithms not only have high performance, but also have been studied recently, e.g. in (Shafait et al., 2008). Additionally, AOSM is also compared with methods that achieved top performances in the ICDAR2009 competition like DICE, Fraunhofer, REGIM-ENIS, and OCRopus (Tesseract, an open source OCR systems) (Antonacopoulos et al., 2009a).

Image

Separator

Text

DICE

36.46

27.08

40.02

Fraunhofer

61.83

84.51

82.37

REGIM-ENIS

54.42

74.53

15.44

Tesseract

52.95

69.42

73.24

AOSM

92.65

88.75

92.63

Fig. 10. PRImA measures of AOSM and top algorithms in the ICDAR2009 competition.

3.4. Results and discussion Performance of Docstrum, Voronoi, WhiteSpace, Tab-Stop, and AOSM on two datasets are reported in Figures 9. As document images in the UW-III dataset have relatively simple structure (rectangular structure), most algorithms achieve quite high performance, e.g. 92.87% of Docstrum, 90.42% of Tab-Stop. The most common error of these algorithms is the over-segmentation in the title with large font size. With the local adaptive grouping strategy, AOSM mostly fixes this type of error and improve the overall performance, e.g. 93.12% of AOSM compared to 92.87% of Docstrum, see Fig 9. The small margin in improvement is due to the small the number of document images that have title with large font size in the UW-III dataset. The ICDAR2009 dataset contains a wide selection of contemporary documents with complex structure and changes in font size. In this more difficult dataset, AOSM shows its advantages over other algorithms: 86.43% of AOSM compared to the second position 76.68% of Tab-Stop (Figure 9). Evaluation with the PRImA-measure also shows a clear difference between AOSM and the others, especially in the case of text scenario: 92.63% of AOSM compared to 82.37% of the second Fraunhofer (Figure 10). Figure 11 shows different types of error made by the representative algorithms. The complexity of ICDAR 2009 dataset

ACCEPTED MANUSCRIPT 7 30

87 86.8 86.6 PSET-measure

10 0

86.4 86.2 86

85.8 85.6

Split

Merge

Miss

Docstrum

3.16

26.02

0.05

85.4

Voronoi

11.02

26.50

0.05

85.2

WhiteSpace

12.84

19.13

0.39

85

Tab-Stop

6.11

17.07

0.14

AOSM

4.28

9.17

0.12

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0



1.4

1.2 1 0.8 0.6

AC

CE

PT

ED

M

AN US

makes it difficult to estimate the (global) thresholds as well as to detect separation objects. Most of the methods failed in reducing both the under-segmentation (merge) and oversegmentation (split) errors, e.g. the Docstrum has the lowest split error of 3.16% but it violates most the merge error of 26.02%, the corresponding numbers of the Tab-Stop are 6.11% split and 17.07% merge errors. AOSM reduces the merge error to 9.17% by its first phase of over-segmentation and the analysis and detection of text-line separators in homogeneous text regions in the second phase. The local adaptive thresholding method in phase 2 also helps AOSM effectively reducing the over-segmentation errors caused by irregular font size in document images, e.g. the split error of AOSM is only 4.28%, compared to 6.11% of Tab-Stop. For the adaptive threshold θ in merging two text-lines described in section 2.2.1, we conduct experiment with different values of θ, ranging from 1.0 to 2.0, on the ICDAR 2009 dataset. As we can see in Figure 12, the result of AOSM is very insensitive to the value of θ: the PSET scores vary from 86.24 with θ = 1.0 to 86.43 with θ = 1.4, 1.5, 1.6. The less sensitivity of θ to the text line merging condition is based on the fact that the difference in font sizes is partly included in the calculation of distance between line centers and the distance threshold is based on the minimum estimated x-height of the two lines. In other words, AOSM permits merging text lines of similar height and restrict other cases even if they are very close to each other. Figure 13 shows the average running time for one page of Docstrum, Voronoi, WhiteSpace, TabStop and AOSM on the ICDAR 2009 dataset. The experiment was conducted on an Intel Core i5 Processor 3.2GHz machine. AOSM takes about one second to analyze one document image, almost the same running time with WhiteSpace, faster than Voronoid and slower than Docstrum. We show in Figure 14 and Figure 15 some typical examples illustrating for the capability of AOSM in difficult cases. Traditional and conventional methods usually failed when assumptions on the clear distance between text blocks or separation objects do not exist.

Fig. 12. PSET-measure of AOSM with different values of θ on the ICDAR2009 dataset.

Running time (sec)

Fig. 11. Comparison of different types of errors on the ICDAR2009 dataset.

CR IP T

Error rate (%)

20

4. Conclusions We have presented AOSM, an Adaptive Over-Split and Merge algorithm, for the page segmentation problem. The pri-

0.4 0.2 0

Docstrum

Voronoi

WhiteSpace

Tab-Stop

AOSM

Fig. 13. Average running time for one page segmentation of representative algorithms.

mary aim of AOSM is to reduce simultaneously two common types of error, the under- and over-segmentation, due to the variety in font size, narrow spacing between text regions, and the complexity of document structure. AOSM firstly uses white spaces covering background document as delimiters, this makes an interesting and useful alternative to normal delimiters like white rectangles or tab-stops for finding column structure of a page. This tactic has solved not only the delimiter detection but also the under-segmentation problem. It helps to overcome the under-segmentation error caused by irregular text alignment and close text zones (Figures 14, 15). The over-segmentation error is usually caused by the large variation in font size and large spacing between text lines. The proposed adaptive thresholding strategy of AOSM reduces both the fault positive in merging text lines of irregular font size in different text blocks and the false negative resulting from over-segmenting text lines of similar font size within the same text region (Figures 5b, 15). Finally, homogeneous text regions are identified and separated into paragraphs using a pre-defined set of separation text-lines. The mixture of over-segmentation with adaptive thresholding and context rules makes a different and effective page segmentation algorithm. The proposed AOSM performs well on benchmark datasets and shows excellent result on very hard cases (Figures 8e, 15).

ACCEPTED MANUSCRIPT 8

(AOSM)

(Ground-truth)

(Docstrum)

ED

M

Fig. 14. Segmentation results on the PRImA-00000197 image. As text blocks are very close to each other, the Docstrum failed in separating these regions. The skew of document makes text zones be not left or right aligned and delimiters could not be determined. The Tab-Stop algorithm merges all text-lines of these zones together. The AOSM overcomes this difficulty by the over-segmentation process. The text regions are then adaptively merged and split into paragraphs.

Acknowledgment

AC

CE

PT

We would like to thank anonymous reviewers for their valuable comments and suggestions on the submission version of this paper. This work was supported in part by the Vietnam Academy of Science and Technology under the grand VAST01.08/15-16 “Development of document analysis and recognition methods for automatic data entry”. References

(WhiteSpace)

CR IP T

(Tab-Stop)

AN US

(Docstrum)

Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S., 2009b. A realistic dataset for performance evaluation of document layout analysis. in Proc. 10th Intl. Conf. on Document Analysis and Recognition. , 296 – 300. Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S., 2013. Icdar2013 competition on historical newspaper layout analysis. in Proc. 11th Intl. Conf. on Document Analysis and Recognition. , 1454–1458. Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C., 2009a. Icdar2009 page segmentation competition. in Proc. 10th Intl.Conf. on Document Analysis and Recognition. , 370 – 1374. Bloomberg, D.S., 1991. Multiresolution morphological approach to document image analysis, in: Proc. of the International Conference on Document Analysis and Recognition, Saint-Malo, France., pp. 963–971. Breuel, T.M., 2002. Two geometric algorithms for layout analysis. Document Analysis Systems. 2423, 188–199. Breuel, T.M., 2003. High performance document layout analysis. Proc. Symp. Document Image Understanding Technology. .

(AOSM)

(Ground-truth)

Fig. 15. Segmentation results on the PRImA-00000781 image. The large font size and the large inter-word spacing in the title makes the Docstrum over-split text lines. Moreover, the very close overlaiding text block of similar font size makes it difficult to separate the two columns from the top block. Most of the page segmentation methods failed in this undersegmentation type of error. AOSM overcomes this challenging case by using the local text-line delimiters to separate this homogeneous text region into paragraphs.

Chen, M., Ding, X., 2003. Unified HMM-based layout analysis framework and algorithm. Science China Information Sciences. 46(6), 401–408. Chowdhury, S., Mandal, S., Das, A., Chanda, B., 2007. Segmentation of text and graphics from document images. Proc. of the 9th Int. Conf. on Document Analysis and Recognition. 2, 619–623. Clausner, C., Pletschacher, S., Antonacopoulos, A., 2011. Scenario driven indepth performance evaluation of document layout analysis methods. in Proc. 11th Intl. Conf. on Document Analysis and Recognition. , 1516–1520. Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T., 1997. Data sets for ocr and document image understanding research. Handbook of character recognition and document image analysis. , 779–799. Kise, K., Sato, A., Iwata, M., 1998. Segmentation of page images using the area voronoi diagram. Computer Vision and Image Understanding. 70, 370–382. Mao, S., Kanungo, T., 2002. Software architecture of pset: a page segmentation evaluation toolkit. International Journal on Document Analysis and Recognition. 4, 205–217. Nagy, G., Seth, S., Viswanathan, M., 1992. A prototype document image analysis system for technical journals. IEEE Computer. 25, 10–22. O’Gorman, L., 1993. The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 15, 1162–1173. Shafait, F., Keysers, D., Breuel, T., 2008. Performance evaluation and benchmarking of six page segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence. 30, 941–954. Smith, R., 2009. Hybrid page layout analysis via tab-stop detection. Proc. 10th Intl. Conf. on Document Analysis and Recognition. , 241 – 245. Wahl, F., Wong, K., Casey, R., 1982. Block segmentation and text extraction in mixed text/image documents. Computer Graphics and Image Processing. 20, 375–390.

An adaptive over-split and merge algorithm for page segmentation

An adaptive over-split and merge algorithm for page segmentation

Recommend Documents