Pattern Recognition 38 (2005) 2333 – 2350 www.elsevier.com/locate/patcog
Fiducial line based skew estimation Bo Yuan∗ , Chew Lim Tan Department of Computer Science, School of Computing, National University of Singapore, Singapore 117543, Singapore Received 5 October 2004; received in revised form 28 January 2005; accepted 7 March 2005
Abstract Skew estimation for textual document images is a well-researched topic and numerals of methods have been reported in the literature. One of the major challenges is the presence of interfering non-textual objects of various types and quantities in the document images. Many existing methods require proper separation of the textual objects which are well aligned from the non-textual objects which are mostly nonaligned. Some comparative evaluation work on the existing methods chooses only the text zones of the test image database. Therefore, the object filtering or zoning stage is crucial to the skew detection stage. However, it is difficult if not impossible to design general-purpose filters that are able to discriminate noises from textual components. This paper presents a robust, general-purpose skew estimation method that does not need any filtering or zoning preprocessing. In fact, this method does apply filtering, but not on the input components at the beginning of the detection process, rather on the output spectrum at the end of the detection process. Therefore, the problem of finding a textual component filter has been transformed into finding a convolution filter on the output accumulator array. This method consists of three steps: (1) the calculation of the slopes of the virtual lines that pass through the centroids of all the unique pairs of the connected components in an image, and quantizes the arctangents of the slopes into a 1-D accumulator array that covers the range from −90◦ to +90◦ ; (2) a special convolution on the resultant histogram, after which there remain only the prominent peaks that possibly correspond to the skew angles of the image; (3) the verification of the detection result. Its computational complexity and detection precision are uncoupled, unlike those projection-profile-based or Hough-transform-based methods whose speeds drop when higher precision is in demand. Speedup measures on the baseline implementation are also presented. The University of Washington English Document Image Database I (UWDB-I) contains a large number of scanned document images with significant amount of non-textual objects. Therefore, it is a good image database for evaluating the proposed method. 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Skew estimation; Centroids; Component pairs; Fiducial lines; Noise immunity; UWDB-I
1. Introduction Text lines in documents are horizontally or vertically oriented by human reading customs. Due to misalignment in the digitization process, the text lines in the resultant
∗ Corresponding author: Tel.: +65 6874 6396; fax: +65 6775 7717. E-mail address:
[email protected] (B. Yuan).
images usually deviate from their original orientation by certain amount, which is referred to as skew angle. Skew angle detection is one of the important processing steps in document image understanding. It has drawn extensive studies and a large array of techniques has been developed [1–10]. There are also in-depth reviews [11,12] and comparative evaluations [13] available. The general principle of skew detection for textual document images is to find a proper representation of text lines, and develop a method to draw the correspondence
0031-3203/$30.00 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.03.023
2334
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
between certain property of the chosen representation and the actual orientation of the images. Popular classes of skew detection methods in the literature include those based on projection-profiles [1–4], Hough-transforms [5,6], nearestneighborhood [7,8], and morphological operations [9,10]. Different approaches compete on the ground of detection accuracy, time and space efficiencies, abilities to detect the existence of multiple skews in the same image, and robustness in noisy environments (such as graphical objects or their residues after filtering) and scan-introduced distortions (such as object-touching due to insufficient resolution, or warping due to non-flat scanned surfaces). The skew detection process of the projection-profilebased methods starts by marking a set of points (fiducial points) to represent the objects in an image. Then, it makes parallel projection of these points onto an accumulator array, and a chosen premium function is evaluated on the accumulator array. If the accumulator array is successively rotated by a fixed interval in a range, a series of premiums are obtained. The premium function reaches extremes when the projection is along the text line orientations. Usually, detection is carried out in multiple rounds to achieve good speed and accuracy, either from coarse to fine accumulator rotation interval or from sub-sampled to full resolution image. The idea of using Hough-transform to detect skew angles of document images is that if the objects of a text line can be represented by a set of points (fiducial points) {x, y}, each of these points can be mapped with normal parameterization = x cos + y sin to a sinusoidal curve in the quantized − parameter space by scanning the whole range of parameter in certain intervals. Therefore, if the mapped curves are accumulated in the parameter space, the global maxima {max , max } correspond to the prominent text line orientations of the image. All the skew detection methods deploying Hough transforms must make aggressive computation reduction in order to achieve acceptable performance for real world documents, usually with compromises in accuracy and other properties. Nearest-neighborhood (k-NN) provides a spatial clue for local grouping of objects that belong to a text line. The positions of grouped objects are then used to approximate the orientation of the text line. Although the nearest-neighbor searching process is global, the skew angle detection process is local. Therefore, its skew detection precision is usually not as high as that of the other methods that detect skew on global scale. Noises, the residual components from photographic objects or scan artifacts after binarization, in textual document images affect all skew detection methods in different ways. Projection-profile-based methods in principle are able to deal with noises only when the noises are uniformly distributed on the whole page (rare in real-world documents) so that the peak evaluation on the projection-profiles is not compromised. Hough-transform-based methods are also able to deal with moderate amount of noises of irregular
distribution. However, the presence of excessive noises may produce false maxima which are the results of noise points that happen to be collinear. This severely undermines peaksearching for the already fussy and weak signals in the 2D accumulator array, as well as drastically slows down the detection speed. The nearest-neighborhood-based methods’ ability to deal with excessive noises is generally weak, as noises may easily corrupt the buildup of neighborhoods. The same can be said about the methods using morphological operations, which can be thought of as alternative ways of object grouping at the pixel level. Noise filtering is commonly used as a preprocessing measure in all classes of the skew detection methods. This is more for the efficiency than for the robustness of the detection algorithms. These filters used are either based on the prior knowledge of the samples to be processed or common sense in order not to be too aggressive in component elimination. It is difficult if not impossible to design generalpurpose filters that are able to discriminate noises from textual components. This paper proposes a skew estimation method that is robust to process document images that contain a large amount of non-textual components or noises. Its noise immunity comes from its uses of component-pairs rather than individual components like most of other methods do. This exploration of the inter-component correlation enables higher linear alignment identification, which leads to easier design of a filter for identifying the peaks that correspond to prominent skew angles in the resultant 1-D accumulator array. It is a general-purpose, full-range (±90◦ ) skew estimation method that is straightforward in principle, simple in implementation, parameter-free, and highly competitive in detection accuracy and execution speed.
2. The proposed method This method works on the extracted components from a binary image. It uses the centroids of the components as the representation the components (fiducial points), as centroids are rotation-invariant, thus a proper choice for skew detection in the range of ±90◦ . It traverses all the unique pairs of components to calculate the slopes of the virtual lines passing through their centroids (fiducial lines), as illustrated in Fig. 1. The arctangent values of the slopes of the fiducial lines are computed and quantized into an accumulator array to form an angle histogram, as shown in Fig. 3. The prominent peaks in the histogram are the candidates for the detected skew angles. Fig. 2 superposes the fiducial lines drawn along the angle at the peak position of the histogram in Fig. 3 on the input image in Fig. 1. As Fig. 4 shows, the fiducial lines highly concentrate along the direction of the text lines and widely spread over other angles, the contribution from the components in the same text line overwhelm that from different lines or regions. This is the reason why this method is robust in the existence
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
2335
• Component filtering (optional): The proposed method does not require the exclusion of non-textual components. However, filtering can improve, to a certain extent, the accuracy, reliability and efficiency of the method; • Histogram generation: The slope histogram h[i] has a total Nbin = 9000 bins representing a range of −90◦ inclusive to +90◦ exclusive, yielding an angle resolution of 0.02◦ /bin. All unique pairs of components are traversed and the slopes of their fiducial lines are computed and quantized to one of the bins in the histogram. 2.2. Peaks searching
Fig. 1. An enlarged portion of an image superposed on the fiducial lines drawn among the centroids of components. This is only for illustrating the working principle of the proposed skew estimation method.
The obtained slope histogram h[i], where i ∈ [0, Nbin ), is convolved with a finite-sized, symmetric kernel generated from the second derivative of an inversed, un-normalized normal distribution with variance 2bin :
k[j ] =
2bin − (j − )2
0
4bin
e
(j −)2 22 bin
0 j 2,
(1)
otherwise,
where is a positive integer that represents the half-size of the kernel, resulting in a new histogram from which the dominant peaks are searched: hconvol [i] =
j =−
h[(i − j − + Nbin )modNbin ]k[j + ]. (2)
Fig. 2. Fiducial lines are drawn on the image in Fig. 1 along the angles of 1.72 ± 0.02◦ .
of heavy noises. This method works globally and it does not require any page segmentation in order to work properly. 2.1. Histogram generation Given a document image, the slope histogram is obtained in the following steps: • Image binarization: This method works at the component level, so color documents are first converted to grayscale, then to bi-level by a appropriate thresholding method (global or moving window); • Connected-component analysis: The 8-neighbor connectivity analysis is done on the pixels of the bi-level image to extract components and store them in a data structure, together with the calculated centroids and other important properties;
The modular operation in Eq. (2) indicates the wrapping of values at the two endpoints of the histogram. In Eq. (1), the parameter bin (in bins) is converted from the parameter (in degrees) by bin =Nbin /180. The halfsize of the kernel is set to =4bin to make the total area of the kernel sufficiently close to 0. If = 0.5◦ then bin = 25 bins and = 100 bins. Therefore, the full-size of the kernel is 201 bins. The reason to choose the second derivative of the normal distribution as the convolution kernel is that it combines the functionalities of center-weighted smoothing and background removal in one operation. The smoothing functionality filters out statistical fluctuation or singular points in the histogram. The background removal functionality comes from the fact that the total area of the kernel in Eq. (1) is zero. By using such a kernel, the slow varying, broad shapes from the histogram are subtracted, leaving only the sharp peaks whose positions are easy to detect. The histogram may contain spikes whose widths are 1 bin, at a series of fixed locations of −90◦ , 0◦ , ±45◦ , ±63.44◦ , ±26.56◦ . . ., some of which can be seen in Fig. 3. These spikes originate from the two quantization processes. The first quantization takes place when the images are captured on the square imaging grid of photo sensors. The pixels in the
2336
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350 0
1.7
-90
45 -45
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
Angle (degrees) Fig. 3. The slope histogram of the fiducial lines among all the components of the image in Fig. 1. The components on the same text lines form a well-defined, narrow peak at about 1.7◦ . Note the prominent spikes at 0◦ , −90◦ , 45◦ , etc. They are the results of the quantization effects and can be removed by the convolution proposed in this paper.
1.7
intra-line
inter-line
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
Angle (degrees) Fig. 4. The slope histograms for separated inter-/intra-line components of the image in Fig. 1. The contributions from intra-line components form a broad background, while the contributions from inter-line components form an easily recognizable sharp peak.
images can only have integer coordinates; therefore, along the angles of 0◦ (tg−1 0), ±90◦ (tg−1 ± 8), ±45◦ (tg−1 ± 1), ±63.44◦ (tg−1 ± 2) and ±26.56◦ (tg−1 ± 1/2) there are more pairs of grid points than along nearby angles, as shown in Fig. 5. The second quantization takes place when the calculated slopes are accumulated in the finite-sized accumulator array to form the histograms. The chosen kernel
in Eq. (1) can minimize the influence of these spikes after convolution. An alternative histogram processing method is proposed in Ref. [14], which combines spike removal, smoothing, median curve extraction and background subtraction. In comparison, the chosen one-pass convolution method with the kernel in Eq. (1) is superior in terms of both accuracy and efficiency.
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
2337
Fig. 5. Distance versus angle plot for the grid points in a squared imaging grid. The quantization effects are obvious in short distances, especially along ±90◦ or tg−1 (±∞), 0◦ or tg−1 (0), ±45◦ or tg−1 (±1), ±63.44◦ or tg−1 (±2), ±26.56◦ or tg−1 (±1/2).
1.72
-µ
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
µ
0
40
50
60
70
80
90
Angle (degrees)
Fig. 6. The convolved histogram in Fig. 3 with the kernel shown as inset ( = 0.5◦ , not to scale).
After applying the convolution to the histogram, the most prominent peak, as shown in Fig. 6, is the candidate for determining the dominant text line orientation. The existence of multiple prominent peaks may indicate multiple skew angles or other features such as the use of mono-spaced fonts. For the purposes of this paper, only the highest peak is taken into account. In practice, can be set to a fixed value of 0.5◦ for all processing, including the samples from the UWDB-I. Since
the signal peaks are distinct in shape to the background, the choice of is not sensitive. It is chosen to be close to the shape of the signal. Based on observation, both the synthesized and the real images in UWDB-I show similar peak shapes. This makes the proposed method virtually parameter-free. To illustrate the origins of the broad and the sharp peaks in Fig. 3, the histogram is divided into two contributions: one from inter-line components (the two components from
2338
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
L01 L02 L03 L04 L05 L06 L07 L08 L09
-4
-3
-2
-1
0
1
2
3
4
5
6
7
Angle (degrees) Fig. 7. The convolved histograms for the individual lines of the image in Fig. 1. The histogram in Fig. 6 is the addition of all the nine lines.
different text lines), and the other from intra-line components (the two components from the same text lines). The histogram in Fig. 4 shows that the well-shaped sharp peak comes from the intra-line components, while the broad background comes from inter-line components. A generalization can be made that the out-of-text-line components in an image, be they textual or graphical in nature, contribute mainly to the slow-varying background of the histogram thus can be effectively eliminated by the proposed convolution. Minimizing the contributions from out-of-text-line components can certainly improve the detection performance, but the proposed method does not require it. This analysis explains why the proposed skew detection method works robustly for the document images with excessive noises, even in situations where textual components are in extreme minority, such as the sample in Fig. 15. This proposed method is statistics-based. Problems arise when text components are sparse or the text columns are narrow. For example, the sample text in Fig. 1 is around 30 characters per line; daily newspapers have around 40 and normal two-column journal pages have around 50. The convolved histograms of individual lines are shown in Fig. 7. The peak at 1.7◦ is the accumulation of the spectrums of all the text lines. The larger the number of the text lines or the longer the text lines, the more reliable of the detection. Even in cases of short single text lines, the proposed method can give more accurate values than using least-square linefitting or Hugh-transform based detection.
Fig. 8. Working on the component holes along the angles of 1.78 ± 0.02◦ for the image in Fig. 1. The holes are extracted with 4-connnectedness on the background. The major white spaces have no effect on the skew detection (removed here for cleaner presentation).
sure, as shown in Eq. (3), is evaluated, where Ai is the value in bin-i and A0 is the detected peak height. S(dB) = 10 × log10
A20 Nbin −1 2 1/Nbin i=0 Ai
.
(3)
2.3. Results verification In order to evaluate the reliability of the highest peak found on the convolved histogram, a signal strength mea-
An accept/reject threshold value can be set to, say 20 dB. Values below this threshold are considered unreliable and should be rejected or subjected to further examination.
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
2339
1.78
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
Angle (degrees) Fig. 9. The convolved histogram in the background mode for the image in Fig. 1 ( = 0.5◦ ). The S/N ratio and peak accuracy are both lower than in the foreground mode, but still usable especially for the skew detection for low-resolution or down-sampled images.
Fig. 10. Design of distance filter for component pairs using the image in Fig. 1. The dense, stripe-like central pattern is formed by the intra-line component pairs, while the other hyperbola-shaped patterns are from inter-line pairs.
2.4. Working on component holes—the background mode Common text images have black text on white background. For many languages, their characters have closed regions or holes, such as the holes in Latin a, b, d, e, g, o, p, q, A, B, D, O, P, Q, R, 4, 6, 8, 9, 0. If characters are aligned to lines, so are their holes. To extract these holes, a 4-neighbor connected-component analysis is done on image
background, followed by a low-pass size filter to remove major white spaces that separate columns, paragraphs, sentences, words and characters. The remaining background components are used to go through the same processing steps mentioned above for the foreground components. Fig. 8 shows the detection in the background mode on the image in Fig. 1. It is obvious in Fig. 9 that the spectrum obtained in the background mode is not as clean as that in the
2340
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
Fig. 11. Design of size-difference filter for component pairs using the image in Fig. 1. The major contributions to the central peak are from component pairs whose size differences are less than 100 pixels. 100
Percentage of Samples (%)
90 80 70 60 50 40 Original Rotated
30 20 10 0 0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
Absolute deviation from ground-truth (degrees) Fig. 12. The skew detection error rates on the 168 synthetic document images in UWDB-I. Each of the images in the database is randomly rotated three times in the range of ±90◦ , resulting in 504 rotated images.
foreground mode, but there is still a prominent peak that is usable for skew detection. The results from both modes are pretty close. Working on character holes can be faster because not all characters have holes, while this may also decrease detection accuracy. Another advantage of using character holes is that they are, to a large extent, immune to the character touching problem when scan resolution is low, or the origi-
nal documents have printing problems in the first place, or it is the characteristic of the character design.
3. Speedup measures The time complexity of the proposed method is n(n − 1)/2, where n is the total number of compo-
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
2341
100
Percentage of samples (%)
90 80 70 60 50 40 30 20 10 0 0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Absolute deviation from ground-truth (degrees) Fig. 13. The skew detection error rates on the 979 real document images in UWDB-I.
Detected skew angle (degrees)
3
2
1
0
-1
-2
-3 -3
-2
-1
0
1
2
3
Ground-truth (degrees) Fig. 14. Linear correlation between the ground truth and the detected skew angles for the real document images in UWDB-I. The correlation coefficient is 92.2%.
nents in an image. If this number is reduced without sacrifice the detection accuracy, the computation can be significantly speeded up. Here are some feasible measures that can reduce computation time, and some may even improve the detection accuracy to some extent.
3.1. Alternative histogram configuration The baseline, full-range (within ±90◦ ) configuration quantizes the arctangent values into 9000 bins. One of the advantages of this configuration is that it can detect vertical text lines. If the detection range is limited to half-range
2342
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
Fig. 15. Image H04IBIN from UWDB-I (skew ground-truth: −0.1◦ ). Fiducial lines are drawn along the angles of 89.82 ± 0.02◦ .
(within ±45◦ ), there is an alternative configuration that runs much faster. In the alternative configuration the slopes (not their angles) of the fiducial-lines are quantized into 3000 bins. When peaks are detected after applying the same convolution, only the slopes of the peaks are converted to angles. This virtual elimination of the arctangent computation in this alternate configuration reduces the detection time to about 38% of the baseline configuration. In fact, the majority of document images are suitable for using the alternative configuration.
3.2. Filters for individual components There are some component-based properties that can be used to design filters to reduce the number of components to process for the histogram construction. Commonly used filters include size-filter that limits the contributions only from the components with appropriate sizes (number of pixels), density-filter that limits the contributions only from the components whose size to bounding-box area ratios are within predefined range, and aspect-ratio-filter that limits the con-
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
2343
60
70
80
90 89.82
0.1
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
Angle (degrees)
Fig. 16. The raw (top) and convolved (bottom) histograms of image H04IBIN (skew ground-truth: −0.1◦ ). The detected skew angle is −0.18◦ (89.82◦ –90◦ ). The peak at 0.1◦ is from the horizontal text lines at the bottom of the page.
tributions only from the components whose width-to-height ratios of their bounding-boxes are close to 1. 3.3. Filters for component pairs Applying filtering on the component pairs has impacts on both accuracy and efficiency of skew detection. There are some component-pair-related properties that can be used to design filters to reduce the number of component pairs to process for the histogram construction. Distance-filter is a high-pass filter that only takes into account of the contributions from the long-ranged component pairs. Fig. 10 plots the distances among the centroids of the component pairs versus the angles of their fiducial lines for the image in Fig. 1. The high-density stripes manifest the existence of parallel text lines. The central stripe is formed by pairs from the same text line, while the other
stripes are formed by pairs from different text lines. If the image is rotated, the plot only shifts horizontally. The histogram in Fig. 3 is actually the vertical projection of this plot, and the central stripe forms the highest peak in the histogram. It can be seen from Fig. 10 that the stripes become narrower when distances among components increase. It is because the farther the component pairs are apart the closer their fiducial-lines approximate the text lines, thus the less deviation among them. At distances below 100 pixels, there are some fine features that reveal the effect of square imaging grid. The short-ranged pairs mostly blur the peak that is used for skew detection, and there are more shortrange component pairs than long-ranged ones. Therefore, a distance-filter with a threshold of 100–200 pixels can improve the detection accuracy. Yet, it is the efficiency that benefits most from the reduction of the number of component pairs.
2344
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
Fig. 17. Image D03EBIN (skew ground-truth: −0.21◦ ) from UWDB-I. Fiducial lines are drawn along the angles of 0.16 ± 0.02◦ .
Size-difference-filter is a low-pass filter that only allows the contributions from similar-sized components. It can be seen from Fig. 11 that for the sample image in Fig. 1 the major contributions are from those component pairs with size differences of less than 100 pixels. This threshold value is text dependent, and a distribution analysis of the component sizes is needed if no priori knowledge of the image available. The main purpose of applying this filter is for speed gain.
3.4. Faster slope-to-angle calculation The most time-consuming computation is the computation of arctangents. Time can be saved by pre-calculating the arctangents and storing the data in a persistent object (one arctangent table for all detection runs). Subsequent arctangent computations become indexing and interpolation operations, which speeds up the method at the cost of some extra memory space.
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
2345
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
0.16
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
Angle (degrees)
Fig. 18. The raw (top) and convolved (bottom) histograms of image D03EBIN (skew ground-truth: −0.21◦ ). The detected skew angle is 0.16◦ .
3.5. Skew-independent segmentation One of the motivations of this paper is to design a skew estimation method which is robust enough to be able to work without the need to separate textual and graphical components or to segment text columns. However, some form of skew-independent segmentation can improve both efficiency and accuracy of skew detection. Even if the components are separated only into two equal halves, the detection time will be reduced to less than 50% of the original time. As a result, component segments of different natures no longer interfere in the skew detection process, thus the reliability of the detection method may improve. Of course, the segmentation algorithm should be chosen as
such that its time-saving should far outweigh the cost to deploy it. 4. Experimental results Note that the experimental data are obtained using the original configuration without applying any filters or segmentation mentioned in the previous section, except a 10pixel size filter to remove pepper noises. 4.1. Synthetic images (total 168 from UWDB-I) There are a group of 168 images in UWDB-I that are synthesized from LATEX documents, and their skew angles
2346
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
Fig. 19. Image E01EBIN (skew ground-truth: −0.04◦ ) from UWDB-I. Fiducial lines are drawn along the angles of −0.04 ± 0.02◦ .
are exactly 0◦ . The purpose of using these synthetic images is to verify that the detection results of the proposed method do correspond to the text line orientations of the document images as intended. In the first test, the original synthetic images are directly used and the results are shown in Fig. 12. The standard deviation, which is a measure of data dispersion from mean, is 0.04◦ (about 2 bins). The larger errors are from those images that have only sparse text lines.
The second test, each of the 168 synthetic images is rotated randomly in the range from 0◦ to 90◦ for three times, producing total 504 new images. The purpose of this experiment is to prove that the detection method is valid in the whole range of ±90◦ as designed. The detected skew angles are compared with the randomly generated rotation angles. The results are also shown in Fig. 12. It can be seen that detection on the rotated images are more accurate on average than on the orig-
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
2347
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
-0.04
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
Angle (degrees)
Fig. 20. The raw (top) and convolved (bottom) histograms of image E01EBIN (skew ground-truth: −0.04◦ ). The detected skew angle is −0.04◦ .
inals. This is because the text orientations in the original images are close to horizontal, along which the quantization error of the imaging grid is the highest. It can be expected from Fig. 5 that the detection errors are also high if the text orientation in the images are along some specific angles, such as ±45◦ , ±63.44◦ or ±26.56◦ . 4.2. Scanned images (total 979 from UWDB-I) There are another group of 979 images in UWDB-I that are scanned from real printed materials. Some samples and their histograms are shown in Figs. 15–20. Many images contain large area of disjoint, non-textual components that are the results of binarization on
photographic objects, or the artifacts of the scanning process. There are no general-purpose filters to remove these noises from textual components. Therefore, they are also good candidates to evaluate the robustness and processing efficiencies of any skew detection methods. The original images are detected and the deviations from ground truth are shown in Figs. 13 and 14. The standard deviation is 0.17◦ (about 9 bins). According to the documentation of the UWDB-I, these images are produced by human scanners, and the misalignment is very small with normal distribution centered at about 2◦ . Therefore, not only the absolute errors (Fig. 13) but also the linear correlation between the detected skew angle and the ground truth (Fig. 14) should be evaluated.
2348
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
Fig. 21. A scanned Chinese newspaper clip. Fiducial lines are drawn on the original image (top-left) along the angles of 0.04 ± 0.02◦ (top-right), 50.44 ± 0.02◦ (bottom-left) and 89.86 ± 0.02◦ (bottom-right).
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
2349
60
70
80
90
0.04
89.88
50.44
-50.56
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
Angle (degrees) Fig. 22. The raw (top) and convolved (bottom) histograms of the Chinese newspaper clip. There are multiple prominent peaks in the convolved histogram due to the special style of Chinese text.
The linear correlation coefficient of this experiment is 92.2%, which means that it is very unlikely that the small detected angles are irrelevant, random values. 4.3. Scanned images from Chinese newspaper clips A few scanned images from Chinese newspaper clips are used to investigate how the proposed method works in nonLatin text in the same experimental setup as for the English document images. Fig. 21 is one of the more complex samples tested. Chinese characters are mono-spaced; therefore, multiple peaks can be detected (see Fig. 22). The same effects can also be observed from English document images that use mono-spaced fonts.
has the freedom of being plugged up- or down-stream of the segmentation module in the processing pipeline. A direct comparison with the method [9] from the creators of UWDB-I using the same image database shows that the proposed method in this paper has similar detection precision as their algorithm in automatic mode. Efficiencywise, the proposed method in this paper is highly competitive in speed even in its baseline implementation without deploying the speedup measures mentioned in this paper. For the UWDB-I scanned images which measured 2592 × 3300 pixels, it takes about 2 s in average to process one sample (excluding the I/O operations) on the Java 2 platform in a 2.3 GHz Pentium IV computer. It is possible to improve its performance up to an order of magnitude by using some of the proposed speedup measures in this paper.
5. Conclusion The motivation of this paper is to develop a generalpurpose skew estimation method that is robust enough to process noisy textual document images. Since the proposed method is independent of the page segmentation stage, it
References [1] W. Postl, Detection of linear oblique structures and skew scan in digitized documents, Proceedings of 8th International
2350
B. Yuan, C.L. Tan / Pattern Recognition 38 (2005) 2333 – 2350
Conference on Pattern Recognition, Paris, October 1986, pp. 687–689. [2] H.S. Baird, The skew angle of printed documents, Proceedings of SPSE 40th Annual Conference and Symposium on Hybrid Imaging Systems, Rochester, NY, May 1987, pp. 21–24. [3] Y. Nakano, Y. Shima, H. Fujisawa, J. Higashino, M. Fujinawa, An algorithm for skew normalization of document images, Proceedings of 10th International Conference on Pattern Recognition, Atlantic City, New Jersey, 1990, pp. 8–13. [4] A.L. Spitz, Skew Determination in CCITT Group 4 compressed images, Proceedings of 1st Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, 16–18 March 1992, pp. 11–25. [5] S.N. Srihari, V. Govindaraju, Analysis of textual images using the Hough transform, Mach. Vision Appl. 2 (3) (1989) 141–153. [6] B. Yu, A.K. Jain, A robust and fast detection algorithm for generic documents, Pattern Recogn. 29 (10) (1996) 1599– 1629. [7] A. Hashizume, P.S. Yeh, A. Rosenfeld, A method of detecting the orientation of aligned components, Pattern Recogn. Lett. 4 (1986) 125–132.
[8] L. O’Gorman, The document spectrum for page layout analysis, IEEE Trans. Pattern Anal. Mach. Intell. 15 (11) (1993) 1162–1173. [9] S. Chen, R.M. Haralick, An automatic algorithm for text skew estimation in document images using recursive morphological transforms, Proceedings of IEEE Conference on Image Processing, Austin, Texas, November 1994, pp. 139–143. [10] C. Sun, D. Si, Slant correction for document images using gradient direction, Proceedings of 4th International Conference on Pattern Recognition, Germany, August 1997, pp. 142–146. [11] G. Nagy, Twenty years of document image analysis in PAMI, IEEE Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 38–62. [12] R. Cattoni, T. Coianiz, S. Messelodi, C.M. Modena, Geometric Layout Analysis Techniques for Document Image Understanding: a Review, ITC-IRST Technical Report #970309, 1998. [13] A.D. Bagdanov, Evaluation of document image skew estimation techniques, Proc. SPIE 2660 (1996) 343–353. [14] B. Yuan, C.L. Tan, Skewscope: the document image skew detector, 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 49–53.
About the Author—B. YUAN received the B.Sc. and M.Sc. degrees in Nuclear Physics in 1985 and 1988 from Peking University, China. He started his Ph.D. research under the guidance of Professor C.L. Tan after he received his M.Sc. in Computer Science in 2000 from National University of Singapore. His current research interests include scanning defects analysis and correction, historical documents restoration, and the development of document imaging systems on the Java platform. He is currently a Research Scientist in the Centre for Remote Imaging, Sensing and Processing, National University of Singapore.
About the Author—C.L. TAN received the B.Sc. (Hons.) degree in Physics in 1971 from University of Singapore, the M.Sc. degree in Radiation Studies in 1973 from University of Surrey, UK, and the Ph.D. degree in Computer Science in 1986 from University of Virginia, USA. His research interests include document image and text processing, neural networks and genetic programming. He has published more than 200 research publications in these areas. He is an associate editor of Pattern Recognition. He has served on the program committees of many international conferences and workshops, including the International Conference on Document Analysis and Recognition (ICDAR) 2005, International Workshop on Graphics Recognition (GREC) 2005, and the International Conference on Pattern Recognition (ICPR) 2006. He is currently an Associate Professor in the Department of Computer Science, School of Computing, National University of Singapore.