Binarization of degraded document images with global-local U-Nets

Binarization of degraded document images with global-local U-Nets

Optik - International Journal for Light and Electron Optics 203 (2020) 164025 Contents lists available at ScienceDirect Optik journal homepage: www...

1MB Sizes 0 Downloads 60 Views

Optik - International Journal for Light and Electron Optics 203 (2020) 164025

Contents lists available at ScienceDirect

Optik journal homepage: www.elsevier.com/locate/ijleo

Original research article

Binarization of degraded document images with global-local UNets

T

Xiao Huang*, Lin Li, Rong Liu, Chengshen Xu, Mingdeng Ye Autohome Inc., 10th Floor Tower B, No. 3 Dan Ling Street Haidian District, Beijing, China

A R T IC LE I N F O

ABS TRA CT

Keywords: Document binarization Convolutional neural networks Document analysis

Document binarization plays a significant role in the document analysis to extract foreground text from the background. Traditional Convolutional Neural Networks (CNNs) focus only on local textual features and ignore global context, which are both important for pixel classification in segmentation based document binarization. In this paper, we propose a local-global combined approach for document binarization. This model is composed of a global branch and a local branch, taking the global patches from downsampled image and cropped local patches from source image as respective inputs. The final binary prediction is achieved via combining the results of this two branches. The experimental results on several DIBCO datasets present that our method outperforms many traditional and state-of-the-art document binarization algorithms.

1. Introduction Document binarization is one of the most important pre-processing steps for file analysis and recognition. It aims to separate the foreground text of the page and the degraded background in the document image and its performance directly affects the accuracy of the subsequent tasks. The document image suffers from various types of degradation, such as page stains, uneven illumination, bleedthrough, contrast variation, and deteriorated environment [1–6], which makes it a difficult task for accurate binarization [7–9]. Although, a number of research studies have been conducted to tackle this issue. Generally, binarization algorithms can be categorized as two types: traditional non-machine learning methods and image segmentation based deep learning approaches. The classical Otsu's algorithm [10] is a traditional binarization method using global thresholding. Further, local adaptive thresholding methods from Niblack's [11], Sauvola's [12] and Wolf's [13] methods manifest more robustness for binarizing any image in general. However, the traditional threshold processing methods have difficulties in document binarizations with severely degraded backgrounds. Recently, deep learning strategies have made remarkable progress in document binarization, which can be regarded as an image segmentation task [14–17]. A convolutional auto-encoder decoder model was conducted by Calvo-Zaragoza and Gallego [18] for document image binarization. Tensmeyer and Martinez [19] proposed a fully convolutional neural network with combined Fmeasure and pseudo F-measure as loss functions to binarize the document images and palm leaf manuscripts. Vo et al. [20] established a hierarchical deep supervised network for the prediction of the text pixels at different feature levels. In these methods, the original document images are cropped into small local patches, and then convolutional neural networks (CNN) based method [21–23] is conducted to translate the colored or gray-scale patch into a binarized one, the binary patches are finally assembled into a large one with the same size as that of the source image. Since the document images usually have ultra-high resolution (up to 3000 pixels in height or width), small local image patches



Corresponding author. E-mail address: [email protected] (X. Huang).

https://doi.org/10.1016/j.ijleo.2019.164025 Received 23 October 2019; Received in revised form 4 December 2019; Accepted 7 December 2019 0030-4026/ © 2019 Elsevier GmbH. All rights reserved.

Optik - International Journal for Light and Electron Optics 203 (2020) 164025

X. Huang, et al.

Fig. 1. Our proposed local-global architecture. Two local U-Nets are stacked and then combined with one global U-Net.

instead of the whole image are used as the input of the neural network because of the limitation of the GPU memory. However, this kind of patch cropping strategy will result in the loss of the spatial contextual information of the whole document image, sometimes causing misclassified when it is difficult to identify foreground text from a degraded background. This paper presents a new approach to overcome the mentioned shortcomings by combining both the local-level and global-level network. The local net accepts cropped patches from original document images as inputs and produces precise character boundaries, while the global network performs segmentation from the holistic perspective via converting the input whole image to a lowresolution map. These networks both use the U-Net [24] as basic component since U-net introduces skip concatenation between the encoder and the decoder layers and helps to obtain better performance in image segmentation. The remainder of the paper is structured as follows: In Section 2, the proposed method with local-global level network is described. Section 3 presents the quantitative and qualitative experiment results of document binarization. Finally, Section 4 summarizes the main conclusions drawn from this study. 2. Proposed method The overview of our proposed architecture is shown in Fig. 1. Generally, three U-Net models are fused to convert the original colored image to the final binarized image: Local U-Net with input size 128 × 128 and 256 × 256, and Global U-Net with input size 512 × 512. Two local U-Nets are stacked and then combined with the Global U-Net. 2.1. Local U-Net The U-Net model employed in Local U-Net is composed of an encoder and a decoder. The encoder consists of several repetition of two convolution and one max-pooling layers with the kernel sizes of 3 × 3 and 2 × 2, respectively. Each convolution operation is followed by a batch normalization and then a ReLU layer. Along the downsampling path in the encoder, the height and width of the feature map are halved while the number of channels doubles. The decoder architecture exactly follows an inverse encoder, whose feature spatial size doubles while the number of channels halves. The Local U-Net takes cropped image patches as input and produces segmentation maps in their original resolution. Since the network can extract hierarchical features from images with different patch size, we construct two Local U-Net models with different input sizes 128 × 128 and 256 × 256, and stack these two U-Nets to achieve a better performance. 2.2. Global U-Net For Global U-Net, the network architecture is just the same as that of the Local U-Net, except for the input image size with 512 × 512. We introduce global patches cropped from a downsampled version of the source image instead of resizing it directly to the desired scale. The reason is twofolds. First, the Document Image Binarization Contest (DIBCO) datasets contain only about one hundred images and training on these datasets with overall scaling model will greatly reduce the number of training samples, thus yielding a poor trained model. Second, since the document images in DIBCO datasets possess varied styles with different height/ width ratios, the overall resizing will introduce unwanted geometric distortion, which can affect the accuracy of the model. For the above reasons, we crop image patches from a rescaled version of the original document image, by scanning a window with fixed size 512 × 512 across the entire image region. The rescaling and cropping strategies are based on the ratio of the original image height h and the image width w , as follows: 2

Optik - International Journal for Light and Electron Optics 203 (2020) 164025

X. Huang, et al.

(1) h/ w ≥ 2 : the original image is resizing to the size with (height, width) as (512h/ w, 512) , image cropping is performed along the height direction to obtain several image patches. (2) 1 < h/ w < 2 : the rescaled size is (1024h/ w, 1024) , the cropping is performed along both the height and width direction. (3) 0.5 ≤ h/ w ≤ 1: the original image is resizing to the size with (1024, 1024w / h) , the cropping is performed along the height direction to obtain several image patches. (4) h/ w < 0.5: the rescaled size is (512, 512w / h) , the cropping is performed along both the height and width direction. 2.3. Combination The proposed method first stacks two Local U-Nets with two scale input sizes 128 × 128 and 256 × 256 via averaging the activation softmax values of each final probability maps, as shown in the lower branchs in Fig. 1. The upper branch takes the designed global patches as inputs and converts them with Global U-Net of input size 512 × 512. The binarized predictions from the stacked local model and the global model are then combined to achieve the final representation via logical AND operator. It is noted that all the outputs of U-Net models are cropped image patches, and the operations of stacking and combining are both conducted over aggregated whole image. 3. Experiments In this section, the experimental results of the proposed methods are presented. The training data sets are constructed based on the public-benchmark datasets in the competition of Document Image Binarization, i.e., DIBCO datasets. Evaluation metrics are introduced and the results of our method are presented with comparison to the state-of-the-art binarization algorithms in Document Image Binarization competitions. 3.1. Dataset and metrics All competition images from the years 2009–2018 are utilized for either training, validation, or testing data. When testing on a particular DIBCO year, DIBCO datasets before that year are used as training data. For example, when testing on a DIBCO 2017, (H-) DIBCO 2009–2016 datasets compose the training set. Four evaluation metrics which were adopted in the (H-)DIBCO contests, are utilized for quantitatively evaluation and comparison, including F-measure (Fm), pseudo F-measure (Fps), distance reciprocal distortion metric (DRD) and the peak signal-to-noise ratio (PSNR). For Fm, Fps, and PSNR, larger values indicate better performance, while for DRD, smaller is better. 3.2. Hyper-parameter selection Our network architecture uses U-Net as the binarization approach for each image patch. We keep the original hyper-parameters of U-Net and pay attentions on some other aspects, such as the effects of data augmentation, the selection of loss functions, and the threshold used to convert a probability map to a binary output. In the discussion of hyper-parameter selection, we take F-measure Fm as the evaluation metric. The image crops composed of only background are nonsense in calculating Fm, so we conduct the evaluation on the original whole document image. The Local U-Net with input size 128 × 128 is chosen for the discussion of the hyperparameter selection, via training over datasets (H-)DIBCO 2009–2016 and testing on the original whole document image in H-DIBCO 2018. 3.2.1. Data augmentation It has been demonstrated that data augmentation is an effective way to improve the performance of the model especially when the amount of training data is limited [25,26]. In the scenario of data augmentation, we perform random distortion, shear, skew, and add random Gaussian noise, random brightness with a certain probability for each image patch during training [25,27]. The effects of data augmentation on testing F-measure over training epochs are depicted in Fig. 2, where the loss function is set as the binary cross entropy loss and the threshold is fixed as 0.5. It is seen that the F-measure can be increased from 89.2% to 91.5% with an improvement of around 2.6% when applying data augmentation technique. This is due to that data augmentation helps to increase the diversity of training samples and makes it easier for the model to capture the foreground characters in the document images, thus improving the accuracy. 3.2.2. Loss function Traditionally, binary cross-entropy loss has been used for binary image segmentation tasks. However, class imbalance has shown to limit the optimal performance and many additional loss functions have been investigated to tackle it. We here describe four commonly used loss functions for training our document binarization model. At training, we apply a pixel-wise softmax activation in the final layer of the model to get the predicted probability map. Four loss functions that are the binary cross entropy loss (BCE loss), focal loss (γ = 2.0, α = 0.25), dice loss, and BCE + dice loss [28,29] are compared, as shown in Fig. 3, where data augmentation is employed and the threshold is fixed as 0.5. It is observed that performance differences between BCE loss, dice loss, and BCE + dice loss are negligible, but all three losses significantly outperform the focal loss. This indicates that the class imbalance has little effect on the results of document binarization tasks. Since dice loss is also benefited on class imbalance, we will use BCE + dice loss as the 3

Optik - International Journal for Light and Electron Optics 203 (2020) 164025

X. Huang, et al.

Fig. 2. Comparison on testing F-measures with and without data augmentation.

Fig. 3. Comparison on testing F-measures with different training loss functions.

loss function in the subsequent experiments.

3.2.3. Threshold The U-Net model can convert the original RGB or gray image into a probability map with softmax activation, a thresholding is then applied to binarize the probability map. The influence of the threshold is also studied. The variations of F-measure with respect to the threshold under different trained epochs are illustrated in Fig. 4. It is found that the highest F-measure value is achieved when the threshold is around 0.5. Moreover, the variability of the performance is negligible in the vicinity of 0.5 under each particular epoch. This fact indicates that the robustness of the activations is best when the threshold is set as 0.5.

Fig. 4. Variations of testing F-measures with respect to the threshold under different trained epochs. 4

Optik - International Journal for Light and Electron Optics 203 (2020) 164025

X. Huang, et al.

Fig. 5. Segmentation results of a machine-printed image in DIBCO 2017 dataset: (a) Source image. (b) Ground-truth. Predictions with (c) Local1 model, (d) Global model and (e) Local + Global model. Illustrations of colors used: Black: True positives, White: True negatives, Red: False negatives, Blue False positives. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

3.3. Binarization results of the proposed method In this section the performance of the approach is analyzed by comparing the binarization results of Local U-Net with input size 128 × 128 (Local1), Global U-Net with input size 512 × 512 (Global), and combined model with two Local U-Nets and Global U-Net (Local + Global). Fig. 5 displays the binarization results of a machine-printed document image in DIBCO 2017 dataset. The document image suffers from bleed-through degradation, rendering pixels in the background very similar to the text of the document. It is shown that with Local U-Net, false positives regions are more likely to appear in the background area, while they tend to appear around the text area with Global U-Net. This indicates that the local model is prone to misclassify the degraded background as the foreground text due to bleed-through degradation, while the global model is good at grasping the spacial information but poor in precisely characterising the outlines of words from backgrounds. As a result, as shown in Fig. 5(e), the accuracy of the system is

Fig. 6. Segmentation results of a handwritten image in H-DIBCO 2018 dataset: (a) Source image. (b) Ground-truth. Predictions with (c) Local1 model, (d) Global model and (e) Local + Global model. Illustrations of colors used: Black: True positives, White: True negatives, Red: False negatives, Blue False positives. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 5

Optik - International Journal for Light and Electron Optics 203 (2020) 164025

X. Huang, et al.

Table 1 Evaluation results for HDIBCO 2016 dataset. Methods

Fm

Fps

PSNR

DRD

Otsu [10] Sauvola [12] Winner of H-DIBCO 2016 contest [31] Vo et al. [20] Jia et al. [30] Tensmeyer et al. [19] Proposed Local1 Proposed Local1 + Local2 Proposed Global Proposed Local + Global

86.61 82.52 87.61 90.10 90.48 89.52 89.71 90.53 89.79 90.77

88.67 86.85 91.28 93.57 93.27 93.76 93.00 93.90 92.51 94.21

17.80 16.42 18.11 19.01 19.30 18.67 18.87 19.18 18.36 19.33

5.56 7.49 5.21 3.58 3.97 3.76 3.80 3.27 4.04 3.11

enhanced by combination of local and global information. For a handwritten document image, Fig. 6 provides the binarization results with different models. The ancient handwritten document suffers from a deterioration around the edge of the paper, which may cause confusion with the foreground texts by the Local U-Net. The combination with local U-Net and Global U-Net is also a better choice to perform binarization in this scenario.

3.4. Comparative results In this section, we compare the proposed methods, Local U-Net with input size 128 × 128 (Local1), the stacked model with Local U-Net of input size 128 × 128 and 256 × 256 (Local1 + Local2), combined model with two Local U-Nets and Global U-Net (Local + Global), with the corresponding document image binarization algorithms in previous researches. Tables 1–3 present the quantitative results for the four evaluation metrics (Fm, Fps, PSNR and DRD for H-DIBCO 2016, DIBCO 2017 and H-DIBCO 2018, respectively. For H-DIBCO 2016 dataset, Table 1 illustrates that though the single local U-Net model (Local1) obtains a slightly worser result than Vo et al.'s [20] and Jia et al.'s [30] models, the stacked local model (Local1 + Local2) presents better performance and the proposed combined Local + Global model behaves best in all the four evaluation metrics. From Table 2 we can see that the proposed models without considering global information (Local1, Local1 + Local2) show similar performance as that of other deep learning strategies conducted by the winner of DIBCO 2017, while the Local + Global model achieves obviously better results. This implies that the combination method based on local and global features aggregation indeed improves the performance of differentiating the foreground text from the degraded background. Moreover, it is noted that the results attained by the proposed model gradually improve from H-DIBCO 2016 to H-DIBCO 2018, which may be ascribed to the increasing numbers of training samples of the model. We also calculate the average inference time of the Local1 model and the combined Local + Global model on the GPU (RTX 2080). The results show that the Local + Global model can increase the binarization precision compared to the Local1 model with only 210 ms increase latency from 450 ms to 660 ms.

4. Conclusion In this paper, we propose a document binarization method consisting of global and local segmentation branches, which both use U-Net as the basic architecture. The global U-Net takes designed patches cropped from downsampling images as inputs, while the local branch utilizes stacked U-Nets with cropped patches from source images over two scale input sizes. Finally, the stacked local model and the global model are fused to obtain the final result through logical AND operator. The experimental results demonstrate the efficacy of our proposed combination method. It is presented that the global method or local method alone will lose local fine details of foreground or global contexts. The combination between them outperforms the existing state-of-the-art methods on the recent (H-)DIBCO competitions. Table 2 Evaluation results for DIBCO 2017 dataset. Methods

Fm

Fps

PSNR

DRD

Otsu [10] Sauvola [12] Winner of DIBCO 2017 contest [32] Proposed Local1 Proposed Local1 + Local2 Proposed Global Proposed Local + Global

77.73 77.11 91.04 90.74 91.72 89.69 92.14

77.89 84.10 92.86 91.38 92.46 90.40 93.02

13.85 14.25 18.28 17.91 18.42 17.35 18.71

15.54 8.85 3.40 3.68 3.16 3.92 2.81

6

Optik - International Journal for Light and Electron Optics 203 (2020) 164025

X. Huang, et al.

Table 3 Evaluation results for H-DIBCO 2018 dataset. Methods

Fm

Fps

PSNR

DRD

Otsu [10] Sauvola [12] Winner of H-DIBCO 2018 contest [33] Proposed Local1 Proposed Local1 + Local2 Proposed Global Proposed Local + Global

51.45 67.81 88.34 91.82 91.75 80.99 92.10

53.05 74.08 90.24 94.16 94.35 83.35 94.88

9.74 13.78 19.11 19.76 20.15 16.36 20.41

59.07 17.69 4.92 3.01 2.61 12.27 2.36

Conflicts of interest The authors declare no conflicts of interest. Acknowledgements Support from the National Natural Science Foundation of China (11602027) is acknowledged. References [1] A. Sulaiman, K. Omar, M.F. Nasrudin, Degraded historical document binarization: a review on issues, challenges, techniques, and future directions, J. Imaging 5 (4) (2019) 48. [2] Z. Huang, Y. Zhang, Q. Li, et al., Spatially adaptive denoising for X-ray cardiovascular angiogram images, Biomed. Signal Process. Control 40 (2018) 131–139. [3] W. Xiong, J. Xu, Z. Xiong, et al., Degraded historical document image binarization using local features and support vector machine (svm), Optik 164 (2018) 218–223. [4] Z. Huang, Q. Li, T. Zhang, et al., Iterative weighted sparse representation for X-ray cardiovascular angiogram image denoising over learned dictionary, IET Image Process. 12 (2) (2017) 254–261. [5] N. Kligler, S. Katz, A. Tal, Document enhancement using visibility detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 2374–2382. [6] Z. Hadjadj, M. Cheriet, A. Meziane, Y. Cherfa, A new efficient binarization method: application to degraded historical document images, Signal Image Video Process. 11 (6) (2017) 1155–1162. [7] S.A. Oliveira, B. Seguin, F. Kaplan, dhsegment: a generic deep-learning approach for document segmentation, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE, 2018, pp. 7–12. [8] Z. Huang, H. Fang, Q. Li, et al., Optical remote sensing image enhancement with weak structure preservation via spatially adaptive gamma correction, Infrared Phys. Technol. 94 (2018) 38–47. [9] Y. Chen, L. Wang, Broken and degraded document images binarization, Neurocomputing 237 (2017) 272–280. [10] N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Syst. Man Cybern. 9 (1) (1979) 62–66. [11] W. Niblack, An Introduction to Digital Image Processing, Strandberg Publishing Company, 1985. [12] J. Sauvola, M. Pietikäinen, Adaptive document image binarization, Pattern Recognit. 33 (2) (2000) 225–236. [13] C. Wolf, J.-M. Jolion, Extraction and recognition of artificial text in multimedia documents, Formal Pattern Anal. Appl. 6 (4) (2004) 309–326. [14] M.M. Dyla, F. Morain-Nicolier, Text line segmentation and binarization of handwritten historical documents using the fast and adaptive bidimensional empirical mode decomposition, Optik 188 (2019) 52–63. [15] Z. Huang, L. Chen, Y. Zhang, et al., Robust contact-point detection from pantograph-catenary infrared images by employing horizontal-vertical enhancement operator, Infrared Phys. Technol. 101 (2019) 146–155. [16] X. Peng, H. Cao, P. Natarajan, Using convolutional encoder-decoder for document image binarization, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, IEEE, 2017, pp. 708–713. [17] Z. Huang, L. Huang, Q. Li, et al., Framelet regularization for uneven intensity correction of color images with illumination and reflectance estimation, Neurocomputing 314 (2018) 154–168. [18] J. Calvo-Zaragoza, A.-J. Gallego, A selectional auto-encoder approach for document image binarization, Pattern Recognit. 86 (2019) 37–47. [19] C. Tensmeyer, T. Martinez, Document image binarization with fully convolutional neural networks, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, IEEE, 2017, pp. 99–104. [20] Q.N. Vo, S.H. Kim, H.J. Yang, G. Lee, Binarization of degraded document images based on hierarchical deep supervised network, Pattern Recognit. 74 (2018) 568–586. [21] S. Lian, L. Li, G. Lian, et al., A global and local enhanced residual u-net for accurate retinal vessel segmentation, IEEE/ACM Trans. Comput. Biol. Bioinform. (2019). [22] Z. Huang, Y. Zhang, Q. Li, et al., Unidirectional variation and deep cnn denoiser priors for simultaneously destriping and denoising optical remote sensing images, Int. J. Remote Sens. 40 (15) (2019) 5737–5748. [23] K.R. Ayyalasomayajula, F. Malmberg, A. Brun, Pdnet: semantic segmentation integrated with a primal-dual network for document binarization, Pattern Recognit. Lett. 121 (2019) 52–60. [24] O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation, International Conference on Medical Image Computing and Computer-Assisted Intervention (2015) 234–241. [25] J. Wang, L. Perez, The effectiveness of data augmentation in image classification using deep learning, Convolutional Neural Netw. Vis. Recognit. (2017). [26] Y. Xu, R. Jia, L. Mou, et al., Improved Relation Classification by Deep Recurrent Neural Networks With Data Augmentation, (2016) arXiv preprint arXiv:1601.03651. [27] Z. Huang, Y. Zhang, Q. Li, et al., Progressive dual-domain filter for enhancing and denoising optical remote-sensing images, IEEE Geosci. Remote Sens. Lett. 15 (5) (2018) 759–763. [28] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: fully convolutional neural networks for volumetric medical image segmentation, 2016 Fourth International Conference on 3D Vision (3DV), IEEE, 2016, pp. 565–571. [29] J. Patravali, S. Jain, S. Chilamkurthy, 2d-3d fully convolutional neural networks for cardiac mr segmentation, International Workshop on Statistical Atlases and Computational Models of the Heart (2017) 130–139. [30] F. Jia, C. Shi, K. He, et al., Degraded document image binarization using structural symmetry of strokes, Pattern Recognit. 74 (2018) 225–240. [31] I. Pratikakis, K. Zagoris, G. Barlas, B. Gatos, Icfhr2016 handwritten document image binarization contest (h-dibco 2016), 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE, 2016, pp. 619–623. [32] I. Pratikakis, K. Zagoris, G. Barlas, B. Gatos, Icdar2017 competition on document image binarization (dibco 2017), 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, IEEE, 2017, pp. 1395–1403. [33] I. Pratikakis, K. Zagoris, P. Kaddas, B. Gatos, Icfhr 2018 competition on handwritten document image binarization (h-dibco 2018), 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) (2018) 489–493.

7