Binarization of degraded document images based on hierarchical deep supervised network

Binarization of degraded document images based on hierarchical deep supervised network

Accepted Manuscript Binarization of Degraded Document Images based on Hierarchical Deep Supervised Network Quang Nhat Vo , Soo Hyung Kim , Hyung Jeon...

1MB Sizes 0 Downloads 65 Views

Accepted Manuscript

Binarization of Degraded Document Images based on Hierarchical Deep Supervised Network Quang Nhat Vo , Soo Hyung Kim , Hyung Jeong Yang , Gueesang Lee PII: DOI: Reference:

S0031-3203(17)30339-4 10.1016/j.patcog.2017.08.025 PR 6261

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

17 April 2017 17 August 2017 23 August 2017

Please cite this article as: Quang Nhat Vo , Soo Hyung Kim , Hyung Jeong Yang , Gueesang Lee , Binarization of Degraded Document Images based on Hierarchical Deep Supervised Network, Pattern Recognition (2017), doi: 10.1016/j.patcog.2017.08.025

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights We propose a supervised binarization method based on the Deep Supervised Networks.



The multi-scale Deep Supervised Networks for binarization has not been reported yet.



A hierarchical architecture is designed to distinguish text from background noises.



Different feature levels are dealt by the multi-scale architecture.



The performance results are considerably better than state-of-the-art methods.

AC

CE

PT

ED

M

AN US

CR IP T



1

ACCEPTED MANUSCRIPT

Binarization of Degraded Document Images based on Hierarchical Deep Supervised Network Quang Nhat Vo, Soo Hyung Kim, Hyung Jeong Yang, Gueesang Lee*1

Gwangju, 500-757, Korea

CR IP T

Department of Electrical and Computer Engineering, Chonnam National University, 300 Yongbong-dong, Bukgu,

Abstract: The binarization of degraded document images is a challenging problem in terms of document analysis. Binarization is a classification process in which intra-image pixels are assigned to either of the two following classes: foreground text and background. Most of the algorithms are constructed on low-level features in an

AN US

unsupervised manner, and the consequent disenabling of full utilization of input-domain knowledge considerably limits distinguishing of background noises from the foreground. In this paper, a novel supervised-binarization method is proposed, in which a hierarchical Deep Supervised Network (DSN) architecture is learned for the prediction of the text pixels at different feature levels. With higher-level features, the network can differentiate text

M

pixels from background noises, whereby severe degradations that occur in document images can be managed. Alternatively, foreground maps that are predicted at lower-level features present a higher visual quality at the

ED

boundary area. Compared with those of traditional algorithms, binary images generated by our architecture have cleaner background and better-preserved strokes. The proposed approach achieves state-of-the-art results over

PT

widely used DIBCO datasets, revealing the robustness of the presented method. Keywords: document image binarization, convolutional neural network, document analysis

CE

1. Introduction

Historical documents represent valuable cultural heritages that need to be protected and preserved; accordingly,

AC

digital archiving is emerging as a practical solution for the storage, retrieval, and studying of these heritages [1, 2]. Due to the massive volumes of historical documents, along with the need for rapid retrieval and understanding, the extraction of document image content from data-storage systems should be automatic. Automatic analysis of historical-document images involves the following steps: layout analysis [3], text-line and word segmentation [4], and optical character recognition (OCR). In most document analysis approaches, the preferred image format for

* Corresponding authors. Email address: [email protected]

2

ACCEPTED MANUSCRIPT

these several steps is the binary-image representation in which each pixel is labeled as either ―text‖ (1) or ―background‖ (0). The process that converts the color or gray-level images into binary images is called binarization. While binarization is quite simple for uniform images, it is rather complicated for degraded document images. Due to the impact of aging, poor storage methods, and inadequate maintenance conditions, severe degradations of historical records occur, including the non-uniform intensity, complex background, and bleed-through. Some of the

CR IP T

difficult cases regarding document image binarization are demonstrated in Fig. 1. Although many solutions have been presented in the literature, most of them are mainly based on unsupervised approaches and low-level features; therefore, it is still difficult to distinguish between the text and the non-text components.

The persistent difficulties in document image binarization motivated the authors to design a new learning model

AN US

based on the Deep Supervised Network (DSN) [19, 20]. According to previous reports, the DSN structure has outperformed regular Convolutional Neural Network (CNN) structures regarding convergence and the test performance [19, 20]. However the prediction of a single DSN [20] suffers from the loss of low-level information, such as character edges and contours, which is insufficient for the binarization problem. We proposed a new hierarchical DSN architecture to cope with different levels of information simultaneously. The design of this

M

network architecture is resulted from the coarse-to-fine principle that is usually applied to extract structures in images and videos. The target is to locate the foreground components and eliminate the background noises at the

ED

coarse level, while the detail of foreground text is extracted and preserved at the fine level. The hierarchical DSN structure came from the observation that predicted maps at the last layers show decreased-noise tendency, while the

PT

tendency of learned features at the early layers is delivery of clearer foreground maps. Even though the idea of exploiting both high-level and low-level features of CNN has been established for image segmentation in previous

CE

works [20, 21], their network structures are not suitable for the binarization issue. These approaches [20,21] try to predict the output by incorporating both low-level features and high-level features in a single stream network, but

AC

low-level features like edges or contrasts are sacrificed in favor of higher-level features like object area prediction. Therefore, they do not satisfy two primary targets of a robust binarization, i.e., clean background and the high visual quality of foreground components, simultaneously. The design of our network model allows each stream of DSN to focus on a separate criterion of the binarization, and hence delivers better results with the use of the multi-stream structure. Different from traditional binarization approaches, we train DSNs directly from document image regions, using

3

ACCEPTED MANUSCRIPT

raw pixels as input and binarization ground truth as the label. Benefitting from the purely supervised training manner and the hierarchical architecture, the proposed model can learn different levels of text-related features to highlight the text regions from a noisy background. The proposed trained networks construct a map for every pixel that represents the probability that it belongs in the foreground text. Probability maps that are predicted at different feature levels are then integrated into one to acquire the final

CR IP T

binary image. Compared to other deep learning based methods [38, 39], the proposed approach is more efficient in predicting pixel labels. By applying patch-wise prediction, the classification of all pixels in an image patch may be achieved at the same time, and hence it does not require scanning the local window pixel-by-pixel. Last, the proposed method surpasses state-of-the-art unsupervised methods and recent deep learning based methods as well

AN US

[21][38-40] on famous DIBCO datasets.

The contributions of this study may be summarized as follows. (1) The main contribution is the proposal of a hierarchical DSN architecture that learns different feature levels from image data itself to classify foreground and background from degraded document images. The designed hierarchical structure is more suitable to the binarization problems than existing deep network models. (2) The proposed method is evaluated on the widely used DIBCO

M

datasets and achieves results that are better than that of state-of-the-art models. To the best of our knowledge, this is the first time in the literature that a deep supervised network model is successfully applied to document image

ED

binarization issues. (3) A set of images and ground-truth patches are collected for the training DSNs to perform the binarization of the document images. The created dataset may also be used to train other network architectures like

PT

the FCN [21] and the deconvolutional network [22]. The rest of this paper is organized as follows: Section 2 presents a summary of a selection of related works;

CE

Section 3 describes the proposed DSN-based model; Section 4 is a report on quantitative and qualitative experiment

AC

results in terms of benchmarks; and last, conclusions are presented in Section 5.

4

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: Demonstration of the document image binarization and the challenges. The first row shows some of the

AN US

images of the DIBCO 2011 dataset [14]; the second row shows the corresponding results of Howe’s method [18], which achieved the highest performance on the DIBCO 2011 dataset; and the third row shows the corresponding results of the proposed method.

2. Related works

M

Numerous document image binarization methods have been proposed in the past. The standard approaches for the

ED

assignment of either the text label or the background label to every pixel in a document image can be categorized under either the global or the local methods. For the global methods, the information used for labeling is extracted and applied to the entire document images. The most famous one is Otsu’s method [5] which computes a threshold

PT

that minimizes the within-class variance and maximizes the between-class variance of two pre-assumed classes at the same time. The global threshold can also be calculated using other methods such as the histogram entropy [6]

CE

and the moment-preserving principle [7]; alternatively, the clustering-based approaches [8] can separate the text from the background through the learning of the unsupervised models from the pixel features. Although these

AC

methods can generate acceptable results regarding document images with simple backgrounds and a uniform intensity, problems are still encountered concerning degraded document images. The local methods were proposed to overcome this issue. The aim of the local methods is the prediction of the label of a pixel through the use of its neighborhood information. Some state-of-the-art local-binarization algorithms have been presented. The Niblack algorithm [9] computes a pixel-wise threshold using the mean and the variance of the gray values in the window that is centered at

5

ACCEPTED MANUSCRIPT

each pixel, and it covers at least one or two of the characters in the document images; however, Niblack’s method still presents many noises at the light texture in the background. In Sauvola’s algorithm [10], a hypothesis on the gray level of the text and the background is employed to design a new threshold that overcomes the problem of Niblack’s method. Gatos et al. [11] developed another approach to improving the quality of the binarization result for degraded document images; here, Sauvola’s method is incorporated to produce a rough binarization result,

CR IP T

followed by the performance of several post-processing steps to remove the noise in the background. Su et al. [31] employed the Markov-random-field framework for a classification of the uncertain pixels into either the document background or the foreground text for which the already known foreground and background pixels serve as the basis. In most of the local-binarization methods, the local-window size, an important parameter, is set manually. Pai

AN US

et al. [12] proposed an adaptive window-size selection method that is based on the image characteristic. A threshold surface is then constructed in each window region based on the pixel-intensity distribution of each region. The different approach of Moghaddam and Cheriet [13] introduced a multi-scale binarization framework that can incorporate any local threshold-based binarization method. This method performs binarization on difference scales by proposing fast grid-based models to restore the weak connections and strokes.

M

Notably, while image binarization is considered as a classification problem, most of the available solutions are for the unsupervised-classification problem. Regarding a number of the existing supervised learning-based approaches,

ED

attempts are made to derive useful information from the training-document images. Chou et al. [32] proposed a learning process for the determination of the binarization threshold. In each image region, a three-dimensional

PT

feature vector is formed by the distribution of the gray values for each image region. The support vector machine (SVM) [33] is applied to classify the feature space into four classes corresponding to four threshold values. For

CE

another approach [34], a decision function is learned to directly map an extracted n-dimensions feature vector around a pixel into a binary space; here, different feature types, including the existing and self-developed ones, are

AC

prepared. The key advantages of the supervised learning-based methods are the parameter-free nature and the absence of the need for pre- or post-processing. The training data should, however, be selected carefully. Recently, deep learning frameworks have been applied to the document image binarization problem. Given an

amount of training data, the main approach is to train a deep neural network for the binary classification task in which each pixel is assigned either the foreground or the background label. Different types of deep network structures were used. Pastor-Pellicer et al. [38] propose a CNN structure that contains two groups of convolutional

6

ACCEPTED MANUSCRIPT

layers and a fully connected layer. By using the CNN, each pixel is classified into foreground and background from a sliding window centered at the classified pixel. This approach is also applied to the binarization of musical documents [39]. Afzal et al. [40] formulate the binarization of document images as a sequence learning problem. The 2D Long Short-Term Memory (LSTM) is employed to process a 2D sequence of pixels for the classification of each pixel as text or background. Although deep neural networks have been successfully applied to different

CR IP T

problems [35][41-44], the recent attempts in the binarization of document images could not surpass the results of state-of-the-art methods on public datasets. Therefore, additional research is required for finding a network architecture that is more suitable for the binarization problem.

To identify the current advances in document image binarization, some document image binarization contests

AN US

(DIBCOs) have been held as an event of the International Conference on Document Analysis and Recognition (ICDAR) [14, 15]. The degraded document images of the DIBCOs range from grayscale to color and from machineprinted to handwritten and the ground truth of each image and the evaluation measures are also provided. At the DIBCO 2011 [14], Su et al. achieved the second-best result by applying a local threshold on a constructededge map and estimating the stroke width. Lelore and Bouchara [16] obtained the strongest binarization results by

M

partitioning foreground pixels from the rough binary image into the following three groups: text, background, and unknown. Based on the models of the text and background clusters, the noisy pixels in the unknown group were

ED

detected and removed. There are improvements on some of the submitted algorithms after the competition. Su et al. [17] extended their method to estimate the parameter adaptively and to address the problems in the evaluation of the

PT

local-image contrast. Howe [18] added a heuristic criterion for the selection of the suitable parameter values for each sample. Through this parameter-turning strategy, Howe’s method achieved advancements beyond the other systems

CE

with respect to the DIBCO 2011 test images, and it resulted in the attainment of the second-best position at the DIBCO 2013 [15]. The first position of the DIBCO 2013 was achieved by Su’s team, who presented a new

AC

submission [15].

Importantly, although methods attending to DIBCOs have achieved superior performances compared with the

traditional algorithms, the room for improvement that remains is significant. Figure 1, for instance, shows some of the test images that were used at DIBCO 2011. Regarding the Howe’s method [18], the binarization on several test images of DIBCO 2011 dataset still contains noises and disconnected strokes.

7

ACCEPTED MANUSCRIPT

3. Proposed method In this section, the proposed binarization approach for degraded document images is discussed. The model architecture is mainly based on the DSN. First, the problem is formulated and presented, followed by a depiction of the details of the proposed DSN structure and a demonstration of the way it is used to predict the foreground pixels

CR IP T

in degraded document images.

3.1. Problem statement

The target of our work is the creation of binary images from gray-value document images. The pixel value of a binary image that is generated by the proposed system is either 0 (black) or 1 (white), where the 0 value represents

AN US

the foreground text, and the 1 value represents a background-belonging pixel. On other types of document images, value 0 may represent the foreground objects like figures, stamps, or tables. In this paper, we focus on the historical document images which contain only printed/handwritten text information. The main problem is in the separation of foreground and background which is a challenging task in case of historical document images. Document images can suffer from varieties of physical degradations and image-capture conditions such as the bleed-through, weak

M

strokes, foreground-like background clutter, and non-uniform background. A binarization method should be able to separate the true foreground pixels from the background noises. The authors, however, observed that the low-level

ED

and handcraft features that were designed during the implementation of the previous methods are insufficiently robust for the distinguishing of the foreground from the background (which is, in this case, text components), and

PT

this is especially the case for foreground-similar background noises; furthermore, the design of new features or the modification of the current features for a new dataset takes time. Alternatively, most of the algorithms are

CE

constructed by simple assumptions regarding the test documents and a lack of the content knowledge. Different from the previous algorithms, the binarization problem is addressed in this study through the development of a novel

AC

supervised framework that is based on the convolutional neural network (CNN) [35]. The CNN mechanism facilitates the extraction of the different feature levels at the layers for the prediction of the foreground locations.

3.2. Problem formulation In this work, the binarization of degraded document images is formulated as a dense prediction problem. Suppose we have a set of training image dataset S = {(Xn, Yn ), n = 1, . . . , N}, where Xn = {xj(n), j = 1,…, |Xn|} represents an

8

ACCEPTED MANUSCRIPT

nth input document image and Yn = {yj(n), j = 1,…, |Xn|}, yj(n) ∈ {0, 1} represents the nth corresponding ground truth binary map for document image Xn. The target is to build a prediction model Φ* that maximize the probability function: |X |

*  arg max  P( y j | X ; ) 

(1)

CR IP T

j 1

The construction of our prediction model Φ* is based on ―Holistic Edge Detection‖ (HED) work of Xie [20] for which a DSN [19] model is trained for the boundary-detection problem. The DSN is a new formulation of the deep networks that aim to improve the convergence and the test performance of the CNN architectures; accordingly, a side layer and its associated classifier are appended to each hidden layer, thereby enhancing the discriminative

AN US

power of the learned features in the first layers and making the hidden layers transparent to the overall classification. The flowchart of the DSN model is demonstrated in Fig. 2. We have a deep CNN that contains M groups of convolutional layers, to which all standard network layer parameters are denoted as W={W(1), W(2),…, W(M)}. Each group of convolutional layer is appended to a side-output layer that is associated with a classifier, wherein the

M

corresponding weights are denoted as w(m), m = 1,…, M. The objective DSN function is defined as follows: M

Lside (W , w)    ml m (W , w( m ) ),

(2)

ED

m 1

where lm is the image-level loss function on the side output of the mth layer and αm is the weight of the side-output

PT

loss. The loss function lm is a class-balanced cross-entropy loss, as follows:

CE

l m (W , w( m ) )     log P( y j  1| X ;W , w( m ) )  (1   )  log P( y j  0 | X ;W , w( m ) ), jY

(3)

jY

where Y+ and Y- are the foreground and background label sets, respectively, and β is a parameter to balance the loss

AC

of the foreground/background classes, β = |Y-|/|Y| and 1 - β = |Y+|/|Y|. P(yj = 1| X;W,w(m)) = σ(aj(m)) is computed using the sigmoid function σ on the activation value aj(m) at pixel j. The side outputs are then fused in the weight-fusion layer F with the loss function, as follows: L fuse (W , w, h)  Dist (Y , Y fuse )

9

,

(4)

ACCEPTED MANUSCRIPT

where Y fuse   (



M

( m) h Aside ) , hm = 1/M is the fusion weight, and Aside  {a j , j  1,...,| Y |} is the activation of

m

( m)

m1 m

the side-output layer m. Dist is the distance between the fused predictions and the binary label map. In the DSN model, Dist is formulated as cross-entropy loss. The final objective function, which is minimized by the standard (back-propagation) stochastic-gradient descent, is written as follows:

.

CR IP T

L(W , w, h)  L fuse (W , w, h)  Lside (W , w)

M

AN US

which is minimized by the standard back-propagation algorithm.

(5)

Figure 2: Demonstration of the DSN model for dense prediction. X is the training image and Y is the corresponding

ED

( m ) is fed to the fuse layer F and to the loss function at ground truth map. The activation of the side-output layer Aside

each side layer. The ground truth map Y is used for computing the loss at each side layer. The weights W(m) and w(m)

PT

are updated by back propagation algorithm.

AC

CE

Lastly, the proposed prediction model is formulated as follows: M

P( y j  1| X ; )   ( hm a (jm ) ),

(6)

  (h,W , w)

(7)

m 1

10

ACCEPTED MANUSCRIPT

3.3. Hierarchical DSN architecture for document image binarization 3.3.1 Network architecture In this research, the design of the proposed DSN architecture is based on two main criteria of a robust binarization method. The first criterion is the ability to distinguish the noisy background pixel from the foreground, which aims

CR IP T

to generate a clean background region. The second criterion is the ability to preserve the high visual quality and detail of foreground. The first criterion requires high-level features for classifying the background and the foreground. About the second criterion, the target is to improve the discrimination power in the character boundary areas. Since the detail information of the input image (such as edges and boundaries) is usually lost at the higher feature levels, lower levels of feature are useful for preserving the high detail of foreground. Therefore, the

AN US

integration of different feature levels may lead to a better performance. The developed DSN model for document image binarization comprises a hierarchical structure for learning different levels of text-like features from the document image itself, whereby the text and the background are classified from degraded document images. Specifically, three following properly designed DSN architectures, each of which contains three, four, and five

M

convolutional-layer groups, respectively, were considered for this study: DSN_C3, DSN_C4, and DSN_C5. The authors observed that the background of the predicted maps that are produced at the end layers contain fewer noises

ED

due to the construction of the high-level features; however, the textual detail is lost through the pooling layers. Alternatively, the text strokes of the predicted maps that are generated at the first layers are clearer, but more

PT

background noises are present. A more-effective result can therefore be achieved if the outputs of the three DSNs are integrated. Each DSN structure is trained independently using document image patches as input and binary maps

CE

as ground truth. The target of the proposed design is the prediction of the foreground maps at three different feature levels. The proposed DSN architecture could use DSN_C2 and DSN_C6, which contain two and six groups of

AC

convolutional layers, instead of DSN_C3 and DSN_C5. However, the DSN_C2 structure is too shallow for predicting the correct foreground text and the DSN_C6 model is too heavy while generating the same result as DSN_C5. The predicted result of each network structure will be discussed further in the experiment section. Another possible solution is the use of bigger image-patch scales for the training of the three DSN_C5 networks; however, the time and memory costs of the training are increased. The deep network structure could be considered as a stack of multiple layers of feature extractors. In the network structure, early layers compute basic and low level features while higher layers compute more global, more invariant

11

ACCEPTED MANUSCRIPT

features. Therefore, most of CNN structures in the literature usually include more convolutional layers at the end. As shown in Fig. 3, each DSN stream is composed of several convolutional-layer groups (represented by C) and sideoutput layers (represented by S). Each of the two groups C1 and C2 consists of two convolutional layers, two ReLU layers, and one pooling layer. The three groups C3, C4, and C5 each consist of three convolutional layers, three ReLU layers, and one pooling layer. The numbers of the filters in the C1, C2, C3, C4, and C5 groups are 64 × 2, 128 × 2, 256

CR IP T

× 3, 512 × 3, and 512 × 3, respectively. The filter size of 3 × 3 is used for all convolutional layers. The reasons of selecting the filter size of 3 × 3 are in two folds:

 The filter size of 3 × 3 is the conventional kernel size that is recommended to use in many deep network structures. According to the Stanford Computer Science course [45], one of the common rules of thumb for

of S = 1.

AN US

deciding some hyper parameters is that the filter size should be small (3x3 or at most 5x5) and going with a stride

 When using a larger filter size, the memory can be built up very quickly. Since there is the limitation of GPU’s memory, the filter size of 3 × 3 is a better choice.

M

For this study, 2 × 2 receptive field which is the most common setting are chosen for all pooling layers. Each sideoutput layer is composed of a convolutional layer with a filter size of 1 × 1, a deconvolutional layer, and a sigmoid

ED

layer. The training data for the DSNs are the image patches and the corresponding ground-truth binary maps that are sampled from the collected historical-document images. All image patches were converted to grayscale. The three

PT

DSNs were trained independently to predict the foreground maps at the different feature levels. For the backpropagation, the cross-entropy loss was computed at the side-output layers.

CE

Theoretically, we can process the whole-image instead of image patches with DSNs. However, in reality working with arbitrary-sized inputs is almost impossible, because of the limitation of the GPU memory. The input images for

AC

the training in previous works [20, 21] are much smaller than document images of public datasets [14][15][36] for document binarization. Mostly training the network with large images requires more GPU memory for storing the network model since the size of network layers increases proportionally to the size of input images. Besides, resizing the original document images to the allowed input size will damage the text in the document image.

12

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 3: Diagram of the proposed DSN-based document binarization model. During the training, the image patches and the ground-truth binary maps from the training dataset are inputted into the three DSNs. During the testing, the local-image patches from the document image are obtained as the inputs. The predicted foreground maps of each of

M

the image patches that were generated from the three DSNs are then integrated into one through the selection of the minimum foreground-map value at each pixel location. Lastly, the binary document image is obtained via the

3.3.2 Local prediction

ED

thresholding of the full foreground map.

PT

Given a gray-scale document image, to be compatible with the training phase, we scan a local window across the entire image region to obtain input image patches. The window of the size d × d is defined based on the ratio of the

AC

CE

image width to the image height, r = width/height, as shown in the following Eq. (8): min( width, height ) / 2 d   min( width, height )

if 0.5  r  2 if r  0.5 or r  2 .

(8)

After the determination of the window size, the slide distance is fixed as d/2 in both directions. In the proposed system, because of the memory limitation, each image window is normalized to range from 190 to 420.

The three foreground maps

DSN _ C 3 DSN _ C 4 Yfuse , Yfuse ,

and

DSN _ C 5 Y fuse

are predicted for each image patch using the

proposed DSN architecture. Each map represents a special information of the foreground and background. The

13

ACCEPTED MANUSCRIPT

prediction map of a shallower DSN structure lacks the power to distinguish the foreground and background but can keep the detail of foreground text (i.e. smooth and clear character edges). The prediction map of a deeper DSN structure can identify the foreground/background regions by using more global, more invariant features. However, the detail of the foreground is lost by pooling layers. To integrate the information in three prediction maps, we apply the minimum function (Min). This function can suppress the high probability value of noisy background pixels using

CR IP T

the prediction map of larger scale networks and at the same time suppress the high probability value of blurring pixels near the contour of characters using the prediction map of smaller scale networks. The final predicted map is composed as follows:

(9)

ED

M

AN US

DSN _ C 3 DSN _ C 4 DSN _ C 5 Y p  min(Yfuse , Yfuse , Yfuse )

PT

Figure 4: F-measure distribution according to the estimations of predicted maps of random selected training images with different global-threshold T values.

CE

The complete foreground map of the tested document image is recovered from the processed image patches. The maximum value at each overlap position is selected to overcome the weak responses at the boundary regions. The

AC

final step is the generation of the binary image BW={bwj, j = 1,…, |Yp|} from the predicted foreground map via the application of the global threshold T, as follows: 0 bw j   1

if y jp  T otherwise

(10)

The optimal T value is analyzed on the predicted maps of the training images. The process is as follow: In the training data, we randomly select 100 image patches. The selection is performed 100 times. In each time, different T

14

ACCEPTED MANUSCRIPT

values are applied to generate different versions of the binary images BWs. The quality of the BWs at each T value is then estimated using the well-known F-measure [14]. Finally, the average of F-measure at each T value is computed. As shown in Fig. 4, by selecting T = 0.84, the highest F-measure value is given. 3.3.3 Global prediction

CR IP T

The binarization is mainly done by the local prediction. The proposed DSN model can separate text from background noises. However, if a document image contains a large background area, the local windows that are derived for the prediction may not have any foreground text; in this case, the local prediction sometime pulls out text-like components from the background. To handle this problem, we perform the global prediction that tries to roughly spot the foreground text region by inputs a full document image to the DSN_C5 network. The reason for the

AN US

selection of the DSN_C5 structure is its higher capability in the classification of text and noises. Since the size of a document image may be large, the image was resized using the ratio 420/max(width, height). After the attainment of the global predicted map, the same threshold T = 0.84 is applied to obtain the binary map of the textual regions. To ensure that the map covers the textual components, the candidate map is expanded by the morphology dilation. The

M

size of the structure element is selected as the average size of foreground components in a local predicted map. Lastly, the global binary map is incorporated into the local prediction to remove the noises in the background. The

ED

global-prediction process is demonstrated in Fig. 5 (a), while Fig. 5 (b) shows the effect of the combination of the

AC

CE

PT

global prediction and the local prediction that reduces the text-like noises in the background.

(a)

15

CR IP T

ACCEPTED MANUSCRIPT

(b)

Figure 5. Global prediction: (a) Global-prediction-generated binary map that depicts the textual-focus locations, and

AN US

(b) final-binarization result for which the incorporation of the global binary map is compared with the usage of only the local prediction.

4. Experiments

M

In this section, the experiment results regarding the evaluation of the proposed document image binarization approach are presented. First, the training data and the test data that were constructed on public-benchmark datasets

ED

are introduced. Then, the DIBCO evaluation metrics are introduced, and this is followed by the implementation details of the proposed model. Lastly, the results of our approach and comparisons with state-of-the-art binarization algorithms

are

shown.

Trained

models,

datasets,

and

source

code

are

publicly

available

at:

PT

https://github.com/vqnhat/DSN-Binarization/.

CE

4.1. Datasets and evaluation metrics

The test images and training data were harvested from public document binarization datasets. A total of ten

AC

datasets were used for this work.

 DIBCO datasets: Three competition datasets (DIBCO 2009 [23], DIBCO 2011 [14], and DIBCO 2013 [15]) were utilized. A total of 21 degraded handwritten documents and 21 deteriorated printed documents were used.

 Handwritten-Document-Image Binarization Contest (H-DIBCO) datasets: From the H-DIBCO 2010 [24], HDIBCO 2012 [25], H-DIBCO 2014 [26], and H-DIBCO 2016 [36], a total of 44 images was collected. These

16

ACCEPTED MANUSCRIPT

datasets were solely used for the evaluation of the binarization methods regarding the handwritten-document images.  Bickley-diary dataset [27]: This dataset contains documents that were written 100 years ago. Among the 94 images, six of the labeled images were selected.  Persian Heritage Image Binarization dataset (PHIDB) [28]: This dataset provides 15 historical-manuscript images

CR IP T

that were collected from the historical records of Mirza Mohammad Kazemaini.

 Synchromedia Multispectral dataset (S-MS) [29]: The document images in this dataset were captured through the simultaneous use of ultraviolet, infrared, and visible light in different spectral bands. From both the training set and the test set, 50 images were gathered.

AN US

To evaluate the performance of the proposed method, the three well-known datasets, DIBCO 2011, DIBCO 2013, H-DIBCO 2014 and H-DIBCO 2016, were tested. The results were quantitatively compared with those of the stateof-the-art algorithms regarding the F-measure, pseudo F-measure (Fps), Distance reciprocal distortion metric (DRD), and the peak signal-to-noise ratio (PSNR) metrics that were adopted from the contests [14, 15][23–26].

M

4.2. Implementation details

ED

Data preparation. The three public datasets, DIBCO 2011, DIBCO 2013, H-DIBCO 2014 and H-DIBCO 2016, were selected for the evaluation. For the testing of each dataset, the remaining eight datasets were used for the creation of the training image patches. The patch sizes are decided by the width/height ratio of the processed

PT

document image, as shown in Eq. (8). An augmentation was also performed, for which the patches and the binary maps were rotated with the rotation angle {90, 180, or 270}, and they were also resized with the scale factors 0.75

CE

and 1.25. The sizes of all of the image patches were normalized to range from 190 to 420. Overall, approximately 84000 training image patches were created in each testing case. Figure 6 displays the image-patch samples from the

AC

training data.

17

CR IP T

ACCEPTED MANUSCRIPT

Figure 6: Samples of the generated training image patches and ground-truth binary maps

DSN parameters and setting. The proposed DSNs were trained over the created image patches. As stated

AN US

previously, the three DSNs were learned independently. During the DSN-learning process, the training step was set to 320000, and one mini-batch was trained per step. The base learning rates of 10-7 were set with the ―step‖ learningrate policy (stepsize = 80000 and gamma = 0.1). A momentum of 0.9 and a weight decay of 0.0002 were used. The convolutional layers at the side-output layers were updated with a smaller learning rate through the application of the learning-rate multiplication of 0.01 for the filter weights and 0.02 for the bias term. Our network is trained and

M

tested with the modification of Caffe library for the DSN architecture that is implemented by Xie and Tu [20]. The

ED

system runs on a PC platform with a 3.2 GHz 2-core i5 CPU, 12 GB memory, and single NVIDIA GTX 1070. To demonstrate the effectiveness of the DSNs over the other CNN architectures, the created training data was also trained on a fully-convolutional network (FCN_8s) [21]. To produce the predicted maps, the softmax layer of

PT

FCN_8s was replaced with the logistic-regression layer. The base learning rate here is 10-12, while the iteration

CE

number and the learning-rate policy are the same as those of the proposed system. Transfer learning. Similar to Xie [20], the parameters of the proposed DSNs were initialized from a pre-train

AC

trimmed VGGNet, which consists of five convolutional-layer groups, to overcome the problem of a lack of training images. For the DSN_C3 and DSN_C4, only the parameters of the first three and four groups of the VGGNet layers are copied.

4.3. Results In this section, the proposed method is quantitatively and qualitatively evaluated on the DIBCO 2011, DIBCO 2013, H-DIBCO 2014 and H-DIBCO 2016 datasets. The proposed method is compared with the state-of-the-art

18

ACCEPTED MANUSCRIPT

algorithms including Bernsen’s method (BERN) [30], Niblack’s method [9], Sauvola’s method [10], Gatos et al.’s method [11], Otsu’s method [12], Su’s method [17], Howe’s method [18], CNN based method (CNN) [38], LSTM based method [40] and the top-three methods from the four contests. The results of both the use and the non-use of the global prediction are shown. In addition to the result of the proposed hierarchical architecture (hierarchical DSN), the performances of the individual DSNs (DSN_C3, DSN_C4, and DSN_C5) are presented along with the

CR IP T

FCN_8s results. The same threshold of T = 0.84 is applied on the predicted maps of these networks for the final binarization. Figures 7 to 14 display the visual quality of the binary results, and Tables 1 to 4 describe the quantitative evaluation of the proposed method and the compared algorithms. The results show that the proposed deep-network structure differentiates the text from the background more effectively than the other methods and that

AN US

the hierarchical structure yields a more-effective performance compared with the single DSNs. In general, the FCN_8s capability regarding the output of foreground maps is unsatisfactory, as the foreground maps still contain a lot of noises and corrupted-text components.

Table 6 presents the average processing time of our system in the binarization of a document image. By using the GPU, the running time required for the binarization can be reduced significantly. The DSN_C5 architecture takes

M

almost the same time as Su’s method [17]. Although the hierarchical DSN architecture spends more time in the processing, it is still faster the Howe’s method [18] which employs the graph cut algorithm. Regarding the training

ED

time, it requires around 96 hours for training a DSN_C5 architecture in the case of our system. However, three DSN

computer system.

PT

architectures can be trained independently on different GPUs and the training time can be shortened with a stronger

CE

4.3.1. DIBCO 2011 dataset

The quantitative results on this dataset are shown in Table 1. The proposed method performs the best regarding all

AC

four of the measurements. The Fps and DRD of the proposed method are significantly better than those of Howe’s method [18], which denotes the robustness of the proposed model in terms of the preservation of the text-stroke contours and the elimination of the background noises. Figures 7 and 8 further demonstrate the binary results of the two example images (HW1 and PR6). As can be seen, the other algorithms fail to produce clean binary images. The proposed model, however, can efficiently remove the shadow and background clutter through the use of the highlevel features to achieve binary results with a higher visual quality. In the case of the sample image PR6, the

19

ACCEPTED MANUSCRIPT

background result of Su’s method is clean; however, the stroke-maintenance of the proposed method is more

(b)

(e)

M

(d)

(h)

(f)

(i)

PT

ED

(g)

(c)

AN US

(a)

CR IP T

effective.

(j) (k) (l) Figure 7. Binarization results of the sample document image (HW1) on the DIBCO 2011 dataset produced by

CE

different methods: (a) Original image, (b) the ground truth, (c) BERN [30], (d) Niblack [9], (e) Sauvola [10], (f) Gatos [11], (g) Otsu [5], (h) Su [17], (i) Howe [18], (j) FCN_8s [21], (k) proposed method (hierarchical DSN), and

AC

(l) proposed method with the global prediction.

(a)

(b)

20

(c)

ACCEPTED MANUSCRIPT

(g)

(h)

(f)

CR IP T

(e)

(i)

AN US

(d)

(j) (k) (l) Figure 8: Binarization results of the sample document image (PR6) on the DIBCO 2011 dataset produced by

M

different methods: (a) Original image, (b) the ground truth, (c) BERN [30], (d) Niblack [9], (e) Sauvola [10], (f) Gatos [11], (g) Otsu [5], (h) Su [17], (i) Howe [18], (j) FCN_8s [21], (k) proposed method (hierarchical DSN), and

ED

(l) proposed method with the global prediction. Table 1. Comparison with the binarization algorithms, followed by the scores, on the DIBCO 2011 image set.

AC

CE

PT

Methods Entry 11 [14] Entry 10 [14] Entry 8 [14] BERN [30] Niblack [9] Sauvola [10] Gatos [11] Otsu [5] Su [17] Howe [18] FCN_8s DSN_C3 DSN_C4 DSN_C5 Hierarchical DSN Hierarchical DSN with the global prediction

F-measure 88.7 80.9 85.2 38.5 72.0 71.3 84.3 82.1 87.8 91.7 71.0 91.3 92.5 92.7 93.2

Fps 37.7 71.4 69.6 84.3 84.8 90.0 92.0 67.7 93.9 94.2 93.4 96.2

PSNR 17.8 16.1 17.2 6.4 12.6 12.5 16.3 15.7 17.7 19.3 12.6 18.9 19.0 19.3 20.0

DRD 8.67 64.42 9.07 114.5 21.9 24.1 6.08 9.0 4.7 3.48 13.5 3.3 2.6 2.3 2.1

93.3

96.4

20.1

2.0

21

ACCEPTED MANUSCRIPT

4.3.2. DIBCO 2013 dataset This dataset consists of challenging images. Figures 9 and 10 present the binary results of the two example images (HW8 and PR6) that contain text-like background noises. In the sample image HW8, Su’s method and Howe’s method separated the text from the background successfully; however, they fail in the sample image PR6 (Fig. 11),

CR IP T

since only a slight difference is present between the textual intensity and the text-like noises. The remaining methods also fail to clean the background. Figure 9 presents an image of a word with a thin stroke (HW7). The other methods either corrupt the strokes or introduce noises into the binary image. The proposed approach delivers the best visual quality regarding all three of these samples. The qualitative results of all of the algorithms are shown in Table 2. The proposed-method results are better than those of the top-three methods from the DIBCO 2013. The

AN US

application of the global prediction significantly reduces the background noises and improves the F-measure and Fps scores. As shown by the low DRD score, the proposed method is also superior to the others in terms of the visual distortion. It should be noted that the Entry 3 [15] that was submitted to the DIBCO 2013 produces a slightly higher performance than Howe’s method [18] even though they are based on the same algorithm. Compared to deep

M

learning based approaches (CNN [38], LSTM [40], and FCN), the hierarchical DSN shows the highest F-measure value. Even single scale DSN structures (DSN_C3, DSN_C4, and DSN_C5) showed better F-measure than CNN

PT

ED

[38], LSTM [40], and FCN.

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

AC

CE

(a)

22

ACCEPTED MANUSCRIPT

(j) (k) (l) Figure 9. Binarization results of the sample document image (HW7) on the DIBCO 2013 dataset produced by different methods: (a) Original image, (b) the ground truth, (c) BERN [30], (d) Niblack [9], (e) Sauvola [10], (f)

CR IP T

Gatos [11], (g) Otsu [5], (h) Su [17], (i) Howe [18], (j) FCN_8s [21], (k) proposed method (hierarchical DSN), and

AN US

(l) proposed method with the global prediction.

(b)

(c)

ED

M

(a)

(e)

(f)

AC

CE

PT

(d)

(g)

(h)

23

(i)

ACCEPTED MANUSCRIPT

(j)

(k)

(l)

Figure 10. Binarization results of the sample document image (HW5) on the DIBCO 2013 dataset produced by different methods: (a) Original image, (b) the ground truth, (c) BERN [30], (d) Niblack [9], (e) Sauvola [10], (f) Gatos [11], (g) Otsu [5], (h) Su [17], (i) Howe [18], (j) FCN_8s [21], (k) proposed method (hierarchical DSN), and

(b)

M

(d)

PT

ED

(g)

(j)

(c)

AN US

(a)

CR IP T

(l) proposed method with the global prediction.

(e)

(f)

(h)

(i)

(k)

(l)

Figure 11. Binarization results of the sample document image (PR6) on the DIBCO 2013 dataset produced by

CE

different methods: (a) Original image, (b) the ground truth, (c) BERN [30], (d) Niblack [9], (e) Sauvola [10], (f) Gatos [11], (g) Otsu [5], (h) Su [17], (i) Howe [18], (j) FCN_8s [21], (k) proposed method (hierarchical DSN), and

AC

(l) proposed method with the global prediction. Table 2. Comparison with the binarization algorithms, followed by the scores, on the DIBCO 2013 image set. Methods Entry 15b [15] Entry 3 [15] Entry 5 [15] BERN [30] Niblack [9] Sauvola [10]

F-measure 92.1 92.7 91.8 52.6 72.8 85.0

Fps 94.2 93.2 92.7 52.8 72.2 89.8

24

PSNR 20.7 21.3 20.7 10.1 13.6 16.9

DRD 3.1 3.2 4.0 62.2 24.9 7.6

ACCEPTED MANUSCRIPT

83.4 83.9 87.7 91.3 87.74 87.9 68.0 91.3 92.3 92.3 93.7

87.0 86.5 88.3 91.7 63.7 92.9 92.9 92.0 95.7

17.1 16.6 19.6 21.3 18.91 12.5 20.0 20.3 19.9 20.9

9.5 11.0 4.2 3.2 21.0 4.2 3.3 2.8 2.1

94.4

96.0

21.4

1.8

4.3.3. DIBCO 2014 dataset

CR IP T

Gatos [11] Otsu [5] Su [17] Howe [18] CNN [38] LSTM [40] FCN_8s DSN_C3 DSN_C4 DSN_C5 Hierarchical DSN Hierarchical DSN with the global prediction

AN US

The test images of this dataset are regarding historical handwritten documents. Compared with the DIBCO 2011 and the DBICO 2013, this dataset is easier to binarize because the background noises are weaker than the foreground text. Table 3 shows that the result of the proposed hierarchical DSN architecture with the global prediction is as sound as that of the winner of the competition. Among the four evaluation metrics, the proposed method surpasses Method 6 in terms of the PSNR and the DRD, implying a less-distorted visual quality. The three

M

individual DSNs and Howe’s method also deliver higher scores than the third-place method. Regarding the qualitative evaluation, as demonstrated in Figs. 12 and 13, Su’s method and Howe’s method are sound for the

ED

removal of noises; however, the binary images of Su’s method are slightly over-binarized at the weak textual components. The other threshold-based algorithms still leave some small noises in the background. Similar to the

PT

DIBCO 2011 results, the difference between the use and the non-use of the global prediction regarding the proposed method is not major because the document text covers most of the image area, and the test images have fewer noises

AC

CE

than the ones in the DIBCO 2013 dataset.

(a)

(b)

25

(c)

ACCEPTED MANUSCRIPT

(g)

(h)

(f)

CR IP T

(e)

AN US

(d)

(j)

(k)

(i)

(l)

M

Figure 12. Binarization results of the sample document image (H06) on the H-DIBCO 2014 dataset produced by

ED

different methods: (a) Original image, (b) the ground truth, (c) BERN [30], (d) Niblack [9], (e) Sauvola [10], (f) Gatos [11], (g) Otsu [5], (h) Su [17], (i) Howe [18], (j) FCN_8s [21], (k) proposed method (hierarchical DSN), and

AC

CE

PT

(l) proposed method with the global prediction.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

26

ACCEPTED MANUSCRIPT

(j)

(k)

(l)

Figure 13. Binarization results of the sample document image (H09) on the H-DIBCO 2014 dataset produced by

CR IP T

different methods: (a) Original image, (b) the ground truth, (c) BERN [30], (d) Niblack [9], (e) Sauvola [10], (f) Gatos [11], (g) Otsu [5], (h) Su [17], (i) Howe [18], (j) FCN_8s [21], (k) proposed method (hierarchical DSN), and (l) proposed method with the global prediction.

Fps 97.65 97.46 96.05 97.83 96.71 95.87 66.09 87.05 91.80 95.48 95.74 95.94 97.30 63.19 98.09

PSNR 22.66 22.40 19.45 22.02 21.54 21.12 12.68 16.55 17.63 18.53 18.72 20.31 22.24 12.20 22.12

DRD 0.902 1.001 2.194 1.00 1.07 1.17 16.91 5.77 4.896 2.73 2.647 1.95 1.08 14.92 0.95

96.66

97.59

23.23

0.79

M

F-measure 96.88 96.63 93.35 96.55 96.17 95.78 67.09 87.19 86.83 91.48 91.78 94.38 96.49 68.47 96.61

CE

PT

ED

Methods Entry 6 [26] Entry 2 [26] Entry 5 [26] DSN_C3 DSN_C4 DSN_C5 BERN [30] Niblack [9] Sauvola [10] Gatos [11] Otsu [5] Su [17] Howe [18] FCN_8s Hierarchical DSN Hierarchical DSN with the global prediction

AN US

Table 3. Comparison with the binarization algorithms, followed by the scores, on the H-DIBCO 2014 image set.

AC

4.3.4. H-DIBCO 2016 dataset The next competition on handwritten document image binarization is H-DIBCO 2016. Table 4 shows the

evaluation scores of participant’s algorithms. Because the image samples used in this contest are more challenging than images in the H-DIBCO 2014 dataset, the results are not as good as the H-DIBCO 2014 and there is still more room for improvements. The binarization images generated by our system are displayed in Fig. 14. Comparing to the top 3 winners, our method delivers better results in term of all evaluation metrics. Both hierarchical DSNs and DSN_C5 generate better results than the participant’s methods. The results in F-measure and Fps demonstrate the

27

ACCEPTED MANUSCRIPT

effectiveness of the hierarchical DSNs architecture and the global prediction in foreground extraction and background noise removal. Table 4. Comparison with binarization algorithms followed by scores on the H-DIBCO 2016 image set. Fps 91.28 91.84 91.71 86.85 88.67 92.06 91.65 92.29 93.54

PSNR 18.11 18.45 18.29 16.42 17.80 18.31 18.41 19.00 18.99

DRD 5.21 3.86 3.93 7.49 5.56 4.48 4.24 3.34 3.60

90.10

93.57

19.01

3.58

AN US

CR IP T

F-measure 87.61 88.72 88.47 82.52 86.61 88.84 89.56 90.01 90.08

PT

ED

M

Methods Entry 2 [36] Entry 3-3 [36] Entry 3-2 [36] Sauvola [10] Otsu [5] DSN_C3 DSN_C4 DSN_C5 Hierarchical DSN Hierarchical DSN with global prediction

CE

Figure 14: Binarization results of the sample document images on H-DIBCO 2016 dataset 4.3.5. Network-structure analysis

AC

Hierarchical DNS. The motivation underlying the design of the proposed network structure is further discussed

here. The influences of the DSN structures with respect to the final performance are analyzed as well. The recall and the precision, as shown in Table 5, are computed on the DIBCO 2011 and DIBCO 2013 datasets for the single DSNs. As shown in Fig.15 (b), because the characters become more ―blurry‖ in the predicted maps of DSN_C4 and DSN_C5, the text strokes in the binary images become dilated; consequently, the outcomes are the higher recall value and the lower precision value regarding DSN_C5. It is because of the reduction of the feature maps in the

28

ACCEPTED MANUSCRIPT

pooling layers that the text-stroke detail decreases; however, as demonstrated in Fig. 15 (a), the structures of DSN_C5 and DSN_C4 are more powerful than that of DSN_C3 in terms of noise removal because higher feature levels are extracted for the classification of the text and the background. The higher precision value of DSN_C4 compared with DSN_C3 is also explained by this observation. Lastly, the output of the combined network structures can preserve the text strokes and suppress the background noises. According to Table 5, the precision value of the

CR IP T

hierarchical architecture is higher, and the balance between the precision and the recall is superior. Another observation is the delivery of the best performance by DSN_C3 among the three network structures on the DIBCO 2014 dataset, as shown in Table 3. This result is due to test images that contain less background clutter or weak noises; therefore, when the background noises do not appear in the binary images, a high precision score of

DSN_C3

M

AN US

DSN_C3 is given.

DSN_C4

DSN_C5

ED

(a)

PT

DSN_C3

DSN_C4

DSN_C5

(b)

CE

Figure 15. Sample images for the network-structure analysis: (a) Binary images of the sample document image (HW7) on the DIBCO 2011 dataset that were generated by the proposed DSN, where DSN_C4 and DSN_C5 are

AC

superior to DSN_C3 in terms of noise removal; and (b) predicted maps of the proposed DSN and the extracted binary maps in an image region, where the character strokes become thicker with DSN_C4 and DSN_C5. Table 5. Performances of the proposed DSN on the DIBCO 2011 and DIBCO 2013 datasets in term of recall and precision. Methods DSN_C3 DSN_C4 DSN_C5 Hierarchical DSN

Recall 92.79 94.89 95.86 92.49

29

Precision 90.48 90.77 89.73 94.83

ACCEPTED MANUSCRIPT

Table 6. The average processing time per image on the DIBCO 2011 dataset. Running time (second) 9.14 1.94 1.90 4.04

CR IP T

Methods Howe [18] Su [17] DSN_C5 Hierarchical DSN

Architecture alternatives. For justifying the selection of three DSN structures (DSN_C3, DSN_C4, and DSN_C5), we investigate the performance of different settings of DSN structures and their effects in the hierarchical architecture. Two DSN structures, DSN_C2 and DSN_C6, which contain two and six groups of convolutional layers, respectively, are tested on the DIBCO 2013 dataset. The effect of these two network structures in the hierarchical

AN US

architecture is shown in two hierarchical DSN architectures, DSN_C2_C4_C5 and DSN_C3_C4_C6. Table 7 summarizes the evaluation results of each network architecture. About the DSN_C2, we could see that its performance on the DIBCO 2013 is worse than the DSN_C3. The explanation is that DSN_C2 is too shallow to predict the foreground text correctly. As presented in Fig. 15, the binarization result of DSN_C2 misses some

M

character strokes. The application of DSN_C2 to the hierarchical architecture leads to the reduction of the performance in the DSN_C2_C4_C5 architecture. In the case of the DSN_C6, its effect in the hierarchical

ED

architecture is same as the DSN_C5. Therefore, DSN_C5 is the better choice in term of saving memory and processing time.

PT

Another setting for the hierarchical DSN architecture is to use two streams of DSN. Three combinations, DSN_C3_C4, DSN_C4_C5, and DSN_C3_C5, are tested. In general, these implementations deliver the lower

CE

performance than the DSN_C3_C4_C5. The results of DSN_C3_C5 and DSN_C4_C5 also show that DSN_C5 is the most important component in the hierarchical architecture since its main role is to remove the noises in the

AC

background areas.

In other CNN-based applications, the number of a convolutional layer in each convolutional-layer group usually

increases in later groups. The target of this design strategy is to extract more global, more invariant features in higher layers for the classification. To investigate the effect of the number of convolutional layers to the prediction, we train another DSN architecture that contains 5 groups of convolutional layers. The difference with DSN_C5 is that the last three groups only consist of 2 convolutional layers. We name this structure as DSN_C5_2. The

30

ACCEPTED MANUSCRIPT

evaluation result of DSN_C5_2 and its contribution to the hierarchical architecture (DSN_C3_C4_C5_2) are presented in Table 7. Between DSN_C5 and DSN_C5_2, it shows that DSN_C5 is more robust in the preservation of the foreground information and removal of background noises.

DSN_C3

CR IP T

DSN_C2

Figure 15. Result of DSN_C2 and DSN_C3 on an image region of the DIBCO 2013 dataset. Table 7. Performances of each network architecture on the DIBCO 2013 datasets. Fps 88.9 92.9 92.9 92.0 90.0 90.7 94.8 94.3 95.1 94.3 95.5 95.7 95.7

PSNR 18.6 20.0 20.3 19.9 19.2 19.4 20.9 20.9 20.9 21.0 20.8 20.9 20.9

AN US

F-measure 87.5 91.3 92.3 92.3 90.1 91.0 93.1 93.6 93.5 93.5 93.3 93.7 93.7

DRD 8.0 4.2 3.3 2.8 4.8 3.9 2.9 2.4 2.4 2.6 2.4 2.1 2.1

ED

M

Methods DSN_C2 DSN_C3 DSN_C4 DSN_C5 DSN_C5_2 DSN_C6 DSN_C3_C4 DSN_C4_C5 DSN_C3_C5 DSN_C3_C4_C5_2 DSN_C2_C4_C5 DSN_C3_C4_C6 DSN_C3_C4_C5

Min function and global threshold. In the hierarchical DSN architecture, prediction maps are integrated by the

PT

minimum function. Another choice that could be selected is the linear function, as presented in Eq. (11). The evaluation of the linear function with different weight settings is shown in Table 8. We can see that different values

CE

of F-measure are obtained at different threshold T. Since the aim of the minimum function is to suppress the high probability value of noises, the better results are achieved at the lower T values. The linear function generates the

AC

better result at higher T values. However, according to Table 8, the highest F-measure is obtained by the minimum function at T = 0.80. Another observation is that T = 0.84 is not the optimal threshold value of our system for the DIBCO 2011 and DIBCO 2013 because T is estimated on the training data, not on the test data. DSN _ C 3 DSN _ C 4 DSN _ C 5 Y p  Yfuse   Yfuse   Yfuse ,     1

Table 8. Performances of each integration function on the DIBCO 2011 and DIBCO 2013 datasets.

31

(11)

ACCEPTED MANUSCRIPT

=1/3 =0.5 =0.2 =0.5 =0.2 =0.3 =0.3 =0.4 =0.4 =0.2

F-measure T=0.84 T=0.86 93.23 93.54 93.53 93.49 93.49 93.42 93.41 93.46 93.51 93.44 93.48 93.50 93.49 93.48 93.44 93.46 93.50 93.45 93.50 93.47 93.47 93.49

T=0.82 93.70 93.41 93.41 93.23 93.43 93.34 93.37 93.29 93.42 93.40 93.32

AN US

=1/3 =0.2 =0.5 =0.3 =0.3 =0.2 =0.5 =0.2 =0.4 =0.4

Min =1/3 =0.3 =0.3 =0.2 =0.5 =0.5 =0.2 =0.4 =0.2 =0.

T=0.80 93.75 93.21 93.23 92.99 93.25 93.11 93.16 93.04 93.23 93.19 93.08

T=0.90 92.24 93.17 92.76 93.07 92.80 93.27 93.14 93.25 93.03 93.13 93.27

(b)

(c)

PT

ED

M

(a)

T=0.88 92.75 93.32 93.20 93.37 93.24 93.37 93.31 93.33 93.25 93.29 93.37

CR IP T

Functions

CE

(d) (e) Figure 16. Binarization results of proposed method compared with other methods on a Korean historical document image. (a) Input image, (b) Sauvola method, (c) Howe method, (d) Proposed method without data augmentation and

AC

(e) Proposed method with data augmentation. Influence of the training data and ground truth. To study the influence of the ground truth, we create another

training dataset which does not contain the rotation of image patches. The new dataset consists of 20,000 training image patches and the proposed DSN structures are trained again with the same training setting. The evaluation is performed on the DIBCO 2013 and the results are shown in Table 9. Without the data augmentation, the performance becomes lower. In another experiment, the proposed DSN architecture is tested with other types of

32

ACCEPTED MANUSCRIPT

documents, such as Korean and Chinese historical documents. Figure 17 and 18 demonstrate the qualitative results of our method on two historical document images. The results of Sauvola and Howe methods still contain false positive errors in some regions. Our method performs better in the removal of background noises. As we can see, the learning using the data augmentation might extract the foreground information better than the learning without using

learned network models.

CR IP T

the data augmentation. The reason is the reduction of variation in training samples that leads to the overfitting of

Table 9. Performances of the proposed DSN architecture on DIBCO 2013 dataset with and without using the data augmentation. F-measure 92.1 93.7

Fps 95.1 95.7

PSNR 20.1 20.9

DRD 2.7 2.1

(b)

(c)

CE

PT

(a)

ED

M

AN US

Methods Without data augmentation With data augmentation

AC

(d) (e) Figure 17. Binarization results of proposed method compared with other methods on a Chinese historical document image. (a) Input image, (b) Sauvola method, (c) Howe method, (d) Proposed method without data augmentation and (e) Proposed method with data augmentation. Failure analysis. There are still rooms for improvement in the performance of our model. The first weakness is that a few dominant background noises, especially ones far from the foreground areas, remain even with the application of global prediction. This problem happens because of the similarity to the foreground characters and the

33

ACCEPTED MANUSCRIPT

absence of foreground in the local area, which cause the wrong prediction of the network. Another weakness is the missing of some thin or weak strokes in the output. This issue appears mainly in handwritten documents where the ink is faded with time. During the feature extraction process, weak strokes are lost when processed by ReLU and pooling layers. For such cases, the discriminative power of the model should be enhanced. Regarding the problem of thin/blurred strokes, another network topology might be considered to limit the loss of weak information in the final

AN US

CR IP T

prediction.

ED

M

(a)

PT

(b) Figure 18. Failure cases. (a) The remain of background noises. (b) Problem of thin/blurred strokes.

5. Conclusions and future works

CE

In this paper, a novel supervised-binarization framework that is based on a hierarchical DSN architecture is proposed. The outstanding performance of our method mainly comes from the hierarchical architecture of Deep

AC

Supervised Network that incorporates side layers to improve the training convergence. In general, to train a deeper network structure, more training data is required, while DSN-type network can accomplish training much faster. The results of the hierarchical DSN model that was learned with the created training samples show that it can differentiate the text from the background noises efficiently. The evaluation regarding the different measurements on the three public datasets shows that the proposed method fully outperforms the state-of-the-art binarization

34

ACCEPTED MANUSCRIPT

algorithms. The power of the hierarchical architecture helps the proposed method to more-effectively preserve the text strokes and provide an excellent visual quality. For future works, the model can be generalized to cover the remaining issues of dominant background noises and missing thin or weak strokes. A new network structure could be developed to handle the weak information, or

CR IP T

training can be enforced by introducing a new set of training data including thin characters or empty patches that contain foreground-similar noises. Although this paper focuses only on the historical documents, the adaptation to other types of document image, like music score and paycheck, is also possible. Another improvement of our model is possible with the memory requirement and processing time, which limit the application of our method to recent systems for document analysis. The model compression [46, 47] that aims to reduce the number of convolutional

AN US

layers can be a solution to this problem, which might lead to a lighter CNN model that is specialized for the binarization task.

6. Acknowledgement

M

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by MEST (NRF-2015R1D1A1A01060172 and NRF-2017R1A4A1015559)"

ED

References

[1] A. Antonacopoulos and A. Downton, Special issue on the analysis of historical documents, International Journal

PT

on Document Analysis and Recognition 9 (2007) 75-77. [2] M. Manso and M.L. Carvalho, Application of spectroscopic techniques for the study of paper documents: a

CE

survey, Spectrochimica Acta Part B: Atomic Spectroscopy 64 (6) (2009) 482-490. [3] A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, Historical Document Layout Analysis

AC

Competition, in: ICDAR, 2011, pp. 1516-1520.

[4] N. Stamatopoulos, B. Gatos, G. Louloudis, U. Pal and A. Alaei, ICDAR 2013 Handwriting Segmentation Contest, in: ICDAR, 2013, pp. 1402-1406.

[5] N. Otsu, A threshold selection method from gray-level histograms, IEEE Transactions on Systems, Man and Cybernetics 9 (1979) 62–66.

35

ACCEPTED MANUSCRIPT

[6] J.N. Kapur, P.K. Sahoo, and A.K.C. Wong, A new method for gray-level picture thresholding using the entropy of the histogram, Computer Vision, Graphics, and Image Processing 29 (1985) 273–285. [7] W. Tsai, Moment-preserving thresholding: a new approach, Computer Vision, Graphics, and Image Processing 29 (1985) 377–393. [8] N. Papamarkos, A technique for fuzzy document binarization, in: Proceedings of the 2001 ACM Symposium on

CR IP T

Document Engineering, 2001, pp. 152-156.

[9] W. Niblack, An Introduction to Digital Image Processing, Prentice-Hall, Englewood Cliffs, NJ, (1986), pp. 115–116.

[10] J. Sauvola, M. Pietikinen, Adaptive document image binarization, Pattern Recognition 33 (2) (2000) 225–236.

AN US

[11] B. Gatos, I. Pratikakis, and S.J. Perantonis, An adaptive binarization technique for low quality historical documents, Lecture Notes in Computer Science: Document Analysis Systems VI 3163 (2004) 102–113. [12] Y.T. Pai, Y.F. Chang, and S.J. Ruan, Adaptive thresholding algorithm: Efficient computation technique based on intelligent block detection for degraded document images, Pattern Recognition 43 (2010) 3177–3187. [13] R.F. Moghaddam and M. Cheriet, A multi-scale framework for adaptive binarization of degraded document

M

images, Pattern Recognition 43 (2010) 2186–2198.

[14] I. Pratikakis, B. Gatos, and K. Ntirogiannis, ICDAR 2011 Document Image Binarization Contest (DIBCO

ED

2011), in: ICDAR, 2011, pp. 1506-1510.

[15] I. Pratikakis, B. Gatos, and K. Ntirogiannis, ICDAR 2013 Document Image Binarization Contest (DIBCO

PT

2013), in: ICDAR, 2013, pp. 1471-1476.

[16] T. Lelore and F. Bouchara, Super-resolved binarization of text based on FAIR algorithm, in: ICDAR, 2011, pp.

CE

839-843.

[17] B. Su, S. Lu, and C.L Tan, Robust Document Image Binarization for Degraded Document Images, IEEE

AC

Transactions on Image Processing 22 (4) (2013) 1408-1417.

[18] N.R. Howe, Document binarization with automatic parameter tuning, International Journal on Document Analysis and Recognittion 16 (3) (2012) 247–258.

[19] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, Deeply-supervised nets, in: AISTATS, 2015, pp. 562-570. [20] S. Xie and Z. Tu, Holistically-Nested Edge Detection, in: ICCV, 2015, pp. 1395-1403.

36

ACCEPTED MANUSCRIPT

[21] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, in: CVPR, 2015, pp. 3431-3440. [22] H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, in: ICCV, 2015, pp. 1520-1528.

2009), in: ICDAR, 2009, pp. 1375-1382.

CR IP T

[23] B. Gatos, K. Ntirogiannis and I. Pratikakis, ICDAR 2009 Document Image Binarization Contest (DIBCO

[24] I. Pratikakis, B. Gatos, and K. Ntirogiannis, H-DIBCO 2010 - Handwritten Document Image Binarization Competition, in: ICFHR, 2010, pp. 727-732.

[25] I. Pratikakis, B. Gatos, and K. Ntirogiannis, ICFHR 2012 Competition on Handwritten Document Image

AN US

Binarization (H-DIBCO 2012), in: ICFHR, 2012, pp. 817-822.

[26] I. Pratikakis, B. Gatos, and K. Ntirogiannis, ICFHR2014 Competition on Handwritten Document Image Binarization (H-DIBCO 2014), in: ICFHR, 2014, pp. 809-813.

[27] F. Deng, Z. Wu, Z. Lu, and M.S. Brown, Binarizationshop: A user- assisted software suite for converting old documents to black-and-white, in: Proc. Annu. Joint Conf. Digit. Libraries, 2010, pp. 255-258.

M

[28] H.Z. Nafchi, S.M. Ayatollahi, R.F. Moghaddam, and M. Cheriet, An efficient ground truthing tool for binarization of historical manuscripts, in: ICDAR, 2013, pp. 807-811.

ED

[29] R. Hedjam, H.Z. Nafchi, R.F. Moghaddam, M. Kalacska and M. Cheriet, ICDAR 2015 MultiSpectral Text Extraction Contest (MS-TEx 2015), In: ICDAR, 2015, pp. 1181-1185.

PT

[30] J. Bernsen, Dynamic thresholding of gray-level images, in: ICPR, 1986, pp. 1251–1255. [31] B. Su, S. Lu, and C.L. Tan, A Learning Framework for Degraded Document Image Binarization using Markov

CE

Random Field, in: ICPR, 2012, pp. 3200-3203. [32] C.H. Chou, W.H. Lin, and F. Chang, A binarization method with learning-built rules for document images

AC

produced by cameras, Pattern Recognition 43 (2010) 1518-1530.

[33] C. Cortes, V. Vapnik, Support-vector network, Machine Learning 20 (1995) 273–297. [34] Y. Wu, P. Natarajan, S. Rawls, and W. AbdAlmageed, Learning Document image binarization from data, in: ICIP 2016, pp. 3763-3767. [35] L.C. Yann, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.

37

ACCEPTED MANUSCRIPT

[36] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, ICFHR2016 Handwritten Document Image Binarization Contest (H-DIBCO 2016), in: ICFHR, 2016, pp. 619-623. [37] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, ICFHR2016 Handwritten Document Image Binarization Contest (H-DIBCO 2016), in: ICFHR, 2016, pp. 619-623. [38] J. Pastor-Pellicer, S. Espana-Boquera, F. Zamora-Martınez, M. Zeshan Afzal, and Maria Jose Castro-Bleda,

Work-Conference on Artificial Neural Networks, 2015, pp. 115-126.

CR IP T

Insights on the Use of Convolutional Neural Networks for Document Image Binarization, in: 13th International

[39] J. Calvo-Zaragoza, G. Vigliensoni, and I. Fujinaga, Pixel-wise Binarization of Musical Documents with Convolutional Neural Networks, in: IAPR Conference on Machine Vision Applications, 2017.

AN US

[40] M.Z. Afzal, J. Pastor-Pellicer, F. Shafait, T.M. Breuel, A. Dengel, M. Liwicki, Document Image Binarization using LSTM: A Sequence Learning Approach, in: Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing, 2015, pp. 79-84.

[41] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Proc. Neural Information Processing Systems, 2012, pp. 1106-1114.

SIFT, 2014, CoRR abs/1405.5769.

M

[42] P. Fischer, A. Dosovitskiy, T. Brox, Descriptor matching with convolutional neural networks: a comparison to

ED

[43] J. Long, N. Zhang, T. Darrell, Do convnets learn correspondence?, Proc. Neural Information Processing Systems, 2014, pp. 1601-1609.

PT

[44] A. Dosovitskiy, J.T. Springenberg, M. Tatarchenko, T. Brox, Learning to Generate Chairs, Tables and Cars with Convolutional Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39:4 (2016)

CE

692-705.

[45] CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/convolutional-

AC

networks/

[46] S. Han, H. Mao, W. J. Dally, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization And Huffman Coding, in: International Conference on Learning Representations, 2016.

[47] G. Hilton, O. Vinyals, J. Dean, Distilling the Knowledge in a Neural Network, arXiv preprint arXiv:1503.02531, 2015.

38

ACCEPTED MANUSCRIPT

About the Author—VO QUANG NHAT received his B.S. degree in Information Technology from the University of Science, Vietnam, in 2010, and his M.S. degree in Electronics and Computer Engineering f rom Chonnam National University, Republic of Korea, in 2013, where he is currently a Ph.D. student. His study interests are multimedia and image processing, vision tracking, and pattern recognition.

CR IP T

About the Author—SOO-HYUNG KIM received his B.S. degree in Computer Engineering from Seoul National University in 1986, and his M.S. and Ph.D. degrees in Computer Science from Korea Advanced Institute of Science and Technology in 1988 and 1993, respectively. From 1990 to 1996, he was a senior research staff member of the Multimedia Research Center at Samsung Electronics Co., Republic of Korea. Since 1997, he has been a professor of the Department of Computer Science, Chonnam National University, Republic of Korea. His research interests are pattern recognition, document-image processing, medical-image processing, and ubiquitous computing About the Author—HYUNG JEONG YANG received her B.S., M.S., and Ph.D. from Chonbuk National University, Republic of Korea. She is currently an associate professor of the Department of Electronics and Computer Engineering at Chonnam National University, Republic of Korea. Her main research interests are multimedia data mining, pattern recognition, artificial intelligence, eLearning, and eDesign.

AC

CE

PT

ED

M

AN US

About the Author—GUEESANG LEE received his B.S. degree in Electrical Engineering and his M.S. degree in Computer Engineering from Seoul National University, Republic of Korea, in 1980 and 1982, respectively. He received his Ph.D. degree in Computer Science from Pennsylvania State University, U.S., in 1991. He is currently a professor of the Department of Electronics and Computer Engineering at Chonnam National University, Republic of Korea. His primary research interests are image processing, computer vision, and video technology.

39