DCWI: Distribution descriptive curve and Cellular automata based Writer Identification

Expert Systems With Applications 128 (2019) 187–200 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

Download PDF

3MB Sizes 0 Downloads 22 Views

Report

Full Text

Expert Systems With Applications 128 (2019) 187–200

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

DCWI: Distribution descriptive curve and Cellular automata based Writer Identiﬁcation Parveen Kumar a,b,∗, Ambalika Sharma a a b

Department of EE, Indian Institute of Technology Roorkee, India Department of CSE, National Institute of Technology Uttarakhand, India

a r t i c l e

i n f o

Article history: Received 6 October 2018 Revised 20 March 2019 Accepted 21 March 2019 Available online 21 March 2019 Keywords: Writer identiﬁcation Feature extraction Cellular automata Support vector machine

a b s t r a c t Writer identiﬁcation is an active area of research owing to its applications in a wide variety of ﬁelds, ranging from ancient document analysis to modern forensic document analysis. It deals with the writing style of documents and the learning of the discriminating features of different writers. In the domain of pattern recognition, the extraction of discriminative features of different writers has become very challenging. In order to address this concern, this work highlights a distribution descriptive curve (DDC-) and cellular automata (CA-) based model approach. The DDC utilizes the idea of the pixel distribution of handwritten text images to generate a unique curve as a feature vector. The generated feature vector is then fed to a support vector machine (SVM) as an input to identify the writer. Simultaneously, in a parallel mode, the initial handwritten text images are processed repeatedly with CA to generate another set of feature vectors. This new set of generated feature vectors are fed to a similarity-based classiﬁer (SBC) as an input. The writer is predicted on the basis of the similarity of the features. The results from both of the approaches (DDC + SVM and CA + SBC) are merged to improve the performance of the model. This proposed model, DCWI, has good writer-identiﬁcation capabilities compare with state-of-the-art techniques. Eventually, writer identiﬁcation is accomplished using the ranking-based score scheme. The proposed model is evaluated on different datasets, e.g., IAM for English, IFN/ENIT for Arabic, and Kannada and Devanagari (Hindi) for the Indic script. The results show that the proposed model has better performance compared with existing state-of-the-art techniques. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Writer identiﬁcation is used to identify the writer of a given handwritten text sample with high conﬁdence. Over the last few decades, the application of writer identiﬁcation has widely varied from ancient document analysis to the modern forensic document analysis. Ancient documents are rarely examined to recognize the handwriting (Ogier, 2008). However, it is still used for the identiﬁcation and characterization of different writers (Benseﬁa, Paquet, & Heutte, 2005). Forensic document analysis is utilized for a variety of applications, such as the authentication of an addressed record, signature veriﬁcation, and the detection of document forgeries and alterations (Siddiqi & Vincent, 2010). The writer identiﬁcation process involves the prediction of the writer of a given handwritten text sample. This identiﬁcation system is divided into two major categories: oﬄine writer identiﬁcation (He, You, & Tang,

2008a) and online writer identiﬁcation (Schlapbach, Liwicki, & Bunke, 2008). The oﬄine writer identiﬁcation takes images as inputs, whereas online writer identiﬁcation may take more than one ﬁle as an input containing speed, the pressure applied in various regions, the angle of writing, etc. The online writer identiﬁcation (Schomaker, 2007) involves the collection of data by means of an electronic device that enables the user to write something on the screen using an electronic pen connected to the writer’s device. The data contain more information about the writing style of the writer, such as the speed of writing, the angle of the pen, and the pressure applied in various regions. In oﬄine writer identiﬁcation, a single image is available with handwritten characters which make identiﬁcation comparatively more challenging owing to the unavailability of additional information such as the pressure, and angle. The proposed work has three main contributions: •

∗

Corresponding author. E-mail addresses: [email protected] (P. Kumar), [email protected] (A. Sharma). https://doi.org/10.1016/j.eswa.2019.03.037 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

A novel approach of feature extraction based on distribution descriptive curve (DDC) has been introduced. These features are utilized in our proposed model to obtain high accuracy compared with stated state-of-the-art techniques.

188 •

•

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

A cellular automata (CA-) based model for feature extraction has been introduced. These features are utilized to enhance the performance of the proposed model. An eﬃcient model, DCWI, for writer identiﬁcation has been presented based on DDC and CA. The results of these techniques are combined to obtain better performance. To the best of our knowledge, writer identiﬁcation using DDC- and CAbased models has not yet been applied in the ﬁeld of pattern recognition.

The rest of the paper is organized as follows: Section 2 brieﬂy describes previous studies carried out on oﬄine writer identiﬁcation, explaining their key features along with their advantages and disadvantages. Section 3 illustrates the proposed model, DCWI. Then, Section 4 presents the experimental results and discussion. Finally, Section 5 concludes the paper. 2. Related work During the last few decades, several writer identiﬁcation techniques have been developed. A technique that is based on morphological waveform coding was proposed by Zois and Anastassopoulos (20 0 0). A two-dimensional (2D) normalized projection function is calculated from the text image. Further, the morphologically transformed function is applied to the normalized projection function. This method is based on single words and uses a Bayesian classiﬁer as well as a neural network (NN) classiﬁer for writer identiﬁcation. The error rate is 7% and 3.5% for the Bayesian and NN classiﬁer, respectively. Benseﬁa et al. (2005) proposed a writer identiﬁcation and veriﬁcation system based on local features such as graphemes extracted from cursive handwriting. An information retrieval system based on text is used for writer identiﬁcation. This work achieved an accuracy of 86% and 95% on IAM and PSI datasets, respectively. Pervouchine and Leedham (2007) proposed a method that extracts features from the letters ‘d’, ‘y’, ‘f’ and grapheme ‘th’. The features are extracted using genetic algorithms, and NN is used for classiﬁcation. The results are not good when the training samples are small in size. He, You, and Tang (2008b) proposed a model that uses global wavelet-based features characterized by the generalized Gaussian model (GGM) in the wavelet domain. The method obtained high accuracy on Chinese handwriting only. An industrial approach that focuses on segmentation at the character level integrated (Tan, Viard-Gaudin, & Kot, 2009). It has achieved an accuracy of 99.2% on a dataset of 120 writers, but with the constraint of minimum character requirement. The minimum length of characters required in the given text is around 160. Helli and Moghaddam (2010) presented a text-independent writer identiﬁcation method in which pattern-based features are extracted from text using Gabor and XGabor ﬁlters. A feature relation graph was constructed by applying a fuzzy method to the extracted features. The method has obtained an accuracy of 100% on a dataset of 100 writers, but only for the Persian handwriting that made it script dependent as it focused on properties speciﬁc to Persian handwriting. Siddiqi and Vincent (2010) have utilized the redundancy of some special characters in the text and visual attributes to identify the writer of the documents. Contour-based orientation and curvature-based features are used to compute a set of features from writing samples on different levels of observations. The method produced good results on IAM and RIMES datasets, while results on other standard datasets are not appreciable. Wen, Fang, Chen, Tang, and Chen (2012) proposed a textindependent Chinese writer identiﬁcation scheme that is based on edge structure code (ESC) distribution feature and nonparametric discrimination of the sample. Considerably, good results are obtained on the HIT-MW dataset, which is widely used for perfor-

Fig. 1. Grapheme codebook: (a) English handwriting, (b) Farsi handwriting.

mance evaluation. This method obtained the top match rate of 95.4%. Ghiasi and Safabakhsh (2013) proposed a writer identiﬁcation method based on the grapheme codebook. For the given handwritten text sample, the method developed a feature vector based on the occurrence of shapes in grapheme codebook. The grapheme codebook samples are shown in Fig. 1. Djeddi, Siddiqi, Souici-Meslati, and Ennaji (2013) proposed a writer recognition system using multi-script handwritten text, where the writer is predicted using k-nearest neighbor and SVM. This system is signiﬁcant because of its applicability, even for short text lengths, which is the main requirement of many scenarios of the forensic lab. However, it is not feasible to implement this system for large datasets because of its time complexity. Bertolini, Oliveira, Justino, and Sabourin (2013) proposed texture-based identiﬁcation and dissimilarity representation-based classiﬁcation to identify the writer. An accuracy of 99.2% and 96.7% were obtained for IAM and BFL datasets, respectively. Another grapheme-based sparse model is presented by Kumar, Chanda, and Sharma (2014) in which extracted documents are represented in the form of Fourier and wavelet descriptors. The wavelet descriptors produce a multiresolution representation of the shape. This method obtains good results, even for a smaller codebook. Hannad, Siddiqi, and El Kettani (2016) proposed a writer-identiﬁcation method that uses texture-based descriptors for handwritten fragments. This method divides the dataset into smaller fragments, from which histograms of local binary patterns (LBPs), local ternary patterns (LTPs) and local phase quantization (LPQ) are computed. The method is evaluated on IAM and IFN/ENIT datasets, and considerably better results are obtained. Chahi, Ruichek, Touahni et al. (2018) presented a simple yet eﬃcient and effective similarity-based method known as block wise local binary count (BWLBC) for writer identiﬁcation. The method extracts connected components from the text, and it then calculates BWLBC for these extracted connected components. This BWLBC histogram is matched with the BWLBC histograms of existing known datasets. Marti and Bunke (2002) introduced a handwritten English sentence dataset (i.e., the IAM-dataset) for oﬄine handwriting recognition. The dataset is based on the Lancaster-Oslo/Bergen (LOB) corpus, and consists of handwritten English sentences which are used to evaluate the proposed model. Pechwitz et al. (2002) have presented a new handwritten Arabic dataset (IFN/ENIT-dataset) that consists of more than 210,0 0 0 characters. Alaei, Nagabhushan, and Pal (2011) and Alaei, Pal, and Nagabhushan (2012), have presented a new Kannada handwritten text dataset (KHTD) that consists of 204 handwritten text documents of four different categories. The documents were written by 51 native speakers of Kannada. Alaei and Roy (2014) have proposed a new model based on histogram symbolic representation (HSR) for writer identiﬁcation. The authors have extracted 92 features from each text line. The experiments were performed on datasets such as English and Kannada as an Indic script. Olszewska (2015) has proposed an active contour-

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

189

Fig. 2. Preprocessing steps.

based approach for optical character recognition (OCR), which allows automatic extraction and digit recognition from real-world visual scenes and videos. Moreover, the template matching technique is also employed for character recognition. The experiments are performed on a sports dataset, and outperform state-of-the-art methods. Al-Maadeed, Hassaine, Bouridane, and Tahir (2016) have proposed a novel feature extraction technique based on geometric features, which include direction, curvature, and tortuosity for writer identiﬁcation. Moreover, the authors have also presented an improved version of edge-based directional and chain code-based features, evaluated on Arabic and English handwriting datasets. An accurate and eﬃcient curve detection technique was presented by Walsh and Raftery (2002). The Hough transform is used to detect parametric curves in images. The authors combined the Hough transform and probabilistic Hough transform in the framework to present the sampling. The combined results are shown for simulated and real data.

leads to the ﬁnal selected writer. Several experiments are conducted using both DDC- and CA-based models separately for writer identiﬁcation. It is observed that for some samples, the ﬁrst model performs better, while for some other samples, the second model performs better. The results of two parallel models are combined, and it is found that the accuracy of DCWI improved with the combined effect. The SVM as well as SBC are used to classify the appropriate writer. The results obtained from these two parallel processes are combined to obtain eﬃcient results. The complete procedure of the writer identiﬁcation model is depicted in Fig. 3. The proposed model is classiﬁed into three phases. •

•

•

3. Proposed model: DCWI In the proposed model, two parallel processes are carried out simultaneously to achieve writer identiﬁcation. The ﬁrst one is a probabilistic model based on DDC, and the second one is a similarity-based model for writer identiﬁcation. The similaritybased model involves the use of CA (Hadeler & Müller, 2017). First, text lines are extracted from the input images, and the words are segmented from these extracted text lines. The process is illustrated in Fig. 2. After this preprocessing step, binary images of every word of the documents are obtained and fed to the model. The DDC- and CA-based features are extracted from the text lines and words, and the proposed model is applied for writer identiﬁcation. In DCWI, 23 features are extracted from DDC and 28 features are extracted from the CA-based model for each word of the text sample. These features form a feature vector that is fed into the respective classiﬁers for training (e.g., the feature vector from the DDC-based model is fed into SVM, and the feature vector from the CA-based model is fed into SBC). These extracted features are treated homogeneously. No speciﬁc weight is assigned to any feature, and each feature derives a real value. The classiﬁers treat the features in the feature vector equally (i.e., with equal weightage) and draw n-dimensional hyperplanes that act as boundaries between different classes (i.e., writers). We have selected the top ﬁve matching classes from the results obtained by the classiﬁers individually. On the basis of their class selection, the ranking-based score is calculated, as expressed by Eq. (12), and the highest score

Preprocessing: This phase prepares the data for the model. The methods used for the preprocessing of data are binarization and segmentation. Feature extraction: In the proposed model, DDC and CA are used to extract the features from the images. Classiﬁcation/Identiﬁcation: Finally, the system classiﬁes the samples using features provided by the feature-extraction phase. Two classiﬁcation techniques have been used in the proposed model: SVM and SBC. The results provided by these two classiﬁers are combined to obtain better performance.

3.1. Distribution Descriptive Curve (DDC-) based model The DDC is a novel approach to extract the features from the handwritten text images. The DDC is generated by the combination of a central curve and upper and lower lines based on disorder and density. The ﬁrst step of the model is to generate the DDC for every word in the document. The generation of the DDC involves several steps, as illustrated in Fig. 4. The steps required to generate the DDC are as follows: •

Disorder-based upper and lower lines: The disorder of a line is deﬁned as the number of neighboring pixels having different colors. We traversed through the line, and wherever the value/color of the pixel is changed, the number of disorder is taken to be increasing. Suppose L is the row vector of the image data and n(x, x + 1 ) represents the number of adjacent pairs. The disorder D(L) can be expressed as in Eq. (1).

D(L ) = { n(x, x + 1 ) : L(x ) ! = L(x + 1 )

}

(1)

After calculating the disorder for every row vector of the image, we extract the maximum disorder md from the disorder vector D using Eq. (2).

md = arg_max(D )

(2)

190

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

Fig. 3. Proposed model for writer identiﬁcation.

Fig. 4. Generation of DDC.

•

•

Once the maximum disorder md is obtained, we ﬁnd the upper and lower lines that will be used to generate the DDC. While traversing D, the ﬁrst line with disorder greater than or equal to md is considered as the upper line. On traversing D in reverse order, i.e., from the last element towards the ﬁrst element, the ﬁrst line with disorder greater than or equal to maximum disorder md is the lower line. The upper and lower lines obtained are shown in Fig. 4, and are shown in blue and green color, respectively. Central curve (CC): The central curve (CC) represents the midpoint of the region in which part of the word lies,i.e., it consists of the midpoint of the section of all the columns in which the word lies. To obtain the CC, we traverse all the columns of the text image, and the occurrence of the ﬁrst and last black pixel mark is marked in each column. Let these positions be and u, respectively. The mean of the two values is computed to obtain the value of the CC for that particular column. Fig. 4 shows the extracted CC in red color. Density-based upper and lower lines: The density of a line is deﬁned as the number of pixels having black color. We traverse through the line, and wherever a black pixel is encountered, the density is taken to be increasing. Suppose L is the row vec-

tor of the image data and n(x) represents the number of pixels with index, x. The density of the row vector L can be expressed as deﬁned in Eq. (3).

Density(L ) = { n(x ) : L(x ) = 0

}

(3)

After calculating the density of every row vector of the image, the maximum density is extracted from the density vector (expressed in Eq. (3)) as given in Eq. (4).

mds = arg_max(Density )

(4)

Once the maximum density mds is obtained, we determine the upper and lower lines that are used to generate the DDC. While traversing the density vector, the ﬁrst line with a density greater than or equal to mds is the upper line. On traversing the density vector in reverse order, i.e., from the last element towards the ﬁrst element, the ﬁrst line with a density greater than or equal to the maximum density mds is the lower line. Devanagari (Hindi) handwriting consists of a horizontal line in every word that increases the maximum density of the word to a very high value. In this case, the detected upper and lower lines would not be absolute. To overcome this problem, we ﬁrst determine whether the density is the result of some horizontal

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200 Table 1 The weight matrix for different parameters to obtain DDC.

•

•

Parameter

Weight

Disorder-Based Upper Line Disorder-Based Lower Line Central Curve Density-Based Upper Line Density-Based Lower Line

0.165 0.165 0.33 0.165 0.165

line, or if it is due to a word with high density. On observing a large number of Devanagari (Hindi) words, we ﬁnd out that on average, the thickness of the line is 8% of the thickness of the word. Therefore, if lines with maximum density are more than or equal to 0.08 times the thickness of the word, the maximum density is due to the line of the Hindi word. In this case, we reduce the maximum density by half. The upper and lower lines obtained are shown in Fig. 4, and are shown in blue and green color, respectively. Distribution descriptive curve generation: To obtain the DDC, the calculated parameters are shown in Table 1. Further, the weighted mean of each of the above parameters is computed, namely, disorder-based upper and lower lines, CC and densitybased upper and lower lines. The CC combined with disorderbased upper and lower lines and density-based upper and lower lines are shown in Fig. 4. Every word is considered to consist of mainly three regions: upper, middle, and lower regions. The upper lines based on the density and disorder inform about the upper region, the lower lines based on density and disorder inform about the lower region, and the CC informs about the middle region. We desire an equal contribution of all three regions. Therefore, we assign an equal weight of 0.33 to every region. The upper region has two upper lines with weights 0.165 each, the middle region has a CC with a weight of 0.33, and the lower region has two lower lines, each with a weight of 0.165. To calculate the weighted mean, the assigned weights are listed in Table 1. Deadlines: A combination of two or more adjacent points on the DDC having the same value is known as a deadline segment. It is also deﬁned as the line segment in the DDC with zero slopes. The DDC depends on several features of the word, such as disorder-based upper and lower lines, density-based upper and lower lines, and CC. If the value of contiguous points is the same, all of these features have the same value or are changing in such a way so as to adjust the ﬁnal value of the DDC. It is unlikely that all features would change similarly and simultaneously for all the writers. This leads to an important characteristic feature to deﬁne the uniqueness of the word. The deadlines in DDC of various words are shown in red color in Fig. 5. The two samples at the right-hand side are acquired from the Hindi dataset of the same writers, and the rest are acquired from the Kannada and English handwritten dataset of different writers. We observe that the occurrence and length of the deadlines are very similar in samples from the same writers, and dissimilar in the sample from different writers. The various methods applied provided the best features for classiﬁcation. In DCWI, we employed DDC, which is unique for every single

191

word as well as unique to every style of writing of the same word. DDC is based on the distribution text of the word. A subtle change in handwriting style changes the DDC of the word. After conducting several experiments on different features of DDC, we determine the 23 strongest features that best describe the DDC. After obtaining DDC in the form of a vector of length equal to the width of the word, 23 features have been extracted by using DDC and deadlines. These features are extracted with respect to a whole line instead of a single word. First, we obtained the DDC of each word, and then appended this DDC of the word in order of their occurrence to obtain the DDC of the whole line. We considered the following characteristics of the deadlines as features for writer identiﬁcation: -Average length of deadlines: The length of deadlines is an important characteristic of the handwritten document. Therefore, we need to consider how the length varies from writer to writer. We have calculated the mean of the length of the deadlines present in every line of a document written by the writer. -Standard deviation (SD) of the length of deadlines: To increase the effectiveness of the length features of the deadline, we considered the SD of the length of the deadlines present in the line of the handwritten document. -Number of deadlines per unit length: The number of deadlines is also considered as a feature for handwritten documents. We have taken a number of deadlines per unit length to normalize the effect of different word lengths. The length is considered in pixels. In other words, the feature is the number of deadlines per unit pixel. We calculated the mean of the points on the DDC, and drew a horizontal line through it, known as the mean DDC. These deadlines are further classiﬁed into three types: ∗ ∗ ∗

– – – – – –

Upper deadlines: Those lying above the mean DDC. Middle deadlines: Those lying on the mean DDC. Lower deadlines: Those lying below the mean DDC.

Number of upper deadlines per unit length Number of middle deadlines per unit length Number of lower deadlines per unit length Mean distance between deadlines SD of distance between deadlines Number of peaks per pixel: A peak in DDC is deﬁned as a point for which none of its neighborhood has a higher value. DDC is represented as a vector of length equal to the length of the word. In Eq. (5), x refers to the index in DDC, and DDC(x) refers to the value of DDC at index x. Therefore, DDC(x) is the distance of the pixel at position x in DDC from the x-axis considering the lower left point of the image as the origin in the Cartesian plane. Mathematically, a peak (i.e., Peak) is deﬁned as in Eq. (5).

Peak = {x : DDC (x − 1 ) ≤ DDC (x ) ∧ DDC (x + 1 ) ≤ DDC (x )} (5) The number of peaks is calculated in DDC, and this number divided by the length of DDC provides the required feature. – Number of troughs per pixel: A trough in DDC can be deﬁned as a point for which none of its neighborhood has a smaller value. Mathematically, the trough (i.e., Trough) is deﬁned as in Eq. (6).

T rough = {x : DDC (x − 1 ) ≥ DDC (x ) ∧ DDC (x + 1 ) ≥ DDC (x )} Fig. 5. Deadline for sample words.

(6)

192

–

– –

– –

–

–

–

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

The number of troughs is calculated in DDC, and this number divided by the length of DDC provides the required feature. Average length of rises in DDC: A rise in DDC is a non-decreasing subsequence in a DDC. Initially, the length of all the rises in the DDC is extracted. The average length of rises in the DDC is considered as a feature. SD of length of rises in DDC Average length of falls in DDC: A fall in DDC is a non-increasing subsequence in a DDC. Initially, the length of all the falls in the DDC is extracted. The average length of falls in the DDC is considered as a feature. SD of length of falls in DDC Average value of local maxima in DDC: In the DDC-based model, the maximum value of each of the words is calculated, and their mean is considered as a feature. Average value of local minima in DDC: In the DDC-based model, the minimum value of each of the words is calculated, and their mean is considered as a feature. Mean value of absolute differences in minima and maxima: The absolute difference between the maximum and minimum values in DDC of each word is calculated, and its mean is considered as a feature. Mean of the block-wise binary scores: The DDC is divided into continuous blocks Bi , each of length 32. The binary code is generated corresponding to each of the blocks, as expressed in Eq. (7).

Binary_Codek (i ) =

1 if Bk (i − 1 ) ≤ Bk (i ) 0 otherwise

(7)

features are used to classify the writer. This process is repeated n times, taking an output image of the CA as the input image for the next step. In the proposed model, the value of n is considered as four. This is the number of times that CA is applied to the sample image. Every time, after applying CA, the processed image becomes darker and the relative difference between extracted skylines decrease. The relative difference between consecutive skylines implies the relative difference between the values of corresponding features extracted from two skylines obtained in subsequent iterations of applying CA. A large set of random sample data is chosen from the dataset, and it is observed that for higher values of n, the difference between the subsequent skylines is negligible. It is observed that for values of n greater than four, the skylines of the results obtained after applying CA are less useful. Therefore, the value of n is chosen to be four. The skyline of a text image is implemented in Algorithm 1. Algorithm 1 Skyline(I : image). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

– –

–

–

–

After generating the binary code for the blocks Bk , the binary code Binary_Codek is calculated for each Bk . The decimal value corresponding to the binary code is known as the binary score. The mean of the block-wise binary score is considered as a feature. SD of the block-wise binary scores Mean of the rise in block-wise binary scores: The rise in the binary score is the continuously increasing subsequence in the binary score. We considered the mean of the length of the rise in the binary score as a feature. SD of the rise in block-wise binary scores: The rise in the binary score is the continuously increasing subsequence in the binary score. We considered the SD of the length of the rise in the binary score as a feature. Mean of the fall in block-wise binary scores: The fall in the binary score is the continuing decreasing subsequence in the binary score. We considered the mean of the length of the fall in the binary score as a feature. SD of the fall in block-wise binary scores: The fall in the binary score is the continuing decreasing subsequence in the binary score. We considered the mean of the length of the fall in the binary score as a feature.

3.2. Cellular Automata (CA-) based model There is a construct that is similar to DDC, which is unique to every word as well as to the different writings of the same word. We call this the construct skyline. The manner in which the skyline changes in subsequent processing with CA can be observed in Fig. 6. This model is based on processing the input repeatedly with CA, and extracted features are used for SBC. The procedure of feature extraction using CA is depicted in Fig. 6. Initially, sample text is processed with CA to obtain another intermediate text image. Next, the skyline of the obtained text image is extracted to give the outer cover of the text image. After extracting the skyline of the text image, features are extracted from the skyline, and these

13: 14: 15: 16: 17:

I1 ← zeros(m, n ) for i ← 1 to m do for j ← 1 to n do result ← 0 for i1 ← max(i-1, 0) to min(i+1, n) do for j1 ← max(j-1, 0) to min(j+1, m) do result ← result ∨ I (i1 , j1 ) end for end for if result = 0 then I1 (i, j ) = 1 else I1 (i, j ) = I (i, j ) end if end for end for return I1

The algorithm Skyline takes as its input image I with size m × n, and computes the skyline I1 as the output. In the algorithm, we traverse each pixel of the image and check all of its neighborhoods. If all of the neighborhood pixels of the current pixel are black, we set the pixel value as white, and the color of the remaining pixels remains unchanged. In a binary image, black color is represented by zero and white color is represented by one. Performing the logical OR of all the neighborhood pixels will yield zero only if the value of all the neighborhood pixels is zero. If this happens, the algorithm changes the color of a pixel to white, i.e., it sets its value to zero. The obtained resulting image, I1 , gives the outer cover of the word image, which is used for feature extraction for the SBC. The basic CA model is depicted in Fig. 7. The CA (Hadeler & Müller, 2017) is deﬁned as an n-dimensional grid-like structure that consists of cells having one of the predeﬁned S states. The state of cells of the grid changes with time according to the given set of rules deﬁned for the neighborhood of the cell. For binary images, CA is deﬁned as a 2D grid-like structure, where the grid is the image and the cell is the pixel within the image. The state of the cell is the value of the pixel at a particular instance of time. In the case of a binary image, it is either zero or one. The CA model can be deﬁned by three tuples, namely the set of states S, Neighborhood Nr , and the transition function δ . In our case, S is {0, 1}, and Nr can be deﬁned in several ways. The most popular ones are the Von-Neumann Neighborhood (Weisstein, 2013) and Moore Neighborhood (Weisstein, 2005). The Von-Neumann Neighborhood deﬁnes Nr using Nr1 (Popovici & Popovici, 2002), as expressed in Eq. (8).

Nr1 (xc ) = {x : |x − xc |1 ≤ r}; x =< xi , yi >

(8)

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

193

Fig. 6. Feature extraction using CA.

cells, and s0 be the state of the current cell, respectively. The next state of the cell is evaluated using the transition function δ , as expressed in Eq. (10). According to this rule, the value of pixels will be black if any of its neighboring pixels are black. In Fig. 8, the effect of CA is shown on the handwriting of different writers. The three samples are acquired from Bangla, English, and Hindi handwriting, respectively. Five iterations of the CA model are shown in Fig. 8.

δ ( s0 , s1 , s2 , s3 , s4 , s5 , s6 , s7 , s8 , s9 ) = s0 ∧ s1 ∧ s2 ∧ s3 ∧ s4 ∧ s5 ∧ s6 ∧ s7 ∧ s8

(10)

Fig. 7. Basic CA model: (a) Moore Neighborhood, (b) Von-Neumann Neighborhood.

The Moore Neighborhood deﬁnes Nr using the norm (Gray, 2003), as deﬁned in Eq. (9). In Eqs. (8) and (9), xc and x refer to the location of current and neighborhood pixels, respectively.

Nr∞ (xc ) = {x : |x − xc |∞ ≤ r}; x =< xi , yi >

(9)

In the proposed CA model, we assigned the Moore Neighborhood as r = 2, which gives eight neighboring cells shown in red color and current cells shown in blue color in Fig. 7a. The number of neighbors that are selected in the Moore neighborhood is dependent on the value of r. The number of neighbors for the Moore neighborhood in terms of r is given by the formula (2r − 1 )2 − 1. Therefore, for r = 2, the number of neighbors is eight, and for r = 3, the number of neighbors is 24, while for r = 4, the number of neighbors is 48. We can observe that the difference in the number of neighbors with just one increase in the value of r is increasing rapidly. Because the complexity of CA depends on the number of neighbors, larger values of r therefore lead to a larger number of neighbors. Hence, it is not computationally feasible to consider the higher values of r. Therefore, we have chosen a value of r as two. Superscript 1 in Eq. (8) indicates that neighborhood cells are taken only along a single line in four directions, as shown in Fig. 7b, while the superscript ∞ in Eq. (9) indicates that we take all neighborhood cells along with all lines directed outwards from the current cell. Let s1 - s8 be the state of the eight neighborhood

In the proposed model, CA is applied to the sample image to extract features. In each step, seven features are extracted from the resulting image. The application of CA generates newly processed images that have changed in the skyline. This modiﬁed skyline extracts seven new features. The process is repeated four times sequentially, and therefore derives a total of 28 features. These features are as follows: -Global Binary Count per Unit Area (GBCUA): The total number of black pixels in the skyline of the text image is known as the global binary count (GBC). The GBCUA describes the writing style of the writer. Repeatedly applying CA and skyline on the image, GBCUA may increase or decrease, and this varies from writer to writer. -Number of connected components per unit length: It considers the total number of connected components. When we apply CA to text images, the change in the number of connected components varies from one writing style to another. -GBCUA of critical region: The critical region is the rectangular window outside which no black pixel is present. The GBCUA of the critical region is considered as a feature. -Mean size of connected components per unit length: The size of all the connected components is extracted, and the mean of the size of connected components per unit length is considered as a feature. -SD of the size of connected components per unit length: The sizes of all the connected components are extracted, and the SD of

194

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

Fig. 8. Effect of CA on handwriting text.

the size of connected components per unit length is considered as a feature. -Row block-wise binary score: The rows are divided into blocks of size 16, and their binary score is calculated as shown in Eq. (7). We considered the mean of the scores of different blocks as a feature. -Column block-wise binary score: The columns are divided into blocks of size 16, and their binary score is calculated as presented in Eq. (7). We considered the mean of the scores of different blocks as a feature. 3.3. Writer identiﬁcation process The two well-known classiﬁers, SVM and SBC (Bernal, Hospevian, Karadeniz, & Lassez, 2003), were utilized to identify the writer. To enhance the performance of DCWI, the results from these two classiﬁers are merged. Eventually, using the rankingbased score, scheme writer identiﬁcation is accomplished. The two classiﬁers are as follows: i Support Vector Machine (SVM): SVM is a classiﬁcation technique that was initially designed for binary classiﬁcation. It originally separates binary classes using the maximum margin criterion (Cortes & Vapnik, 1995). To use the SVM for the multi-class classiﬁer, one way is to use several binary classiﬁers, i.e., the one-versus-all approach, and select the classiﬁer with the best result. Instead of creating several classiﬁers, an approach proposed by Weston and Watkins (1998) is used to distinguish all classes in a single optimization process. According to this approach, {(x1 , c1 ), (x2 , c2 )(xk , ck )} is the labeled dataset, where xi ∈ Rd and ci ∈ {1...k}. The proposed model is based on the result expressed in Eq. (11).

argmaxm fm (x ) = argmaxm (wm T ψ (x ) + bm )

(11)

SVM is used to predict the writer in the DDC-based model. The dataset is generated from the given data in the form of feature vectors for every line of the document consisting of 23 features. This feature vectors are processed with SVM for writer identiﬁcation. ii Similarity-Based Classiﬁer (SBC): In the proposed CA-based model, the features are extracted at the word level, unlike the line level in our DDC-based model. The SBC is based on the maximum similarity between words of different writers. Initially, to identify the writer using the feature’s-similarity, features are extracted from every word of the document and

stored in labeled feature vectors. To classify the new sample, the feature vector is calculated for each word, and the most similar feature vector, which corresponds to each word in the sample, is selected. After selection, the label that is most likely to occur in selected feature vectors is considered, and the corresponding writer is identiﬁed as the writer of the document. iii Result generation based on SVM and SBC: In our observation, we have found that both models produce similar results; however, different results have also been reported for few text samples. In the case of similar results, the writer predicted by both the classiﬁers is the identiﬁed writer, while the weighted system is used to remove ambiguity in the case of different results. The top ﬁve candidates are selected for the writer by both classiﬁers, i.e., SVM and SBC. Let ksvmi and ksimi be the rank of candidate writer in SVM and SBC, respectively. The score for a given writer is calculated as expressed in Eq. (12).

Score(Wi ) =

1 1 + ksvmi ksimi

(12)

A writer not present in the top ﬁve candidates of SVM has rank ksvmi as ∞, and a writer not present in the top ﬁve candidates of SBC has rank ksimi as ∞. The writer with the highest score is identiﬁed as the writer of the document. 4. Experimental results and discussion This section is divided into two parts, i.e, experimental setup and experimental results. The detailed description is given in the following sections. 4.1. Experimental setup The proposed model was evaluated on an IAM (Marti & Bunke, 2002) English dataset, IFN/ENIT (Pechwitz et al., 2002) Arabic dataset, Kannada dataset (Alaei et al., 2011; Alaei et al., 2012), and Devanagari (Hindi) script dataset (developed in-house). The content in all of the documents for training and testing was mutually exclusive, with the exception of the Devanagari (Hindi) dataset. The different datasets used for the experiments are shown in Table 2. 4.1.1. English script dataset There are several datasets available for the English script. However, the IAM dataset (Marti & Bunke, 2002) is a widely used English handwritten dataset for handwriting text recognition as well

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

195

Table 2 IAM, IFN/ENIT, Kannada, and Devanagari (Hindi) script corpus. Dataset

# Documents

# Writers

# Training documents

# Testing documents

IAM (Marti & Bunke, 2002) IFN/ENIT (Pechwitz et al., 2002) Kannada (Alaei et al., 2011; Alaei et al., 2012) Devanagari (Hindi)

1539 822 228 648

657 411 57 81

1100 411 114 405

439 228 114 243

Table 3 Result generation based on SVM, SBC, and combined (DCWI) for different datasets. Dataset

SVM (%)

SBC (%)

Combined accuracy (DCWI) (%)

IAM (Marti & Bunke, 2002) IFN/ENIT (Pechwitz et al., 2002) Kannada (Alaei et al., 2011; Alaei et al., 2012) Devanagari (Hindi)

96.4 95.8 97.5 97.8

91.5 92.2 94.5 93.5

97.8 97.5 99.8 99.9

Table 4 Performance of the proposed model with different datasets. Dataset

Precision (%)

Recall (%)

F1 -score (%)

IAM (Marti & Bunke, 2002) IFN/ENIT (Pechwitz et al., 2002) Kannada (Alaei et al., 2011; Alaei et al., 2012) Devanagari (Hindi)

97.32 95.97 99.82 99.93

98.17 99.16 99.82 99.87

97.74 97.53 99.82 99.89

as for writer veriﬁcation/identiﬁcation. It consists of handwritten documents of 657 writers, each contributing several pages of handwritten English text with a total of 13,353 lines. Each handwritten document is digitized at 300 dpi and saved in PNG image format with 256 gray levels. To evaluate the proposed model, 60% and 40% of data were selected randomly for training and testing of the model, respectively. 4.1.2. Arabic script dataset The IFN/ENIT (Pechwitz et al., 2002) dataset is a well-known Arabic dataset employed for writer identiﬁcation (Hannad et al., 2016), and it is widely used for research purposes. It consists of 26,0 0 0 handwritten Arabic words written by 411 writers. Documents are also available in the binary image format. To evaluate the proposed model, 60% and 40% of the data were selected randomly for the training and testing of the model, respectively. 4.1.3. Kannada script dataset The Kannada handwritten text dataset (Alaei et al., 2011; Alaei et al., 2012) was introduced owing to an increase in interest in the Indic handwriting dataset. The Kannada dataset consists of 228 handwritten documents of 57 writers. The model was evaluated by considering varying percentages of training and testing data. 4.1.4. Devanagari script dataset A dataset of a handwritten document of Indic (Devanagari) script from different writers was created as to the best of our knowledge, there existed no such dataset. It consists of 648 documents written by 81 writers, with 8 documents written by each writer. The average number of text lines in each document is 12, and the average number of words in each line is 10. All of the images in the dataset are in binary image form. The model was evaluated by considering varying percentages of training and testing data.

for writer identiﬁcation. The IFN/ENIT and Kannada datasets contain the same text in the documents given by each writer, whereas the IAM and Devanagari datasets contain random text in training and testing samples. As the content of the document written by each writer is the same, the model learns the features responsible for the writing style instead of the content itself. If all writer classes have the same content for training samples, then the performance of the system improves signiﬁcantly. The reason is still arguable as the difference in performance may be due to the dataset. The overall performance of DCWI based on the various performance metrics such as Accuracy, Precision, Recall, and F1 -Score is summarized in Tables 3 and 4. Table 3 presents the results from SVM and SBC classiﬁers as well as those for our proposed model (i.e., DCWI) for datasets IAM, IFN/ENIT, Kannada, and Devanagari (Hindi). The results presented in Table 3 indicate that if we use either the SVM or SBC classiﬁer, the resulting models, neither (DDC + SVM) nor (CA + SBC), is suﬃcient to improve the state-of-theart with respect to the accuracy of the model. However, when we combine the results of these two models (DDC + SVM and CA + SBC), the new model, i.e., DCWI, outperforms the state-of-the-art technique for writer identiﬁcation. The confusion matrix is computed for all classes, and the performance metric (i.e., Precision, Recall, F1 -Score, and Accuracy) is calculated and expressed by Eqs. (13)–(16). Table 4 shows the performance of the proposed model for different datasets. The following deﬁnitions were used to estimate the above four evaluation metrics. •

•

•

•

4.2. Experimental results The experiments were carried out in different languages, such as English, Arabic, and Indic (Devanagari) handwritten documents

True Positive (TP): The number of persons detected in class 1 who actually belonged to class 1. False Positive (FP): The number of persons misclassiﬁed as class 1 who belonged to class 2. True Negative (TN): The number of class 2 persons classiﬁed as class 2. False Negative (FN): The number of class 2 persons misclassiﬁed as class 1.

Accuracy =

TP + TN TP + TN + FP + FN

(13)

196

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200 Table 5 Performance comparison of various state-of-the-art writer identiﬁcation models for IAM dataset. Dataset

IAM (Marti & Bunke, 2002)

Language

English

# Writers

657

Methods

Accuracy (%)

Codebook (Ghiasi & Safabakhsh, 2013) MSHT (Bertolini et al., 2013) SMB (Kumar et al., 2014) TDHF (Hannad et al., 2016) BDCTF (Khan et al., 2017) BWLBC (Chahi et al., 2018) HMM-Based (Schlapbach & Bunke, 2004) Proposed model

93.70 96.70 88.40 89.50 97.20 88.99 82.50 97.80

Table 6 Performance comparison of various state-of-the-art writer identiﬁcation models for IFN/ENIT dataset. Dataset

IFN/ENIT (Pechwitz et al., 2002)

Language

Arabic

# Writers

411

Methods

Accuracy (%)

TAF-Based (Bulacu & Schomaker, 2007) MBA (Abdi & Khemakhem, 2015) TDHF (Hannad et al., 2016) BDCTF (Khan et al., 2017) BWLBC (Chahi et al., 2018) Proposed model

80.00 90.00 94.90 76.00 96.43 97.50

Table 7 Performance comparison of various state-of-the-art writer identiﬁcation models for Kannada and Devanagari dataset.

P recision = Recall =

Dataset

Language

# Writers

Methods

Accuracy (%)

Kannada (Alaei et al., 2011; Alaei et al., 2012)

Kannada

57

Devanagari Script

Hindi

81

HSR (Alaei & Roy, 2014) Proposed model Proposed model

90.20 99.80 99.90

TP TP + FP

TP TP + FN

Fβ =1 − score =

(1 + β 2 ) × Precision × Recall β 2 × Precision + Recall

(14) (15) (16)

The overall performance of the proposed model, DCWI, and its comparison with state-of-the-art methods considering different datasets are presented in Tables 5–7. The outcome of the proposed model on the IAM dataset is compared with state-of-the-art techniques, such as writer identiﬁcation using Codebook (Ghiasi & Safabakhsh, 2013), writer identiﬁcation using multi script handwritten text (MSHT) (Bertolini et al., 2013), sparse model based (SMB) writer identiﬁcation (Kumar et al., 2014), writer identiﬁcation using texture descriptors of handwritten fragments (TDHF) (Hannad et al., 2016), writer identiﬁcation using bagged discrete cosine transform features (BDCTF) (Khan, Tahir, Kheliﬁ, Bouridane, & Almotaeryi, 2017), block wise local binary count (BWLBC)based writer identiﬁcation (Chahi et al., 2018), and HMM-based writer identiﬁcation (Schlapbach & Bunke, 2004). The comparison of these models with DCWI is tabulated in Table 5 and depicted in Fig. 9. Another comparison was carried out using the IFN/ENIT dataset. Results obtained for the IFN/ENIT dataset using DCWI were compared with those obtained using various methods discussed in literature, such as writer identiﬁcation using the textural and allographic feature (TAF) (Bulacu & Schomaker, 2007), modelbased approach (MBA) for writer identiﬁcation (Abdi & Khemakhem, 2015), writer identiﬁcation using TDHF (Hannad et al., 2016), writer identiﬁcation using BDCTF (Khan et al., 2017), and BWLBC-based writer identiﬁcation (Chahi et al., 2018). The comparison of these techniques with DCWI is tabulated in Table 6 and depicted in Fig. 10.

Several studies have been conducted using datasets such as IAM (English) and IFN/ENIT (Arabic). However, to the present, writer identiﬁcation for Indic languages such as Kannada and Devanagari (Hindi) has not been explored much. The model was evaluated on two Indic datasets, namely Kannada and Devanagari (Hindi). DCWI achieves more than 99% accuracy for both of the languages. To evaluate the proposed model on these two datasets, 90% of data was considered for training, and the remainder for testing. In the next step, we decreased the amount of training data by 10%. This procedure was repeated until the amount of training data was only 10%. A comparison of results for Kannada and Devanagari (Hindi) is tabulated in Table 7 and depicted in Fig. 11. The results obtained by computing different percentages of training samples for the Kannada and Devanagari (Hindi) dataset are illustrated in Fig. 12. When the training sample is 90%, DCWI obtains an accuracy of 99.8% and 99.9% on the Kannada and Devanagari dataset, respectively. A comparison of the accuracy of DCWI for all four datasets considering the training data as 50%, 70%, and 90% is represented in Fig. 13. If the average accuracy of all percentages on which the model is evaluated for Kannada is considered, an accuracy of 98.11% is achieved, while for Devanagari (Hindi), an accuracy of 98.62% is achieved. Further, on averaging these two values, we obtain an accuracy of 98.36% for Indic datasets. The accuracy of DCWI for the IAM and IFN/ENIT dataset is 97.8% and 97.5%, respectively. The average accuracy of DCWI is found to be 98% for all four datasets, indicating that DCWI outperforms the state-of-the-art techniques.

4.2.1. Comparison of proposed DCWI model with state-of-the-art techniques A comparative study of the proposed DCWI model is performed with the state-of-the-art techniques of writer identiﬁcation. The results were evaluated using IAM, IFN/ENIT, Kannada, and Devanagari (Hindi) datasets. Our proposed DCWI model achieves an accuracy of 97.8% and 97.5% on IAM and IFN/ENIT datasets, respectively

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

197

Fig. 9. Comparative analysis with IAM dataset.

Fig. 10. Comparative analysis with IFN/ENIT dataset.

which outperforms the state-of-the-art technique (Khan et al., 2017) by 0.6% and 21.5%, Chahi et al. (2018) by 8.81% and 1.07%, and Hannad et al. (2016) by 8.3% and 2.6% for IAM and IFN/ENIT datasets, respectively. The state-of-the-art method BDCTF proposed by Khan et al. (2017) is based on using the universal codebook to generate multiple predictor models, while DCWI focuses on a combination of the DDC- and CA-based model. The ﬁnal decision regarding the writer identiﬁcation for BDCTF is performed by using the majority voting rule, while DCWI utilizes the rankingbased score. However, BDCTF fails to perform well on documents presented in binary form while there is no such constraint with DCWI. Another best-performing method, i.e., BWLBC, which was proposed by Chahi et al. (2018), extracts connected components from the text sample and then calculates BWLBC and the corresponding histogram. The histogram is then matched with the histogram of known data samples. Instead of storing information about local binary blocks, DCWI stores features of the whole sam-

ple. BWLBC faces an issue of space complexity if a high-resolution image dataset is considered, or if the training samples are large, while in DCWI, the time complexity may arise if the value of r is considered greater than two in the CA-based model. Another method TDHF proposed by Hannad et al. (2016) uses texture-based descriptors for handwritten fragments. Each fragment is considered as a texture, from which LBP, LTP, and LPQ are computed, while DCWI focuses on the distribution curve and CA-based features. The major limitation of TDHF is that the smaller fragment size leads to a large number of fragments that eventually would be computationally expensive, while DCWI focuses on the problem of computational complexity by selecting limited but effective descriptors. Several other state-of-the-art approaches which contribute significantly to the writer identiﬁcation are Bertolini et al. (2013) and Ghiasi and Safabakhsh (2013). The proposed model DCWI outperforms (Bertolini et al., 2013) and (Ghiasi & Safabakhsh, 2013) by 1.1% and 4.1%, respectively, for the IAM dataset. The method MSHT

198

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

case of the IFN/ENIT dataset, another method MBA proposed by Abdi and Khemakhem (2015) has an accuracy of 90%. DCWI outperforms MBA by 7.5%. The MBA method uses the grapheme approach, while compared with natural grapheme present in the text sample, it synthesizes its own graphemes based on the beta-elliptic model. DCWI uses other effective data-dependent features to obtain good results with considerably better time complexity. The proposed DCWI model achieves an accuracy of 90.2% for the Kannada dataset, outperforming the state-of-the-art technique (Alaei & Roy, 2014) by 9.6%. The HSR method proposed by Alaei and Roy (2014) extracts 92 features from each text line based on the connected components, enclosed region, lower and upper contours, fractal code, and Curvelet. A histogram is created for each extracted feature of every writer. DCWI uses DDC- and CA-based feature extraction, which is a novel approach for writer identiﬁcation. The proposed DCWI model achieves an accuracy of 99.9% on the Devanagari script dataset. However, to the best of our knowledge, no such method exists for the Devanagari script (Hindi) dataset.

Fig. 11. Comparative analysis with Kannada and Devanagari dataset.

proposed by Bertolini et al. (2013) focuses on the dissimilarity representation-based classiﬁcation scheme for writer identiﬁcation and veriﬁcation. The MSHT method is limited to the local features only because of the consideration of LBP and LPQ, while DCWI extends the scope of extracted features to the global level with respect to a word in the document. The Codebook method proposed by Ghiasi and Safabakhsh (2013) is based on the grapheme codebook. An inﬁnitely large number of graphemes are possible for the handwritten text sample, but a limited number of graphemes are considered. The choice of grapheme is crucial to the performance of this method. DCWI is similar in terms of storing information from the sample in SBC, but there is no such problem in the proposed model. The stored information from the sample is ﬁnite and provides good results for different samples. In the

4.2.2. Computational complexity In our experiments, the computational complexity is derived in terms of the size of the images. Let the size of an image be m × n. To generate the DDC, every pixel of the image is traversed until the entire image of the word is scanned a ﬁxed number of times. Therefore, the complexity in generating the DDC curve (considering m = n for ease of calculation) is O(n2 ). Again, if we consider that there are approximately k × n such words in the training data, where k < < n, then the complexity would be O(n3 ). The complexity of the training SVM model is O(n3 ). Hence, the overall complexity of the model (DDC + SVM) is O(n3 ) + O(n3 ) = O(n3 ). However, to derive the complexity of the CA model, we need to traverse pixels of the image a ﬁxed number of times, and if there are n such words in the model, the overall complexity of the CA model is derived as O(n3 ). During the testing phase, while using the DDC model, the DDC curve of a limited number of words is generated, which can be accomplished in O(n2 ) time. For the CAbased model, features are initially generated from the given data; to classify these, the features are matched with the existing features obtained from the training data. This feature generation of

Fig. 12. Training data vs accuracy (%) (a) Kannada dataset, (b) Devanagari dataset.

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

199

Fig. 13. Training data vs accuracy (%) for various datasets.

test data requires O(n2 ) time complexity. Further, the matching of features from the training dataset using SBC requires O(n2 ) time. Hence, the total time required to identify the writer using the already trained model is O(n2 ). The CA-based model is computationally expensive when larger values of r are used in the Moore neighborhood. In DCWI, the value of r is set as two. As a result, the computational complexity of our CA-based model is not very high, as expected. 5. Conclusions and future work A writer identiﬁcation model based on DDC and CA was presented in this paper. The DDC and CA are used to extract the features from handwritten text images. The writer is then identiﬁed using SVM and SBC classiﬁers. The proposed model, DCWI, merges the results from both the SVM and SBC classiﬁers. Finally, using the ranking-based score scheme writer identiﬁcation is achieved. A comparative evaluation and analysis of DCWI using different datasets for different languages were carried out. The results obtained by DCWI show that there is a signiﬁcant improvement in performance compared with existing state-of-the-art techniques. However, as an important observation, the CA model is found to be computationally expensive when larger values of r are used in the Moore neighborhood. Therefore, as a future research perspective, an improvement of the DDC and the CA models can be envisaged, while considering the performance of other suitable classiﬁers for writer identiﬁcation. Credit authorship contribution statement Parveen Kumar: Conceptualization, Data curation, Formal analysis, Visualization, Writing - original draft, Writing - review & editing. Ambalika Sharma: Visualization, Writing - review & editing. References Abdi, M. N., & Khemakhem, M. (2015). A model-based approach to oﬄine text-independent arabic writer identiﬁcation and veriﬁcation. Pattern Recognition, 48(5), 1890–1903. Al-Maadeed, S., Hassaine, A., Bouridane, A., & Tahir, M. A. (2016). Novel geometric features for off-line writer identiﬁcation. Pattern Analysis and Applications, 19(3), 699–708.

Alaei, A., Nagabhushan, P., & Pal, U. (2011). A benchmark kannada handwritten document dataset and its segmentation. In Document analysis and recognition (ICDAR), 2011 international conference on (pp. 141–145). IEEE. Alaei, A., Pal, U., & Nagabhushan, P. (2012). Dataset and ground truth for handwritten text in four different scripts. International Journal of Pattern Recognition and Artiﬁcial Intelligence, 26(04), 1253001. Alaei, A., & Roy, P. P. (2014). A new method for writer identiﬁcation based on histogram symbolic representation. In Frontiers in handwriting recognition (ICHFR), 2014 14th international conference on (pp. 216–221). IEEE. Benseﬁa, A., Paquet, T., & Heutte, L. (2005). A writer identiﬁcation and veriﬁcation system. Pattern Recognition Letters, 26(13), 2080–2092. Bernal, A. E., Hospevian, K., Karadeniz, T., & Lassez, J.-L. (2003). Similarity based classiﬁcation. In International symposium on intelligent data analysis (pp. 187–197). Springer. Bertolini, D., Oliveira, L. S., Justino, E., & Sabourin, R. (2013). Texture-based descriptors for writer identiﬁcation and veriﬁcation. Expert Systems with Applications, 40(6), 2069–2080. Bulacu, M., & Schomaker, L. (2007). Text-independent writer identiﬁcation and veriﬁcation using textural and allographic features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4), 701–717. Chahi, A., El Khadiri, I., El Merabet, Y., Ruichek, Y., & Touahni, R. (2018). Block wise local binary count for on-line text-independent writer identiﬁcation. Expert Systems with Applications, 93(1), 14. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297. Djeddi, C., Siddiqi, I., Souici-Meslati, L., & Ennaji, A. (2013). Text-independent writer recognition using multi-script handwritten texts. Pattern Recognition Letters, 34(10), 1196–1202. Ghiasi, G., & Safabakhsh, R. (2013). Oﬄine text-independent writer identiﬁcation using codebook and eﬃcient code extraction methods. Image and Vision Computing, 31(5), 379–391. Gray, L. (2003). A mathematician looks at wolfram’s new kind of science. Notices-American Mathematical Society, 50(2), 200–211. Hadeler, K.-P., & Müller, J. (2017). Cellular automata: Basic deﬁnitions. In Cellular automata: Analysis and applications (pp. 19–35). Springer. Hannad, Y., Siddiqi, I., & El Kettani, M. E. Y. (2016). Writer identiﬁcation using texture descriptors of handwritten fragments. Expert Systems with Applications, 47, 14–22. He, Z., You, X., & Tang, Y. Y. (2008a). Writer identiﬁcation of chinese handwriting documents using hidden markov tree model. Pattern Recognition, 41(4), 1295–1307. He, Z., You, X., & Tang, Y. Y. (2008b). Writer identiﬁcation using global wavelet-based features. Neurocomputing, 71(10–12), 1832–1841. Helli, B., & Moghaddam, M. E. (2010). A text-independent persian writer identiﬁcation based on feature relation graph (frg). Pattern Recognition, 43(6), 2199–2209. Khan, F. A., Tahir, M. A., Kheliﬁ, F., Bouridane, A., & Almotaeryi, R. (2017). Robust off-line text independent writer identiﬁcation using bagged discrete cosine transform features. Expert Systems with Applications, 71, 404–415. Kumar, R., Chanda, B., & Sharma, J. (2014). A novel sparse model based forensic writer identiﬁcation. Pattern Recognition Letters, 35, 105–112. Marti, U.-V., & Bunke, H. (2002). The iam-database: An english sentence database for oﬄine handwriting recognition. International Journal on Document Analysis and Recognition, 5(1), 39–46.

200

P. Kumar and A. Sharma / Expert Systems With Applications 128 (2019) 187–200

Ogier, J.-M. (2008). Ancient document analysis: A set of new research problems. In Colloque international francophone sur l’ecrit et le document (pp. 73–78). Groupe de Recherche en Communication Ecrite. Olszewska, J. I. (2015). Active contour based optical character recognition for automated scene understanding. Neurocomputing, 161, 65–71. Pechwitz, M., Maddouri, S. S., Märgner, V., Ellouze, N., Amiri, H., et al. (2002). Ifn/enit-database of handwritten arabic words. In Proc. of CIFED: 2 (pp. 127–136). Citeseer. Pervouchine, V., & Leedham, G. (2007). Extraction and analysis of forensic document examiner features used for writer identiﬁcation. Pattern Recognition, 40(3), 1004–1013. Popovici, A., & Popovici, D. (2002). Cellular automata in image processing. In Fifteenth international symposium on mathematical theory of networks and systems: 1 (pp. 1–6). Citeseer. Schlapbach, A., & Bunke, H. (2004). Using hmm based recognizers for writer identiﬁcation and veriﬁcation. In null (pp. 167–172). IEEE. Schlapbach, A., Liwicki, M., & Bunke, H. (2008). A writer identiﬁcation system for on-line whiteboard data. Pattern recognition, 41(7), 2381–2397. Schomaker, L. (2007). Advances in writer identiﬁcation and veriﬁcation. In Document analysis and recognition, 2007. ICDAR 2007. Ninth international conference on: 2 (pp. 1268–1273). IEEE.

Siddiqi, I., & Vincent, N. (2010). Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features. Pattern Recognition, 43(11), 3853–3865. Tan, G. X., Viard-Gaudin, C., & Kot, A. C. (2009). Automatic writer identiﬁcation framework for online handwritten documents using character prototypes. Pattern Recognition, 42(12), 3313–3323. Walsh, D., & Raftery, A. E. (2002). Accurate and eﬃcient curve detection in images: The importance sampling hough transform. Pattern Recognition, 35(7), 1421–1431. Weisstein, E. W. (2005). Moore neighborhood. From mathworld-a wolfram web resource. http://mathworld.worlfram.com/MooreNeighborhood.html. Weisstein, E. W. (2013). von neumann neighborhood. from mathworld–a wolfram web resource. Wen, J., Fang, B., Chen, J., Tang, Y., & Chen, H. (2012). Fragmented edge structure coding for chinese writer identiﬁcation. Neurocomputing, 86, 45–51. Weston, J., & Watkins, C. (1998). Multi-Class Support Vector Machines. Technical Report. Citeseer. Zois, E. N., & Anastassopoulos, V. (20 0 0). Morphological waveform coding for writer identiﬁcation. Pattern Recognition, 33(3), 385–398.

DCWI: Distribution descriptive curve and Cellular automata based Writer Identification

DCWI: Distribution descriptive curve and Cellular automata based Writer Identification

Recommend Documents