Shreds Assembly Based on Character Stroke Feature

Shreds Assembly Based on Character Stroke Feature

Available online at www.sciencedirect.com ScienceDirect ScienceDirect Procedia Computer Science 00 (2017) 000–000 Available at Science www.scienced...

1MB Sizes 0 Downloads 19 Views

Available online at www.sciencedirect.com

ScienceDirect ScienceDirect

Procedia Computer Science 00 (2017) 000–000

Available at Science www.sciencedirect.com Procediaonline Computer 00 (2017) 000–000

ScienceDirect

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Procedia Computer Science 116 (2017) 151–157

2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2017, 13-14 October 2017, Bali, Indonesia 2017, 13-14 October 2017, Bali, Indonesia

Shreds Shreds Assembly Assembly Based Based on on Character Character Stroke Stroke Feature Feature Nan Xing*, Siqi Shi, Yuhua Xing Nan Xing*, Siqi Shi, Yuhua Xing

Xi'an University of Technology, School of Automation and Information Engineering, Xi'an University of Technology, Automation Information Engineering, NO.5 South JinhuaSchool Road, of Xi’an, Shaanxi,and 710048,China NO.5 South Jinhua Road, Xi’an, Shaanxi, 710048,China Abstract Abstract Shredded document recovery is an important research sub-field of information security. The paper document is broken into a Shredded document recovery important research of information The paper document broken into ita large number of similar shredsisbyanthe shredder. Usuallysub-field these shreds imply somesecurity. important information. If theyisare restored, large number of similar impact shreds by shredder. Usually these shreds imply some information. If they are restored, will have an important on the judicial investigation, military command, and important archaeo logical research. Therefore, it has ita will on judicial investigation, command,ofand archaeo logical between research.languages Therefore,and it has greathave valueaninimportant research impact of shredded document recovery.military In consideration large differences greata great value in research of shredded document recovery. In consideration of large differences between languages and great demands for Chinese document recovery, our study will focus on the shredded Chinese document. In this paper, a method of demands Chineserecovery documentbased recovery, our studystroke will focus on is theproposed. shredded Chinese this paper, a method of shredded for document on character feature We use document. horizontal In strokes to match Chinese shredded recovery based on character strokeInfeature We use strokes toin match Chinese characters.document Finally the shredded document is restored. the testisofproposed. actual sample, thehorizontal method proposed this paper has characters. Finally the shredded document is restored. In the test of actual sample, the method proposed in this paper has achieved good recovery result. The average accuracy is 13.96% higher than the traditional algorithm. achieved good recovery result. The average accuracy is 13.96% higher than the traditional algorithm. © 2017 2017 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. © © 2017 The under Authors. Published by B.V. committee of the 2nd International Conference on Computer Science and Peer-review responsibility of Elsevier the scientific scientific Peer-review under responsibility of the committee of the 2nd International Conference on Computer Science and Peer-review under responsibility Computational Intelligence 2017.of the scientific committee of the 2nd International Conference on Computer Science and Computational Intelligence 2017. Computational Intelligence 2017. Keywords: Document recovery; Character stroke; Shred stitching; Information security Keywords: Document recovery; Character stroke; Shred stitching; Information security

1. Introduction 1. Introduction Nowadays, various types of shredders have been the indispensable tools in offices. For the sake of confidentiality, types of shreddersschools, have been in offices. the sake of confidentiality, theNowadays, majority ofvarious government agencies, andthe theindispensable armed forcestools generally use For shredder to destroy important the majority of government agencies, schools, and the armed forces generally use shredder to destroy important

* Corresponding author. Tel.: 86+29+82312301. * E-mail Corresponding Tel.: 86+29+82312301. address:author. [email protected] E-mail address: [email protected]

1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1877-0509 ©under 2017responsibility The Authors. of Published by Elsevier B.V.of the 2nd International Conference on Computer Science and Peer-review the scientific committee Peer-review under responsibility Computational Intelligence 2017.of the scientific committee of the 2nd International Conference on Computer Science and Computational Intelligence 2017.

1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the 2nd International Conference on Computer Science and Computational Intelligence 2017. 10.1016/j.procs.2017.10.060

152 2

Nan Xing et al. / Procedia Computer Science 116 (2017) 151–157 Nan Xing, Siqi Shi, Yuhua Xing / Procedia Computer Science 00 (2017) 000–000

documents and materials. In fact, the shredded documents need to be restored in many cases. However, manual recovery is extremely difficult for a large number of similar shreds. The computer, with a powerful data processing capability, is an effective way to restore the shredded document. The earliest study on shredded document recovery was reported in 2004. Ukovich et al. 1 classified shreds by seven kinds of features (MPEG-7) and utilized the puzzle algorithm to match shreds, in which three kinds of color features were proved to be effective. Subsequently, Skeoch et al. 2 adopted pixel value and color histogram as the features in the multi-scale space, and the shredded document was recovered in the genetic algorithm. Prandtstetter et al.3 spliced shreds based on the pixels in the edge of shreds. Lin et al. 4 used the average length of words as the feature to restore the shredded document. Deever et al. 5 matched shreds in an interactive way, according to the contour information of shreds. Unlike the torn pieces (documents are torn by hands) which can use the difference in the shape of pieces6, the shreds (documents are shredded by machines) are extremely similar in appearances. This is difficult to achieve shred stitching. Although some achievements have been obtained in previous studies, the existing methods do not work well for real shreds, which are influenced by information loss and noise interruption etc. At the same time, most of existing methods focus on the shredded English document and ignore the other language document. Because of the huge difference of various languages in appearance, structure, etc7, such methods are difficult to apply to other language document recovery. As a kind of world language, Chinese has long history and great influence. Moreover, the demand for shredded Chinese document recovery is more and more with the development of Chinese economy. In this paper, we focus on the shredded Chinese document recovery. 2. Shreds assembly based on character stroke feature The shreds assembly algorithm proposed in this paper includes three parts: shred scanning, shred preprocessing and shred stitching. The system flowchart is shown in Fig. 1.

Fig. 1. System flowchart of the recovery algorithm.

2.1. Shred scanning Considering paper shreds cannot be processed directly by the computer, we need to convert them into the digital images by scanner and restore them on the computer. In this paper, a collected template is used to fix the shreds when they are scanned. With the aid of the collected template, the soft shreds can be expanded, and the distortion of shreds can be suppressed as a result of folding, tilting, curling, etc. Meanwhile, the collected template can improve the scanning efficiency (a template can process 10 ~ 20 shreds at the same time), and it can be used repeatedly. As shown in Fig. 2, the collected template is rectangular and there are several grooves in it. The width of each groove is slightly larger than the width of a shred.

Fig. 2. The collected template.



Nan Xing et al. / Procedia Computer Science 116 (2017) 151–157 Nan Xing, Siqi Shi, Yuhua Xing / Procedia Computer Science 00 (2017) 000–000

Fig. 3. The scanned image.

153 3

Fig. 4. Examples of the shreds extracted from scanned image.

After those shreds are fixed in a collected template, they are scanned by a scanner. The scanned image is shown in Fig. 3. The image is held in computer for the subsequent treatment. This paper addresses strip-cut shredded documents, not cross-cut shredded documents. Thus the size of collected template is designed to satisfy the strip shred. 2.2. Shred preprocessing To extract shreds from the scanned image, the shreds are preprocessed by the image processing method. Firstly, the shreds are extracted from the background template by the image binarization algorithm 8. Secondly, the 8connected Freeman chain code method9 is used to get the edge of shreds. Thirdly, the denoising algorithm10 is applied to reduce the noises in shreds. Finally, the shreds are segmented from the image. In addition, in order to facilitate subsequent treatments, every shred image is set to the same format, which is rectangular with equal width. Examples of the shreds extracted from scanned image are shown in Fig. 4, in which the shreds are white, the characters are black, and the background is gray. It is noted that there are still noise interruption and information defect in the shreds after the preprocessing. 2.3. Shred stitching For the strip shreds, although their contours are similar, the characters contained in shreds are often different. Using the fixed internal structure of characters, the shreds are spliced by the reassembling of characters, which is the research idea of this paper. It is well known that Chinese characters have a square-shaped structure in the appearance, namely all strokes in a character are distributed in a square area. As shown in Fig. 5, Stroke is the basic unit of Chinese character and its common types include: the horizontal stroke, the vertical stroke, the left-to-right diagonals stroke, and the right-to-left diagonals stroke. In the study of Tseng et al.11, frequencies of various strokes in Chinese character were different. The frequency of the horizontal stroke (31%) was much higher than other strokes. While according to the study of Guo et al. 12, there were 20902 Chinese characters being used. Each character contained 12.8 strokes in average, and the characters which included 12 strokes were the most common

154 4

Nan Xing et al. / Procedia Computer Science 116 (2017) 151–157 Nan Xing, Siqi Shi, Yuhua Xing / Procedia Computer Science 00 (2017) 000–000

Fig. 5. Common types of the strokes, (a) the horizontal stroke; (b) the vertical stroke; (c) the left-to-right diagonals stroke; (d) the right-to-left diagonals stroke.

Fig. 6. A shred is divided into a number of text blocks.

Fig. 7. The extension of horizontal strokes in the shred edge.

words. Based on the above studies, the horizontal stroke plays an important role in Chinese character, and each Chinese character contains several horizontal strokes in average. Although the entire structures of characters have been destroyed by shredder, the losses of character strokes are very few. The strokes have no distortion and are only incoherent in the edge of shreds. If those strokes are reconnected, the Chinese characters can be restored. While if all characters are restored, the shredded document can be recovered. For the strip shreds, they come from shredded documents. Each shred contains many horizontal strokes, and these strokes come from the split Chinese characters. As the horizontal strokes have a good linear property in horizontal direction, they can link the two parts of character structures, which come from different shreds together. This paper uses features of horizontal strokes to achieve shredded document recovery. In order to obtain the precise matching of character structure, the shreds are firstly split along the vertical direction. Namely the height and the line spacing of text are obtained through the horizontal projection of shreds, and the shreds are divided into a number of text blocks with the same size. As shown in Fig. 6, the top and bottom of the text block are blank, and the middle is text. For the divided text blocks, the text strokes are enhanced by the closed operation. Subsequently, strokes along the shred edges of text block are searched from top to bottom, the horizontal strokes of text block are detected using feature matrix M with the size of 5  5 .

1 0  M  0  0 1

1 1 1 1 0 0 0 0  0 0 0 0  0 0 0 0 1 1 1 1 

(1)

Where the element “0” represents black pixel, and the element “1” represents white pixel. In the searching process, if there is a matching area between matrix M and pixels in the edge of shred, the text block has a horizontal stroke. Then this horizontal stroke is extended to the right or left edge of the text block, as shown in Fig. 7. Otherwise, the shred edge of text block remains unchanged. After the searching of entire text block is completed, all the position of extended strokes is recorded. After all horizontal strokes in the edges of text blocks are confirmed, in the set of shreds, any one of the shreds is selected as an initial shred, and then shred stitching is begin. Firstly, the set of the number of matching points between the right edge of initial shred and the left edge of other shreds is defined as



Nan Xing, Siqi Shi, Yuhua Xing / Procedia Computer Science 00 (2017) 000–000 Nan Xing et al. / Procedia Computer Science 116 (2017) 151–157

P(k)  Pk, 1,Pk, 2, ,Pk, i, ,Pk, n  1

5 155

(2)

Where k represents the initial shred which is the k-th shred in the set of shreds. Pk , i represents the number of matching horizontal stroke between the k-th shred and the i-th shred. n is the total number of shreds in a document. Pk , i is represented by the summation of matching numbers of horizontal stroke in the text blocks. m

Pk, i   Wj j1

Where m represents the number of text lines in a shred.

(3)

Wj represents the matching number of horizontal stroke in

the text blocks between two shreds in the j-th line. It is noted that, when two text blocks are matching, only two strokes at the same position along vertical direction are matching strokes. The matched shred is selected as the one which has the maximum number in P(k ) matches the initial shred. where the matching shred is defined as

 arg max P(k)

(4)

Where  represents index number of the shred. When the two shreds are stitched together, they are regarded as a whole one. The above process is repeated until all shreds are matched, and the shreds stitching algorithm ends. 3. Experiments and results The proposed method in this paper was tested by the actual shreds. In the experiment, the paper type was A4, all texts in document were edited by Microsoft word, the size of font was small four, and the line spacing was 1.5 times. 10 pages of document were randomly selected as the samples. All samples were shredded into the strip shreds, there are 465 shreds in a total, and the width of a shred was about 3mm. The experiment was proceeded on the computer with an Intel Core i5-4200M 2.5GHz CPU, 3 GB of memory and a 500 GB hard disk. For the shreds to be tested, firstly the paper shreds are transformed into the digital images through the processing of section 2.1, and the images which have little distortion can be obtain with the help of collected template. Secondly each shred is extracted from the scanned image separately, and the shreds in a unified format are produced through the processing of section 2.2. Finally, the method proposed in section 2.3 is used for stitching the shreds, and the shredded documents are recovered. The 10 pages of documents were tested respectively in the experiment. Since all document samples were Chinese documents, two methods which were apt for shredded Chinese document were compared, one method was proposed in this paper, and another one was the information quality algorithm proposed in reference 13. Table 1 shows the recovery results of two algorithms. The IQ Algorithm represents the information quality algorithm, and Our Proposal indicates the algorithm proposed in this paper. The reconstruction accuracy obtained by two algorithms is describes in Table 1. As shown in Table 1, the proposed method has obvious advantages by comparing the results of two algorithms. Its average accuracy is 13.96% higher than the IQ algorithm. The IQ algorithm use the height and width of text as the features, which are not suitable to some Chinese characters. In terms of character structure, there are only a few of Chinese characters having symmetrical structures between left and right. In fact, the two parts of character structures have difference in height or width when most Chinese characters are cut. In addition, the noise interference and the information defect are inevitable for the actual shreds, which will have a great impact on the gray distribution feature of the IQ algorithm. In contrast, the proposed method use the horizontal stroke as feature, which is widely exist in Chinese character. Since the horizontal stroke has a good linear characteristic in the

156 6

Nan Xing et al. / Procedia Computer Science 116 (2017) 151–157 Nan Xing, Siqi Shi, Yuhua Xing / Procedia Computer Science 00 (2017) 000–000

horizontal direction, it is very suitable for deal with the characters in the strip shred. Meanwhile, because the horizontal strokes on the edge of shred occupy only a small area, the effect of noise is greatly reduced. Therefore, the proposed method obtains the high accuracy for the actual shreds. Fig. 8 shows that one page of a reconstruction document by the proposed method. The page includes 47 shreds, except of blank shreds. In addition, from the view of time complexity, the complexity in this paper at the step of horizontal stroke searching is O(n) , and the complexity at the step of shreds stitching is O(n 2 ) . The complexity of a whole algorithm is not high, and the document recovery can be achieved quickly. It is noted that the most left shred of document is usually stitched wrongly, this is probably because the initial shred is random in the proposed method. In addition, due to the paper properties, the edge of the shreds is very complex. It causes some horizontal strokes cannot be detected effectively, and influences the performance of shredded document recovery. Table 1. The recovery results of two algorithms Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page10

Shred Number

IQ Algorithm

Our Proposal

46 47 46 46 46 47 47 46 47 47

42.22% 41.30% 47.83% 55.56% 46.67% 47.83% 58.70% 37.78% 63.04% 52.17%

57.78% 47.83% 64.44% 68.89% 62.22% 60.87% 76.09% 51.11% 73.91% 69.57%

Fig. 8. Restored document.



Nan Xing et al. / Procedia Computer Science 116 (2017) 151–157 Nan Xing, Siqi Shi, Yuhua Xing / Procedia Computer Science 00 (2017) 000–000

157 7

4. Conclusion This paper proposes a shreds assembly method based on character stroke feature. The method utilizes the linear feature of horizontal stroke to reassemble the character structures which belong to the same word, and the shredded document is restored through matching the characters. The experimental results show that the proposed method has a good performance in real shreds. It can help people to obtain the useful information from shredded documents. Our research aims at the shredded document recovery of pure text in this paper. The complicated cases including images and tablets are to be studied in the future work. References 1. Ukovich A, Ramponi G, Doulaverakis H, Kompatsiaris Y. Shredded document reconstruction using MPEG-7 standard descriptors. In Processing of the 4th IEEE International Symposium on Signal Processing and Information Technology; 2004 Dec 18-21; Rome, Italy. p. 334-337. 2. Skeoch A. An investigation into automated shredded document reconstruction using heuristic search algorithms [dissertation]. Bath, UK: University of Bath; 2006. 3. Prandtstetter M, Raidl GR. Combining forces to reconstruct strip shredded text documents. In Proceedings of the 5th International Workshop on Hybrid Metaheuristics; 2008 Oct 8-9; Málaga, Spain. p. 175-189. 4. Lin HY, Fan-Chiang WC. Reconstruction of shredded document based on image feature matching. Expert Systems with Applications. 2012 Feb 15: p. 3324-3332. 5. Deever A, Gallagher A. Semi-automatic assembly of real cross-cut shredded documents. In Proceedings of the 19th IEEE International Conference on Image Processing; 2012 Sep 30-Oct 3; Orlando, Florida, USA. p. 233236. 6. Richter F, Ries CX, Lienhart R. Evaluation of discriminative models for the reconstruction of hand-torn documents. In Proceedings of the 12th Asian Conference on Computer Vision; 2014 Nov1-5; Singapore. p. 671686. 7. Chen LL, Zhang FX. A study of cultural difference through comparisons of words. Journal of Henan University of Science and Technology (Social Science). 2007 July: p. 61-64. 8. Long JY, Jin LW. An image binarization method based on global mean and local standard deviation. Computer Engineering. 2004 Jan: p. 70-72. 9. Freeman H. On the encoding of arbitrary geometric configurations. IRE Transactions on Electronic Computers. 1961 June: p. 260-268. 10. Zhao GC, Zhang L, Wu FB. Application of improved median filtering algorithm to image de-noising. Journal of Applied Optics. 2011 July 15: p. 678-682. 11. Tseng HC, Chang LH, Chen CK. The relative frequencies of the various stroke types of the Chinese ideograms. Acta Psychologica Sinica. 1965 June 30: p. 212-214. 12. Guo SL, Piao ZJ. GB13000.1 character set: Chinese character sequence (stroke sequence) standard stroke number statistics report. Modern Chinese. 2006 Nov: p. 39-40. 13. Zhao B, Zhou Y, Zhang Z, Na Y, Ma T. Information quantity based automatic reconstruction of shredded Chinese documents. In Proceedings of the 26th IEEE International Conference on Tools with Artificial Intelligence; 2014 Nov 10-12; Limassol, Cyprus. p. 1016-1020.