Data service generation framework from heterogeneous printed forms using semantic link discovery

Data service generation framework from heterogeneous printed forms using semantic link discovery

Future Generation Computer Systems ( ) – Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevi...

3MB Sizes 0 Downloads 33 Views

Future Generation Computer Systems (

)



Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Data service generation framework from heterogeneous printed forms using semantic link discovery Han Yu, Hongming Cai *, Jun Zhou, Lihong Jiang School of Software, Shanghai Jiao Tong University, 800 Dongchuan Rd, Shanghai, China

highlights • A complete and feasible framework from printed forms to data service is proposed. • An automatic form data extraction and structuration approach is presented. • A usable prototype system integrating heterogeneous printed resumes is implemented.

article

info

Article history: Received 28 March 2017 Received in revised form 7 September 2017 Accepted 22 September 2017 Available online xxxx Keywords: Form recognition Heterogeneous data integration Semantic data model Table matching Data service generation

a b s t r a c t Printed forms contain rich information in business process and daily life. However, tremendous heterogeneous printed forms containing same categories of information are difficult to manage and share, which lead to massive data in printed forms remaining waste. To automatically integrate and share these data remarkably improves the efficiency of enterprises, the key problem is how to extract heterogeneous data in printed forms and integrate them for quick use. To solve this issue, we propose a framework that discovers semantic links in printed forms and generates data services for easy data management and rapid data sharing in the enterprise systems. First, a multiple-OCR-based form recognition approach is proposed to make forms computer-readable. Next, forms are modeled into semi-structured data using structurebased semantic link discovery and refining with massive data. Then, a linked data model is built by table matching to align data. Finally, data services are generated based on the linked data model. A series of experiments on printed resumes are conducted, and the results illustrate our framework performs well in recognition rate, link discovery accuracy, data compression ratio and data resource accuracy. A prototype system is presented to illustrate the feasibility of the proposed framework. © 2017 Elsevier B.V. All rights reserved.

1. Introduction The forms are one of the most commonly used data carriers in business and daily life: HRs need to select qualified candidates for the company by looking through hundreds of various resumes a day, auditors need to go through hundreds of heterogeneous balance sheets to find mistakes, travel agents provide different kinds of itineraries for different travelers. There are still millions of printed forms, even though digitalization revolution has been developed for tens of years. The rich information in the printed forms is still useful and irreplaceable. In the era of the Internet, information is power. Better data integration and sharing mean more efficient work and further bring more benefits. Therefore, the rapid and convenient information retrieving, communication and management become the most significant processes valued by author. * Corresponding E-mail addresses: [email protected] (H. Yu), [email protected] (H. Cai), [email protected] (J. Zhou), [email protected] (L. Jiang).

the enterprises. Thus, digitalizing printed forms and extracting rich information for data integration and sharing will greatly benefit the enterprises by improving the efficiency of data communication and management. However, with no unified standard or template, people turn to use heterogeneous forms to describe instances in same categories, such as resumes(F1 , F2 in Fig. 1), balance sheets (F3 in Fig. 1), itineraries (F3 in Fig. 1). These forms have several notable features. To that end, (1) They are heterogeneous in both structure and semantics. Take F1 and F2 in Fig. 1 as an instance, these two forms are both resumes but have different organizing structures and diverse property fields(gray-background cells in form). Furthermore, among these fields some have diverse expressions but semantically represent the same concept, like Self-evaluation in F1 and Self-assessment in F2 . (2) Forms are loosely connected. Compared to long text, there are no sentence components, syntax, and grammars in forms but only form lines and discrete expressions with almost no connection words like ‘‘is’’, ‘‘for’’, etc.,

https://doi.org/10.1016/j.future.2017.09.059 0167-739X/© 2017 Elsevier B.V. All rights reserved.

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

2

H. Yu et al. / Future Generation Computer Systems (

)



Fig. 1. Common printed form examples.

which leads to the third feature that (3) forms are not computerunderstandable. Due to these features, there are no automatic approaches to deal with data in printed forms. Meanwhile, manually reading and processing these forms cost a large amount of time and man power to cope with these forms and utilize data from these forms for further use. The core problem we are faced with is how to extract and integrate heterogeneous data from printed forms. The problem can be resolved into four key problems to solve: (1) We need to recognize forms and transfer them into computer files, which is an OCR issue. There already exist some sophisticated OCR techniques, such as Tesseract [1] that allows data training to obtain better performance in particular cases, and ABBYY1 that widely supports rich-text documents. However, these OCR engines perform unsatisfyingly in recognizing form format documents. (2) We want the computers to automatically understand expressions in forms and semantic relations among them, which is a semantic link discovery problem. Nowadays, much semantics research has been done on linked data, for examples, key discovery [2], association discovery [3] and long-text analysis based on open data [4] and context [5]. However, the solution to semantic discovery on forms has not been proposed yet. (3) How to merge heterogeneous forms with various fields into a single linked data model, which is similar to the table-matching problem. Some [6] matches short tables to open data, some [7] matches data from different open data sources, and some [8] matches terms incrementally. Our matching target is large and rich forms, which is different from above approaches. However, the matching methods can be learned from former research. (4) How to provide data services based on the linked data model, which is a data service generation problem. R.T. Fielding’s RESTful [9] is a widely used and lightweight web service architecture. Based on it, some sophisticated approaches 1 http://ocrsdk.com/.

from ontology [10] and general data model [11] to data service are proposed and applied in practice. In this paper, we propose a complete framework for generating data services from heterogeneous printed forms. First, we combine two OCR techniques for the form recognition. Then, we propose a new semantic link discovery approach to model the forms automatically. Next, we use table matching approach and build a linked data model to store all form instances. Finally, we provide data service based on the linked data model, including basic data services like retrieving and search, plus special services like data homogenization and evaluation. Furthermore, we conduct a series of experiments on our framework and build a prototype system based on the framework to reflect the feasibility and usability. The contributions of this paper can be summarized as follows:

• A feasible framework is designed for first to generate data services from heterogeneous printed forms. It is a complete process from printed forms directly to data services that has not been proposed before. • An adoptable data extraction method is presented. First, an improved form recognition approach is proposed. Then a two-phase semantic link discovery method is applied to generate relation model. Both methods perform acceptably in experiments and are usable in practice. • Data integration and service generation are first applied in printed forms and a prototype system is implemented illustrating the usage of the framework. The paper is organized as follows: First, Section 2 introduces the framework of our approach, and Section 3 illustrates the details of methods used in the framework. Then, Section 4 shows how we conduct experiments and the results. Section 5 presents a prototype system based on our framework. Finally, Section 6 introduces related works and Section 7 concludes the whole paper.

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

H. Yu et al. / Future Generation Computer Systems (

)



3

Fig. 2. The framework.

2. The framework In this section, we introduce the architecture of our framework. There are 4 phases in our framework as illustrated in Fig. 2. Phase I: Form Recognition. In this phase, multiple OCR engines are leveraged to transfer heterogeneous printed forms into computer-readable files that describe forms as line and character elements. Then, for each form, we process it based on the form structure into a set of discrete cell models F .C that contain position information and text content. However, among the cells in a form, there are certain semantic relations, such as Amy is a value of Name in a resume. The cell model only records the text content of cell but relations among cells are lost, namely, the form is in a discrete format. Phase II: Semantic Link Discovery. In this phase, we consider finding semantic links between the cell models F .C and constructing a tuple model F .T for the form, which is to make forms computer-understandable. First, a structure-based approach is applied to determine which cells are likely to be properties and which cells are the values of an another cell, and what semantic layer relations are among properties. By analyzing the structure, a primary tuple model is built. Then, to eliminate structure ambiguity, we involve a linked data model to refine relations. By using linked data model M which contains quantities of history data extracted from the forms, the semantic experience is introduced to the link discovery, which is complementary to the structure-based approach to make the tuple model more precise. Phase III: Linked Data Construction. In this phase, a unified linked data model M is constructed by the merging tuple models F .T generated from previous steps. There are two aims of the

linked data construction: the first is to align heterogeneous data for better storage and usage, the other is to leverage this unified model to provide semantic information in previous steps and for further usages as a corpus. Two kinds of merging are applied in this phase: the property-based merging and the value-based merging. The property-based merging calculates semantic similarity of property names under the support of WordNet [12]. The valuebased merging is to record the frequencies of value appearances to judge two properties are same or not. The merged linked data model is used both in the semantic link discovery phase to refine the tuple model and in the data service generation phase to provide data model. Phase IV: Data Service Generation. In this phase, the linked data model M is transferred into XML resources directly for their similar graphic structure. For each resource, the RESTful data services in WADL are built respectively. By doing this, users can get the resources of same categories via the web services. Based on the provided services, data management, sharing, homogenization, and evaluation can be easily implemented. Through these 4 phases, we implement a system easily and rapidly providing data services based on heterogeneous printed forms. 3. The methods In this section, we first define the fundamental problems of our work and then introduce the core methods in each phase illustrated in the framework.

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

4

H. Yu et al. / Future Generation Computer Systems (

3.1. Definitions A printed form F is composed of a set of cells, denoted as F .C = {C1 , C2 , . . . , C|F .C | }. Each cell can be a property name and a property value. The form F1 in Fig. 1 is a simple example. The cells with gray backgrounds are property names and the others with white backgrounds are property values. While in the form F2 , some property names (such as F2 .Name) are also a value of a greater property name (such as F2 .PersonalInfo). A Tuple T = ⟨P , V ⟩ denotes a pair of a property name and the set of corresponding values, while a V in V with corresponding P composes a propertyvalue pair Pair ⟨T , V ⟩ and then a form F can be represented by a set of tuples, denoted as F .T = {T1 , T2 , . . . , T|F .T | }. Data Merging. Forms in the same category may have different structures and properties, but many of the names of the properties are semantically similar to each other. Data Merging is to merge similar properties into a single node in the linked data model. As shown in Fig. 3, the merged model M = ⟨N , L⟩ is a directed acyclic graph, where N and L are the set of nodes and the set of links. For each node N in M, it represents a property (also known as a concept) and is denoted as N = ⟨P , I ⟩, where P is the property name and I is the set of items that contain this property. Item I is a triple ⟨uF , P , V ⟩, where uF is the unique id of the form F , P and V are the original property name and the value that matches this property in F . The link set L contains all links L. A link L⟨uF , from, to⟩ represents not only the two nodes that L connects but also the form to which the link belongs. 3.2. Form recognition 3.2.1. Recognition Form can be decomposed into a set of characters and lines, and the first step of the form recognition is to extract the character set and line set using OCR technique. Tesseract [1] and ABBYY (see footnote 1) are two of widely-used sophisticated OCR engines, while neither of them focus on form recognition. The former performs well in line recognition and the latter recognizes characters more accurately. To achieve accurate recognition of forms, we combine the results of the two OCR engines and get the character set CH = {Ch1 , Ch2 , . . . , Ch|CH| }, where Ch = ⟨v alue, page, startX , startY , height , width⟩ from the ABBYY recognition result, and the line set LN = {Ln1 , Ln2 , . . . , Ln|LN | }, where Ln = ⟨page, startX , startY , height , w idth⟩ from Tesseract. This step corresponds to process (a) in Fig. 5. 3.2.2. Cell model construction In this step, we leverage the position information to build the cell model F .C from CH and LN according to Algorithm 1. A cell is denoted as ⟨v alue, left , abov e, w idth, height ⟩, representing the value and the position of the cell. First, we sort CH and LN according to the reading order. Then, for each chi in CH, according to its position information, we find the nearest 4 lines leftLn, rightLn, abov eLn, bottomLn in LN . The relative positions of these 4 lines in LN refer to the relative position of the built cell Ci . If there is another Cj in F .C has the same position with Ci then we merge Ci .v alue to Cj .v alue, otherwise add Ci into F .C . Process (b) in Fig. 5 is an example of cell set construction.

3.3. Semantic link discovery The cell model contains all the layout and lexical information of a form, but semantic relations are not kept. Semantic link discovery is to analyze relationships among cells and build a linked tuple model to model the forms semantically.

)



Algorithm 1 Cell Model Construction Input: Character set {CH }, Line set {LN } Output: The cell model for the form F .{C } 1: F .{C } ← ∅ 2: Sort {CH } by {page, startX , startY } in ascending order 3: Sort {LN } by {page, startX , startY } in ascending order 4: LNh ← ∅, LNv ← ∅ 5: for ln ∈ {LN } do 6: if ln.height ≫ ln.w idth then 7: Put ln in LNv 8: else 9: Put ln in LNh 10: end if 11: end for 12: for ch ∈ {CH } do 13: find ln has the biggest ln.startX on the horizontal left of ch 14: ch.leftLn ← ln 15: find ln has the smallest ln.startX on the horizontal right of ch 16: ch.rightLn ← ln 17: Similar for ch.abov eLn and ch.bottomLn 18: if some c ∈ F .{C } st. c .left = ch.leftLn and c .abov e = ch.abov eLn then 19: c .v alue ← c .v alue + ch.v alue, 20: c .w idth ← max(c .w idth, ch.rightLn − ch.leftLn) 21: c .height ← max(c .heightch.bottomLn − ch.abov eLn) 22: else 23: Put ⟨ch.v alue, ch.leftLn, ch.abov eLn, ch.rightLn − ch.leftLn, 24: ch.bottomLn − ch.abov eLn⟩ in F .{C } 25: end if 26: end for 27: return F .{C }

3.3.1. Structure-based discovery According to the interpretive mode of a form, we can classify forms into column-wise type, row-wise type, mix-wise type and three value-alternate types (see in Fig. 4). In practice, a form is always a combination of these six types of form structures. Take the forms in Fig. 1 as examples. The form F1 is in row-wise mode. The upper portion of the form F2 is the combination of the rowwise and the row-wise value-alternate modes, and the bottom is in the column-wise value-alternate mode. The whole form F3 and F4 are both in column-wise mode, and the sub forms under the titles are both in row-wise mode. As shown in Algorithm 2, based on the sizes and the relative positions of the cells in F .C , we build tuple model F .T . For cell Ci with smaller-size cells Ci , after iteratively building tuples Ti from Ci , we take {Tij .P for all T in Ti where 0 < j ≤ |Ti |} as Ti .V to build Ti . If Ci is in the smallest size, we take the right or the down one Cineighbour .v alue as the potential values of Ti . According to the interpretive mode of the form, we choose from the potential values to decide the only value in Ti .V to build tuple Ti . Process (c) in Fig. 5 is an example of tuple tree model construction. 3.3.2. Model refinement Since the structures of forms are diverse and flexible, only leveraging structure to decide relations is not enough. So we introduce a refinement to the model construction. When we deal with a cell Ci with no smaller-size cells in F .C , to decide which cell contains the value of Ti , we would take the merged model M as a reference. First, we match Ci .v alue with M .N to get the candidate node set Ni . Then for each Nij in Ni , we get Vij , the union set of all values of items in Nij .I . Then, we match the values of all potential value

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

H. Yu et al. / Future Generation Computer Systems (

)



5

Fig. 3. An example of merged model M.

Property cell

(a) column-wise

(d) column-wise value-alternate

Name cell

(b) row-wise

(c) mix-wise

(e) row-wise value-alternate

(f) column-wise multi-valuealternate

Fig. 4. Examples of six types of form structure.

cells Cp in Cipotential with the value set Vij . According to the valuebased similarity (see in Section 3.4.2) of Cp in Cipotential , find the most possible Cp to obtain Ti .

3.4. Linked data construction The linked data construction is to construct a single model by merging form instances into a directed acyclic graph. Every time a new form is extracted, merging operation is executed if the matching score of the certain property node is above a threshold θ . The matching score denotes the similarity of a tuple T in F .T and a node N in M .N . The matching score is decided by two factors: the property similarity and the value similarity. 3.4.1. Property-based merging The central process of the property-based merging is to calculate the semantic similarity of the names of T .P and N .P. To calculate the semantic similarity, we introduce WordNet [12], a large lexical database, and Lin algorithm [13], a word lexical similarity

calculating algorithm. For two properties P , Q , we first split them into two sets of words (entries in WordNet), denoted as P .W and Q .W , by stop words. Then for all P .Wi , we calculate the lexical similarity simij with all Q .Wj in Q .W to build a similarity matrix Msim . Then, we find the highest min(|P .W |, |Q .W |) similarities with no repeated words, if the similarity simij is higher than θ1 , then we regard P .Wi and Q .Wj as the union of P .W and Q .W . Then we apply Jaccard coefficient to P .W and Q .W to get the similarity simPQ of P and Q . 3.4.2. Value-based merging In some cases, we cannot decide two properties are the same only when their property names are similar, such as Title for a book and Title for a person, or decide two properties are different when their property names are not similar, such as Contact Info and Email. To match properties more accurately, we also leverage values of properties to calculate the similarity when merging. For two value sets V of property P and U of W , we first calculate the term frequency (TF) in V and U to get the term frequency vectors

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

6

H. Yu et al. / Future Generation Computer Systems (

)



Fig. 5. An example from printed form to tuple model.

Algorithm 2 Find Tuples Input: The cell model F .{C } Output: The tuple model F .{T } 1: Sort F .{C } by {left , abov e} in ascending order 2: F .{T } ← ∅ 3: while F .{C } ̸ = ∅ do 4: FindTuple(F .C0 , F .{C }\F .C0 , F .{T }) 5: end while 6: return F .{T } 7: function FindTuple(Cell, v alueScope) 8: if Cell.v alue contains more than 100 words then return False 9: end if 10: Find all successive c with ⋃ smaller height on the horizontal right of Cell as rightPotentialValues, similarly get bottomPotentialValues 11: if rightpotentialValues bottomPotentialValues ̸= ∅ then ⋃ 12: for v alueCell ∈ rightpotentialValues bottomPotentialValues do 13: if v alueCell ∈ F .{C } then 14: FindTuple(v alueCell, F .{C }\v alueCell, F .{T }) 15: Put ⟨Cell.v alue, v alueCell.v alue⟩ in F .{T } 16: Remove v alueCell from F .{C } 17: end if 18: end for 19: end if 20: Find rightNeighbour and bottomNeighbour as potentialValues 21: if some neighbor ∈ potentialValues found tuple then 22: The other is the value of Cell 23: else 24: Compare the matching score of ⟨Cell.v alue, neighbour .v alue⟩ to get the more possible value 25: end if 26: return True 27: end function

v⃗ and u⃗ . The value-based similarity simVU is the cosine similarity of v⃗ and u⃗ .

3.5. Data service generation

After the property-based similarity simPQ and the value-based similarity simVU is given, the matching score is simply derived as score = 12 ∗ (simPQ + simVU ), the arithmetic mean of the two similarities. If the matching score is higher than a threshold θ , we consider property P and Q as the same property and merge them. Model merging is applied when a new form instance is to be added into model M.

To better enrich and utilize the linked data model, encapsulating the data to provide web services is necessary. So we map linked data model onto XML resources and generate RESTful [9] web services. A node N is mapped to a resource category, the items in N are classified by uF and mapped onto the resources. Take Fig. 6 as an instance, the node Resume is assigned to resume in XML and there are 2 instances F1 and F2 transferred into resume

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

H. Yu et al. / Future Generation Computer Systems (

)



7

Fig. 6. An example from linked data model to XML resource.

resources in this node, both of which has two items. For each item, we map P as the name of the resource, uF as the id and V as the sub-property if V is also a node linked to Resume in this instance, otherwise, a value. By doing this mapping operation to every node in the model, we build the resources in the XML format which can be published as data services. Then, Web Application Description Language (WADL) is applied to describe RESTful data services. When a new node is added to the model, a new resource and corresponding web services will be registered in a WADL file, and four basic operations GET, PUT, POST and DELETE in RESTful data service are supported. Different from traditional PUT operations which simply insert a new resource, in our work, data merging shown in Section 3.4 is applied every time a new form instance is added to maintain the linked data model. After the WADL files are registered on a server, the data services can be visited by sending HTTP request to the server, and the data can be easily

and rapidly obtained. Take the resource in Fig. 7 as an instance, the instances of the resource resume can be obtained using the URL http://resume.com/root/resume. Moreover, by visiting http: //resume.com/root/resume/0351, one can get the information of the resume with id 0351. There are several typical applications of data services generated from heterogeneous printed forms, including data homogenization and data evaluation. Data Homogenization. Homogeneous forms are requested in some enterprise processes. With our data services, heterogeneous printed forms are easily transferred into homogeneous ones. A standard S is given as the input, denoted as S = ⟨FMT , P ⟩, where FMT represents the format of the output data and P is the set of properties requested, denoted as P = {P1 , P2 , . . . , P|P | }. When S is given, the central process of data homogenization is to query properties via data services and organize properties according to

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

8

H. Yu et al. / Future Generation Computer Systems (

)



Fig. 7. An example of WADL service specification.

the form instance to transfer the forms F to the new F ′ in standard format FMT . Data Evaluation. Data comparison cross instances is common in business activities, so evaluating instances is another important application of data services from heterogeneous printed forms. By providing a scoring criterion of each property, evaluating heterogeneous forms becomes simple. 4. Experimental evaluation In this section, we evaluate the performance of our framework and methods. First, we introduce our data set and evaluation metrics. Then we describe how the methods are evaluated in each phase. Finally, we analyze and discuss the experiment results.

Table 1 Statistic data for data set.

Optimization set Evaluation set Total

Cell

Tuple/property

Property-value pair

1915 1804 3719

931 798 1729

1010 985 1995

optimal threshold settings for model merging phase, we randomly select 50 of the resumes as the optimization set for the threshold settings and the left 50 as the evaluation set for the performance testing. The statistic data for the data sets are shown in Table 1. A tuple ⟨P , V ⟩ is a one-to-many relation, while a pair ⟨P , V ⟩ consists of a property P and a value V in V . 4.2. Evaluation

4.1. Data set In this experiment, we use resumes as our test data. To get a large amount of heterogeneous printed resume form, we collect 100 resumes on a career fair from 100 job hunters of diverse positions to ensure the diversity of our data set. Then, to get the

4.2.1. Form recognition First, we take the two OCR engines—Tesseract and ABBYY as the baselines of the evaluation on the form recognition. The experiment results are shown in Table 2. Our method recognizes 3677 cells out of 3719 cells in the 100 forms, while ABBYY and Tesseract

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

H. Yu et al. / Future Generation Computer Systems (

)



9

Table 2 Results for the form recognition.

Our method ABBYY Tesseract

# Cell

Precision

Recall

F1

3677 2619 3483

95.43% 96.30% 77.06%

96.67% 70.99% 96.32%

96.04% 81.73% 85.62%

Table 3 Results for the tuple model construction.

Structure-based only Structure-based + refinement

#Pair

Precision

Recall

F1

1850 1909

94.22% 95.50%

96.34% 96.74%

95.27% 96.11%

extract 2619 cells and 3483 cells. The result illustrates that our method identifies much more cells than the two baselines and has the highest F1 score. However, we used to expect our method to obtain the best results in both precision and recall, but our method is nearly 1% lower than the ABBYY baseline in precision. The evaluation results demonstrate that although there is some data loss during the combination and the cell model construction which leads to the loss of precision, our method combining the two OCR engines still joins their strength and performs better in general. With the F1 score of 96.04%, we can conclude that our method in form recognition is feasible and acceptable. 4.2.2. Semantic link discovery In the evaluation on the semantic link discovery, we first manually generate a standard tuple set for all the 100 resumes, which contains 1729 tuples in total. Then, we respectively test the structure-based model constructing method only and the full semantic link discovery method with the refinement evolved. A property-value pair is regarded as our statistic unit. A true positive is the one that has a corresponding pair in the standard set. The experiment results are illustrated in Table 3. From the table, we can see that our structure-based model construction method has good and acceptable results with more than 90% in both precision and recall, which proves that our approach is reasonable. Moreover, our method of combining the structure-based method and the refinement performs better than only using the structure-based method. According to the result, the full version of the semantic link discovery method performs nearly 1% better than the structurebased version. The gap is small. So there is a trade-off of the refinement and the runtime in an application. However, by improving the model merging and the refinement, the trade-off can diminish, and our method can perform much better than only using the structure-based method. 4.2.3. Linked data construction First, we use 50 resumes as our optimization set to set thresholds for the linked data construction. There are two thresholds, θ for the matching score and θ1 for the word similarity in the property-based merging. We separately set them from 0.1 to 0.9 with 0.1 apart, record every merge operation and manually judge whether two property nodes should be merged. Fig. 8 clarifies the results of the F1 scores with different thresholds. From the aspect of a single curve, the F1 scores are in a general trend of rising as θ1 increases to 0.5. After, the F1 scores start to tail off. When comparing across the curves, we can see the F1 scores ascend with all θ1 settings when θ increases from 0.1 to 0.6 and descends as θ grows to 0.9. From the results, we conclude that the optimal settings for thresholds are θ = 0.6 and θ1 = 0.5. After the thresholds are settled, we test the performance of the linked data construction method with the left 50 test data in the evaluation set with the optimal threshold settings. In this

Fig. 8. Results of F1 score for linked data construction with different thresholds.

Fig. 9. Performance of the linked data construction in data compression ratio.

phase, we introduce a new evaluation matrix Compression Ratio that denotes the proportion of the property number of the model M in the tuple or the property number of all form instances:

|M .N | . CompressionRatio = ∑|F | i=1 |Fi .T |

(1)

Fig. 9 demonstrates that properties in the merged model grow much more slowly than the properties in instances. When more instances of the same category are merged into the model, the growth ratio of the model decreases and the compression ratio also decreases. When the number of the instances is large enough, the model tends to stabilize, and the compression ratio can be under 20%. 4.2.4. Data service generation In the previous steps, 191 nodes are constructed in linked data model from the 50 form instances. To evaluate the correctness of data service generation, two baselines are set, which are the results of data after the previous steps and the original form data. We evaluate the generated data services in WADL by requesting resources via HTTP requests and manually check the correctness of the responded data resources in XML files. A category of resource

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

10

H. Yu et al. / Future Generation Computer Systems (

Table 4 Results for the data service generation compared to the two baselines. Baseline

Precision

Recall

F1

Linked data model Original data

100% 89.60%

99.49% 95.84%

99.83% 92.61%

stands for a node in the linked data model and a resource represents an instance in a node of the linked data model. A true positive instance is a resource with the correct resource name, value, and related resource URIs. 191 categories of resources are requested. By comparing resources with corresponding data in the two baselines, we get the results shown in Table 4. From the first row of Table 4, we can see that our approach loses little information when generating data services from the linked data model, which demonstrates that our data service generation approach is adoptable. The reason for recall not being perfect is that some nodes in the linked data contain special characters which are not compatible with XML. The data in the second row of Table 4 is based on the original data in the printed form format, which represents the total result of our framework from printed forms to data services. The precision is much lower in final than in the previous steps. After analyzing the true negatives, we think the main reason is that the merge operation in the linked data construction brings faults. However, it does reduce the number of nodes in the linked data, which fastens the searching for heterogeneous data. With over 90% F1 score, we conclude that our framework is feasible. 4.2.5. Overall results By going through each phase, we get the overall results of the experiments on the framework. As already shown in Table 4, by comparing the overall results to the original data in the printed forms, we get 89.60% in the precision, 95.84% in the recall and 92.61% in the F1 score. The results show that our methods work well in the accuracy. Besides, in the aspect of efficiency, our methods also have excellent performance on space utilization. However, the performance may be depressed by data sparsity. And the good performance in space utilization also brings the cost in time consuming which is an inevitable trade-off. Further, from the results of the experiments, we can also make some conclusion in the characteristics of the test data. From the linked data model, we find that there are some common properties, such as Name, Gender, Self Evaluation, Graduation School, Aimed Position. More than 70% resumes include these properties, which follows the intuition and habit of making resumes. Surprisingly, Birth Place has 66% occurrence rate, which is a common property in Chinese personal information forms but not in an English one. Moreover, we find that the most popular format of resumes is under the title of personal resume and has the popular properties mentioned plus Email, Project Experience, Major and Educational Background. More than a quarter of the resumes are in this format. And there are nearly 10% resumes has four separate portions, that are the Personal Information portion that includes properties like the Name, the Gender and etc., the Work Experience portion, the Project Experience portion and the Educational Background portion. 5. Case study To intuitively illustrate the whole framework and methods, we build a prototype system. The system is a resume management system that allows HRs to upload printed form images into the system and select the candidates with self-defined requests. When an image of a printed form is uploaded, the system extracts corresponding cell models, constructs and shows the tuple model. When the information is needed, the whole linked data model can be

)



retrieved, the needed information can be searched and the resume entries are displayed in the vertical table format. Tuple Model Construction. Fig. 10 shows the web page of the tuple model construction which displays a tuple tree of an instance of a printed form. In this page, we upload a graphic file for a printed form, and the system recognizes the file and builds cell models, which corresponds to the first phase, namely form recognition, in our framework. Afterwards, the system constructs a tuple model for the form by the semantic link discovery technique introduced in Section 3.3. Meanwhile, in the backend, the newly-constructed tuple model is merged into the merged model using the methods in Section 3.4 for further use. In this case, we use Chinese resumes as the data sources. Therefore, beyond the methods we proposed, we also use some other techniques to apply our framework better in Chinese language. First, after form recognition, we preprocess the cell models to correct spelling or recognition errors of the values using Tencent Wenzhi API. Next, to apply WordNet dictionary in the model merging, we translate Chinese expressions into English by using Google translate API. However, the property names are again replaced by the most frequently used Chinese property name in the nodes for displaying to standardize the property names. Linked Data Construction. Linked data construction and updating is done after modeling every instance of forms. The linked data model is shown in Fig. 11 and can be retrieved at any time in any status. In the model, there are different colors and sizes. The color represents the instance from which the node is added to the model. The different colors represent that those nodes are from different instances. The size of a node represents the quantity of data it contains. A bigger node means more instances are contained in this node. The right below part of Fig. 11 shows a more detailed model, from which we can find that the merged model is narrowly connected in general with few outliers. Vertical Table Generation. As shown in Fig. 12, the system provides a set of properties appeared in the merged model via data services generated by getting all resource names. Users select the resources they need, and the system retrieves the resources by sending GET requests and shows the responded resources as a vertical table. This is an application of data services generated from our framework. An application scenery is that an HR wants to find candidates for a position and the position has some requirement in major or graduation year, then he or she can select major or graduation year and name to find proper candidates. 6. Related work and discussion Our work contains four main phases, and related research has been done in each phase. Many form transformation approaches [14–16] have been proposed for transforming printed forms to computer-readable files. However, little has been found in turning printed forms to structured or semi-structured data [17– 19]. In the aspect of web table extracting and matching, there is some research [6,7,20]. Different from a web table that always contains multiple entities and same properties, a printed form represents a single entity and has more complicated structure than web tables. Moreover, web tables are more like a vertical table and are easier for property labeling which is a significant challenge for printed forms. However, the thought of matching in web table matching is inspiring in the linked data construction part. Before we can construct linked data, form recognition and semantic link discovery need to be applied. In the aspect of data service generation, much research [10,11,21] has been done and implemented in enterprise systems. Table 5 shows the comparison among our work and related work. It can be seen that our framework is the only one provide a complete process from printed forms to data service and the approaches used are adoptable.

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

H. Yu et al. / Future Generation Computer Systems (

)



11

Fig. 10. Tuple model construction page.

Table 5 Related work comparison. Our approach

Phases Form recognition

Formate free Works on heterogeneous forms F1 score Property field determination Predefines property fields Property matching Value matching Form type Data source

Y Y 96.08% Y N Y Y Large&Rich

Semantic link discovery

Linked data construction

[16]

[14]

[17]

[18]

[19]

N N –a N

Y Y nearly 100% N

Y Y 91.89%b Y N

N N 96.33% Y N

Y Y 96.03% Y Y

[7]

Y Y Y N Small&Narrow Web table

Printed form

a

The F1 score is not found.

b

Only the recall is provided in [17], the F1 score is under the assumption that precision were 100%.

Form Recognition: Form recognition is a traditional topic in document analysis and recognition. The structure-based recognition [14–16] is a mainstream thinking in form recognition. Ting et al. [14] proposed a form recognition method by transforming printed forms to the descriptions of lines, texts, and spacings. And for questioning element, they measure the similarity to assured elements to determine the category to which it belongs. The success rate is almost 100%, which can be used in our future work. Lin et al. [16] recognize the line structure by line pattern matching. They first use a lot of forms as the templates for the learn phase and then recognize new forms. This method is not that useful when the forms are widely heterogeneous. Semantic Link Discovery: Semantic link discovery in linked data [2,3] is broadly discussed. The key point is to discover hidden relationships among large data. However, in the aspect of semantic link discovery in forms, there is scarcely such research as we know. Hirayama et al. [17] worked on recognize forms with no template or predefined layout knowledge. They score labels, values, and alignment and determine property-value relations by verifying all possible combinations. This approach is also based on the structure like ours. However, the result of 85% recall is much lower than ours. Emilie et al. [18] using semantic and physical constraints to recognize forms based on Bayesian Network. The precision and recall of Emilie’s approach are averagely 96.5% and 96.17%, which

[6]

Data service generation [8]

[10]

[11]

Ontology

Data model

N Y

are close to our 95.5% and 96.7% results. However, their training and learning phases are based on a single form format, which is invalid when forms are heterogeneous. Kasar et al. [19] proposed a query-based table recognition method, which adopts client queries to illustrate the set of key-fields, namely, manually presetting property fields. However, the result is still close to ours with a higher precision 97.1% and a lower recall 94.99%. Therefore, we can conclude that our semantic link discovery approach is efficient as presetting. Linked Data Construction: Central process of the linked data construction is the property matching and merging. Ritze et al. [6] proposed T2K Match, which is designed for ‘‘matching large quantities of mostly small and narrow HTML tables against large crossdomain knowledge bases’’. Fan et al. [7] proposed a conceptbased mapping approach to finding the best-matched concept in a knowledge base. Zhang [8] proposed TableMiner for web table interpretation, which does not use relevance-based matching but matches on entity-based overlaps. Data Service Generation: Since R.T. Fielding [9] first proposed the RESTful architecture, the RESTful design style has been widely used in web services. Nowadays, this straightforward and feasible technique plays a significant role in data as a service. Cai et al. [10] presented an approach to constructing data as a service based on ontology, Petri et al. [11] gave a model-driven process for RESTful

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

12

H. Yu et al. / Future Generation Computer Systems (

)



Fig. 11. More detailed example of generated linked data model. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Web services designing, namely, generating data services from data models, from which our data service generation approach learned. As the data generated from printed forms can be sparse and inconsistent, data service generation and updating can be time-wasting. Yu et al. [22] proposed a novel approach to employing user clusters and service clusters to solve the problem, which is an inspiration to enhance the Quality of Service(QoS) of our work. And Alsaig et al. [23] provided an approach to ranking the services, which can be applied to our work to measure and improve the quality of the linked data model and the data services. Overall, our framework is a highly automated, feasible and easy approach to generating data service from printed forms and

provide a way to digitalize heterogeneous data and integrate information. 7. Conclusion In this paper, we have presented a framework for generating data services from heterogeneous forms based on semantic link discovery. Three key contributions have been done in our work. To that end, we proposed for first a framework from printed forms to data services, which is feasible and time-saving. Next, we present an adoptable data extraction method that brings improvement in form recognition and semantic relation discovery in large and rich

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

H. Yu et al. / Future Generation Computer Systems (

)



13

Fig. 12. Vertical table generation page.

forms. Finally, we conduct experiments and implement a prototype showing that our work is a usable and easy way to integrate and share data from heterogeneous printed forms. In the future, we are going to explore in three directions. To that end, we will improve the quality of the linked data model by improving the table-matching method to better match properties, and deeply mine the semantic information and introduce open data into the framework to discover hidden semantic relations in printed forms. Meanwhile, we consider deeply integrating the properties extracted from printed forms to a more closely connected relational model to resolve data sparsity and inconsistency to provide more robust and efficient data services. Further, we will work on an automated system based on the services as one of the potential extension of our work. Acknowledgment This work was supported by the National Natural Science Foundation of China under Grant No. 61373030, 71171132. References [1] C. Patel, A. Patel, D. Patel, Optical character recognition by open source OCR Tool tesseract: A case study, Int. J. Comput. Appl. 55 (10) (2012) 50–56. [2] N. Pernelle, F. Saïs, D. Symeonidou, An automatic key discovery approach for data linking, Web Semant. Sci. Serv. Agents World Wide Web 23 (2013) 16–30. [3] Q. Zheng, H. Chen, T. Yu, G. Pan, Collaborative semantic association discovery from linked data, in: IEEE International Conference on Information Reuse & Integration, 2009, IRI’09, IEEE, 2009, pp. 394–399. [4] Y. Peng, J. Wei, Improving Cross-Document Knowledge Discovery Through Content and Link Analysis of Wikipedia Knowledge, Springer Berlin Heidelberg, 2015, pp. 161–184. [5] K.L. Skillen, L. Chen, C.D. Nugent, M.P. Donnelly, W. Burns, I. Solheim, Ontological user modelling and semantic rule-based reasoning for personalisation of Help-On-Demand services in pervasive environments, Future Gener. Comput. Syst. 34 (4) (2014) 97–109. [6] D. Ritze, O. Lehmberg, C. Bizer, Matching html tables to dbpedia, in: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, ACM, 2015, p. 10. [7] J. Fan, M. Lu, B.C. Ooi, W.-C. Tan, M. Zhang, A hybrid machine-crowdsourcing system for matching web tables, in: 2014 IEEE 30th International Conference onData Engineering, (ICDE), IEEE, 2014, pp. 976–987.

[8] Z. Zhang, Towards efficient and effective semantic table interpretation, in: International Semantic Web Conference, 2014, pp. 487–502. [9] S.A. Mcilraith, T.C. Son, H. Zeng, Semantic web services, IEEE Intell. Syst. 16 (2) (2011) 46–53. [10] H. Cai, C. Xie, L. Jiang, L. Fang, C. Huang, An ontology-based semantic configuration approach to constructing data as a Service for enterprises, Enterp. Inf. Syst. 10 (3) (2016) 325–348. [11] M. Laitkorpi, P. Selonen, T. Systa, Towards a model-driven process for designing restful web services, in: IEEE International Conference on Web Services, ICWS 2009, Los Angeles, Ca, Usa, 6-10 July 2009, pp. 173–180. [12] B. You, X.R. Liu, N. Li, Y.S. Yan, Using information content to evaluate semantic similarity on hownet, in: Eighth International Conference on Computational Intelligence and Security 2012, pp. 142–145. [13] X. Yu, L. Peng, Z. Huang, H. Zhuge, A framework for automated construction of resource space based on background knowledge, Future Gener. Comput. Syst. 32 (C) (2014) 222231. [14] A. Ting, M.K.H. Leung, Form recognition using linear structure, Pattern Recognit. 32 (4) (1999) 645–656. [15] X.U. Jun-Gang, Overview of data extraction,transformation and loading, Comput. Sci. (2011). [16] C.-F. Lin, C.-Y. Hsiao, Structural recognition for table-form documents using relaxation techniques, Int. J. Pattern Recognit. Artif. Intell. 12 (07) (1998) 985– 1005. [17] J. Hirayama, H. Shinjo, T. Takahashi, T. Nagasaki, Development of templatefree form recognition system, in: 2011 International Conference on Document Analysis and Recognition, (ICDAR), IEEE, 2011, pp. 237–241. [18] P. Emilie, B. Yolande, B. Abdel, Use of semantic and physical constraints in bayesian networks for form recognition, in: International Conference on Document Analysis and Recognition, 2011, pp. 946–950. [19] T. Kasar, T.K. Bhowmik, A. Belaid, Table information extraction and structure recognition using query patterns, in: 2015 13th International Conference on Document Analysis and Recognition, (ICDAR), IEEE, 2015, pp. 1086–1090. [20] D. Ritze, O. Lehmberg, Y. Oulabi, C. Bizer, Profiling the potential of web tables for augmenting cross-domain knowledge bases, in: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2016, pp. 251–261. [21] S.A.R. Dezfouli, J. Habibi, S.H. Yeganeh, Semantic web services for handling data heterogeneity in an e-business framework, in: Advances in Computer Science and Engineering, Springer, 2008, pp. 453–460. [22] C. Yu, L. Huang, CluCF: a clustering CF algorithm to address data sparsity problem, Serv. Oriented Comput. Appl. 11 (1) (2017) 1–13. [23] A. Alsaig, V. Alagar, M. Mohammad, W. Alhalabi, A user-centric semantic-based algorithm for ranking services: design and analysis, Serv. Oriented Comput. Appl. 11 (1) (2017) 1–20.

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.

14

H. Yu et al. / Future Generation Computer Systems (

)



Han Yu received her B.S. degree in Software Engineering in Shanghai Jiao Tong University, Shanghai, China, in 2016. She is now a postgraduate student in School of Software, Shanghai Jiao Tong University. She obtained the Outstanding Thesis Paper Award from School of Electronic Information and Electrical Engineering at Shanghai Jiao Tong University in 2016. Her research interests are in the areas of data recognition and analysis, data integration, web service and information system.

Jun Zhou received the B.S. degree in software engineering from Shanghai Jiao Tong University, China, in 2014, now she is a postgraduate student in School of Software, Shanghai Jiao Tong University, China.

Hongming Cai received his B.S. degree, M.S. degree and Ph.D. degree from Northwestern Polytechnical University, China in 1996, 1999 and 2002, respectively. Now he is a professor in School of Software, Shanghai Jiao Tong University, China. He is a standing director of China Graphics Society, a senior member of ACM/IEEE, and a senior member of China Computer Federation. He is rewarded as ‘‘National outstanding scientific and technological worker’’ by China Association for Science and Technology in 2012.

Lihong Jiang received her B.S. degree, M.S. degree, Ph.D. degree from Tianjin University, China in 1989, 1992 and 1996, respectively. During 1992–1993, she worked as assistant professor in the department of computer, Qingdao Ocean University, China. During 1996–1998, she worked as postdoctoral research fellow at the school of management in Fudan University, China. She now is an associate professor, Software School, Shanghai Jiao Tong University, China.

Please cite this article in press as: H. Yu, et al., Data service generation framework from heterogeneous printed forms using semantic link discovery, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.059.