The Journal of Systems and Software 86 (2013) 1679–1688
Contents lists available at SciVerse ScienceDirect
The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss
Graph-based reference table construction to facilitate entity matching Fangda Wang a , Hongzhi Wang b,∗ , Jianzhong Li b , Hong Gao b a b
School of Computing, National University of Singapore, Singapore 117417, Singapore School of Computer Science and Technology, Harbin Institute of Technology, 150001 Harbin, China
a r t i c l e
i n f o
Article history: Received 7 July 2012 Received in revised form 21 December 2012 Accepted 13 February 2013 Available online 5 March 2013 Keywords: Entity matching Reference table Graph clustering
a b s t r a c t Entity matching plays a crucial role in information integration among heterogeneous data sources, and numerous solutions have been developed. Entity resolution based on reference table has the benefits of high efficiency and being easy to update. In such kind of methods, the reference table is important for effective entity matching. In this paper, we focus on the construction of effective reference table by relying on co-occurring relationship between tokens to identify suitable entity names. To achieve high efficiency and accuracy, we first model data set as graph, and then cluster the vertices in the graph in two stages. Based on the connectivity between vertices, we also mine synonyms and get the expansive reference table. We develop an iterative system and conduct an experimental study using real data. Experimental results show that the method in this paper achieves both high accuracy and efficiency. © 2013 Elsevier Inc. All rights reserved.
1. Introduction Entity matching is to identify object instances referring to the same real-world entity. In some applications, it is also called entity resolution (ER for brief), entity identification and deduplication. It is important for many applications such as information integration and personal information management. Therefore, it receives significant attention in the literature (Cohen and Sarawagi, 2004; Agrawal et al., 2008; Koudas et al., 2006; Getoor and Machanavajjhala, 2012). Typical applications of entity matching include identification of entities in search engines, combination of information from heterogeneous data sources, such as product sort in shopping web sites which periodically groups “near” duplicate records together to facilitate users’ queries and help commercial analytics. For efficient and effective entity matching, in data warehouse with data from heterogeneous data sources, the tokens representing possible entities are extracted as a table, named reference table (Chaudhuri et al., 2009). By exploiting the reference table, applications are able to check whether or not a record matches an entity. For example, a data warehouse with data from shopping web sites needs to maintain a list of products (see Table 1) as a reference table which comes from either their retailers or related staffs in their company. For a given description of a product s = “SONY CYBERSHOT DSC-TX5 10.2 MP CAMERA + 4GB SDHC HDTV”, to
∗ Corresponding author. Tel.: +86 451 86403492x810; fax: +86 451 86415827. E-mail addresses:
[email protected] (F. Wang),
[email protected] (H. Wang),
[email protected] (J. Li),
[email protected] (H. Gao). 0164-1212/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jss.2013.02.026
determine its corresponding entity, s is compared with entries in the reference table and the entity with the highest score in terms of similarity or similarity larger than a predefined threshold is considered as the one that the product belongs to. Comparing with state-of-art entity resolution methods (Bellare et al., 2012; Kolb et al., 2012; Rastogi et al., 2011; Shu et al., 2011; Whang and Garcia-Molina, 2010, 2012), the benefit of reference-tablebased entity matching is that each data object is only required to compared with related rows in the reference table and pairwise comparisons between data objects are avoided. Thus entity matching is efficient. Another benefit is that with updating reference table, last updated knowledge will be represented such that the method could handle entity resolution on new data sets. Obviously, the quality of the reference table determines the effectiveness of entity matching. It is crucial for information integration systems to address the problems and construct high-quality reference tables. The construction of reference tables suitable for entity matching brings technical challenges. Firstly, the importance of tokens in the description or name of an object should be identified. This depends on the semantics of the tokens and the semantic relationship between the tokens should be discovered to distinguish the importance. Secondly, since the difference of occurrences of different objects may be notable, it is difficult to identify the importance of tokens simply by frequencies. For example, the frequency of “Aigo”, a brand name of cameras, is likely to be much lower than “Cyber-shot”, a series name of “Sony” cameras, because market share of “Sony” ranks the top in the field of cameras. But, the brand is clearly important during entity matching.
1680
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
Table 1 An example of reference table. ID
Entity name
e1 e2 e3 e4 e5
Sony Cyber-shot DSC-TX5 Digital Camera w 10.2 mp green **New** Sony Cyber-shot DSC-TX7 Digital Camera Canon Digital IXUS 200 IS Compact Camera Sony DSC-W120 for parts Pentax K1000 35 mm Film Camera NR w/2 ex lens
To the best of our knowledge, very few existing studies focused on the construction of reference tables. In most systems, reference tables are made manually or provided by original data sources. Such solutions have two problems. One is that entity names in the reference table are always full names with many words which are seldom co-occurring. For example, from the common knowledge, among the entities in the reference table in Table 1, it is known that “Sony TX5” refers to e1 “Sony Cyber-shot DSC-TX5 Digital Camera”. However, the (unweighted) Jaccard similarity between them is only 2/7. Most likely, “Sony TX5” cannot be identified that it matches with e1. The other is that for identifying an entity, the importance of tokens in the object’s description is different. Simply storing the whole description in the reference table neglects such difference and will affect the accuracy of matching. For example, the similarity between “Sony Cyber-shot TX5” and e1 is 4/7 and the similarity of “Sony Cyber-shot DSC TX7” and e1 is 5/8. However, it is obvious that “Sony Cyber-shot TX5” correlates e1 much stronger, since the token “TX5” is more important. With these problems, even approximate matching on the reference table may not obtain satisfactory entity matching results because approximate matching can only identify the tokens similar in grammar but not the words similar in semantics. A related approach is to identify variations of entities in reference table (Chaudhuri et al., 2009). This method use a relative large amount of documents thus favoring good match quality but at the expense of extra overhead for finding variations. In this paper, the problem of reference table construction is studied. To represent the correlation between tokens, we model the tokens in the data set as a graph with each token as a vertex and the correlation between two tokens as an edge between the vertices corresponding to them. An observation on the graph is that the connectivity of vertices represents the correlation of them. Based on the observation, we propose an efficient technique for hierarchy clustering the graph with the key idea of performing a two-stage clustering. The first stage identifies most important exemplar vertices and the second stage performs a top-down clustering on the vertices close to exemplar vertices. In summary, we make the following contributions: (1) To the best of our knowledge, this is the first attempt to adaptively generate and process reference table on a data set, and we propose a solution that models the problem as a graph with affinity property among vertices. (2) To best of our knowledge, this is the first work to propose the hierarchy clustering in entity matching, as well, which efficiently distinguishes different kinds of tokens occurring in the data set. (3) We develop a graph-based method of identifying synonyms to prove the accuracy of clustering. (4) We develop pruning and partition techniques to achieve high performance. (5) We propose a novel method of weight decision, which is based on features of vertices in the graph, leading to reasonable and discriminative results. (6) Experimental results show that our method achieves both high accuracy and efficiency. The rest of this paper is organized as follows. We describe the architecture of the whole system in Section 2. In Section 3, we describe our techniques for solving the graph clustering problem, and in Section 4, we discuss the determination of token weights in reference entities we get. In Section 5, we present an experimental
evaluation. We review the related work in Section 6 and conclude the paper in Section 7. 2. System architecture In this section, we discuss the whole procedure of reference table construction. The input of the system is a set of records R and the output is a reference table with each entry containing the tokens representing an entity e. As the first step, inputted records are pre-processed by using typical white spaces to delimit tokens and by transforming all upper letters into lower ones. A word directly before “-” and a digit directly after “-” are merged as one token. All other special characters are replaced by white spaces. Then the system iteratively processes the records for reference table construction. Each iteration consists of three phases. In the first phase, maximal r-radius subgraphs search phase, each record is scanned once and the token graph is constructed. Then the maximal r-radius subgraphs in the token graph are found. Such techniques will be discussed in Section 3. The second phase, weights & similarity score computation phase, sets weights to tokens in every maximal r-radius graph SG possibly identifying an entity, and computes the correlation between each candidate records and the entity corresponding to SG. The third thresholding phase is that for each entity, the records with correlation above a threshold are outputted. Then the system scans the rest records matching no entity again and repeats the above procedure. The iterative procedure continues until the no new entry is to be added to the reference table. We show the architecture in Fig. 1. For example, the token graph G is created from records set R as shown in Fig. 2. After the first phrase, maximal 3-radius subgraphs (e.g., the subgraph SG with edges marked in bold) are obtained. Taking these maximal 3-radius subgraphs as input to the second phrase, we compute token weights in SG referring to an entity e = (“sony”:1, {“cyber”, “shot”, “dsc”}:1, tx5: 3) and we also get the correlation between e and every record in the set R = {“Sony CyberShot TX5 Black”, “Battery for Panasonic W350 W100, Sony W350 TX5”, “Canon EOS Rebel 5D”}. Then, in the next phrase, a threshold is set to identify relevant records corresponding to e. For the convenience of discussion, we set threshold = 0.5. In practice, the threshold could be set by machine learning strategies which are out of scope of this paper and left for future work. Then we can get the output set R = {“Sony Cyber-Shot TX5 for parts”}, the rest of set R is taken as input into the next iteration because the correlation between them and e is 4/11 and 0 (lower than 0.5), respectively. “Battery for Panasonic W350 W100, Sony W350 TX5” will be added into R if it relates to e the most after all iterations. Then the result set R = {“Sony Cyber-Shot TX5 for parts”, “Battery for Panasonic W350 W100, Sony W350 TX5”}. 3. Graph-based reference table construction In this section, the reference table construction algorithm is proposed. At first, the problem is defined. Then the algorithms are presented. 3.1. Problem definition In this section, the problem is defined. In this section, R denotes the set of records referring to entities in one category. For each r ∈ R, T(r) denotes the set of tokens in r. fre(t) denotes the frequency of a token t occurring in all records in R. We model data set as a token graph G with each vertex v representing a token tv and each edge (u, v) representing the connectivity between u and v. The advantage of the graph model is that we could
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
1681
Fig. 1. Architecture overview.
Sony shot
Cyber
tx5
w350
mega pixel
t5
w120
w150
5.1 digita l camer a
dsc Fig. 2. An example of the token graph.
identify entities based on connectivity between tokens. To identify the affinity between tokens, weights are assigned to the edges in G. With the consideration that connectivity between two different vertices can represent the affinity of them, in order to compute more flexible and richer relationships in data sets, weights of edges are defined with respect to a given P. Then we have the following definition.
HDTV”, weightr (r1 , sony, dsc) = 0.25; and in the record r2 = “Sony DSC-W120”, weightr (r2 , sony, dsc) = 1.0 when we set P = 2. Then, weighte (sony, dsc) = 0.25 + 1.0 = 1.25. If two vertices do not connect within the restriction P, their affinity is set to be −∞. In the weighted graph, the goal is to find its subgraphs with each SG ⊆ G identifying an entity e uniquely. However, in many cases, a graph is complex and it is ambiguous to determine the subgraph
Definition 1. (P-distance edge) In a token graph G corresponding to a data set R, given a non-negative integer P, for t1 , t2 in T(r), an edge (t1 , t2 ) is added to G iff distance between them is smaller than P. Such edge is called a P-distance edge.
Sony
shot
Example 1. The record r is “Sony Cyber-shot DSC-TX5 Black”. The 2-distance edges among them are illustrated as follows (Fig. 3). Based on Definition 1, Formula (1) scores the affinity between any two vertices: weighte (vi, vj) =
n
weightr (r, vi, vj)
cyber
tx5
(1)
r=1
where weightr (r,v1 ,v2 ) equals 1/(distance between v1 and v2 in r)2 within P distance for every v1 ,v2 ∈ T(r). For instance, in the record r1 = “SONY CYBERSHOT DSC-TX5 10.2 MP CAMERA + 4GB SDHC
dsc
balck
Fig. 3. Edges between tokens of “Sony Cyber-shot DSC-TX5 Black”.
1682
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
Table 2 An entity table in one database. RID
Name
r1 r2 r3 r4 r5
Sony Cyber-shot DSC-W150 Digital Camera Sony Cyber-shot DSC-T5 5.1 Megapixel Sony Cyber-shot DSC-TX5 Black *NEW* SONY DSC-W350 PINK Sony DSC-W120 for parts
referring to an entity. It makes things worse that different entities may often contain similar subgraphs of tokens. For example, consider the records in Table 2. In the ideal case, an application might want to identify “Sony TX5” as an entity and r3 = “Sony Cyber-shot DSC-TX5 Black” referring to this entity. The application also needs to distinguish some objects with similar representation referring to different entities, such as “Sony Cyber-shot DSC-T5 5.1 Megapixel “and “Sony Cyber-shot DSC-TX5”. Additionally, some meaningless tokens (e.g., “digital”) should be eliminated since they will be helpless during the matching. To address these problems, we introduce a maximal r-radius graph to describe the subgraph referring to an entity. To formally describe maximal r-radius graph, we define several concepts. Definition 2. (exemplar vertex) For each vertex v in a graph whose vertices have been clustered, we use fre(v) to denote the frequency of the token corresponding to v in the document set. Given two fraction Fup , Flow ∈ [0,1] and a non-negative integer K, v is an exemplar vertex if (i) Flow < fre(v) < Fup ; (ii) v is the center vertex in its cluster and the size of this cluster is bigger than K and (iii) weighte (v, u) = maxi∈V {weighte (u, i)/edge(u, i)} for any vertex u satsifying (i) and (ii), where edge (u, i) is the connectivity between u and i. Definition 3. (core vertex) Given an exemplar vertex U, a vertex v is a core vertex if (i) edge(U, v) exists and weighte (U, v) > Mweight(U) where Mweight(U) is a threshold. This threshold should be higher than the median value of weights of edges with the vertex U (ii) v is the center in its cluster. Definition 4. (support vertex) Given a core vertex C, a vertex s is a support vertex if (i) s is in the C’s cluster; and (ii) s is a center of another cluster. From the definitions, exemplar vertices, core vertices and support vertices form hierarchy. Exemplar vertices are in the highest class and support vertices are in the lowest class. For a vertex v, the induced graph with all vertices in its corresponding lower class is denoted by Lv . We now formally define the maximal r-radius graph problem for identifying reference entities over graph. Definition 5. (maximal r-radius graph) For a subgraph SG with the longest distance between any two vertices in it smaller than r in a token graph G, SG is called a maximal r-radius subgraph of G if (i) SG contains only one exemplar vertex, at least one core vertex and some support vertices; and (ii) there is no other r-radius subgraphs containing SG. Since the r-radius subgraphs contain only the three kinds of vertices including exemplar, core and support vertices, the problem of identifying maximal r-radius subgraphs is to find these three kinds of vertices. To find these vertices efficiently, clustering are performed on the token graph. We will discuss the clustering method in Section 3.2. 3.2. Clustering-based reference table construction algorithm Intuitively, a straightforward way to cluster graph is to compute all the affinity of vertices. However, it becomes prohibitively expensive when the data set is large. Consequently, this paper proposes an efficient technique with the key idea that the number of affinity
computation required for clustering is vastly reduced by shrinking graph size. With this idea, we attempt to prune and partition the graph into overlapping subgraphs to identify most important exemplar vertices, and then measure affinity among pairs of vertices that are close to any identified exemplar vertex. At first, to reduce the search space, graph pruning and partition strategies are discussed in Sections 3.2.1 and 3.2.2, respectively. In Section 3.2.3, we will discuss the determination of synonyms to improve the quality of clustering. At last, in Section 3.2.4, the clustering method is proposed as well as the construction of reference table based on the results of clustering. 3.2.1. Graph pruning To alleviate the problem that loading the whole graph in memory is not practical for large graphs, as the preprocessing step, the token graph is pruned based on the frequency of tokens in data set. Obviously, vertices in the token graph with high frequency or high association with other vertices are usually common information that contributes little to clustering, while vertices with low frequency or low association are unimportant because they notably come from wrong spellings or some other coincident occasions. For example, a portion of the token graph created from a collection of 50,600 camera product records is shown in Fig. 4. From the common knowledge, we observe that “new” with both high frequency and high association is a common token which is helpless for distinguishing subgraphs, since it connects many scattered vertices to make the graph complex. Such tokens should be pruned. As for some vertices with low frequency (e.g., vertices occurring just once) or low association (e.g., vertices with number of connecting vertices no more than 2*P), it is obvious that their occurrence is occasional, possibly from spelling mistake (e.g., “digital” is misspelled as “digtal”). Such tokens should also be considered unimportant. We prune the graph by deleting vertices with top-2 highest frequency, with the intuition that each category has at least two common tokens (e.g., “digital” and “camera” in the category of camera products). Also, the vertices with association lower than 2*P or occurring just once are deleted, since they occur too little to affect the clustering results. Furthermore, we note that these parameters can be determined by machine learning method to obtain more accurate results. As discussed above, the token graph is pruned with low cost. Pruning the token graph cannot only reduce the required memory but also accelerate the clustering since running time is reflected by the magnitude of connectivity between vertices. 3.2.2. Graph partition In order to efficiently cluster a pruned token graph, it is partitioned into r-radius subgraphs based on candidate exemplar vertices with the intuition that vertices with distance more than r are too far to be in the same maximal r-radius subgraph. What is more, vertices far from each other are notably far in semantics because the tokens corresponding to them are far in records. Consider r2 in Table 2, “sony” and “5.1” are far, and they indeed have little affinity in both semantic and position. This observation is helpful to get accurate clustering results and improve the efficiency. In this procedure, the partition is based on the candidate exemplar vertices. Even though the candidate exemplar vertices are chosen randomly, the accuracy can be assured by some strategies including allowing vertices of different subgraphs to overlap and setting a proper radius r (range threshold). The partition is performed by traversing the token graph from each exemplar v and the vertices with distance smaller than r are considered to be in the same partition. On each subgraph, we employ an existing clustering method such as Affinity Propagation Algorithm (Frey and Dueck, 2007).
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
n2
1683
shot
..
new
……
cyber
. cybers hot
sony
. ..
.
……
n1 eos
..
……
canon
powers hot ixus
Fig. 4. A portion of token graph G. Size of tokens presents their frequency, the bigger the frequency is, the larger the size is. Here, the frequency and association of “new” are 0.03 and 2320, respectively. Vertex “n1” presents “supervalue” with frequency 0.000002, and “n2” presents “flashcam” with association 4.
Example 2. Consider the data set in Table 2. The candidate exemplar vertices are chosen randomly, a possible set is {“megapixel”, “shot”}. Obviously, it is not a good choice, since we know the exemplar vertex here should be “sony”. With these two vertices as candidates and setting r = 1, we cluster the graph and obtain two partitions shown in Fig. 5. Then the set = {“cyber”, “sony”} is obtained as the clustering center set. Then we may choose “sony” as an exemplar vertex from the set manually.
Consider all three kinds of vertices. Exemplar vertices are clustered from the token graph, as the center of each cluster, they can hardly be “close” either in neighbors or in associations. As for support vertices, they are the most distinguishing vertices so that they rarely have synonyms. Therefore, only core vertices are considered. Observe that synonyms often co-occur and have strong associations, a cheap and approximate method is adopted instead. This method is based on graph similarity, as defined in Definition 6.
Considering the complication of vertices and the difference of property, to assure the quality of clustering, exemplar vertices may be chosen manually. This only requires choosing small number of exemplar vertices. For example, we apply our method on a realworld data set, and obtain a center set with 62 vertices from nearly 20,000 vertices at the first iteration, and 9 of them are ideal exemplar vertices. It means that 1/7 of them are chosen as real exemplar vertices.
Definition 6. (graph similarity) Given two core vertices C1 and C2 in a graph, their low-class graph similarity, S(C1 , C2 ), is (|N1 ∩ N2|/|N1 ∪ N2|), where N1 and N2 denote the directly support vertex sets of C1 and C2 respectively.
3.2.3. Synonyms identification Identifying synonyms helps improve the clustering results. An accurate method for finding synonyms is to record all possible properties between different vertices such as frequency, neighbors and the weights to neighbors. However, with many vertices to be compared, the cost of such method is high.
For each core vertex v, the comparison is performed on it with each vertex v in the same class as v in frequency and the graph similarity between the Lv and Lv . Simultaneously, the comparison is performed with the vertices in both higher class and lower class in affinity. Considering the transitivity between synonyms, vertices grouped together respectively would be included in the same group finally. Example 3. Suppose we get “sony” as the exemplar vertex, and “cyber”, “shot” and “dsc” can be found as core vertices according to Definition 3. By comparing weighte (cyber, dsc) with those of other
shot
w150 tx5
5.1
black
t5
cyber
sony mega pixel
dsc
Fig. 5. Graph partition based on candidate exemplar vertices.
1684
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
Sony
shot
Cybe r
tx5
t5
w350
w120
w150
dsc Fig. 6. The graph model for some records.
tokens’ and by comparing fre(cyber), fre(dsc) and other tokens’ frequency, we observe that “cyber” and “dsc” have overwhelmingly stronger affinity with each other. So they are grouped together as synonyms. Also, according to the token graph shown in Fig. 6, by computing the graph similarity of L“dsc and L“Shot , SIM(“dsc”, “shot”)= 5/8. Then we identify the synonyms “dsc” and “shot”. Because of the transitivity, “cyber”, “shot” and “dsc” are all synonyms.
3.2.4. The construction of reference table In this section, the construction method of the reference table is presented as a summary of previous sections. The pseudo code of the reference table construction algorithm is shown in Algorithm 1. The input of this algorithm is the token graph G. Firstly the weights of its edges are computed (Line 1) and it is pruned with the strategy in Section 3.2.1 (Line 2). The candidate exemplar vertices are selected randomly or according to the users’ selection (Line 3). The selected candidates are clustered (first stage) according to Definition 2 by employing the Affinity Propagation algorithm (Frey and Dueck, 2007) (Line 4). After that, based on the clusters, in Line 5, the exemplar vertices are labeled by manual work with a low overhead or according to some knowledge base. Subsequently, in Line 6, G is partitioned into subgraphs and the subgraphs are clustered (second stage) to find candidate core vertices and their support vertices (Lines 7–10), as discussed in Section 3.2.2. Affinity and graph similarity of these core vertices are computed to identify synonyms among core vertices with the strategies in Section 3.2.3 (Line 11). Finally, the maximal r-radius subgraphs are found (Line 12). Since each maximal r-radius subgraph identifies an entity, for each maximal r-radius subgraph SG, an entry is inserted to the reference table with the name as the concatenation of the tokens corresponding to the vertices in SG (Lines 14 and 15).
Table 3 Maximal 3-radius graph. No.
Maximal 3-radius graph
1 2 3 4 5
Sony {Cyber,shot,DSC} W150 Sony {Cyber,shot,DSC} T5 Sony {Cyber,shot,DSC} TX5 Sony {Cyber,shot,DSC} W350 Sony {Cyber,shot,DSC} W120
Example 4. To construct the reference table for the record set in Table 2, we first create the graph as Fig. 6. Then the graph is clustered at the first stage and the exemplar vertex “Sony” is obtained. Then the graph is clustered at the second stage and the core vertices set {“dsc”, “cyber”, “shot”} is obtained. Then, we find synonyms set {“dsc”, “cyber”, “shot”}. Their support vertices set is {“w150”, “t5”, “tx5”, “w350”, “w120”}. Finally, we get refined maximal 3-radius graphs as Table 3. A subgraph is stored as an entry in the table, representing an entity. Complexity analysis: The numbers of vertices and edges in G are n and m, respectively. The time complexity of Lines 1–3 is O(m). According to the algorithm in Frey and Dueck, 2007, the time complexity of the clustering algorithm is O(kn), where k is the number of maximal iterations which could be treated as a constant. Thus the time complexity of Line 4 and Line 6 is O(n). The time complexity of Line 11 is O(N2 ) in the worst case where N is the number of core vertices. With the consideration that Line 4, Lines 7–10 and Lines 12–15 is O(n). In summary, the time complexity is O(m + N2 ) in the worst case. During the execution of this algorithm, required extra space does not exceed the size of token graph. Thus the space complexity is the same as the space complexity of the input graph. 4. Token weight decision
Algorithm 1: Reference table construction algorithm Input: Output: 1. 3. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
A token graph G Reference table T Add weights to each edge of G Prune useless vertices and edges of G Select candidate exemplar vertices set S from G G is clustered according to S into {G1 , G2 , . . ., Gk } Each v ∈ S is assigned a label lv G is partitioned to {P1 , P2 , . . ., Pr } for each Pi do candidate core vertex set Ci is selected for each v ∈ Ci do The support vertex set Rv is selected Find Synonyms with Similarity join on core vertex sets for each partition Pi do if Pi is a maximal r-radius graph then extract token set Ti from Pi Insert Ti to T return T
To illustrate the flow, consider the example below.
In this section, we first discuss the main idea of the weight decision method in the existing proposals. Next, we propose a new measure based on the maximal r-radius subgraphs and synonyms presented in Section 3. 4.1. TF·IDF-based weight A popular method is to use a standard IR-ranking formula (Chaudhuri et al., 2009; Hristidis and Papakonstantinou, 2002; Liu et al., 2006). For example, the TF·IDF-based method weights an entity by considering term frequency(tf) and inverse document frequency (idf) of its tokens. Although the TF·IDF-based evaluating methods are effective for textual documents, they perform not well enough for structured data such as entity names in reference tables. It is for the following two reasons. The one is that simply considering that the less a token occurs, the more important the
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688 Table 4 weights of tokens in an entity. Token
Weight
sony {cyber, shot, dsc}
1 1
tx5
should be present. For example, the record “Cyber Shot TX5 for parts” indeed refers to the entity “Sony Cyber Shot TX5”, unfortunately g1 cannot capture it because of lack of “Sony”. On the contrary, g2 , as a relaxed notion, could solve the problem, with more computation and space.
3 if "sony" occurs in the record 1 otherwise
5. Experimental study
token is not reasonable since the importance of a token potentially relates to many factors such as market share; (ii) This method cannot show either position information of a token or the importance of a token. Example 5. Consider the record r = “sony cybershot dsc w80 8.1 mp lowprice/supervalue”. Token weights are IDF weights, computed from a collection of 50,600 real-world records. Based on the computation, the token weights are sony(1.04), cybershot(1.72), dsc(1.18), w80(2.57), 8.1(2.01), mp(0.78), lowprice(4.40) and supervalue(4.40). Note that the meaningless token (e.g., “lowprice”) has a relatively high weight and may affect the result. 4.2. Hierarchy-based weight To determine weights of tokens, we should consider features of different kinds of vertices. Considering that synonyms notably are similar both in position and importance, our weight decision method regards synonyms as one group and the synonyms in the same group share the same weight. Therefore, synonyms in the same group are considered as the same token during the determination of weights. Note that two maximal r-radius graphs are distinguished at least by their different support vertices. So support vertices are more likely to be meaningful and important with the same exemplar vertex. Accordingly, the weights of them should be larger if its exemplar vertex occurs in the same record. We now define the token weight formally.
weight(t) =
⎧ 0 ⎪ ⎪ ⎨1
We use real data for the experiments. The test data are extracted from EBAY web (http://www.ebay.com) and it is a collection of more than 50,000 records inputted by various users. The number of tokens in one record varies from 5 to 30, with 12 tokens on average. The experiments were conducted on an Intel(R) Core(TM) 1.67 GHz computer with 2GB of RAM running Windows XP, and the algorithms were implemented in C language. To test the quality of the reference entities (maximal r-radius subgraphs), we compare our constructed reference table with a given reference table manually created from official websites of products, by using a representative string-based similarity measure, weighted Jaccard similarity measure (Elmagarmid et al., 2007). In fact, many other string similarity measures, e.g., edit distance, can be rewritten by Jaccard similarity (Chakrabarti et al., 2008). It is defined as followed.
weight(t) t ∈ T (r)∩T (e) WJS(r, e) = t ∈ T (r)∪T (e)
weight(t)
The weights of tokens in the given reference tables are TF·IDF weights. We also compare our method with the method in Chaudhuri et al. (2009). 5.1. Comparison with given reference table We denote WJS for the weighted Jaccard similarity measure, G1 for the measure with g1 (Formula (3)), and G2 for the measure
if t is a pruned vertex in the graph if t is an exemplar vertex or core vertex
X(X ≥ 1) if t is a support vertex and its exemplar vertex occurs in the record ⎪ ⎪ ⎩ Y (Y ≤ 1)
(2)
otherwise
An example of weights decision is shown in Table 4. Example 6. Consider the same record in Example 5. In our method, the token weights can be sony(1.0), cybershot(1.0), dsc(1.0), w80(3.0), 8.1(0.5), mp(0.5), lowprice(0.0) and supervalue(0.0). We note that it captures more reasonable relationship between tokens. For example, “lowprice” has IDF weight 4.4 but it is obviously helpless for matching. In our method, its weight gets 0.0 and it will not affect the result.
with g2 (Formula (4)), where the weight(t) is decided according to Formula 2 (we roughly set X = Y = 1). We vary the threshold from 0.2 to 0.6 to test the impact on all three measures. The precision-recall curves are plotted in Fig. 7. It is observed that the reference table constructed in our method performs significantly better than the given reference table. Among
We are to define the correlation between an entity e and a record r. At first, we define a stricter notion g1 of a record referring to an entity e, where all tokens of e are required to be present in the record.
g1 =
1ifT (e) ⊆ T (r) 0otherwise
(3)
We define a relaxed notion g2 of a record referring to an entity e, which is qualified by the fraction of tokens presented in either T(e) or T(r).
weight(t) t ∈ T (r)∩T (e) g2 = t ∈ T (r)∪T (e)
1685
weight(t)
(4)
Obviously, g1 needs less time to compute by simply comparing the records from inverted document indices. However, it is stricter and may have lower precision since all the tokens in the entity
Fig. 7. Precision-recall w.r.t measures.
1686
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
Fig. 8. Precision w.r.t P.
G1 and G2, G1 has a constant precision and a constant recall, this is because G1 only requires all tokens to be presented in the record. High precisions and recalls verified the high quality of reference table constructed in our method. Since G2 is a relaxed formulation, it has relatively lower precision and recall for certain given thresholds, but still higher than WJS’s.
Fig. 10. Exec. time at the first iteration. Table 5 Precision and recall w.r.t the number of records at the first iteration. Number 550 5500 10,000 50,600
Precision
Recall
0.78 0.79 0.92 0.91
0.32 0.38 0.72 0.96
5.2. Effect of P-distance
5.3. The impact of record number
In this section, we test the affect of edge’s property, P-distance. We vary the value of P from 1 to 12 and test the quality of reference tables. The impact to precision and recall are shown in Fig. 8 and Fig. 9, respectively. It is observed that the precision is not very sensitive to P, since it is nearly high in variants of P. From the experimental results, the recall almost increases with P. With a relatively lager P, the recall becomes higher, since more important and richer connectivity between tokens is found. When p reaches 5, recall peaks. This is due to the inference caused by the growth of P that weights and associations between vertices with low affinity are strengthened. This affects the clustering results. We also test the efficiency of our reference table construction method. The results with different p are shown in Fig. 10. It is because that larger p results in more associations and weights, and as a result, the cost of clustering gets larger. From the result in Figs. 8–10, it is observed that p = 5 is a good trade-off between cost of time and quality of results.
To demonstrate the impact of record number on the effectiveness of our method, we set P = 5 and vary the number of records from 500 to 50,000. To test the precision and recall with various number of records, we choose both relative small number, i.e. 550 and 5500, and large number, i.e. 10,000 and 50,600. The results are reported in Table 5. We observe that our method is very effective when the number of records is relatively large. 5.4. Universality To test the universality of our method, we select 4 kinds of popular goods, as shown in Table 6 for testing universality of our method. We extract 50,600 records from EBAY web (http://www.ebay.com) in the category of camera. And 10,000, 3580, 500 in the category of watch, laptop, and shoes, respectively. Employing our method, the precision and recall of results are showed in Table 6. From these results, it is observed that our method performs well on various data sets, especially when the data set is large and the produce names have hierarchy themselves. It shows that our method has good universality. 5.5. Comparisons with existing method To show the benefits of our method, we compare our method with Chaudhuri et al. (2009), which is a state-of-art referencetable-based method. For the convience of precision and recall computation, we choose 100, 200, . . ., 600 product from the data set of camera randomly and label the entities referred by the users Table 6 Categories employed in the experiments and their precision and recall at first iteration.
Fig. 9. Recall w.r.t P.
CID
Name
Precision
Recall
C1 C2 C3 C4
Camera Watch Laptop Shoes
0.91 0.87 0.80 0.80
0.96 0.91 0.42 0.37
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
1687
Table 7 Comparisons with Chaudhuri et al. (2009). Chaudhuri et al. (2009)
100 200 300 400 500 600
Our method
Pre
Rec
F–S
CT (s)
RT (s)
Pre
Recall
F–S
CT (s)
RT (s)
0.52 0.78 0.63 0.77 0.74 0.80
0.02 0.06 0.05 0.01 0.02 0.02
0.02 0.06 0.05 0.01 0.02 0.02
1987 2045 2177 2300 2405 2512
37.12 41.98 54.31 66.72 73.05 74.88
0.61 0.69 0.71 0.75 0.75 0.82
0.24 0.24 0.30 0.35 0.38 0.37
0.17 0.18 0.21 0.24 0.25 0.25
272 311 450 591 612 701
35.09 40.77 55.62 67.01 72.59 73.35
manually. We implement the method in Chaudhuri et al. (2009) by ourselves and tune it by varying the thresholds and window size. As a result, we choose the threshold as 0.6 and window size as 50, which keep the balance of precision and recall. To find proper document to generate the reference table, we input the keywords in the product name to bring and extract related documents. The comparison results are shown in Table 7, where Pre denotes precision, Rec denotes recall, F–S denotes F-Score, CT denotes reference table construction time and RT denotes entity resolution time. From the result, the quality of the generated reference tables of our method outperform existing method, since our method considers the relationship of tokens are considered sufficiently. The generation time of our method is faster than that of Chaudhuri et al. (2009). It is because their method requires process a large amount of document while our method only requires to process the information of products. The entity resolution time is similar, it is because two methods are only required to look up the reference table and the sizes of reference tables are similar.
6. Related work Many researchers and practitioners have recognized the desirability of grouping similar records in a database into clustering. The methods used tend to fall into two categories: string similarity functions and clustering algorithms. Many previous approaches for finding near duplicate records rely on string similarity functions which measure similarity by considering the information from the candidate string and the target entity string that it would match with (e.g., Chandel et al., 2006; Chaudhuri et al., 2006). However, the string similarity is hard to capture correlation between tokens and the count of records containing the original entity is always low in real world. The limitation above causes less accuracy. To improve the string similarity between input records, techniques for labeling matching and nonmatching records by users have been developed (Arasu et al., 2009; Michelson and Knoblock, 2007). These techniques primarily rely on “unaligned” token set pairs so that they cannot discover the important class of subset synonyms. Co-occurrence is also used to identify synonyms (e.g. Manning and Schuutze, 1999), meaning mutual information has been considered for quantifying distributional similarity between words (Lin, 1998). But co-occurrence is more applicable for “symmetric” scenario. Chaudhuri et al. (2009) proposed an approach to identify variations of entities in reference table. However, it also relies on the existing entity string and it is sensitive to order of tokens. Some latest work (Shu et al., 2011; Rastogi et al., 2011; Kolb et al., 2012) focus on the efficiency of entity resolution. Blocking (Shu et al., 2011), partition (Rastogi et al., 2011) and cloud strategies (Kolb et al., 2012) are used to accelerate entity resolution. Comparing with these strategies, our method could avoid pair-wise comparisons of records and just naturally accelerate the entity resolution. Additionally, the reference table is constructed offline and the online part is quite efficient.
The separate line of work on entity matching is clustering records based on many fields. Records always are first sorted separately on multiple keys (such as email address, security number) (Whang et al., 2009; Whang and Garcia-Molina, 2010, 2012). This method assumes an exact match in at least one field but cannot cluster “near” duplicate records. Also, a specific key can be used to sort records and then a more expensive similarity is computed between records that are close in the sorted list (Monge and Elkan, 1996, 1997). The computing cost is large if each record is required to compare with all the other ones in the same sorted list. 7. Conclusions and future work In this paper, we considered the problem of constructing a highquality reference table for heterogeneous data. We model tokens as a graph and find reference entities on the graph. To the best of our knowledge, this is the first attempt to efficiently construct effective reference table for different data sources. We proposed a hierarchy clustering architecture by taking into account the structural relationship and importance of tokens. We considered measures for quantifying the correlation between a reference entity and a record, and developed a measuring function according to the positions and importance of tokens. Using real data, we demonstrated the efficiency and effectiveness of our method. As future work, we plan to develop effective and efficient algorithms on external memory for massive data. In particular, we plan to consider graphs with more complex and huger structure. Acknowledgements This paper was partially supported by NGFR 973 grant 2012CB316200, NSFC grant 61003046, 61111130189 and NGFR 863 grant 2012AA011004. Doctoral Fund of Ministry of Education of China (No. 20102302120054). Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China), Ministry of Education (No. KF2011003). Fundamental Research Funds for the Central Universities (No. HIT. NSRIF. 2013064). References Agrawal, S., Chakrabarti, K., Chaudhuri, S., Ganti, V., 2008. Scalable ad-hoc entity extraction from text collections. In: VLDB. Arasu, A., Chaudihuri, S., Kaushik, R., 2009. Learning string transformations from examples. In: VLDB. Bellare, K., Iyengar, S., Parameswaran, A.G., 2012. Vibhor Rastogi: active sampling for entity matching. In: KDD. Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D., 2008. An efficient filter for approximate membership checking. In: SIGMOD Conference, pp. 805–818. Chandel, A., Nagesh, P.C., Sarawagi, S., 2006. Efficient batch top-k search for dictionary-based entity recognition. In: ICDE, p. 28. Chaudhuri, S., Ganti, V., Kaushik, R., 2006. A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5. Chaudhuri, S., Ganti, V., Xin, D., 2009. Mining document collections to facilitate accurate approximate entity matching. In: VLDB. Cohen, W.W., Sarawagi, S., 2004. Exploiting dictionaries in named entity extraction: combing semi-markov extraction process and data integration methods. In: KDD, pp. 89–98.
1688
F. Wang et al. / The Journal of Systems and Software 86 (2013) 1679–1688
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S., 2007. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering 19 (1), 1–16. Frey, B.J., Dueck, D., 2007. Clustering by passing messages between data points. SCIENCE 315, 972–976. Getoor, L., Machanavajjhala, A., 2012. Entity resolution: theory practice and open challenges. In: PVLDB. Hristidis, V., Papakonstantinou, Y., 2002. Discover: Keyword search in relational databases. In: VLDB. Kolb, L., Thor, A., Rahm, E., 2012. Load balancing for mapreduce-based entity resolution. In: ICDE. Koudas, N., Sarawagi, S., Srivastava, D., 2006. Record linkage: similarity measures and algorithms. In: SIGMOD Conference, pp. 802–803. Lin, D., 1998. Automatic retrieval and clustering of similar words. In: COLING-ACL, pp. 768–774. Liu, F., Yu, C., Meng, W., Chowdhurry, A., 2006. Effective keyword search in rational databases. In: SIGMOD. Manning, C., Schuutze, H., 1999. Foundations of Statistical Natural Language Processing. In the MIT Press. Michelson, M., & Knoblock, C.A. (2007). Mining heterogeneous transfornations for record linkage. In IIWeb, pages 68-73. AAAI Press. Monge, A., Elkan, C., 1996. The field –matching problem: algorithm and applications. In: Processings of the Second International Conference on Knowledge. Monge, A., Elkan, C., 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: In the proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery.
Rastogi, V., Dalvi, N.N., Garofalakis, M.N., 2011. Large-scale collective entity matching. In: PVLDB, p. 4. Shu, L., Chen, A., Xiong, M., Meng, W., 2011. Efficient SPectrAl neighborhood blocking for entity resolution. In: ICDE. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H., 2009. Entity resolution with iterative blocking. In: SIGMOD 2009. Whang, S.E., Garcia-Molina, H., 2010. Entity resolution with evolving rules. In: PVLDB. Whang, S.E., Garcia-Molina, H., 2012. Joint entity resolution. In: ICDE. Fangda Wang was born in 1988, female. She is a master student in National University of Singapore. She graduated from Harbin Institute of Technology in 2010. Her research area is database. Hongzhi Wang was born in 1978, male, Phd. He is an Associate Professor. His research area is data management, including data quality, XML data management and graph management. He is a recipient of the outstanding dissertation award of CCF, Microsoft Fellow and IBM PhD Fellowship. Jianzhong Li was born in 1950. He is a professor and doctoral supervisor at Harbin Institute of Technology. He is a senior member of CCF. His research interests include database, parallel computing and wireless sensor networks, etc. Hong Gao was born in 1966. She is a professor and doctoral supervisor at Harbin Institute of Technology. She is a senior member of CCF. Her research interests include data management, wireless sensor networks and graph database, etc.