Advances in Engineering Software 37 (2006) 129–132 www.elsevier.com/locate/advengsoft
Short Communication
ADSS: An approach to determining semantic similarity Lixin Hana,b,c,*, Linping Suna, Guihai Chenb, Li Xieb a Department of Mathematics, Nanjing University, Nanjing 210093, People’s Republic of China State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210093, People’s Republic of China c Department of Computer Science and Engineering, Hohai University, Nanjing 210024, People’s Republic of China
b
received 7 October 2004; received in revised form 6 May 2005; accepted 20 May 2005 Available online 12 July 2005
Abstract Determining the semantic similarity is an important issue in the development of semantic search technology. In this paper, we propose an approach to determining the semantic similarity. This approach takes into consideration the similarity between two entities and their similarity reflected in context. Furthermore, the approach provides an efficient Tabu Search algorithm combined with multi-objective programming algorithm to improve the precision. q 2005 Elsevier Ltd. All rights reserved. Keywords: Semantic Web; Ontology; Information retrieval; Search engine; Optimization
1. Introduction Nowadays, the volume of information on the Web is increasing dramatically. Facilitating users to get useful information has become more and more important to information retrieval systems. Information retrieval technology has been greatly improved. However, users are not satisfied with the low precision and recall. The wide availability of machine understandable information on the Semantic Web offers some opportunities for the improvement of traditional search. Some semantic search methods [1,2] have been proposed to improve the traditional search technology. Just as the ranking of documents is a critical component of today’s search engines, the ranking of relationships will be essential for tomorrow’s semantic search engines that would support discovery and mining of the Semantic Web [3]. However, ranking a set of interconnected entities and relations is more complex than ranking a set of documents or paths of semantic associations [3]. The semantic ranking approach considers the total number of entities and relations that match a user’s interests by assigning a value of calculation to each of them. How to compute the semantic
* Corresponding author. Address: Building 8, Apartment 105, Second New Village, West Beijing Road, 210008, Nanjing, jiangsu, China.
0965-9978/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2005.05.003
similarity is a critical issue in semantic ranking. In this paper, we propose a method called ADSS (an Approach to Determining Semantic Similarity). The approach takes into consideration such criteria as the similarity between two entities and their similarity reflected in context. The ranking score is defined as a function of some particular parameters. ADSS is different from other methods in that it combines an efficient Tabu Search algorithm with an efficient multiobjective programming algorithm to improve precision.
2. Related work Aleman-Meza et al. [3] discuss a framework that uses ranking techniques to identify more interesting and more relevant semantic associations and define a ranking formula that considers subsumption weight, path length weight, context weight and trust weight for assessing the effectiveness of the ranking scheme outlined. Rodriguez and Egenhofer [4] present an approach to computing semantic similarity across different ontologies. A similarity function determines similar entity classes by using a matching process over synonym sets, semantic neighborhoods, and distinguishing features. In the SWAP project, Broekstra et al. [5] aim at overcoming the lack of semantics by combining the Peerto-Peer paradigm with Semantic Web technologies. They propose a data model for encoding semantic information
130
L. Han et al. / Advances in Engineering Software 37 (2006) 129–132
that combines ontology features with a flexible description and rating model. In Rodriguez and Egenhofer’s approach [4], three ideas are presented—word matching, feature matching, and semantic-neighborhood matching. Broekstra et al. [5] extend Rodriguez and Egenhofer’s approach with a fourth idea—instance matching. Thus, two objects can be identified through these similarity measures. Pekar and Staab [6] address the problem of automatically enriching a thesaurus by classifying new words into its classes. The proposed classification method makes use of both the distributed data about a new word and the strength of the semantic relatedness of its target class to the other likely candidate classes. In contrast to the above work, ADSS introduces a multiobjective programming algorithm to compute the weights and the Tabu Search to compute the optimal solution. Hence the approach can acquire the results with higher precision.
3. ADSS method ADSS is an approach to determining the semantic similarity among a set of entities from different ontologies. An ontology is an explicit specification of a conceptualization. In an ontology, definitions associate the names of entities in the universe of discourse with human-readable text describing what the names mean, and formal axioms that constrain the interpretation and well-formed use of these terms [7]. 3.1. The similarity between entities The similarity between two entities is considered in ADSS. Formula (1) denotes the similarity between two entities ap and bq.
In formulae (2) and (3), the function len() corresponds to the shortest path from the entity to the root, and the function maxlen() corresponds to the shortest path from the root to the leaf through the entity. Related research [4] in similarity measures assumes that the entities located at a lower level in the ontology are more meaningful than those located at a higher level in one ontology. Thus, more weights are assigned to more ‘specific’ semantic entities. In contrast to the equations in [4], ADSS employs the relative length of the path instead of the absolute length of the path. Different from the absolute length of one path which only considers len(), the relative length of the path considers both len() and maxlen(). Thus, the relative length of the path is more reasonable. 3.2. The contextual similarity between entities In addition to the features of entities, the features of adjacent entities across neighborhoods are also considered in ADSS. Formula (4) denotes the contextual similarity between two entities. X SNðapi ; bqj Þ Z ðk=nÞ ! ðk=mÞ i%n
X !
Sðapi ; bqj Þ; for k=n% 1; k=m% 1
(4)
j%m
where api and bqj are entities in the semantic neighborhood of ap and bq, respectively, n and m are the numbers of entities in the corresponding semantic neighborhoods, k is an amplification constant, and the function S() acquired from formula (1) is the semantic similarity between entities. The function SN() is the semantic similarity of entities across neighborhood. In contrast to the equations in [4], formula (4) is easier to be calculated.
Sðap ; bq Þ Z jAh Bj=½jAh Bj C aðap ; bq ÞjA K Bj C ð1 K aðap ; bq ÞÞjB K Aj; for 0% a% 1
(1)
where ap is an entity of ontology p, bq is an entity of ontology q, j$j is the cardinality of a set, and the function a can be defined in terms of the depth of the entities. It denotes that there are greater values of similarity from deep to shallow entities than from shallow to deep entities. AZ {ua1,.,uan} and BZ{ub1,.,ubn}, where ua1,.,uan are the features of ap, and ub1,.,ubn are the features of bq. aðap ; bq Þ Z½lenðap Þ=maxlenðap Þ=f½lenðap Þ=maxlenðap Þ C ½lenðbq Þ=maxlenðbq Þg; for lenðap Þ% lenðbq Þ (2) aða ; b Þ Z½lenðb Þ=maxlenðb Þ=f½lenða Þ=maxlenða Þ C ½lenðbq Þ=maxlenðbq Þg; for lenðbq Þ! lenðap Þ p
q
q
q
p
p
(3)
3.3. Determining semantic similarity criteria First, the WM algorithm is proposed to compute the weight u1. Then, formulae (1) and (4) are combined into formula (5) by the widely used linear weighting method [8]. Finally, the ITSTDSS algorithm is employed to compute the optimal solution. Therefore, the semantic similarity among the set of entities is determined. f Z MAXðu1 Sðap ; bq Þ C ð1 K u1 Þ SNðapi ; bqj ÞÞ; for u1 R 0
(5)
where the function S() is the semantic similarity between entities, the function SN() is the semantic similarity between entities across neighborhoods, and u1 is their corresponding weight. The function f is a maximal value of semantic similarity among the set of entities.
L. Han et al. / Advances in Engineering Software 37 (2006) 129–132
3.3.1. WM algorithm for computing the weights In contrast to [3,4], the WM (Weights Method) algorithm is proposed to compute the weights. In the WM algorithm, the Powell direction accelerating method [8] in unconstrained optimization problem is introduced in order to compute the local minimum solutions. The Powell direction accelerating method [8] is an effective direct search method that does not need to use the derivative in computation. In addition, there is a different measure in computing subobjective functions. Therefore, it is necessary for these functions to be normalized. The WM algorithm is described as follows: Input: a set of entities, function S(x, y), function SN(x, y), a given appropriate positive number M Output: u1 {the Powell direction accelerating method is used to compute the local minimum solution in S(x, y), that is, s Z min Sðx; yÞ; x;y2D
the Powell direction accelerating method is used to compute the local minimum solution in SN(x, y), that is, sn Z min SNðx; yÞ; x;y2D
S(x, y) and SN(x, y) are normalized, that is, Sðx; yÞZ ½Sðx; yÞC M=s , SNðx; yÞZ ½SNðx; yÞC M=sn ; the Powell direction accelerating method is used to compute the local minimum solution in S(x, y), that is, s Z min Sðx; yÞ; x;y2D
the Powell direction accelerating method is used to compute the local minimum solution in SN(x, y), that is, sn Z min SNðx; yÞ; x;y2D linear Eq. (6) is constructed to compute u1; u1 s C ð1 K u1 Þ sn Z ð1 K u1 Þ s C u1 sn (6) 3.3.2. ITSTDSS algorithm for determining optimal solution The Tabu Search (TS) is an iterative procedure designed for the solution to the optimization problems. The basic concept of the Tabu Search described by Glover is a metaheuristic superimposed on a heuristic. The Tabu Search is a method designed to cross boundaries of feasibility or local optimality and to systematically impose and release constraints to permit exploration of otherwise forbidden regions [9]. The Tabu Search has been used to solve a wide range of hard optimization problems. The method is still actively researched, and is continuing to evolve and improve. The Tabu Search proceeds assuming that there is no point in accepting a new solution unless it is to avoid a path already investigated. This ensures that the new regions of a problems solution space will be investigated to avoid the local minima and ultimately to find the desired solution. Therefore, the approach is to avoid entrainment in cycles by forbidding or penalizing moves to points in the solution space previously visited. In many cases, the differences between various implementations of the Tabu method have
131
to do with the size, variability, and adaptability of the Tabu memory to a particular problem domain [10]. The ITSTDSS (Introduce Tabu Search to Determining Semantic Similarity) algorithm is proposed in order that the Tabu Search algorithm is introduced to determine the semantic similarity among the set of entities. The ITSTDSS algorithm can guarantee to escape from the current local optimal solution. Particularly, the increased entity number leads to a bigger computing workload. In this situation, it is very meaningful to seek the optimal solution that would bring a better precision. ffiffiffi the ITSTDSS algorithm, the length of the Tabu lists is pIn a 3 n (0!a!1), where n is the number of elements in N(S). Candidate solution number is computed by the node number divided by 3. ‘Tabu’ prohibites the existed matched entities. The ITSTDSS algorithm stops if f(s)KzLB%3 or cis– cbisRmaxno. Aspiration criteria shows that if x in the Tabu lists meets f(x)!f(bs), x is selected, where bs is a current optimal solution. The ITSTDSS algorithm is described as follows: Input: a set of entities, initial solution s0, function f, the Tabu lists length, the element number n in N(S), the candidate number Output: the optimal solution of the f function {bsZs0;// initialize optimal solution bs SZs0;// initialize S cisZ0; // cis is the current iterative step cbisZ0; // cbis is the iterative step that owns the optimal solution HZf; // initialize H, H is the Tabu lists the widely used Lagrange relax algorithm is used to solve the lower bound zlbw; while (f(s)KzLBO3) and ((cis–cbis)!maxno)//3 is a given enough small positive number, maxno is the iterative maximum number that allows no improvement on the current optimal solution; {cisZcisC1}; candidate set V* is created from N(S) where all candidate element x is no tabu or aspiration; // N(S) is the neighboring region of S the Powell direction accelerating method is used to compute the optimal solution S* from V*; update the Tabu lists; if f(S*)!f(bs) then {bsZS*;cbisZcis;} SZS*;}} 3.4. Experimental results and discussion In this paper, the concept of precision is used to evaluate the experimental results of the approach. Precision is the proportion of entities that are actually similar to each other. This is similar to the standard evaluation measures in information retrieval. Such ontologies as teaching, institution, staff, project, student, course, service and paper are constructed to show that ADSS approach has a higher
132
L. Han et al. / Advances in Engineering Software 37 (2006) 129–132
Case 1
precision
Case 2 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 5
10
15
20
25
30
35
40
45
takes into consideration such criteria as the similarity between two entities and their similarity reflected in context. The ranking score is defined as a function of some particular parameters. ADSS is different from other methods in that it presents an approach that combines the ITSTDSS algorithm with the WM algorithm to improve precision. Particularly, our approach has a higher precision as the computing workload increases with a larger number of entities.
50
entity number Fig. 1. Experimental results.
precision. The experiment compares one entity and its adjacent entities in the corresponding semantic neighborhood with a set of entities and their adjacent entities in the corresponding semantic neighborhoods in other ontologies in order to acquire a rank of similarity and to evaluate the quality of the experimental results. ADSS approach and Rodriguez et al. approach [4] were used, respectively, to obtain the result of the precision. In Fig. 1, 50 entities are selected from these ontologies. Case 1 and case 2 show the experimental results obtained from ADSS approach and from Rodriguez et al. approach respectively. Fig. 1 shows that ADSS can perform better on the experiments. It is because ADSS employs the WM algorithm to compute more reasonable weights and uses the ITSTDSS algorithm to compute the optimal approximation. Thus, the semantic similarity among the set of entities is better determined. In contrast, Rodriguez et al. approach [4] acquires the weights and the semantic similarity among the set of entities using manual interference. Fig. 1 also shows that when the entity number increases, the precision of ADSS approach is improved more obviously than that of Rodriguez et al. approach. It is because the increased entity number leads to a bigger computing workload. In this situation, it is more meaningful to seek the optimal solution that would bring a better precision. 3.5. Conclusion The wide availability of machine understandable information on the Semantic Web offers some opportunities to improve traditional search. With the development of semantic search engines, semantic ranking is more and more important. Semantic ranking provided by semantic search engines is harder than the ranking approach provided by a traditional search engine. How to determine semantic similarity is a critical issue in semantic ranking. In this paper, we propose a method called ADSS. The approach
Acknowledgements This work is supported by the National Grand Fundamental Research 973 Program of China under No. 2002CB312002, Jiangsu Planned Projects for Postdoctoral Research Funds, the State Key Laboratory Foundation of Novel Software Technology at Nanjing University under grant A200308, the Natural Science Foundation of Jiangsu Province of China under grant BK2004114 and the Key Natural Science Foundation of Jiangsu Province of China under grant BK2003001.
References [1] Guha R, McCool R, Miller E. Semantic Search. In the proceedings of the 12th international world wide web conference. Budapest, Hungary, May 20–24 2003. [2] Heflin J, Hendler J. Searching the web with SHOE. AAAI-2000 workshop on AI for Web search. California: AAAI Press; 2000. [3] Aleman-Meza B, Halaschek C, Arpinar IB, Sheth A. Context-aware semantic association ranking. Semantic web and databases workshop proceedings. Berlin, Germany, September 7–8 2003. [4] Rodriguez M, Egenhofer M. Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl. Data Eng 2003;15(2):442–56. [5] Broekstra J, Ehrig M, Haase P, van Harmelen F, Kampman A, Sabou M, et al. A metadata model for semantics-based peer-to-peer systems. Proceedings of the WWW’03 workshop on semantics in peer-to-peer and grid computing. Budapest, Hungary, May 20–24 2003. [6] Pekar V, Staab S. Word classification based on combined measures of distributional and semantic similarity. In: Proceedings of the research note sessions of the 10th conference of the European chapter of the association for computational linguistics (EACL’03). Budapest, Hungary, April 12–17; 2003. [7] Gruber TR. A translation approach to portable ontology specifications. Knowledge Acquisition 1993;5(2):199–220. [8] Xie K, Han L, Lin Y. Optimization method. Tientsin: University Press; 1997. [9] Glover F. Future paths for integer programming and links to artificial intelligence. Computers and Operations Research 1986;13(5):533–49. [10] Gray P, Hart W, Painton L, Phillips C, Trahan M, Wagner J. A survey of global optimization methods 1997 http://www.cs.sandia.gov/opt/ survey/main.html.