Journal of Integrative Agriculture
May 2012
2012, 11(5): 800-807
RESEARCH ARTICLE
An Ontology-Based Information Retrieval Model for Vegetables E-Commerce TAO Teng-yang and ZHAO Ming College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, P.R.China
Abstract With the rapid increment of the information on the web, traditional information retrieval based on the keywords is far from user’s satisfaction in recall and precision. In order to improve the recall ratio and the precision radio of IR engine in the vegetables e-commerce, an information retrieval model based on the vegetables e-commerce ontology is presented in this paper, vegetables e-commerce ontology was constructed by gathering and the analyzing vegetables e-commerce domain information on the web. The vegetables e-commerce ontology is composed of some kinds of vegetable classes and hierarchy relationship of vegetables classes. In the process of information retrieval, domain ontology helps to index information and information inference. An ontology-based information retrieval model is implemented, and which has more functions than the keyword-based web information retrieval engines. The experiment results show that the recall ratio and the precision ratio of ontology-based information retrieval model are higher than that of the information retrieval engine based on keyword at a certain extent. Key words: domain-ontology, vegetables e-commerce, information retrieval
INTRODUCTION With the development of global information technology, electronic commerce has become an integral part of the market economy. However, the internet has become a huge source for information over the last several years (Osterwalder and Pigneur 2002). Even though search engine technology has experienced impressive enhancements in the last decade, the information retrieval (IR) technology currently builds upon are still mostly based on keywords. The current keyword-based IR is based on word form matching. There are many problems, and therefore it can not meet the user’s needs. The key to solve this problem is to search from keyword matching to semantic matching, understood as searching by meanings rather than literal strings
Received 28 June, 2011
(Fernández et al. 2011). On the one hand, the information retrieval can be promoted from the keyword level to the conception level, it will realize semantic retrieval. Semantic search has been present in the IR field since the early eighties (Croft 1986), some of these approaches are based on statistical methods that study the co-occurrence of terms (Deerwester et al. 1990; Dumais 1990), and therefore they capture and exploit rough and fuzzy conceptualizations. How to retrieve information efficiently and accurately has become more and more important. It has become an urgent problem which the people awaited to be solved. At present, domain ontology serves as a backbone of the Semantic Web by providing vocabularies and formal conceptualization of a given domain to facilitate information sharing and exchange (Gruber 1993; Berners-Lee et al. 2001). Their potential to over-
Accepted 31 August, 2011
Correspondence ZHAO Ming, Tel: +86-10-62737855, E-mail:
[email protected]
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.
An Ontology-Based Information Retrieval Model for Vegetables E-Commerce
come the limitations of keyword-based search in the IR system was explored by several researchers in the ontology-based information retrieval (Guha et al. 2003; Maedche et al. 2003). Ontology technology is widely used in the field of information retrieval, which aims to improve the efficiency of information retrieval model (Muller et al. 2004; Al-Jadir et al. 2010). Domainontology can capture more specific fields of knowledge, provide the common understanding of knowledge in the field, and identify the specific definition of words which are belong to the different levels of formal models (Deng et al. 2002). Domain-ontology is the basis of understanding knowledge and inference in the field of information retrieval, and is a powerful tool for supporting effective information retrieval. In this study, ontology is used to achieve semantic extension and provide users with better information services to improve customer needs. In this paper, a model of IR is proposed for vegetables e-commerce web pages based on a domain ontology. The experimental data shows that the recall ratio and the precision ratio of ontology-based information retrieval model is higher than the general keyword-based web information retrieval model at a certain extent. The structure of the paper is as follows. First, a brief construction process and implementation on the vegetables e-commerce ontology will be given. Second, the information retrieval model will be discussed in detail, including the key modules and key functions. Next, test and evaluation of the model are mentioned. At last, some conclusions and further work are offered.
THE CONSTRUCTION OF VEGETABLES ECOMMERCE ONTOLOGY Construction method of vegetables e-commerce domain ontology Ontology is a shared concept model, which describes the relationships between knowledge with the classes, relations, functions, axioms, and instances. An ontology is a type of structured vocabulary in which the terms and the logical relationships that hold between them are well-defined (Balhoff et al. 2010). It gives the entity concept, mutual relations of the concept in
801
the area of activities, as well as the formal description of features and rules. Domain ontology provides essential terms and correlative forms in the specific areas which can be identified by the computer. It is intended to enable the machine to reach a consensus on the related or similar terms (Niu et al. 2008). The following steps are needed for building ontology vegetables. Determine the scope and purpose of the domain ontology This stage clarifies the aim, the scope, and the function which the domain ontology are constructed for. Before the construction, the purpose of the domain ontology should be cleared. The vegetables ecommerce ontology provides certain semantic help to improve the efficiency of information retrieval for the web pages information. Therefore the semantic relationship of the concept should be provided as much as possible to improve the information service based on ontology. Domain information collection and analysis This stage is successful important precondition for building vegetable ontology. Only when the domain information and understanding the domain knowledge were fully collected, it is able to build a available and correct ontology with sufficient amount of information. In order to build the ontology with the versatility, the contents of the domain ontology have the authority and standards, and its terms must have the accuracy and completeness. The sources information of vegetable ontology come from authoritative information, such as professional books on vegetables, agricultural information website, the domain experts and other relevant ontologies already existed. Define the classes and the class hierarchy Currently there are three class design methods. A topdown method that first defines the most general concepts in the domain, and then gradually subsequent specialization of the concepts; a bottom-up method that first defines the specific, unique concept, from the bottom, the definition of the smallest class start, then the generalization of these concepts form a comprehensive concept; a combination method that is the combination of top-down method and bottom-up method, defines the more salient concepts first and then generalize and specialize them appropriately. In this paper, a top-down approach was applied to construct vegetables e-commerce domain ontology.
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.
802
Define the properties of classes The classes alone will not provide enough information to answer the competency questions. Therefore, once some of the classes were defined, the properties of classes must be described. Initially, it is important to get a comprehensive list of terms between concepts they represent, relations among the terms, or any property that the concepts may have. These terms contain object properties and datatype properties. All the subclasses inherit the properties of the classes. A property slot should be attached to the largest class which posses the property. Create instances The last step is to creat individual instances of classes in the hierarchy. Defining an individual instance of a class requires (1) choosing a class, (2) creating an individual instance of that class, and (3) filling in the slot values (Noy and McGuinness 2001).
The building of the vegetables domain ontology The vegetables e-commerce domain ontology describes the concepts and the relationships of the concepts of vegetables e-commerce. In this model, OWL DL was
TAO Teng-yang et al.
used to describe the ontology concepts, and achieve the domain ontology with the tool of Protégé 4.0.2. The vegetables ontology has many categories to compose, for example, the vegetable species, the vegetable area, etc., each level contains many concept sets (classes, classes’ relations and properties), and each concept set has the concrete example information. The model determines the category, and then defines the class, the property and the relations to correspond to that category. Finally, the classes will be filled into with instances. According to the analysis of related information in various vegetables e-commerce websites, the vegetable domain ontology framework mainly has three categories: vegetable species, vegetable area, company. And the vegetable species are composed of nearly seven kinds of vegetables, such as leaf vegetables, bulb vegetables, solanaceous vegetable etc. The ontology construction graph by Protégé 4.0.2 is given in Fig. 1. Domain ontology defines the classes and subdivides the corresponding class information. The vegetables e-commerce ontology describes the property of each class, the relationship and the expansion relation. The different concepts in part, as well as the relationships
Fig. 1 Gragh of vegetables e-commerce domain ontology.
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.
An Ontology-Based Information Retrieval Model for Vegetables E-Commerce
among them are shown with different forms in Fig. 2. The concept relations have three kinds of possibilities: (1) association relations: generally relations; (2) generalization relations: is-kind-of relations; (3) aggregation relations: is-part-of relations. The property relations of classes contain IsPartOf, HasPartOf. IsPartOf relation is a sub-IsVegetableOf relation which in three relations, and it has the transitive property, however IsPartOf is inversed of HasPartOf. The object properties contain hasproduct, hasproducer, locatedIn, etc.
CONSTRUCTING OF THE INFORMATION RETRIEVAL MODEL BASED ON DOMAIN ONTOLOGY A semantic retrieval system was constructed based on the domain ontology, which aims to achieve higher efficiency than keyword-based search engines. This model uses domain ontology to achieve semantic annotation for the vegetables e-commerce websites, and construct semantic information retrieval system by the vegetables e-commerce ontology (Mayfield and Finin 2003). Domain ontology may contain a specific field of knowledge, the description of the specific areas of the concept, relations among concepts and provide the common understanding of knowledge from the different levels of the pattern. The information retrieval model based on domain ontology was proposed in this paper, and the efficiency based on this system is higher than the search engine based on keyword. The key modules of information retrieval model are shown in the Fig. 2.
803
some vegetables e-commerce websites, in order to analyze the web information, some preprocessing need to be done for the crawled websites. The operations included: (1) Remove HTML tags, the information become free text; (2) indentify the words in the free text; (3) indentify the property of the words; (4) remove the stop words; (5) extract the words, remove the prefix and the suffix of the words; the web information will be more precise by preprocessing. Then, the concepts and instances were extracted from the free text, which may contain “product name”, “phone”, “email”, “release date”, etc. The instances were obtained based on noun phrases in the free text. For example, “jin wa wa” is a instance of Chinese cabbage, which position was fixed at the word “jin wa wa”, and the tag
and were added. The information was annotated in this way. The Fig. 3 shows the key modules of semantic annotation.
Constructing and implementing inference engine based on Jena The function of this module is to enable the machine
The module of semantic annotation based on domain ontology Semantic annotation is the process of indexing information from the related resource. It includes using manual, automatic or semi-automatic ways to express content of the resource and knowledge of key concepts by the use of ontology classes and ontology instances revealed in the process (Zhou et al. 2008). First of all, the related information was crawled from
Fig. 2 Information retrieval model based on ontology.
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.
804
TAO Teng-yang et al.
ship reason rules are as follows: (1) transitive rule: (Is-a(C1,C2) Is-a(C2,C3)) Isa(C1,C3) (2) object attribute inherit rule: (Is-a(C1,C2) HasAttribute(C2,A)) HasAttribute(C1,A) (3) instances transitive rule: (Is-a(C1,C2) Instanceof(e, C1)) instance-of(e, C2) (4) superclass-subclass inverse relationship rule: subclassof(C1,C2) subclassof(C2,C1)
Constructing and implementing information retrieval module
Fig. 3 Graph of semantic annotation.
based on the domain ontology to understand questions in the related field, so that, to construct a more complete and accurate concepts and knowledge, and to use the revised questions and search engine to match the resources. Jena is a Java framework for building Semantic Web applications. Jena inference engine provides perfect support for ontology modeling, operation and inference and other relative activities. SPAROL language is an ontology searching language recommended by W3C, which uses the searching form of SQL sentences (Wang and Zaniolo 2008; Guo and Zhang 2009). The first step, import all the jar files in the lib under Jena development kit into myeclipse, then import the appropriate package or class in Java source code. Jena’s API based on their respective functions is organized into 24 packages. In order to operate OWL ontology effectively, my MySQL was used as the persistence of ontology storage tool. Jena API provides the OWL graph data structure stored in various databases, which will save much time by ontology model reading from the database. Jena comes with a simple rule-based inference engine. And racer, a reasoner based on DIG interface, was used in this system. In this case, the classes and instances are limited, the relationship is not very complexed, some relation-
In this information retrieval model, there are two information retrieval modules. The first module is fulltext search module (information retrieval based on keywords), which is used for contracting. The second module is semantic retrieval system based on the domain ontology, which can search for information not only based on keyword but also based on the relationship of the information. According to the first module, information was obtained from a full-text search engine which is based on keywords, but the returned resources are lack of semantic information. Through analyzing the user input query, load the domain ontology, match the related vocabulary in the ontology, and return the related vocabularies as the expanded vocabularies. In this retrieval system, user input was vegetable name commonly, reasoning engine match the input vocabulary and get subclass, superclass, equivalentclass, and related properties which have certain semantic relation. The related information was found by class hierarchy relationship and appropriate reason rules. The structure of semantic retrieval system based on domain ontology is shown in Fig. 4. Different users have different requirements, and users are also interested in the result of the retrieval ranking. Considering the needs of users in the vegetables e-commerce, three information retrieval ranking models were constructed. Information retrieval ranking by price The result of the information retrieval is ranked by the price of the products so that users can obtain the cheaper products,
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.
An Ontology-Based Information Retrieval Model for Vegetables E-Commerce
Fig. 4 The structure of semantic retrieval.
by comparing the price among the different commodities, and return the results by the price. Information retrieval ranking by date of manufacturing date In the model of information retrieval ranked by manufacturing date, the result of the information retrieval is ranked by the valid time of the products so that users can obtain more effective information by comparing the manufacturing date among the different commodities and return the results by the date. This method has been already used in some e-commerce websites. Information retrieval ranking by region For the model of information retrieval ranked by region, the system identifies the location of the user, and returns the results by distance between the location of the user and the place of products. The distance is not a specific value, but predefined before. These above methods can supply more convenient transport information for the users, and these methods are suitable for the vegetables e-commerce.
IMPLEMENTATION OF THE PROTOTYPE SYSTEM AND THE ANALYSIS OF TEST DATA Experiment design In order to test this system, nutch (an open source text
805
search engine tool kit) was used to design a keywordbased search engine, which is used to compare with information retrieval system based on domain ontology. There are two aspects on evaluation of an information retrieval system, functional evaluation and performance assessment. According to the characteristics of vegetables e-commerce, from the perspective of users, functional evaluation are more concerned. It means that the system should meet the user’s search requests. In this paper, precision rates and recall rates were used to compare the two systems. The recall ratio is the measure of how many possible correct answers are found by the approach, and the precision ratio (P) is the measure of how many the total answers given are actually correct. Some vegetables e-commerce websites were clawed, there are 1 023 trading informations of vegetables, such include Chinese cabbage, onion, eggplant, potato, cucumber, and so on.
Experiment results Chinese cabbage, onion, eggplant, potato, and cucumber were taken to compare the two systems. The amount of information about Chinese cabbage is 32, 25 is obtained by keyword-based search engine and 23 of 25 is correct, 28 is obtained by semantic retrieval system, and 27 of 28 is correct. The recall ratio and precision ratio about Chinese cabbage, onion, eggplant of the two systems were calculated respectively. We mainly used two methods for improving recall and precision of information retrieval: (1) If only in accordance with keywords by user input to retrieve, search results may cause omission because of synonyms. To solve this problem, it can use the concept of ontology specification to map the set of its synonyms for improving the precision rate. (2) A word may have multiple meanings, search results may lead to errors if only to make one of the means of the simple matching. To solve this problem, it can use ontology to analyze degree of match between the user search terms and the semantic type of information resources. The user can input directly to the semantic types of words or semantic relationships so that based on the concept of specific to determine the semantics of search terms.
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.
TAO Teng-yang et al.
806
Tables 1 and 2 show the comparison results that two different retrieval systems on the recall and precision radio in the experiments. The result of the tables shows that the precision ratio and the recall ratio of ontology-based information retrieval model are higher than the information retrieval engine based on keyword at a certain extent. According to Table 1, recall ratio increased about 12.8% at most, increased 9.4% at least, and recall ratio increased an average of about 11.9%. According to Table 2, precision ratio increased about 9.6% at the highest of times, increased 1% at the lowest of times, and recall ratio increased an average of about 4.2%. By analyzing the two models, semantic retrieval model can obtain more related information, because it includes the concepts and the relationship of different concepts, the synonym and some easy relation can be retrieved, which is impossible for the keyword-based search engine. The key module of the semantic retrieval system is the construction of the domain ontology. The more relationship of the classes, the more semantic information will be retrieved.
CONCLUSION In this paper, an ontology-based information retrieval model was presented, as well as the construction theory and the process of vegetables e-commerce domain ontology. Besides semantic information annotation model, a reasoning engine based on Jena and informa-
Table 1 The result of recall ratio by different retrieval system Options Chinese cabbage Onion Eggplant Potato Cucumber
Recall ratio by keyword-based search engine 0.7813 0.7333 0.7692 0.8049 0.8095
Recall ratio by semantic retrieval system 0.875 0.8333 0.8974 0.9024 0.9048
Table 2 The result of precision ratio by different retrieval system Options Chinese cabbage Onion Eggplant Potato Cucumber
Precision ratio by keywordbased search engine 0.92 0.9091 0.8667 0.8788 0.8824
Precision ratio by semantic retrieval system 0.9643 0.92 0.9426 0.8919 0.9477
tion retrieval module was also presented. The experimental results show that the recall ratio increases an average of about 11.9%, and the precision ratio increases an average of about 4.2%. It can be concluded that the information retrieval system based on vegetables e-commerce domain ontology improves the precision ratio and recall ratio. Ontology and semantic retrieval have been widely studied in recent years, and the problem how to improve the efficiency of information retrieval further and express semantic relationship is a challenging in the future.
Acknowledgements This research is supported by the National High Technology Research and Development Program of China (2006AA10Z239). The authors are grateful for the anonymous reviewers who made constructive comments.
References Al-Jadir L, Parent C, Spaccapietra S. 2010. Reasoning with large ontologies stored in relational databases: the OntoMinD approach. Data and Knowledge Engineering, 69, 1158-1180. Balhoff J P, Dahdul W M, Kothari C R, Lapp H. 2010. Phenex: Ontological annotation of phenotypic diversity. PLoS One, 5, 1-10. Berners-Lee T, Hendler J, Lassila O. 2001. The semantic web: a new form of web content that is meaningful to computers will unleash a revolution of new possibilities, Scientific American, 285, 34-43. Croft W. 1986. User-specified domain knowledge for document retrieval. In: SIGIR 1986 Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Pisa, Italy. pp. 201-206. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R. 1990. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41, 391-407. Deng Z, Tang S, Zhang M. 2002. The ontology research summarizes. The Beijing University Journal: Natural Sciences Version, 38, 730-733. Dumais S. 1990. Enhancing performance in latent semantic indexing retrieval, TM-ARH-017527. Bellcore, USA. Fernández M, Cantador I, López V, Vallet D. 2011. Semantically enhanced information retrieval: An ontology-based approach. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 434-452. Gruber T. 1993. A translation approach to portable ontology specification. Knowledge Acquisition, 5, 199-220. Guha R, McCool R, Miller E. 2003. Semantic search. In: Proceedings of the 12th International World Wide Web Conference (WWW 2003). Budapest, Hungary. pp. 700-
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.
An Ontology-Based Information Retrieval Model for Vegetables E-Commerce
709. Guo Q, Zhang M. 2009. Semantic information integration and question answering based on pervasive agent ontology. Expert Systems with Applications, 36, 1006810077. Maedche A, Staab S, Stojanovic N, Studer R, Sure Y. 2003. SEmantic portAL: the SEAL approach. In: Proceedings of Spinning the Semantic Web. MIT Press, Cambridge London. pp. 317-359. Mayfield J, Finin T. 2003. Information retrieval on the Semantic Web: Integrating inference and retrieval. In: Proceedings of SIGIR 2003 Semantic Web Workshop. Toronto, Canada. Muller H M, Kenny E E, Paul W, Sternberg. 2004. An ontology-based information retrieval and extraction system for biological literature. PLoS Biology, 2, 19841998. Niu Q, Qiu B, Xia S. 2008. Ontology-based learning
807
resources in the field of semantic search model. Computer Application Research, 25, 1977-1982. Noy N F, McGuinness D L. 2001. Ontology development 101: A guide to creating your first ontology. In: Stanford Knowledge Systems Laboratory Technical Report KSL01-05. Stanford Press, USA. Osterwalder A, Pigneur Y. 2002. An e-business model ontology for modeling e-business. In: Proceedings of the 15th Bled Electronic Commerce Conference eReality: Constructing the e-Economy. Bled, Slovenia. pp. 17-19. Wang F, Zaniolo C. 2008. Temporal queries and version management in XML-based document archives. Data and Knowledge Engineering, 65, 304-324. Zhou D, Bian J, Zheng S. 2008. Exploring social annotations for information retrieval. In: WWW 2008 Proceedings of the 17th International Conference on World Wide Web. Beijing, China. pp. 715-724. (Managing editor ZHANG Juan)
© 2012, CAAS. All rights reserved. Published by Elsevier Ltd.