Knowledge-Based Systems 24 (2011) 393–405
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Word AdHoc Network: Using Google Core Distance to extract the most relevant information Ping-I Chen ⇑, Shi-Jen Lin Department of Information Management, National Central University, Chung-Li 320, Taiwan, ROC
a r t i c l e
i n f o
Article history: Received 4 March 2010 Received in revised form 23 November 2010 Accepted 23 November 2010 Available online 29 November 2010 Keywords: Similarity distance Search engines Information retrieval Keyword sequence n-gram
a b s t r a c t In recent years, finding the most relevant documents or search results in a search engine has become an important issue. Most previous research has focused on expanding the keyword into a more meaningful sequence or using a higher concept to form the semantic search. All of those methods need predictive models, which are based on the training data or Web log of the users’ browsing behaviors. In this way, they can only be used in a single knowledge domain, not only because of the complexity of the model construction but also because the keyword extraction methods are limited to certain areas. In this paper, we describe a new algorithm called ‘‘Word AdHoc Network’’ (WANET) and use it to extract the most important sequences of keywords to provide the most relevant search results to the user. Our method needs no pre-processing, and all the executions are real-time. Thus, we can use this system to extract any keyword sequence from various knowledge domains. Our experiments show that the extracted sequence of the documents can achieve high accuracy and can find the most relevant information in the top 1 search results, in most cases. This new system can increase users’ effectiveness in finding useful information for the articles or research papers they are reading or writing. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction The search engine has become an indispensable tool in people’s daily lives. Individuals can search for any kind of knowledge and can find all of the newest information in the world. The only way to use the search engine is to enter some keywords that represent what the user wants to know. Jansen et al. [21] found that most users only enter 2.35 terms in the search engine, usually because the users lack sufficient domain knowledge about entering precise keywords that describe their thoughts. In the past few years, the Google search engine has offered keyword expansion, a feature that can provide useful next or additional keywords to help the user find the most relevant and accurate search results. But the possible combinations of keywords are so numerous that the Google recommendations will not always work. The search engine can only provide the most frequently entered keyword sets, which are determined by other users. This method is called ‘‘collaborative recommendation’’. Both methods use tools such as semantic nets, ontology [48], and Markov chains [5] to model users’ behavior and to identify their interests. Thus, the system can find people who share interests in order to form ⇑ Corresponding author. E-mail addresses:
[email protected],
[email protected] (P.-I. Chen),
[email protected] (S.-J. Lin). 0950-7051/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.11.006
communities and to use their search results as the best potential keyword sequences. However, each time users want to search for information, they may want to find different kinds of information in different knowledge domains. A user might enter the keyword ‘‘Apple’’ in order to learn about the Apple company’s newest product. But in another search, the same user might use the same keyword to search for McDonald’s Apple pie. Using the traditional methods, the training model only can be constructed in a single domain. When modeling several domains of knowledge at the same time, the model will be so huge that it will take an extremely long time to search for the potential keywords in all those domains. In addition, keyword extraction methods, like TF-IDF, always rely on term frequency (TF) to find the most important keywords. The TF-IDF methods need pre-collected documents or Webpage sets to calculate the inverse document frequency (IDF) values. But in Web information retrieval, all the execution should be on-line and real-time. Additionally, users’ browsing behaviors are widely varied. Therefore, it is impossible to determine the exact IDF values in order to evaluate the importance of the keywords except by observing users and collecting information about their search behaviors. To use this system on a mobile handset device, the repository of that information must be very small; hence, it is impossible to save the pre-defined model or dataset on such devices. Thus, there is a need for a new way to enhance the keyword expansion methods,
394
P.-I. Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
turning it into a real-time execution system, and to minimize the system in order to save space in the repository. In our previous work, we used the Google similarity distance algorithm to calculate and find the potential keywords in articles that users viewed [6]. The users can acquire information about each keyword so that all the articles or Web pages that the users are reading can become a Wikipedia-like system. In other words, this system can automatically provide information about which keywords in an article are most important to that user. This is easier than marking the keywords and searching in Google to discover their meanings. By using an NGD algorithm, the system can achieve the goal of on-line real-time execution without using any repository. However, this system only can provide information about each keyword. To find the most relevant information about the article, the keywords must be expanded into a longer sequential set. For example, if a user reads an article about the Google similarity distance, the user may want to know whether any research relates to that article or whether other papers reference it. A set of keywords must be extracted as the ‘‘term vector’’ to represent this article; that vector can then be used to search Google or a database to find the most relevant research articles [22]. The term vector model uses index terms as vectors of identifiers to represent documents. The Apache Lucene text search engine uses this method to calculate documents’ relevancy rankings. The document similarities theory is then used to compare the deviations of angles between the documents’ vectors and the query keywords. Many researches explain how to find a meaningful keyword sequence in order to enhance the accuracy of the search results. Still, the model must be trained in advance so that the system usability is always restricted. In this paper, we adapt the NGD algorithm and try to extract a meaningful keyword sequence based on the Webpage or article that is viewed. The NGD algorithm has been used to conduct some systems. We proved that using the number of search results to calculate the relations among the keywords is a workable method. However, we find a significant problem; the number of keyword search results will be varied. Thus, the relations of keywords will become unstable because the NGD algorithm can only be used for search results. The NGD algorithm should be used to extract the sequence of documents as index terms and to use this sequence to cluster those relative documents. When using this method, two similar data that were collected at different times are likely to be totally different because the relations of the keywords have been changed so greatly. We propose a new algorithm, called ‘‘Google Core Distance’’ (GCD), to improve the stability of the relations. In this process, the GCD algorithm will become a distance-based method rather than a probability method. Therefore, it will be impossible to combine this method with the traditional LSI algorithm to find a n-gram sequence of keywords as the term vector. We used the famous PageRank algorithm and combined it with the BB’s graph-based clustering algorithm to find the sequence. This idea is based on the sensor network’s routing algorithm, and we have named it ‘‘Word AdHoc Network’’ (WANET). Our method can be used to find a most important keyword, the most important two or three keywords, and so on. By using a Hop-by-Hop Routing algorithm, we can extract the word-by-word sequences from the documents or Web pages to represent the term vectors. Thus, to find the best keyword sequences, our system will focus on the co-occurrence of each two keywords, the connectivity of those relative keywords (including the relations of keyword to keyword and the relations of a keyword’s relative keywords), and the best routing path. We used the previously mentioned algorithms to conduct the system, and we used the Elsevier Webpage to choose four different knowledge domains. We randomly selected 10 journals in each
domain as subjects for the experiment. In each journal, we chose the top 25 most-downloaded papers as our data set, and we used our system to extract the most important keyword sequences. Next, we used those sequences to search in Google and to evaluate the strength of the sequences’ ability to find the original papers in the Top-k search results. The experiment’s results showed that the 4-gram sequence of the determined keywords can identify the most relevant search results. We believe that using our system can help users, especially researchers, to find the most relevant information in a more efficient way. When a user is writing an introduction to a research paper, if the system can immediately calculate and find related previous researches, then it will be much easier for the user to define the paper topic and to determine whether any previous work has examined the same problem [27]. When the manager of a company is reading a newspaper on a mobile device, if the system can find other related articles and provide a summary, then the manager can make informed decisions—anytime, anywhere. Information from around the world is collected on the Internet, but users must know how to find the precise information and then to use it properly. By using this kind of system, users can control the information and can understand current events, thereby enhancing their knowledge. The rest of this article is organized as follows. In Section 2, we will introduce some relative research articles and compare them with the system which we proposed in this article. In Section 3, we will introduce our proposed methods in detail to emerge the original thinking of the system’s design. In Section 4, we evaluate the NGD algorithm and our proposed GCD algorithm by using the spearman’s footrule measurement. Then, we conduct the experiment which uses the research articles from different knowledge domain to evaluate whether our system can achieve the goal of either high accuracy or cross domain or not. 2. Related works Most users do not know how to enter precise keywords to represent what they want to find in the search engine, especially when looking for knowledge in unfamiliar domains [1]. Two main problems need to be solved: (a) extracting the potential keyword; (b) finding a meaningful keyword sequence [7,13,20,34,40]. In the past few years, many different solutions to this problem have been developed. A sequence of keywords can be extracted to represent the documents, and searching for that sequence in the search engine will offer the most relevant information to the user. In this section, these methods will be explained and compared with our system in detail. 2.1. Semantic similarity Semantic similarity is a concept that has been used to measure the similarity of documents or terms. Some algorithms use humandefined ontologies to measure the distance between words. Others use the vector-based model to represent their correlations. 2.1.1. Vector-based model In 1990, Deerwester et al. proposed the latent semantic analysis (LSA), which is a technique used for natural language processing and for providing the relationship measurements of word–word and word–passage. This method analyzes the potential relationships among a set of documents and terms by using a set of concepts related to those documents and terms. Thus, a document becomes a column vector, and the query that is entered by the user also becomes a vector. Finally, the two vectors can be compared to measure their similarity. The problem with the LSA is that it does
Ping-I Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
not consider the order of words within a sequence. Nowadays, the order of the entered keywords will affect the search results in the search engine. The results of using this method are not satisfactory because the method only considers the co-occurrence of keywords as vectors and then compares them [11]. Hofmann [19] adapted the LSA algorithm into the probabilistic latent semantic analysis (PLSA). This method can find the relations of topics that are associated with terms and documents so that users can find documents that belong to the same topic. Additionally, the PLSA can solve the problems of polysemy and synonymy. Thus, this algorithm can be used to extract meaningful information more accurately. The PLSA’s flaw is the same as the LSA’s; it does not consider the order of the sequence [37]. Generalized latent semantic analysis (GLSA) was proposed by Matveeva et al. [35]. This method uses a large corpus to compute the term vectors for the vocabulary of a document collection. This algorithm uses a word-by-word matrix so that the computational cost of creating the vector-space will increase with the number of all possible keywords. Thus, GLSA cannot be used to combine information on the World Wide Web. Hyperspace analogue to language (HAL) considers context only as the words that immediately surround a given word. It will create a word-by-word matrix, which is based on word co-occurrence in a large corpus of text, and will use these vectors to compare and measure the similarity of the documents. This method is still timeconsuming, however, and it still needs a pre-processing stage to create a term matrix [33]. The best path length on a semantic self-organizing map (BLOSSOM) method uses the SOM (self-organizing map) algorithm to reduce the dimensions and to calculate the semantic distance from one keyword to another, just like the traversal algorithm [30]. By finding the similarity score of each pair of keywords, the algorithm uses a node selection and finds the shortest path on the undirected graph to form a concept-path. This algorithm is similar to our proposed method. However, our method will not use any training process to extract cross-domain information. Pointwise mutual information (PMI) uses statistical methods to measure the association between two events. It can find the actual probability of the co-occurrence of the two keywords in the documents set. If the keywords x and y occur together more often than by chance, then the PMI value will be higher so that the relationship between the two keywords will be very strong. This method can be used in any search engines, and many search engine optimization (SEO) methods are based on this algorithm. Nevertheless, this method requires huge document collections. 2.1.2. Vocabulary-based model WordNet groups words into synsets (synonym sets) and records the semantic relations of these synsets. This method was first developed by George A. Miller et al. in 1985. Currently, WordNet contains about 117,659 synsets and 206,941 word-sense pairs. It is also widely used in semantic Web applications. Some researches use it to calculate the similarity between keywords. However, WordNet only collects some domain vocabularies so that it can be useful for information extraction in every domain. Explicit semantic analysis (ESA) is a measure that uses Wikipedia as a dataset to compute the semantic relatedness between two arbitrary keywords [17]. The input texts are extracted by the TFIDF algorithm and then become the weighted vectors of the concept. Therefore, the relatedness of each two keywords can be calculated by using the cosine similarity metric. The problem with using this method is that the data need to be pre-collected. If users read an article that lacks the pre-collected information, then they will spend a great deal of time downloading and parsing the data from the Internet.
395
2.1.3. World Wide Web-based model The algorithm that uses the World Wide Web can overcome the restrictions of the limited corpus and can be used in extracting information from different knowledge domains [9]. Normalized Google Distance (NGD) was proposed by Cilibrasi and Vitanyi [8], and it has been used to calculate the relationship between two words. The researchers treated the World Wide Web as the largest database on earth. In this way, they did not need to pre-collect the data as a corpus; rather, they just entered each keyword in the search engine and used the number of search results to calculate the similarity of the keywords. For example, the relationship between ‘‘USA’’ and ‘‘New York’’ will be higher than the relationship between ‘‘Taiwan’’ and ‘‘New York’’. But for each two keywords, Google must be searched three times to find the number of search results and to calculate the NGD score. The keywords ‘‘USA’’ and ‘‘New York’’ must be entered both individually and together to find the number of Web pages that contain both keywords at the same time. Although this algorithm can be used in cross-domain knowledge extraction, the execution time will grow along with the number of potential keywords. Another problem is that the algorithm cannot extract the sequence of keywords as the term vector because its original design is only for the similarity measurement of two events [38]. Batra and Bawa [2] used the NSS to evaluate words in predefined categories and to discover the Web services semantically. Normalized similarity score (NSS) is derived from the NGD algorithm. Some of the pre-defined categories include zip code, temperature, weather, and the like, and the system will extract terms to calculate the most relevant documents and allocate the service to it. 2.2. Sequence of words The Google search engine can cope with the sequence order of keywords. If a user enters the same keyword set into the search engine but in different order, the search results will change. Thus, it is important to use these algorithms to find the best sequences in order to achieve the most relevant search results. 2.2.1. n-Gram n-Gram matching techniques are one of the most common approaches [16]. An n-gram is a set of n consecutive characters extracted from a word. It relies on the likelihood of sequences of words, such as word pairs (in the case of bigrams) or word triples (in the case of trigrams), and therefore it is less restrictive. The algorithm is as follows:
PðWÞ ¼ Pðw1 ; w2 . . . ; wn Þ ¼ Pðw1 ÞPðw2 jw1 ÞPðw3 jw2 ; w1 Þ . . . Pðwn jw1 ; w2 ; w3 ; . . . wn1 Þ ð1Þ The W represents the string of w1w2. . .wn, and P(wij w1, . . . , wi1) represents the conditional probability of that when the word w1 to wi 1 appear, the word wi will also appear. That means each word’s appearance will depends on the last n words. It can also be used to represent sequences of words, such as ‘‘Walt Disney Parks’’, ‘‘Disney Parks and’’, and ‘‘parks and resorts’’, which are trigrams. Then, these keyword sequences can be used as the vectors of the document and can be compared to other documents to find similar information. 2.2.2. Multiple keyword sequence extraction Sato and Saito [42] propose the method of extracting word sequences by using the SVM (Support Vector Machine). They use this method to find relationships of documents. It can also be used for bilingual word sequences. The maximum entropy (ME) algorithm can also be used to extract meaningful phrases from an article.
396
P.-I. Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
Feng and Croft [14] use the ME algorithm to extract English noun phrases automatically. The system can use probability methods to find a sequence of keywords from one word to the next word. These noun phrases can be used to realize the article summarization. Li et al. [28] use the ME algorithm to extract a sequence of keywords as an index of the news. Thus, those relative pieces of information can be clustered together, which provides better search results to the user. However, the algorithms mentioned here require training processes and training datasets. Therefore, they can only be used in restricted knowledge domains [25,39]. 2.3. Keyword expansion When a keyword is entered into the Google search engine, the engine provides some relevant next-keywords and the number of search results for that keyword set. For example, if ‘‘Taiwan’’ is entered, the search engine will show that ‘‘Taiwan news’’ has about one hundred million search results. This kind of method, however, can only help users understand the importance of these keyword sets. In most situations, that information is not useful because it is not customized to each user’s thought. Thus, many research studies have focused on expanding a user-entered single keyword into a set of sequential keywords in order to get more accurate results [4,15,31]. There are two main types of keyword expansion methods [12]. 2.3.1. User profiles A user’s browsing or using behavior profiles need to be collected to form a personal profile. Then, some techniques can be used to construct a tree-based or probability-based model in advance so that the search engine system can provide some customized information [29,36,44]. Cui et al. [10] propose a log-based query expansion and suppose that a user will choose the relevant information about the document being read. Thus, the researchers could calculate the similarity between the user’s interests and the keyword. For example, one user might frequently read articles about computer technology. If that user enters the keyword ‘‘Apple’’ into the search engine, then the system will expand the keyword to ‘‘Apple computer’’ or some other relevant information about the computer science knowledge domain, not the fruit. The weakness of this system is that if the user really wants to know information about the fruit, his search results will be disappointing. This method can also analyze the local context, which is based on the top-ranked documents initially retrieved for a given query, and then add the best-scoring concepts to the query [45]. In this way, the keywords can be expanded by using the co-occurrences of the global and local terms and thereby getting more accurate search results. However, this method does not consider the semantic relationships between keywords, and it takes a great deal of time to collect those top-ranked documents for each search. Some researchers try to add the semantic relationship of the keywords, using ontology to solve this problem. There are two problems with this method: (a) the ontology needs to be pre-constructed by the domain expert; (b) the input of the keyword must match the format of the semantic relationship. For example, for a user to find information about a person, the system must include a column about the person’s name and a column about his or her occupation. Thus, to find information about Michael Jackson, a user must enter both ‘‘Michael Jackson’’ and ‘‘singer’’ into the system. Then, the system can draw from its ontology and use the concept-level keyword to search and to get more accurate search results [24]. The weakness of using user profiles is that many data must be collected about each user, and the constructed models are always huge. Thus, the computational time will be very long. Also, if a keyword is used that was not previously connected to this kind of knowledge, then the user profile method will not be able to provide any useful information.
2.3.2. Automated collaborative filtering The collaborative recommendation method depends on collecting information about each search engine user’s interests and grouping the users who have similar interests or knowledge backgrounds. In this way, if a user enters a keyword, then the system can provide the potential next keywords to the user based on information from similar community members. This kind of method always needs a large amount of information and a significant computational effort; hence, only big companies or groups can use this kind of technique to help users expand their keywords [32,46,26,41,18]. The advantage is that if the user is a new member of this system, and hence the system lacks information about that individual, then this method can be used to improve the quality of the search results. However, if a user does not know how to enter the precise first keywords or if the user misspells the keywords, then this method will be useless. Additionally, people in the same community might have similar interests, so if they want to search for something outside of their knowledge domain, then their search results will be disappointing. The long computational time and the need for vast room to store these huge user profiles are also significant problems for this kind of system. In our previous work, we used the NGD algorithm to deal with this problem [6]. By using the NGD algorithm, we can treat the search engine as the largest semantic database in the world. In this paper, we will improve the stability of the NGD algorithm and will find a new way to extract the keyword sequence from an article or Webpage. This method is totally different from the methods of the above-mentioned researches that collected user information and used training processes. The most important feature of this system is that it can be used to extract cross-domain knowledge.
3. Proposed method In this section, we will introduce our system in detail, describing our design and calculation methods. Our system is composed of three main parts: 1-gram filtering Google Core Distance algorithm (GCD) Hop-by-Hop Routing algorithm (HHR) Fig. 1 shows our system architecture. The 1-gram filtering algorithm will be used to find potential keywords. The GCD algorithm will use these extracted keywords to calculate and discover the relationship of each two keywords. Finally, the Hop-by-Hop Routing algorithm in the sensor network will decide the strongest sequence of the keywords. In this stage, we only can deal with the problem of 1-gram keywords because of that the sequence of keywords will be calculate and formulate in the final stage of the system. In this way, we can only filter the 1-gram keywords in the system instead of 2or 3-gram keywords. In this system, we donot use the stemming procedure because of the following two reasons: 1. Every words which we extracted are been consider as potential keywords for further combination into multi-words or into the keyword sequence. If we stemming some words and use the 1gram filtering algorithm to combine it with another words. This combined multi-word might be different with the corrected one and the results will not be very good. For example, we have two words ‘‘united’’ and ‘‘states’’ By using the stemming procedure, the system might change these two words into ‘‘unite’’ and ‘‘state’’. Thus, when we use the 1-gram filtering algorithm to combine them, the search results of the ‘‘united states’’ will
Ping-I Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
397
Fig. 1. System architecture.
be different with ‘‘unite state’’. Also, the meanings of these two keywords are totally different. Another example is the words ‘‘traditional’’ and ‘‘Chinese,’’ the 1-gram filtering algorithm may combine them into a keyword ‘‘traditional Chinese’’ which most commonly refers to characters in the standardized character sets of Taiwan, of Hong Kong. But if we use the stemming procedure, the combined keyword may become ‘‘tradition Chinese’’ or ‘‘tradition China.’’ Thus, the meaning of this keyword is also totally different [47]. 2. We hope our system can be used in different knowledge domain and can provide the immediate information to the user. Therefore, it will almost be impossible to collect all of the keywords in advance and provide our system for stemming.
3.1.3. Number of Google search results After we filter out some irrelevant words and find the potential keywords, we use those words to search Google. We can thus use the Google search results to decide which words should be retained and which deleted, guaranteeing that the retained keywords are worth being calculated in the next stage. We use a simple algorithm to present a full view of our system’s first processing step:
3.1. 1-gram filtering method
Algorithm 1. 1-gram filtering.
In our experience, the following three key points should be noted to extract potential keywords from a Webpage [6].
3.1.1. Part-of-speech and word combinations We use a Qtag tagger to read text and, for each token in the text, to return the part-of-speech (such as noun, verb, punctuation, etc.). Those words tagged as NN (common noun, singular), NP (proper noun), DT (determiner), or JJ (adjectives) because in our experience will be chosen because of that most desired keywords are composed of those words. All the executions should be on-line in real-time, without using any pre-set database or word dictionary. We provide a simple method to solve this problem and to combine some single words together into a more meaningful keyword. In our previous work, the experiments showed that this method can successfully improve the accuracy of the entire system.
3.1.2. Length of the words We will choose words that have more than three characters as our potential keywords because most keywords, including abbreviations, are more than three characters long. This is because most of the keywords that we need are often more than three characters, including abbreviations. For example, the word ‘‘GPS’’ is an abbreviation of ‘‘Global Positioning System’’ and the length is just three characters. So, we will choose this word as our potential keyword for the user. In this way, some unimportant words will be filtered out and make the whole system more efficient [43].
gresults denotes the number of Google search results; length denotes the length of the word; P1 = {NN, JJ, NP, DT} denotes the POS of the single word; and P2 = {NN, NP} denotes the POS of the second word.
INPUT: Words: array of strings parsed through the paragraphs in a webpage. //Types of POS for the first word of any two consecutive words. P1 = {NN, JJ, NP, DT} //Types of POS for the second word of any two consecutive words. P2 = {NN, NP}; a = Threshold of Google search results; OUTPUT: candidates; //Potential keywords for doing NGD calculation. begin candidates = £; i = 0; while (i < n){ if (words[i] length( ) > 3 && words[i] pos( ) 2 P1) if (words[i + 1] length( ) > 3 && words[i + 1] pos( ) 2 P2) { keyword = words[i] + words[i + 1]; //combine if (gresult (keyword) < a) candidates = candidates [ {keyword}; } else if (gresult (words[i]) < a) candidates = candidates [ {words[i]}; i++; } end
398
P.-I. Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
3.2. Google Core Distance The Google similarity distance algorithm, which was created by Cilibrasi and Vitanyi [8], is used to calculate the relationship between two words. The researchers treated the World Wide Web as the largest database on earth. This was the original algorithm:
NGDðx; yÞ ¼
maxflog f ðxÞ; log f ðyÞg log f ðx; yÞ log N minflog f ðxÞ; log f ðyÞg
ð2Þ
The attributes f(x) and f(y) represent the number of search results of the words ‘‘x’’ and ‘‘y’’, respectively. The attribute log f(x, y) represents the number of Web pages containing both x and y. For each pair of keywords, the NGD needs to search Google three times. This is very time-consuming, especially when many potential keywords need to be calculated. In the first round, the system sends each keyword to Google and calculates the number of search results. Next, permutation methods are used to rearrange the keywords into pairs and then resubmit them to Google. Then, the NGD value can be used as the relationship degree of each two keywords. We use the NGD algorithm and implement it to construct several systems. The original NGD algorithm supposed that both the total number of pages and the total number of search terms in Google would continue to grow for some time, so that the similarity distance between two keywords would not be significantly changed. We found, however, that the relationship between the two keywords will vary along with any substantial variation of the number of Google search results. As seen in Table 1, the number of search results in September 2009 is remarkably different than those in September 2008. For example, the search results for the keyword ‘‘Algorithm Design’’ are 35 times more numerous than before. Thus, if we use want to use the NGD algorithm to find the relationship of this keyword with the keyword ‘‘divide-and-conquer’’, the result will be totally different from when we last calculated it. When the change is not very significant, however, then the value of the NGD will remain almost the same. We hope the relationship of each two keywords can remain stable and will not experience a time-variance effect. Therefore, we want to find a new algorithm to measure the relationship of the two keywords and to use the number of Google search results as the NGD algorithm does. We formed a very simple idea about finding the distance between the two centers of the circles. As seen in Fig. 2, the original NGD algorithm calculates the intersection of the two circles. Thus, when one circle contains almost the same number of search results but another has a great increase or decrease, the intersection will experience a great change, so that the score of the two keywords using NGD will also be varied. But if the distance of the two centers can be changed to represent the similarity distance, then the changes should not be too significant, and the score will remain almost stable.
Table 1 Number of Google search results in 2008 and 2009. Keywords
Sept-2008
Sept-2009
Increase/decrease (%)
Algorithm design Mathematical process Algorithm engineering Dynamic programming Divide-and-conquer Template method Decorator Internet retrieval
2,020,000 3,650,000 1,260,000 2,140,000 1,190,000 663,000 7,850,000 751,000
70,100,000 14,900,000 9,790,000 11,800,000 1,120,000 24,600,000 6,110,000 11,300,000
3470 408 777 551 94 3710 78 1505
Fig. 2. Search results variation.
Fig. 3. Distance-based Google similarity distance.
Fig. 3 presents a simple example of our idea. Suppose that circle ‘‘a’’ represents the total number of search results in Google. Circles ‘‘b’’ and ‘‘c’’ represent the number of search results for the two keywords that we want to measure. We can use the radius of circle a (R), subtracting the radius of circle b to get the value of r1. In the same way, we can find the value of r2. Thus, the distance of the two centers will be as follows:
0 6 GCDðx; yÞ 6
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðR r 1 Þ2 þ ðR r2 Þ2
ð3Þ
A small example can illustrate this method. We call this method Google Core Distance (GCD) to represent its special property of using Google as the database. The official data for the number of Web pages show that about 8,058,044,651 pages are within the circle area of a. The number of search results for keywords ‘‘algorithm design’’ and ‘‘dynamic programming’’ are 2,020,000 and 2,140,000 respectively, and that number represents the circle areas of b and c. Although the circle area is p multiplied by the radius squared, we can ignore the p because it will not significantly change the answer. The number of search results is always so huge so that we will use their logs to calculate the results. Thus, we can get the radius of R, r1, and r2. Finally, the similarity score of these two keywords is about 0.511. The calculation process is as follows:
Ping-I Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
R2 = log 8058044651 = 9.906 r 21 ¼ log 2020000 ¼ 6:305 ==Algorithm design r 22 ¼ log 2140000 ¼ 6:330 ==Dynamic programming ) R = 1.780, r1 = 1.417, r2 = 1.420 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi GCDðx; yÞ 6 ðR r1 Þ2 þ ðR r 2 Þ2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ) xy 6 ð1:78 1:417Þ2 þ ð1:78 1:42Þ2 ¼ 0:511
The GCD is a measurement of the distance between the two keywords. We use the number of search results as the circle area, and try to calculate the distance between the two circles as the relation. The range of the GCD score will be as formula 3. In this way, if the GCD value is smaller, it means that the distance of two centers of circles are closer. In other words, the relationship of them is stronger.
399
3.3.1. PageRank algorithm PageRank was used by the Google search engine to assign a numerical weighting to each element of a hyperlinked set of documents. That weight can be calculated for collections of documents of any size. As mentioned previously, we can find the relationship of each two keywords from the GCD algorithm. We will rank those GCD values in ascendant order and choose half of them as the most important relations. Then, we can use the PageRank algorithm to analyze each relation and to determine which keyword is the most important. Finally, we can use this keyword as the starting point to find the best routing path to represent the documents. The PageRank algorithm proceeds as follows; a link from keyword A to keyword B is a vote, by keyword A, for keyword B. This algorithm also analyzes the page that casts the vote.
N = the number of important relations; d = the damping factor (0.85); PR = the PageRank score; L = the normalized number of outbound links.
1d PRðBÞ PRðCÞ PRðDÞ þd þ þ þ N LðBÞ LðCÞ LðDÞ
3.3. Hop-by-Hop Routing algorithm
PRðAÞ ¼
After we find the relationship score of each two keywords, we can understand the relative importance of various keywords. As shown in Fig. 4, this method is just like the sensor network, and a best routing word-by-word path can be found as the keyword sequence to represent the article’s information. A keyword sequence can be used to represent a vector of the article so that it can be used to compare the similarity between two documents and to cluster some relative articles together. Still, finding the most important sequence in such a system is not easy, and no previous researches have considered that task. Therefore, we must find a new way to decide which keywords are the best in order to hop and link them together. We must also consider the threshold to decide when the system will stop the hopping process. Our two main concerns are integrating the relationship of the keywords and linking their statuses. We will introduce our design and explain why we use these methods in detail in this section.
After we find the PR value of each keyword and the relationship of each two keywords, we can use the RF power dissipation algorithm to combine these two values into one synthetic score (SS). The original algorithm is as follows: 2
PT ¼ PR d
ð4Þ
ð5Þ
This algorithm has been used to choose the best next sensor nodes in order to hop from one node to another. The value of PR is the power of the receiver, and d is the distance. Thus, if the power is stronger and the distance is not too far from the sender, then the node will choose the strongest node to send the message. We adopted the algorithm to accommodate our system, as follows: PRR = the PageRank score of the selective keywords; GCD = the GCD score between two keywords;
SS ¼ PRR GCD2
Fig. 4. WANET.
ð6Þ
400
P.-I. Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
The power of the receiver (PR) is expressed in dBm. The larger the absolute value of the negative number is, the better the receive sensitivity will be. We think in our proposed method, the meaning of PageRank score is the same as PR which represents the importance of the keyword. Also, the GCD score is the distance between two centers of circles so that we can use the idea of RF power dissipation to calculate a SS score. If the SS score of the root keyword with one of the relative keyword is higher than the others, we will choose this relative keyword as the next hop and link to it. Thus, we can start with the keywords that have the highest PageRank values and then choose the next keywords that have the highest SS values. For example, as shown in Fig. 5, the keyword that has the highest PR score is RFID-based. We use the algorithm to calculate the SS score and find that the keyword ‘‘remanufacturing’’ had the highest value. Therefore, we can find a sequence of ‘‘RFID-based ? Remanufacturing’’. By using this method, we can create a list of keyword sequences that can be used to represent the document.
3.3.2. BB’s graph-based clustering algorithm After we find the keyword sequence, we face a new problem; making sure the sequence is strong enough, or stopping the hopping procedure. We adapted the BB graph-based clustering algorithm [3] to solve this problem. This algorithm was designed for measuring the similarities between the query terms and the search results. Thus, we hope to use this algorithm to weight our keywords and their relative next-keywords. This is the original BB algorithm:
( BBðx; yÞ ¼
jNðxÞ\NðyÞj ; jNðxÞ[NðyÞj
if jNðxÞ [ NðyÞj > 0;
0;
otherwise;
Table 2 Keywords and their high relative keywords. Relative words
1 2 3 4 5
Keyword RFID-based
Remanufac-turing
Remanufact-uring Closed-loop supply Reverse logistics Genetic algorithm Reasonable recycling
Genetic algorithm Recycling simulation
N(x) = the set of relative keywords of the keyword x; N(y) = the set of relative keywords of the keyword y; We can use an small example to explain how this algorithm is used. As shown in Table 2, there are five keywords that are relative to ‘‘RFID-based’’ and two that are relative to ‘‘remanufacturing’’. In those relative keywords, only one keyword, ‘‘genetic algorithm’’, belongs to both keywords at the same time. Thus, their BB score is 0.167(1/6). We use the BB score as the weight to multiply with the SS score, and we hope to use this method to strengthen the accuracy of the whole system. 3.3.3. Hop-by-Hop Routing algorithm (HHR) The last step involves automatically setting the threshold and deciding when to stop the routing process. We can conclude all of the algorithms in this section as follows:
Pn1 ð7Þ
N(x) and N(y) represent the number of neighboring vertices of x and y, respectively. However, we want to use them to measure the relative scores of two keywords. Thus, our definition for this system is as follows:
Dðx0 ! ! xn Þ ¼
i¼0
ðBBðxi ; xiþ1 Þ PRðxiþ1 Þ ðGCDðxi ; xiþ1 ÞÞ2 Þ n ð8Þ
And the threshold of the stop hopping is as follows:
Dðxi ! xiþ2 Þ > Dðxi ! xiþ1 Þ þ Dðxiþ1 ! xiþ2 Þfor i ¼ 0 to n 2 . . .
Fig. 5. Path choosing example.
ð9Þ
Ping-I Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
Fig. 6. Expansion score checking.
In Fig. 6, the 3-gram keyword sequence is used to provide an example. The score of the Top-1 and Top-2 sequence is D(X0, X1), and the score of the Top-2 and Top-3 sequence is D(X1, X2). Thus, according to our HHR algorithm, the value of the Top-1 to -3 sequence will be (D(X0, X1) + D(X1, X2))/2. Accordingly, we want to ensure that this 3-gram sequence, which is combined by two individual 2-gram sequences, will have a stronger meaning to represent the document. We use the same algorithm to calculate the HHR value and to find the D(X0, X2). If the 3-gram HHR value is more than the D(X0, X2), then we treat this 3-gram sequence as valid because the meaning of X0 and X2, which is mediated by X1, is stronger than it would be without the mediating effect. Thus, the keyword sequence will become X0 ? X1 ? X2. In the same way, the system will repeat this process and continue to check until the threshold is reached. Algorithm 2. hop-by-hop Routing algorithm. INPUT: Words[0. . .n 1]: array of strings parsed through the paragraphs in a webpage. SS score function (l, k). OUTPUT: k-sequence; //Potential keyword sequence. begin add words[0] & words[1] to k-sequence; i = 2; while (i < n){ compute score(0, 1), score(1, 2), . . . , score(i 1, i); compute score(0, i); s = [score(0, 1) + score(1, 2) + + score(i 1, i)]/i; if (s < score(0, i)) return k-sequence; else { add words[i] to k-sequence; i++; } } end
This idea is basically derived from the PLSA [19]. Fig. 7 represents the relationship of ‘‘z is affected by d, and w is affected by z’’. The relationships between documents and keywords are mediated by a potential theme or intention. With the mediator z, the relationship of d and w will be stronger than before. For instance,
Fig. 7. PLSA relations.
401
we are reading an article about ‘‘the Taiwan high speed railway’’. The system can extract the words ‘‘Taiwan’’, ‘‘high speed’’, and ‘‘railway’’. If we only use two of them (i.e., ‘‘Taiwan Railway’’) to search Google, it will not be easy to find exact information related to the article because the Taiwan Railway Company is different from the Taiwan High Speed Railway Company. Thus, we can use the algorithm to add the keyword ‘‘high speed’’ to the middle of the keywords and get this sequence: ‘‘Taiwan ? High Speed ? Railway’’. The search results will now be better than before. We use this idea to formulate our new algorithm design. Using the PLSA algorithm only can extract the relationship between three entities, but our algorithm find more than three relative keywords in order to formulate the n-gram sequence. Also, the PLSA is basically a probability-based algorithm, and our GCD algorithm is now a distance-based algorithm. Thus, we combined some algorithms to develop a new way to reach the same goal as the PLSA.
4. Experimental results In this section, we conduct some experiments to evaluate the performance and accuracy of our new algorithm and to compare it with the old methods. Because our new algorithm is intended to solve the problem of the NGD algorithm’s instability, we will first evaluate the GCD algorithm’s ranking results, using the data that we collected in September 2009 and in January 2010. The data here represent the number of search results for each keyword. Then, we will use the same data and the NGD algorithm to form a list of Top-k keywords. We can use the spearman’s footrule to compare the difference between the two algorithms, using different timings of data to find the impact degree of the time variance problem. 4.1. Time variance effect of the Google search results As mentioned in the previous section, the relationship of any two keywords will be affected by the number of Google search results for each keyword. The NGD algorithm cannot deal with this problem; hence, the execution results will vary over time. Our new algorithm tries to solve this problem and to provide more consistent results. The experiments presented here endeavor to find the relationship between words and to compose a list of those relationships according to their NGD and GCD values. We use the old data from September 2009 to run the executions and then compare it to the new data, which are the latest Google search results. We use the spearman’s footrule to compare the sequences that were extracted by those two algorithms. The first important element of this experiment is forming a list of ranked keywords. Because the algorithms only can find the relationship between each two keywords and because the spearman’s footrule only can be used to measure a one-dimension sequence of keywords, we must find some way to transform the data into a single dimension. For example, we get the Top-1 result, which has the smallest NGD or GCD value, as (Apple, Mac); the Top-2 result is (Apple, iPod). We use the following simple algorithm to get the list: (Apple ? Mac ? iPod), and then we use the list to compare the results of our new algorithm and the NGD. In our previous work [6], we proved that this algorithm for creating a list of keywords is quite efficient and that the ranked results are truly meaningful. The algorithms are as follows. We can describe our data in three-tuple, in which Ti = (word1, word2, NGD), and where: Word1 denotes the first keyword in termset T. Word2 denotes the second keyword in termset T, and word1 – word2.
402
P.-I. Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
NGD is the NGD value and represents the relationship between word1 and word2. Qlist is a sequence queue which can save those keywords in order. Algorithm 3. The keyword filtering. INPUT: combinations[n]: array of tuple (word1, word2, NGDscore); //word1 and word2 are any two elements of ‘‘candidates’’. //Tuples are sorted in ascending order by NGDscore. OUTPUT: Qlist: the ranked queue of elements of candidates. // The first keyword is the highest recommendation. begin i = 0; while (i < n){ if (combinations[i] word1 not in Qlist) append combinations [i] word1 to Qlist; if (combinations[i] word2 not in Qlist) append combinations[i] word2 to Qlist; i++; } end
Next, we use spearman’s footrule to evaluate the similarity of the rankings provided by the NGD and GCD algorithms. The Spearman footrule distance between two given rankings is defined as the sum of the absolute differences between the ranks of i with respect to the two full lists. Using the NGD algorithm to construct a keyword suggestion system which can provide a list of keywords based on their importance of the relationship between each of two keywords. But we found that the ranking of the keywords will be varied when we use the same methods to calculate the NGD score. We believe that in most of the situation, the ranking results should not be varied too much because of the relationship between most of the keywords also will keep in stable. Thus, we use the spearman’s footrule to measure the ranking results of using the same algorithms but in different time. This measurement can evaluate the Top-k ranking quality of the context and have been used for a long time. Therefore, we think this measurement can provide an evidence to let us understand the ranking results of NGD algorithm will be varied so that using our proposed algorithm can improve the instability of the ranking results [23]. Formally, this is the Spearman footrule distance between r and s:
Fr jsj ðr1 ; r2 Þ ¼
jsj X
jrðiÞ sðiÞj
NFr ¼
max FrðjsjÞ F ¼ 1 NFr
4.2. Execution time The number of words in each article will deeply affect the system’s execution time. In our second experiment, we evaluate whether the system that uses the new algorithm can be more efficient than before. In our previous work, we also used the conditional probability to simplify the NGD algorithm into a new algorithm, called SNGD, and we proved that the precision and recall of the execution results were almost the same. However, SNGD basically evolved from the NGD algorithm, so it was also unable to solve the problem of time variance. Except for this problem, the execution time of this algorithm was very satisfactory. We also combined the SNGD algorithm with the PLSA to find the most important sequence of keywords to represent the documents; we call this method GLSD. Thus, we need to compare this system’s execution time with our new system’s time. Although the original NGD system was designed for finding the relationship between two keywords, we think that the NGD system’s execution time can provide a baseline for comparison. We want to prove that even though we add some other algorithms to the system, the whole execution time is less than the NGD system because we only need to search Google once for each keyword. The features of these three systems are displayed in Table 4. Because Google does not provide its Google search API for the user to search the keywords, we adapt the htmlparser to automatically input each keyword which the system extracted into Google. Then, we use the text extraction method to get the number of search results from the Google Webpage. Thus, the performance will be affected by the text processing. The time of getting the number of search results have been included in the execution time. Our system will send each keyword to Google and will obtain the number of search results, each time we run the program to extract the representative sequence of each document, it will automatically search Google more than hundreds of times. Thus, we set the time interval of each search to 0.5 s to prevent the overuse of Google and the network bandwidth.
ð10Þ
i¼1
Fr ðjsjÞ
contain different results. Thus, before we start using the spearman’s footrule as a measurement, we must prune off the keywords that are not on both lists at two different times. In most situations, only a few keywords will appear on a single list. The spearman’s measurement results are presented in Table 3. Our new algorithm remains more consistent than the NGD algorithm. The value of 0.74 means that the ranked keyword lists of new data have a 74% chance of being the same as the old lists, if our GCD algorithm is used. But with the NGD, the chance diminishes to 56%.
ð11Þ
Table 3 Spearman’s footrule measurement of GCD and NGD. Measurement
Algorithm
ð12Þ
When the two lists are identical, Frjsj is zero, and its maximum value is jSj2/2 when jSj is even, and 1/2(jSj + 1)(jSj 1) when jSj is odd. If the value of F is close to 1, then the two lists are exactly the same. In other words, those keywords’ rankings in the GCD results, which use new data, will be the same as the GCD results that use the old data. Accordingly, using the number of search results provided by the new data may sometimes cause a great difference in comparison to the old data. Thus, some keywords in the past will not pass the threshold because their numbers of search results are too small. Now, however, a keyword can be contained in more Web pages so that it can successfully pass the threshold and be ranked in the list of keywords. The spearman’s footrule cannot evaluate two lists that
Spearman’s footrule
GCD
NGD
0.74
0.56
Table 4 Features of our proposed method and previous works. Feature
Gram Algorithm No. of times search in Google for each keyword
System WANET
GLSD
NGD
More than 3 GCD + PageRank + BB 1
3 SNGD + PLSA 1
1 NGD 2
403
Ping-I Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
downloaded articles as our dataset. We thought that the mostdownloaded papers may have more related information on the Internet, which would increase the difficulty of our experiments. After we found the keyword sequence, we searched for it in Google to see whether it was in the Top-5 search results. Precision defines the purity of the retrieval. Precision of 100% implies that the sequence of the keywords that the system found can have a 100% chance of finding the original document in Google. Some research papers could not be mined for 3-gram keywords because the abstracts were too short or because there were no meaningful keywords. For example, 80% precision means that we extracted ten keyword sequences from each journal, and eight of them could be located in the original paper. ‘‘Recall’’ identifies the percentage of completeness of retrieval. A recall of 100% implies that the accurate number of results divided by the total number of papers in each journal.
Fig. 8. Execution time vs. number of words in the article.
tp tp þ fp tp Recall ¼ tp þ fn
Table 5 Precision and Recall rates. Obtained result
E1 E2
Precision ¼
Correct result E1
E2
tp (true positive) fn (false negative)
fp (false positive) tn (true negative)
The results showed that our GCD algorithm could reduce the execution time when the article contained many keywords. As shown in Fig. 8, when there are 800 words in a single article, the average execution time lasts only about 250 s, which is almost 100 s less than the NGD. In most of the cases, the execution time was one minute less than with the NGD. The GCD algorithm’s execution time, however, is longer than the SNGD in most situations, except the 800 words result. We integrated the PageRank algorithm and the new Hop-by-Hop Routing algorithm into this system in order to increase the whole accuracy so that the execution time will be longer. Perhaps we can improve the algorithm’s performance in the future.
4.3. Precision and recall rate
The execution results of papers from four knowledge domains are presented in Table 6. In Fig. 9, the precision rate of our WANET system, which only extracts 3-gram keyword sequences, is almost the same as the rate of the GLSD system, which also extracts 3gram sequences. The recall rate of the WANET, however, is better than the rate of the GLSD, although WANET’s computation time is a bit longer than GLSD’s. Still, the ability of anti-time variance is much better than GLSD. The GLSD system could only extract 3gram keywords from approximately 74% of the papers, because some abstracts were too short to contain very many potential keywords. By using WANET system, on the other hand, almost 90% of the article allowed 3-gram sequences to be extracted so that the original documents could be found in Google. In the space domain, the accuracy of the keywords is not very high. Most of the abstracts in this domain contain more than 250 words, and most of the keywords contain some mathematic symbols to represent keywords or ideas. Our system’s original design did not consider this kind of keyword. Therefore, the precision and recall rates in the space domain were not very good.
The experiments in the previous section show that the new system’s execution time is very satisfactory. In information retrieval, however, the accuracy of the system is a great concern. The accuracy of the system can be measured by using precision and recall rates, as shown in Table 5. We conducted experiments that used the abstracts of 200 research papers to parse the most important sequential keywords and then use them to search. We wanted to see whether we could use a sequence of keywords to find the original paper. Our purpose is to use this experiment to prove that the sequence of keywords that we had found was strong enough to represent the documents, instead of using the SVM or some other algorithms that many previous researches used. These papers are randomly selected from the Elsevier Web site. We chose four knowledge domains and ten journals for each domain. In each journal, we chose the 10 most-
Fig. 9. Precision and recall evaluation.
Table 6 Execution results of four knowledge domains, using WANET. 3-gram
BM CS Space Psychology
4-gram
Accurate
Error
Total-extracted
Accurate
Error
Total-extracted
33 33 32 39
10 9 17 8
43 42 49 47
32 31 35 39
4 1 11 4
36 32 46 43
404
P.-I. Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
Because the WANET system extracts more than 3-gram keyword sequences, we used the 4-gram sequence to repeat the same experiment in order to determine whether the system’s accuracy would be better with 4-gram than with 3-gram. Only 78% of the article could extract 4-gram keywords, but the precision of the WANET system increased that percentage to almost 90%. The recall rate of the 4-gram system was not very different from the rate of the 3-gram system. We think that using the 4-gram sequence to represent a concept of an article or paragraph can achieve high accuracy. Our system can provide more than 4-gram sequences, and longer sequences should have higher precision rates; however, if we want to use the sequence to compare other documents’ sequences to find their similarity, then the computational cost will be increased along with the sequence.
Fig. 11. Top-k search results of 3-gram WANET.
4.4. Top-k search results analysis Using our system can easily find out the most relevant informations by single 4-gram keyword sequence. But we want to know in advance the distribution of the search results. If we can find the original documents and it is just in Top-1 Google results, then we can say that this sequence is really the most representational one and it might be better than in the Top-5 search results. Here, we will analyze the Google search results which we used the WANET system’s 3-gram keywords to search and compare to the situation of the GLSD system. As shown in Fig. 10, almost 70% of the search results for those four domains were found in the Top-1 Google search results. It means that if we can extract out the keyword sequence, then we can have about 70% chance to find it in the first Google search results. Next, Fig. 11 shows the execution results of WANET which use only 3-gram. The WANET system is better than GLSD in BM and Psychology domain. But in CS domain, the results are not as well as the GLSD. This is because not all of the articles can not be extracted out 3-gram sequence. But if the system can find the sequence, the results will be very well. We think the sequence which we founded using WANET may not strong enough than the GLSD system. The GLSD are using the famous PLSA algorithm to enhance the NGD based technique. It is basically an opportunity based method. But we change the system into distance based measurement so that we can not use the PLSA to enhance the new system. And the new system can extract more than 3-gram sequence of keywords and remain consistency as time goes by. Thus, we will step forward and use the 4-gram WANET to compare with the GLSD to see if the search results will be better or not. Fig. 12 shows the execution results of using the 4-gram keywords to search. The results are better than those of the 3-gram sequence, especially since the psychology domain’s search results
Fig. 10. Top-k search results of 3-gram GLSD.
Fig. 12. Top-k search results of 4-gram WANET.
have a 90% chance of being found in the Top-1 search results. The worst domain is still the space domain, probably because the keywords in this domain have too many mathematical symbols; hence, our system cannot find the exact keywords to represent the articles, even though it uses four keywords. 5. Conclusion In this paper, we proposed a new system that can extract the most important keyword sequence to represent a document and then can help users automatically find relevant documents or Web pages. Users simply want to read articles or browse Web pages; they should not need to enter any keywords into a search engine. Our new GCD algorithm, which is adapted from the NGD algorithm, can achieve high accuracy, perform well, and remain consistent so that the Google search engine can be used as a large semantic corpus to calculate the importance of each keyword pair. We used the GCD algorithm and combined it with the PageRank algorithm, and finally we used the AdHoc network’s Hop-by-Hop algorithm to find the most important sequence. Our experiments show that the 4-gram sequence that we extracted can be used to obtain the most relevant information from the Internet, with high accuracy. The precision of our system was nearly 90%, and the recall rates averaged 65%. When using the extracted sequences of keywords, most of the documents could be found in the Top-1 Google search results. This means that the keyword sequences are representative and highly accurate. All the executions of our new system are on-line and real-time, and they can be used in every kind of domain knowledge because this system requires no pre-collected data or training process. We believe that this method can help users who browse different types of Web pages or read articles in various knowledge domains. Additionally, our system does not need any database to save user logs because our approach is just like collaborative recommendation
Ping-I Chen, S.-J. Lin / Knowledge-Based Systems 24 (2011) 393–405
and because it uses Google information to provide suggestions. Thus, we are trying to implement this system into a browser, and we hope that one day it can be used in a mobile device or an e-book reading device such as Kindle. Our system is based on the precondition that when users see something they do not know or that they want to know more about, they may search for further relevant information. Therefore, if users do not browse articles, then this system might work because it does not use any pre-constructed user profiles. References [1] E. Agichtein, S. Lawrence, L. Gravano, Learning to find answers to questions on the Web, ACM Transactions on Internet Technology 4 (2) (2004) 129–162. [2] S. Batra, S. Bawa, Semantic categorization of Web services, International Journal of Recent Trends in Engineering 2 (3) (2009) 19–23. [3] D. Beeferman, A. Berger, Agglomerative Clustering of a Search Engine Query Log, in: Proc. ACM SIGKDD, 2000. [4] K. Bharat, SearchPad: explicit capture of search context to support Web search, Computer Networks 33 (1–6) (2001) 493–501. [5] J. Borges, M. Levene, Evaluating variable-length Markov chain models for analysis of user Web navigation sessions, IEEE Transactions on Knowledge and Data Engineering 19 (4) (2007) 441–452. [6] P.I. Chen, S.J. Lin, Automatic keyword prediction using Google similarity distance, Expert Systems with Applications 37 (3) (2010) 1928–1938. [7] L.F. Chien, PAT-tree-based Keyword Extraction for Chinese Information Retrieval, in: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1997, pp. 50–59. [8] R.L. Cilibrasi, P.M.B. Vitanyi, The Google similarity distance, IEEE Transactions on Knowledge and Data Engineering 19 (3) (2007) 370–383. [9] C. Collosal, How well does the world wide Web represent human language? The Economist, 2005. [10] H. Cui, J.R. Wen, J.Y. Nie, W.Y. Wei-Ying Ma, Query expansion by mining user logs, IEEE Transactions on Knowledge and Data Engineering 15 (4) (2003) 829– 839. [11] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41 (6) (1990) 391–407. [12] G. Ercan, I. Cicekli, Using lexical chains for keyword extraction, Information Processing and Management 43 (6) (2007) 1705–1714. [13] R. Feldman, I. Dagen, H. Hirsh, Mining text using keywords distributions, Journal of Intelligent Information Systems 10 (3) (1998) 281–300. [14] F. Feng, W. Croft, Probabilistic techniques for phrase extraction, Information Processing and Management 37 (2001) 199–200. [15] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, E. Ruppin, Placing search in context: the concept revisited, ACM Transactions on Information Systems 20 (1) (2001) 116–131. [16] G.E. Freund, P. Willett, Online identification of word variants and arbitrary truncation searching using a string similarity measure, Information Technology: Research and Development 1 (3) (1982) 177–187. [17] E. Gabrilovich, S. Markovitch, Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis, in: IJCAI 2007, 2007, pp. 1606– 1611. [18] M. Gaeta, F. Orciuoli, P. Ritrovato, Advanced ontology management system for personalised e-learning, Knowledge-Based Systems 22 (4) (2009) 292–301. [19] T. Hofmann, Probabilistic latent semantic indexing, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50–57. [20] X.H. Hu, B. Wu, Automatic Keyword Extraction Using Linguistic Features, in: 6th IEEE International Conference on Data Mining, 2006, pp. 19–23. [21] M. Jansen, A. Spink, J. Bateman, T. Saracevic, Real Life Information Retrieval: A Study of User Queries on the Web, in: Proc. ACM SIGIR Forum, vol. 32, 1998, pp. 5–17. [22] K.S. Jones, Information retrieval and artificial intelligence, Artificial Intelligence 114 (1–2) (1999) 257–281. [23] B.I. Judit, M.H. Mazlita, L. Mark, Methods for comparing rankings of search engine results, Computer Networks 50 (10) (2006) 1448–1463.
405
[24] L. Khan, D. McLeod, E. Hovy, Retrieval effectiveness of an ontology-based model for information selection, The VLDB Journal (2004) 71–85. [25] M. Kitamura, Y. Matsumoto, Automatic extraction of word sequence correspondences in parallel corpora, in: Proceeding of the 4th Workshop on Very Large Corpora, 1996, pp. 78–89. [26] I. Konstas, V. Stathopoulos, J.M. Jose, On social networks and collaborative recommendation, in: SIGIR’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, pp. 195–202. [27] K.C. Lee, J.S. Kim, N.H. Chung, S.J. Kwon, Fussy cognitive map approach to Webmining inference amplification, Expert System with Applications 22 (2002) 197–211. [28] S.J. Li, H.F. Wang, S.W. Yu, C.S. Xin, News-Oriented Keyword Indexing with Maximum Entropy Principle[J], in: Proceedings of PACLIC17, 2003, pp. 277– 281. [29] Y. Li, C. Zhang, J.R. Swan, An information filtering model on the Web and its application in JobAgent, Knowledge-Based Systems 13 (5) (2000) 285–296. [30] R. Lindsey, M. Stipicevic, V.D. Veksler, W.D. Gray, BLOSSOM: Best path Length On a Semantic Self-Organizing Map, in: 30th Annual Meeting of the Cognitive Science Society, 2008. [31] F. Liu, C. Yu, W. Meng, Personalized Web search for improving retrieval effectiveness, IEEE Transactions on Knowledge and Data Engineering 16 (1) (2004) 28–40. [32] Z.Y. Lu, Y.Y. Yao, N. Zhong, Web log mining, Web Intelligence (2003) 174–194. [33] K. Lund, C. Burgess, Hyperspace analogue to language (HAL): a general model semantic representation, Brain and Cognition 30 (3) (1996) 5. [34] Y. Matsuo, M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, International Journal on Artificial Intelligence Tools 13 (1) (2004) 157–169. [35] I. Matveeva, G. Levow, A. Farahat, C. Royer, Generalized latent semantic analysis for term representation, in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05), Borovets, Bulgaria, 2005. [36] T. Meng, H.F. Yan, On the peninsula phenomenon in Web graph and its implications on Web search, Computer Networks 51 (1) (2007) 177–189. [37] L.A.F. Park, K. Ramamohanarao, Efficient storage and retrieval of probabilistic latent semantic information for information retrieval, The VLDB Journal 18 (1) (2009) 141–155. [38] P.M.B. Vitanyi, F.J. Balbach, R.L. Cilibrasi, M. Li, Normalized information distance, in: F. Emmert-Streib, M. Dehmer (Eds.), Information Theory and Statistical Learning, Springer-Verlag, New-York, 2008, pp. 45–82. [39] A. Ratnaparkhi, J. Reynar, S. Roukos, A maximum entropy model for prepositional phrase attachment, in: Proceeding of the Human Language Technology Workshop. Plainsboro, NJ: Advanced Research Projects Agency, 1994, pp. 250–255. [40] G. Salton, C.S. Yang, C.T. Yu, A theory of term importance in automatic text analysis, Journal of the American society for Information Science 26 (1) (1975) 33–44. [41] J.J. Sandvig, B. Mobasher, R. Burke, Robustness of collaborative recommendation based on association rule mining, in: Proceedings of the 2007 ACM conference on Recommender systems, 2007. [42] K. Sato, H. Saito, Extracting word sequence correspondences with support vector machines, in: Proceedings of the 19th international conference on Computational linguistics, Taipei, Taiwan, 2002, pp. 1–7. [43] B. Sigurd, E.O. Mats, J.V. Weijer, Word length, sentence length and frequency – Zipf revisited, Studia Linguistica 58 (1) (2004) 37–52. [44] F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys 34 (1) (2002) 1–47. [45] J. Xu, W.B. Croft, Improving the effectiveness of information retrieval with local context analysis, ACM Transactions on Information Systems (TOIS) 18 (1) (2000) 79–112. [46] Q. Yang, H. Zhang, I. Tian, Y. Li, Mining Web Logs for Prediction Models in WWW Caching and Prefetching, in: Proc. 7th ACM SIGKDD International Conference Knowledge Discovery and Data Mining, 2001, pp. 473–478. [47] W. Zhang, T. Yoshida, X.J. Tang, Text classification based on multi-word with support vector machine, Knowledge-Based Systems 21 (8) (2008) 879–886. [48] N. Zhong, Representation and construction of ontologies for Web intelligence, International Journal of Foundation of Computer Science 13 (4) (2002) 555–570.