Available online at www.sciencedirect.com
ScienceDirect Procedia Technology 24 (2016) 1256 – 1262
International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015)
User Search Goals Evaluation with Feedback Sessions Febna V a, * , Anish Abraham a a
Government Engineering College Thrissur, Kerala, 680009, India.
Abstract In today’s e-world, search engines play a vital role in retrieving and organizing relevant data for various purposes. Different methods are used to find user search goals. Personalization is the process of finding exact needs of a user using different representations and machine learning techniques. These methods exploit feedback sessions and bipartite graphs, along with machine learning techniques such as clustering, classification and Apriori algorithms. This paper proposes a variant of feedback session method for inferring user search goals, where bag of words approach is employed for representation. K-Medoid clustering algorithm is used to derive the cluster for the keywords entered by the user. The performance improvement can be evaluated by using evaluation measures like Average Precision (AP), Voted Average Precision (VAP) and Classified Average Precision (CAP). © byby Elsevier Ltd.Ltd. This is an open access article under the CC BY-NC-ND license © 2016 2016The TheAuthors. Authors.Published Published Elsevier (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of ICETEST – 2015. Peer-review under responsibility of the organizing committee of ICETEST – 2015 Keywords: Feedback Sessions; Pseudo-documents; CAP; K-Medoid.
1. Introduction Information Retrieval is the activity of obtaining resources relevant to information from a collection of resources. Searches can be based on full text indexing. User search goals are the information on different aspects of a query that user group wants to obtain. When a user submits a query, different users may want to get different information. Fig. 1 shows an example for an Ambiguous query. Nowadays, evaluating user search goals is an indispensable part of most of the search engines. Various
* Corresponding author. Tel.: +91-703-477-3188. E-mail address:
[email protected]
2212-0173 © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of ICETEST – 2015 doi:10.1016/j.protcy.2016.05.107
V. Febna and Anish Abraham / Procedia Technology 24 (2016) 1256 – 1262
1257
identifiable works have been developed and works are still upcoming focusing this area. According to Lu, Zheng et al. (2013) [1], user search goals analysis can be classified into three classes: query classification, search result reorganization, and session boundary detection. For every access of a Web page, a Web server usually registers a web log entry; it includes the requested URL and a timestamp [2, 3, 6]. Then, construct a feedback session based on the Weblog records. Because Weblog data provide information about what kind of users will access what kind of Web pages. The inference and analysis of user search goals can have a lot of advantages for improving search engine relevance and user experience. Some of the advantages are summarized as follows. Initially, restructure web search results according to user search goals by grouping the search results with the same search goal; thus, users with different search goals can easily find what they want. Second, user search goals represented by some keywords can be utilized in query recommendation. Thus, the distributions of user search goals can be useful in applications such as re-ranking web search results that contain different user search goals [5].
Fig. 1. Result shown for an Ambiguous query
1.1. Organization of the paper The rest of the paper is organized as follows: The literature survey related to the proposed method is presented in Section 2. The proposed method is described in Section 3 and which includes several subsections namely Feedback Sessions, User search goals by clustering pseudo-documents and finally Evaluation methods. Section 4 concludes the paper. 2. Related Works According to Susan gauch et al. (2005), personalization is the process of representing right information to right user. User feedbacks like preferences and rating are the method used for the analysis of explicit method. Whereas in implicit method, analysis is by observing the user behaviours like how much time spent for reading an online document. Browsing histories are used for the re-ranking purpose [15]. Another method for user search goals evaluation introduced by Patnaik L M et al. (2013) improves response time and throughput of search engine with web caching. Crawl based web can be divided into four stages namely data acquisition, mining and pre-processing, index construction and query processing. In this paper, a cache search
1258
V. Febna and Anish Abraham / Procedia Technology 24 (2016) 1256 – 1262
algorithm can be applied on the top of the search engine and the implement a cache search engine. Cache memory is refreshed in every 15 days. The major advantages of web caching is as follows; Faster delivery of web objects to the end user, it reduces the bandwidth cost and needs, it benefits the user, the service provider and the website owner and also reduces the load on the website servers [13]. Farida Achemoukh, Rachid Ahmed Ouamer et al. (2013) proposes a method for information retrieval based on Bayesian Approach. A Bayesian network is modeled by combining query and user document to infer user profile. This probabilistic model is used to evaluate the user relevance [21]. Another method is an efficient searching for web recipes, which is introduced by Martin Junghans, Sudhir Agarwal (2013). Here semantic search is used for query evaluation. Web browsing recipes is an alternative technique for information retrieval and which is a goal-oriented approach. The proposed method can be applicable on structured data [22]. Yet another contribution in user search goal analysis was by Zheng Lu, Xiaokang Yang (2002) focusing on combining image visual information with the click session information. Here is to propose a Classification Risk based approach for automatically selecting the optimal number of search goals [23]. User search goals analysis by using customer feedback and customer behavior by Nandhakumar C et al. (2014) was also a notable work. Apriori algorithm is used to find out the frequently used product [24]. Gaurav Agarwal et al. (2012), propose a heuristic based AGE algorithm for search engine. A heuristic based AGE algorithm worked on the basis of three parameters namely density of keywords, number of successors to the node and the age of the web page. This algorithm is compared with Google’s page rank algorithm. Here the evaluation results show that the proposed method is better than Google search [14]. N. N Das et al. (2013) explains the technique of search engine optimization and also explains the basic working principle of search engine with the help of six activities such as crawling, indexing, processing, calculating, relevancy and retrieving. Major concepts in page ranking are the number of visitors in a particular page. This paper explains how page rank algorithm ranks a web page [16]. Another contribution by Rushikesh M. Shete, V. S. Gulhane (2013) is an enhanced web search engine based on user profiles and click through patterns. Here is to propose a web graph mining method. The directed links between the pages of WWW are described by web graph. Filtering techniques are used to represent users’ interest, also use collaborative filtering, query suggestion, image recommendation and click through data analysis [17]. Jagroop Kaur et al. (2014), use ranking algorithm for search engine optimization. White hat technique is used to promote search engines natural listing and Black hat technique is for promote search engine result list. Black hat technique breaks search engine rules and regulations, which is an optimization technique in unethical manner. This paper discusses onsite and offsite optimization techniques, ranking algorithm and search engine process. Also mention about Google’s PANDA and PENGUIN techniques. Here IP address is the security factor and proposed works are focused on re-ranking algorithms [18]. Mandar Mokashi et al. (2013), introduce a bipartite graph for representing user interest for different keywords. Bipartite graph is made up of two disjoint sets A and B for representing different keywords and corresponding URLs in the graph respectively. Bisecting K-means algorithm is used for clustering, which is better than K-means. K-means clustering is not applicable for large amount of log data [19]. Personalization is the process of representing exact need of a user. The above mentioned various works focused on personalization techniques. Users search history is a major contribution for document ranking. Bipartite graph is an efficient method for representing user search history corresponds to different keywords. Ranking algorithm is the most common for document re-ranking. Web caching method is useful to improve throughput.
1259
V. Febna and Anish Abraham / Procedia Technology 24 (2016) 1256 – 1262
3. User Search Goals Evaluation with Feedback Sessions Search goals evaluation is an indispensible part of most of the data mining techniques. The previous section handled with various related works. By analyzing those works, here is to propose a new evaluation method namely “User Search Goals Evaluation with Feedback Sessions”. The proposed method use K-Medoid clustering algorithm for clustering requested URLs. This algorithm is more efficient than K-means clustering with respect to noise reduction. Bag of words representation is used to analyses the rank of clicked URLs. The proposed method handles two techniques for retrieving requested data. Initially, for a query feedback sessions are used for data analysis and those results are saved in the database. For later request for the same query, normal database mining process is used. By calculating the retrieval time, we can identify whether the data retrieved either by creating feedback sessions or by using database mining. Changing parameter value like number of cluster is also gives better performance. The detailed description for the proposed method is handled in the following sub titles. In this paper, focuses on discovering the different user search goals for a query and relate each goal with some keywords. For the same, first gives an approach to deduce user search goals for a query by clustering the feedback sessions. The feedback session is defined as the series of both clicked and unclicked URLs, ends with the last URL that was clicked in a session from user click-through logs [3, 4]. Optimization method can be used to map feedback sessions to pseudo-documents which can efficiently reflect user information needs. Finally, cluster these pseudodocuments to conclude user search goals and relate them with some keywords. To overcome the difficulties of evaluation, Classified Average Precision (CAP) is used to evaluate the performance. Also demonstrate that the given evaluation criterion can help us to optimize the parameters in the clustering methods [2, 3].
User
Enter Query
Database
Data
No
Default Search (Bing Search) Feedback data formation Generate Pseudo-documents
Yes
K-Medoid Clustering Algorithm Map similarity and Clustered Results
Restructure Search Results
Fig. 2. System Architecture
CAP Evaluation
1260
V. Febna and Anish Abraham / Procedia Technology 24 (2016) 1256 – 1262
As shown in Fig. 2, if the feedback data is not present in the database for query, then using Bing Search API which produces results same as that of Google. When for a query gets user’s feedback then map those feedback data to pseudo-document. Finally K-Medoid clustering algorithm used for pseudo-document clustering [7, 8, 9]. Then map the similarity between the documents by using cosine similarity and Euclidian distance. Finally re-ranking links such that most visited links should appear at the top of the list [1]. 3.1. Feedback Session Web search results are a series of successive queries to satisfy a single information need and some clicked search results. Therefore, the single session containing only one query is introduced, which distinguishes from the conventional session. Feedback Sessions are the URLs list obtained after a query submission. Direct representation of feedback session is not suitable. So some representations like Bag of words representation is used to represent feedback sessions more efficiently. The proposed method gives a method to map feedback sessions to pseudodocuments, which is a converted format of a typical document. Pseudo-documents can be built in two steps, they are the following [10, 11, 20]: x URL Representation. In the first step, each URL in a feedback session is represented by a small text paragraph that consists of its titles and snippet. Then, some textual processes are implemented to those text paragraphs, such as transforming all the letters to lowercase, stemming and removing stop words. Finally, each URLs titles and snippet are represented by a Term Frequency-Inverse Document Frequency (TF-IDF) vector. URLs titles and snippets have different significances and represent each URL by the weighted sum of titles and snippets. x Pseudo-document Construction. In order to obtain the feature representation of a feedback session, apply an optimization method to combine both clicked and unclicked URLs in the feedback session. Feature representation of a feedback session is constructed such a way that the distance between the feedback session and clicked URL is minimized and the distance between the feedback session and unclicked URL is maximized. Each and every term is independent and hence performs optimizations independently. 3.2. User Search Goals by Clustering Pseudo-documents User search goals can be inferred by the given pseudo-documents. In this section, describe how to infer user search goals and depict them with some meaningful keywords. Titles and snippets extracted from the URLs are represented as an n-dimensional vector namely pseudo-documents. Similarity between two pseudo-document is computed as the cosine score of the feature representations, F fs and F fs j . i
Simi , j
cos( Ffsi , Ffs j )
(1)
1 Simi , j
(2)
And the distance between the feedback sessions is,
Disi , j
K-Medoid clustering is used to cluster pseudo-documents, which is simple and effective. Since the exact number of user search goals is not known for each query. The optimal value will be determined through certain evaluation criterion. Each cluster can be considered as one user search goal, after clustering all the pseudo-documents. The centre point of a cluster is calculated as the average of the vectors of all the pseudo-documents in the cluster, as shown in, ci
Fcenteri
¦F k 1
fsk
y Ci , ( Ffsk Clusteri )
(3)
V. Febna and Anish Abraham / Procedia Technology 24 (2016) 1256 – 1262
Where
1261
Fcenteri is the feature representation of the i th cluster’s center and C i is the number of the pseudoth
documents in the i cluster. Finally, the term with the highest values in the centre points are used as the keywords to depict user search goals. 3.3. Evaluation User search goals evaluation is a big problem. Therefore, an evaluation method based on restructuring web search results is used to evaluate whether user search goals are inferred properly or not. Here is to apply Classified Average Precision to evaluate the restructured results, which is also helpful to select the best cluster number [12]. Average precision (AP) is not applicable when the number of clicks in the two classes is same; hence it is replaced with VAP. A good restructuring of search results should have higher VAP. If each URL in the click session is categorized into one class, VAP will always be the highest value namely 1 no matter whether users have so many search goals or not. Therefore, there should be a risk to avoid classifying search results into too many classes by error. Risk as follows, m
Risk
¦
i , j 1( i j )
di , j y Cm2
(4)
It calculates the normalized number of clicked URL pairs that are not in the same class, where m is the number of the clicked URLs. If the pair of the i
th
clicked URL and the
j th clicked URL is not categorized into one class, di , j
will be 1, otherwise it will be 0. VAP can be further extend by introducing the above Risk and propose Classified AP, as shown below,
CAP VAP *(1 Risk )J
(5)
CAP selects the AP of the class that users is interested in and takes the risk of wrong classification and is used to adjust the influence of Risk on CAP. This can be learned from training data. Finally, CAP is used to evaluate the performance of restructuring search results. Parallel calculation of data retrieval time along with data retrieval is useful to analyse whether data is retrieved either by using feedback session or by using database mining. So it is sufficient to evaluate the performance of the proposed work. 4. Conclusion In this paper, different techniques for inferring user search goals for a query are discussed. Specifically user search goals for a query by clustering its feedback session are explained in detail. A variant of this method is proposed. Binary vector representation is not informative enough to tell the contents of user search goals. Bag of words representation has been used to represent the feedback sessions instead of binary vectors. When the number of cluster is high, K-Medoid clustering algorithm is being used to reduce the noise. Thus, hybrid approach of bag of words representation and K-Medoid clustering algorithm will improve the process of finding user search goals. References [1] Lu, Zheng, et al. "A new algorithm for inferring user search goals with feedback sessions." Knowledge and Data Engineering, IEEE Transactions on 25.3 (2013): 502-513. [2] Baeza-Yates, R. "Hurt ado, C., and Mendoza, M. “Query recommendation using query logs in search engine”. In proc. of int." Workshop On clustering information over the web, Crete, Springer. 2004. [3] Huang, ChienǦKang, LeeǦFeng Chien, and YenǦJen Oyang. "Relevant term suggestion in interactive web search based on contextual information in query session logs." Journal of the American Society for Information Science and Technology 54.7 (2003): 638-649. [4] Joachims, Thorsten. "Evaluating Retrieval Performance Using Clickthrough Data." (2003): 79-96. [5] Dou, Zhicheng, Ruihua Song, and Ji-Rong Wen. "A large-scale evaluation and analysis of personalized search strategies". Proceedings of the 16th international conference on World Wide Web. ACM, 2007.
1262
V. Febna and Anish Abraham / Procedia Technology 24 (2016) 1256 – 1262 [6] Qiu, Feng, and Junghoo Cho. "Automatic identification of user interest for personalized search". Proceedings of the 15th international conference on World Wide Web. ACM, 2006. [7] Shah, Saurabh, and Manmohan Singh. "Comparison of a time efficient modified K-mean algorithm with K-mean and K-medoid algorithm". Communication Systems and Network Technologies (CSNT), 2012 International Conference on. IEEE, 2012. [8] Aljoby, Walid, and Khaled Alenezi. "Parallelization of K-medoid clustering algorithm". Information and Communication Technology for the Muslim World (ICT4M), 2013 5th International Conference on. IEEE, 2013. [9] Mishra, Debahuti, and Saroj Hiranwal. "Analysis & implementation of item based collaboration filtering using K-Medoid". Advances in Engineering and Technology Research (ICAETR), 2014 International Conference on. IEEE, 2014. [10] Wang, Xuanhui, and ChengXiang Zhai. "Learn from web search logs to organize search results." Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2007. [11] Rohini B.Mothe, V.S.Deshmukh. “A Noval Approach to Cluster Search Result based on Search Goals”. International Journal of Computer Applications. [12] Bhuvaneswari.S, Shobana.P,Vaishnavi.V,Ramya.P. “Classified Average Precision for Retrieving Information using Feedback sessions”. International journal of Advanced Research in Computer Science Engineering and Information Technology, Volume:3, NO:2321-3337, March -2014. [13] Pushpa C N, Thriveni J, Venugopal K R, Patnaik L M. “Improving Response Time and Throughput of Search Engine with Web Caching”. International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 4, June - 2012 ISSN: 2278-0181. [14] Harshita Bhardwaj, Gaurav Agarwal, Chanchal Dhaka. “A Heuristic Based AGE Algorithm For Search Engine”. International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 8, October - 2012 ISSN: 2278-0181. [15] Mirco Speretta, Susan Gauch. “Personalized search based on user search histories”. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05) 0-7695-2415-X/05 $20.00 © 2005 IEEE. [16] N. N. Das, Shivani Gupta, “An Algorithm For Ranking The Web Pages Of Search Engine”. International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181.Vol. 2 Issue 7, July – 2013. [17] Rushikesh M. Shete, V. S. Gulhane. “An Enhanced Web Graph Search Engine Based on User Profiles and Clickthrough Patterns”. International Journal of Engineering Research & Technology (IJERT).Vol. 2 Issue 12, December – 2013. ISSN: 2278-0181. [18] Navkirandeep Kaur, Jagroop Kaur. “Development of a Ranking Algorithm for Search Engine Optimization”. International Journal of Engineering Research & Technology (IJERT). ISSN: 2278-0181. Vol. 3 Issue 4, April – 2014. [19] Harshada P. Bhambure, Mandar Mokashi. “Inferring User Search Goals Using Feedback Session”. International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2013): 4.438. [20] Gawade, Sukanya S., and Gyankamal J. Chhajed. "Using Feedback Sessions for Inferring User Search Goals". Published by International Journal of Advanced Research in Computer and Communication Engineering (2014). [21] Farida Achemoukh, Rachid Ahmed-Ouamer. “Exploitation of User Profile in the Information Retrieval based on Bayesian Approach”. 9781-4799-3392-1/13/$31.00 ©2013 IEEE. [22] Martin Junghans, Sudhir Agarwal. “Efficient Search for Web Browsing Recipes”. 2013 IEEE 20th International Conference on Web Services. [23] Zheng Lu, Xiaokang Yang, Senior Member, IEEE Weiyao Lin, Hongyuan Zha, and Xiaolin Chen. “Inferring User Image-Search Goals Under the Implicit Guidance of Users”. JOURNAL OF LATEX CLASS FILES, VOL. 1, NO. 8, AUGUST 2002. [24] Nandhakumar C, Gayathri A, Gokulavani M, Santhamani V. “Inferring User Goals Using Customer Feedback and Analyzing Customer Behavior”. International Journal of Computer Applications Technology and Research Volume 3– Issue 2, 125 - 129, 2014, ISSN: 23198656.