Accessing Relevant and Accurate Information using Entropy

Accessing Relevant and Accurate Information using Entropy

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 54 (2015) 449 – 455 Eleventh International Multi-Conference on Inf...

729KB Sizes 0 Downloads 121 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 54 (2015) 449 – 455

Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015)

Accessing Relevant and Accurate Information using Entropy Sarowar Kumar∗ , Kumar Abhishek and M. P. Singh Department of Computer Science, National Institute of Technology, Patna, India

Abstract Keyword based search engine generally provides result set with a large number of web pages, mostly irrelevant. The world wide web is a large collection of web/hypertext document, so effort is required to provide relevant data for a given set of query with less response time. This paper implements the semantic-synaptic web mining algorithm, and compares the result with other existing algorithm. The algorithm focuses on use of entropy for finding accurate results for any given query. © 2015 2015 Published The Authors. Published Elsevier B.V.access article under the CC BY-NC-ND license by Elsevier B.V.by This is an open Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under (IMCIP-2015). responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 Processing-2015 (IMCIP-2015)

Keywords:

Clustering; Entropy; Information content; Meta tag; Semantic web; Synaptic web.

1. Introduction Internet size is growing exponentially with respect to user, server, and data. Especially last of three years have witnessed a huge increase in the volume of data on Internet, along with emergence of Internet things and Big Data. The retrieval process still are unable in fetching relevant information the search result set consist of unwanted document or link. Web mining, A subset of data mining, is a handy technique employed for improving retrieval process which focuses on extracting desired information from web. Web mining is classified into three categories based on web data. a. Web Content Mining b. Web Usage Mining and c. Web Structure Mining. 1.1 Web content mining Web Content Mining deals with extraction of useful information from the different type of data present on the web. 1.2 Web usage mining Web Usage Mining deals with analysis of the server log, web cookie, meta data etc. for tracking users interest. ∗ Corresponding author.

E-mail address: [email protected]

1877-0509 © 2015 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015) doi:10.1016/j.procs.2015.06.052

450

Sarowar Kumar et al. / Procedia Computer Science 54 (2015) 449 – 455

1.3 Web structure mining (WSM) Web Structure Mining make use of graph theory to analyze the hyperlink structure, based on the hyperlink topology. It also categorized the website and interlink the website. This paper implements an algorithm proposed by4 which focuses on providing relevant data to the users. The remaining part of the paper is organized in following manner. Section 3 will explain the related work, Section 4 will explains the implementation of the algorithm. 2. Motivation The motivation of this research has been born from the need of improving the result set of the search engine, the concept of entropy was evolved in thermodynamic but today it is broadly used in information science. This paper work makes use of entropy in finding relevant result set, since entropy value of a document is based on the probability mass function and information content and this value gives the more suitable result as per user query according to entropy concepts, studied in4, 18 . The idea was proposed as semantic-synaptic web mining by4 , this paper implement this idea which is based on entropy concept and add this idea with page rank to enhance the present search result set. 3. Related Work Web mining concept was first introduced by Oren Etzioni9. R. Kosla et al.,12 divided web mining into three categories according to the different type of data mined. The author of15 claims that the data present on the web is of three type namely web content mining, web structured mining and web usage mining. The author of8 categorized the data over web into four types namely structure data, content data, profile data and usage data. The author22 categories the web mining into three area namely web text mining, web usage mining and user modeling mining. To improve the efficiency of web mining technique some researcher have merged the structure mining and content mining as described in7, 11, 19. From the study of several research work related of web mining, it was analyzed that the most of researcher do not agree with the above web mining classification. Today, most popular category of web mining are web content mining, web structure mining and web usage mining. In the year 2001, Tim Berners-Lee Introduced the Ontological approach which increases the machine understand capability of the web data which is known as semantic web13 . In the year 2006, Tim Berners-Lee came up with the principle of linked data6 concept and also provided some information for how to use concept of standard web technology to connect between the data over the web. In 2014 author of3 give the idea for improving the accuracy and relevancy of the web page data for this they proposed a semantic-synaptic web mining model, combining the best idea of semantic web and synaptic web at low entropy concept, and they also gave the semantic-synaptic web mining architecture. Author of the paper4 proposes a algorithm of semantic-synaptic web mining algorithm in a paper titled “Entropy Measurement and Algorithm for Semantic-Synaptic Web Mining4”which is based on the entropy Value with no implementation proof. Semantic web mining13 can be termed as an improved version of web mining23. Semantic web increases the efficiency and quality of web mining, it add meaning to the present information for better co-operation or result, and this semantic helps in machine learning. The concept of semantic web was first introduce by T. Berners-Lee13 to make better understanding of web information to the machine, T. Berners-Lee said that the multiple layer architecture present in semantic web is useful to apply the semantic concept to the different kind of web mining. Semantic Web play a very Important Role to improve the WORLD WIDE WEB (WWW) of first generation. The architecture of semantic web stack is shown in Fig. 1. The synaptic web5 established the relation between objects, the objects may be the either content or information. In synaptic web capability of filtering is much more important then its search functionality. In other hand it can be said

451

Sarowar Kumar et al. / Procedia Computer Science 54 (2015) 449 – 455

Fig. 1.

The semantic web stack2 .

Fig. 2.

Contribution of synaptic web in the present scenario3 .

that it is a next generation concept to manipulate and solve the issue related to document search. The Contribution of synaptic web in the present scenario is shown in Fig. 2. Web is referred today as web of data. Synaptic web tries to inter-link the web of data. Semantic-Synaptic web mining4 is a mixed concept of semantic web as well as synaptic web, a part of web mining technique which is using low entropy concept. If web content have Lower value of entropy then its probability to similarity is more between the web content, which give the more accurate and relevant data over the web. In paper4 they also given semantic-synaptic web mining architecture. The concept of entropy based on information theory was introduced by shannon and people also known as Shannon entropy20, it is actually the measure of uncertainty, inconsistency, unstructured data etc. of the random variable. From this, information content of web page can be obtained. Information Content of document tends to give efficient result. A web page having low information content, then that web page have the lowest unstructured data, in other word the web page has lowest uncertainty, that means web page leads towards the relevancy. If a web page is more relevant and accurate then that web page having low Information Content. According to Information Theory, entropy is calculated by predicted average value of information content Ixi which is related with the random variable X, and the entropy of the variable X is a function of p(x) = Pr [X = x i ], x i  X where i [1, n] and this is described as follows E(X) =

N 

P(x i )I (x i )

(1)

i=1

OR it can be written as E(X ) =

N  i=1

 1 =− P(x i ) log(P(x i ))26 P(x i ) N

P(x i ) log

(2)

i=1

Here negative sign indicate that the value of entropy is always positive value, because value of probability always lying between 0 and 1. 3.0.1 Method for measuring the entropy For entropy calculation, here we are using the method of Information Content by Resnik18 is used, different step of this method is given as follows Step 1: First web page are clustered and measured the frequency of each word having the concept C, mathematically it can be written as follows  Frequency(c) = count(n) (3) n∈words(c)

where words (c) means the group of words having the concept C 16, 18 .

452

Sarowar Kumar et al. / Procedia Computer Science 54 (2015) 449 – 455

Step 2: From the above step 1, the Frequency of the concept c is measured, now probability of words of web page having the concept (ci ) using the following formula is measured p(ci ) =

frequency(ci ) N

(4)

In the above equation N is the total number of words present in a web page. Step 3: From the above step 2, the probability of words of web page having the concept (ci ) is measured, now the Information content of the concept is calculated using the following formula I (ci ) = −Log( p(ci ))

(5)

Using above step calculation of entropy value of web page is done and web pages are arranged in order of increasing entropy. Algorithm for semantic-synaptic web mining using page entropy4 The algorithm consist of two phases, first phase deals with clustering of web page, and once clustering is done, entropy of web pages are calculated. semantic-synaptic web mining algorithm shown below.

Algorithm 1. Semantic-Synaptic Web Mining4

Et0 indicates web page with least entropy value as root in hierarchy, Etn indicates web page with highest entropy value. The hierarchy is given in the Fig. 3. In the Fig. 4, A is root with Et0 , node E, F, G, H, I, J, K , L, M are leaf node with Etn and B, C, D are pages at adjacent level i . 4. Implementation To prove the symantic-synaptic web mining algorithm4, a database of approx 125000 unique web pages is built using a web crawler application program, available for all major operating system. A search engine like web application is used to show the result of entropy based algorithm4. Given a user query, the search engine first cluster the web page using keywords, then entropy of each web page in cluster is calculated and finally sort the result.

Sarowar Kumar et al. / Procedia Computer Science 54 (2015) 449 – 455

Fig. 3. Semantic-synaptic web mining architecture4 .

4.1 Clustering Clustering of web page10, 21 is the first phase of the algorithm proposed by4. The real meaning of web page clustering is the groups the web pages having the similar content, similar meaning or related to each other in terms of content or functionality. These cluster web page are strongly coupled, web pages present in a cluster taken as single type item. Clustering of web page is also important when web page information getting from the server or Internet is very huge, then clustering is the best suitable solution, because in clustering technique those web pages are group together those having the some relationship. 4.1.1 Important web page clustering algorithms 1. 2. 3. 4.

k-means analysis14 hierarchical clustering nearest neighbor clustering Clustering based on keywords17

In this paper we use Clustering based on keywords to implement the symantic-synaptic web mining algorithm, for showing the relevancy of the result. 4.2 Clustering of web page Keyword based clustering technique is used, for clustering of web page. The input is taken by the user, remove the stop word (for e.g in, the, a, an, at etc) and the user keyword and their synonym (synonym of the word is calculated using Wordnet 2.125 dictionary) is compared with keyword of web pages present in the database, the match results is grouping and return the URL and title of the corresponding web pages for the next step i.e entropy calculation. clustered result is shown in the Fig. 4. 4.3 Calculation of web page entropy The calculation of entropy of the web page is done using the formula is given in the above section which is proposed by Resnik18. The entropy of web page gives an idea about inconsistency of word present on web. Entropy of each and every page present in taxonomy is calculated using a php script1 . The results are arranged in increasing order of entropy value. After the implementation of the symantic-synaptic web mining algorithm, output is shown in Fig. 5.

453

454

Sarowar Kumar et al. / Procedia Computer Science 54 (2015) 449 – 455

Fig. 4. Screenshot of clustering web page.

Fig. 5. Screenshot of result according to entropy value.

4.4 System implementation Logic of coding part using PHP script1 in the form of algorithm is shown below:

Algorithm 2. Code Implementation

4.5 Page rank vs entropy rank Google uses the concept of Different Factor24to rank the search result, like Standard IR measures, proximity, anchor text and Page Rank. Applying the concept of entropy with different factor, will reduce the result set and make it more relevant. The implementation result in the paper, advocates the advantage of entropy over page rank is as follows: 1. Page rank is unique value for every web page, while Entropy Rank is variable according to the query given by the user. 2. Page rank depend on the in-link and out-link of a web page where outlink is very difficult to calculate, where as entropy value depend upon the information Content18 of the web page. 3. Page rank is generally high for website home page because most out-link is present on home page and other website also reference the home page of web site generally, where as entropy value of home page may be high if the information of web page is irrelevant other wise low. 4. Page rank remains unaffected with respect to change in information content of website, it is affected only if the links (INLINK and OUTLINK) of a website is modified where as the entropy value is directly proportional to the information content of webpage.

Sarowar Kumar et al. / Procedia Computer Science 54 (2015) 449 – 455

From the above it can be concluded that if entropy is applied over google page Rank the search result set will improve in terms of number and relevancy. 5. Conclusion The paper implements the algorithm of symantic-synaptic web mining algorithm which is based on the entropy value and information content. This implementation is performed on the large data set approx 125000 of web page data. and the result shows that the web page present on root node having the low entropy value, generally provide the most relevant data with respect to the available some search engine which do not use the entropy concept. Here result are compared with the first ten result coming out from the different search engine which don’t have entropy concept. The above implementation is done using keyword based search, if implemented with full text search then its accuracy will be improved further. References [1] http://www.php.net/. [2] http://www.w3.org/DesignIssues/diagrams/sweb-stack/2006a.png. [3] H. K. Azad and Kumar Abhishek, Semantic-synaptic Web Mining: A Novel Model for Improving the Web Mining, IEEE International Conference on Communication Systems and Network Technologies (CSNT-2014), pp. 454–457, (2014). [4] H. K. Azad and Kumar Abhishek, Entropy Measurement and Algorithm for Semantic-Synaptic Web Mining, IEEE International Conference on Data Mining and Intelligent Computing (ICDMIC-2014). [5] C. Bizer, T. Heath, K. Idehen and T. Berners-Lee, Linked Data on the Web, Work-shop Summary, In Proceedings of the International World Wide Web Conference, LDOW, (2010). [6] T. Berners-Lee, Linked Data Design Issues, http://www.w3.org/DesignIssues/LinkedData.html, (2006). [7] S. Chakarabarti, Data Mining for Hypertext: A Tutorial Survey, ACM SIGKDD Explorations, vol. 1, pp. 01–11, (2000). [8] R. Cooley, The Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, Phd thesis, Department of Computer Science, University of Minnesota, May (2000). [9] Oren Etzioni, The World Wide Web: Quagmire or Gold Mine, Communications of the ACM, pp. 65–68, (1996). [10] Wai-Chiu Wong and Ada Wai-Chee Fu, Incremental Document Clustering for Web Page Classification, Chinese University of Hong Kong, pp. 21–38, July (2000). [11] J. Fumkranz, Web Structure Mining: Exploiting the Graph Structure of the World Wide Web, Osterreichische Gesellschaft fur Artificial Intelligence (OGAI), vol. 21, pp. 17–26, (2002). [12] R. Kosala and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, ACM, pp. 1–15, (2000). [13] T. Berners-Lee, J. Hendler and O. Lassila, The Semantic Web, Scientific American, pp. 34–43, May (2001). [14] J. McQueen, Some Methods for Classification and Analysis of Multivariate Observations, Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, (1967). [15] S. K. Madria, S. S. Bhow Mick, E. P. Lim, et al., Research Issues in Web Data Mining, In Proceeding Conference, Dawak, pp. 303–312, (1999). [16] George Miller, Wordnet: An On-Line Lexical Database, International Journal of Lexicography, vol. 3(4), (2001). [17] Filippo Ricca, Paolo Tonella, Christian Girardi and Emanuele Pianta, An Empirical Study on Keyword-Based Web Site Clustering, Proceedings of the 12th IEEE International Workshop on Program Comprehension, (2004). [18] P. Resnik, Semantic Similarity in a Taxonomy: An Information-based Measure and its Application to Problems of Ambiguity in Natural Language, Journal of Artificial Intelligence Research, vol. 11, pp. 95–130, July. [19] F. Sebastini, Machine Learning in Automated Text Categorization Tech. Report B4-31, Istituto di Elaborazione dellInformatione, Consiglio Nazionale delle Ricerche, Pisa, (1999). [20] C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal, pp. 379423–623656, July, October (1948). [21] Y. Fu, K. Sandhu and M. Shi, Clustering of Web Users Based on Access Patterns, lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin, vol. 1836, pp. 21–38, (2000). [22] M. Spiliopoulou, Data Mining for the Web, In Proceeding of Principles of Data Mining and Knowledge Discovery, Third European Conference, PKDD, pp. 588–589, (1999). [23] B. Berendt, A. Hotho and G. Stumme, Towards Semantic Web Mining. I. Horrocks, J. A. Hendler (Eds.), The Semantic WebISWC 2002. First International Semantic Web Conference, Proceedings of LNCS, Springer, vol. 2342, pp. 264–278, (2002). [24] L. Page, S. Brin, R. Motwani and T. Winograd, The Pagerank Citation Ranking: Bringing Order to the Web, Tech. Report, Stanford University, January (1998). [25] https://wordnet.princeton.edu/ [26] Saeed V. Vaseghi, Information Theory and Probability Models, Advanced Digital Signal Processing and Noise Reduction, Fourth Edition, March (2009). (doi:10.1002/9780470740156.ch3)

455