A web page distillation strategy for efficient focused crawling based on optimized Naïve bayes (ONB) classifier

A web page distillation strategy for efficient focused crawling based on optimized Naïve bayes (ONB) classifier

Accepted Manuscript Title: A Web Page Distillation Strategy for Efficient Focused Crawling Based on Optimized Na¨ıve Bayes (ONB) Classifier Author: A...

2MB Sizes 3 Downloads 94 Views

Accepted Manuscript Title: A Web Page Distillation Strategy for Efficient Focused Crawling Based on Optimized Na¨ıve Bayes (ONB) Classifier Author: Ahmed I. Saleh Arwa E. Abulwafa Mohammed F. Al Rahmawy PII: DOI: Reference:

S1568-4946(16)30653-6 http://dx.doi.org/doi:10.1016/j.asoc.2016.12.028 ASOC 3967

To appear in:

Applied Soft Computing

Received date: Revised date: Accepted date:

15-9-2015 1-9-2016 18-12-2016

Please cite this article as: Ahmed I.Saleh, Arwa E.Abulwafa, Mohammed F.Al Rahmawy, A Web Page Distillation Strategy for Efficient Focused Crawling Based on Optimized Na¨ıve Bayes (ONB) Classifier, Applied Soft Computing Journal http://dx.doi.org/10.1016/j.asoc.2016.12.028 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Web Page Distillation Strategy for Efficient Focused Crawling Based on Optimized Naïve Bayes (ONB) Classifier Ahmed I. Saleh1, Arwa E. Abulwafa1, and Mohammed F. Al Rahmawy2 1:Dept. of Computer Eng. & Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt 2:Dept. of Computer Science, Faculty of Computers, Mansoura University, Mansoura, Egypt

Graphical Abstract Outlier Rejection

Genetic Algorithm SVM

Cn Cx Pn P Cx TUTs R(C R(C T U C C x, n x, m Ts R(C )R(C R(C) Cy C R(C C C n, ss x, y x, s P Cy P)C ) Cz x,Cz ) s T U ) T P R(C U Cz Ts T U C Ts z, s Ts )

C Pm TC U m Ts

Transfer to concepts

Disambiguatio n Pick domain keywords

2

   

Input Web page

Page vector space model Text Extraction String Tokenizer Remove Stop Keywords Term Stemming

1

Highlights An effective modification on the behavior of focused crawlers by adding a domain distiller is proposed. The distiller relies on an Optimized Naïve Bayes (ONB) classifier, which combines naïve Bayes (NB) and Support Vector Machines (SVM). Word sense disambiguation (WSD) is employed to identify the accurate sense of each domain keyword extracted from the input page. Results indicate that the proposed distiller improves the performance of focused crawling.

Abstract The target of a focused crawler (FC) is to retrieve pages related to a specific domain of interest (DOI). However, FCs may be hasted if bad links were injected into their crawling queue. Hence, they will be gradually skewed away from their DOI. This paper introduces an effective modification on the behavior of FCs by adding a domain distiller. Hence, before passing the retrieved page to the indexer or embedding its links into the crawling queue, the page must pass through a domain distiller. The proposed domain distiller relies on an Optimized Naïve Bayes (ONB) classifier, which combines naïve Bayes (NB) and Support Vector Machines (SVM). Initially, genetic algorithm (GA) is used to optimize the soft margins of SVM. Then the optimized SVM is employed to eliminate the outliers from the available training examples. Next, the pruned examples are used to train the traditional NB classifier. Moreover, ONB employs word sense disambiguation (WSD) to identify the accurate sense of each domain keyword extracted from the input page. This is accomplished by using a proposed domain ontology, which is called Disambiguation Domain Ontology (D2O). ONB has been tested against recent classification techniques. Experimental results have proven the effectiveness of ONB as it introduces the maximum classification accuracy. Also, results indicate that the proposed distiller improves the performance of focused crawling in terms of crawling harvest rate.

1

Keywords: Wen Page Classification, Focused Crawling, Domain Ontology, Support Vector Machines, Naïve Bayes, Genetic Algorithm.

1. Introduction Due to the explosive growth in the field of Internet and computers, millions of web pages are daily added to Internet. Accordingly, searching the web becomes a true challenge [1]. Search engines (SEs) are information retrieval systems, which are designed mainly to help users finding what they need. A Crawler is one of the main components of SE, which is used to retrieve web pages then pass them to the indexer. In spite of their effectiveness, the general-purpose search engines suffer from low precision and recall, freshness problem, poor retrieval rate, time consuming due to the long list of result, and storage problem caused by the huge amount of expanded information. To overcome those problems, more specialized (vertical) search engines (VSE), which are called the domain-specific search engines [2], have been introduced. The aim of VSEs is to reply the users’ queries in a specific domain of interest (DOI), and accordingly, they have a special type of crawlers called focused crawlers [3]. A VSE offers a good solution for the general-purpose search engines' limitations, as it covers only the portion of the web related to its DOI. On the other hand, VSEs can easily provide more precise results and more customized functions. However, building an accurate VSE is also a true challenge as the web is full of noisy and volatile materials. Focused crawling aims to download only pages related to a specific DOI. The web is then divided into a set of interconnected domains, and then one (or more) crawler(s) is (are) allowed to discover each domain. Hence, indexing the web can be done in parallel; and accordingly, more web coverage can be easily achieved. However, the area of focused crawling still has many challenges and unsolved problems. One of the effective problems that harm the focused crawling efficiency is the “Haste Problem”. Such problem takes place when bad links are injected into the crawling queue of the focused crawler. After retrieving the pages behind those bad links, more bad links are added to the crawling queue causing the focused crawler to be skewed away from its DOI. The main cause of the haste problem is that traditional focused crawlers rely on estimation. Hence, they estimate whether the page is related to DOI or not before actually retrieving the page. Although recent focused crawlers implement accurate estimation techniques, relevancy estimation is not always accurate. To overcome such problem, web mining techniques can be applied to calculate a true relevancy score that is based on the actual page’s contents rather than such inaccurate estimation. To accomplish such aim, a domain distiller can be employed to calculate the relevancy of the page after actually retrieving it. Then the decision to pass the retrieved page to the search engine’s indexer or to add the page’s embedded links to the crawling queue is based on the distiller’s decision. Web page classification can be defined as the assignment of a web page to one (or more) predefined classes. It is often posed as a supervised learning problem. Hence, a set of training examples are used to train the classifier by setting its classification rules, which can be applied to classify future examples. Based on the number of the employed classes, classification can be binary or multi-class. Binary classification categorizes items into exactly one of two classes, while multi-class classification employs more than two classes. Web page domain distillers are binary classifiers that are designed to take a decision whether an input web page is related to a specific DOI. Several classification techniques can be employed in domain distillers such as; support vector machines (SVM) [4], k-nearest Neighbor (KNN) [5], decision trees [6], neural networks [7], and Bayesian classifier [8]. However, binary classification is still a challenge. In current era, it has been found that results returned by search engines are not what actually needed. This happens because, while querying for certain data, there is a likelihood that the query contains ambiguous (polysemous) words, which having multiple meaning. Word Sense Disambiguation (WSD) [9], which tries to assign a unique sense to a word, is an important area of NLP [10]. Generally, WSD can be employed to promote the classification performance as it can effectively solve the classifier confusion. Several techniques can be used to implement WSD. One technique is based on the collocation of other words in which nearby words are used to provide consistent clues to the sense of a target word [9]. Another technique is the word sense based on discourse in which the sense is consistent within any given document. The originality of this paper is concentrated in introducing a new architecture for focused crawling by integrating evidence from Machine Learning and Web Mining. The paper introduces an effective modification on the behavior of focused crawlers by employing a domain distiller to decide whether the retrieved page is related to crawler’s DOI, and then take a decision to index the page and add its links to the crawling queue accordingly. The proposed domain distiller combines SVM and NB classifiers in a new instance called Optimized Naïve Bayes (ONB) classifier. Initially, genetic algorithm (GA) is used to optimize the soft margins of SVM. Then, the optimized SVM is employed to eliminate the outliers from the available training examples. Next, the pruned examples are used to train the traditional NB classifier. Furthermore, in order to guarantee an effective classification task, WSD has been implemented to create 2

innovative features to perfectly represent the input page for the classification. With the help of WSD, a set of specially selected ambiguous domain keywords, which is called “Confusion Set” (CS), is identified. Then, the sense of each ambiguous keyword is identified based on a pre-stored collection of discriminative keywords (for each ambiguous keyword), called “Partners”. CS is the subset of ambiguous domain keywords that most likely confuse the classifier. ONB employs proposed domain ontology for both mapping domain keywords to the corresponding concepts as well as implementing WSD. Hence, it is called; Disambiguation Domain Ontology (D2O). ONB has been tested against recent classification techniques. Experimental results have proven the effectiveness of ONB as it introduces the maximum classification accuracy. Also, results indicate that the proposed distiller improves the performance of focused crawling in terms of crawling harvest rate. This paper is organized as follows; section 2 introduces the background and basic concepts, section 3 illustrates the effective focused crawling that combines the traditional focused crawling with the domain distiller, section 4 presents the previous efforts in the area of web page classification, section 5 introduces the employed disambiguation domain ontology (e.g., D2O), section 6 illustrates in details the proposed ONB classifier, section 7 presents the performance analysis and experimental results, while section 8 summarizes our conclusions.

2. Background and Basic Concepts In this section, an explanation about traditional focused crawlers as well as the crawling haste problem will be introduced. Then, a brief introduction for word sense disambiguation (WSD) will be illustrated.

2.1.

Web Search Engines

Search engines are the most popular search tools for finding the required information on the web. A typical search engine consists of five basic components, which are; (i) crawler, which may be focused or unfocused, (ii) indexer, (iii) database, (iv) query manager, and (v) user interface. Hence, the user sends his query to the search engine in the form of “search keywords”. Then search engine retrieves the relevant pages to the users' query from the database. Finally, a ranked list of pages is presented to the user. Search engines rely on crawlers to traverse the web [11]. Crawlers, which may be focused or unfocused, collect pages, pass them to the indexer, and then follow links from one page to another [12]. Indexer on the other hand, analyzes the page and stores its features in the database.

2.2.

Focused Crawlers and the Haste Problem

A focused crawler [13] is a special type of crawlers that retrieves pages related to a specific topic or a domain of interest based on both content and link structures [14]. As illustrated in figure 1, the focused crawler operates in five steps; initially, its priority crawling queue is initialized with a number of seed pages [15], which are highly relevant pages that are manually chosen. Then, the focused crawler fetches the link located at the head of its priority queue and retrieves the corresponding page. In the third step, it analyzes the page (using parsers to extract keywords and links). Forth, the focused crawler assigns a score for each link in the processed page based on several criteria such as the link position in the page, the link’s anchor window, and/or the page’s rank. Those extracted links, after scoring them, are injected into the crawling queue. Finally, the crawler sorts the links in its queue so that the links with higher scores appear at the queue head; hence, they will be processed first. A focused crawler will continue operating as its queue has URLs for processing. This procedure ensures that the crawler moves towards relevant pages with the assumption that relevant pages tend to be neighbors to each other. However, focused crawlers are very sensitive to the quality of the seeds initially injected into its queue. Hence, falsely chosen seeds will dramatically affect the crawler performance. Moreover, inaccurate link scoring strategy badly impacts the crawler behavior in the future crawling cycles. Hence, more and more bad links, which are extracted from the retrieved low quality pages, will be added to the crawling queue. As a result, the crawler may be involuntary skewed away from its main target, which is retrieving high quality pages relevant to a specific domain. We call such skew; the crawling “Haste Problem”.

3

2.3.

Word Sense Disambiguation

Human language is fairly ambiguous; hence, numerous words can be portrayed in several different ways based on the context in which they occur. Most of the words in natural languages are polysemous as they have multiple possible meanings or senses. Word sense disambiguation (WSD) is defined as the task of identifying the appropriate sense of an ambiguous word in a context. In English language, there exist a variety of ambiguities types; table 1 illustrates the most famous ones. WSD typically involves two main tasks; (1) determining the different possible senses (or meanings) of each word, and (2) tagging each word of a text with its appropriate sense. The former task, that is, the precise definition of a sense, is still a challenge within the Natural Language Processing (NLP) community. Recently, the most used Sense Repository is WordNet [16]. On the other hand, the second task (e.g., tagging of each word with appropriate sense) involves the development of a system capable of tagging polysemic words in running text with sense labels. WSD community classifies these systems into two main general categories, namely; knowledge-based and corpus-based. Although both categories build a representation of the examples to be tagged using some previously collected information, they differ in the source of this collected information. Knowledge-based methods obtain the information from external knowledge sources such as Machine Readable Dictionaries (MRDs) and/or lexico-semantic ontologies. On the contrary, in corpusbased methods the information is gathered from contexts of previously annotated instances (e.g., examples) of the word.

3. Effective Focused Crawler Usually, the crawling queue of the focused crawler is initially fed with high quality pages (e.g., seeds), which are highly related to DOI. Also, the focused crawler maintains an efficient link weighting strategy. However, some of the retrieved pages may be obsolete as traditional focused crawlers suffer from "Haste Problem". This happened because the focused crawler is a blind entity. Its operation relies mainly on predictions. A focused crawler retrieves a page if it predicts that it is a good page. But what is the situation if the crawler's prediction is inaccurate. According to the traditional focused crawling, the links extracted from any retrieved pages are weighted, and then injected into the crawling queue. If such retrieved page is irrelevant one (e.g., the page is not related to the domain of interest), bad links will be injected into the crawling queue causing the crawler to skew from its main target of retrieving high quality pages that are related to a specific domain. To go around such problem, as illustrated in figure 2, a domain distiller can be used to guide the crawler operation and compensate its blindness.

Hence, before passing the retrieved page to the indexer or adding page’s links to the crawling queue, the page is passed to a domain distiller to decide whether it is relevant to DOI. Only good pages (e.g., pages that are highly related to DOI) are passed to the indexer. The decision here is taken according to the pages contents; hence, it is an accurate decision.

4. Related Work The main contribution of this paper is to enhance the performance of focused crawling by using domain distillers. A distiller is a binary classifier that is initially learned with the knowledge of a specific domain in the form of classification rules. Then, it can use such pre-stored rules to discover those web pages that are relevant to crawler’s DOI. In this section, a quick review summarizing the recent work in the area of web page classification will be introduced, which can be applied in binary or multi-class classifications. SVM has been applied to web page classification in [17], in which the original SVM classifier is combined with BEV (Bagging Ensemble Variation) algorithm to create a new classifier called VOTEM. A web document has been assigned to a sub-category based on voting from all category-to-category classifiers. In this work, a hierarchical classification algorithm starts from the top of the hierarchical tree downward recursively until it triggers a stop condition or reaches the leaf nodes. Because of the imbalanced data that decreases the performance of original SVM classifier, VOTEM is used to provide an improved binary classifier to solve the problem brought by BEV. In [18], a web page classification 4

method using an SVM based on a weighted voting schema has been proposed. The feature vectors are extracted from both the LSA (latent semantic analysis) and WPFS (web page feature selection) methods. LSA can extract common semantic relations between terms and documents. Then, LSA classifies semantically related web pages, and WPFS extracts four text features from the web page content. The category of a web page can be correctly determined by the WPFS. [19] Presented an algorithm based on Cost-Sensitive Support Vector Machine (CS-SVM) to improve the classification accuracy. During the training process of CS-SVM, different cost factors are attached on the training errors to generate an optimized hyperplane. Experiments have shown that CS-SVM outperforms SVM on the standard ODP dataset. In [20], the problem of feature selection has been highlighted; the aim is to find a subset of features for optimal classification. A critical part of feature selection is to rank features according to their importance for classification. [26] Has developed a new feature scaling method, called class–dependent–feature–weighting (CDFW) using naive Bayes (NB) classifier. A new feature scaling method, which is called; CDFW–NB–RFE, combines CDFW and recursive feature elimination (RFE). In [21], an auxiliary feature method is proposed. It determines features by an existing feature selection method, and selects an auxiliary feature, which can reclassify the text space aimed at the chosen features. Then, the corresponding conditional probability is adjusted in order to improve classification accuracy. In [22], a hybrid k-nearest neighbor (KNN) and SVM classifiers for multiclass classification of gene expression data has been introduced. This hybrid classifier, which is called HKNNSVM, uses KNN to prune training samples and uses SVM to classify samples. Compared with SVM and KNN, the Misclassification rate of HKNNSVM for datasets were lower, which indicated that the classification performance of HKNNSVM was stable. [23] Proposes a discriminant analysis method for categorization of text documents. It categorizes the text by finding coordinate transformations that reflect similarity from data by using generalized singular value decomposition (GSVD). However, the cost of classification is extremely high in document analysis. In [24], an effective re-examination of text categorization approaches of statistical test using five categorization method as KNN, SVM, NN, NB, and LLSF (Linear Least square Fit) has been introduced. Among them SVM, KNN, LLSF outperform the results than the NB, NN when the number of positive training set per categories are small. In [25], a hybrid algorithm based on Variable Precision Rough Set (VPRS) is proposed, which combines the strength of KNN and Rocchio techniques to overcome their weaknesses. Firstly, feature space of training data is partitioned using VPRS, and lower and upper approximations of each category are defined. Then KNN and two Rocchio classifiers are built on these new subspaces respectively. The two Rocchio classifiers are used to classify most of the new documents effectively and efficiently. KNN is used to find nearest neighbors of the new document in the subset of training dataset, which can save time obviously compared with finding nearest neighbors in the whole training dataset. Experimental results indicate that the proposed hybrid algorithm achieves significant performance improvement. [26] Proposed a model for text categorization that concentrates on the underlying meaning of words in their context (i.e., concentrates on learning the meaning of words, identifying and distinguishing between different contexts of word usage). This model can be summarized in the following steps: first, it maps each word in a text document to explicit concepts. Then, it learns classification rules using the newly acquired information; finally, it interleaves the two steps using a latent variable model. The proposed model combines Natural Language Processing techniques such as word sense disambiguation, part of speech tagging, with statistical learning techniques such as Naïve Bayes in order to improve the classification accuracy and to achieve robustness with respect to language variations. In [27], a new text classifier is proposed by integrating the nearest neighbor (NN) and SVM algorithms. The proposed SVM-NN approach aims to reduce the impact of parameters in classification accuracy. In the training stage, SVM is used to reduce the training samples for each of the available classes to their support vectors (SVs).The SVs from different classes are used as the training data of nearest neighbor classification algorithm in which the nearest centroid distance function is used to calculate the average distance instead of Euclidean function, which reduce time consumption. [28] Proposes hybrid classifiers involving various two-classifier and four-classifier combinations for two-level text categorization. It shows that the classification accuracy of the hybrid combination is better than the classification accuracies of all the corresponding single classifiers. The constituent classifiers of the hybrid combination operate on different subspaces obtained by semantic separation of data. Experiments show that dividing a document space into different semantic subspaces increases the efficiency of such hybrid classifier combinations. [29] Introduced a new crawler architecture, which is called Treasure-Crawler (TC). According to TC, a new methodology that employed specific HTML elements of the input page was employed to predict the target topical domain of each page that has an unvisited link inside that input page. Then, only those on-topic pages are sorted based on their relevancy the crawler’s domain of interest for further actual downloads. In TC, a hierarchical structure called T-Graph was employed, which assigns the appropriate priority score to each unvisited link. Then, these URLs will be downloaded later based on their pre-assigned priorities. In [30], an effective focused crawler had been developed, 5

which is called OntoCrawler. It provide a semantic level solution that provides fast, precise, and stable query results based on ontology-supported website models. Hence, OntoCrawler can benefit both user requests and domain semantics. It has practically applied on Yahoo and Google search engines to actively search for webpages of related information. Experimental results had shown that OntoCrawler could definitely promote both precision and recall rates of webpage query.

5. Disambiguation Domain Ontology (D2O) In this section, a proposed structure for domain ontology will be presented, which will be used for mapping domain keywords to the corresponding domain concepts as well as achieving keyword disambiguation. Hence, it is called Disambiguation Domain Ontology (D2O). With the aid of Wordnet, D2O organizes the considered domain keywords into groups so that each group consists of a set of synonymous keywords that are used to express a unique domain concept. For dimensionality reduction, one keyword, from each group, is selected to express the underlying concept, which is called concept’s Representative Term (RT). In addition to “Synonymous” relation, D2O maintains several relations among the domain keywords that indicate the semantic strength among each domain keyword and other keywords D2O. To the best of our knowledge, all web page classification techniques are mainly relying on bag-of-words representation for the input page. However, employing individual keywords as features may lead to features ambiguity, which is called; the polysemy effect [9]. This happened as some keywords may be shared among several domains and accordingly have several meanings. For illustration, suppose that page X includes the keywords; “Java”, “Programming”, “Computer”, while another page Y includes the keywords; “Java”, “Cup”, “Coffee”. If the keywords are used directly as the classification features, this will certainly mislead the classifier as it may give the same label to both pages (e.g., classify them to the same class). However, if “Java” is considered as an ambiguous keyword, WSD can be used to disambiguate it into its correct sense, and then classify the page accordingly. As the nearby keywords can provide strong consistent clues to the sense of the ambiguous keyword, D2O maintains a list of discriminative keywords for each domain ambiguous keyword, which is called Partners List (PL). Then, rely on those partners to sense the correct meaning of the polysemous (ambiguous) keyword. However, selecting those partners is a true challenge, which will be clarified through the next sections. Hence, when classifying a new page P, which includes an ambiguous keyword Kamb, initially PL(Kamb) is identified with the aid of D2O, then, the correct meaning of Kamb is sensed accordingly based on the absence/existence of keywordsPL(Kamb) in the tested page (e.g., P). Afterword, a decision can be taken whether to include Kamb as a feature during the classification process or not. The next subsections will illustrate in details; D2O construction, the methodology used to identify the ambiguous keywords, and the procedure followed to elect the partner list for each ambiguous keyword.

5.1.

D2O Construction

Algorithm 1 shows the procedure for D2O construction that consists of four steps, which are; (i) Conceptualization, (ii) Concept Weight Calculation, (iii) Inter-Concepts Relationships Assignment, and (iv) Graph Construction. In the first step of D2O construction (e.g., Conceptualization), the domain's keywords are collected by a domain expert from highly domain related web pages, which can be represented by the set K = {k1, k2, ..,kg}. Then, synonymous are grouped into one cluster, so that each cluster represents distinct domain concept. Then, the most popular keyword in each cluster is selected to be the concept’s Representative Term RT, while the remaining cluster members are the RT’s synonyms. Finally, after conceptualization, DOI is expressed by a group of concepts, which are represented by the set C={c1, c2, c3, …., cn}. In the second step (e.g., Concept weight calculation), the weight of each domain concept is calculated, denoted as; w(ci) ∀i∈{1,2, … ,n}. This weight is calculated using the web pages collected from the considered domain corpus using the Odd Ratio Numerator (OddN) [31] method as calculated (1).

w(ci )  OddN(ci )  tpr(ci )[1  fpr(ci )]

(1)

where; tpr(ci )  tp(ci ) / pos and fpr(ci )  fp(ci ) / neg Where tpri is the sample true positive rate of concept ci, which is calculated as; tpr (ci)=tp (ci)/pos, and fpri is the sample false positive rate of concept ci, which is calculated as; fpr(ci)=fp(ci)/neg, tp(ci) expresses the true positive of the domain given concept ci, which is the number of positive pages for the domain(e.g., pages that were already related to DOI) containing the concept ci, fp(ci) expresses the false positive of the domain given concept ci, which is the 6

number of negative pages for the domain(e.g., pages that were not related to the domain) containing the concept ci, pos is the number of positive pages of the domain, and neg is the number of negative pages of the domain.

Disambiguation Domain Ontology (D2O) Generation Algorithm  Inputs: o

K: a set of keywords of DOI, K = {k1, k2, ..,kg}.

 Output: o

Domain Ontology Graph (OG), as G = (C, E, W, R)

 Steps:

// calculate each concepts pair relation, use WebOverlap Coefficient 22: For each concept pair cx, cy in C Do 23: find the number of retrieved pages from Google for the query of “c x+cy” 24: find the number of retrieved pages from Google for the query of “cx” 25: find the number of retrieved pages from Google for the query of “c y” 26: calculate the WebOverlap Coefficient for cx and cy

// keywords clustering by WordNet w(ci )  OddN (ci )  tpr (ci )[1  fpr (ci )] 1: Set E=Φ, X1=K, X2=K 2: While X1≠Φ 27: End For 3: For each keyword x1 ∈ X1 Do 28: Gather all related concepts with their weights and relations in a graph G. 4: While X2≠Φ 5: For each keyword x2 ∈ X2 Do Algorithm Parameters 6: If (x1≠ x2) and (no edge between them) Then K The set of all domain keywords. 7: Get semantic relation between them from the WordNet x1 A keyword. 8: Add edge connecting x1 and x2 x2 A keyword. 9: Let E=E ∪ {x1, x2} G The Graph of Disambiguation Domain Ontology. 10: End If C The set of ‘n’ concepts, C = {c1, c2, …,cn}. 11: X2=X2-x2 E The set of edges. 12: End For W The set of nodes (Concepts) weighs. 13: End While 14: X1=X1-x1 R The set of edge weights. 15: End For w(A) The concept A weight in the domain. 16: End While OddN(A) Odd Ratio Numerator feature selection method. 17: Get the related keywords of the domain as the set of domain concepts C. True positive: the number of positive pages tp(A) contains the concept A. // calculate each concept weight False positive: the number of negative pages fp(A) 18: For each concept ci in C Do contains the concept A. 19: Use the considered domain corpus. pos The number of positive pages. 20: Calculate w(ci )  OddN (ci )  tpr (ci )[1  fpr (ci )] neg The number of negative pages. tpr(A) Sample true positive rate, TPi/Pos. Where; tpr(ci)=tp(ci)/pos and fpr(ci)=fp(ci)/neg fpr(A) Sample false positive rate, FPi/Neg. 21: End For r(A,B) The relation between concepts A and B. Algorithm 1, Disambiguation Domain Ontology (D2O) generation.

In the third step (e.g., Inter-Concepts Relationships Assignment), the relation between every pair of concepts, denoted as; r(cx,cy)∀ cx, cy ∈ C is calculated using the WebOverlap coefficient [32] as illustrated in (2).

r (ci , c j )  WebOverlap(ci , c j )  N (ci  c j ) / min(N (ci ), N (c j ))

(2)

Where; N(c) is the number of retrieved pages from Google that contain the concept “c” and N (cxcy) is the number of retrieved pages from Google that contain both concepts “cx“ and “cy”. In the fourth step (e.g., Graph Construction), the domain concepts are arranged in a graph like structure called the disambiguation domain ontology (D2O), considering the inter-concept relationships between every pair of concepts and the concepts’ weights.

5.2.

D2O Generation: Illustrative Example

In this section, an illustrative example showing how to construct a simple D2O will be introduced. The task is to generate a new ontology that represents a specific domain by following the proposed procedure illustrated in algorithm 1.

A. Conceptualization and Calculating the Concepts Weights As depicted in figure 3, domain concepts are identified. Then, the Odd Ratio Numerator (OddN) method is used to calculate the weight of each concept w(ci) using (1). The used positive pages (e.g., pos)=10, also the negative pages (e.g., neg)=10. Those negative pages are the ones that contain several domain concepts but are not related to the domain. 7

B. Inter-Concepts Relationships Assignment In this step, the relationship r(cx,cy) between each concept pair will be calculated by using (2). For illustration, consider the concept pair (A,B). As the query of “A+B” appears in 642,000,000 pages in Google, and for the query of “A” and for “B” are 25,270,000,000 and 10320,000,000 pages respectively, so r(A,B)= 642,000,000/min(25,270,000,000, 10320,000,000)= 642,000,000/10320,000,000=0.06. And continue the calculations for the rest of the domain concepts as shown in figure 3.

C. Graph Construction Graph construction is the final step at which all domain concepts are arranged together in the form of a weighted conceptual graph that representing the domain of interest as illustrated in the last step of figure 3.

5.3.

Identifying Ambiguous Keywords (Confusion Set)

In order to solve the polysemy dilemma, after setting up D2O, it is essential to identify the set of keywords that need to be disambiguated. This set of keywords, which is called “Confusion Set” (CS), may cause serious problems as it confuses the classification algorithm. As manual identification of CS is a tedious and subjective issue, through this section, an automatic methodology for identifying CS will be illustrated in details. WordNet, as a valuable lexical resource, has attempted to model the knowledge of a native speaker of English. However, identifying the ambiguous set by directly employing WordNet has resulted in too many senses for almost all considered domain keywords. Generally, WordNet maintains a semantic relation among keyword senses by grouping keywords into the same semantic domain (Education, Sports, Medicine, etc.). However, too many senses may be considered for the same keyword, which may be harmful and time consuming for the classification task. For illustration, the keyword “bank”, has ten senses in WordNet 3.1, in which three of them, namely; “bank#1”, “bank#3” and “bank#6”, are grouped into the same domain label, which is “Economy”. On the other hand, “bank#2” and “bank#7” are grouped into the labels “Geography” and “Geology” respectively. Moreover, WordNet considers also the rare senses of the word, for illustration, the word “java” has three senses in WordNet 3.1, hence, “java” is shared between three domains, which are; “Food” with the sense “coffee”, “Communication” with the sense “object oriented programming language”, and “Location” with the sense “an island in Indonesia”. However, the second sense is the most widely used one, hence, as WSD is expensive, it is unnecessary to disambiguate every domain keyword appeared in the tested page. Too much WSD may increase the classifier’s risk of overfitting as WSD is not always accurate. From another point of view, WordNet has a subtle distinction between keyword senses, which could be harmful for the classification task. For illustration, WordNet distinguishes between bass (the lowest part in polyphonic music) and bass (the lowest part of the musical range). Hence, relying completely on WordNet for identifying CS will not be a good decision. In this section, a simple but effective methodology will be introduced for identifying the subset of keywords that most likely confuse the classifier, which is called Fuzzy Based Ambiguity Identification (FBAI) Strategy. As illustrated in figure 4, FBAI consists of two sequential phases, which are; (i) Pre-Processing (PP) and (ii) Fuzzy Inference Engine (FIE). 5.2.1.

Pre-Processing (PP)

During PP, the task is to calculate three different parameters for the input keyword kD2O, which are; (i) Domain Relatedness of k, denoted as; DR(k), (ii) Keyword Popularity, denoted as; KP(k), and (iii) domain Information Content of k, denoted as; IC(k). For calculating those parameters for k, initially, Google is queried to retrieve the 𝜂 pages that contains k, which are expressed by the set Pages(k). Then, for each retrieved page PPages(k), the set of snippets that contains k are identified and combined together to be a representative for the page P. A snippet is defined as the set of keywords around k in the page P, hence, the center of the snippet is the considered keyword k. The snippet has two wings with length χ, hence, each wing is represented be χ words in one side of k. The collected snippets of k of page P are combined together and used to classify P using traditional NB classifier either to the domain (class D+) or not (class D-). This process is continued until all pagesPages(k) have been classified. Then, the number of pagesD+ as well as the number of pagesD- are calculated, denoted as; P+(k) and P-(k) respectively. Moreover, the frequency of k is calculated in all snippets collected from all pagesD+, denoted as; TF+(k), and also in all snippets collected from all pagesD-, denoted as; TF-(k). Also consider the number of occurrences of k in the considered domain corpus, denoted as; N(k), as 8

well as the total number of occurrences of other domain keywords, denoted as; NT, Finally, DR(k), KP(k), and IC(k) are calculated using (1), (2), and (3) respectively. 

TF (k ) DR(k ) 

(1)



TF (k )



P KP(k )  (k )

(2)



P (k )

IC (k )   log P(k )   log N (k )  NT  

(3)



Definition 1: Keyword Domain Relevancy, denoted as; DR(k) measures the degree of relevancy of a keyword k to the Domain of interest and is defined as the percentage of occurrences of k in the positive pages (e.g., pages related to the domain) to occurrences of k in the negative ones (e.g., pages unrelated to the domain).



Definition 2: Keyword Domain Popularity, denoted as; KP(k) measures the popularity of a keyword k in the considered domain against other domains, and is defined as the percentage of positive pages that contains k to the negative ones.



Definition 3: Keyword Domain Information Contents, denoted as; IC(k) measures k’s ability to convey domain information. It measures the amount of domain knowledge gained by the keyword occurrence.

As illustrated in definition 3, a highly recurring keyword conveys little information due to its ubiquitous use, and then it has a small IC. On the other hand, a rare keyword conveys much more domain information as it tends to be more specialized and conveys more independent meaning, and accordingly it owns a high IC value. Finally, after calculating DR(k), KP(k), and IC(k) kD2O, the Confusion Set (CS) can be identified through a fuzzy inference process. Generally, the higher the domain relevancy, popularity, and information content of a keyword, the lower its ambiguity level. 5.2.2.

Fuzzy Inference Engine (FIE)

Fuzzy inference is suitable for approximate reasoning as it can be used efficiently in decision making under incomplete or uncertain data. Hence, fuzzy inference can be successfully employed to assign an ammbiguation value for a specific keyword. Generally, fuzzy inference can be applied through three sequential steps, which are; (i) fuzzification of inputs, (ii) Fuzzy Rule Induction, and (iii) defuzzification. a. Fuzzification of Inputs Three different fuzzy sets, which are; DR, KP, and IC will be considered. During fuzzification, the input crisp values are mapped into grades of membership for linguistic terms, “Low” and “High” of the used fuzzy sets. The employed membership functions for the considered fuzzy sets (e.g., DR, KP, and IC) are illustrated in figure 4. b. Fuzzy Rule Induction For the inference process, a set of fuzzy rules are employed in the form if (A is X) AND (B is Y) AND (C is Z) …… THEN (D is M), where A, B, and C represent the input variables (e.g., DR, KP, and IC), while X, Y, and Z represent the corresponding linguistic terms (e.g., low or high), D represents the rule output, and finally M represents a linguistic terms (low or high). Hence, the output of the fuzzification is the input for the fuzzy rule induction. There are 8 rules, which are listed in table 2 (assuming ‘L’ refers to “Low”, ‘H’ refers to “High”). For illustration, the first rule in table 2 indicates that; IF DR(k) is Low AND KP(k) is Low AND IC(k) is Low THEN Output is Low.

Generally, there are four fuzzy rules inference methods, namely; max-min, max-product, sum-dot, and drastic product. The max-min is the used method in this paper. It is based on choosing the min operator for the conjunction in the premise of the rule and for the implication task, while the max operator is used for aggregation. Hence, for w input variables and q states of the output linguistic terms, the max-min inference rule can be illustrated in (4).

 

aggregation



out ( x)  MAX  MIN  inp(1) , inp( w)  implication



9

x  {1,2,3,...., q}

(4)

c. Defuzzification Defuzzification was accomplished using the output membership function illustrated in figure 4. Hence, consider a keyword k whose input parameters are; DR(k), KP(k), and IC(k). The output value of the defuzzification process is a crisp value that expresses the Ambiguation Value (AV) of the keyword k, e.g., AV(k). Finally, the decision is taken whether k is an ambiguous or not based on a simple rule, which is expressed by the simple step identification function illustrated in figure 5. On the other hand, figure 6, illustrates a simple example considering the keyword k, with the corresponding parameters DR(k), KP(k), and IC(k) are; 3, 3.5, 7 and respectively.

5.3. Identifying Partner List (PL) for each Ambiguous Keyword After domain CS has been identified, the task is to choose PL(k) kCS. A partner of a keyword k is a nearby domain keyword that provides a strong consistent clue to the sense of k with respect to the domain of interest (DOI). Hence, a good way is to choose the closest keywords to the ambiguous keyword k in D2O to represent the partners of k (e.g., PL(k)). To accomplish such aim, as depicted in figure 7, a circle of center k is drawn over D2O with radius λ, which is called; Partner Circle (PC). Hence, λ represents the number of hops that separates k from the farthest partners in PL(k). However, not all the keywords inside PC are considered as partners of k. Only those keywords that are related to k with an acceptable strength are considered as partners. Hence, we rely on the relation strength among k and those keywords inside PC(k) that were previously calculated during D2O construction. A threshold strength value is assumed as θ, then any a keyword M is considered as a partner of k if r(k,M)θ. Also, as illustrated in figure 7, if a keyword is rejected from the partner list of a keyword, its directly connected keywords are also rejected.

6.

The Proposed Domain Distiller

As shown in figure 8, the input for the proposed distiller is a web page to be classified. The distiller then analyses the page and extracts existing domain keywords using the Disambiguation Domain Ontology (D2O), which maintains all available domain keywords. However, before mapping extracted domain keywords to the corresponding concepts, a disambiguation process is done on the extracted domain keywords to discard those ambiguous keywords which are not related to the domain. As the target of the proposed distiller is to decide whether the input page is related to the domain of interest or not, the core of our distiller is a binary classifier. However, to perfectly accomplish the distillation process, as illustrated in figure 8, the proposed distiller is divided into two sequential modules, which are; analysis module (AM) and classification module (CM). During AM, the input page is expressed in a vector space model. On the other hand, CM takes the decision whether to accept or reject the page using an optimized Naïve Bayes classifier. These modules will be explained through the next subsections.

6.1. Analysis Module (AM) The target of AM is to represent the input page in the vector space model. To accomplish such aim, domain keywords found in the page will be extracted and mapped to the corresponding domain concepts using the aid of the proposed D2O. As illustrated before, if the considered Domain Of Interest (DOI) has k keywords, these keywords are clustered using WordNet. For each cluster synonymous keywords are grouped and represented by one domain concept (e.g., the underlying concept), which is the frequently used term of the cluster's terms, hence, it is called the concept's Representative Term (RT) and the remaining cluster's terms are RT’s synonyms. Considering an input page P, after extracting the page’s domain keywords, expressed by the set Keywords(P), ambiguous keywords are identified with the aid of the proposed D2O. Disambiguation is done by discovering the correct sense of the word by looking for the partners of the ambiguous keyword inside the processed page (e.g., P). Considering an ambiguous keyword kamb, The existence of the kamb’s partners gives a good conclusion that kamb gives the correct domain sense. Hence, to disambiguate kamb, initially, Partners(kamb) are identified, then the set RES=Keywords(P) Partners(kamb) is constructed. Finally, the strength of the relations among kamb and all keywordsRES are indicated from D2O, then the average of strengths for the identified relations is calculated. Then, 10

the decision whether the sense of kamb is in DOI label, and then kamb is considered as a domain keyword is taken based on Rule 1. 

Rule 1: Disambiguation Rule An ambiguous keyword is considered as a domain keyword if the average of the strengths of the relations of the partners of kamb that exists in the processed page is greater that a threshold value, denoted as; . This can be expressed by the following expression. If [Average(Relations_Strength (kambk))  kRES]   Then Sense(kamb )DOI Else Sense(Kamb )DOI

After removing those keywords that are not related to DOI, the remaining domain keywords are mapped into the corresponding domain concepts using D2O. Afterword, the input page can be represented in the form of a vector of domain concepts. The dimension of that vector equals the number of the considered domain concepts. Hence, if the concept is found in the input page, the corresponding place of that concept in the vector is set to 1, otherwise, it is set to 0. At the end of this stage, an input page can be expressed as a vector of ones and zeros, which is called the Vector Space Model. As our domain has a set of n concepts C= {c1, c2, c3 … , cn}, so the vector of the input page will be of ndimensions.

6.2. Classification Module (CM) This module is a binary classifier, in which pages will be classified into only two classes. The processed web page may relate to our DOI or it may not. Related pages are classified to "Class 1", while the others are classified to "Class 2". In CM, a new of binary classifier is proposed by integrating an optimized Support Vector Machine (SVM), by the use of Genetic Algorithm (GA), and Naïve Bayes (NB) algorithms. Although the optimization was done on SVM, such process also optimizing the behavior of NB. Hence, the new binary classifier, which is the core of the proposed distiller, is called Optimized NB (ONB) classifier. Initially, ONB rejects outliers by selecting the most informative examples using an optimized SVM (with the aid of GA). Hence, after rejecting outliers, the remaining pages (examples) will be used to train the NB classifier. Finally, the decision if an input page is related to "Class1" or "Class 2" will be taken using NB classifier for testing. ONB operates in three phases; which are; (i) Outlier Rejection (ii) Training, and (iii) Testing. More details about those three steps will be introduced in the next sub-sections, also they are depicted in algorithm 2.

6.2.1. Outlier Rejection This step aims to discard the false examples that may result in constructing wrong classification rules during the next step, which is the classifier training. Surely, constructing wrong rules will badly impact the performance of the distiller, which in turn directs the focused crawler in the wrong direction when embedding such distiller in an EFC. Rejecting outliers can be accomplished in two steps, which are; (A) SVM Optimization, and (B) Selection of informative pages.

(A) SVM Optimization Genetic algorithm (GA) is one of most effective, powerful, and unbiased heuristic search approach in the area of Artificial Intelligence (AI). It guarantees an optimized solution for a given problem based on several approaches such as; mutation, inheritance, and selection. GA has several advantages, which makes it as one of the most common search algorithms in AI, such as; (i) GA has the ability to solve any optimization problem based on chromosome approach, (ii) It can handle multiple solution search spaces as well as perfectly solving the given problem in such an environment (iii) GA is less complex and more straightforward compared to classical algorithms, (iv) it is easier to be transferred and applied in different platforms, thereby increasing system’s flexibility, (v) GA has the ability to support multi-objective optimization, and accordingly it is a good choice for “noisy” environments. Moreover, GA has the ability to find a global optimum. On the other hand, GA differs from traditional search and optimization methods in several significant points, which make it a perfect choice to be implemented in our work such as; (i) it searches in a parallel manner through the population. Accordingly, GA avoids being trapped in local optimal solution such as traditional techniques, which search from a single point. (ii) GA implements probabilistic selection rules, not deterministic ones. (iii) GAs work on the Chromosome, which is an encoded clone of potential solutions’ considered parameters, rather the implemented parameters themselves. (iv) GAs employ a fitness score, which is derived from objective functions, with no derivative or auxiliary information. (v) GA requires no explicit expression for the solving model; rather it needs the fitness function and related variables. (vi) GA can effectively handle arbitrary types of constraints and objectives. (vii) GA’s complexity is almost linearly correlated with the scale of the considered problem, hence, no chance to have dimension disaster. 11

In this step (e.g., SVM Optimization), genetic algorithm (GA) is used to improve the performance of the traditional SVM. GA will be used to optimize a specific parameter of SVM, which is the soft margin parameter C [33]. Parameter C is used during the training of SVM and indicates how much outliers are taken into account in calculating support vectors. To implement our proposed approach, this paper used the linear kernel function for the SVM classifier as the linear kernel function is the most suitable one for text. The optimization procedure can be divided into five processes, which are; (i) Population, (ii) Evaluation, (iii) Encoding, (iv) Selection, and (v) Crossover. In population process, different values for the parameter C are assumed as chromosomes, which are the possible solutions of the problem. Then, in evaluation process, for each chromosome, that representing parameter C, there is a fitness function used to evaluate its fitness (Ft) by using the training dataset to train the SVM classifier, then use a testing dataset to calculate the accuracy as depicted in (5). Fti  Acci 

Ai  Ci CorrectAssignments  TotalPages Ai  Bi  Ci  Di

(5)

Where, Fti is the fitness value of the ith chromosome, Acci is the accuracy of SVM classifier when using the ith chromosome, Ai is the number of pages that are assigned correctly when using the ith chromosome, Bi is the number of pages that are assigned incorrectly when using the ith chromosome, Ci is the number of documents that are rejected correctly when using the ith chromosome, and Di is the number of documents that are rejected incorrectly when using the ith chromosome. Then, in encoding process, the solutions can be encoded as a binary numbers. On the other hand, during the selection process, the better solutions are selected using Roulette Wheel Selection (RWS) technique [34] as it is the most simple selection technique in implementation. Hence, the probability of each chromosome will be selected is calculated as in (6)

Pi  Fti



x

i 1

Fti

(6)

Where Pi is the selection probability of chromosome i, x is the number of all chromosomes in the population, and Ft i is the fitness value of chromosome i. then the sum of all probabilities of all chromosomes is calculated as depicted in (7)

SumPi  i1 Pi x

(7)

Where, SumP is the sum of all probabilities of all chromosomes, x is the number of all chromosomes in the population, and Pi is the selection probability of the ith chromosome. Then a random number (Rn) from interval (0, SumP) is generated as expressed in (8).

Rn  Rand (0, SumP)

(8)

Where, Rn is a random number between 0 and SumP, and SumP is the sum of all probabilities of all chromosomes in the population. Then, return again to the considered population to accumulatively sum the probability values from 0 to SumP. While summing, if SumP reaches a value that is greater than or equal the random number Rn (e.g., SumP > = Rn), the process should stop and select this chromosome. The crossover process is used to interchange genes between chromosomes to create offsprings using Single Point (SP) technique [35]. In this technique, firstly, again a random number Rn[i] from 0 to SumP should be generated. Then, if Rn[i] is smaller than the crossover probability (Pc), then the chromosome i will be chosen as a parent. The crossover point within a chromosome is chosen randomly, and then the two parent chromosomes are interchanged at that point to produce two new offspring.

(B) Selection of informative pages SVM has several salient properties that other techniques do not have such as; (i) it guarantees accurate classification, even if the input data is non-linearly separable, (ii) the classification accuracy does not depend on the expertise of choosing of the kernel function (in the case of non-linear input data), (iii) as SVM has only two free parameters, which are; kernel function and the upper bound, it can be easily controlled, (iv) SVM insures the existence of global, unique, and optimal solution since its training is equivalent to solving a linearly constrained Quadratic Programming, (v) SVM has the ability to adapt its learning characteristic using the kernel function, also it owns a good ability to adequately classify data even in a high-dimensional feature space with little training data, (xi) SVM works well on real-world applications, moreover, it can be successfully applied to any kind of data since a kernel is available. In this step, the optimized SVM is used to select the most informative examples from the available ones. When the distribution of the available training examples between the two classes is inspected, as illustrated in figure 9, it will be clear those examples that are far from the hyperplane are highly related to their classes. This is because the hyperplane separate between the two classes, hence those examples that are closed to the hyperplane may be related to both classes or their assignments may have errors; hence some of them may be assigned incorrectly. Accordingly, they are not 12

discriminative examples and may lead to obsolete classification rules during the training phase. A good behavior is to eliminate those ambiguity examples. However, to keep as much training examples as possible, such elimination should be controlled. A simple approach, which is the followed one, is to eliminate the support vectors as they are the most ambiguity examples as illustrated in figure 9.

6.2.2. Training Phase NB has proven to be the most effective classifier in the area of text classification. It has several advantages such as; (i) NB has a short computational time for training with the assumption of independent features, (ii) It is easy to be implemented and often has superior performance, (iii) applying NB requires low resource requirements in terms of time and memory during the testing phase, hence, it is suitable for use by focused crawlers that need to take the classification decision on time, and (iv) NB is robust in noisy environments, moreover, it requires little amount of training data. During the training phase, traditional NB classifier is trained using the most informative examples that were selected during the previous phase. Hence, the simple binary NB classifier is employed, which has two target classes represented by the set CL={cl1,cl2}, where cl1 is the class including those pages related to DOI, while cl2 is the opposite class. During NB training, the task is to calculate the conditional probabilities P(ci|clj)∀ci Є C, clj Є CL, and the classes prior probabilities P(clj) ∀cljЄ CL as illustrated in (9) and (10) respectively.

P(cl j )  Pg j Pg

(9)

P(ci | cl j )  Ni , j N j

(10)

Where; Pgj is the number of pages related to class clj, and Pg is the total number of pages related to all domain classes, Nij is the number of occurrences of concept ci in pages of class clj, and Nj is the total number of concepts in pages of class clj.

6.2.3. Testing Phase During the testing phase, the decision whether an input page is related to the DOI or not will be taken. If it’s related, the page will be targeted to “cl1”. Otherwise, it will be targeted to “cl2”, hence, it will be rejected. During the testing phase, initially, the probability that pi related to each class is calculated (also called the likelihood of being in the class) based on the domain concepts extracted from pi. If pi contains z domain concepts, the likelihood of being in class clj ∀ clj ∈CL can be calculated as expressed in (11). Then, pi will be targeted to the class (cltarget) with the maximum calculated likelihood, using (12).

Likelihood ( pi | cl j )  P(cl j ) * x1 P(cx,i cl j ) z

T arg et ( pi )  clt arg et  arg max( P(cl j ) * x1 P(cx,i cl j )) z

13

(11) (12)

Optimized Naïve Bayes (ONB) Algorithm  Inputs: o o o

CL: a set of predefined ‘m’ classes, CL = {cl1, cl2, ..,clm}. TEs: a set of ‘s training examples, TEs = {te1, te2, .., tes}. pgi: the input test page

1.

// 4-Testing Phase Calculate the probability that page pi related to each class (likelihood) Likelihood ( pi | cl j )  P(cl j ) * x1 P(cx,i cl j ) k

 Output:

2.

A class for the input test page.

 Steps:

Select the target class for page pi (cltarget) with the maximum calculated likelihood T arg et ( pi )  clt arg et  arg max( P(cl j ) * x1 P(cx,i cl j )) k

// 1-SVM Optimization // a-Population: 3. Assume different chromosomes. // b-Evaluation: 4. Calculate the fitness function for each chromosome.

Algorithm Parameters

Ai  Ci CorrectAssignments Fti  Acci   TotalPages Ai  Bi  Ci  Di

CL

// c-Encoding Represent each chromosome in binary value. // d-Selecting: select the best chromosome. 6. Calculate the selection probability of each chromosome

C

5.

Pi  Fti 7.



x

i 1

TEs pgi

Fti

x

Calculate sum of all probabilities values (SumP)

Fti

SumPi  i 1 Pi x

8. 9.

Pi

For each chromosome i in population Do Generate random number (Rn) in the range of [0, SumP]

SumP

Rn  Rand(0, SumP)

Rn

10. SumP+=Pi 11. If (SumP > = Rn) Then 12. Stop and select chromosome i. 13. End If 14. End For // e-Crossover: using Single Point Technique 15. For each chromosome i in population Do 16. Generate random number (Rn) from (0 to SumP)

Acci Ai Bi Ci

Rn  Rand(0, SumP)

Di

17. If (Rn < Pc) Then 18. Stop and select chromosome i as a parent 19. End If 20. End For 21. Select the crossover point within a chromosome randomly, 22. Interchange the two parent chromosomes at this point to produce the offspring. // 2-Selection of informative pages 23. Train the SVM. With the selected chromosome. 24. Find SVs. 25. Eliminate SVs. // 3-Training Phase 26. Train the NB classifier with the informative pages 27. Calculate the conditional probabilities P(ci|clj) ∀ci Є C, clj Є CL P(ci | cl j )  Ni , j N j 28. Calculate the classes prior probabilities P(clj) ∀cljЄ CL

Pc SVs P(ci|clj) Pgj Pg P(clj) Nij Nj k cltarget Target(pi)

P(cl j )  Pg j Pg

Algorithm 2, Optimized Naïve Bayes (ONB) Algorithm

14

The set ‘m’ classes, CL = {cl1, cl2, ..,clm}. The set of domain ‘n’ concepts, C = {c1, c2, ..,cn}. A set of ‘s training examples, TEs = {te1, te2, .., tes}. The input test page. the number of all chromosomes in the population the fitness value of chromosome i. The selection probability of chromosome i. the sum of all probability values for all chromosomes. a random number between 0 and SPtob. the accuracy of the SVM when using chromosome i # of documents that are assigned correctly # of documents that are assigned incorrectly # of documents that are rejected correctly # of documents that are rejected incorrectly The crossover probability. The support vectors. The conditional probability of concept ci given class clj. The number of pages related to class clj. The total number of pages of all domain classes. The classes’ prior probability of class clj. The number of occurrences of concept ci in pages of class clj The total number of concepts in pages of class clj. The number of domain concept found in the page pi. the target class for page pi

6.3.

SVM Optimization: Illustrative Example

The optimization procedure can be divided into five steps: (i) Population, (ii) Evaluation, (iii) Encoding, (iv) Selection, and (v) Crossover. Here are some assumptions:   

No. of chromosomes (size of population) = 6 No. of generation = 50 Crossover probability (Pc) = 0.8

(i) Population In this step, different values for the parameter C are assumed as chromosomes as shown in table 3, column 2. (ii) Evaluation In this step, for each chromosome, the fitness value (Ft) is calculated using fitness function. The fitness value is calculated by training dataset with the SVM classifier and using the testing dataset to calculate the accuracy as fitness function as in (5) as illustrated in table 3 column 4. (iii) Encoding In this step, the chromosomes can be encoded in a binary representation as illustrated in table 3 column 3. (iv) Selection In this step, the Roulette Wheel Selection (RWS) technique is used. First, the probability (P) of each chromosome will be selected is calculated using (6) as in table 3 column 5, then the sum of all probabilities (SumP) of all chromosomes is calculated using (7), then a random numbers (Rn) from interval (0, SumP) is generated using (8) as in table 3 column 8. Then go again through the population and compute the cumulative sum of the probabilities, and if SumP is greater than the random number (e.g., SumP > = Rn), then stop and select this chromosome, do this 6 times, so then the new population will be as in table 3 column 9. (v) Crossover As illustrated in algorithm 2, first generate a random number from 0 to 1 as shown in table 3 column 10, which is Rn[i] where i=1, 2, …, 6. And then if this random number Rn[i] is smaller than the crossover probability Pc then select the chromosome i as a parent chromosome. As the crossover probability Pc=0.8, then the parent chromosomes will be, chromosomes 1, 2, and 5, so we can take chromosome 1with 2, chromosome 2 with 5, and chromosome 5 with 1. Then the crossover point is chosen randomly at bit number 2 for the three crossovers as the following: Chromosome 1= chromosome 1>< chromosome 2= 10.1><00.1=00.1 = 0.5 Chromosome 2= chromosome 2>< chromosome 5= 00.1><01.1=00.1 = 0.5 Chromosome 5= chromosome 5>< chromosome 1= 01.1><10.1=11.1 = 3.5 Finally, the chromosome population after crossover process is illustrated in table 3 column 11, which will be the population for the next generation of the genetic algorithm. In the next generation, we follow the same steps as in the first generation and the results are illustrated in table 4. The population after crossover process is illustrated in table 4 column 11 which will be the next population, the process continues until the assumed number of generation ends. Finally, the fitness value of all chromosomes will be calculated, and then the highest one will be selected as the best solution.

7. Performance Analysis and Implementation In this section, the proposed distiller, e.g., ONB, which is the core of CM, will be evaluated against traditional classification techniques, which are SVM, NB, and KNN as well as some recent classification techniques, namely; 15

Domain Oriented Naïve Bayes (DONB) [36] and Domain Oriented KNN (DOKNN) [36]. For each one, the accuracy, precision, and error will be reported. The Web Data Commons dataset series contains all structured data extracted from the various Common Crawl corpora [37]. Currently, the extraction process considers the data formats Microdata, RDFa, and microformats. The documents of Web Data Commons were not pre-designated as training or testing patterns. Hence, we choose some of them as training and testing subsets. 10000 web sites are randomly selected for training and 500 are used for testing. The parameters that will be used though the next experiments with the corresponding values are illustrated in table 5.

7.1.

Performance Metrics

Table 6 depicts the possible outcomes of a binary classifier. They will be used to measure the performance of the proposed classifier. And, table 7 shows the confusion matrix. On the other hand, the performance metrics that will be used are illustrated in (1315) [38].

Pr ecision  P 

Pages Assigned Correctly A  Total Assigned Pages A B

Accuracy  Acc 

Error  E 

7.2.

Correct Assignments AC  Total Pages A B C  D

Incorrect Assignments BD  Total Pages A B C  D

(13) (14) (15)

Implementing D2O

To speed up the searching, mapping, and conceptualization processes, D2O is implemented using a database procedure in the form of three related tables as illustrated in figure 10. The first table stores the used representative terms for the considered domain concepts, while the second table stores the synonyms terms corresponding to each of the representative terms. On the other hand, the last table stores the partners of each ambiguous keyword as well as the relations among the keyword and the other domain keywords if exist. This table is used mainly for keyword disambiguation. The “Partners” field in the Relation-Partner table is considered as a flag that is set to “1” if the corresponding related keyword is a partner; otherwise its value is set to “0”.

7.3.

Evaluating the Proposed Distiller

The performance of the proposed distiller (e.g., ONB) is affected by; (i) the number of training pages (TPs), (ii) the kernel function, and (iii) the value of soft margin parameter “C” used in SVM training. According to [39], the most appropriate kernel function for binary text classification is the “linear” kernel, especially for large number of concepts. Based on the Linear Kernel function, through this section, the pre-mentioned performance metrics (e.g., P, R, Acc, and E), will be measured for ONB as well as its competitors using different values of training pages. Considering figures (11-13), basically, by increasing the number of TPs, the performance of all classification techniques is promoted. The reason is that as the number of TPs increases, the classifiers are better trained as they collect more domain knowledge. Hence, it is obvious that performance promotion can be accomplished by training the classifier with more TPs. The best precision, accuracy, and error are obtained at the maximum number of training pages (e.g., when TPs=10000). As generally depicted in figures (11-13), ONB outperforms all other classification techniques. When TPs=10000, ONB’s Precision, Accuracy, and Error are; 0.86, 0.89, and 0.11 respectively. On the other hand, KNN has the worst performance, because it is highly affected by the noise due to outliers that may exist in the training pages, which have been successfully neglected in ONB by eliminating the support vectors. When TPs=10000, precision, accuracy, and error of KNN are; 0.7, 0.75, and 0.25 respectively. DOKNN has better performance than KNN. For DOKNN, when TPs=10000, precision, accuracy, and error are; 0.72, 0.79, and 0.21 respectively. Moreover, NB and DONB have good performance as they are probabilistic classifiers. They both outperform SVM and KNN as they depend on NB theorem. When TPs = 10000, NB’s precision, accuracy, and error are; 0.78, 0.83, and 0.17 respectively, while, DONB’s 16

precision, accuracy, and error are; 0.81, 0.86, and 0.14. On the other hand, SVM’s precision, accuracy, and error are; 0.74, 0.81, and 0.19 respectively. Again, consider figures (11-13). Figure 11 shows the precision against TPs. Generally, the precision for all classifiers increases gradually by increasing the number of TPs. It is noticed that, ONB has the highest precision compared with the others. On the other hand, as illustrated in figure 12, ONB introduces higher accuracy than other classifiers. ONB’s accuracy is gradually improved by increasing the number of TPs till reaches 0.89 when TPs = 10000. Finally, figure 13 depicts the error; it is concluded that ONB introduced significant error reduction compared with its competitors. The reason is that ONB combines evidence from NB and SVM classifiers.

7.4.

Effects of the Proposed Distiller on the Crawling Performance

In general, focused crawlers can be evaluated by measuring their ability to retrieve “good" pages. A “good” page is the one that is highly relevant to the domain of interest. As depicted in figure 14, all web pages are represented by the “Web” set, which is denoted as; “W”, while the set of “good” pages (e.g., relevant to the domain of interest), is called “Relevant” set and is denoted as; “R”. On the other hand, the set of “Crawled” is denoted as; “C”, which represents the pages that the crawler has visited, hence, (R,C)⊂W. The target of the focused crawler is to maximize the set R∩C, which indicates a large number of good retrieved pages while keeping the set X=C-(R∩C) as small as possible. An ideal focused crawler, which is unrealistic, has C⊂R and accordingly X=𝛟. Figure 14 (A) represents the ideal focused crawler, while figure 14 (B) represents the traditional focused crawler. The target of the focused crawler is expressed in (16). T arg et( Focused _ Crawler )  [max(C  R)] AND [min(C  (C  R)]

(16)

For illustration, assuming a focused crawler CRL whose domain of interest is “Education”. The set W, which represents all the available pages in all domains, are assumed to be 1000 pages in which 100 pages are related to “Education” domain, which represent the set R. In order to minimize the cost in terms of time and storage penalties, CRL tries to visit only those relevant pages while discarding irrelevant ones. The set C, which indicates the pages that the crawler already visits, is assumed to be 200 pages, in which only 50 pages are related to “Education”. Hence |C∩R|=50, and |C-(C∩R)|=150 accordingly, CRL’s precision is 50/200=0.25. However, if CRL crawls the same number of pages (e.g., 200 pages), in which 70 pages are relevant to “Education”, hence |R∩C|=70, then CRL’s precision is 70/200=0.35. Then, one of the targets of CRL is to maximize |C∩R|. On the other hand, if CRL has the ability to retrieve 50 pages related to “Education” (e.g., |C∩R|=50) by visiting only 125 pages (e.g., |C|=125, then |C(C∩R)|=75), then CRL’s precision is 50/125=0.4. Hence, an effective focused crawler should maximize the number of crawled pages that are related the domain (e.g., the set C∩R), while minimizing the retrieved irrelevant ones (e.g., the set C-C∩R).

Education domain was selected to be the domain of interest. In this section, the focused crawling task is done in two different scenarios. The first is the Tradition Focused Crawler (TFC) as explained in section 2.2, while the second is the Effective Focused Crawler (EFC) which is illustrated in section 3. The crawling queue, in both types of crawlers, is filled with 10 seed pages. Although those seeds are highly related to the domain of interest, which is the Education domain, they also contain “bad” links (e.g., links point at irrelevant pages), such as advertisement and utility pages. Several link weighting strategies are used for implementing both crawlers (e.g., TFC and EFC), which are illustrated in table 8. The tested crawlers are allowed to retrieve 10000 pages, and then the crawling harvest rate and average relevancy are calculated.

7.4.1. Measuring the Harvest Rate Harvest Rate (HR) is defined as the parentage of the retrieved and relevant pages over the overall retrieved pages during the crawl. Therefore, if 10 relevant pages are found in the first 100 crawled pages, then a harvest rate of 10% at 100 pages is then concluded. HR can be calculated using (19), while results are shown in table 9. Assuming nr is the number of retrieved and relevant pages while n is the total number of retrieved pages. HR  nr n

(19)

Figures (1518) show the harvest rate of the proposed EFC as well as TFC against the number of retrieved pages using several link weighting strategies, which are; STS, Bay, STS, and TC respectively. As illustrated in such figures, 17

EFC outperforms TFC regardless of the used link weighting strategy. This proves the effectiveness of adding domain distillers to traditional focused crawlers.

7.4.2. Relevancy Score In this experiment, evaluating the quality of pages retrieved by various crawlers (e.g., TFC and EFC) has been done by employing the different considered weighting strategies illustrated in table 8 (e.g., STS, Bay, STD, and TC). Generally, if we have N different crawlers, which are allowed to start from the same seeds (manually chosen 10 seeds related to the “Education” domain) and run till each crawler retrieves P pages. Hence, N * P pages (from all crawlers) are collected. After removing similar pages retrieved by different crawlers, M pages are available, which are ranked using a human assessor into 5 different categories with a corresponding score according to the table 10. Then, the crawler score can be calculated using (20). P

Score(Crawler j )   score( pageji ) i 1

N

P

j 1

i 1

 ( score( page )) ji

(20)

Where; pageji is the ith page retrieved by the jth crawler. When such evaluation strategy is followed using P=4000, 5000, ….., 10000, N=2 (the number of crawling strategies, which are EFC and TFC). Result is illustrated in figures (1922). Those figures show that EFC outperforms TFC as it achieves the higher score, which indicates the high quality of the retrieved pages. Surely, this indicates the effectiveness of adding domain distillers to traditional focused crawlers.

8.

Conclusion

Traditional focused crawlers rely on estimations, hence, they estimate whether the page is related to DOI or not before actually retrieving it. If the estimation decides that the page is good, the page will be retrieved then passed to the indexer, then the page’s links are injected into the crawling queue. However, if the estimation is not accurate, the crawler will be skewed away from its target of retrieving pages related to a specific domain. The originality of this paper is concentrated in introducing an effective modification on the behavior of focused crawlers by employing a domain distiller to decide whether the retrieved page is related to crawler’s DOI, and then take a decision to index the page and add its links to the crawling queue accordingly. The proposed domain distiller combines SVM and NB classifiers in a new instance called Optimized Naïve Bayes (ONB) classifier. Initially, genetic algorithm (GA) is used to optimize the soft margins of SVM. Then the optimized SVM is employed to discard the outliers from the available training examples. Next, the pruned examples are used to train the traditional NB classifier. Moreover, ONB employs word sense disambiguation to identify the accurate sense of each domain keyword extracted from the input page. ONB has been tested against recent classification techniques. Experimental results have proven the effectiveness of ONB as it introduces the maximum classification accuracy. Also, results indicate that the proposed distiller improves the performance of focused crawling in terms of crawling harvest rate.

9. References [1] M. Razek, “Credible Mechanism for More Reliable Search Engine Results”, International Journal of Information Technology and Computer Science, vol. 3, pp. 12-17, 2015. [2] R. Shettar and R. Bhuptani, “A vertical search engine based on domain classifier”, International Journal of Computer Science and Security, vol. 2, no. 4, pp. 18-27, 2007. [3] A. Elyasir and K. Anbananthen, “Focused Web Crawler”, International Conference on Information and Knowledge Management, vol. 45, 2012. [4] A. Sun, E. Lim and W. Ng, “Web classification using support vector machine”, Proceedings of the 4th international workshop on Web Information and Data Management, New York, ACM Press, pp. 96-99, 2002. [5] O. Kwon and J. Lee, “Web page classification based on k-nearest neighbor approach” , Proceedings of the 5th international workshop on Information retrieval with Asian languages, Hong Kong, China, ACM Press, pp. 9-15, 2000. [6] J. Orallo, “Extending Decision Trees for Web Categorization”, In Proceeding of 2nd Annual Conference of the ICT for EU India Cross Cultural Dissemination, 2005. [7] Z. Liu and Y. Zhang, “A Competitive Neural Network Approach to Web-Page Categorization”, International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, vol. 9, no. 6, pp. 731-741, 2001. 18

[8] A. Saleh, A. El Desouky, and S. Ali, “Promoting the Performance of Vertical Recommendation Systems by applying new Classification Techniques”, Knowledge-Based Systems, vol. 75, pp. 192-223, 2015. [9] R. Navigli, “Word Sense Disambiguation: a survey”, ACM computing surveys, vol. 41, no. 2, 2009. [10] H. Isahara, and K. Kanzaki, “Advances in Natural Language Processing”, Proceedings 8th International Conference onNLP, Springer, Japan, October 22-24, 2012. [11] T. Udapure, R. Kale and R. Dharmik, “Study of Web Crawler and its Different Types”, IOSR Journal of Computer Engineering (IOSR-JCE), vol. 16, no. 1, pp. 1-5, 2014. [12] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine”, Computer Networks and ISDN Systems, vol. 30, no. 7, pp. 107-117, 1998. [13] M. Selvakumar and A. Vijaya, “Design and Development of a Domain Specific Focused Crawler Using Support Vector Learning Strategy”, International Journal of Innovative Research in Computer and Communication Engineering, vol.2, no. 5, October 2014. [14] M. Jamali, H. Sayyadi, B. Hariri, and H. Abolhassani, “A method for focused crawling using combination of link structure and content similarity”, In Web Intelligence, IEEE Computer Society, pp. 753-756, 2006. [15] S. Zheng, P. Dimitriev and C. L. Giles, “Graph based crawler seed selection”, Proceedings of the 18th International Conference on World Wide Web (WWW), pp.1089 -1090, 2009. [16] G. A. Miller, “WordNet: A lexical database for English”, Communications of the ACM, vol. 38, no. 11, pp. 39-41, 1995. [17] Y. Wang and Z. Gong. “Hierarchical classification of web pages using support vector machine”, Lecture Notes in Computer Science, Springer, vol. 5362, pp. 12-21, 2008. [18] R. Chen and C. Hsieh, “Web page classification based on a support vector machine using a weighted vote schema”, Expert Systems with Applications, vol. 31, no. 2, pp. 427-435, August 2006. [19] W. Liu, G. Xue, Y. Yu and H. Zeng. “Importance-based web page classification using cost-sensitive SVM”, In Proceedings of International Conference on Web-Age Information Management, pp. 127-137, 2005. [20] E. Youn and M. Jeong, “Class dependent feature scaling method using naive Bayes classifier for text data mining”, Pattern Recognition Letters, Science Direct, ELSEVIER, vol. 30, pp. 477-485, 2009. [21] W. Zhanga and F. Gaoa, “An Improvement to Naive Bayes for Text Classification”, Proceeding Engineering vol. 15, pp. 2160-2164, 2011. [22] Z. Mei, Q. Shen and B. Ye, “Hybridized KNN and SVM for gene expression data classification”, Life Science Journal, vol. 6, no. 1, pp. 61-66, 2009. [23] T. Li, S. Zhu, and M. Ogihara, “Text categorization via generalized discriminant analysis”, Information Processing and Management, vol. 44, no. 5, pp. 1684-1697, 2008. [24] F. Li and Y. Yang, “A loss function analysis for classification methods in text categorization”, in ICML 2003, pp. 472-479, 2003. [25] D. Miao, Q. Duan, H. Zhang and N. Jiao, “Rough set based hybrid algorithm for text classification”, Expert Systems with Applications, vol. 36, no. 5, pp. 9168-9174, 2009. [26] G. Ifrim, “A Bayesian Learning Approach to Concept-Based Document Classification”, M.Sc Thesis, Computer Science Dept., Saarland University, Saarbrücken, Germany, February 2005. [27] R.Vinoth, A. Jayachandran, M.Balaji and R.Srinivasan, “A Hybrid Text Classification Approach Using KNN and SVM”, International Journal of Advance Foundation and Research in Computer (IJAFRC), vol. 1, no. 3, pp. 2348-4853, March 2014. [28] N. Tripathia, M. Oakesa and S. Wermterb, “Hybrid classifiers based on semantic data subspaces for two-level text categorization”, International Journal of Hybrid Intelligent Systems, vol. 10, pp. 33-41, 2013. [29] A. Seyfi, A. Patel, and J. C. Júnior, “Empirical evaluation of the link and content-based focused TreasureCrawler”, Computer Standards & Interfaces, vol.44, pp. 54-62, 2016. [30] S. Yang, “OntoCrawler: A focused crawler with ontology-supported website models for information agents”, Expert Systems with Applications, vol. 37, pp. 5381-5389, 2010. 19

[31] G. Forman, “An extensive empirical study of feature selection metrics for text classification”, Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003. [32] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Measuring semantic similarity between words using web search engines”, In Proceedings of International Conference on World Wide Web, pp. 757–766, May 2007. [33] V. Cherkassky and M. Yunqian, “Practical selection of SVM parameters and noise estimation for SVM regression”, Neural Network, vol. 17, no. 1, pp. 113-126, 2004 T. Pencheva, K. Atanassov, and A. Shannon, “Modelling of a Roulette Wheel Selection Operator in Genetic Algorithms Using Generalized Nets”, BIOAUTOMATION, vol. 13, no. 4, pp. 257-264, 2009.

[34]

[35] F. Alabsi and R, Naoum, “Comparison of Selection Methods and Crossover Operations using Steady State Genetic Based Intrusion Detection System”, Journal of Emerging Trends in Computing and Information Sciences, vol. 3, no. 7, July 2012. [36] H. Ali, A. El Desouky, and A. Saleh, “Studying and Analysis of a Vertical Web Page Classifier Based on Continuous Learning Naïve Bayes (CLNB) Algorithm”, IGI Global, pp. 210-245, 2009. [37] R. Meusel, P. Petrovski, and C. Bizer, “The Web Data Commons Microdata, RDFa and Microformat Dataset Series”, In Proceedings of the 13th International Semantic Web Conference (ISWC 2014), Italy, Springer Berlin Heidelberg, pp. 277-292, 2014. [38] Y. Lin, J. Jiang and S. Lee, “A similarity measure for text classification and clustering”, IEEE Transaction Knowledge Data Engineering. vol. 26, no. 7, pp. 1575-1590, 2014. [39] C. Hsu, C. Chang and C. Lin, “A practical guide to support vector classification”, Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, 2003. [40] J. Zhao, M. Lan, and J. Tian, “ECNU: Using Traditional Similarity Measurements and Word Embedding for Semantic Textual Similarity Estimation”, Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 117-122, June, 2015. [41] A. Seyfi, A. Patel, and J. Júnior, “Empirical evaluation of the link and content based focused Treasure Crawler”, Computer Standards & Interfaces, vol. 44, pp. 54-62, February 2016.

20

Parser Links

Download Document

Keywords and page features to indexer

HTTP Request

Repeat until queue is empty

Assign link score Priority Queue

Add links to the queue

Seed Pages

Figure 1.The Structure of a Typical Focused Crawler (TFC)

Download Document

Parser Domain Distiller Discard Page

Repeat until queue is empty

HTTP Request

Related to DOI

Assign link score Priority Queue

Add links to the queue

No

Yes

Extract Links

Links

Seed Pages

Figure 2, the Structure of Effective Focused Crawler (EFC)

21

Keywords and page features to indexer

1 Filter pages contents

4

Calculate Inter-concepts relationships

3

(a worker wjwhich hosts Page ID Page Contents Ti becomes not ready) p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 P20

A,B,E A,B,C A,B,D A,D,E A,E,F B,D,E B,C C,D,E B,E B,E,F C,D D,F A,F A,C,E E,F A,C,D A,E D,E,F B,C,D A,F

Graph Construction

r (A↔C) r (A↔B)

A r(A↔E)

r(A↔F)

r (A↔C)

B

r(B↔C)

r(B↔F) r(B↔D)

F r(D↔F)

r(B↔E) r(C↔D)

r(E↔F)

2 Calculate Concept Weight

C

D E

rl(C↔E) r (D↔E)

r(C↔F)

(a worker wjWeight which Concept hosts Ti becomes not A ready) 0.3 B 0.63 C 0.18 D 0.2 E 0.42 F 0.1

Concepts

# of Retrieved

Pair

Web Pages

A,B A,C A,D A,E A,F B,A B,C B,D B,E B,F C,A C,B C,D C,E C,F D,A D,B D,C D,E D,F E,A E,B E,C E,D E,F F,A F,B F,C F,D F,E

642,000,000 313,000,000 249,000,000 116,000,000 86,200,000 234,000,000 498,000,000 59,000,000 72,200,000 37,500,000 218,000,000 55,000,000 496,000,000 101,000,000 61,500,000 250,000,000 59,200,000 77,800,000 468,000,000 49,600,000 535,000,000 50,100,000 84,900,000 92,100,000 399,000,000 157,000,000 33,300,000 51,200,000 37,600,000 48,500,000

Figure 3, D2O Generation in the Illustrative Example

22

Concepts Pair Relation 0.06 0.02 0.03 0.01 0.01 0.02 0.05 0.01 0.01 0.01 0.01 0.01 0.03 0.01 0.01 0.02 0.01 0.01 0.03 0.01 0.03 0.01 0.01 0.01 0.05 0.02 0.01 0.01 0.01 0.01

2 Fuzzy Inference Engine Parameter αR ΒR αP βP αC βC αA βA

Assigned value 2 5 3 5 5 10 3 7

Applyin g Fuzzy Rules

(FIE)

μ(DR)

Low

1. 0 0.

5 0. 0 μ(KP)

Defuzzificati on High Low

5 0. 0

μ(AV)

0. 0

αA

FreqS(k)

D+ D-

αR

Calculate DR(k)

Keyword βPPopularit y

αP

Query Google for the top 𝜂 pages that contain k

Calculate IC(k)

High

Collect the list of retrieved pages Pages(k) Yes

Pick next page (P)

Informati on βC content

αC

Identify snippets of keyword k

Decision (K is ambiguous or not)

Combine snippets

Meaning

P+(k)= P+(k)+1

P-(k)= P-(k)+1

TF+(k)=

TF-(k)= TF(k)+FreqS(k)

Classify page (P)

The list of 𝜂 retrieved pages that contain k.

PD

Frequency of k in the snippets of current page.

Yes

The class containing pages related to DOI. The class containing pages not related to DOI. The number of pages  D+.

P-(k)

The number of pages  D-.

TF (k)

Frequency of k in the snippets collected from pages  D+.

TF-(k)

Frequency of k in the snippets collected from pages  D-.

DR(k)

Domain Relevancy of k.

KP(k)

Popularity of k.

IC(k) AV

Domain information content of k. Ambiguation Value

+

Domain Class D+

Non Domain Class D-

Figure 4, Identifying Ambiguous Keywords

DIS 1 AV(k) Ambiguation Value of the considered keyword

PreProcessing (PP)

TF+(k)+FreqS(k)

Number of retrieved pages per keyword.

P+(k)

1

Pages(k ) Finished No

High

Ambiguatio Value

βA n

Parameter 𝜂 Pages(k)

Calculate KP(k)

Domain βRRelevanc y

Low

1. 0 0.

1. 0 0.5

High

1. Low 0 0.5 0. 0 μ(IC)

Input Keyword (k)

Fuzzification

Ambiguation

0

Value (AV)

βo=7

DIS

1

0 (Ambigous)

(Disambigous)

Figure 5, The used identification function

23

+

No

Step 1

Find the fuzzy set membership values for the keyword K (DR=3, KP=3.5, IC=7)

High

Low

1.0

μ(IC)

μ(KP)

μ(DR)

1.0

1.0

High

Low

0.4

0.3

0.0

αR=2 DR=3 βR=5

Domain Relevancy

High

Low

0.8

0.7

0.6

0.0

αP=3

KP=3.5

Keyword βP=5 Popularity

0.2

Information

0.0

βC=10 Contents

αC=5 IC=7



μlow(DR=3)=0.6



μlow(KP=3.5)=0.7



μlow(LAF=7)=0.8



μHigh(DR=3)=0.4



μHigh(KP=3.5)=0.3



μHigh(LAF=7)=0.2

Step 2 Calculate the fuzzy outputs for each rule. DR

KP

IC

L L L L H H H H

L L H H L L H H

L H L H L H L H

Rule output H H H L H L L L

Output membership Min(0.6,0.7,0.8)=0.6 Min(0.6,0.7,0.2)=0.2 Min(0.6,0.3,0.8)=0.3 Min(0.6,0.3,0.2)=0.2 Min(0.4,0.7,0.8)=0.4 Min(0.4,0.7,0.2)=0.2 Min(0.4,0.3,0.8)=0.3 Min(0.4,0.3,0.2)=0.2

Step 3

Defuzzification

μ(AV) Low

1

Use the center of gravity method: COG    ( AV ) * AV  ( AV )

High

0.6 0.4 0

5

COG  R 1 R * 0.4  R 5 R * 0.6 4

10

10

5 * (0.4  0.6)

=6.2 So that the output will be: 

High with membership=MAX(0.2,0.3, 0.4,0.6)=0.6



Low with membership=MAX(0.2, 0.2, 0.2,0.3)=0.3

Hence the Ambiguation value of K is 6.2

DIS

1 Ambiguation

0 βo=5 6.2

Value (AV)

So that K is an Ambigous keyword

Figure 6, illustrative example showing how to calculate the importance value of a cached item using a fuzzy inference system

24

Discriminative Circle Ambiguous keyword (ki) Partner keyword of ki =2

Remaining Ontology keywords

X

X

X

Rejected Partner

X X

X

Figure 7, Identifying the partner list of an ambiguous keyword assuming λ=2.

Outlier Rejection

Input Web page Page vector space model

Genetic Algorithm

Text Extraction

SVM

Transfer to concepts

String Tokenizer

Disambiguation

Remove Stop Keywords

Pick domain keywords

2

1

Figure 8, the Proposed Distiller Framework

25

Term Stemming

Class 1 1/||w|| 1/||w|| The most informative examples for class 1

Class 2

H1

Support Vectors

The most informative examples for class 2

H2 Support vectors for class 2 Support vectors for class 1 Informative Example for class 1 Informative Example for class 2

Figure 9, The region around SVM Hyperplane

Synonyms RT Table

1 IDC

RT (CONCEPT)

1 2 3

-------

n

---

Table

∞ IDC

Relation-Partner



1

Table

Synonyms

IDK

IDK

IDR

Relation weight

Partners

1 1 2

-----

1 2 3

1 1 1

5 8 12

-------

1 0 1

n

---

---

---

---

---

Where RT RT_Table Synonyms_Table

IDC

Representative Term Representative term table The table for RT’s synonyms Stores the partners of a keyword as well as the relation among it and other domain keywords Concepts Identifier

IDK

Keyword Identifier

Relation-Partner_Table

Figure 10, Data Structure used to

26

implement D2O

Figure 11, Precision in all used techniques

Figure 12, Accuracy of all used techniques

Figure 13, Error of all used techniques

W

W R

R

C

C∩R

(A) Ideal Focused Crawler W: the Web set

C

(B) Traditional Focused Crawler

R: Retrieved set

C: Crawled set

Figure 14, Crawling behavior for ideal and traditional focused crawlers

27

28

Figure 15, Harvest rate of TFC and EFC against retrieved pages using STS as a link weighting strategy

Figure 16, Harvest rate of TFC and EFC against retrieved pages using Bay as a link weighting strategy

Figure 17, Harvest rate of TFC and EFC against retrieved pages using STD as a link weighting strategy

Figure 18, Harvest rate of TFC and EFC against retrieved pages using TC as a link weighting strategy

Figure 19, Relevancy score against retrieved pages of TFC and EFC using STS as a link weighting strategy

Figure 20, Relevancy score against retrieved pages of TFC and EFC using Bay as a link weighting strategy

29

Figure 22, Relevancy score against retrieved pages of TFC and EFC using TC as a link weighting strategy

Figure 21, Relevancy score against retrieved pages of TFC and EFC using STD as a link weighting strategy

30

Table 1, types of Word sense disambiguation Type of ambiguity Homographs Homonyms Heteronyms

Definition Words with same spelling, same pronunciation but either same or different meaning Words with same spelling, same pronunciation but different meaning Words with same spelling but different pronunciation and different meaning

31

Example minute (extremely small, measure of time) rose (flower, past tense form of verb “rise‟) dove (bird, past tense of verb “dive‟)

Table 2, The used fuzzy Rules ID 1 2 3 4 5 6 7 8

DR L L L L H H H H

KP L L H H L L H H

IC L H L H L H L H

Rule output L L L H L H H H

Table 3, The first generation Population, Evaluation and Encoding

ID

Chromosome

Binary Encoding

1 2 3 4 5 6 Total

0.5 1 1.5 2 2.5 3

00.1 01.0 01.1 10.0 10.1 11.0

Fitness Value (Ft)

Probability (P)

0.66 0.71 0.69 0.7 0.5 0.5 3.76

0.18 0.19 0.18 0.19 0.13 0.13 1.0

Selection

Crossover

Cumulative Sum probability (SumP)

Random Number (Rn)

New ID

New Population

Random Number (Rn)

New Population

0.18 0.37 0.55 0.74 0.87 1.0

0.84 0.02 0.75 0.42 0.35 0.88

5 1 5 3 2 6

2.5 0.5 2.5 1.5 1 3

0.28 0.55 0.96 0.97 0.16 0.95

0.5 0.5 2.5 1.5 3.5 3

32

Table 4, The second generation Population, Evaluation and Encoding

ID

Chromosome

Binary Encoding

1 2 3 4 5 6 Total

0.5 0.5 2.5 1.5 3.5 3

00.1 00.1 10.1 01.1 11.1 11.0

Fitness Value (Ft)

Probability (P)

0.66 0.66 0.5 0.69 0.71 0.5 3.72

0.18 0.18 0.13 0.19 0.19 0.13 1.0

Selection

Crossover

Cumulative Sum probability (SumP)

Random Number (Rn)

New ID

New population

Random Number (Rn)

New Population

0.18 0.36 0.49 0.68 0.87 1

97.0 6970 6970 4979 5970 3397

5 6 4 1 5 2

3.5 3 1.5 0.5 3.5 0.5

360. 440. 69.0 50.0 8597 9.00

3.5 2 1.5 2.5 3.5 0.5

33

Table 5, The used tunable parameters through the experiments Parameter

Assigned value

n t CL Training web pages Testing web pages Kernel Function C x Pc No. of generations λ θ η χ

485 1305 2 10000 500 Linear From 0.5 to 5 100 0.8 50 2 0.2 100 50

Description Number of domain concepts (Popular terms only). Number of domain keywords (popular and un-popular terms). Number of classes: Cl1 is related to DOI, Cl2 otherwise. From Web Data Commons dataset. K(xi,xj)=xi.xj Soft margin parameter Number of chromosomes of the population Crossover Probability Number of generations or iterations. Number of hops that separates keyword k from the farthest partners in its Partner List PL(k). Threshold relation Strength between the keyword K and its partners. Number of retrieved pages that contain keyword K from Google. Length of wing of snippet in page P that contains keyword K.

Table 6, All possible outcomes of a binary classifier Classifier outcome

Description

A B C D

# of documents that are assigned correctly # of documents that are assigned incorrectly # of documents that are rejected correctly # of documents that are rejected incorrectly

Table 7, Confusion Matrix Actual Class Positive Negative

Predicted Class Positive A B

34

Negative D C

Table 8, The link weighting strategies for implementing both crawlering ways (e.g., TFC and EFC) Weighting strategy Description According to this strategy, the score of the currently processed page is the sum of its similarities to the considered seeds. Hence, the input page as well as the seeds are expressed in a vector space model of 485 dimensions (number of concepts in “Education” domain) {a1,a2,…,a485}. In this representation, if the concept Ci exists in the seed, the corresponding value ai will be assigned to “1” in the vector, otherwise it will be “0”. Let Doc is the input page, the page score will be the sum of all similarity scores to all seeds. Hence, the score for a page Doc is Similarity To Seeds calculated by (17) (assuming z training examples {e1, e2, …ez}). Finally, the page’s score is assigned to all links (STS) extracted from it. z

Score( Doc)   sim( Doc, ei )

(17)

i 1

The used similarity measure is the cosine similarity [40], which can be calculated using (18) as:

Simcos ( Doc, e)  k( Doce ) (ak ( Doc).ak (e))

Bayesian (Bay)

Similarity To Domain (STD)

Treasure Crawler (TC)



z

k 1

((ak ( Doc))2 .k 1 ((ak (e))2 z

(18)

According to such weighting technique, the crawler uses a Bayesian similarity score to calculate the relevancy of the processed page according to Bayes theorem using seeds as training examples. In training phase, all seeds texts are combined, then domain keywords are extracted and mapped to the corresponding concepts. The probability of each domain concept is calculated. When it’s needed to assign a Bayesian score to an input page, pages’ domain keywords are extracted then mapped to the corresponding domain concepts. The page’s relevancy to the domain is calculated (e.g., domain membership). Such domain relevance score is assigned (as a weighting score) to all links extracted from the page. In STD, Term Frequency (TF) is used to estimate the similarity between the currently processed page and the domain of interest. Hence, the more the domain keywords found within the processed page, the more the page-todomain relevancy. Then, the calculated page-to-domain relevancy is assigned to all links extracted from the page. In TC [41], specific HTML elements are extracted from the input page. These elements are then given to a domain relevancy calculator, which is supplied with the Dewy Decimal Classification DDC system. Then, the set of DDC entries specify the crawler’s topic. If the relevancy calculator decides that an unvisited link is on-topic, then its HTML elements are compared to T-Graph nodes and its priority score is assigned. However, if an unvisited link is off-topic, it receives the lowest priority score and ignored by the crawler. On-topic URLs with their priority scores are then injected into the fetcher queue for future downloads.

35

Table 9, Harvest Rate (HR) using different link weighting strategies in both crawlering ways (e.g., TFC and EFC) Strategy

Similarity To Seeds (STS)

Bayesian (Bay)

Similarity To Domain (STD)

Treasure Crawler (TC)

Visited pages

Relevant

Irrelevant

4000 5000 6000 7000 8000 9000 10000 4000 5000 6000 7000 8000 9000 10000 4000 5000 6000 7000 8000 9000 10000 4000 5000 6000 7000 8000 9000 10000

2204 2240 2264 2273 3360 3394 4430 2279 3336 3398 4429 5503 5569 6613 3309 3369 4453 5521 5587 6614 6653 3699 4060 4997 5845 6233 7188 8864

1796 2760 3736 4727 4640 5606 5570 1721 1664 2602 2571 2497 3431 3387 691 1631 1547 1479 2413 2386 3347 301 940 1003 1155 1767 1812 1136

TFC Harvest Rate 0.551 0.448 0.377333333 0.324714286 0.42 0.377111111 0.443 0.56975 0.6672 0.566333333 0.632714286 0.687875 0.618777778 0.6613 0.82725 0.6738 0.742166667 0.788714286 0.698375 0.734888889 0.6653 0.92475 0.812 0.832833 0.835 0.779125 0.798667 0.8864

Relevancy Score

Relevant

Irrelevant

0.469263 0.362875 0.362202 0.291012 0.348836 0.334054 0.36998 0.363967 0.443952 0.389742 0.422513 0.403021 0.411013 0.439517 0.482959 0.410157 0.426159 0.405317 0.402777 0.411544 0.37163 0.483875 0.450434 0.455419 0.465589 0.442433 0.449597 0.474608

2245 3320 3376 4478 5510 5567 6613 3302 3378 4456 5534 6623 7723 8802 3342 4421 5512 6623 7723 8807 9901 3948 4932 5902 6822 7855 8853 9930

1755 1680 2624 2522 2490 3433 3387 698 1622 1544 1466 1377 1277 1198 658 579 488 377 277 193 99 52 68 98 178 145 147 70

Table 10, Different categories and the corresponding scores. Category ID

Description

Score

1 2 3 4 5

Very relevant Relevant Medium Related Irrelevant

20 15 10 5 0

36

EFC Harvest Rate 0.56125 0.664 0.562666667 0.639714286 0.68875 0.618555556 0.6613 0.8255 0.6756 0.742666667 0.790571429 0.827875 0.858111111 0.8802 0.8355 0.8842 0.918666667 0.946142857 0.965375 0.978555556 0.9901 0.987 0.9864 0.983667 0.974571 0.981875 0.983667 0.993

Relevancy Score 0.530737 0.637125 0.637798 0.708988 0.651164 0.665946 0.63002 0.636033 0.556048 0.610258 0.577487 0.596979 0.588987 0.560483 0.517041 0.589843 0.573841 0.594683 0.597223 0.588456 0.62837 0.516125 0.549566 0.544581 0.534411 0.557567 0.550403 0.525392