A novel framework for semantic entity identification and relationship integration in large scale text data

A novel framework for semantic entity identification and relationship integration in large scale text data

Accepted Manuscript A novel framework for semantic entity identification and relationship integration in large scale text data Dingxian Wang, Xiao Liu...

3MB Sizes 0 Downloads 87 Views

Accepted Manuscript A novel framework for semantic entity identification and relationship integration in large scale text data Dingxian Wang, Xiao Liu, Hangzai Luo, Jianping Fan PII: DOI: Reference:

S0167-739X(15)00255-1 http://dx.doi.org/10.1016/j.future.2015.08.003 FUTURE 2814

To appear in:

Future Generation Computer Systems

Received date: 3 March 2015 Revised date: 26 July 2015 Accepted date: 6 August 2015 Please cite this article as: D. Wang, X. Liu, H. Luo, J. Fan, A novel framework for semantic entity identification and relationship integration in large scale text data, Future Generation Computer Systems (2015), http://dx.doi.org/10.1016/j.future.2015.08.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Manuscript-Minor Revision Click here to view linked References

A Novel Framework for Semantic Entity Identification and Relationship Integration in Large Scale Text Data* Dingxian Wang1, Xiao Liu1, Hangzai Luo2, Jianping Fan2 1

East China Normal University, Shanghai, China 2 Northwest University of China, Xi'an, China

[email protected], [email protected], [email protected], [email protected]

Abstract Semantic entities carry the most important semantics of text data. Therefore, the identification and the relationship integration of semantic entities are very important for applications requiring semantics of text data. However, current strategies are still facing many problems such as semantic entity identification, new word identification and relationship integration among semantic entities. To address these problems, a two-phase framework for semantic entity identification with relationship integration in large scale text data is proposed in this paper. In the first semantic entities identification phase, we propose a novel strategy to extract unknown text semantic entities by integrating statistical features, Decision Tree (DT), and Support Vector Machine (SVM) algorithms. Compared with traditional approaches, our strategy is more effective in detecting semantic entities and more sensitive to new entities that just appear in the fresh data. After extracting the semantic entities, the second phase of our framework is for the integration of Semantic Entities Relationships (SER) which can help to cluster the semantic entities. A novel classification method using features such as similarity measures and co-occurrence probabilities is applied to tackle the clustering problem and discover the relationships among semantic entities. Comprehensive experimental results have shown that our framework can beat state-of-the-art strategies in semantic entity identification and discover over 80% relationship pairs among related semantic entities in large scale text data. Keywords: Semantic Entity Identification, New Word Identification, Decision Tree, SVM, Semantic Entities Relationships * The initial work was published in The 14th International Conference on Web Information System Engineering (WISE 2013), pp. 354-367, Nanjing, China, October 2013.

1

Introduction

In most text applications, it is very important to understand the semantics of the input multimedia data. In most semantic models [1] of multimedia data, various semantic entities inferring the real semantics of the real world are essential to the model because the semantics of the multimedia data can be generally modeled as entities and their relationships [2]. As a result, the identification of semantic entities and semantic entity relationship integration [3, 4] are the fundamental basis for understanding the semantics of multimedia data. To address the problem of both semantic entity identification and relationship integration, an algorithm with high recall rate should first be applied to help find out the semantic entities. Researchers have proposed different algorithms for different types of multimedia data. For example, several named entity extraction algorithms [5-7] have been proposed to detect special semantic entities for text data, e.g. person name, location name and organization name as the first phase. However, most text applications need not only named entities but also other more general semantic entities (e.g. those in bold italic fonts) as shown in the following examples. English Examples: Tel Aviv will continue to abide by its peace treaty with Egypt despite the attack on its embassy in Cairo. The search for extraterrestrial life has taken another step forward - even if we are unlikely to find life as we know it any time soon, if at all. A team led by Swiss astronomers has recently discovered more than 50 exoplanets - planets orbiting stars outside the solar system. He won three of the four Grand Slam titles this year -- at the Australian Open, Wimbledon and US Open -- and is talking about adding to his collection. LAKE ARROWHEAD, Calif. (AP) - An 8-year-old boy with severe autism was found Tuesday after being lost and alone for more than 24 hours in the San Bernardino Mountains. Chinese Examples: %>-, M ,(Lang XianPing: person name ) 2!L (Chinese manufacturing industry ):9Q, M2C16K !(BO.N&JP(National post-disaster reconstruction funds)250  ,    ' < D  =  multifunctional school buildings   R # D  = (multifunctional dining halls) EG?(U.S. aircraft carrier)HS@A(Yellow Sea military exercises ),  !8;4 F  /0 (Zhang HengChu: person name), -*C91 ) (91 years old :age)  I$57  (Bertelsmann book club )3!+"(Chinese market ) From the above samples, we can find that more than half of the above semantic entities are not named entities such as person name, location name and organization name. However, they carry very important semantics of the text. To be more specific, semantic entities are advanced name entities through adding mixed verb and noun

phrases. For example, semantic entity “The search for extraterrestrial life” is very different from name entity “extraterrestrial life” and carries a lot of important information than simple the name entity. As a result, text applications will miss important semantics if they only use named entity detection algorithms to detect the entities of the text data. Moreover, it is very hard to find out the relationship among semantic entities if missing so much important semantic information. In theory, the semantic entity detection problem can be treated as a generalized named entity detection problem and adopted similar algorithms such as Hidden Markov Model (HMM) [8] or Conditional Random Field (CRF) [9, 10]. However, as the diversity of semantic entities is much higher than named entities, the detection accuracy can be low. After extracting semantic entities from large scale text data, relationships among semantic entities should be integrated to help us cluster the semantic entities into classes. It is very hard to cluster semantic entities and define the boundary between different classes. Several existing techniques such as the famous Centroid-based clustering methods including the representative method k-means clustering [11], Hierarchical clustering [12], Distribution-based clustering [13] and Density-based clustering [14], are most suitable for datasets with relatively fixed distance estimated features. However, our framework is designed to be more general to find out the related semantic entities which might have different features. Therefore, a method which can help cluster semantic entities with many different features should be employed. In this paper, to address the above issue, we propose a novel framework to detect semantic entities and their relationships. There are two phases in our framework: 1) Semantic Entities Extraction and 2) Semantic Relationship Integration. The purpose of our framework is to first employ an effective approach to detect semantic entities from the large scale text data, and then use a regional connectivity scanning process to find out the relationships between these semantic entities. The main contributions of this paper include: 1) the inner, outer and novelty statistical features are combined in our strategy and are proved to be useful to discover semantic entities; 2) a two-step novel DT-SVM classification algorithm is proposed and proved to be more effective in dealing with imbalanced dataset compared with state-of-the-art approaches; 3) an effective clustering method based on the semantic pairs is proposed to detect the relationships among semantic entities. The remainder of this paper is organized as follows. Section 2 introduces the related work. Section 3 presents our framework in detail. Section 4 demonstrates the experimental results and analyzes the performance of our strategies. Finally, we conclude the paper and point out the future work in Section 5.

2

Related Work

The statistic based machine learning methods have been widely used in the research on English Named Entity Recognition (NER) as well as New Word Identification (NWI), such as Hidden Markov Model (HMM), Decision Tree (DT) and entropy model, and so on [8, 15, 16]. The experimental results of these methods are quite good on datasets with high consistency of relatively fixed format on finding normal name entities such as person names, location names and organization names. However, it may experience performance downgrade on datasets with high diversity, such as the web page data. As Chinese NER is more difficult than English NER task, more advanced algorithms should be proposed to achieve better performance. Due to the unique syntactic and grammar usage, it is often very hard for Chinese NER to achieve satisfactory results as English NER. Gao [17] uses statistical filtering as an important phase to find the real Chinese NER. Wu [6] combines statistical model with back-off model as well as a Chinese thesaurus to help find Chinese NER. Takeuchi [18] investigates the identification and classification of technical terms in the molecular biology domain by using a combined HMM bigram model. But due to the complexity of the data set its performance needs to be improved. Other methods such as CRF, class-based language model (LM), pattern-based, rule-based methods as well as hybrid methods have also been employed in Chinese NER. Chen [19] presents a Chinese NER system which incorporates basic features and additional features based on CRF and gets satisfactory results on MSRA data sets. Bai [20] creates a system for tokenization and named entity recognition of ideographic language. The research on Chinese NWI is also one of the most critical issues in Chinese NLP. The Chinese NWI research is closely related to Chinese NER and Chinese word segmentation research. As Sproat and Emerson [21] find out that inefficient new word detection causes over 60% of the word segmentation errors. From then on, many innovative algorithms such as statistical information based, class-based LM, user behavior based and collaborative methods have been brought forward to improve the accuracy of the Chinese NWI. Wu [22] presents a mechanism of new word identification in Chinese text where probabilities are used to filter candidate character strings and assign part-of-speech (POS) to the selected strings in a ruled-based system. Li [23] also uses a statistical learning approach based on SVM classifier employing different features such as the in-word probability of a character, the analogy between new words and lexicon words, the anti-word list and frequency documents to achieve the state-of-the-art performance. However, it is very time consuming given the complexity of the features. Fu proposes [24] a modified class-based LM approach by turning the problem into a classification problem with the part-of-speech information to classify each unknown word. Zheng [25] adds collaborative filtering to incorporate user behaviors into their New Word Detection system. Chien [26] implements a keyword extraction system by extracting significant lexical patterns from related documents and constructing PAT tree which can index the full-text of documents so as to efficiently retrieve and update all possible character strings including their frequency counts in the documents. However, these algorithms focus on Chinese text data only

and its performance on other languages is unclear. In general, many works mentioned above use features with high computational complexity thus are not suitable for large scale data sets. The second part of work discussed here is a very difficult task in current semantic NLP research area as it covers the techniques of both Semantic Entity Recognition (SER) and Semantic Entity Relationship Integration (SERE). Generally speaking, the approaches for SERE and SER are mainly three types including supervised, semisupervised and bootstrapping approaches as well as some other approaches extracting higher-order relationships. The supervised approaches can be further divided into feature based methods and kernel methods. Kambhatla [27] used lexical, syntactic, and semantic features to train a log-linear model to handle the task of entity classification. Whereas, Zhao [28] and Guo Dong [29] used SVMs to train these features in order to classify different types of entity relationships. These feature based methods need heuristic choices as well as trial-and-error basis to select the useful features, and hence they are very time consuming. On the other hand, Lodhi et al. [30] applied string-kernels for relationshipdetection which led the popularity of kernel methods. Bunescu [31] improved the methods using subsequence kernels in conjunction with SVMs improved both the precision and recall. And Culotta [32] used dependency tree kernels that can produce a richer structured representation which also leads to significant performance gains. However, the supervised approaches are difficult to extend for new entityrelationships, the pre-processed stage is error prone, the training data is limited and computationally burdensome. The algorithms used by Yarowsky [33] and Blum et al [34] are considered as the prototypes of semi-supervised relationship integration methods. Their main idea is to use the output of the weak learners as training data for further processing. After that, Agichtein et al. [35] made one of the earliest attempts to automatically extract semantic relationships between entities in text using semi-supervised and boostrapping techniques. Their Snowball system can efficiently generate patterns and extract relationship tables from document collections with only small user labeled training data. However, the system is significantly dependent on the large set of domain extraction patterns. KnowItAll [36] tackles this disadvantage. As a large scale Web IE relationship-specific system, the system can learn a set of relationship specific extraction patterns with a few domain independent extraction patterns and frequency information. Snowball and KnowItAll as relationship-specific systems have the same shortcoming that the relationships they want to extract need to be labeled by human first. TextRunner [37] as a novel self-supervised algorithm can learn the relationships, classes and entities from the text in its document collections without specifying the demanding relationships so as to overcome such a limitation. Most recently, Chang Wang et al. [38] proposed a novel way to address the SERE problem which leverages the knowledge learned from previously trained relationship detectors. Each relationship is considered to follow a multinomial distribution over the existing relationships. Meanwhile, Chang Wang et al. [39] improved their algorithm and employed it for the relationship integration as well as scoring in DeepQA. Mohit Bansal and Dan Klein [40] applied a diffuse but robust way to capture a range

of semantic entities as well as their relationships. In their system, Web n-gram features are introduced so as to get significant gains on improving the accuracy of identifying named entities considering multiple datasets. However, the features they used requiring detailed POS (Part-of-speech) tags for each word which is very difficult to acquire in Chinese dataset. In contrast, our work only needs little POS information, thus our strategy can address SER and SERE problem with both English as well as Chinese dataset. Moreover, many other clustering methods [41-44] are also proposed in different fields.

3

A Novel Two-Phase Framework

3.1

Framework Overview

The proposed framework is shown in Figure 1. There are two main phases in our framework including 1) Semantic Entities Extraction and 2) Semantic Relationship Integration. Specifically, the procedures of our framework includes data preprocessing while storing data into the file system (Step 1), calculating statistic features on the server (Step 2), training the DT-SVM model for semantic entities extraction (Step 3), generating semantic entity pairs and calculating the statistic features from the semantic entities (Step 4), training a SVM model to decide whether a pair of semantic entities are related (Step 5) and then applying the model to cluster the semantic entities into classes (Step 6). Steps 1-4 are included in Semantic Entities Extraction phase and Steps 5-6 are included in Semantic Relationship Integration phase. In the following, detailed explanation about the two phases will be presented. As shown in Figure 2, the basic idea of the Semantic Entities Extraction is to first extract statistical features for each potential semantic entity text string from the data of interest and then feed to a classifier to determine whether the string is a semantic entity or not. However, to achieve acceptable performance, there are two problems need to be solved.

Fig. 1.Strategy Framework

First, the statistical features used in the strategy must be carefully selected since the classifier can only achieve good accuracy with representative features. In addition, the features must be extracted efficiently. Otherwise, the strategy will be running very slow because the number of potential text strings can be extremely large. To resolve this problem, we propose a set of fast statistical features. Furthermore, we propose a

set of novelty features which are sensitive to new entities occurring in the fresh data so that the entities never shown in the training data can be detected more accurately.

Fig. 2. Semantic Entities Extraction Process

Second, the dataset for the classifier is highly imbalanced. There are only around 1% to 5% semantic entities in all potential text strings. Therefore, most existing classification algorithms cannot achieve satisfactory accuracy. To tackle this problem, we propose the DT-SVM algorithm that integrates DT (Decision Tree) and SVM (Support Vector Machine). The proposed algorithm in our strategy is designed to handle extremely imbalanced data.

Fig. 3. Relationship Integration Process

As shown in Figure 3, the relationship integration among semantic entities is also proposed in this paper. The main idea of this phase is to make use of the features and semantic entities extracted through the semantic entity identification process so as to

effectively define the relationship among semantic entities. In the relationship integration process, semantic entity pairs are first constructed. Then, based on semantic entities and statistical features computed during the Semantic Entities Extraction process, statistical features, similarity features, and co-occurrence features are calculated. After that, SVM algorithm is implemented to train the relationship extraction model to help determine whether two semantic entities of a semantic entity pair are related or not. Finally, a regional connectivity scanning process is executed to help cluster the semantic entities into different classes. Technical details will be presented in the following sections. 3.2

Features Extraction

The semantic entity detection task is executed as a scan procedure: the detector scans the input text string sequentially with a window, it outputs true when the sub string in the scan window is a semantic entity and false otherwise. In this paper, the string in the scan window is named as        . In the definition,  is a segmented word, n refers to the size of scan window and  lies in the sentence              . where  is a context word. Then, the semantic entity detection algorithm extracts features from and feed to the classifier. To achieve accurate semantic entity detection, there must be a set of features carrying abundant information of semantic entities and they can be fast extracted from large volume of data. To resolve this problem, we propose several features that are suitable for semantic entity detection. To decide whether is a semantic entity or not, several types of features must be extracted from  and fed to the classifier. As discussed above, the feature extraction must be of low complexity since there is extremely large number of different s. For example, there are at least one thousand words for each article and at least one thousand articles should be handled. Since the size of s can be 1…n, millions of s might exist. Therefore, the number of different s can be very large. Therefore, we propose the following features which can be fast obtained while still carrying abundant information of semantic entities. First, the words or phrases composing a semantic entity must have frequent cooccurrence rather than random. Therefore, any statistical quantities measuring the correlations, closeness, or similar properties among the words and phrases of a semantic entity can be helpful for our target. Since these features are extracted from the internal components of the semantic entity, they are called “inner statistical features” in this paper. Second, for the context word        and   } especially  and  , may carry important information regarding the boundary of a semantic entity. For example,  may have a high chance to be an article if  is a semantic entity. As a result, extracting statistical quantities from the context words as features may improve the accuracy of semantic entity detection. Since they are computed by using words outside , they are called “outer statistical features” in this paper. Third, some novel semantic entities may appear in new data more frequently while some old semantic entities may gradually disappear. For most applications, those

novel semantic entities are more important than general entities. However, they are more difficult to detect because they never occur in the training data. To resolve this problem, we propose a novelty feature to measure the novelty of a semantic entity. In the following subsections, we will introduce the three types of features in detail. To simplify annotations, 34and #'" 34are used to represent the probability that word  may occur at any position and any document. They can be approximated as follows:

34 . ,

3(4

, #'" 34 .

+*) 3(+ 4

5#(#6 

In this formula, 34represents the times appearing in the whole dataset. is the number of document of the dataset and 5   6 the number of documents carrying . 3.2.1Inner Statistical Features The information content ! 34, mutual information ! 34 [45], correlation! 34, TFIDF [46] ! 34, cosine index ! 34, E index  ! 34and dice index ! 34 [47] of are computed as the most important inner statistical features. The information content  ! 34is computed as the entropy of : &

 ! 34 . - 1 3$ 4 3$ 4 $

In this formula, is the number of word has. ! 34can reflect how much information content carries and the importance of to the current news. Unlike the contribution of mutual information and correlation, the information content is calculated to determine the information that the whole semantic entity includes and the degree of confusion it reflects. The mutual information  ! /$  % 0 of two words $  % is defined as: /$  % 0  ! /$  % 0 . 3$ 4  /% 0 In this formula, /$  % 0 means the joint term probability of $  and % . If ! /$  % 0 is close to 0, it means $ and % are just like independent random variables and have little connection. Therefore, only with high  ! /$  % 0values can the two words have correlation. Since the words or phrases composing a semantic entity must appear at the same time, they may have higher mutual information than words or phrases co-occurred randomly. In this paper, the traditional mutual information of two variables is extended to measure the mutual information over multiple variables as: 34  ! 34 . 7 & 8 $ 3$ 4 The mutual information is a feature to measure correlations between two random variable from the view point of information theory. From the view point of statistical theory, the dependence has similar effect. The dependence of is defined as: &

! 34 . 34 - 2 3$ 4 $

If  ! is larger than 0, then words in may not be independent and has higher chance to be a semantic entity. The TF-IDF  !is a statistical quantity that shows the significance of a word or phrase to a document. Therefore, it may also help our semantic entity detection task. In this paper, we use the normalized TF-IDF value so that the feature value can be comparable cross document boundary:  !  !   ! In this formula,  !is the inverse document frequency of [46]. In addition to the TF-IDF value of the whole entity , the TF-IDF values of its components

  !may also carry useful information of semantic entities. Therefore, the sum, variance and median of TF-IDF values of   ! are also computed as the features. Moreover, the cosine index  !, E index   !as well as dice index  ! [47] are computed as:    !   ! 

   !   !   !     ! 

   !

   ! Please note that all the formulas use term probability !are also applicable to the document probability  !. So the mentioned features over !are also computed in our strategy. All of the above features only use  !and  !. If there is a table storing  !and  !for all and , these features can be computed with constant complexity by look-up in the table. In addition, the table can be computed via term frequency of  and , which can be computed by a sequential scan on the whole dataset with a scan window of max length . Therefore, all features can be computed at linear complexity with respect to the word length of dataset.  ! 

3.2.2Outer Statistical Features Even though the inner statistical features may identify words and phrases that are parts of semantic entities, they may not carry enough information regarding the boundary of semantic entities. As a result, features extracted from the context of a semantic entity are needed for semantic entity detection. In theory, the above features can be used to identify the semantic entity boundary if they are computed at the context of s. However, the computation of the above features needs the term frequency table. If they are computed at the context of s, one term frequency table must be computed for each potential semantic entity string s. As there are too many distinct s, it means either extremely large memory space (PB or EB size) is needed or large number of scans (millions) on the whole data. Apparently, it's too time and storage consuming.

To resolve this problem, we propose several outer statistical features that can be computed fast enough yet with reasonable memory consumption. They are normalized term frequency mean ) 6#7, max probability ) 6#7, outer mutual information  ) 6#7, outer dependance  ) 6#7and the expand versions of cosine index  ) 6#7, e index  ) 6#7as well as dice index ) 6#7of #. First, the context word may have high diversity than semantic entity elements. To measure the diversity of a context position , the normalized term frequency mean ) ( 6#7of the context position is used: +0', 6*71 "6$7  )  ( 6#7 / "6#7 5 0!( 6#715 In this formula, ( is the context position, $  0!( 6#71is the set of words that appear at the context position of #. Also, feature )( 6#7measures probability of word that appear most at position ( :

 "6$ #7 +6', 7 )( 6#7 / "6#7 "6$ #7is the joint probability that words appear at position !( together with #. The outer mutual information between !( and string #is defined as: "6$ #7 () 6#7 / 3 "6$ #7 

"6$7  "6#7 +6', 7

The outer dependence between !( and #is defined as: "6$ #7 - "6$7  "6#7 () 6#7 / 3 "6$ #7 

 6!( 7 +6', 7

Also, the expand versions of cosine index  ) 6#7, E index ) 6#7as well as dice index ) 6#7are computed as:   6#7  ) 6#7 / +6', 7 2"6$ #7 6#7 ) 6#7 / 4  "6$ #7 +6', 7

  6#7 +6',7 "6$ #7 The regular form of above features needs computation intensive resources since there are a large number of various words appear at position !( needed to be considered. However, only words at !&% or !% are often believed really useful because they are the direct prefix and suffix of . Moreover, if the scan window is large enough then the phrases as 8!&%  #9 or 8# !% 9 are included in the phrase term frequency tables. Therefore, only features with / .are used in our algorithm. 

) 6#7

/

3.2.3 Novelty Statistical Features New semantic entities may be repeated time after time and news after news during a period. So the frequencies of these semantic entities are often very large during at that

time while they are scarcely appearing in the previous period. It implies that the novelty statistical features  is proportional to the occurrence probability    in current document and inverse proportional to the historical occurrence probability of as   :                       In this formula, is for normalization. The historical occurrence probability of as   can be calculated through many different methods, because it is believed that different time intervals will provide different effects. In this paper, several time intervals such as one day, one week, one month and one year are calculated as historical data. Meanwhile, several collective periods are also calculated. For example, if we have 3 days data, then data of first day, second day, third day, first day plus second day, first day plus second day and third day are all calculated as historical data. The historical data can be used to calculate several novelty features. These features can help to reflect the real novelty of semantic entities. 3.3

DT-SVM Classification Algorithm

With representative features, semantic entities can be detected by a sophisticated classifier. The SVM (Support Vector Machine) algorithm [48] is a widely used classification algorithm and can be adopted for semantic entity detection. However, because our methods require the statistical features of all occurred adjacent string combinations in the dataset without considering whether it is a potential entity or not, the distribution of true semantic entities are so sparse (1~3%) and complicated. Therefore, traditional SVM algorithm may not be able to achieve acceptable performance given extremely imbalanced data. To resolve this problem, we propose to use a decision tree to filter out most negative samples before the data are fed to the SVM classifier. The decision tree algorithm is chosen as the filter because it trains fast and is easy to tune between precision and recall. Also, the decision tree algorithm are fast to train a model and can achieve good performance on training data with little noise. Therefore, the decision tree algorithm is suitable for first stage algorithm. In our algorithm, we need to tune the filter to achieve almost 100% recall on semantic entities. To do so, we use the following filter training step: (1) train a decision tree model via C4.5 algorithm; (2) check all leaf nodes of the tree, mark leaf nodes covering only positive samples as “+” state, leaf nodes covering only negative nodes as “-” state and other leaf nodes as “0” state. With this filter, the proposed DT-SVM works as shown in Fig. 4.

Fig. 4. DT-SVM framework

As has been discussed above, the key part of our framework is the DT-SVM-based Classification Algorithm. In order to give a more specific view of DT-SVM classification algorithm, DT-SVM algorithm is proposed as Figure 5 demonstrates. In the algorithm, Xk represents the feature vector composed of novel features, inner features and outer features. Yk is the classification of Xk. After applying the DT filtering algorithm to remove most useless negative samples and feeding the remaining samples to the final SVM classification algorithm. The classifier model can be finally trained.

Fig. 5. DT-SVM Classification Algorithm

To sum up, the key part of our framework is the DT-SVM-based Classification Algorithm. And it is adopted due to several reasons: 1) decision tree as a basic way of classification can swiftly exhibit quite reasonable performance on those datasets which are easily classified; 2) the SVM methods are powerless when there is too much noise in the unbalanced dataset, and the results will be influenced greatly. The training time will also increase to a great extent. So the decision tree methods with high recall rate (nearly 100%), acceptable precision and quite low computation complexity are used to pretreat the dataset. Since a lot of negative data can be processed before and only small part of data which comparatively complex is left to be dealt by support vector machine, both the efficiency and effect will be improved. 3) the support vector machine is based on the maximum-margin rule so that the data with cheap properties will contribute little to the whole process and usually be treated as distractors. Thus, it is no need to worry whether the deleted negative samples will be useful to the classification algorithm. Semantic entities are extracted through the DT-SVMbased Classification Algorithm, however, the relationship among the semantic entities remain uncertain. Thus, the next part of our framework will come to the relation extraction among semantic entities.

3.4

Relationship Integration

As the semantic entity identification task, most SER and SERE algorithms used to employ linguistic features and part of speech tagging information more or less. It is very hard to prevent error-prone problem with these information. So our framework will only make use of several statistical properties of the first stage, together with similarity measuring properties and co-occurrence probabilities. More detailed interpretation will be given in the following. Our semantic entity relation integration task is executed as a scan procedure: the detector scans the input text string sequentially with a window, if several semantic entities are found in the scan window then these semantic entities are regarded as occurred together (a co-occurrence matrix is constructed to record co-occurrence frequency among each semantic entities in the document collection). After the scanning process, the co-occurrence matrix is constructed and the features can be collected among each pair of semantic entities. In this paper, the semantic entities extracted from the first stage are named as  "    #, where  is a semantic entity and  lies in the paragraph   "          # , where   is a context word. Meanwhile, the words contained in  is defined as   "     #. Then, the semantic entity relationship integration algorithm extracts co-occurrence features from . Moreover, several other features are extracted to feed the classifier. The classifier outputs true when the two entities are connected and false otherwise. The set of rules together with their popularities (co-occurrence count) and the algorithm used to extract the relationship among entities will be introduced in the following subsections. 3.4.1 Features for Relationship Integration To decide whether one semantic entity  is related to other semantic entity, several types of features must be extracted from , and fed to the classifier. As the procedure described in semantic entity identification process, the features must be extracted with low complexity due to extremely large number of different . So the features used in this section make use of the features of semantic entity identification task and cooccurrence properties. The first kind of features is the feature similarity among each pair of semantic entities. Their features are calculated through several similarity functions as the Manhattan Distance  [49], the Euclidean Distance-based Similarity  [50] and the Chebyshev Distance  [51].The explanation for each similarity function is introduced as follows. The Manhattan Distance  is a simple method measuring the difference between two semantic entities. The Manhattan Distance between the corresponding features of two semantic entities  and  is defined as follows: 

  !      

In this formula,  denotes the   feature of entity  and  denotes the   feature of entity  likely. By applying the formula on each feature of the two semantic entities,

the Manhattan Distance can be obtained, and thus the rough difference between and

can be estimated. Apart from Manhattan Distance, the Euclidean Distance-based Similarity  is also a useful tool to estimate the divergence between two entities. is defined as follows:



, - $ )*'! # " ( 

In this formula, denotes the  feature of entity and " denotes the  " feature of entity likely. The Euclidean Distance is also a very common distance measurement to define the difference. !

"

Moreover, the Chebyshev Distance  is used to assess the discrepancy between the two entities. is calculated as follows: , - $  +! # " + In this formula.  denotes the  feature of entity and  denotes the  feature of entity likely. With the help of these similarity features, the connection between two entities can be known approximately. All these three features are used in our framework. Moreover, the co-occurrence properties are the key to improve our framework since these features can measure the co-occurrence relationship among entities more accurately. As the more often several entities appear together, the more likely these entities are related or convey the same meaning. Therefore, these co-occurrence frequencies among the entities should be counted to measure their relationship. In our framework, a feature vector is adopted to display the co-occurrence among the entities as follows: ,  - $ %'   ( '   ( '   (& In this formula.  is a certain entity from of index , '  ( computes the frequency that entity  occur with entity  at certain size of context as '  ( and '  (. After acquiring the ,  - for every entity from , the Manhattan Distance , the Euclidean Distance-based Similarity  and the Chebyshev Distance  are also applied to count the similarities between two entities and through , - and , - as follows: 

, - $ *+'  ( # '  (+  



, - $ )* .'  ( # '  (/ 

, - $  +'  ( # '  (+ In this formula, ' ( represents the frequency that entity  occurs with entity  at certain size of context so as ' ( represents the frequency that entity  occurs with

entity , at certain size of context window. Using these similarity measuring methods, the relationship between entity and can be described, and promote the efficiency of discovering related entities. Besides these computation methods, the kernel similarity between entity and is given by: *

'

:  ; 2 : ; : ; 2 9 7 ( 87 ( 8 (&"

As mentioned above, 7 , 8 represents the frequency that entity occurs with entity , at certain size of context so as 7 , !8 represents the frequency that entity ! occurs with entity , at certain size of context window. Furthermore, the Jaccard Similarity and Jaccard Distance [52] are also applied on : ; and :!; as follows: : ;  : ; :  ; 2 : ;  : ; : ;  : ; 1 : ;  : ; + :  ; 2 : ;  : ; In this formula,: ; and:!; stand for co-occurrence feature vector of entity and . With these set of feature vectors, the variance among entities can be found so as to make the framework recognize the related entities more correctly. Also, the edit distance /. :  !; [53] is helpful to measure the similarity between two semantic entities 2    and 2    it can be computed by using dynamic programming. (" :  ; 2  ") :  ; 2  

(%#):  ; 0 

 4()%# :  ;6 0     3   (%#)%#:  ; 0 5    2  As mentioned above, ,- :  !; means the edit distance computed between s=s1…si and t=t1…tj. The statistical features here can be acquired through scanning the segmented files and semantic entity files for once. The computation complexity here is :$ 0 ;, where  is the number of semantic entities from the output of first phase task and  is the segmented words in the segmented files. Since the semantic entities only account for about 1%-5% in all potential text strings, will be a very small number and the computation complexity is acceptable. ():  ; 2 

3.4.2 Relationship Integration Algorithm

Fig. 6. Relationship Integration Algorithm

As shown in Figure 6, relationship integration, as the second phase of our framework, is conducted as a scanning process by making use of the segmented files and semantic entity files generated from the first stage of the framework as the identification of semantic entities. With the semantic entity files, the features of the first phase can be retrieved and  semantic entity pairs are formed. Statistical Features such as the Manhattan Distance , the Euclidean Distance-based Similarity  and the Chebyshev Distance  discussed in Section 3.3.1 can be computed. After that, the scanning process is conducted on the segmented files, and a context window is defined to measure the co-occurrence between each pair of semantic entity. In addition, a cooccurrence matrix is generated to record the co-occurrence of each pair of semantic entities appears in the context window and each semantic entity is assigned an index. If a semantic entity  of index  occurred with another semantic entity  of index  in the context window, then the matrix element at position   and   are increased. After building up the matrix, the co-occurrence properties can be obtained through the

formulas mentioned before. Then, the features of each semantic entity pair are computed. The SVM method (as introduced in section 3.2) which is good at solving nonlinear, relatively complex as well as high dimensional data is employed. Finally, the classification models are trained. Then, the classification model could be used to gather related semantic entity pairs into together. Figure 6 demonstrates the pseudo-code of the Relationship Integration algorithm. In the algorithm, Xk and Xn represent the feature vector composed of novel features, inner features and outer features discussed in the Section 3.3.1. Xkn is the statistical feature vector, which has been introduced in the preceding paragraph, computed from Xk and Xn. Ykn is a value used to define the relationship between Xk and Xn. After feeding training data into SVM classification algorithm, the classifier model can be trained as RE-SVM defined in the figure.

4

Evaluation

In this section we evaluate the effectiveness of our framework. We use both Chinese and English text data for comparison, so that we can evaluate the effectiveness of the framework on different languages. The data are news web pages downloaded from Internet. There are 4.1 million Chinese pages and 690 thousand English pages covering 13 months in the dataset. The dataset is segmented into two parts where the first 150 days of data are used for training and the rest are used for testing. For each dataset, the last day’s semantic entities are manually annotated for quantitative evaluation. Other datasets containing 210 days of data are used as the background statistical material. The dataset is obtained by ourselves, all the data were news data crawled from the online news website. All the experimental results including the text data and algorithms can be found online1.

Fig. 7. Precision Rate We first evaluate the effectiveness of inner statistical features, outer statistical features and novelty features. Afterwards, our DT-SVM algorithm along with the standalone decision tree algorithm and the standalone SVM algorithm are also evaluated and 1

http://pan.baidu.com/s/1o6zfXpS

compared to illustrate the charracteristics of each algorithm. Measurements including precision, recall and F-One [554] are used for comparison. The results are shown in Figure 7 to Figure 9 where “N NF” stands for the results using only novelty features, “IF” stands for the results using only inner statistical features, “OF” stands for the results using only outer statisttical features, “WF” stands for the results using word features, viz. inner and outer sttatistical features, and “AF” stands for the results using all features, viz. inner, outer annd novelty statistical features. In our experiments, we use C4.5 to implement the DT algorithm. One of the great advantages of decision tree is that we do not need to set the parameters manually. Therefore, C4.5 can be easily implemented and applied. Meanwhile, grid searching which applies 5-fold cross vallidation is used to find the optimal parameters for the SVM algorithm. Specifically, c and gamma are two key parameters where c is the penalty-factor that shows the acceptable a error rate and gamma determines the distribution of data when it is mapp ped to the new feature space. We use the grid searching to find the best match of c and gamma. The searching ranges for c and gamma are set to a large number N (as 50) an nd the step is set to a small number M (as 0.5) so as to search the parameter for the alg gorithm precisely.

Fig. 8. Recall Rate

Fig. 9. F-One Measure

Figure 7 to Figure 9 depict thee precision rate, recall rate and F-One measure rate of semantic entities identification on both English text and Chinese text applying differ-

ent algorithms such as SVM, decision tree, pruned decision tree and DT-SVM. From the figures, one can find that all three types of features can provide satisfactory precision rate ranges from 35.28% to 68.3% for NF, 60.2% to 88.2% for IF and 63.7% to 90% for OF, recall rate ranges from 27% to 75.98% for NF, 4.8% to 74% for IF and 20.2% to 75.6% for OF, and F-one measure rate ranges from 38.18% to 57.16% for NF, 9.06% to 78.2% for IF and 32.25% to 72% for OF.By using the WF and AF features, the precision rate, recall rate and f-one measure rate can improve 5%-25% than using standalone features such as NF, IF and OF. In the meantime, the precision rate, recall rate and f-one measure of AF can improve 5%-10% than using WF. Among them, DT-SVM strategy using AF features can provide precision rate ranges from 87.3% to 92.3% with an average of 89.3%, recall rate ranges from 73.5% to 81.8% with an average of 77.6%, f-one measure rate ranges from 81.53% to 84.56% with an average of 82.54% on both English and Chinese text data, which is better than all other strategies such as SVM, DT and pruned-DT. Specifically, the average increase on precision rate, recall rate and F-One are 10%, 27%, 25% respectively. Based on the above results we can conclude that: (1) all the 3 types of features are effective for the semantic entity detection; (2) the combination of these features can provide much better results than using individual features alone; (3) our proposed DTSVM strategy outperforms standalone decision tree algorithm and SVM algorithm, which proves the proposed two-step classification strategy is very effective for handling imbalanced classification. To further evaluate the effectiveness of our strategy, we have compared with CRFbased semantic entity detection approach [9, 55] which is regarded as one of the most effective NER approaches. In this paper, the experiments are conducted on CRF++ toolkit [9]. The three key parameters are ‘–a’, ‘-c’ and ‘-f’. Specifically, ‘-a’ is used to select the type of algorithm to conduct the experiments. In our experiments, we choose ‘CRF-L2’ because it is proved to be better than ‘CRF-L1’ [56]. ‘-c’ is used to set the hyper-parameter which determines the balance between ‘overfitting’ and ‘underfitting’. We use cross validation to find the optimal ‘-c’. ‘-f’ determines the cut-off threshold of features. We use the simple parameter searching methods which is to set a range and apply the CRF methods to get the results of every ‘-f’. Then, we select the ‘-f’ that produces the best result. To further demonstrate the effectiveness of our strategy in dealing with imbalanced dataset, we have also compared our strategy with Bootstrapping-SVM [57] since bootstrapping is a common method for efficiently and effectively solving imbalanced data problems. The simple bootstrap method involves first taking the original data set of N heights, and then sampling from it to form a new sample (called a 'resample' or bootstrap sample) which is also of size N. Therefore, our process is repeated for a large number of times (typically 1,000 or 10,000 times), and for each of these bootstrap samples we compute its mean (each of these samples are called bootstrap estimates).The precision, recall and F-One results are shown in Figure 8 where “Eng-P” stands for the precision rate on English Text, “Eng-R” stands for the recall rate on English Text, “Eng-F” stands for the F-One measure results on English Text, “Chn-P” stands for the precision rate on Chinese Text, “Chn-R” stands for the recall rate on Chinese Text and “Chn-F” stands for the F-One measure results on Chinese Text.

Fig. 10. Comparison beetween DT-SVM, CRF and Bootstrapping-SVM

Figure 10 depicts the compaarison between DT-SVM, CRF-based approach and the Bootstrapping-SVM on the preecision rate, recall rate and F-One measure rate of semantic entity identification on both English text data and Chinese text data. From the figure, we can easily find thaat our DT-SVM strategy outperforms CRF-based approach and Bootstrapping-SVM M significantly in both Chinese and English datasets. The precision, recall and F-O One measure rate of DT-SVM on English dataset and Chinese dataset ranges from 87 7.3% to 92.3%, 73.5% to 81.8% and 81.83% to 84.56% while the highest precision, reccall and F-One measure rate of CRF-based algorithms are 53.2%, 61.8% and 57.718% %. In the meantime, DT-SVM is better than Bootstrapping-SVM in dealing with imb balanced dataset. The best precision, recall and F-One measure rate of Bootstrapping g-SVM algorithms are 79.3%, 50.5% and 56.992% respectively, and they are much lower l than their counterparts of DT-SVM. Through the DT-SVM model that trained in the first step, most semantic entities can be identified and the featurres used in the first step are calculated. In the meantime, the other features mentioned in n section 3.4.2 can also be extracted and computed with the segmented files produced in the first step. For example, the training and testing files fed for SVM can be organ nized as shown in Table 1 as follows. Tab ble 1. Samples of Entities Pair

Entities Pair Video Games-Play Station 4 LiverpoolGerrard Dior-Prada

Feature:1 17.2

Feature:2 0.8 85

Feature:3 78.1

Feature:4 8.6

Feature:5 7.3

Feature:6 9.5

3.1

1.2 23

62.3

12.8

7.8

23.7

2.04

3.1 10

35.76

20.6

9.6

26.8

s of semantic entities pair is composed of two seAs shown in Table 1, each sample mantic entities linked by “-”. If I two semantic entities which construct a pair are related to each other, then the paair is regarded as a positive sample, otherwise negative.

Afterwards, the training files as well as testing files are tested. From Table 1, three semantic entity pairs can be found including Video Games with Play Station 4 (Play Station 4 is a type of Video Games), Liverpool with Gerrard (Gerrard is a player of Liverpool FC Club) and Dior with Prada (Both Dior and Prada are luxury fashion brand). Therefore, semantic entities of each pair do have relationships as shown in the table. However, the closeness of relationships among the three pairs is different. To be more specific, there are affiliation relationships such as Play Station 4 with Video Game, Liverpool with Gerrard, and parallel relationship such as Dior with Prada. As mentioned in section 3.4.1, there are several statistical features to describe the similarity and relationship between semantic entities. Different values of statistical features reflect the differences. Table 1 also illustrates the file format used in our framework. After acquiring all the features, the training files are prepared in the same format as Table 1. In this example, feature 1 represents for Manhattan Distance, feature 2 represents for the Euclidean Distance-based Similarity, feature 3 stands for the Chebyshev Distance, feature 4 stands for the kernel similarity, feature 5 stands for the Jaccard Similarity and feature 6 stands for the edit distance. The training files are fed to the SVM. Then, the classification model which can help to determine the relationship between two semantic entities in the semantic entity pair can be built. Finally, since the purpose of our framework is to cluster the semantic entities into different groups with the help of semantic entity pairs classification model, in our experiment, 4987 semantic entity pairs extracted from the first phase are used to test the effectiveness of the framework as a whole. The results show that the total number of errors is 958 and the error rate is 19.20% (hence the precision rate is about 80.8%). Specifically, there are 1326 positive samples and 3661 negative samples in the dataset. Currently, the relationship threshold is dynamic by maximize the distance between each cluster in this experiment.

5

Case Study

In order to demonstrate the effectiveness of our framework as a whole, a case study is presented as shown in Figure 11 as follows.

Fig. 11. Sample of Semantic Entities Relationship Clustering

In the case study, we randomly select several files as samples from the results of the first semantic entity identification phase of our framework. As shown in the figure, four classes including video games based class in purple color, Liverpool F.C. based class in red color, mobile phones based class in green color and luxury brand based class in blue color have been clustered through our relationship integration method. The font sizes of semantic entities are different due to the different values obtained to describe the closeness between semantic entities in a pair. This value is calculated through our classification model. The distance between the semantic entities is defined according to the closeness value between the semantic entity and the core semantic entity of each class such as Video Games, Louis Vuitton, Liverpool F.C and Mobile Phones. The core semantic entity is selected manually by us before the experiment. The semantic entities in different classes are clustered through the following process. Firstly, these semantic entities are formed into semantic entity pairs. Secondly, these semantic entities are provided to the classification method. Each semantic entity pair has a value which is used to define the closeness of relationship between two semantic entities. Only when the value is above certain threshold, the two semantic entities are considered as related. In this paper, these classes are clustered through regional connectivity scanning process on semantic entities pairs. The regional connectivity scanning process scans all the semantic entity pairs and clusters all the related pairs into groups. Therefore, related semantic entities could be found and clustered similarly as shown in Figure 11.

6

Conclusion and Future Work

In this paper, a novel two-phase framework has been proposed to identify semantic entities and integrate the relationship between semantic entities in large scale text data. The motivation of our work is that a highly effective semantic entity identification strategy is first required to produce accurate semantic entities, and then only with a lot of accurate semantic entities, satisfactory results for semantic relationship integration can be achieved. Accordingly, for the first phase of the framework, we have defined a set of statistical features which are sensitive to new semantic entities and proposed a two-step classification algorithm which integrates decision tree and SVM to handle extremely imbalanced classification task. For the second phase of the framework, we have proposed a novel method to produce semantic entity pairs and then clustered the semantic entities using the regional scanning process. Comprehensive experiments have been conducted to evaluate our framework. Specifically, the first part of the experiments has shown that statistical features can identify semantic entities effectively and our proposed two-step classification algorithm can achieve high performance on imbalanced data. The second part of the experiments has shown that our proposed strategy can outperform representative approaches such as CRF and Bootstrapping-SVM on semantic entity detection. Finally, through a case study, the third part of the experiments has demonstrated the effectiveness of our framework as a whole. In the future, we will focus on the clustering methods which can help to group the semantic entities more accurately. Furthermore, our research will not only focus on the integration of relationship among the semantic entities but also try to apply the relationship to improve the text clustering accuracy among documents. To be more specific, relationship between texts can be identified through the connected semantic entities and relationships discovered in the texts. Therefore, whether two texts are connected can be effectively determined.

7

Acknowledgement

The research work reported in this paper is partly supported by the National Natural Science Foundation of China (NSFC) under No. 61300042 and Shanghai Knowledge Service Platform Project No. ZF1213. Xiao Liu is the corresponding author.

8 1. 2. 3. 4. 5.

6.

7.

8. 9. 10.

11. 12. 13.

14. 15.

16.

17.

Reference Hunter, J., Adding multimedia to the Semantic Web-Building and applying an MPEG-7 ontology. 2005: Wiley. Arndt, R., et al., COMM: designing a well-founded multimedia ontology for the web, in The semantic web. 2007, Springer. p. 30-43. Zhuge, H. and Y. Sun, The schema theory for semantic link network. Future Generation Computer Systems, 2010. 26(3): p. 408-420. Zhuge, H., Dimensionality on Summarization. arXiv preprint arXiv:1507.00209, 2015. Tsai, T.-H., et al., Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model. International Journal of Computational Linguistics and Chinese Language Processing, 2004. 9(1). Wu, Y., J. Zhao, and B. Xu. Chinese named entity recognition combining a statistical model with human knowledge. in Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognitionVolume 15. 2003. Association for Computational Linguistics. Wu, Y., et al. Chinese named entity recognition based on multiple features. in Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005. Association for Computational Linguistics. Altun, Y., I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. in ICML. 2003. Kudo, T., CRF++: Yet another CRF toolkit. Software available at http://crfpp. sourceforge. net, 2005. Zhao, H., C.-N. Huang, and M. Li. An improved Chinese word segmentation system with conditional random field. in Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. 2006. Sydney: July. Wagstaff, K., et al. Constrained k-means clustering with background knowledge. in ICML. 2001. Johnson, S.C., Hierarchical clustering schemes. Psychometrika, 1967. 32(3): p. 241-254. Xu, X., et al. A distribution-based clustering algorithm for mining in large spatial databases. in Data Engineering, 1998. Proceedings., 14th International Conference on. 1998. IEEE. Ester, M., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. in KDD. 1996. Berger, A.L., V.J.D. Pietra, and S.A.D. Pietra, A maximum entropy approach to natural language processing. Computational linguistics, 1996. 22(1): p. 39-71. Sekine, S., R. Grishman, and H. Shinnou. A decision tree method for finding and classifying names in Japanese texts. in Proceedings of the Sixth Workshop on Very Large Corpora. 1998. Gao, J., et al., Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics, 2005. 31(4): p. 531-574.

18.

19.

20. 21.

22.

23. 24.

25. 26. 27.

28.

29.

30. 31.

Takeuchi, K. and N. Collier. Use of support vector machines in extended named entity recognition. in proceedings of the 6th conference on Natural language learning-Volume 20. 2002. Association for Computational Linguistics. Chen, A., et al. Chinese named entity recognition with conditional probabilistic models. in 5th SIGHAN Workshop on Chinese Language Processing, Australia. 2006. Bai, S., et al., System for chinese tokenization and named entity recognition. 2001, Google Patents. Sproat, R. and T. Emerson. The first international Chinese word segmentation bakeoff. in Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17. 2003. Association for Computational Linguistics. Wu, A. and Z. Jiang. Statistically-enhanced new word identification in a rule-based Chinese system. in Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 12. 2000. Association for Computational Linguistics. Li, H., et al., The use of SVM for Chinese new word identification, in Natural Language Processing–IJCNLP 2004. 2005, Springer. p. 723-732. Fu, G. and K.-K. Luke, Chinese unknown word identification using classbased LM, in Natural Language Processing–IJCNLP 2004. 2005, Springer. p. 704-713. Zheng, Y., et al. Incorporating User Behaviors in New Word Detection. in IJCAI. 2009. Chien, L.-F. PAT-tree-based keyword extraction for Chinese information retrieval. in ACM SIGIR Forum. 1997. ACM. Kambhatla, N. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. 2004. Association for Computational Linguistics. Zhao, S. and R. Grishman. Extracting relations with integrated information using kernel methods. in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005. Association for Computational Linguistics. GuoDong, Z., et al. Exploring various knowledge in relation extraction. in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005. Association for Computational Linguistics. Lodhi, H., et al., Text classification using string kernels. The Journal of Machine Learning Research, 2002. 2: p. 419-444. Bunescu, R.C. and R.J. Mooney. A shortest path dependency kernel for relation extraction. in Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005. Association for Computational Linguistics.

32.

33.

34.

35.

36. 37.

38.

39. 40.

41. 42. 43. 44. 45. 46. 47.

48.

Culotta, A. and J. Sorensen. Dependency tree kernels for relation extraction. in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. 2004. Association for Computational Linguistics. Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. in Proceedings of the 33rd annual meeting on Association for Computational Linguistics. 1995. Association for Computational Linguistics. Blum, A. and T. Mitchell. Combining labeled and unlabeled data with cotraining. in Proceedings of the eleventh annual conference on Computational learning theory. 1998. ACM. Agichtein, E. and L. Gravano. Snowball: Extracting relations from large plain-text collections. in Proceedings of the fifth ACM conference on Digital libraries. 2000. ACM. Etzioni, O., et al., Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 2005. 165(1): p. 91-134. Yates, A., et al. TextRunner: open information extraction on the web. in Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 2007. Association for Computational Linguistics. Wang, C., et al. Relation extraction with relation topics. in Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011. Association for Computational Linguistics. Wang, C., et al., Relation extraction and scoring in DeepQA. IBM Journal of Research and Development, 2012. 56(3.4): p. 9: 1-9: 12. Bansal, M. and D. Klein. Coreference semantics from web features. in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. 2012. Association for Computational Linguistics. Hindle, A., et al., Clustering web video search results based on integration of multiple features. World Wide Web, 2011. 14(1): p. 53-73. Huang, F., et al., Clustering web documents using hierarchical representation with multi-granularity. World Wide Web, 2013: p. 1-22. Khy, S., Y. Ishikawa, and H. Kitagawa, A novelty-based clustering method for on-line documents. World Wide Web, 2008. 11(1): p. 1-37. Li, L., et al., An efficient approach to suggesting topically related web queries using hidden topic model. World Wide Web, 2013: p. 1-25. Latham, P.E. and Y. Roudi, Mutual information. Scholarpedia, 2009. 4(1): p. 1658. Jones, K.S., A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 1972. 28(1): p. 11-21. Zhao, Y., L. Cui, and H. Yang, Evaluating reliability of co-citation clustering analysis in representing the research history of subject. Scientometrics, 2009. 80(1): p. 91-102. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995. 20(3): p. 273-297.

49. 50.

51. 52.

53.

54.

55.

56.

57.

Sherwood, T., et al. Automatically characterizing large scale program behavior. in ACM SIGARCH Computer Architecture News. 2002. ACM. Su, M.-C. and C.-H. Chou, A modified version of the K-means algorithm with a distance based on cluster symmetry. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2001. 23(6): p. 674-680. Klve, T., et al., Permutation arrays under the Chebyshev distance. Information Theory, IEEE Transactions on, 2010. 56(6): p. 2611-2617. Hamers, L., et al., Similarity measures in scientometric research: the Jaccard index versus Salton's cosine formula. Information Processing & Management, 1989. 25(3): p. 315-318. Ristad, E.S. and P.N. Yianilos, Learning string-edit distance. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 1998. 20(5): p. 522-532. Powers, D., Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies, 2011. 2(1): p. 37-63. Finkel, J.R., T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005. Association for Computational Linguistics. Tellier, I., et al., Pos-tagging for oral texts with crf and category decomposition. Natural Language Processing and its Applications, 2010. 46: p. 79-90. Niu, C., et al. A bootstrapping approach to named entity classification using successive learners. in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. 2003. Association for Computational Linguistics.

*Biographies (Text)

*  * !* * * * * * * * * * *

!#* * #)* )* )* &$$')* * !* * * * * *  #* * * *  * * * " * !#* * #)*  )*  )* &$%%(* * "* * * * "* *  #* * * *  * * * " * !#* * #* * &$%%* * &$%&(* * *  #* * * * * * "* *  * * * * *

!#(* * * *  *  *  )* ""* * #)*  *****"(*



*Biographies (Photograph)