Knowledge-Based Systems 40 (2013) 27–35
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Integrating statistical and lexical information for recognizing textual entailments in text Yu-Chieh Wu Department of Communications Management, Ming Chuan University, Taiwan
a r t i c l e
i n f o
Article history: Received 23 March 2012 Received in revised form 5 November 2012 Accepted 23 November 2012 Available online 11 December 2012 Keywords: Textual entailment Text mining Natural language processing Machine learning Kernel methods
a b s t r a c t Recognizing textual entailment is to infer that a given text span follows from the meaning of a given hypothesis. To have better recognition capability, it is necessary to employ deep text processing units such as syntactic parsers and semantic taggers. However, these resources are not usually available in other non-English languages. In this paper, we present a light-weight Chinese textual entailment recognition system using part-of-speech information only. We designed two different feature models from training data and employed the well-known kernel method to learn to predict testing data. One feature set abstracts the generic statistics between the text pairs, while the other set directly models lexical features based on the traditional bag-of-words model. The ability of the proposed feature models not only brings additional statistical information from their datasets but also helps to enhance the prediction capability. To validate this, we conducted the experiments on the novel benchmark corpus – NTCIRRITE-2011. The empirical results demonstrate that our method achieves the best results in comparison to the other competitors. In terms of accuracy, our method achieves 54.77% for the NTCIR RITE MC task. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction Textual entailment (also known as paraphrasing) is the task of discovering the entailment relationships between texts. These relationships usually include forward entailment, reverse entailment, bidirectional entailment, independence, and contradiction. The relationships directly provide useful information for downstream purposes. Useful applications include eliminating duplicate descriptions in question answering (QA) systems [25,17], finding redundant sentences for machine translation and text summarization [9]. The goal of recognizing textual entailment relations is to identify, given two text fragments t and h, whether t entails h or not (where t means the entailing text and h is the hypothesis or the entailed text). With the rapid growth of Chinese language processing techniques, NTCIR-RITE [20] gives an earlier investigation into those issues, especially in Chinese since 2011. NTCIR-RITE’s challenge has been to create Chinese/US/Japanese benchmark corpora dedicated to textual entailment – recognizing that the meaning of one text is entailed. This task is very competitive and has raised many text mining techniques, such as Machine Learning Joachims [11]; [3], Chinese Text Processing Wu et al. [31], [30], and Natural Language Processing [16]. However, previous approaches are required to integrate full parsers. For resource limited languages, developing a simple yet parser-free textual entailment identification system would be indispensable. E-mail address:
[email protected] 0950-7051/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2012.11.009
English textual entailment has been considerably addressed in recent years, with the well-known PASCAL workshop on RTE (Recognizing Textual Entailment challenge) [6] as the best example. Tatu and Moldovan [21] proposed a logic-based approach which converts the input text into a set of predicate and argument forms based on an English full parser. To increase the lexical coverage, it is required to include sense knowledge, such as WordNet [5] and FrameNet [2], although unfortunately this method is not easy to apply to Chinese, where the parsers and WordNet-like thesaurus are not available in traditional Chinese. Another well-known technique is to introduce the machine learning methods [1,14,21,15,13,26], which involve training and testing phases. In the training phase, a number of labeled examples first map to a vector space model through the selected features. Then the machine learning algorithms are applied to learn to predict the testing examples via mapping testing examples to the features. One beneficial advantage of this method is its simplicity and flexibility. Both Malakasiotis and Androutsopoulos [15] and Li et al. [13] derived parse tree features and similarity-based features [27] to enhance prediction capability. NTCIR RITE [20] presents pioneering and very early competition to the task of Asian textual entailment. It comes up with four different languages: English, Japanese, and (simplified and traditional) Chinese. There are three subtasks for this objective, namely, BC (binary), MC (multiple), and QA (question answering). BC is simply to identify the entailment relationship between the given pair of text fragments (yes or no), while MC is to label the relation of the given fragment. Chinese textual entailment (TE) is a new
28
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35
open research issue. There are fewer literatures about this topic. Huang et al. [10] presented a complex Chinese textual entailment recognition system, and without the use of a traditional Chinese syntactic parser [28], they convert the text into simplified Chinese for parsing. Furthermore, they propose many heuristics to correct the Chinese word segmentation errors and numeric text normalization. They employed the LibSVM (Chang and Lin [3]) to learn to find the textual entailment relationship. As reported by [10], the most useful feature is ‘‘tree mapping”, which requires a parser. In this paper, we design a shallow syntactic pattern-based and machine learning-based TE recognition system for the task of Chinese textual entailments. Only Chinese word segmentation and POS tagging information is required in this method. Our approach automatically extracts two different feature types from the training data – statistical features and lexical features – and places them into the classifier for training and testing. Each feature type was designed by observing the general statistics and patterns. The statistical features were derived from simple word similarity measurement and numeric operations based on the pair of hypothesis and given text. The second feature type makes use of lexical information on the training set. Each feature set was evaluated by different kernels using SVMs with different multiclass strategies such as one-versus-one and one-versus-all ([3,11,29]). In addition, and to further improve the result, we designed a simple classifier ensemble approach that combines outputs from individual SVMs. The ensemble framework is based on the linear combinations of classifier outputs. By applying the above settings, we conducted the experiments with the NTCIR-9 RITE MC/BC tasks. Through our experiments, we found that the radial basis function (RBF) kernel SVM achieves better accuracy for our needs. The experimental result showed that we achieved the best result in the traditional Chinese MC task. The enhanced model further improved the official result from 53.56% (official best result) to 54.77%, with a 2.25% error reduction rate. The system has been implemented for online demonstration purposes. 2. Task analysis As in paraphrasing, textual entailment in RITEs tasks needs to develop a system which can identify five different relationships (Forward, Reverse, Bidirectional, Contradiction, and Independence) between the given pairs of text fragments. The forward type is the forward entailment where the hypothesis includes the means of the text, that is, h ? t but t cannot infer h, reversely. An example pair of forward type is shown below. Forward h: 芮氏規模是美國地震學家芮氏在一九三五年所創立的計算公式 (The formula of Richter scale was designed by US seismologist Richter in 1935) t: 芮氏規模是美國科學家芮氏所發明的。 (The Richter scale was created by US seismologist Richter)
The second type is the reverse entailment relation. That means t can infer h, but h cannot infer t. In other words, t completely includes the entire meaning of h, As seen below. Reverse h: 小柳由紀是埼玉縣人。 (Koyanagi Yuki, Saitama County) t: 小柳由紀出生於昭和57年 (1982年) 1月26日, 埼玉縣人。 (Koyanagi Yuki was born in Showa 57 (1982) on January 26, Saitama County.)
For the third type, both t and h can be inferred by each other. That means h ? t ^ t ? h. Below is an example of a bidirectional entailment text fragments pair. In this example, the main difference is that ‘‘Yahoo!” is rephrased to ‘‘Yahoo! Inc.” in t. Both terms have the same meaning. Therefore, the relationship is ‘‘bidirectional.” Bidirection h: 雅虎公司執行長楊致遠 (Yahoo! CEO Jerry Yang) t: 雅虎執行長楊致遠 (Yahoo! Inc. CEO Jerry Yang)
Both independence and contradiction indicate that there is not any single textual entailment relationship between the text pairs. The independence means the given text pair expresses two different meanings, while the contradiction shows that the fact of h is true in h but false in t. Listed below lists are the two entailment types. Independence h: 911攻擊對美國經濟造成衝擊 (The 911 attacks have economy impact to U.S.) t: 2001年第三季購買美國股票的外資淨額遠低於2000年同期的 406億美元 (Third quarter of 2001, net foreign purchases of U.S. stocks is far lower than the same period in 2000 of $ 40.6 billion)
Contradiction h: 2010年台印雙邊貿易總值50億美元 (Taiwan-Indian bilateral trade worth attends to $5 billion in 2010) t: 2010年台灣印度雙邊貿易額為64.7億美元, 僅佔台灣貿易總額 的1.2% (Taiwan-Indian bilateral trade worth is $6.47 billion in 2010, accounted for only 1.2% of Taiwan’s total trade)
As shown in the contradiction type, h indicates that there was 5 billion US dollars of trade between Taiwan and India in 2010. However, t reveals that there was 6.47 billion US dollars trade in 2010. Obviously, h and t are in contradiction with each other. 3. Methodology Fig. 1 shows the proposed Chinese textual entailment recognition system. The first component (Chinese text preprocessing) is to initially segments the Chinese text and gives a POS tag for each word. Secondly, we construct the feature set from the training data and map each training example for SVM. In the third stage, the SVM training and testing modules received the instances and performing learning and classifying. The trained SVM model was used to predict the testing data and give an entailment label for each pair of text fragments. In the following section, we introduce the three important modules. 3.1. Chinese text processing The first step in Chinese text processing is to find the words in Chinese. This step is still an ongoing research issue [22,29,30], as in contrast to most western languages such as English, there is no explicit space symbol between Chinese words. To resolve this, a
29
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35
Fig. 1. System architecture of the proposed Chinese textual entailment system.
Chinese word segmentation tool is needed. This tool plays an important role in the preprocessing step, since word information provides the basic concept in term-level for downstream applications. In addition, the POS tag information also gives basic syntactic structures in text. While, there are a number of Chinese word segmentation tools for our purpose, in this paper we revise our in-house CMM-based Chinese word segmentation and POS tagging method [30]. This approach has one good property in that it can integrate external vocabulary. CMM treats the vocabulary as features to enhance the algorithm [22,34]. Thus, we collected the terms from the RITE corpus using simple string matching and the initial tagged results. If a word was found more than twice in the RITE corpus, it was added to the vocabulary. The revised vocabulary was then combined with the CMM to form the final Chinese word segmentation component and POS tagger. 3.1.1. Text normalization A set of Chinese words share the same meanings as Arabic numbers, such as 伍 equals 5. Also the holomorphic words need to be normalized. However, directly transforming these words into numbers is not ideal, as some words might be part of a person’s name. To solve this, the normalization process only deals with a small set of POS tags. For Neu (number) and Nd (date) words, Chinese numerical words are directly converted into digits. For example, 1 = 1, 二 = 2, 叄 = 3, etc. There are still some complex Chinese words used to express numbers, such as 二十一 = 21. A simple rule has been designed to solve this. If a specified Chinese word is found (e.g. 十、廿、卅、百、千、 萬), a left-right numeric character search is also applied. Such a search strategy aims to convert the compound Chinese numeric words into Arabic numbers. For all Chinese numeric words that are located on the left-hand-side of the specified word, the numeric words are converted using the above text normalization method and multiply the specified word. Similarly, all the right-hand-side Chinese numeric words are normalized and plus the left hand side numbers. For example, Chinese: 五百五十三 (553) Arabic Number: 五 = 5, 三 = 3 Specified Chinese word: 百 = 100, 十 = 10 The result is 5 100 + 5 10 + 3 = 553. 3.2. Feature model construction For better prediction power, we construct two different feature types, namely statistical features and lexical features. The former
measures the general statistic information of each paired texts, while the later captures the lexical-level information using both Chinese word and POS tags. Below, we list the used features in this paper. 3.2.1. Type I (statistical features) Length difference (character-level)
Ldiffchar ¼ ktchar j jhchar k
ð1Þ
where |tchar| is the number of Chinese characters and English words in the fragment t and |hchar| is the number of Chinese characters and English words in the fragment h. Length difference (word-level)
Ldiffword ¼ kt word j jhword k
ð2Þ
where |tword| is the number of Chinese words and English words in the fragment t and |hword| is the number of Chinese words and English words in the fragment h. Character match ratio in t
MatchRðtchar Þ ¼
jtchar \ hchar j jt char j
ð3Þ
where |tchar \ hchar| is the number of common Chinese characters or English words in the fragment t and h. Character match ratio in h
MatchRðhchar Þ ¼
jtchar \ hchar j jhchar j
ð4Þ
Word match ratio in s1
MatchRðtword Þ ¼
jt word \ hword j jt word j
ð5Þ
where |tword \ hword| is the number of common Chinese words or English words in the fragment t and h. Word match ratio in s2
MatchRðhword Þ ¼
jt word \ hword j jhword j
ð6Þ
POS match ratio in t
MatchRðtpos Þ ¼
jtpos \ hpos j jtpos j
ð7Þ
30
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35
where |tpos \ hpos| is the number of common part-of-speech tags in the fragment t and h.
POS match ratio in h
MatchRðhpos Þ ¼
jt pos \ hpos j jhpos j
ð8Þ
Pattern match ratio in t
jt pattern \ hpattern j MatchRðtpattern Þ ¼ jtpattern j
ð9Þ
Pattern match ratio in h
jt pattern \ hpattern j jhpattern j
ð10Þ
Reversed pattern match ratio in t
MatchRðtrpattern Þ ¼
jt rpattern \ hpattern j jt rpattern j
ð11Þ
where trpattern is the reversed pattern of fragment t. It reverse the position of the original pattern.
Reversed pattern match ratio in h
MatchRðhrpattern Þ ¼
jt pattern \ hrpattern j jhrpattern j
Date feature which includes year, month, and day plays an important role in RTE task. For contradiction and independence type, h and t usually talk two different yet contradict date. For the other three types, the date information in h and t should be consistent. To realize this concept, we propose a minimum date difference measurement as below.
fyear ¼
where |tpattern \ hpattern| is the number of common patterns in both fragment t and h. A pattern is defined below.
MatchRðhpattern Þ ¼
Minimum Date Difference
ð12Þ
where hrpattern is the reversed pattern of fragment h.
3.2.2. Definition of a pattern Here, the pattern is predefined as the specified POS bigram and trigrams. We define the following six patterns.
Noun þ Verb; Verb þ Noun; Noun þ Noun; Noun þ Verb þ Noun; Verb þ Noun þ Verb; Noun þ Noun þ Noun Even the six patterns are defined to find the matched statistics. We also reverse the order for each pattern. That is, the reversed patterns can be used to find the contradiction sentence pairs. To enhance the results, both word and POS tag were used to represent the pattern. For example, the first pattern, Noun + Verb, the word bigram and POS bigram were extracted. In total, there 6 2 (POS and Word) 2(plus reverse order) = 24 patterns were extracted. In the following example, we illustrate how pattern is extracted from text fragment. It is easy to extend this idea to find all patterns. h: 日本(Nc) 是(SHI) 投資(VC) 馬來西亞(Nc) 的(DE) 三(Neu) 大 (VH) 外商(Na) 之(DE) 一(Neu) (English: Japan is one of the three biggest foreign investments of Malaysia) The Noun+Verb patterns are: 日本(Nc)+投資(VC), 三(Neu)+ 大 (VH) Verb+Noun patterns are: 投資(VC)+ 馬來西亞(Nc), 大(VH)+外 商(Na) Noun+Verb+Noun patterns are: 日本(Nc)+ 投資(VC)+ 馬來西亞 (Nc), 三(Neu)+大(VH)+外商(Na)
min jy1i y2j j
i2y1 ^ j2y2
fmonth ¼ fday ¼
min jm1i m2j j
i2m1 ^ j2m2
min
i2d1 ^ j2d
1 jdi 2
2 dj j
ð13Þ ð14Þ ð15Þ
where y1 means the year terms in h and y2 is the year terms in t; m1 means the month terms in h and m2 is the month terms in t; d1 means the day terms in h and d2 is the day terms in t. fyear is the minimum year difference between h and t for all year terms. Similar definition can be used to define fmonth and fday. The main idea of this measurement is quite simple – we enumerate all date terms and perform the basic numeric difference and find the minimum distance to represent the feature. Putting the above measurements allows the classification algorithm to learn a strong feature to judge the entailment relation. In Section 4.3, we show how much effect of this feature type. Example: In the following part, we give an example for estimating the word and POS matching score in type I. It can also be used to derive the other features such as pattern match ratio with the same concept. Fig. 2 lists the word-level, char-level, and POS tag given h and t. The main difference between the two texts is that h contains one more word w4 than t. By using the definition above, we can estimate the statistical features as below.
Length difference (character-level): Ldiffchar = |10 8| = 2 Length difference (word-level): Ldiffword = |4 3| = 1 Character match ratio in t: MatchRðtchar Þ ¼ 88 ¼ 1 8 Character match ratio in h: MatchRðhchar Þ ¼ 10 Word match ratio in s1: MatchRðtword Þ ¼ 33 Word match ratio in s2: MatchRðhword Þ ¼ 34 POS match ratio in t: MatchRðt pos Þ ¼ 33 POS match ratio in h: MatchRðhpos Þ ¼ 34
3.2.3. Type II (lexical features) Matched POS tags The common POS tags between h and t are kept. In this paper, we do not give any constraint in extracting POS tag. Each matched POS tag was treated as word in the model. By following the above example, the matched POS tag of h and t are: Nb, Na, Nb. Matched Bi-POS tags This feature set is used to capture the common Bi-POS bigrams between h and t. For example, if the POS sequence ‘‘NN+NP” occur in both h and t, then the feature is activated. For the above example, the matched Bi-POS tag in both h and t is ‘‘Na+Nb.” Mismatched POS tags Different from matched POS feature, the POS tag that only appear in either h or t is treated as feature in this type. For example, if the POS tag ‘‘NP” does not appear in both h and t, then the ‘‘NP” is activated.
31
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35
Fig. 2. The word-level, char-level, and POS tag of two given text fragments.
Matched Verb tags This feature just keeps track of the matched verb POS tags in h and t. For instance, the verb POS tag ‘‘VP” is activated when it occurs in h and t. Mismatched Verb tags
The ai is the weight of non-zero weight training example xi (i.e., ai > 0), and b denotes as a bias threshold of this decision. SVs means the support vectors and obviously has the non-zero weights of ai. K (X, xi) = /(X) /(xi) is a pre-defined kernel function that might trans0 form the original feature space from ℜD to RD (usually D D0 ). 4. Evaluations and results
This feature extracts the verbs that do not occur in either h or t. It shows the differences in the verb-level. For example, if the POS tag ‘‘VP” is missed in t and ‘‘VG” is missed in h, then both ‘‘VP” and ‘‘VG” are activated as features. Mismatched Verb words This feature extracts the verb words that do not occur in either h or t. This feature extends the ‘‘Mismatched Verb tags” feature type by extracting the mismatched verb words.
The testing data was derived from the NTCIR-RITE 2011 task [20] which is one of the latest benchmark corpora. The training data contained 421 (h, t) pairs, while there were 900 instance in the testing set. We also separated 121 training pairs as development data to tune the parameters. In the training phase, we used 300 instances as training while the other 121 examples were treated as validation data. Once the optimal settings were found, we used the whole 421 training data to train in order to predict the 900 testing pairs. The data statistics are listed in Table 1. The performance of the task is officially measured by the accuracy [20].
3.3. Classification algorithm
Accuracy ¼ We adopt the SVM [24] to learn to classify the testing example. SVM is a kernel-based classifier which can solve non-linear separable problems. Given a set of training examples,
ðx1 ; y1 Þ; ðx2 ; y2 Þ; . . . ; ðxn ; yn Þ; xi 2 RD ; yi 2 fþ1; 1g where xi is a feature vector in D-dimension space of the i-th example, and yi is the label of xi either positive or negative. The training of SVMs is to minimize the following objective function (primal form, soft-margin [24],
minimize : WðaÞ ¼
n X ! 1 ! ! W W þ C Lossð W xi ; yi Þ 2 i¼1
ð16Þ
The loss function indicates the loss of training error. Usually, the hinge-loss is used [12]. The factor C in (16) is a parameter that allows one to trade off training error and margin size. To classify a given testing example X, the decision rule takes the following form:
yðXÞ ¼ sign
X
!
!
ai yi KðX; xi Þ þ b
ð17Þ
xi 2SVs
A N
where N is the number of testing samples. A is the number of correct samples recognized. In addition, we also provide the wellknown evaluation metrics, namely recall, precision, and F-measure [32] in order to see the detailed results. The recall rate estimates the ratio of answers found by the system, while precision measures the percentage of the answers correctly predicted answers by the system. The definition of the precision rate is also the same as accuracy used by the NTCIR-RITE task. Finally, the F-measure combines both recall and precision rates, defined as follows:
F¼
2 recall precision : precision þ recall
4.1. Settings For the classification algorithm, in this paper we adopt the LibSVM [3] and SVMlight [11] for training and testing. LibSVM and SVMlight have different strategies for solving multiclass problems. The default setting of LIBSVM is one-versus-one multiclass
Table 1 Data statistics of the traditional Chinese RITE task.
Training data Validation data (derived from training) Testing data
# Of instances
Avg. length (word level)
Avg. length (char. level)
421 121 900
15.06 11.79 15.05
24.11 20.87 26.10
32
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35
SVM, while we implemented our one-versus-all strategy for the SVMlight. The kernels used in this paper are: (1) the polynomial kernel with degree 2 (d = 2); and (2) the RBF kernel with Gaussian is 1– 0.001. As seen in (1), the parameter C controls the trained margin and training errors. According to our observations, we set C = 0.01– 10 for all experiments. 4.2. Results First, we divided the training data to find the optimal parameter settings for SVM. The classical 10-fold-cross validation on the develop data. After parameter tuning, the training process was then performed using the entire training data. Table 2 lists the experimental results of our method in the validation and testing of datasets. The settings of the RBF kernel was C = 4 and g = 0.0234, while C = 5 and d = 2 for the polynomial kernel. This table lists the optimal settings: RBF kernel + type I features, polynomial kernel + type I features, and the ensemble method combines the type I and type II features with the RBF and polynomial kernels. The ensemble method combines four methods with weighted voting. The weight is mainly determined by the accuracy on the validation data. Details of the ensemble method are described in Section 4.4. From this Table, we show the effectiveness of the proposed approach. In terms of accuracy, the RBF kernel –based SVM implementation achieves better results than the polynomial kernel. It has a recall rate of 54.67%, a precision rate of 52.66%, and an f-measure rate of 53.64%. Using the accuracy evaluation metric our method achieves 54.67% in the testing data. Second, we list the system performance of each participant in the NTCIR RITE 2011 task. Table 3 shows the compared results of the task. In the BC track, we convert {F, B, R(swap t1 and t2)} as the true entailment label, while {I, R, C} are treated as false. As shown in Table 3, there were 9 distinct participants submitting results in this task. The official best system was obtained by our previous work [31]. A well-known manual-based approach [8] was eliminated from the official comparison (see the official paper of [20] since the approach seemed to require too much human efforts rather than propose an automatic method. This was the only one pure rule-based method in the NTCIR-RITE task although it achieved more than 90% accuracy in the BC task and better than 75% in the MC task. This is the only one pure rule-based method in the NTCIR-RITE task, although it would be unfair to make a comparison with this approach since the rules had been created manually. Rule-based systems depend on ‘‘domain knowledge” and human resources, while their development time might take several weeks [8]. When porting to large data and different domains, it queries again to re-develop the rules. Also, the cost of maintaining rules is much more expensive than a real system-based approach. Excluding this, our method is the state-of-the-art textual entailment system in the competition. Clearly, our method requires no any external resource as mentioned above. It merely derives the statistical features and POS pattern features after performing Chinese word segmenting and POS tagging. The second best method has been produced by IMTKU [4], who compiled two well-known Chinese word lexicons and a
Table 3 List of competition results of NTCIR RITE 2011 task (accuracy). Participants
MC (%)
BC (%)
MCU (previous version) IMTKU [4] IASLD [19] ICRC_HITSZ [33] III_CYUT_NTHU [31] NTU [10] Yuntech [7] NTOUA (Lin and Hsiao, 2011) UIOWA [8]
53.56 52.22 50.11 49.89 49.11 48.33 47.67 46.11 78.67
55.44 55.56 66.11 61.33 65.00 60.78 52.78 61.33 90.78
This paper
54.80
66.60
self-developed antonym lists. The use of a syntactic parser is required, while the approach extracts multiple features such as word/char similarity, lexical overlap, and pair length based on the three lexicons. In comparison to our method, it is not only more complicated, but also requires many more external resources. Furthermore, it is quite difficult to port such a complex system to different domains and languages. One good example is ICRC_HITSZ [33] which showed the best performance in a simplified Chinese task but performed significantly worse than the other approaches even though they integrated the most resources in simplified Chinese. For the Japanese RITE task, the IBM_JA (Tsuboi et al. [23]) also demonstrated the power of integrating rich external resources and self-developed ontology. Although those approaches performed good results in a single language, they are quite difficult to port to the other resource scarce domains and languages. In addition, it is not easy to identify which resource contributes to the result. The third best system has been yielded by IASLD [19]. Similar to IMTKU, they also employed rich lexicons and human effort (exclusive word list). There is a small difference – IASLD further combined the head-modifier information and named entity tagging. They showed that the use of head-modifier relations obtained positive results. However, the relation is defined specifically for some categories, such as bidirectional. In comparison to our POS patterns, we use a more general patterns and 2-g and 3-g POS combinations, and no parser is required. On the other hand, we evaluated the statistical significance test between the two methods through comparing the output labels. The null hypothesis (h0) is that the two approaches have no difference, i.e. A = B. In this paper, we performed three tests, namely, s-test, McNemar-test, and p-test. Table 4 lists the results of the significant test. It is not surprising that the methods adopted in this paper are significantly different from our previous work [31]. Also, only the s-test reveals the significance between the classifier ensemble method and the polynomial kernel SVM. 4.3. The effect of SVM implementations In this section, we report the effect of different SVM implementations and multiclass performance. This experiment was carried out by performing one-versus-one (OVO) and one-versus-all (OVA) strategies on LibSVM and SVMlight. SVMlight does not
Table 2 Experimental results on the traditional Chinese MC task. Method
RBF kernel Polynomial kernel Ensemble method
Validation data
Testing Data
Recall (%)
Precision (accuracy) (%)
F-measure (%)
Recall (%)
Precision (accuracy) (%)
F-measure (%)
50.78 49.33 50.78
48.61 47.38 48.85
49.67 48.34 49.79
54.67 52.89 54.78
52.66 52.56 52.56
53.64 52.72 53.65
33
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35 Table 4 Statistical significant test using the classifier outputs. Method A
Method B
s-test
McNemar-test
p-test
This paper Polynomial kernel RBF kernel RBF kernel Ensemble Ensemble
[31] [31] [31] Polynomial kernel RBF kernel Polynomial kernel
>> >> >> > >> >
>> >> >> > >>
>> >> >> >> >
> means p-value < 0.05; >> means p-value < 0.01; means p-value is greater than 0.05.
implement any multiclass algorithm whereas the LibSVM is essentially an OVO-based multiclass SVM method. We implemented OVO and OVA for SVMlight to see the effect. The validation data was also used to find the optimal parameter settings for different classifiers. Table 5 lists the results of SVMlight and LibSVM using the polynomial kernel and Table 6 shows the results with the RBF kernel. It is clear that the polynomial kernel in SVMlight beats the LibSVM in both MC (53.67%) and BC tasks (65.67%). On the contrary, using the RBF kernel in LibSVM shows much better accuracy than SVMlight + OVO (54.66% versus 36.67%). The best performance is obtained by LibSVM + OVO using RBF kernel and SVMlight + OVO
Table 5 Performance comparison between SVMlight and LIBSVM using polynomial kernel (accuracy). Method
MC (%)
BC (%)
SVMlight + OVA SVMlight + OVO LIBSVM + OVO
49.78 53.67 52.89
65.67 65.22
Table 6 Performance comparison between SVMlight and LIBSVM using RBF kernel (accuracy). Method
MC (%)
BC (%)
SVMlight + OVA SVMlight + OVO LIBSVM + OVO
34.22 36.67 54.66
37.89 66.55
using the polynomial kernel gets the second best accuracy. This experiment also reveals that SVM with OVO multiclass strategy has better accuracy than OVA. 4.4. The effect of features Recall that there are two feature types used in this paper (statistical and lexical types). Here, we report the actual performance of the two features in the testing data. Table 7 lists the results using RBF and polynomial kernels. In this experiment, the RBF kernel is mainly realized by LibSVM + OVO, while SVMlight + OVO is designed for the polynomial kernel. As shown in Table 7, the statistical feature (type I) yields more superior accuracy than the lexical feature (type II) and type I + II. Type II feature is much worse for textual entailment. Furthermore, by combining the two feature types, the performance is still not comparable to type I. Second, we continued the experiment by removing each feature in type I. Initially, all features were in the pool, and incrementally, we eliminated each feature and performed training and testing on the testing data. Table 8 shows the detailed results using RBF and polynomial kernels. The third row of Table 8 is the preprocessing step rather than a feature (see Section 3.2). It shows the effect of our text normalization method. The word match ratio plays an important role in this task where the accuracy drops from 54.66% to 50.22%. The second best feature is the pattern match ratio (54.66% ? 53.22%). By contrast, the length difference feature has the smallest drops in accuracy. 4.5. Combining multiple classifiers Observed by previous experiments, we propose a simple weighted classifiers ensemble approach that determines the final answer by combining multiple classifiers. Four classifiers are used: RBF + type I, polynomial + type I, polynomial + type II, and RBF + type I + type II features. Fig. 3 draws the basic idea of ensemble classifiers. The weight factors, W1 W4 represent the weight of each classifier. We use the following formula to decide the label of a testing instance.
X yðXÞ ¼ arg max W j yj ðci Þ ci 2C
Table 7 The effect of different feature type (accuracy). Method
Accuracy on testing data (%)
Type I + RBF Type II + RBF Type I + II + RBF Type I + Polynomial Type II + Polynomial TypeI + II + Polynomial
54.66 39.66 48.44 53.67 40.22 47.67
j
where ci is the category score of class i and yj(ci) is the output probability of classifier j. In this paper, we use the analog function [18] to generate the probability of SVM. The empirical result of the classifier ensemble is listed below. In the BC task, it achieves 66.60% in accuracy, while in MC task, it has 54.80% correct rate. In comparison to the best classifier: RBF + OVO + type I, the ensemble method slightly performs better than single SVM and improves the polynomial kernel + OVO + type II SVM.
Table 8 Feature importance analysis. Method
Accuracy on testing data RBF kernel (%)
Accuracy on testing data polynomial kernel (%)
Full features - text normalization (preprocessing step) - length difference - character match ratio - word match ratio - POS match ratio - pattern match ratio - minimum date difference
54.66 54.00 54.33 54.33 50.22 54.33 53.22 53.66
53.67 52.22 53.00 52.55 48.78 52.66 52.44 52.66
34
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35
4.6. Learning curve
Fig. 3. An illustration of simple weighted classifier ensemble.
At this stage, the optimal features and kernels for SVM had been found. We further tested the performance under different sizes of training data. In this experiment, the training data was divided into 10 groups: 10–100% according to the number of training examples. Figs. 4 and 5 draw the learning curve (accuracy and F-measure) by feeding the variant sizes of training data. Here, we report the accuracy of the three classifiers: RBF + type I, polynomial + type I, and the ensemble method. As shown in Fig. 4, it is clear that RBF + type I has almost the identical curve, as well as the ensemble method. The polynomial + type II is obviously worse than them. However, it is surprising that polynomial + type II has better F-measure when there are 40% of the training data. The recall and precision in that settings is very consistent (recall = 47.22% and precision = 49.73%) compared to the RBF + type I (recall = 48.22% and precision = 45.31%). After this point, we also found the precision of polynomial + type I drops to 47–45% and the recall rate positively grows. The main reason is that the polynomial + type I is too weak to judge ‘‘independence” and the other class (especially ‘‘bidirectional”). Table 9 lists the class-to-class match and mismatch table. From the above report, we can see that the classifier performs poorly in the independence class and strongly in ‘‘Forward” and ‘‘Reverse” categories. The three classes: F, R, and B have a higher mismatch rate than the correct rate of I itself. In this way, by inputting more training data, the problem is still not resolved (the precision rate does not increase). Therefore, the development of a better feature for class I needs to be noted. We have left this improvement for future work.
5. Conclusion
Fig. 4. Learning curve of different SVM kernels (accuracy).
Fig. 5. Learning curve of different SVM kernels (F-measure).
Recognizing Inference in Text is an important and new research topic in recent years. This paper presents an earlier work on the resources-limited textual entailment recognition. A machine learning algorithm which integrates pattern-based and statistical information was designed to deal with RITE problems. Two famous kernel-based methods were embedded as classifiers to learn to tag the entailment relation. In addition, an ensemble method that combines four multiple classifiers and features is presented in this paper. The experimental study was carried out by the NTCIR RITE task. The empirical study shows that, (1) the RBF kernel + statistical feature achieves the best result in terms of the single classifier; (2) one-versus-one multiclass strategy is better than one-versus-all in all experiments; (3) using the pure lexical feature is poor in testing, and (4) the classifier ensemble did improve the overall performance. In the future, we plan to integrate more unlabeled data (semi-supervised learning) to improve the result. Also, if the parser is available, we will adopt the parse tree features to enhance the method. The online demonstration of our method can be found in (http://120.96.128.186/ritc_ct).
Table 9 Actual class-to-class match and mismatch result. System predicted
Forward Reverse Bidirection Contradiction Independence
Gold class Forward
Reverse
Bidirection
Contradiction
Independence
127 4 21 41 35
1 123 9 25 25
36 32 125 27 92
9 16 4 84 11
7 5 21 3 17
Y.-C. Wu / Knowledge-Based Systems 40 (2013) 27–35
Acknowledgement The authors acknowledge support under NSC Grants NSC 1012221-E-130-027- and NSC 101-2622-E-130-006-CC3. References [1] I. Androutsopoulos, P. Malakasiotis, A survey of paraphrasing and textual entailment methods, Journal of Artificial Intelligence Research 38 (2010) 135– 187. [2] C.F. Baker, C.J. Fillmore, J.B. Lowe, The Berkeley framenet project, in: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on, Computational Linguistics, 1998, pp. 86–90. [3] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (27) (2011) 1–27. [4] M.Y. Day, R.Y. Lee, C.T. Liu, C. Tu, C.S. Tseng, L.T. Yap, C.L. Huang, Y.H., Chiu, W.Z. Hong, IMTKU textual entailment system for recognizing inference in text at NTCIR-9 RITE, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 339–344. [5] C. Fellbaum (Ed.), WORDNET: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998. [6] D. Giampiccolo, B. Magnini, I. Dagan, B. Dolan, The third PASCAL recognition textual entailment challenge, in: ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, 2007, pp. 1–9. [7] N.H. Han, L.W. Ku, The Yuntech system in NTCIR-9 RITE Task, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 345–348. [8] C.G. Harris, UIOWA at NTCIR-9 RITE: Using the Power of the Crowd to Establish Inference Rules, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 318–324. [9] R. He, B. Qin, T. Liu, A novel approach to update summarization using evolutionary manifold-ranking and spectral clustering, Expert Syst. Appl. 39 (2012) 2375–2384. [10] W.C. Huang, S.H. Wu, L.P. Chen, C.K., Chinese textual entailment analysis, in: Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing, 2011. [11] T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of the European Conference on, Machine Learning, 1998, pp. 137–142. [12] S. Keerthi, D. DeCoste, A modified finite Newton method for fast solution of large scale linear SVMs, Journal of Machine Learning Research 6 (2005) 341– 361. [13] B. Li, J. Irwin, E.V. Garcia, A. Ram, Machine learning based semantic inference. experiments and observations at RTE-3, in: ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, 2007, pp. 159–164. [14] P. Malakasiotis, Paraphrase recognition using machine learning to combine similarity measures, in: Proceedings of the ACL-IJCNLP 2009 Student Research, Workshop, 2009, pp. 27–35. [15] P. Malakasiotis, I. Androutsopoulos, Learning textual entailment using SVMs and string similarity measures, in: Proceedings of ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, 2007, pp. 42–47.
35
[16] C.D. Manning, H. Schutze, Fundations of Statistical Natural Language Processing, The MIT Press, London, 1999. [17] H.J. Oh, S.H. Myaeng, M.G. Jang, Semantic passage segmentation based on sentence topics for question answering, Information Sciences 177 (18) (2007) 3696–3717. [18] J. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in: Advances in Large Margin Classifiers, 1999. [19] C.W. Shih, C.W. Lee, T.H. Yang, W.L. Hsu, IASL RITE system at NTCIR-9, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 379–385. [20] H. Shima, H. Kanayama, C.W. Lee, C.J. Lin, T. Mitamura, Y. Miyao, S. Shi, K. Takeda, Overview of NTCIR-9 RITE: recognizing inference in text, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 291–301. [21] M. Tatu, D. Moldovan, COGEX at RTE3, in: ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, 2007, pp. 22–27. [22] R.T.H. Tsai, Chinese text segmentation: a hybrid approach using transductive learning and statistical association measures, Expert Systems with Applications 37 (5) (2010) 3553–3560. [23] Y. Tsuboi, H. Kanayama, M. Andohno, Syntactic difference-based approach for the NTCIR-9 RITE Task, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 404-411. [24] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. [25] E.M. Voorhees, Overview of the TREC 2001 question answering track, in: Proceedings of the 10th Text Retrieval Conference, 2001, pp. 42–52. [26] S. Wan, M. Dras, R. Dale, C. Paris, Using dependency-based features to take the ‘‘parafarce” out of paraphrase, in: Proceedings of the Australasian Language Technology, Workshop, 2006, pp. 131–138. [27] R. Wang, G. Neumann, Recognizing textual entailment using sentence similarity based on dependency tree skeletons, in: ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, 2007, pp. 36–41. [28] Y.C. Wu, Y.S. Lee, J.C. Yang, Robust and efficient Chinese word dependency analysis with linear kernel support vector machines, in: Proc. of COLING, 2008a, pp. 135–138. [29] Y.C. Wu, Y.S. Lee, J.C. Yang, Robust and efficient multiclass SVM models for phrase pattern recognition, Pattern Recogn. 41 (9) (2008) 2874–2889. [30] Y.C. Wu, J.C. Yang, Y.S. Lee, S.J. Yen, Chinese word segmentation with conditional support vector inspired markov models, in: Proceedings of CIPSSIGHAN Joint Conference on Chinese Language Processing (CLP’2010), 2010, pp. 228–233. [31] S.H. Wu, W.C. Huang, L.P. Chen, T. Ku, Binary-class and multi-class chinese textural entailment system description in NTCIR-9 RITE, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 422–426. [32] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval 1 (1/2) (1999) 67–88. [33] Y. Zhang, J. Xu, C. Liu, X. Wang, R. Xu, Q. Chen, X. Wang, Y. Hou, B. Tang, ICRC_HITSZ at RITE: leveraging multiple classifiers voting for textual entailment recognition, in: Proceedings of the NTCIR-9 Workshop, 2011, pp. 325–329. [34] H. Zhao, C. Kit, Incorporating global information into supervised learning for Chinese word segmentation, in: Proceedings of the 10th Conference of the Pacific Association for, Computational Linguistics, 2007, pp. 66–74.