Answer extraction and ranking strategies for definitional question answering using linguistic features and definition terminology

Information Processing and Management 43 (2007) 353–364 www.elsevier.com/locate/infoproman Answer extraction and ranking strategies for deﬁnitional q...

Download PDF

175KB Sizes 0 Downloads 56 Views

Report

PDF Reader
Full Text

Information Processing and Management 43 (2007) 353–364 www.elsevier.com/locate/infoproman

Answer extraction and ranking strategies for deﬁnitional question answering using linguistic features and deﬁnition terminology Kyoung-Soo Han, Young-In Song, Sang-Bum Kim, Hae-Chang Rim

*

Department of Computer Science and Engineering, Korea University, 1, 5-ga, Anam-dong, Seongbuk-gu, Seoul 136-701, Republic of Korea Received 26 May 2006; accepted 25 July 2006

Abstract We propose answer extraction and ranking strategies for deﬁnitional question answering using linguistic features and deﬁnition terminology. A passage expansion technique based on simple anaphora resolution is introduced to retrieve more informative sentences, and a phrase extraction method based on syntactic information of the sentences is proposed to generate a more concise answer. In order to rank the phrases, we use several evidences including external deﬁnitions and definition terminology. Although external deﬁnitions are useful, it is obvious that they cannot cover all the possible targets. The deﬁnition terminology score which reﬂects how the phrase is deﬁnition-like is devised to assist the incomplete external deﬁnitions. Experimental results show that the proposed answer extraction and ranking method are eﬀective and also show that our proposed system is comparable to state-of-the-art systems. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Passage expansion; Phrase extraction; Deﬁnition terminology; Deﬁnitional question answering

1. Introduction Deﬁnitional question answering (QA) is a task of answering deﬁnitional questions, such as What are fractals? and Who is Andrew Carnegie?, initiated by TREC Question Answering Track (Voorhees, 2003). The deﬁnitional QA has several characteristics that is diﬀerent from the factoid QA handling a question such as What country is Aswan High Dam located in? The deﬁnitional questions do not clearly imply an expected answer type but contain only the question target (e.g., fractals and Andrew Carnegie in the above example questions), contrary to the factoid questions involving a narrow answer type (e.g., country name for the above example question). Thus, it is hard to ﬁnd which information is useful for the answer of a deﬁnitional question. Another diﬀerence is that a short passage cannot answer deﬁnitional questions because a deﬁnition needs several essential information *

Corresponding author. Tel.: +82 2 3290 3195; fax: +82 2 929 7914. E-mail address: [email protected] (H.-C. Rim).

0306-4573/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2006.07.010

354

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

about the target. Therefore, the answer of deﬁnitional questions can consist of several component information called information nuggets. Each answer nugget is naturally represented by a short noun phrase or a verb phrase. Good deﬁnitional QA systems should ﬁnd out more answer nuggets with shorter length. The general architecture of deﬁnitional QA systems consists of following three components: question analysis, passage retrieval, and answer selection. The question target extracted from the question sentence in the question analysis phase is used for composing a passage retrieval query. The query can be expanded with additional terms extracted by co-occurrence statistics in the deﬁnition of external resources such as a dictionary and an encyclopedia (Gaizauskas, Greenwood, Hepple, Roberts, & Saggion, 2004; Saggion & Gaizauskas, 2004), or extracted from the WordNet synset (Wu et al., 2004). Usually all sentences in the retrieved passages by the query are used as answer candidates. The top ranked candidates based on several evidences such as deﬁnition patterns and similarity to the deﬁnitions of external resources are selected as the answer (Cui, Li, Sun, Chua, & Kan, 2004; Cui, Kan, Chua, & Xiao, 2004; Harabagiu et al., 2003; Hildebrandt, Katz, & Lin, 2004; Saggion & Gaizauskas, 2004). This paper mainly focuses on the passage retrieval and answer selection components. After documents relevant to the question target are retrieved, the sentences containing the target are extracted from the retrieved documents in passage retrieval phase. The question target is variously represented in documents, and an anaphora is usually used to indicate it. Therefore, it is necessary to expand retrieved passages so that the passages, in which the target is represented as an anaphora, can be retrieved. Some works (Gaizauskas et al., 2004; Katz et al., 2004) introduced anaphora resolution techniques in the deﬁnitional QA. Katz et al. (2004) applied generic coreference resolution techniques to the entire corpus, but the coreference resolution could not improve the performance. The full coreference resolution is computationally expensive, and the incorrect resolution may cause a deﬁnitional QA system to fail to ﬁnd the correct answer. It is necessary to limit the resolution scope to the anaphora referring to the question target which can be correctly resolved. Thus, this paper suggests a passage expansion method using simple pronoun resolution rules. By using shorter text segments instead of the sentence itself, we can reduce answer granularity, and include more information per unit length. Harabagiu et al. (2003), Blair-Goldensohn, McKeown, and Schlaikjer (2003), Xu, Weischedel, and Licuanan (2004) extracted shorter text segments from retrieved sentences as answer candidates. Xu et al. (2004) extracted linguistic constructs, such as relations and propositions, using information extraction tools. Blair-Goldensohn et al. (2003) used the predicate set deﬁned by semantic categories such as genus, species, cause, and eﬀect. Contrary to the previous works, we extract noun and verb phrases as answer candidates by using deﬁnition patterns which are constructed based on the syntactic parsing result of the sentences. The syntactic patterns are useful for extracting some phrases which have few ﬁnite lexical clues and are distant from the question target. Even a few syntactic patterns can cover a lot of descriptive phrases. External deﬁnitions, deﬁnitions from external resources such as online dictionaries and encyclopedias, are known to be useful for ranking answer candidates (Cui, Kan, et al., 2004). Although external deﬁnitions provide relevant information, they cannot cover all question targets. Thus we propose a ranking criterion considering the characteristics of the deﬁnition itself. Although Echihabi et al. (2003) used similar approaches for ranking answer candidates, ours is diﬀerent in that we identify the target type and build the terminology according to the type. We think that term importance in deﬁnition varies according to the target type. For example, scientist, born, and died are important terms in the person deﬁnition. On the other hand, member, found, and locate are meaningful in the organization one. Therefore, we pre-compile the deﬁnition terminology (i.e. terms used for deﬁning something), and ﬁnd out deﬁnition-like candidates using it. The remainder of this paper is organized as follows. Our contributions for candidate extraction and ranking strategies are described in Sections 2 and 3, respectively, and experimental results are given in Section 4. Finally, we conclude our work in Section 5. 2. Answer candidate extraction The question target is assumed to be expressed explicitly in the question sentence. Questions that need more inference to identify real target, such as What is the most popular UK outdoors activity?, are out of the scope of this paper.

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

355

The target is extracted from the question sentence, and the type of the target is identiﬁed using the named entity tagger, Identiﬁnder (Bikel, Schwartz, & Weischedel, 1999). The target is classiﬁed into one of the three types: person, organization, or other things. If a target is not classiﬁed into person or organization by the named entity tagger, it is classiﬁed into other things. The target type is used for expanding passages and ranking answer candidates in later stages. 2.1. Passage retrieval and expansion 2.1.1. Two-phase retrieval As the target tends to be expressed diﬀerently between documents and the question, a lot of relevant information could not be retrieved by one phase passage retrieval method. Therefore, we ﬁrstly retrieve only relevant documents to the target by generating a relatively strict query, and then extract relevant sentences by using a more relaxed one. The query for the document retrieval consists of words and phrases of the target ﬁltered with a stopword list. If there is a sequence of two words starting with a capital letter, a phrase query is generated with the two words. The remaining words are also used as single query words. For example, for a target Berkman Center for Internet and Society, the query would include a phrase berkman_center and two words internet and society. Once the documents are retrieved, the passages consisting of sentences that contain the head word of the target can be generated. We check whether the passage can be expanded to the multiple-sentence passage using a simple anaphora resolution technique described in the next section. 2.1.2. Passage expansion using target-focused anaphora resolution We also carry out a passage expansion method to retrieve sentences where the question target is used as a personal pronoun. We try to resolve only the pronoun which can be correctly resolved. When the question target is used as a subject in a sentence, the following simple resolution rules are applied to the following sentences according to the target type: Person: If the starting word of the next sentence is he or she, it is replaced with the question target. Organization or things: If the starting word of the next sentence is it or they, it is replaced with the question target. With the following example, if the sentence (a) is followed by the sentence (b) in a document, the pronoun he in (b) is replaced with Bill Clinton in (a). (a) Former president Bill Clinton was born in Hope, Arkansas. (b) He was named William Jeﬀerson Blythe IV after his father, William Jeﬀerson Blythe III. Using this simple method, we can extract the informative sentences related to the question target without using a full anaphora resolution method. 2.2. Candidates extraction using syntactic patterns All sentences in the retrieved passages are usually used as answer candidate. However, a sentence may be too long so that it is likely to contain information which is not related to the question target. Thus we try to extract target-related parts of sentences using syntactic structure of the sentences. If the parts are extracted, they are used as answer candidates. Otherwise, the sentences are used as the candidates. We extract noun phrases and verb phrases from the sentences using the syntactic patterns shown in Table 1. In this study, we use the syntactic information generated by the Conexor FDG parser (Tapanainen & Jarvinen, 1997).1

1

For newspaper articles, the parser can attach heads with 95.3% precision and 87.9% recall.

356

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

Table 1 Syntactic patterns for extracting answer candidates Name

Description

Example

ModNP

Noun phrases that have a direct syntactic relation to the question target

IsaNP

Noun phrases used as a complement of the verb be

RelVP

Verb phrases where a nominative or possessive relative pronoun modiﬁes directly the question target Present or past participles, without its subject, modifying directly the question target or the main verb directly related to the question target

Former world and Olympic champion Alberto Tomba missed out on the chance of his 50th World Cup win when he straddled a gate in the ﬁrst run TB is a bacterial disease caused by the Tuberculosis mycobacterium and transmitted through the air Copland, who was born in Brooklyn, would have turned 100 on Nov. 14, 2000 Tomba, known as ‘‘La Bomba’’ (the bomb) for his explosive skiing style, had hinted at retirement for years, but always burst back on the scene to stun his rivals and savor another victory Iqra will initially broadcast eight hours a day of children’s programs, game shows, soap operas, economic programs and religious talk shows

PartVP

GenVP

Verb phrases modiﬁed directly by the question target which is the subject of the sentence. If the head word of a phrase is among stop verbs, the phrase is not extracted. The stop verbs indicate the functional verbs, which is not informative one such as be, say, talk, and tell

The syntactic information analyzed by the syntactic parser is prone to errors. Thus, we complement the error-prone syntactic information with POS information as follows: If any word between the ﬁrst word and the last of the extracted phrase in the sentence is not extracted, it is inserted between the two words. For the RelVP example in Table 1, if the phrase ‘‘born Brooklyn’’ is extracted, the phrase is changed into ‘‘born in Brooklyn’’. If the last word of the extracted phrase is labeled with one of noun-dependent POSs such as adjective, determiner or preposition, the immediate noun phrase is put together into the extracted phrase. For the above example, if the phrase ‘‘born in’’ is extracted, the phrase is altered into ‘‘born in Brooklyn’’. If the extracted phrase is an incomplete one, that is, ended with one of the POSs such as conjunction or relative pronoun, the last word is removed from the extracted phrase. For an example sentence ‘‘Copland was born in Brooklyn and won Oscar.’’, if the phrase ‘‘born in Brooklyn and’’ is extracted, the phrase is changed into ‘‘born in Brooklyn’’. The phrases containing more than two content words and a noun or a number are considered to be valid candidates. We eliminate redundant candidates using a word overlap measure and semantic class matching of the head word. A pair of candidates are considered to be redundant when the candidates highly overlap (above 80%) each other in lexical level or the head words of them belong to the same synset in WordNet (Fellbaum, 1998) with the modest word overlap (50%). Once the redundancy is detected, the more highly overlapped candidate is eliminated. Although the redundancy is a trouble to make a short novel deﬁnition, the redundant information is likely to be important, which is also used as an eﬀective ranking measure in the factoid question answering system (Dumais, Banko, Brill, Lin, & Ng, 2002). Therefore, we use the redundant count of the eliminated candidates in the candidate ranking phase. 3. Answer candidate ranking It is diﬃcult to decide which candidates are deﬁnitions or not. Thus, we try to rank the candidates according to the deﬁnition likelihood. We used several criteria to rank answer candidates: redundancy, term statistics in the relevant passages, external deﬁnitions, and deﬁnition terminology. We normalize each score between 0 and 1 and combine them into the ﬁnal score.

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

357

3.1. Redundancy Important facts or events are usually mentioned repeatedly. As the redundancy is checked in the previous candidate extraction phase, the redundancy score of answer candidate C could be expressed by the following redundancy ratio: r RddRatioðCÞ ¼ n where r represents the redundant count of answer candidate C in the candidate set, and n is the total number of answer candidates. The redundancy score is calculated by the following scaled version: RddRatioðCÞ RddðCÞ ¼ log2 þ1 ð1Þ maxj RddRatioðC j Þ where maxj RddRatio(Cj) is the maximum redundancy ratio among all candidates. 3.2. Local term statistics As important facts or events are mentioned repeatedly, the target-related terms occur frequently. Thus the frequent words in the retrieved passages (i.e. local passages) are considered to be important. The Loc(C) represents a score based on the term statistics in the retrieved passages and is calculated as follows: ! P sfi ti 2C maxj sfj

LocðCÞ ¼ log2

jCj

þ1

ð2Þ

where sfi is the number of sentences in which the term ti is occurred, maxj sfj is the maximum value of sf among all terms, and jCj is the number of all content words in the answer candidate C. 3.3. External deﬁnitions Deﬁnitions of a question target extracted from external resources such as online dictionary or encyclopedia are called external deﬁnitions. The candidates that have higher probability of occurring in the external deﬁnitions than in the general text are likely to be an answer. The probability ratio is measured by the following equation: ExtRatioðCÞ ¼

P ðCjEÞ P ðCÞ

where P(CjE) is the probability that the candidate C will occur in the external deﬁnition E, and P(C) is the prior probability of the candidate. The external deﬁnition score Ext(C) is calculated by the following hyperbolic tangent sigmoid function: ExtðCÞ ¼

1 eExtRatioðCÞ 1 þ eExtRatioðCÞ

Since the probability ratio ExtRatio(C) is above 0, the score Ext(C) is between 0 and 1. Each probability is estimated by MLE (maximum likelihood estimation): Y freqi;E jCj1 P ðCjEÞ ¼ jEj ti 2C Y freqi;B jCj1 P ðCÞ ¼ jBj ti 2C

ð3Þ

ð4Þ ð5Þ

where freqi,E is the number of occurrences of the term ti in the external deﬁnitions E, and jEj is the total term occurrences in the external deﬁnitions. The freqi,B and jBj in the background general collection correspond to

358

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

freqi,E and jEj in the external deﬁnitions. jCj is the number of content words in the candidate C and is used for normalizing the probabilities. 3.4. Deﬁnition terminology Although the external deﬁnitions are useful for ranking candidates, it is obvious that they cannot cover all the possible targets. In order to alleviate the problem, we device a deﬁnition terminology score which reﬂects how the candidate phrase is deﬁnition-like. For the deﬁnition terminology, we collected external deﬁnitions according to the three target types. We compare the term statistics in the deﬁnitions to those in the general text, assuming that the diﬀerence of the term statistics can be a measure for the deﬁnition terminology. The deﬁnition terminology score of an answer candidate C is calculated based on the term statistics as follows: ! P DefTermðti Þ ti 2C maxj DefTermðti Þ

TmnðCÞ ¼ log2

jCj

þ1

ð6Þ

where DefTerm(ti) is the deﬁnition terminology score for a term ti in the candidate, and maxj DefTerm(ti) is the maximum value of the score. In order to measure the score DefTerm(ti), we tried several measures including ones which have been used for feature selection in the text categorization ﬁeld (Yang & Pedersen, 1997). Each measure refers to the following two-way contingency table of a term t and a deﬁnition class D; a is the number of times t and D cooccur, b is the number of times the t occurs without D, c is the number of times D occurs without t, d is the number of times neither t nor D occurs, and N is the total number of documents. Mutual information: It is a criterion commonly used in statistical language modeling of word associations. Rare terms have a higher score than common terms by the mutual information: Iðti ; DÞ ¼ log

P ðti ; DÞ P ðti ÞP ðDÞ

DefTermmi ðti Þ ¼ log

aN ða þ cÞ ða þ bÞ

Information gain: It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document: X X Gðti Þ ¼ P ðti ; cj ÞIðti ; cj Þ þ P ðti ; cj ÞIðti ; cj Þ cj 2fD;Dg

DefTermig ðti Þ ¼

cj 2fD;Dg

a aN b bN log þ log N ða þ cÞ ða þ bÞ N ða þ bÞ ðb þ dÞ c cN d d N þ log þ log N ðc þ dÞ ða þ cÞ N ðc þ dÞ ðb þ dÞ

v2 statistic: It measures the lack of independence between t and D: 2

DefTermchi ðti Þ ¼

N ðad bcÞ ða þ cÞ ðb þ dÞ ða þ bÞ ðc þ dÞ

Probability ratio: The ratio of the probability of terms in the deﬁnitions D to the probability in the general text is used: DefTermpratio ðti Þ ¼

P D ðti Þ P ðti Þ

The probabilities are estimated by MLE as follows:

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

359

freqi;D jDj freqi;B P ðti Þ ¼ jBj P D ðti Þ ¼

3.5. Score combination The criteria mentioned so far are linearly combined into a score: ScoreðCÞ ¼ k1 RddðCÞ þ k2 LocðCÞ þ k3 ExtðCÞ þ k4 TmnðCÞ ð7Þ P where ks are tuning parameters satisfying j kj ¼ 1 and are set empirically. The top ranked candidates are selected for the ﬁnal answer. 4. Experimental results 4.1. Experiments setup We have experimented with 50 TREC 2003 topics and 64 TREC 2004 topics, and found the answer from the AQUAINT corpus used for TREC Question Answering Track evaluation. The TREC answer set for the deﬁnitional question answering task consists of several deﬁnition nuggets for each target, and each nugget is a short string classiﬁed as either vital or okay. The vital nugget is the information that must be present in a good deﬁnition. On the other hand, the okay nugget is the information that is interesting enough but is not essential. Because a series of questions including factoid, list and deﬁnitional one about a question target is presented in TREC 2004, the answers for deﬁnitional questions exclude the answers for factoid and list questions (Voorhees, 2004). However, for evaluating deﬁnitional question answering systems only, the answers for those questions of a target can be considered to be the answer for deﬁnitional questions of the target. Thus we expanded the TREC 2004 answers by adding the answers for factoid questions of each topic, and used them to evaluate for TREC 2004 topics.2 The evaluation of systems involves matching up the answer nuggets and the system output. Because the manual evaluation such as TREC evaluation requires a lot of cost, we evaluated our system using the automatic measure POURPRE (Lin & Demner-Fushman, 2005). In order to automatically match the nuggets, a match score for each nugget ni in TREC answer A is calculated based on term co-occurrences as follows: max jni ^ sj j MSðni Þ ¼

j

jni j

where jnij is the number of terms in the answer nugget, and jni Ù sjj is the number of terms overlapped with a system output nugget sj. The POURPRE estimates the TREC metric, recall R, allowance a, precision P, and F-measure using the match score as follows: P R¼

ni 2A MSðni Þ

R a ¼ 100 ðr0 þ a0 Þ ( 1 ðl < aÞ P¼ la 1 l ðl P aÞ F ðbÞ ¼

2

ðb2 þ 1Þ P R b2 P þ R

When we compare our systems with other TREC participant systems, we used the original gold standard answer.

ð8Þ ð9Þ ð10Þ ð11Þ

360

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

where R is the number of vital answer nuggets, l is the number of non-white-space characters in the entire system output, and r 0 and a 0 are the number of vital and okay nuggets, respectively, that have non-zero match score. F-measure is used for the oﬃcial measure in TREC evaluation, and b is set to three favoring the recall heavily. We used external deﬁnitions from various online sites: Biography.com, Columbia Encyclopedia, Wikipedia, FOLDOC, The American Heritage Dictionary of the English Language, and Online Medical Dictionary. The external deﬁnitions are collected in the query time by throwing a query consisting of the head words of the question target into the site. In order to extract deﬁnition terminology, we also collected deﬁnitions according to the target type: 14 904 persons, 994 organizations, and 3639 things entries. We used our document retrieval engine based on BM25 of OKAPI (Sparck Jones, Walker, & Robertson, 1998), and processed top 200 documents retrieved in all experiments. 4.2. Passage expansion Table 2 shows the eﬀect of passage expansion using target-focused anaphora resolution. The upper two lines represent the result for all retrieved sentences without applying candidate extraction and ranking, and the lower two lines represent the result for top ranked candidates up to 2000 non-white-space characters with candidate extraction and ranking. Without PE means that the passage expansion is not applied, and With PE means that the expansion is applied. While the passage expansion makes little diﬀerence in answering performance for all sentences, the performance can be a little bit improved for top ranked candidates. The median number of added sentences is three and two for TREC 2003 and 2004, respectively. As only a few sentences are added, they contain little new information indicated by little increase for all sentences. On the other hand, the added sentences improve the ranking by promoting the candidates eligible for the answer. 4.3. Candidate extraction Table 3 shows the performance according to the candidate units. Sent only and Phrase only mean that all sentences and phrases are used as answer candidates, respectively. Phrase + sent is the system that uses phrases if any syntactic pattern is matched but uses raw sentences otherwise. The passage expansion is conducted to all the three systems, and the performance is measured for all candidates which are not ranked. As the phrases are extracted from the sentences, the recall of the Phrase only is lower than Sent only. In spite of shorter text, the phrase system covers 81.3% (0.3993/0.4909) and 78.7% (0.4233/0.5380) of answer nuggets Table 2 Eﬀect of passage expansion TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

All sent

Without PE With PE

0.4909 0.4909

0.0493 0.0476

0.1260 0.1249

0.5367 0.5380

0.0285 0.0286

0.1434 0.1437

Top ranked 2000B

Without PE With PE

0.4314 0.4478

0.1911 0.1959

0.3522 0.3662

0.4524 0.4535

0.2820 0.2829

0.4227 0.4238

Table 3 Comparison of candidate units TREC 2003

Sent only Phrase only Phrase + sent

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

0.4909 0.3993 0.4859

0.0476 0.1892 0.0579

0.1249 0.2765 0.1439

0.5380 0.4233 0.5357

0.0286 0.2657 0.0380

0.1437 0.3225 0.1641

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

361

in the all sentences for TREC 2003 and 2004 data, respectively. The Phrase + sent balances recall and precision. Figs. 1 and 2 show the changes in performance of each system according to the answer length measured by the number of non-white-space characters. As shown in the ﬁgures, the phrase-based system Phrase only is better than the sentence-based one Sent only for short answers until about 900 bytes for TREC 2003 and 600 bytes for TREC 2004, but we had better use the phrases and sentences together for longer answers. Performance changes according to answer length (TREC 2003) 0.4

0.35

F(3)

0.3

0.25

sent only phrase only phrase+sent

0.2

0.15 0

200

400

600

800

1000

1200

1400

1600

1800

2000

answer length (# of non-white-space characters)

Fig. 1. Performance changes according to answer length: TREC 2003.

Performance changes according to answer length (TREC 2004) 0.45

0.4

F(3)

0.35

0.3

0.25

0.2

sent only phrase only phrase+sent

0.15 0

200

400

600

800

1000

1200

1400

1600

answer length (# of non-white-space characters)

Fig. 2. Performance changes according to answer length: TREC 2004.

1800

2000

362

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

Table 4 Comparison of deﬁnition terminology measures Measure type

Top ranked 2000B

mi ig chi pratio

TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

0.3699 0.3977 0.3795 0.4172

0.1707 0.1778 0.1725 0.1859

0.3050 0.3260 0.3108 0.3415

0.3868 0.4040 0.3875 0.4297

0.2451 0.2542 0.2448 0.2682

0.3625 0.3779 0.3628 0.4015

Table 5 Comparison of ranking score combinations

Top ranked 2000B

Combination type

TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

rdd loc ext tmn

0.3756 0.3673 0.4440 0.4172

0.1754 0.1700 0.1938 0.1859

0.3108 0.3026 0.3625 0.3415

0.3959 0.3797 0.4530 0.4297

0.2493 0.2375 0.2826 0.2682

0.3705 0.3548 0.4234 0.4015

all

0.4478

0.1959

0.3662

0.4535

0.2829

0.4238

For TREC 2004 data, the performance of Phrase + sent is slightly worse than Sent only in long answers. It is probably because of the insuﬃcient phrases. The median number of extracted phrases is 81.5 for TREC 2004 data, compared to 110.5 for TREC 2003 data. Generally it is safe and eﬀective to use the phrases and sentences together. 4.4. Candidate ranking Table 4 shows the ranking performance of each deﬁnition terminology measure. In order to compare the deﬁnition terminology measures only, any other ranking criterion is not used for this experiment. As shown in the table, our proposed probability ratio is the best deﬁnition terminology measure, and the mutual information is the worst. Table 5 shows the result of ranking score combinations where rdd, loc, ext, and tmn indicate the redundancy, local term statistics, external deﬁnition, and deﬁnition terminology score, respectively. Each system uses only the single ranking criterion for ranking candidates. The all is the combination of all scores where the tuning parameters k1, k2, k3, and k4 are set to 0.15, 0.15, 0.4, 0.3, respectively, considering the single measure performance. The table shows that external deﬁnition score is the best ranking measure, and the deﬁnition terminology score is very good measure. The deﬁnition terminology score is expected to make the performance robust, even if there is no deﬁnition about the question target in the external resources. 4.5. Comparison with TREC participant systems We compared our proposed system with the previous TREC participant systems. For this experiments, the raw run ﬁles of each system are oﬀered by NIST.3 For TREC 2004 evaluation, run ﬁles for only the deﬁnitional questions are used. Table 6 shows the POURPRE evaluation result of TREC top ﬁve systems and our proposed system Proposed of which answer length is set to 1500 bytes and 2000 bytes, respectively. Because the responses of the TREC 2004 participant systems are generated on the assumption that the answer for deﬁnitional questions do not have to include the answer for other question types, we evaluated the systems with the original TREC 2004 answer. Our systems may be slightly underestimated for TREC 2004 questions 3

http://trec.nist.gov/

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

363

Table 6 Comparison with TREC participant systems based on POURPRE TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

TREC participant systems (top 5)

0.3979 0.4229 0.3939 0.3314 0.3790

0.3513 0.2009 0.2062 0.4961 0.2658

0.3644 0.3531 0.3402 0.3348 0.3321

0.3468 0.3412 0.2950 0.3053 0.3174

0.1920 0.1920 0.2451 0.1682 0.1894

0.3139 0.3107 0.2803 0.2766 0.2676

Proposed (1500B) Proposed (2000B)

0.4224 0.4478

0.2336 0.1959

0.3648 0.3662

0.3124 0.3323

0.1939 0.1555

0.2907 0.2933

because ours do not consider other types of questions. The table shows that our system is comparable to stateof-the-art deﬁnitional QA systems. 5. Conclusions This paper proposed a deﬁnitional question answering system that extracts answer candidates based on linguistic features and ranks the candidates based on various ranking measures, speciﬁcally deﬁnition terminology. Our interesting ﬁndings can be summarized as follows: The passage expansion technique using a simple target-focused anaphora resolution technique can add informative sentences related to the question target. The added sentences have a positive eﬀect on the ranking performance. The phrase extraction method based on syntactic patterns is useful for a short deﬁnition. However, for a long deﬁnition, it is better to use phrases and sentences together. The external deﬁnitions and the deﬁnition terminology turn out to be eﬃcient and harmonic measures for ranking candidates. In this study, we designed our deﬁnitional QA system which can be compromised with error-prone linguistic tools. Sometimes we cannot help applying very strict measures because of those tools. For the future work, we will try to gradually relax the constraint without degrading the performance. References Bikel, D. M., Schwartz, R. L., & Weischedel, R. M. (1999). An algorithm that learns what’s in a name. Machine Learning, 34(1–3), 211–231. Blair-Goldensohn, S., McKeown, K. R., & Schlaikjer, A. H. (2003). A hybrid approach for QA track deﬁnitional questions. In Proceedings of the 12th text retrieval conference (TREC-2003) (pp. 185–192). Cui, H., Kan, M.-Y., Chua, T.-S., & Xiao, J. (2004). A comparative study on sentence retrieval for deﬁnitional question answering. In SIGIR workshop on information retrieval for question answering (IR4QA). Cui, H., Li, K., Sun, R., Chua, T.-S., & Kan, M.-Y. (2004). National University of Singapore at the TREC-13 question answering. In Proceedings of the 13th text retrieval conference (TREC-2004). Dumais, S., Banko, M., Brill, E., Lin, J., & Ng, A. (2002). Web question answering: is more always better? In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR-2002) (pp. 291–298). Echihabi, A., Hermjakob, U., Hovy, E., Marcu, D., Melz, E., & Ravichandran, D. (2003). Multiple-engine question answering in TextMap. In Proceedings of the 12th text retrieval conference (TREC-2003) (pp. 772–781). Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. The MIT Press. Gaizauskas, R., Greenwood, M. A., Hepple, M., Roberts, I., & Saggion, H. (2004). The University of Sheﬃeld’s TREC 2004 Q&A experiments. In Proceedings of the 13rd text retrieval conference (TREC-2004). Harabagiu, S., Moldovan, D., Clark, C., Bowden, M., Williams, J., & Bensley, J. (2003). Answer mining by combining extraction techniques with abductive reasoning. In Proceedings of the 12th text retrieval conference (TREC-2003) (pp. 375–382). Hildebrandt, W., Katz, B., & Lin, J. (2004). Answering deﬁnition questions with multiple knowledge sources. In Proceedings of the human language technology and conference of the North American chapter of the association for computational linguistics (HLT-NAACL-2004) (pp. 49–56).

364

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

Katz, B., Bilotti, M., Felshin, S., Fernandes, A., Hildebrandt, W., Katzir, R., Lin, J., Loreto, D., Marton, G., Mora, F., & Uzuner, O. (2004). Answering multiple questions on a topic from heterogeneous resources. In: Proceedings of the 13th text retrieval conference (TREC-2004). Lin, J., & Demner-Fushman, D. (2005). Automatically evaluating answers to deﬁnition questions. In Proceedings of the human language technology and conference on empirical methods in natural language processing (HLT-EMNLP-2005). Saggion, H., & Gaizauskas, R. (2004). Mining on-line sources for deﬁnition knowledge. In Proceedings of the 17th international Florida artiﬁcial intelligence research society conference (FLAIRS-2004). Sparck Jones, K., Walker, S., & Robertson, S. E. (1998). A probabilistic models of information retrieval: development and status. Technical Report 446, University of Cambridge Computer Laboratory. Tapanainen, P., & Jarvinen, T. (1997). A non-projective dependency parser. In Proceedings of the 5th conference on applied natural language processing (pp. 64–71). Voorhees, E. M. (2003). Overview of the TREC 2003 question answering track. In Proceedings of the 12th text retrieval conference (TREC2003) (pp. 54–68). Voorhees, E. M. (2004). Overview of the TREC 2004 question answering track. In Proceedings of the 13th text retrieval conference (TREC2004). Wu, L., Huang, X., You, L., Zhang, Z., Li, X., & Zhou, Y. (2004). FDUQA on TREC 2004 QA track. In Proceedings of the 13th text retrieval conference (TREC-2004). Xu, J., Weischedel, R., & Licuanan, A. (2004). Evaluation of an extraction-based approach to answering deﬁnitional questions. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR-2004) (pp. 418–424). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th international conference on machine learning (pp. 412–420).

Answer extraction and ranking strategies for definitional question answering using linguistic features and definition terminology

Answer extraction and ranking strategies for definitional question answering using linguistic features and definition terminology

Recommend Documents