Answer extraction and ranking strategies for definitional question answering using linguistic features and definition terminology

Answer extraction and ranking strategies for definitional question answering using linguistic features and definition terminology

Information Processing and Management 43 (2007) 353–364 www.elsevier.com/locate/infoproman Answer extraction and ranking strategies for definitional q...

175KB Sizes 0 Downloads 56 Views

Information Processing and Management 43 (2007) 353–364 www.elsevier.com/locate/infoproman

Answer extraction and ranking strategies for definitional question answering using linguistic features and definition terminology Kyoung-Soo Han, Young-In Song, Sang-Bum Kim, Hae-Chang Rim

*

Department of Computer Science and Engineering, Korea University, 1, 5-ga, Anam-dong, Seongbuk-gu, Seoul 136-701, Republic of Korea Received 26 May 2006; accepted 25 July 2006

Abstract We propose answer extraction and ranking strategies for definitional question answering using linguistic features and definition terminology. A passage expansion technique based on simple anaphora resolution is introduced to retrieve more informative sentences, and a phrase extraction method based on syntactic information of the sentences is proposed to generate a more concise answer. In order to rank the phrases, we use several evidences including external definitions and definition terminology. Although external definitions are useful, it is obvious that they cannot cover all the possible targets. The definition terminology score which reflects how the phrase is definition-like is devised to assist the incomplete external definitions. Experimental results show that the proposed answer extraction and ranking method are effective and also show that our proposed system is comparable to state-of-the-art systems. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Passage expansion; Phrase extraction; Definition terminology; Definitional question answering

1. Introduction Definitional question answering (QA) is a task of answering definitional questions, such as What are fractals? and Who is Andrew Carnegie?, initiated by TREC Question Answering Track (Voorhees, 2003). The definitional QA has several characteristics that is different from the factoid QA handling a question such as What country is Aswan High Dam located in? The definitional questions do not clearly imply an expected answer type but contain only the question target (e.g., fractals and Andrew Carnegie in the above example questions), contrary to the factoid questions involving a narrow answer type (e.g., country name for the above example question). Thus, it is hard to find which information is useful for the answer of a definitional question. Another difference is that a short passage cannot answer definitional questions because a definition needs several essential information *

Corresponding author. Tel.: +82 2 3290 3195; fax: +82 2 929 7914. E-mail address: [email protected] (H.-C. Rim).

0306-4573/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2006.07.010

354

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

about the target. Therefore, the answer of definitional questions can consist of several component information called information nuggets. Each answer nugget is naturally represented by a short noun phrase or a verb phrase. Good definitional QA systems should find out more answer nuggets with shorter length. The general architecture of definitional QA systems consists of following three components: question analysis, passage retrieval, and answer selection. The question target extracted from the question sentence in the question analysis phase is used for composing a passage retrieval query. The query can be expanded with additional terms extracted by co-occurrence statistics in the definition of external resources such as a dictionary and an encyclopedia (Gaizauskas, Greenwood, Hepple, Roberts, & Saggion, 2004; Saggion & Gaizauskas, 2004), or extracted from the WordNet synset (Wu et al., 2004). Usually all sentences in the retrieved passages by the query are used as answer candidates. The top ranked candidates based on several evidences such as definition patterns and similarity to the definitions of external resources are selected as the answer (Cui, Li, Sun, Chua, & Kan, 2004; Cui, Kan, Chua, & Xiao, 2004; Harabagiu et al., 2003; Hildebrandt, Katz, & Lin, 2004; Saggion & Gaizauskas, 2004). This paper mainly focuses on the passage retrieval and answer selection components. After documents relevant to the question target are retrieved, the sentences containing the target are extracted from the retrieved documents in passage retrieval phase. The question target is variously represented in documents, and an anaphora is usually used to indicate it. Therefore, it is necessary to expand retrieved passages so that the passages, in which the target is represented as an anaphora, can be retrieved. Some works (Gaizauskas et al., 2004; Katz et al., 2004) introduced anaphora resolution techniques in the definitional QA. Katz et al. (2004) applied generic coreference resolution techniques to the entire corpus, but the coreference resolution could not improve the performance. The full coreference resolution is computationally expensive, and the incorrect resolution may cause a definitional QA system to fail to find the correct answer. It is necessary to limit the resolution scope to the anaphora referring to the question target which can be correctly resolved. Thus, this paper suggests a passage expansion method using simple pronoun resolution rules. By using shorter text segments instead of the sentence itself, we can reduce answer granularity, and include more information per unit length. Harabagiu et al. (2003), Blair-Goldensohn, McKeown, and Schlaikjer (2003), Xu, Weischedel, and Licuanan (2004) extracted shorter text segments from retrieved sentences as answer candidates. Xu et al. (2004) extracted linguistic constructs, such as relations and propositions, using information extraction tools. Blair-Goldensohn et al. (2003) used the predicate set defined by semantic categories such as genus, species, cause, and effect. Contrary to the previous works, we extract noun and verb phrases as answer candidates by using definition patterns which are constructed based on the syntactic parsing result of the sentences. The syntactic patterns are useful for extracting some phrases which have few finite lexical clues and are distant from the question target. Even a few syntactic patterns can cover a lot of descriptive phrases. External definitions, definitions from external resources such as online dictionaries and encyclopedias, are known to be useful for ranking answer candidates (Cui, Kan, et al., 2004). Although external definitions provide relevant information, they cannot cover all question targets. Thus we propose a ranking criterion considering the characteristics of the definition itself. Although Echihabi et al. (2003) used similar approaches for ranking answer candidates, ours is different in that we identify the target type and build the terminology according to the type. We think that term importance in definition varies according to the target type. For example, scientist, born, and died are important terms in the person definition. On the other hand, member, found, and locate are meaningful in the organization one. Therefore, we pre-compile the definition terminology (i.e. terms used for defining something), and find out definition-like candidates using it. The remainder of this paper is organized as follows. Our contributions for candidate extraction and ranking strategies are described in Sections 2 and 3, respectively, and experimental results are given in Section 4. Finally, we conclude our work in Section 5. 2. Answer candidate extraction The question target is assumed to be expressed explicitly in the question sentence. Questions that need more inference to identify real target, such as What is the most popular UK outdoors activity?, are out of the scope of this paper.

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

355

The target is extracted from the question sentence, and the type of the target is identified using the named entity tagger, Identifinder (Bikel, Schwartz, & Weischedel, 1999). The target is classified into one of the three types: person, organization, or other things. If a target is not classified into person or organization by the named entity tagger, it is classified into other things. The target type is used for expanding passages and ranking answer candidates in later stages. 2.1. Passage retrieval and expansion 2.1.1. Two-phase retrieval As the target tends to be expressed differently between documents and the question, a lot of relevant information could not be retrieved by one phase passage retrieval method. Therefore, we firstly retrieve only relevant documents to the target by generating a relatively strict query, and then extract relevant sentences by using a more relaxed one. The query for the document retrieval consists of words and phrases of the target filtered with a stopword list. If there is a sequence of two words starting with a capital letter, a phrase query is generated with the two words. The remaining words are also used as single query words. For example, for a target Berkman Center for Internet and Society, the query would include a phrase berkman_center and two words internet and society. Once the documents are retrieved, the passages consisting of sentences that contain the head word of the target can be generated. We check whether the passage can be expanded to the multiple-sentence passage using a simple anaphora resolution technique described in the next section. 2.1.2. Passage expansion using target-focused anaphora resolution We also carry out a passage expansion method to retrieve sentences where the question target is used as a personal pronoun. We try to resolve only the pronoun which can be correctly resolved. When the question target is used as a subject in a sentence, the following simple resolution rules are applied to the following sentences according to the target type:  Person: If the starting word of the next sentence is he or she, it is replaced with the question target.  Organization or things: If the starting word of the next sentence is it or they, it is replaced with the question target. With the following example, if the sentence (a) is followed by the sentence (b) in a document, the pronoun he in (b) is replaced with Bill Clinton in (a). (a) Former president Bill Clinton was born in Hope, Arkansas. (b) He was named William Jefferson Blythe IV after his father, William Jefferson Blythe III. Using this simple method, we can extract the informative sentences related to the question target without using a full anaphora resolution method. 2.2. Candidates extraction using syntactic patterns All sentences in the retrieved passages are usually used as answer candidate. However, a sentence may be too long so that it is likely to contain information which is not related to the question target. Thus we try to extract target-related parts of sentences using syntactic structure of the sentences. If the parts are extracted, they are used as answer candidates. Otherwise, the sentences are used as the candidates. We extract noun phrases and verb phrases from the sentences using the syntactic patterns shown in Table 1. In this study, we use the syntactic information generated by the Conexor FDG parser (Tapanainen & Jarvinen, 1997).1

1

For newspaper articles, the parser can attach heads with 95.3% precision and 87.9% recall.

356

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

Table 1 Syntactic patterns for extracting answer candidates Name

Description

Example

ModNP

Noun phrases that have a direct syntactic relation to the question target

IsaNP

Noun phrases used as a complement of the verb be

RelVP

Verb phrases where a nominative or possessive relative pronoun modifies directly the question target Present or past participles, without its subject, modifying directly the question target or the main verb directly related to the question target

Former world and Olympic champion Alberto Tomba missed out on the chance of his 50th World Cup win when he straddled a gate in the first run TB is a bacterial disease caused by the Tuberculosis mycobacterium and transmitted through the air Copland, who was born in Brooklyn, would have turned 100 on Nov. 14, 2000 Tomba, known as ‘‘La Bomba’’ (the bomb) for his explosive skiing style, had hinted at retirement for years, but always burst back on the scene to stun his rivals and savor another victory Iqra will initially broadcast eight hours a day of children’s programs, game shows, soap operas, economic programs and religious talk shows

PartVP

GenVP

Verb phrases modified directly by the question target which is the subject of the sentence. If the head word of a phrase is among stop verbs, the phrase is not extracted. The stop verbs indicate the functional verbs, which is not informative one such as be, say, talk, and tell

The syntactic information analyzed by the syntactic parser is prone to errors. Thus, we complement the error-prone syntactic information with POS information as follows:  If any word between the first word and the last of the extracted phrase in the sentence is not extracted, it is inserted between the two words. For the RelVP example in Table 1, if the phrase ‘‘born Brooklyn’’ is extracted, the phrase is changed into ‘‘born in Brooklyn’’.  If the last word of the extracted phrase is labeled with one of noun-dependent POSs such as adjective, determiner or preposition, the immediate noun phrase is put together into the extracted phrase. For the above example, if the phrase ‘‘born in’’ is extracted, the phrase is altered into ‘‘born in Brooklyn’’.  If the extracted phrase is an incomplete one, that is, ended with one of the POSs such as conjunction or relative pronoun, the last word is removed from the extracted phrase. For an example sentence ‘‘Copland was born in Brooklyn and won Oscar.’’, if the phrase ‘‘born in Brooklyn and’’ is extracted, the phrase is changed into ‘‘born in Brooklyn’’. The phrases containing more than two content words and a noun or a number are considered to be valid candidates. We eliminate redundant candidates using a word overlap measure and semantic class matching of the head word. A pair of candidates are considered to be redundant when the candidates highly overlap (above 80%) each other in lexical level or the head words of them belong to the same synset in WordNet (Fellbaum, 1998) with the modest word overlap (50%). Once the redundancy is detected, the more highly overlapped candidate is eliminated. Although the redundancy is a trouble to make a short novel definition, the redundant information is likely to be important, which is also used as an effective ranking measure in the factoid question answering system (Dumais, Banko, Brill, Lin, & Ng, 2002). Therefore, we use the redundant count of the eliminated candidates in the candidate ranking phase. 3. Answer candidate ranking It is difficult to decide which candidates are definitions or not. Thus, we try to rank the candidates according to the definition likelihood. We used several criteria to rank answer candidates: redundancy, term statistics in the relevant passages, external definitions, and definition terminology. We normalize each score between 0 and 1 and combine them into the final score.

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

357

3.1. Redundancy Important facts or events are usually mentioned repeatedly. As the redundancy is checked in the previous candidate extraction phase, the redundancy score of answer candidate C could be expressed by the following redundancy ratio: r RddRatioðCÞ ¼ n where r represents the redundant count of answer candidate C in the candidate set, and n is the total number of answer candidates. The redundancy score is calculated by the following scaled version:   RddRatioðCÞ RddðCÞ ¼ log2 þ1 ð1Þ maxj RddRatioðC j Þ where maxj RddRatio(Cj) is the maximum redundancy ratio among all candidates. 3.2. Local term statistics As important facts or events are mentioned repeatedly, the target-related terms occur frequently. Thus the frequent words in the retrieved passages (i.e. local passages) are considered to be important. The Loc(C) represents a score based on the term statistics in the retrieved passages and is calculated as follows: ! P sfi ti 2C maxj sfj

LocðCÞ ¼ log2

jCj

þ1

ð2Þ

where sfi is the number of sentences in which the term ti is occurred, maxj sfj is the maximum value of sf among all terms, and jCj is the number of all content words in the answer candidate C. 3.3. External definitions Definitions of a question target extracted from external resources such as online dictionary or encyclopedia are called external definitions. The candidates that have higher probability of occurring in the external definitions than in the general text are likely to be an answer. The probability ratio is measured by the following equation: ExtRatioðCÞ ¼

P ðCjEÞ P ðCÞ

where P(CjE) is the probability that the candidate C will occur in the external definition E, and P(C) is the prior probability of the candidate. The external definition score Ext(C) is calculated by the following hyperbolic tangent sigmoid function: ExtðCÞ ¼

1  eExtRatioðCÞ 1 þ eExtRatioðCÞ

Since the probability ratio ExtRatio(C) is above 0, the score Ext(C) is between 0 and 1. Each probability is estimated by MLE (maximum likelihood estimation): Y freqi;E jCj1 P ðCjEÞ ¼ jEj ti 2C  Y freqi;B jCj1 P ðCÞ ¼ jBj ti 2C

ð3Þ

ð4Þ ð5Þ

where freqi,E is the number of occurrences of the term ti in the external definitions E, and jEj is the total term occurrences in the external definitions. The freqi,B and jBj in the background general collection correspond to

358

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

freqi,E and jEj in the external definitions. jCj is the number of content words in the candidate C and is used for normalizing the probabilities. 3.4. Definition terminology Although the external definitions are useful for ranking candidates, it is obvious that they cannot cover all the possible targets. In order to alleviate the problem, we device a definition terminology score which reflects how the candidate phrase is definition-like. For the definition terminology, we collected external definitions according to the three target types. We compare the term statistics in the definitions to those in the general text, assuming that the difference of the term statistics can be a measure for the definition terminology. The definition terminology score of an answer candidate C is calculated based on the term statistics as follows: ! P DefTermðti Þ ti 2C maxj DefTermðti Þ

TmnðCÞ ¼ log2

jCj

þ1

ð6Þ

where DefTerm(ti) is the definition terminology score for a term ti in the candidate, and maxj DefTerm(ti) is the maximum value of the score. In order to measure the score DefTerm(ti), we tried several measures including ones which have been used for feature selection in the text categorization field (Yang & Pedersen, 1997). Each measure refers to the following two-way contingency table of a term t and a definition class D; a is the number of times t and D cooccur, b is the number of times the t occurs without D, c is the number of times D occurs without t, d is the number of times neither t nor D occurs, and N is the total number of documents.  Mutual information: It is a criterion commonly used in statistical language modeling of word associations. Rare terms have a higher score than common terms by the mutual information: Iðti ; DÞ ¼ log

P ðti ; DÞ P ðti ÞP ðDÞ

DefTermmi ðti Þ ¼ log

aN ða þ cÞ  ða þ bÞ

 Information gain: It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document: X X Gðti Þ ¼ P ðti ; cj ÞIðti ; cj Þ þ P ðti ; cj ÞIðti ; cj Þ  cj 2fD;Dg

DefTermig ðti Þ ¼

 cj 2fD;Dg

a aN b bN  log þ  log N ða þ cÞ  ða þ bÞ N ða þ bÞ  ðb þ dÞ c cN d d N þ  log þ  log N ðc þ dÞ  ða þ cÞ N ðc þ dÞ  ðb þ dÞ

 v2 statistic: It measures the lack of independence between t and D: 2

DefTermchi ðti Þ ¼

N  ðad  bcÞ ða þ cÞ  ðb þ dÞ  ða þ bÞ  ðc þ dÞ

 Probability ratio: The ratio of the probability of terms in the definitions D to the probability in the general text is used: DefTermpratio ðti Þ ¼

P D ðti Þ P ðti Þ

The probabilities are estimated by MLE as follows:

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

359

freqi;D jDj freqi;B P ðti Þ ¼ jBj P D ðti Þ ¼

3.5. Score combination The criteria mentioned so far are linearly combined into a score: ScoreðCÞ ¼ k1 RddðCÞ þ k2 LocðCÞ þ k3 ExtðCÞ þ k4 TmnðCÞ ð7Þ P where ks are tuning parameters satisfying j kj ¼ 1 and are set empirically. The top ranked candidates are selected for the final answer. 4. Experimental results 4.1. Experiments setup We have experimented with 50 TREC 2003 topics and 64 TREC 2004 topics, and found the answer from the AQUAINT corpus used for TREC Question Answering Track evaluation. The TREC answer set for the definitional question answering task consists of several definition nuggets for each target, and each nugget is a short string classified as either vital or okay. The vital nugget is the information that must be present in a good definition. On the other hand, the okay nugget is the information that is interesting enough but is not essential. Because a series of questions including factoid, list and definitional one about a question target is presented in TREC 2004, the answers for definitional questions exclude the answers for factoid and list questions (Voorhees, 2004). However, for evaluating definitional question answering systems only, the answers for those questions of a target can be considered to be the answer for definitional questions of the target. Thus we expanded the TREC 2004 answers by adding the answers for factoid questions of each topic, and used them to evaluate for TREC 2004 topics.2 The evaluation of systems involves matching up the answer nuggets and the system output. Because the manual evaluation such as TREC evaluation requires a lot of cost, we evaluated our system using the automatic measure POURPRE (Lin & Demner-Fushman, 2005). In order to automatically match the nuggets, a match score for each nugget ni in TREC answer A is calculated based on term co-occurrences as follows: max jni ^ sj j MSðni Þ ¼

j

jni j

where jnij is the number of terms in the answer nugget, and jni Ù sjj is the number of terms overlapped with a system output nugget sj. The POURPRE estimates the TREC metric, recall R, allowance a, precision P, and F-measure using the match score as follows: P R¼

ni 2A MSðni Þ

R a ¼ 100  ðr0 þ a0 Þ ( 1 ðl < aÞ P¼ la 1  l ðl P aÞ F ðbÞ ¼

2

ðb2 þ 1Þ  P  R b2  P þ R

When we compare our systems with other TREC participant systems, we used the original gold standard answer.

ð8Þ ð9Þ ð10Þ ð11Þ

360

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

where R is the number of vital answer nuggets, l is the number of non-white-space characters in the entire system output, and r 0 and a 0 are the number of vital and okay nuggets, respectively, that have non-zero match score. F-measure is used for the official measure in TREC evaluation, and b is set to three favoring the recall heavily. We used external definitions from various online sites: Biography.com, Columbia Encyclopedia, Wikipedia, FOLDOC, The American Heritage Dictionary of the English Language, and Online Medical Dictionary. The external definitions are collected in the query time by throwing a query consisting of the head words of the question target into the site. In order to extract definition terminology, we also collected definitions according to the target type: 14 904 persons, 994 organizations, and 3639 things entries. We used our document retrieval engine based on BM25 of OKAPI (Sparck Jones, Walker, & Robertson, 1998), and processed top 200 documents retrieved in all experiments. 4.2. Passage expansion Table 2 shows the effect of passage expansion using target-focused anaphora resolution. The upper two lines represent the result for all retrieved sentences without applying candidate extraction and ranking, and the lower two lines represent the result for top ranked candidates up to 2000 non-white-space characters with candidate extraction and ranking. Without PE means that the passage expansion is not applied, and With PE means that the expansion is applied. While the passage expansion makes little difference in answering performance for all sentences, the performance can be a little bit improved for top ranked candidates. The median number of added sentences is three and two for TREC 2003 and 2004, respectively. As only a few sentences are added, they contain little new information indicated by little increase for all sentences. On the other hand, the added sentences improve the ranking by promoting the candidates eligible for the answer. 4.3. Candidate extraction Table 3 shows the performance according to the candidate units. Sent only and Phrase only mean that all sentences and phrases are used as answer candidates, respectively. Phrase + sent is the system that uses phrases if any syntactic pattern is matched but uses raw sentences otherwise. The passage expansion is conducted to all the three systems, and the performance is measured for all candidates which are not ranked. As the phrases are extracted from the sentences, the recall of the Phrase only is lower than Sent only. In spite of shorter text, the phrase system covers 81.3% (0.3993/0.4909) and 78.7% (0.4233/0.5380) of answer nuggets Table 2 Effect of passage expansion TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

All sent

Without PE With PE

0.4909 0.4909

0.0493 0.0476

0.1260 0.1249

0.5367 0.5380

0.0285 0.0286

0.1434 0.1437

Top ranked 2000B

Without PE With PE

0.4314 0.4478

0.1911 0.1959

0.3522 0.3662

0.4524 0.4535

0.2820 0.2829

0.4227 0.4238

Table 3 Comparison of candidate units TREC 2003

Sent only Phrase only Phrase + sent

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

0.4909 0.3993 0.4859

0.0476 0.1892 0.0579

0.1249 0.2765 0.1439

0.5380 0.4233 0.5357

0.0286 0.2657 0.0380

0.1437 0.3225 0.1641

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

361

in the all sentences for TREC 2003 and 2004 data, respectively. The Phrase + sent balances recall and precision. Figs. 1 and 2 show the changes in performance of each system according to the answer length measured by the number of non-white-space characters. As shown in the figures, the phrase-based system Phrase only is better than the sentence-based one Sent only for short answers until about 900 bytes for TREC 2003 and 600 bytes for TREC 2004, but we had better use the phrases and sentences together for longer answers. Performance changes according to answer length (TREC 2003) 0.4

0.35

F(3)

0.3

0.25

sent only phrase only phrase+sent

0.2

0.15 0

200

400

600

800

1000

1200

1400

1600

1800

2000

answer length (# of non-white-space characters)

Fig. 1. Performance changes according to answer length: TREC 2003.

Performance changes according to answer length (TREC 2004) 0.45

0.4

F(3)

0.35

0.3

0.25

0.2

sent only phrase only phrase+sent

0.15 0

200

400

600

800

1000

1200

1400

1600

answer length (# of non-white-space characters)

Fig. 2. Performance changes according to answer length: TREC 2004.

1800

2000

362

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

Table 4 Comparison of definition terminology measures Measure type

Top ranked 2000B

mi ig chi pratio

TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

0.3699 0.3977 0.3795 0.4172

0.1707 0.1778 0.1725 0.1859

0.3050 0.3260 0.3108 0.3415

0.3868 0.4040 0.3875 0.4297

0.2451 0.2542 0.2448 0.2682

0.3625 0.3779 0.3628 0.4015

Table 5 Comparison of ranking score combinations

Top ranked 2000B

Combination type

TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

rdd loc ext tmn

0.3756 0.3673 0.4440 0.4172

0.1754 0.1700 0.1938 0.1859

0.3108 0.3026 0.3625 0.3415

0.3959 0.3797 0.4530 0.4297

0.2493 0.2375 0.2826 0.2682

0.3705 0.3548 0.4234 0.4015

all

0.4478

0.1959

0.3662

0.4535

0.2829

0.4238

For TREC 2004 data, the performance of Phrase + sent is slightly worse than Sent only in long answers. It is probably because of the insufficient phrases. The median number of extracted phrases is 81.5 for TREC 2004 data, compared to 110.5 for TREC 2003 data. Generally it is safe and effective to use the phrases and sentences together. 4.4. Candidate ranking Table 4 shows the ranking performance of each definition terminology measure. In order to compare the definition terminology measures only, any other ranking criterion is not used for this experiment. As shown in the table, our proposed probability ratio is the best definition terminology measure, and the mutual information is the worst. Table 5 shows the result of ranking score combinations where rdd, loc, ext, and tmn indicate the redundancy, local term statistics, external definition, and definition terminology score, respectively. Each system uses only the single ranking criterion for ranking candidates. The all is the combination of all scores where the tuning parameters k1, k2, k3, and k4 are set to 0.15, 0.15, 0.4, 0.3, respectively, considering the single measure performance. The table shows that external definition score is the best ranking measure, and the definition terminology score is very good measure. The definition terminology score is expected to make the performance robust, even if there is no definition about the question target in the external resources. 4.5. Comparison with TREC participant systems We compared our proposed system with the previous TREC participant systems. For this experiments, the raw run files of each system are offered by NIST.3 For TREC 2004 evaluation, run files for only the definitional questions are used. Table 6 shows the POURPRE evaluation result of TREC top five systems and our proposed system Proposed of which answer length is set to 1500 bytes and 2000 bytes, respectively. Because the responses of the TREC 2004 participant systems are generated on the assumption that the answer for definitional questions do not have to include the answer for other question types, we evaluated the systems with the original TREC 2004 answer. Our systems may be slightly underestimated for TREC 2004 questions 3

http://trec.nist.gov/

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

363

Table 6 Comparison with TREC participant systems based on POURPRE TREC 2003

TREC 2004

Rec

Prec

F (b = 3)

Rec

Prec

F (b = 3)

TREC participant systems (top 5)

0.3979 0.4229 0.3939 0.3314 0.3790

0.3513 0.2009 0.2062 0.4961 0.2658

0.3644 0.3531 0.3402 0.3348 0.3321

0.3468 0.3412 0.2950 0.3053 0.3174

0.1920 0.1920 0.2451 0.1682 0.1894

0.3139 0.3107 0.2803 0.2766 0.2676

Proposed (1500B) Proposed (2000B)

0.4224 0.4478

0.2336 0.1959

0.3648 0.3662

0.3124 0.3323

0.1939 0.1555

0.2907 0.2933

because ours do not consider other types of questions. The table shows that our system is comparable to stateof-the-art definitional QA systems. 5. Conclusions This paper proposed a definitional question answering system that extracts answer candidates based on linguistic features and ranks the candidates based on various ranking measures, specifically definition terminology. Our interesting findings can be summarized as follows:  The passage expansion technique using a simple target-focused anaphora resolution technique can add informative sentences related to the question target. The added sentences have a positive effect on the ranking performance.  The phrase extraction method based on syntactic patterns is useful for a short definition. However, for a long definition, it is better to use phrases and sentences together.  The external definitions and the definition terminology turn out to be efficient and harmonic measures for ranking candidates. In this study, we designed our definitional QA system which can be compromised with error-prone linguistic tools. Sometimes we cannot help applying very strict measures because of those tools. For the future work, we will try to gradually relax the constraint without degrading the performance. References Bikel, D. M., Schwartz, R. L., & Weischedel, R. M. (1999). An algorithm that learns what’s in a name. Machine Learning, 34(1–3), 211–231. Blair-Goldensohn, S., McKeown, K. R., & Schlaikjer, A. H. (2003). A hybrid approach for QA track definitional questions. In Proceedings of the 12th text retrieval conference (TREC-2003) (pp. 185–192). Cui, H., Kan, M.-Y., Chua, T.-S., & Xiao, J. (2004). A comparative study on sentence retrieval for definitional question answering. In SIGIR workshop on information retrieval for question answering (IR4QA). Cui, H., Li, K., Sun, R., Chua, T.-S., & Kan, M.-Y. (2004). National University of Singapore at the TREC-13 question answering. In Proceedings of the 13th text retrieval conference (TREC-2004). Dumais, S., Banko, M., Brill, E., Lin, J., & Ng, A. (2002). Web question answering: is more always better? In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR-2002) (pp. 291–298). Echihabi, A., Hermjakob, U., Hovy, E., Marcu, D., Melz, E., & Ravichandran, D. (2003). Multiple-engine question answering in TextMap. In Proceedings of the 12th text retrieval conference (TREC-2003) (pp. 772–781). Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. The MIT Press. Gaizauskas, R., Greenwood, M. A., Hepple, M., Roberts, I., & Saggion, H. (2004). The University of Sheffield’s TREC 2004 Q&A experiments. In Proceedings of the 13rd text retrieval conference (TREC-2004). Harabagiu, S., Moldovan, D., Clark, C., Bowden, M., Williams, J., & Bensley, J. (2003). Answer mining by combining extraction techniques with abductive reasoning. In Proceedings of the 12th text retrieval conference (TREC-2003) (pp. 375–382). Hildebrandt, W., Katz, B., & Lin, J. (2004). Answering definition questions with multiple knowledge sources. In Proceedings of the human language technology and conference of the North American chapter of the association for computational linguistics (HLT-NAACL-2004) (pp. 49–56).

364

K.-S. Han et al. / Information Processing and Management 43 (2007) 353–364

Katz, B., Bilotti, M., Felshin, S., Fernandes, A., Hildebrandt, W., Katzir, R., Lin, J., Loreto, D., Marton, G., Mora, F., & Uzuner, O. (2004). Answering multiple questions on a topic from heterogeneous resources. In: Proceedings of the 13th text retrieval conference (TREC-2004). Lin, J., & Demner-Fushman, D. (2005). Automatically evaluating answers to definition questions. In Proceedings of the human language technology and conference on empirical methods in natural language processing (HLT-EMNLP-2005). Saggion, H., & Gaizauskas, R. (2004). Mining on-line sources for definition knowledge. In Proceedings of the 17th international Florida artificial intelligence research society conference (FLAIRS-2004). Sparck Jones, K., Walker, S., & Robertson, S. E. (1998). A probabilistic models of information retrieval: development and status. Technical Report 446, University of Cambridge Computer Laboratory. Tapanainen, P., & Jarvinen, T. (1997). A non-projective dependency parser. In Proceedings of the 5th conference on applied natural language processing (pp. 64–71). Voorhees, E. M. (2003). Overview of the TREC 2003 question answering track. In Proceedings of the 12th text retrieval conference (TREC2003) (pp. 54–68). Voorhees, E. M. (2004). Overview of the TREC 2004 question answering track. In Proceedings of the 13th text retrieval conference (TREC2004). Wu, L., Huang, X., You, L., Zhang, Z., Li, X., & Zhou, Y. (2004). FDUQA on TREC 2004 QA track. In Proceedings of the 13th text retrieval conference (TREC-2004). Xu, J., Weischedel, R., & Licuanan, A. (2004). Evaluation of an extraction-based approach to answering definitional questions. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR-2004) (pp. 418–424). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th international conference on machine learning (pp. 412–420).