A reliable FAQ retrieval system using a query log classification technique based on latent semantic analysis

A reliable FAQ retrieval system using a query log classification technique based on latent semantic analysis

Information Processing and Management 43 (2007) 420–430 www.elsevier.com/locate/infoproman A reliable FAQ retrieval system using a query log classific...

362KB Sizes 0 Downloads 100 Views

Information Processing and Management 43 (2007) 420–430 www.elsevier.com/locate/infoproman

A reliable FAQ retrieval system using a query log classification technique based on latent semantic analysis Harksoo Kim a

a,*

, Hyunjung Lee b, Jungyun Seo

c

Program of Computer and Communications Engineering, College of Information Technology, Kangwon National University, 192-1 Hyoja 2(i)-dong, Chuncheon-si, Gangwon-do 200-701, Republic of Korea b Natural Language Processing Laboratory, Department of Computer Science, Sogang University, Sinsu-dong 1, Seoul 121-742, Republic of Korea c Department of Computer Science and Interdisciplinary Program of Integrated Biotechnology, Sogang University, 1 Sinsu-dong, Mapo-gu, Seoul 121-742, Republic of Korea Received 11 May 2006; accepted 25 July 2006 Available online 6 October 2006

Abstract To obtain high performances, previous works on FAQ retrieval used high-level knowledge bases or handcrafted rules. However, it is a time and effort consuming job to construct these knowledge bases and rules whenever application domains are changed. To overcome this problem, we propose a high-performance FAQ retrieval system only using users’ query logs as knowledge sources. During indexing time, the proposed system efficiently clusters users’ query logs using classification techniques based on latent semantic analysis. During retrieval time, the proposed system smoothes FAQs using the query log clusters. In the experiment, the proposed system outperformed the conventional information retrieval systems in FAQ retrieval. Based on various experiments, we found that the proposed system could alleviate critical lexical disagreement problems in short document retrieval. In addition, we believe that the proposed system is more practical and reliable than the previous FAQ retrieval systems because it uses only data-driven methods without high-level knowledge sources.  2006 Elsevier Ltd. All rights reserved. Keywords: FAQ retrieval; Lexical disagreement problem; Query log clusters; Latent semantic analysis

1. Introduction FAQ (frequently asked question) retrieval plays an increasingly important role in e-commerce websites because FAQs accommodate both customer needs and business requirements. As a useful tool for information access, most commercial sites provide customers with a keyword search. However, sometimes the keyword search does not perform well in FAQ retrieval because FAQ collections consist of ordinary lists which have some deficiencies, as follows (Sneiders, 1999). First, information suppliers do not know users’ actual questions.

*

Corresponding author. Tel.: +82 33 250 6388; fax: +82 33 252 6390. E-mail addresses: [email protected] (H. Kim), [email protected] (H. Lee), [email protected] (J. Seo).

0306-4573/$ - see front matter  2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2006.07.018

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

421

Instead, the information suppliers construct question candidates in advance by using their own knowledge and answer the question candidates. However, the question candidates do not always satisfy users’ needs. Second, each FAQ consists of a small number of words, unlike ordinary documents. Information suppliers choose words in each FAQ according only to their knowledge. However, the words may not be properly used in the FAQs. These deficiencies may raise lexical disagreement problems in keyword search. For example, the query ‘‘How can I remove my login ID’’ and the FAQ ‘‘A method to secede from the membership’’ have a very similar meaning, but there is no overlap between the words in the two sentences. This fact often makes keyword search systems misdirect users to irrelevant FAQs with one or more common words such as ‘‘How to create login ID’’ and ‘‘How can I change my password’’. The representative FAQ retrieval systems are FAQ Finder (Hammond, Burke, Martin, & Lytinen, 1995), Auto-FAQ (Whitehead, 1995), and Sneiders’ system (Sneiders, 1999). FAQ Finder was designed to improve navigation through already existing external FAQ collections. To match users’ queries to the FAQ collections, FAQ Finder uses a syntactic parser to identify verb and noun phrases. Then, FAQ Finder performs concept matching using semantic knowledge like WordNet (Miller, 1990). Auto-FAQ matches users’ queries to predefined FAQs using a keyword comparison method based on shallow NLP (natural language processing) techniques. Sneiders’ system classifies keywords into three types: required keywords, optional keywords and irrelevant keywords. Sneiders’ system retrieves and ranks relevant FAQs according to the three types. Although the representative systems perform well, it is a time and effort consuming job to construct these knowledge bases whenever application domains are changed. To reduce these bothersome efforts, we propose a FAQ retrieval system using only query logs as knowledge sources because the query logs have two characteristics: (1) we can easily collect users’ query logs, and (2) we can observe similar meanings of words in various query logs. There have been numerous studies on how clustering can be employed to improve retrieval results (Liu & Croft, 2004). The cluster-based retrieval can be divided into two types: static clustering methods (Jardine & van Rijsbergen, 1971; van Rijsbergen & Croft, 1975) and query specific clustering methods (Hearst & Pedersen, 1996; Tombros, Villa, & van Rijsbergen, 2002). The static clustering methods group entire collections in advance, independent of the user’s query, and clusters are retrieved based on how well their centroids match the user’s query. The query specific clustering methods group the set of documents retrieved by an IR system for a query. The main goal of the query specific clustering methods is to improve the rankings of relevant documents on searching time. Some studies have shown that cluster-based retrieval did not outperform document-based retrieval, except with the small size of collection (El-Hamdouchi & Willet, 1989; Tombros et al., 2002; Voorhees, 1985; Willet, 1988). In spite of this skepticism with cluster-based retrieval, the proposed system use query log clusters as a form of document smoothing (Liu & Croft, 2004) because the size of FAQ collections is probably much smaller than the size of ordinary document collections. This paper is organized as follows. In Section 2, we propose a cluster-based FAQ retrieval system using query logs as knowledge sources. In Section 3, we explain experimental results. Finally, we draw some conclusions in Section 4. 2. FRACT: a FAQ retrieval and clustering techniques If we can automatically group similar meanings of query logs, we may be able to use the query log clusters as informative knowledge sources in FAQ retrieval because the query log clusters will include various surface forms of sentences that reflect users’ preferences for specific words. Based on this assumption, we propose a cluster-based FAQ retrieval system called FRACT (Faq Retrieval And Clustering Technique). FRACT consists of two sub-systems: a query log clustering system and a cluster-based retrieval system. The query log clustering system periodically collects and refines users’ query logs. Then, the query log clustering system considers each FAQ as an independent category and classifies the query logs into the FAQ categories by using a vector similarity measure in latent semantic space. Based on the classification results, the query log clustering system groups the query logs and computes centroids of each query log cluster. When a user inputs his/her query, the cluster-based retrieval system calculates the similarities between the query and FAQs smoothed by the query log clusters. According to the similarities, the cluster-based retrieval system ranks and returns a list of relevant FAQs.

422

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

2.1. Vector representation of indexing terms In linguistics, a syntagmatic lexical affinity, also termed a lexical relation, between two units of language stands for a correlation of their common appearance in the sentence (Saussure, 1949). The observation of lexical affinities in large textual corpora has been shown to convey information on both syntactic and semantic levels and provides us with a powerful way of taking context into account (Smadja, 1989). Ideally, lexical affinities are extracted from a text by paring it, since two words share a lexical affinity if they are involved in a modifier–modified relation. In IR (information retrieval), sentences are generally represented as a set of unigrams, but the unigrams do not provide contextual information between co-occurring words. A possible solution to this problem is to supplement the flaws of unigrams with dependency bigrams (i.e. lexical affinities in linguistics) in order to provide further control over phrase formation. Unfortunately, automatic syntactic parser of a free-style text is still not very efficient, and the number of dependency bigrams extracted by parsing is not enough to measure similarity between sentences if we do not have a large corpus. Therefore, we apply a sliding window technique (Maarek, Berry, & Kaiser, 1991) to FRACT in order to robustly extract co-occurrence information. The sliding window technique consists of sliding a window over the text and storing pairs of words involving the head of the window if it is a content word and any of the other content words of the window. The window slides word by word from the first word of the sentence to the last, the size of the window decreasing at the end of the sentence so as not to cross boundaries between sentences. The window size being smaller than a constant, the number of extracted bigrams is linear to the number of unigrams in the sentence. In the experiment, we set the window size to three. Fig. 1 shows the process of term extraction with examples. After extracting terms, FRACT assigns weight scores to each term according to the weighting scheme of the 2-poisson model (Robertson & Walker, 1992), as shown in Eq. (1).   tfj n  dfj þ 0:5   wij ¼  log ð1Þ dli dfj þ 0:5 k 1  ð1  bÞ þ b  avdl þ tfj

Fig. 1. An example of term extraction.

H. Kim et al. / Information Processing and Management 43 (2007) 420–430 rr

n r

r m =

Xmxn

423

×

Umxm

×

T

Smxn

Vnxn

Pseudo-document construction

×

T

Xmxn

=

Umxr

×

Vnxr

Srxr

Fig. 2. A pseudo-document matrix.

In Eq. (1), wij is the weight score of the jth term in the ith document,1 and tfj is the frequency of the jth term in the document. dli is the length of the ith document, and avdl is the average length of documents. N is the total number of documents, and dfj is the number of documents including the jth term. k1 and b are constant values for performance tuning. 2.2. Query log clustering system using LSA The similarities between documents can be calculated by popular methods such as the cosine measure, the Dice coefficients, Jaccard coefficients and the overlap coefficients (van Rijsbergen, 1979). However, these popular measures may not be effective in calculating the similarities between sentences as there is often very little overlap between the words in the sentences. LSA (latent semantic analysis) is a method of extracting and representing the contextual-usage meaning of words by statistical computations (Landauer, Foltz, & Laham, 1998). Some researchers have shown that LSA can bridge some lexical gaps between two words by mapping all terms in the texts to a representation in so-called latent semantic space. Based on this fact, we apply the LSA techniques to FRACT in order to increase the performance of query log classification. The LSA processes are as follows. First, FRACT constructs an m · n term-document matrix Xm·n, where m is the number of terms, n is the number of documents, and an element wij is a weight score that indicates the degree of association between the ith term and the jth document, as shown in Eq. (1). Then, FRACT applies SVD to the term-document matrix Xm·n, as shown in Eq. (2). X mn ¼ U mm  S mn  V Tnn

ð2Þ

where Um·m is an m · m orthonormal matrix, and Vn·n is an n · n orthonormal matrix. Sm·n is an m · n positive matrix whose nonzero values are s11, . . . , srr, where r is the rank of X, and they are arranged in descending order s11 P s22 P    P srr > 0. After applying SVD, FRACT reduces Xm·n to Sr·r by selecting top r-dimensions, and obtains a pseudo-document matrix, Vn·r Æ Sr·r, in r-dimensional space called latent semantic space, as shown in Eq. (3). X^ Tmn ¼ V nr  S rr  U Tmr X^ T  U mr ¼ V nr  S rr mn

Fig. 2 illustrates a pseudo-document matrix that is generated according to Eq. (3). 1

In this paper, we consider FAQs or query logs as short documents.

ð3Þ

424

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

In Fig. 2, the shaded portions of the matrices are what we use as the basis for its term and document vector representations. As shown in Fig. 2, FRACT selects the top r elements among diagonal elements in Sm·n to reduce the representation dimension. Therefore, the actual representations of the term and document vectors are Um·r and Vn·r scaled by elements in Sr·r. When the pseudo-document matrix has been constructed, to increase the performance of query log classification, FRACT compares FAQ vectors to query log vectors in the latent semantic space (not in the original vector space) by using cosine similarity measure, as shown in Eq. (4) (Salton & McGill, 1983). Pr j i k¼1 fk  qlk cosðf i ; qlj Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð4Þ Pr Pr j 2 i 2 ðf Þ  ðql Þ k k¼1 k k¼1 In Eq. (4), fki and qljk are the kth element values of the ith FAQ vector f i and the jth query log vector ql j in the r-dimensional pseudo-document matrix Vn·r Æ Sr·r. After comparing FAQ vectors to query log vectors, FRACT classifies each query log vector into FAQ categories2 with the maximum cosine similarities. Then, FRACT generates centroid vectors of each FAQ category according to Eq. (5) and construct the centroid matrix Cfn·r, where fn is the number of FAQ categories, by gathering the centroid vectors. P f i þ qlj 2cati qlj i c ¼ ; if cosðf i ; qlj Þ > h ð5Þ numi In Eq. (5), ci is the centroid vector of cati that is the category of the ith FAQ vector f i. numi is the number of all vectors (i.e. a FAQ vector and query log vectors) belonging to cati, and h is a threshold value to prevent the centroid vectors from leaning excessively toward query logs that may be misclassified. Finally, FRACT restores the representation dimension to original m dimension, as shown in Eq. (6). ^ fnm ¼ C fnr  U T C mr

ð6Þ

ˆ fn·m because term weights should be In Eq. (6), FRACT sets the element values smaller than zero to zero in C ˆ generally bigger than zero. We call Cfn·m a latent centroid matrix because it includes latent term weights that are not directly calculated according to the actual term occurrences but are estimated according to LSA techniques. 2.3. Cluster-based FAQ retrieval system The similarity in vector space models is determined by using associative coefficients based on the inner product of the document vector and query vector, where word overlap indicates similarity. The inner product is usually normalized. The most popular similarity measure is the cosine coefficient, which measures the angle between the document vector and the query vector. In other words, given a query and FAQs, we can calculate the vector similarity between the ith FAQ vector f i and the query vector q by using the cosine coefficient. However, we may obtain very low similarities because there is often little overlap between the query vector and the FAQ vector. To overcome this problem, FRACT smoothes representations of the FAQ vectors using the latent term weights and calculates cosine similarities between the queries and the smoothed FAQ vectors, as shown in Eqs. (7) and (8). ^cik fki þ ð1  kÞ  i max valðf Þ max valð^ci Þ Pm ~ i k¼1 f k  qk cosðf i ; qÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pm ~ i 2 Pm 2 k¼1 ðf k Þ  k¼1 ðqk Þ

f~ ik ¼ k 

ð7Þ ð8Þ

In Eq. (7), ^cik is the kth term weight in the ith vector (i.e. the latent centroid vector which is associated with the ^ fnm . k is a parameter which determines the degree of smoothing FAQ vector fi) in the latent centroid matrix C 2

In this paper, we consider each FAQ as individual categories.

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

425

cˆ1

cˆ12 = 0.5

cˆ11 = 1.0 Fig. 3. The effect of normalization.

and has a value between zero and one. max_val(x) is a normalizing factor that represents the maximum one among the element values of the vector x. According to the normalizing scheme, we expect that the vectors will lean toward axes with high term weights and the elements with high term weights will take some advantages, as shown in Fig. 3. Based on Eq. (8), we believe that FRACT can alleviate the lexical disagreement problems, as FRACT utilizes the centroids of FAQ categories, including much more terms than FAQs themselves, as smoothing factors. For example, if FRACT has the latent centroid vector [how:0.1, membership:0.5, method:0.4, remove:0.1, secede:0.9] associated with the FAQ ‘‘A method to secede from the membership’’, FRACT may return the FAQ ‘‘A method to secede from the membership’’ with a high rank when users input the query ‘‘How can I remove my login ID’’ because there are two common words between the query and the latent centroid although there is no common word between the query and the original FAQ. 3. Evaluation experiments 3.1. Data sets and experimental settings We collected 406 Korean FAQs from three domains; LGeShop (www.lgeshop.com, 91 FAQs), Hyundai Securities (www.youfirst.co.kr, 81 FAQs) and KTF (www.ktf.com, 234 FAQs). LGeShop is an internet shopping mall, Hyundai Securities is a security corporation, and KTF is a mobile communication company. For two months, we also collected a large amount of query logs that were created by commercial search engines installed in the above websites for FAQ retrieval. After eliminating additional information such as IP, date and time except users’ queries, we automatically selected 5845 unique query logs (1549 query logs from LGeShop, 744 query logs from Hyundai Securities, and 3552 query logs from KTF) that consisted of two or more content words. Then, we manually classified query logs into the 406 FAQ categories and annotated each query log with the identification numbers of the FAQ categories. Finally, we constructed a test collection called KFAQTEC (Korean test collection for evaluation of FAQ retrieval systems). KFAQTEC consists of 406 FAQs and 5845 query logs. The number of content words per query is 5.337. Table 1 shows a sample of KFAQTEC. The manual annotation was done by graduate students majoring in language analysis and was postprocessed for consistency. To experiment with FRACT from the various viewpoints, we reorganized KFAQTEC into three types of data sets; FAQSET-1, FAQSET-2 and FAQSET-3. In FAQSET-1, the query logs were divided into 10-folds, and each fold used 9/10 of the query logs and all FAQs for system building, and 1/10 of the query logs for Table 1 A sample of KFAQTEC Domain

Type

ID

KTF KTF KTF LGeShop LGeShop

FAQ LOG LOG FAQ LOG

8 8 8 4 4

Sentence (A method to secede from the membership) (The removal of login ID) (How to secede from the membership) (How to use e-money) (Can I buy goods using e-money?)

426

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

system testing. Using FAQSET-1, we evaluated performances of FRACT, as compared with conventional IR systems. In FAQSET-2, the query logs were divided into 10-folds, and each fold used 9/10 of the query logs and fake FAQs (1/10 of the query logs) for system building, and real FAQs for system testing. The fake FAQs were used as both retrieval target sentences and categories for query log classification as if they were real FAQs pre-constructed by information suppliers. Using FAQSET-2, we evaluated the robustness of FRACT according to word combinations of FAQs. In FAQSET-3, the query logs were divided into 10-folds, and each fold used various numbers of the query logs (from 1/10 to 9/10 of query logs) and all FAQs for system building, and 1/10 query logs for system testing. Using FAQSET-3, we evaluated the performances of FRACT according to the sizes of training data for query log clustering. In all our experiments, we set k1 and b in Eq. (1) to 1.2 and 0.75 respectively according to Okapi BM-25 (Robertson, Walker, Jones, Beaulieu, & Gatford, 1994). We also set the reduced dimension r in Eq. (3) to 200, the threshold value h in Eq. (5) to 0.3, and the smoothing rate k in Eq. (7) to 0.7. 3.2. Evaluation methods To evaluate the performances of the query log clustering system, we computed the F1 measure, as shown in Eq. (9). F1 ¼

2PR P þR

ð9Þ

In Eq. (9), P is the precision that means proportion of correct ones out of returned query logs. R is the recall rate that means proportion of returned query logs out of classification targets. To evaluate the performances of the cluster-based retrieval system, we computed the MRRs (Mean Reciprocal Rank) and the miss rates. The MRR represents the average value of the reciprocal ranks of the first relevant FAQs given by each query, as shown in Eq. (10). MRR ¼

num 1 X 1  num i¼1 ranki

ð10Þ

In Eq. (10), ranki is the rank of the first relevant FAQ given by the ith query, and num is the number of queries. The miss rate means the ratio of the cases that the searching engine fails to return relevant FAQs, as shown in Eq. (11). MissRate ¼

the number of failure queries the number of queries

ð11Þ

3.3. Evaluation results To evaluate seriousness of lexical disagreement problems in short document retrieval, we analyze the degrees of word overlaps between query logs and relevant FAQs given by the query logs, as shown in Table 2. Table 2 The word overlaps between query logs and FAQs Domain

The number of overlap words (unigrams) 0

1

2

3

4

5

LGeShop Hyundai KTF

393 185 760

606 221 1131

350 213 822

156 116 441

56 66 159

26 55 89

All (%)

1338 (22.9%)

1958 (33.5%)

1385 (23.7%)

713 (12.2%)

281 (4.8%)

170 (2.9%)

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

427

Table 3 The comparison of performances in different vector spaces

Original vector space Latent semantic space

Query log clustering system

Cluster-based retrieval system

Average F1

Average MRR

Average miss rate

0.249 0.420

0.501 0.592

0.193 0.033

Table 4 The performances on each IR systems System

LGeShop

Hyundai

Average MRR

Average miss rate

Average MRR

Average miss rate

FRACT OKAPI LM

0.541 0.504 0.510

0.040 0.250 0.250

0.723 0.679 0.682

0.042 0.202 0.202

Average MRR

Average miss rate

Average MRR

Average miss rate

0.511 0.476 0.472

0.018 0.235 0.235

0.592 0.553 0.554

0.033 0.229 0.229

KTF

FRACT OKAPI LM

Total

As shown in Table 2, we found that 1338 query logs (22.9% of query logs) did not overlap with original FAQs at all and 56.4% of query logs had overlaps less than one word. This fact reveals that the lexical disagreement problems often occur in short document retrieval. To evaluate effectiveness of latent term weights, we implemented two versions of FRACT that perform query log classifications in different vector spaces (i.e. latent semantic space and original term-document space). Then, we compared the performances on each version of FRACT by using FAQSET-1, as shown in Table 3. As shown in Table 3, FRACT using latent term weights highly outperformed the other FRACT using original term weights. This fact reveals that the proposed latent term weights hold more effective information. To evaluate performances of FRACT, we compared FRACT with conventional IR systems by using FAQSET-1, as shown in Table 4. In Table 4, OKAPI is the Okapi BM25 retrieval model (Robertson et al., 1994), and LM is the KL-divergence language model using JM smoothing (Zhai & Lafferty, 2001). We implemented these IR systems using LeMur Toolkit version 3.0 (The Lemur project). As shown in Table 4, FRACT outperforms all comparison systems in both the average MRR and the average miss rate. Specifically, FRACT reduced the average miss rate by 0.196. In the experiment, it is difficult to compare FRACT directly with the other systems because the other systems do not use query log information. Even if direct comparisons are impossible, we think that the proposed method can be an effective solution to the lexical disagreement problems between queries and FAQs. As an additional experiment, we observed changes of ranks on the basis of top-10 in comparison with OKAPI. As shown in Table 5, FRACT made about 60.3 relevant FAQs ranked into top-10. Moreover, FRACT ranked about 129.3 relevant FAQs in top-10 that OKAPI could not find at all. This fact reveals that FRACT highly ranks relevant FAQs more than OKAPI. To evaluate the robustness of FRACT according to the word combinations of FAQs, we compared the miss rates between FRACT and OKAPI by using FAQSET-2, as shown in Fig. 4. As shown in Fig. 4, FRACT is less sensitive to word combinations than the representative IR system, OKAPI. To evaluate changes of system performances according to the number of query logs, we computed the performances of FRACT by using FAQSET-3, as shown in Fig. 5. As shown in Fig. 5, FRACT slowly increases in MRR while the number of query logs is growing.

428

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

Table 5 The changes of ranks in comparison with OKAPI Domain

# of relevant FAQs upgraded into top-10

# of relevant FAQs degraded out of top-10

# of relevant FAQs newly found in top-10

LGeShop Hyundai KTF

45 1 135

28 4 73

136 57 195

Average

60.3

35.0

129.3

FRACT

OKAPI

0.45 0.40 0.35 Miss rate

0.30 0.25 0.20 0.15 0.10 0.05 fake-9

fake-8

fake-7

fake-6

fake-5

fake-4

fake-3

fake-2

fake-1

fake-0

0.00

Fig. 4. The changes of miss rates according to various word combinations.

Avg. MRR

0.60 0.59 0.59 MRR

0.58 0.58 0.57 0.57 0.56 0.56 0.55

567

1133 1700 2267 2834 3400 3967 4534 5100 5667 The number of query logs and FAQs for system building

Fig. 5. The changes of performances according to the number of query logs.

3.4. Failure analysis We analyzed the cases where FRACT failed to highly rank relevant FAQs. We found some reasons why the relevant FAQs were low ranked or missed. First, there were still the lexical disagreement problems between users’ queries and FAQs. FRACT could resolve some lexical disagreement problems because it used query log clusters in order to smooth the FAQs. However, we found many cases where there was very little overlap between the words in queries and the words in query log clusters. To solve this problem at a basic level, we need to study new methods that match users’ queries with FAQs on the semantic levels. Second, there were some cases where only one query was associated with several FAQs. In these cases, we could not select the FAQs that were entirely relevant to those queries. To solve this problem, information suppliers should accurately construct initial FAQs and should constantly update the FAQs. Third, there were some cases where several relevant FAQs were much lower ranked, as compared with OKAPI. To solve this problem, we need

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

429

to study new methods that effectively combine latent term weights with original term weights. Finally, there were some cases where irrelevant FAQs were returned because of syntactically inadequate bigrams. To solve this problem, we need to develop high-performance dependency parser and replace the simple bigrams with dependency bigrams. 4. Conclusion In IR, query logs may be informative knowledge sources because we can observe various surface forms of sentences with similar meanings. Based on this merit of query logs, we proposed a high-performance FAQ retrieval system using query log clusters as smoothing factors. Using LSA techniques during the indexing time, the FAQ retrieval system effectively groups query logs and generates cluster centroids containing latent term weights. When a user inputs a query, the FAQ retrieval system smoothes retrieval target FAQs using the cluster centroids. By virtue of the cluster-based retrieval techniques using latent centroids of query log clusters, the FAQ retrieval system could resolve some lexical disagreement problems regardless of word combinations of FAQs. We believe that the FAQ retrieval system is more practical and reliable than the previous FAQ retrieval systems because it does not require high-level knowledge sources like thesauruses and handcrafted rules. Acknowledgement This study was supported by 2006 Research Grant from Kangwon National University. It was also partially supported by Kangwon Institute of Telecommunications and Information (KITI). References El-Hamdouchi, A., & Willet, P. (1989). Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal, 32(3), 220–227. Hammond, K., Burke, R., Martin, C., & Lytinen, S. (1995). FAQ finder: a case-based approach to knowledge navigation. In Proceedings of the 11th conference on artificial intelligence for applications (pp. 80–86). Hearst, M. A., & Pedersen, J. O. (1996). Re-examining the cluster hypothesis: scatter/gather on retrieval results. Proceedings of SIGIR 1996 (pp. 76–84). Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, 217–240. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR 2004 (pp. 25–29). Maarek, Y. S., Berry, D. M., & Kaiser, G. E. (1991). An information retrieval approach for automatically construction software libraries. IEEE Transaction on Software Engineering, 17(8), 800–813. Miller, G. (1990). WordNet: an on-line lexical database. International Journal of Lexicography, 3(4), 1–12. Robertson, S. E., & Walker, S. (1992). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of SIGIR 1992 (pp. 232–241). Robertson, S. E., Walker, S., Jones, S., Beaulieu, M. M., & Gatford, M. (1994). Okapi at TREC-3. Proceedings of TREC-3 (pp. 109–126). Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval (computer series). New York: McGraw-Hill. Saussure, F. (1949). Cours de linguistique generale (Quatrieme ed.). Paris: Librairie Payot. Smadja, F. A. (1989). Lexical co-occurrence: the missing link. Literary and Linguistic Computing, 4(3). Sneiders, E. (1999). Automated FAQ answering: continued experience with shallow language understanding. Papers from the 1999 AAAI fall symposium (pp. 97–107). The Lemur project. The Lemur toolkit for language modeling and information retrieval (Version 3.0). Tombros, A., Villa, R., & van Rijsbergen, C. J. (2002). The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38, 559–582. van Rijsbergen, C. J. (1979). Information retrieval (second ed.). London: Butterworth. van Rijsbergen, C. J., & Croft, W. B. (1975). Document clustering: an evaluation of some experiments with the Cranfield 1400 collection. Information Processing and Management, 11, 171–182. Voorhees, E. M. (1985). The cluster hypothesis revisited. In Proceedings of SIGIR 1985 (pp. 188–196). Whitehead, S. D. (1995). Auto-FAQ: an experiment in cyberspace leveraging. Computer Networks and ISDN Systems, 28(1–2), 137–146. Willet, P. (1988). Recent trends in hierarchical document clustering: a critical review. Information Processing and Management, 24(5), 577–597. Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR 2001 (pp. 334–342).

430

H. Kim et al. / Information Processing and Management 43 (2007) 420–430

Harksoo Kim is an assistant professor of computer and communications engineering at Kangwon National University. He received a B.A. degree in computer science from Konkuk University in 1996, a M.S. degree in computer science from Sogang University in 1998, and a Ph.D. degree in computer science with major in natural language processing from Sogang University in 2003. He visited the CIIR in University of Massachusetts, Amherst as a research fellow in 2004. In 2005, he worked for Electronics and Telecommunications Research Institute (ETRI) as a senior researcher. His research interests include natural language processing, dialogue understanding, information retrieval and question–answering. Hyunjung Lee is a Ph.D. student majoring in Natural Language Processing at Sogang University. She received a B.A. degree in computer science from Dongduk Women’s University in 1995 and a M.S. degree in computer science from Sogang University in 1997. She worked for NHN corp. as a manager of the Language Processing Team from 2000 to 2004. Her research interests include natural language processing, dialogue understanding and information retrieval. Jungyun Seo is a full professor of computer science at Sogang University. He was educated at Sogang University, where he obtained a B.S. degree in mathematics in 1981. He continued his studies at the department of computer science in the University of Texas, Austin, receiving a M.S. and Ph.D. in computer science in 1985 and 1990 respectively. He returned to Korea in 1991 to join the faculty of Korea Advanced Institute of Science and Technology (KAIST) in Taejon where he leaded the Natural Language Processing Laboratory in Computer Science Department. In 1995, he moved to the Sogang University in Seoul and became a full professor in 2001. His research interests include multi-modal dialogues, statistical methods for NLP, machine translation and information retrieval.