A relevance model for a data warehouse contextualized with documents

A relevance model for a data warehouse contextualized with documents

Information Processing and Management 45 (2009) 356–367 Contents lists available at ScienceDirect Information Processing and Management journal home...

332KB Sizes 0 Downloads 63 Views

Information Processing and Management 45 (2009) 356–367

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

A relevance model for a data warehouse contextualized with documents Juan Manuel Pérez *, Rafael Berlanga, María José Aramburu Universitat Jaume I, Campus de Riu Sec, E-12071 Castelló de la Plana, Spain

a r t i c l e

i n f o

Article history: Received 24 July 2008 Received in revised form 17 October 2008 Accepted 9 November 2008 Available online 9 January 2009

Keywords: Relevance-based language model Data warehouse Text-rich document collection

a b s t r a c t This paper presents a relevance model to rank the facts of a data warehouse that are described in a set of documents retrieved with an information retrieval (IR) query. The model is based in language modeling and relevance modeling techniques. We estimate the relevance of the facts by the probability of finding their dimensions values and the query keywords in the documents that are relevant to the query. The model is the core of the so-called contextualized warehouse, which is a new kind of decision support system that combines structured data sources and document collections. The paper evaluates the relevance model with the Wall Street Journal (WSJ) TREC test subcollection and a self-constructed fact database. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction During decades the information retrieval (IR) area has provided users with methods and tools for searching interesting pieces of text among huge document collections. However, until very recently these techniques have been implemented apart from databases due to the very different nature of the objects they manage: whereas data is well-structured with well-defined semantics, texts are unstructured and require approximate query processing (Baeza-Yates & Ribeiro-Neto, 1999). Nowadays, corporate information systems need to include internal and external text-based sources (e.g., web documents) into the information processes defined within the organization. For example, decision support systems would greatly benefit from text-rich sources (e.g., financial news and market research reports) as they can help analysts to understand the historical trends recorded in corporate data warehouses. Opinion forums and blogs are also valuable text-sources that can be of great interest for enhancing the decision making processes. Unfortunately, there are scarce works in the literature concerned with a true integration of data and document retrieval techniques. Recent proposals in the field of IR include language modeling (Ponte & Croft, 1998) and relevance modeling (Lavrenko & Croft, 2001). Language modeling represents each document as a language model. Thus, documents are ranked according to the probability of emitting the query keywords from the corresponding language model. Relevance modeling estimates the joint probability of the query’s keywords and the document words over the set of documents deemed relevant for that query. In this paper, we apply the language modeling and relevance modeling approaches to develop a new model that estimates the relevance of the facts stored into a data warehouse with respect to an IR query. These facts are well-structured data tuples, whose meaning is described by a set of documents retrieved with the same IR query from a separate text repository. The proposed relevance model is the core of the contextualized warehouse described in Pérez, Berlanga, Aramburu, and Pedersen (2008). However, the topic of Pérez et al. (2008) was the multidimensional model of the contextualized warehouse, rather than the relevance model. In the current paper, we describe the relevance model in detail, and we compare it with the

* Corresponding author. Tel.: +34 964 728368; fax: +34 964 728435. E-mail addresses: [email protected] (J.M. Pérez), [email protected] (R. Berlanga), [email protected] (M.J. Aramburu). 0306-4573/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2008.11.001

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

357

relevance-based language model techniques that support it. The paper provides a series of experiments over a well-known IR collection in order to demonstrate that the ranking of facts provided by the model is good enough for helping analysts in their tasks. This evaluation is completely new and has not been previously published anywhere. The review of the language modeling and relevance modeling approaches included in this paper is also an original contribution. The rest of the paper is organized as follows: Section 2 overviews the contextualized warehouse. Section 3 reviews the language modeling and the relevance modeling IR approaches. Section 4 presents the contextualized warehouse relevance model and Section 5 evaluates it. Finally, Section 6 discusses some conclusions and future lines of work. 2. The contextualized warehouse A contextualized warehouse is a new kind of decision support system that allows users to obtain strategic information by combining sources of structured data and documents. Fig. 1 shows the architecture of the contextualized warehouse presented in Pérez et al. (2008). Its main three components are a corporate data warehouse, a document warehouse and the fact extractor module. Next, we briefly describe these components: (a) The corporate data warehouse integrates data from the organization’s structured data sources (e.g., its different department databases). The integrated data is organized into multidimensional data structures, denoted OLAP cubes (On-Line Analytical Processing) (Codd, 1993). In these cubes, the data is divided into facts, the central entities/events for the desired analysis, e.g., the sales, and hierarchical dimensions, which characterize the facts, e.g., the products sold and the grouping of products into categories. Typically, the facts have associated numerical measures (e.g., profit) and analysis operations aggregate the fact measure values up to a certain level of detail, e.g., total profit by product category and month (Pedersen & Jensen, 2005). The facts can be conceptually modeled as data tuples whose elements depict dimension and measure values. For instance, the fact f ¼ ðProduct:ProductID ¼ fo1; Customer:Country ¼ Japan; Time:Month ¼ 1998=10; SUMðProfitÞ ¼ 300; 000$Þ could represent the total profit for the sales of the product fo1, made to Japanese customers during October 1998: 300,000$. The OLAP cubes can be stored by following either a so-called ROLAP and/ or a so-called MOLAP approach. ROLAP stands for Relational OLAP, since the data is stored in relational tables. In order to map the multidimensional data cubes into tables, different logical schemas have been proposed. The star and the snowflake schemas are the most commonly used. The star schema consists of a fact table plus one dimension table for each dimension. Each tuple in the fact table has a foreign key column to each of the dimension tables, and some numeric columns that represent the measures. The snowflake schema extends the star schema by normalizing and explicitly representing the dimension hierarchies. In the Multidimensional OLAP (MOLAP) alternative, special data structures (e.g., multidimensional arrays) are used for the storage instead. The construction of a corporate warehouse for structured data has been broadly discussed in classical references like (Inmon, 2005). (b) The document warehouse stores the unstructured data coming from internal and external sources. These documents describe the context of the corporate facts. They provide users with additional information related to the facts, very useful to understand the results of the analysis operations. For instance, a petrol crisis reported in an economy article may explain a sales drop. (c) The objective of the fact extractor module is to relate the facts of the corporate warehouse with the documents that describe their contexts. This module first identifies dimension values in the metadata and textual contents of the documents, and then, links each document with those facts that are characterized by the same dimension values. In Pérez (2007) we showed that the information extraction techniques proposed in Danger, Berlanga, and Ruiz-Shulcloper

Fig. 1. Contextualized warehouse architecture.

358

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

Table 1 Example R-cube. Fact Id

Dimensions

Measures

R-cube special dimensions

Product Id

Country

Month

Profit

R

Ctxt

f1

fo1

Cuba

1998/03

4,300,000$

0.05

d23

f2

fo2

Japan

1998/02

3,200,000$

0.1

f3

fo2

Korea

1998/05

900,000$

0.2

f4

fo1

Japan

1998/10

300,000$

0.4

f5

fo2

Korea

1998/11

400,000$

0.25

0:005

0:005 ; d47 0:02 d50 0:04 d84 0:04 0:08 d123 ; d7 0:08 0:01 d7 ; d69

(2004), Llidó, Berlanga, and Aramburu (2001) can be used for identifying the dimension values in the documents contents. The fact extractor module would link the example fact f to those documents of the warehouse that report events located in Japan during October 1998. In a contextualized warehouse, the user specifies an analysis context by supplying a sequence of keywords (i.e., an IR query Q like ‘‘petrol crisis”). The analysis is performed on a new type of OLAP cube, called R-cube, which is materialized by retrieving the documents and facts relevant for the selected context. Table 1 shows an example R-cube. Each row represents a fact fi , and each column a dimension. In this case, the facts represent the total profit of the sales per product, country and month. R-cubes have two special dimensions: the relevance (R) and the context (Ctxt) dimensions. In the relevance dimension, each fact has a numerical value representing its relevance with respect to the specified context (e.g., how important the fact is for a ‘‘petrol crisis”). Thereby the name R-cube (Relevance cube). The context dimension links each fact to the set of r documents that describe its context. In the R-cube, each dj denotes a document whose relevance with respect to the analysis context is r. The most relevant facts of our example R-cube are the facts f4 and f5 , which involve the sales made to Japanese and Korean customers during the months of October and November 1998. By studying the documents associated to these facts, e.g., the most relevant d7 , we may find out a report on a petrol crisis that affected Japan and Korea during the second half of 1998. Probably, this report could explain why the sales represented by f4 and f5 experimented the sharpest drop. The formal definition of the R-cube’s multidimensional data model and algebra was given in Pérez et al., 2008. A prototype of a contextualized warehouse was presented in (Pérez, Berlanga, Aramburu, & Pedersen, 2007). This paper presents the IR model of the contexutalized warehouse. Given a context of analysis (i.e., an IR query), we first retrieve the documents of the warehouse by following a language modeling approach. Then, we rely on relevance modeling to rank the facts described in the retrieved documents. Language modeling and relevance modeling establish a formal foundation based on probability theory, which is also well-suited for studying the influence of the R-cubes algebra operations in the relevance values of the facts (Pérez et al., 2008). 3. Language models and relevance-based language models The work on language modeling estimates a language model mj for each document dj . A language model is an stochastic process which generates documents by emitting words randomly. The documents dj are then ranked according to the probability PðQ jmj Þ of emitting the query keywords Q from the respective language model mj (Ponte & Croft, 1998). The calculation of the probability PðQ jmj Þ differs from model to model. In Song and Croft (1999) the query Q is represented as a sequence of independent keywords qi , Q ¼ q1 ; q2 ; . . . ; qn (let qi 2 Q mean that the keyword qi appears in the sequence Q); and the probability PðQ jmj Þ is computed by

PðQ jmj Þ ¼

Y

Pðqi jmj Þ

ð1Þ

qi 2Q

Song and Croft (1999) propose to approach the probability Pðqi jmj Þ of emitting the keyword qi from mj by smoothing the relative frequency of the query keyword in the document dj . Their approach avoids probabilities equal to zero in PðQ jmj Þ when a document does not contain all the query keywords. They make the assumption that finding a keyword in a document might be at least as probable as observing it in the entire collection of documents, and estimate this probability as follows:

Pðqi jmj Þ ¼ ð1  kÞ

cwfqi freqðqi ; dj Þ þk jdj jw coll sizew

ð2Þ

In formula (2) freqðqi ; dj Þ is the frequency of the keyword qi in the document dj . The term jdj jw denotes the total number of words in the document, cwfqi is the number of times that the query keyword qi occurs in all the documents of the collection, and coll sizew is the total number of words in the collection. The k factor is the smoothing parameter, and its value is determined empirically, k 2 ½0; 1.

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

359

The retrieval model of the contextualized warehouse proposed in this paper also models the queries as sequences of keywords and follows a similar approach to compute the relevance of the documents. Many well-known IR techniques, such as the relevance feedback, have a very intuitive interpretation in the classical probabilistic models (Robertson, 1997). These techniques require modifying a sample set of relevant documents according to the user’s relevance judgments. However, they are difficult to integrate into the language modeling framework where there is not such a notion of set of relevant documents. The work on relevance modeling returns to the probabilistic models view of the document ranking problem, i.e., the estimation of the probability Pðwi jRÞ of finding the word wi in an ideal set of relevant documents R. The purpose of relevance modeling is to identify those words that indicate relevance and thus will be effective when comprising a query. These papers make the assumption that in the absence of training data, but given a query Q ¼ q1 ; q2 ; . . . ; qn , the probability Pðwi jRÞ can be approximated by the probability Pðwi jq1 ; q2 ; . . . ; qn Þ of the co-occurrence of the sequence of query keywords Q and the word wi (Lavrenko & Croft, 2001), that is

Pðwi jRÞ  Pðwi jQ Þ ¼

Pðwi ; QÞ PðQÞ

ð3Þ

Let M ¼ fmj g be the finite universe of language models mj that (notionally) generated the documents of the collection. If we assume independence between the word wi and the query keywords Q1, the joint probability Pðwi ; QÞ can be then computed by the total probability of emitting the word and the query keywords from each language model in M:

Pðwi ; Q Þ ¼

X

Pðmj ÞPðwi jmj ÞPðQ jmj Þ

ð4Þ

mj 2M

Formula (4) can be interpreted as follows: Pðmj Þ is the probability of selecting a language model mj from the set M, Pðwi jmj Þ is the probability of emitting the word wi from the language model mj , and PðQ jmj Þ is the probability of emitting the query keywords Q from the same language model. Like in the language modeling approach (Song & Croft, 1999), the probability Pðwi jmj Þ can be estimated by the smoothed relative frequency of the word in the document. See formula (2). By applying the Bayes’ conditional probability theorem, the probability PðQ jmj Þ can be computed by

Pðmj jQ ÞPðQ Þ Pðmj Þ

PðQjmj Þ ¼

ð5Þ

Replacing PðQ jmj Þ by the previous expression in formula (4), we obtain:

Pðwi ; Q Þ ¼

X

Pðwi jmj ÞPðmj jQ ÞPðQ Þ

ð6Þ

mj 2M

Finally, by including formula (6) in the expression (3), the approximation of the probability Pðwi jRÞ results in

Pðwi jRÞ 

X

Pðwi jmj ÞPðmj jQÞ

ð7Þ

mj 2M

In order to implement the relevance models in an IR system, the set M is restricted to contain only the language models of the k top-ranked documents retrieved by the query Q. The system performs the following two steps (Lavrenko et al., 2002): 1. Retrieve from the document collection the documents that contain all or most of the query keywords and rank the documents according to the probability Pðmj jQ Þ that they are relevant to the query. As formula (5) shows, this is equivalent to rank the documents by the probability PðQ jmj Þ, since the probabilities Pðmj Þ and PðQ Þ are constant across queries. The language modeling formula proposed in Song and Croft (1999) can be used for this purpose, see formula (1). Let RQ be the set composed of the language models associated with the top r ranked documents. RQ stands for documents Relevant to the Query. 2. Approximate the probability Pðwi jRÞ of finding a word wi in the ideal set of relevant documents R by the probability Pðwi jRQ Þ of emitting it from the set of relevant document language models RQ

Pðwi jRÞ  Pðwi jRQ Þ 

X

Pðwi jmj ÞPðmj jQ Þ

ð8Þ

mj 2RQ

The main contribution of relevance modeling is the probabilistic approach discussed above to estimate Pðwi jRÞ using the query alone, which has been done in a heuristic fashion in previous works (Robertson & Jones, 1976). This approximation to Pðwi jRÞ can be latter used for applying the probability ranking principle. For instance, the authors of Lavrenko and Croft (2001) represent the documents as a sequence of independent words (let wi 2 dj be each one of these words) and propose to rank the documents by

1 Notice that it is not a realistic assumption, since correlation between words always exists in the texts. However, as many other retrieval models, we need to assume independence in order to compute the joint probability.

360

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

Y Pðwi jRÞ Pðdj jRÞ  Pðdj jRÞ wi 2dj Pðwi jRÞ

ð9Þ

The models of relevance have been shown to outperform base-line language modeling and tf/idf IR systems in TREC adhoc retrieval and TDT topic tracking tasks (Lavrenko & Croft, 2001). Moreover, relevance modeling provides a theoretically well-founded framework where not only is possible to calculate the probability of finding a word in the set of documents relevant to an IR query, but also to estimate the probability of observing any arbitrary type of object described in this set of relevant documents. For example, inLavrenko, Feng, and Manmatha (2003) relevance models are applied in image retrieval tasks to compute the joint probability of sampling a set of image features and a set of image annotation words. The notion of the set RQ of documents relevant to an IR query can be used for representing the context of analysis in a contextualized warehouse. The relevance model presented in this paper adapts this idea to estimate the probability of observing a corporate fact described in the set of documents relevant to the context of analysis. 4. The facts relevance model In this section, we propose a relevance model to calculate the relevance of a fact with respect to a selected context (i.e., to an IR query). Intuitively, a fact will be relevant for the selected context, if the fact is found in a document which is also relevant for this context. We will consider that a fact is important in an document if its dimension values are mentioned frequently in the document textual contents. We assume that each document dj describes a set of facts ffi g; and that the document and its fact set were generated by a model mj that emits words, with a probability Pðwi jmj Þ, and facts, with a probability Pðfi jmj Þ. Definition 1. Let D1 ; D2 ; . . . ; Dn be the dimensions defined in the corporate warehouse OLAP cubes. A fact fi consists of an ntuple of dimension values ðv 1 ; v 2 ; . . . ; v n Þ, where v k 2 Dk , meaning that each v k is a value of the dimension Dk . By v k 2 fi we will mean that v k is a dimension value of the fact fi . The tuple ðfo1; Japan; 1998=10Þ 2 Products  Customers  Time represents the fact f4 of the cube characterized by the dimensions Products, Customers and Time, shown in Fig. 1. Notice that at this point, we are only concerned about the occurrence of dimension values in the documents, independently of the hierarchy level which they belong to. Then, in the relevance model, we simply consider a dimension as the flat set that includes all the members of the dimension hierarchy levels, as specified in the corporate warehouse schema. For example, we do not make explicit the mapping of customers into cities, or states into countries in the Customers dimension; we just represent the dimension Customers by the set that comprises all the values of its hierarchy levels (e.g, Customers ¼ Customer [ City [ State [ Country). Definition 2. Let Q ¼ q1 ; q2 ; . . . ; qn be an IR query, consisting of a sequence of keywords qi ; and let RQ be the set models that generated the documents relevant to this query. We compute the relevance of a fact fi to the query Q by the probability Pðfi jRQ Þ of emitting this fact from the set RQ of models relevant to the query, as follows:

P

mj 2RQ Pðfi jmj ÞPðQ jmj Þ

P

Pðfi jRQ Þ ¼

ð10Þ

mj 2RQ PðQ jmj Þ

That is, we estimate the relevance of a fact by calculating the probability of observing it in the set of documents relevant to the query. In formula (10) PðQ jmj Þ is the probability of emitting the query keywords from the model mj . This probability is computed by the language modeling formula (1). Definition 3. Pðfi jmj Þ is the probability of emitting the fact fi from the model mj , which is estimated as follows:

P Pðfi jmj Þ ¼

v k 2fi freqðv k ; dj Þ

ð11Þ

jdj jv

where freqðv k ; dj Þ is the number of times that the dimension values dimension values found in the document dj .

v k are mentioned in dj ; and jdj jv

is the total number of

The approach discussed above to compute the probability Pðfi jRQ Þ is based on the relevance modeling techniques presented in Section 3. However, we have adapted these techniques to estimate the probability of facts instead of document words. Next, we point out the major similarities and differences between the two approaches. The probability PðQ jmj Þ can be expressed in terms of the probability Pðmj jQ Þ, by applying the conditional probability formula (5). By including the expression (5) in the formula (10), we have that

P Pðfi jRQ Þ ¼

mj 2RQ Pðfi jmj ÞPðmj jQ ÞPðQ Þ

Pðmj Þ

P

mj 2RQ PðQjmj Þ

ð12Þ

In formula (12) PðQ Þ is the joint probability of emitting the query keywords from the set of models RQ, and Pðmj Þ denotes the probability of selecting a model from this set. In order to estimate the probability PðQ Þ, we compute the total probability

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

361

Table 2 Topic number, title and expected top-ranked industry of the TREC topics selected for the experiment. Topic #

Title

Industry

109 112 124 133 135 137 143 152

Find Innovative Companies Funding Biotechnology Alternatives to Traditional Cancer Therapies Hubble Space Telescope Possible Contributions of Gene Mapping to Medicine Expansion in the US Theme Park Industry Why Protect US Farmers? Accusations of Cheating by Contractors on US Defense Projects Oil Spills Automobile Recalls Tobacco Company Advertising and the Young Smoking Bans U. S. Restaurants in Foreign Lands Asbestos Related Lawsuits Signs of the Demise of Independent Publishing Gene Therapy and Its Benefits to Humankind

Software & Computer Services Biotechnology Health Care Equipment & Services Aerospace & Defense Biotechnology Media Food Producers Aerospace & Defense Oil & Gas Producers Automobiles & Parts Tobacco Tobacco Restaurants & Bars Construction & Materials Media Biotechnology

0.4 0.3 0.0

0.1

0.2

Precision

0.5

0.6

0.7

0.8

154 162 165 173 179 183 187 198

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Recall Fig. 2. Average precision versus recall obtained for the selected TREC topics.

of emitting the query keywords from each model in RQ. See formula (13). Notice that the assumption that we make here is equivalent to the one made by the relevance modeling works in formula (4) to calculate the joint probability Pðki ; Q Þ.

PðQÞ ¼

X

PðQ jmj ÞPðmj Þ

ð13Þ

mj 2RQ

By considering that the probability Pðmj Þ is constant, and replacing the probability PðQ Þ by the previous expression, we have that formula (12) is equivalent to

Pðfi jRQ Þ ¼

X

Pðfi jmj ÞPðmj jQ Þ

ð14Þ

mj 2RQ

Notice the similarity between formula (14) and the relevance modeling formula (8) used for computing the probability Pðwi jRQ Þ. The difference comes in considering that whereas the ordinary relevance modeling proposals approached the probability Pðwi jRÞ by the probability of observing the word wi , once the query keywords Q have been observed in the documents, i.e., Pðwi jRÞ  Pðwi jQ Þ, we approach the probability Pðfi jRÞ by the probability of finding the fact fi when the query keywords Q have been previously found in the documents, that is, Pðfi jRÞ  Pðfi jQ Þ.

362

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

Table 3 Average precision versus recall values obtained for the selected TREC topics. Recall level

Average precision

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.8403 0.6671 0.5690 0.5283 0.4472 0.4167 0.3728 0.3525 0.2556 0.1697 0.0057

Average

0.4205

Table 4 R-Precision obtained for each TREC topic. R-Precision

109 112 124 133 135 137 143 152 154 162 165 173 179 183 187 198

0.2727 0.2500 0.5000 0.4762 0.7500 0.7083 0.4615 0.1852 0.5294 0.3333 0.3500 0.5526 0.1250 0.6842 0.3718 0.7419

Average

0.4558

0.4 0.3 0.2 0.1

Topic number Fig. 3. R-Precision histogram for the selected TREC topics.

198

187

183

179

173

165

162

154

152

143

137

135

133

124

112

109

0.0

R−Precision

0.5

0.6

0.7

Topic #

363

0.2000 0.1000

F−measure

0.3000

0.4534

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

1

5

10

36

100

500

1000

Document ranking Fig. 4. Average F-measure for the selected TREC topics with different sizes of the result set.

5. Experiments and results This section evaluates the proposed relevance model with the Wall Street Journal (WSJ) TREC test collection (Harman, 1995) and a fact database constructed from the metadata available in the documents. In our experiments, we took a set of example information requests (called topics in TREC), determined which is the expected most relevant fact in the response result for each topic, and analyzed the quality of the ranking of the facts provided by our model. It is important to emphasize that the objective here is not to evaluate the document retrieval performance. The formulas used in our approach for estimating the relevance of a document and building the set RQ of relevant models are based on those of language modeling, that have already been shown to obtain good performance results (Song & Croft, 1999). The final objective of our experiments is to evaluate the proposed facts relevance ranking approach. Next, we introduce the document collection, the fact database and the topics selected for the experiments. Afterwards, we show how we built the IR queries for the topics and tuned the set RQ. Finally, we study the results obtained when ranking the facts with the relevance model. 5.1. Document collection, fact base and topics In our experiments we considered the 1990-WSJ subcollection from TREC disk 2, a total of 21,705 news articles published during 1990. The news articles of the WSJ subcollection contain metadata. These metadata comprise, among other information, the date of publication of the article, and the list of companies reported by the news article. By combining the date of publication and the company list of each article, we built a ðDate; CompanyÞ fact database. For each fact, we also kept the news articles were the corresponding ðDate; CompanyÞ pair was found. Thus, our experiments involved two dimensions, the Date and the Companies dimensions. In the Companies dimension, the companies described by the WSJ articles are organized into Industries, which are in turn classified into Sectors. The correspondence between companies, industries and sectors is based on the Yahoo Finance2 companies classification. We selected 16 topics from the TREC-2 and TREC-3 conferences by choosing the topics that have at least 20 documents in the provided solution set of documents relevant for the topic. We made such a restriction to ensure that the set of relevant documents was big enough to find several samples of the dimension values relevant for the query. Furthermore, we examined the textual description of each selected topic in order to determine the industry that is most likely related to the theme of the topic, that is, the industry of the companies that are expected to be found at the top-ranked facts for each topic.

2

http://finance.yahoo.com.

364

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

Table 5 Top-ranked industries for the TREC topics 109–152. Industry

Relevance

Topic 109, expected industry = software & computer services Software & computer services Technology hardware & equipment Fixed line telecommunications Chemicals

0.6772 0.2510 0.0297 0.0211

Topic 112, expected industry = biotechnology Biotechnology Pharmaceuticals Aerospace & defense

0.7565 0.0981 0.0426

Topic 124, expected industry = health care equipment & services Biotechnology Health care equipment & Services Pharmaceuticals Food & drug retailers Technology hardware & equipment

0.6496 0.1778 0.1439 0.0249 0.0038

Topic 133, expected industry = aerospace & defense Aerospace & defense General retailers

0.9793 0.0207

Topic 135, expected industry = biotechnology Biotechnology Pharmaceuticals Chemicals Health care equipment & Services

0.8870 0.0460 0.0385 0.0213

Topic 137, expected industry = media Media Industrial metals Food producers General retailers

0.6262 0.3019 0.0234 0.0151

Topic 143, expected industry = food producers Food producers Chemicals

0.9999 3.2e–5

Topic 152, expected industry = aerospace & defense Aerospace & defense Technology hardware & equipment Electronic & electrical equipment

0.9881 0.0083 0.0016

Table 2 shows the topic number, title and expected top-ranked industry of the TREC topic set considered in our experiments. For example, as this table shows, the expected most relevant industry for the TREC topic number 198, entitled ‘‘Gene Therapy and Its Benefits to Humankind”, is Biotechnology. 5.2. Building the set RQ In order to estimate accurately the relevance of the facts, we need an acceptable description of each selected topic in the corresponding set of relevant models RQ. Next, we show how we constructed and tuned the context of analysis (i.e., the set RQ) for the test topics. For each topic, we specified a short IR query (less than 4 keywords), and then we retrieved the set of documents relevant to this query, as discussed in Section 3. The smoothing parameter k of formula (2) determines the features of the top-ranked documents, mainly their length (Losada & Azzopardi, 2008). Larger documents are usually ranked at the first positions as k decreases. In our case, larger documents are more likely to describe more dimension values than shorter ones, and therefore they can contribute better to contextualize facts. Additionally, it is well-known that short queries require less smoothing than larger ones. For these reasons, we set the smoothing parameter k to 0.1 in our experiments. Nevertheless, a deeper study of the influence of the smoothing method in the result must be carried out in the future. The query keywords were interactively selected to reach an acceptable precision versus recall figure (Baeza-Yates & Ribeiro-Neto, 1999). Typically, an ‘‘acceptable” retrieval performance is considered to be achieved when the precision is over 40% at low recall values, e.g, 20%; greater than 30% for a recall of 50%; and no lower than 10% for high recall percentages like 80%. See for example, the evaluations of Harman (1995), Lavrenko and Croft (2001) and Song and Croft (1999). Fig. 2 illustrates the average precision values obtained at the 11 standard recall levels for the selected topics. The percentages are over the acceptable margins quoted above. Table 3 details these precision values. The R-Precision is a useful parameter for measuring the quality of the result set for each individual topic, when the ideal set R of documents judged to be relevant is known (Baeza-Yates & Ribeiro-Neto, 1999). Given jRj, the number of documents

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

365

Table 6 Top-ranked industries for the TREC topics 154–198. Topics and industries

Relevance

Topic 154, expected industry = oil & gas producers Oil & gas producers Oil equipment, services & distribution Industrial transportation

0.6348 0.3192 0.0460

Topic 162, expected industry = automobiles & parts Aerospace & defense Automobiles & parts Oil & gas producers Chemicals

0.5426 0.4562 0.0008 0.0002

Topic 165, expected industry = tobacco Tobacco Media Airlines Industrial transportation Aerospace & defense

0.6356 0.1473 0.1263 0.0877 0.0029

Topic 173, expected industry = tobacco Airlines Tobacco Media Industrial transportation

0.4585 0.3525 0.0701 0.0431

Topic 179, expected industry = restaurants & bars Restaurants & bars Beverages Travel & leisure

0.8930 0.0476 0.0229

Topic 183, expected industry = construction & materials Construction & materials Media Chemicals

0.7459 0.0591 0.0579

Topic 187, expected industry = media Media Technology hardware & equipment General retailers

0.9375 0.0283 0.0225

Topic 198, expected industry = biotechnology Biotechnology Chemicals Pharmaceuticals

0.8858 0.0718 0.0249

in the ideal set R, it calculates the precision for the jRj top-ranked documents in the result set. Table 4 shows the R-Precision values obtained for each topic, as well as the resulting average R-Precision. Fig. 3 depicts the corresponding R-Precision histogram. As stated in Section 3, the set RQ comprises the models associated with the k top-ranked documents of the query. We now turn our attention to tuning the size of the RQ sets. Here, our purpose is to determine the number k of top-ranked documents to be considered in RQ that maximizes the retrieval performance. In this case, we use a different performance measure, called F-measure (Baeza-Yates & Ribeiro-Neto, 1999), that calculates the harmonic mean of precision and recall. Maximizing the Fmeasure means finding the best possible combination of precision and recall. We computed the average F-measure for the selected TREC topics with different sizes of the result set. As Fig. 4 shows, the maximum value is 0,4534, reached when the result set contains the 36th top-ranked documents. 5.3. Evaluation of results Finally, in this section we evaluate the fact relevance ranking results obtained with our model. For each topic, we considered the facts described by the 36 top-ranked documents in the corresponding set RQ. We grouped the facts by industry, and calculated their relevance to the IR query following the approach discussed in Section 4. Tables 5 and 6 show the industries, along with their relevance, at the top of the ranking for each topic. We can conclude that the results demonstrate the effectiveness of the approach. For all the topics, even for those where the R-Precision was low (see for example the topics 152 and 179), the expected industry is found at the first (81% of the topics) or the second (19%) position of the ranking. Furthermore, the relevance value assigned to facts clearly differentiates the industries that are directly related to the topic of analysis from those that are not so relevant. In almost all cases, the relevance value approximately decreases by one order of magnitude. For example, in the topic number 154, the first (Oil & Gas Producers) and second (Oil Equipment, Services & Dis-

366

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

tribution) ranked industries are clearly related to the thematic of the topic (‘‘Oil spills”). The relevance values assigned to these industries (0.6348 and 0.3192, respectively) are significantly greater than the relevance value of the next industry in the ranking (Industrial Transportation, 0.0460). We also find an explanation for some of the topics where the ranking was not completely accurate. The top-raked industry for the topic number 173 is Airlines, whereas the expected industry Tobacco is found at the second position of the ranking. The reason is that a number of the documents judged to be relevant for this topic report smoking bans on flights. The industry at the top of the ranking for the topic 137 is Media, since many media companies also own thematic parks (e.g., Time Warner/ Warner Bros. Entertainment). In fact, in our Companies dimension, the Media industry also comprises these recreation and entertainment companies. The second top-ranked industry for this topic is Industrial Metals, which still has a relative high relevance value. Although, this industry initially seemed irrelevant for the topic 137, after reading some of the documents retrieved for this topic, we discovered a group of news relating the vanguardist Japanese company Nippon Steel’s diversification strategy on the amusement-park sector. 6. Conclusions This paper introduces a new relevance model aimed at ranking the structured data (facts) and documents of a contextualized warehouse when the user establishes an analysis context (i.e., runs an IR query). The approach can be summarized as follows. First, we use language modeling formulas (Ponte & Croft, 1998) to rank the documents by the probability of emitting the query keywords from the respective language model. Then, we adapt relevance modeling techniques (Lavrenko & Croft, 2001) to estimate the relevance of the facts by the probability of observing its dimension values in the top-ranked documents. We have evaluated the model with the Wall Street Journal (WSJ) TREC test subcollection and a fact database self-constructed from the metadata available in the documents. The results obtained are encouraging. The experiments show that our relevance model is able to clearly differentiate the facts that are directly related to the test topics, from those that are not so relevant. We found the expected top-ranked fact in the first or the second position of the ranking for the 16 topics selected. A deeper study of the influence of the smoothing method in our approach remains to be done. In the prototype of the contextualized warehouse presented in Pérez et al. (2007), a corporate warehouse with data from the World major stock indices is contextualized with a repository of business articles, selected from the WSJ TREC collection too. The prototype involved a dataset of 1936 (Date, Market, Stock Index value) facts and 132 documents. Although we did not formally evaluate the relevance model of the prototype, we showed some analysis examples where the relevant articles explain the increases and decreases of the stock indexes. Testing the performance of the contextualized warehouse analysis operations with larger datasets and studying query optimization techniques is also future work. One of the current research lines in the field of IR is opinion retrieval (Eguchi & Lavrenko, 2006; Liu, Hu, & Cheng, 2005). These papers propose specific techniques for retrieving and classifying opinions expressed in small text fragments (like the posts of a web forum). We are currently working on extending our retrieval model with opinion retrieval techniques in order to contextualize a traditional company’s sales data warehouse with documents gathered from web forums, where the customers review the products/services of the company. References Baeza-Yates, R. A., & Ribeiro-Neto, B. A. (1999). Modern information retrieval. ACM Press/Addison-Wesley. Codd, E. F. (1993). Providing OLAP to user-analysts: An IT mandate. Danger, R., Berlanga, R., & Ruiz-Shulcloper, J. (2004). CRISOL: An approach for automatically populating semantic web from unstructured text collections. In Proceedings of 15th international conference on database and expert systems applications (pp. 243–252). Eguchi, K., & Lavrenko, V. (2006). Sentiment retrieval using generative models. In Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 345–354). Harman, D. K. (1995). Overview of the third retrieval conference (TREC-3). In D. K. Harman (Ed.), Overview of the third text retrieval conference (TREC-3) (pp. 1–19). NIST Special Publication 500-225. Inmon, W. H. (2005). Building the data warehouse. John Wiley & Sons. Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V. & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the second international conference on human language technology research (pp. 115–121). San Francisco, CA: Morgan Kaufmann Publishers Inc. Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 120–127). Lavrenko, V., Feng, S. L., & Manmatha, R. (2003). Statistical models for automatic video annotation and retrieval. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (pp. 17–21). Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th international conference on the world wide web (pp. 342–351). Llidó, D. M., Berlanga, R., & Aramburu, M. J. (2001). Extracting temporal references to assign document event-time periods. In Proceedings of the 12th international conference on database and expert systems applications (pp. 62–71). Losada, D. E., & Azzopardi, L. (2008). An analysis on document length retrieval trends in language modeling smoothing. Information Retrieval, 11(2), 109–138. Pedersen, T. B., & Jensen, C. S. (2005). Multidimensional databases. In R. Zurawski (Ed.), The industrial information technology handbook (pp. 1–13). CRC Press. Pérez, J. M. (2007). Contextualizing a data warehouse with documents. PhD thesis, Departament de Llenguatges i Sistemes Informàtics, Universitat Jaume I de Castelló (Spain). Pérez, J. M., Berlanga, R., Aramburu, M. J. & Pedersen, T. B. (2007). R-cubes: OLAP cubes contextualized with documents. In Proceedings of the IEEE 23rd international conference on data engineering (pp. 1477–1478). Pérez, J. M., Berlanga, R., Aramburu, M. J., & Pedersen, T. B. (2008). Contextualizing data warehouses with documents. Decision Support Systems, 45(1), 77–94.

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

367

Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 275–281). New York, NY: ACM Press. Robertson, S. (1997). The probability ranking principle in IR. In Readings in information retrieval (pp. 281–286). Morgan Kaufmann Publishers Inc. Robertson, S., & Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society of Information Science, 27(3), 129–146. Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the eighth international conference on information and knowledge management (pp. 316–321). New York, NY: ACM Press. Juan Manuel Pérez obtained the B.S. degree in Computer Science in 2000, and the Ph.D. degree in 2007, both from Universitat Jaume I, Spain. Currently, he is associate lecturer at this university. He is author of a number of papers in international journals and conferences such as Decision Support Systems, IEEE Transactions on Knowledge and Data Engineering, DEXA, ECIR, ICDE, DOLAP, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies. Rafael Berlanga is an associate professor of Computer Science at Universitat Jaume I, Spain. He received the B.S. degree from Universidad de Valencia in Physics, and the Ph.D. degree in Computer Science in 1996 from the same university. He is author of several articles in international journals, such as Information Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, among others, and numerous communications in international conferences such as DEXA, ECIR, CIARP, etc. His current research interests are knowledge bases, information retrieval, and temporal reasoning. María José Aramburu is an associate professor of Computer Science at Universitat Jaume I, Spain. She obtained the B.S. degree from Universidad Politécnica de Valencia in Computer Science in 1991, and a Ph.D. from the School of Computer Science of the University of Birmingham (UK) in 1998. She is author of several articles in international journals, such as Information Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, and numerous communications in international conferences such as DEXA, ECIR, etc. Her main research interests include document databases, and their applications.