Relevance theory and distributions of judgments in document retrieval

Relevance theory and distributions of judgments in document retrieval

Information Processing and Management 53 (2017) 1080–1102 Contents lists available at ScienceDirect Information Processing and Management journal ho...

2MB Sizes 0 Downloads 42 Views

Information Processing and Management 53 (2017) 1080–1102

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Relevance theory and distributions of judgments in document retrievalR Howard D. White College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, USA

a r t i c l e

i n f o

Article history: Received 18 December 2016 Revised 18 February 2017 Accepted 21 February 2017

Keywords: Scale language Cognitive effects Processing effort

a b s t r a c t This article extends relevance theory (RT) from linguistic pragmatics into information retrieval. Using more than 50 retrieval experiments from the literature as examples, it applies RT to explain the frequency distributions of documents on relevance scales with three or more points. The scale points, which judges in experiments must consider in addition to queries and documents, are communications from researchers. In RT, the relevance of a communication varies directly with its cognitive effects and inversely with the effort of processing it. Researchers define and/or label the scale points to measure the cognitive effects of documents on judges. However, they apparently assume that all scale points as presented are equally easy for judges to process. Yet the notion that points cost variable effort explains fairly well the frequency distributions of judgments across them. By hypothesis, points that cost more effort are chosen by judges less frequently. Effort varies with the vagueness or strictness of scale-point labels and definitions. It is shown that vague scales tend to produce U- or V-shaped distributions, while strict scales tend to produce right-skewed distributions. These results reinforce the paper’s more general argument that RT clarifies the concept of relevance in the dialogues of retrieval evaluation. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction When information retrieval systems are evaluated, judges assess retrieved documents for their relevance to queries, the simplest scale being “not relevant” or “relevant.” A persistent finding in retrieval experiments is that scales with intermediate degrees of relevance produce distributions that are roughly U- or V-shaped. That is, whether the scales have few or many values, judges choose the low and high endpoints more frequently than the midpoints (Saracevic, 2007b: 2137), as illustrated at length by Janes (1993). However, in other articles, such as Sormunen (2002) and Lykke, Larsen, Lund, and Ingwersen (2010), the distributions of judgments on graded scales are right-skewed: from a high frequency at “not relevant,” they simply decrease. This paper gives a cognitive account of factors underlying both kinds of distributions. The argument is that the distributions depend not only on properties of the documents being assessed but also on judges’ interpretations of the scale points, which are communications from researchers. The researchers may label and perhaps define the points, or they may simply let judges infer them as positions on a line. In either case, judges must expend cognitive effort to process the points as inputs, and the effort will vary, depending on properties of these inputs as communications. My general claim is that judges tend to assign documents less frequently to scale points that cost them greater effort to process. That is, the more demanding the label, definition, or requirements of a point, the less judges choose it, resulting in R

Disclaimer: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. E-mail address: [email protected]

http://dx.doi.org/10.1016/j.ipm.2017.02.010 0306-4573/© 2017 Elsevier Ltd. All rights reserved.

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1081

highly non-uniform distributions of grades. White (2011) similarly equated greater effort with lower document frequencies to explain the skewing of citation distributions over time. This line of thought is influenced by Dan Sperber and Deirdre Wilson’s (1995) relevance theory, a major subfield of linguistic pragmatics brought most fully into information science (IS) by Harter (1992), Huang (2009), Saracevic (2007a), and White (20 07a,b, 20 09, 2010, 2011, 2014). Relevance theory (RT) and cognitive information science are mutually reinforcing. RT affords a clearer understanding of relevance in the dialogues of information retrieval (IR). Information science is a rich source of data for testing RT’s explanatory powers. The 50-plus experiments discussed here were found through topical searches, direct citation searches, and advice from experts. They carry forward Janes’s (1993: 113) insight into what judges of documents do: “Determine, very quickly, if the document is really good or really bad. If so, say so (and the data appears to show that they don’t much care exactly how really good or bad it is). If not, then more time and effort must be taken to determine how much of it is good, whether or not it is from a trustworthy source, addresses the right issues, is in the right language, is available and accessible, etc. The first of these processes is quick, relatively easy, and is done with confidence. The second is slower, less certain, and is done with more difficulty.” Studies mostly not covered by Janes are presented here as replicated tests of two hypotheses: 1. U- or V-shaped distributions of relevance judgments tend to be associated with vague scales. The vagueness is most problematic at the midpoints: the middle range is the muddle range (cf. Kulas & Stachowski, 2009). A sign of vagueness is that judges interpret scale points inconsistently; if they are asked why they placed a document at a certain point, as in Spink, Greisdorf, and Bateman (1998: 611) or Maglaughlin and Sonnenwald (2002: 335), their answers are all over the map. Vague scales also make for disagreements about document scoring, which have beset IR experiments from their earliest days (Saracevic, 2007b: 2134–2136). 2. Right-skewed distributions of relevance judgments tend to be associated with relatively strict scales. The definitions of points in these scales are somewhat less vague and become more exacting as they move rightward to the “most relevant” pole. Document counts at those points decline as the demands placed on judges increase. Strict scales are a reaction to the two-point scale defined by TREC, the Text REtrieval Conference (2006): “Only binary judgments (‘relevant’ or ‘not relevant’) are made, and a document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document).” Researchers such as Sormunen (2002) reject this formulation as too liberal. Accordingly, documents called “relevant” on TREC’s binary scale in earlier studies have been reassigned to finer grades of relevance in, e.g., Sormunen (2002) and Järvelin (2013). Kekäläinen (2005: 283) writes: “Judging relevance liberally is fast. In graded assessment, extra work is required to specify the degree of relevance of each document.” Researchers traditionally measure differences in effort by timing people’s judgments: more difficult judgments take longer to make (Carterette & Soboroff, 2010; Smucker & Jethani, 2012; Wang, 2011). Retrieval evaluations in IS are analogous, though unintentional, tests of scale points. To differentiate the effort these points cost, I rely not on timing but on linguistic arguments buttressed by past empirical results. I developed effort-based accounts of the distributions on several prototypical scales and then sought uses of the same scales (or close variants) in other papers, none known in advance, to see whether distributions of the same shape occurred. The causal arrow runs from language to frequencies. Thus effort-based predictions can be confirmed or disconfirmed. Judgments can of course be affected by cognitive factors other than the effort of processing scales. Research on such factors appears, for instance, in Ruthven, Baillie, and Elsweiler (2007) and Ruthven (2014), which also review older papers. The investigation reported here is not a meta-analysis of studies; it merely seeks, through RT, a parsimonious explanation of patterns in judgment frequencies at the level of entire experiments (or higher). Such aggregated patterns greatly simplify the causally complex distributions of judgments on individual queries. Distributions not fitting predicted shapes are noted. 2. Related work 2.1. New emphasis on effort Several recent papers have analyzed user effort as a key variable in IR evaluation, especially as it bears on satisfaction with searches. Verma, Yilmaz, and Craswell (2016) measures and demonstrates the importance of three components of effort—reading a document, understanding it, and looking for specific information in it. According to a linked paper, Yilmaz, Verma, Craswell, Radlinski, and Bailey (2014), anyone must expend effort along these lines in the first stage of evaluating documents. But the authors also distinguish two kinds of judges. Further effort by “relevance judges” will consist simply in choosing the best grade for a document’s degree of topical match with the query. By contrast, “real users” with acceptable first-stage results will put considerably more effort into a second stage, in which they extract the document’s utility for their own purposes. Jiang and Allan (2016) discusses user effort in processing a ranked list of documents to various depths. In past measures of retrieval quality, every ranked document has been treated as having the same cost to judge. The authors propose and test more realistic combined models that assign higher effort-costs to relevant documents in the list than to nonrelevant ones. Zuccon (2016) provides some measures that integrate document understandability (i.e., ease of reading) with topicality in

1082

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Fig. 1. Judgment effort on scales with varying numbers of points; reproduced from Cuadra and Katter (1967).

standard retrieval evaluation. The benefits and effort-costs that emerge as users pursue an overall search strategy, including evaluation, are analyzed in Azzopardi and Zuccon (2016).

2.2. Some background on scales and shapes Earlier researchers have related effort to scale endpoints and midpoints. The graph in Fig. 1 comes by way of Janes (1993) from Cuadra and Katter (1967, v.2: 57–66). The judges in that study were more than 40 information scientists at a 1966 conference. They volunteered to assess the relevance of five short papers to four paragraph-long queries on topics in information systems research. In the sub-study used here, Cuadra and Katter created scales with two to eight degrees of relevance and two orientations for using documents. Respondents in groups of 10 to 13 were each assigned a scale with different numbers of points and asked to grade documents for “decision-oriented use” and “quantity-oriented use.” They were then asked how difficult their grades were to assign and how certain they were of their verdicts. Fig. 1 shows their average difficulty scores on each set of decision-oriented scales. The four-, six-, and eight-point scales produce U- or V-shaped distributions (as did the same scales under the quantity orientation). Judgments at the endpoints were said to be less difficult and more certain than those at the midpoints. Tang, Shaw, and Vevea (1999: 263) presents a similar finding. The authors ran experiments to determine the best number of categories for relevance judgment scales, measured by confidence ratings. On scales ranging from three to 11 points, participants were more confident of their judgments at the extremes and less confident at the midpoints, producing U-shaped curves. Spink et al. (1998) found that end-users tended to grade documents “partially relevant” on a three-point scale when they were not yet clear on what they wanted. The middle grade was associated with self-reported changes in their goals, their understanding of a topic, or their ideas of what counted as relevant. “However,” the researchers add (p. 611), “these judgments are fuzzy in nature, and so may be the changes.” A judgment of “partially relevant” would thus suggest greater uncertainty and mental effort. Scholer, Turpin, and Sanderson (2011) looked at TREC assessors’ disagreements with themselves when they unintentionally graded the same document twice in the same evaluation. In experiments using a three-point scale, the authors state

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1083

(p. 1071), “Partially relevant documents were found to contribute disproportionately to assessor inconsistency” —that is, to giving the same document two different grades. Villa and Halvey (2013: 765) asked: “Does the degree of relevance of a document to a topic affect the effort and accuracy of the judging process?” To answer, the authors drew a sample from the AQUAINT collection of news stories used in the HARD (High Accuracy Retrieval from Documents) Track of TREC 2005. TREC assessors had previously graded every document in the 50-topic collection on the standard scale of “not relevant,” “relevant,” or “highly relevant.” Forty-nine volunteer participants regraded nine documents apiece (N = 440). In a special online set-up, they could move from topic statement to news story and back as often as they wished, but their topic views were automatically counted as an indicator of uncertainty. The accuracy of their grades was determined by comparing them to the TREC grades. The effort of grading each document was determined by their self-evaluations on six scales from the NASA task load index: mental demand, physical demand, temporal demand, performance, effort, and frustration. Participants agreed with the TREC assessors on 81% of 147 “not relevant” documents and on 83% of 147 “highly relevant” documents. They were least accurate on the midpoint “relevant,” agreeing with assessors on only 68% of 146 documents. The distribution is thus V-shaped (with significant differences between the points). Participants also said that midpoint documents cost them more effort to judge. These documents were significantly higher on measures of mental demand, physical demand, and effort; also higher, though not significantly so, on measures of temporal demand and frustration. On the measure of self-estimated performance they were lower, suggesting that participants’ judgments in the intermediate category were “less secure.” The midpoint documents required significantly more topic views as well, again suggesting greater uncertainty. The results in Gwizdka (2014) were similar. He had 24 students give grades of “not relevant” or “relevant” to three documents on each of 21 topical queries drawn from TREC’s 2005 Question Answering Track. “Relevant” in this case meant “answers the question in the query.” The three-document packages reflected his version of the grades previously assigned by TREC assessors: I (irrelevant); T (on-topic but lacks an answer); R (relevant; has an answer). The packages always contained, in randomized order, one R document and two documents with some combination of the other two grades. The I-T-R scale was treated as an independent variable. The dependent variables measured cognitive effort in evaluating documents by how long the students took to decide on grades and by eye-tracking their reading fixation sequences. Gwizdka’s question was “Does the degree of relevance of a text document affect how it is read?” His analyses of variance with I-T-R, normalized for document length, are all statistically significant and answer the question affirmatively. Irrelevant documents tended to be scanned and required low cognitive effort. On-topic/no-answer documents at midpoint required high cognitive effort, including repeated scanning and reading. Relevant documents tended to be read and required medium cognitive effort. Graphed, a low-high-medium pattern is an inverse V, which pertains to accounts of V-shaped frequency distributions in later sections. 3. Relevance theory This brings us to relevance theory (RT), with its emphases on inference and effort in communication (Sperber & Wilson, 1995; Wilson & Sperber, 2012). Good introductions are Allott’s (2013) chapter and Clark’s (2013) book. Sperber and Wilson (S&W or W&S) claim that human minds are evolutionarily adapted to prioritize what they heed on the basis of its present relevance in context (Allott, 2013: 60–64). S&W use “context” as a psychological term. It is internal to minds and dynamic, not external or static. Goatly (1997: 137) glosses it as “your existing beliefs/thoughts.” Carston (2002: 376) defines it more formally for RT as “that subset of mentally represented assumptions which interacts with newly impinging information (whether received via perception or communication) to give rise to ‘contextual effects’.” RT now usually calls the latter “cognitive effects.” The three types of effects are (1) to strengthen an assumption, (2) to contradict and eliminate it, or (3) to combine with it to yield a new conclusion—a conclusion derived from neither the new information nor the assumption alone, but from both together (W&S, 2012: 176; Clark, 2013: 102). If an utterance—or any input—fails to do any of these things, it is not relevant to the hearer. For a given individual at a given time, innumerable possible inputs are unheeded—and irrelevant—because they produce no effects. The greater the effects of an input, the greater its relevance. The more effort required to process the input, the less its relevance. (Re-quoting Janes, the effects of a document might include “how much of it is good, whether or not it is from a trustworthy source, addresses the right issues” ; the effort of processing it is affected by whether it “is in the right language, is available and accessible, etc.” ) Both factors operate simultaneously. Relevance thus varies directly with effects and inversely with effort. The claim can be represented as a benefit-cost ratio: the relevance of an input to an individual = cognitive effects / processing effort. According to RT, people automatically and constantly process inputs in this manner, but never with exact, ratio-level numbers. They can only judge relevance comparatively—as greater or lesser in degree—which accords with the ordinal nature of relevance scales in IR experiments (cf. Tague-Sutcliffe, 1995: 85–86). My own scales of effort and effects/effort ratios below are likewise ordinal; they are crude attempts to portray small differences in very rapid thought processes. Much more sophisticated versions of the ratio created for IR evaluations appear in Jiang and Allan (2016). RT grounds its account of communication in a theory of cognition, and it discusses inputs not only from other people, but from the physical environment, memory, or inferential reasoning. However, as befits its origin in linguistics, it focuses on conversations. It seeks to explain how hearers spontaneously derive what speakers mean from what they say. To do

1084

H.D. White / Information Processing and Management 53 (2017) 1080–1102

this, it subsumes a code model of communication under an inferential model (W&S, 2012: 263–265; Clark, 2013: 14–17). Speakers, that is, use words to encode evidence of what they want to communicate, but hearers cannot grasp their actual meanings simply by decoding words or phrases through lookups in a mental dictionary. Rather, they infer meanings by mindreading speakers’ intentions on the fly. This involves instantly deriving the implicit premises and conclusions that make a speaker’s utterance relevant in the hearer’s present context (Allott, 2013; Clark, 2013: 159–199). In claiming people’s attention, speech is privileged: hearers automatically expect talk addressed to them to be relevant, and speakers, wanting to be understood, meet that expectation to the extent their abilities and preferences allow (W&S, 2012: 64–66; Clark, 2013: 108–112). A speaker thus implicitly guarantees that an utterance will be at least worth processing. Worth is determined by the relevance-seeking hearer, for whom one interpretation of the utterance, out of the many possible, will usually be most accessible in the present context. Hearers stop at this first interpretation as the right one—the one the speaker intended—and go on to process the next utterance. They thereby avoid interpretive processing that churns on without resolution, which would leave them disastrously behind the speaker. At the same time, the relevance-guided interpretation very often is the right one, in that they have correctly inferred the speaker’s meaning. The process for hearers is summed up in the relevance-theoretic comprehension procedure (Wilson, 2012: 7): “Follow a path of least effort in looking for cognitive effects: Test interpretive hypotheses (disambiguations, contextual assumptions, implications, etc.) in order of accessibility. Stop when you have enough cognitive effects to satisfy your expectations of relevance.” The procedure may also be stopped if expectations are not met. Least effort means that hearers expend no unnecessary effort in deriving effects, which leads to the split-second interpretations characteristic of talk. Readers grasp writers’ meanings in the same way and often with similar speed. Emphatically, however, there is no guarantee that an utterance will be relevant in the sense of truly improving what the hearer knows. S&W call improvements in this latter sense positive cognitive effects—or in Wilson’s latest phrasing (2014: 133) “warranted conclusions, warranted strengthenings or revisions of available information.” Speakers frequently do meet this standard, of course, but for all sorts of reasons, they may not—what they say may be incomprehensible or evasive or deceitful or wrong—and hearers, for their part, may draw mistaken inferences from what they hear. Inferences in RT are defeasible. Communication is inherently fallible (W&S, 2012: 277; Clark, 2013: 6–7). According to RT, people differ in the inferences they draw from an input because their accessible cognitive contexts differ. Given a new piece of information, the cognitive contexts readily accessible to Sherlock Holmes are not the same as Dr. Watson’s, and Holmes reaches conclusions Watson does not. Differing cognitive contexts also explain why the same input affects people to different degrees–e.g., Watson greatly; Holmes not at all. Reading and responding to forms in retrieval experiments is a special case in this regard. In IR, people routinely draw different inferences about the same document (Saracevic, 2008; Scholer et al., 2011). They might differ on its degree of relevance to a given query or on why it is relevant. They might also revise their own earlier judgments, differing with themselves over time. RT would thus predict key IR findings summarized in Saracevic (1995: 143): “…there are considerable individual differences in relevance assessment by people; relevance is assessed by people in gradations, i.e., it is not a binary yes-no decision; it also heavily depends on circumstances, context, and so on.” A one-word distillation of “relevant” in its RT sense is consequential. A retrieved document is consequential if something true follows from it that alters an individual’s cognitive context and thereby improves his or her representation of the world. (Such improvements are not necessarily pleasant.) This jibes with how “information” in the cognitive sense has been defined in information science (e.g., by Harter, 1992: 611–612; Barry, 1994: 150; Furner, 2004: 440–444). In RT an informative input is the same as an input that achieves relevance through positive cognitive effects. Inputs that misinform may seem relevant but are not genuinely so (S&W, 1995: 263–266). 4. Relevance theory and retrieval The RT framework can be adapted to dialogues in document retrieval. When users put queries to a retrieval system, they act like speakers. Frequently their queries—e.g., a brief noun phrase—will greatly underspecify what they actually want. Underspecification is the norm in talk, because speakers can depend on hearers’ inferential abilities to fill in the gaps (Carston, 2002: 15–28; Clark, 2013: 296–297). But unlike human hearers, major present-day systems (e.g., Google) cannot infer what queries really mean, nor do they interview users like a reference librarian to find out. Their powers are limited to what system designers can imply through various technologies, which is that some set of documents, often large, will be fruitful to browse. This set of documents is the system’s response to the query, its predictions as a speaker—i.e., designers’ mouthpiece. Systems that rank documents by predicted relevance also imply, on the designers’ part, that the top documents will be more productive to browse than those lower down—a technique for increasing relevance by reducing effort. In IR-speak, retrievals with greater precision reduce effort. Outputs from literature searches are of course quite unlike a speaker’s utterances in face-to-face dialogues or a writer’s utterances in text. Both computerized and manual searches depend on systems that cannot be said to intend meanings in the same way that a speaker or writer does. The people in charge of the systems no doubt intend each output to be as relevant as possible, but responsibility for how its content actually relates to queries is so diffuse as to be unassignable. Even so, the simplest explanation of system users’ behavior is that they employ the same cognitive mechanisms by which they seek relevance in conversations or a book. When they interpret retrieved documents in light of queries, they want positive cognitive effects at an acceptable cost in effort.

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1085

Regarding effects, consider experiments in which judges decide how well documents in a retrieved set match a query in topic or respond to it as a question. These are inferences in S&W’s sense. For example, quotations (a) through (e) are from students in Maglaughlin and Sonnenwald (2002: 332–333). They show that bibliographic information on a new document might: Strengthen an assumption: (a) “In the preliminary research I have done…this guy has come up.” (b) “So tripartism [as discussed in the document] is an issue I’m interested in.” Eliminate an assumption: (c) “I’m nervous about this article because I was really hoping nothing had been done, so I really need to look at it.” Combine with an assumption to yield a new conclusion: (d) “And the title implies that it is going to be broad in scope” [which requires an assumption of what constitutes breadth]. (e) “There is actually nothing new there” [which requires an assumption of what is already known]. Any such judges start with two contextual assumptions: a written query (their own or someone else’s), and a written scale for judging the degree of relevance of a document to the query or to their own needs. They are given as new information the surrogate or full text of a document. As one new conclusion among others, they express that document’s relevance in terms of a scale point. The First, or Cognitive, Principle of RT is that “Human cognition tends to be geared to the maximisation of relevance” (S&W, 1995: 260–266). The Cognitive Principle is a claim about how human attention is allocated for efficient processing of inputs (any inputs, whether a bunch of documents or a tiger’s snarl). Accordingly, as judges grade documents, they automatically seek maximal relevance for themselves, which Higashimori and Wilson (1996: 2) defines as “the greatest possible effects for the smallest possible effort.” As judges consider scale points, they can reduce their effort by preferring easier interpretations to more problematical ones. In experiments with vague scales, endpoint values of “relevant” and “not relevant,” are less vague than what Saracevic (1969: 298) calls the “fuzzy gray” midpoint values and hence easier to interpret when combined with newly presented texts. Judges stop at the easier values when they can, even if longer consideration of documents might make their judgments more accurate. This does not necessarily mean that judges give researchers valid and reliable measures of query-document fit. Constrained by time pressures and their own subjectivity, they will at best mark their forms so as to respond as relevantly as they can at the time. In RT this is called optimal relevance (W&S, 2012: 64–66). Under RT’s Second, or Communicative, Principle (there are only two), an optimally relevant communication carries the presumption that it is: (a) relevant enough to be worth processing, and (b) the most relevant one compatible with the communicator’s abilities and preferences. According to Higashimori and Wilson (1996: 2), a communication worth processing gives hearers or readers “adequate effects for no unjustifiable effort.” The Communicative Principle is a claim about how communicators exploit the Cognitive Principle to modify the thoughts and beliefs of their relevance-seeking addressees. For many reasons, communicators cannot (or may choose not to) guarantee that their utterances will be maximally relevant to addressees in terms of effects and effort. However, they can speak or write so as to meet goals (a) and (b) above, and this “more modest” optimal relevance (Clark, 2013: 365) is what their hearers or readers can reasonably expect at a given time. For example, researchers presuming optimally relevant responses from judges would not want them to say “Don’t know” about most of their assigned documents, because doing so would yield inadequate cognitive effects. Borlund (20 0 0: 108) gave her judges a “Can’t say” option, but adds that they used it only six times in 10,407 assessments, which implies that they knew its overuse would make their grades not worth processing. Nor would researchers want judges to add pluses and minuses to an existing scale, because interpreting those would cost unjustifiable effort. But only a rare researcher would challenge a judge because of a disagreement over a grade or because other judges marked the same document differently. Disagreements on relevance are notorious in IS (Saracevic, 2008; Turpin et al., 2015). The multiple independent judges in Rees and Schultz (1967: v1, 117–118), for instance, gave some documents every possible grade on an 11-point scale. Whatever judges think of their decisions, or whatever researchers think of them, no absolute gold standard exists. There are only approximations, some better than others (Carterette & Soboroff, 2010; Huang & Soergel, 2013; Scholer et al., 2011). Naturally, better evaluations can be sought by having documents graded by more than one judge and looking for agreements. When this is done, the percentages of over-all agreement may seem impressively high. However, a major qualification emerges from independent observers (Saracevic, 1971: 138; Kekäläinen, 2005: 1021; Harman, 2005: 44; Al-Maskari, Sanderson, & Clough, 2008: 683; Bailey et al., 2008: 671–672). These authors all reveal in various ways that judges are likelier to agree on the nonrelevant documents than on the relevant ones. This is also indirect evidence that judgments of “not relevant” cost less cognitive effort to make. 5. Vagueness When researchers utter relevance scales to judges, they intend them to be optimally relevant. In practice, however, they often create brief, vague instruments that require judges to do most of the interpretive work. This reflects researchers’ abilities as well as preferences; it is hard to write a scale with sharp criteria that is also reasonably short.

1086

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Fig. 2. Distribution of judgments on a scale of cognitive effects in two studies.

RT holds that utterances are merely clues to speakers’ intended meanings. Adequate clues restrict possible meanings to those likely to meet a hearer’s standard of relevance in context. In ordinary talk, this standard is frequently met. Utterances are vague, however, when their possible meanings are under-restricted. According to Clark (2013: 27), vagueness occurs (1) “because the concepts being communicated are open to several interpretations and the speaker does not want to be committed to any particular one of them,” or (2) because the speaker’s “own understanding is vague.” Either or both conditions may characterize researchers in IR. As to (1)—commitment to a particular interpretation of relevance—many researchers explicitly or implicitly use grounded theory method. That is, rather than imposing their own definitions of relevance on judges, they prefer them to have freedom of response. Clark and Schober (1994: 27) predicts the result, which, in the phrase I have italicized, is reminiscent of RT’s path of least effort: “Whenever a surveyer chooses a vague word…respondents can presume that he means something specific by it, namely the interpretation most obvious to them at that moment. These interpretations will often be idiosyncratic just because vague words allow such latitude of interpretation.” Thus, improvising criteria over many documents, judges may give documents the same grade for multiple reasons. As to (2), many IR researchers’ own understanding of “relevance” appears vague. Authors of the major IR textbooks, for instance, shy away from analyzing the term in any detail. In Patrick Wilson’s view (1978: 17), “The best way to understand how the term is actually used seems to be this: it is simply the chief evaluative term in information retrieval, and means approximately ‘retrieval-worthy.’ ” (Compare RT’s “at least worth processing.” ) Wilson also notes that no fixed test of retrieval-worthiness exists. Or as Spink et al. (1998: 600–601) puts it, “Although relevance has been debated for more than three decades, a clear definition or viable operationalization within the context of IR system evaluation has not emerged.” RT has addressed vagueness on another front. Literary works, such as lyric poems, are likewise open to multiple interpretations not strongly intended by their authors. About such works Deirdre Wilson (2011: 6) writes: “The stronger the communication, the greater the author’s responsibility for what is conveyed; the weaker the communication, the more the responsibility falls on the reader.” In her non-pejorative sense, relevance scales in IR, especially the vague ones, are weak communications. 6. Results: a vague four-point scale The ordinal scale at the bottom of Fig. 2 consists of four vague labels with no definitions or instructions. It was used in Greisdorf (20 0 0, 20 03) and a replication (Greisdorf & Spink, 2001) at the University of North Texas. In the first study, conducted in academic year 1997–98, 32 end-users used it to judge 1432 documents from 54 Dialog searches on topics of their own choice. In the second, comprising three sub-studies from 1998–99, 36 end-users carried out 57 Dialog or Inquirus searches on their own topics and used it to judge a total of 1295 documents. Any relevance scale implicitly measures the ordinal cognitive effects of documents on users—the degrees of relevance postulated in both RT and IS. The effects scale added in boldface below the verbal labels in Fig. 2 makes that explicit: 0 to 3, or none to high. When users respond, they are inferring the upper term in their own effects/effort ratios. But what of the lower term, the effort of making judgments? Like other researchers, Greisdorf and Spink presumably thought they were simply testing document effects. Yet an effort scale is implicit in the labels they created, because pre-worded verdicts require some degree of effort to process, and the degree will vary with their linguistic and logical complexity (Wilson,

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1087

Table 1 Mapping an effort scale onto the Greisdorf & Spink effects scale, with relevance ratios for documents.

Effects Effort Ratio

Not relevant

Partially not relevant

Partially relevant

Relevant

0 1 0

1 4 0.25

2 3 0.67

3 2 1.5

2012: 4; Allott, 2013: 66). The following effort scale thus reorders Greisdorf and Spink’s labels. Also, it starts at 1 rather than 0, since it takes a bit of processing even to reject a document: 1 Not relevant. The easiest judgment to make is quick dismissal. The document is simply perceived as wrong; it fails to match the query in sense or is otherwise unsatisfactory. It thus has negligible cognitive effects and can be rejected with little effort. “It takes less time to judge a nonrelevant document than to judge a document with any degree of relevance,” says Carterette and Soboroff (2010: 540)—a conclusion the authors reached by mining assessor interaction logs in the TREC 2009 Million Query Track. 2 Relevant. The second-easiest judgment says the document does match the sense of the query. The high cognitive effects result from a perceived topical match and/or other good features. However, in contrast to the simple judgment of “not relevant,” somewhat greater processing effort is needed to infer what these effects are. They might differ for each document examined, since the researchers place no limits on interpretation. 3 Partially relevant. This judgment is the second hardest to make because contrary inferences are involved. The document is relevant to the query in some ways but not in others—not good or bad, but a mixture of good-bad. Greisdorf and Spink (2001: 847) states: “Partially relevant represented a judgment that confirmed some relation by inference existed, but the relationship may be weaker than a relevant relation at the time the judgment was made…” 4 Partially not relevant. This is the hardest judgment because it requires the most complicated inferences. In this case, the judge must decide not only that the document is somehow deficient, but also that its bad aspects outweigh the good ones. Moreover, from psycholinguistics we know that wordings are harder to process when they involve negatives, especially more than one (partially not relevant = not all not relevant). For corroboration, look no further than this gloss by Greisdorf and Spink (2001: 847): “…partially not relevant represented that some non-relation existed, but the inference may not be strong enough to totally reject the relation as not relevant at the time the judgment was made.” The judges were not shown this definition, but it indicates that “partial non-relevance” is a difficult notion even for its devisers. Assume, then, that the labels and values from the effort scale are mapped back onto their equivalents in the Fig. 2 effects scale, as if relevance = effects/effort ratios are being formed. The result appears in Table 1. The ratios are not precise measures of mental activity, of course, but they do preserve the order of apparent document worth. Relevance varies directly with the effects scale and with these ratios. It is the effort scale, however, that predicts the frequency distributions of judgments in the two studies. The scale labels can be ranked 1-4-3-2 by imputed effort, and the judgment frequencies can be ranked 4-1-2-3 by their relative elevation. A standard measure of correlation between two ranked variables is Spearman’s rho. In Fig. 3 and elsewhere in this paper, a rho of −1 indicates a perfect inverse relationship: as the effort of interpreting scale points goes up, the frequencies with which they are chosen go down. The incorporation of processing effort into what the researchers presumably saw as a straightforward scale of progressively larger cognitive effects turns out to explain the V-shaped distributions on that scale. 7. Results: unlabeled scales The analysis of Greisdorf and Spink’s four-point scale has suggested that it is the wording of midpoint labels that repels judges. However, some scales are simply lines between labeled poles. Such lines are intentional communications from researchers but even vaguer. Respondents must nevertheless infer where to mark them, which means dealing with uncertainty. Fig. 4 displays their solutions. At top left is Janes’s (1993) display of data from Rees and Shultz (1967: v1, 116). There, in the context of a hypothetical research project, 184 judges from medical institutions were each asked to read 16 documents on diabetes and grade them on an 11-point scale. In the legend, “R” stands for the relevance of a document, and “U” stands for its utility; “1” and “3” stand for the project’s first and third stages. Relevance and utility proved to be indistinguishable concepts, but the stage of research made a difference. In the third stage, many more documents were deemed irrelevant or useless, and fewer were deemed extremely relevant or useful, than in the first. Rees and Schultz expected their judges to prefer the midpoints of the scale to the endpoints (v1, 117–118). Instead, the judges most often preferred to reject documents by giving them scores of 1 or 2. At the other extreme, we see small increases in documents with scores of 11, especially in the first, more tentative stage of the research. Although the judges did put documents at the most nuanced point a scale allows, they tended to prefer the relatively clear-cut endpoints. At top right is one of several bimodal distributions that Janes (1993: 111) reproduced from his own studies. In this one, respondents were asked to mark the relevance of documents to queries on a 100 mm line; the farther right the mark, the

1088

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Fig. 3. Assuming the boxed effort scale, the elevations of the frequencies vary inversely with it; Spearman’s rho = −1.

Fig. 4. Top distributions from Rees and Schultz (1967) and Janes (1993), both reproduced by permission; bottom distributions plotted from Greisdorf and Spink (2001).

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1089

Table 2 18 distributions on the Saracevic vague scale in row percentages. Study

0

1

2

N

Saracevic, 1971 Saracevic & Kantor 1988–1 Saracevic et al., 1991 Spink et al., 1998 Study A Spink et al., 1998 Study B Spink et al., 1998 Study C Spink et al., 1998 Study D Pao, 1993 pilot Pao, 1993 field Borlund, 20 0 0 feasibility–1 Borlund, 20 0 0 feasibility–2 Borlund, 20 0 0 feasibility–3 Borlund, 20 0 0 feasibility–4 Borlund, 20 0 0 feasibility–5 Borlund, 20 0 0 feasibility–6 Christoffersen, 2004 Gehanno et al., 2007 meta Gehanno et al., 2007 MeSH Totals

85.2 41.0 42.7 40.9 41.6 57.0 49.4 39.7 47.5 51.1 44.4 60.6 48.8 40.7 51.1 52.6 40.1 43.4 47.5

5.3 28.3 28.2 28.6 29.0 13.5 13.8 22.5 23.4 18.5 22.2 17.2 22.4 25.2 24.4 23.3 14.2 16.9 24.4

9.5 30.7 29.1 30.5 29.4 29.5 36.8 37.9 29.2 30.4 33.3 22.2 28.9 34.1 24.4 24.1 45.8 39.8 28.1

2626 8956 6225 609 3120 474 269 1384 5558 135 45 279 246 246 90 3526 212 83 34,083

higher the relevance (otherwise undefined). These “secondary judges” were non-experts who were re-evaluating documents originally judged by experts in psychology and education (Janes & McKinney, 1992). The result strikingly shows them preferring endpoints to midpoints. Greisdorf and Spink likewise supplemented their verbal scale with plain lines in two sub-studies (p. 850), the results of which I have graphed. At bottom left are the judgment frequencies, symmetrically binned by them, on a 100 mm line. The plot resembles the one above it from Rees & Schulz. On their 77 mm line at bottom right, the binned frequencies plot rather like the frequencies on the eight-point scale from Cuadra & Katter in Fig. 1. The RT notion of differences in cognitive effects and processing effort provides a common thread of explanation.

8. Results: vague three-point scales More than 34,0 0 0 documents were graded in the eight chronologically ordered studies in Table 2. Percentaged to facilitate comparisons, every distribution in the table but one is at least slightly V-shaped. The studies are united by a vague scale introduced in Saracevic (1969, 1971), which in a sense has been inadvertently tested in 17 further experiments. Saracevic himself used it in his later research, and the other authors in Table 2 use his labels or close variants, if not his definitions. The scale is: Relevant—Any document which on the basis of the information it conveys is considered to be related to your question, even if the information is outdated or already familiar to you. Partially Relevant—Any document which on the basis of the information it conveys is considered only somewhat or in some part related to your question or to any part of your question. Nonrelevant—Any document which on the basis of the information it conveys is not at all related to your question. Here, the vague phrase “related to your question” adds little to what “relevant” itself conveys. The relatedness would presumably often be interpreted as topical, but that possibility is not spelled out, and so other interpretations are implicitly licensed. In Table 2, the labels “nonrelevant,” “partially relevant,” and “relevant” are coded as 0-1-2; they continue to stand for degrees of cognitive effects. Also presumed is an implicit effort scale of 1-3-2, based as with Greisdorf and Spink on scale-point interpretability. The endpoint 1, “nonrelevant,” is taken as the easiest to process, and the other endpoint 2, “relevant,” as the second easiest. The midpoint 3, “partially relevant,” is the most demanding, because judges must decide whether part of a document matches all of query, or all of a document matches part of query, or part of a document matches part of query. Judges not given this definition must still work out a mixture of good-bad characteristics. Mapped onto the 0-1-2 effects scale, the 1-3-2 effort scale yields effects/effort ratios of 0, .33, 1, which again order documents by approximate worth, as in Table 1. Borlund (20 0 0: 148) intuitively chose analogous values of 0, .5, 1. Even when the differences in counts at “partially relevant” and “relevant” are very small, the distributions in Table 2 corroborate Janes (1993) with new examples and strengthen the claim that judgment frequencies in retrieval trials do not depend solely on query-document fit. Moreover, added together, the Table 2 studies produce the V-shaped upper plot of Fig. 5.

1090

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Fig. 5. Total counts from Table 2 and alternative total counts. Assuming an effort scale of 1-3-2, Spearman’s rho for both = −1.

8.1. The eight studies Saracevic (1971), which subsumes his 1969 paper, shows the distribution of grades for the 2626 full-text documents in his pioneering Case Western study of retrieval in tropical medicine. These documents were responses to 124 queries and had multiple expert judges. Subsets of this distribution appear in 40 tables in which counts on his relevance scale are the dependent variable. In every one of the 40, the count at “partially relevant” is lowest. Saracevic and Kantor (1988) has outcomes from a Rutgers project in which 40 end-users from various disciplines supplied one topic apiece and judged the retrievals. Nine intermediaries chose an appropriate Dialog database for each of the 40 topics, and then all nine searched it in parallel. This produced many overlapping retrievals. The 360 searches in 22 databases yielded a total of 8956 documents before duplicates were removed. The frequencies in Saracevic & Kantor 1988–1 include these duplicates. Saracevic & Kantor 1988–2, in which duplicates were removed, is discussed in the next subsection. Saracevic, Mokros, Su, and Spink (1991) reports on a 1989–90 project at Rutgers that used four librarians as searchers and focused on their interactions with 40 end-users, who were graduate students and faculty. In this case, retrievals across multiple Dialog databases did not overlap. Pao (1993) evaluated term retrieval vs. citation retrieval in parallel searches on medical topics. In her pilot study, an information professional chose 33 topics in nutrition and found documents on them in a custom database. Two subject experts did the scoring. The field study compared retrievals from MEDLINE, which uses MeSH terms, and Scisearch, which uses citation indexing. Users supplied librarians with 89 real topics and judged the results themselves. Studies A through D in Spink et al. (1998) confirmed the value of adding “partially relevant” to the traditional binary scale. Studies A, C, and D had non-overlapping sets of 13, 13, and 11 end-users at the University of North Texas. Studies A and C involved end-users doing initial Dialog searches. In Saracevic et al. (1991), 18 of the 40 end-users had been doing initial searches, and data from these 18 were reused for comparison in Study B. The six distributions in Borlund (20 0 0: 230–231) come from trials that tested the feasibility of simulating situational relevance in retrieval experiments. Her judges were three library science students and two information professionals. The top category in her Saracevic-style scale was “highly relevant.” Christoffersen (2004) tried a search strategy for differentiating a core of relevant documents from a periphery of only partially relevant ones in MEDLINE, EMBASE, and Scisearch. He called the intermediate category “perhaps relevant,” adding (p. 390) that it “may cloud the picture” [cf. Saracevic’s “fuzzy gray” ; Spink et al.’s “fuzzy in nature” ]. The queries came from various sources, and 10 medical experts judged the retrievals for topical relevance. Gehanno, Thirion, and Darmoni (2007) compared retrievals with two types of indexing: standard MeSH terms and broad metaterms from a catalog/index of French-language health resources on the Internet. Queries on 16 medical topics supplied by librarians were formed with terms from each index. A physician judged the results. 8.2. Discordant outcomes Table 3 has two distributions in which the prediction based on levels of effort is not borne out. The distribution from Maglaughlin and Sonnenwald (2002: 337) has extraordinarily few “not relevant” documents. The judges were 12 students doing graduate-level research, for whom one of the authors performed searches in multiple Dialog databases. The students read surrogates of the 20 most recent documents on their topics and were asked to mark words justifying their verdicts. The Saracevic labels they saw were undefined. Their pre-announced reward was a full-text photocopy of

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1091

Table 3 Two discordant distributions on the Saracevic scale in row percentages. Study

0

1

2

N

Maglaughlin & Sonnenwald Saracevic & Kantor 1988–2

16.1 48.4

36.9 26.8

47.0 24.8

236 5411

Table 4 CHiC distributions on a Saracevic-style vague scale in row percentages. Study

0

1

2

N

Dutch English Finnish French German Greek Hungarian Italian Norwegian Polish–1 Polish–2 Slovenian Spanish Swedish Totals

77.3 84.4 88.0 83.6 80.7 95.9 85.9 78.4 80.4 84.9 58.7 89.9 81.2 89.0 79.3

7.7 0.4 0.8 2.4 0.3 1.4 8.4 5.4 2.8 5.5 14.8 2.9 3.9 2.9 5.3

15.0 15.2 11.2 14.0 19.0 2.6 5.7 16.3 16.7 9.6 26.5 7.2 14.9 8.1 15.4

10,548 16,696 2465 17,978 18,460 10,032 5834 13,387 10,287 11,342 32,144 6718 11,373 11,640 178,904

every document they thought relevant to their research. This may have inclined them to judge generously. Alternatively, Sonnenwald observes (personal communication) that the extensive analytical effort asked of students may have caused them to rate documents more favorably. In RT “greater effort imposed” can indicate that “greater effect is intended”—effect that addressees themselves then infer creatively (S&W, 1985: 167). In Saracevic & Kantor 1988–2, the duplicates found in Saracevic & Kantor 1988–1 were removed, and the documents judged “relevant” become fewest. The causes of the drop are not obvious, but judges plainly vary in their reactions to the 40 individual topics in the study (pp. 195–196). Indeed, their judgment distributions are decidedly mixed; 10 are V-shaped, 19 are right-skewed, and 11 take other shapes. Yet as the lower plot in Fig. 5 shows, the overall distribution of Table 2 remains V-shaped even if the Saracevic & Kantor 1988–2 totals replace those of Saracevic & Kantor, 1988–1 and the Maglaughlin & Sonnenwald totals are added to it. Globally, judges appear to conserve effort. 8.3. Cultural heritage in CLEF (CHiC) Percentaged distributions from experiments described in Petras, Bogers, Ferro, and Masiero (2013) appear in Table 4. The experiments were conducted for the 2013 Cross Language Evaluation Forum (CLEF), and they involved retrieval of mixed-media documents on aspects of European cultural heritage in 13 languages. Some 50 queries were translated into the various languages and sent to participating IR teams. Over a two-week period, 15 assessors, all native speakers of the languages (except English), judged 146,760 documents. The second Polish-language study, a separate project that dealt with complexities of retrievals in Polish and involved two assessors, has been added from Petras, Bogers, Toms, et al. (2013). The teams were instructed by internal guidelines (Petras, personal communication) that “a record is relevant when it fulfills the information need represented by the original query (in title) and by the suggested information need description (in description).” They were given a vague scale defined with tautologous terms similar to Saracevic’s: Not relevant—the record does not fulfill the information need; the information is not relevant. Partially relevant—the record partially fulfills the information need, but there are some doubts as to whether the whole information need is covered. Highly relevant—the record as represented in the DIRECT system fulfills the information need and is highly relevant. To assign a grade of “partially relevant,” an assessor must decide whether “some doubts” exist that “the whole information need is covered.” Presumably these cloudy phrases and good-bad equivocality increase the cognitive demands of the midpoint. Confirming the prediction from a 1-3-2 scale of effort, the distributions of judgments in Table 4 are again all V-shaped except the right-skewed one for Hungarian documents. The total frequencies from Table 4 are plotted in Fig. 6.

1092

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Fig. 6. Total counts from Table 4, CHiC trials. Assuming an effort scale of 1-3-2, Spearman’s rho = −1. Table 5 Distributions on a TREC vague scale in row percentages. Track

0

1

2

N

Medical 2011 Medical 2012 Totals

80.1 82.9 82.2

1.6 1.6 1.6

18.3 15.5 16.2

8865 24,219 33,084

Fig. 7. Total counts from Table 5, TREC Medical Tracks. Assuming an effort scale of 1-3-2, Spearman’s rho = −1.

8.4. TREC medical tracks In the TREC Medical Records Tracks (Voorhees & Hersh, 2012) patients were mapped onto studies. That is, the topics were descriptions of clinical studies, the documents were multiple medical records per patient, and the judgments, made by experts, were whether the records indicated the patients were candidates for the studies. The glosses on scale labels were: 0, “not relevant,” definitely not a candidate; 1, “partially relevant,” possibly a candidate but not enough information; and 2, “relevant,” definitely a candidate. Some 34 topics were judged in 2011, and 47 in 2012. Both produce V-shaped patterns in Table 5. Fig. 7 displays the combined frequency counts for the two studies. 9. Results: strict scales As noted earlier, judges read documents as inputs to contexts of assumptions set by search topics and scales, and then infer as new conclusions the scale points to assign. Strict scales and vague scales differ in which points cost more effort. In

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1093

Table 6 Distributions on the Sormunen strict scale in row percentages.The totals row excludes frequencies from Sormunen (2002) because they are incorporated in Kekäläinen (2005). Study

0

1

2

3

N

Sormunen, 20 0 0 Sormunen, 2002 Kekäläinen, 2005 Lykke et al., 2010 Totals

86.8 60.5 60.7 74.0 78.1

5.8 19.8 19.6 16.9 11.8

4.8 13.3 13.3 6.0 6.7

2.6 6.3 6.4 3.0 3.4

17,337 5737 6122 11,066 34,525

both cases, however, even the most effortful grading decisions are likely to be quick, not laborious. To repeat, they reflect “small differences in very rapid thought processes.” Vague-scale researchers do not tell judges specifically what to look for in documents. Judges must improvise interpretations of what scale points mean so as to fit new documents to them. Because the points leave so much to inference, the judges’ interpretations are not necessarily fixed; they can shift to accommodate documents in various ways. Judges also know that, in typical experiments, no one is going to ask them to justify their grades. Thus, they can adequately perform their overall task and still conserve effort by assigning documents more frequently to scale points that seem easier—less complicated—to interpret. The preceding sections showed that those tend to be endpoints. By contrast, strict-scale researchers want judges to be more critical of retrieval quality, especially at the high end, and they tell them more specifically what to look for in documents. They also sometimes instruct judges to infer topical relevance only. Judges may well like these clearer criteria. Nevertheless, the effort of interpreting definitions does not disappear. With strict scales, the points as defined may be less vague, but they also require judges to hold documents to progressively higher standards, each costing more effort to apply. Again, no one is going to ask judges to justify their grades, but the very fact that criteria are more explicit heightens justifiability as a concern. This alters judges’ priorities: the higher the grade, the more they tend to withhold it. Accordingly, the distributions of judgments are no longer U- or V-shaped; from modal values at left they simply fall. 9.1. A strict four-point scale The four studies in Table 6 demonstrate the falloff. They use a scale created in Sormunen (20 0 0: 63) to evaluate a test collection of newspaper stories in Finnish. The scale was revised somewhat in Sormunen (2002: 325) to regrade Englishlanguage newspaper stories that assessors in TREC-7 and TREC-8 had graded on a binary scale. Kekäläinen (2005) kept Sormunen’s version from 2002 and also his dataset, merely adding three new topics to the original 38. Lykke et al. (2010) used the 2002 version to build a new 65-topic testbed of documents in physics. The 2002 wording is: 0 Not relevant. The document does not contain any information about the topic. 1 Marginally relevant. The document only points to the topic. It does not contain more or other information than the topic description. Typical extent: one sentence or fact. 2 Relevant. [Lykke et al. and Kekäläinen use Fairly relevant.] The document contains more information than the topic description but the presentation is not exhaustive. In case of a multi-faceted topic, only some of the sub-themes or viewpoints are covered. Typical extent: one text paragraph, 2–3 sentences or facts. 3 Highly relevant. The document discusses the themes of the topic exhaustively. In case of a multi-faceted topic, all or most sub-themes or viewpoints are covered. Typical extent: several text paragraphs, at least 4 sentences or facts. The percentaged distributions in Table 6 are predicted from this scale. All are right-skewed. Damessie, Scholer, and Culpepper (2016) reports a small experiment in which “not relevant” documents on the Sormunen scale required significantly less processing time than documents at any of the higher levels. Since Sormunen’s scale is designed for judging brief news stories in full, its distinctions are in terms of prose snippets (e.g., one sentence vs. 2–3 sentences). Yet talk of themes, subthemes, viewpoints, and whether a presentation is exhaustive seems more appropriate for longer works. Sormunen (2002: 325) even says that documents at the highest level “are expected to help the user to take a good command of the topic”—a tall order for a news story. Still, he is plainly trying to make judgment criteria more explicit. His typical-extent instructions tell judges to infer approximately how much of each document, by sentence count, fits the search topic. Effort then increases with the number of sentences that can be said to fit. However, since meaning tends to be holistic rather than atomistic, deciding how many sentences or “facts” to count may be tricky. Thus, Sormunen also defines the scale points more abstractly, in terms of progressively greater informativeness. Determining fit with these criteria has a cost as well. In fact, the Sormunen scale is actually two scales, in which the typical-extent criteria and the informativeness criteria must be considered simultaneously, and the two do not necessarily coincide. Trying to reconcile these (perhaps conflicting) demands would also increase interpretive effort. In Fig. 8 the Sormunen scale as coded stands again for cognitive effects. The implicit effort scale below it now runs simply 1-2-3-4, since cognitive effort for judges increases with each label. A “marginally relevant” document mentions the

1094

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Fig. 8. Total counts on Sormunen scale from Table 6. The boxed effort scale is inversely related to the elevations of the frequencies; Spearman’s rho = −1. Table 7 Relevance-theoretic comparisons of Sormunen strict scale with Greisdorf & Spink vague scale. Sormunen

Not relevant

Marginally relevant

Fairly relevant

Highly relevant

Effects Effort Ratio

0 1 0

1 2 0.5

2 3 0.67

3 4 0.75

Greisdorf & Spink

Not relevant

Partially not relevant

Partially relevant

Relevant

Effects Effort Ratio

0 1 0

1 4 0.25

2 3 0.67

3 2 1.5

topic. A “fairly relevant” one must extend the topic with additional information. A “highly relevant” one must feature the topic at length and in detail. These criteria would tend to reduce grade frequencies at each point, and that is what we see in Fig. 8, which plots the right-skewed total frequencies from Table 6. Effort is again inversely related to the ranked elevations of the frequencies, 4-3-2-1. The Sormunen effects/effort ratios in Table 7 once more order documents by presumed worth. Yet the diminishing differences between each ratio and the one preceding (.5, .17, .08) suggest effort growing relative to effects, indicating greater demands as the scale moves rightward. The table also prompts a comparison of Sormunen’s language with Greisdorf and Spink’s. Sormunen’s labels come with instructions; G&S’s are undefined. In both scales, cognitive effects have the same codes, and the ratios increase with them. However, the effort scales differ, as do the ratios. Sormunen’s “marginally relevant” (.5) is accompanied by one sharp criterion and thus costs less effort than G&S’s complicated phrase “partially not relevant” (.25). On the other hand, the complex explicit criteria behind Sormunen’s “highly relevant” (.75) make it harder to assign than G&S’s free-floating “relevant” (1.5). In between, Sormunen’s “fairly relevant” and G&S’s “partially relevant” yield the same ratio (.67). Both labels convey that a document, while retainable, does not fully satisfy the query, but the effort scale is too crude to distinguish their nuances.

9.2. Strict three-point scales TREC’s ternary scale refines its “too liberal” binary scale. The precision-oriented refinement splits the upper point in two. Like the binary scale, assessors use the ternary scale along with TREC topic statements to judge each retrieval. Voorhees (2001: 75) glosses the three points thus: “Assume that you have the information need stated in the topic and that you are at home searching the web for relevant material. If the document contains information that you would find helpful in meeting your information need, mark it relevant. If the document directly addresses the core issue of the topic, mark it highly relevant. Otherwise, mark it not relevant.” Her standard for “highly relevant” may seem to go without saying, but it is not said in vague-scale definitions.

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1095

Table 8 Relevance-theoretic comparisons of the Voorhees strict scale with vague scales from Saracevic and Petras. Scale

Label

Definitions (slightly edited)

Effects

Effort

Ratio

Voorhees Saracevic Petras

Not relevant Not relevant Not relevant

Does not address core issue of topic; not even helpful. [imputed] Not related to your question. Does not fulfill the information need.

0 0 0

1 1 1

0 0 0

Voorhees Saracevic Petras

Relevant Partially relevant Partially relevant

Helpful in meeting your information need. In some part related to your question or any part of your question. Some doubts whether the whole information need is covered.

1 1 1

2 3 3

0.5 0.33 0.33

Voorhees Saracevic Petras

Highly relevant Relevant Highly relevant

Directly addresses the core issue of the topic. Related to your question. Fulfills the information need.

2 2 2

3 2 2

0.67 1 1

Table 9 Distributions on 3-point strict scales from unrelated studies. Assuming an effort scale of 1-2-3, Spearman’s rho = −1 for the first four studies but not the last. Study

0

1

2

N

HARD track 2005 Web track 2009 MQ track 2008 Borlund, 20 0 0 main Al-Maskari et al., 2008

82.6 70.9 81.0 88.0 48.5

10.0 20.5 13.0 6.9 23.6

7.4 8.6 6.0 5.1 27.9

37,798 23,601 15,211 2179 4395

To suggest again how effort varies with language, the Voorhees strict scale can be compared to the vague scales of Saracevic and Petras. Table 8 shows that the three scales have the same 0-1-2 coding for effects, but differ in coding for effort; Voorhees is 1-2-3; Saracevic and Petras are 1-3-2. Hence, their ratios differ as well. In the context of Voorhees’s high grade, her open-ended midpoint—“helpful in meeting your information need” (.5)—presumably means “is somehow related to your topic.” The Saracevic and Petras ratios are lower (.33) because their good-bad midpoint as defined requires slightly more pondering. However, the Saracevic and Petras ratios for their high grade (1) are higher than the Voorhees ratio (.67) because their vague definitions give assessors considerable leeway of interpretation. By contrast, the Voorhees standard for the high grade—“directly addresses the core issue of the topic”—imposes a more rigorous test on assessors. The TREC ternary scale was used in the Villa and Halvey experiment (Section 2.2), and the top grade there cost less effort than the midpoint to assign. This makes sense: a sharp criterion like “addresses the core issue of the topic” is easier for judges to interpret. Why, then, is “addresses the core issue” considered more effortful here? Because TREC experiments are different. Villa and Halvey’s participants each regraded only nine documents. TREC assessors usually judge far larger sets per topic. Typically most of these documents will be screened out as irrelevant, but the remaining ones are still frequently quite numerous. In this context, the top grade, “addresses the core issue,” has presumably been made clearer than “helpful” so that assessors will bear in mind not to over-award it—an extra inferential burden that Villa and Halvey’s participants are unlikely to have felt. If assessors must continuously satisfy themselves that the top grade is warranted, they will tend to exclude as many documents as possible. Thus Ruthven’s statement (2014: 1105–1106), based on data from TREC’s 2003 HARD Track, applies: “…where assessors have stricter relevance criteria then we should expect fewer documents to match these criteria and also to see fewer documents that do not match these criteria to be judged relevant.” Expectations about strict ternary scales were tested with a miscellany of studies: TREC’s 2005 HARD Track, its 2009 Web Track, its 2008 Million Query Track, Borlund (20 0 0), and Al-Maskari et al. (2008). In the 2005 HARD Track, 50 topics that had proved difficult in earlier TREC competitions were searched in a new corpus, AQUAINT (Allan, 2006). Six assessors (not the originators of the topics) used TREC’s strict ternary scale to regrade the results. Table 9 confirms that the summed document frequencies for all 50 HARD topics form another right-skewed distribution. TREC’s Web Track for 2009 had a similar structure: the ad hoc task involved 50 topical queries, and assessors judged Web pages against them on scale values of “not relevant,” “relevant,” and “highly relevant” (Clarke, Craswell, & Soboroff, 2009: 5). Table 9 gives the right-skewed outcome. TREC’s Million Query Tracks of 2007–2010 involved putting hundreds of queries to very large Web collections, so as to test the efficacy of “many shallow” vs. “fewer thorough” relevance judgments. In the 2008 MQ Track, retrievals on 784 queries were judged, with right-skewed results as in Table 9 (Allan, Aslam, Carterette, Pavlu, & Kanoulas, 2008: 9). This track augmented TREC’s three-point scale with a category labeled “not relevant but reasonable” (also labeled “related”) to cover pages that were relevant to an unintended interpretation of the query. Table 9 merges the 5% of retrievals in this category with those called “not relevant,” since that is what a real user would call them. Carterette and Soboroff (2010: 540) presents a bar graph of six assessors’ judgments on the augmented scale, and they largely accord with strict-scale predictions Borlund (20 0 0) used a ternary scale roughly comparable to TREC’s in her main experiment. She had 24 students do searches in a database of newspaper articles. They searched both on needs simulated from TREC topics and on a real

1096

H.D. White / Information Processing and Management 53 (2017) 1080–1102 Table 10 Distributions on a TREC 5-point strict scale in row percentages. Track

−2

0

1

2

3

N

Web 2010 Web 2011

5.6 5.3

73.7 78.5

15.9 10.5

4.3 3.7

0.5 2.1

25,329 19,381

information need of their own. The real-need findings are given here. Rather than defining her scale points, she relied on students’ assumptions and inferences about their needs as end-users. Her guidelines to them suggest that, like Voorhees, she takes “relevant” to mean “somehow topic-related” which is less stringent than “useful.” She writes (italics hers): “The test persons were instructed that their job was to retrieve as many useful documents as it would take to satisfy their information need rather than as many relevant documents as possible. The retrieval should stop when the need was satisfied, or when the test persons felt it was not possible to satisfy the information need from the actual newspaper collection” (Borlund, 20 0 0: 114). The instruction to “stop when the need was satisfied” (or abandoned) resembles the stopping rule in S&W’s relevance-theoretic comprehension procedure given in Section 3. Table 9 aggregates the relevance judgments of all 24 students on a scale with degrees of low, medium, and high (Borlund, 20 0 0: 249–256), recoded here as 0-1-2. The great majority of articles were rejected, while 6.9% were related to the need but not personally useful to the students. The 5.1% “useful” documents then corresponded to TREC documents that “directly address your topic.” The strict-scale predictions are again supported. In contrast, the distribution in Al-Maskari et al. (2008: 683) in Table 9 is discordantly V-shaped. The research team had 56 assessors perform new searches on 56 existing topics from TREC-8. Documents originally graded on TREC’s two-point scale were then regraded on a strict three-point scale. The judges were told, à la Sormunen, that a “partially relevant” document “only points to the topic: it does not discuss the themes of the topic thoroughly”; and, à la Voorhees, that a “highly relevant” one “addresses the core issue of the topic.” Some possible reasons for discordant distributions on strict scales are put forth in Section 10, but whether any of them apply in this case is an open question. 9.3. Longer strict scales During 2010–2014, TREC undertook experiments in which retrievals from an enormous collection of Web pages were assessed with strict scales of five and then six points. The five-point version used in the 2010 and 2011 Web Tracks was defined as follows (Clarke, Craswell, Soboroff, & Voorhees, 2011: 4): −2 Junk. This page does not appear to be useful for any reasonable purpose; it may be spam or junk. 0 Not relevant. The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. 1 Relevant. The content of this page provides information on the topic; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. 2 Key. This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. 3 Navigational. This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. The pages were obtained by Web crawling, which picked up many worthless items. Hence the “junk” value was created. The “not relevant” pages probably cost slightly more effort to assess, given how they are defined. The implicit scale of effort thus again runs from least to most, with a qualification at the high end. A “navigational” item is not necessarily even more informative than a “key” item. It is rather the home page of an entity named in the query, of which there will often be only one (Clarke, Craswell, & Voorhees, 2012: 2). For that reason, and because many query entities lack home pages, one would expect the “navigational” items to be the fewest in any distribution. Table 10 displays the outcomes. The data reflect judgments on the 50 topics of the ad hoc task. Since the -2 retrievals were apparently almost meaningless, they are unsurprisingly far fewer than those at 0, which at least made sense of a wrong kind. From 0 to 3 the counts fall off as predicted. If both grades of nonrelevance are combined, as TREC itself did in some analyses, the falloff is monotonic, as plotted in Fig. 9. The six-point scale of 2012–2014 had the same definitions, differing only in that the “relevant” midpoint was subdivided (Collins-Thompson, Bennett, Diaz, Clarke, & Voorhees, 2013: 6–7): −2 Junk. 0 Not relevant. 1 Relevant. The content of this page provides some information on the topic, which may be minimal. 2 Highly relevant. The content of this page provides substantial information on the topic. 3 Key. 4 Navigational.

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1097

Fig. 9. Total counts from Table 10, TREC Web Tracks. Assuming an effort scale of 1-2-3-4, Spearman’s rho for both = −1.

Table 11 Distributions on a TREC 6-point strict scale in row percentages. Track

−2

0

1

2

3

4

N

Web 2012 Web 2013 Web 2014

5.3 1.6 3.9

72.7 69.7 56.9

13.8 21.0 26.2

2.5 6.4 11.2

0.3 1.2 1.6

5.3 0.1 0.2

16,055 14,474 14,432

Fig. 10. Total counts from Table 11, TREC Web Tracks. Assuming an effort scale of 1-2-3-4-5, Spearman’s rho = −1 for 2013 and 2014 but not 2012.

Each ad hoc Web Track again comprised 50 topics. As Table 11 reveals, the resulting distributions for the 2013 and 2014 tracks exhibit the same falloff seen in Table 10. The one for 2012, however, has almost no “key” pages and a relatively high percentage of “navigational” ones—i.e., home pages. Without seeing the retrievals, one cannot review the judgments, but they are unusual compared to other Web Track results of the period. Fig. 10 has the plots.

1098

H.D. White / Information Processing and Management 53 (2017) 1080–1102

10. Conclusion: explaining distributions The distributions in the present article are aggregates of judgments across entire experiments (e.g., Table 2) or multiple experiments combined (e.g., Fig. 2). These high-level distributions consist of numerous lower-level ones on individual query topics. The latter can be broken out from the online TREC datasets, such as HARD 2005. They are also occasionally reported in non-TREC studies, such as Borlund (20 0 0). The following remarks address these two levels of aggregation. At query level, there is typically one stated topic, one judge, one set of retrieved documents, and one shape for the grade frequencies. But across all the queries in the experiment, these factors become sources of variation. For example, topics vary in difficulty (Smith, 2008) and in verbal cohesion with a given corpus of documents (Carmel & Yom-Tov, 2010). Retrieved sets differ in size, judges are not uniform in their thoughts and beliefs, and distributions of grades take two or more shapes. At the level of the experiment, however, the grade distributions from the queries again sum to one shape. What remains invariant at this level is the graded relevance scale, to which all judges are exposed. Because the scale spans the entire experiment, it lends itself to explaining the shape of the summed judgment frequencies. The present paper has compared results from more than 50 experiments and two types of scales. It has shown that the scale types have fairly consistent patterns of association with the distribution shapes. While not perfect, the patterns tend to support a scale-based explanation. That is, the patterns mostly hold across a variety of researchers, and a much greater variety of judges, who brought different levels of expertise, interest, and conscientiousness to their tasks. They mostly hold whether these judges also created the search topics, and whether librarians helped them search. They mostly hold across multiple databases and subject areas, retrieval sets of various sizes and compositions, queries of unequal difficulty, and documents of different types and lengths. They mostly hold across different experimental settings (some in TREC or CLEF; some not) over decades of time. Usually researchers were not focused on the shapes of the judgment distributions, nor did participants in the experiments intend to produce them; they arose from oblivious behaviors. This suggests an automatic cognitive mechanism at work. Following RT, I have identified it as conservation of effort in interpreting scale points. There remain cases in which the predictions for vague or strict scales fail. For example, in Saracevic and Kantor (1988), discussed in Section 8.2, the query distributions are of very mixed shapes, and when duplicate documents are removed across the entire experiment, the vague-scale prediction is wrong. In the 2005 HARD Track, a strict scale correctly predicted a right-skewed distribution at the experiment level, but at the query level only 31 of the distributions were right-skewed; 19 were V-shaped. Distribution shapes at query level necessarily affect the overall distribution, and explanations of them are likely to be multivariate. Judges, that is, are motivated by more variables than the effort of processing scale points. A possibility within the RT framework is that some discordant distributions are caused by judges’ errors. Ideally, judges guaranteeing the optimal relevance of their grades to researchers will choose reasonable—i.e., justifiable—scale points. But such communication is always conditioned by their abilities and preferences. Some judges may be unable to grade documents in a way that is at least justifiable. Other judges may be unwilling to grade documents justifiably if it turns out to be too much trouble. According to Carterette and Soboroff (2010: 539, 541–542), TREC assessors may misjudge “due to misunderstandings of the task or documents, fatigue, boredom, and for many other reasons.” These authors used large amounts of data from the TREC Million Query Track to model seven types of systematic errors attributable to assessors’ different judgment styles. Three can be considered failures of ability, in which assessors made good-faith attempts to grade documents reasonably but were wrong-headed. That is, an “optimistic” assessor “takes an overly-broad view of the topic and ends up judging things relevant that are not.” A “pessimistic” assessor is described as “taking an overly-narrow view of the topic and judging documents nonrelevant that should be considered relevant.” A “Markovian” assessor is one whose “judgments are conditional on previous judgments”—one who “‘feels bad’ about judging too many nonrelevant documents in a row and thus takes a broader view of the topic over time,” or “who takes a narrower view after judging many relevant documents in a row.” In the remaining four types, assessors apparently preferred not to communicate in good faith. Instead, they assigned grades so as to get through the judgment task as quickly as possible. Carterette and Soboroff label them by their various grading-patterns as “unenthusiastic,” “topic-disgruntled,” “lazy-overfitting,” or “fatigued.” If detected elsewhere, grades like these would presumably give inadequate cognitive effects and cost unjustifiable effort to straighten out. Although judging retrievals is a highly structured activity, it does require fresh inferences with each new query-document pair. That leaves room for judges to exhibit idiosyncrasies even when grading reasonably. Any text in a given retrieval (queries, search terms, documents, instructions, scale points) will cost judges some effort to interpret, and such effort will vary over persons. So will existing assumptions and cognitive effects. For example, assessors in HARD 2005 were asked some background questions, and Ruthven (2014: 1104) reports on the correlation of their answers with their assessment patterns. He states that they graded more documents “partially relevant” on topics where they had “low levels of interest, familiarity and specific knowledge.” However, when they were interested in, familiar with, and specifically knowledgeable about a topic, they made “higher use of the highly relevant category.” So when discordant V-shaped distributions occur with strict scales on individual topics, subject expertise may be a factor; the judge’s take on the query overrides scale considerations and makes the top grade easier to assign. But it may also be that judges make errors of the “optimist” type precisely because they are not domain experts. For example, library school students in Janes and McKinney (1992: 164–165) and “generalists” in Kinney, Huffman, and Zhai (2008: 591) tended to grade documents higher than domain experts because

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1099

they interpreted queries more loosely and rewarded mere term matches—a tendency noted in IR research as long ago as Saracevic (1975: 341). Similar notes might be extended to vague scales, but the intent here is merely to highlight cognitive factors in grading, as do many other papers (e.g., Cole, Gwizdka, Liu, & Belkin, 2011; Liu, Liu, & Belkin, 2013; Liu, Liu, Cole, Belkin, & Zhang, 2012; Scholer et al., 2013). The interaction of scale-point effort with other variables affecting judgment invites further investigation. 11. Discussion: relevance theory and information science In linguistic pragmatics, relevance theorists analyze effects and effort at the level of sentences from mini-dialogues. This paper has presented evidence that they can also be analyzed in the aggregations of recorded judgments typical of information science. Moreover, by including least-effort behavior in the very definition of relevance, certain distributions important to IS can be fairly well explained (cf. White, 2011). Least-effort behavior has received much attention in IS, but it has not usually been integrated with relevance-seeking behavior. RT provides an integration—one complementing new work in retrieval evaluation such as Jiang and Allan (2016). Because of the IS research agenda, “relevant” has predominantly meant that a retrieved document is viewed as topically related to a query (Bean & Green, 2001: 116–119). In the branch of RT called lexical pragmatics, this would count as a narrowing of the range of meanings traditionally encoded in “relevant,” such as “bearing on the matter in hand” or “counting as evidence for or against a claim.” The narrowed term means something like “matching or fitting a pre-set topic.” Patrick Wilson (1968: 43–45) argued that “fitting a description” is not at all the same as “relevant,” which he equated with evidentiary value. At present, however, a topical fit is the best system designers can do (abilities and preferences again). Even achieving that much can produce positive cognitive effects—accurate consequential inferences—in users at acceptable levels of effort. “Fitting a description” is one way of “bearing on the matter in hand.” Most users probably want documents to be relevant in at least this sense, just as they want people to stick to the subject in certain conversations. If on-topic documents also answer a user’s question or count as evidence, so much the better. Nevertheless, the narrowed definition has left IS struggling for decades with misleading dichotomies, such as “relevance vs. utility” or “relevance vs. pertinence.” RT ends this clutter with a generalization from non-demonstrative logic: if inputs strengthen or eliminate existing assumptions, or combine with them to produce new conclusions, they are in some degree relevant; if they do not, they are irrelevant. The definition of “relevant” in RT is content-neutral and fixed. By contrast, inputs and assumptions are content-laden and variable. For example, the 80 factors of relevance in Schamber (1994: 11) are simply different contexts of assumptions evoked in judges by researchers’ scales or questionnaires. Such contexts are also tabulated in Schamber and Bateman (1999: 387–388) and Chu (2011: 269–270); significantly, some affect processing effort. An RT-based approach can handle these contexts in all their variety, while upholding relevance as the central notion in information science by making it more abstract. It is easy enough to see relevance judgments in terms of cognitive effects on retrieval system users. That, after all, is what motivated Harter (1992) to bring RT into information science. He wanted to dislodge the idea that, for a document to be relevant to a query, it must be on the query’s topic. A searcher’s expressed topic is always part of his or her larger context of interests. With RT, Harter could counter-argue that (1) off-topic documents may produce valuable cognitive effects within that larger context, and (2) on-topic documents may not produce valuable cognitive effects within it. Being on topic is therefore neither necessary nor sufficient for relevance in the RT sense. RT thus clarifies the concept of nonrelevant documents. The most general formulation of nonrelevance is not “off topic” but “having no effects in context.” A document might well have no effects because it is off topic. But, as many have noted, it could be on topic yet have no effects because it is already known. (Recall that Saracevic told judges to mark such a document relevant “even if the information is…already familiar to you.” ) Or the effects of a document might be negligible because it is not “from a trustworthy source” (Janes, 1993: 113). Or, again, if a document seems to be on topic and new and credible, it could be rendered irrelevant by the effort factor: it is in some way problematical to understand (Wilson, 1977: 54). Unfortunately, Harter (1992) set topicality and cognitive effects in opposition—another misleading dichotomy also seen in Saracevic’s (1997: 319) “manifestations of relevance.” On the contrary, perceived topicality is a cognitive effect (Huang, 2009: 28–31; Hjørland, 2010: 227). A judge considers new information—a document—in the context of an existing assumption—the query—and infers the degree of topical fit—a new conclusion. This is a key inference in IS; it is just not the only one (Bateman, 1999; Bean & Green, 2001; Greisdorf, 20 03; Huang, 20 09). There are in fact many cognitive contexts in which people can be asked to draw inferences about the same document (and do draw them on their own): its bearing on a real or simulated task, its credibility, its familiarity or novelty, the prestige of its author or publisher, its style of exposition, its evidentiary value, and so on (Wilson, 1978: 17–19; Bates, 1996). What keeps the number of contexts in check is that only a few turn out to be consistently important to people across studies; others are relatively minor or idiosyncratic (Barry & Schamber, 1998; Maglaughlin & Sonnenwald, 2002: 338–340; Xu & Chen, 2006). As we have seen, judges in retrieval experiments also infer where documents fit on scales, which evoke additional contexts of assumptions. I have argued that scale points differ in the effort required to interpret or comply with them, which affects judgment frequencies at those points. To summarize with rhetorical questions: •

If aggregated judgment frequencies simply reflect the characteristics of retrieved documents, what in retrieval technology repeatedly produces U- and V-shaped or right-skewed distributions?

1100 • •

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Why do U- and V-shaped distributions occur? Why do right-skewed distributions occur?

My answer to all three is that the aggregated distributions do not depend solely on the characteristics of retrieved documents. They also depend on judges conserving their own effort in interpreting the scales by which they judge. This cognitive account, derived from tenets of relevance theory, unifies many retrieval studies and illustrates RT anew as a source of explanation in information science. Acknowledgments I thank Marcia Bates, Pia Borlund, Howard Greisdorf, Vivien Petras, Ian Ruthven, Philipp Schaer, Diane Sonnenwald, Ellen Voorhees, and Deirdre Wilson for advice on various versions of this paper. I also thank Joseph Janes and the Association for Information Science and Technology for permission to reproduce two of the graphics in Fig. 4. References Allan, J. (2005). HARD Track overview in TREC 2005 high accuracy retrieval from documents. In Proceedings of the fourteenth text retrieval conference (TREC 2005) http://trec.nist.gov/pubs/trec14/t14_proceedings.html. Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million Query Track 2008 overview. In Proceedings of the seventeenth text retrieval conference (TREC 2008) http://trec.nist.gov/pubs/trec17/t17_proceedings.html. Allott, N. (2013). Relevance theory. In A. Capone, F. Lo Piparo, & M. Carapezza (Eds.). In Perspectives on linguistic pragmatics (Perspectives in pragmatics, philosophy & psychology, v. 2) (pp. 57–98). Berlin: Springer. http://folk.uio.no/nicholea/papers/. Al-Maskari, A., Sanderson, M., & Clough, P. (2008). Relevance judgments between TREC and non-TREC assessors. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 683–684). ACM. Azzopardi, L., & Zuccon, G. (2016). An analysis of the cost and benefit of search interactions. In Proceedings of the international conference on the theory of information retrieval (pp. 59–68). ACM. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 667–674). Barry, C. L. (1994). User-defined relevance criteria: An exploratory study. Journal of the American Society for Information Science, 45, 149–159. Barry, C. L., & Schamber, L. (1998). User criteria for relevance evaluation: A cross-situational comparison. Information Processing & Management, 34, 219–236. Bateman, J. (1999). Modeling the importance of end-user relevance criteria. In Proceedings of the American Society for Information Science and Technology: 36 (pp. 396–406). Bates, M. J. (1996). Document familiarity, relevance, and Bradford’s law: The Getty Online Searching Project report no. 5. Information Processing & Management, 32, 697–707. Bean, C. A., & Green, R. (2001). Relevance relationships. In C. A. Bean (Ed.), Relationships in the organization of knowledge (pp. 115–132). Dordrecht, The Netherlands: Kluwer. ˚ ˚ Borlund, P. (20 0 0). Evaluation of interactive information retrieval systems (Doctoral dissertation). Abo, Finland: Abo Akademi University Press. Carmel, D., & Yom-Tov, E. (2010). Estimating the query difficulty for information retrieval, Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool. Carston, R. (2002). Thoughts and utterances: The pragmatics of explicit communication. UK: Blackwell; Oxford. Carterette, B., & Soboroff, I. (2010). The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 539–546). Christoffersen, M. (2004). Identifying core documents with a multiple evidence relevance filter. Scientometrics, 61, 385–394. Chu, H. (2011). Factors affecting relevance judgment: A report from TREC Legal Track. Journal of Documentation, 67, 264–278. Clark, B. (2013). Relevance theory. Cambridge, UK: Cambridge University Press. Clark, H. H., & Schober, M. F. (1994). Asking questions and influencing answers. In J. M. Tanur (Ed.), Questions about questions: Inquiries into the cognitive bases of surveys (pp. 15–47). New York: Russell Sage Foundation. Clarke, C. L. A., Craswell, N., & Soboroff, I. (2009). Overview of the TREC 2009 Web Track. In Proceedings of the eighteenth text retrieval conference (TREC 2009) http://trec.nist.gov/pubs/trec18/t18_proceedings.html. Clarke, C. L. A., Craswell, N., Soboroff, I., & Voorhees, E. M. (2011). Overview of the TREC 2011 Web Track. In Proceedings of the twentieth text retrieval conference (TREC 2011) http://trec.nist.gov/pubs/trec20/t20.proceedings.html. Clarke, C. L. A., Craswell, N., & Voorhees, E. M. (2012). Overview of the TREC 2012 Web Track. In Proceedings of the twenty-first text retrieval conference (TREC 2012) http://trec.nist.gov/pubs/trec21/t21.proceedings.html. Cole, M. J., Gwizdka, J., Liu, C., & Belkin, N. J. (2011). Dynamic assessment of information acquisition effort during interactive search. In Proceedings of the American Society for Information Science and Technology: 48 (pp. 1–10). Collins-Thompson, K., Bennett, P., Diaz, F., Clarke, C. L. A., & Voorhees, E. M. (2013). TREC 2013 Web Track overview. In Proceedings of the twenty-second text retrieval conference (TREC 2013) http://trec.nist.gov/pubs/trec22/trec2013.html. Cuadra, C. A., & Katter, R. V. (1967). Experimental studies of relevance judgments (3 vols.). Santa Monica, CA: System Development Corporation. Damessie, T. T., Scholer, F., & Culpepper, J. S. (2016). The influence of topic difficulty, relevance level, and document ordering on relevance judging. In Proceedings of the 21st Australasian document computing symposium (pp. 41–48). ACM. Furner, J. (2004). Information studies without information. Library Trends, 52, 427–446. Gehanno, J.-F., Thirion, B., & Darmoni, S. J. (2007). Evaluation of meta-concepts for information retrieval in a quality-controlled health gateway. In Proceedings of the AMIA annual symposium (pp. 269–273). Goatly, A. (1997). The language of metaphors. London and New York: Routledge. Greisdorf, H. (20 0 0). Relevance thresholds: A cognitive / disjunctive model of end-user cognition as an evaluative process (Doctoral dissertation). University of North Texas. Greisdorf, H. (2003). Relevance thresholds: A multi-stage predictive model of how users evaluate information. Information Processing & Management, 39, 403–423. Greisdorf, H., & Spink, A. (2001). Median measure: An approach to IR systems evaluation. Information Processing & Management, 37, 843–857. Gwizdka, J. (2014). Characterizing relevance with eye-tracking measures. In Proceedings of the 5th information interaction in context symposium (pp. 58–67). ACM. Harman, D. K. (2005). The TREC test collections. In E. M. Voorhees, & D. K. Harman (Eds.), TREC: Experiment and evaluation in information retrieval (pp. 21–52). Cambridge, MA: The MIT Press. Harter, S. P. (1992). Psychological relevance and information science. Journal of the American Society for Information Science, 43, 602–615. Higashimori, I., & Wilson, D. (1996). Questions on relevance, 8. University College London, Working Papers in Linguistics. http://www.phon.ucl.ac.uk/home/ PUB/WPL/96papers/higashi.pdf.

H.D. White / Information Processing and Management 53 (2017) 1080–1102

1101

Hjørland, B. (2010). The foundation of the concept of relevance. Journal of the American Society for Information Science and Technology, 61, 217–237. Huang, X. (2009). Topicality reconsidered: A multidisciplinary inquiry into topical relevance relationships (Doctoral dissertation). University of Maryland. Huang, X., & Soergel, D. (2013). Relevance: An improved framework for explicating the notion. Journal of the American Society for Information Science and Technology, 64, 18–35. Janes, J. W. (1993). On the distribution of relevance judgments. Proceedings of the American Society for Information Science, 30, 104–114. Janes, J. W., & McKinney, R. (1992). Relevance judgments of actual users and secondary judges: A comparative study. Library Quarterly, 62, 150–168. Järvelin, K. (2013). Test collections and evaluation metrics based on graded relevance. Lecture Notes in Computer Science, 7536, 280–294. Jiang, J., & Allan, J. (2016). Adaptive effort for search evaluation metrics. In European conference on information retrieval (pp. 187–199). Springer International Publishing. Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations—Comparison of the effects on ranking of IR systems. Information Processing & Management, 41, 1019–1033. Kinney, K. A., Huffman, S. B., & Zhai, J. (2008). How evaluator domain expertise affects search result relevance judgments. In Proceedings of the 17th ACM conference on information and knowledge management (CIKM’08) (pp. 591–598). Kulas, J. T., & Stachowski, A. A. (2009). Middle category endorsement in odd-numbered Likert response scales: Associated item characteristics, cognitive demands, and preferred meanings. Journal of Research in Personality, 43, 489–493. Liu, C., Liu, J., Cole, M., Belkin, N. J., & Zhang, X. (2012). Task difficulty and domain knowledge effects on information search behaviors. In Proceedings of the American Society for Information Science and Technology: 49 (pp. 1–10). Liu, J., Liu, C., & Belkin, N. (2013). Examining the effects of task topic familiarity on searchers’ behaviors in different task types. In Proceedings of the American Society for Information Science and Technology: 50 (pp. 1–10). Lykke, M., Larsen, B., Lund, H., & Ingwersen, P. (2010). Developing a test collection for the evaluation of integrated search. Lecture Notes on Computer Science, 5993, 627–630. Maglaughlin, K. L., & Sonnenwald, D. H. (2002). User perspectives on relevance criteria: A comparison among relevant, partially relevant, and not-relevant judgments. Journal of the American Society for Information Science and Technology, 53, 327–342. Pao, M. L. (1993). Term and citation retrieval: A field study. Information Processing & Management, 29, 95–112. Petras, V., Bogers, T., Ferro, N., & Masiero, I. (2013a). Cultural heritage in CLEF (CHiC) 2013. Multilingual Task Overview. http://ims-sites.dei.unipd.it/documents/ 71612/430938/CLEF2013wn- CHiC- PetrasEt2013.pdf. Petras, V., Bogers, T., Toms, E., Hall, M., Savoy, J., Malak, P., Pawlowski, A., Ferro, N., & Masiero, I. (2013b). Cultural heritage in CLEF (CHiC) 2013. Lecture Notes in Computer Science, 8138, 192–211. Rees, A. M., & Schultz, D. G. (1967). A field experimental approach to the study of relevance assessments in relation to document searching. Final report (2 vols.). Cleveland, OH: Center for Documentation and Communication Research, School of Library Science, Case Western Reserve University. Ruthven, I. (2014). Relevance behaviour in TREC. Journal of Documentation, 70, 1098–1117. Ruthven, I., Baillie, M., & Elsweiler, D. (2007). The relative effects of knowledge, interest and confidence in assessing relevance. Journal of Documentation, 63(4), 483–504. Saracevic, T. (1969). Comparative effects of titles, abstracts, and full texts on relevance judgments. Proceedings of the American Society for Information Science, 6, 293–299. Saracevic, T. (1971). Selected results from an inquiry into testing of information retrieval systems. Journal of the American Society for Information Science, 22, 126–139. Saracevic, T. (1975). Relevance: A review of the literature and a framework for thinking on the notion in information science. Journal of the American Society for Information Science, 26, 321–343. Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (pp. 139–146). Saracevic, T. (1997). The stratified model of information retrieval interaction: Extension and applications. In Proceedings of the American Society for Information Science: 34 (pp. 313–327). Saracevic, T. (2007a). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58, 1915–1933. Saracevic, T. (2007b). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology, 58, 2126–2144. Saracevic, T. (2008). Effects of inconsistent relevance judgments on information retrieval test results. Library Trends, 56, 763–783. Saracevic, T., & Kantor, P. (1988). A study of information seeking and retrieving. II. Users, questions, and effectiveness. Journal of the American Society for Information Science, 39, 177–196. Saracevic, T., Mokros, H., Su, L. T., & Spink, A. (1991). Interaction between users and intermediaries in online searching. In Proceedings of the 12th national online meeting (pp. 329–341). Schamber, L. (1994). Relevance and information behavior. Annual Review of Information Science and Technology, 29, 3–48. Schamber, L., & Bateman, J. (1999). Relevance criteria uses and importance: Progress in development of a measurement scale. In Proceedings of the American Society for Information Science annual meeting: 36 (pp. 381–389). Scholer, F., Turpin, A., & Sanderson, M. (2011). Quantifying test collection quality based on consistency of relevance judgments. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval (pp. 1063–1072). Scholer, F., Kelly, D., Wu, W.-C., Lee, H. S., & Webber, W. (2013). The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th International ACM SIGIR conference on research and development in information retrieval (pp. 623–632). ACM. Smith, C. L. (2008). Searcher adaptation: A response to topic difficulty. Proceedings of the American Society for Information Science and Technology, 45, 1–10. Smucker, M. D., & Jethani, C. P. (2012). Time to judge relevance as an indicator of assessor error. In Proceedings of the 35th annual international ACM SIGIR conference on research and development in information retrieval (pp. 1053–1054). Sormunen, E. (20 0 0). A method for measuring wide range performance of Boolean queries in full-text databases (Doctoral dissertation). University of Tampere. Sormunen, E. (2002). Liberal relevance criteria of TREC—Counting on negligible documents? In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 324–330). Sperber, D., & Wilson, D. (1985). Loose talk. In Proceedings of the Aristotelian Society: 86 (pp. 153–171). Sperber, D., & Wilson, D. (1995). Relevance: Communication and cognition (2nd ed.). Oxford, UK.: Blackwell. Spink, A., Greisdorf, H., & Bateman, J. (1998). From highly relevant to not relevant: Examining different regions of relevance. Information Processing & Management, 34, 599–621. Tague-Sutcliffe, J. (1995). Measuring information: An information services perspective. San Diego, CA: Academic Press. Tang, R., Shaw, W. M., & Vevea, J. L. (1999). Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science, 50, 254–264. Text Retrieval Conference. (2006). Ad hoc test collections. Relevance judgments. Data – English relevance judgments. http://trec.nist.gov/data/reljudge_eng. html. Turpin, A., Scholer, F., Mizzaro, S., & Maddalena, E. (2015). The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 565–574). ACM. Verma, M., Yilmaz, E., & Craswell, N. (2016). On obtaining effort based judgements for information retrieval. In Proceedings of the ninth ACM international conference on web search and data mining (pp. 277–286). ACM.

1102

H.D. White / Information Processing and Management 53 (2017) 1080–1102

Villa, R., & Halvey, M. (2013). Is relevance hard work? Evaluating the effort of making relevance assessments. In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 765–768). Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval (pp. 74–82). Voorhees, E. M., & Hersh, W. (2012). Overview of the 2012 Medical Records Track. In Proceedings of the twenty-first text retrieval conference (TREC 12) (pp. 1–6). Wang, J. (2011). Accuracy, agreement, speed, and perceived difficulty of users’ relevance judgments for e-discovery. In Proceedings of SIGIR information retrieval for E-discovery workshop http://www.umiacs.umd.edu/∼oard/sire11/papers/wang2.pdf. White, H. D. (2007a). Combining bibliometrics, information retrieval, and relevance theory. Part 1: First examples of a synthesis. Journal of the American Society for Information Science and Technology, 58, 536–559. White, H. D. (2007b). Combining bibliometrics, information retrieval, and relevance theory. Part 2: Some implications for information science. Journal of the American Society for Information Science and Technology, 58, 583–605. White, H. D. (2009). Some new tests of relevance theory in information science. Scientometrics, 83, 653–667. White, H. D. (2010). Relevance in theory. In M. J. Bates, & M. N. Maack (Eds.). In Encyclopedia of library and information sciences (pp. 4498–4511). Boca Raton, FL: CRC Press. White, H. D. (2011). Relevance theory and citations. Journal of Pragmatics, 43, 3345–3361. White, H. D. (2014). Co-cited author retrieval and relevance theory: Examples from the humanities. Scientometrics, 102, 2275–2299. Wilson, D. Relevance and the interpretation of literary works. University College London Working Papers in Linguistics 23, 2011. https://www.ucl.ac.uk/pals/ research/linguistics/publications/wpl/11papers/Wilson2011. Wilson, D. Handout 1—Relevance. Wilson seminar, University of Gent, Belgium, 2012. http://www.gist.ugent.be/wilsonseminar. Wilson, D. Relevance theory. University College London Working Papers in Linguistics 26, 2014. http://www.ucl.ac.uk/pals/research/linguistics/publications/ wpl/14papers/Wilson_UCLWPL_2014.pdf. Wilson, D., & Sperber, D. (2012). Meaning and relevance. Cambridge, UK: Cambridge University Press. Wilson, P. (1968). Two kinds of power; An essay on bibliographical control. Berkeley and Los Angeles: University of California Press. Wilson, P. (1977). Public knowledge, private ignorance. Westport, CT: Greenwood Press. Wilson, P. (1978). Some fundamental concepts of information retrieval. Drexel Library Quarterly, 14(2), 10–24. Xu, Y., & Chen, Z. (2006). Relevance judgment: What do information users consider beyond topicality? Journal of the American Society for Information Science and Technology, 57, 961–973. Yilmaz, E., Verma, M., Craswell, N., Radlinski, F., & Bailey, P. (2014). Relevance and effort: An analysis of document utility. In Proceedings of the 23rd ACM international conference on information and knowledge management (pp. 91–100). ACM. Zuccon, G. (2016). Understandability biased evaluation for information retrieval. In European conference on information retrieval (pp. 280–292). Springer International Publishing.