Evaluating the impact of MeSH (Medical Subject Headings) terms on different types of searchers

Evaluating the impact of MeSH (Medical Subject Headings) terms on different types of searchers

Information Processing and Management 53 (2017) 851–870 Contents lists available at ScienceDirect Information Processing and Management journal home...

2MB Sizes 0 Downloads 18 Views

Information Processing and Management 53 (2017) 851–870

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Evaluating the impact of MeSH (Medical Subject Headings) terms on different types of searchers Ying-Hsang Liu a,b,∗, Nina Wacholder c a

School of Information Studies, Charles Sturt University, Wagga Wagga, NSW 2678 Australia Research School of Computer Science, The Australian National University, Acton ACT 0200, Australia c School of Communication & Information, Rutgers University, New Brunswick, NJ, 08901, USA b

a r t i c l e

i n f o

Article history: Received 3 October 2016 Revised 7 March 2017 Accepted 27 March 2017

Keywords: Information retrieval evaluation Medical Subject Headings Controlled vocabulary Domain knowledge User characteristics

a b s t r a c t To what extent do MeSH terms improve search effectiveness for different kinds of users? We observed four different kinds of information seekers using an experimental information retrieval system: (1) search novices; (2) domain experts; (3) search experts and (4) medical librarians. Participants searched using either a version of the system in which MeSH terms were displayed or another version in which they had to formulate their own terms. The information needs were a subset of the relatively difficult topics originally created for the Text REtrieval Conference (TREC). Effectiveness of retrieval was based on the relevance judgments provided by TREC. The results of the study provide experimental evidence of the usefulness of MeSH terms and further identify the significant relationship between the user characteristics of domain knowledge and search training and the search performance in an interactive search environment. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Controlled retrieval languages such as MeSH (Medical Subject Headings) and Library of Congress Subject Headings (LCSH) are standard tools for information access in library collections. The debate over the usefulness of these indexing languages first emerged in the early days of the field of information retrieval (IR) itself when scholars such as Salton (1972) and Sparck Jones (1981) suggested that free text searching was as effective as controlled retrieval languages. More recently this question has been raised in the knowledge organization community (Hjørland, 2016) and at major American libraries. For example, a report prepared for the Library of Congress suggested that relative to automatic subject indexing, manual indexing is not cost-effective (Calhoun, 2006). A report prepared by the University of California Libraries Bibliographic Services Task Force (2005, p. 24) suggested that the question of “abandoning the use of controlled vocabularies [LCSH, MESH, etc.] for topical subjects in bibliographic records” merits consideration, in part because manual indexing is difficult to understand and use. The Library of Congress Working Group on the Future of Bibliographic Control included sixteen metadata experts who recommended, inter alia, the re-purposing of LCSH and recognition of the potential usefulness of computational methods for subject analysis (Library of Congress Working Group on the Future of Bibliographic Control, 2008). In response to the recommendations, a task force established by the Association for Library Collections and Technical Services Heads of Technical Services in Large Research Libraries Interest Group suggested a research agenda that addresses various aspects of bibliographic data use, “to determine if, in fact, these elements do provide value towards facilitating the user tasks” ∗

Corresponding author at: School of Information Studies, Charles Sturt University, Wagga Wagga, NSW 2678 Australia. E-mail addresses: [email protected], [email protected] (Y.-H. Liu), [email protected] (N. Wacholder).

http://dx.doi.org/10.1016/j.ipm.2017.03.004 0306-4573/© 2017 Elsevier Ltd. All rights reserved.

852

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

(Stalberg & Cronin, 2011, p. 132). The suggestion that controlled indexing languages are no longer worth the resources required to produce them has been vigorously opposed, for example by Byrd et al. (20 06) and Mann (20 06). Given computer users’ widespread comfort and familiarity with free text searching in modern search engines, the question of the conditions under which the resources to create manual indexing is justified is significant. The goal of the research described in this paper is to provide empirical evidence to help answer the question of how useful controlled indexing languages are for different kinds of users. We evaluate the usefulness of controlled indexing languages, exemplified by MeSH (Medical Subject Headings) terms, by considering their usefulness for searchers with different levels of domain knowledge and search training. The research question raised in this paper is of practical importance for the librarians currently engaged in development of controlled indexing languages and for the libraries that employ them, as well as for the users of libraries. Although the articles mentioned above have focused on whether expending the resources to create controlled indexing languages are justified, the study we describe here is focused on the usefulness of the languages to different kinds of users. 2. Background and literature review Traditionally an index language refers to a language used to describe the documents for retrieval purposes; index terms are the elements of the index language (Cleverdon, 1967; van Rijsbergen, 1979). The need for controlled vocabulary was primarily based on the supposition that natural language is not systematic enough to represent complex concepts predictably and therefore is of limited value in accessing all and only relevant documents (Svenonius, 1986). Controlled vocabularies provide a mechanism to compensate for the pervasive ambiguity and synonymy of natural language (Wacholder, Ravin, & Choi, 1997). For example, without controlled vocabulary, a user using the ambiguous word cell to search for information about the basic structural unit of human Biology might instead retrieve documents on the jail cells or on databases. A controlled indexing language provides contextualized disambiguation, for example through the use of broader or narrower terms, and hierarchical relationships among the terms. If the index term cell appears under the heading Biology, the ambiguity is removed. The index term cell can also represent major as well as minor concepts at a relatively high level of abstraction for document indexing. An information seeker searching for information on the affective problem of manic disorder might miss text that discusses synonymous or near-synonymous concepts such as mania, bipolar disorder and bipolar depression. Controlled indexing languages address the problem of vocabulary variability by including cross-references to term variants and supporting the search process. Some studies have suggested that complex term relations are not useful in terms of retrieval effectiveness (e.g., Keen, 1973, Sparck Jones, 1981). Sparck Jones’ (1981) review of 1958–1978 index language tests suggests that different index languages can achieve comparable levels of search performance; more importantly, simple indexing with uncontrolled vocabulary that is based on selected textwords (the words used in the text) is as good as sophisticated indexing. Studies from library online catalogs have demonstrated that the value of subject headings lies in the retrieval of relevant records not retrieved by keyword searching with bibliographic records alone (e.g., Gross & Taylor, 2005; Gross, Taylor, & Joudrey, 2015; Voorbij, 1998). Most users seemed to use keyword searching in part because they have problems using controlled indexing languages such as subject headings to formulate search queries (Larson, 1991). The terms from controlled indexing languages can be used as queries for finding unique relevant documents not retrieved by keyword searching (Tenopir, 1985). Studies that directly compare the usefulness of manual and automatic indexing systems without the involvement of real users have suggested that automatic indexing methods for identifying terms and search systems, if implemented properly, can be as effective as manual indexing systems (e.g., Boyce & Lockard, 1975; Salton, 1972). In a comparison of the Boolean search system with human-developed controlled terms in MEDLARS (Medical Literature Analysis and Retrieval System) and the automatic vector matching techniques in SMART system, Salton (1972) claimed that the search effectiveness of automatic indexing techniques can be comparable to manual indexing. This study is often cited even though it used a relatively small test collection of 450 documents and searched over bibliographic records with abstracts (Salton, 1972). Boyce and Lockard’s (1975, p. 383) comparison between automatic and manual indexing methods using MeSH terms concluded that the performance of automatic indexing method was comparable to that of systems using manual indexing methods. Savoy’s (2005) evaluation of various search models revealed that the mean average precision obtained by a combination of manual and automatic indexing schemes is significantly better than manual or automatic indexing alone. In an evaluation of the search effectiveness of MeSH terms for different retrieval models, Abdou and Savoy (2008) found that for some retrieval models, including MeSH terms for indexing documents significantly improves the mean average precision (MAP). Since both studies were conducted in a laboratory environment without any human searchers, the setting may reflect behavior of the underlying indexing models and be too removed from operational settings. Nonetheless, the comparative studies contribute to our understanding of the conditions of indexing and searching strategies in which controlled indexing languages improve search effectiveness. In a different approach, Hersh, Buckley, Leone, and Hickam (1994) study compared the usefulness of MeSH terms and text words by asking clinical physicians to conduct searches over bibliographic records on a search system based on vector space retrieval model. The findings indicated that clinic physicians can effectively use the retrieval system, with significantly higher recall and reduced precision than each of the other search groups. In a study of family practice physicians’ querying behavior in the workplace, Lykke, Price, and Delcambre (2012) found that experienced searchers are able to use a

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

853

search feature based on a semantic component model for structuring queries. Wacholder and Liu (20 06, 20 08) specifically compared the usefulness of query terms, units of indexing languages that searchers submit to a search system identified by different methods: one constructed by a human indexer and two others identified automatically. The findings suggested that query terms do affect search outcome and that a set of automatic terms using linguistically-motivated rules can be as effective as terms identified by a human indexer. Svenonius (1986) found that the usefulness of indexing languages is affected by factors such as the subject domain of document collection, the rules by which the controlled vocabulary is created and the skills of indexers and searchers. The important factors that have been considered by researchers included the types of indexing languages used (Boyce & Lockard, 1975; Salton, 1972; Savoy, 2005; Tenopir, 1985), retrieval systems (retrieval model and system implementation) (Abdou & Savoy, 2008; Hersh et al., 1994; Savoy, 2005), elements of test collection (whether full-text, records, or abstracts were searched) (Tenopir, 1985; Wacholder & Liu, 20 06, 20 08), evaluation measures (whether the evaluation involved real user judgments) (Gross & Taylor, 2005; Hersh et al., 1994; Keen, 1973) and searching mechanism (whether human searchers were involved) (Abdou & Savoy, 2008; Hersh et al., 1994; Salton, 1972). From the perspective of this research, our main concern is that previous studies have not taken the impact of user characteristics on search results. One reason researchers do not conduct this kind of study is the difficulty of isolating user characteristics (e.g., Anderson & José Pérez-Carballo, 2001; Golub et al., 2016; Svenonius, 1986) and the possible interactions among the characteristics in assessing the specific impact of particular variables on search performance. The cost and value of human creation of controlled indexing languages has been identified as important research agenda from the perspective of technical services (Stalberg & Cronin, 2011). We conclude that there remains a need for deeper analysis of the usefulness of controlled indexing languages, in part because of conflicting results from different studies and in part because of a lack of understanding of user characteristics that affect index term usefulness in different contexts.

2.1. User characteristics of domain knowledge and search experience In the investigation of user search behaviors researchers have attempted to identify specific user characteristics related to search performance, such as domain knowledge and search experience. These two user characteristics have been identified as important research variables (Moore, Erdelez, & Wu, 2007; Wildemuth, 2004). Domain knowledge refers to an individual’s level of knowledge in a particular subject discipline. This variable has been operationalized and measured in several ways, depending on the purposes of the study. For example, medical students’ clinical knowledge was measured by standardized test since the study population was medical students (Pao et al., 1993). Hembrooke, Granka, Gay, and Liddy (2005), whose subjects were undergraduates, used the subjects’ self-report of search topic familiarity as a measure of the domain expertise. To study the effect of the domain knowledge on user search behavior, users’ self-assessed knowledge (Cole et al., 2011; Mu, Lu, & Ryu, 2014; Tang, Liu, & Wu, 2013; Zhang, Liu, Cole, & Belkin, 2015), stages of course instruction in a particular field (Sihvonen & Vakkari, 2004), formal training in a subject domain (Hsieh-Yee, 1993; Marchionini & Dwiggins, 1990; Meadow, Wang, & Yuan, 1995; Wildemuth, 2004) were also used as a measure of the participant’s level of domain knowledge. Search experience refers to searcher’s skills in interacting with information retrieval systems. It has been operationalized as whether searchers have had extensive use of online databases and whether they were proficient in the system features, such as search commands or indexing thesauri. Researchers used the total time spent using online databases as a measure of different levels of search experience (Fenichel, 1981; Howard, 1982). The search experience was also determined by formal training in online database searching (Bellardo, 1985; Saracevic & Kantor, 1988). More recent studies tend to assess whether the search experience in a specific type of information retrieval system can be transferred to another (Palmquist & Kim, 20 0 0; Vakkari, Pennanen, & Serola, 20 03). However, whether knowledge obtained from search experience is transferrable within the Boolean-based IR system or between different types of IR systems have not been specifically studied. Nonetheless, despite different measurement in the afore-mentioned studies, ongoing research into the effect of search experience on search performance does provide a rationale for formal search training. Somewhat surprisingly, studies that investigated the effect of domain knowledge on the search effectiveness have shown that these two variables are not correlated (e.g., Allen, 1991; Pao et al., 1993). Studies have generally suggested that there are large individual differences in search performance, even within a user group distinguished by different levels of either domain knowledge or search experience (e.g., Fenichel, 1981; Howard, 1982; Pao et al., 1993). However, one of the limitations of these studies is that a relatively small number of search tasks (referred to as “search topics” in TREC) have been used without specific consideration of search topic variations (e.g., Buckley & Voorhees, 2005; Sparck Jones & van Rijsbergen, 1976; Voorhees & Harman, 2005). That is, the effect of user characteristics on search results has not been properly assessed due to the dominating effect of search topics. Overall, empirical studies have obtained some mixed results: (a) Domain knowledge or specific topic knowledge is not correlated with search performance (Allen, 1991; Pao et al., 1993), but domain experts are able to use controlled vocabularies to expand their queries and obtain good search performance (Hersh et al., 1994; Nielsen, 2004; Shiri & Revie, 2006); (b) Search experience with online databases cannot predict search performance (Fenichel, 1981; Howard, 1982; Sutcliffe, Ennis, & Watkinson, 20 0 0), but the recall score improves with search experience (Liu, 2010; McKibbon et al., 1990) and experienced users can find relevant documents more efficiently than nonexperienced users (Yoo & Mosa, 2015);

854

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

(c) There is a crossover interaction effect between domain knowledge and search experience (Hsieh-Yee, 1993; Meadow, Marchionini, & Cherry, 1994; Vakkari et al., 2003). In the following sections, we report the results of a study designed to address some of the uncertainties in understanding the effect of method of creation of index terms, search expertise and domain knowledge on the efficiency and effectiveness of searching. 3. Research questions and hypotheses The goal of this study was to answer two research questions: 1. Do controlled indexing languages help users produce better search results? 2. Do controlled indexing languages help different kinds of users produce better search results? We used MeSH terms as a case study because MeSH is widely recognized as one of the most sophisticated and start-of-the-art controlled vocabularies for IR systems in the biomedical domain (Nelson, Johnston, & Humphreys, 2001; Yoo & Mosa, 2015). One of the most advantageous chemical information systems designed by Chemical Abstracts Service (CAS) for the representation and retrieval of chemical structures, specifically the CAS registry, has also been included in MeSH (Willett, 1987). The large number and high-quality of the indexed documents in MEDLINE database have been critical to biomedical research and development. The two primary research questions led to the formulation of several research hypotheses, developed using the perspective of IR experiments: Hypothesis 1. Queries searched using MeSH will get better results than queries searched not using MeSH when precision and recall are used to measure result quality. To test this hypothesis, we assess quality of MeSH terms by measuring quality of search results in terms of search effectiveness. In IR experiments the search effectiveness of different retrieval techniques is achieved by comparing the search performance across queries. IR researchers have widely used micro-averaging in summarizing precision and recall values for comparing the search effectiveness of different retrieval techniques in order to meet the statistical requirements (e.g., Kelly, 2009; Tague-Sutcliffe, 1992; van Rijsbergen, 1979). The method of micro-averaging is intended to obtain reliable results in comparing search performance of different retrieval techniques by giving equal weight to each query. This approach is required to produce a robust statistical design. Within an interactive IR experiment that involves human searchers, it is often difficult to use a large set of search topics. Empirical evidence has demonstrated that 50 topics are necessary to determine the relative performance of different retrieval techniques in batch mode evaluations (Buckley & Voorhees, 2005). Nonetheless, as will be demonstrated in the methodology section, this study has explicitly diminished the overriding topic effect by an experimental design that controls searchers, systems and search topic pairs and using a relatively large number of search topics. The second hypothesis is concerned with the usefulness of MeSH terms at the level of searchers: Hypothesis 2. Searchers using MeSH will get better results than searchers who do not use MeSH when precision and recall are used to measure result quality. In addition to the experimental considerations, with Hypothesis 2 we are concerned about the impact of searcher characteristics on user search performance since previous research has been inconclusive (Fenichel, 1981; Howard, 1982; Hsieh-Yee, 1993; McKibbon et al., 1990; Pao et al., 1993). It is uncertain which kinds of searchers will benefit the most using MeSH terms while searching for documents about complex biomedical topics. The third set of hypotheses posits that the answers to question Hypothesis 2 depend on the characteristics of the searcher: Hypothesis 3. Quality of search results using MeSH will vary by searcher type. Hypothesis 3a. Domain experts using MeSH will get better results than domain novices using MeSH. Hypothesis 3b. Search experts using MeSH will get better results than untrained searchers using MeSH. Since domain experts can better understand relatively technical biomedical topics and MeSH terms, they are expected to obtain better search results. Search experts are expected to do better than untrained searchers because they are familiar with the structure and use of controlled indexing languages like MeSH terms and system features in their interactions with IR systems. This set of hypotheses should be considered exploratory rather than seeking significant differences, because previous investigations have suggested large individual differences (e.g., Bellardo, 1985; Fenichel, 1981; Saracevic & Kantor, 1988) and relatively small differences in system performance (Sparck Jones, 1981). Within these constraints, it is potentially difficult to identify statistically significant differences in search results. Nonetheless, the exploratory investigation will contribute to our understanding of the impact of user characteristics and search assistance tools on information seeking effectiveness.

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

855

Table 1 Four types of searchers categorized by domain knowledge and search training. Searcher type

Domain knowledge

Search training

Search Novice (SN) Domain Expert (DE) Search Expert (SE) Medical Librarian (ML)

− + − +

− − + +

Note. Plus (+) and minus (−) indicate the high-level and low-level of the specified searcher characteristics respectively.

Fig. 1. MeSH+ search interface based on Greenstone (New Zealand Digital Library Project, 2006). The fold and stem options refer to features of case-folding and stemming respectively. MeSH terms are not specially weighted in retrieval model and added to search indexes.

4. Methodology We observed four different kinds of information seekers using an experimental IR system: (1) search novices; (2) domain experts; (3) search experts and (4) medical librarians. The information needs were a subset of the relatively difficult topics originally created for the TREC Genomics Track 20 04 (Hersh, 20 04), given the growing complexity of the biomedical literature and the significance of structuring knowledge in databases (Hersh et al., 2006). Effectiveness of retrieval was measured using the relevance judgments provided by TREC. The four types of searchers were distinguished by their levels of domain knowledge and search training (Table 1): The four kinds of searchers were operationalized as follows: 1 Search Novice (SN). Undergraduate students without formal training in online searching courses and without advanced knowledge in biomedical domain. These undergraduates are not Biology majors. While many of these students are experienced and heavy Web users, they are not expected to have in-depth understanding of online bibliographic databases. 2 Domain Expert (DE). Graduate students in a biomedical domain, i.e., Biology or medicine. DEs did not have formal training in searching, such as online searching courses. 3 Search Expert (SE). Graduate students enrolled in Master of Library and Information Science (MLIS) programs who had previously taken online database searching or other related courses and do not have advanced knowledge in biomedical domain. SEs had not majored in Biology and did not have a Master degree or above in any biomedical field. 4 Medical Librarian (ML). Medical librarians specializing in online searching services. The domain knowledge is defined by formal education in biomedical areas or more than two-year work experience in medical libraries. The search task was to conduct searches to help biologists perform their research. Participants searched on two different search interfaces using a single system, with MeSH+ and MeSH− versions. One interface allowed searchers to use MeSH terms (MeSH+); the other did not provide this search option (MeSH−). The MESH+ interface displayed MeSH terms and abstracts in retrieved bibliographic records; and the MESH- interface displayed only abstracts. Because we were concerned that the participant might respond to a cue that may signal the experimenters’ intent, the search interfaces were termed ‘System Version A’ and ‘System Version B’ for ‘MeSH+ Version’ and ‘MeSH− Version’ respectively (see Fig. 1). The IR system was built using the Greenstone Digital Library Software (New Zealand Digital Library Project, 2006). It was constructed as Boolean-based system with ranked functions by the TF×IDF weighting rule (Witten, Moffat, & Bell, 1999). The search engine MGPP (MG++), a re-implementation of the MG (Managing Gigabytes) searching and compression

856

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

ID: 39 Title: Hypertension Need: Identify genes as potential genetic risk factors candidates for causing hypertension. Context: A relevant document is one which discusses genes that could be considered as candidates to test in a randomized controlled trial which studies the genetic risk factors for stroke. Fig. 2. Sample search topic from TREC Genomics Track 2004 document set data file (Hersh, 2004).

algorithms, was used as indexing and querying indexer. A search help with detailed instructions about system features was prepared for user reference purposes. For example, either Hypertension or hypertension will appear in search results when the fold option is used. The query hypertension and its morphological variants, such as hypertensive and hypertensions, will appear in search results when the stem option is used. MeSH terms in MeSH+ Version were not specially weighted in retrieval model and they were added to search indexes. We deliberately refrained from implementing certain system features that allow users to take advantage of the hierarchical structures of MeSH terms, such as the hyperlinked MeSH terms, explode function that automatically includes all narrower terms and automatic query expansion (e.g., Lu, Kim, & Wilbur, 2009; Matos, Arrais, Maia-Rodrigues, & Oliveira, 2010) available on other search systems. The use of those features would have invalidated the results by introducing other variables at the levels of search interface and query processing, although a full integration of those system features would have increased the usefulness of MeSH terms. An undesirable side effect of this rigorous experimental design is that users of the MeSH+ system, especially experienced ones, were handicapped because they could not use the hierarchical browsing designed to increase MeSH usability. The experiment was a 4 × 2 × 2 factorial design with four types of searchers, two versions of the experimental system and controlled search topic pairs. Thirty-two participants (eight for each type of searcher) searched four topics each with the two search interfaces. With the Graeco-Latin square balanced design (Fisher, 1935), we used 20 search topics (10 pairs) in an IR user experiment to alleviate search topic variability (Robertson, 1981, 1990; Sparck Jones, 20 0 0, Lagergren & Over, 1998; Voorhees & Buckley, 2002). Each topic was searched 16 or 20 times in total (see Appendix A for the arrangement of experimental conditions). A balanced one-way analysis of power analysis indicated that our experimental design has a power of .92, with medium effect size of .28 (Cohen, 1988). 4.1. Search topics The search topics used in this study were created for TREC Genomics Track 2004 for the purpose of evaluating the search effectiveness of different retrieval techniques (see Fig. 2 for an example). They covered a range of genomics questions typically asked by biomedical researchers. Because of the technical nature of genomics topics, we considered whether the search topics are intelligible for human searchers, particularly for those without advanced training in the biomedical field. Given that these search topics were designed for machine runs with little or no consideration for searches by real users, we selected twenty of the fifty topics using the following procedure: 1 Consulting a practicing academic librarian for twenty years with a Bachelor degree in Biology and a graduate student in Neuroscience, to help make a judgment as to whether the topics would be comprehensible to the participants who were not domain experts; 2 Ensuring that major concepts identified by concept analysis in search topics could be mapped to MeSH by searching MeSH Browser. Concept analysis was introduced as part of the training session for participants; 3 Eliminating topics with very low MAP (mean average precision) and P10 (precision at top ten documents) scores in the relevance judgment set. We selected twenty search topics from a pool of fifty topics. These topics were then randomly selected to create ten search topic pairs for the arrangement of experimental conditions (see Appendix B for selected twenty search topics). 4.2. Experimental procedure After signing the consent form, participants filled out a searcher background questionnaire before the search assignment. After a brief training session to familiarize participants with the interface, they were assigned to one of the arranged experimental conditions and conducted eight search tasks. After they completed each task, they filled out a questionnaire about their perception of search task difficulty and usefulness of abstracts/MeSH terms. To check on participants’ ability to comprehend the abstracts, they were also asked to indicate the relevance of two pre-judged documents when they finished each search topic (discussed below). A brief interview was conducted when they finished all search topics. To ensure that the participant received consistent training, an experimental guideline with scripted instructions in colloquial English, together with a training search topic, was prepared and used. This training session was designed to

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

857

familiarize participants with available system features and search tasks, particularly search concepts formulations and examination of search results. A sample document record was used to illustrate the availability of particular index terms in which MeSH terms were only accessible half of the time. A search help with advanced system features was provided to the participant on a piece of paper as part of the tutorial. To help searchers recognize potentially useful search terms, participants were then instructed to do concept analysis by identifying important concepts from search topic descriptions and devising other terms within each concept. The chosen practice topic (see Fig. 2) was analyzed as 3 (hypertension, genetic risk and stroke) or 4 (hypertension, risk factors, genetics and stroke) main concepts. All types of searchers seemed to understand this process, although only trained searchers had received this kind of training before the experiment. So SNs may have been able to make good use of the instructions in this study. The MeSH Browser (U.S. National Library of Medicine, 2003), an online vocabulary look-up aid prepared by U.S. National Library of Medicine, was designed to help searchers find appropriate MeSH terms and display hierarchy of terms for retrieval purposes. As noted earlier, advanced system features were not implemented in our experimental IR system. In particular, experienced users of MeSH (the MLs) may have been at a disadvantage because the terms were not displayed in the MeSH hierarchy. The stripped down MeSH Browser was only available when participants were assigned to the MeSH+ version of an experimental system; in the MeSH− version, participants had to formulate their own terms without any assistance from the MeSH Browser or from the MeSH terms displayed in bibliographic records. Each search topic was allocated up to ten minutes. The last query within the time limit was used for calculating search performance, since searchers were instructed to stop searching when they were satisfied with the search results. Searchers were asked to indicate which previous query was most satisfying when they used up allocated time. We kept search logs that recorded search terms, a ranked list of retrieved documents, and time-stamps. 4.3. Participants To ensure that the participant demonstrates the level of domain knowledge as expected, the participant was instructed to rate the relevance of two documents for each assigned search topic when they finished searches with each topic. These two documents were randomly selected from the pool of relevance judged documents; one was ‘definitely relevant’ and the other was ‘not relevant’. The order of presentation was also randomized. The search topic judgment, known as the Comprehension Test, was intended to ascertain that DEs and MLs demonstrate sufficient knowledge to understand technical topics in genomics. The participant profile overall satisfied the requirement of four kinds of searchers as specified by design. However, the Comprehension Test showed that MLs did not have as high level of knowledge in genomics as we expected (see Table 3). It is possible that medical librarians were at a disadvantage because of the biological content of the topics, in which they were not necessarily domain experts. The participant’s demographic profile revealed that all DEs and half of SNs are non-native speakers of English. Still, this verification of the participant’s level of domain knowledge increased the internal validity of this study for the investigation of the impact of user characteristics on user search behavior. 4.4. Data analysis We used the trec_eval program and the relevance judgments from TREC Genomic track dataset to measure search effectiveness in terms of precision and recall measures (Buckley, 1999; Hersh, 2004). To ensure that the TREC pooled relevance judgments set was sufficiently complete and valid for our study (e.g., Hersh et al., 2004; Voorhees, 2000; Zobel, 1998), we conducted a reliability test of the relevance judgment set. The results from an analysis of top 10 retrieved documents from all human searches revealed that about one-third of all documents retrieved in our study had not been judged in the TREC data set (762 out of 2277 analyzed documents). A comparison of the judged and un-judged documents for each search topic showed significant differences between the two sets in terms of MAP (t(19) = −3.69, p < .01), P10 (t(19) = −3.89, p < .001) and P100 (t(19) = −3.95, p < .001) measures. Since the mean of the differences for MAP, P10 and P100 was approximately 2.7%, 9.9% and 4.9% respectively, we concluded that the TREC relevance judgments were applicable to our study. Given that this study is based on a factorial experimental design and we are concerned with the effects of system, searcher and topic, we constructed a linear fixed-effects model to fit the data. The Graeco-Latin square design controlled three sources of variation: four types of searchers, two versions of a system and 10 search topic pairs. We performed square root data transformations on the precision and recall scores to approximate the normality of residuals, an assumption in analysis of variance (ANOVA) (Fox, 2016; Rutherford, 2001; Tague-Sutcliffe, 1992, p. 485). ANOVA was used to analyze the data because searchers and queries are assumed to be randomly selected from the population (Carterette, 2015; TagueSutcliffe, 1992) and the queries are assumed to be independent (Hull, 1993). The approach of factorial design and analysis allowed us to separate the effects of systems, searchers and search topics, with a relatively small number of participants. Since we were interested in the relationship between user characteristics and search performance, the data were also analyzed by a logarithmic cross-ratio analysis (Fleiss, Levin, & Paik, 2003) for dichotomous outcomes. The independent numeric variables of user characteristics (e.g., domain knowledge and search training, excluding gender and language) as well as dependent variables of search performance measured by precision and recall score were converted into binary variables, and the mean as cut off point is normally used and convenient for further analysis. This data analysis technique

858

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870 Table 2 Use of MeSH terms search field in MeSH+ version. Searcher type

MeSH terms search field use in searches Yes

No

SN DE SE ML Total

2 (6.3%) 10 (31.3%) 16 (50.0%) 22 (68.8%) 50 (39.1%)

30 22 16 10 78

Total (93.8%) (68.8%) (50.0%) (31.3%) (60.9%)

32 (100.0%) 32 (100.0%) 32 (100.0%) 32 (100.0%) 128 (100.0%)

Note. SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian. For each searcher type, there are 32 searches in total (8 searchers × 4 topics = 32 searches). Table 3 Correctness of comprehension test by searcher type. Searcher type

Correctness of comprehension test Both correct

One correct

None correct or not sure

Total

SN DE SE ML Total

37 (57.8%) 35 (54.7%) 28 (43.8%) 26 (40.6%) 126 (49.2%)

19 27 20 28 94

8 (12.5%) 2 (3.1%) 16 (25.0%) 10 (15.6%) 36 (14.1%)

64 (100.0%) 64 (100.0%) 64 (100.0%) 64 (100.0%) 256 (100.0%)

(29.7%) (42.2%) (31.3%) (43.8%) (36.7%)

Note. SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian. For each searcher type, there are 64 searches in total (8 searchers × 4 topics × 2 system versions = 64 searches).

was chosen because it is resistant to sample selection bias and successfully applied in a previous study (Saracevic, Kantor, Chamis, & Trivison, 1988). 5. Research findings Generally, our results show that MeSH terms are more useful in terms of the precision measure for domain experts than for search experts. Users other than search experts achieve the same level of search performance regardless of whether MeSH terms are offered. In line with previous findings from IR experiments in a laboratory environment, we demonstrate that automatic indexing techniques can be as competitive as controlled indexing languages within an interactive search environment in which the hierarchical browsing designed to increase MeSH usability has not been implemented. Specific elements of the findings are discussed under the headings Overall Use of MeSH Terms, Domain Knowledge, Search Training, Search Efficiency and Search Outcome. 5.1. Overall use of MeSH terms The search logs revealed that participants overall did use MeSH terms during search processes when they used MeSH+ version (Table 2). Searchers specified MeSH terms as search field in 39.1% of all searches. Further analysis suggested that there was a statistically significant relationship between searcher types and use of MeSH terms (Fisher’s Exact Test, p < .001). Searchers’ levels of search training (SEs and MLs) were reflected in the use of MeSH terms; the more search training one had, the more likely one would use MeSH terms. These results validated our experimental instruments and procedures. 5.2. Domain knowledge DEs generally had the most biomedical knowledge as suggested by the large number of undergraduate (median = 15) and graduate (median = 7.5) levels of courses taken, followed by MLs. However, MLs’ domain knowledge was much lower than that of the DEs and their biomedical knowledge primarily came from undergraduate courses (Fig. 3). The DE searchers came from the subfields of Cancer Biology, Biochemistry & Molecular Biology, Chemical Biology, Neuroscience, Pharmacology, and Computational Biology & Molecular Biophysics. Our results from the Comprehension Test indicated that DEs demonstrate significantly better understanding of search topics than SEs do. The correctness of judgment was composed of the following three categories: (1) both correct; (2) one correct and (3) none correct or not sure. For each assigned search topic, all types of searchers were only able to judge correctly both documents at 40–50% (Table 3). A chi-square test of independence was performed to examine the relationship between searcher types and comprehension. The relationship between the variables was significant, χ 2 (6, N = 256) = 17.60, p < .05. Pearson residuals analysis indicated that SEs were more likely to have the category of none correct or not sure than expected while MLs were less likely to be the case. There was no significant difference between DEs and SNs in terms of their comprehension of search topics, as measured by relevance judgment tasks.

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

859

Fig. 3. Users’ level of domain knowledge boxplot by the number of Biology courses at the undergraduate and graduate levels. A rectangular box is drawn from the lower quartile to the upper quartile, with the median dividing the box. Outliers are represented as the dots. SN, DE, DE and ML refer to types of searchers, SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian. Table 4 Amount of MeSH use experience. Searcher type

SN DE SE ML Total

Amount of MeSH use experience None

A little

Some

A lot

Total

8 8 4 0 20

0 0 3 0 3

0 0 1 2 3

0 0 0 6 6

8 8 8 8 32

Note. SN, Search Novice; DE, Domain Expert; SE, Search Experts; ML, Medical Librarians.

In post-search questionnaires, participants were asked to indicate their perceived difficulty of search tasks and usefulness of MeSH terms after they finished each search topic. A chi-square test between the searcher types and the perceived difficulty revealed that the relationship between the two variables was significant, χ 2 (12, N = 256) = 38.40, p < .001. Pearson residuals analysis indicated that SEs were more likely to perceive the search task extremely difficult while DEs were less likely to be the case. In post-search interviews, some MLs also commented that these genomic topics were especially challenging because of the rapid development in this field and it was difficult to identify different names for a specific gene. These results allowed us to conclude that DEs have significantly higher level of biomedical knowledge than SEs in this study. Overall, participants did not perform well in Comprehension Test and the results confirmed that levels of education are good indicators of domain knowledge. 5.3. Search training Search training, measured by formal training in online searching course, suggested that MLs have participated in the largest number of online searching courses (median = 8.5), followed by SEs (median = 1) (Fig. 4). Most DEs and SNs had no formal search training. MLs also had the most experience using MeSH terms. None of the SNs and DEs had used MeSH terms before they participated in the study (Table 4). As would be expected MLs also had the most experience using MeSH terms among the four types of searchers.

860

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

Fig. 4. Users’ level of search training boxplot by the number of online searching courses. A rectangular box is drawn from the lower quartile to the upper quartile, with the median dividing the box. Outliers are represented as the dots. SN, DE, DE and ML refer to types of searchers, SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian.

The results demonstrate that we have successfully recruited different kinds of participants distinguished by level of domain knowledge and search training, although MLs’ level of knowledge in genomics was not as high as expected. As revealed by academic training in the biomedical domain and Comprehension Test, DEs demonstrated significantly higher level of domain knowledge than SEs; SNs and MLs’ level of domain knowledge was between DEs and SEs. MLs did not have much academic training in Biology as we expected. Although we would have liked to say that MLs are SEs and DEs, their scientific knowledge is much less than the DEs. Furthermore, the fact that most MLs work in the medical field rather than Biology may be a limiting factor in their comprehension of genomic topics. We cannot eliminate the possibility that the MLs would have performed better had the topics been more strictly medical. 5.4. Search efficiency The participants were very engaged with assigned search tasks. A boxplot of time spent by searcher type and system version showed that the median time was predominately above 500 s (Fig. 5). There was no significant difference in the time spent using MeSH+ or MeSH− versions (ANOVA, F(1, 254) = .15, p > .05). And the time spent by searcher types was not statistically significant (ANOVA, F(3, 252) = .31, p > .05). The amount of time may reflect at least these two factors: (1) searchers found the topics difficult; (2) searchers were engaged with search tasks in this study. We speculate that because of the relatively technical nature of search topics and the amount of search training, MLs are still persistent in searching when they are given difficult topics. 5.5. Search outcome We measured search outcome in terms of precision and recall measures for search effectiveness and time spent for search efficiency. The overall result comparing MeSH+ and MeSH− versions suggested that there was no statistically significant difference between the two versions of the experimental system, in terms of both precision (ANOVA, F(1, 254) = .01, p > .05) and recall (ANOVA, F(1, 254) = .30, p > .05) measures. The hypothesis that queries using MeSH will get better results than queries not using MeSH thus is not supported. In the discussion, we will consider possible reasons for this result. Different types of searchers obtained comparable results when we compared all search results, regardless of system versions. Search effectiveness by different types of searchers did not make a statistically significant difference in terms of precision (ANOVA, F(3, 252) = 1.86, p > .05) and recall (ANOVA, F(3, 252) = 1.66, p > .05) measures. All four types of searchers were only able to achieve mean precisions of approximately between .30 and .40, and mean recalls between .15 and .23 (Table 5). This result showed that search tasks were difficult for all searchers. But when we compared search effectiveness of different types of searchers across system versions, we found a very strong effect of system version and searcher type in terms of precision measure (ANOVA, F(7, 248) = 3.48, p < .01) (Table 6). There were highly significant differences in precision between DEs and SEs when they used the MeSH+ version (Tukey’s

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

861

Fig. 5. Boxplot of time spent by searcher type and system version. A rectangular box is drawn from the lower quartile to the upper quartile, with the median dividing the box. Outliers are represented as the dots. SN, DE, DE and ML refer to types of searchers, SN=Search Novice, DE = Domain Expert, SE = Search Expert, ML = Medical Librarian.

Table 5 Search effectiveness by searcher types in terms of precision and recall measures. Searcher type

Mean precision

N

Mean recall

N

SN DE SE ML Total

0.29 0.40 0.30 0.35 0.34

64 64 64 64 256

0.21 0.15 0.15 0.23 0.18

64 64 64 64 256

Note. SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian. For each searcher type, there are 64 searches in total (8 searchers × 4 topics × 2 system versions, 64 searches).

HSD, p < .01) and between DEs’ use of MeSH+ and SNs’ use of MeSH− versions (Tukey’s HSD, p < .01) (Fig. 6). The precision of SEs and MLs decreased when using MeSH+ than MeSH- (Table 6 and Fig. 6), but there were no significant differences between these groups (Tukey’s HSD, p > .05). The results of significant differences in search performance between DEs and SEs using MeSH+ were consistent with user perceived search task difficulty from post-search questionnaires. SEs were much more likely to be extremely difficult than DEs due to different levels of domain knowledge. However, there was no statistically significant difference in terms of the recall measure among different types of searchers and between different system versions (Fig. 7). For each searcher type, there are 32 searches in total for each system version (8 searchers × 4 topics = 32 searches). Search training alone does not make a difference in terms of precision (F(1, 252) = .35, p > .05), but there are strong interaction effects between search training and system version (F(1, 252) = 17.37, p < .001). One possible explanation is that searchers with low level of search training can search reasonably well, partly because the experimental system is equipped with state-of-the-art retrieval techniques, the search results are ranked by order of relevance and DEs and SNs follow the search procedures introduced in this study closely. Interestingly, after a brief training session, DEs were able to search with MeSH terms, even though they had not used MeSH terms before (Table 2), but it was not the case for SNs. The precision

862

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870 Table 6 Search effectiveness by system version and searcher type in terms of precision and recall. Searcher type

MeSH+

MeSH−

Mean precision

Mean recall

N

Mean precision

Mean recall

N

SN DE SE ML Total

0.36 0.51 0.21 0.28 0.34

0.21 0.15 0.16 0.22 0.19

32 32 32 32 128

0.23 0.29 0.38 0.42 0.33

0.20 0.15 0.13 0.24 0.18

32 32 32 32 128

Note. SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian.

Fig. 6. Boxplot of search performance by searcher type and system version. Search performance in terms of the precision measure. A rectangular box is drawn from the lower quartile to the upper quartile, with the median dividing the box. Outliers are represented as the dots. SN, DE, DE and ML refer to types of searchers, SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian.

score of SN was significantly lower when MeSH terms were not offered. Possible explanations of why SNs rarely used MeSH terms while DEs did are provided in the discussion section. In general, searchers with the most domain knowledge (DEs) were capable of obtaining significantly better search results than SEs with the help of MeSH terms. SEs and MLs did not significantly perform better when the MeSH terms were offered. Trained searchers (SEs and MLs) were able to achieve better search results than untrained searchers (SNs and DEs) when MeSH terms were not offered, although the increase in effectiveness was not statistically significant. But trained searchers did not substantially benefit from MeSH terms, probably because they did not have the level of domain knowledge to understand the genomics search topics. These results revealed the significance of domain knowledge for the usefulness of MeSH terms in IR systems. The Comprehension Test results using relevance judgment tasks showed that the contrast between DEs and SEs in their levels of domain knowledge was found significant and MLs’ level of domain knowledge was not as high as expected. Our findings revealed that domain knowledge is crucial for searching technical topics, and that domain experts using MeSH terms can significantly enhance the precision of searches. We cannot eliminate the possibility that MLs would have performed better on more traditionally medical topics like the ones they normally encounter in their work or if MLs had been able to use the standard hierarchal MeSH interface which has been implemented in search systems such as Ovid and EBSCOhost. On the whole, domain experts benefit most from the use of MeSH terms. Later we will discuss why this might be and why we should not generalize from the result.

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

863

Fig. 7. Boxplot of search performance by searcher type and system version. Search performance in terms of the recall measure. A rectangular box is drawn from the lower quartile to the upper quartile, with the median dividing the box. Outliers are represented as the dots. SN, DE, DE and ML refer to types of searchers, SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian.

Table 7 Summary of the relationship between user characteristics and the precision score (N users = 32; N questions = 20; N all searches = 256; statistical significance at 95%). User characteristics

Cut point(Mean)

OddsRatio

LogOdds

Stand. error+/−

t-value

Stat. Signif.

Gender Native language # of undergraduate Biology courses # of graduate Biology courses # of online searching courses Experience of MeSH use Experience as information professional Experience of database use Frequency of database use Age

Male/Female Native/Non-Native 4.94 1.84 3.47 0.84 0.97 2.84 4.06 3.59

0.92 1.27 2.57 1.82 1.06 0.98 1.29 1.25 0.96 0.77

−0.09 0.24 0.94 0.60 0.06 −0.02 0.26 0.22 −0.04 −0.26

0.26 0.26 0.29 0.29 0.31 0.27 0.27 0.27 0.28 0.26

−0.34 0.91 3.21 2.04 0.19 −0.09 0.95 0.83 −0.15 −0.97

No No Yes Yes No No No No No No

5.6. User characteristics and search outcome Previous research has suggested that user characteristics of gender, age and language skills are important factors in searching (e.g., Chevalier, Dommes, & Marquié, 2015; Ford, Miller, & Moss, 2001; Lorigo, Pan, Hembrooke, Joachims, Granka, & Gay, 2006; Vanopstal, Stichele, Laureys, & Buysschaert, 2012). An examination of the relationship between user characteristics and search effectiveness using a logarithmic cross-ratio analysis showed that searchers’ domain knowledge, measured by the number of undergraduate/graduate level Biology courses taken, was correlated with the precision measure (Table 7). Searchers who have taken more than five undergraduate level Biology courses were estimated to obtain higher precision score by a factor of 2.57, or 157% more than those with less than five courses. Searchers who have taken more than two graduate level Biology courses were by a factor of 1.82, or 82% more likely to obtain higher precision score than those with less than two courses. Other user characteristics, however, were not correlated with the precision score. These results suggested that the searcher’s formal education in Biology was significantly correlated with the precision of searches, probably because of the technical nature of genomics search topics.

864

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870 Table 8 Summary of the relationship between user characteristics and the recall score (N users = 32; N questions = 20; N all searches = 256; statistical significance at 95%). User characteristics

Cut point (Mean)

OddsRatio

LogOdds

Stand. Error+/-

t-value

Stat. Signif.

Gender Native language # of undergraduate Biology courses taken # of graduate Biology courses # of online searching courses Experience of MeSH use Experience as information professional Experience of database use Frequency of database use Age

Male/Female Native/Non-Native 4.94 1.84 3.47 0.91 0.97 2.84 4.06 3.59

1.02 0.76 0.68 0.53 2.21 1.91 1.60 1.61 1.06 1.44

0.02 −0.27 −0.39 −0.63 0.79 0.65 0.47 0.48 0.06 0.37

0.28 0.28 0.34 0.35 0.32 0.28 0.29 0.29 0.29 0.28

0.07 −0.97 −1.16 −1.79 2.51 2.30 1.65 1.63 0.19 1.31

No No No No Yes Yes No No No No

The results from the relationship between the user characteristics and the recall measure indicated that formal search training and experience of MeSH term use were correlated with the recall score (Table 8). Searchers with more than four online searching courses were estimated to be two times more likely to obtain high recall score than searchers with fewer courses. Searchers with any experience using MeSH terms were 91% more likely to obtain high recall score (above .18). We speculate that well-trained searchers obtained better recall score because they knew the potential usefulness of MeSH terms and were able to use more unique terms in searches than other types of searchers, and thus obtain more comprehensive results. Overall, domain knowledge measured by level of education was correlated with the precision score, whereas search training measured by the amount of formal training and experience of MeSH use were correlated with the recall score. 6. Discussion This study indicates that domain knowledge plays an important role in effective use of MeSH terms, especially when the search topics are technical in nature. Specifically, MeSH terms are most useful in terms of precision for domain experts. Our results therefore partly contradict earlier research (e.g., Allen, 1991; Pao et al., 1993) that has suggested that domain knowledge is not correlated with search performance. This may be because the earlier research used a small number of search topics and homogenous group of participants with relatively similar subject background. Our study has demonstrated that the use of a relatively large number of search topics in an interactive search environment through experimental design is feasible. The participants’ considerable differences in level of domain knowledge have made it possible to observe the subtle differences in search performance. One prominent finding from this study is the conditions in which searcher’s domain knowledge makes a difference in search performance. It suggests that searchers can benefit the most from the proper use of search tools, such as MeSH terms, when they have sufficient knowledge about the search topic. This is also supported by previous studies (e.g., Nielsen, 2004; Sihvonen & Vakkari, 2004), indicating that domain expert searchers can benefit more than search novices from the use of thesaurus tools in selecting potentially useful terms for expanding initial queries. We did not find evidence that these terms are useful for other kinds of users, though it is possible that MLs would have performed better with the standard hierarchical MeSH display and advanced search functions within the system, such as automatic term mapping and explode function of retrieving documents with more specific terms, in addition to the specified MeSH terms. In fact, recent studies on the relationship between user-perceived topic familiarity, indicators of levels of domain knowledge and the proposed search user interfaces in support of query formulations/reformulations (e.g., Mu et al., 2014; Tang et al., 2013; Zhang et al., 2015) have suggested the significant relationship between topic familiarity and querying behavior. Comparable performance in topic understanding as measured by relevance judgment tasks between DEs and SNs might be related to the factors of verbal ability and knowledge of domain-specific terminologies, as suggested in previous studies (Bellardo, 1985; Dumais & Schmitt, 1991; Saracevic & Kantor, 1988; Vanopstal, Stichele, Laureys & Buysschaert, 2012). So the impact of verbal ability and domain knowledge on relevance judgment tasks and search performance needs further research. Our finding that formal search training and experience of MeSH use were correlated with the recall score provides further evidence of the importance of search training and understanding of the potential usefulness of MeSH terms. For example, users obtained better recall score when they searched in a best match IR system using thesaurus-based query enhancement features (Jones et al., 1995). It was found that search novices do not know how to improve search results by query reformulation when they search technical topics. The results also demonstrate that the level of formal search training was correlated with search outcome. Searchers with formal online search training and experience of MeSH use were more likely to obtain high recall score. Our results therefore contradict prior research (e.g., Fenichel, 1981; Howard, 1982; Pao et al., 1993) that has suggested that search experience as a theoretical construct was not correlated with the recall score. One possible explanation is that earlier studies depended on the user’s subjective self-reporting data to measure the search experience rather than objective criteria. Some researchers have reservations about the reliability of using subjective

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

865

self-reporting of exposure to IR systems in the measurement of search experience (e.g., Dalrymple & Zweizig, 1992; Moore et al., 2007). This study primarily used the objective criterion of formal search training, supplemented by self-reporting of frequency and years of experience searching online databases. We found that the number of online searching courses was a good predictor of recall score, while the frequency and years of use (i.e. experience) were not. Experience of MeSH use was correlated with the recall score. We speculate that searchers with high level of search training were able to identify more potentially useful terms in query reformulations contributing to significantly better search performance. Search experts did not find MeSH terms useful and they did not do well in searches. Search experts perceived the search task as very difficult especially when they used MeSH+ system. The results from Comprehension Test revealed that the technical topics are especially challenging for search experts. It suggests that search training alone cannot compensate for the lack of domain knowledge in searches on technical topics as we used in the study. Medical librarians had the most search training in our study, but their knowledge in Biology was not as high as we expected. The impact of search training was also reflected in the overall use of MeSH terms: the more search training one had, the more likely one would use MeSH terms. Searchers with low level of search training, especially search novices, can search reasonably well in part because of the information retrieval system used in this study. With regard to the question of why SNs rarely used MeSH terms while DEs did, here are possible explanations: (1) The brief MESH training was more helpful for DEs, both because of DE’s understanding of technical terminology, which would make it much easier for them to understand why MESH terms might be helpful. (2) DEs are researchers, so even if they don’t have much experience using MeSH, they are more likely to have research experience with other library databases that carries over to MeSH. So DEs are also more experienced searchers than SNs. This is consistent with some of the research literature where SNs have problems using controlled vocabularies (Larson, 1991). As suggested in Tenopir (1985), because of their lack of appreciation/understanding of the MeSH terms, SNs couldn’t appreciate their potential value, and maybe even found them confusing. SEs and MLs did appreciate value of MeSH terms because they’d been trained to use controlled indexing languages. DEs understood the MeSH terms, which may simply have made their usefulness more obvious. Furthermore, DEs can be expected to be more experienced researchers – they presumably have to do research in their graduate work – than SNs, who may never have written a research paper. Our research findings are consistent with the observation that DEs are able to make good use of controlled vocabularies to expand queries (Hersh et al., 1994; Nielsen, 2004; Shiri & Revie, 2006). We can conclude that people who lack either formal search training or domain knowledge don’t appreciate/make use of index terms and thesauri. Methodologically, this study has demonstrated the feasibility of re-using a test collection originally created for evaluating the effectiveness of retrieval techniques in an ad hoc search task, for a controlled user experiment. We used a relatively large number of search topics in a user experiment through experimental design techniques and ensuring the reliability of relevance judgment sets. Therefore, we were able to detect subtle differences in search effectiveness obtained by different kinds of users. This study was designed to assess the impact of MeSH terms on search effectiveness in an interactive search environment. One limitation of the design was that participants were a self-selected group of searchers that may not be representative of the population. The use of search topics in genomics may have disadvantaged medical librarians and our experts may have overestimated the knowledge of the participants for appropriate search tasks. These may pose threat to external validity since those search tasks might not be representative of search topics received by medical librarians in practice. Recruitment of more experienced and professional medical librarians and increase of the number of participants in each category would enhance the validity of the results. 7. Future research The experimental design and methodology similar to our study can be used to assess the quality of automatically extracted phrases as displayed index terms in support of browsing or interactive query expansion tasks. For example, researchers have proposed several methods of automatic identification of index terms to support interactive information retrieval tasks (e.g., Kim, Yeganova, & Wilbur, 2017 ; Wacholder, Evans, & Klavans, 2001; Workman, Fiszman, Cairelli, Nahl, & Rindflesch, 2016). By using a restricted version of MeSH in our user experiment, we did not take the full advantage of MeSH’s affordances in this regard. We used genomics search topics that were technical and difficult for most searchers in our experiment. We decided to use these difficult search topics because of the availability of the TREC relevance judgments, and because of the changing role of medical librarians in genomics and translational medicine (Cleveland, Holmes, & Philbrick, 2012) as well as the development of MeSH-based literature mining system in genomics (Xiang, Qin, Qin, & He, 2013). Nonetheless, we recommend using other search topics, such as clinical topics, for assessing the usefulness of MeSH terms in the settings of medical or hospital libraries in future research (see e.g. Wittek, Liu, Darányi, Gedeon, & Lim, 2016). The use of clinical topics might have different results for medical librarians because they are more familiar with these topics. This will also enhance the external validity of this kind of study since medical librarians usually receive clinical topics in their work settings. Our study provides empirical evidence that the effort to create MeSH terms is worthwhile for domain experts’ searches on technical topics. This study has focused on the impact of displayed controlled index terms on search effectiveness. For comparative purposes we used MeSH terms as a case study of the impact of controlled vocabularies and used a

866

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

Boolean-based retrieval system with ranking functions. We only compared one kind of controlled vocabulary within a single IR system because we were concerned with the difficulty of separating out the effects of users, systems and topics in an interactive search environment. Future research can replicate this study using different kind of controlled vocabulary, search topics and search systems, such as Ovid and PubMed, before we generalize the results to other settings. Since query reformulation tasks are concerned with forming mental representations of the search topic, and translating those into search expressions, searcher’s domain knowledge and verbal ability are crucial for the execution of these tasks (e.g., Smith, 2017; Wacholder, 2011). There is some evidence that searcher’s verbal ability was associated with search performance in controlled user experiments (e.g., Bellardo, 1985; Dumais & Schmitt, 1991; Saracevic & Kantor, 1988). Therefore, verbal ability as an important user characteristic and its role in cognitive processing of query reformulations need further research. 8. Conclusion Our results provide experimental evidence of the usefulness of MeSH terms in an interactive search environment. Previous research has compared the retrieval performance of MeSH terms and automatic indexing techniques in laboratory settings without human searchers. Salton (1972, p. 81) claimed that “fully automatic text processing methods can be used to obtain retrieval output of an effectiveness substantially equivalent to that provided by conventional, manual indexing (emphasis original).” Our results support this general conclusion and further identify the significant relationship between the user characteristics of domain knowledge and search training and the search performance in an interactive search environment. Acknowledgement This study was funded by NSF grant #0414557, PIs. Michael Lesk and Nina Wacholder. We would like to thank the participants who generously shared their expertise, Nick Belkin, Paul Kantor, Chung-Chieh Shan and Gloria Leckie for constructive suggestions and Lu Liu for technical assistance. This article is a substantially expanded version of the Liu and Wacholder (2008) conference paper.

Appendix A. Arragnement of experimental conditions by 4×4 Graeco–Latin square design

1 SN 38 12 29 50 42 46 32 15

2 DE 12 38 50 29 46 42 15 32

3 SE 29 50 12 38 32 15 42 46

4 ML 50 29 38 12 15 42 46 32

5 DE 38 12 27 45 9 36 30 20

6 SN 12 45 38 27 36 9 20 30

7 ML 27 38 45 12 30 20 9 36

8 SE 45 27 12 38 20 30 36 9

9 SE 29 50 27 45 2 43 1 49

10 ML 50 29 45 27 43 1 49 2

11 SN 27 29 45 50 1 49 2 43

12 DE 45 27 50 29 49 2 43 1

13 ML 42 46 9 36 2 43 33 23

14 SE 46 36 42 9 43 2 23 33

15 DE 9 42 36 46 33 23 2 43

16 SN 36 9 46 42 23 33 43 2

Note. Numbers 1–16 refers to participant ID. SN, DE, DE and ML refer to types of searchers, SN, Search Novice; DE, Domain Expert; SE, Search Expert; ML, Medical Librarian. Shaded and non-shaded blocks refer to MeSH+ and MeSH− versions of an experimental system. Numbers in blocks refer to search topic ID number from TREC Genomics Track 2004 document set data file (Hersh, 2004). Ten search topic pairs, randomly selected from a pool of twenty selected search topics, include (38, 12), (29, 50), (42, 46), (32, 15), (27, 45), (9, 36), (30, 20), (2, 43), (1, 49) and (33, 23). See Appendix A for a list of twenty selected search topics.

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

867

Appendix B. Selected twenty search topics from TREC Genomics Track 2004 document set data file (Hersh, 2004)

Topic # Information Needs Statements 1

2

9

12

15

20

23

27

29

30

32

33

36

38

42

43

45

46

49

50

Title: Ferroportin-1 in humans Need: Find articles about Ferroportin-1, an iron transporter, in humans. Context: Ferroportin1 (also known as SLC40A1; Ferroportin 1; FPN1; HFE4; IREG1; Iron regulated gene 1; Iron-regulated transporter 1; MTP1; SLC11A3; and Solute carrier family 11 (proton-coupled divalent metal ion transporters), member 3) may play a role in iron transport. Title: Generating transgenic mice Need: Find protocols for generating transgenic mice. Context: Determine protocols to generate transgenic mice having a single copy of the gene of interest at a specific location. Title: mutY Need: Find articles about the function of mutY in humans. Context: mutY is particularly challenging, because it is also known as hMYH. This is further complicated by the fact that myoglobin genes are also typically located in search results. Title: Genes regulated by Smad4 Need: Find articles describing genes that are regulated by the signal transducing molecule Smad4. Context: Project is to characterize Smad4 knockout mouse in skin (specifically skin) to establish signaling network. Identify all Smad4 targets to compare gene expression patterns of the knockout mouse to the normal mouse. Title: ATPase and apoptosis Need: Find information on role of ATPases in apoptosis Context: The laboratory wants to know more about the role of ATPases in apoptosis. Title: Substrate modification by ubiquitin Need: Which biological processes are regulated by having constituent proteins modified by covalent attachment to ubiquitin or ubiquitin-like proteins? Context: Ubiquitin and ubiquitin-like proteins have important roles in controlling cell division, signal transduction, embryonic development, endocytic trafficking, and the immune response. Title: Saccharomyces cerevisiae proteins involved in ubiquitin system Need: Which Saccharomyces cerevisiae proteins are involved in the ubiquitin proteolytic pathway? Context: The researcher identified a protein in another yeast species and wants to compare it to the same one in Saccharomyces cerevisiae. Title: Role of autophagy in apoptosis Need: Experiments establishing positive or negative interconnection between autophagy and apoptosis. Context: New information about experiments and genes involved in autophagic cell death. Title: Phenotypes of gyrA mutations Need: Documents containing the sequences and phenotypes of E. coli gyrA mutations. Context: The laboratory has isolated some gyrA mutations in E. coli. They want to compare their mutant gyrA with the wild-type and other mutant sequences. Title: Regulatory targets of the Nkx gene family members Need: Documents identifying genes regulated by Nkx gene family members. Context: The laboratory needs markers to follow Nkx family-member expression and activity. Title: Xenograft animal models of tumorogenesis Need: Find reports that describe xenograft models of human cancers. Context: A xenograft animal model of cancer is one in which foreign tumor tissue is grafted into animals, usually rodents, providing a means to test various compounds for their ability to slow or halt tumor growth. Title: Mice, mutant strains, and Histoplasmosis Need: Identify research on mutant mouse strains and factors which increase susceptibility to infection by Histoplasma capsulatum. Context: The ultimate goal of this initial research study, is to identify mouse genes that will influence the outcome of blood borne pathogen infections. Title: RAB3A Need: Background information on RAB3A. Context: Further information about a gene is needed after it is identified through a gene expression profile. The genes are related to synaptic plasticity in learning and memory. Title: Risk factors for stroke Need: Information concerning genetic loci that are associated with increased risk of stroke, such as apolipoprotein E4 or factor V mutations. Context: Candidate gene testing within a large Scottish case-control study of genetic risk factors for stroke. Future research includes investigations into other ethnically distinct populations. Title: Genes altered by chromosome translocations Need: What genes show altered behavior due to chromosomal rearrangements? Context: Information is required on the disruption of functions from genomic DNA rearrangements. Title: Sleeping Beauty Need: Studies of Sleeping Beauty transposons. Context: A relevant document is one that discusses studies on Sleeping Beauty. Interviewee’s group studies a related element and want to know what others are doing in a similar field. Title: Mental Health Wellness-1 Need: What genetic loci, such as Mental Health Wellness 1 (MWH1) are implicated in mental health? Context: Want to identify genes involved in mental disorders. Title: RSK2 Need: What human biological processes is RSK2 known to be involved in? Context: After being identified via microarrays, the biological processes the genes are involved in needs to be discovered. Title: Glyphosate tolerance gene sequence Need: Find reports and glyphosate tolerance gene sequences in the literature. Context: A DNA sequence isolated in the laboratory is often sequenced only partially, until enough sequence is generated to identify the gene. In these situations, the rest of the sequence is inferred from matching clones in the public domain. When there is difficulty in the laboratory manipulating the DNA segment using sequence-dependent methods, the laboratory isolate must be re-examined. Title: Low temperature protein expression in E. coli Need: Find research on improving protein expressions at low temperature in Escherichia coli bacteria. Context: The researcher is not satisfied with the yield of expressing a protein in E. coli when grown at low temperature and is searching for a better solution. The researcher is willing to try a different organism and/or method.

References Abdou, S., & Savoy, J. (2008). Searching in MEDLINE: Query expansion and manual indexing evaluation. Information Processing & Management, 44(2), 781–789. Allen, B. (1991). Topic knowledge and online catalog search formulation. Library Quarterly, 61(2), 188–213.

868

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

Anderson, J. D., & Perez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part I: Research, and the nature of human indexing. Information Processing & Management, 37(2), 231–254. Bellardo, T. (1985). An investigation of online searcher traits and their relationship to search outcome. Journal of the American Society for Information Science, 36(4), 241–250. Boyce, B., & Lockard, M. (1975). Automatic and manual indexing performance in a small file of medical literature. Bulletin of the Medical Library Association, 63(4), 378–385. Buckley, C. (1999). trec_eval IR evaluation package. Retrieved from ftp://ftp.cs.cornell.edu/pub/smart. Buckley, C., & Voorhees, E. M. (2005). Retrieval system evaluation. In E. M. Voorhees, & D. K. Harman (Eds.), TREC: Experiment and evaluation in information retrieval (pp. 53–75). Cambridge, MA: The MIT Press. Byrd, J., Charbonneau, G., Charbonneau, M., Courtney, A., Johnson, E., & Leonard, K. (2006). A white paper on the future of cataloging at Indiana University. Retrieved from http://www.iub.edu/∼libtserv/pub/Future_of_Cataloging_White_Paper.doc. Calhoun, K. (2006). The changing nature of the catalog and its integration with other discovery tools. Retrieved from http://www.loc.gov/catdir/calhoun-reportfinal.pdf. Carterette, B. (2015). The best published result is random: Sequential testing and its effect on reported effectiveness. In Proceedings of the ACM SIGIR conference (pp. 747–750). Chevalier, A., Dommes, A., & Marquié, J.-C. (2015). Strategy and accuracy during information search on the web: Effects of age and complexity of the search questions. Computers in Human Behavior, 53, 305–315. Cleveland, A. D., Holmes, K. L., & Philbrick, J. L. (2012). Genomics and translational medicine for information professionals: An innovative course to educate the next generation of librarians. Journal of the Medical Library Association, 100(4), 303–305. Cleverdon, C. W. (1967). The Cranfield tests on index language devices. Aslib Proceedings, 19(6), 173–193. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: L. Erlbaum Associates. Cole, M. J., Gwizdka, J., Liu, C., Bierig, R., Belkin, N. J., & Zhang, X. M. (2011). Task and user effects on reading patterns in information search. Interacting with Computers, 23(4), 346–362. Dalrymple, P. W., & Zweizig, D. L. (1992). Users’ experience of information retrieval systems: An exploration of the relationship between search experience and affective measures. Library & Information Science Research, 14, 167–181. Dumais, S. T., & Schmitt, D. G. (1991). Iterative searching in an online database. In Proceedings of the human factors society 35th annual meeting (pp. 398–402). Santa Monica, CA: Human Factors Society. Fenichel, C. H. (1981). Online searching: Measures that discriminate among users with different types of experiences. Journal of the American Society for Information Science, 32(1), 23–32. Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver and Boyd. Fleiss, J. L., Levin, B. A., & Paik, M. C. (2003). Statistical methods for rates and proportions (3rd ed.). Hoboken, NJ: John Wiley. Ford, N., Miller, D., & Moss, N. (2001). The role of individual differences in internet searching: An empirical study. Journal of the American Society for Information Science and Technology, 52(12), 1049–1066. Fox, J. (2016). Applied regression analysis and generalized linear models (3rd ed.). Thousand Oaks, CA: Sage. Golub, K., Soergel, D., Buchanan, G., Tudhope, D., Lykke, M., & Hiom, D. (2016). A framework for evaluating automatic indexing or classification in the context of retrieval. Journal of the Association for Information Science and Technology, 67(1), 3–16. Gross, T., & Taylor, A. G. (2005). What have we got to lose? The effect of controlled vocabulary on keyword searching results. College & Research Libraries, 66(3), 212–230. Gross, T., Taylor, A. G., & Joudrey, D. N. (2015). Still a lot to lose: The role of controlled vocabulary in keyword searching. Cataloging & Classification Quarterly, 53(1), 1–39. Hembrooke, H. A., Granka, L. A., Gay, G. K., & Liddy, E. D. (2005). The effects of expertise and feedback on search term selection and subsequent learning. Journal of the American Society for Information Science and Technology, 56(8), 861. Hersh, W. (2004). TREC 2004 Genomics Track document set data file. Retrieved from http://ir.ohsu.edu/genomics/data/2004/. Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. Proceedings of the ACM SIGIR Conference, 17, 192–201. Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M., & Kraemer, D. F. (2004). TREC 2004 genomics track overview. In E. M. Voorhees, & L. P. Buckland (Eds.). In Proceedings of the Text REtrieval Conference: Vol. 13. Gaithersburg, MD: NIST. Hersh, W. R., Bhupatiraju, R. T., Ross, L., Roberts, P., Cohen, A. M., & Kraemer, D. F. (2006). Enhancing access to the Bibliome: The TREC 2004 Genomics Track. Journal of Biomedical Discovery and Collaboration, 1. doi:10.1186/1747-5333-1-3. Hjørland, B. (2016). Does the traditional thesaurus have a place in modern information retrieval. Knowledge Organization, 43(3), 145–159. Howard, H. (1982). Measures that discriminate among online searchers with different training and experience. Online Review, 6(4), 315–327. Hsieh-Yee, I. (1993). Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers. Journal of the American Society for Information Science, 44(3), 161–174. Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. Proceedings of the ACM SIGIR Conference, 16, 329–338. Jones, S., Gatford, M., Robertson, S., Hancock-Beaulieu, M., Secker, J., & Walker, S. (1995). Interactive thesaurus navigation: Intelligence rules OK. Journal of the American Society for Information Science, 46(1), 52–59. Keen, E. M. (1973). The Aberystwyth index languages test. Journal of Documentation, 29(1), 1–35. Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1/2), 1–224. Kim, S., Yeganova, L., & Wilbur, W. J. (2016). Meshable: Searching PubMed abstracts by utilizing MeSH and MeSH-derived topical terms. Bioinformatics, 32(19), 3044–3046. doi:10.1093/bioinformatics/btw331. Lagergren, E., & Over, P. (1998). Comparing interactive information retrieval systems across sites: The TREC-6 interactive track matrix experiment. Proceedings of the ACM SIGIR Conference, 21, 164–172. Larson, R. R. (1991). The decline of subject searching: Long-term trends and patterns of index use in an online catalog. Journal of the American Society for Information Science, 42(3), 197–215. Liu, Y.-H. (2010). A meta-analysis of the effects of search experience on search performance in terms of the recall measure in controlled IR user experiments. In F. Scholer, A. Trotman, & A. Turpin (Eds.), ADCS 2010: Proceedings of the fifteenth australasian document computing symposium (pp. 105–110). Melbourne, Australia: School of Computer Science and IT, RMIT University. Liu, Y.-H., & Wacholder, N. (2008). Do human-developed index terms help users? An experimental study of MeSH terms in biomedical searching. Proceedings of the ASIS&T Annual Meeting, 45, 1–16. Lorigo, L., Pan, B., Hembrooke, H., Joachims, T., Granka, L., & Gay, G. (2006). The influence of task and gender on search and evaluation behavior using google. Information Processing & Management, 42(4), 1123–1131. Lu, Z., Kim, W., & Wilbur, W. (2009). Evaluation of query expansion using MeSH in PubMed. Information Retrieval, 12(1), 69–80. Lykke, M., Price, S., & Delcambre, L. (2012). How doctors search: A study of query behaviour and the impact on search results. Information Processing & Management, 48(6), 1151–1170. Mann, T. (2006). The changing nature of the catalog and its integration with other discovery tools. final report. March 17, 2006. Prepared for the Library of Congress by Karen Calhoun: A critical review. Retrieved from http://www.guild2910.org/AFSCMECalhounReviewREV.pdf. Marchionini, G., & Dwiggins, S. (1990). Effects of search and subject expertise on information seeking in a hypertext environment. Proceedings of the ASIS Annual Meeting, 27, 129–142.

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

869

Matos, S., Arrais, J. P., Maia-Rodrigues, J., & Oliveira, J. L. (2010). Concept-based query expansion for retrieving gene related publications from MEDLINE. BMC Bioinformatics,, 11(212). doi:10.1186/1471-2105- 11- 212. McKibbon, K. A., Haynes, R. B., Walker Dilks, C. J., Ramsden, M. F., Ryan, N. C., Baker, L., et al. (1990). How good are clinical MEDLINE searches? A comparative study of clinical end-user and librarian searches. Computers and Biomedical Research, 23(6), 583–593. Meadow, C. T., Marchionini, G., & Cherry, J. M. (1994). Speculations on the measurement and use of user characteristics in information retrieval experimentation. Canadian Journal of Information and Library Science, 19(4), 1–22. Meadow, C. T., Wang, J., & Yuan, W. (1995). A study of user performance and attitudes with information retrieval interfaces. Journal of the American Society for Information Science, 46(7), 490–505. Moore, J. L., Erdelez, S., & Wu, H. (2007). The search experience variable in information behavior research. Journal of the American Society for Information Science and Technology, 58(10), 1529–1546. Mu, X., Lu, K., & Ryu, H. (2014). Explicitly integrating MeSH thesaurus help into health information retrieval systems: An empirical user study. Information Processing & Management, 50(1), 24–40. doi:10.1016/j.ipm.2013.03.005. Nelson, S. J., Johnston, W. D., & Humphreys, B. L. (2001). Relationships in Medical Subject Headings (MeSH). In C. A. Bean, & R. Green (Eds.), Relationships in the organization of knowledge (pp. 171–184). Dordrecht, The Netherlands: Kluwer Academic Publishers. New Zealand Digital Library Project. (2006). Greenstone Digital Library Software (Version 2.70). Hamilton, New Zealand: Department of Computer Science, The University of Waikato. Retrieved from http://sourceforge.net/projects/greenstone/. Nielsen, M. L. (2004). Task-based evaluation of associative thesaurus in real-life environment. Proceedings of the ASIS&T Annual Meeting, 41, 437–447. doi:10. 1002/meet.1450410151. Palmquist, R. A., & Kim, K. S. (20 0 0). Cognitive style and on-line database search experience as predictors of Web search performance. Journal of the American Society for Information Science, 51(6), 558–566. Pao, M. L., Grefsheim, S. F., Barclay, M. L., Woolliscroft, J. O., McQuillan, M., & Shipman, B. L. (1993). Factors affecting students use of MEDLINE. Computers and Biomedical Research, 26(6), 541–555. Robertson, S. E. (1981). The methodology of information retrieval experiment. In K. Sparck Jones (Ed.), Information retrieval experiment (pp. 9–31). London: Butterworths. Robertson, S. E. (1990). On sample sizes for non-matched-pair IR experiments. Information Processing & Management, 26(6), 739–753. Rutherford, A. (2001). Introducing ANOVA and ANCOVA. Thousand Oaks, CA: SAGE. Salton, G. (1972). A new comparison between conventional indexing (MEDLARS) and automatic text processing (SMART). Journal of the American Society for Information Sciences, 23(2), 75–84. Saracevic, T., Kantor, P., Chamis, A. Y., & Trivison, D. (1988). A study of information seeking and retrieving. I. Background and methodology. Journal of the American Society for Information Science, 39(3), 161–176. Saracevic, T., & Kantor, P. (1988). A study of information seeking and retrieving. III. Searchers, searches, and overlap. Journal of the American Society for Information Science, 39(3), 197–216. Savoy, J. (2005). Bibliographic database access using free-text and controlled vocabulary: An evaluation. Information Processing & Management, 41(4), 873–890. Shiri, A., & Revie, C. (2006). Query expansion behavior within a thesaurus-enhanced search environment: A user-centered evaluation. Journal of the American Society for Information Science and Technology, 57(4), 462–478. Sihvonen, A., & Vakkari, P. (2004). Subject knowledge improves interactive query expansion assisted by a thesaurus. Journal of Documentation, 60(6), 673–690. Smith, C. L. (2017). Investigating the role of semantic priming in query expression: A framework and two experiments. Journal of the Association for Information Science and Technology, 68(1), 168–181. doi:10.1002/asi.23611. Sparck, Jones,K. (1981). Retrieval system tests 1958–1978. In K. Sparck Jones (Ed.), Information retrieval experiment (pp. 213–255). London: Butterworths. Sparck, Jones,K. (20 0 0). Further reflections on TREC. Information Processing & Management, 36(1), 37–85. Sparck Jones, K., & van Rijsbergen, C. J. (1976). Information retrieval test collections. Journal of Documentation, 32(1), 59–75. Stalberg, E., & Cronin, C. (2011). Assessing the cost and value of bibliographic control. Library Resources & Technical Services, 55(3), 124–137. Sutcliffe, A. G., Ennis, M., & Watkinson, S. J. (20 0 0). Empirical studies of end-user information searching. Journal of the American Society for Information Science, 51(13), 1211–1231. Svenonius, E. (1986). Unanswered questions in the design of controlled vocabularies. Journal of the American Society for Information Science, 37(5), 331–340. Tang, M.-C., Liu, Y.-H., & Wu, W.-C. (2013). A study of the influence of task familiarity on user behaviors and performance with a MeSH term suggestion interface for PubMed bibliographic search. International Journal of Medical Informatics, 82(9), 832–843. doi:10.1016/j.ijmedinf.2013.04.005. Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4), 467–490. Tenopir, C. (1985). Full text database retrieval performance. Online Information Review, 9(2), 149–164. The Library of Congress Working Group on the Future of Bibliographic Control. (2008). On the record: Report of the library of congress working group on the future of bibliographic control. Retrieved from http://www.loc.gov/bibliographic-future/news/lcwg- ontherecord- jan08- final.pdf. The University of California Libraries Bibliographic Services Task Force. (2005). Rethinking how we provide bibliographic services for the University of California. Retrieved from http://libraries.universityofcalifornia.edu/sopag/BSTF/Final.pdf. U.S. National Library of Medicine. (2003). MeSH Browser (2003 MeSH), from http://www.nlm.nih.gov/mesh/2003/MBrowser.html. Vanopstal, K., Stichele, R. V., Laureys, G., & Buysschaert, J. (2012). PubMed searches by Dutch-speaking nursing students: The impact of language and system experience. Journal of the American Society for Information Science and Technology, 63(8), 1538–1552. Vakkari, P., Pennanen, M., & Serola, S. (2003). Changes of search terms and tactics while writing a research proposal: A longitudinal case study. Information Processing & Management, 39(3), 445–465. van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). London: Butterworths. Voorbij, H. J. (1998). Title keywords and subject descriptors: A comparison of subject search entries of books in the humanities and social sciences. Journal of Documentation, 54(4), 466–476. Voorhees, E. M. (20 0 0). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5), 697–716. Voorhees, E. M., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. Proceedings of the ACM SIGIR Conference, 25, 316–323. Voorhees, E. M., & Harman, D. K. (Eds. ). (2005). TREC : Experiment and evaluation in information retrieval. Cambridge, MA: The MIT Press. Wacholder, N. (2011). Interactive query formulation. Annual Review of Information Science and Technology, 45, 157–196. Wacholder, N., Evans, D. K., & Klavans, J. L. (2001). Automatic identification and organization of index terms for interactive browsing. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 1, 126–134. Wacholder, N., & Liu, L. (2006). User preference: A measure of query-term quality. Journal of the American Society for Information Science and Technology, 57(12), 1566–1580. Wacholder, N., & Liu, L. (2008). Assessing term effectiveness in the interactive information access process. Information Processing & Management, 44(3), 1022–1031. Wacholder, N., Ravin, Y., & Choi, M. (1997). Disambiguation of proper names in text. In Proceedings of the fifth conference on applied natural language processing (pp. 202–208). doi:10.3115/974557.974587. Wildemuth, B. M. (2004). The effects of domain knowledge on search tactic formulation. Journal of the American Society for Information Science and Technology, 55(3), 246–258. Willett, P. (1987). A review of chemical structure retrieval systems. Journal of Chemometrics, 1(3), 139–155.

870

Y.-H. Liu, N. Wacholder / Information Processing and Management 53 (2017) 851–870

Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: Compressing and indexing documents and images (2nd ed.). San Francisco: Morgan Kaufmann. Wittek, P., Liu, Y.-H., Darányi, S., Gedeon, T., & Lim, I. S. (2016). Risk and ambiguity in information seeking: Eye gaze patterns reveal contextual behavior in dealing with uncertainty. Frontiers in Psychology, 7, 1790. doi:10.3389/fpsyg.2016.01790. Workman, T. E., Fiszman, M., Cairelli, M. J., Nahl, D., & Rindflesch, T. C. (2016). Spark, an application based on Serendipitous Knowledge Discovery. Journal of Biomedical Informatics, 60, 23–37. doi:10.1016/j.jbi.2015.12.014. Xiang, Z., Qin, T., Qin, Z. S., & He, Y. (2013). A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks. BMC Systems Biology, 7(3), 1–15. Yoo, I., & Mosa, A. S. M. (2015). Analysis of PubMed user sessions using a full-day PubMed query log: A comparison of experienced and nonexperienced PubMed users. JMIR Med Inform, 3(3), e25. doi:10.2196/medinform.3740. Zhang, X., Liu, J., Cole, M., & Belkin, N. (2015). Predicting users’ domain knowledge in information retrieval using multiple regression analysis of search behaviors. Journal of the Association for Information Science and Technology, 66(5), 980–10 0 0. Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments. Proceedings of the ACM SIGIR Conference, 21, 307–314.