O~formukx~ Pmcasing 1 Menagmzenr Printed in Great Britain.
Vol. 30, NO. 2, pp. 223-238,
1994 Copyright 0
0306-4573/94 S6.00 + .cul 1993 Pergaman Press Ltd.
AN EXPERT SYSTEM APPROACH TO ONLINE CATALOG SUB.TECTSEARCHING CHRISTOPHER S. G. KHOO’ and DANNY C. C. Poe* ‘Science Library and Qepartment of Information Systems and Computer Science, National University of Singapore, Kent Ridge, Singapore 051 I, Republic of Singapore (Received 7 Augwf 1992; accepted in final form 16 Febmary 1993)
Abstract-The various ways of improving the online catalog for subject searching are reviewed. The paper then discusses the expert system approach to developing a subject search front-end. It is suggested that an effective expert front-end can be developed by focusing on search strategies. A design for a rule-based expert system front-end is described. Possible search strategies and selection rules are illustrated. The inference structure of the system is based on Ciancey’s model of heuristic classification.
1. INTRODUCTION
This paper proposes an expert system front-end as a way of improving subject access in online public access catalogs. A design for such a system is described. This is intended to provide a framework for future work and discussion on the topic. The need to provide users with more help with subject searching is well doc~ented in the literature. Recent studies of online catalog use found that subject searches formed a large proportion of online catalog use, and that users have difficulty doing subject searches. In a nationwide survey of U.S. libraries, Matthews et a6 (1983, p. 144) found that about 59% of all searches in the online catalog are for subject information. They also found that difficulty in subject searching is the most important single factor in user satisfaction (Matthews et al., 1983, p. 170). Markey (1984, p. 77) reviewed the results of four studies of online catalog use, involving five academic libraries, a community college, and a public library. In the libraries surveyed, searches that involved the subject access points accounted for 34% to 55% of online catalog use. The total proportion of subject searches would actually be higher because the studies did not take into account subject searches carried out in the title field. Online catalog users have various problems when performing subject searches. The Online Catalog Evaluation Projects sponsored by The Council of Library Resources found that users have the following problems with subject searching (Markey, 1984, p. 89):
l l
Users have problems matching their terms with those indexed in the online catalog. They have difficulty identifying terms broader or narrower than their topic of interest. They do not know how to increase the search results when too little or nothing is retrieved. They do not know how to reduce the search resufts when too much is retrieved. They lack underst~di~g of the printed LCSH (Library of Congress Subject Heading) {e.g., abbreviations, subdivisions, etc.).
Hildreth Cf987) pointed out that subject searching is not easy, and that “to optimize retrieval results in subject searching, more than one search approach may have to be employed in the overall search strategy. . . . Conventional Information Retrieval systems Mr. Khoo is currently a doctoral student at the School of Information Studies, Syracuse University, Syracuse, NY 13210, U.S.A. Requests for reprints should be sent to Dr. Danny CC. Poo. 223
C.S.G.
224
KHOOand D.C.C.
Poo
place the burden on the user to reformulate and re-enter searches until satisfactory results are obtained.” 2. SOLUTIONS
DESCRIBED
IN THE LITERATURE
Some of the ways described in the literature for helping users with subject searching are: 1. 2. 3. 4.
designing more helpful interfaces, providing non-Boolean “best match” search capability, using automatic search sequencing, incorporating natural language processing and knowledge-based bilities, and 5. using a hypertext or an enhanced thesaurus system.
processing capa-
We briefly review these solutions. 2.1 Designing more helpful interfaces Hildreth (1987) suggested that “online catalogs can include a message-response system that tells the user to try available search and display options that offer ways out of the current difficulty.” Said Hildreth, such messages should tell the user what to do, how to do it, and why it may improve the results. When there are no or few retrievals, the system can suggest shortening the search phrase or word, substituting synonyms or more general terms for the initial search words, or retrying the search using a different search method. When too many records are retrieved, the system can ask the user to enter additional search words or enter limiting criteria to narrow the search. Online prompts and help screens can provide only a limited amount of help. To carry out searches successfully, some search strategies require an understanding of the bibliographic record and the Library of Congress subject headings, as well as a knowledge of cataloging rules and online searching. Too much information about “what to do, how to do it, and why” will overwhelm the user and discourage him or her from using the catalog. 2.2 Non-Boolean “best match”search Some writers have advocated a “best match” approach, where records containing some or all of the user’s terms are retrieved and then displayed in ranked sequence, with the records that most closely match the user’s query being shown first. Two well-known systems that use this approach are CITE (Doszkocs, 1983), a front-end to the National Library of Medicine’s online catalog CATLINE, and Okapi (Walker, 1988), an experimental online catalog at the Polytechnic of Central London. CITE stems keywords from the user query and identifies word variants of these keywords. It then assigns weights to these terms based inversely on the document frequencies of these terms. The higher a search term’s frequency, the lower its weight. (The user is allowed to override the automatically derived weights by ranking the terms.) The weights of the terms occurring in each retrieved record are summed and the records are ranked. The best match approach was found to give relatively good results. However, its performance can be improved by incorporating some relevance feedback routines. Indeed, both CTE and Okapi use relevance feedback to reformulate a search, as described in the next section. 2.3 Automatic search sequencing Lynch (1987) advocated the use of heuristic procedures to increase the power of retrieval systems: “One approach to this goal of increased power and ease of use is to extend existing command-language-based systems into highly interactive, dialogue-oriented systems by incorporating heuristic algorithms that interpret user commands appropriately and
Online catalog subject searching
225
guide the user in refining his or her search criteria. . . . Heuristic techniques are a key tool in creating systems to meet today’s user expectations.”
One way of employing heuristic procedures is to use information in records already retrieved to retrieve even more records automatically. Said Hildreth (1987), the bibliographic record “can also be a source of relevance feedback from the user to the system. Additional dialogue and automatic search routines can assist the searcher in tracking down related materials without requiring the user to continually reformulate precise, well-structured queries until satisfactory results are obtained.” At the University of Illinois at Urbana-Champaign, a microcomputer-based front-end to the online catalog (Cheng, 1985) executes a sequence of strategies until the user is satisfied or the strategies are exhausted. When an initial search does not find a subject heading, the front-end performs a sequence of modifications on the search string, including adding a truncation sign, displaying alphabetically similar headings, separating the words and searching using the Boolean “AND”, and finally searching for the words in the title. If the title search retrieves a relevant record, the system extracts the first subject heading from the record and continues the subject search with that heading. The CITE system (Doszkocs, 1983), mentioned previously, performs a frequency analysis of the subject headings that occur in the records that the user indicates to be relevant. It then displays a ranked list of the subject headings for the user’s selection. Classification numbers from the “relevant” records are used together with the selected headings in a “best match” search. In Okapi Version 3 (Walker & de Vere, 1988), the “query expansion” facility takes the subject headings and classification numbers from records chosen by the user and assigns each a weight based on the number of chosen records in which they appear. All the above systems use a fixed strategy or a fixed sequence of strategies, and thus cannot be considered expert systems, An expert system that can select from a repertory of heuristics in response to dgferent situations should be able to perform more effectively than fixed straf egy systems. 2.4 Natural language processing and knowledge-based processing Paice (1986) suggested developing an expert system to help users select appropriate terms to search in the online catalog. Paice said that “the task of sorting out the best terms to represent a client’s need is probably the biggest task of the human search intermediary.” And this task is approached 1. through careful questioning of the client, and 2. through references to appropriate thesauri, which provide some of the subject expertise that intermediaries themselves do not possess. Paice saw the thesaurus as a kind of semantic network that “contains facts about a vocabulary, that is, about the properties and interrelationships of terms in some domain.” ALEX-DOC, a natural language retrieval front-end package developed by Paris firm ERLI (Etude et Recherche en Linguistique et Informatique), performs linguistic analysis to match the user query with terms in the thesaurus, and constructs a Boolean search statement (Hildr~h, 1989, p. 80). It does this using a set of rules of linguistic analysis and transformation. The front-end can be used with a variety of retrieval systems. Status with IQ (Pape & Jones, 1988), a prototype system, incorporates a “natural language analyzer” to interpret user queries entered in natural language. It performs syntactic analysis on the user query and constructs phrases suitable for searching in the catalog. The above interfaces use natural ianguage processing and knowledge-based processing to translate the user query into an appropriate form for searching in the catalog. Knowledge-based processing is not used in the tasks of selecting and executing an initial search strategy and reformulation strategies after relevance feedback. We shall show how knowledge-based processing can be used for these tasks.
C.S.G. KHOO and D.C.C. Poo
226
2.5 Hypertext and enhanced thesaurus systems In “best match” searching, automatic search sequencing, and searching using an expert system, the user plays a passive role in the search. Bates (1990) pleaded for developing systems that give the user an active role in directing the search process. To this end, she proposed a “front-end system mind” (Bates, 1986), a semantic network consisting of an end-user thesaurus enhanced by a network of associations. Associations suggested by Bates include relations taken from the classification system, terms that co-index a document, and title terms from documents indexed by a subject heading. Such an enhanced thesaurus increases the chances of the user’s terms matching one or more terms in the thesaurus, and allows the user to explore a rich network of links and associations. The idea of a hypertext-based library system was put forward by Hjerppe (1986) in his description of HYPERCATalog, a project of LIBLAB at the Linkoping University, Sweden. It was envisaged that HYPERCATalog would support browsing and navigation as the primary modes of using the catalog. To support navigation, the system would have five different kinds of links: record to record link, inter-record field to field link, field to record link, record to field link, and fields linked together to form a record. Designing a transparent user interface for these systems is a major problem. Not only does the system have to allow the user to navigate easily, but also it has to be able to display different views of the system and alert the user to different types of links. A hypertext system is ideal for exploratory searches. It would, however, be difficult to carry out a comprehensive search in such a system. Although such systems can have a rich network of associations, a human can only follow one link at a time in linear fashion. For comprehensive retrieval, a parallel search in multiple branches may have to be carried out automatically by the system. 3. THE
EXPERT
SYSTEM
APPROACH
An expert system can be defined as a computer program that uses expert knowledge to attain high levels of performance in a narrow problem area. An expert system that embodies within it the knowledge and skills of a librarian or “search intermediary” for carrying out online searches in bibliographic or textual databases has been called an expert intermediary system or an expert retrieval assistance system. The expert intermediary system has some salient characteristics when compared with other kinds of expert systems. Paice (1986) pointed out two differences: 1. The expert intermediary system is concerned with indirect access to information. Its expertise is centered on the techniques for retrieving references to documents, rather than actually deducing and providing facts. 2. The domain or subject coverage of the retrieval system is usually wider, and often very much wider, than for a typical expert system. Instead of rules and facts, the knowledge base of an expert intermediary system would consist mainly of strategies for clarifying the search topic, strategies for searching the information retrieval system, and rules for selecting the strategies. Some writers (Salton, 1986; Brooks, 1987) have expressed doubts that an expert system can be developed for information retrieval because of the many heterogeneous tasks involved in information retrieval, and because information retrieval systems typically serve large heterogeneous user populations. Expert systems have so far succeeded only in narrow structured problem areas. We feel, however, that an effective system for information retrieval can be built by focusing on search strategies rather than subject knowledge. Several expert intermediary systems, with varying degrees of expertise, have been reported in the literature. All of these are for information retrieval (IR) systems that deal with abstracts or full texts of journal articles. The online catalog is, in some ways, a simpler form of IR system. Some strategies used with IR systems can be used in a simpler form in online catalogs. Conversely, lessons learned from online catalog searching are relevant to IR systems.
Online catalog subject searching
227
There are, however, some important ways in which the online catalog is different from IR systems: 1. Users of the online catalog are more likely to be casual users than users of IR systems. Studies have revealed that most users seemed satisfied with a relatively small number of pertinent documents (see, for example, Akeroyd, 1990). The reason for this is uncertain, but it suggests that online catalog users are not likely to tolerate a long pre-search interview. 2. The online catalog does not contain abstracts. The records in an online catalog are less rich in information than records in an IR system. Sophisticated strategies that make use of the richer information in IR systems cannot be used in online catalogs. Long search statements are likely to fail in the online catalog. On the other hand, online catalog searches are less prone to “false drops,” because a higher proportion of the words in an online catalog record are important content-bearing words. 3. The online catalog contains references to books, whereas an IR system contains references to journal articles. Books usually cover broader topics than do journal articles. Since subject headings in the online catalog are meant to summarize the content of the whole book rather than individual topics covered in the book, one may have to search for a broader subject than actually wanted. Searching for narrow topics in an online catalog is likely to result in a null retrieval. Compared to other kinds of online catalog interfaces, an expert system front-end would have the following important characteristics: 1. Search heuristics are derived from human experts. 2. The expert system does not have just one heuristic, but a repertory of strategies. 3. It does not execute a fixed sequence of strategies, but has rules for selecting strategies in different situations. 4. The knowledge base of strategies is modular. Strategies can easily be added or removed. 5. It can monitor its own performance. 6. It can explain the search strategy used and why it selected the strategy. 4. A DESIGN
FOR AN EXPERT
SYSTEM
FRONT-END
Figure 1 shows the main modules that we envisage an expert system front-end will have. The Control Module controls the function of the system using a script that specifies the sequence of events that will take place during a search session. It activates the other modules when appropriate. The script has two parts: the initial search and the search reformulation. The User Interface accepts the user’s search query, formats various displays (e.g., thesaurus terms and search results), and obtains feedback from the user on the displays. The Stemming Module removes suffixes from the words in the user’s search query. The Synonyms Database provides synonyms for frequently used terms. The file also matches keywords with LC “free floating” subdivisions. The Knowledge Base of Search Strategies contains procedures for implementing various search strategies, rules for selecting search strategies, and statistics on the performance of the strategies. Search strategies can be divided into initial search strategies and reformulation strategies. The Fact Base stores all temporary information needed to select the next search strategy. Rules from the Knowledge Base are applied to information in the Fact Base to select a strategy. Several rules may have to trigger before a search strategy is finally selected. The intermediate results of applying the rules are also stored in the Fact Base. The Fact Base is reset whenever a search profile has been implemented. It then starts accumulating facts to help determine the next strategy to use. The Search History contains a temporary record of the search profiles that have been implemented during the search session and the search results obtained by each profile.
228
C.S.G. KHOOand D.C.C. Poo
Fig.
1.
Expert
system architecture.
(“Search profiles” refer to search strategies that have been implemented and stored in the internal representation format used by the expert system.) The search history is deleted at the end of the search session. The Session Log, which resides in a secondary storage device, keeps a permanent record of all the information stored in Search History, as well as any other information useful for analyzing the performance of the expert system. The Explanation Facility provides the capability for explaining how a search strategy is selected and on what criteria was it selected. This Spelling Correction module provides the facility for detecting and correcting erroneous words in a search query. The System Interface communicates with the online catalog. It translates the search profile from the expert system’s internal representation to the search language of the online catalog. It interprets responses from the online catalog. In our design of the front-end, we shall assume that the online catalog uses Boolean retrieval operations, rather than “best match” searching or probabilistic retrieval. We agree with Marcus (1986) that enhanced (or “smart” or “intelligent”) ways of using Boolean retrieval operations will prove superior, and that statistical methods will be best as supportive techniques, rather than basic search modes in expert assistant systems. . . . the human input is best induced when the human can easily perceive the nature of the matching algorithm; a term is or is not present in the Boolean scheme as opposed to the less obvious statistical algorithms where strategy modification may be effectuated by adjusting some numerical parameter(s) without direct and obvious correlation in terms of problem or document description.
Online catalog subject searching
229
With Boolean searching, it is easier to evaluate the performance of the system and to understand why certain strategies failed in certain situations. In the next sections, we shall describe the Initial Search and the Reformulation Search scripts to be used by the Control Module. We shall also look in greater detail at the functions of the modules. 4.1 Znitiaf search The Initial search has the following main steps: 1. 2. 3. 4. 5.
accept search query from user, search for synonyms, stem query words, match query words with controlled vocabulary, and construct and implement the initial search profile.
Accept search query. The system can either prompt the user to enter a search query one concept at a time or simply to state a request in natural language. Marcus (1986) found that users have problems breaking down their search requirements into component concepts. Some disadvantages of requiring a user to enter one concept at a time are: 1. it places a burden on the user to break down his or her information need into separate concepts; 2. it is not always certain whether a phrase should be considered as one concept or broken up into two concepts; and 3. some users will misunderstand what the system requires of them. For example, they may enter synonyms as different concepts. It will be difficult for the system to decide whether the Boolean operator “AND” or “OR” should be used to link the concepts. Accepting natural language input brings its own problems. Ideally, the system should parse the search query and perform a linguistic analysis. Most systems that accept natural language input simply ignore this problem, and either use a “best match” approach or assume that the Boolean “AND” is implied in the search input. Since search queries entered in an online catalog tend to be short, there is reason to believe that assuming a Boolean “AND” will work most of the time. The Okapi project found that 85% of search statements entered consist of three terms or fewer (Walker & Jones, 1987). It is also possible that errors from assuming an implicit “AND” can be corrected later during relevance feedback and search reformulation. For our expert system, we propose to follow the procedure outlined by Marcus (1986) in accepting natural language input, removing stopwords, and assuming an implicit “AND” among the remaining query words. Search for synonyms. The Synonyms Database provides additional keywords to supplement the query words. Additional keywords are provided not only for commonly used words, but for commonly used phrases as well. The query words, stripped of stopwords, are retained in their original order for searching in the Synonyms Database. The Synonyms database is searched for an entry made up of the first few query words. If there is more than one match, the longest entry is accepted. The process is iterated, each time dropping the first of the remaining query words. All the keywords retrieved from the Synonyms Database are added to the query words. The Synonyms Database can be built by entering keywords from the Library of Congress free-floating subdivisions. Word frequency analysis can also be carried out on user queries to identify commonly used words and phrases. Stem query words. A number of stemming algorithms have been described in the literature. Lennon et al. (1981) tested seven of these algorithms experimentally, and found relatively little difference in their performance. They found that simple, fully automated methods, like Porter’s (1980), perform as well as procedures that involve a large degree of manual involvement in their development.
230
C.S.G.
KHOO and D.C.C.
Poo
The Okapi project found that even Porter’s algorithm could cause serious loss of precision in an online catalog environment (Walker & Jones, 1987, p. 30). Walker and Jones explained that “when automatic stemming is used on searches consisting of only one or two words, even the most conservative procedure often gives false drops which would have been ANDed out in a longer search.” The National Library of Medicine’s CITE system circumvents this problem by displaying the list of matching terms to the user to select (Ulmschneider & Doszkocs, 1983). CITE first stems a query word and then retrieves all words in the database that begin with the stem. Not all the retrieved words will be correct, however. For example, the stem “CAT” retrieves “CATALYZE” and “CATHETER” among other things, in addition to “CATS.” The system thus examines each retrieved term to see if it can be reduced to the original stem after suffix stripping. “CATALYZE,” for instance, cannot be stemmed to “CAT”, and is thus rejected. A list of words that succeed is displayed for the user to select. In our system, we propose to use Porter’s stemming algorithm, complemented by the procedure used by the CITE system described above. We shall hereafter refer to the words selected by the user as the “user words.” Match user words with controlled vocabulary. Aragon-Ramirez and Paice (1985) described a design for an interface that helped the user define a search topic using a thesaurus. The interface led the user through a fairly thorough “topic elucidation” process. Online catalog users may not be willing to go through such a detailed pre-search interview. We propose a simpler way of matching the user words with the Library of Congress (LC) headings. The particular matching method used depends on the capabilities of the online catalog. Ideally, an online catalog should have a searchable thesaurus with “broader term” and “narrower term” references. Hildreth (1989, p. 69) reported that there was no OPAC that provided online access to the LCSH. Some systems have an authority file of LC headings with “see” references. Other systems do not have “see” references, but can display a list of heading-subdivision combinations used in the catalog. Some systems allow keyword searching in the subject headings, whereas others allow the searching only of subject heading phrases, with or without right truncation. We shall assume in our matching procedure that the online catalog has either an online thesaurus or an index file of assigned LC headings with keyword access. All LC headings and “see” references containing at least one of the user words are retrieved, The headings and “see” references composed wholly of user words are automatically selected. The remaining headings partially contain user words. They are weighted according to the number of user words they contain, and are displayed in descending weight order for the user to select. If two or more headings have the same weight, they are displayed in reverse order to the number of words they contain. If a heading selected is a “see” reference, it is replaced by the authorized heading. For each selected heading, all narrower headings and related terms are retrieved and displayed for the user selected. This is iterated until all narrower headings have been scanned by the user. In systems that can display heading-subdivision combinations, heading-subdivision combinations are displayed only if both the heading and subdivision contain user words. Otherwise there may be too many heading-subdivisions for the user to scan. For each narrower heading selected, the subdivisions of the narrower heading are scanned for user words. The divisions are automatically selected if they are composed wholly of user words. Otherwise they are displayed for the user to select. The system uses a Headings Tab/e to store and process the retrieved LC headings. It consists of seven columns of information. They include: 1. Heading: This records the LC headings and see reference retrieved. 2. Type: This indicates whether the heading is an LC heading or a see reference. 3. Weight: This indicates the number of user words contained in (or represented by) the heading. 4. User words represented: This indicates which user words are found (or represented by) the heading.
Online catalog subject searching
231
5. User words NOT represented: This indicates which user words are not represented. 6. Selected: This indicates whether the heading is selected by the user. 7. Expanded: This indicates whether narrower and related terms of the heading have been displayed.
Construct and implement search profile. The Control Module then selects an Initial Search Strategy to construct the search profile of the initial search. Three possible initial search strategies are outlined here as examples. One strategy is to prefer pre-coordinated headings to post-coordinating the headings. The LC heading-subdivision in the Headings Table representing the highest number of user words is selected as the first term in the search profile. If two or more heading-subdivisions have the highest number of words, then each becomes the first term of a separate subprofile. The sub-profiles are linked together with the Boolean “OR” to form the complete search profile. After selecting the first term of the profile, the system constructs a sub-profile to cover the user words not represented by the first term. Constructing a sub-profile involves selecting a first term covering the highest number of user words, and then constructing a subsub-profile to cover the remaining user words. The sub-sub-profile is ANDed to the first term to form the sub-profile. This is repeated recursively until no more user words remain. If no LC heading-subdivision can be found to cover the remaining user words, the subprofile will be constructed by ANDing the user words together as keywords to be searched in all the fields. One variation of the above procedure is to include in the search profile all the LC headings selected by the user. Instead of selecting the heading-subdivision covering the highest number of user words, every heading-subdivision in the Headings Table covering one or more of the user words is selected. Each becomes the first term of a sub-profile. The sub-profiles created are merged using the Boolean “OR. ” This strategy should, however, include a procedure for identifying and eliminating redundant sub-profiles. A third strategy is not to give preference to pre-coordinated heading-subdivision. For each user word, all the heading-subdivisions that contain the word are collected and linked with the Boolean “OR.” The groups of heading-subdivisions, each representing a user word, are “ANDed” together. An algorithm goes through this search profile and removes redundant headings. After the search profile has been constructed, it is sent to the System Interface to translate into the search language of the online catalog. 4.2 Search reformulation This has the following steps: 1. 2. 3. 4.
Display a sample of the retrieved titles and obtain relevance feedback. Display retrieved records. Record the performance statistics for the strategy used. Select a strategy to reformulate the search.
The cycle is repeated until a satisfactory search result is obtained. Display a sample of the retrieved titles and obtain relevance feedback. A sample of the titles retrieved are displayed for the user to indicate which titles are relevant. The sample should be big enough for the system to carry out word frequency analyses on the relevant and irrelevant titles, yet small enough not to overwhelm the user. Twenty titles is probably a good sample size. The system has to keep a record of the sample titles displayed and the user’s relevance judgment, so that if the same records appear in future sets, the titles need not be displayed again for the user to judge their relevance. Display retrieved records. Users are then asked whether they wish to refine the search or to look through the references first. If the user chooses the latter, he or she can still decide to refine the search later.
232
C.S.G.
KHOO and D.C.C.
Poo
This approach of giving the user the choice of displaying intermediate answer sets is better than the alternative of displaying only the final search result. This is because some relevant titles in earlier answer sets may be lost in later sets. The answer set scanned by the user can be “NOTed” out from future answer sets. Record the performance statistics for the strategy used. Statistics on the performance of the strategy used is stored for later analysis by the system developer. The following statistics can be collected: 1. 2. 3. 4. 5.
the selection rules triggered that resulted in the strategy being selected, retrieval size of the search result BEFORE the strategy was executed, retrieval size of the search result AFTER the strategy was executed, precision of the sample search result BEFORE the strategy was executed, and precision of the sample search result AFTER the strategy was executed.
If the search is an initial search, obviously there is no previous search result, and there will be no value for items 2 and 4. From the statistics collected, the alternative search strategies can be compared by comparing their precision versus retrieval size curves. If the search is a reformulation search, the strategy used can be evaluated in terms of improvement in retrieval size and precision of the search result. For example, if the purpose of the reformulation strategy was to narrow the search, then the strategy could be judged effective if the precision increased substantially. If the purpose of the reformulation strategy was to broaden the search, then the strategy was effective if the retrieval size increased substantially without a substantial loss in precision. The statistics described above allow the system developer to calculate the change in retrieval size and precision of the search output as a result of implementing the strategy. A record of the selection rules triggered that resulted in a strategy being selected is useful for analyzing the performance of the strategy in different situations. Furthermore, when two strategies are associated with identical selection rules, the system developer can compare the relative effectiveness of the two strategies under the same conditions. Select a strategy to reformulate the search. Our design for selecting a search strategy is modelled after Clancey’s (1985) model of heuristic classification. Clancey divided problem-solving methods used by expert systems into two basic types: heuristic classification and heuristic construction. Taken as a whole, to carry out a literature search requires a heuristic construction method. A search is “constructed” or synthesized through several cycles of search and feedback from the user, using a variety of search strategies. However, if each search cycle is handled in isolation, the method of heuristic classification can be used in the selection of an appropriate search strategy. This is because the number of search strategies is small and can be enumerated. The inference structure of the heuristic classification method takes the following form: Data abstraction rules are applied to raw data to produce data abstractions; heuristic matching rules match the data abstractions to solution abstractions or classes of solutions; then refinement rules select one or more solutions from a class of solutions. Figure 2 gives an example of a possible chain of inference in our expert system. Beginning with the raw data that the retrieval size of the search is 160 and the precision is 30%, the system applies the rules in the Knowledge Base to select an appropriate reformulation strategy. Applying the data abstraction rules, 1. IF the retrieval size is 101-200, THEN the retrieval level is 4 (high). 2. IF the precision is ~20% and 140%, THEN the precision level is 2 (low). The system implies that the retrieval size is big and the precision is low. Applying the heuristic matching rule, IF the precision level is 2 or 3 AND the retrieval level is >2, THEN use a narrowing strategy.
Online catalog subject searching
Use
233
a narrowingatnitegy
ueethe stralegy “Use temts that have hiih frequencies in tie relevantwantIs” Fig. 2. A possible chain of inference rules.
It concludes that the appropriate reformulation strategy to use is a narrowing strategy. It then selects a particular narrowing strategy by applying the following refinement rule: IF a narrowing strategy is needed, THEN select the strategy “Use terms that have high frequencies in the relevant records” WITH WEIGHT 0.8. Each rule can assign a weight or certainty factor to its conclusion. In our design, only the refinement rules assign weights to the strategies they recommend. The set of data abstraction rules, heuristic matching rules, and refinement rules is stored in the knowledge base. To illustrate, the following is a set of data abstraction rules for the precision level. IF the precision is ~20070, THEN the precision level is 1. IF the precision is >20% and 140%, THEN the precision level is 2. IF the precision is >40% and ~60~0, THEN the precision level is 3. IF the precision is MO% and 180%, THEN the precision level is 4. IF the precision is >80%, THEN the precision level is 5.
234
C.S.G.
The data abstraction 0 1 2 3 4 5
for for for for for for
KHOO
and
D.C.C. Poo
rules use a six-point scale:
invalid or not appropriate very low low average high very high
Some examples of heuristic matching rules are given below: IF the retrieval size is 0, THEN select the strategy “Replace the concept group that has 0 posting” WITH WEIGHT 0.8. IF the retrieval size is 0 AND there is more than one concept group, THEN select the strategy “Leave out the least important concept group” WITH WEIGHT 0.6. IF the retrieval size is 0, THEN backtrack WITH WEIGHT
0.4.
IF the retrieval level is 5, THEN use a narrowing strategy. IF the precision is 0, THEN backtrack WITH WEIGHT
0.8.
IF the precision level is ~4, THEN use a narrowing strategy. IF the precision level is 4 AND the retrieval level is ~4, THEN use a broadening strategy. IF the precision level is 5 AND the retrieval level is <5, THEN use a broadening strategy. IF the precision change level is 1, THEN backtrack WITH WEIGHT 0.8. IF the precision change level is 3 AND the retrieval change level is 3, THEN backtrack WITH WEIGHT 0.8. Some examples of refinement rules are: IF a broadening strategy is needed AND the search profile has more than one concept group, THEN an appropriate strategy is “Leave out the least important concept group from the search.” IF an appropriate strategy is “Leave out the least important concept group from the search” AND there is a concept group consisting only of keywords THEN select the strategy “Leave out from the search the concept group consisting only of keywords” WITH WEIGHT 0.8. IF an appropriate strategy is “Leave out the least important concept group from the search” THEN select the strategy “Leave out from the search the concept group with the highest posting” WITH WEIGHT 0.6. IF an appropriate strategy is “Leave out the least important concept group from the search” THEN select the strategy “Ask user to rank the concept groups and drop the least important from the search” WITH WEIGHT 0.6. IF a broadening strategy is needed, THEN select the strategy “Use the class no. to find similar books” WITH WEIGHT 0.6.
Online catalog subject searching
235
IF a broadening strategy is needed, THEN select the strategy “Convert the Boolean operator AND to OR” WITH WEIGHT 0.4. IF a narrowing strategy is needed, THEN select the strategy “Leave out synonyms that give poor results” WITH WEIGHT 0.8. IF a narrowing strategy is needed, THEN select the strategy “Use terms that have high frequencies in the relevant records” WITH WEIGHT 0.8. Besides rules, the knowledge base also contains compiled procedures. The procedures are of two kinds: 1. Search strategies in the form of compiled procedures for constructing an initial search profile or for reformulating the search profile. A reformulation strategy performs a specific kind of analysis on the sample records retrieved and modifies the previous search formulation. The strategies can be broadly characterized as broadening strategies or narrowing strategies. Some of these are described in the Appendix. 2. Procedures for obtaining specific information needed to determine whether a premise of a particular rule succeeds. As an example, let us say we have the following rule: “IF a broadening strategy is needed AND the search profile contains a keyword (i.e., non-LC heading), THEN select the strategy Drop a keyword from the search profile with weight 0.8”; this rule requires the system to have a procedure for checking whether the search profile contains keywords. Having described the rules and the procedures, we now consider the design of the control structure for applying the rules. The rules can be applied in a forward manner (called forward chaining) by applying the rules to the information stored in the fact base, storing the inferences in the fact base, and then applying the rules again, repeating the cycle until a search strategy is selected. The rules can also be applied using backward chaining. Each search strategy is considered in turn as a “candidate” strategy, and the rules are applied backwards to see if the data supports the strategy. First, all the rules having this strategy as the conclusion are examined to see if their premises are supported by the information in the Fact Base. If their premises are supported, then the candidate strategy succeeds and is selected. If the premises are neither supported nor contradicted, then these premises are themselves set up as candidate conclusions to be considered. The rules that have these as conclusions are examined to see if their premises are supported. The process is repeated until the candidate strategy is supported or contradicted. We propose a mixed forward chaining and backward chaining strategy. The fact base begins with information on the retrieval size, the precision, the change in retrieval over the previous strategy, and the change in precision. The data abstraction rules and the heuristic matching rules are applied iteratively to the information in the fact base, posting all inferences in the fact base, until no more rules will trigger. This approach is efficient because most of the search strategies are selected based, at least in part, on retrieval and precision considerations. However, a pure forward chaining strategy cannot be used because the fact base does not contain all the information needed to select a strategy. The selection rules for some strategies require special information (e.g., whether keywords exist in the search profile) obtained by executing one of the compiled procedures described previously. Using a backward chaining strategy, the special information will be obtained by the system only when there is a need for it. Using backward chaining, each search strategy is considered in turn. All the refinement rules that select this strategy are examined to see if their premises are satisfied by the information in the fact base. The “breadth-first” search strategy is used to control which premise will be considered first. In breadth-first search, all the premises at the same level are considered first before considering premises at a lower level. Using breadth-first search,
C.S.G.
236
KHOO
and D.C.C. Poo
premises contradicted by information in thefact base are quickly detected without wasting time considering the other premises “in-depth.” The selected reformulation strategies are ranked by the weights assigned by the refinement rules. The strategy with the highest weight is implemented. If the strategy fails during implementation, the strategy with the next highest weight is selected instead. A strategy can fail during implementation for two reasons: 1. The search profile constructed resembles a search that has been implemented previously. 2. The system is unable to complete the strategy. This happens, for example, when the system prompts the user for additional keywords and the user declines. We have assumed that the system will automatically select a strategy to be implemented. The system can easily be programmed to display a few recommended strategies for the user to choose. Procedures can be attached to the selection rules to explain why the strategy is recommended. 5. CONCKJJSION
A simpler version of the design we have described was implemented by a team of three final-year undergraduates from the Department of Information Systems and Computer Science at the National University of Singapore as part of an industrial attachment program with the university library (Tan et al., 1991). A prototype front-end to the library’s MINISIS system was developed on an IBM-compatible microcomputer using Turbo Prolog and Turbo Prolog Toolbox. The demonstration prototype showed that a front-end with a repertory of heuristic procedures can provide a user with substantial help in doing subject searches. The design described in this paper is based on experience gained in the initial project and on comments by librarians on the earlier system. However, a mature expert front-end can be developed only after our design has been implemented (this is currently being carried out; the results will be documented in a separate paper), the system put through its paces by library users, and its performance analyzed. We also see the design we have described as a framework for future research. Each aspect of the system merits further study. In particular, each of the strategies described in this paper have many variations. Further study is needed to see if using different variations in different situations will improve the performance of the system. The expert front-end can also be a vehicle for research in online searching. By attempting to model the searching behavior of human intermediaries, we can gain more insight into the knowledge and skills used in online searching. Acknowledgements-Preparation of this articfe was supported in part by the National University of Singapore Grant RP910688 to Dr. Danny C.C. Poo. We gratefully acknowledge the assistance of the two anonymous reviewers for their comments on a draft of this article.
REFERENCES Akeroyd, J. (1990). Information seeking in online catalogues. Journal of Documentation, 46(l), 33-52. Aragon-Ramirez, V., & Paice, C. D. (1985). Design of a system for the online elucidation of natural language search statements. Advances in Intelligent Retrieval: Informatics, 8, 163-190. Bates, M. J. (1986). Subject access in online catalogs: A design model. Journal of the American Sociefy for Information
Science, 37(6),
357-376.
Bates, M. J. (1990). Where should the person stop and the information search interface start? I~formufion Processing & managements 26(S), 575-591. Brooks, H. M. (1987). Expert systems and intelligent information retrieval. Information Processing & Munagement, 23(4),
367-382.
Cheng, C.-C. (1985). Microcomputer-based user interface. Information Technology and Libraries, Clancey, W. J. (198.5). Heuristic classification. Artificial intelligence, 27, 289-350. Doszkocs, T. E. (1983). CITE NLM: Natural-language searching in an online catalog. Informafion and Libraries,
2(4),
364-380.
4(4), 346-351. Technology
Online catalog subject searching
237
Hildreth, C. (1987). Beyond Boolean: Designing the next generation of online catalogs. Library Trends, 35(4), 647-667. Hildreth, C. R. (1989). Intelligent interfaces and retrieval methods for subject searching in bibliographic retrieval systems. Washington, D.C.: Cataloging Distribution Service, Library of Congress. Hjerppe, R. (1986). Project HYPERCATalog: Visions and preliminary conceptions of an extended and enhanced catalog. In B. C. Brookes (Ed.), Intelligent information systems for the information society: Proceedings of the Sixth International Research Forum in Information Science, pp. 21 l-232. Amsterdam: North-Holland. Lennon, M., Peirce, D. S., Tarry, B. D., & Willet, P. (1981). An evaluation of some conflation algorithms for information retrieval. Jouranl of Information Science, 3(4), 177-183. Lynch, C. A. (1987). The use of heuristics in user interfaces for online information retrieval systems. In C.-C. Chen (Ed.), ASIS ‘87: Proceedings of the 50th ASIS AnnualMeeting, vol. 24, pp. 148-152. Medford, N.J.: Learned Information. Marcus, R. S. (1986). Design questions in the development of expert systems for retrieval assistance. In J. M. Hurd (Ed.), ASIS ‘86: Proceedings of the 49th ASIS AnnualMeeting, vol. 23, pp. 185-189. Medford, N.J.: Learned Information. Markey, K. (1984). Subject searching in library catalogs: Before and after the introduction of online catalogs. (OCLC Library, Information and Computer Science Series, 4). Dublin, Ohio: OCLC Online Computer Library Center. Matthews, J. R., Lawrence, G. S., & Ferguson, D. K. (eds.) (1983). Using online catalogs: A nationwide survey. New York: Neal-Schuman Publishers. Paice, C. (1986). Expert systems for information retrieval? Aslib Proceedings, 38(10), 343-353. Pape, D. L., & Jones, R. L. (1988). STATUS with IQ: Escaping from the Boolean straightjacket. Program, 22(l), 32-43. Porter, M. F. (1980). An algorithm for suffix stripping. Program, I4(3), 130-137. Salton, G. (1986). On the use of knowledge-based processing in automatic text retrieval. In J. M. Hurd (Ed.), ASIS ‘86: Proceedings of the 49th ASIS Annual Meeting, vol. 23, pp. 277-287. Medford, N.J.: Learned Information. Tan, D., Tan, M., & Tan, W. S. (1991). An intelligent interface to the library online catalogue system: Report and program listing. National University of Singapore, Department of Information Systems and Computer Science, 3rd Year Project No. 954. Ulmschneider, J. E., & Doszkocs, T. (1983). A practical stemming algorithm for online search assistance. Online Review, 7(4), 301-318. Walker, S., & de Vere, R. (1988). Okapi: Developing an intelligently interactive online catalogue. Vine, 71, 4-l 1. Walker, S., & Jones, R.M. (1987). Improving subject retrieval in online catalogues: 1, Stemming, Automatic Spelling Correction and Cross-reference Tables. (British Library Research Paper; 24) London: British Library. Walker, S. (1988). Improving subject access painlessly: Recent work on the OKAPI online catalogue projects. Program, 22(l), 21-31. APPENDIX:
REFORMULATION
STRATEGIES
1. Broadening strategies 1.1 Leave out an unimportant search term. Some of the terms entered by a user may be unimportant to the search and, if used, will restrict the search unnecessarily. Such terms include: l
l
terms that are already implied by the other concepts. For example, in a search query building an expert sytem, the word building should not be searched because most books on expert systems are on how to build expert systems; terms that are not specific or can mean many things. Examples are methods, applications, and effect.
1.2 Use term frequency analysis of LC headings in “relevant” records to identify appropriate LC headings to use as synonym. If an LC heading or a combination of two headings appears in most of the “relevant” records, then that heading (or combination) is probably a good one to use. 1.3 Display broader LC headings for the user to select. For each LC heading used in the search, display the broader heading to see if the user is interested in it. If the user approves, then add the broader heading as a synonym. 1.4 Use the classification scheme to search for similar books or broaden the search. It is difficult to combine call number searching with subject heading searching or keyword searching in the title. Perhaps the easiest way to use the class number is to treat it as a separate step in the search sequence. After the subject search has been done, the system can offer to show the user books with similar call numbers. The system can carry out frequency analysis to identify the class number range where the highest number of “relevant” records are classified.
238
C.S.G.
KHOOand D.C.C.
Poo
1.5 Convert the Boolean operator “AND” to “OR”. The initial search strategy assumes that all the concepts entered by the user should be “ANDed” together. This will not work if the user enters a series of synonyms as the search query. The strategy of converting all the occurrences of the Boolean “AND” to “OR” can be tried if the retrieval size is very low or nil. 2. Narrowing strategies 2.1 Leave out synonyms that give poor results. In the search profile, synonyms and related terms are linked together with the Boolean OR. One or more of the related terms may give poor results. For example, in the search query Computer-aided software engineering or CASE, the synonym CASE will result in a high proportion of false drops, and should thus be excluded from the search profile. After relevance feedback by the user on the sample of titles retrieved, the system can carry out term frequency analysis to compare the relative frequency of each related term in the “relevant” and “irrelevant” records. The terms with high frequency in the “irrelevant” records but low frequency in the “relevant” records can be dropped from the search. 2.2 Use terms having high frequencies in the “7elevant”records. Identify the words and subject headings having a high frequency in the “relevant” records and a low frequency in the “irrelevant” records. This strategy seeks to identify a group of such terms that together will be able to retrieve all the “relevant” records. The group of terms are then ORed together and the whole appended to the search profile with the Boolean AND. This strategy can only be used if the sets of “relevant” records and “irrelevant” records are big enough for a word frequency analysis to be meaningful. A sample size of at least four “relevant” and four “irrelevant” records will probably give satisfactory results. 2.3 Ask user for more concepts to narrow down search. Prompt the user for another term and its synonyms. LC headings and see references containing any of the terms are displayed for the user to select. All narrower terms of selected headings are automatically selected. If no LC heading can be found for a term, it is used as a keyword. The words and headings are ORed together, and the whole appended to the search profile with the Boolean AND. 2.4 Ask user for words that can be “NOT”edfrom the search. After the user has given feedback about the sample titles, display the “irrelevant” titles with their assigned LC headings, and ask the user to identify the words and headings that he or she does not like in those titles. The Boolean NOT can then be used to eliminate these terms from the search. This strategy is to be used as a last resort because it can eliminate good titles as well as the bad ones. This is particularly true for words that can have different meanings in different contexts. Consider the search query Computer-aided software engineering or CASE. In this search, if the user identifies CASE as a problem word to be eliminated, works on computer-aided software engineering may also be eliminated.