Int. J. Man-Machine Studies (1986) 24, 125-139
Comparison of decision support strategies in expert consultation systems PERETZ SHOVAL
Department of Industrial Engineering and Management, Ben Gurion University of the Negev, P.O. Box 653, Beer Sheva, Israel (Received 2 February 1985, and in revised form 25 September 1985) Different strategies of decision support can be identified in a consultation process and implemented in an expert system. This paper concentrates on experiments that were carried out with an expert system that was developed in the area of information retrieval, to perform the job of an information specialist who assists users in selecting query terms for database searches. Three different support strategies are utilized in the system. One is a "participative" strategy, in which the system performs a search within its knowledge base and during which there is interaction between the system and the user, whereby the system informs the user of intermediate findings and the user judges their relevancy and directs the search. The second is a more "independent" support strategy, in which the system performs a search and evaluates its findings without informing the user before the search is completed. The third is a "conventional" strategy (not an expert system) in which the system only provides information according to the user's request, hut it does not make judgments/decisions; the user himself is expected to evaluate and to decide. Three main questions are examined in the experiments: (a) which of the three support strategies or systems is more effective in suggesting the appropriate query terms; (b) which of the approaches is preferred by users; and (c) which of the expert systems is more efficient, i.e. more "accurate" and "fast" in performing its consultation job. The experiments reveal that the performance of the system with the first two strategies is similar, and it is significantly better than the performance with the third strategy. Similarly, users generally prefer these two strategies over the "conventional" strategy. Between the first two, the more "independent" system behaves more "intelligently" than the more "participative" one.
1. Introduction We present the results o f comparative experiments c o n d u c t e d (in a laboratory setting) with an expert system that has been developed in the area o f information retrieval to support users o f retrieval systems in selecting appropriate search terms for queries. This section starts with a short introduction to the p r o b l e m area, and then describes briefly the essentials o f the expert system, and introduces the three different decision support strategies which were developed into three systems. Section 2 outlines the objectives, hypotheses a n d experimental design, and section 3 presents and discusses the results. Section 4 concludes and suggests guidelines for a real world implementation of the systems. 1.1. THE PROBLEM AREA Our objective is to s u p p o r t users o f information retrieval systems which are based on a controlled vocabulary. In such systems, data is indexed via index-terms which 125 0020-7373/86/020125 + 15503.00/0
© 1986 Academic Press Inc. (London) Limited
126
p. SHOVAL
comprise a controlled vocabulary, and retrieval of data is achieved by first having the user formulate a query which consists of a set of index-terms. In most retrieval systems, the retrieval mechanism is based on a "perfect" match between query terms and index terms, where the query terms are connected with Boolean operators. Other systems are based on a "fuzzy" match between sets of query terms and index terms, where the terms may be weighted, such that the retrieval can also be rank-ordered according to similarity between the two sets of terms (Salton & McGill, 1983). A vocabulary of some applicaton area can be organized in a thesaurus, which is a collection of terms, including the index-terms which belong to the controlled vocabulary and other related words/terms/phrases. These are cross-referenced to each other according to various types of semantic relationships. A thesaurus usually consists of thousands of terms and cross-references, organized into several indexes which presumably aid in finding the fight index terms for a query. (For more information on thesauri, their construction and usage see, e.g. Beck, McKecknie & Peters, 1979; Goodman, 1979; Soergel, 1974). Utilizing a thesaurus is complicated, and it is unrealistic to expect a casual user to employ the various indexes effectively. Users may be neither capable nor willing to invest the necessary time. (On limitations of thesauri see, e.g. Spark-Jons & Kay, 1973 and Schultz, 1978). Therefore information centers usually employ information specialists or consultants, whose job it is (among other things) to support users in formulating their queries. Basically, the specialist obtains from the user some narrative description of his need, and then the specialist performs a search in his knowledge-base (mainly the thesaurus) and returns with his advice on which index terms should be used. Our expert system has this objective, i.e. to support users in selecting appropriate index terms for queries. 1.2. P R I N C I P L E S O F T H E E X P E R T SYSTEM
A detailed description of the expert system can be found in Shoval (1981, 1985). Briefly, the system accepts the user's problem, as expressed in his terminology, and it suggests a set of appropriate index terms to represent the problem. The system consists of two main elements: a knowledge-base and a set of procedures and decision rules. The knowledge-base comprises the knowledge which is available to the human experts; the source of that knowledge is the thesaurus and additional knowledge of terms, meanings and associations. The procedures and decision rules utilized in the system are search and evaluation rules, which reflect the work procedures and decision rules employed by human experts--the information specialists. The knowledge-base is represented as a semantic network, in which the nodes are words, terms and phrases, and a distinction is made in the semantic network between nodes which denote index terms (that may be used in a query) and non index terms. The links in the network denote various types of semantic relationships between the nodes, such as hierarchy and synonymy. For the purpose of the experiments we have created a relatively small knowledge-base, which consists of approximately 500 nodes and 850 links (in the area of information systems and management). A detailed description of the knowledge base and its creation is provided in Shoval (1983). The consultation process of the expert system has two main stages: "search" and "suggest". ~ T h e "search" stage is aimed at finding specific relevant index terms. At the beginning the user enters a set of terms which he selects to represent his information
D E C I S I O N SUPPORT ST RATEGIES
127
need. Starting with these terms, the system performs a guided search in the semantic network, which is carded out by repetitive activities of " e x p a n d " , "match", "relevance judgment" and "replacement". " E x p a n d " refers to a process of exciting, or expanding a node in the network, along its directed links so the associated nodes become "active" too. " E x p a n d " is complemented by "match", which tries to find intersections of active nodes: if two (or more) terms in the active knowledge link to the same term, their successor, the matched term, has a potential to be relevant to the problem, because its meaning is common to the meaning of its originating terms. Having potential does not yet make the term relevant; at this point the system conducts a "relevance judgement": to be considered relevant the matched term should also inherit from each of its "parents" some explanation capability not provided by the other parents, i.e. there should be some new, additional entry terms involved in the match. The system maintains a list of terms which are considered relevant; when a new, more specific term is discovered and deemed relevant, it is assumed to embody the meaning of its originators, therefore it is added to that list, replacing the more general terms that preceded it. Assume that during the search stage there is a "front-line" of relevant terms that represent the user problem. At the beginning the front line consists of just the user terms themselves. When these terms are expanded and matched, some of the new terms are recognized as relevant, so these new relevant terms take the place of their originators, replacing them in the "front line". The process of "replacement" may continue as long as terms can be expanded and matched. At the end, the "front line" comprises a set of active-relevant terms, which are assumed to represent the user problem best. - - A t the end of the "search" process, we have a set of active relevant terms, some of which may be "hits", but others may be "misses". The "suggest" state is aimed at evaluating and suggesting to the user the right terms. The system employs an evaluation scheme according to which the most relevant or important index terms are suggested first, and the less relevant terms--last. Two criteria are used by the system for this evaluation. One is the metric o f " s t r e n g t h " of a term, which is the number of originators, user terms, involved in the selection of a relevant term: the more user terms involved, the more promising or important a selected term is assumed to be in representing the meaning o f the problem. The other criteria is "complementation": we want not only to suggest " g o o d " terms; we want to make sure that the suggested terms represent the user's problem as a whole, capturing the meaning of all his expressed need. This can b e done by assembling a cluster of terms (out of the active-relevant terms), which together encompass the meaning of all the user terms. Elements of such a cluster are said to "complement" one another. "Complementation", which is measured by the number of new user terms that are involved in the creation of a suggested term ("new" refers to user terms that have not yet been considered or embodied in the meaning of the previously suggested and accepted terms). So, the meaning of the term to be suggested complements the meanings of the formerly accepted terms. This criteria insures that a cluster of index terms which captures the meaning of the user problem as a whole will be suggested first, and that only subsequently will more terms be suggested (irrespective of the way they complement each other). The system suggests first the term which has the greatest metric o f "strength"; the next-best terms to be suggested are determined according to a combination of the two rules of "strength" and "complementation".
128
P. SHOVAL
Terms are suggested to the user one at a time. Once a term is suggested, the user has several options for reaction. If the user accepts a term, meaning that he judges it relevant to his problem, then the system fires the evaluation function, computes the next-best term and suggests it to the user. If the user rejects a term, meaning that he judged it not relevant (this may happen in spite of the "intelligence" of the system), then the system employs a "backtracking" procedure, which tries to find alternatives to the rejected term (alternatives are considered among the relevant terms that were previously replaced in the course of development of the term that is rejected). If the meaning of a suggested term is not clear to the user, he can ask to see its dictionary definition (or scope note). He can also ask to see the "family tree" of a suggested term, so the system traces back along the chained paths from the term to the user's entry terms from which it originated. This provides a kind o f explanation as to how, or why, a term was suggested. For a more detailed description of the structure and procedures of the system see Shoval (1985). 1.3. CONSULTATION STRATEGIES: THREE APPROACHES A consultation process is naturally interactive. In the case of information consulting the user provides a problem statement and the specialist conducts a search within its knowledge base, applying decision rules and making evaluations. The consultant may sometimes return to the user, presenting him with intermediate results, asking his opinion about the findings, or providing explanations of their meaning. Different styles or levels or interaction between a consultant and a client can be identified. We may assume that in some situations the consultant will accept the user's description of the problem and perform his job---with little user participation and involvement, and that in other situations a more interactive and participative process is expected, in which the consultant may present to his client intermediate results and ask his opinion in order to guide the search. In reality there may be different types of consultants, clients and problem situations, hence---different styles of interaction. We have decided to model two styles or strategies of system-user interaction. These were developed into two versions of the expert system, referred to as systems " I " and "B". System " I " takes a more informative and participative approach than system "B". During the "search" stage system " I " interacts actively with the user: when it finds and evaluates a term as relevant, it immediately informs the user and awaits his reaction (NB at this stage, these are just intermediate findings--"milestones" in the search process--not necessarily index terms that will finally be suggested as query terms). If the user judges such an intermediate result as being irrelevant to his problem, then the system would not continue to search in that direction; but if he judges such a finding as relevant, then the system would continue the search, focusing in that direction. System " B " , on the other hand, does not inform the user o f intermediate findings and expects no guidance. Because o f that there is a risk that it will end up with more unsuitable or " b a d " terms than system " I " . It is therefore desirable to include a capability to separate the " g o o d " terms from the "bad" before they are suggested to the user. During the "suggest" stage, the two systems differ in the evaluation function, and hence in the order they suggest the final list of index terms to the user. System "I" orders and suggests the terms according to the metric of "strength" only, assuming
D E C I S I O N SUPPORT STRATEGIES
129
that the more user terms involwd in the creation Of a term, the more relevant it is. Since this system was previously guided by the user it is highly probable that most of the terms to be suggested in this stage will be accepted. System " B " employs the metrics of both "strength" and "complementation", in order to suggest at the beginning a cluster of index terms that taken together express the meaning of the user-stated problem. Only after this condition is satisfied (and if there are still more terms) the rest of the terms are suggested, upon the user's request. In this manner, system " B " tries to overcome the risk of having previously evaluated " b a d " terms as relevant. Another difference between the two systems is in the handling of alternative terms, if such are found after backtracking, i.e. after the user judged some suggested term irrelevant. If an alternative term is found and evaluated relevant, system ' T ' immediately informs the user and awaits his reaction; if the user also rejects the alternative a search for other alternatives is started at once. System "B", as we may expect, does not inform the user of finding alternatives; it simply adds them (if such are found) to the rest o f the relevant terms, and suggests them according to its own evaluation scheme. Thus far we have discussed the main differences between the two versions of the expert system. At any rate, both systems perform a search, apply decision rules, judge relevancy of findings, respond to user feedback, explain the meaning and reason for suggestions, evaluate the importance of the results, i.e. exhibit intelligent behavior. In contrast to the expert system approach, we have also examined a more "conventional" support strategy. To do so we have developed another system, referred to as system "T", which does not employ the above techniques. The objective of system " T " is the same as the objective o f the expert systems, but its principles and characteristics are entirely different. It attempts to present to the user all information he requests, but it is not supposed to make decisions and suggestions. Everything depends on the user; the more he asks, the more information he gets, but is up to him to decide how to deal with that information and what to do with it. Specifically, as the user enters a term into the system (out of the set of terms that represent his problem/information need) the system shows him all other terms in the knowledge-base that are related to it. If the user wishes, he can continue to enter terms from the related terms shown to him, and the system will show the new terms and relationships. Alternatively, he can enter a new term from his initial set. Theoretically the user has access to all terms in the knowledge-base, and nothing prevents him from getting to the best terms. He is expected, however, to face similar problems that the user of a manual thesaurus faces; namely, to become confused because of the large amount of information it provides, which has neither been pruned of irrelevant relationships nor been accompanied by any type o f decision-support. To summarize, we have developed three alternative decision support systems which have the same objective. Two of them are expert systems, which differ mainly in their interaction and evaluation strategies, whereas the third system is limited to providing information--no advice. These three systems were subject to comparative experiments.
2. The experiments 2.1. OBJECTIVES
We have developed three systems to support users in selecting the appropriate index terms for queries, and we have made assumptions regarding their capabilities of
130
v. SHOVAL
performing the job of a human expert utilizing different support or consultation strategies. Some of the assumptions were tested in a set of experiments. These had three main objectives: (a) To examine the performance of the systems, i.e. which of the systems provides better advice about which index terms should be used in a query. (b) To determine which of the systems, i.e. the support strategies, users prefer. (c) To establish which of the expert systems ( ' T ' or " B " ) is more efficient in performing its job, i.e. how "fast" and "accurate" does it generate the results. (This question is not relevant for system "T", which is not an expert system.) The experiments were conducted in a laboratory environment with a group of users and a predefined set of problem statements. A problem which is often encountered in testing laboratory type systems is lack of a direct measure of performance. A direct measure in our case could have been the quality of the information retrieved from a database (e.g. in terms of "recall" and "precision") as result of using query terms that have been suggested by our systems. This cannot, however, be done here because our systems have only a limited knowledge-base and are not connected to a real retrieval system and database. The product of our systems is just a set of terms, so we can try to measure the quality of these terms as a surrogate for the desired direct measure. We may ask: how " g o o d " are the terms? How correctly do they represent the meaning of the user problem? In order for us to be able to make this evaluation we need a "gold standard", an estimate of the "best" set of query terms that should be used for a given problem. Such an estimate can be provided by human experts, who are familiar with the problem statements and can look up the right set of index terms in the relevant vocabulary. Thus, for each problem statement, a solution set of terms (the "gold standard") was determined by a group of five experts (each expert suggested a set of terms for each of the problems, then the sets of terms were compared and agreed upon by the whole group of experts, before the experiments began). 2.2. HYPOTHESES We will now present the hypotheses that were tested, with regard to the above objectives. For each objective we define a hypothesis, discuss the rationale for our assumption and explain the measurement of the results. (The results are presented in Section 3).
( a ) System performance (1) Hypothesis: Ho:
Per(Sys-B) = Per(Sys-I) = Per(Sys-T)
HI:
Per(Sys-B) = Per(Sys-I)> Per(Sys-T)
(2) Discussion. The hypothesis is that the performance of system " B " will be equal, on the average, to that of system "I", and be better than the performance of system "T". It is assumed that a rational/consistent user with a given problem will get the same (or almost the same) results with both strategies of the expert system. Differences will occur because users receive different amounts of information in the coUrse of the search and are presented with the suggested terms in a different manner--which may cause different reactions--but it is assumed that on the average the differences between
DECISION
SUPPORT
131
STRATEGIES
these two systems will b a l a n c e each other. T h e o t h e r p a r t o f t h e h y p o t h e s i s is that system " T " will n o t p e r f o r m as well as the o t h e r s b e c a u s e this system e x p o s e s the u s e r to an e n o r m o u s a m o u n t o f i n f o r m a t i o n , b u t it d o e s n o t h e l p h i m to m a k e the right choice; he m a y b e c o m e c o n f u s e d a n d u n c e r t a i n a b o u t w h a t he s h o u l d d o - - r e s u l t i n g in low p e r f o r m a n c e . (3) M e a s u r e m e n t . F o r e a c h o f the p r o b l e m s t a t e m e n t s in the e x p e r i m e n t s , an " e x p e r t s o l u t i o n " was p r e p a r e d , as e x p l a i n e d earlier. F o r e a c h p r o b l e m w o r k e d out b y a u s e r (with a given system) a set o f terms, the " u s e r s o l u t i o n " , was o b t a i n e d . To m e a s u r e system p e r f o r m a n c e we u s e d a metric o f " s i m i l a r i t y " a n d two o t h e r c o m p l e m e n t a r y metrics: " r e c a l l " a n d " p r e c i s i o n " t . T h e " s i m i l a r i t y " b e t w e e n a " u s e r s o l u t i o n " a n d the " e x p e r t s o l u t i o n " m e a s u r e s the p e r f o r m a n c e o f the u s e r with the given system. " S i m i l a r i t y " is c o m p u t e d using the f o r m u l a : Similarity = T c / ( T u + Te - Tc) where Tu = # o f terms in the " u s e r s o l u t i o n " ; Te = # o f t e r m s in t h e " e x p e r t s o l u t i o n " , a n d T c = # o f " h i t s " , i.e. t e r m s in c o m m o n (see Fig. 1), " S i m i l a r i t y " ranges f r o m 0, in the case o f a " p e r f e c t m i s s " (Tc = 0), to 1, in case o f a " p e r f e c t h i t " (Tu = Te = Tc). Tu
"re
FIG. 1. Similarity. Once the s i m i l a r i t y ratio was c o m p u t e d for all users w h o w o r k e d with a given p r o b l e m , the a v e r a g e s i m i l a r i t y was c a l c u l a t e d for the users w i t h i n the t h r e e g r o u p s , i.e. for the three systems. T h e s a m e was r e p e a t e d for all the p r o b l e m s , resulting in three sets o f average similarities. T h e s e three sets o f ratios were c o m p a r e d with test for significance o f the difference. In a d d i t i o n to " s i m i l a r i t y " the two r e l a t e d a n d c o m m o n l y u s e d measures o f p e r f o r m a n c e were e m p l o y e d : - - " R e c a l l " m e a s u r e s t h e ratio o f " h i t s " to the " e x p e r t s o l u t i o n " only. "Recall" = Tc/Te. - - " P r e c i s i o n " m e a s u r e s the ratio o f " h i t s " to the " u s e r s o l u t i o n " only. "Precision" = Tc/Tu. t "Similarity", "recall" and "precision" are common measures of performance in the Information Retrieval Systems arena. See, for example, Lancaster (1979) and Salton & McGili (1983). "Similarity", also termed "consistency", is used, e.g., to measure the extent to which two or more indexers agree on the choice of terms needed to represent the subject matter of a particular document. "Recall" and "precision" measure performance of retrieval systems: "recall", also termed "sensivitity", is the ratio of the number of relevant records retrieved in a query to the total number of relevant records in the database; "precision", also termed "relevance ratio" and "acceptance rate", is the ratio of relevant and retrieved to the total number of records retrieved in a query.
132
v. SHOVAL
Average "recall" and average "precision" were computed for each of the three systems and for each of the problems, and the significance of the difference between them was tested, as before.
( b ) User-preferences (1) Hypothesis: H0:
Pref(Sys-B) = Pref(Sys-I) = Pref(Sys-T)
HI:
Pref(Sys-B)> Pref(Sys-I)> Pref(Sys-T)
(2) Discussion. The hypothesis is that on the average, users will prefer system "B" over system " I " over system "T". The rationale is that users will prefer a system that is easier to use, that does not require too much work and previous experience, that can give support in providing suggestions and making decisions, and generally, a system that is more "intelligent". According to these assumed criteria system "B" was expected to be most preferred, and system "T"--the least preferred. (3) Measurement. To evaluate preferences we used a forced-choice question, in which the users were asked to rank-order the three systems and assign points to each. (This was done after each of the users had a chance to work with the three systems). The points assigned by all users to each system were averaged, and the significance of differences between the averages was tested. In addition, each of the users was asked to answer an open-ended question, in which he was expected to explain his preference and to state the advantages and disadvantages of the systems. Responses were qualitatively analysed, in order to identify the criteria for user evaluation, and to determine how the criteria were related to what they liked or disliked about each of the systems.
( c) System efficiency (1) Hypotheses: (i) "Interaction Load" (IL):
(ii) "Irrelevancy"(IR):
H0: HI:
H0: HI:
IL(Sys-B)=IL(Sys-I) IL(Sys-B) < IL(Sys-I)
IR(Sys-B)=IR(Sys-I) IR(Sys-B)
(2) Discussion. The efficiency of a system can be measured in many respects. Efficiency here is used in a limited sense, to mean: how "fast" and "accurately" the system comes up with its solution. Both expert systems interact with the user and both may suggest "good" and "bad" terms, so efficiency can be expressed in respect to the number of "good" and "bad" terms found and suggested by each of the systems during the consultation process. The hypothesis is that system "B" is "faster" and more "accurate" than system ' T ' , because during the "search" stage it makes its own assumptions/decisions, not "bothering" the user with intermediate (sometimes irrelevant) findings, and during the "suggest" stage it employs an evaluation scheme which is assumed to separate the "good" terms from the "bad", suggesting the "good" terms first.
D E C I S I O N SUPPORT STRATEGIES
133
(3) Measurement: two measures were utilized: (a) In measuring how "fast" a system is we used: "Interaction Load" (IL) = ( # Y E S + # N O ) / T u , The number of times the user was asked to respond was normalized by the number of accepted terms. The less interaction load placed by the system on the user, the "faster" it is in achieving its results, i.e. the lower this load, the "better" is the system. "Better" is defined not only in the sense of less demand from the user, but also in the sense of system's capability: fewer demands on the user imply more capabilities of its own. (b) In measuring how "accurate" a system is we used: "Irrelevancy" (IR) = # N O / T u . This ratio relates only the number of NO responses to the number of accepted terms: the lower the ratio, the more accurate the system is assumed to be (n.b. this measure should not be confused with the measure of system performance; the latter relates actual performance to the "gold-standard" and therefore is a measure of effectiveness; the first relates NO responses to the actual results, which is a type of efficiency measure). The IL and IR ratios were computed for each of the results (counts of YES and NO included all steps o f the systems). Then, for each problem the average ratio for the users o f each system was computed, and the significance of the difference was tested. In addition to the above we have separately computed the IR ratio and the ratio of # N O / # Y E S for the "suggest" stage only, to test the efficiency of the evaluation scheme (as will be further explained in the next section). 2.3. E X P E R I M E N T A L D E S I G N
In designing the experiments, the following factors were considered: (a) In order for a user to express a preference and rank order the systems, he has to use the three systems. (b) To eliminate bias because of previous exposure to a given problem, users should not use the same problems with all systems. (c) To eliminate ordering effects, the order in which the systems are to be used by each of the subjects should also be randomized or else equally distributed. To ensure that each problem is used the same number of times with each system, and that the order of system usage is equally distributed, a schedule for the experiments was prepared. It was designed for 15-18 users and six different problems, such that each user would run two problems per system, where the pairs are randomly mixed and all together provide for five to six users per each of the six problems per each of the three systems. A set of instructions was prepared and given to the users before the experiments began. It included the objectives of the systems and the experiments, a short explanation of the systems and printouts of an example, as run with the three systems. When an experiment began, each user first randomly selected a number, which determined the set of problems he would use with each of the systems, and the order in which he would use the systems. Then he was given the first problem to read, and he accessed the terminal and followed the system instructions. Two problems were run with each
134
P. SHOVAL
system before he continued with the next system. Problems were given to the user one at a time. Users t o o k notes during or after running the systems, and at the end o f the six runs they were asked to rank-order and assign points to each system and to write an answer to the open-ended question. Seventeen users, all o f them graduate and doctoral students, w h o had no previous exposure to the systems, participated in the experiments, resulting in a total o f 102 system runs (34 runs with each of the systems).
3. Results 3.1. SYSTEM PERFORMANCE The sets of terms that users accepted to represent a problem ("user solutions") were compared with the sets of terms suggested by the group of experts ("expert solutions"). The "similarity", "recall" and "precision" ratios were computed for each of the results, and then the average "similarity", "recall" and "precision" was computed for the three groups of users, i.e. for the three systems. This was done for the six problems. Table 1 shows the average "similarity", and Table 2 shows the average "recall" and average "precision". The results confirm the hypotheses that were stated, that systems " B " and ' T ' will have a similar performance, and better than the performance of system "T". TABLE 1
System performance: "Similarity" Problem #
System "T"
System ' T '
System "B"
1 2 3 4 5 6
0.276 0.194 0-393 0.192 0.270 0.411
0-654 0.714 0.271 0.560 0"586 0.608
0-695 0.654 0.314 0-523 0-569 0-628
Average
0.289
0.565
0.564
TABLE 2
System performance: "Recall" and "'Precision" System "T" Problem # 1 2 3 4 5 6 Average
System ' T '
System "B"
Recall
Precision
Recall
Precision
Recall
Precision
0.375 0.361 0.625 0"375 0-457 0.567
0.519 0"303 0.482 0.281 0.460 0.567
0.854 0"777 0"350 0.750 0"738 0"778
0.722 0"833 0"517 0"690 0"742 0-732
0"850 0"766 0.500 0"667 0.762 0.778
0"795 0"795 0.497 0"694 0.687 0.765
0.460
0.435
0-708
0.714
0.720
0.705
DECISION SUPPORT STRATEGIES
135
Pairwise t-tests, on the three sets of measures, revealed that systems ' T ' and "B" have a similar performance, and both perform significantly better than system "T " (at a > 0-995). To supplement these findings, it is interesting to examine the number of terms that are in the "user solutions". We claimed that the expert system helps the user focus on the right set of terms, whereas we anticipated that the alternative may confuse him with a surfeit of information. If this is true, then we have to expect a smaller range of terms suggested by the expert systems than by system "T". We have computed the difference between the minimal and maximal number of terms in all user solutions for the six problems. The averages for the three systems clearly confirm our assumptions, as follows: system " I " - - 3-0 system " B " - - 3.67 system " T " - - 10. 0 In summary, the results tend to confirm the hypotheses that the two expert systems have a similar performance, which is significantly better than the performance of system "r". 3.2. USER PREFERENCES
Table 3 shows the mean, median and range of the weighting points assigned by the users to each system. TABLE 3
User preference Measure
System " T "
System " I "
System " B "
Average Median Range
22.3 20 0-60
37.1 35 22-55
40.6 40 10-70
Pairwise t-tests were conducted to test the hypotheses that users prefer system "B" over ' T ' over "T". The difference between the average preferences of both system "B" and ' T ' over system " T " were highly significant (at a > 0.995), whereas the difference between systems "B" and ' T ' was not statistically significant (a < 0.80). Analyses of the rank-ordering of the systems by the users revealed that 38% (6-5f out of 17 users) rated system "B" first, system ' T ' second and system "T " last. 38% (6.5 users) rated ' T ' first, "B" second and " T " last. Only 18% (three users) rated system "T" first, in which case they always rated ' T ' second and "B" last. These results clearly show that users prefer the expert system approach over the "conventional" approach. They also show that users behaved systematically: those who liked the expert system approach either rated system "B" first and system " I " second, or vice versa--hut system "T" was always last. On the other hand, those few who preferred system "T" did it because they like to get more information and less decision support from the system t The reason for "'6.5" users is that one of them rated systems "B" and ' T ' the same.
136
P. SHOVAL
(as explained by them in their answer to the open-ended question), and therefore they rated systems " B " last! Between systems " B " and ' T ' there is some priority to " B " but basically preference is "shared" between them, meaning that the advantages that some users found in system "'B" (less interaction, more decision) were balanced by the advantages that other users found in system ' T ' (more interaction, more information). These results were supplemented by a qualitative analysis of the answers to the open-ended question. It is beyond the limits of this p a p e r to present a detailed analysis of these verbal answers. Instead of that we opt to show in Fig. 2 an informal summary
Criteria/attribute
Sys" B"
Type of problem: routine problems vs specific ones: Types of users: naive/inexperienced users vs experts User effort: how much user effort is required Amount of information: how much information is provided: User participation: how much user participation is required by system Decision support: how helpful is the system in making decisions "Intelligence": how smart/complex/sophisticatedis the system's behavior
Sys"I"
routine naive easy little
Sys "T" specific expert uneasy much
little
much
much
little
intelligent
simple
FIG. 2. Criteria/attributes for user preferences. of the responses. It presents a list of criteria of characteristics which were often attributed to the three systems. Not surprisingly, users viewed systems " B " and "T" as two opposing approaches and system " I " in between. Although the listed criteria are neither exhaustive nor exclusive the figure accurately summarizes the user responses and is consistent with the quantitative analysis of preferences. 3.3. SYSTEM EFFICIENCY Table 4 shows the average "Interaction L o a d " (IL) and " ' I r r e l e v a n c y " (IR) ratios for the six queries and for both system " B " and ' T ' . The I L ratios are significantly different, as hypothesized; meaning: system " B " places much less " l o a d " on the user, so it gets to the solution "faster". These results are not surprising, because system ' T ' inherently requires more user participation during the "search" stage. TABLE 4
System efficiency Measure Interaction load (IL) Irrelevancy (IR)
System "I" 4-02 1.156
/ System "B" 1.91 0.796
137
DECISION SUPPORT STRATEGIES
The IR ratios suggest that system "B" is, on the average, more "accurate" than system ' T ' , namely: it suggests fewer "bad" terms (per accepted term). A t-test revealed that this difference is not statistically significant. The slight difference can be explained by the fact that both systems are actually based on the same principles; if there is a difference, it is in the order in which "bad" terms are suggested. System " I " presents to the user all relevant concepts during the search process, and this is when most of the NO responses are expected. System "B", on the other hand, shows more "intelligent" behavior, because its evaluation function causes it to suggest first a cluster of terms which best complement each other, and therefore are more likely to be accepted. Only at the next stage (and only if the user desires) it suggests more terms--among which more "bad" terms are to be expected. To corroborate these explanations and assumptions the distribution of NO and YES responses was examined: - - I n system ' T ' the IR ratio ( # NO/Tu) was computed for the "suggest" stage alone, and was found to be 0-25; i.e. on the average for every four accepted terms only one is rejected. This ratio is obviously low as compared with the overall IR ratio for system " r ' (1.156). The interesting thing here is that the "second chance" given to users in the "suggest" stage is not redundant, because they do reject terms in spite of the fact they have previously (in the "search" stage) directed the search. - - I n system "B" a distinction is made by the evaluation function in the "suggest" stage between two phases: in phase 1 a whole cluster of terms is suggested and in phase 2--the rest of the terms. The average ratio of # N O / # YES in each phase was:
phase 1: phase 2:
0.085 3.140
The difference is very significant. In phase 1 almost all terms are "good" whereas in phase 2 there are on the average three times more "bad" than "good" terms. This ratio actually shows that the function which combines the factors "strength" and "complementation" is very powerful. This result was also supported by the user responses to the open-ended question, which typically attributed more "intelligence" to this system. To sum-up this issue: although the "accuracy" of system "B" is not significantly better than that of system ' T ' , it behaves more "intelligently" by succeeding in suggesting first a cluster of (mostly) good terms, and only later, upon user request, showing more terms, among which more bad terms are found. This is in spite of the fact that the user was neither informed nor asked to participate in its search process.
4. Conclusions We have described three systems which represent three strategies for consultation and decision support. Two of them are expert systems, which differ mainly in the level of interaction and involvement expected of their users; the third is a "conventional" system which only provides information and does not have the "expert" capabilities which the others have. We have conducted comparative experiments, to test the
138
P. SHOVAL
performance of the systems, the users ~preferences and the systems' efficiency. We have shown that both expert systems ("I" and "B") have similar capabilities, significantly better than the performance of the "conventional" alternative (system "T"). Similarly, we have shown that users definitely prefer the expert system approach over the other approach, and that between both of the expert systems, first preference is shared almost equally. We have learned that different users, or the same users in different situations, may prefer different search strategies. Since with different users or in different situations either approach may be preferred-the conclusion is that both strategies should be available, so that users will have the choice of which to use. In addition, since the preference expressed for the "conventional" system "I"' is not inconsequential--some users liked and justified this option of getting more information and being less dependent on the system decision--there needs to be a way to incorporate this advantage into the expert system. One way of doing so is by adding an option (at the "suggest" stage) in which the user can ask to see and also to select terms that are cross-referenced to a suggested term. The performance of the expert systems was clearly better than the performance of the alternative. The "efficiency" measures clearly favored system "B". This can be attributed to the powerful evaluation function, which "intelligently" succeeds in suggesting first a cluster of good terms, reserving the bad ones (as options) for the end. The conclusion is that this same function may also be used with the system " I " approach. As was said, the existing system is of a "laboratory" nature. More work is needed to implement it in a real world setting, and more research is needed to improve its effectiveness and efficiency. To implement a real-world system the following additions are needed: ---We have to integrate the three alternative approaches with a "combined" system, according to the above conclusions. - - W e have to create a large, real knowledge-base, as based on a complete thesaurus used for an existing information-retrieval system. - - W e have to connect the expert system to the retrieval system, so that the terms suggested to the user by the expert system might be immediately used for a database search to retrieve information. Feedback from the database will allow the user to change terms (if needed) and consult the expert system again. - - W e have to add an algorithm to the s~stem, such that user natural language terms will be accepted and standardized into the forms used in the knowledge base. The resuRs of these additions will be a "complete" system, in which the user will first consuR the expert system in natural language words, obtain the system's advice about vocabulary terms to use, formulate them into a query, receive feedback from the database, enter changes to the initial set of terms (if needed), receive "improved" advice from the system, and so on. This implementation will enable large-scale and real-world experiments to be made with the alternative approaches. Additional experiments will enable us to explore the performance of the system when used by different types of users, and to see to what degree they are satisfied with it. With this type of implementation it will also become possible to compare the expert system approach with other aRernatives, such as the human experts. I am thankful to Dr H. E. Pople, from the Decision Support Lab. at the University of Pittsburgh who advised me in developing the expert systems.
DECISION SUPPORT STRATEGIES
139
References BECK, C., McKECHNIE, T. & PETERS, P. E. (1979). Political Science Thesaurus II. University of Pittsburgh, PA. GOODMAN, F. (1979). The role and function of the thesaurus in education. In ERIC Thesaurus of ERIC Descriptors, 5th edn New York: MacMillan Information. LANCASTER,F. W. (1979). Information Retrieval Systems: Characteristics, Testing and Evaluation 2nd edn New York: Wiley and Sons. SALTON, G. & MCGILL, M. Y. (1983). Introduction to Modern Information Retrieval New York: McGraw-Hill. SCHULTZ, C. (1978). Thesaurus of Information Science Terminology. New Jersey: Scarecrow Press. SHOVAL, P. (1981). An Expert Consultation System for a Retrieval Database with a Semantic Network of Concepts. Ph.D. Thesis, University of Pittsburgh, PA. SHOVAL, P. (1983). Knowledge representation in consultation systems for users of retrieval systems. In KEREN, C. & PERLMUTER, L. Eds, The Application of Mini and Micro Computers in Information, Documentation and Libraries. The Netherlands: North-Holland. SHOVAL, P. (1985). Principles, procedures and rules in an Expert System for Information Retrieval. Information Processing and Management, 21, 475-499. SOERGEL, O. (1974). Indexing Languages and Thesauri: Construction and Maintenance. Los Angeles, CA: Melvile Pub. SFARK-JONS, K. & KAY, M. (1973). Linguistics and Information Science. New York: Academic Press.