Computers in Human Behavior 18 (2002) 745–759 www.elsevier.com/locate/comphumbeh
Assessing problem solving in expert systems using human benchmarking Harold F. O’Neil Jr.a,b,*, Yujing Nib,1, Eva L. Bakerb, Merlin C. Wittrockc a
Rossier School of Education, University of Southern California, 600 Waite Phillips Hall, University Park, Los Angeles, CA 90089-0031, USA b University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 300 Charles E Young Drive North, Room 301, Los Angeles, CA 90095-1522, USA c University of California, Graduate School of Education and Information Studies, 3022A, B Moore Hall, 405 Hilgard Avenue, Los Angeles, CA 90095-1521, USA
Abstract The human benchmarking approach attempts to assess problem solving in expert systems by measuring their performance against a range of human problem-solving performances. We established a correspondence between functions of the expert system GATES and human problem-solving skills required to perform a scheduling task. We then developed process and outcome measures and gave them to people of different assumed problem-solving ability. The problem-solving ability or ‘‘intelligence’’ of this expert system is extremely high in the narrow domain of scheduling planes to airport gates as indicated by its superior performance compared to that of undergraduates, graduate students and expert human schedulers (i.e. air traffic controllers). In general, the study supports the feasibility of using human benchmarking methodology to evaluate the problem-solving ability of a specific expert system. # 2002 Elsevier Science Ltd. All rights reserved. Keywords: Assessment; Problem solving; Human benchmarking; Expert systems; Technology
1. Introduction Existing evaluation techniques for the assessment of expert systems (e.g. O’Neil, Baker, Ni, Jacoby, & Swigger, 1994) have experienced two major challenges. First, * Corresponding author. Tel.: +1-213-740-2366; fax: +1-213-740-2367. E-mail address:
[email protected] (H.F. O’Neil Jr.). 1 Present address: Department of Educational Psychology, The Chinese University of Hong Kong, Hong Kong. 0747-5632/02/$ - see front matter # 2002 Elsevier Science Ltd. All rights reserved. PII: S0747-5632(02)00028-6
746
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
many evaluation approaches are based on a model of conventional software development that has performance specifications predefined before implementation (O’Neil & Baker, 1994b; Swigger, 1994). Unlike conventional software development, expert system development is an exploratory process in which performance specifications are usually developed following knowledge engineering because of the particular iterative nature of the technology involving both knowledge engineers and domain experts (Keravnou & Washbrook, 2001; Pandit, 1994; Talebzadeh, Mandutianu, & Winner, 1995; Wei, Hu, & Liu-Sheng, 2001). Thus, conventional verification approaches to assessing the simple match between predefined specifications and outcomes become impossible as specifications in the traditional sense do not exist. Second, an essential task of expert system evaluation is to translate selected criteria into testable requirements and to design test cases for the manipulation of parameters of interest. However, the most serious problem in carrying out these procedures is that testable requirements are hard to define because of the typical knowledge engineering processes involved in developing an expert system. Moreover, an expert system performs a complex task that a human expert can perform, such as a scheduling task, and the more complex the task, the more variation in human strategic approaches to the task (Hayes, 1997; O’Neil, Baker, Jacoby, Ni, & Wittrock, 1990). Therefore, there is great uncertainty in whether a set of selected test cases represents a reasonable range of experts’ performance (Ericsson & Smith, 1991; Garibaldi, Westgate, & Ifeachor, 1999; Hanks, Pollack, & Cohen, 1993; Hayes, 1997; Moret-Bonillo, Mosqueira-Rey, & Alonso-Betanzos, 1997). There have been efforts to explore alternatives to assess artificial intelligence (AI) systems that break from traditional, standard evaluation techniques (e.g. Berry & Hart, 1990; Hayes, 1997; O’Neil & Baker, 1994a; Park, Kim, & Lim, 2001; Sharma & Conrath, 1992). The human benchmarking approach is one of those nontraditional evaluation attempts to assess performance improvement in artificial intelligence systems by measuring system performance against a range of human performances (Baker, 1994). In general, human benchmarking is a type of benchmarking or quality control practice in industrial organizations. In the literature, benchmarking is used in two senses: (a) it is a continuous process to search for quality problems and quality solutions against industry best performance (Camp, 1989; Landry, 1992); (b) in software development, benchmarking is defined as a set of standard tasks used to measure whether a software system exhibits desired performance capabilities (Cheney, 1998; Hanks et al., 1993; Letmanyi, 1984). Our human benchmarking approach refers to an evaluation procedure in which an expert system’s performance is judged based on differentiated performances by different samples of people. Our search of the literature indicated only one study that was similar in intent to ours. Hayes (1997) used a method in which the competence of an expert system that generated manufacturing plans was evaluated against differentiated performances by machinists with different years of experience. In her study, Hayes gave the same set of tasks to the system and to seven human machinists with years of experience ranging from 2 to 10 years. An experience interval was thus created based on performances of the human machinists to judge the experience level of the expert system.
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
747
Our human benchmarking approach was first invented in the context of an assessment of a natural language understanding system in a study by Baker, Turner, and Butler (1990). The goal of their study was to explore the feasibility of benchmarking the performance of an artificial intelligence system to the performance of humans and to use these results from the human benchmarking procedure as indicators of improvement in computer system performance (Baker, 1994). The strategy for the study was to design a test, called the Natural Language Elementary Test, that was syntactically equivalent to the natural language system software and to administer the Natural Language Elementary Test to children of different ages. By grouping the children according to their grade-equivalent scores in order to represent language proficiency levels, levels of the children’s performance on the Natural Language Elementary Test were differentiated. This performance was used as a ‘‘scale’’ to measure the performance of the natural language computer system. Since the vast majority of children in kindergarten and first grade could accomplish these syntactically equivalent tasks and the natural language system could not, the human benchmarking procedure resulted in a classification of ‘‘Not very intelligent’’ for the natural language system, as its intelligence was no greater than that of a kindergarten student. This initial attempt at benchmarking the performance of an AI system to performance of humans proved feasible and therefore was followed by a further exploration of the methodology with evaluation of an expert system (O’Neil et al., 1994). The human benchmarking methodology was augmented when it was applied to evaluation of the expert system. The human benchmarking study of the natural language understanding system was based on the equivalence between the natural language system and the Natural Language Elementary Test. This equivalence relied only on the common syntactic structures of language while ignoring the semantic aspects. Also, the natural language study focused only on outputs of the program, disregarding analogous processes of the program. However, expert systems such as scheduling systems involve a considerable amount of domain-specific knowledge and iterative strategic planning. Therefore, human benchmarking of such expert systems has to take into consideration both the syntactic structure and the specific content of both processes and outcomes to compare expert systems’ performance against human performance. The tasks involved in a scheduling expert system clearly constitute a problem-solving task. This conceptualization is a refinement of O’Neil et al.’s (1994) study where the focus was on a metaphorical comparison using the construct of intelligence (i.e. artificial intelligence). Our current thinking now views these processes and outcomes as examples of problem solving. Mayer and Wittrock (1996) have defined problem solving as ‘‘cognitive processing directed at achieving a goal when no solution method is obvious to the problem solver’’ (p. 47). The definition and assessment of problem solving was further refined by O’Neil (1999) as consisting of domain knowledge, problemsolving strategies, and self-regulation. In this study problem-solving strategies and elements of self-regulation were assessed. Domain knowledge per se was not assessed. Our general approach in the study (O’Neil et al., 1994) was to establish the correspondence between functions of an expert system and the problem-solving skills required by a person to perform the task of the expert system. Based on the inferred correspondences, we developed both process and outcome problem-solving measures.
748
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
We then gave both the process and outcome measures to human participants of different assumed problem-solving ability. The participants included community college students, university undergraduate and graduate students, and airport scheduling experts (i.e. air traffic controllers). The expert system GATES, a scheduling system written in Prolog that was used for gate assignment of airplanes at TWA’s JFK and St. Louis airports, was chosen for these studies. In previous studies (see O’Neil et al., 1994, for details), we first piloted the human scheduling tasks materials for content, clarity, and administration time with several human participants. Then the tasks were given to a sample of community college students and to the GATES system. Results indicated that the problem-solving ability of the system was extremely high in the narrow domain of scheduling compared to that of the community college students. However, the results did not allow us to create a scale with intervals for human benchmarking due to the fact that only one group of human participants with an assumed narrow level of problem-solving ability was used. In order to create the human benchmarking metric needed to quantify ‘‘extremely high,’’ we expanded the methodology to other groups of human participants including university undergraduate students, graduate students, and airport scheduling experts. We assumed that graduate students would have more problem-solving ability than undergraduate students, and that airport scheduling experts would have more problem-solving ability than graduate students. To facilitate the comparison, we also used scheduling tasks varying in difficulty levels and think-aloud protocols to probe cognitive processes while the human participants solved the scheduling tasks. The goal of the present study was to administer both an easy task and a difficult task to undergraduate and graduate university students, who were assumed to have higher problem-solving ability than the community college students who participated in the O’Neil et al. (1994) study. In California, where this study was conducted, higher SAT scores and higher prior achievement as indicated by higher high school grade point averages were required for admission to university than for admission to community college. In California, to be admitted to a community college, students were required only to be 18 years of age or older. High school graduation was not required. Also, we felt that the use of graduate students would provide a very select group on the underlying problem-solving ability variable, as graduate students were selected from successful undergraduates. The final group, air traffic controllers, would constitute the group with the highest level of problemsolving ability in the scheduling domain. This group allowed us to test the high end of our human benchmarking scale.
2. Method 2.1. Participants Fifty-one undergraduate (17 male, 34 female) and 51 graduate (21 male, 30 female) university students from a southern California university participated in the
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
749
study. The undergraduate students were regular university students from different departments taking a beginning psychology class, during a summer session. The graduate students were regular university students in the Education Department working on research or enrolled in a research methods class or a teacher education class during the same summer session. All students were paid for their participation. The undergraduate students were paid $20 and the graduate students were paid $30 for participating in the study. Six additional education graduate students (four females and two males) served as participants for the think-aloud aspects of the study. These students were paid $50 for their participation. Finally, three air traffic controllers working for a major airline at a major New York City airport also participated as our airport scheduling experts. The air traffic controllers had various degrees of scheduling experience ranging from 6 months to 10 years. The controllers were paid $50 for their participation. 2.2. Outcome measure materials Both an easy version and a difficult version of the scheduling task for the human participants, based on GATES, were developed by varying the number and type of constraints in the task. There are certain constraints (rules and restrictions) in doing such scheduling tasks (for example, different kinds of flights—domestic and international—go to different kinds of gates), and an extensive discussion of the constraints can be found in O’Neil, Baker, et al. (1990). The constraints in our task were taken from the expert system GATES (Brazile & Swigger, 1988). In order to do the task, GATES follows a plan that has three phases. In the first phase, GATES tries to complete the task using all the constraints; the second phase gives guidelines to relax the constraints in order to schedule all unscheduled flights; and the third phase gives guidelines to include selected constraints again to the extent possible so that all the flights are still assigned. In our study, this plan, with slight modifications, was given to the human participants solving the task. The plan was modified in the following manner: (a) we added an explanation of how to use the plan that the expert system uses by default, such as pick out the constraints relevant to the task; and (b) to reduce memory demands, the participants also received a job aid, which was a tabular worksheet showing the flight times vertically and the gates with appropriate taxiways horizontally. The scheduling problem given to the participants in the study included a list of incoming flights with the following information about each flight: flight destination (domestic or international), flight number, plane type, arrival time, and departure time. The task for the participants was to assign a set of incoming flights to available gates without violating the constraints. We characterized scheduling tasks for this study by difficulty level as easy or difficult. The easy task included 15 flights, 10 gates, and three taxiways. Only domestic flights were given, and no continuation flights, moves, mobile lounges (predesigned flights that use a bus-like vehicle to load/unload passengers), tow-on gates (gates that need some type of equipment to place the plane at the gate), or ferry flights were given. The easy task can be solved using only the first phase of the plan,
750
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
that is, by doing the task with all the constraints. The difficult task included 15 flights, nine gates, and five taxiways. There were domestic and international flights, four continuation flights, and four moves. The second pass of the plan is needed in order to assign all the flights by relaxing some of the restrictions. 2.3. Process measure: metacognitive questionnaires The metacognitive questionnaires were developed as measures of either a trait approach to metacognition (the initial measure) or a state metacognition approach (the post measure). The trait measure was given with instructions to ‘‘describe how you generally think.’’ The state measure was given with retrospective state instructions, that is, ‘‘describe how you thought during the scheduling task.’’ The trait measure asked participants to rate the frequency of occurrence of the thought process, whereas the state measure asked participants to rate the intensity of the thought process. The item stems were the same for both measures. For example, an item on the post-thinking questionnaire (a state measure) was ‘‘I set useful goals for myself.’’ The participants were asked to rate this item as not at all, somewhat, moderately so, or very much so. The parallel item on the prethinking questionnaire (a trait measure) was ‘‘I set useful goals for myself,’’ and participants were asked to rate this item as almost never, sometimes, often, or almost always. Our conceptualization of metacognition as having both a trait and a state component is a synthesis of previous work of Weinstein and Mayer (1986), Beyer (1988), and Pintrich and DeGroot (1990). We view metacognition as conscious, periodic self-checking of one’s goal achievement and, when necessary, selecting and applying different strategies. Metacognition consists of planning and self-assessment. Thus, for metacognition one must have a goal (either assigned or self-directed) and a plan to achieve the goal. The plan requires cognitive or affective strategies to execute. These strategies can be either domain-independent or domain-dependent. In addition, one needs a self-checking or self-monitoring mechanism to know which strategy among competing ones to select initially to solve the task and, further, when to change such a strategy when it is ineffective in achieving the goal. Finally the process is conscious to the individual. Similar conceptions of metacognition and selfregulation have been reported by Paris and Paris (2001). The lack of an appropriate paper-and-pencil measure for these attributes of metacognition led us to the development of a trait and a state measure of metacognition (O’Neil & Abedi, 1996; O’Neil & Herl, 1998; O’Neil, Ni, Jacoby, & Swigger, 1990). Our goal for each instrument was to have four scales of five items each, to measure each aspect of metacognition. The aspects were planning, self-checking, cognitive strategy, and awareness. In addition, state worry was measured. The state worry scale (Morris, Davis, & Hutchings, 1981) asked the participants to rate, on a 5-point scale, their attitudes or thoughts while doing the task. A sample item is ‘‘I was afraid that I should have studied more for this test.’’ Participants rated each item using the following scale: 1 (The statement does not describe my past condition); 2 (The condition was barely noticeable); 3 (The condition was moderate); 4 (The condition was strong); 5 (The condition was very strong).
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
751
2.4. Procedure Participants were recruited and told they would be paid for their participation. Participants were randomly assigned to either the easy or the difficult task. When participants came to the experiment room, they were asked to fill out a consent form, information for payment, and the trait metacognition questionnaire. Then, the task was handed out and briefly explained. All participants had 55 min to complete the task. At the end of the allocated time the forms were collected and participants were asked to complete a state worry questionnaire, the post-metacognitive questionnaire, a process questionnaire, and additional process questions (the latter two measures are discussed in O’Neil, Baker, et al., 1990).
3. Results The error scoring scheme used to analyze the community college data (O’Neil et al., 1994) was used to analyze the university data. The errors made in gate assignment were subdivided into eight categories: separation time (less than 20 min separation time at the same gate between one flight to another), wrong plane type (a flight with specific plane type assigned to a gate that is not proper for the plane type), time conflict (two flights are assigned to the same gate at the exact same time), gate 4 not free when should be (a flight using the 747 plane type is assigned to gate 3, and gate 4 is not left open), taxiway conflict (less than 5 min separation time between flights using the same taxiway), unavailable gates (a flight is assigned to a different gate), illegal moves (a flight is moved from one gate to another although not supposed to), and missing assignment (a landed flight is not assigned to any gate). Every wrong assignment was categorized to one of these categories. In addition to the errors for the easy task, there were two other possible errors related to the difficult task. For example, one scheduling constraint states that a particular gate can be used only for departure flights and not for arrival flights. Thus, an error of ‘‘not arrival gate’’ occurs when an arriving flight is assigned to this gate. Another constraint is that an international flight must use an international gate, and a domestic flight must use a domestic gate. The violation of this constraint results in the error of ‘‘illegal change between international and domestic gates.’’ With respect to the trait and state metacognition scales, the alpha reliability in this study was 0.90 and 0.92 respectively. The alpha reliability for the state worry scales was also excellent, 0.84. The state worry value is consistent with other data for this scale (Morris et al., 1981). Table 1 provides means and standard deviations for process and outcome variables for undergraduates and graduate students in the easy and difficult tasks. As may be seen in Table 1, for undergraduate students the mean for total correct assignments on the easy task was 13.29, which means that on average these students performed about 90% of the task correctly. The most frequent error was taxiway violation. For graduate students, the mean for total correct assignments on the easy task was 13.04, which means that on average these students
752
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
Table 1 Means and standard deviations of process and outcome variables Variable
GPAa Schedule experience Total correct assignment Domestic travel exp. International travel experience Taxiway violation Time conflict Wrong plane type Unavailable gate Gate 4 violation Separation time violation Illegal move Missing assignment Not arrival gateb Illegal gate cb Trait metacognitionc State metacognition State worry a b c
Undergraduate
Graduate
Easy (n=27)
Difficult (n=24)
Easy (n=24)
Difficult (n=27)
3.13 0.07 13.29 3.18 1.96 0.77 0.44 0.03 0.07 0.22 0.14 0.00 0.03 NA NA 80.52 82.46 9.44
3.14 0.00 11.79 3.04 2.29 0.91 0.45 0.25 0.29 0.37 0.00 0.00 0.08 0.75 0.08 79.35 72.37 11.75
3.68 0.20 13.04 2.87 2.25 0.62 0.50 0.12 0.04 0.16 0.12 0.00 0.37 NA NA 79.87 77.30 8.70
3.79 0.18 11.07 3.18 1.81 0.81 0.29 0.22 0.33 0.22 0.00 0.00 0.96 0.63 0.40 79.96 68.57 11.88
(0.40) (0.26) (1.32) (1.46) (1.09) (0.80) (0.89) (0.19) (0.26) (0.50) (0.45) (0.99) (0.19)
(9.72) (10.61) (4.33)
(0.40) (0.00) (1.31) (1.49) (1.23) (0.77) (0.72) (0.44) (0.69) (0.49) (0.00) (0.00) (0.28) (0.53) (28) (10.70) (8.30) (5.08)
(0.37) (0.42) (2.07) (1.11) (1.15) (0.82) (0.78) (0.44) (0.20) (0.38) (0.33) (0.00) (0.12)
(9.26) (12.07) (3.34)
(0.21) (0.36) (2.55) (1.03) (0.83) (0.73) (0.54) (0.57) (0.73) (0.42) (0.00) (0.00) (2.37) (0.49) (1.15) (11.19) (12.98) (4.03)
n=22. These categories were not included in the easy task. n=25.
also performed about 90% of the task correctly. The most frequent error for graduate students, as for undergraduate students, was taxiway violation. The mean for total correct assignments for undergraduate students on the difficult task was 11.79, which means that on average the undergraduate students performed about 75% of the task correctly. The most frequent error, as in the easy task, was the taxiway violation. Another common error was assigning an arriving plane to a gate that was not an arrival gate. For graduate students the mean for total correct assignments on the difficult task was 11.07, which means that on average the graduate students performed about 75% of the task correctly. The most frequent error was again taxiway violation. The difficult task also had the highest mean of missing assignments. Also shown in Table 1, state worry was higher for the difficult task compared to the easy task. With respect to relationships among the variables, for all students there was a positive correlation between gate assignment performance and state metacognition (r=0.35, p < 0.001), but a negative correlation between performance and state worry (r= 0.20, p< 0.05). State worry was significantly negatively associated with post state metacognition (r= 0.28, p=0.05). Moreover, the trait metacognition scores were not related to correct gate assignment performance. Thus, state metacognition predicted performance, but trait metacognition did not. Such results, in which states predict performance but traits do not, are common in the affective domain. For
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
753
example, state worry is associated with poor performance, whereas trait worry is not (O’Neil, Baker, & Matsuura, 1992; Sieber, O’Neil, & Tobias, 1977). In general, with few exceptions, the following variables were not related to each other: GPA, scheduling experience, international travel experience, and trait metacognition. Expectedly, trait metacognition was related to state metacognition (r=0.34, p=0.001). The results of a 22 ANOVA on number of correct gate assignments for difficulty of task (easy vs. difficult) and educational level (undergraduate vs. graduate) showed a main effect of task difficulty, F=21.18, df=1, 98, p< 0.001, but not educational level or their interaction. These results indicated that both university graduate and undergraduate students performed better on the easy task than on the difficult task. Unexpectedly, there was no significant difference in performance between the graduate and undergraduate students. The same ANOVA (task vs. educational level) was conducted on the following variables: state worry, state metacognition, and the eight errors common to both tasks. The main effect of task was found for both state worry and state metacognition. The harder the task, the higher the students’ state worry, F=10.63, df=1, 98, p < 0.05, and the lower state metacognition, F=17.58, df=1, 95, p< 0.001. It appears that the increased level of worry decreased state metacognition. Although the graduate and undergraduate students did not differ in their level of state worry while they were doing the scheduling task, they did differ on state metacognition, F=3.96, df=1, 95, p< 0.05, with undergraduates exhibiting more state metacognition than graduate students. We felt that our undergraduates (at a select university) were ‘‘smarter’’ or better problem solvers than our graduate students (at the same university but education graduate students) and thus showed higher levels of state metacognition. This was also indicated by the lack of difference in task performance for the two groups. Trait metacognition scores were equivalent for both groups; thus, the different levels of state metacognition reflect mainly reactions to the tasks. Further, there was no differential pattern of errors for either the graduate or the undergraduate students as no main effect of educational level was found for any kind of error except the error ‘‘missing assignment,’’ F=4.9, df=1, 98, p < 0.05. The analysis suggested that graduate students made more of this kind of error than the undergraduates. However, the raw data showed that only two graduate students were responsible for this effect. 3.1. Qualitative aspects Six graduate students at the same university as the undergraduate students volunteered to participate in the part of the study in which they were asked to think aloud when they were doing the scheduling task. They were paid $50 for their participation. Two students did the easy task with the plan information, two did the same task without the information, and two did the difficult task with the plan information. There were no time limits on completing the task. In this part of the study, the participants were told that we were interested in the processes they used to approach the scheduling task, and they were encouraged to talk aloud, saying everything they were thinking while solving the task. Sessions were run individually.
754
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
The sessions were both audio- and videotaped, and detailed notes were taken by the experimenters. The data were analyzed by abstracting the notes and listening to the audiotapes. Transcripts were not typed. The data were analyzed for stages, plans, strategies, and errors. The detail think-aloud protocols for each subject are provided in O’Neil, Baker, et al. (1990) and summarized below. In general, the qualitative data of the think-aloud protocols indicated that the process of scheduling consists of three stages: reading the task information, assigning flights to gates, and checking the assignments. These stages appeared very obvious in the think-aloud data of four of the six participants (i.e. participants 1, 2, 3, and 5). Through these stages, several cognitive processes were reflected in the think-aloud protocols. 3.1.1. Identifying relevant task information for the scheduling task All participants tried to figure out information relevant for the task during the first stage. Three things demonstrated this process: (a) taking notes for some task information participants thought was important for the task; (b) crossing off some task information they considered not relevant to the task; and (c) wondering about whether or not some information was related to the task. 3.1.2. Scheduling flights to gates and evaluating the assignments according to the constraints related to the task Usually, participants were concerned with the 20-min separation time, 5-min taxiway separation time, and turn time. Interestingly, some participants (i.e. participants 2 and 6) were very concerned with the rule ‘‘minimized tow on gate 1.’’ 3.1.3. Checking the assignments Although the participants checked their assignments during the second stage, this seemed to happen at a local level, while doing the scheduling. It became more global as they checked after finishing all the assignments and used more systematic ways, such as checking the taxiway constraint for arrival time first, then departure time, or domestic flights first, then international flights. During this stage, the participants usually tried to optimize a schedule. But sometimes a participant made the original schedule worse (e.g. participant 3). For the easy task, all four participants did the scheduling basically in the order of the incoming flights given in the task with small variation. The small variation entailed a participant assigning an MD80 flight first because this type of plane is most restricted, then scheduling other flights in the given order. For the difficult task, the two participants used somewhat different strategic plans for the scheduling. One participant first divided the flights into domestic and international flights, then treated non-continuation flights first and next continuation flights within domestic or international flights. The other participant grouped the incoming flights according to their arrival time. He then assigned four flights that all came at 8:00, then scheduled another set of flights that arrived at 8:05. The fact that there is more variation in the strategic plans that participants used for the difficult task than for the easy task suggests that the more complicated a task,
755
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
the more variation in the strategic plan people use for the task, which was reflected in both the think-aloud data and state metacognition data. This does not seem to be the case for the computer system GATES, which uses the same strategic plan for any level of gate assignment task. Yet, it did take GATES more time to do the difficult task than to do the easy one. The participants for the think-aloud study made three kinds of errors: taxiway violation (participants 3 and 6), assigning unavailable gates to flights (participant 6), and assigning a flight to a gate not for arrival (participant 5). The errors were assumed to occur because the participants were not aware of relevant constraints. For example, the error made by participants 5 and 6, of assigning a plane to an inappropriate gate, was made because the participants did not realize this was a relevant constraint. The only error participant 3 made was taxiway violation. She made this error not because she was unaware of this constraint but because she considered the restriction to be concerned only with the departure time of flights; thus, she checked her assignments about the taxiway constraint only for departure time and not for arrival time. As she explained, ‘‘The flights are already being there, I even didn’t think about this!’’ The quantitative data description of the qualitative data is shown in Table 2. 3.2. Expert schedulers: air traffic controllers’ performance Three air traffic controllers working for a major airline at a major New York City airport were asked to do the difficult task and complete the state worry and metacognition questionnaires. They were chosen to participate in the study as expert schedulers because they are considered to represent the highest level on the scale of performance our human benchmarking approach is intended to create. One of the schedulers was an expert with 10 years scheduling experience, and he also provided the domain knowledge represented in the GATES system. The schedulers’ data are summarized in Table 3. As may be seen in Table 3, unexpectedly, the schedulers made some errors. For example, Scheduler 1 assigned an incoming flight to gate 38, which is not for arrival according to the given restrictions. In this case, the error was probably due to too Table 2 Data from think-aloud protocol participants Participant
Task
% Correct
Wrong assignments
Types of errors
Time
1 2 3 4 5 6
Easy with plan Easy with plan Easy without plan Easy without plan Difficult with plan Difficult with plan
100 100 80 100 90 80
0 0 3 0 1 3
– – Taxiway violation – Not arrival gate Not arrival gate; unavailable gate
1 h 50 min 50 min 50 min –a 1 h 50 min 45 min
a
Time was not recorded for this participant.
756
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
Table 3 Summary of the expert schedulers’ performance on the difficult task Participant number
1 2 3
Time (min)
50 45 55
Expert scheduler’s experience
10 years 4 years 6 months
Errors
1 1 2
Score State worry
Trait metacognition
State metacognition
5 10 9
83 101 85
88 90 80
much knowledge. Scheduler 2 also made an error. She wrote gate number ‘‘13’’ next to flight 13. Gate 13 was not provided in the task. Finally, Scheduler 3 made two errors of using hardstand gates that were not available for the task although she said in the process questionnaire that she avoided using hardstand gates. The schedulers in the study had different levels of scheduling experience. One had 10 years, one 4 years, and one 6 months. The data suggest that different levels of experience have impact on the processes of solving the scheduling problem. Scheduler 1, who had 10 years of experience, had a more explicit plan and strategies to do the scheduling task, compared with the other two schedulers. For example, Scheduler 1 stated that when he planned the scheduling, he considered four kinds of restrictions: gate restrictions and types of planes, inbound and outbound times, domestic and international distinctions, and required moves. Scheduler 2 said she used her own worksheet to do the task, and she scheduled the flights with most restrictions first, then those with longest turn times. Scheduler 3 only mentioned using the plotting sheet to highlight those gates which were provided in the task. Scheduler 1 had the lowest score on the state worry measure. The schedulers outperformed the vast majority of the university students on the difficult task. Out of 51 students, only two university students did the task without error; one was an undergraduate student and the other a graduate student; two students made one error (one graduate student and one undergraduate student); two undergraduate students and six graduate students made two errors. However, when the easy and difficult tasks were given to GATES to solve, the expert system completed both tasks with no errors. Thus, the program had a better track record than the expert schedulers or the students.
4. Discussion There was a significant difference in gate assignment performance between community college students in the O’Neil et al. (1994) study (M=11.96) and in this study of university undergraduate students (M=13.29) and graduate students (M=13.04). The community college students were not given the difficult task. As expected, university students performed better than community college students. In this study, both undergraduate and graduate students performed better on the easy
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
757
task than the difficult task. Surprisingly, graduate students did not perform better than undergraduates, although it was expected that graduate students’ performance would be better than that of undergraduate students on both tasks. One reason may be that the sample of undergraduate students was more representative than that of graduate students. For example, the undergraduate students were from several departments in the university, whereas the graduate students were from the Education Department only. In general, there were supportive data for the validity of the state and trait metacognition questionnaires. For the state measure, the higher the metacognition scores, the better the performance. Also consistent with our expectations were the data that indicated that state worry was negatively associated with performance and with state metacognition. Further, trait metacognition and state metacognition were significantly positively correlated. However, our expectation was that the magnitude of the relationship would be stronger. Also as expected, the trait measure of metacognition was not correlated to performance. Our think-aloud protocol data indicated that participants who received the plan for the easy task followed the plan to do the scheduling. Interestingly, participants who did not receive the plan for the easy task used a similar strategy. However, the two participants doing the difficult task, even though they received the plan, used different strategies. For example, one participant scheduled by domestic/international flights, and the other scheduled by continuation/noncontinuation flights. In general, for the difficult task, participants were overwhelmed by the amount of information and therefore had problems identifying relevant constraints. It also appeared that the more difficult a task, the more likely people are to use different strategies to approach the task. Thus, we can argue that it is inadequate to use the test cases generated only by one or two experts for evaluation of expert systems because there may be great variation in strategic plans for a given task, and it is inadequate to evaluate the ‘‘intelligence’’ or problem-solving ability of an expert system only according to outcome. In general, this study supports the feasibility of using human benchmarking methodology to evaluate the problem-solving ability of a specific expert system. The ‘‘intelligence’’ or problem-solving ability of this expert system is extremely high in the narrow domain of scheduling planes to gates, as indicated by its superior performance compared with that of undergraduates, graduate students, and expert schedulers.
Acknowledgements The authors wish to thank other members of the UCLA human benchmarking group: Drs. Robert Brazile, Frances Butler, Anat Jacoby and Kathleen M. Swigger. In addition, we wish to thank Dr. Harold Levine for his methodological assistance with the qualitative aspects of our study. This research was supported in part by contract number N00014–86-K-0395 from the Defense Advanced Research Projects Agency (DARPA), administered by the Office of Naval Research (ONR), to the
758
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
UCLA Center for the Study of Evaluation. However, the opinions expressed do not necessarily reflect the positions of DARPA or ONR, and no official endorsement by either organization should be inferred. This research also was supported in part under the Educational Research and Development Centers Program cooperative agreement R117G10027, and in part under the Educational Research and Development Centers Program, PR/Awards No. R305B60002 and R305B960002–01, as administered by the Office of Educational Research and Improvement, US Department of Education. The findings and opinions expressed in this report do not reflect the positions or policies of the National Institute on Student Achievement, Curriculum, and Assessment, the Office of Educational Research and Improvement, or the US Department of Education.
References Baker, E. L. (1994). Human benchmarking of natural language systems. In H. F. O’Neil Jr., & E. L. Baker (Eds.), Technology assessment in software applications (pp. 85–97). Hillsdale, NJ: Lawrence Erlbaum Associates. Baker, E. L., Turner, J. L., & Butler, F. A. (1990). An initial inquiry into the use of human performance to evaluate artificial intelligence systems. Los Angeles: University of California, Center for Technology Assessment/Center for the Study of Evaluation. Berry, D. C., & Hart, A. E. (1990). Evaluating expert systems. Expert Systems, 7, 199–207. Beyer, B. K. (1988). Developing a thinking skills program. Boston, MA: Allyn & Bacon. Brazile, R., & Swigger, K. (1988). GATES: An expert system for airlines. IEEE Expert, 3, 33–39. Camp, R. C. (1989). The search for industry best practices that lead to superior performance. Milwaukee, WI: Quality Press (American Society for Quality Control). Cheney, S. (1998). Benchmarking. ASTD Info-line (Issue 9801). Alexandria, VA: American Society for Training and Development. Ericsson, K. A., & Smith, J. (1991). Toward a general theory of expertise. Cambridge: Cambridge University Press. Garibaldi, J. M., Westgate, J. A., & Ifeachor, E. C. (1999). The evaluation of an expert system for the analysis of umbilical cord blood. Artificial Intelligence in Medicine, 17, 109–130. Hanks, S., Pollack, M. E., & Cohen, P. R. (1993). Benchmarks, test beds, controlled experimentation, and design of agent architectures. AI Magazine, 14(4), 17–42. Hayes, C. C. (1997). A study of solution quality in human expert and knowledge-based system reasoning. In P. J. Feltovich, K. M. Ford, & R. R. Hoffman (Eds.), Expertise in contexts: human and machine (pp. 339–362). Menlo Park, CA/Cambridge, MA: American Association for Artificial Intelligence Press/The MIT Press. Keravnou, E. T., & Washbrook, J. (2001). Abductive diagnosis using time-objects: criteria for the evaluation of solutions. Computational Intelligence, 17(1), 87–131. Landry, P. (1992). Benchmarking, a competitive strategy for the 1990s. In Proceedings of the APICS 35th international conference and exhibition (pp. 54–55). Falls Church, VA: American Production and Inventory Control Society. Letmanyi, H. (1984). Assessment of techniques for evaluating computer systems for federal agency (Final Report). Washington, DC: National Bureau of Standards (DOD), Institute for Computer Sciences and Technology. Mayer, R. E., & Wittrock, M. C. (1996). Problem-solving transfer. In D. C. Berliner, & R. C. Calfee (Eds.), Handbook of educational psychology (pp. 47–62). New York: Simon & Schuster Macmillan. Moret-Bonillo, V., Mosqueira-Rey, E., & Alonso-Betanzos, A. (1997). Information analysis and validation of intelligence monitoring systems in intensive care units. IEEE Transactions on Information Technology in Biomedicine, 1(2), 87–99.
H.F. O’Neil, Jr. et al. / Computers in Human Behavior 18 (2002) 745–759
759
Morris, L. W., Davis, M. A., & Hutchings, C. H. (1981). Cognitive and emotional components of anxiety: literature review and a revised worry-emotionality scale. Journal of Educational Psychology, 73, 541–555. O’Neil, H. F., Jr. (Ed.). (1999). Computer-based performance assessment of problem solving [Special Issue]. Computers in Human Behavior, 15(3/4). O’Neil, H. F. Jr., & Abedi, J. (1996). Reliability and validity of a state metacognitive inventory: potential for alternative assessment. Journal of Educational Research, 89, 234–243. O’Neil, H. F. Jr., & Baker, E. L. (1994a). Introduction. In H. F. O’Neil Jr., & E. L. Baker (Eds.), Technology assessment in software applications (pp. 1–12). Hillsdale, NJ: Lawrence Erlbaum Associates. O’Neil, H. F. Jr., Baker, E. L. (Eds.). (1994b). Technology assessment in software applications. Hillsdale, NJ: Lawrence Erlbaum Associates. O’Neil, H. F. Jr., Baker, E. L., Jacoby, A., Ni, Y., & Wittrock, M. (1990). Human benchmarking studies of expert system (Report to DARPA, Contract No. N00014–86-K-0395). Los Angeles: University of California, Center for Technology Assessment/Center for the Study of Evaluation. O’Neil, H. F. Jr., Baker, E. L., & Matsuura, S. (1992). Reliability and validity of Japanese trait and state worry and emotionality scales. Anxiety, Stress, and Coping, 5, 225–239. O’Neil, H. F. Jr., Baker, E. L., Ni, Y. J., Jacoby, A., & Swigger, K. M. (1994). Human benchmarking for the evaluation of expert systems. In E. L. Baker, & H. F. O’Neil Jr. (Eds.), Technology assessment in software applications (pp. 13–45). Hillsdale, NJ: Lawrence Erlbaum Associates. O’Neil, H. F. Jr., & Herl, H. E. (1998). Reliability and validity of a trait measure of self-regulation. Presented at the annual meeting of the American Educational Research Association, San Diego, CA. O’Neil, H. F. Jr., Ni, Y., Jacoby, A., & Swigger, K. (1990). Human benchmarking of expert systems (Technical Report). Los Angeles: University of California, Center for Technology Assessment/Center for the Study of Evaluation. Pandit, V. B. (1994, May). Artificial intelligence and expert systems: a technology update. In Proceedings of the 1994 IEEE instrumentation and measurement technology conference (Vol. 3, pp. 77–81). IEEE Instrumentation and Measurement Technology Conference, Hamamatsu, Japan. Paris, S. G., & Paris, A. H. (2001). Classroom applications of research on self-regulated learning. Educational Psychologist, 36, 89–101. Park, H.-J., Kim, B. K., & Lim, K. Y. (2001). Measuring the machine intelligence quotient (MIQ) of human-machine cooperative systems. IEEE Transactions on Systems, Man, and Cybernetics, 31(2), 89–96. Pintrich, P. R., & De Groot, E. V. (1990). Motivational and self-regulated learning components of classroom academic performance. Journal of Educational Psychology, 82, 33–40. Sharma, R. S., & Conrath, D. W. (1992). Evaluating expert systems: the socio-technical dimensions of quality. Expert Systems, 9, 125–137. Sieber, J. E., O’Neil, H. F., & Tobias, S. (1977). Anxiety, learning, and instruction. Hillsdale, NJ: Lawrence Erlbaum Associates. Swigger, K. M. (1994). Assessment of software engineering. In H. F. O’Neil Jr., & E. L. Baker (Eds.), Technology assessment in software applications (pp. 153–176). Hillsdale, NJ: Lawrence Erlbaum Associates. Talebzadeh, H., Mandutianu, S., & Winner, C. F. (1995). Countrywide loan-underwriting expert system. AI Magazine, 16(1), 51–64. Wei, C.-P., Hu, P. J.-H., & Liu-Sheng, O. R. (2001). A knowledge-based system for patient image prefetching in heterogeneous database environments—modeling, design and evaluation. IEEE Transactions on Information Technology in Biomedicine, 5(1), 33–45. Weinstein, C. F., & Mayer, R. F. (1986). The teaching of learning strategies. In M. C. Wittrock (Ed.), Handbook of research on teaching (3rd ed.) (pp. 315–327). New York: Macmillan.