An expert natural language interface for statistical packages

An expert natural language interface for statistical packages

Expert Systems WithApplications,Vol.5, pp. 71-77, 1992 Printedin the USA. 0957-4174/92 $5.00+ .00 © 1992PergamonPressLtd. An Expert Natural Language...

637KB Sizes 2 Downloads 94 Views

Expert Systems WithApplications,Vol.5, pp. 71-77, 1992 Printedin the USA.

0957-4174/92 $5.00+ .00 © 1992PergamonPressLtd.

An Expert Natural Language Interface for Statistical Packages RICHARD LYCZAK AND SYLVIA WEBER-RUSSELL Universityof New Hampshire,Durham,NH

Abstract--A natural language interface has been developed tofacilitate the use of statistical packages. Queries are parsed into "case frames" based on statistical primitives. A rule-based expert system uses the case frame to choose a statistical test and generate a batch file that, when executed, answers the query.

1. INTRODUCTION

on Schank's (1975) conceptual dependency theory, which uses a small number of"conceptual primitives" to represent actions or states. Although this approach is well suited for applications that have a small number of primitive operations, such as database management, it has not been applied to the development of database query systems. Like database management systems, statistical packages use a query language to obtain information from a highly structured data file. While a database query typically specifies subsets of data to be retrieved, a statistical package query typically specifies subsets of data to be described or compared. Writing a statistical query requires considerable statistical expertise since the user must specify not only the data to be analyzed but also the statistical test to be used in the analysis. Because of the complexity involved in choosing an appropriate statistical test, numerous expert systems have been developed to guide the selection process (Blum, 1982; Gale, 1986; HaKong & Hickman, 1985; Hand, Ozsoyoglu, & Cubitt, 1990; Jamison & Metzler, 1985; Marion, 1987; Smith, Lee, & Hand, 1983). Among the products that are commercially available are Statistical Navigator and Exsample by Idea Works (Coffee, 1990). The former guides the selection of a statistical test and the latter guides the selection of a statistical sample. While most statistical expert systems employ menus or question the user about features of the research design, Bucci, Lella, and Pavan (1985) appear to have incorporated some level of statistical expertise into a NLI, which queries a demographic database maintained by the Italian government. However, this expert NLI is dedicated to a specific statistical database and offers little guidance for the development of more general systems. While the work of Bucci et al. (1985) was limited in scope, we believe that the approach of building statistical expertise into a NLI for statistical packages holds

STATISTICALPACKAGESare used to analyze data in a wide variety of academic, business, and research settings. Users are typically experts in some field or discipline that requires the analysis of data, but have limited backgrounds in statistics and computing. The purpose of this project was to facilitate the use of statistical packages by developing a n interface that allows the user to ask research questions in a natural English language format and without having to specify the statistical test to be used. The most extensive use of natural language interfaces (NLI) has been in the area of database query systems. Several such systems have found their way into commercial applications (Brown, 1988; Hendrix & Walter, 1987; Winston, Taylor, & Leeds, 1989), and studies have shown that they are easily mastered by database users (Capindale & Crawford, 1990). While early interfaces were based on augmented transition networks (Woods, Kaplan, & Nash-Webber, 1972) or semantic grammars (Hendrix, Sacerdoti, Sagalowicz, & Slocum, 1978; Thompson & Thompson, 1975; Waltz, 1978), more recent systems have employed an intermediate representation language (IRL) (Bates & Bobrow, 1983; Grosz, Appelt, Martin, & Pereira, 1986; Warren & Pereira, 1982), which is then translated into the database query language. Although IRL systems typically construct the IRL representation of a query from a syntactic parse tree, methods do exist for going directly from natural language to the IRL representation. One such system is the "conceptual analyzer" developed by Riesbeck (1975) and refined by Birnbaum and Selfridge (1981). The conceptual analyzer is based

Requests for reprints should be sent to Dr. Richard Lyczak,301 Pettee House,Universityof New Hampshire,Durham,NH 03824. 71

72

R. Lyczak and S. Weber-Russell

promise. An ideal interface would accept a natural language query, choose an appropriate statistical test, then generate and execute the code needed to answer the query. We have therefore developed a system called " E X P E R S T A T , " which includes a natural language interpreter based on Riesbeck's (1975) conceptual analyzer and an expert system based on the principles employed by Marion (1987). In its current form, this interface generates and executes code for SPSS, a statistical package widely used in the social sciences. However, since the NLI, expert system, and code generation modules are all independent, the system could readily be adapted to work with any statistical package. Unlike Bucci's et al. system, it can also be used with any predefined data set.

In keeping with Riesbeck's approach, each word in the dictionary is associated with one or more "requests," which are actually executable routines designed to set up and fill the slots in a case frame. These requests are similar to production rules in that they contain a test and an action to be carried out if the test succeeds (e.g., "if the query contains this word, set up the case frame for descriptive statistics"). When a word is found in the dictionary, a list of its requests is returned. It is important to note that only the user-defined labels are added to the dictionary at run time. All of the remaining words and their requests are built into the system. 4. P A R S I N G

2. O V E R V I E W Processing takes place in four phases. First, a dictionary is constructed that contains labels from the data file, key statistical terms, and various connectives such as prepositions, relative pronouns, and interrogatives. Second, a statistical query is obtained from the user. Queries must be entered as questions that make reference to variables in the data file. They must end with a question mark and contain no other punctuation. Queries may be entered interactively or in a file. Parsing the query involves looking up each word in the dictionary and applying functions associated with words in the dictionary to construct a case frame that represents the meaning of the query. Third, the completed case frame is examined by the expert system to select an appropriate statistical test. Finally, the expert system passes the case frame to the code-generating routine for the statistical test chosen. Code is generated from the contents of the case frame and executed. The output provides an answer to the query. Once a query has been answered, phases two through four are repeated until there are no more queries. Let us consider each of these four phases in detail. 3. D I C T I O N A R Y C O N S T R U C T I O N Like most statistical packages, SPSS requires the user to define the data set by assigning labels to each variable. Labels may also be assigned to individual values within a variable (e.g., within the variable SEX, 1 = male, 2 = female). E X P E R S T A T begins by reading these labels into a hash table, which serves as its dictionary. In addition to "variable labels" and "value labels," the dictionary contains "key words" (e.g., mean, difference, relationship), which determine the type of statistical operation being requested and certain "connectives" (e.g., prepositions, relative pronouns, interrogatives, conjunctions, and certain helping verbs), which define the relationship between labels and key words.

The parser starts at the beginning of the query and looks up each word in the dictionary. If the word is found, the requests returned are added to a list of requests called the RLIST. If the word is not found, the parser tries a variation of the word by adding or removing suffixes, such as "s" or "es" and substituting characters, such as " m a n " for " m e n . " If neither the word nor a variation of the word can be found, the parser simply moves to the next word in the query. In that way, only "significant" words are used to construct a representation of the query's meaning. This focus on significant words is made possible by the very restricted context (i.e., analysis of a specific data file) in which the queries are interpreted. After each addition to the RLIST, the parser executes all of the requests on the list. Most requests contain a conditional clause. If the conditions of the request are met and some action is performed related to setting up or filling a case frame, the request is removed from the RLIST. Otherwise, it remains on the list until its conditions are met. Ultimately, all requests are aimed at representing the query's meaning with a completed case frame. However, since the parser processes a query sequentially, it is often necessary to store concepts temporarily until it is known what type of case frame is needed to represent the query. This list of concepts, which have been encountered but not used, is called the CLIST. The conditional clauses of requests usually refer to the relative positions of items on the CLIST. For instance, "if the CLIST contains an item suggesting that descriptive statistics are required and that item is preceded by the name of a variable, set up a DESCRIBE case frame and insert the variable in its OBJECT slot." Once an item on the CLIST has been used, it is removed from the list. All queries are interpreted in terms of five "primitive" statistical operations: describe, tabulate, compare, regress, and relate. Each primitive operation is represented by a different case frame. Slots in these case

An Expert Natural Language Interface frames are designed to collect the information needed to carry out that type of analysis. The case frames constructed by our parser are analogous in concept to Schank's "conceptual case frames." That is, our parser, consistent with Schank's theory as implemented by Riesbeck, links concepts in a sentence to a governing primitive in a conceptual or semantic frame. In Schank's frames, this primitive is a conceptual act or state. In our higher level frames, however, the governing primitive is the desired statistical operation, regardless of the conceptual acts or states underlying the actual query. The operation called for by a specific query is determined by the presence of certain "key words" in the query. Examples of key words for each operation are: DESCRIBE: average, descriptive, mean, deviation TABULATE: fraction, frequency, many, percent, portion COMPARE: compare, differ, comparative forms of adjectives REGRESS: affect, depend, determine, effect, impact RELATE: correlation, relate, relationship

The first request associated with each of these key words in the dictionary is to set up the case frame for the appropriate statistical operation. Additional requests are aimed at filling the slots in the case frame and vary from one key word to another depending on where (and in what form) information is likely to be found on the CLIST. For instance, the following queries are both asking for the same comparison, but the requests needed to locate the groups to be compared would be quite different: Are the GPAs of males higher than the GPAs of females? Do the GPAs of the two sexes differ?

Even so, many keywords do share the same set of requests. When this occurs, the dictionary entry for those words is simply a reference to a request list stored elsewhere in the dictionary. Having several words referring to the same generic request list considerably reduces the size of the dictionary. All case frames contain slots for the variable or variables to be analyzed (e.g., the dependent variable), a definition of the sample to be examined, and any subsamples that need to be analyzed separately. Where cause-effect relationships are being explored, there may also be slots for an independent variable and the names of groups within the independent variable that are to be included in the analysis.

73 5. E X P E R T S Y S T E M When parsing is finished, the completed case frame is examined by a rule-based expert system, which determines the appropriate statistic needed to answer the query. Decisions are based on four factors: 1. the type of primitive operation; 2. the number of variables involved; 3. each variable's level of measurement (nominal, ordinal, interval/ratio); and 4. the number of groups involved in the analysis. Thus, a typical rule might state: IF the operation is a comparison AND there is a single variable in the INDEPEND E N T VARIABLE slot of the case frame AND the level of measurement for that single variable is nominal AND there is a single variable in the D E P E N D E N T VARIABLE slot of the case frame AND the level of measurement for that single variable is interval AND the number of groups in the GROUPS slot is 2 THEN the statistic needed is a "t-test." In this case, the case frame would be passed to the routine in the code generator, which generates code for t-tests. In addition to choosing a statistical test, the expert system module also traps errors. If no case frame has been created, or key slots in a case frame are empty, the expert system calls error message routines that provide feedback to the user. For instance, ifa COMPARE case frame arrived with the I N D E P E N D E N T VARIABLE slot completed but the D E P E N D E N T VARIABLE slot empty, the user might receive the following message:

The word H I G H E R implies that you wish to compare two or more groups with respect to some dependent variable.

It appears that the groups are: MALE FEMALE. However, you have not mentioned the dependent variable.

Please rephrase your query so that it identifies both the groups and the dependent variable using the names of variables and values from your data file.

74

R. Lyczak and S. Weber-Russell

In the event that the user cannot remember the variable labels and value labels used in the data file, EXPERSTAT includes a utility that will list them on the screen. 6. CODE G E N E R A T O R As noted earlier, the code generator currently in use produces executable code for SPSS. The code generation module consists of a separate code generation routine for each of the statistical tests which can be "recommended" by the expert system. These routines simply construct a sequence of SPSS commands from the information contained in the case frame slots. For instance, the query:

Among freshmen who work more than 20 hours per week are the GPAs of males and females significantly different?

would be represented by the following case frame:

OPERATION: compare I N D E P E N D E N T VARIABLE(S): sex GROUPS: male (1) female (2) D E P E N D E N T VARIABLE: gpa SAMPLE: class eq 1, work gt 20 SUBSAMPLES: none

From this case frame the code generator would produce the following SPSS code:

GET FILE = STUDENTS SELECT IF CLASS EQ 1 AND W O R K G T 20 T-TEST GROUPS = SEX (1 2) /VARIABLES = GPA

This assumes, of course, that there is an SPSS system file called STUDENTS, which contains the variables "class," "work," "sex," and "gpa." It also assumes that within "class" the value l has been labeled "freshman" and within "sex" the values 1 and 2 have been labeled "male" and "female." 7. SPECIAL PARSING P R O B L E M S

7.1. Noun Groups As pointed out by Birnbaum and Selfridge (1981), special provisions must be made for parsing noun groups to prevent "premature" decisions about the role of each noun. In their example, encountering the word "stair-

way" in the sentence "George sat on the stairway handrail" could result in the premature filling of the case frame slot containing George's location. Their solution to this problem was to process only the requests associated with individual nouns until the end of the noun group was reached, at which point all requests on the RLIST were processed. We extended this approach to cover entire phrases. Consider the queries:

Are the mean GPAs of senior males and senior females different?

Are the mean GPAs of senior males and junior males different?

Deferring action on the word "senior" until the end of the noun group "senior males" is reached would not guarantee a correct interpretation. The true role of"senior" is not known until the end of the entire prepositional phrase. In the first query, "senior" defines the sample to be analyzed in a comparison of males and females; in the second, it defines one of the two groups being compared in a sample of males. We, therefore, collect all nouns (which are always variable labels or value labels) on a temporary list and determine which slots they should fill in the case frame when the end of the phrase is encountered.

7.2. Premature Case Frame Selection Premature decisions cannot always be avoided. In the above queries, for instance, the word " m e a n " is a key word that triggers construction of the DESCRIBE case frame. When the end of the prepositional phrase is encountered, slots are filled in the OBJECT and SAMPLE slots of that case frame. However, the next word, "different," is a keyword that triggers construction of the COMPARE case frame. Since comparisons often involve means, but descriptive statistics do not involve differences, it is clear that this case of "conflicting case frames" can be resolved by moving the contents of the DESCRIBE frame into appropriate slots of the COMPARE frame and destroying the former. Fortunately, the D E S C R I B E / C O M P A R E conflict appears to be an exception because queries typically contain just a single keyword.

7.3. Logical Operators Statistical queries tend to be relatively free of ambiguity when compared to everyday speech. Research questions, by their very nature, call for a precise use of language. Problems can arise, however, when the rules of formal logic conflict with the way we normally interpret human speech. Consider the following query:

An Expert Natural Language Interface

What is the average GPA for students who are not males or seniors?

The rules of formal logic give highest precedence to "not," followed by "and," followed by "or." Applying these rules, we would interpret this query to be requesting the average for students "who are not males or who are seniors." The typical human interpretation, on the other hand, is likely to be that an average is being requested for students "who are not males and who are not seniors." The dilemma here was whether to interpret all queries in terms of the well-established rules governing the precedence of logical operators or generate a whole new set of precedence rules based on common usage. The former would likely satisfy the professional researcher who is familiar with the rules of logic but would confound the casual user, while the latter would likely satisfy the casual user but would confound someone schooled in formal logic. Our solution was a compromise. Logical precedence rules remain in effect, but one further rule has been added. "Or" is given higher precedence than "not," thereby creating a complete cycle of precedences: "not" precedes "and" precedes "or" precedes "not." This scheme produces interpretations that would be expected by users familiar with logical operations, but makes an exception for queries of the type noted above to conform to common usage. One additional exception has been made to a strictly logic-based interpretation of queries. As noted by Templeton and Burger (1986) in their discussion of End-User Friendly Interface to Data Management (EUFID), the natural language use o f " a n d " and "or" in database (and statistical) queries does not always correspond to their logical meaning. For instance, the query:

What is the median GPA for students who are juniors AND seniors?

75

niors" and "females" are not. EXPERSTAT can detect mutual exclusivity by whether two value labels come from the same variable. If they do, it converts the "and" operator to "or." 8. EVALUATION EXPERSTAT has been tested with over 500 queries generated by the authors on three different data files. One file contained data on individual students, another contained data about colleges and universities, and a third contained health-related data on each of the 50 states. Table 1 contains a sample query and resulting SPSS code involving each of the five primitive operations in each of the three sample data files. Notice in reading this table that the use of user-defined labels sometimes requires departure from a truly natural language format. For example, since variable labels in SPSS cannot exceed eight characters, the label for class rank must be spelled CLASRANK. Also, since value labels cannot contain spaces, words within the label must sometimes be run together (OVER65) or separated by underscores (HAVE_NO__SEATBELT_LAW). These problems could be readily overcome by soliciting synonyms from the user and inserting them in the dictionary. A final test was conducted by obtaining 50 queries from 4 university instructors who use statistical packages in their courses. EXPERSTAT generated correct analyses for 41 (82%) of these 50 queries. Five of the remaining queries resulted in a request for clarification; two resulted in incorrect analyses; and two resulted in no output at all. The system does not currently handle analyses involving covariates, repeated measures, or multiple dependent variables. Apart from these exceptions, it has been able to interpret and generate SPSS code for a very wide range of complex statistical queries. With further testing and refinement, we believe that this approach would lead to a very reliable interface for statistical applications.

actually means: 9. CONCLUSIONS What is the median GPA for students who are juniors OR seniors? On the other hand, the query: What is the median GPA for students who are juniors and female? means exactly what it says. Clearly, "juniors" and "seniors" are mutually exclusive categories, while "ju-

EXPERSTAT has demonstrated the feasibility of integrating an expert system and natural language processor to construct an intelligent interface for statistical packages. In addition to being a new application of natural language processing techniques, the project has employed several innovations in the way that it focuses on keywords and labels, the way it handles noun groups, and particularly in the way it has resolved conflicts between formal logic and common language usage. Also noteworthy is its flexibility. EXPERSTAT can be applied to any predefined data set and can be readily adapted to any statistical package. Only the

76

R. Lyczak and S. Weber-Russell TABLE 1 Queries From Sample Data Files and Their Resulting SPSS Code

WITHIN THE SENIOR CLASS WHAT IS THE RANGE OF GPAS FOR EACH SEX? GET FILE=STUDENTS SELECT IF (CLASS EQ 4) TEMPORARY SELECT IF SEX EQ 1 DESCRIPTIVES VARIABLES= GPA TEMPORARY SELECT IF SEX EQ 2 DESCRIPTIVES VARIABLES= GPA WHAT PROPORTION OF MALE STUDENTS WHO WORK OVER 20 HOURS PER WEEK ARE SENIORS? GET FILE=STUDENTS SELECT IF WORK GT 20 CROSSTABS SEX (1 2) BY CLASS (1 4)

HOW STRONG IS THE RELATIONSHIP BETWEEN MATH AND VERBAL SAT SCORES AMONG COLLEGES IN NEW-HAMPSHIRE AND VERMONT? GET FILE=ACADEMIC SELECT IF (REGION EQ 3) OR (REGION EQ 5) CORRELATIONS VARIABLES= MATH VERBAL IS A COLLEGE'S TUITION DETERMINED BY REGION AND WHETHER IT IS PUBLIC OR PRIVATE? GET FILE=ACADEMIC ANOVA TUITION BY OWNER (1 2) BY REGION (1 6) IS THE SIZE OF A COLLEGE'S LIBRARY DETERMINED BY ITS TUITION AND ENROLMNT? GET FILE=ACADEMIC REGRESSION VARIABLES = ENROLMNT TUITION ]DEPENDENT = LIBRARY

AMONG STUDENTS OVER 21 YEARS OF AGE WHAT IS THE CORRELATION BETWEEN GPA AND THE NUMBER OF HOURS HE]SHE WORKS EACH WEEK?

WHAT IS THE AVERAGE DEATH RATE FROM CANCER IN NEW-ENGLAND STATES WHICH HAVE NUKES?

GET FILE=STUDENTS SELECT IF (AGE GT 21) CORRELATIONS VARIABLES= GPA WORK

GET FILE=HEALTH SELECT IF (REGION EQ 1 AND NUKE EQ 1) DESCRIPTIVES VARIABLES= CANCER

DO MALES HAVE A HIGHER CLASRANK THAN FEMALES?

HOW MANY STATES IN WHICH 20% OR MORE OF THE POPULATION IS OVER65 HAVE NO SEATBELT LAW?

GET FILE=STUDENTS NPAR TESTS M-W = CLASRANK BY SEX (1 2) TO WHAT EXTENT IS GPA AFFECTED BY AGE AMONG STUDENTS WHO ARE NOT FRESHMEN OR SOPHOMORES? GET FILE=STUDENTS SELECT IF (CLASS NE 1) AND (CLASS NE 2) REGRESSION VARIABLES = AGE ]DEPENDENT = GPA WHAT IS THE MEDIAN TUITION CHARGED BY VERMONT COLLEGES WITH ENROLMNTS UNDER 2000 STUDENTS? GET FILE=ACADEMIC SELECT IF ENROLMNT LT 2000 AND (REGION EQ 5) DESCRIPTIVES VARIABLES = TUITION IN MAINE HOW MANY PUBLIC AND PRIVATE COLLEGES ARE THERE WITH TUITIONS EQUAL TO OR GREATER THAN 10000 DOLLARS PER YEAR?

GET FILE=HEALTH SELECT IF OVER65 GE 20 FREQUENCIES VARIABLES=SEATBELT ARE THE NUMBER OF DOCTORS IN A STATE AND THE STATE'S INCOME LEVEL CORRELATED? GET FILE=HEALTH CORRELATIONS VARIABLES= DOCTORS INCOME IN NEW-ENGLAND AND THE MID-ATLANTIC STATES WHAT IS THE EFFECT OF SEATBELT LAWS ON THE NUMBER OF PEOPLE ADMITTED TO HOSPITALS? GET FILE=HEALTH SELECT IF (REGION EQ 1 OR REGION EQ 2) T-TEST GROUPS = SEATBELT (1 0) ]VARIABLES = ADMITTED WHAT EFFECT DOES THE NUMBER OF DOCTORS IN A MID-ATLANTIC STATE HAVE ON ITS DEATH RATE FROM HEART DISEASE?

GET FILE=ACADEMIC SELECT IF TUITION GE 10000 CROSSTABS REGION (1 6) BY OWNER (1 2)

GET FILE=HEALTH SELECT IF (REGION EQ 2) REGRESSION VARIABLES = DOCTORS /DEPENDENT = HEART

modules that generate code and that read labels from the user's data file need to be rewritten when the interface is transported to a new package. Future work on EXPERSTAT will be aimed at extending the range of statistical queries for which it can

accurately generate code. Additional testing will be used to reveal and correct parsing problems with particular syntactic structures, and additional parsing and code generation routines will be written to deal with repeated measures designs.

An Expert Natural Language Interface

77

REFERENCES

veloping a natural language interface to complex data. ACM Transactions Database Systems, 3(2), 105-147. Hendrix, G., & Walter, B. (1987). The Intelligent assistant. Byte, 12(14), 251-258. Jamison, W., & Metzler, D. (1985). An expert system for statistical consulting. Proceedings of the 48th ASIS Annual Meeting (pp. 293-296). White Plains, NY: Knowledge Industry Publications. Marion, R. (1987). An expert system for selecting the correct biomedical statistical procedure. Collegiate Microcomputer, 5, 230-236. Riesbeck, C.K. (1975). Conceptual Analysis. In R. Schank (ed.), Conceptual Information processing. New York: Elsevier. Schank, R.C. (1975). Conceptualinformation processing. New York: Elsevier. Smith, A.M.R., Lee, L.S., & Hand, D.J. (1983). Interactive userfriendly interfaces to statistical software. The Computer Journal 26, 199-204. Templeton, M., & Berger, J. (1986). Considerations for the development of natural-language interfaces to database management systems. In L. Bolc & M. Jarke (eds.), Cooperative interfaces to information systems (pp. 67-99). New York: Springer-Verlag. Thompson, F.B., & Thompson, B.H. (1975). Practical natural language processing: The REL system prototype. In M. Rubinoff& M. Yovits (eds.), Advances in computers(pp. 109-168). New York: Academic Press. Waltz, D.L. (1978). An English language question answering system for a large relational database. Communications of the Association for Computing Machinery, 21(7), 526-539. Warren, D.H.D., & Pereira, F.C.N. (1982). An eificient easily adaptable system for interpreting natural language queries. American Journal of Computational Linguistics, 8(3-4), 110-122. Winston, T., Taylor, M., & Leeds, R. (1989). Natural language query processing. AI Expert, 4(2), 50-58. Woods, W.A., Kaplan, R.M., & Nash-Webber, B.L. (1972). The lunar sciences natural language information system: Final report BBN REP. 2378. Cambridge, MA: Bolt Beranek & Newman.

Bates, M., & Bobrow, R.J. (1983). A transportable natural language interface. Proceedings of the Sixth Annual International SIGIR Conference on Research and Development in Information Retrieval ACM Transactions Database Systems, 17(4), 81-86. Birnbaum, L., & Selfridge, M. ( 1981 ). Conceptual analysis of natural language. In R. Schank, & C. Riesbeck (Eds.), Inside computer understanding (pp. 319-353). Hillsdale, N J: Lawrence Erlbaum Associates. Blum, R.L. (1982). Discovery and representation of causal relationships from a large time-oriented clinical database: The RX project. Lecture notes in medical informatics. New York: Springer-Verlag. Brown, A.W. (Ed.). (1988). Querying databases in plain English. IEEE Expert, 3(2), p. 75. Bucci, P., Lella, G., & Pavan, S. (1985). NLI-ESD: An expert natural language interface to a statistical data bank. Expert Systems and Their Applications, 2, 667-67 I. Capindale, R., & Crawford, R. (1990). Using a natural language interface with casual users. International Journal of Man-Machine Studies, 32, 341-361. Coffee, P. (1990). Two expert-system packages guide users in choice of analyses methods. PC Week, 7, p. 127. Gale, W. (1986). Artificial intelligence and statistics. Reading, MA: Addison-Wesley. Grosz, B.J., Appelt, D.E., Martin, P., & Pereira, G. (1987). TEAM: An experiment in the design of transportable natural language interfaces. Artificial Intelligence, 32, 173-243. HaKong, L., & Hickman, F.R. (1985). Expert systems techniques: An application in statistics. Proceedings of the Fifth Technical Conference of the British Computer Society (pp. 43-63). Cambridge: Cambridge University Press. Hand, D., Ozsoyoglu, G., & Cubitt, R. (1990). Panel session on statistical expert systems. Database Engineering, 13(3), 52-54. Hendrix, G., Sacerdoti, E., Sagalowicz, D., & Slocum, J. (1978). De-