Case-based retrieval of software components

Case-based retrieval of software components

ExpertSystemsWith Applications, Vol. 9, No. 3. pp. 397405, 1995 Copyright 0 1995 Elsev~erScience Ltd Pergamon Printedin the USA. All rightsreserved...

1MB Sizes 3 Downloads 55 Views

ExpertSystemsWith Applications, Vol. 9, No. 3. pp. 397405,

1995 Copyright 0 1995 Elsev~erScience Ltd

Pergamon

Printedin the USA. All rightsreserved 0957-4174/95$9.50+ .oo

0957-4174(95)00012-7

Case-Based Retrieval of Software Components CARMEN FERNANDEZ-CHAMIZO,*PEDROA, GONZALEZ-CALERO, LUIS HERNANDEZ-YAREZ,AND ALVARO URECH-BAQUI~ Dep. Inform&ica, Fat. Fisica, Universidad Complutense 28040 Madrid, Spain

Abstract-A major problem components.

concerning

DifSerent approaches

Reuse Assistant, a hybrid approach object classes. information

the reusability

have been followed

of software

to support the retrieval of sofhare

The Reuse Assistant consists of two subsystems

retrieval

techniques

some of the representation Retrieval approach

based on statistical methods,

and indexing

mechanisms

is the retrieval

to solve this problem.

found

components

that follow

from a library of

two different

and knowledge-based in case-based

approach

enables

components.

Both subsystems

reasoning

about

concepts,

allowing

systems.

can be operated from a common interface,

the retrieval

approaches..

techniques

using

The Information

grants system extendibility, and permits the use of a natural language

Case-Based

of sof?ware

In this paper we present the

interface.

The

of “approximate”

where free-text

and form-filling

queries can be posed.

1. INTRODUCTION SOFTWARE REUSE is widely believed to be one of the most promising technologies for improving software quality and productivity (Biggerstaff & Richter, 1987). Objectoriented languages constitute an important step in the way to reusability (Meyer, 1987). However, the reusability of objects, although promising, must face the problem of classification, storage, and retrieval of reusable components (classes and methods). By considering the natural language documentation of software components, we could apply some of the general purpose Information Retrieval (IR) techniques (Salton & McGill, 1983) to the retrieval of software components. IR techniques have proved to be useful in certain domains. However, the software domain has been found quite different from some of the usual application domains of IR techniques, not only due to the nature of documents but also due to the special requirements of the user. Some of the special characteristics of the software domain that make IR techniques ill-suited for retrieving software components are described below. Much research on information retrieval assumes that the required information can be fully specified a priori, to then concentrate on retrieval efficiency. Studies in

This work is supported by the Spanish Committee of Science & Technology (CICYT, TIC92-0058) * Corresponding author Requests for reprints should be sent to Carmen FemandezChamizo, Dep. InfornGtica, Fat. Fisica, Universidad Complutense, 28040 Madrid, Spain. E-mail: cfeman@dia. ucm.es

software design have shown, however, that the integration of problem setting and problem solving is necessary to articulate the adequate queries. Therefore, tools are needed for incrementally constructing queries and exploring an information space that can support the evolution of an information need. These tools should aid in comprehension, as well as location, by helping users comprehend the relevance of retrieved information (Henninger, 199 1). In order to evaluate the relevance of the retrieved components a careful review of their code and documentation is necessary. If the number of retrieved components is too high, it can discourage users and promote programming from scratch instead of reusing software. Therefore, to achieve a high precision in retrieving is crucial in the software domain. Object-oriented programming, although improving the whole software development process, adds its own drawbacks to the software retrieval problem (Wilde Huitt, 1992). The use of polymorphism and inheritance introduces a large number of dependencies between components. Dynamic binding increases the number of implementations to be examined, and the dispersion of functionality into different components makes difficult the global understanding. All these factors contribute to reduce the effectiveness of the retrieval. This paper is concerned with the retrieval of software components. In particular, our work has concentrated on the problem of retrieval of Object-Oriented Programming (OOP) classes. Section 2 summarizes related work in this area, classifying the existing systems in two basic

C. Ferncindez-Chamizo

398 groups: automatic indexing systems and knowledgebased systems. Section 3 analyzes the advantages and disadvantages of both approaches and justifies the need of a hybrid approach. The Reuse Assistant, a system based in this hybrid approach, is also introduced in this section by describing its two subsystems along with the functionality of the user interface. In Section 4 a usage scenario is described. Section 5 is concerned with the evaluation of the retrieval effectiveness in the Reuse Assistant. Section 6 concludes the paper with some remarks on current and further research. 2. THE RETRIEVAL OF SOFTWARE COMPONENTS: RELATED WORK Traditional approaches to software retrieval fall into two complementary categories: low-level cross-reference tools, which facilitate browsing at the code level, and high-level classification techniques, which emphasize retrieval by software category. Most of the low-level tools automatically generate a data base of relationships between the different program entities, and respond to user queries according to the data base information. Examples of such tools are CScope (Steffen, 1985) and CIA (Chen & Ramamoorthy, 1986), which work on C programs. More recent tools, such as Trellis (O’Brien, Halbert, & Killian, 1987) and CIA + + (Grass & Chen, 1990) work on object-oriented programs, managing the complex relations established between classes. The high-level classification of software components can follow basically two approaches. We can extract the classification information from the component itself (code, documentation) using automatic indexing methods. Or, on the other hand, we can implement a knowledge-based classification scheme using information about the components that lies outside of them. Both high-level approaches are described in the following sections. 2.1. Automatic

Indexing Approach

Due to the increasing size of the natural language descriptions of software components in recent libraries, IR techniques based on statistical methods (Salton & McGill, 1983) are becoming more usual in component retrieval. This approach extracts information from the natural language documentation of the software components. It does not use any semantic knowledge, nor intend to understand the documentation. The goal of this approach is to characterize each component by a set of indices that are automatically extracted from its natural language documentation. The system proposed in Frakes and Nejmeh (1987) uses an existing IR system, CATALOG, for storing and retrieving C software components. Each component is characterized by a set of single-term indices that are automatically extracted from the natural language headers of C programs.

et al.

The GURU system (Maarek, Berry, & Kaiser, 1991) classifies the software components according to attributes automatically extracted from their natural language documentation using an index scheme based on pairs of related words. These kinds of indices, along with some clustering methods used by the GURU system, have proved to be successful when applied to the documentation of the Unix command set. GURU has also been applied to an object-oriented class library (Helm & Maarek, 1991) by complementing it with domainspecific approaches based on code analysis. Some recent proposals (Girardi & Ibrahim, 1993) include a partial natural language analysis of the component descriptions in order to improve the retrieval effectiveness. As the information provided by IR tools is derived automatically, this approach presents advantages in cost, transportability, and scalability. Statistical methods, however, cannot be a substitute for meaning.

2.2. Knowledge-Based

Approach

There is a growing interest in the potential contributions of Artificial Intelligence to Software Engineering. Development of knowledge-based tools for software reusability is one of the most promising research topics in this area. The key feature of this approach is that it draws semantic information about software components from a human expert. Knowledge-based systems are often very sophisticated. Unfortunately, as a trade-off they require domain analysis and a great deal of preencoded, manually provided semantic information. Prieto-Diaz (Prieto-Diaz & Freeman, 1987) created a classification scheme based on the library science. He proposes a set of six facets: three related to the functionality of the component and three related to its environment. The different values a facet can have are called terms. These terms are organized in a conceptual graph that represents manually encoded knowledge about the domain. The system proposed in Wood and Sommerville (1988) uses Conceptual Dependency (Schank, 1972) to represent knowledge about software components. This knowledge is encoded in the component descriptor frames that represent the function performed by the component and the objects manipulated by the function. Embley and Woodfield (1987) define a knowledge structure for a software library consisting of abstract types (ADTs). This knowledge structure supports different relations among ADTs, and it includes natural language descriptions and keywords to assist in finding and browsing activities. The LaSSIE system (Devanbu, Ballard, Brachman, & Selfridge, 1991) embodies a frame-based knowledge representation. Software components are described in terms of the operations they perform. Each of these

Retrieval of Software Components

399

actions is described by providing its actor, object, recipient, agent, environment, and so on. These relationships are coded in a specialized knowledge representation system that classifies them into a conceptual hierarchy. All these knowledge-based systems have several things in common. They all represent the knowledge in frame-like structures with similar sets of slots and fillers. They all organize the knowledge around the functions performed by the components, although some of them also include frame representations of the objects involved. In every single case, the characterization of the components with slots and fillers is done manually, following a preestablished model of the domain. Another approach is undertaken in CODEFINDER (Henninger, 1991). This system faces a slightly different goal. It explores the problem of finding program examples that are relevant to a design task. CODEFINDER uses an associative spreading activation method for locating software objects. Curtis (1989) has analyzed different software indexing methods used by knowledge-based systems. He postulates that the effective use of a reusable library will require an indexing scheme similar to the cognitive structures held by most experts in the application area. The identification and representation of those cognitive structures still remain an open issue. Although it is generally assumed that semantic retrieval may lead to a more effective software retrieval, present knowledgebased systems are being questioned due to the high cost of building knowledge bases.

3. THE REUSE ASSISTANT: APPROACH

A HYBRID

Currently, in terms of retrieval efficiency it is not easy to decide which is the best approach, automatic indexing or knowledge-based, because there are not comparable empirical results about the performance of systems based on those approaches. The differences come from the effort necessary during the knowledge acquisition process, the availability of additional knowledge (not included explicitly in the components), and the type of interface imposed by the selected representation. Both approaches have advantages and drawbacks. Knowledge-based systems make use of a deep knowledge about the component design and implementation, and interrelate the components through a rich set of relationships expressing system architecture, design decisions and all the information needed to use and, more important, to reuse (adapt) the components. Usually, this information is presented to ease query construction by reformulation, in a form-filling interface or even in a restricted natural language interface (as in LaSSlE), and serves as the basis for some kind of browsing system where dependencies among components can be explored. The major drawback of this approach comes from the

need of a manual, high-cost knowledge acquisition process, which handicaps the scalability and extensibility of these systems. On the other hand, automatic indexing systems make use of the readily available information, performing some kind of analysis on the text (and, sometimes, also on the code) associated with the components. Therefore, extendibility is granted because this process may be automatically applied to any new component added to the software library. The most sophisticated of these systems use advanced techniques, such as clustering, and text or code understanding, to impose a more conceptual structure on the representation, trying to resemble that of a hand-coded knowledge base. But the actual state of development of these technologies for information extraction is still far from being capable of extracting the knowledge that can be easily elicited from an expert. Usually, these systems use a natural language interface, where queries are processed in the same way that the text associated with the components. In this paper we present the Reuse Assistant, a hybrid approach to support the retrieval of software components from a library of objects classes (Femandez-Chamizo, Hemandez-Ybfiez, Gonzalez-Calero, & Urech-BaquC, 1993). The Reuse Assistant consists of two subsystems: an IR module based on statistical methods, and a knowledge base, indexed with the techniques used in case-based (CB) systems (Riesbeck & Schank, 1989). We intend to maintain the advantages of both approaches while minimizing the drawbacks, and, as stated in Callan and Croft (1993), we believe that a combination of both approaches is superior to either one. In our system, every component is indexed using both techniques. In this way, the system offers a richer interface, where it is possible to pose queries in natural language (treated by the IR module), to incrementally fill a form specifying the user needs (access to the knowledge base), and to use any retrieved component as entry point to browse through the knowledge base. Our work has concentrated on general purpose software libraries for object oriented languages. The commercially available systems (SmalltalW80, Smalltalk/V, C ++, and Eiffel environments, for example) include some basic libraries with general purpose components for graphical interfacing, data sructure manipulation, communication with the host system, and some utilities considered of general interest. When building applications for a particular domain, these components are a crucial part of any implementation. We believe that the components in the basic libraries cannot be considered at the same level, in terms of reusability, than the modules implemented by the users. And, due to their higher reuse potential, we believe that a complete hand-coded representation of those components is well justified, and it will report enough benefit to be a profitable effort. User-defined components are indexed through the IR module by a statistical analysis of the

C. Fernbndez-Chamizo et al.

400

comments in the code, so that they are partially included in the retrieval system, which is, to some extent, scalable and extensible. It would be desirable that the users could add the representations of their own components to the knowledge base, but this would cause a number of problems concerning consistency and knowledge quality. The control of the component descriptions introduced in the system by the users as they create new components is still an open issue in our system. The next subsections describe with more detail the IR and CB modules, and the user interface.

3.1. The IR Subsystem We have developed a subsystem for class retrieval based on IR techniques (Salton & McGill, 1983). This tool is a modified version of a previously developed prototype (Femandez, Buenaga, & Vaquero, 1993) and it uses an indexing scheme based on pairs of related words according to the notions of lexical affinity (LA) and quantity of information proposed in (Maarek et al., 1991). A LA between two units of language stands for a correlation of their common appearance. When two words appear frequently in a single sentence (separated by at most five other words) a potential LA (or lexical relation between the two words, e.g. “system administrator”, “remote computer”) is identified. In order to reduce the influence of words appearing too often in a given context, only those LAS carrying a big quantity of information (high frequency in a document, low frequency in the rest of documents) are selected as indices. The IR subsystem takes the class library documentation in natural language as input and produces the component profiles. A component profile is a set of LAbased indices characterizing the component. These indices are ordered in decreasing order by the quantity of information associated to them. Profiles are obtained by performing a statistical analysis of the distribution of words in the document describing the component and in the whole set of documents (the library documentations). In this process, names of methods and objects are given a higher weight because in an object-oriented programming environment they have been carefully selected to transmit as much information as possible. The analysis of the class library documentation is performed only once, and the results of this analysis, the component profiles, are stored in auxiliary files to be used during the retrieval stage. When retrieving a component, the user expresses a query in free style natural language. The system obtains the query profile, applying the same method used to obtain the component profiles. The query profile is then compared with the component profiles and a ranked list of potentially relevant components is obtained. Relevance is assessed as a function of common LAS between the query and the component, and taking into account the

quantity of information associated with the required LAS in the different component profiles. The ranked list is shown to the user as described in Section 3.3.

3.2. The CB Subsystem Human programmers perform many programming tasks by “reusing” previously acquired mental schemes, rather than by reasoning from first principles (Rich & Waters, 1988; Steier, 1991). Therefore, case-based techniques constitute a natural approach to represent the experience acquired in the design of past applications. This assertion is even more relevant when dealing with object-oriented design where, very often, the new programs are built from components available in the class library. Descriptions of these components will constitute the case base in our system. First, it is necessary to specify a mechanism to represent cases in the knowledge base. We use a classification-based knowledge representation system (Femandez-Valmayor & Femandez-Chamizo, 1992), which is based on a subset of the KL-ONE system (Brachman & Schmolze, 198.5). This system has the ability of automatically classifying a structured concept with respect to a taxonomy of other concepts. That is, on the basis of its structure and the structure of concepts already in the taxonomy, a new concept can be automatically inserted at the correct position. The same classification algorithm is used for retrieval. The classifier can determine the position in the concept hierarchy where a given description should be located. The knowledge base is implemented as a semantic network, where every node is a frame-like structure representing a concept. Concepts are restricted by a number of slots that relate them to other nodes in the network. In order to build the knowledge base, a domain analysis of the software library has to be made. The representation is organized around the operations performed by the routines in the library, and the objects involved in those operations. Therefore, the domain analysis must identify the operations and the objects needed to characterize the components, along with dependencies among them. This analysis leads to an inheritance hierarchy containing general purpose programming concepts (operations, objects, and relations) at higher levels, concepts specific to the library, and the actual descriptions of the components (i.e. the cases) at a lower level of the hierarchy. The root concepts of the knowledge base are ACTION and OBJECT. Operations are represented as specializations of the ACTION concept, whereas objects manipulated by those operations are represented as specializations of the OBJECT concept. In Object-Oriented Languages (OOLs) the basic components are classes. A class consists of data structures and routines (in Smalltalk terminology, a routine is called a method, which is the term we will use).

Retrieval of Sojhare

401

Components

Collection

Orderable_Collection Order:
Fixed_Size_Collection Size:

Key_Collection

Allow_Duplicates_Collection

Key:

Condition:

v XTY

E

c,x

#Y

\..

I

Dictionary

Indexed_Collection

Key:

Order:
item:
FIGURE 1. A partial view of the concept hierarchy.

For every method and every class in the software library, a description is included in the knowledge base, containing the following information: Methods. A frame representing a method consists of an action name, and three required slots specifying the owner class of that method, the actual implementation of the method, and the identifier of the method. Additionally, the frame may include some slots to connect the method with other methods that play an important role in its functionality. And, finally, there will be some slots to represent the specific modifiers of the action, for example, a search will have an object searched and a medium where the search is accomplished. Classes. We create for each class a frame containing information about its structural description, its code level relationships with other classses (i.e., inheritance, clientelism), its conceptual relationships with other classes and abstract concepts in the knowledge base, and the links to all its methods. The conceptual relationships have to be hand coded but the code level relationships are automatically inferred from the source code. The structural description defines a set of values for the class (i.e., a list of other classes needed to construct instances of the current class). For example, the structural description of the Stream class in Smalltalk consists of an IndexedCollection, and two integers representing the position and the readLimit in the stream. Class descriptions are represented as OBJECT concepts, because they are the objects manipulated by the methods. The descriptions of the actual components included in the class library are defined as specializations of more abstract concepts detected in the domain analysis. Figure 1 shows a partial view of a portion of the concept hierarchy. In the figure appears the abstract concept Collection, which describes any aggregation of objects. Collections are classified according to the following criteria (extracted from the domain analysis): ordered, fixed, or variable size, allows duplicates, and accessible by an external key. These criteria are further specialized

to reflect more specific constraints. The figure also shows two concepts describing actual classes, Indexed_Collection and Dictionary, connected to their corresponding abstractions.

3.3. The User Interface The interface reflects the two sources of information available in the system: natural language queries are processed through the IR module, and a form-filling interface along with a concept browser are supported by the knowledge base. First, the user must state his or her needs by describing the class or the method(s) s/he requires, using one of the following query modes: Natural language. To retrieve a component, the user constructs a free-text query that is indexed as if it were the documentation associated with the components. A profile of this query is produced and then it is compared with the profiles obtained from the components in the software library. As a result of this comparison, the system produces an ordered list of candidate classes, ranked by degree of similarity with the profile of the query. Inside each candidate class, methods related with the functionality requested by the user are activated. Studies have shown that people are often very ungrammatical when communicating in a problem-solving setting. Therefore, keywords are also allowed as queries. Although keyword matching is generally not an effective retrieval mechanism, its simplicity and ease of use holds an advantage for query construction. Keywords provided by the user are taken as a query profile by the IR subsystem, in the same way that profiles obtained from free-text queries. Form-jifling. The user can also construct the query by using a form-filling interaction mode. This interaction is driven by the frames stored in the knowledge base. The user has to select either a verb describing the action the component performs or a noun representing an object manipulated by the component. From the user input the system displays a skeleton frame

C. Fernrindez-Chamizo et al.

402

@NL orm

Of

Query extracl

words

from

a string

traverse&conditionally-select obj: char-sequence condition: client-of: implementor: selector:

nswer

an array

receiver.

The

of substrings

receiver

substrings at the occurrences more space characters ’

1 aStream answer aStream [aStream

answer index1 := OrderedCollection .= ReadStream atEnd]

whilefalse:

from

is divided

nottspace-char ReadStream String asArrayOfSubstrings

the ~ntcl

of 0111: IIr

new on: self

[

[astream

atEnd

aStream

peek

iffrue-

[*answer

iswhitespace

FIGURE

a:;hrt~l.~‘j

not]

2. The user interface.

corresponding to the action or the object. For each frame slot the system prompts the user for the corresponding filler. The user can ask for help regarding appropriate responses but it is not necessary for every slot to be filled. Both query modes, natural language and form filling allow to access the adequate frames in the knowledge base. In form-based queries the access is conducted by the classifying mechanism of the knowledge base. For free-text queries the access is through the IR indices. Components whose profiles are similar to the query profile are selected as candidate components, being components previously linked to their corresponding frames in the knowledge base. Therefore, no matter what the query mode, the frames that encode the knowledge about the retrieved components are accessed. Figure 2 shows the interface of the Reuse Assistant. This figure is the result of a query in natural language. The user has posed the query, then has selected one of the

retrieved classes, and finally has selected one of the methods activated by the query in that class. At the upper right comer the description associated with that method is shown. At the bottom, the actual code (and comments) of the method appears. If the answer to an initial query is unsatisfactory, the user can reformulate the query or browse amongst functionally related classes. The concept hierarchy browser in our system lacks a graphical interface, so the interaction must be accomplished in text mode. The user may ask for more abstract or more specific concepts than the one selected, or follow any link from that frame to any of its fillers, because these are also concepts in the knowledge base. These requests will result in the system displaying the corresponding frame(s). Retrieval by reformulation (Devanbu et al., 1991) considers retrieval as an incremental process of retrieval cue construction. With this technique, the user and the computer system cooperate, with the user able to

403

Retrieval of So&are Components incrementally improve his or her query according to the results of the previous queries. In order to enhance the retrieval process, traditional relevance feedback systems ask the user for assigning relevance ratings to the items retrieved by the initial query. From these values, the system automatically reformulates the query. New retrievals are made without any need of user intervention. On the other hand, retrieval by reformulation sets the user in control of query formation. We believe this is more efficient because it is easier for the user to understand the knowledge representation mechanism when there is some kind of mapping between the natural language query and its corresponding frame. In the Reuse Assistant, after a query, a number of component description frames are retrieved and the users can use them as a cue for refining their query. These descriptions will show concrete information about the retrieved component that may help users refine their information needs and express their query in the system’s terms. The code of the components along with the associated comments (shown on screen at the same time than the frames) may help novice users of the system to understand the more cryptic frame-based description. Traditional information retrieval systems assume that users know exactly what they are searching for. At the other end, browsing systems, assume that users are exploring the information space, taking the risk of getting lost. Our system can close the gap between exact queries and browsing. In addition to finding software components that match the query, we can explore the conceptual hierarchy to obtain near inexact matches. This provides a means to browse amongst functionally related components instead of browsing components related by the inherited code, as browsers usually supplied with class libraries do.

4. A USAGE

SCENARIO

Let us assume that a user is in the process of implementing a method for “extracting words from a string.” This would be the initial query specified in the natural language query mode. This query is indexed by the same statistical method used to index the classes. The profile of lexical affinities detected in the query allows access to the classes with similar profiles. If the requested actions are present in the method profiles, some methods of these classes may also be activated. Figure 2 shows the user interface after processing this query. The closest retrieved class is String. The selection of this class in the list pane displays a list with the String methods activated by the query. The first of these methods is the asArrayOfSuhstrings method. As the figure shows, the frame associated with this method is displayed when the method is selected. This method divides a string into substrings at the occurrences of one or more space characters. At first

glance, this may look to be a good solution for the problem at hand, but the “words” extracted by this method will be any sequence of characters different from the space character, including punctuation characters. If the “words” that the user meant to extract were sequences of alphanumeric characters then it will be necessary to find a different method. The frame description for this method has two action-specific slots: obj and condition. The obj slot identifies the collection that is traversed, a character sequence, and the condition slot specifies the selection criteria, as not being the space character. The user must recognize that the condition slot is the point of disagreement between his specification and the retrieved method. If the user detects this divergence then s/he can query the system about concepts related to space-char to find out if there is some other concept that could serve as character selection conditions, Doing so, the user will find the concept alphanumeric-char, which is connected to the method isAlphaNumeric implemented in the Character class. It is also possible to follow the link, through the concept hierarchy browser, from the condition slot to the spacechar concept and look in its surroundings. The concept alphanumeric is close to it. With this information the user can build a new query using the action that appears in the retrieved frame together with the alphanumericchar concept: traverse&conditionally-select obj: char-sequence condition: alphanumeric-char Using the classification mechanism implemented in the knowledge base the system will retrieve the closest cases, having the same action and as many compatible slots as possible. The closest description is the one associated with the nextWord method implemented in the Stream class. This method answers a string containing the next word in the receiver stream, where a word starts with a letter, followed by a sequence of letters and digits. This method does not constitute a complete solution because the goal is to extract characters from a string and not from a stream, so it is necessary to find a method to transform strings into streams. A query such as “transform a string into a stream” will retrieve some frame descriptions specializing the action Class-conversion, where the specific action slots are origin and destination. With this information it is possible to build the form: class-conversion origin: destination:

string stream

And the classification facility of the knowledge base, using the subsumption relationships among concepts, will retrieve the following frame:

404

C. Ferncindez-Chamizo et al.

class-conversion origin: destination: implementor: selector:

indexed-collection stream Stream on:

This method can be used to transform strings into streams because the filler of the origin slot, indexedcollection, is a more abstract concept than string, the filler of that slot in the query. The example shows some of the benefits achieved by the hybrid approach to software component retrieval. This approach results in a more flexible interface, where IR allows the use of informal specifications expressed in natural language, and the underlying knowledge representation permits a more fine-grained search. Using only IR techniques, it is necessary to pose the “right” query to retrieve the required component. This is not a trivial issue, because the way in which object-oriented libraries are organized depends on code inheritance criteria, and it may not be intuitive for the reuser. A useful component may be located in an unexpected place in the class library, or it may be less specific than what is needed. Some functional relationships among components are easily detected by a statistical analysis of their corresponding documentation. However, detecting other relevant relationships would demand an in-depth understanding of the library documentation. This is the kind of knowledge that can be easily identified by an expert, and it is very unlikely that an automatic indexing technique could produce similar results. 5. EVALUATION The IR module and the interface of our system have been implemented in C for a SUN SPARCStation platform. The knowledge base is implemented in LISP for the same platform. Our work has concentrated on building the descriptions of a subset of the Smalltalk class library, along with the general purpose programming concepts needed to articulate this representation. The system needs to be completed with the descriptions of the rest of the class library, and the related programming concepts. In order to evaluate the performance of our system we concentrate on the retrieval effectiveness. Usually, the performance of an information retrieval system is measured in terms of recall and precision. Recall measures the proportion of relevant material that is actually retrieved. Precision measures the proportion of retrieved material that is relevant. In general, relevance is a subjective notion, but in the retrieval of software components, subjectivity is increased due to the acceptability of close matches. Therefore, the evaluation of this kind of system is particularly difficult. In other domains, there are several test collections in order to evaluate and compare the retrieval effectiveness of the systems being developed. In the software domain,

however, these test collections are practically nonexistent. To our knowledge, there is a test collection of 30 queries for the AIX 3 system (Maarek et al., 1991), but no test collection is available for any object-oriented library. Ideally, in order to have a valid test collection, we would need a large number of independent users, selected at random, accessing the system. Collecting the queries actually issued by these users, along with the relevance judgments made by several Smalltalk experts, we would obtain a test collection for our system. For the moment, we are preparing a list of queries selected from real questions posed by our undergraduate Smalltalk students. This test collection cannot be considered as ideal but it will be useful to evaluate the IR subsystem until the whole system is completed and it can be tested in different sites. The evaluation of the CB subsystem is even more difficult, because of the inherent interactive nature of the formulate-retrieve-reformulate cycle. In this case, it would be necessary to measure the effort expended by the user to locate the required component with and without the Reuse Assistant. Due to the difficulty of monitoring this kind of situation, we believe that it is better to analyze the retrieval effectiveness in a qualitative manner, like in Wood and Sommerville (1988). In this way, we are analyzing the improvements to precision we can obtain using the CB approach in addition to the IR subsystem. Examples like the one shown in the usage scenario are being carefully analyzed. We have tested the prototype with some groups of Smalltalk students. The first results are optimistic. After some initial difficulties (students tend to overestimate the system potentiality and they ask for classes that directly solve their problems) they find that the Reuse Assistant is a useful tool. Usually, they begin with a free-text query and, after the corresponding frame is shown, they modify the slots to reformulate the query instead of introducing a new free-text query. Our preliminary results show that, in general, freetext queries are preferred when the requirements for the component are not well established. Form filling is directly used when the requirements are clear and a certain level of experience with the system has been attained.

6. CONCLUSIONS

AND FUTURE

WORK

We have presented a hybrid approach to the problem of indexing and retrieving software components. In this approach, CB techniques are used to represent the knowledge about components, and, simultaneously, statistical indexing methods are used to facilitate the access to the components through natural language queries. The Reuse Assistant, a system following this approach, has also been introduced. The goal of our approach is to maintain the advan-

Retrieval of Sojiware Components

tages of both techniques, while minimizing their drawbacks. Automatic indexing methods provided by the IR subsystem allow the users to easily incorporate new component profiles, granting the system extensibility. The conceptual organization of the case base provides a means to browse amongst functionally related components, when exact retrieval is not achieved. The high cost of hand coding the component descriptions for a basic class library is compensated by its high reuse potential. At the moment, only a subset of the Smalltalk classes has been represented in the knowledge base. In order to evaluate the system in actual environments, the knowledge base is going to be completed. We are also studying the construction of knowledge bases for other objectoriented languages (C ++ and Eiffel), using the same general purpose programming concepts (the upper levels of the concept hierarchy). Also, we plan to enhance the two subsystems, incorporating natural language processing techniques into the IR module, and new reasoning mechanisms into the CB module that assess the applicability of the retrieved components. The Reuse Assistant is part of a wider environment that will embody several tools for helping and teaching the usage of object-oriented class libraries. This will be a case-based environment because we believe that CB approaches are particularly useful when teaching how to reuse software components. By considering the components in the library as cases, the programming experience embodied in those cases can be of use to novices. REFERENCES Biggerstaff, T. J., & Richter, C. (1987). Reusability framework, assessment, and directions. IEEE Sofhuare, 4, 2. Brachman, R. J., & Schmolze, J. G. (1985). An overview of the KLONE knowledge representation system. Cognitive Science, 9. 2. Callan, J., & Croft, B. (1993, March). An approach to incorporating CBR concepts in IR systems. Symposium on Case-Based Reasoning and Information Retrieval, Standford University, AAAI Spring Symposium Series. Chen, Y. E, & Ramamoorthy, C. V. (1986, Oct.), The C information abstractor. COMPSAC, Chicago. Curtis, B. (1989). Cognitive issues in reusing software artifacts. In T. J. Biggerstaff & A. J. Perlis (Eds.), Software reusability. volume II, applicarions and experience, Reading, MA: ACM Press, AddisonWesley Publishing Company. Devanbu, P., Ballard, B. W.!, Brachman, R. J., & Selfridge, I? G. (1991). LaSSIE: A knowledge-based software information system. In M. R. Lowry & R. D. McCartney (Eds.), Automatic software design. Cambridge: AAAI Press/The MIT Press.

Embley, D. W., & Woodfield, S. N. (1987). A knowledge structure for reusing abstract data types. Proceedings of the Ninth International Conference on Sofiware Engineering, ACM, pp. 360-368. Femlndez, B., Buenaga, M., & Vaquero, A. (1993). An on line intelligent assistant. World Conference on Educarional Muhimedia and Hypermedia (ED-MEDIA ‘93). Orlando, FL. Femandez-Chamizo, C., Hemindez-Yaiiez, L., Gonzalez-Calero, P. A., & Urech-BaquC, A. (1993). A case-based approach to software component retrieval. Symposium on Case-Based Reasoning and Information Rerrieval, Standford University. AAAI Spring Symposium Series, March 1993. Femlndez-Valmayor, A., & Femlndez Chamizo, C. (1992). Educational and research utilization of a dynamic knowledge base. Computers & Education lS( l-3). 5 l-6 1. Frakes, W. B., & Nejmeh, B. A. (1987). Software reuse through of the 20th Annual HICSS, information retrieval. Proceedings Kona, HI. Girardi, M. R., & Ibrahim, B. (1993). A software reuse system based on natural language specifications. Proceedings of International Conference on Computmg and Information (ICC1 ‘931, Sudbury, Ontario, Canada, May 27-29, 1993, pp. 507-511. Grass, J. E., & Chen, Y. E (1990). The C++ information abstractor. Proceedings

of USENIX C + + Conf. 1990.

Helm, R., & Maarek, Y. S. (1991). Integrating information retrieval and domain specific approaches for browsing and retrieval in objectoriented class libraries. Proceedings of OOPSLA-91. Henninger, S. (1991). Retrieving software objects in an example-based programming environment. Proceedings of rhe Fourteenth Annual International

ACMISIGIR

Conference

on Research

and Develop-

ment in Information Retrieval.

Maarek. Y. S., Berry, D. M., & Kaiser, G. E. (1991). An information retrieval approach for automatically constructing software libraries. IEEE Transactions

on Software Engineering,

Meyer, B. (1987). Reusability:

17, 8.

The case for object-oriented

design.

IEEE Software, 4, 2.

O’Brien, P. D., Halbert D. C., & Kilian, M. E (1987). The Trellis programming environment. Proceedings of OOPSLA ‘87. Prieto-Diaz, R., & Freeman, P (1987, January). Classifying software for reusability. IEEE Soffware. Rich, C., & Waters, R. C. (1988). Automatic programming: Myths and prospects. Computer, August, pp. 40-5 1. Riesbeck, C. K., & Schank, R. C. (1989). Inside cased-based reasoning. Hillsdale, NJ: Lawrence Erlbaum Associates. Salton, Cl., & McGill, M. J. (1983). Introduction to modern informanon retrieval, New York: McGraw-Hill. Schank, R. C. (1972). Conceptual dependency: A theory of natural language understanding. Cognifive Psychology, 552-63 1. Steffen, J. L. (1985). Interactive examination of a C program with CScope. Proceedings CJSENIX Association Winter c’onf 1985. Steier, D. (1991). Automating algorithm design within a general architecture for intelligence. In M. R. Lowry & R. D. McCartney (I%.), Automating software design Cambridge: A.4AI Press/The MIT Press. Wilde, N., & Huitt, R. (1992). Maintenance support for object-oriented programs. IEEE Transactions on Software Engineering, 18, 12. Wood, M., & Sommerville, I. (1988). An information retrieval system for software components. ACM SIGIR, 22, 3,4.