Software retrieval by samples using concept analysis

Software retrieval by samples using concept analysis

The Journal of Systems and Software 54 (2000) 179±183 www.elsevier.com/locate/jss Software retrieval by samples using concept analysis Young Park Sc...

127KB Sizes 0 Downloads 78 Views

The Journal of Systems and Software 54 (2000) 179±183

www.elsevier.com/locate/jss

Software retrieval by samples using concept analysis Young Park School of Computer Science, University of Windsor, Windsor, Ont., Canada N9B 3P4 Received 13 April 1999; received in revised form 3 May 1999; accepted 13 June 1999

Abstract Finding and retrieving software components is one of the tasks of the building-block approach to software reuse. One interesting property of code components unlike other types of software artifacts is that they can be executed. The execution-based retrieval process tends to be too long to be incorporated in practice and faces the problem of non-termination and very long execution time. This paper describes a software component retrieval method using sample input±output behavior of the components (but without actual execution) based on concept analysis. The retrieval uses samples chosen by the developers of the components (rather than generated randomly or provided by the users). Based on the validity relation between components and samples, a concept lattice is constructed for the library by applying formal concept analysis. The user retrieves components by selecting valid samples incrementally for a desired component from a dynamically created menu of samples available in the library. Our method avoids the problems associated with actual execution-based retrieval such as non-termination and very long execution time, and also improves the retrieval time. Our approach can be directly applied to other levels of software components than code components as long as the components can be described in terms of some input±output relation. Ó 2000 Elsevier Science Inc. All rights reserved. Keywords: Software reuse; Software component retrieval; Concept analysis; Execution-based retrieval; Concept lattice; Sample behavior

1. Introduction Finding and retrieving software components is one of the tasks of the building-block approach to software reuse (Krueger, 1992; Mili et al., 1995). This paper considers the problem of retrieving reusable code components. One interesting property of code components unlike other types of software artifacts is that they can be executed. Podgurski and Pierce (1993) and Hall (1993) proposed methods of retrieving reusable code components from an unstructured software library by actually executing components on sample inputs and comparing their output with the desired output speci®ed by the user. The input samples are either generated totally randomly from the argument domains of function component (Podgurski and Pierce, 1993) or provided solely by the user (Hall, 1993). Since all components in the reuse library are actually executed during the retrieval process, the retrieval process tends to be too long to be incorporated in practice and faces the problem of non-termination and very long execution time.

E-mail address: [email protected] (Y. Park).

In order to achieve more ecient and e€ective retrieval by execution, Park (1996) presented a method of organizing existing code components in a reuse library based on memoization of the execution of function components on sample inputs. When a component is stored into the reuse library, we execute, if needed, the component on some sample inputs and store (memoize) the result. Later when we retrieve some components, we can reuse the memoized input and output values for the component without execution. Thus, component retrieval can be done just by comparison without involving any actual execution. The sample inputs, however, depends on the order of storing components. In Lindig (1995), an ecient and incremental method of retrieving software components that are indexed by keywords was presented based on the library structured by applying concept analysis. Concept analysis is a formal method used for the analysis of data. Such data are structured into units that are formal abstractions of concepts of human thought allowing meaningful and comprehensible interpretation (Wille, 1982; Ganter and Wille, 1996). An overview of the concept analysis as a new framework and its applications in software engineering areas including software component retrieval (Lindig, 1995), analysis of con®guration spaces

0164-1212/00/$ - see front matter Ó 2000 Elsevier Science Inc. All rights reserved. PII: S 0 1 6 4 - 1 2 1 2 ( 0 0 ) 0 0 0 3 6 - 4

180

Y. Park / The Journal of Systems and Software 54 (2000) 179±183

(Snelting, 1996) and modularization of legacy code (Lindig and Snelting, 1997; Si€ and Reps, 1997) were presented in Snelting (1998). In this paper, we present a retrieval method of code components using the input±output behavior of the components (but without actual execution) based on concept analysis. The retrieval is based on the samples carefully chosen by the developers of the components rather than generated randomly or provided by the users. In our method, the user retrieves components by selecting the samples for the desired component from a dynamically created list of samples available in the library. Our method is similar to the execution-based retrieval, but does not involve any actual execution of components on sample inputs. It thus avoids the problems associated with actual execution-based retrieval such as non-termination and very long execution time, and also improves the retrieval time. It can be used as an alternative to execution-based retrieval. Our approach can also be directly applied to other levels of software components than code components as long as the components can be described in terms of input±output relation. The remainder of this paper is organized as follows: Section 2 describes a sample-based representation of code components; Section 3 presents the process of building a concept lattice for the component±sample relation by applying the formal concept analysis; Section 4 demonstrates the retrieval based on the concept lattice; Section 5 discusses an improvement using other negative samples; concluding remarks and future work appear in Section 6.

· scopy…str1; str2† : copy the string str2 to the string str1. · sswap…str1; str2† : swap the string str1 and the string str2. · srmv…str1; str2† : remove the string str2 from the string str1. · strim…str1; str2† : trim the string str2 from the string str1 both at the beginning and at the end. · sconcat…str1; str2† : concatenate the string str2 to the string str1 at the end. Suppose that the functions are described by the samples as shown in Table 1. Each sample consists of the actual arguments before execution, the argument after execution and the return value after execution. The sets of samples for the components in the library are provided by the developers of the components. The rationale is that we believe it is the developers who know well about their components and can provide good samples that describe the components well. The user will retrieve a desired component by choosing such samples that are expected when the desired component is executed. The notion of a formal context is de®ned as follows (Wille, 1982; Ganter and Wille, 1996; Snelting, 1998): Let O be a ®nite set of elements called objects and A be a ®nite set of elements called attributes. Let R be a set of ordered pairs of objects and attributes, that is, a binary relation between O and A. A triple …O; A; R† is called a formal context. Let C be the set of all components in the library: C ˆ fc1 ; c2 ; c3 ; . . . ; cm g: Let S be the set of all samples used to describe the components in the library:

2. Representing components using samples In our method, we use a set of sample input±output pairs in order to describe a code component. It can be viewed as an approximate speci®cation of the code component. We de®ne the validity of a sample S with respect to a component C as follows: De®nition. A sample S is valid to a component C if, when the component C is executed on the sample input, it would produce the sample output. If the code component is a pure function, the sample consists of argument values and the return value. If the code component is a procedure (function with side e€ects), the sample consists of argument values, return value (in this case, there is no return value.) and the argument values after execution. Example. Consider, for example, a component library L that contains the following ®ve string-handling function components:

Table 1 Components and samples in the library L Components

Samples

scopy

S1 ˆ [(``abc'',``abc''), (``abc'',``abc''),void)] S4 ˆ [(`` '',``abc''), (``abc'',``abc''),void)] S8 ˆ [(``abc'',``xyz''), (``xyz'',``xyz''),void)]

sswap

S1 ˆ [(``abc'',``abc''), (``abc'',``abc''),void)] S5 ˆ [(``xyz'',``abc''), (``xyz'',``xyz''),void)] S9 ˆ [(``abc'',``xyz''), (``xyz'',``abc''),void)]

srmv

S2 ˆ [(``xyz'',``xyz''), (`` '',``xyz''),void)] S6 ˆ [(``xyzabcxyz'',``abc''), (``xyzxyz'',``abc''),void)] S10 ˆ [(``abc'',`` ''), (``abc'',`` ''),void)]

strim

S2 ˆ [(``xyz'',``xyz''), (`` '',``xyz''),void)] S7 ˆ [(``abcxyzabc'',``abc''), (``xyz'',``abc''),void)] S10 ˆ {(``abc'',`` ''), (``abc'',`` ''),void)]

sconcat

S3 ˆ [(``abc'',``xyz''), (``abcxyz'',``xyz''),void)] S4 ˆ [(`` '',``abc''), (``abc'',``abc''),void)] S10 ˆ [(``abc'',`` ''), (``abc'',`` ''),void)]

Y. Park / The Journal of Systems and Software 54 (2000) 179±183

181

Table 2 A component±sample validity relation q

s1

scopy sswap srmv strim sconcat

U U

s2

S3

S4 U

U U

U

U

S ˆ fs1 ; s2 ; s3 ; . . . ; sn g: On the set C  S, we de®ne a binary relation R based on the validity of the sample to the component as follows: R ˆ f…c; s† j s is valid for c and c 2 C and s 2 Sg: Then the triple (C, S, R) that represents the library forms a formal context. Example. The validity relation q between components and samples in the library L is shown in Table 2. 3. Building a concept lattice We build a concept lattice for the validity relation between the components and the samples by applying formal concept analysis. Formal concept analysis can be summarized as follows (Wille, 1982; Ganter and Wille, 1996; Snelting, 1998): Consider any two sets S1 and T1 such that · S1 is a subset of O and T1 is a subset of A, · T1 is the set of attributes common to the objects in S1 , and · S1 is the set of objects common to the attributes in T1 . Then, the pair …S1 ; T1 † is called a formal concept. S1 is called the extent and T1 is called the intent of the concept …S1 ; T1 †. The concepts of a given context are ordered by a subconcept±superconcept relation … 6 † de®ned by: For two concepts …S1 ; T1 † and …S2 ; T2 † of a given context, …S1 ; T1 † 6 …S2 ; T2 † () S1  S2 …or equivalently () T2  T1 †: The partially ordered set of all concepts of a given context is called the concept lattice. It is shown that the concept lattice is a complete lattice. The greatest lower bound (^) of two concepts is given by intersecting their extents and the common attributes of the resulting extent: …S1 ; T1 † ^ …S2 ; T2 † ˆ …S1 \ S2 ; fa j for all o 2 S1 \ S2 …o; a† 2 R & a 2 Ag†:

S5 U

S6

S7

S8 U

U

S9 U

U

s10

U U U

The least upper bound …_† of two concepts is given by intersecting their intents and ®nding the common objects of the resulting intent: …S1 ; T1 † _ …S2 ; T2 † ˆ …fo j for all a 2 T1 \ T2 ; …o; a† 2 R & o 2 Og; T1 \ T2 †: Example. The triple …C; S; q† from the example library L forms a formal context where · C (Objects) ˆ {scopy,sswap,srmv,strim, sconcat}. · S (Attributes) ˆ {s1, s2, s3, s4, s5, s6, s7, s8, s9, s10}. · q (Binary relation) ˆ The validity relation as given in Table 2. Based on the relation between the sample input± output pairs and the components, we construct a complete lattice-structured reuse library L using concept analysis. There are several algorithms to construct a concept lattice for a given context. Example. The concepts and their subconcept±superconcept relations for the context …C; S; q† from the example library L can be computed as follows: · Concept Bottom (B) ˆ ({},{s1, s2, s3, s4, s5, s6, s7, s8, s9,s10}). · Concept X1 ˆ ({scopy}, {s1,s4,s8}). · Concept X2 ˆ ({sswap}, {s1,s5,s9}). · Concept X3 ˆ ({srmv}, {s2,s6,s10}). · Concept X4 ˆ ({strim}, {s2,s7,s10}). · Concept X5 ˆ ({sconcat}, {s3,s4,s10}). · Concept X6 ˆ X1 _ X2 ˆ ({scopy,sswap}, {s1}). · Concept X7 ˆ X1 _ X5 ˆ ({scopy,sconcat}, {s4}). · Concept X8 ˆ X2 _ X5 ˆ ({srmv,strim,sconcat}, {s10}). · Concept X9 ˆ X3 _ X4 ˆ ({srmv, strim}, {s2,s10}). · Concept Top (T) ˆ ({scopy,sswap,srmv,strim, sconcat},{}). Concept lattice can be depicted as a simpler form of lattice diagram using a reduced labeling, in which each object and each attribute is entered only once in the diagram. A concept is labeled with attribute a if it is the largest concept having a in its intent. This means all the concepts below the concept labeled with the attribute a contains a in their intent set of attributes. Similarly, a

182

Y. Park / The Journal of Systems and Software 54 (2000) 179±183

complete and ®xed sequence of samples (Lindig, 1995; Snelting, 1998). The available sample list at a concept X consists of all the attributes that label concepts such that the object set of the greatest lower bound (glb) of these concepts and the concept X is empty. Suppose that the user has selected some samples so far. At this point if the next desired sample is not in the currently available sample list, then it means that there is no component that has this sample along with those samples selected so far as valid samples.

Fig. 1. The concept lattice for the library L.

concept is labeled with an object o if it is the smallest concept having o in its extent. Example. The resulting concept lattice for the library L is depicted as a lattice diagram with reduced labeling with attributes in Fig. 1. 4. Retrieval based on the concept lattice Initially the retrieval system will display a list of all available samples in the library and the user selects the samples that are valid for the desired component. The user chooses the desired samples incrementally from the menu of available samples in the library. The retrieval model is shown in Fig. 2. When the user chooses some sample(s) at some point the retrieval system will dynamically display a list of all available (context-sensitive) samples from that point on. The retrieval is more ¯exible compared to the facet-based retrieval (Prieto-Diaz, 1991) because it does not require a

Fig. 2. Component retrieval based on concept lattice.

Example. For the example library L, the retrieval system initially displays s1, s2, s3, s4, s5, s6, s7, s8, s9 and s10 as available samples. If the user selects s10, then the available samples are s2, s3, s4, s6 and s7. This means that there is no component in the library that has both s10 and any of s1, s5, s8 and s9 as valid samples. When the user further chooses s3, s4 becomes the only available sample from that point on. Suppose the user is searching for a component and let Q be the set of selected samples s1 ; s2 ; . . . and sk . The retrieval process is sketched as follows: Step 1: Find all the concepts labeled with attributes s1 ; s2 ; . . . and sk . Step 2: Find the glb of all these concepts. Step 3: Retrieve all the components, i.e., the object set in this glb concept. All the retrieved components have the selected samples s1 ; s2 ; . . . and sk as their valid samples. If the object set of this concept is empty, then we can conclude that there is no component in the library that can satisfy the query. Example. Consider the userÕs query Q1 ˆ {s10}. The system goes to the concept X8 that is labeled by s10 and retrieves components srmv, strim, and sconcat. Note that s10 is valid sample for all these components. For the query Q2 ˆ {s10, s4}, the system ®nds the concept X5, which is the glb of the concept X8 labeled with s10 and the concept X7 labeled with s4 and the component sconcat will be retrieved. Suppose that the user selects s10 and s1, i.e., the query Q3 ˆ {s10, s1}. In this case, the glb of the concept X8 labeled with s10 and the concept X6 labeled with s1 is the bottom concept. Thus the system retrieves no component. Note that there is no component in the library L that contains both s10 and s1 as valid samples. Updating the concept lattice: The concept lattice for the library is restructured when new components are stored, some existing components are removed from the library or the samples for the existing components in the library are modi®ed.

Y. Park / The Journal of Systems and Software 54 (2000) 179±183

5. Using other samples So far we have used a set of samples to represent a component and the samples are associated with the component only when they describe the component. However, there are many more samples that are originally used to describe other components in the library. We can improve the retrieval by utilizing these already existing other samples for the component. We actually execute each component on these other samples and determine whether they are valid or invalid for the component. If the sample is valid, we use it as a new valid sample for the component. Otherwise we can use it as a negative sample, which means that when the component is executed on the sample input, it should not produce the sample output. In general we can extend the sample set S by including negative samples into S0 : S0 ˆ S [ f:s1 ; :s2 ; :s3 ; . . . ; :sn g; where :si means negative sample for si and accordingly we extend the validity relation R into R0 on the set C  S0 : R0 ˆ R [ f…c; :s† j s is not valid for c and c 2 C & s 2 Sg: Then, the triple …C; S0 ; R0 † forms a formal context and all the concepts in this context are structured into a concept lattice, on which the retrieval is based. 6. Conclusion and future work We have presented a concept-based method of retrieving code components using sample input±output behavior of the components (as a simple and approximate speci®cation). The method uses samples chosen by the developers of the components (rather than generated randomly or provided by the users) and does not involve actual execution. Based on the validity relation between components and samples, a concept lattice is constructed for the library by applying formal concept analysis. The user retrieves components by selecting valid samples incrementally for the desired component from a dynamically created (context-sensitive) menu of samples available in the library. The retrieval is done based on the concept lattice of the library. An improvement of the retrieval by utilizing these already existing other samples for a component is discussed. The concept-based retrieval is more ¯exible compared with the facet-based retrieval. Compared with the execution-based retrieval, our method avoids the problems associated with actual execution-based retrieval such as non-termination and very long execution time, and also improves the retrieval time though it could

183

have higher space overhead. Our method can be used as an alternative to execution-based retrieval. Our approach can also be directly applied to other levels of software components than code components as long as the components can be described in terms of input± output relation. Future work includes the problem of evaluating and scaling up the method and extending it to other component levels. Acknowledgements This research was supported in part by NSERC under grant OGP0138415. References Ganter, B., Wille, R., 1996. Applied Lattice Theory: Formal Concept Analysis, http://www.math.tu-dresden.det/ganter/fca.html. Hall, R., 1993. Generalized behavior-based retrieval. In: Proceedings of the International Conference on Software Engineering, pp. 371±380. Krueger, W., 1992. Software reuse. ACM Computing Surveys 24 (2), 131±183. Lindig, C., 1995. Concept-based component retrieval. In: Proceedings of the IJCAI Workshop on Formal Approaches to the Reuse of Plans, Proofs and Programs. Lindig, C., Snelting, G., 1997. Assessing modular structure of legacy code based on mathematical concept analysis. In: Proceedings of the International Conference on Software Engineering, 349±359. Mili, H., Mili, F., Mili, A., 1995. Reusing software: Issues and research directions. IEEE Transactions on Software Engineering 21 (6), 528±561. Park, Y., 1996. Organizing reusable components for execution-based retrieval. In: Proceedings of the International Symposium on Applied Corporate Computing, 147±155. Podgurski, A., Pierce, L., 1993. Retrieving reusable software by sampling behavior. IEEE Transactions on Software Engineering 2 (3), 286±303. Prieto-Diaz, R., 1991. Implementing faceted classi®cation for software reuse. Journal of ACM 34 (5), 89±97. Si€, M., Reps, T., 1997. Identifying modules via concept analysis. In: Proceedings of the International Conference on Software Maintenance, 170±179. Snelting, G., 1996. Reengineering of con®gurations based on mathematical concept analysis. ACM Transactions on Software Engineering and Methodology 5 (2), 146±189. Snelting, G., 1998. Concept analysis ± A new framework for program understanding. In: Proceedings of the ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tool and Engineering, 1±10. Wille, R., 1982. Restructuring lattice theory: an approach based on hierarchies of concepts. In: Rival, I. (Ed.), Ordered Sets, pp. 445± 470. Young Park is an associate professor of computer science at the University of Windsor, Canada. His current research interests are software reuse, component-based software development, software evolution/ reengineering, program understanding and formal concept analysis and its application in software engineering and data engineering. He received M.S. and Ph.D. degrees in computer science from Courant Institute of Mathematical Sciences, New York University.