Generating domain representations using a relationship model

ARTICLE IN PRESS Information Systems 30 (2005) 1–19 Generating domain representations using a relationship model$ Irene D!ıaza,*, Juan Llorensb, Gon...

Download PDF

514KB Sizes 2 Downloads 65 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Information Systems 30 (2005) 1–19

Generating domain representations using a relationship model$ Irene D!ıaza,*, Juan Llorensb, Gonzalo Genovab, J. Miguel Fuentesc a

Artificial Intelligence Center, Computer Science Department, Universidad de Oviedo, Carr. de Castiello de, Bernueces s/n 33203 Gijon, Spain b Information Engineering Group, Computer Science Department, Universidad Carlos III de Madrid, Spain c Research Department, dTinf S.L., Spain Received 31 January 2002; received in revised form 29 July 2003; accepted 31 July 2003

Abstract Domain analysis (DA) techniques are more and more applied in information science, software engineering and even knowledge management, to help in the creation of a controlled vocabulary to represent, index and retrieve every kind of information. The main limitation of DA expansion is related with the enormous human resources investment these techniques demand. This paper presents a particular technology to perform computer based (semi) Automatic Domain Analysis (ADA) based on a relationships information representation model called RSHP. The proposed technology is presented, describing its different stages and algorithms. Finally, experimental results obtained when the methodology was applied to the ‘‘Software Reuse’’ domain are presented. r 2003 Elsevier Ltd. All rights reserved. Keywords: Classiﬁcation; Domain analysis; Filtering

1. Introduction A domain can be deﬁned as a knowledge area for which a certain software system is developed. Domain analysis (DA) can then be deﬁned as the process of elicitating, classifying and modeling domain information at a non-executable level [1]. The results of performing DA are domain models, formed by knowledge structures representing the commonalities ad variability of all the possible $

Recommended by Maurizio Lenzerini. *Corresponding author. Tel.: +34-985-18-26-65; fax: +34-985-18-21-25. E-mail addresses: [email protected] (I. D!ıaz), [email protected] (J. Llorens), [email protected] (G. Genova), [email protected] (J. Miguel Fuentes).

software applications that can be modeled within the selected domain [2]. Therefore, the main advantage of performing DA is that once a particular domain has been modeled (usually inside an organization), the development of information systems can be made based on high level reuse principles. Information systems developed in that way are usually cheaper and of best quality. An introduction to domain analysis and its application to information systems modeling can be found in [3]. Along the last 20 years, different methods for performing domain analysis have been developed. Neighbors in [4,5] presents the ﬁrst approach to DA, named Draco. Draco is an interactive system that supplies mechanisms to enable the deﬁnition

0306-4379/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0306-4379(03)00076-0

ARTICLE IN PRESS 2

I. D!ıaz et al. / Information Systems 30 (2005) 1–19

of problem domains as special-purpose, high-level languages and manipulate statements in these languages into an executable form. McCain in [6] creates a hierarchical DA process with three steps: market analysis, DA and speciﬁcation and implementation of reusable resources. Prieto-Diaz in [7] presents a technology based on a software library organized around a faceted classiﬁcation scheme. The system supports search and retrieval of reusable components and librarian functions such as cataloging and classiﬁcation. However, he demonstrated that it is not easy to make automatic faceted classiﬁcation, as huge human resources are needed. DARE [8,9] is a CASE tool that supports an analyst in extracting and recording domain information from documents and code, acquiring and recording domain knowledge from experts, analyzing domain knowledge, producing various domain models, and producing a repository of reusable assets for the domain. Many other DA methods have been proposed, as the intelligent libraries of Simos [10], the Feature Oriented Domain Analysis (FODA) [11], the Synthesis project [12], the intelligent design of Lubars [13], the Rapid project [14], the KAPTUR tool [15], the Gomaa’s domain analysis method [16] or the Organization Domain Modeling [17]. Although DA presents clear beneﬁts in the development of information systems, one of the main drawbacks of starting a DA project within an organization, is that it needs huge human resources, most of them highly qualiﬁed (experts in the subjects) and therefore very expensive. Trying to ﬁnd ways of automating DA is the main goal of the research presented in this paper. An Automatic Domain Analysis (ADA) methodology will allow organizations to perform DA for their expertise domains at a low cost level, and this will tremendously ease the development of their information systems. In this paper we present a methodology to perform Automatic DA based on a relationships representation model, called RSHP, used to represent the domain. Section 2 presents the general aspects of how automatic DA should be made. Section 3 describes RSHP domain representation model. The RSHP-DA methodol-

ogy is presented in Section 4. The results of applying RSHP-DA to a particular domain are shown in Section 5, and some conclusions are presented in Section 6.

2. Automatic domain analysis ADA states for a set of techniques, methods and algorithms that allow a computer to create candidate domain representations gathered from electronic information sources. The different DA techniques try to build a domain according to the steps described in Fig. 1 [18]. The ﬁrst step is called plan of the project; this is a previous phase, not speciﬁc of the DA process where the domain has to be identiﬁed and limited. In the second step relevant domain information is selected in order to generate a corpus of documents to be used as input information into computer-based algorithms. The documents corpus must be as complete as possible. It must cover the whole domain to be modeled. The third step is composed of the following two activities: information analysis and classiﬁcation.

Plan of the project: Identification of the domain

Information Acquisition: Obtaining the set of relevant documents

Information Processing (IP) Start

Information Analysis

Information Classification

End

Domain model validation Domain based on knowledge

End

Fig. 1. Steps for building automatic domain representations.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

Information analysis processes the document corpus. The main goal is to build the domain vocabulary, as well as to reject irrelevant data. The classiﬁcation activity tries to obtain a conceptual representation of the information by generating relationships between the vocabulary elements. Finally, a validation step must be taken into account. Indeed, in automatic DA, validation is the most important step, as usually computer algorithms cannot model the complete semantics of the domain, and human surveillance becomes critical. Validation is not usually performed at the end of the process, rather in the middle. Vocabulary generation needs validation, as well as the ontology creation, etc. Few approaches were made in the mid-1990s towards the automatic generation of domain representations [19]. Among them, the authors experimented the application of statistical and neural algorithms for ontology creation [20]. In order to automatically build ontologies, ﬁltering (IDF and n-grams) and classiﬁcation (k-means, Chen co-wording, max–min, ISODATA, ART Neural Networks, etc.) algorithms were used. This approach provided an improvement in the automation process although limited. At the end of the 1990s, the authors oriented their research into linguistic-semantic and NLP technologies, using syntactically based relationships indexers [21]. The RSHP scheme was proposed in [22] to be used as a domain representation model, where the domain is represented using relationships between concepts. Relationships indexers had to be developed to extract information from all different information sources (free text, ER models, Uniﬁed Modeling Language (UML), etc.). The application of RSHP indexers to domain generation, as they automatically extract relationships, allows to merge the information acquisition and the information organization stages presented in Fig. 1. This feature is the base of the ADA presented in this paper.

3. RSHP: relationships information representation model During the last years, the Software and Information Engineering communities have been

3

working on the idea of ﬁnding an information representation model capable to handle all different artifact types present in the software environment. The achievement of this goal would imply a clear improvement towards the mechanization of Software Engineering, as one of the main problems of the ﬁeld is the impossibility to apply real reuse in the Software Development Process (SDP). Having a unique representation model for every kind of artifacts involved in the development of software will allow classifying them using the same criteria and, therefore, this will enhance their retrieval and reuse. For example, in the Finances domain, where several DA projects have been made, the kind of information needed to represent a domain could be very heterogeneous: lists of vocabulary terms, relationships between terms (thesaurus), business requirements described in free text or formal methods, business rules represented by UML activity diagrams, class-object models, system communications needs represented by UML sequence diagrams, software test cases, EntityRelationship database models, ﬁnancial formulas represented in MS excel, source code, etc. In order to represent all these information, we could use their own representation schemas: ISO2788 standard [23] for thesaurus structures, UML metamodel for class, object, use case, sequence, collaboration, activity, state, component and nodes diagrams, ER for database information models, Keywords for free text requirements, and so on. But using the ad-hoc information representation model for each kind of information we would not be able to create a complete domain representation, as there would not be possibility to trace from one knowledge element to another. For example, if the Information Technology Department of a bank demands knowledge regarding credit cards, the domain representation would not allow to traverse from requirements describing the needs for a customer to acquire a credit card, to functional software models developed for controlling credit card payments, or text documents including credit cards usage in a teller machine. This is not possible because there is no way to navigate from one representation model to

ARTICLE IN PRESS 4

I. D!ıaz et al. / Information Systems 30 (2005) 1–19

another. Therefore, a generic information model which allows representing all sorts of knowledge would improve domain representation models. Although this discussion is not the topic of this paper (see [22]), several knowledge representation models have been studied to cope with our requirement, coming mainly from the web area (OWL, OIL, etc.) and the information science (Facets model [24], ISO 13250-Topic Maps standard [25]) areas, but none of them were completely ﬁtting the domain representation. The RSHP representation model was developed and presented in [21,22,26,27]. The model is based on the ground idea that whatever information can be described as a group of relationships between concepts. Therefore, the leading element of an information unit is the relationship. For example, Entity/Relationship data models are certainly represented as relationships between entity types [28]; in the software modeling area, object models can also be represented as relationships between objects and classes [29]; in the process modeling area [30], processes can be represented as casual/ sequential relationships between sub-processes. Moreover, the UML metamodel itself can also be modeled as a set of relationships between model elements. UML is the acronym of Uniﬁed Modeling Language. This language has been designed to specify, display, construct and document the artifacts of a software system [31]. Actually UML can be considered as a standard tool in the steps of analysis and design of software [31]. The repository model RSHP presented here is designed in order to reuse not only code but also the previous information (use case diagrams, requirements documents y) about the problem. This repository tries to include all the information given by UML paradigm and constructing that repository as automatically as possible. In fact, as RSHP model contains UML metamodel, the storing and retrieving of UML information is automatic as well as quite easy. Furthermore, free text information can certainly be represented as relationships between terms by means of the same structure due to coherent human language text must be represented as a set of well-constructed sentences where the subject+ verb+predicate (SVP) structure should be present.

Indeed, the SVP structure can be considered as a relationship typed V between the S and the P: The RSHP representation model is presented in depth in [26]. The RSHP information representation model is based on the following principles: In order to represent information, the main description element should be the relationship. This relationship is in charge of linking information elements. An information element (IE) is the atomic component of information that appears into an artifact and that is linked by one or more relationships with other IEs to form information. It is deﬁned by a concept, and it can also be an artifact (an information container found inside a wider artifact). A concept is represented by a Normalized Term (a keyword coming from a controlled vocabulary, or domain). In RSHP the simple representation model for describing the content of whatever artifact type (models, tests, maps, text docs, etc.) should be: RSHP representation for artifact a ¼ ia ¼ fðRSHP1 Þ; ðRSHP2 Þ; y; ðRSHPn Þg; where every single RSHP is called RSHP-Descriptor, and must be described using IE RSHP ¼ fIE representing a verb; IE1 representing a noun; y; IEn representing a noung: For example, consider two different textual artifacts a and b; where a contains the following text ‘‘Cars are vehicles, and they have motor [y]. The Cosworth motor is a new motor [y]’’, and b includes ‘‘A FORD Focus is a little car’’ as its free text content. For the a artifact, the RSHP content representation would be ia ¼ fðRSHP 1Þ; ðRSHP 2Þ; ðRSHP 3Þg where RSHP 1 ¼ fto be; Vehicle; CargHierarchy RSHP 2 ¼ fto have; Car; MotorgAggregation RSHP 3 ¼ fto be; Motor; Cosworth MotorgHierarchy

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

For the b artifact, the content representation would be ib ¼ fðRSHP 1Þ; ðRSHP 2Þg where RSHP 1 ¼ fto be; Car; FORD FocusgHierarchy RSHP 2 ¼ fno name; FORD Focus; littlegQualification In order to represent all sorts of information schemas, the relationships between concepts covered by the RSHP model are deﬁned collecting them from the information models to be stored. For example, in order to represent UML models, dependency, hierarchy and association (including aggregation and composition) relationships should be deﬁned. If we want to represent free text, the occurrence relationship (which represents the occurrence of a descriptor in a document) should be deﬁned. 4. RSHP technology for generating domain representations: RSHP-DA A complete methodology to perform DA usually involves technical, organizational and human factors. This paper focus only on the technical stages of the methodology needed to create domain representations using computer algorithms (ADA). As it was described in the general framework for ADA (Fig. 1), domains are built from the information that can be extracted from a complete set of electronic documents. RSHP-DA suggests that these documents must include not only textual descriptions, but also whatever kind of artifacts that can also describe the domain (design patterns, analysis models, processes representation, tests, requirements, etc.). Nowadays, almost every nontextual artifact from a SDP can be represented using a software modeling methodology, describing a part of static models, behavioral models, usage models, or architectural models. In order to automatically build a domain, each process, activity, entity, object diagram, etc. represented in a model can be used as an input

5

artifact, and must be translated to RSHP elements to form a part of a domain representation. To transform all-artifacts information structure into RSHP, transformation-indexing models must be created. However, most of the modern software modeling methodologies and languages, UML [32] for example, provides a syntactical deﬁnition of themselves: a meta-model to represent their limitations. This meta-model must be used to create the transformation procedures. The RSHP-DA domain analysis methodology is based on two main principles: 1. A domain can be represented using the RSHP information representation model. 2. The information for all the different artifacts can be represented using RSHP descriptors. The combination of these principles is the kernel of the RSHP-DA methodology. A domain is represented using concepts (that can also be artifacts) and relationships between them. These relationships and concepts can be found in the artifacts, by performing special indexing (RSHPIndexing) on them. Two essential problems must be solved to apply RSHP-DA 1. The original information representation model for the incoming artifact must be transformed into RSHP model.1 2. When all the information is represented in RSHP model, all the selected concepts (artifacts and sub-artifacts) and relationships (hierarchies, aggregations, associations, equivalences, etc.) gathered from document information extraction must be linked together in order to form a semantic network. Specially relevant is the resolution of relationships conﬂicts (e.g. one document includes a hierarchy between concepts A and B while another document includes the opposite). Three different conﬂicts must be identiﬁed either during indexing and afterwards: (a) Not allowed Reﬂexive RSHPs found (i.e. A is speciﬁc of A). 1 The description of the different transformation models is out of the scope of this paper.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

6

(b) Symmetric Loops found in asymmetric RSHPs (i.e. A is speciﬁc of B and B is speciﬁc of A). (c) Transitivity loops found in not transitive RSHPs (i.e. A is speciﬁc of B, B is speciﬁc of C and C is speciﬁc of A). Fig. 2. Class diagram for a motor structure.

The process that solves these two problems is called RSHP Indexing, as it, somehow, creates also indexes of relationships, gathered from all kinds of documents. Once a complete set of documents-artifacts has been indexed, a candidate domain representation can be presented to domain engineers and experts. Therefore, RSHP-DA is a methodology that builds a domain representation while documents are indexed. This feature allows this methodology to be incremental. This is very important because it means that the more artifacts are indexed the more complete can the domain representation be. This property can also be interesting for maintenance purposes. All the domains evolve in time, creating update problems to previous domain representations. With RSHP-DA there is no need to re-engineer the domain representation. Just speciﬁc indexing processes are needed. Let us present an example of domain representation by document indexing: considering that the previous a and b textual artifacts have been included in the electronic corpus from which the domain representation must be built, together with the g artifact, a UML class diagram representing the analysis of the motor structure for Cosworth (Fig. 2). The representation of the three artifacts could be linked together forming the draft representation of the domain included in them:2 Representation of the a artifact RSHP 1 ¼ fto be; Vehicle; CargHierarchy RSHP 2 ¼ fto have; Car; MotorgAggregation RSHP 3 ¼ fto be; Motor; Cosworth MotorgHierarch 2 Only static-structural documents have been considered; therefore only static domain representation is generated.

Representation of the b artifact RSHP 1 ¼ fto be; Car; FORD FocusgHierarchy RSHP 2 ¼ fno name; FORD Focus; littlegQualification Representation of the g artifact RSHP 1 ¼ no name; /Cosworth MotorSclass ; Association ; /ChassisSclass where /Cosworth MotorSclass represents an IE that is also a class sub-artifact. According to RSHP rules, the representation of the ‘‘Cosworth motor’’ sub-artifact is RSHP 1 ¼ fno name; Motor; Cosworth MotorgHierarchy RSHP 2 ¼ fno name; Cosworth Motor; CosworthgAssociation IE1 ¼ f/PowerSattribute g And the representation of the Power sub-artifact is: Pty1 ¼ fVisibility; Publicg where Pty is a property. Graphically, the draft presentation of the results is as shown in Fig. 3. The RSHP-DA domain analysis method, as a slight variation of the general model presented in Fig. 1, includes the following steps: 1. 2. 3. 4. 5.

Delimitation of the target domain. Creation of an electronic document corpus. Generation of the Domain Vocabulary. Validation of the Domain Vocabulary. Generation of the candidate Domain Representation. 6. Validation of the Domain representation.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

Fig. 3. RSHP representation of a domain.

As this technique intends to create domain representations using computer programs, the validation steps are critical. Humans must be provided with navigation and management tools to validate the results provided by the algorithms. This paper is concentrated in describing the generation of domain vocabulary and ﬁnal representation steps for RSHP-DA technique. 4.1. Delimitation of the target domain Even if it seems to be irrelevant, the ADA process must start by clearly deﬁning what is not a part of the domain, in order to implicitly determine the borders of the target domain. 4.2. Creation of an electronic document corpus The domain experts, together with domain engineers and librarians must sample a set of electronic documents including relevant information about the domain. In order to deﬁne what kind of artifacts should be selected, the authors have chosen the artifact notion presented by Jacobson in [33]: ‘‘a tangible piece of information that (1) is created, changed, and used by workers when performing activities, (2) represents an area of responsibility, and (3) is likely to be put under separate version control. An artifact can be a model, a model element, or a document’’. Relevant examples of interesting artifacts could be: UML Class-diagrams, DFDs, Hierarchical charts, UML

7

Activity diagrams, Product Manuals, Structured requirements speciﬁcations, Entity–Relationship data models, SQL Databases, Positioning textbased papers, memorandums, excel spreadsheets, e-mails, etc. One of the main problems of this step is to evaluate the quality of the document’s corpus. Different variables must be controlled, like corpus completeness, corpus homogeneity, documents relevance, document duplicity, etc. Aside of completeness (controlled by domain experts), the authors have used scientometric analysis to measure the ‘‘quality’’ of the document corpus, based on context variables. The main goal of these tools was to select the best document set of all the possible, rejecting those documents that are irrelevant to the domain generation process. A Ph.D. thesis was developed just to study these possibilities. It is published, in Spanish language, in [34], although partial results applied to IR can be found in [35]. 4.3. Generation of the domain vocabulary Step 2 provides DA engineers with a complete set of documents including software and textual artifacts. In this step, these documents must be treated by indexers in order to extract their relevant terms (IEs), and build with them a stable domain vocabulary. To do so, all the different indexers, independently of the artifacts’ information structure, must include the following processes: *

* * * *

Lexical Analysis to exclude punctuation marks, letters and hyphens. Digits/Dates identiﬁcation. Elimination of stop words. Normalization of words. Identiﬁcation of IEs, either single terms or phrases.

Phrase IEs are Information Elements formed by more than one single term: IEi ¼ t1 þ ? þ tn : Indeed an IE can be formed by other IEs. E.g. ‘‘The president of the United States of America’’.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

8

In order to identify IEs, two different strategies can be considered when treating phrase IEs: *

A-Strategy: When a Phrase is identiﬁed, the system generates an IE for any single term ti and for the phrase as well. E.g. Phi ¼ t1 þ t2 ; where t1 and t2 can be phrases, then the system generates three IEs: IEPh ; IEt1 ; IEt2

*

From ‘‘Cosworth Motor’’ the system gets the following IEs: * Cosworth Motor; * Cosworth; * Motor. B-Strategy: When a Phrase is identiﬁed, the system generates only an IE for the phrase and eliminates the internal terms. E.g. Phi ¼ t1 þ t2 ; then the system generates just the phrase IE: IEPh In this case from ‘‘Cosworth Motor’’ the system gets only the IE: * Cosworth Motor.

To describe the functionality of the different RSHP indexers is also out of the scope of this paper. However, free text RSHP indexer is partially presented in [20]. The indexing process provides Domain Engineers with a large set of candidate concepts to form the domain vocabulary. Among them it is possible to ﬁnd * * * * *

Human Language General Terms (HLGT); Generic Domain Concepts (GDC); Speciﬁc Domain Concepts (SDC); Not previously controlled Stop Words; Badly formed words, grammatical errors, indexing errors, etc.

Therefore, a ﬁltering process must be made to try to eliminate everything but GDCs and SDCs. The expected results will form the IEs base. So, ﬁltering refers to the selection of those relevant IEs that will represent a domain vocabulary. The better the ﬁltering process is, the better the domain representation will be. Several ﬁltering algorithms have been proposed in the literature. The most common ones, like tf-idf

[36] or n-grams [37] are not applicable to domain analysis as they normally try to identify those terms that can certainly be used to discriminate in case of Information Retrieval. Unfortunately Domain Analysis has opposite needs: relative common terms usually represent the upper level of the vocabulary hierarchies, and must be taken into account. Therefore, other algorithms must be used. The origin of semantic representation was stated by Zipf [38] in what is known as Zipf’s law: if we distribute by occurrence frequencies all the terms used by humans in written language, those having a very high frequency normally represents usual terms, stop words and very irrelevant information; those having a very low frequency usually represent more the knowledge skills of the writer than the important semantics; and the real relevant information usually can be gathered from those terms located in the middle of the interval. The IE ﬁltering process cannot be fully based on Zipf’s law, due to two reasons: *

*

RSHP indexer solves stop words elimination problem, so Zipf’s law is not directly applicable because its application will imply the elimination of representative terms. RSHP indexer also identiﬁes phrases. Phrases usually have lower term frequencies than single terms. If we should follow Zipf’s law we would have to reject almost all phrases as representative concepts of a domain.

So, in order to select the relevant IEs from the whole set of candidate terms, the terms occurrence frequency for every document are computed. Then, considering ni to be the number of occurrences of a term in document j; and N being the number of terms in artifact j; a conﬁdence interval (left freq, right freq) for the mean frequency is computed for every artifact (document) [39]. This conﬁdence interval is built by using a contrast for the difference between the mean and a contrast value with a degree of conﬁdence of 95%. According to this conﬁdence interval for frequencies, a candidate terms ﬁltering interval (a; b) must be assigned. The usual calculation for the terms interval (a; b) will be to assign it to (left freq, right freq), but other conﬁgurations can be applied.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

As a result of this statistical method, all the candidate terms which occurrence frequency is located inside the (a; b) interval will be accepted as valid Domain Vocabulary IEs. They will form the domain vocabulary that will be validated by the domain experts. More advanced ﬁltering algorithms are under development at the creation time of this paper. 4.4. Validation of the domain vocabulary The domain vocabulary generated by the computer algorithms must be validated by humans. Essentially, domain experts must validate every single IE. The management operations to be made are * * * *

Accept an IE; Reject an IE; Update the concept that describes an IE; Suggest a new IE.

The strongest difﬁculty of this process is found on managing this process for multiple domain experts at the same time, located in different geographical places. This scenario, which is the most common one, implies coping with different opinions of domain experts regarding a particular IE, all of them stated at different dates and times. Computerbased management applications can be very effective aid tools. Classical statistical methods can be applied to decide about IEs experts’ decisions. The authors have developed WEB-based management applications to solve this validation stage. 4.5. Generation of the candidate domain representation 4.5.1. Evolution of the vocabulary to an artifacts’ taxonomy Four steps must be performed: 1. Transform IEs into artifacts if they represent additional information As the domain vocabulary has been created by ﬁnding IEs in the corpus document set, some of them will certainly be representing different

9

artifacts. E.g. an IE named ‘‘Chassis’’ can certainly be a class from a UML Class diagram. IE ¼ fhChassisiclass g Therefore, many IEs will be transformed into artifact IEs. 2. Generate static relationships from phrases On the other hand, Phrase IEs can generate static RSHPs. Those RSHPs will be included to the vocabulary and they can form taxonomic structures. E.g. having the following Phrase IE (which is a class artifact): IE1 ¼ fhCosworth Motoriclass g It is possible to conclude the following RSHPs: RSHP 1 ¼ fno name; Motor; Cosworth MotorgHierarchy RSHP 2 ¼ fno name; Cosworth Motor; CosworthgAssociation And graphically it is shown in Fig. 4. 3. Extract relationships from structured information Structured information, like for example a class diagram artifact, include relationships that can directly be gathered from the artifact. 4. Generate relationships from textual information According to the RSHP representation model, it is possible to identify relationships from text chains by linking normalized verb structures to semantic relationships. For example, the verbal phrase ‘‘TO BE GENERIC OF’’ must mean a RSHPHierarchy Therefore, indexing with the RSHPIndexer a text describing ‘‘A is generic of B’’ will imply a RSHP 1 ¼ fno name; A; BgHierarchy : 4.5.2. Identification of relationships between IEs and artifacts Once the IEs forming the vocabulary have been chosen, the system has to select those relationships existing among them that could be considered relevant to represent the domain. In this version of RSHP-DA this step has been simpliﬁed to the following rule: ‘‘The candidate domain representation will include all the RSHPs found during the indexing

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

10

process that link ONLY those IEs already included in the Domain Vocabulary’’. This means that only those RSHPs linking IEs included in the Domain Vocabulary will be accepted as candidate RSHPs. Afterwards, the domain experts and engineers have to eliminate reﬂexive, symmetric and transitive problems within the RSHPs set. 4.6. Validation of the domain representation

Fig. 4. Example of a RSHP structure.

The domain representation validation is also essentially a management task. Groupware and

Table 1 Document corpus Documents

Origin

94 css oibs.txt Ieee icsr.txt fm c to opetri.txt cs.txt Leda.txt Annals of Software Engineering 5 (1998).txt Ase-journal.txt simil-concept.txt nature-92-08.txt drc-orlando.txt tr-cs91-112.txt reposit-for.txt tr92-29.txt nature93-03.txt tracz-paper2.txt srwn20.txt from subroutines.txt Seke93.txt

Acta Informatica (1994) IEEE-ICSR’94. International Conference on Software Reuse (1994) Proceedings of the Application and Theory of Petri Nets (1995) IEEE Transactions on Software Engineering (1991) Communications of the ACM (1995) Annals of Software Engineering 5 (1998)

re97.txt sr94.txt tr37.txt otso sew20 ps.txt seke92.txt sigsoft93.txt icse15.txt icse18.txt tse s97001.txt infocom98.txt cade-paper.txt calpap.txt ieee94.txt jts.txt

Automated Software Engineering (1996) Proceedings of the 11th European Conference on Artiﬁcial Intelligence (1994) Conference on Advanced Information Systems Engineering (1993) Fourth International Conference on Software Reuse (1995) Workshop on Formal Aspects of Measurement (1991) VLDB Journal: Very Large Data Bases (1993) Proceedings of the 15th International Conference on Software Engineering (1993) Proceedings Second International Workshop on Software Reuse (1993) ADAGE-IBM-92-02 (1995) Proceedings of the International Conference on Software Engineering (1997) The American Programmer (1996) Proceedings of the Fifth International Conference on Software Engineering and Knowledge Engineering (1993) First International IEEE Symposium on Requirements Engineering (1993) Proceedings of Third International Conference on Software Reuse (1994) Technical Report OSU-CISRC-9/95-TR37 (1994) 20th Software Engineering Workshop (1995) Fourth International Conference on Software Engineering and Knowledge Engineering (1992) Proceedings of SIGSOFT (1993) Proceedings Fifteenth International Conference on Software Engineering (1993) Proceedings International Conference on Software Engineering (1996) Technical Report TR-95-12 (1995) Infocomm’ 98 (1998) Proceedings of the Twelfth International Conference on Automated Deduction (1994) Proceedings International Conference on Software Engineering (1998) Proceedings of the Seventh Annual Software Technology Conference (1995) Proceedings of the Fifth International Conference on Software Engineering (1993)

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

organization aspects are relevant to succeed. This step is essentially a continuation of the Domain Vocabulary validation step. The same tools, policies and methods can be used.

5. Experiments The experiments were made considering the software reuse domain as the target. All of them followed the RSHP-DA methodology described in the previous section. The document corpus was formed selecting a set of 32 text artifacts from the citeseer database of NEC Research Institute (http://citeseer.nj.nec. com/cs) (Table 1). A lexical treatment was performed to the corpus, erasing problematic characters and cleaning the documents. Once this preprocessing was performed, the RSHP Indexer was used both to create the vocabulary and to extract the relationships. Regarding domain vocabulary generation, several experiments were made to study which

11

algorithms conﬁguration performs better. As described in the previous chapter, the generation of the domain vocabulary is a very important task, as the relationships are gathered only from the selected IEs. The experiment structure is described in UML Activity diagram (Fig. 5). Four different sub-experiments were performed to study the possibilities offered in Section 4.3. Each one created the domain vocabulary using the following conﬁgurations for the ﬁltering interval (a; b). *

*

*

*

Sub-experiment 1: A-Strategy, (a; b)=(min (percent 20, left freq), Sub-experiment 2: B-Strategy, (a; b)=(min (percent 20, left freq), Sub-experiment 3: A-Strategy, (a; b)=(left freq, right freq); Sub-experiment 4: B-Strategy, (a; b)=(left freq, right freq).

considering right freq); considering right freq); considering considering

The intention of these experiments was to contrast the importance of the (a; b) interval selection in the domain vocabulary generation process. Initially, sub-experiments 1 and 2 contrast

Fig. 5. Experiments structure.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

12

the possibility to expand the (a; b) interval from the initial (left freq, right freq), increasing the left border by accepting the minimal value of the 20% percent frequency or the left freq frequency. This will imply an expanded vocabulary, but it must be controlled whether the quality of the new incoming terms is acceptable. The b limit (the high frequency), has been limited to the conﬁdence interval as the indexing process has already eliminated the stop-words, and therefore, high frequency terms could quite probably represent common language words. Sub-experiments 3 and 4 were performed considering the (left freq, right freq) interval for ﬁltering terms. However, once the vocabulary has been created, the extraction of relationships has been common to all the sub-experiments. Before describing the experiments, it is interesting to remark that, as the corpus was only formed by textual artifacts, the domain representation created did not include software artifacts, but only IEs and relationships. Therefore the ﬁnal result was close to ontology about software reuse. Table 2 Data about vocabulary reduction when A-strategy is performed Total

Phrases

Single concepts

Concepts 18,587 12,871 5716 Filtered concepts 1058 123 935 % vocabulary reduction 94.33% 99.04 83.64

Kind of RSHP

Number of RSHPs Percentage %

Total

98 194 62 393 119 57 23 18 10 51 1025

9,561 18,927 6,0488 38,341 11,61 5,561 2,2439 1,7561 0,9756 4,9756

5.1. Experiment 1: A-strategy with (a, b)= (min(percent 20, left freq), right freq) Considering A-strategy, where phrases form candidate IE elements with the phrase and the single included terms and sub-phrases, 18,587 candidate IEs were extracted during the indexing process. The (a; b) ﬁltering interval was calculated for every document, and the results ranged from the (3.0, 26.11) of the ‘‘From-Subr’’ document to the (9.4, 58.7) gathered by the ‘‘Ase-Journal’’. This result was applied also in Experiment 2. The result of the statistical conﬁdence calculations was that 1058 terms overcome the ﬁltering process, becoming domain vocabulary IEs. This ﬁltering strategy reduces the search space over a 94%. This reduction is specially relevant (and negative) in Phrases (reduced in a 99%) while single terms reduction was an 84%. The results are presented in Table 2. Once the domain vocabulary was built, the indexing process found 1025 relationships. Fig. 6 shows the distribution of the RSHP elements found. As it can be seen below, RSHPqualiﬁcation becomes the most frequent element, being RSHPassociation the second one. The qualiﬁcation RSHP, described in [27], states for the attribute/ value relationship. It is represented as an ‘‘=’’ in the results graphs. The following graphs show some examples of the classiﬁcation got by RSHP Indexer when

400 350 300 250 200 150 100 50 0

Fig. 6. Distribution of RSHP elements using A ﬁltering strategy in Experiment 1.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

A-strategy was performed in this experiment (Figs. 7 and 8). 5.2. Experiment 2: B-strategy with (a, b)=(min (percent 20, left freq), right freq) Considering B-strategy, where phrases form candidate IE elements only including itself, and not its single terms and sub-phrases, 17,409

Reuse

Cursors

Very Simple Generic

Efficient

= = =

= = =

Algorithm

Algorithm of Section 3

Similar

General

Structure Components

Classification Problem

Section Algorithm of Section 4

Fig. 7. Domain representation of RSHP elements where Algorithm is the index term.

Repository System

Reuse

The (a; b) ﬁltering interval must be calculated for every document, and the results ranged from the (3.89, 26.11) of the ‘‘From-Subr’’ document to the (26.77, 58.7) gathered by the ‘‘Ase-Journal’’. Considering ða; bÞ ¼ ðleft freq; ext rightÞ as the ﬁltering interval, the number of accepted terms is even more reduced than in Experiments 1 and 2. In this experiment, the domain representation is formed by 385 IEs. Table 4 shows the reduction of the vocabulary if this strategy is used. The phrases obtained with the ﬁltering process are presented in Table 5. The distribution of Relationships extracted from the process is presented in Fig. 10.

Current

MIV Templates

Partial

Clasificación Reusable Asset

Reuse Modules

Different

Requirements Public

Asset

Reuse

= =

=

Research

=

Analogical

Library Class

Library Asset

Verba

=

= =

=

Logical

Valid

Indirect

= = =

= Available

=

Generative

=

Components

=

=

Idea Private

Internal

=

3000

=

candidate IEs were extracted during the indexing process. After the statistical ﬁltering, only 1299 concepts overcome, forming the domain vocabulary. 353 of these elements correspond to phrases, this represents the 27.2% of all the IEs. Table 3 shows the reduction of the vocabulary if this strategy is used. Fig. 9 resumes this domain representation. As in the previous experiment, RSHPqualification and RSHPassociation elements are the most frequent element.

5.3. Experiment 3: A-strategy with (a; b)= (left freq, right freq)

Mode Practical

13

Successful

Systematic

Measurable General

Fig. 8. Domain representation of RSHP elements where Reuse is the index term.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

14

5.4. Experiment 4: B-strategy with (a, b)= (left freq, right freq) The domain vocabulary gathered in Experiment 4 is formed by 476 concepts. Table 6 shows the reduction of the vocabulary if this strategy is used. The phrases obtained in this ﬁltering process are presented in Table 7. Fig. 11 shows the results of the relationships generation process.

On the other hand, 754 of the IEs are got both by strategy A and B in Experiments 1 and 2. That represents the 71.3% of the IEs got by Experiment 1 and the 58.1% of B. 94 of these 754 common IEs (the 12.5%) correspond to phrases. This result suggests that the domain vocabulary generated by the A-strategy is closer to be a sub-set of the one gathered by the B-strategy. Comparing the impact of the ﬁltering interval conﬁguration (Experiments 1 and 2 against 3 and 4), the results show that the behavioral patterns

5.5. Results analysis 5.5.1. Generation of domain vocabulary Regarding the impact of the selected strategy (A or B) for phrase IEs gathering, the results are presented Table 8. It can be seen that B-strategy generates more phrases than A-strategy for both intervals. This result is totally expected, as B-strategy reduces the total frequency of the single terms found in phrases, and therefore phrases frequency get relatively higher (moves to the right of the interval).

Table 4 Vocabulary reduction when A-strategy is performed

Table 3 Data about vocabulary reduction when B-strategy is performed

A1 ﬁltering process

Total

Phrases

Single concepts

Concepts 17,409 13,254 4155 Filtered concepts 1299 353 946 % vocabulary reduction 92.54% 97.34% 77.23%

Kind of element

Total

Number of elements 145 276 65 483 146 56 39 24 11 48 1293

Percentage of element 11,21423 21,34571 5,027069 37,35499 11,29157 4,331013 3,016241 1,856148 0,850735 3,712297

Total

Phrases

Single concepts

Concepts 18,587 12,871 5716 Filtered concepts 385 15 370 % vocabulary reduction 97.93 99.88 93.53

Table 5 Filtered phrases in Experiment 3

Figure 1 Leveraged reuse Side effects Software reuse Stage 4

Figure 4 Neural network Software components Stage 1 Stage 5

500 400 300 200 100 0

Fig. 9. Distribution of RSHP elements using B ﬁltering strategy in Experiment 2.

Full article text Reuse libraries Software metrics Stage 3 Systems identiﬁers

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19 Number of elements

Percentage of element

167

10,98684

35

281

18,48684

30

140

9,210526

492

32,36842

200

13,15789

67

4,407895

15

29

1,907895

10

54

3,552632

5

19

1,25

0

71

4,671053

Total

1520

Kind of element

15

25 20

Fig. 10. Distribution of RSHP elements using A ﬁltering strategy in Experiment 3.

Table 6 Vocabulary reduction when B-strategy is performed in Experiment 4 Total

Phrases

Single concepts

Concepts 17,409 13,254 4155 Filtered concepts 476 78 398 % Vocabulary reduction 97.27 99.41 90.42

are maintained: * *

*

A-strategy gets less phrases than B-strategy; B-strategy does not perform worse than A-strategy in the single terms identiﬁcation; The domain vocabulary generated by the A-strategy is a kind of sub-set of the B one.

As only 10 phrases are obtained by both ﬁltering processes in Experiments 3 and 4, which represents the 66.7% of the phrases obtained with A-strategy and the 12.8% of the phrases obtained with Bstrategy. The common phrases are presented in Table 9. 5.5.2. Generation of the domain representation As it can be seen in Figs. 10 and 11, the ﬁltering process performed in Experiment 3 provides more RSHPs for single term IEs than the one performed in Experiment 4. One example is presented in Figs. 12 and 13, where the dashed arrow means a dependency, ﬁlled diamonds mean compositions,

hollow diamonds mean aggregations and lines represent associations. However, B-strategy provides more RSHPs for phrase IEs. For example, Figs. 14 and 15 show the results for the IE phrase ‘‘Software components’’. The IEs showed in gray color represent terms obtained only by Experiment 4. The complete results for these experiments can be found in [26]. 6. Conclusions Using a general information representation model, which should allow to represent all kinds of information artifacts, Domain Analysis techniques can be automated (ADA) to enable computers to create candidate domain representations. Therefore, Information Science, Software Engineering and Knowledge Management ﬁelds can apply these techniques to manage more efﬁciently the information. The RSHP-DA method is presented to perform ADA. RSHP-DA is based on the principle that all the information in electronic documents can be represented using concepts and relationships. RSHP-DA suggests to perform DA by locating an artifact corpus representing the domain, indexing the documents, generating a domain vocabulary, and mounting on top of it a candidate domain representation based on concepts and relationships.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

16 Table 7 B1 ﬁltered phrases B1 Filtering process Analogical reasoning Astronomical domain Class libraries Cocomo II POS architectures Current object Domain analysis Evaluation criteria Foundations of research Indirect reuse ISA relation Leveraged reuse Los Angeles MTV generators New systems Object oriented languages OTSO methods Private reuse Reuse components Reuse proﬁles Side effects Software description Software metrics Stages 5 Systems identiﬁers Technology Hellas Verbatim reuse

Kind of element

Assets certiﬁcation framework Basic interoperability Classiﬁcation scheme Conceptual limitations Deﬁnition 4 Domain theory Executables object Genvoca generators Inheritance hierarchy graph Kinds of modiﬁcations Libraries class Measurements theory Multigraph abstraction Object identiﬁers Object oriented systems Partial morphisms Quantity of reuse Reuse libraries Searches goals Software architectures Software engg Software reuse Structure mappings Systems perspective Two objects William Frakes

Number of elements

Percentage of elements

176

15,4386

40

196

17,19298

35

103

9,035088

30

406

35,61404

25

108

9,473684

20

39

3,421053

15

27

2,368421

10

31

2,719298

5

6

0,526316

0

48

4,210526

Total

Assets creation Certiﬁcation policies Client perspective COTS software Deﬁnition 5 Equation 1 Figure 1 Goals speciﬁcation Interfaces Software Length of indirect path Libraries components Meta level Multiple regression analysis Object oriented Object Petri Person months Recursive path Reuse measurements Semantic cases Software components Software entities Stages 3 Subroutine libraries Table 2 Two technologies World wide

1140

Fig. 11. Distribution of RSHP elements using B ﬁltering strategy in Experiment 4.

One of the main features of RSHP-DA is that it is incremental, and therefore easy to maintain along time. The domain update process is accom-

plished by indexing new documents and including results in the domain representation. However, human validation processes must be made for the

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19

17

Table 8 Comparison of the phrase ﬁltering process for all the experiments

Number of phrases Total number of IEs % of phrases for the total of IEs

Experiment 1 AStrategy expanded interval

Experiment 2 B-Strategy expanded interval

Experiment 3 AStrategy not expanded interval

Experiment 4 B-Strategy not expanded interval

123 1058 11.63

353 1299 27.2

15 385 3.90

78 476 16.39

Table 9 Phases ﬁltered by both A- and B-strategies for Experiments 3 and 4

Frame

Software components

Phrase Figure 1 Reuse libraries Software components Software reuse Stage 5

Leveraged reuse Side effects Software metrics Stage 3 Systems identiﬁers

specification

Technology Sentences

Fig. 14. RSHPs found in the Domain representation for the ‘‘software components’’ IE using A-strategy ﬁltering in Experiment 3. organizations baseline

knowledge software

activity

components

Structures Precompiler

Software components

frame environment spicelib

execution Specification

Fig. 12. RSHPs found in the Domain representation for the ‘‘software’’ IE using A-strategy ﬁltering in Experiment 3.

organizations

activity

Effective

=

Interfaces Methods

Technology Software metrics

Fig. 15. RSHPs found in the Domain representation for the ‘‘software components’’ IE using B-strategy ﬁltering in Experiment 4.

software knowledge frame

spicelib

Fig. 13. RSHPs found in the Domain representation for the ‘‘software’’ IE using B-strategy ﬁltering in Experiment 4.

previously described stages. Several experiments have been presented, where the ‘‘Software Reuse’’ domain has been modeled using different strategies

(all of them described in this paper), which intend to conﬁrm that ADA can provide useful domain representation, ready to be ‘‘only’’ validated by domain experts. Therefore, these techniques can reduce the economic costs of modeling domains, as well as the time needed to create these representations.

ARTICLE IN PRESS 18

I. D!ıaz et al. / Information Systems 30 (2005) 1–19

References [1] Fraunhofer Institute for Experimental Software Engineering (Fraunhofer IESE). http://www.iese.fhg.de/ pubs and links/spl/bibliography/deﬁnitions/index.html. [2] J. Neighbors, Software construction using components, Ph.D. Thesis. Department of information and computer science, University of California, Irvine, 1981. [3] A.T. Berztiss, Domain analysis for business software systems, Inf. Systems 24 (7) (1999) 555–568. [4] J. Neighbors, The Draco approach to constructing Software from reusable components, IEEE Trans. Software Eng. SE 10(5), 1984. [5] J. Neighbors, The evolution from software components to domain analysis, Int. J. Software Eng. Knowledge Eng. 2 (3) (1990) 325–354. [6] R. McCain, Reusable software component construction: a product-oriented paradigm, Proceedings of the 5th AIAA/ ACM/NASA/IEEE Computers in Aerospace Conference, Long Beach, CA, 1985. [7] R. Prieto-Diaz, Implementing faceted classiﬁcation of software reuse, Communications of the ACM. ACM Press, New York, NY (USA), May 1991, Vol. 34 (5), pp. 88–97. [8] W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data Structures & Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1992. [9] W. Frakes, R. Prieto-D!ıaz, C. Fox, DARE: domain analysis and reuse environment, Ann. Software Eng. 5 (1998) 125–141. [10] M. Simos, The Growing of an Organon: a Hybrid Knowledge-based Technology and Methodology for Software Reuse, IEEE Computer Society Press, Silver Spring, MD, 1991, pp. 204–221. [11] K. Kang, S. Cohen, J. Hess, W. Novak, S. Peterson, Feature-Oriented Domain Analysis (FODA), Feasibility Study, Technical Report CMU/SEI-90-TR-21, Software Engineering Institute, Pittsburgh, 1990. [12] G. Campbell, S. Faulk, D. Weiss, Introduction to Synthesis, Technical Report Intro Synthesis-90019-N, Software Productivity Consortium, Herndon, 1990. [13] M. Lubars, Domain analysis and domain engineering in IDeA, in: R. Prieto-D!ıaz, G. Arango (Eds.), Domain Analysis and Software Systems Modeling, IEEE Computer Society Press, Silver Spring MD, 1991, pp. 163–178. [14] W. Vitaletti, E. Guerrieri, Domain Analysis within the ISEC RAPID Center, Proceedings of the Eighth Annual National Conference on ADA Technology, 1990. [15] S. Bailin, Domain Analysis with KAPTUR, Course Notes, CTA Inc., Rockville, 1992. [16] H. Gomaa, L. Kerschberg, V. Sugumaran, C. Bosch, I. Tavakoli, A Prototype Domain Modeling Environment for Reusable Software Architectures, Proceedings of the Third International Conference on Software Reuse, IEEE Computer Society Press, Rio de Janeiro, 1994, pp. 74–83. [17] M. Simos, Organization Domain Modeling (ODM): Formalizing the Core Domain Modeling Life Cycle.

[18] [19]

[20]

[21]

[22]

[23]

[24] [25] [26]

[27]

[28]

[29] [30]

[31]

[32] [33]

[34]

SIGSOFT Software Engineering Notes, Special Issue on the 1995 Symposium on Software Reusability, 1995. R. Prieto-D!ıaz, Domain Analysis for Reusability, Proceedings of COMPSAC’87, Tokyo, 1987, pp. 23–29. G. Van Slype, Les Langages d’Indexation: Conception Construction et Utilisation Dans les Syst!emes Documentaires, Les Editions d’Organisation, Paris, 1991. I. D!ıaz, J. Morato, J. Llor!ens, An Algorithm for Term Conﬂation based on Tree Structures, J. Am. Soc. Inform. Sci. Technol. 53(3) (2002) 199–208. J. Llor!ens, J.M. Fuentes, I. D!ıaz, RSHP: A scheme to classify information in a domain analysis environment, Proceedings of the Third International Conference on Enterprise Information Systems, IEEE-AAAI-ICEIS, Setubal-Portugal, 2001, pp. 686–690. J. Llorens, J. Morato, G. Genova, RSHP: an information representation model based on relationships Special book on Soft Computing. Berlin: Springer, accepted for publication April 2003. Available at: http:// www.ie.inf.uc3m.es/grupo/descargas/rshp.zip. ISO 2788-1986 (E), Guidelines for the Establishment and Development of Monolingual Thesauri, Second edition– 11–15 UDC 025.48. ISO 2788, 1986. S.R. Ranganathan, Prolegomena to Library Classiﬁcation, Asian Publishing House, India, 1967. ISO/IEC 13250, Information Technology—SGML Applications—Topic Maps. ISO, Geneva, 2000. ! de Informacion ! I. D!ıaz, Esquemas de Representacion ! a la Generacion ! basados en Relaciones. Aplicacion Autom!atica de Representaciones de Dominios, Ph.D. Thesis, Universidad Carlos III de Madrid, Legan!es, Madrid (Spain), 2001. J. Llor!ens, I. D!ıaz, J.M. Fuentes, G. G!enova, A New Approach to Domain Analysis, Second Symposium on Reusable Architectures and Components for Developing Distributed Systems (RACDS 2000) II-SCI 2000/ISAS 2000. Orlando, FL (USA). P. Chen, The entity–relationship model: toward a uniﬁed view of data, ACM Trans. Database Systems 1 (1) (1976) 9–36. J. Rumbaugh, Object-Oriented Modelling and Design, Prentice-Hall, New York, 1994. T. Humphrey, Managing the Software Process, in: W.S. Humphrey, Addison-Wesley Publishing, Reading, MA, 1989. G. Arango, R. Prieto-D!ıaz, Domain Analysis and Software Systems Modeling, IEEE Computer Society Press, Silver Spring, MD, 1991. UML: http://www.uml.orgp. I. Jacobson, G. Booch, J. Rumbaugh, The Uniﬁed Software Development Process, Addison-Wesley, Reading, MA, 1999. J. Morato, An!alisis de las relaciones cienciom!etricas y lingu! . ısticas en un entorno automatizado, Ph.D. Thesis, Universidad Carlos III de Madrid, Legan!es, Madrid (Spain), 1999.

ARTICLE IN PRESS I. D!ıaz et al. / Information Systems 30 (2005) 1–19 [35] J. Morato, J. Llorens, G. Genova, J.A. Moreiro, Experiments in discourse analysis impact on information classiﬁcation and retrieval algorithms, Information Processing, & Management, Vol. 39, No. 6, November 2003, pp. 825–851. Available online 19 December 2002. [36] G. Salton, Automatic Text Processing, Addison-Wesley, Reading, MA, 1989.

19

[37] J.D. Cohen, Highlights: language and domain independent automatic indexing terms for abstracting, J. Am. Soc. Inform. Sci. 46 (3) (1995) 162–174. [38] G.K. Zipf, Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology, Haffner, New York, 1972. ! a Los M!etodos de la Estad!ıstica, [39] S. R!ıos, Introduccion Nuevas Gr!aﬁcas, Madrid, 1954.

Generating domain representations using a relationship model

Generating domain representations using a relationship model

Recommend Documents