An object-oriented database design for improved performance

An object-oriented database design for improved performance

Data & Knowledge Engineering 37 (2001) 117±138 www.elsevier.com/locate/datak An object-oriented database design for improved performance Narasimhaia...

556KB Sizes 1 Downloads 59 Views

Data & Knowledge Engineering 37 (2001) 117±138

www.elsevier.com/locate/datak

An object-oriented database design for improved performance Narasimhaiah Gorla * Information System and Manufacturing Department, Wayne State University, 5229 Cass Avenue, Detroit, MI 48202, USA Received 3 October 1997; received in revised form 3 December 1999; accepted 28 November 2000

Abstract Object-oriented databases (OODBs) are known to be rich in functionality but poor in performance. One of the important factors that a€ect performance is the physical database design. We developed a methodology for the design of an ecient storage structure of OODB that minimizes the database operating costs. The input for our method is the logical OODB schema and set of user transactions of retrieval and update types. The output of our method is the determination of which instance variables should be inherited from direct and indirect superclasses and stored in which subclasses. We used a genetic algorithm (GA) to solve this intractable problem. The methodology was applied on a university database. Compared to previous storage models, the storage model produced with our methodology showed database performance improvement ranging from 26% to 31%, on the average. Our results demonstrate a cost-e€ective storage structure design that boosts the operating performance of OODBs. Ó 2001 Published by Elsevier Science B.V. Keywords: Object-oriented databases; Database operating cost; Performance evaluation; Physical database design; Genetic algorithms; Optimization

1. Introduction Object-oriented databases (OODBs) are the state-of-the-art database technology of the 1990s, just as relational databases for the 1980s. These advanced systems provide enhanced database management capabilities by reducing the limitations of relational and other record-oriented data models such as hierarchical and network. An object-oriented data model provides capabilities necessary for next generation applications such as CAD [20], expert systems [2,12], and ¯exible manufacturing systems [1]. However, performance of data access is a major concern in OODBs [35]. OODBs are reputed to be `rich in functionality but poor in performance', especially when many classes are involved [18]. A major consideration in determining the performance of OODBs is physical storage structure [36]. There has been little research on the issue of physical storage models for OODBs. The implementors of object-oriented database management systems (OODBMS) use various models of physical structures [37]. These physical implementation models di€er in the manner in which the individual objects and instance variables are stored in the

*

Tel.: +1-313-577-2568; fax: +1-313-577-4880. E-mail address: [email protected] (N. Gorla).

0169-023X/01/$ - see front matter Ó 2001 Published by Elsevier Science B.V. PII: S 0 1 6 9 - 0 2 3 X ( 0 1 ) 0 0 0 0 4 - 0

118

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

database. In this research, we propose a methodology for OODB design in terms of physical organization of these instances in the database that results in the least database operating cost. We use the number of data accesses needed for users' requests as a pseudo measure for database operating cost. The object-oriented data model is more complex than the relational model, but there is no universally accepted data model for OODB [32]. The object-orientation is not based on a formal (mathematical) model like the relational model [28]. Object-oriented data models support the notions of classes, subclasses, class hierarchies, and objects [10]. In OODB, each entity is an object. Objects include instance variables (similar to attributes in relational model) that describe the state of the object. Objects also include methods which contain instructions to manipulate the object or return object state. Objects with the same set of instance variables and methods are grouped into a class, and the objects are called instances of the class. Similar classes are grouped into superclasses, thus forming superclass±subclass relationship. In this research, we propose a methodology to determine OODB storage structure, i.e. to determine as to which instance variables should be inherited and stored in a subclass and which instance variables should be stored in a superclass, such that users' database requirements can be met in the most economical manner. Instance variables can be inherited from both their direct superclass and indirect superclass; i.e. an instance variable can be inherited by a subclass from its superclasses, the instance variables of which are composed of home variables and inherited variables of these superclasses. The possible number of database designs can be very large even with a moderate number of classes and home variables. If there are C classes with an average of A home variables per class and a fraction f of the classes are the root-classes, then the possible number of storage structures are at least 2AC…1 f † . Explanation: The total number of home instance variables is A  C, out of which A  C…1 f † are home variables. Each of these may be inherited from their superclasses. In addition, some of these inherited variables may also be inherited by their sublasses. Thus, the possible physical arrangements of the instance variables are at least 2AC…1 f † . Even for a small OODB logical schema with ®ve classes, ®ve home instance variables per class, and leaf-classes constituting 20% of the total number of classes, there are at least 2  20 ˆ 32 million possible storage structures. It is very dicult to determine the best, or optimum, solution in a reasonable amount of time even for this small problem. In order to solve this intractable problem, we need an ecient heuristic algorithm that can provide a good solution for larger databases. We use a genetic algorithm (GA) approach [8] to solve this optimization problem. GAs, which are based on genetics and evolution theory, are known to be ecient search methods to solve dicult optimization problems [11,33]. Though there are other optimization methods to solve these dicult problems, such as traditional hill climbing and stochastic optimization techniques such as simulated annealing (SA) and chemotaxis algorithm (CA), they present certain disadvantages [24,33]. In traditional hill climbing, the search process starts from a single point and ignores other potential solution points in the search space, thus resulting in local optima. SA [23] is based on the process in which a solid is suciently heated to a liquid form followed by cooling, so that the particles arrange themselves into a lattice. By selecting parameters for the initial high temperature and the cooling rate, various solution states are obtained. Unlike hill climbing techniques, SA accepts both low cost solutions and high cost solutions, thus escaping from local optima. Though this leads to a better solution compared to hill climbing techniques, it results in much longer

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

119

computational time. Furthermore, neither hill climbing nor SA uses information from solutions that were already found. CA is another stochastic algorithm that simulates the chemotaxis of bacteria. A new state from a previous state is generated based on Gauss-distribution. In the search process, CA accepts only lower cost states and thus converges rapidly. However, CA has diculty to get out of local optimum and thus does not produce good global optimum solution. GA maintains several solution points in parallel (unlike hill climbing) and exploits them to build better solutions thus leading to global optimum. Unlike SA, GA builds better solutions by using information found from already existing good solutions, thus reaching global optimum solution much faster. Because of these reasons, we chose GA over other optimization algorithms. The rest of the paper is organized as follows: Section 2 discusses previous related research on OODB storage structures, the relational database design techniques, and the description of GA. Section 3 contains development of problem description using an integer programming problem formulation. Section 4 brie¯y describes the design methodology used and Section 5 has an illustration this methodology using a university database. Section 6 has simulation experiments and results. Section 7 has a summary, conclusions, and future research directions. 2. Background In recent years, there has been enormous amount of research published in OODBs. Case studies comparing relational databases reveal the superiority of OODBs [17]. Some of the OODBMSs available in the market are Vbase [5], Iris [7], ORION [3,20], and Object Store [30]. A typical OODBMS consists of a query processor, a storage manager, a user interface module similar to that of SQL. The OODB concepts need to be standardized [16], since there has been much confusion and controversy about what ``object-oriented'' means [20]. Performance of OODBs can be improved by object clustering [34] and use of access methods. Class instances can be accessed using class-hierarchy (C-H) tree index, a B-tree index for all classes and subclasses [22]. H-tree provides for another ecient operation to both single class retrievals and class-subclass retrievals [25]. Usage of signature ®les is another promising technique to access the instances in OODBs [15]. 2.1. Genetic algorithms GAs are becoming popular for their ability to solve complex problems and provide good solutions, not necessarily optimal solutions. The basic idea of the GAs is as follows. A set of candidate solutions in the form of bit-strings, called ``chromosomes,'' is randomly generated. Each of these solutions is evaluated using some performance measure called ®tness function. The two best solutions are selected and mated to produce two new solutions. This is called crossbreeding, where we cross-over the fragments of these solutions to produce o€springs from the parents. The o€springs are merged with the initial population and the weaker solutions are removed from the solution pool. This becomes the solution pool for the next iteration (also called generation). Thus, each subsequent generation produces better and better solutions. This form of selection is based on the premise of survival of the ®ttest. Normally some form of mutation is also used to exit from local search. Mutation involves changing some bits of the bit-string at random [8]. GAs have been applied in the database area in query optimization of relational databases [4]

120

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

and distributed database design [26]. There was no literature, in the best of the authors' knowledge, applying GAs in OODB Design. 2.2. Relevant relational database design techniques The ®le design and accessing techniques for OODBs are similar to those for relational databases. Data fragmentation is used in relational databases for performance improvement. While Ho€er and Severance [13] studied database partitioning using a bond energy algorithm, March [26] provides a survey of these techniques in centralized databases. Navathe et al. [29] provide several algorithms for vertical fragmentation to be used in central and distributed databases. Gorla and Boe [9] discuss a fragmentation procedure in a multi-relational environment. Chu and Ieong [6] discuss attribute partitioning using a transaction-based approach. In all the previous research, attribute partitioning is determined based on user information needs. 2.3. OODB storage models The issue of physical storage models for OODBs has not been addressed very well in the database literature. Consequently, there is no universally accepted model for the physical storage of the instances in an OODB. Willshire and Kim [37] presented several physical models of OODBs. The di€erence between various models lies in the manner in which the instance variables and the instances of each class are stored. In the Home-Class (HC) model [36], used in ORION OODBMS [3], an instance is stored in the lowest subclass in the hierarchy. Each instance contains the values of all instance variables and an instance is stored only once in the hierarchy. In the Repeat Classes (RC) model, an instance is stored in all the subclasses it applies. Each instance contains all the instance variables in that subclass. In RC model, there is some redundancy of inherited instance variables. The Split Instance (SI) model is similar to RC model, except that the inheritance variables are only stored in the superclass not in any subclass. Fig. 1 shows an example of the logical OODB model, while Figs. 2 and 3 show the storage models SI and RC, respectively. SI model is used in IRIS OODBMS [7]. In this research, we propose a physical storage structure which is expected to perform better than previous data models. 3. Problem description The problem can be stated as follows: given a logical OODB schema and a set of user retrieval and update requests, determine the storage structure that yields the minimum database operating cost. The objective is to determine which instance variables should be inherited (and stored) in which class of the OODB schema. The evaluation criterion is the total number of instances accessed to process all the users' retrieval and update requirements, which is a pseudo measure for database operating cost. Since I/O access cost is the major cost in data intensive commercial applications, similar evaluation functions were used in previous research in relational databases [6,29]. In this research, we assume that, similar to SI and RC models, the instances are stored in all the applicable classes and subclasses. Thus the sub problem can be stated as follows: given a logical OODB schema and user retrieval and update requests, determine an ecient storage structure in terms of which instance variables should be inherited (and stored) in each class; the proposed design results in fewest data accesses.

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

121

Fig. 1. Class-Hierarchy structure of a University database.

3.1. Discussion If the inherited instance variables are also stored in the applicable subclasses (as in RC model), then retrievals are executed in the most ecient manner. This is because all the data pertaining to the instances of any class is available in that class itself and there is no need to visit other classes. However, update operations are performed in the most inecient way in RC model. Update operations require visiting more than one class because instance variables may be replicated in several classes. On the other hand, in SI model, the instance variables are not inherited (and stored) in the subclasses. Thus only home instance variables (instance variables that are contained originally in a class, based on logical design) are stored in each class, thus eliminating redundancy of the instance variables. Because of this, update operations are performed more eciently. However, retrieval operations will be performed in an inecient way because when processing these operations, it is necessary to visit more than one class to get complete information of the instances. Although RC model is better in general for retrievals compared to SI model, SI performs better than RC if the retrievals are such that they only need home instance variables. The above discussion of RC and SI models can be done analytically as follows:

122

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

Fig. 2. Split-Instance (SI) model.

Let R be the number of retrieval transactions, U the number of update transactions, N the number of classes in the C-H, T the average number of instances in a class that are accessed by any transaction, u the proportion of update transactions ˆ U =…U ‡ R†, and r the proportion of retrieval transactions ˆ R=…U ‡ R†.

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

Fig. 3. Repeat-Class (RC) model.

3.1.1. The RC model As stated before, all instance variables are inherited in RC model. Retrieval cost ˆ R  T Update cost ˆ U  2  T  N =2

123

124

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

/* on the average N=2 classes will be updated; update transactions need to access instances twice, to read and write. */ DBCOSTRC …Total Database Operating Cost with RC model† ˆ Retrieval cost ‡ Update cost ˆ T  …R ‡ U  N† ˆ c…1 ‡ u  …N

1††; where c ˆ T =…R ‡ U†:

…1†

3.1.2. The SI model As stated before, no instance variables are inherited in SI model. Retrieval cost ˆ R  T  log…N † /* log…N † nodes should be visited to answer a query, in the worst case */ Update cost ˆ U  2  T DBCOSTSI …Total Database Operating cost with SI model† ˆ Retrieval cost ‡ Update cost ˆ T  …R log N ‡ 2  U† ˆ c……1

u†  log N ‡ 2u†:

…2†

Eqs. (1) and (2) above are plotted varying Number of classes …N † and varying proportion of updates …u†. Following can be observed from the graphs: For update proportion …u† < 50%, DBCOSTSI < DBCOSTRC : In a database environment where retrievals dominate, i.e. u ˆ 0, DBCOSTSI > DBCOSTRC . In a database environment where updates dominate, i.e. u ˆ 1, DBCOSTSI < DBCOSTRC .

Thus in order to arrive at an ecient storage structure for OODBs, a trade-o€ will be required between costs of retrieval operations and costs of update operations. The ecient storage

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

125

structure lies between SI and RC models. A generalized solution method that is computationally feasible is required in order to provide a database design that minimizes the total costs of retrievals and updates over all the users. 4. Design methodology We use the cost of database operations (comprises retrievals and update operations) to evaluate a storage structure. The objective function is the number of instance accesses needed to satisfy the retrieval and update requests of users ± a pseudo measure for the cost of database operations. Previous researchers considered similar pseudo measures for database operating cost. Since the design problem is computationally complex, it is dicult to obtain an optimal solution for the problem. We obtain a heuristic solution based on the GA. Before we provide the solution method, we formulate the problem as a mathematical programming problem. 4.1. Mathematical formulation Problem formulation serves two purposes. (i) It helps us to understand the OODB design problem better. (ii) The objective function of the formulation serves as the database design evaluation function. The OODB design problem is formulated below as a 0±1 integer programming problem with the objective of minimizing the total cost of database operations. In the following, transaction t, instance variable k and classes i, j are used. X X Qtjj  Ttj Minimize ct t

=  the cost function is the total number of instances accessed by all the transactions in all the classes in a given storage structure  = s=t Ttj 6 Tti ; where i is a superclass of j; 8i;j =  number of instances in a subclass is less than that in its superclass  = where k

xij ˆ 1 if instance variable k is inherited from class i to class j, 0 otherwise; 8i 6 leaf-class, 8j 6 root-class /* these are the decision variables that indicate as to which instance variables should be inherited */ Ttj ˆ number of instances in class j to be accessed by transaction t. ct ˆ cost coecient /* ct ˆ 1 if t is a retrieval transaction, ˆ 2 if t is an update transaction update transactions need to access an instance twice for read and write, while retrieve transaction accesses only once for read */ Ai ˆ set of home variables in class i.

126

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

Iij ˆ set of instance variables inherited from class i to class j. Qtj ˆ 1 if transactionPt needs to access class j, 0 otherwise. (i.e., Qtj ˆ 1 if k k qt  k yj ˆ 1) k qt ˆ 1 if transaction t needs instance variable k, 0 otherwise. k yj ˆ 1 if instance variable k is present in class j (i.e., k yj ˆ 1 if k 2 Aj UIij ). 4.2. Algorithm We illustrate our design method using the University-database example in Fig. 1. The storage structures of RC and SI are shown in Figs. 2 and 3, respectively. The following are the steps for the OODB design, considering the example problem. /* Generation of solution bitstrings: A solution bitstring composed of several substrings. There is one substring corresponding to one superclass±subclass relationship. The substring contains 0s and 1s, where a 1 denotes that an instance variable from the superclass is inherited (and stored) in the subclass and a 0 denotes that an instance variable is not stored. Thus, the length of a substring of a superclass±subclass relationship is the number of variables ± home or inherited ± present in that superclass. */ 1. Generate randomly several solution bitstrings based on the above format. Apply the inheritance constraint to each bitstring to check if the bitstring is a valid one. /* The inheritance constraint is satis®ed only if the instance variables ± home or inherited ± corresponding to the 1s in superclass±subclass bitstring are present in the superclass. */ 2. Compute the cost of database operation for each of the above solutions using the objective function given in the above mathematical formulation, i.e., X X ct Qtj  Ttj : Solution cost ˆ t

j

Also, compute the ®tness function and selection factor for each solution bitstring i. X Solution costi : Fitness-of-solutioni ˆ 1 Solution costi = X Fitness-of-solutioni : Selection-factor-of-solutioni ˆ Fitness-of-solutioni = /* ®tness function is a measure that re¯ects goodness of a solution, such that the higher the value of ®tness function of a solution, the more likely the solution will produce o€springs in the next generation [8]. In the database design problem, the objective function is the database operating cost that should be minimized. Thus, we map linearly the database operating cost to the ®tness function, so that value of the ®tness function lies between 0 and 1 and the solution with low cost will result in high ®tness function. We also compute a relative ®tness function, called selection factor that is obtained by dividing ®tness of a solution by the sum of ®tness of all solutions. This selection factor makes it convenient to select solutions for next generation, which can be done by using roulette wheel mechanism [8]. */ 3. REPEAT steps 3A±3E 3A. Pick up two design solutions at random from the pool based on selection factors. /* Design solutions with high selection factors are more likely to be selected */ 3B. Derive the o€springs from these two solutions.

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

127

/* In order to do that, (i) Consider a point at random in the bitstrings and obtain two o€springs by crossing over. This is done by concatenating the ®rst segment of ®rst bitstring with the second segment of the second bitstring, and by concatenating the ®rst segment of the second bitstring with the second segment of the ®rst bitstring. (ii) Verify the o€spring solutions to satisfy inheritance constraints. If any of the o€springs are invalid, generate two corrected solutions for each incorrect o€spring. These corrections are made by forward and backward adjustments. (iii) The forward adjustment assumes the ®rst segment is correct and modi®cations are made to the second segment so that inheritance constraint is satis®ed. The backward adjustment assumes the second segment is correct and changes are made to the ®rst segment so that inheritance constraint is satis®ed. (iv) For example, consider two solutions `11 10 1111' and `10 11 0011'. One of the o€springs obtained by cross-over between fourth and ®fth bits is `10 11 1111'. This solution violates the inheritance constraint. Using the forward adjustment, the solution can be corrected as `10 11 1011'. Using backward adjustment, `11 11 1111' is obtained. */ 3C. Mutate the above o€springs. This is done by selecting one bit at random and deciding to ¯ip the bit or not, by using a random number. Perform forward and backward adjustments, if necessary, to satisfy the inheritance constraint. /* For 20 population solutions with 20 bitstring each, mutation is done to one bit out of 400 bits, resulting in a mutation probability of 0.0025 */ 3D. Compute database operating cost, ®tness of solution, and selection factor for each of solutions obtained in steps 3A and 3B. 3E. Add these solutions to the solution pool and remove equal number of inferior solutions from the pool based on selection factors. UNTIL satisfactory solution is obtained. /* Steps 3A±3E are repeated until the crossed-over solutions are worse than the parent solutions; alternatively the algorithm may be stopped after a certain time is elapsed and the best solution from the pool is considered the desired solution. */ 5. An illustrative example We consider the OODB logical schema of University-database shown in Fig. 1. The following retrieval and update transactions are considered on the database. Frequency of these transactions is assumed to be 1 each. Transaction

Type of transaction

1. 2. 3. 4. 5. 6.

Retrieval Retrieval Retrieval Update Update Update

Get all details of teaching-assistant with SSN ˆ 498 List all faculty names, ranks, and their salaries List all the names of students and their GPA Increase the salary of all employees by 5% Modify the GPA to 3.5 for student with SSN ˆ 450 Increase the age of sta€#234 by 1 year

128

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

5.1. Solution bitstring format A typical solution consists of a bitstring of 1's and 0's, in which `1' denotes that the corresponding instance variable is to be stored in the subclass and `0' denotes otherwise. The bitstring in this example contains six substrings that relate to the six superclass±subclass edges. These edges are as follows. The edge on the graph

Possible variable inheritance

Example substring

1. 2. 3. 4. 5. 6.

Name, Age Name, Age Name, Age, Date-Hired, Salary Name, Age, Date-Hired, Salary Name, Age, Major, GPA Name, Age, Date-Hired, Salary, Rank, Dept

10 00 1011 1001 0011 100110

PERSON ! EMPLOYEE PERSON ! STUDENT EMPLOYEE ! STAFF EMPLOYEE ! FACULTY STUDENT ! TA FACULTY ! TA

Some of the inheritance constraints are: (a) STAFF can inherit Name and Age only if EMPLOYEE had inherited them from PERSON. (b) FACULTY can inherit Name and Age only if EMPLOYEE had inherited them from PERSON. (c) TA can inherit Name and Age only if STUDENT had inherited them from PERSON. (d) TA can inherit Name, Age, Date-Hired, Salary only if these are inherited by FACULTY from EMPLOYEE and in turn EMPLOYEE must have inherited Name and Age from PERSON. Since TA involves multiple inheritance, TA can inherit Name and Age either through STUDENT or through FACULTY (not necessarily both). A solution string `00 00 0000 0000 0000 000000' represent SI model, since none of the instance variables are inherited. Similarly, a solution string `11 11 1111 1111 1111 111111' represents RC model, since all the instance variables are inherited from each class. A `1' signi®es that an instance variable is inherited and stored in the subclass and a `0' imply that it is not inherited and stored in the subclass. The ®rst substring `11' represents the inheritance of edge #1 above. i.e. EMPLOYEE inherits both Name and Age from PERSON and are stored in EMPLOYEE class, together with its home instance variables SSN, Date-Hired, and Salary. Similarly the third string `1111' imply that STAFF inherits four instance variables: Name, Age, Date-Hired, and Salary and are stored in STAFF in addition to home instance variables: SSN and Duties. Since Name and Age are inherited by EMPLOYEE from PERSON, the constraint is satis®ed. Similarly the sixth substring `111111' implies the instance variables Name, Age, Rank, Dept., Major, GPA, SSN, WorkLoad are stored in TA. It can be observed that SSN is replicated in all classes, since SSN is used to relate the instances of several subclasses corresponding to one PERSON instance. 5.2. Cost of database operations We consider the number of instances needed to execute database transactions to be synonymous to the database operating costs. Since the time to execute a transaction is predominantly I/O

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

129

bound in business information processing, the number of instances needed access is assumed to be synonymous to the database operating cost in relational databases [9,29]. We use similar approach in the sense that we consider that the number of instances needed access to be a pseudo measure of database operating cost. The computation of database operating costs for the fully inherited case (same as RC model) is as follows. Transaction #1 (Get all details of teaching-assistant with SSN ˆ 498) needs to access one instance in the class Teaching-Assistant and transaction #2 (List all faculty names, ranks, and their salaries) needs to access four instances of Faculty class. Similarly, transaction three needs ®ve instances. Update transactions require, in general, about twice the number of I/Os compared to retrieval transactions. Transaction 4 needs to update instance variables in the classes Employee, Faculty, Sta€, and Teaching-Assistant; thus, it needs 26 accesses. Transaction 5 needs four accesses and transaction 6 needs six accesses. The total cost of database operations for the fully inherited storage structure is then 46. Similarly, the database operating cost for the non-inherited case (i.e. solution #2) is 43 (retrieval transaction costs are 5, 12, 10 respectively; update retrieval costs are 12, 2, and 2, respectively). Other solutions for the pool and the respective database operating costs are given in the following section. 5.3. Initial solution pool The following table gives an initial solution pool for the University-database example. Solution#

Solution bitstring

1 2 3 4 5 6 7 8 9 10

11 00 11 11 10 11 00 00 10 00

Total

11 00 11 11 10 11 11 00 11 11

1111 0000 1100 0111 0100 1111 1100 1100 0000 1000

1111 0000 1100 0101 1010 1100 0100 1100 1000 1000

1111 0000 1100 1111 1010 1001 0000 0000 1100 0011

111111 000000 110000 010111 111010 101101 010100 010100 010000 000011

Cost

Fitness

Selection

46 43 46 52 35 50 47 52 35 37

0.8962 0.9029 0.8962 0.8826 0.9210 0.8871 0.8939 0.8826 0.9210 0.9165

0.0996 0.1003 0.0996 0.0981 0.1023 0.0986 0.0993 0.0981 0.1023 0.1018

443

9.0000

1.0000

The average cost of solutions in the solution pool is 44.3, while the best cost is 35. 5.4. Cross-over Using a random number on the selection factors, two parents are selected for mating from the pool of solutions #4 and #9.Crossing over these parents produces the following o€springs. One node in the class hierarchy is selected at random as a cross-over point and parent bitstring is divided into two parts: one part is the subtree below the cross-over point and the other part is the rest of the subtree.

130

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

Solution #4 Solution #9

Cost ˆ 52 Cost ˆ 35

Crossed-over solutions: O€spring 1: O€spring 2:

11 11 0000 0101 1111 010111 10 11 0111 1000 1100 010000

After forward adjustment: O€spring 10 : O€spring 20 :

11 11 0000 0101 1111 010111 10 11 0110 1000 1100 010000

Cost ˆ 48 Solution 11 Cost ˆ 37 Solution 12

After backward adjustment: O€spring 100 : 11 11 0000 0101 1111 010111 11 11 0111 1000 1100 010000 O€spring 200 :

Cost ˆ 48 Cost ˆ 41 Solution 13

5.5. Second generation solution pool O€springs 10 , 20 , and 200 will replace solution #4, #6, and #8 in the intial pool, resulting in second generation solution pool. Solution#

Solution bitstring

1 2 3 11 5 12 7 13 9 10

11 00 11 11 10 10 00 11 10 00

Total

11 00 11 11 10 11 11 11 11 11

1111 0000 1100 0000 0100 0110 1100 0111 0000 1000

1111 0000 1100 0101 1010 1000 0100 1000 1000 1000

1111 0000 1100 1111 1010 1100 0000 1100 1100 0011

111111 000000 110000 010111 111010 010000 010100 010000 010000 000011

Cost

Fitness

Selection

46 43 46 48 35 37 47 41 35 37

0.8892 0.8964 0.8892 0.8843 0.9157 0.9108 0.8867 0.9012 0.9157 0.9108

0.0988 0.0996 0.0988 0.0983 0.1017 0.1012 0.0985 0.1001 0.1017 0.1012

415

9.0000

1.0000

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

131

Fig. 4. Generation wise database operating costs.

The average cost of solutions in the second generation is 41.5, which represents improvement in the solution pool. The best solution has a cost of 35. The process is repeated and the optimal solution is obtained in the ninth generation. The generation wise average and the best costs for this example problem are shown in Fig. 4. 5.6. The `Optimal' design The ®nal solutions pool that is obtained in the ninth iteration is given below: Solution#

Solution bitstring

1 2 17 14 5 12 15 16 9 10

11 00 11 11 10 10 11 11 10 00

Total

11 00 11 11 10 11 11 11 11 11

1111 0000 1111 0111 0100 0110 0111 0000 0000 1000

1111 0000 1111 1000 1010 1000 1000 0101 1000 1000

1111 0000 1100 1100 1010 1100 1111 1111 1100 0011

111111 000000 010000 010000 111010 010000 111011 010111 010000 000011

Cost

Fitness

Selection

46 43 32 41 35 37 39 34 35 37

0.8786 0.8865 0.9156 0.8918 0.9077 0.9024 0.8971 0.9103 0.9077 0.9024

0.0976 0.0985 0.1017 0.0991 0.1009 0.1003 0.0997 0.1011 0.1009 0.1003

379

9.0000

1.0000

132

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

Fig. 5. An optimal OODB storage structure for university database. (instance variables in italics are the inherited from direct or indirect superclasses).

The average cost in the solutions pool is 37.9, while the best solution (solution #17) has a cost of 32. The storage structure corresponding to the best solution is shown in Fig. 5. Through exhaustive enumeration, we found that an optimal solution has a cost of 32. Hence, GA iterations were terminated as soon as a solution with cost of 32 is obtained. For large problems, the algorithm could be terminated after a speci®ed number of iterations. In this example problem, if the algorithm were terminated after three iterations, the best solution obtained in GA has a cost of 35. The GA solution has a cost that is 9% closer to optimal solution cost of 32; this represents about 25% improvement over RC solution (solution #1), and about 19% over SI solution (solution #2). After six iterations, the best GA solution has a cost of 34, i.e. GA solution is 5% closer to optimal, and the GA solution shows an improvement of 27% over RC and 22% over SI. Thus, our GA approach provides an ecient storage structure that is very close to optimal solution and is a substantial improvement compared to the previous models of RC and IS storage structures. 6. Simulation experiment results A computer program is written in C to automate the database design process using GA and is run on Unix. The parameters selected are as follows: Number of classes in the C-H Number of home instance variables in a class Number of instances in a class

4±25 1±15 max 1000

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

Number of solutions in the GA pool Number of iterations used in GA to get best solution Number of transactions Proportion of updates in the total transactions

133

20 100 00 10±30%

The number of instances in a subclass is determined using a random number between 50% and 100% of the number of instances in its superclass that has the least number of instances. Each transaction is generated considering logical OODB model, such as in Fig. 1. The target class for a transaction is chosen randomly from the classes in the database schema. Then all the superclasses (up to the root) of that target class are identi®ed. Some (0±100%) of these superclasses are selected at random; instance variables are then selected randomly from these selected superclasses. The number of instances needed for a transaction is determined from the number of instances available in the target class. This constitutes a transaction. The GA algorithm is applied on schemas with 5, 10, 15, 20, and 25 classes. The proportion of updates considered are 0%, 5%, 10%, 15%, 20%, 25%, and 30%. Thus we solved 35 database problems to determine the `best' storage structures. We computed the GA cost, SI cost, and RC cost. As described in Section 4, we use the number of instances needed access is used as a pseudo measure of database operating cost. Fig. 6 shows the best solution in each of the 20 iterations used in GA for a schema with 25 classes with 10% update proportion. In the ®rst iteration, the best solution showed an improvement of 30% over SI model and 17% over RC model. At the end of 20 iterations, the best solution has 42% reduction in database operating cost compared to SI model, and 30% reduction in cost compared to RC model. The best solution itself showed an improvement of 18% over the 20 iterations. Fig. 7 shows database performance of GA, compared to SI and RC models, for a schema with 25 classes with di€erent update proportions. It may be noted that the absolute database operating cost depends heavily on the data and transactions, but the relative database costs are not very sensitive. When evaluating the performance of GA solution, relative database operating costs (of GA, SI, and RC) are more important than absolute costs. As can be seen from Fig. 7, the best solution cost (with GA) coincides with RC cost in all-retrieval transaction environment. The worst performance is obtained by SI model in all-retrieval transaction environment; these results are in agreement with the analytical explanations of RC and SI models in Section 3. Thus GA solution shows a performance improvement of 77% over SI model, and 0% improvement over RC model. As the update proportion is increased, the RC cost increases and SI cost decreases relative to GA cost. In all-update transaction environment, with a database schema of 25 classes, the GA solution has showed a performance improvement of 10% over SI model, and 72% over RC model. 6.1. Database operating costs in various update environments As can be seen from Tables 1±3, the database performance improvement of GA ranged from 7% to 27% over SI, and the performance improvement over RC ranged from 20% to 58%. In each of the 35 database design problems, GA solution showed the lowest cost compared to SI or RC. The database performance of GA is better in schemas with more classes as compared to schemas

134

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

Fig. 6. Iteration wise performance with 25 classes.

Fig. 7. Database performance improvement of GA with 25 classes.

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

135

Table 1 Database operating cost in low update environment Number of classes

GA solution cost

SI solution cost

RC solution cost

Improvement over SI (%)

Improvement over RC (%)

5 10 15 20 25

71,229 72,435 46,329 63,708 85,060

88,230 88,142 79,032 71,824 138,280

71,332 88,487 61,588 81,983 120,361

19 18 41 11 38

0 18 25 22 29

Average

67,752.2

27.23

20.06

93,101.6

84,750.2

Table 2 Database operating cost in medium update environment Number of classes

GA solution cost

SI solution cost

RC solution cost

Improvement over SI (%)

Improvement over RC (%)

5 10 15 20 25

45,195 58,353 60,044 71,025 58,833

45,231 62,725 79,998 71,378 72,291

62,142 113,641 106,953 195,826 118,002

0 7 25 1 19

27 49 44 64 50

Average

58,690

66,324.6

119,312.8

11.51

50.81

Table 3 Database operating cost in high update environment Number of classes

GA solution cost

SI solution cost

RC solution cost

Improvement over SI (%)

Improvement over RC (%)

5 10 15 20 25

74,617 61,240 72,227 69,083 53,222

78,850 68,582 78,309 70,545 58,937

97,396 159,247 184,109 152,667 187,833

5 11 8 2 10

23 62 61 55 72

Average

66,077.8

71,044.6

156,250.4

6.99

57.71

with fewer classes. It can be implied that GAs performance will not deteriorate with much larger database schemas. The reason for better performance of GA in larger schemas is that GA considers user transaction patterns and assigns instance variables to the classes, so that the database operating costs are low. On the other hand, SA and RC are insensitive to user transaction patterns and thus incur high database costs, especially in larger schemas. The best performance improvement of GA over RC is 72% with 25 classes in the schema and high update environment. In schemas with fewer classes, only a few possible arrangements for the instance variables exist; thus, the database cost di€erence between GA and (SI or RC) solutions is marginal.

136

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

6.2. Time performance The average time taken for GA to solve a database problem with ®ve classes is 1.4 min, while it is 60 min for the schema with 25 classes. (This time is for 100 iterations.) In order to arrive at a satisfactory solution, 20 iterations will suce, in which case the time taken by the GA would range from 16 s to 12 min. These times are taken with VAX computer system and C language. The above times are the average over di€erent update proportions. 7. Conclusions OODBs provide the required complexity needed to model data requirements of complex systems which the conventional systems cannot model. However, the overhead associated with object oriented database implementations are enormous, resulting in poor performance of OODB systems. In this research, we provide a database design methodology that will help reduce database operational cost in OODBs. In our approach, we consider the user information requirements composed of retrievals and updates in determining best storage structure for OODBs. To solve this intractable problem, we used the GA approach. We applied our methodology with a sample problem of university database with six classes. Our design solution is close to optimal solution (within 5%) after six iterations. We obtained optimal solution after nine iterations of the GA. A C-program is written to implement our design methodology and is run on VAX computer under UNIX operating system. We solved 35 database design problems varying schema size from 5 to 25 classes and update proportion from 10% to 30% in the transaction mix. In all the cases our GA methodology produced the best solution. With a database schema size of 25 classes, the performance improvement of GA over SI model ranged from 77% to 10% from a 0% update environment to 30% update environment. The performance improvement of GA over RC model ranged from 0% to 72% from 0% updates to 30% updates for the same database size. The time taken by the GA algorithm ranged from 0.8 s per iteration with ®ve classes to 3.5 s per iteration with 25 classes. Thus our time-ecient design method could help in arriving at a cost-ecient OODB design, resulting in performance improvement in OODBs. We did not consider in this research, methods of data access such as indexing and hashing. Since that is a research area in itself, our future studies involve further experimentation considering the e€ect of access method strategy on the cost ecient OODB design solution. Furthermore, we present in this research a static model, whereby usage patterns are assumed known in advance and are ®xed. Since user transactions (retrievals and updates) can change over time, there is a need to develop dynamic models. Wolfson et al. [38] has shown that adaptive algorithms produce better performance compared to static algorithms in distributed databases. Our future research direction is aimed at developing a procedure to change dynamically the physical structure of OODB. Two issues are to be addressed in such a procedure: what incremental modi®cations should be made to the storage structure and what is the frequency with which such modi®cations are made. The ®rst issue involves making incremental changes to the OODB storage structure by comparing the present usage patterns with previous usage patterns. Typical modi®cations include deleting one or more variables that were previously inherited/stored in a subclass in addition to deleting the corresponding variables in its subclasses and/or inheriting/storing an additional variable in a subclass. These operations involve reorganizing the storage structures of the relevant

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

137

subclasses. This gives raise to the second issue i.e., determining the frequency of incremental modi®cation; this requires consideration of the trade-o€s between cost of making incremental changes to the storage structure and the bene®t in terms of reducing the cost of access with the new storage structure. These issues are the subject of our future research e€orts. 8. For further reading [14,19,21,27,31]. Acknowledgements The authors would like to thank the anonymous reviewers for their valuable comments, which made the paper better. The author wishes to acknowledge the assistance of X. Li. References [1] S. Ahmed, A. Wong, D. Sriram, R. Logcher, Object-oriented database management systems for engineering: a comparison, Journal of Object-Oriented Programming (1992) 27±44. [2] N. Ballou et al., Coupling expert system shell with an object-oriented database system, Journal of the Object-Oriented Programming (1992) 27±44. [3] J. Banerjee, H.-T. Chou, J.F. Garza, W. Kim, D. Woelk, N. Ballou, Data model issues for object-oriented applications, ACM Transactions on Oce Information Systems 5 (1) (1987). [4] K. Bennett, M.C. Ferris, Y.E. Ioannidis, A genetic algorithm for database query optimization (1991). [5] C. Damon, G. Landis, Abstract data types and storage types in an OO-DBMS, in: IEEE Conference, 1988. [6] W.W. Chu, I.T. Ieong, A transaction-based approach to vertical partitioning for relational database systems, IEEE Transactions on Software Engineering 19 (8) (1993) 804±812. [7] D.H. Fishman, D. Beech, H.P. Cate, E.C. Chow, T. Connors, J.W. Davis, N. Derrett, C.G. Hoch, W. Kent, P. Lyngbaek, B. Mahbod, M.A. Neimat, T.A. Ryan, M.C. Shan, Iris: an object-oriented database management system, ACM Transactions on Oce Information Systems 5 (1) (1987). [8] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989. [9] N. Gorla, W. Boe, Database operating eciency in fragmented databases in mainframe, mini, and microsystem environments, Data and Knowledge Engineering 5 (1990). [10] R. Gupta, E. Horowitz (Eds.), Object-Oriented Databases with Applications to Case, Networks, and VLSI CAD, Prentice-Hall, Englewood Cli€s, NJ, 1991. [11] K.A. De Jong, W.M. Spears, Using genetic algorithms to solve NP-complete problems, in: Proceedings of the Third International Conference on Genetic Algorithms, 1989. [12] K. Higa, M. Morrison, J. Morrison, O.R. Liu Sheng, An object-oriented mehodology for knowledge base/database coupling, Communications of the ACM (1992) 99±113. [13] J.A. Ho€er, D.G. Severance, The use of cluster analysis in physical database design, in: Proceedings of the International Conference on Very Large Databases, 1975. [14] S.P. Hufnagel, J.C. Browne, Performance properties of vertically partitioned object-oriented systems, IEEE Transactions on Software Engineering 15 (8) (1989). [15] Y. Ishikawa, H. Kitagawa, N. Ohbo, Evaluation of signature ®les as set access facilities in OODBs, ACM SIGMOD (1993) 247± 256. [16] J.V. Joseph, S.M. Thatte, C.W. Thompson, D.L. Wells, Object-oriented databases: design and implementation, Proceedings of the IEEE 79 (1) (1991). [17] M.A. Ketabchi, T. Risch, S. Mathur, J. Chen, Comparative analysis of RDBMS and OODBMS: a case study, in: IEEE International Conference, 1990. [18] S. Khosha®an, Insight into object-oriented database, Information and Software Technology 32 (1990) 274±289. [19] M. Kifer, W. Kim, Y. Sagiv, Querying object-orinted databases, ACM SIGMOD (1992).

138

N. Gorla / Data & Knowledge Engineering 37 (2001) 117±138

[20] W. Kim, Object-oriented databases: de®nition and research directions, IEEE Transactions on Knowledge and Data Engineering (1990). [21] W. Kim, Object-oriented database systems: strengths and weaknesses, Journal of Object-Oriented Programming (1991). [22] W. Kim, K.-C. Kim, A. Dale, Indexing techniques for object-oriented databases, Working paper, TR-87-14, May 1987. [23] C. Kirkpatrick, C. Gelatt, M. Vecchi, Optimization by simulated annealing, Science 220 (1983). [24] B. Li, W. Jiang, A novel stochastic optimization algorithm, IEEE Transactions on Systems, Man, and Cybernetics, Part B, 30 (1) (2000). [25] C.C. Low, B.C. Ooi, H. Lu, H-trees: a dynamic associative search index for OODB, ACM SIGMOD (1992). [26] S.T. March, Techniques for structuring database records, ACM Computing Surveys 15 (1) (1983). [27] S.T. March, S. Rho, Allocating data and operations to nodes in distributed database design, IEEE Transactions on Knowledge and Data Engineering 7 (2) (1995). [28] M.S. Oliver, S. Von Solms, A taxanomy for secure object-oriented databases, ACM Transactions on Databases (1994). [29] S. Navathe, S. Ceri, G. Weiderhold, J. Dou, Vertical partitioning algorithms for database design, ACM Transactions on Database Systems 9 (4) (1994). [30] J. Orenstein, S. Haradhvala, B. Margulies, D. Sakahara, Query processing in the objectstore database system, ACM SIGMOD (1992). [31] D.D. Straube, M.T. Ozsu, Queries and query processing in object-oriented database systems, ACM Transactions on Information Systems 8 (4) (1990). [32] D.D. Straube, M.T. Ozsu, Query optimization and execution plan generation in object-oriented data management systems, IEEE Transactions on Knowledge and Data Engineering 7 (2) (1995). [33] K.Y. Tam, Genetic algorithms, function optimization, and facility layout design, European Journal of Operations Research 63 (2) (1992). [34] M.M. Tsangaris, J.F. Naughton, On the performance of object clustering techniques, ACM SIGMOD (1992). [35] K.Y. Whang, A seamless integration in object-oriented database systems, in: Proceedings Fifth International Conference on Data Engineering, Los Angeles, February 1989. [36] M.J. Willshire, How spacy can they get? Space overhead for storage and indexing with object-oriented databases, IEEE Transactions (1991). [37] M.J. Willshire, H.-J. Kim, Properties of physical storage models for object-oriented databases, IEEE conference on Databases, Parallel Architectures, and Applications (1990). [38] O. Wolfson, S. Jajodia, Y. Huang, An adaptive data replication algorithm, ACM Transactions on Database Systems 22 (2) (1997). Narasimhaiah Gorla is an Associate Professor in the Information Systems and Manufacturing Department at Wayne State University, Detroit, Michigan. He has a Bachelor of Engineering from Sri Venkateswara University, an M.B.A. from the Indian Institute of Management and a Ph.D. in Management Information Systems from the University of Iowa, Iowa City. Prior to his Ph.D., he held responsible positions in industry in the Information Systems for over nine years. He has published in Data and Knowledge Engineering, The Computer Journal, Information and Management, Information Processing and Management, Information Systems, IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Software Engineering, Journal of Systems and Software, and Information and Software Technology.