Processing Letters
ELSEWIER
Information
Applicability
Processing
Letters 58 ( 1996) 123- 128
of genetic algorithms to optimal evaluation of path predicates in object-oriented queries Sang Koo Seo ‘, Yoon Joon Lee *
Department
of Computer
Science,
Korea Advanced
Institute
of Science
305-701,
and Technology,
Received 24 June 1994, revised 14 September Communicated by K. Ikeda
Keywords:
Algorithms;
Databases;
Object-oriented
databases;
Query optimization;
1. Introduction Query processing is one of the most important issues in object-oriented database systems. Much research efforts have been made toward the optimization of object-oriented queries [2,8,12,15]. It has been pointed out by many authors that the traditional query optimization techniques should be tailored to be suited for object-oriented databases. Evaluation of Boolean expressions in queries is one of such areas since object-oriented queries usually involve methods and path expressions in their selection predicates and their evaluations are very costly [6,9,12]. In this paper we address the optimization of object-oriented queries containing path predicates. A path predicate is a predicate on the nested attribute (i.e., path expression) of a class. For example, “ u.Manufacturer.President.Name
= = John”
is a path predicate of whether a Vehicle object u is manufactured by a Company whose president is
* Corresponding author. Email:
[email protected]. ’ Present address: Hyundai Electronics Ind., Software R & D Center, 9th fl. Hyundai Jeonja Bldg., 66 Jeokseon-dong, Jongro-ku, Seoul, Korea 110-052. Email:
[email protected]. 0020-0190/96/$12.00 0 1996 Elsevier PII SOO20-0 190(96)000361
371-1,
Kuseong-Dong,
Yuseong-Ku,
Daejeon
South Korea
Science B.V. All rights reserved
1995
Genetic algorithms
“John”. A single path predicate can be evaluated either in forward or backward direction of the path [7,10]. In the above example the forward traversal starts from an object u of class Vehicle, and retrieves the value of its nested attribute Name. The backward traversals begin with the attribute Name and find objects of Vehicle which indirectly reference the value “John”. The cost to traverse a path either forward or backward varies significantly depending on various factors such as the existence of nested indexes [2] in the path, selectivity of predicate, length of path, etc. Thus multiple path predicates in a query are subject to be evaluated in the form of mixed traversals: some in forward and others in backward directions [7,101. In mixed traversals of path predicates backward traversal predicates are evaluated first and their results are intersected (assuming that the predicates are logically ANDed). With these partially qualified objects the remaining predicates are processed sequentially using forward traversals. The challenge is to partition the predicates into two groups - backward traversals and forward traversals - and to give an ordering to the latter such that the cost to process all predicates is minimized. In this paper we formulate the objective optimization function based on our
124
S.K. Sea. YJ. Lee/Informuion
models and assumptions. It is shown that the search space of our optimization problem can be reduced in virtue of a well-known optimal ordering rule for processing forward traversal predicates [5]. The reduced search space, however, is still exponential with respect to the number of predicates in queries. Further, it is not easy to find heuristics to avoid exhaustive searching in our problem. For this reason we try to apply a randomized search strategy to our problem. We explore the applicability of genetic algorithms (GAS) [4], and show by experiments the feasibility of genetic search strategies for use in the query optimization of object-oriented database systems. We note that there have been initial efforts to apply GAS to optimization problems in the database area [ 1.31. In [I], an encoding method was presented to model arbitrary binary trees using genetic algorithms. The binary trees were meant to represent query graphs for queries with lots of (or, “large”) joins in relational DBMSs. In [3], a machine learning model was proposed to cope with the dynamics of user queries and to help decide on which attributes of a relation indexing is profitable. Both the previous works and ours deal with the application of GAS, but the problem domains are quite different. Although genetic algorithms are regarded as a robust, practical approach for various application domains, not every kind of problems can benefit from the genetic search method [4]. Therefore it is very important to formulate a problem kind in GAS and assess its applicability. The rest of the paper is organized as follows. In Section 2 we present models and assumptions on query and query processing. Section 3 formulates our optimization problem. In Section 4 we present the application of genetic algorithms, including the results of experiments. Finally Section 5 summarizes the paper.
2. Models
and assumptions
Typical object-oriented queries involve finding objects of a class (called a target class) which satisfy restrictions on its (nested) attribute values. The predicates in a query can be connected by the Boolean
Processing Letters 58 119961 123-128
operators AND, OR, and NOT. In this paper we assume that queries are transformed to conjunctive queries since they are considered to be frequently used, which conforms to the assumptions made in previous works [2,10]. Path traversal costs are estimated in terms of disk page accesses. The backward traversal cost for a predicate is the cost to find object IDS of all objects satisfying the predicate. Any kind of nested attribute indexes can be exploited, whenever possible, to reduce the traversal cost [2,8]. The forward traversal cost for a predicate is the cost to check whether the predicate is satisfied for a single object in the target class. Due to space constraint, we do not present detailed cost formulas, which can be found in our previous work in a different context [ 181. It is assumed that selectivities of path predicates are known in advance. The selectivity factor of a predicate is the fraction of qualified objects in the target class. It is usually estimated on the basis of default values and system statistics for different types of restrictions (e.g., = , < , IN,. . .) as discussed in 1171. Recently, sampling techniques have been proposed for estimating the selectivities [ 141. Path predicates encompass the predicates with restriction on simple attributes (i.e., direct attributes of a target class) in a way that a single object fetch suffices for all forward traversals. For ease of presentation we consider path predicates of only nested attributes throughout the paper. Besides path predicates there may be other kinds of expensive predicates, such as the one involving methods, subqueries as in SQL, and join predicates [6,9,13]. These can be incorporated to our framework provided that their costs are correctly estimated.
3. Problem formulation We have a query of n path predicates for a target class C of a cardinality I C I. Associated with each predicate are a backward traversal cost, a forward traversal cost and a selectivity factor, denoted as b,., fi, and ri, 1 Q i Q n, for the ith predicate, respectively. We choose some predicates and evaluate them backward. With the resulting partially qualified objects, the remaining predicates are processed forward
S.K. Sea, YJ. Lee/Information
sequentially. The cost formula is defined as follows,
where T, is a set of predicates for backward traversalsand (I,, 1, ,..., lk), 0 < k < n, is an ordered set, say T,, of remaining predicates for forward traversals. The problem is to partition the predicates into T, and T2 and to give an ordering to T, such that the cost formula (1) is minimized. We explain the cost formula briefly. The cost for T, is the summation of b,‘s of its elements. Considering the large main memory of today’s computers and cascaded intersections of object IDS, the cost to store intermediate results is supposed to be safely ignored. The coefficient of the second term is the number of objects of a target class C reduced by the product of selectivity factors of all predicates in T,. Each forward traversal predicate reduces the cost of subsequent predicates by its selectivity. For k predicates in T, there are k! permutations of possible orderings of processing the predicates. Thus, total number of possible strategies for a query is L;,,(nCk * k!), i.e. C;_,nPk, which grows very fast for n. Let us first consider the problem of how to order the predicates in T2. We are required to find a permutation so that when predicates are processed in that order the processing cost becomes minimal. By intuition, a plausible strategy appears to arrange the predicates in the increasing order of f and r values. A theorem described in [5] helps disclose a more concrete relationship between f and r values, allowing a total ordering for an optimal evaluation sequence. The theorem is applied as follows: assuming that k predicates are to be processed sequentially, the processing cost is minimized if the predicates are processed in increasing order of fi/(l - ri), 1 < i < k. Note that this property is also adopted in [6] as a ranking measure for ordering the expensive predicates including join predicates. Although the search space is reduced from Z-0 nPk to C”k_ ,nCk, the problem of partitioning the predicates is not trivial. Because parameters in the cost function (1) are mixed in a complex way, it seems difficult, if not impossible, to find a good
Processing Letters 58 (1996) 123-128
125
heuristics to further reduce the search space. This is the main reason why we choose to apply genetic algorithms to our optimization problem. We note that randomized search strategies are recently drawing attention for the extensibility in query optimization to cope with various kinds of application requirements [ 111.
4. Genetic search strategies 4.1. Brief overview
of genetic algorithms
Since suggested by John Holland in his book, in Natural and Artificial Systems (University of Michigan Press, 19751, genetic algorithms have been applied to a wide range of domains such as searching, numerical function optimization, adaptive control system design, machine-learning problems in artificial intelligence, and so on. GAS attempt to solve problems in a fashion similar to the way in which genetic processes seem to operate: i.e., the survival of genes which fit better to the environment over generations. GAS begin with a randomly (or heuristically, if applicable) selected population of function inputs, called chromosomes. The chromosomes are represented as strings of bits. GAS use the current population of strings to create a new one such that the strings in the new population are, on the average, “better” than those in the current population. Three processes, called genetic operators, are repeatedly used to make the transition from one population generation to the next: selection, mating (or crossover) and mutation. The selection process determines strings in the current generation that will be used to create the next generation. This is usually done by a biased random selection, i.e., the best strings among the population have the greatest chance of being selected as parents for the next generation. The second step is the mating process, which creates children strings from the selected parents. If the length of each string is 1, then two random numbers, say r and s, are determined, 1 < r < s d 1. For two of the selected parents, bits r through s of the first parent are swapped with bits r through s of the second parent. In this way two new strings are created as children. The final
Adaptation
SK. St-o, Y.J. Lee/Informution
126
step is mutation. Using a fixed small mutation probability set at the start of the algorithm, bits in new strings are subject to be changed (“flipped”) with the probability. These three steps are repeated to create each new generation. It continues until some stopping condition is reached, e.g., a maximum number of generations, an acceptable approximate solution, or any application specific criterion. Although GAS do not guarantee obtaining the best solution, they may avoid the high cost of optimization. A detailed description on GAS and their techniques can be found in [3,16]. 4.2. Applying the GAS In order to use GAS we need to build an objective function such that the function receives strings of bits as inputs. We first arrange the n predicates based on the optimal ordering rule of [5]. A string of n bits will represent a strategy to process n path predicates in such a way that predicates whose bits are set as “1” in the string are processed in backward and remaining in forward directions. The following function is evaluated for each bit string x of length n: kbix[i] i- 1 x
+ ICI
firfLil ( i- 1
&(l -X[l])
+fz(l
1
-x[2])r~‘-=~‘I)
( + . . . +fn( 1 - X[Ti]) ns r#?jl)). j- 1 I
(2)
As an example, suppose that five predicates are ordered accordingly. If a bit string is (1, 0, 0, 1, 01, then only the 1st and 4th predicates are processed backward. It can be easily seen that the formula (2) applied by the bit string coincides with the function (1) with n = 5 and k = 3 where the predicate subscripts (I,, l,, 1s) equal (2, 3, 5). 4.3. Experiment We have conducted an experiment to validate the applicability of the genetic search strategy to our problem. A GA package, called GAUCSDof UC San Diego [16], is used under SunOS 4.1.1 on Sun SPARC 4/280. We have also implemented an exhaustive optimal searching algorithm and compared
Processing Letters 58 (1996) 123-128
the performance. Queries tested are conjuncts of up to 16 predicates each of which is assigned a forward traversal cost, a backward traversal cost, and a selectivity factor. The limit on the number of predicates was due to the inability of the exhaustive searching algorithm with a larger number of predicates, primarily because of its huge memory requirement. Values of associated parameters for queries are generated randomly within the ranges as follows: - forward traversal costs: 5 - 30 - backward traversal costs: 10 - 100 - selectivity factors: 5% - 50% - cardinality of a target class: 10000 - 15000 The forward traversal cost of a predicate mainly depends on the length of a path expression and the existence of set-valued attributes in the path. Thus, a smaller value will account for a shorter path expression without set-valued attributes, and vice versa for larger value. The backward traversal cost is affected by the existence of nested attribute indexes in the path, cardinalities and sharing degrees of classes in the path as well as the length of path. Note that, since the forward traversal cost is meant for a single object in the target class, its value range is set smaller than that of backward traversal costs. The random mixture of these values will simulate various situations in path predicates. In randomized search strategies it is occasionally proposed that algorithms be executed multiple times on a given problem instance [ 1,161. In our experiments each query is run five times and the average of five runs is used for performance measures of GA. Each run is set to terminate if three generations have passed without creating a new chromosome, bits in each position of chromosomes in a population converges beyond a threshold (70%), or a maximum number of trials (i.e., function evaluations) per run (200 in our experiment) is reached. Some typical results are presented in the figures. Fig. 1 shows the performance of the GA compared with the exhaustive algorithm with varying population size (i.e., the number of bit strings in a generation) when the number of predicates is 15. The x-axis is the number of generations, and the y-axis represents the ratio of the average of GA’s best solutions over all five runs to the optimal solutions. The figure demonstrates that our objective formula is well-suited to GA’s behavior. In the initial
S.K. Sea, Y.J. Lee /Information Processing Letters 58 (1996) 123-128
generations the deviation from the optimal value ranged 55 _ 65%. As generations proceed, however, the deviation decreases for each population size, and begins to converge at around the 15 w 25th generation. The smaller population size converges faster at the expense of a larger deviation from the optimal. Also clearly, we see that larger population size yields better performance (within 2% of the optimal), which coincides with the general observation of GAS [1,4,16]. It should be noted that in most queries optimal solutions were found by GA in at least one run. This exhibits an important implication on GA’s effectiveness to our problem since GA evaluates only a small fraction of all possible combinations. That is, GA evaluates at most 5 (runs) X 200 (evaluations) bit strings, which is less than 5% of 215 possible bit strings. We compare the execution time of the GA with the exhaustive algorithm as shown in Fig. 2. The query size is the number of predicates of a query. For a given query size five different queries were tested and the average of elapsed times was measured. Note that the time for GA is the summation of all five runs. GA began to work faster when the query size is larger than 10. While the increase in time for GA is almost linear, the time taken for the exhaustive algorithm increase exponentially. For larger queries (query size > 16) the time for exhaustive algorithm would be intolerably high. We want to make a comment regarding the optimization costs and the gap between optimal values and the average of best values of GA. In many tests optimal values range around 100 - 300 disk access. 1.8 1.7 2 B o d s .i
1.6 I.5
-
1.4
-)-
12
-
IS
IO
1.3
127
80 -
60 -
T
cxhllustive
--c
GA
G x r .= l-
40-
20 -
-4 a
IO
I2 Quay
Fig. 2. Comparison of optimization trials = 200, no. of runs = 5).
14
16
size
times (pop. size =
15, max.
Assuming the today’s disk access time as 15 milliseconds, 2 _ 15% deviation of GA seen in Fig. 1 can be sufficiently offset by its reduced optimization effort for queries of more than 10 predicates. We note that the need of queries with large predicates can be frequently found in the new database application domains [1,6,9,11].
5. Summary In this paper we have explored the problem of optimal evaluation of path traversal predicates in object-oriented queries. We formulated the problem and applied the genetic search strategies to our query optimization problem. Experiment results show that our problem formulation suits well to GA’s behavior. GA worked efficiently both in the solution quality and execution time, showing its feasibility for use in optimization problems of database systems. As a further work, we plan to apply GA approach to the optimal index configuration problem in object-oriented databases 1181. Also, it would be an interesting research direction to formalize the path predicate queries with relational expressions especially for object/relational and object wrapper-based DBMSs.
1.2 I.1
References 0
5
10 I5 20 No. of guwations
25
Fig. 1. Average of best values over five runs.
30
[l] K. Bennett, M.C. Ferris and Y.E. Ioannidis, A genetic algorithm for database query optimization, Computer Sciences Tech. Rept. TRlOO4, University of Wisconsin-Madison, 1991.
128
SK. Sea. YJ. Lee/Information
[2] E. Bertino, Optimization of Queries using Nested Indices (Extended Data Base Technologies, 1990). [3] F. Fotouhi and C.E. Galarce, Genetic Algorithms and the Search for Optimal Database Index Selection, in: Lecture Notes on Comparer Science 507 (Springer, Berlin. I99 I ). [4] D.E. Goldberg, Generic Algorithms in Search. Optimization and Machine Learning (Addison-Wesley, Reading, MA, 1989). IS] M.Z. Hanani, An optimal evaluation of boolean expressions in an online query system, Cow. ACM 20 (1977). [6] J.M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates, in: Proc. ACM SIGMOD Confi ( 1993). [7] B.P. Jenq, D. Woelk. W. Kim and W.L. Lee, Query Processing in Distributed ORION (Extended Data Base Technologies, 1990). [8] A. Kemper and G. Moerkotte. Advanced query processing in object bases, in: Proc. VLDB Conj ( 1990). (91 A. Kemper. G. Moerkotte and M. Steinbrtmn, Optimizing Boolean expressions in object bases, in: Proc. VLDB Con5 (1992). [IO] K.C. Kim et al., Acyclic query processing in object-oriented databases, in: Proc. 7th Inrernat. Con& on E-R Approach (1989).
Processing Letters 58 (1996) 123-128 [I I] R.S.G. Lanzelotte and P. Valduriez, Extending the search strategy in a query optimizer, in: Proc. VLDB Co@ (1991). [ 12) R.S.G. Lanzelotte, P. Valdutiez and M. Z&t, Optimization of object-oriented recursive queries using cost-controlled strategies, in: Proc. ACM SIGMOD Conf. ( 1992). (131 D.F. Lieuen and D.J. Dewitt, A transformation-based ap preach to optimizing loops in database programming Ianguages, in: Proc. ACM SIGMOD Con/: (1992). 1141 R.J. Lipton, J.F. Naughton and D.A. Schneider, Practical selectivity estimation through adaptive sampling, in: Proc. ACM SIGMOD Conf ( 1990). [ 151J. Orenstein et al., Query processing in the ObjectStore database system, in: Proc. ACM SIGMOD Conf: (1992). [ 161 N.N. Schraudolph and J.J. Gmfenstette, A User’s Guide ro GAucsd 1.4. Tech. Rept. CS92-249. UC San Diego, 1992. [ 17) P.G. Selinger et al.. Access path selection in a relational database management system, in: Prod. ACM SIGMOD Co@ ( 1979). [I81 S.K. Seo and Y.J. Lee, Optimal configuration of nested attribute indexes in object-oriented databases, in: Proc. EUROMICRO Con/: ( 1994).