An Optimal Access Path Handling Strategy in a Database Environment Yuksel Uckan Miami University, Oxford, Ohio
William D. Haseman University of Wisconsin-Milwaukee
This paper presents an efficient access path storage and manipulation facility in a relational database management system. It includes an access path storage structure and an artificial intelligence which enables the DBMS to generate, store, and maintain paths which are more likely to be needed by random query requests. A model is developed for simulation, and several path delete strategies are simulated. Path delete strategies aiming for least recently used and least frequently used access paths are shown to require minimal DBA interference and to result in lowest average query processing cost. The proposed path handling strategy is sufficiently general to permit application in an arbitrary database system.
INTRODUCTION A problem of major importance in a database environ-
ment is that of secondary key retrieval, i.e., fetching tuples of database relations based on a predicate which includes a subset of nonkey domains of a given relation. There are quite a few techniques suggested for the solution of this problem, some applicable under special circumstances [ 1,2]. However, the most widely used technique for a nonassociative computing environment still remains that of creating inverted list indexes and using such indexes for the computation of access paths which correspond to the predicates for conditional data retrieval [ 31. There are difficulties to this approach, especially compounded in a corporate database in which a sub-
Address correspondence io Dr. Yuksel Uckan. Associate Professor. Systems Analysis Departmeni, Miami University. Oxford, Ohio 45056.
stantial number of entity attributes are normally nonkey. Inverting the entire database for all possible nonkey fields and field values, and maintaining list indexes as part of the database are obviously too costly a strategy; for a large database, it is definitely infeasible. Besides, some of the index tables may not at all be popular and may hardly, if ever, be requested by the database users to justify their cost. On the other hand, some low demand list index structures may experience a sudden gain in popularity, in view of dynamically changing demands on the database. The database administrator should find it no easy task to accurately anticipate high demand indexes and maintain only those in the database until future changes in database use pattern necessitate further modifications in the database structure. The significance of inverted database structures and their optimal performance have been recognized by many database researchers [4,5]. The problem of selecting the right access paths has also been investigated [6]. The present investigation further explores the same problem and presents the design of an access path handling strategy within the context of a relational database management system; however, the proposed design is sufficiently general to be easily applicable in any corporate database system. The relational database management system on which the present investigation is based is INTREL/ DBMS (for Intelligent Relational) [ 7,8]. INTREL/ DBMS is a relational experimental database management system which is partially implemented on a mainframe computer system. It supports a user-friendly, English-like, and relationally complete query language; a comprehensive data manipulation language; and a 231
The Journal of Systems and Software 6, 237-249 0 Elsevier Science Publishing Co., Inc., 1986
(1986)
0164-1212/86/%3.50
Y. Uckan and W. D. Haseman schema description and manipulation language [9]. This paper is one of the products of a continuing research effort related to the design and implementation of the INTREL/DBMS. INTREL/DBMS has a uniform design in the sense that it includes ten system control structures, each introduced into its design to handle a specific task, but all conceived and implemented as relations. One of these data structures is for the purpose of storing primitive access paths, which are inverted list indexes corresponding to attribute-value pairs (i.e., simple queries) for invertible nonkey fields. The access path relation (APR) is a fixed-size storage structure; its size and initial content can be determined by the DBA. In order to store and maintain in APR only those paths which are most likely to be needed by a random conditional query request, an artificial intelligence has been integrated into the INTREL/DBMS design. The system is capable of generating primitive access paths which are demanded by a random conditional query request, adding them to the current content of the APR, and permitting future events to make use of those paths available in the APR. This leads to a cost-shared setup, thereby reducing substantially the average cost of conditional query processing. After the APR becomes full, the system automatically selects for deletion a subset of the primitive access paths and replaces them with the new paths defined by the current query request. Seven different path delete strategies are proposed for the APR. The central purpose of this investigation is to determine the “best” delete strategy, which will provide lowest average query processing cost over a longest period of database use, such that the need for the DBA to interfere and rejuvenate the system is kept at a minimum. We are interested in a delete model which will behave optimally for any arbitrary relational database. A simulation model has been developed approximating the secondary key retrieval aspect of the INTREL/DBMS. In order to study the behavior of the APR under controllable conditions and to determine a generally optimal path delete strategy, the model has been simulated extensively. The results are reported in Section 5 of the paper. Section 1 of this paper describes the data structures in the INTREL/DBMS that are relevant to the theme of this research and analytically investigates the efficiency and cost of conditional retrieval. In Section 2, we present the main features of the INTREL/DBMS design for conditional query processing. Section 3 formulates several alternatives for path deletion in the APR. Section 4 discusses the model for simulation and the basic steps of the simulation process. Finally, Section 5 is devoted to an analysis of the simulation results.
1. DATA STRUCTURES FOR CONDITIONAL RETRIEVAL
INTREL database management system is a relational DBMS with uniform design in that its design includes ten data structures each conceived and implemented as a relation. Two of these data structures are directly relevant to the conditional retrieval capability of INTREL, and therefore to the theme of this research. They are the data structure for database relations proper, and the one to store access paths, the Access Path Relation (APR). Figure 1 shows the physical implementation for the database relations. It is actually a direct file whose cells may be accessed by relative cell address. The file consists of a prime data area and a preceding index area. Each cell of the prime area is used to store a tuple of the relation. On the other hand, the index area tuples associate relative addresses of prime area tuples with their key attribute values and also maintain a sequence number field to define an ascending sequence of the relation tuples by key. This design facilitates unconditional retrieval, tuple retrieval by key, and conditional retrieval. It also makes it possible to generate work area relations following retrieval, already sorted by the key field. Each relation in the corporate database is implemented using this basic structure. It should be noted that the prime area cell capacity and the index area blocking factor are determined by the tuple length of the relation, and that the index area tuple length is fixed by design assumptions pertaining to the three domains in its tuple structure. Figure 2 is the physical implementation for the Access Path Relation. It is a special relation conceived to store inverted lists based on any nonkey domain of any data base relation. Only one APR is defined for an integrated relational database of an arbitrary number of data relations. Similar to the previous structure, APR comprises an index area and a prime area. The prime area tuple structure is shown in Figure 2. Each prime area tuple, and therefore each access path, is uniquely identified by an attribute name, attribute value pair. It is assumed that the name of each nonkey domain in the database is unique to a single relation, thus no relation reference is maintained in APR. PATH-TYPE domain in the APR prime area tuple structure may have only two values: B for access paths expressed as a binary string, and R for access paths in terms of a string of relative addresses. INVERSIONTYPE may be C if the implied relation has been completely inverted with respect to the attribute, or P if only a proper subset of inverted lists are currently stored in APR. The term “event” in CREATION-EVENT-NO and
Access Path Handling
239
in a Database System INDEX AREA TUPLE STRUCTURE key-attribute-value ~_-
sequence-no
I t
--
b---
relative-addr
/---
--
1 I
1 2
INDEX AREA
Figure 1. Data structure for database relations.
n. 1 ni+l
I
ni+2
I
j
.
I
f
j -i
I PRI?t?3 DhTA AREA
1
i ni+n
P
i 1 PRIME AREA TUPLE STRUCTURE attribute-l
refers to the sequence number LAST-EVENT-USED assigned by the INTREL/DBMS to a conditional query after the DBA resets the system to zero. TIMESUSED is in terms of the number of conditional query requests which used the available access path in the past. COST is the creation and maintenance cost of the access path as computed by the DBMS. INVESTMENT-RETURN-PERIOD is the number of uses for an access path before its cost becomes zero; it is decided upon by the DBA, and in turn, determines the UNRETURNED-COST for an access path. Finally, ACCESS-PATH is either a binary string or a relative address string specification, whichever is shorter, of the inverted list corresponding to the identifying attribute name, attribute value pair in the tuple. The index area tuple structure for APR simply associates the attribute name with the start address of the cell block in which tuples with the attribute name under consideration reside. Again, the blocking factor in the index area, the index area tuple length, and the prime area tuple length for APR can be fixed on the basis of a set of design assumptions, all under the control of the DBA.
. . . . . * I
The query formulation sublanguage of the INTREL/DBMS permits conditional retrieval from a given relation, r, using a complex or simple predicate, p, and specifying a set of target variables, a, as follows: Get tuples with attributes a of relation r satisfying condition p INTREL/DBMS decomposes the complex predicate p into its constituent attribute-value pairs, accesses APR-index using attribute names, fetches the access paths from the APR-prime area, and generates incore a new access path to correspond to the predicate p. Then, it uses this access path to retrieve the qualifying tuples from the prime area of the direct-access file for the relation r, after exhaustively searching r-index. The time required to do that is:
(11
240
Y. Uckan and W. D. Haseman INDEX ARRA TUPLE STRUCTURE attribute-name
start-address
INDEX AREA
PRI!+lE AREA
I I
PRIME AREA TUPLE STRUCTURE attributename
attributevalue
path-type
\
inversion-
type
\
creationevent-no
\ I \
last-event
times_used
cost
investmentreturn-period
Figure 2. Data structure for access path relation.
where time to search APR-index; TAPR-index: time to directly access an APR-prime tuple; TAPR-prime: Tr-index:time to exhaustively search r-index; Tr-prime:time to directly access one r-prime tuple;
unreturnedcost
accesspath
Approximating the timing terms [IO] and ignoring the computational overhead, we obtain:
~=g&]$+(f+2~)s~ 1
a’+1
b
+ 2 a 1 iLf/L{j 1 t’
(2)
+ f (3a + q + c) $
where L,: r-prime tuple length in bytes;
TWA:time to write one work-area tuple;
Li: r-index tuple length;
TAPR_prime: time to search the ith block of APR-prime area;
L,: APR-prime tuple length;
CO: computational overhead; a: number of attributes in the predicate;
LWA:work area tuple length; n,: number of prime tuples in r;
s: number of tuples in r satisfying the predicate.
t’: bulk transfer rate for direct-access storage device;
Li: APR-index tuple length;
Access Path Handling in a Database System a’: number of APR-index tuples (i.e., number of attribute definition blocks in APR); c: total number of APR-prime tuples in blocks defined by the a attributes in the predicate; q: total number of APR-prime tuples to be used by the predicate. Figure 3 is a plot of T vs. np for fixed values of the parameters listed above, and different values of LL corresponding to specific ranges of np. Line 1 in Figure 3 corresponds to the T-n, variation for a less sophisticated implementation in which APR is nonexistent and all database relations are actually sequential files. Clearly, the INTREL/DBMS handles conditional retrieval requests much more efficiently. Furthermore, for normal direct-access bulk transfer rates (e.g., t’ = 800 Kbyte/ set), the increase in the cost of processing a relation for small variations in np is negligibly small. Therefore, one may safely state that, the cost of conditional query processing in the INTREL/DBMS environment is almost constant for reasonably stable database relations.
241 cessor, into a finite number of attribute-value pairs. The constituent attribute-value pairs for a given relation are dependent on the predicate and the instance of the relation cited in the statement. The Access Path Relation stores in a fixed size storage area, a limited number of primitive access paths, each corresponding to an attribute-value pair. The INTREL/DBMS, having identified the required simple access paths, accesses APR and fetches the stored paths. They are then combined into a single access path for the predicate. The resultant access path is used to extract the selection in the work area. A Boolean predicate is first decomposed into range queries. We arbitrarily classify range queries in Table 1. A represents an attribute of the given relation, and V is an element of its corresponding domain. APR is designed to store only Type 5 query access paths. All other range queries identify access paths which are con-
Table 1. Classification
of Range Queries Range query
Type
2. DESIGN OF DBMS FOR CONDITIONAL RETRIEVAL
Relevant to the theme of the current investigation, the INTREL/DBMS has the following design characteristics: a conditional retrieval statement may be used to specify a selection (followed by a projection) for any one database relation. The predicate in such a statement is, in general, a Boolean predicate which is automatically decomposed by the INTREL language pro-
T.t'.lO
A
1 2 3 4 5 6
Figure 3. Cost of DBMS.
AjV A>V A>V A=V A#V
conditional query processing in INTREL/
-5
/ L
“1
L)
= 30
30
s
= 10
LWA”‘~=“’ a = 3
4
=
a’
c
= 12
7
= 20
’
Y. Uckan and W. D. Haseman
242 junctions of a finite number of access paths corresponding to Type 5 queries. In case none or only some of the needed paths are found to exist in APR, the INTREL/ DBMS sequentially processes the database relation and generates and stores all required access paths in APR. Such access paths are used by the present query as well as all future queries which may require them, as long as the access paths are kept in APR. Evidently, a database user without fully intending to do so, may cause, through conditiona retrieval requests, certain access paths to be created and added to APR, provided APR has space to accommodate them. The INTREL/DML provides an INVERT statement, primarily for the database administrator’s convenience, whose only function is to create and store Type 5 access paths in APR. Thus, the DBA may rejuvenate the system any time using a DELETE and INVERT pair of DML statements. Alternatively, the DBA may request certain access paths to be added to the system through an INVERT statement. A set of new access paths are unconditionally accommodated in APR in case there is sufficient storage in it. If this is not the case, then enough paths must be deleted to make room for the new ones. The paths to be deleted are to be selected such that not only the paths that are most likely to be used are retained in APR, but also the average cost of query processing following this update is kept low. In the following analysis regarding the average cost of conditional query processing, two assumptions are made: Cost of full inve~ion of a database relation with respect to an attribute is almost the same as that of partial inversion of the same relation for a range query. As inversion is an I/O-bound operation, and as it requires exhaustive sequential processing, whether full or partial, this assumption is justified. Cost of processing a conditional query (i.e., cost of inversion) is almost independent of the cardinality of the relation to be inverted within a fixed time period. For the INTREL/DBMS, the analysis presented in the previous section justifies this assumption. Seven different automatic access path deletion strategies are suggested for the INTREL/DBMS. The following section formuIates delete strategies. 3. ACCESS PATH DELETION STRATEGIES
The proposed seven access path deletion strategies are based on four different approaches: 1. Delete enough paths whose unreturned cost is down to zero.
2. Delete those paths for which unreturned costs are
lowest. 3. Delete the least recently used paths. 4. Delete the least frequently used access paths. We define E’ = {dietilt
. e’,j}
(3)
as the set of mj access paths existing in APR at event j (an “event” being defined by a user’s conditional retrieval request), Iy = {d’i,d)?,. . , d{j}
(4)
as the set of nj paths defined by a conditional retrieval statement, X’ =
(xl;,x:, . . . , Xii)
(5)
as the set of p’ paths requested by event j and existing in APR, and Y’ = (y{,yJ?*. .
. 9 y’,j)
(6)
as the set of qj paths requested by event j and nonexistent in APR. Clearly, X’ = E’ n D’, yj = DJ - EJ, 9J = n’ - p.
(7) (8) (9)
We introduce another set A:[ = {Ai, A’, . . . * Xi,j)
where j - b,, if j - bi > 0 otherwise 0,
(11)
is the “life” of path i at event j, and bi is the birth event for path i. Then, we define “extrapolated-times-used” and “normalized-times-used” for event i as follows: extra~lated-times-used~ = (times-used:. max &)/X_[,rounded normalized-times-usedj = .extrapolated-times-used:, I0
(12) (13)
if A: > pi if A{< pti
where pi is the maturity period for path i during which the path may not be deleted. Finally, we denote the capacity of APR by c and the available space in APR at event j by ariand note that ifc2mJ otherwise
(14)
243
Access Path Handling in a Database System Now, three linear lists which we call delete-order lists (DOL) are introduced: Least Unreturned Cost DOL: LUC/DOL :: = (unreturned-cost) 11 (last-event used) /( (extrapolated-times-used) Least Recently Used DOL: LRU/DOL :: = (last-event-used) (1 (unreturned-cost} I\ (extrapolated-times-used) Least Frequently Used DOL: LFU/DOL :: = (normaIized-times-used} I/ (unreturned~ost} /1 (last-event-used)
(1%
(16)
(171
Note that all list definitions contain fields which are either the attribute values of the APR prime area tuples or may be computed using such attribute values, as in Eqs. (1 l)-(13). The proposed automatic path delete strategies are formulated below:
Strategy t
Determine D’, Xi, Yj, ti, and at’. Structure LUC/DOL of mj paths in APR and determine the first mz’ entries of LUC/DOL for which unreturned-cost is zero. If qj - cuiI mzj (i.e., enough paths with zero unreturned cost can be found in APR), then delete the first qj - c$ entries of LUC/DOL from APR and add qJ new paths to APR. If qj - n’ > mz’, then do not permit any deletions. Consider this an isolated event; create ail the required access paths, use them to process the query, charge the user the total cost of inversion, and release the created access paths.
Strategy II
1. Determine D’, X, Y’, A’, and d. 2. Structure LUC/DOL of mJ entries. 3. Delete the first q - at’ entries of LUC/DOL from APR and add qJ new paths to APR. 4. Determine the sum of unreturned costs of qj - or’ deleted paths. 5. Divide this sum by q’. 6. Add this extra cost to the creation cost per path for qj new paths.
Strategy III 1-4. Same as in Strategy II.
5. Divide this sum by mj+‘. 6. Add this extra cost to the creation cost of each path in the set Ej+‘.
Strategy IV 1. Determine D, Xj, Yj, nj, and aj. 2. Structure LRU/DOL of mj entries. 3. Delete the first qJ - ol’ entries of LRU/DOL from APR and add qj new paths to APR. 4. Determine the sum of unreturned costs of qj - ($ deleted paths. 5. Divide this sum by 4. 6. Add this extra cost to the creation cost per path for qj new paths.
Strategy V
1-4. Same as in Strategy IV. 5. Divide this sum by mj”. 6. Add this extra cost to the creation cost of each path in the set E’+‘.
Strategy VI 1. Determine Dj, x’, Yj, il’, and aj. 2. Structure LFU/DOL of m’ entries. 3. Delete the first qJ - aj entries of LFU/DOL from APR and add q new paths to APR. 4. Determine the sum of unreturned costs of qJ - a” deleted paths. 5. Divide this sum by 4. 6. Add this extra cost to the creation cost per path for qJ new paths.
Strategy VII Same as in Strategy VI. 5. Divide this sum by mJi’. 6. Add this extra cost to the creation cost of each path in the set Ej+‘.
l-4.
The main issue in the present investigation is to determine which of the above seven delete strategies is the “best” in general. It is natural to expect that the DBA would initially fill APR with a maximum number of access paths of highest user demand. On the basis of the past usage in a corporate environment, the DBA can easily determine which access paths are most likely to be in high demand. An analytical formulation to determine high demand access paths will be presented in the following section. Following a favorable initial loading, the database will be exposed to a series of query processing demands which will randomly specify access paths and consequently cause a deterioration of database performance. One of the major controllable factors which influence the content of the APR and, therefore, the overall da-
Y. Uckan and W. D. Haseman tabase performance, is the chosen delete strategy. We define the best delete strategy as the one which keeps the average cost of conditional query processing lowest for a maximum number of events. Another possible measure of performance, namely, the number of inversions required for a fixed number of events, obviously correlates to the chosen one. In order to determine the best path delete strategy, we need to compute average cost of query processing for a sufficiently large number of random events. An analytical solution is clearly not possible. Therefore, we have attempted a computer simulation of the model. The model and the simulation technique used are presented in the next section.
(1% ki=,
where nfi
w’= 5 j!(n;”
The controllable variables for the model are the capacity of the APR, the access paths initially loaded into the APR, the investment return period for access paths, q, the path delete strategies, and the maturity period, CL, for access paths (for Strategies VI and VII.) The dependent variable is the average cost of query processing. Evidently, the larger the capacity of the APR, the smaller will be the average cost of query processing. Intuitively it is obvious that, if the investment return period is long, then the cost of using a random path will be low, and hence, overall cost of query processing will be positively affected. Similarly, excessively large values for p adversely effects database performance. The effects of n and p on the cost function are verified by the results of the simulation. It is important to choose a subset of access paths which are most likely to be requested by the users of the database, and to load APR with this subset. On the basis of past utilization, the DBA may easily compute two classes of probabilities. These are: :probability that database relation R, is
requested in a conditional query p,(Y”,J :probability that the predicate of a conditional query is based on the k,th distinct combination of the nonkey fields of the relation Ri,
Here, nr and nfi, respectively, denote the number of relations in the database, and the number of nonkey fields in the relation Ri. Given p2, it is a simple matter to compute a third class of probabilities,
If we assume uniform probability distributions for the type of range queries comprising the predicate and the field value appearing in a given range query, then we can prove that p(Fi.1= Vi,J = f . P,(RJ * PdFi.J* for all k = 1, . , nv ,,,,
and nvij is the total number of discrete values for the jth field of the ith database relation. The above assumptions are alternately expressible as follows: for all fields in the database, any admissible values are equally likely to appear in a range query, and a given range query is equally likely to be of any one type out of six possible types. Equation (21) can easily be modified in case a database environment suggests nonuniform distributions. Example. Consider a database of three relations, R,, Rz, and RJ, and let R, have three nonkey fields F,,,, F,.,, and F,,. Given the probabilities p, and p2 as Relation R,
where ki can be regarded as a nfidigit binary number (4, . . . dj - * d,d,), in which dj = 1 implies that jth
RI
field of Ri appears in the combination. For example, for a relation with three fields, F,, Fz and F,, 36 is the following subset:
p,(h)
Field combination
0.60 0.30 0.10
F, F2
F3 F,, Fz F,, F3 F,. F, F,, F,, F,
We can easily compute p3 from p2 as follows:
3, = 9, ,o = {F,, F,}.
Field
It should be noted that
F,
nr
dRJ = 1
(21)
where p(Fij = Vij,k)represents the probability that the path Fij = Vi.j.k is requested by the conditional query,
R2
c
(20)
j)!
p3(Fid) :probability that jth field of the ith relation appears in the predicate of a conditional query either alone or with other fields.
4. THE MODEL AND ITS SIMULATION
p,(R,)
and
(18)
F, F,
P~(FI.J 0.43 0.50 0.45
PAW
0.20 0.25 0.20 0.10 0.10 0.12 0.03
245
Access Path Handling in a Database System Hence, the probability that access paths based on, say F,, will be requested, for all such access paths will be be p(F,,, = v,,,,~) =
0.129.
Having thus chosen the most favorable paths, the DBA may fill up the APR with these, and may occasionally rejuvenate the system on the basis of usage statistics accumulated so far. In our simulations, we always began with a favorable state and, using the probabilities inherent in the initial state, carried out the computations. The basic steps of the simulation are briefly presented below. Step 1. (Which database relation?) Generate a random integer i in the range 1 li
using a probability distribution p,(R). Step 2. (How many fields of Ri appear in the predicate?) Generate a random integer in the range 1 5 n 5
Type of range query 0)
Access paths (F,,, = Vi,,.J
1
1 IS s 5
2
1 5 s 5 ki,,q
3 4 5 6
k,,,, -C s 5 wJq ki.,q 5 s 5 qlq s=k ‘.N 1 5 s < ki,,4Tk,,,, < s 5 nvi,,q
k,,,,
program consists of: 1. Database description, in terms of number of relations, number of nonkey fields for each relation, and number of values for each field of each relation. 2. Probability distributions p,(R,), and p,(Fi, j) for all relations. 3. Capacity of APR. 4. Initial content of APR as determined using the method given in this section. 5. Investment return period (7). 6. Maturity period (h). 7. Probability distributions p4, ps, and ph.
nf,
using a probability distribution p4. Step 3. (Which fields of Ri appear in the predicate?) Generate n distinct random integers j,, jz, . . . , j, in the range 1 5 jqS
Table 2. Access Paths for Range Query Types
nf,
using probability distributions F3(Fi,,k). Step 4. (What type of range query?) For each field Fi,lq,generate a random integer t, in the range 11t16
to specify the type, using a probability distribution ps. Step 5. (What value of F,,,, appears in the range query?) For each field F,,j,, generate a random integer k,,,, in the range
According to the assumptions intrinsic to Eq. (21), and p6 are taken as uniform distributions. p4 is also PS assumed to be uniform. The model was simulated for 100 events, and in certain cases for 300 events. In order to provide for a fair ground for comparison, all seven delete strategies were run using the same collection of pseudorandom numbers. The output of simulation runs includes the average cost of processing queries and the number of required inversions for all j events following initialization, for 1 5 j 5 100. It should be noted that cost values are normalized to 1.O and that 1.0 is the cost of a single inversion. Individual costs of events are at worst 1.0, which corresponds to a query request that could not be accommodated by the APR because no path could be deleted, and hence the query request necessitated an isolated inversion. The results of the simulation runs together with the conclusions of the investigation are presented in the next section.
1 5 k,,, 5 nv,,,,
(number of values for the field F,,,,,).using a probability distribution pn. Step 6. (Which access paths are indicated?) Determine access paths using t of Step 4 and ki,lqof Step 5 and Table 2. A BASIC program for a mainframe computer was written to run the simulation for a variety of databases and for all seven proposed delete strategies. Input to the
5. RESULTS
After the simulation software was developed, it was run experimentally for a small database (Database 1 of Table 3) for about 40 times, each for 300 events. On the basis of these initial runs, it was possible to arrive at the following conclusions: 1. The number of inversions required for a fixed period is strongly correlated to the average cost of condi-
246
Y. Uckan and W. D. Haseman
Table 3. Databases for Simulation
tional query processing at the end of the period. It was, therefore, reasonable to take the average cost function as the sole measure of performance for all seven delete strategies. 2. Figure 4 shows average cost function variations for a typical simulation run for 100 events. With the exception of Strategy I (which was subsequently proven to be the worst delete strategy), all delete strategies exhibited a similar behavior. The average cost of query processing was acceptably low (about 0.40) up to event 70. After this event, there was a gradual deterioration in the performance of the system. Therefore, for performance comparison purposes, it was decided to limit the subsequent simulation runs to 100 events, and use the average cost at event 100 as the discrete measure of performance.
Fields Database
1
1
p,
Relation
o,80 Values P3
1
2
2
o,15
Values
3
o,05
VaZes
o.4.
Vaces
3
o,20
Values
P3
p3 o,80
Values P3
3
2
o,15
Values
3
0.05
Values
P3
P3
o,50 Values P3
4
o,30
Values
o,20
Vakzs P3
Figure 4. Typical simulation
-
-
-
-
2 0.50 7
4 0.50 -
6 0.50 -
-
0.50
2
1
6 0.90
5 0.65 2
Values
4
5
1.00
o , 4.
3
0.90
P3
I
2
-
-
6 0.65 4
6
-
0.50
0.50
-
-
-
-
-
-
-
2 0.88 4 0.83 2 0.68
4 0.13 10 0.14 2 0.37
6 0.40 5 0.17 3 0.20
2 0.12 -
2 0.60 4 0.23 2 0.68
4 0.40 10 0.20 2 0.37
6 0.30 5 0.54 3 0.20
2 0.29 -
7
1.00
-
The results indicated that, for the same database, as investment return period increases, average cost for all events generally decreases. Increasing maturity periods for Strategy VI and VII tend to increase the average cost. Maturity period is not a parameter for other strategies. As the capacity of APR increases, for all delete strategies the average cost comes down. Having thus verified certain intuitive conclusions and therefore the correctness of the model, we chose
results.
Average cost
0.7
0.6
---7-
10
20
30
40
50
60
70
a0
90
100
STRATEGY
I
STRATEGY
III
STRATEGY
IV
Event
247
Access Path Handling in a Database System four different databases and ran a total of 350 simulations in 10 batches. The four databases to be simulated are described in Table 3 in terms of the number of relations, probabilities p,(R) that a particular relation may be indicated by a random query, number of nonkey fields for each relation, probabilities pj that a particular field may be requested in a random query, and number of values for each nonkey field. It should be noted that databases 1 and 2 are structurally similar, and so are databases 3 and 4. They differ in what we shall term “database bias,” which in fact denotes the tendency for a database to contain a subset of access paths which are substantially more likely to be requested than the other paths. Database 1 is a high-bias database compared to database 2, in that p as given in Eq. (21) is markedly higher for all access paths which are used to initialize the APR in database 1 than it is for the same initial paths in database 2. The results of the simulation indicate that for all strategies, high-bias databases produce lower average cost values than those corresponding to low-bias databases of similar structures. In all simulation runs APR capacity was kept at about 30% of all access paths as defined by the database. Investment return period and maturity period were both held at a constant value of five events. The results of the simulation runs are summarized in Table 4. Each batch consists of 35 independent runs of all seven strategies for 100 events. In each run, the seven proposed strategies were simulated using the same random number sequence generated for all involved probability distributions. Table 4 gives for each batch the number of runs in which a particular strategy was found superior to others. In batch 1, for instance, Strategy IV was found to be the best in 16 runs out of a total of 35. We have found that in high-bias databases (batches 1,2, 5,6,7), Strategies IV-VII were significantly better than Strat-
Table 4. Result of Simulation for All Delete Strategies Strategy Database I 1 2 2 3 3 3 4 4 4
I
Batch #
II
III
-
-
I---
2 3 4 5 6 7 8 9 10 Total
4 8-
113
----86 1 12 38 2
1 -
5 8 4 31
IV 16 18 18 5 20 15 12 7 5 5 104
V 4 3
VI 9 10 3 14 2021
14 2 112 1 12 1 12 34 113
VIII 6 4 5 7
I 2 2 1 28
Total number of runs 35 35 35 35 35 35 35 35 35 35 350
egies I, II, and III, and that II and III were definitely superior to I. Statistical Z-test was used to determine significant differences of average cost values. In lowbias databases (batches 3,4,8,9, lo), the performances were not as markedly distinguishable. On the basis of the average of all 35 average costs in each batch, batch 3 and 9 showed that Strategy I is significantly worse than the rest, and batch 4 indicated that Strategies II and III performed worse than Strategies IV-VII. We could find no significant differences between Strategies IV-VII. However, Strategies IV and VI seemed to pro duce lowest average costs in a high majority of simulated runs. As the last row of Table 4 indicates, Strategy IV was favored in 30% of all runs and Strategy VI was superior in 33% of all runs. In reality, it is impossible to predict when a high-bias database becomes a low-bias one, or vice versa. A highbias database, in the sense we define it in this study, is definitely not a rarity. Consequently, it is reasonable to conclude that Strategies I, II, and III are generally poor performers, and to concentrate on the remaining four strategies. Table 5 shows the results of the simulation runs for Strategies IV-VII. Batches 8-10 for the lowbias database clearly emphasize Strategies IV and VI in the absence of Strategies I-III, whereas batches 3 and 4 favor Strategies V and VII. However, for all 350 runs of Table 5, Strategies IV and VI outperform the rest 37% and 40%, respectively, of all simulation runs. One can, then, conclude that in general, and especially for high-bias databases, Strategies IV and VI are the best, and either can be implemented in a real database environment. 6. SUMMARY AND CONCLUSIONS This investigation is concerned with the design of a cost-efficient secondary key retrieval strategy in a re
Table 5. Result of Simulation for Delete Strategies IVVII Strategy Database
Batch #
IV
V
VI
VII
Total number of runs
1 2 3 4 5 6 I 8 9 10 Total
16 18 8 11 20 15 12 12 9 9 130
4 3 12 16 2 3 I 2 43
9 10 4 14 20 21 17 22 23 140
6 4 11 8 I 3 3 I 37
35 35 35 35 35 35 35 35 35 35 350
248 lational database environment. It emphasizes the conditional query processing capability of the INTREL/ DBMS, an experimental relational database management system, and proposes a dynamic secondary key index (access path) storage structure and an associated access path handling strategy which help optimize conditional query processing. INTREL/DBMS is a generalized relational database management system with schema definition, database manipulation, and retrieval capabilities. It includes a full data language, and its query processing language is relationally complete. Secondary key retrieval (or selection) from a specified database relation is among the INTREL/DBMS retrieval capabilities. To store access paths (or inverted lists) which are used for secondary key retrieval, a special storage structure (APR) is designed: this is one of the ten system control structures incorporated into the design of INTREL/ DBMS. INTREL/DBMS has a uniform design characteristic in that all system control structures, as well as the corporate database, are implemented as relations. APR is a fixed size, limited capacity storage structure which may contain only a small subset of all possible primitive access paths that may be based on a relational database. A primitive access path for a nonkey and invertible field of a database relation defines a subset of all tuples of the relation for which the said field value is equal to one specific admissible element of the domain set of the field. APR stores only primitive access paths from which access paths corresponding to complex predicates may be trivially computed. To store and maintain in APR only those paths which are most likely to be needed by a random conditional query request, an artificial intelligence has been integrated into the INTREL/DBMS design. The system adds new paths requested by a random query (an “event”) to APR and permits future events to use these paths at a fraction of the total cost of path creation, It keeps on adding new paths until APR becomes full. At that point, to accommodate new paths, the system will have to delete some of the existing paths. Clearly, the average cost of processing a random query under this strategy is much lower than the cost of processing each conditional query independently of others. The performance of the system may further be improved if the database administrator initially loads the APR with a set of paths which are most likely to be requested by the users of the database. A method which enables the DBA to determine such favorable paths is presented in the paper. The performance of the system will eventually deteriorate as a sequence of random events will alter the initial favorable content of the APR. The DBA may re-
Y. Uckan and W. D. Haseman juvenate the system by reloading the APR with a possibly new set of good access paths as soon as average cost of an event becomes intolerably high. Our objective is to specify an access path delete strategy that will be automatically carried out by the INTREL/DBMS, based on what the system has learned from the past demands of the database users, such that the average cost of query processing is lowest, and thus, the period between any two consecutive reloadings of the APR is longest. Seven different path delete strategies are investigated. These are based on the principles of “delete paths with zero unreturned cost,” “delete paths with lowest unreturned cost,” “ delete paths which are least recently used,” and “delete paths which are least frequently used.” A simulation model is developed and all seven models are simulated for four different databases, in 350 simulation runs each lasting 100 events. It has been concluded that two of the proposed delete strategies, “delete the least recently used paths and distribute the sum of unreturned cost among the new paths added to APR,” and “delete the least frequently used paths, and distribute the unreturned cost among the new paths” are the best. No significant difference between these two delete models is observed. Hence, either can be implemented in the design of INTREL/DBMS, or in general, in any database system whose secondary key retrieval characteristics are similar to those of the INTREL/DBMS. The current study suggests a powerful indexing strategy for any database management system design in coping with the fundamental problem of efficient secondary key retrieval. It also provides the DBA with a methodology to optimize and control the overall performance of a database management system in which the suggested indexing strategy or one that is similar to it exists.
REFERENCES 1. D. E. Knuth, The Art of Computer Programming, Vol. 3, Sorting and Searching, Addison-Wesley, Reading,
MA, 1973, p. 550. 2. J. Martin, Computer Data-Base Organization, 2nd ed., Prentice-Hall,
Englewood
Cliffs, NJ, 1977.
3. S. P. Ghosh, Data Base Organization for Data Management, Academic
Press, New York, 1977. Analysis and Performance of Inverted Data Base Structures, Commun. ACM 18, 253-263 (May 1975). J. S. Johnson and D. B. Webster, Updating An Inverted Index-A Performance Comparison of Two Techniques, Computer J. 25, 169-175 (Feb 1982).
4. A. F. Cardenas,
5.
Access Path Handling in a Databue System 6. A. Putkonen, Selection of the Access Path in Inverted Database Organization, Information Systems 4, 219226 (March 1979). ‘7. Y. Uckan, Design of A Relational Data Language and A Data Base Management System, Technical Report, Department of Computer Science, Middle East Technical University, Ankara, Turkey, 1980.
249 8. Y. Uckan, Design of A Relational Data Base System,
Proceedings ofrhe Fuu~feen~h Annual Conference on ~od~~~ng and ~~rnu~ati~n 14,959-964 (1983). 9. Y. Uckan, A Relational Schema Description and Manipulation Facility, Computer J.. to appear. 10. G. Wiederhold, Database Design, 2nd ed.. McGraw-
Hill, New York, 1983, p. 73.