Inform SystemsVol. 12. No. 2, pp. 203-213, 1987
0306-4379/87 $3.00+ 0.00 PcrgamonJournalsLtd
Pnnted In Great Brttaln
THE DESIGN OF OPTIMAL ACCESS PATHS FOR RELATIONAL DATABASES DALIA MOTZKIN Department of Computer Science, Western Michigan University, Kalamazoo, Ml 49008, U.S.A.
(Received 25 June 1986) Ahatract-A generalized model for the performance optimization of physical relational database access has been developed and implemented. The model consists of a set of algorithms and cost equations. It assists the database designer in specifying and selecting optimal access schemes within given systems’ constraints. The model is an extension of previous work. It is more comprehensive and flexible. It addresses problem that have not been considered in previous models; it integrates into one model aspects that were treated individually before, and it produces database access configurations that can work within certain given system constraints.
1.
INTRODUCITON
The problems of optimizing database access have been approached from two standpoints. One aspect is query optimization which seeks optimal selection of access path for a given query and a given access structure of a physical database. The other aspect is the selection of optimal access structure. This paper is concerned with the second aspect. The databases under consideration are composed of unsorted relations stored on a random access device such as a disk and the access paths are via indices or directly through the file. The selection of indices for a database is an important part of physical database design. Since index performances vary, there is a need to select the most suitable index for each field of each tile. While appropriate indexing improves the performance considerably, excessive indexing can result in major performance degradation, as well as in a significant increase of storage requirements. Performance of some types of indices deteriorate in time due to overflow situations and other problems. Reorganization is then required. To review the problem, consider the following situation: given an unsorted file of 1,000,000 record (a file of this size is not unusual for a large corporation), and access time of 0.1 see per record. Assume that only one record is retrieved with one access. The average number of accesses needed to find one record is half the file, i.e. 500,000 access. The average time needed to obtain a record from such a file is 13.8 h. If the file is sorted and a binary search is used, then only log, (l,OOO,OOO)(i.e. 20 accesses) are needed to find a record, that is 2 set are needed. Sorting a file cannot solve all access problems since the 6le can be sorted on only one field. Also sorted files lead to considerable updating problems. Assume that the given file has four-level B-tree indices for some of the
fields. Retrieving a record from the file using such a B-tree will require 4 accesses in the B-tree and one in the file, i.e. 5 accesses which, when used in the above system, is 0.5 sec. However, while indices improve the search time for records, they obviously slow down database maintenance. Each time a record is inserted in a file all indices of all fields must have entries inserted in them. In some situations the overall effect of an index is a degraded performance. In addition, the indices occupy a considerable amount of space. Available space is often limited, especially on mini- or microcomputers. Selecting an appropriate index to a field is another problem; no index is ‘best” in all situations. The problem of index selection along with other problems of modelling, optimization and prediction of database performance have been studied by many researchers. Interesting results have been published by Batory [l-3], Bonfatti et al. 141,Carlis et al. [5-Tj, Chen and Yao [8], Christodoulakis [9], Hoffer [lo], Lum and Ling [ 131, March and Severance [14], Mendelson [IS], Nicolas [19], Schkolnick [20-211, Whang et al. [23,24], Yao et al. [25,271, and others. Additional bibliography, related to earlier work, can be found in the extensive surveys by Schkolnick [21], and Yao et al. [26,28]. However, modelling and optimization techniques are not complete, they suffer from one or more of the following problems: 1.1. The work is file oriented rather thun databare oriented 1.2. The list of evaluated indices is inadequate In some models only a few indices can be evaluated, having omitted the entire B-tree family and other indices. In some, a variety of updating techniques is not incorporated. Others compare too many indices rendering the models slow and inefficient.
203
204
DALIA MOTZKIN
1.3. Periodic reorganization is not addressed Some index structures require periodic reorganization due to overflow and other problems. Some models do not take reorganization into consideration in the selection process. 1.4. System constraints are not taken into consideration Many computer systems, especially microcomputers, have a limited amount of space. Other systems have time constraints due to heavy work loads. These constraints do not play a role in many current models. 1.5. Important transaction types and the e&cts of these transactions are not included
2. RELATIONAL OPERATORS AND INDEXING The operations on relational grouped into two groups:
databases
can be
Group Z Operations which require access to each record in a file or files. This group includes: projection, union, intersection, difference, natural join and equijoin operations, and some up-date operations (e.g. “increase salary of all employees by 5%“). Group ZZ Operations which require accesses to some selected subset of records in a file. This group includes: selection, deletion, most modifications and some general join operations.
Some models are oriented towards retrieval only. Others take into consideration maintenance operations such as insertion, deletion, and modification, but do not include the multi-index/multifield effects of transactions. The following example may illustrate a problem of this nature. A field i (say salary field), of record R the system uses the value of field j (say social security number). An index on field j will obviously improve the search performance, but an index on field i will not contribute to the speed of the search. On the other hand, the index on field i must be modified, thus decreasing the performance of this transaction.
While the performance of operations of the second group can improve considerably by using indices, operations of the first group do not benefit from indexing. Projection, for example, requires only one pass over the file, while natural join is efficiently done by first sorting the two files to be joined, and then comparing them, using one pass over both files. This work is concerned with the design of optimal access paths and, therefore, the model described below is associated with operations of Group II.
1.6. The interaction and dtstinction between primary and secondary fiela!s is lacking
In the following sections, the details of the model for the design of optimal access paths will be described. This model has four components: input parameters, output, algorithms and cost equations.
Some models are oriented towards primary fields only, other toward secondary fields. There are models which evaluate indices for all fields but do not distinguish between primary and secondary fields.
3. THE COMPONENTS OF THE MODEL
A INPUT PARAMETERS 1.7. There b no &crimination nom&se field attributes
between dense and
1.8. Performance prediction b not provided There arc models which present optimal database configuration and/or re-organization points. But they do not provide the user with time and space requirements of the database. This work is an extension of previous work. Solutions to the performance issues 1.1-1.8 above been developed and implemented into one model. As mentioned above, the databases under consideration are composed of unsorted relations. No reference is made to hierarchical or network data models. Preliminary results of the initial version of this model have been described in Motzkin [17]. This paper describes the revised and refined version of the model including the algorithms, the cost equations and the output produced by the model. The model has been implemented with a PASCAL program and also with an APL program.
The input to the system consists of four groups of parameters: system parameters, database parameters, user workload parameters, and index parameters. For examples of input parameters, see Figs 1 and 2.
4.1. System parameters These parameters are concerned with system and operating environment constraints and costs. They include: the total number of disk blocks available for the database, the size of a block, the cost of a block, average access time and cost per access. A block is defined as in [24] i.e. the unit of disk allocation as well as the unit of disk transfer and buffer size. 4.2. Database parameters These parameters provide database information such as the number of files, the number and names of fields in each file, and needed information on each field. These parameters include: number of files, for each file-the file name, for each field-the name of the field, a flag indicating whether the field attribute
Design of optimal access paths
is dense or not dense (this parameter is needed since different types of indices are suitable for dense attributes and nondense attributes), the number of distinct attribute values (this is only meaningful for nondense attributes), and the number of characters in the field.
operations called here Group II operations (see previous sections). Insertions and deletions are measured “per file” due to the fact that when a record is inserted or deleted all indices have to be modified. Searches are measured per field. Database users usually request records with given field values. Updates (modifications) also affect individual fields; for example, if a “Quantity” field in an inventory file is changed, only the “Quantity” index is modified. It is assumed that tbe searches for records to be deleted or modified are done using the primary field. The “frequency of reorganization” parameter indicates how often reorganization is done. Reorganization involves merging overflow areas with main areas, removing empty areas that might have been created due to deletions, regenerating some indices, and other related operations. This input parameter is required “per file”, assuming that all indices that require reorganization are reorganized at the same time. An example of a set of system parameters, database parameters and user work-load parameters is provided in Fig. 1. The sample input of Fig. 1 is one of the data sets that were actually used in simulation runs of the implemented model. Format of input parameters (see Fig. 1). “/” in column 1 of an input line indicates a comment line. The input may include comments. Tbe first line of input consists of system parameters. The parameters are separated by blanks. Blanks are always ignored.
4.3. User workload parameters This group of parameters is concerned with various user transactions. The user workload parameters are broken into two groups, those that are associated with a file, and those that are associated with a field. The workload parameters associated with a file include the number of insertions and deletions into each fle per unit of time (such as a day), and the frequency of file i-e-organization. (This parameter is used in index organizations that require periodical reorganizations.) The workload parameters that are associated with individual fields include the average number of daily searches based on a field value and the average number of field modifications (called updates here) per day. It is difficult to obtain the values for user workload parameters. The values may be estimated, or a program may count them over a period of time and come up with an average per unit of time such as a day or a month. This paper does not discuss methods by which user workload parameters may be obtained. It is assumed that such input is available for the model. User workload parameters include processes which can use indices, i.e. accesses which are results of
/AVAILABLE /BLOCKS
CHARACTERS PER BLOCK
ACCESS TIME IN SEC
640
0.1
: 20000 . /FILE /NAME /
PRIMARY ATTRIBUTE
TOTAL # OF RECS
: 3
: 3
160000 10000 15000
.
/FILE /NAME
FIELD NAME 2 3
FIELD TYPE DENSE/NONDENSE 1
20s
COST PER BLOCK PER DAY IN 8 o.ooo15
COST PER ACCESS IN 8 0.00070 DELETIONS PER DAY
FREQ OF REORG IN DAYS
ii 400
;: 15
# OF
# OF INSERTIONS PER DAY 310 50 zoo0
# OF
# OF
CHARS
SEARCHES PER DAY 200
:,
9 10 5
0
# OF UPDATES PER DAY
# OF DISTINCT VALUES
iit
E 50
mooo
10
20
10
1000
0
1x 8
Ii 40
SOfti 25
150
2
0
6
50
30
300
3 4
1 1
20 11
150 10
sooi
2 3
1
Fig. 1. An example of input parameters.
206
DALU MOTZKIN
Next, there are two tables including database information and user workload information regarding liles and fields. Each table is preceded and succeeded by a line containing a “*” in column 1. 4.4. Index parameters The number of indices that have been detined and described in the literature, the variations on each in&x, the different variations of updates and the possible combinations of variations may lead to tens or hundreds of different possible combinations. It is obviously not feasible for a database to have aA excessive number of distinct indices and updating procedures. Such a choice will require an excessive number of subroutines in the database and will produce a database management system that is cumbersome and difficult to maintain and modify. A model that compares hundreds of combinations and tries to choose the “best” few, will also be exceedingly slow and cumbersome. On the other hand, it is difficult to choose a small number of indices for comparison and be sure that we picked exactly the ones that every database designer would choose. So, we suggest selecting a few indices that are widely used and provide the database designer the option of testing any additional indices that he/she wishes to test. This way we provide a model which is fast, comprehensive and flexible. (a) The built-in indices. The system provides a few of the more widely used indices as a default option. The 6le designer may add any additional indices to be used in the evaluation. For dense fields the system evaluates the B-tree index and sequential index. The B-tree index is chosen as a representative of the B-tree family which includes B-tree, B+-tree, B*-tree and multi-level sequential index with block splitting techniques used for updating. These four directories have similar performance, therefore one representative is selected. The structure of the B-tree is as given in Horowitz and Sahni [l 11. The B-tree family has a very efficient access time, but space utilization may be as low as 50%. Therefore, the other directory chosen as a default option for a dense attribute is a simple one level sequential directory. The sequential directory is
not as fast as a B-tree but it is more economical than a B-tree in space requirements. In some database environments, especially on microcomputers, the space constraints may be stronger than the time constraints, thus the slower but more compact sequential directory may be more suitable. For the nondense attribute an inverted file is selected as the default option. It was pointed out by Mot&in [17] and other that inverted files are superior to multilists in most situations. The uniform organization of inverted files as described by Motzkin et al. [16] is assumed. (b) Additional user-selected indices to be evaluated. A database designer may wish to evaluate and compare indices other than the defaults. It is possible to enter additional indices and their characteristics. The system will incorporate all additional indices into the optimization process. The user defined indices are entered by the user as follows: the index name, followed by a flag indicating whether it is adequate for dense or nondense attribute followed by a list of formulas. Each formula has the form: Variable name + Expression. The variables in each expression can be either system variables or variables appearing on the lefthand side of a preceding formula. The operators include symbols for addition, subtraction, multiplication, division, exponentiation and logarithms. The APL implementation allows APL symbols while the PASCAL implementation allows PASCAL symbols. An example of a user defined index is depicted in Fig. 2. The list of system variables that may be used by the designer is provided in Fig. 3.
5. THE OUTFWT The output includes systems’ summary, and a table indicating the optimal set of indices as well as the cost, time and space estimates associated with each index. Cost and time are computed per day. A sample output is depicted in Fig. 4. Additional interpretation of the output is provided in Section 6.
MULTILEVEL 1 NCE+-NCF+(2x((lO LOG NFi)+l)) NEB + NCB + NCE L c (NE6 LOG NR) + 1 NBMAlN+ (NRx (((((1 +NEB)‘L)-I)+((1 NEclxFREQ NBOVF .- NE + NEB NB +. NBMAIN + NBOVF NAS+ L+(NBOVF+NR)x(NBOVF&4) NAlc4 NAD + NAS+l NAR .- ((2 x NBOVF x (2 LOG NBOVF)) . ..
_,NEB)-l))-l))+l
+ (2 x NE))
Fig. 2. User defined index.
Design
of optimal access paths
207
AVAILBLK = number of available blocks. CA = cost per access. CB = cost per block. FREQ = frequency of ~~e/reorgan~ation. 0 = number of deletions per day. DV = number of distinct values. I = number of insertions per day. NAD = number of accesses par deletion. NAI = number of accesses per insertion. NAR = number of accesses for r~gani~tion/m~ge (per day). NAS = number of accesses per search. NAU=number of accesses per update. NB=number of blocks. NCB = number of characters per block. NCE = number of characters per entry. NCF = number of characters per field. NCR=number of characters per record. NDAI - number of daily accesses for insertions. NDAD = number of daily accesses for deletions. NDAR = number of daily accesses for ~o~anization. NDAS = number of daily accesses for searches. NDAU = number of daily accesses for updates. NE = number of entries. NE8 - number of entries per block. NR = number of records. NRB = number of records per block. S,=number of searches per day based on value of field i. TA = average time per access. TDA = number of total daily accesses. TDC = total daily costs. TOT = tot81 daily time. TOTALSPACE = total database space. U, = number of updates of field i per day. lJ= total number of daily updates of all fields.
Fig. 3. System’s variables.
6. THE ALGORITHM 6.1.
An overview of the alg~r~~~
For each field of each file the system first selects the best index from all indices to be evaluated. The “best” index is the index with the lowest cost. The cost considerations are described in Section 7. After the “best” index has been determined for a field, the cost of related processing of the field without an index is computed. “costs without an index” (Section 7.2.1.) will include the cost of direct search in the file for records associated with certain field values. Obviously the cost of a direct search in the tile will be significantly higher than that for a search that uses an index; however, there will be no cost for index space and index maintenance. Now, for each field, the “cost without index” is compared with the cost with “best” index. It is then determined whether the “best” index or no index will be selected for the field. When indices or no indices have been selected for all fields, a lowest cost organization has been achieved. However, this low cost organization may require more space than is available, or may require too much time. The space problem is handled as follows: After indices (or no indices) have been selected for all fields the total database space is computed. This includes the space occupied by the files and the
indices. If the total database space is greater than the available space, then the least “ben&cial” index is removed. The process of removing the least “beneficial” index continues until enough indices have been removed and the total database space is less than or equal to the available space. (Section 6.2.) The “benefit” of an index is determined by its cost benefit, the cost benefit is obtained by the difference between the cost associated with the corresponding field if an index is not used for the field and the cost associated with the field when a “best” index is provided. An index is considered less “beneficial” if it has lower cost benefit. An index is normally more bene&ial when the corresponding field has more searches and fewer modtications, and if the index does not occupy a very large amount of space. In the algorithm described below the concept of separability [4,23,24] is used and extended to separability of fields. Each field is processed individually as opposed to testing combinations of fields and files. To illustrate the efficiency of this approach consider a database with n fields and m indices. The number of possible combinations is m” while the number of separate tests is m *n. No loss of accuracy occurs because the interrelations between fields is taken into ~nside~tion in the input data structure and by the cost equations, rather than by trying out combinations.
208
DALU
FIRST SELECT/ON OF OPTIMAL lNDfCES TOTAL INOEX FILE FIELD COST 1
:
MULTILEVEL SEQUENTIAL
: 4
3 : 3 : 3 4
3 3 3
MOTZ~IN
DAY
4.81 2.86
ii
INVERTED BTAEE NO INDEX
5.5 I:** 8:8f
12 2
NO INDEX BTREE NO INDEX NO INDEX
30.03 35.66 37.54 7.25
SYSTEM SUMMARY TOTAL COST DAY 151.35
TOTAL TIME MIN HR 5 55
1 1
:
NO INDEX
37.54
z
4 3
NO INDEX BTREE
35.66 7.25
DAY
TOTAL TIME HR MIN 7 41
5282 3828 2822 135 556
:: 24 17
1364
TOTAL SPACE SEC 22
3
SYSTEM SUMMARY TOTAL COST
TOTAL SPACE
z:
21356
SECOND SELECTION OF OPTIMAL SET OF INDICES FILE INDEX TOTAL FIELD DAY 1 1 MULTILEVEL 4.81 40.4 1 NO INDEX 1 23 INVERTED 5.5 2 : NO INIXX 7.38 2 BTREE 17.88 : 3 NO INDEX 30.03 8.81
195.27
TOTAL TIME HR MIN
TOTAL TIME HR MIN 9 1 36 12 17 42 20 1 11 1 29 1 24 17
TOTAL SPACE SEC 33 11 6 23: :: 23 25 16
5282 2822 556
1364
TOTAL SPACE SEC 21 Fig.
17393
4. Output.
This approach is deterministic, it alleviates the problems associated with “add heuristics” and “drop heuristics” [24]. At the end of the compuations the database designer is presented with the total space, the total time of accesses computed from the input parameters, and the related cost of the database. It is possible that while the space is acceptable, the time figure is too large. The database designer may then allow for more space for the database, and run the optimization
program again. The additional space allocation will allow the database to use more indices and thus improve the time figure. The user may also try to put less weight on the space by reducing the “cost” of a block; this reduction may also increase the number of indices used. The “frequency of reorganization” parameters can also be changed. This iterative procedure may continue until an acceptable coloration is achieved or until there is no further improvement. 6.2. Outline of the algorithm
PROCEDURE MAIN FOR I = 1 TO NUMBER OF FILES DO FOR J = 1 TO NUMBER OF FIELDS IN FILE I DO CALL FIND-BEST-INDEX NO-INDEX (IJ) STORE INFO (best index, also no index) END FOR END FOR COMPUTE TOTAL SYSTEM SPACE COMPUTE TOTAL SYSTEM TIME COMPUTE TOTAL SYSTEM COST OUTPUT RESULTS IF AVAILABLE SPACE < TOTAL SYSTEM SPACE THEN CALL BEST-FIT COMPUTE NEW TIME, SPACE AND COST OUTPUT RESULTS AFTER BEST FIT END OF PROCEDURE
Design
of optimal aeeess Paths
209
PROCEDURE COMPUTEJNDEXPARAMETERS(INDEX-NAME, FILE I, FIELD # , VAR:NAS, NAD, NAI, NAR, TOTAL SPACE) (*The procedure calls specific subroutines that compute the NAS, NAI, NAD, NAR AND TOTAL SPACE for a given file, field and index*) END OF PROCEDURE PROCEDURE FIND_BEST-INDEX/NO_INDEX(Z,.Z) (file I field J*) CHECK DENSE/NON DENSE ATTRIBUTE FOR L = 1 TO NUMBER OF INDICES for this dense/non dense field DO CAL~COMP~~D~A~METERS~r~e~rs) Cl +- COST OF INDEX L for field (ZJ) C2 .- COST OF FILE ACCESSES WHEN INDEX IS USED COSTS._WITH-INDEX +- Cl + CZ STORE INFO END FOR FIND MIN(COSTS_WITH_INDEX, L) (*L is an output parameter indicating the index with lowest cost, costs-with_ index is also an output parameter, it is the cost with index I,*) FIND COST-WI~OUT~N-INDEX BEST INDEX/NO INDEX + M~(~OST_~~DEX, COST_WI~O~~N~DEX~ END OF PROCEDURE PROCEDURE
TOTAL
SYSTEM TIME
TOTAL TIME .- 0 FOR Z = 1 TO NUMBER OF FILES DO FOR J = 1 TO NUMBER OF FIELDS IN FILE I DO IF FIELD HAS AN INDEX THEN TOTAL TIME + TOTAL TIME f TOTAL TIME OF INDEX ACCESSES + TOTAL TIME OF FILE ACCESSES WHEN INDEX IS USED ELSE TOTAL TIME + TOTAL TIME + TOTAL TIME OF FILE ACCESSES WHEN INDEX NOT USED END FOR END FOR END OF PROCEDURE PROCEDURE TOTAL SYSTEM SPACE TOTAL SPACE c 0 FOR I = 1 TO NUMBER OF FILES DO FOR .Z = I TO NUMBER OF FIELDS IN FILE I DO IF FIELD J HAS AN INDEX THEN TOTAL SPACE +- TOTAL SPACE + INDEX SPACE END FOR TOTAL SPACE +- TOTAL SPACE + FILE SPACE END FOR END OF PROCEDURE PROCEDURE TOTAL SYSTEM COST TOTAL COST +- 0 FOR Z = 1 TO NUMBER OF FILES DO FOR .Z = 1 TO NUMBER OF FIELDS IN FILE I DO IF FIELD J HAS AN INDEX THEN TOTAL COST .- TOTAL COST + INDEX COST + COST OF FILE ACCESSES WHEN INDEX IS USED ELSE TOTAL COST .- TOTAL COST + COST OF FILE ACCESSES WHEN INDEX IS NOT USED END FOR TOTAL COST c TOTAL COST + FILE SPACE COST END FOR END OF PROCEDURE
IS
12 3-F
IS
210
Da~u Morzru~
PROCEDURE BEST-FIT (*fit database into available space*) (*compute benefit of index (Z,J>*) FOR I = 1 TO NUMBER OF FILES DO FOR J = 1 TO NUMBER OF FIELDS IN FILE Z DO IF FIELD (ZJ) HAS AN INDEX THEN BENEFIT (ZJ) + COST OF ACCESSES WITHOUT INDEX COST OF ACCESS WHEN INDEX IS USED STORE BENEFIT (Z,J),Z,J END FOR END FOR SORT LIST OF BENEFIT (ZJ) (*denote sorted list by benefit (K)*) Kc1 REPEAT REMOVE INDEX (K) TOTAL SPACE + TOTAL SPACE - SPACE OF INDEX K K+K+I UNTIL AVAILABLE SPACE > = TOTAL SPACE OR ALL INDICES HAVE BEEN REMOVED END OF PROCEDURE
7. COST FUNCI’IONS 7.1.
The Basic Assumptions
The following assumptions are used in the cost computations: Each file is assumed to be an unsorted relation with one primary field. It is assumed that records to be modified and deleted are located using the primary field value. These assumptions simplify the task of obtaining input data and simplify the formulas. The results provide reasonable estimates since this situation holds for the majority of deletions and modifications. Without loss of generality the indices are assumed to index single fields. If a group of fields has to be indexed (e.g. name and addresses), the group can be viewed as one virtual field. An extended separability assumption [23,24] is used. The costs of each field are computed individually. However, it is shown how interfield effects are taken into consideration. The costs of each field are composed of file costs, and index costs. The cost of each field is computed twice. One computation assumes that no index is provided for the field. In this computation, there are only tile costs. The second computation assumes that an index is used to locate records. The cost of the field in the second case is given by the sum of the file cost and the index cost. The variables used in the functions and formulas are described in Fig. 3. The detailed formulas are provided in the following sections. 7.2. The Costs of File Accesses 7.2. I. Cost of accesses per jield per day when it&x is not iIsed (a) Dense attribute. NAS = NB/Z-on the average half the blocks have to be read to access a record.
NAI =&for primary field. NAI is not computed for non-primary fields. The assumption here is that available space is maintained in a linked list. When a record is inserted into the file, two accesses are needed to read and rewrite the header of the linked list, and two additional accesses to read and rewrite the block into which a record was inserted. Since an insertion atfects a complete record, each insertion is only computed once per file. It is computed here with the primary field of the file. Note: Insertions have 6xed time, and do not use any index or any specific field to locate records. No bias is caused by computing insertions with the primary field computations. NAD = NAS + 3-For primary field. NAD is not computed for a nonprimary field. NAS accesses are needed to locate and read the record to be deleted; two accesses to read and rewrite the linked list header, and one to rewrite the block from which a record has been deleted. This can link the record space to the linked list of available space. As for insertions, deletions are computed for the whole file rather than for individual fields. Here the search for the record to be deleted is assumed to use the value of the primary key, so deletion computations are done with the primary field. NAU = NAS + I-For the primary field. NAU = 1 for a nonprimary field. As before, the assumption here is that when a field has to be modified, the value of the primary field is known, and that the
211
Designof optimal accesspaths primary field and its index (if it has an index) are used to locate the record to he modified. The update of a nonprimary field demonstrates a situation where the costs for each field are computed individually, while costs that involve combinations of field are fully taken into consideration. Here the cost of locating a record to be updated is added to the primary field costs, while the cost of writing the rn~i~~tion is added to the costs of the mod&d field. (Also see the formula for NDAU below, where all searches for records to be updated are added to the costs of the primary field.) Using the above formula, the daily cost of transactions can be computed NDAS=NASxS,. NDAI = 4 x I (Computed with primary field only). NDAD = NAD x D (Computed with primary field only). NDAU = NAS x U + U, for primary field = U, for nonprimary field. TDT = TA x (NDAS + NDAI + NDAD + NDAU) for primary field. TA x (NDAS + NDAU) for nonprimary field. TDC = CA x (NDAS $ NDAI + NDAD + NDAU) for primary field. CA x (NDAS + NDAU) for nonprimary field. (b) Nonaknse attribute.
NAS-NB-The whole file must be scanned since more than one record may correspond to the query. The rest of the formulas are the same as for dense attribute. 7.2.2. Cost of accesses per field when index is used (a) Dense attribute.
NAS = I-The correct address was found using the index. Thus, only one read from the file is the cost associated with the file. NAI = 4-Same as in 6.1. (not computed for nonprimary field.) NAD = 4-Here* too, a record’s address is found using the index. The four accesses are needed to add the field location to the linked list of available space. As before, NAD is not computed for nonprimary fields. NAU = 2-For primary field read the record, modify it, and write it. DAU = l-For nonprimary fields (again, record was located using the primary field),
TDT = TA x (NDAS + NDAI -t-NDAD + NDAU) for primary field. = TA x (NDAS + NDAU) for nonp~ma~ field. TDC = CA x (NDAS + NDAI + NDAD + NDAU) for primary field. = CA x (NDAS + NDAU) for nonprimary field. (b) Nomiense attribute. For nondense attributes the following formula derived by Yao 1281is used_
NAS=NBx
(
NRV 1-P~~D(NR-NR/NB-R+l) x(NR-R+l). >
NAI, NAD, NAU same as nonprimary attribute above. 7.3. File Space Confutation NEB =NCB/NCE truncated (The assumption here is that entries are not split between blocks). NB = NR/NEB rounded up. Cost of file space = NB x CB. 7.4. h&x
Costs
The values of NAS, NAI and NAD are derived individually for each index for each field. For example in a sequential index computation NAS = LOQ NB + 1 rounded to the nearest integer, where NB-Number of blocks occupied by the index. Formulas for the performance of standard indices can be found in standard texts such as Ill, 121 and others. The formulas for the default option indices are part of the model, while formulas for additional indices are given by the database designer. The daily values are computed in a way similar to file cost computations. We get
NDAS = NAS x S,. NDAI = NAI x I-This value is computed for each index of each field since, when a record is inserted into a file, all indices must have entries inserted in them. NDAD = NAD x D-This value is also computed for all indices of all fields. NDAU=NADxUj+NAIxU,+NAS(U-UJ for primary field. NAD x U, + NAI x Ut for nonfat field. When a field value is modified in the file, an entry must be deleted from the corresponding index, and a new entry inserted. In addition, the address of the record that needs modification is found using the primary index.
DALU MOTZKIN
212
Index time and cost per day are also computed in a similar way to that of tile cost. TDT = TA x (NDAS + NDAI + NDAD + NDAU + NDAR) for primary field. = TA x (NDAS + NDAU + NDAR) for nonprimary field. TDC = CA x (NDAS + NDAU + NDAD -I- NDAU + NDAR) + NB x CB = CA x (NDAS + NDAU + NDAR) + NB x CB for nonprimary field. The index cost is composed of both cost of accesses and cost of space. The above cost functions are utilized in the algorithms decribed above. The “best” index is the index with lowest TDC value. The compu~tions of the estimates for total systems’ time and cost are listed in the procedures in the Section 6. 8. THE COMPLEXFI”Y OF THE ALGORITHM The complexity of the algorithm O(NF x (NI + 1)) where NF is the total number fields in the database and NI is the number evaluated indices. The speed and efficiency is due the separability approach applied here.
is of of to
9. CONCLUDING REMARKS A model for the design and prediction of an optimal set of access paths for relational databases has been developed. The model is concerned with selection of an optimal set of access structures. It provides estimates for total cost, time and space associated with the selected indices for the given input parameters. It is a natural extension of previous work. It further extends the separability approach into individual fields and achieves increased efficiency. At the same time it uses a deterministic (rather than he~stic) approach. All cost beneficial indices are included at first, thus starting with a lowst cost configuration. If this lowest cost configuration requires more space than is available, then indices are removed one at a time in order of increased cost/benefit starting with the least beneficial indexes and continuing until the database fits into available space (or all indices have been removed). A neat optimal configuration within the system space constraints is thus achieved. Trade-offs of time and space are also possible. F~he~o~, it takes into consideration the effects of transactions on different fields, as well as different reorganization points. It allows the designer to evaluate those indices that the user is interested in. The model distinguishes between primary and secondary fields, and between dense and nondense attributes. Thus, an efficient, comprehensive, and flexible model is achieved. The model has been implemented by a PASCAL program and an APL program. It is more complete than previous
work. The implementations easy to use.
of the model are fast and
Acknowledgemenfs-The author wishes to thank Dr Kenneth Williams for his comments and suggestions. This research was supported in part by a fellowship grant from
Western Michigan U~ve~ity. REFERENCES
[l] B. S. Batory. B+-trees and indexed sequential files: a performana comparison. Ass. Comput. Mach, Proc. SIGMGD, pp. 3&39 (1981). [2] D. S. Batory. Gptimal ftle designs and reorganization points. Ass. Comput. Mach. Trans. Darabase Systems 7(l), 60-81 (1982). [31 D. S. Batory and C. C. Gotlieb. A unifying model of physical databases. Ass. Comput. Mach. Trans. Database Systems 7(4), 509-538 (1982). [4] F. Bonfatti, D. Maio and P. Tiberio. A separabilitybased method for second index selection in physical database design. Methodology and Tools for Database Design, pp. 149-160. North-Holland, Amsterdam (1983). [5j J. V. Carlis, S. T. March and G. W. Dickson. Physical database design a DSS approach. Inform. Mgmt 6(4), 21 l-224 (1984). [6] J. V. Carlis and S. T. March. A descriptive model of physical database design problems and solutions. Proc. Int. Co& on Data Engineering, pp. 24-27 (1984). [‘If J. V. Carlis and S. T. March. Computer aided physical database design methodology. Comput. Perform. 4(4), 198-214 (1983). [8] P. 0. Chen and S. B. Yao. Design and performance toois for database systems. IEEE Proc. Znt. Conf. on Very Large Data Bases, pp. 3-15 (1977). {9j Estimating record seiectivities. _ _ S. C~st~o~~s. Inform. Systems 8(2), 105-l 15 (1983). 1101 J. A. Hoffer. An integer ntoaramminz formulation of . computer database de&$prtblems. Iform. Sci. ll( l), 2948 (1976). [Ill E. Horowitz and S. Sahni. Pu~~tals of Data Structures. Computer Science Press, Maryhtnd (1976). [12] D. E. Knuth. Sorting and searching. The Art of Computer Programming. Addison-Wesley, Reading, Mass. (1972). 1131 _ * V. Y. Lum and H. Ling. An optimization problem on the selection of secondary keys. Proc. Ass. Comput. Mech. Ann. Cork VI). 349-356 f1971). . .* (141 S. T. March and D. G. Severan&. The determination of eSicient record segmentation and blocking factors for shared data files. Ass. Comput. Mach. Trans. on Database Systems 2(3), 279-296 (1977). [IS] H. Mendelson. Analysis of extendible flashing. ZEEE Trans. on Software Engng SE-8(6), 61 l-619 (1982). [16] D. Motzkin, K. Williams and K. Chang, Uniform organization of inverted files. Proc. 1984 National Computer Co@‘. July, pp. 567-585 (1984). .1171 . D. Motzkin. The use of normal multiplication tables for information storage and retrieval.~Commrm. Ass. Comput. Mach. 22(3),-193-207 (1979). 1181 . - D. Motzkin. Database performance optimization. Proc. 1985 National Computer Sci. Cot& July pp. 555-566 (1985). [19] G. S. Nicolas. A generalized database access path model. APIPS Proc. National Computer Co& pp. 529-535 (1981). (201 M. Schkolnick. The optimal selection of secondary indices for files. Inform. Sysrems 1, 141-146 (1975). [21] M. Schkolnick. A survey of physical database design methodology and techniques. Proc. 4th Int. Con/. on Very Large Databases, pp. 474-487 (1978). I
Design of optimal access paths
213
1221 _ _ M. Schkolnick and P. Tibcrio. Estimating the cost of updates in a relational database. Ass. Co&U. Milch. 7’runs. on Dufuime Svstem 10131.16%1?9 (1985). [23] IL W. Whang, G. &&rhol~ ‘&xd D. !%galo&~s. Separability-an approach to physical database design. Proc. 7th Inr. ConJ on Very Lurge Databases, pp. 320-332 (1982). [24] K. Y. Whang, G. Wiaderhold and D. Sagalowics. Separability-an approach to physical database design.
[26] S. B. Yao., K. S. Das and T. J. Theomy. A dynamic
IEEE Trans. on Computers 3(3), 209422 (1984). [25] S. B. Yao and A. G. Mertin. S&&ion of file or-
f)ortrbapr system 2(I), 45-67 (1977). 1291 S. B. Yao. Approximating block aaxsses to database organizations. Communs Ass. Comput. Mach. 20,
ganization using an analytic model. Proc. Int. Co& on Very Large Databaves, pp. 255-267 (1975).
database reorganization algorithm. Ass. Compur. Macfr. Trans. on Dora&ax Systems l(2), IS%174 (1976). f27f S. B. Yao. Modelling and pcrformamx evaluation of physical database st&wes: Proc. Ass. Cowtput.Mach. National Cod._ __ DD. 303-309 (1976). [28] S. B. Yao. An attribute ba&d r&de1 for database access cost aaalysis. Ass. Combo Mach. Trans. on
260-261 (1977).