Data & Knowledge Engineering 68 (2009) 499–508
Contents lists available at ScienceDirect
Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak
An optimal workload-based data allocation approach for multidisk databases Ming-Hua Lin * Department of Information Technology and Management, Shih Chien University, No. 70, Ta-Chih Street, Taipei 10462, Taiwan
a r t i c l e
i n f o
Article history: Received 18 January 2008 Received in revised form 4 February 2009 Accepted 5 February 2009 Available online 15 February 2009
Keywords: Multi DBs Data allocation Query optimization Integer programming
a b s t r a c t Parallel processing mechanisms and data layout approaches that significantly affect access performance of database systems have received increased attention in the last few decades. Multidisk allocation problems try to find an allocation of relations to disks such that expected query cost is minimized. Solving this NP-complete problem is extremely timeconsuming, especially because the need for solution time rises exponentially as the number of 0–1 variables increases. This study presents a novel and efficient approach for deriving an optimal layout of relations on disks based on database statistics of access patterns and relation sizes. In addition to minimizing query cost, the proposed model allows replication of relations, minimizes storage cost, and enhances computational efficiency by reducing the number of 0–1 variables and constraints. Illustrative examples and experimental results demonstrate the advantages and efficiency of the proposed method. Ó 2009 Elsevier B.V. All rights reserved.
1. Introduction Over the last few decades, numerous parallel processing mechanisms and data layout methods have been developed to improve the performance of database systems. Since CPU speed far exceeds that of main memory and I/O bandwidth, I/O throughput rates have become a bottleneck in system performance, especially for large databases [13,16]. Most enterprises use relational database systems in the back-end to support applications, such as decision-support systems, geographical information systems, business intelligence, engineering and scientific systems. As the size of databases continues to increase, efficient I/O performance plays a critical role in attaining acceptable and scalable overall performance for database access [1]. A number of I/O parallel approaches [2,7,13,15,16,18] for data array systems, such as disk striping, disk shadowing, and data partitioning, have been developed to increase I/O bandwidth. Kurose and Simha [8] investigated the resource-allocation problem based on economic models in a distributed computer system. They developed a decentralized algorithm to allocate file fragments in a cooperative and non-competitive manner among agents (computer systems). In their approach, one file is divided into different parts without meaningful relationships among the file fragments. Fragmentation is distributed to different computers to minimize communication cost and average processing delay related to file access. However, in relational database systems in which some objects are accessed simultaneously, the relationships among objects must be considered. Lin and Orlowska [9] solved the data allocation problem in relational database systems, and minimized total communication cost within the network using an integer linear programming approach. Sarathy et al. [14] analyzed the problem of allocating copies of relations from a global database to a geographically distributed
* Tel.: +886 2 25381111x8930; fax: +886 2 25381111x8819. E-mail address:
[email protected] 0169-023X/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2009.02.001
500
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
communication network. This allocation problem is formulated as a constrained nonlinear 0–1 program and solved using subgradient optimization. Agrawal et al. [1] developed a heuristic search algorithm for allocating objects in relational databases to minimize total I/O response time for a given workload. Via experiments conducted on Microsoft SQL Server 2000, they determined that exploiting the co-access knowledge of database objects is important in designing a database layout with better I/O performance than that achieved with full striping. Physical data allocation significantly impacts I/O performance of database systems, especially in relational database systems where two-way joins comprise a significant portion of the application mix [13]. To solve a class of two-way join problems in relational database systems, Rotem et al. [13] developed a mathematical framework for allocation of nonfragmented relations to multidisk storage systems. Chang [4] generated a concise method to reformulate the model developed by Rotem et al. using a reduced number of 0–1 variables at the cost of prohibiting the replication of relations. The prohibition means that each relation can only be assigned to one disk. Loomis and Popek [11] developed guidelines for data replication and allocation via optimized query strategies. They determined that multiple copies of data should be placed on different servers located in a network to maximize parallel processing within queries. Therefore, the method developed by Chang [4] may lose the benefit of data replication for maximizing disk resource utilization. Tsai and Li [17] recently presented another model that improves the models generated by Rotem et al. [13] and Chang [4] for solving multidisk vertical allocation (MDVA) problems. Existing methods [4,13,17] minimize query cost but neglect storage cost. This study presents a novel and efficient approach for reformulating the data allocation problem as a solvable mixed 0–1 programming problem such that a globally optimal solution can be obtained. The advantages of the proposed approach over existing techniques are summarized as follows: (i) Compared with the methods of Rotem et al. [13] and Tsai and Li [17], the proposed model uses less storage space to achieve minimal query cost. Moreover, the proposed approach has superior solution efficiency by utilizing fewer additional 0–1 variables and constraints. (ii) Compared with the method developed by Chang [4], the proposed method can maximize parallel processing by retaining the possibility of replication of relations on different disks. The rest of this paper is organized as follows. Section 2 discusses the models of Rotem et al. [13] and Tsai and Li [17]. Subsequently, Section 3 presents the proposed model for the multidisk allocation problems considering storage cost and possible replication of relations. In Section 4, two examples are used to demonstrate the effectiveness of the proposed model. In Section 5, we present the experimental results to illustrate the performance difference among the previous models and the proposed model. Finally, the paper is concluded in Section 6. 2. Review of the models by Rotem et al. and Tsai and Li 2.1. Rotem et al.’s model Rotem et al., who developed a mathematical model for allocating nonfragmented data objects to multidisk storage systems, discussed how to allocate relations to disks in relational databases [13]. Their model minimized expected query cost based on database statistics of access patterns and relation sizes. They defined query cost in this study to be the query communication cost of transmitting data from disks to execute a two-way join. Computational cost of the JOIN operation is not considered in this model. The solution must satisfy the constraint such that each relation appears at least once in a system and each disk capacity limit is not exceeded. The notations used throughout this study are as follows: CIO Ck M N pij Ri Si T xik yij zijk
I/O cost unit (in terms of I/O time per block of data) capacity of disk k number of disks in the multidisk storage systems number of relations in the databases P N1 PN the probability of a query involving JOIN operation on relation Ri and Rj i¼1 j¼iþ1 pij ¼ 1 relation i in the databases size of the relation Ri (number of blocks) value of the objective (cost) function an indicator of relation Ri being allocated to disk k; xik = 1 if Ri is allocated to disk k, otherwise xik = 0 an indicator of Ri and Rj not available on two different disks; yij = 1 if Ri, Rj are not available on any two different disks in the databases, otherwise yij = 0 an indicator of Ri and Rj being placed on the disk k; zijk = 1 if Ri, Rj are stored on disk k, otherwise zijk = 0
SMaxði;jÞ ¼ MaxðSi ; Sj Þ;
SMinði;jÞ ¼ MinðSi ; Sj Þ:
A multidisk allocation problem can be formulated as
Minimize
T¼
N1 X N X i¼1 j¼iþ1
pij C IO ðSMaxði:jÞ þ yij SMinði;jÞ Þ
ð1Þ
501
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508 N X
subject to
Si xik 6 C k ;
k ¼ 1; 2; . . . ; M;
ð2Þ
i¼1 M X
xik P 1;
i ¼ 1; 2; . . . ; N;
ð3Þ
k¼1
zijk P xik þ xjk 1; yij > 1 þ
i ¼ 1; . . . ; N 1;
M X
ðzijk xik xjk Þ;
j ¼ i þ 1; . . . ; N;
i ¼ 1; . . . ; N 1;
k ¼ 1; . . . ; M;
ð4Þ
j ¼ i þ 1; . . . ; N;
ð5Þ
k¼1
xik ; yij ; zijk 2 f0; 1g forall i; j; k: The objective function represents expected query cost. Constraint (2) indicates that the total size of relations stored on a disk cannot exceed disk capacity. Constraint (3) ensures that each relation is assigned to at least one disk. Constraints (4) and (5) guarantee correct yij values. If yij = 1 for some pair i and j, then no pair of disks exists such that Ri is on one disk and Rj is on the other disk; otherwise, yij = 0. Although Rotem et al. [13] proposed a promising solution for multidisk allocation problems, their model contains numerous 0–1 variables in constraints (4) and (5). With n 0–1 variables, there are 2n solutions be considered. The need for solution time in treating an integer programming problem rises exponentially as the number of 0–1 variables increases [3,6,12]. Therefore, solving this model is excessively time-consuming. For the multidisk allocation problem with N relations and M disks, MN + N(N 1)/2 + MN(N 1)/2 0–1 variables are required in their model. 2.2. Tsai and Li’s model Tsai and Li [17] proposed the following model for solving the multidisk allocation problem: Minimize (1) subject to (2), (3), M X
z0ijk 6 yij ;
i ¼ 1; . . . ; N 1;
k¼1
Mðxik 1Þ þ xjk
X
j ¼ i þ 1; . . . ; N;
xjk0 6 z0ijk 6 Mð1 xik Þ þ xjk
k0 –k
Mxik 6 z0ijk 6 Mxik ;
ð6Þ X
xjk0 ;
i ¼ 1; . . . ; N 1;
j ¼ i þ 1; . . . ; N;
k ¼ 1; . . . ; M;
ð7Þ
k0 –k
i ¼ 1; . . . ; N 1;
j ¼ i þ 1; . . . ; N;
k ¼ 1; . . . ; M;
ð8Þ
where z0ijk is a continuous variable (for i = 1, . . . , N 1, j = i + 1, . . . , N, k = 1, . . . , M), and the other variables are the same as defined in Rotem et al.’s model. Constraint (6) has same function as previous constraints (4) and (5). Constraints (7) and (8) are used to linearize the theP nonlinear term, xik xjk k0 –k xjk0 . Since binary variable zijk in the model by Rotem et al. can be replaced by continuous variable z0ijk in the model by Tsai and Li, the number of 0–1 variables is reduced from MN + N(N 1)/2 + MN(N 1)/2 to MN + N(N 1)/2. However, the number of constraints increases from M + N + N(N 1)(M + 1)/2 to M + N + N(N 1) (4M + 1)/2. The following section presents a novel technique for reducing the number of 0–1 variables and auxiliary constraints for the same problem.
3. Proposed method Consider the following proposition: Proposition 1. Constraints (4) and (5) in the model of Rotem et al. can be replaced by the following expression:
xik
X 0
k –k
xik0
X
xjk0 6 y0ij ;
i ¼ 1; . . . ; N 1;
j ¼ i þ 1; . . . ; N;
k ¼ 1; . . . ; M;
ð9Þ
0
k –k
where y0ij is a non-negative continuous variable and the other variables are the same as defined previously. Proof P P Case 1: For any k, if xik = 1 and k0 –k xik0 þ k0 –k xjk0 ¼ 0, then y0ij ¼1. P P Case 2: For any k, if xik = 0 or k0 –k xik0 þ k0 –k xjk0 P 1, then y0ij ¼ 0 or 1. Since the objective should be minimized, y0ij is then forced to 0.
502
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
Therefore, the proposition is proved. h According to Proposition 1, Rotem et al.’s model can be reformulated as follows: Minimize
T¼
N1 X N X
pij C IO ðSMaxði:jÞ þ y0ij SMinði;jÞ Þ
ð10Þ
i¼1 j¼iþ1
subject to (2), (3) and (9), where y0ij is a non-negative continuous variable (for i = 1, . . . , N 1, j = i + 1, . . . , N), and the other variables are the same as defined previously. Since binary variable yij can be replaced by continuous variable y0ij , and the binary variable zijk has been removed from the model by Rotem et al., the number of 0–1 variables is reduced from MN + N(N 1)/2 + MN(N 1)/2 to MN. Moreover, the number of constraints is also reduced. Table 1 lists the comparison of the number of variables and constraints in three models. In the models by Rotem et al. [13] and Tsai and Li [17], the total size of relations stored on a disk cannot exceed disk capacity. However, neither work specifies how to reduce the amount of storage space used while minimizing query cost. Therefore, their methods increase disk cost via unnecessary replication of relations. To consider query cost and storage cost simultaneously, our present work modifies the models developed by Rotem et al. [13] and Tsai and Li [17] as Minimize
T ¼ Bweight
N1 X N X
pij C IO ðSMaxði:jÞ þ y0ij SMinði;jÞ Þ þ
i¼1 j¼iþ1
N X M X i¼1
xij
ð11Þ
j¼1
subject to (2), (3) and (9), , w = min{pijCIO SMin(i,j)j1 6 i 6 N, 1 6 j 6 M}, and all the other variables are the same as defined previwhere Bweight ¼ MNþ1 w P PN ously. In the proposed model, the objective function has the following two parts: Bweight N1 i¼1 j¼iþ1 P P P P P P N M N1 N N M 0 pij C IO ðSMaxði:jÞ þ y0ij SMinði;jÞ Þ and i¼1 j¼1 xij . Notably, i¼1 j¼iþ1 pij C IO ðSMaxði:jÞ þ yij SMinði;jÞ Þ is query cost, i¼1 j¼1 xij is storage cost, and w is the minimal reduction in query cost when one additional table is allocated to the disks. Storage cost PN PM * * * i¼1 j¼1 xij is in the range of [0, M N]. Since Bweight the minimal reduction in query cost = M N + 1 exceeds the maximal * change M N in storage cost, Bweight can put more emphasis on minimizing query cost than on minimizing the number of taP P bles (i.e., Ni¼1 M j¼1 xij ) allocated to disks. That is, the model minimizes storage cost without sacrificing query cost. After modifying the models developed by Rotem et al. [13] and Tsai and Li [17] by considering storage cost in the objective function and reducing the number of 0–1 variables and constraints, the proposed model can be efficiently solved to identify an optimal data layout that uses minimal storage space to achieve minimal expected query cost. 4. Numerical examples Two examples are presented in the following to illustrate the superior performance of the proposed approach compared with other methods. Table 1 Comparison of the number of variables and constraints in three models. Number of variables and constraints
Rotem et al. [13]
Tsai and Li [17]
Proposed model
Binary Continuous Constraints
MN + N(N 1)/2 + MN(N 1)/2 0 M + N + N(N 1)(M + 1)/2
MN + N(N 1)/2 MN(N 1)/2 M + N + N(N 1)(4M + 1)/2
MN N(N 1)/2 M + N + MN(N 1)/2
M: number of disks in the multidisk storage system; N: number of relations in the databases.
R4 0.1 R1
p12 = 0.5, p13 = 0.1, p14 = 0.1, 0.1
p23 = 0.1, p24 = 0.1, p34 = 0.1,
0.1
S1 = 0.5, S 2 = 0.3, S 3 = 0.45, R2
0.5
S 4 = 0.4.
0.1
0.1 R3
Fig. 1. Example of two disks and four relations (Tsai and Li [17]).
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
503
Example 1. Consider a database with four relations depicted in Fig. 1 where each edge represents a two-way join query likelihood. We assume all applications in the example run only two-way join queries and the entire database must be allocated to two disks. The capacities of disks 1 and 2 are 1.5 and 1, respectively. This example can be formulated by the proposed method as the following program.
Minimize Bweight
ð0:5ð0:5 þ 0:3y012 Þ þ 0:1ð0:5 þ 0:45y013 Þ þ 0:1ð0:5 þ 0:4y014 Þ þ 0:1ð0:45 þ 0:3y023 Þ þ 0:1ð0:4 þ 0:3y024 Þ þ 0:1ð0:45 þ 0:4y034 ÞÞ þ x11 þ x21 þ x31 þ x41 þ x12 þ x22 þ x32 þ x42
subject to
0:5x11 þ 0:3x21 þ 0:45x31 þ 0:4x41 6 1:5; 0:5x12 þ 0:3x22 þ 0:45x32 þ 0:4x42 6 1; x11 þ x12 P 1; x21 þ x22 P 1; x31 þ x32 P 1; x11 x12 x22 6 y012 ; x12 x11 x21 6 y012 ; x11 x12 x32 6 y013 ;
x12 x11 x31 6 y013 ;
y014 ; y023 ; y024 ; y034 ;
x12 x11 x41 6 y014 ;
x11 x12 x42 6 x21 x22 x32 6 x21 x22 x42 6
x41 þ x42 P 1;
x22 x21 x31 6 y023 ; x22 x21 x41 6 y024 ;
x32 x31 x41 6 y034 ; 8þ1 Bconst ¼ ¼ 300; minf0:15; 0:045; 0:04; 0:03g y012 P 0; y013 P 0; y014 P 0; y023 P 0; y024 P 0; x31 x32 x42 6
xik 2 f0; 1g for i ¼ 1; 2; 3; 4;
y034 P 0;
k ¼ 1; 2:
The program uses fewer 0–1 variables and constraints than the models by Rotem et al. [13] and Tsai and Li [17]. In solving the program using LINGO [10], the optimal allocation can be obtained as depicted in Fig. 2. With the capability to replicate relations, relation 1 must be allocated to disks 1 and 2 to reach the globally optimal solution. The query cost obtained by the proposed model is 0.51, which is better than the query cost of 0.55, acquired by Chang [4]. Commercial database systems allow a database administrator to flexibly allocate each object to multiple disk drives. Therefore, a database administrator can determine the placement of each relation on specific disks as suggested by the solution of the proposed model to enhance I/O performance. Example 2. Fig. 3 illustrates an example of a relational database in Rotem et al. [13]. Suppose that a database with 19 relations must be allocated to three disks, each with a capacity of 256 MB, and all applications run only two-way join queries. For comparison with model results obtained by Rotem et al. [13] and Tsai and Li [17], the problem is decomposed into three sub-problems. Table 2 compares the problem sizes of the example in these three models. The proposed model obviously requires fewer 0–1 variables and constraints than the models of Rotem et al. and Tsai and Li. Although solving this problem using the models of Rotem et al. and Tsai and Li can obtain an optimal allocation with minimal query cost, the results from these two models usually involve unnecessary replication of tables; thus, the database storage space consumed is increased. In solving sub-problems using LINGO [10], the proposed model can identify the optimal data allocation as indicated in Fig. 4. A comparison of the allocation obtained by the proposed model and that obtained by Rotem et al. [13] (Fig. 5) indicates that the optimal allocation via the proposed model utilizes 354 MB of disk space to reach the same minimal query cost, which is less than the 391 MB of disk space used by the model by Rotem et al. 5. Computational experiments This section presents experimental results when evaluating the performance of the proposed model. Twelve groups of test examples are characterized as follows: (number of disks (M), number of tables (N)) = ((2, 10), (2, 12), (2, 14), (2, 16), (2, 18), (2, 20), (4, 10), (4, 12), (4, 14), (4, 16), (4, 18), (4, 20)). Ten problems are randomly generated for each group. In each test examples, disk capacity (Ck) is 256 MB, the sizes of the relations (Si) are drawn from a uniform distribution of values in the
R1
R2
Disk #1
R4
R1
R3
Disk #2
Fig. 2. Optimal allocation of Example 1.
504
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
R8
R10
.10
.05
.03 S6 = 10Mb S7 = 2Mb S8 = 5Mb S9 = 3Mb S10= 1Mb S11=35Mb S12= 5Mb
.05
R6
.05 R7
R4
.02
.08
.025
R9
.01
.05
R1
.03
R5
.025
R11
.01
.15
R2
.05
.05 R12
.05 S1 = 20Mb S2 = 12Mb S3 = 16Mb S4 = 8Mb S5=160Mb
R3
Sub-Graph 2:Administration
Sub-Graph 1:Accounts.
R19
.02 S13= 5Mb S14=20Mb S15= 5Mb S16= 4Mb S17=25Mb S18=16Mb S19= 2Mb
.02 R13
R15
256Mb
.05
R17
Sub-Graph3:Assets and Investments
256Mb
D#1
R14
.03
.05
.02
.01
R18
.0 2
R16
256Mb
D#2 D#3 Fig. 3. A weighted graph of a relational database (Rotem et al. [13]).
Table 2 Problem sizes of Example 2. Number of variables and constraints
Binary Continuous Constraints
Sub-graph 1
Sub-graph 2 (or 3)
Rotem et al. [13]
Tsai and Li [17]
Proposed model
Rotem et al. [13]
Tsai and Li [17]
Proposed model
55 0 48
25 30 138
15 10 38
105 0 94
42 63 283
21 21 73
Sub-Graph 1
R1 R5
R3 R4
R2
Sub-Graph 2
R8 R10 R12
R6 R11
R7 R9
Sub-Graph 3
R18 R19
R13 R15 R17
R14 R16
Total Slack: 414Mb Total Usage: 354Mb
Disk #1 Slack:47Mb
Disk #2 Slack:152Mb
Disk #3 Slack:215Mb
Fig. 4. Optimal allocation of Example 2 by the proposed model.
range of [5 MB, 25 MB], the I/O cost unit (CIO) are all assumed to be 1 s, and the probability of a query a JOIN oper involving 1 and normalized to ation on each pair of relations (pij) is drawn from a uniform distribution of values in the range of 0; NN PN1 PN i¼1 j¼iþ1 pij ¼ 1. Each test example is formulated using the models of Rotem et al. [13] and Tsai and Li [17] and the proposed model, which are then solved by LINGO [10] on a Notebook with an Intel 1.66 GHz CPU. Table 3 compares the
505
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
Sub-Graph 1
R2 R3 R4
R1 R2
R5
Sub-Graph 2
R6 R11 R12
R8 R10
R7 R9 R12
Sub-Graph 3
R14 R16 R18
R13 R17 R19
R14 R15
Disk #1 Slack:130Mb
Disk #2 Slack:186Mb
Total Slack: 377Mb Total Usage: 391Mb
Disk #3 Slack:61Mb
Fig. 5. Optimal allocation of Example 2 by Rotem et al.’s model (Rotem et al. [13]).
Table 3 Comparisons of average performance among three models. Problem size (M,N)
Query cost
Used storage space (MB)
CPU time (s)
Rotem et al. [13]
Tsai and Li [17]
Proposed model
Rotem et al. [13]
Tsai and Li [17]
Proposed model
Rotem et al. [13]
Tsai and Li [17]
Proposed model
(2,10) (2,12) (2,14) (2,16) (2,18) (2,20)
1.98 2.00 1.84 1.48 1.23 0.82
1.98 2.00 1.84 1.48 1.23 0.82
1.98 2.00 1.84 1.48 1.23 0.82
252.90 319.10 356.00 367.80 387.30 425.70
255.50 315.10 373.20 359.70 417.60 448.80
230.80 277.50 311.30 306.10 338.80 362.00
0.08 0.15 0.26 0.37 0.60 0.91
0.35 0.47 0.60 1.76 1.95 3.25
0.01 0.02 0.03 0.02 0.04 0.05
(4,10) (4,12) (4,14) (4,16) (4,18) (4,20)
1.79 1.90 1.71 1.55 1.18 0.67
1.79 1.90 1.71 1.55 1.18 0.67
1.79 1.90 1.71 1.55 1.18 0.67
263.50 310.70 322.30 378.80 410.40 407.50
268.00 324.10 394.10 451.80 447.70 522.60
160.10 186.30 216.90 250.10 256.00 281.70
0.20 0.28 0.40 0.52 0.60 1.34
0.48 1.82 2.64 4.12 5.88 7.79
0.02 0.03 0.04 0.07 0.20 0.14
average performance of the models by Rotem et al. [13] and Tsai and Li [17] and the proposed model measured by query cost, storage space used and CPU solution time. In terms of solution quality, the three models have the same query cost. However, the proposed model uses less storage space than the other two models. The storage space used is the disk capacity used by PN PM the data allocation strategy; that is, i¼1 j¼1 xij Si . Compared with the proposed model, the model by Rotem et al. uses roughly 136% of average storage space, and that by Tsai and Li uses roughly 148% of average storage space. The proposed method uses less CPU time to solve the problem than the other two methods. Compared with the proposed model, the model by Rotem et al. takes roughly 10 times the average CPU time and that by Tsai and Li takes roughly 40 times the average CPU time. Furthermore, multidisk allocation problems cannot be solved by the models by Tsai and Li or Rotem et al. within a reasonable amount of time when problem sizes exceed (M, N) = (2, 22). The model by Rotem et al. cannot solve problems larger than (M, N) = (4, 36) and the model by Tsai and Li cannot solve problems larger than (M, N) = (6, 38) because the number of integer variables or number of constraints exceeds the capacity of the educational software version. Conversely, the proposed model can solve problems sized (M, N) = (4, 36) or (6, 38) to generate a data allocation plan with minimal query cost within 1 min. Experimental results in Table 3 demonstrate that the proposed model uses less CPU time when solving multidisk allocation problems and uses less storage space to achieve a minimal query cost than the other two models. Additionally, given the same number of tables, the storage space used in the four disks is less than that in the two disks. The proposed model prioritizes minimizing query cost over minimizing storage cost. If only two disks are available to allocate the tables to the disks, then the table that has co-access requirements with many other tables must be allocated to both disks to reduce query costs. However, using four disks increases flexibility in allocating tables with co-access requirements. Fig. 6a and b compare the performance of the models by Rotem et al. [13] and Tsai and Li [17] and the proposed model under 10 disks and various tables. Again, the three models have identical query costs in all cases. The proposed model requires less CPU time when solving multidisk allocation problems than the other two models, and consumes the least storage space among these three methods. As the number of tables increases, the required CPU time is more with the models by Tsai and Li [17] and Rotem et al. [13] than with the proposed model. To demonstrate the scalability of the proposed model for practical datasets, multidisk allocation problems with up to 4 disks and 1000 tables are solved using LINGO 11 [10], windows 64 version, on a server with an Intel Xeon 3.2 GHz CPU and 2G RAM. In all randomly generated cases, the capacity of all disks (Ck) is 40 GB, the sizes of relations (Si) are drawn from a uniform distribution of values in the range of [5 MB, 25 MB], the I/O cost units (CIO) are all assumed to be 1 s, 20% of pairs of relations are involved in the JOIN operation, and probabilities (pij) are drawn from a uniform distribution of values in the
506
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
Rotem et al. [10]
Rotem et al. [10]
Tsai and Li [14]
Proposed model
800
16
700
Used storage space (MB)
14
CPU time (seconds)
Tsai and Li [14]
Proposed model
12 10 8 6 4
600 500 400 300 200 100
2
0
0 (10,2)
(10,4)
(10,8)
(10,16)
(10,2)
(number of disks, number of tables)=(M,N)
(10,4)
(10,8)
(10,16)
(number of disks, number of tables)=(M,N)
(a) CPU time
(b) Used storage space
Fig. 6. Comparisons of average performance among three models under 10 disks and various tables.
P PN 1 range of 0; NN and normalized to N1 i¼1 j¼iþ1 pij ¼ 1. For the case of allocating 1000 tables to 3 disks, the data layout with minimal query cost can be solved within 3 min with the proposed model. For the case of allocating 1000 tables to 4 disks, the data layout with minimal query cost can be solved within 4 min with the proposed model. In our computational experiments, the disk capacity is assumed as 256 MB or 40 GB and the sizes of tables are drawn from a uniform distribution of values in the range of [5 MB, 25 MB]. As the disk capacity continues to increase, for instance, 512 GB drives are normal in these days, the following two conditions should be considered in applying the proposed method for solving multidisk allocation problems: (i) If the number of tables in a database system is determined by information system requirements analysis and design, then the increase in disk capacity will not change the number of tables. Since the performance of the proposed approach depends on the number of disks and tables, this situation will not affect the performance of the proposed approach. (ii) If system requirements change and the number of tables increases due to increasing available disk space, then the performance of the proposed model will be affected by the number of tables. Fig. 7 indicates that the required CPU time for solving the proposed model tends to grow exponentially as the number of tables increase. The proposed approach formulates a mathematical model to obtain an optimal data allocation for multidisk databases. This deterministic method is guaranteed to find a global optimum but the computational complexity grows rapidly by increasing the number of variables and constraints, especially for integer programming problems. More investigation and research will be required to develop an efficient approach for solving large-scale multidisk allocation problems, for instance, integrating heuristic algorithms or employing distributed computing techniques to enhance solution efficiency. 250
CPU time (seconds)
3 disks 200
4 disks
150 100 50 0 100
200
300
400 500 600 700 number of tables
800
900 1000
Fig. 7. The required CPU time for solving the proposed model.
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
507
This work presents a novel model that efficiently derives a data layout using minimal storage space to achieve minimal query cost based on database statistics of access patterns and relation sizes. From the viewpoint of an organization, the performance benefits of the proposed method are as follows: (i) Decreasing query time: The proposed model guarantees minimal average query time. Over the course of a week or month of database access, the difference in total query time of all queries between an optimal layout and a sub optimal layout depends on the frequency of database queries. The demand for database services may vary significantly as databases have become integrated into information delivery systems both inside and outside organizations [5]. The proposed model can decrease total query time when the arrival rate of a database query is high. Additionally, when queries are in a queue, total system time (the sum of wait time and query time) should be considered in performance evaluation. (ii) Reducing disk space: The proposed method utilizes minimal storage space in achieving minimal query cost. As the size of databases increases, storage requirements can increase dramatically. Thus, decreasing storage cost is beneficial to organizations. (iii) Shortening solution time: The proposed method spends less computational time obtaining an optimal layout than the models of Rotem et al. [13] and Tsai and Li [17]. The need to alter a database layout arises from changes in query patterns or the addition/deletion of tables. If databases in an organization change frequently, then the number of operations needed to redesign a data layout will increase. Adopting an efficient model can save considerable time when the frequency of redesigning a database layout is high.
6. Conclusions and future work Physical data allocation significantly affects I/O performance of database systems. Since I/O performance is a bottleneck that reduces overall processing capability of a system and limits its scalability characteristics; thus, the need for an efficient data layout approach is paramount. Although the models developed by Rotem et al. [13] and Tsai and Li [17] obtain optimal allocation of relations for a multidisk database system, their models contain numerous 0–1 variables and constraints, and do not minimize storage cost. The model created by Chang [4] only treats the multidisk allocation problem without replicating relations. That model, therefore, cannot reduce retrieval time by keeping copies of some relations on more than one disks. The proposed method not only uses fewer 0–1 variables and constraints than the models by Rotem et al. [13] and Tsai and Li [17], but is also a more general model than that developed by Chang [4]. Furthermore, the proposed model guarantees that query cost is minimal. Under the same minimal query cost, the proposed model utilizes less storage space than existing methods. The results of the two illustrative examples and computational experiments demonstrate the advantages and efficiency of the proposed method. Additionally, other components can be utilized in the proposed model to optimize data allocation. First, data partitioning into multiple disks can be incorporated into the model. We assume that a relation is stored entirely on a disk. However, in the real world, a relation may be striped across multiple disks for capacity and performance. The data allocation problem becomes very complex when data partition is considered. For instance, if M disks exist in a database, then each table has PM M combinations to allocate among various disks. Moreover, the speeds of different striping disk groups may not i¼1 i be identical and should be considered when solving this problem. Second, real world applications require that multiple relations be joined. A general case of multiple joins problem is challenging as it is difficult to formulate a mathematical model and the resulting model may contain a large number of binary variables and constraints. These issues are important in supporting realistic database configurations and worthy of future investigation. Acknowledgement The authors would like to thank the National Science Council of the Republic of China, Taiwan, for financially supporting this research under Contract No. NSC 96-2416-H-158-003-MY3. References [1] S. Agrawal, S. Chaudhuri, A. Das, V. Narasayya, Automating layout of relational databases, in: Proceedings of ICDE’03, IEEE Computer Society, Bangalore, pp. 607–618. [2] J.A. Bakker, Semantic partitioning as a basis for parallel I/O in database management systems, Parallel Computing 26 (11) (2000) 1491–1513. [3] R. Bosch, M. Trick, Integer programming, in: E.K. Burke, G. Kendall (Eds.), Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques, Springer, Boston, 2005, pp. 69–95. [4] C.T. Chang, Optimization approach for data allocation in multidisk database, European Journal of Operational Research 143 (2002) 210–217. [5] A.N.K. Chen, P.B. Goes, A. Gupta, J.R. Marsden, Database design in the modern organization – identifying robust structures under changing query patterns and arrival rate conditions, Decision Support Systems 37 (2004) 435–447. [6] F.S. Hillier, G.J. Lieberman, Introduction to Operations Research, eighth ed., McGraw-Hill, New York, 2005. [7] M.Y. Kim, Synchronized disk interleaving, IEEE Transactions on Computers C-35 (11) (1986) 978–988.
508
M.-H. Lin / Data & Knowledge Engineering 68 (2009) 499–508
[8] J.F. Kurose, R. Simha, A microeconomic approach to optimal resource allocation in distributed computer systems, IEEE Transactions on Computers 38 (5) (1989) 705–717. [9] X. Lin, M. Orlowska, An integer linear programming approach to data allocation with minimum total communication, Information Sciences 85 (1995) 1–10. [10] LINGO Release 9.0, LINDO System Inc., Chicago, 2004. [11] M. Loomis, G. Popek, A model for database distribution, in: Proceedings of Trends and Application: Computer Networks, IEEE Computer Society, 1976, pp. 162–169. [12] J. Rantala, Linear programming and mixed integer programming in management of seedling transportation, International Journal of Forest Engineering 15 (1) (2004) 41–51. [13] D. Rotem, G.A. Schloss, A. Segev, Data allocation for Multidisk Database, IEEE Transactions on Knowledge and Data Engineering 5 (5) (1993) 882–887. [14] R. Sarathy, B. Shetty, A. Sen, A constrained nonlinear 0–1 program for data allocation, European Journal of Operational Research 102 (1997) 626–647. [15] E.J. Schwabe, I.M. Sutherland, B.K. Holmer, Evaluating approximately balanced parity-declustered data layouts for disk arrays, Parallel Computing 23 (4) (1997) 501–523. [16] P. Triantafillou, C. Faloutsos, Overlay striping and optimal parallel I/O for modern applications, Parallel Computing 24 (1) (1998) 21–43. [17] J.F. Tsai, H.L. Li, On optimization approach for multidisk vertical allocation problems, European Journal of Operational Research 165 (2005) 835–842. [18] K.A. Zaman, S. Padmanabhan, An efficient workload-based data layout scheme for multidimensional data, Data & Knowledge Engineering 39 (3) (2001) 271–291.
Ming-Hua Lin received her B.S. degree from National Taiwan University, Taiwan, in 1995, and M.S. and Ph.D. from National Chiao-Tung University, Taiwan, in 1997 and 2005, all in information Management. At present, she is an assistant professor in Department of Information Technology and Management, Shih Chien University, Taiwan. Her research areas include optimization methods and applications and wireless networks management.