Computers ind. Engng Vol. 16, No. 2, pp. 225-234, 1989 Printed in Great Britain. All fights reserved
0360-8352/89 $3.00+0.00 Copyright © 1989 Pergamon Press pk
MULTIPLE COPY FILE ALLOCATION AND PLACEMENT IN A DISTRIBUTED INFORMATION NETWORK ANAND S. KUNNATHURl and RAFAEL SOLIS2 tDepartrnent of Information Systems & Operations Management, The University of Toledo, Toledo, OH 43606 and 2Department of Information Systemsand Decision Sciences,CaliforniaState University, Fresno, CA 93740, U.S.A.
(Receivedfor publication 10 May 1988) AMtract--The problem of minimizing query, update and storage costs in a distributed information network in which one or more sites may receivea copy of one or more of m differentfilesis studied in this work. The problem is known to be NP complete. Cost efficiencyconsiderationsfurther dictate that for each query emanatingsite there be exactlyone designatedrespondersite. A modelto solvethe problem is formulated.Two heuristicsbased on the problem structure are proposed and their performancerelative to each other and to an optimal procedure is reported for problems with up to four filesand up to eight sites.
INTRODUCTION We consider the problem of optimally placing copies of m distinct files among n sites (nodes) of a computer information network. The problem as originally posed by Casey [1], assumes that the file utilization (query and update traffic), storage and communication costs among all nodes are known quantities. The objective is then to find the assignment that minimizes the overall communications and storage costs. The problem is known to be NP-complete [2], thus heuristic procgdures are the only practical alternatives for large networks [3-7]. Variations of the above original formulation incorporate more realistic assumptions. For example, l.avin and Morgan [6], take into consideration the allocation of the programs which process the files. Mahrnoud and Riordon [8], assume that the network's communication lines capacities are assumed to be different, thus restricting the traffic within certain portions of the network. Hatzopoulos and Kollias [3] make the assumption that the file usage changes over time, thus resulting in a dynamic allocation of the files. Traditionally the file allocation methods have been applied to computer networks where no copies of the files are permitted. This file organization called partitioned [9] results in an easier problem to solve as opposed to the replicated case (also described in [9]) where the network's organization allows for multiple copies of the files. Hatzopoulos and Kollias [5], proposed solving the multiple copy problem simultaneously for all files (as opposed to all other previous attempts where allocation is performed on a per file basis). Storage constraints in the form of maximum number of files per site are also introduced in Ref. [5]. A branch and bound solution procedure is also specified in Ref. [5]. This paper considers the replicated case with the addition of the obvious allocation constraint that for any site i posing queries against file k there be exactly one site j that will handle all queries which have emanated from site i. Further we express the storage constraint at each site in the form of number of units of storage available as opposed to the number of files restriction in Ref. [5]. The impracticality of requiring all nodes to be familiar with the location of all copies of all the files that are to be updated is pointed out in Dutta [10], In that paper a model is also formulated to designate sites that coordinate the updates for all copies of a file. The solution proposed in this work accommodates this practical requirement with the simple ploy of redefining the update costs. We suggest the use of an update coordinating site and only this site need have the facilities to communicate with other sites having copies of the file to be updated. 225
226
ANANDS. KUNNATHURand RA~AELSOLm MODEL DESCRIPTION
Following Casey [l], we define the following: m n s~
Vk 2u, ~bu, q@ uok a#~ 6u, x~
number of distinct files to be allocated. number of sites (nodes) in the network. storage capacity of site i. size of file k. query traffic for file k originating at node i. update traffic for file k originating at node i. communications cost of a query originating at node i and satisfied by file k at node j; (q~ = 0, if i = j ) . communications cost of an update originating at node i for file k at node j; (u~ = 0, if i = j ) . storage cost for file k at node i. binary variable which equals 1 if file k is placed at site i, and 0 otherwise. binary variable which equals 1 if a query originating at node i for file k is satisfied at node
j. Ln Lm
Index set denoting the nodes in the network, i.e. (1, 2 . . . . . n). Index set denoting the files to be allocated, i.e. (1, 2 . . . . . m).
The problem of allocating and placing copies of the m distinct files among the n sites (nodes) of the network can be formulated as:
a#,6i~ +
Min
U~6/k +
k=li=l
q~x~
(1)
"=
such that: ~ . x t : = 1 i e L,, k e Lm
(2)
j=l
• 6ik~> 1
keL,,,ieLn
(3)
i,j, k
(4)
i=1
61k-- Xi: /> 0
~6~Vk~
all
i~Ln.
(5)
k=l
The model (I)-(5) will be referred to as (M). The three terms in equation (I) correspond to the storage and communications (update and query) costs respectively.Constraint (3) guarantees that all fileswill be allocated. Constraints (2) and (4) guarantee that exactly one query cost per inquiring node will be taken into account for those fileswith multiple copies. W e callthis the multiple copy fileallocationand placement problem (MCFAP). SOLUTION PROCEDURE
Clearly model (M) can be solved using standard integer programming techniques such as branch and bound [II]. It is worth noting that except for the last constraint (5), the constraints (2)-(4) are distinct for distinct files. That is, the model without constraint (5) is separable into m subproblems whose solution, if feasible to constraint (5), results in an optimal solution to the overall problem. T w o heuristics that take advantage of the separability of the M C F A P problem are developed in this work. The actual algorithms, an illustrativeexample, and computational experience in using the two algorithms arc described in the next four sections. A briefoverview of the procedure precedes the specificationof Heuristic I. The M C F A P problem for m fileshad been identifiedearlieras being separable into m subproblc~ns with the exception of the storage constraint (5). Ignoring the storage constraint, each of the m subproblems can bc
Multiple copy file allocation
227
cast in the form of a fixed charge transportation problem (FCTP) (see Ref. [12] for its formulation). The placement of files and the allocation of query sites determined by solving the FCTP's (corresponding to each file) would be optimal with respect to query, update and storage costs if constraint (5) is satisfied at each of the n sites. If the storage constraint is violated at any site, then files are dropped from that site in increasing order of incremental total cost until constraint (5) becomes feasible at that site. The FCTP's are reworked after excluding the dropped file copies for those sites where constraint (5) was violated. The algorithm terminates on that pass in which the solution to the m separable FCTP's are feasible to constraint (5). Some simplifying notation precedes the description of Heuristic 1. Denote by fjk the fixed cost of updating a copy of file k located at node j plus the cost ajk of storing the copy at node j, i.e.
(6) i~ Ln
Similarly, denote by v~j,the cost of a query on file k emanating from node i being answered at node j, i.e. v~j,= 2, qjj,.
(7)
The m separable problems can then be formulated as follows; for each k e Lm: Min E fjkx~ + E E vt~,x~ j e Ln
i~ LAjE Ln
xu, <~n j~Ln
(T)
ie Ln
x~j,= 1 ieL~ j~ Ln
x~ - x~k >/0
for all
i,j ~ L,
f l if on oa o at , to i i, answ0ro x~jk= ~0 otherwise.
(7a)
sito
The FCTP of which model (T) is a more restricted case, has been optimally solved using branch and bound approaches [see Ref. [12] for example]. While problems of size 30 × 20 have been optimally solved using such an approach the need here is for a device which can be used iteratively at low computational expenditure and yet has acceptable deviation from optimality. We are not aware of any available algorithm, optimal or otherwise, for solving FCTP's accommodating constraint (7a) other than branch and bound. Note that problem (T) without constraint 7(a) is simply a transportation problem with the property that x¢ -- 0 or 1 at optimality for all i, j for each k. We "solve" the m FCTP's heuristically by dynamically adjusting the costs and by applying the MODI Method (see Ref. [13]) to each of the m subproblems (T). If good lower bounds are desired to solve the MCFAP, general integer programming methods (see Ref. [14] for example) may be used to solve each of the m FCTP's (T). Since the branch and bound scheme for solving model (T) is computationally much more expensive, albeit optimal, we propose heuristically "solving" the FCTP's which lead to a heuristic for "solving" the MCFAP. Our procedure specified in steps 1 and 2 of Heuristic 1 clearly yields a local minimum to model (T). We shall denote by c~ the cost to be assigned to cell (i,j) of the transportation table corresponding to file k. HEURISTIC 1
Phase 1 For each file k ~ L,, execute steps 1 and 2. 1. Solve the transportation problem (TP) with supply nodes j ~ Ln each with a supply amount of n units and with demand nodes i E Ln, corresponding to the n sites with (query) demand
228
ANAND S. KUNNATHURand R ~ L
SOLIS
of 1 unit each. Set c~ =vuk. Denote by Tk the optimal transportation table with entries x ~ - 1, i ~ Ln obtained on termination of step 1.
. Solve once again the above TP represented by Tk for file k after modifying the transportation costs qjk (excluding the costs in the dummy row) as follows: Let
Byf[(i,j)[(i,j)
isabasiccell,
i, j e L n ]
(a) If IBA = 1 (I .I denotes the cardinality of the set), then:
%k =fjk
and
c~ = v,~k, i # j .
(b) For each j such that By = ~ let
%.k = fjk,
Cok= o0,
i # j,
i ~ L~
(c) If for any j, [By[ > 1, then set
Cek = V~,
i,j ~ L~.
Record the above solution in
R k = [ ( i , j ) t x ~ = 1, i,j~Ln]
(8)
and the solution's cost (including all fixed costs) in Ck. Retain the solution tableau Tk. Denote by I k the set of designated query sites for file k, i.e. P = [jlx~ = 1,j~L,]
Phase II (Feasiblity) 3. If constraint (5) is infeasible, then identify L, c Ln such that
(9) where
~1 if j e lk, k eLm 6jk = ( 0 otherwise If Lu = ~ STOP 4. Calculate the incremental cost P,k incurred in dropping file k from site r, if [I k] > 1, applying rules (a), (b) and (c) in step 2 and revising Tk to obtain T[, 5. Find j and k such that: Pjk = M i n p , , breaking ties arbitrarily. tEL u t~Lm
Set I k = I k - j , T~ -~ Tk, ~# = O. G O T O Step 3.
The revision of costs in step 2 (a)-(c) guarantees that the fixed costs will be figured in only once per file site. Further, step 2 ensures that queries cannot be posed against a file at a site if that site is not holding a copy of the file. EXAMPLE The data from the example provided by Kollias and Hatzopoulos [3] is used to illustrate Heuristics 1 and 2. We consider the problem of allocating two files in a network with five nodes. Table 1 shows the communications costs (symmetric). These costs were originally taken from Cas¢y [1]. Communication costs for a query or an update are assumed equal. Tables 2 and 3 show the query and update traffic (Au, and ~,~,) respectively. These wore taken from Kollias and Hatzopoulos [4]. We neglect the storage costs (i.e. ou, = 0 for i = 1, 2 . . . . ,5 and k ffi 1, 2).
Multiple c o p y file allocation
229
Table 1. Communications cmt (q~ = u~ for k = t.2)
Nodes Nodes
i
2
3
4
5
I 2 3 4 5
0
6 0
12 6 0
9 12 6 0
6 9 12 6 0
Table 2. Query traffic (At)
Nodes File
I
2
3
4
5
I 2
24 32
24 10
24 10
24 3O
24 5
Table 3. Update traffic ( ¢ , )
Nodes File
1
2
3
4
5
I 2
2 2
3 2
4 3
6 10
8 3
We assume the following storage capacities (in units of storage) for each of the sites: s: ffi s2 ffi s3 = s5 = 100, s4 = 90. The two files are assumed to be 60 and 40 units of storage in size, respectively. Applying Heuristic 1 to the data in the example above we obtain.
Step 1 Designated Sites
1
2 II
~3
1
1
2
3
4
I0
½
1~8 ~
5
I,~4
1
3
1
b~_ 1 L__~ 1144 i288 1216
1
2
~~1o1~1~
1
II 3
~~~~1~o
1
4
~~~½1,~
1
5
½~1,~1o1,~ !
~1~,~~~1°
I
temmmm 5
5
5
5
5
5
L__~_1 L_~ L__~_ L~_ 190 1
1
~ ~ ½ ~ 1 o
h" 10 e~
4
L_..O ]192 J384 [288 1192
1
1
1
2
4
l0 4
5
10 4
5
10 4
5
1
10~0 4
5
5
Where f , m 168. f21 z 180. f3t - 174. f4: -- 126. f~t = 123. and ft2 -- 156. f . . ~ 177. f32 = 132. f a = 78. f52 = 126.
Entries in the upper fight corner of each cell are the c~ initially equal to v0t of expression (7). t A l E 16/'b-.4=
b~
I°
I°
Dummy
18
18
I°
I°
i°
!°
Dummy
18
18
18
18
18
oo
18
oo
18
4~
I" 18
m
I" I
I°
8
I°
m
II
g
o
~
,-.t
~
~.~"
~-~.
÷~
N
~,,I°
o
~
m
II
~,~
II
II
II
IZ I
II
~
II
I°
F
F
Dunffny
4~.
I°
I°
Durmny
18
18
I°
I°
I°
18
18
4~
m
8
18
T*=
I"
Is
Is
m
18
oo
I°
I"
I°
4~
t-J
t~
r~
Multiple copy file allocation
231
Step 5 P4m = 48 is the minimum incremental cost. Set P = (1, 4, 5 ) - (4). 64~ = O, R, = [(1, 1), (2, 1), (3, 3), (4, 5), (5, 5)]. R2 = [(1, 1), (2, 1), (5, 1), (3, 4), (4, 4)]. Step 3
Since L, = ~ we Stop. The solution value is ZCk + handling.
P41 =
1089 + 48 = 1137 and R~, R 2 contain the allocations for query
HEURISTIC 2
Two results due to Grapa and Belford [15] leading to Heuristic 2 to solve the M C F A P stated in model (M) are the following. Theorem 1 [15]
If only one copy of a file k can be allocated (i.e. the partitioned case) all optimal allocations must include site i if 2~ rain q~ > Z~, j#i
k ¢Lm
(10)
where:
Z~ = o~ + L (~/jkuj~). jRI
Theorem 2 [15] If multiple copies of file k can be allocated (i.e. the replicated case), no optimal allocation including more than one site will include site i if:
Z ~ > J=~ 2jk (max\ ~L, (qi~)-- q~);
j~Ln,
k~Lm.
(11)
An outline of Heuristic 2 is as follows: Heuristic 2 Repeat steps 1 and 2 for each file k e Lm
1. For each file k identify a set of initially ineligible sites I k using equation (11). Also using equation (10), identify the set of sites I k for locating file k. Test I k for feasibility in terms of file size vs node capacity. Denote by I k the set of remaining eligible sites that may receive file k. Let the total cost, communication plus update of having file k in the sites in P be C*. 2. Find a site Jk ¢ Ike, if one exists, that reduces C* the most, by Ok, ifA is added to I k (breaking ties arbitrarily). If I~ = g for all k go to step 4. 3. Order the candidate sites for addition in decreasing order of the reductions Ok placing at the head of the list sites that are to receive a copy of a file k such that lk ffi O. Systematically add, according to the order in this list, file k to site Jk if the addition is storage feasible. If not, move on to the next element in the ordered list. Delete from I~ th~ site A and repeat step 3 until the ordered list is empty. If I~ = g for all k then go to step 4. Else go to step 2. 4. If I k v~ ~ for all k then stop. Else for each I k = ~ set Ike ffi I k and go to step 2. In a nutshell, Heuristic 2 assigns one file at a time to a designated query site adding that file first to a site which reduces total cost the most among all files that are candidates for being added to that site. Note that the problem to be solved in step 2 of Heuristic 2 is a fixed charge transportation problem. The procedure outlined in step 2 of Heuristic 1 may be used to "solve" this problem quickly.
232
ANAND S. KUNSATHUR a nd RAF~L SOUS
EXAMPLE
The solution to the example problem using Heuristic 2 follows.
Step 1 P = (4), 12 = (1). C t -- 918. C 2 = 636.
Step 2 The site Jt = 1 decreases C t the most (by 318). J2 = 4 with a cost reduction of 234 in C 2.
Step 3 it __ (1, 4), 12 ----(1), site 4 cannot receive a copy of file 2 due to its storage capacity being exceeded.
Step 2 The site J2 = 3 decreases C 2 the most (by 60). No other sites are attractive for file 1.
Step 3 I t = ( 1 , 4 ) , I2 = (1, 3).
Step 2 No other site is cost reducing for either file.
Step 4 Since I t # ~f and 12 # ~ STOP.
Solution Place file 1 at sites 1 and 4. Allocate queries from sites 1 and 2 to site 1 and from sites 3, 4 and 5 to site 4. Place file 2 at sites 1 and 3. Allocate queries from sites 1 and 5 to site 1 and from rites 2, 3 and 4 to site 3. Total cost = C t + C 2 = 600 + 576 = 1176. The Heuristics 1 and 2 can accommodate the requirement, proposed in Dutta [10], that only a few sites be required to know the location of other copies of a file. To achieve this accommodation, designate a site to be the update coordinator for a file funnelling all updates of that file from all sites through this coordinating site. The fixed update cost associated with a file location is then simply the cost of communicating the updates from the coordinator site to that location. COMPUTATIONAL EXPERIENCE
The performance of the two heuristics was compared computationally against optimal solutions obtained using LINDO [14] on a microcomputer. The results are presented in Table 4. Preliminary runs of Heuristic 1 with loosely constrained storage requirements (5) in model (M) revealed that a near optimal (10% variation or less) solution is almost always generated. The "solutions" in problem (T) were feasible to constraint (5) in almost all these cases. The contents of Table 4 represent our experience with tightly constrained storage requirements requiring at least two iterations in using Heuristic 1. The notation used in Table 4 is explained below. Each row in the table represents 10 randomly generated problems. Table 4. Computational comparison of Heuristics I and 2 N
K
MVHI
AVHI
MVH2
AVH2
5 5 5 6 6 6 7 7 8
2 3 4 2 3 4 2 3 2
0.44 0.34 0.81 0.86 0.62 0.56 0.86 0.46 0.98
0.34 0.19 0,35 0.34 0.26 0.23 0.36 0.35 0.41
0.80 0.14 0.62 0.47 0.73 0.55 0.75 0.86 0.64
0.62 0.06 0.35 0.31 0.36 0.31 0.38 0.40 0.45
Multiple copy file allocation
233
Denote the optimal solution value by C* and solution value obtained by using heuristic i ( = 1, 2) by Hi. For each of the 10 random instances and for each of the combinations (N, K), we computed the deviation from optimality as follows: DHt =
(Hi -- C*) C* . ,
i=1,2
The following quantities are reported in Table 4: N: K" MVH~: AVHi:
number of sites number of distinct files maximum value of DH~ for Heuristic i average value of DHi for Heuristic i.
The heuristics perform about equally well on the problems studied with no consistent advantage or disadvantage apparent in Table 4. It is worth noting that increasing problem size does not seem to adversely affect the performance of either heuristic. Kennington [16] has pointed out that fixed charge problems with equal numbers of sources and destinations are among the hardest to solve optimally. The model (T) that we "solve" repeatedly is of this variety. As N and K grow the number of integer variables in model (M) grows very quickly, e.g. for N = 10 and K = 5 it is 500! In practical terms such problems would be very difficult if not impossible to solve optimally in any reasonable length of time. The heuristics on the other hand appear to be obtaining good and at times near optimal solutions in a small fraction of the time required to solve model (M) optimally.
CONCLUSION The observation that cost minimization requires exactly one designated query site for each query emanating node, together with separability of model (M) (without the storage capacity restriction) enables the "solution" of the traditional multiple copy file allocation problem using the heuristics presented here. Network reliability can easily be preserved by allowing queries to be posed against alternate designated sites whenever the designated site for a given query is unavailable. By varying the maximum number of sites a file site can accommodate for queries, response time considerations (in model T) can be handled in our procedure. The centralized updating approach, to accommodate Dutta's [10] requirement, proposed earlier can lead to problems in that, if the central file goes down, the updates will not be available until the site comes back up. Further, this may not be a very cost effective procedure. A future direction of research is to adopt a decentralized updating approach and yet accommodate the requirement in Ref. [10]. The computational performance reported in this work is based on problem sets generated at random where invariably some storage capacities were designed to be exceeded. The average variations from the optimum in Table 4 are admittedly not very small. However, the solution of two essentially NP-Hard problems is attempted in both heuristics. Namely, a problem at least as hard as the fixed change transportation problem for each file and a multiple knapsack problem to determine the least expensive file copy to discard from sites where storage capacity has been exceeded. The cost basis for such minimal discarding is unfortunately linked to the solution of m problems (T) complicating the solution procedure. It is heartening to note that in our testing neither procedure appeared to get worse with increasing problem size. This holds promise that for large problems that are not optimally solvable the heuristic solutions would be acceptable as good alternatives. REFERENCES
1. R. G. Casey. Allocation of copies of a file in an information network. Proc. AFIPS, SICC 40, 617-625 (1972). 2. K. Eswaran. Placementof records in a fileand file allocation in computer networks. Proc. IFIP Conf. 304-307(1977). 3. M. Hatzopoulos and J. G. Kollias. The file allocation problem under dynamic usage. Inf. Systems 5, 197-201 (1980). 4. J. G. Kollias and M. Hatzopoulos.Criteria to aid in solvingthe problem of allocating copiesof a file in a computer network. Computer J. 24, 29-30 (1981). 5. J. G. Kollias and M. Hazopoulos. Allocation of copies of s distinct files in an information network. Inf. Systems 6, 201-204 (1981).
234
ANAND S. KUNNATHUR and RAP/tEL SOLIS
6. K. D. Levin and H. L. Morgan. Optimizing distributed databases--A framework for research. Proc. AFIPS SICC 44, 473-478 (1975). 7. H. L. Morgan and K. D. Levin. Optimal program and data locations in computer networks. Comm. ACM 20, 315--322 (1977). 8. S. Mahmoud and J. S. Riordon. Optimal allocation of resources in distributed information networks. ACM TODS 1, 66-78 (1976). 9. G. M. Booth. Distributed information systems. Proc. NCC 789-794 (1976). 10. A. Dutta. Modeling of multiple copy update costs for file allocation in distributed databases. Int. J. Computer Inf. Sci. 14, 29-34 (1985). 11. A. Warzsawski. Multidimensional Location Problems. Opns Res. Q. 24, 165-179 (1973). 12. p. Gray. Exact solution of the fixed charge transportation problem. Opns Res. 19, 1529-1538 (1971). 13. F. S. Hillier and G. J. Lieberman. Operations Research, 2nd edition. Holden Day, San Francisco (1974). 14. L. Schrage. Linear, Integer and Auadrati¢ Programming with UNDO. The Scientific Press, Palo Alto (1984). 15. E. Grapa and G. G. Belford. Some theorems to aid in solving the file allocation problem. Comm. ACM 20, 878-882 (1977). 16. J. Kennington. The fixed-charge transportation problem: A computational study with a branch-and-bound code. AIIE Trans. 241-247 (1976).