Parallel Computing 17 (1991) 1009-1016 North-Holland
1009
An efficient algorithm for mapping VLSI circuit simulation programs onto multiprocessors S. Selvakumar and C. Siva Ram Murthy Department of Computer Science and Engineering, Indian Institute of Technology, Madras 600 036, India
Received 17 December 1990 Revised 26 July 1991
A bstract
Selvakumar, S. and C. Siva Ram Murthy, An efficient algorithm for mapping VLSI circuit simulation programs onto multiproccssors, Parallel Computing 17 (1991) 1009-1016. Recent advances in VLSI and communications technology have made the use of multiprocessors m circuit simulation cost-effective_ However, for achieving maximum speedup in concurrent circuit simulation, the problem of mapping the tasks of a concurrent program onto the processors of a muitiprocessor must be satisfactorily solved. Unfortunately, the mapping problem, i.e. the problem of assigning the tasks of a concurrent program to the processors of a multiprocessor such that the difference of load among the processors is the smallest and the communication between the processors is minimum, ~s known to be NP-hard and hence there is not much hope of designing a polynomial time algorithm for solving it. Consequently, researchers have focused on designing heuristic algorithms which obtain near-optimal solutions in a reasonable amount of computation time. In this paper we propose a new heuristic algorithm for solving the mapping problem and compare its performance with that of simulated annealing_ Experimental results show that our algorithm is typically an order of magnitude faster and produces solutions which are substantially better than those obtained with annealing using M = 10. Keywords. VLSI technology; circuit simulation; mapping problem; simulated annealing; experimental results.
1. Introduction R e c e n t a d v a n c e s in V L S I a n d c o m m u n i c a t i o n s t e c h n o l o g y h a v e r e s u l t e d in w i d e s p r e a d i n t e r e s t in t h e d e v e l o p m e n t o f m u l t i p r o c e s s o r s y s t e m s a n d t h e i r a p p l i c a t i o n to s u i t a b l e p r o b l e m areas. L o g i c level s i m u l a t i o n p r o b l e m s e x h i b i t a h i g h level o f c o n c u r r e n t a c t i v i t y a n d a r e w e l l - s u i t e d for e x e c u t i o n o n t h e s e s y s t e m s . A u c k l a n d et al. [1] d e s c r i b e a c o n c u r r e n t t i m i n g s i m u l a t o r . I n their s i m u l a t o r t h e M O S c i r c u i t is r e p r e s e n t e d as a set o f c a p a c i t i v e n o d e s i n t e r c o n n e c t e d b y v o l t a g e c o n t r o l l e d c u r r e n t sources. N o d e s s h a r i n g t h e s a m e b i d i r e c t i o n a l c u r r e n t s o u r c e s a r e g r o u p e d t o g e t h e r a n d a l l o c a t e d t h e s a m e v e r t e x o n the c o n c u r r e n t s i m u l a t i o n g r a p h . T h e n u m b e r o f n o d e s a l l o c a t e d to a v e r t e x gives t h e w e i g h t o f t h a t vertex. A t e a c h s i m u l a t i o n t i m e s t e p the v a r i o u s v o l t a g e o u t p u t s f r o m e a c h v e r t e x in t h e g r a p h a r e e v a l u a t e d in t e r m s o f the i n p u t s . T r a n s m i s s i o n o f d i s c r e t e v o l t a g e s takes p l a c e o v e r u n d i r e c t i o n a l c o m m u n i c a t i o n links b e t w e e n t h e v e r t i c e s (a b i d i r e c t i o n a l l i n k b e t w e e n t w o v e r t i c e s is c o n s i d e r e d as t w o u n i d i r e c t i o n a l links). I n their 0167-8191/91/$03.50 © 1991 - Elsevier Science Pubfishers B.V. All rights reserved
S. Seloakumar, C. Siva Ram Murthy
1010
implementation each unidirectional link has a weight of one. Details of the circuit and computation model for the concurrent timing simulator are given in [1]. Efficient simulation of a circuit represented by its concurrent simulation graph requires optimal mapping of the tasks of the concurrent simulation graph onto the processors in the multiprocessor where an optimal mapping is one in which the difference in computational load among the processors is the smallest and the amount of communication between the processors is minimum. Unfortunately, this mapping problem is known to be NP-haxd and hence there is not much hope for designing a polynomial time algorithm for solving it. Consequently, researchers have focused on developing heuristic algorithms which obtain near-optimal solutions in a reasonable amount of computation time. Sheild [2] has applied the simulated annealing heuristic to solve the mapping problem. Even though the quality of mappings produced by simulated annealing is better than that of r a n d o m mappings, the solution quality heavily depends upon the value of M, where M is a parameter in simulated annealing to be described in section 4. For M = 1, the algorithm runs in an acceptable time but the solution quality is very poor. For M = 100, the solution quality is good, but the execution time is unacceptably high. In this paper we propose a new algorithm for the mapping problem which produces solutions that are better in quality than those obtained with simulated annealing using M = 10. Also our algorithm typically runs an order of magnitude faster than simulated annealing using M = 10.
2. T h e cost function
We briefly describe the cost function developed by Shield for evaluating the quality of a mapping. The execution time of a concurrent simulation graph on a multiprocessor can be minimized: (i) by keeping the computational load on the processors nearly equal, i.e. by minimizing the load imbalance cost C b and (ii) by keeping the amount of interprocessor communication to a minimum i.e. by minimizing the communication cost C¢. Let the number of vertices in the concurrent simulation graph be V and the n u m b e r of processors in the system be P. Let %. be the computational weight on vertex i and S k be the set of vertices allocated to processor k. Then the load imbalance on processor k is given by 1
E
i~Sk
V
,'=1
The total cost because of unbalanced loading of the multiprocessor system, i.e. the load imbalance cost C b, is therefore given by P
1 ~%..
c =E k=l
,'~S,
,'=1
The edges joining the vertices in the graph correspond to data communications between the concurrent processes. Let e,..j represent the communication cost for data transfer from vertex i to vertex j . We assume the cost of data transmission between vertices allocated to the same processor to be negligible. N o w the total communication cost, Cc, from edges having vertices on different processors is given by V
C~ =
~ i,j= l
,,.j
, ' E S m,
~S~,
m#:n.
Mapping VLSI circuit simulation programs onto multiprocessors
1011
Therefore, an overall cost function or cost (denoted C) for the system may be formulated as
C=C~+w*Cc where w is the weight given to the contribution of the communications cost relative to the cost of computational load imbalance across the system. For the S / N E T star multiprocessor network [4], Sheild estimates the value of w as 1. We will use the above cost function with w = 1 for evaluating the performance of our algorithm.
3. Our algorithm 3.1. The algorithm We define some terminology before presenting the algorithm. A configuration is an assignment of vertices to processors. A move of vertex %. is an operation which changes a configuration by transferring vertex v. from its processor to another. An exchange of vertices v. and vj is an operation which changes a configuration by simultaneously transferring vertex v,to the processor of v. and vertex ~ to the processor of v.. The overall cost function is defined for a configuration and when the configuration changes the overall cost function might change. To illustrate the terminology, consider a two-processor system in the following configuration:
Vertices O, 3, and 4 are assigned to Processor 0 and Vertices 1, 2, and 5 are assigned to Processor 1. Then a move which moves vertex 0 from processor 0 to processor 1 will change the configuration to:
Vertices 3, and 4 are assigned to Processor 0 and Vertices O, 1, 2, and 5 are assigned to Processor 1. If this move is followed by an exchange of vertex 3 with vertex 5 then the resulting configuration is: Vertices 4, and 5 are assigned to Processor 0 and Vertices O, 1, 2, and 3 are assigned to Processor 1. Now, we present the algorithm: BEGIN A L G O R I T H M Step 1. Randomly assign the vertices to the processors. This assignment gives the initial configuration. Record it as the best configuration and its cost as the best cost. Goto step 2. Step 2. IF there are moves which decrease the cost THEN Make that move which decreases the cost most thereby changing the configuration. Go back to step 2. ELSE Goto step 3. Step 3. IF there are exchanges which decrease the cost THEN Make the exchange which decreases the cost most thereby changing the configuration. Goto step 2. ELSE Goto step 4. Step 4. Whenever the algorithm reaches this step we append the cost at the current configuration to a queue of cost values. This queue has length L, where L is a parameter. As elements are appended at the tail of the queue they might drop off the head because of the restriction on the length of the queue. If the current value of the cost is less than that of the best cost achieved
1012
S. Seloakumar, C. Siva Ram Murthy
so far~ record the current configuration as the best configuration and the current cost as the best cost. Goto step 5. Step 5. IF the value of the last element in the queue is less than the value of the last but one element in the queue THEN Delete all but the last element in the queue, thereby leaving only the last element in the queue. Goto step 6. E L S E IF the queue is full (i.e. if the queue has L elements) T H E N Exit the algorithm. ELSE Goto step 6. Note that, because of steps 4 and 5 the algorithm terminates iff the cost never decreased during the L immediately past transfers of control to step 4. Step 6. Perform R r a n d o m moves, where R is a parameter. This changes the configuration. Goto step 2. END ALGORITHM
3.2. Computation complexity Evaluating the condition part of step 2 takes O ( P V) time. If N 2 is the average n u m b e r of times the T H E N part of step 2 is executed before going to step 3 then the complexity of step 2 is O ( N 2 P V). Evaluating the condition part of step 3 takes O ( V 2) time. Hence, the time taken to execute the block comprised of steps 2 and 3 once is O ( V 2 + N 2 P V). After step 3 control passes either to step 2 or to step 4. Let N 3 be the average n u m b e r of times control passes to step 2 from step 3 before it passes to step 4. Then it takes O ( N 3 V 2 + N 3 N 2 P V) time to execute the block comprised of steps 2, 3. Note that N 3 is also the number of times step 3 is executed for each execution of the block comprised of steps 2, 3, 4, and 5. F r o m step 4 control passes to step 5. At step 5, the algorithm either terminates or j u m p s to step 6. Hence it takes O ( N 3 V 2 + N 3 N 2 P V) time to execute the block comprised of steps 2, 3, 4, and 5 once. Finally, from step 6, after making R random moves, control passes to step 2. If N 6 is the total n u m b e r of times step 6 is executed in the algorithm, then the computation complexity of the complete algorithm is given by O ( N 6 N 3 V 2 + N 6 N 3 N 2 P V). Clearly, if N 2>> 1, then execution of step 3 which takes time quadratic in V will be kept to the bare m i n i m u m required for improving the solution quality and therefore will not shoot up the run time of the algorithm.
3.3. Discussion Our algorithm is based on two assumptions: Exchanges which decrease the cost might be found even if no move which decreases the cost could be found. Clever use of such exchanges might greatly improve the solution quality without making the optimization intolerably slow. (i,i) If struck in a locally optimal configuration in which no move or exchange which decreases the cost can be found, it is better to make R r a n d o m moves to j u m p out of the local optimum. Our extensive experiments indicate that both these assumptions are valid. Just like M in annealing, L is a parameter which determines the quality of solutions obtained with our algorithm. Larger values of L always produce better quality solutions than smaller values of L. But larger values of L require greater execution time of the algorithm. (i)
Mapping V L S I circuit simulation programs onto multiprocessors
1013
4. Simulated annealing algorithm Simulated annealing is a popular optimization technique devised by an analogy to the physical process of annealing [2,6]. We briefly describe the simulated annealing algorithm that is used by Shield [2] for solving the mapping problem. N o t e that unless explicitly mentioned control does not flow from one step to another in the algorithm below. BEGIN ALGORITHM Step 1. Start with a random initial configuration, C O N 0. Goto step 2. Step 2. This algorithm uses three parameters, viz. temperature (T), rate of cooling (fl), and number of moves performed at a given temperature ( N ) . Sheild [10] recommends that the three parameters of simulated annealing be determined as below. (i) Set/3 = 0.95 (ii) Compute ACF~, i, the average of ACF for all possible moves from the initial randomly assigned configuration. Then choose that value for T which makes e -t'acF''~/r) equal to 0.9. This gives the initial value of T. (iii) The value of N is computed using the formula, N = M V ( P - 1) where M is an input parameter to the algorithm. Note that N is directly proportional to M. Typically, M is 0.01 or 0.1 or 1 or 10 or 100. Since higher values of M result in higher values of N (i.e. higher value of M results in more moves being performed at each value of the temperature, T ) leading to better solution quality. Goto step 3. Step 3. D O steps 4 and 7 W H I L E (e t - l / r ) > 2-31); E X I T the algorithm. Note that the algorithm terminates when the temperature T becomes so low that e t - ~/r) <~ 2-31. Step 4. D O steps 5 and 6 W H I L E (the number of moves performed in step 6 is less than or equal to N); Step 5. Let ACF be the change in cost function value resulting from a move. ACF = CF, cw - C F where CF is the cost function value at the current configuration and CFnc w is the cost function value of the new configuration resulting from the current configuration after performing the move. Clearly, for a move that decreases the cost function value ACF is negative and for a move that increases the cost function value ACF is positive. Arbitrarily select a move and compute its ztCF. Step 6. IF ( A C F < 0) THEN The selected move decreases the cost function value and is made resulting in a new configuration. ELSE Generate a random number r between 0 and 1. IF (r < e -<'acv/r)) THEN Even though the selected move does not decrease the cost function value it is made resulting in a new configuration. Step 7. Set T ~ / 3 T . That is, lower the temperature according to the rate of cooling. END ALGORITHM Simulated annealing has a time complexity of O ( M N K log(T~tart/T~top)) where T~,art is the initial value of T and T~,op is the value of T just before the termination of the algorithm.
1014
s_ Selvakurnar, C. Siva Ram Murthy
5. Results W e have e v a l u a t e d the p e r f o r m a n c e of o u r a l g o r i t h m in m a p p i n g c o n c u r r e n t p r o g r a m s b y c o m p a r i n g its m a p p i n g s with the m a p p i n g s p r o d u c e d b y s i m u l a t e d a n n e a l i n g for an eightp r o c e s s o r c o m p l e t e l y c o n n e c t e d m u l t i p r o c e s s o r . W e have used the following c o n c u r r e n t p r o g r a m s in o u r p e r f o r m a n c e e v a l u a t i o n : (i) the c o n c u r r e n t circuit s i m u l a t i o n g r a p h s r e p o r t e d in [2] a n d (i_i) 40 r a n d o m l y g e n e r a t e d c o n c u r r e n t s i m u l a t i o n graphs. T h e g r a p h s r a n g e d in size f r o m 10 vertices to 150 vertices with different p e r c e n t a g e s of c o u p l i n g [5]. T h e p e r c e n t a g e c o u p l i n g , A, of a g r a p h is d e f i n e d as: 1 A * 100 ~v. l + - Eei.j 0 < A < 100 a n d /, j = 1, 2 , . . . , V. N o t e that for heavily ' c o m p u t a t i o n b o u n d ' graph, A tends to zero a n d for a h e a v i l y ' c o m m u n i c a t i o n b o u n d ' graph, A tends to 100. In all o u r e x p e r i m e n t s with o u r a l g o r i t h m , we set L = 3 b e c a u s e p r e l i m i n a r y i n v e s t i g a t i o n s suggested this value o f L as the best. H o w e v e r , we v a r i e d R f r o m 2 to 10. F o r a given i n p u t graph, for each c o m b i n a t i o n of values of L a n d R , we p e r f o r m e d the m a p p i n g once. W e have e v a l u a t e d the s i m u l a t e d a n n e a l i n g a l g o r i t h m using M = 1, M - - 10, a n d M = 100. F o r each graph, for each value of M, we p e r f o r m e d the m a p p i n g once. U n f o r t u n a t e l y , the i n o r d i n a t e a m o u n t of time c o n s u m e d b y s i m u l a t e d a n n e a l i n g using M = 100 m a d e it p o s s i b l e for us to test o n l y a few cases a n d h e n c e we d o n o t i n c l u d e o u r results of s i m u l a t e d a n n e a l i n g using M = 100 for m a p p i n g r a n d o m g r a p h s d u e to insufficient data. H o w e v e r , we i n c l u d e the results of s i m u l a t e d a n n e a l i n g in m a p p i n g the c o n c u r r e n t circuit s i m u l a t i o n graphs. I n the case of m a p p i n g the c o n c u r r e n t circuit s i m u l a t i o n g r a p h s we e v a l u a t e d (i) the cost of the initial r a n d o m c o n f i g u r a t i o n , (ii) the cost o f the final m a p p i n g , a n d (iii~ the time taken to o b t a i n the final m a p p i n g ,
Table 1 Mappings of the concurrent circuit simulation graphs onto an eight-processor completely connected muhiprocessor Graph name
4 x 4 multiplier
Frequency lock loop
Simulated annealing
INIT M1 M10 M100 TIME
185.25 49.75 47_75 47_75 3739.44
103.25 74.75 53.75 52.75 5302.50
Our algorithm
INIT MIN MAX AVG TIME
178.25 4035 42.75 42.05 52.04
99.25 48.25 51.25 49.27 109.67
NOTE. For both simulated annealing and our algorithm, INIT is the cost of the initial random assignment generated
by the algorithm. MI, M10, and M100 are the minimum of the cost of the mappings obtained with simulated annealing using M = I , M~10, and M=100, respectively. MIN, MAX, and AVG are, respectively, the minimum, maximum, and average of the cost of the mappings obtained with our algorithm_ TIME for our algorithm is its average computation time (in seconds) over all values of R tested and TIME for simulated annealing is its computation time (in seconds) using M = 10.
Mapping VLSI circuit simulation programs onto multiprocessors
1015
Table 2 Comparison of random graph mappings onto an eight-processor completely connected multiprocessor by our algorithm and simulated annealing using M = 1 Percentage of coupling (A)
33
50
67
No. of graphs tested No. of graphs for which our algorithm produces better solutions No. of graphs for which slmulated annealing produces better solutions No. of graphs for which our algorithm and simulated annealing produce same quality solutions (i.e. ties) Average improvement in cost by our algorithm over simulated annealing using M = 1 Maximum improvement in cost by our algorithm over simulated annealing using M = 1 Computation time of simulated annealing Avg. Computation time of our algorithm
15
13
12
13
11
6
0
0
4
2
2
2
36.14
17.29
3.81
83.93
36.47
29.63
3.09
3.34
4.49
f o r o u r a l g o r i t h m a n d t h e s i m u l a t e d a n n e a l i n g a l g o r i t h m . O u r r e s u l t s i n t h i s case, w h i c h a r e t a b u l a t e d i n Table I, d e m o n s t r a t e t h a t f o r t h e s e g r a p h s o u r a l g o r i t h m o u t p e r f o r m s s i m u l a t e d a n n e a l i n g , i n p a r t i c u l a r , it o u t p e r f o r m s e v e n s i m u l a t e d a n n e a l i n g u s i n g M = 100. In the case of mapping random concurrent simulation graphs, we compared our algorithm u s i n g L = 3 w i t h s i m u l a t e d a n n e a l i n g u s i n g M = 1 a n d M = 10. O u r r e s u l t s i n t h i s c a s e a r e p r e s e n t e d i n Tables 2 and 3.
Table 3 Comparison of random graph mappings onto an eight-processor completely connected multiprocessor by our algorithm and simulated annealing using M = 10 Percentage of coupling (A)
33
50
No. of graphs tested No. of graphs for which our algorithm produces better solutions No. of graphs for which simulated annealing produces better solutions No. of graphs for which our algorithm and simulated produce same quality solutions (i.e. ties) Average improvement in cost by our algorithm over simulated annealing using M = 10 Maximum improvement in cost by our algorithm over simulated annealing using M = 10 Computation time of simulated annealing Avg. Computation time of our algorithm
15
13
67 12
13
9
3
0
2
5
2
2
4
27.46
10.20
- 3.18
65.14
28.21
4.35
30.93
33.44
44.93
1016
S. Seloakumar, C Sioa Ram Murthy
Table 2 clearly demonstrates that our algorithm out performs simulated annealing using M = 1 for mapping simulation graphs which are not communication bound (i.e. p r o g r a m graphs with A = 33 and A = 50). In fact, improvements as high as 83.93% have been obtained. Moreover, simulated annealing never out performed our algorithm while mapping computation dominated program graphs (with A = 33). However, while mapping communication bound program graphs (with A = 67), our algorithm performed only slightly better than simulated annealing using M--- 1. Further, a comparison of the running times shows that our algorithm is around 3.5 times faster than simulated annealing using M = 1, on an average. Table 3 clearly shows that our algorithm fares much better than simulated annealing using M = 10 for mapping program graphs which are not communicationbound (i.e. p r o g r a m graphs with A = 33 and A = 50). In fact, improvements as high as 65.14% have been obtained. Moreover, simulated annealing never outperformed our algorithm while m a p p i n g computation dominated program graphs (with A = 33). However, while mapping communication b o u n d program graphs (with A = 67), simulated annealing performed slightly better than our algorithm. Further, a comparison of the running times shows that our algorithm is around 35 times faster than simulated annealing using M = 10, on an average. Finally, we observe that the quality of solutions obtained with our algorithm is largely independent of the value of R (which we varied from 2 to 10 in m a p p i n g each r a n d o m graph using our algorithm) and improves with higher values of L. We conducted all our experiments using the C programming language on a network of S U N 3 / 5 0 work stations.
6. Conclusions In this paper we developed a new and efficient heuristic for solving the problem of m a p p i n g the tasks of concurrent VLSI circuit simulation programs onto the processors in a multiprocessor system. We then compared its performance with that of simulated annealing. The results of this comparison study demonstrate that our algorithm runs several times faster and produces substantially better quality solutions than simulated annealing.
Acknowledgements The authors would like to acknowledge the constructive comments and suggestions from the referees of this paper.
References [1] B. Auckland, S. Ahuja, T.L. Lindstrom and D.J. Romero, CEMU - A concurrent timing simulator, in: Proc. IEEE lnternat_ Conf. on Computer Aided Design, Santa Clara, CA, USA (1985) 122-124. [2] J. Sheild, Partitioning concurrent VLSI simulation programs onto a multiprocessor by simulated annealing, lEE Proc. 134, Pt. E (1) (1987) 24-28. [3] S. Selvakumar and C. Siva Ram Murthy, An efficient algorithm for mapping VLSI circuit simulation programs onto arbitrary multiprocessors, Technical Report, Department of Computer Science and Engineering, Indian Institute of Technology, Madras, October 1990. [4] S.R. Ahuja, S/Net: A high speed interconnect for multiple computers, IEEE J., SAC-1 (5) (1983) 751-756. [5] C. Siva Ram Murthy and V. Rajaraman, Task assignment in a multiprocessor system, Microprocessing and Microprogramming 26 (1) (1989) 63-71. [6] S. Kirkpatrick, C.D. Gelatt and M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671-680.