JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.
43, 21–35 (1997)
PC971325
Simulated Annealing Based Parallel State Assignment of Finite State Machines 1 Gagan Hasteer* ,2 and Prithviraj Banerjee† ,3 *Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois, 1308 West Main Street, Urbana, Illinois 61801; and †Center for Parallel and Distributed Computing, 4386 Technological Institute, Northwestern University, 2145 Sheridan Road, Evanston, Illinois 60208
In this paper, we present two parallel algorithms, ProperJEDI and PartJEDI, for the state assignment problem of finite state machines (FSM). Although PartJEDI is slower than ProperJEDI, it is memory scalable 4 and provides the capability to encode large FSMs on multiple processors. The parallel algorithms are based on the most recent sequential algorithm, JEDI [8]. PartJEDI was incrementally derived from ProperJEDI and illustrates the efficacy of a modular approach to developing a data-replicated parallel algorithm from a serial algorithm and then making it memory scalable. ProperJEDI and PartJEDI have been implemented on top of the ProperCAD library, which is a C++ runtime environment for distributed and parallel CAD applications [9]. ProperCAD has been ported to shared memory multiprocessors such as the SGI Challenge and the SunSparcServer1000, distributed memory multicomputers such as the Intel Paragon and the Thinking Machines’ CM-5, and a network of workstations. FSMs form an important class of VLSI circuits and are especially useful for designing the control units of various microprocessors. Sequential circuits consisting of a moderate number of latches (e.g., 20) can be viewed as large FSMs containing more than a million states (2 20). In [10], Devadas et al. try to generate implicit state transition graphs (ISTGs) for such sequential circuits. They further report that state encoding algorithms such as MUSTANG [11] and JEDI were not able to re-encode many such circuits due to large memory and time requirements. There is definitely a case for parallel encoding algorithms which would handle such FSMs for which the generation of even a conventional state transition graph (STG) is prohibitive. Researchers have tried to use techniques such as FSM decomposition and isolation of smaller interacting FSMs to make the time and memory requirements manageable. These smaller subcircuits are locally optimized by conventional
Simulated annealing is an effective tool in many optimization problems in VLSI CAD but its time requirements are prohibitive. In this paper, we report parallel algorithms for a well established simulated annealing based algorithm for the state assignment problem for finite state machines. Our parallel annealing strategy uses parallel moves by multiple processes, each performing local moves within its assigned subspace of the state encoding space. The novelty is in the dynamic repartitioning of the state space among processors, so that each processor gets to perform moves on the entire space over time. This is important to keep the quality of the parallel algorithm comparable to the serial algorithm. Our algorithm gives quality results within 0.05% of the serial algorithm on 64 processors. Our algorithms, ProperJEDI and PartJEDI, are portable across a wide range of MIMD machines. PartJEDI is memory scalable and is incrementally developed from ProperJEDI which is data replicated. For a large circuit, ProperJEDI reduces the runtime from 11 h to 10 min on a 64-processor machine. For the same circuit, PartJEDI reduces the runtime from 11 h to 20 min and memory requirement from 114 to 2 MB. © 1997 Academic Press
1. INTRODUCTION
With the rapid improvement in VLSI technology, circuit design is becoming extremely complex and is placing increasing demands on CAD tools. Parallel processing is fast becoming an attractive solution to reduce the inordinate amount of time spent in VLSI circuit design [1]. Researchers in VLSI CAD have reported parallel implementations for cell placement [2], circuit extraction [3, 4], test generation [5, 6], and logic synthesis [7]. 1This
research was supported in part by the Semiconductor Research Corporation under Contract SRC 95-DP-109 and the Advanced Research Projects Agency under Contract DAA-H04-94-G-0273 administered by the Army Research Office. We are also grateful to the National Center for Supercomputing Applications for providing us access to their Thinking Machines CM-5. 2 E-mail:
[email protected]. 3 E-mail:
[email protected].
4
Memory scalability is the property of a parallel algorithm of being able to solve problems of increasing size with the use of larger number of processors on a parallel machine. If we can solve a problem of size N on P processors, an ideal memory scalable algorithm should be able to solve a problem of size 2N on 2P processors. However, any sublinear scalable algorithm would suffice; i.e., problem size αN on 2P processors, where 1 < α < 2. 21 0743-7315/97 $25.00 Copyright © 1997 by Academic Press All rights of reproduction in any form reserved.
22
HASTEER AND BANERJEE
techniques such as state minimization and optimal state assignment [10, 12]. While such techniques work in some cases where the decomposition is well defined and easy to extract, the penalty for lack of global optimization techniques may affect the quality of the results in many cases. We believe that the area requirement is significantly more for a piecewise locally optimized sequential machine than for a globally optimized one. To this effect our attempt to parallelize JEDI assumes significance. Parallelizing JEDI would greatly aid the current sequential circuit optimization tools by providing the option of re-encoding large extracted FSMs globally rather than optimizing poorly decomposed smaller FSMs. Besides large problem sizes, the compute intensive nature of JEDI (due to the use of a simulated annealing based encoding scheme) also makes a parallel implementation desirable. It can take several hours for large circuits as shown in Table III. Further, it uses four different options and is run four times to generate state assignments, the best of which is chosen. Hence the time requirements increase by a factor of 4. Unlike other parallel CAD applications which suffer a loss of quality due to partitioning of the problem across processors, our algorithms maintain strict control on quality degradation. It will be amply demonstrated that a significant reduction in time and memory requirements can be achieved for the state assignment of large FSMs, with negligible compromise in quality. An earlier parallel state assignment algorithm, ProperSTATE, was based on a greedy heuristic for encoding the states [13]. A divide and conquer technique was used by partitioning the state space across processors and allowing each processor to encode on the basis of its local information. ProperSTATE is very parallel and gives results comparable to MUSTANG, the sequential algorithm it is based on [11]. But JEDI, the most recent state assignment algorithm, improves upon MUSTANG as well as ProperSTATE in quality. In contrast, the time requirement for JEDI is an order of magnitude more than ProperSTATE (even on 1 processor). Hence, ProperSTATE and JEDI offer a speed–quality trade off to a designer. Our parallel implementation of JEDI justifies itself with its remarkable speedups and good quality, thus achieving the best of both approaches. On an average, our algorithms give quality within 0.05% of the serial algorithm on 64 processors. For a large FSM, ProperJEDI reduces the runtime from 11 h to 10 min on a 64-processor machine. For the same circuit, PartJEDI reduces the runtime from 11 h to 20 min and memory requirement from 114 MB to 2 MB. The rest of this paper is organized as follows: In Section 2, we give the motivation for state assignment algorithms and review some basic definitions. Section 3 discusses related work done in state assignment for FSMs and parallel simulated annealing implementations. Section 4 describes the ProperJEDI algorithm in detail. The heuristics used for quality and error control are discussed in Section 5. Incremental development of PartJEDI from ProperJEDI is described in Section 6. An analysis of time and memory requirements is
given in Section 7. Section 8 discusses the performance of the parallel algorithms on several parallel machines. We conclude the paper with a discussion of future work in Section 9.
2. PRELIMINARIES
In this section we review some definitions and motivate the state assignment problem. 2.1. Definitions A variable is a symbol representing a single coordinate in the Boolean space (e.g., a). A literal is a variable or its negation (e.g., a or a). A cube is a set of literals and represents the Boolean function corresponding to their product (e.g., {a, b} is a cube representing the product ab). An expression is a set of cubes and represents the Boolean function corresponding to their disjunction (e.g., {{a, b}, {b, a}} is an expression representing the function ab + ba). The Hamming distance H(c i, c j) between two codes c i and c j equals the number of bits in which they differ. It can also be interpreted as the number of empty literals in the intersection of the corresponding cubes that the codes represent. Conversely, the Proximity P(c i and c j) between two codes c i and c j equals the number of identical bits (the number of nonempty literals in the intersection of the corresponding cubes). The most commonly used representation of an FSM is a state transition graph (STG). A STG G(V, E, I(E), O(E)) is a set of vertices V connected by directed edges belonging to E. Each edge is labeled by two fields: an input field I(E) and an output field O(E). The set of vertices are in 1–1 correspondence with the states of the FSM. Directed edge (v i, v j) joins v i to v j if there is a transition from the state corresponding to v i to the state corresponding to v j in the FSM. I(v i, v j) is the set of primary inputs that cause the transition from v i to v j and O(v i, v j) is the set of primary outputs produced on that transition. We denote the numbers of states, edges, inputs, and outputs of the FSM by N s, N e, N i, and N o, respectively. The minimum number of bits used to encode the states of the FSM is Nb = dlog2 Ns e. Correspondingly the size of the encoding space for the states of the FSM is N c = 2 Nb . In order to make a fair comparison between JEDI and different processor runs of ProperJEDI we give equal weight to all the benchmarks. We assess the quality in terms of scaled count rather than actual literal count. Scaled literal count for an algorithm with respect to a reference algorithm is defined as 100 X ∗ n n
LCscaled =
i =1
Literal count of Benchmarki for the given algorithm . · Literal count of Benchmarki for the reference algorithm
23
SIMULATED ANNEALING BASED ASSIGNMENT
TABLE I ProperJEDI vs ProperSTATE on 4 Processors on SunSparcServer: Time in Seconds and Quality in Literal Counts (LC) MUSTANG FSM Benchmarks
ProperSTATE
JEDI
ProperJEDI
Time
LC
Time
LC
Time
LC
Time
LC
bbara_bbtas
83.6
1110
0.8
1049
5779.20
837
679.6
716
scf
67.6
1046
2.8
1016
3140.12
957
178.36
981
s298
842.4
8219
5.6
8739
32,450.8
8034
3820.1
8142
fetchXlog
11,861.6
9240
39.2
9265
Timeout
9291
30954.5
9657
dvramXmarkl
11,434.4
10,248
155.6
9973
Timeout
10,290
28,916.9
10,201
Scaled count
100
100
0.8
99.1
4163.6
93.1
265.7
92.3
This prevents the larger benchmarks from dominating the results. In Table I scaled time count is calculated by a similar formula. 2.2. The State Assignment Problem Figure 1 shows the state transition graph (STG) for a finite state machine which outputs a 1 on detecting either of the two patterns 0110 or 1010. Three bits are needed to encode the states of this FSM. There is a one-bit output. This translates to four Boolean functions. Let x represent the input, y 1, y 2, and y 3 the three bits which encode the states of the FSM, and z the output. Consider the following state assignment: A(000), B(001), C(010), D(011), E(100), F(101), G(111). The resulting Boolean equations for the next state functions are z = x y1 y2 y3 y1 = x y 1 y3 + x y 1 y2 + y1 y 2 y 3
(1) (2)
y2 = x y 1 y 2 y3 + x y 2 y 3 + x y 1 y 3 y3 = x y 2 y 3 + x y 1 y 2 + x y 1 y3 .
(3) (4)
The combinational circuit required to realize these functions requires two 4-input AND gates, 8 3-input AND gates, and three 3-input OR gates, assuming complemented inputs are available. This makes a total of 14 gates. On the other hand, the assignment A(000), B(010), C(011), D(110), E(111), F(100), G(101) yields the following next-state functions: z = x y1 y 2 y3 y1 = x y1 y 2 + x y 1 y 2 + x y1 y2 y3 y2 = y 3 y3 = y2 .
(5) (6) (7) (8)
The combinational circuit required to realize these functions requires only two 4-input AND gates, two 3-input AND gates, and one 3-input OR gate, making a total of five gates. Thus the state assignment problem involves assigning binary coded values to the states of an FSM in such a way that an estimate of the area of the combinational circuit required to realize the FSM is minimized. 3. RELATED WORK
FIG. 1. STG for an FSM (pattern detector).
In terms of quality, JEDI is the best known algorithm for state assignment targeting multilevel implementations [6]. MUSTANG, another algorithm from Berkeley is much faster than JEDI but does not achieve the same quality [11]. ProperSTATE, a parallel implementation based on MUSTANG has been reported in [13]. It handles huge circuits on multiprocessors which MUSTANG fails to handle. In ProperSTATE, clusters of states with similar behavior are encoded closely by each processor based on some heuristic gain estimates. The encoding is greedy and prohibits a state from being moved around once it has been encoded. The results from various processors are combined in a tree-based merge. In contrast, in JEDI the states are moved around in the encoding space using a probabilistic hill climbing heuristic
24
HASTEER AND BANERJEE
(simulated annealing). While the quality of parallel JEDI is 9% superior than that of ProperSTATE the run times are about 200 times more. Hence parallelizing both the algorithms provides the designer a tradeoff between time and quality. Table I shows ProperJEDI vs ProperSTATE on four processors on the SunSPARCServer1000. Since a majority of the time taken by JEDI is in this simulated annealing based encoding, most of the effort in implementing ProperJEDI and PartJEDI was spent in efficiently parallelizing the encoding phase. 3.1. Parallel Annealing Strategies A significant amount of work done in parallel simulated annealing has been for cell placement applications [14]. Using the taxonomy defined in [15], there are three major classes of parallel simulated annealing algorithms—serial-like, asynchronous, and altered generation. Serial-like algorithms preserve the convergence characteristics of the sequential algorithm through the use of single move acceleration or serializable subsets. Kravitz and Rutenbar have investigated both approaches and found there is limited parallelism and it is more appropriate to shared memory architectures [16]. Speculative computation also falls into this class but has also been shown to be ineffective for parallel placement [17]. The second class of parallel simulated annealing techniques, altered generation, is distinguished from serial-like algorithms in that it does not follow the exact search space laid out by the sequential algorithm. This is usually accomplished with a processor or group of processors either exploring a restricted state space or using a restricted search on the entire state space. To ensure proper global convergence, the global state is kept up to date through periodic solution exchanges or with a shared memory architecture. This approach has been followed for cell placement with some success [18]. The final class of parallel simulated annealing algorithms are the asynchronous or “parallel moves” algorithms where each processor generates and evaluates moves independently. This differs from altered generation methods in that the state space is not restricted. In other words, each processor contains information on the entire circuit regardless of whether the global layout information is accurate in the local processor. Obviously, the cost function calculations may be incorrect due to the moves made by the other processors. There are various methods to address the effect of error, but all involve some form of periodic updates. The number of updates is directly related to the average acceptance rate of the particular annealing schedule chosen. Several researchers have pursued this approach for cell placement [2, 18–21]. ProperJEDI is a hybrid of the last two classes described above. Each processor explores a restricted search space. Parallel moves are carried out in the local search space by each processor asynchronously. The novelty of the algorithm lies in the quality control which is achieved by dynamic repartitioning of the global search space among processors. Repartitioning
achieves the effect of moves being carried out in the global state space. Dynamic repartitioning is explained in detail in Section 5.1. 4. PROPERJEDI: A PARALLEL STATE ASSIGNMENT ALGORITHM
JEDI (like most state assignment algorithms) operates in two phases. In Phase 1 it generates an estimate of the affinity of states to each other. Two states which behave similarly in the FSM are given a high gain estimate. Phase 2 encodes the states, using a suitable heuristic to maximize the sum of the estimates generated in Phase 1. We will call the two phases the Gain Computation phase and the Encoding phase. In JEDI, there are four heuristics for generating gain estimates between pairs of states: input dominant, output dominant, coupled and variation. All of them have been parallelized in ProperJEDI. The input dominant algorithm works on the source states of the transitions of the FSMs (the states from which the transitions are triggered). A pair of present states which asserts similar outputs and produces similar sets of next states is given a high gain estimate. This has an effect of maximizing the size of common cubes in the Boolean functions corresponding to the output and the next state lines. The output dominant algorithm, on the other hand, works on the output states of the transitions. Pairs of next states which are produced by similar inputs and similar sets of present states are given high gain estimates. This has an effect of maximizing the number of common cubes in the boolean functions corresponding to the next state lines. A coupled approach uses a hybrid of these two heuristics. The variation heuristic is similar to the input dominant algorithm, except that the method of calculating gain estimates is different. Since all the neuristics have similar computation mechanism, we describe only the input dominant algorithm for the sake of simplicity. 4.1. Gain Computation The input to ProperJEDI is an FSM in the form of a State Transition Graph (STG) which is replicated on all processors. A gain matrix M I is constructed in the Gain Computation phase. It is an N s × N s symmetric matrix where each entry M I[x, y] corresponds to the gain that will be realized if states s x, s y are encoded close together. This estimate is equal to the sum of the number of identical output bits in each pair of transitions from states s x and s y. Mathematically MI [x, y] =
|O y | |O x| X X
P(oxi , o y j ),
i=1 j=1
where O x and O y are the number of transitions out of states s x and s y in the STG. o xi is the set of binary outputs produced by the ith transition out of state s x. Hence for calculating the gain matrix M I, each pair of transitions in the STG is considered and the corresponding entry in M I is updated.
25
SIMULATED ANNEALING BASED ASSIGNMENT
If there are N e edges in the STG then for an edge numbered i, edge pairs (i, i + 1) through (i, N e) have to be considered for building M I. This gives rise to a triangular work load which justifies a cyclic distribution of edges. However, a cyclic distribution gives rise to a load imbalance of O(N e), which can be significant for densely connected FSMs. In the parallel algorithm, the edges are divided among processors in a modified cyclic fashion as shown in Fig. 2. Our modified cyclic approach reduces the load imbalance further, from O(N e) to O(P) as proved in Lemma 1 below. LEMMA 1. Modified cyclic distribution reduces the load imbalance from O(N e) to O(P) where P is the number of processors. Proof. Let there be N e edges in the STG and P processors. Order the edges arbitrarily. For an edge numbered i, edge pairs (i, i + 1) through (i, N e) have to be considered. Partition the set of edges into contiguous blocks of P edges as shown in Fig. 2. For a cyclic distribution Processor 0 computes the heaviest workload as it gets the first edge in that block to work on. Correspondingly, Processor P-1 gets the least workload. The difference in the workloads of these processors in any block is (P − 1) edge pairs. Since there are dNe /Pe blocks, the load imbalance adds up to (P − 1)N e /P ≈ O(N e). When the modified cyclic approach is used, the work done by any two processors over two consecutive blocks is the same. Load imbalance occurs only due to the last block if the total numbers of blocks is odd. As shown above, maximum load imbalance in any block is (P − 1) which is O(P). Hence the result. Each processor builds a local gain matrix based on the edge pairs assigned to it. These local matrices are summed up using
a tree based merge technique (log 2 P communication steps per processor) to generate the gain matrix M I on each processor. Figure 3 gives a detailed description of the parallel algorithm for this phase. 4.2. Encoding Phase The encoding phase aims at assigning binary codes to states so that the following weighted sum is minimized, Ns X Ns X
MI [i, j]H [bi , b j ],
i=1 j=i+1
where b i and b j are binary codes for states s i and s j and M I[i, j] is the gain estimate between them. Intuitively, if two states have a high gain estimate m ij then the distance between their binary encodings H[b i, b j] should be small. JEDI achieves this by a probabilistic hill climbing heuristic (simulated annealing). States are randomly assigned to the vertices of the encoding hypercube. The temperature for simulated annealing follows a modified Newtonian cooling schedule as shown in Fig. 4. At higher temperatures two vertices are picked at random and their encodings are switched if it results in a reduction in the above cost. Once the temperature falls below the stopping temperature, low probability moves are weeded out by ensuring that the two vertices being considered for an exchange are adjacent (Hamming distance between them is 1). A local minimum is avoided by probabilistically making a move even if it increases the cost estimate. This process is repeated until a sufficiently low cost is reached and making more moves appears futile. The stopping criterion for simulated annealing is satisfied when the temperature is below
FIG. 2. Motivation for modified cyclic distribution of edges for gain function computation.
26
HASTEER AND BANERJEE
FIG. 3. Gain function computation phase of ProperJEDI (input dominant algorithm).
the stopping temperature and there has been no cost change during four previous temperature points. Figure 4 gives the encoding algorithm of ProperJEDI in detail. The parallel algorithm for the encoding phase has to deal with two main issues. First it cannot allow any processor to exchange any two vertices. This would cause errors in the state space when many processors simultaneously try to exchange a particular vertex with other vertices. To avoid this, the encoding space is partitioned in a disjoint manner among processors. Each processor attempts a series of moves in its parti-
FIG. 4. Encoding phase of ProperJEDI.
tion. But, if moves are restricted to static partitions, potentially useful moves across partitions would not be attempted. This would result in a degradation in quality. Our algorithm dynamically reconfigures the partitions in such a way that all possible exchange moves are explored. Further, it achieves this in minimum possible reconfigurations. Figure 5 shows two possible partitions of an encoding space of size 16 among four processors. These are the minimum number of partitions needed to attempt any exchange in the global state space. An analysis of quality control using repartitioning is given in Sect. 5.1. The second issue is to avoid accumulation of errors, since a processor does not know what changes have been made outside its local space. A periodic global update of the state space at the time of repartitioning achieves this objective. As a result there are alternating stages of state update and annealing in ProperJEDI. Thus this strategy of asynchronous parallel moves in the local state space causes some computational redundancy in the algorithm in the form of repeated state updates. When the errors accumulate they induce inaccuracy in the gainPestimate of a move. The cost of an encoding is given N s P Ns by i=1 j =i+1 MI [i, j]H [bi , b j ]. Note that the diagonal entries of M I and H are defined to be 0 and do not affect the cost function. Clearly, the calculation of the cost of an encoding is O(Ns2 ). After each state update the correct cost estimate has to be calculated. This is a computation overhead in the algorithm which is avoided by the following observation. When states s x and s y are exchanged in a P move the decrease Ns in cost (gain of the move) is given by i=1 [(M I[x, i] − M I[y, i]) * (H[b y, b i − H[b x, b i])] i≠x, y. The above expression can be calculated in O(N s) time and is independent of the initial cost of the encoding. Thus recalculating the global cost of an encoding at the time of repartitioning can be avoided altogether. This was a significant optimization in our algorithm and contributes to the scalability of the algorithm as analyzed in Section 7.
SIMULATED ANNEALING BASED ASSIGNMENT
27
FIG. 5. An example of dynamic repartitioning (this is also the minimum number of partitions for attempting any exchange in the global state space for this example).
Another source of computational redundancy is that the algorithm may make unproductive moves in the middle of an annealing stage where it assumes that everything outside its local space is fixed as shown in Fig. 7. Although such moves cannot be avoided, they can be minimized. The error control mechanism is discussed in detail in Section 5.2. 5. QUALITY AND ERROR CONTROL
5.1. Quality Control If each processor is restricted to work in its local space, it may result in a loss of quality. There has to be a mechanism to ensure that any two arbitrary states will be potential candidates for an exchange at some point of time. Our dynamic repartitioning scheme ensures this. At each global update step each processor recalculates the subset of the encoding space in which it has to work for the next annealing stage. This is achieved by fixing a subset of log 2 P bits of the encoding space. These bits are ensured to be different for each processor (more precisely they represent the processor number in binary). At each repartitioning a different set of bits is chosen. For the encoding space of Fig. 6 there are six possible partitions (constraining any two bits out of four). Our algorithm starts from the LSB and constrains sets of log 2 P consecutive bits at a time. This window of size log 2 P is moved towards the MSB at each repartitioning in a modulo fashion, i.e., the window jumps back to the LSB after covering the MSB. Let N b be the number of bits used to represent a code. The number of constrained bits = log 2 P. We will call a bit position free, if it is not constrained. To explore all possible exchanges each bit position has to be free during some annealing stage. The number of free bits in each stage is N b − log 2P. Therefore, the minimum number of partitions required is dNb /Nb − log2 Pe. This is also the minimum number of
partitions required to carry out any arbitrary exchange in the global encoding space. As a simple example consider a four-bit encoding space (16 codes). Consider a four-processor parallel annealing algorithm being carried out on this space. The minimum number of stages to enable any possible exchange in this case would be 2. During the first stage the two least significant bits are fixed (00 for processor 0, 01 for processor 1 and so on). The local space for processor 0 contains 0000, 0100, 1000, 1100. Similarly the local space for processor 3 contains 0011, 0111, 1011, 1111. A direct exchange between codes 0000 and 1111 can never be carried out since they would always lie on different processors. This is achieved by dynamic repartitioning as follows. Let processor 0 exchange 0000 and 1100 and processor 3 exchange 0011 and 1111. In the next annealing stage the two most significant bits will be constrained. Hence, after the repartitioning stage, processor 0 owns codes 0000,0001,0010,0011 and processor 3 owns 1100, 1101, 1110, 1111. Now a local exchange of 0000 and 0011 on processor 0, and 1100 and 1111 on processor 3 has a net effect of exchanging 0000 and 1111 in the global space. Figure 6 illustrates the entire process. 5.2. Error Control All processors are allowed to anneal their local partitions in parallel. This introduces some error in the cost function estimates. Figure 7 illustrates how this error can initiate unproductive moves. Hence, an important criterion for the parallel algorithm is to control this error which accumulates with each move. The following analysis shows how our algorithm achieves this. The annealing process consists of alternating stages of annealing and state update. A fixed number of moves is attempted after which the global state is updated to the latest information. At this stage the temperature is also updated to
28
HASTEER AND BANERJEE
FIG. 6. An example of indirect exchanges through dynamic repartitioning.
its new value. Assume for simplicity that the acceptance rate A is constant for the entire simulated annealing process. Let the number of attempted moves for every annealing stage be k. The number of successful moves per stage = kA on an average. If the number of processors is P, the number of invalid state entries in a processor at the end of an annealing stage is 2(P − 1)kA (each successful move on some other processor affects two state encodings in the worst case). The percentage error in the cost function is directly proportional to the ratio of the number of wrong state entries to the total number of states. Hence, the worst case percentage error is 2(P − 1)kA/N s. The task of our parallelization process was to keep this fraction bounded. If the problem size (N s) doubles, the number of attempted moves also doubles. On the other hand, if the number of processors doubles, the parallel algorithm attempts half the moves on each processor. Further, with decreasing
temperature, the acceptance rate A decreases. Our algorithm heuristically increases the number of attempted moves with a decrease in temperature to offset the decrease in A. Hence a bound on 2(P − 1)kA/N s is maintained through out the algorithm. This bound is determined by the initial value of parameter k as explained in Fig. 5. 6. DATA PARTITIONED IMPLEMENTATION OF JEDI
Parallel JEDI implementation described above is datareplicated. In this section we describe PartJEDI, a memory partitioned implementation of ProperJEDI which was incrementally developed from the data replicated version. The primary memory requirements for ProperJEDI are in the intermediate data structures which store the Ns2 gain estimates and Nc2 Hamming distances in the form of matrices, where N s is the
SIMULATED ANNEALING BASED ASSIGNMENT
29
FIG. 7. An example of unproductive move due to error in the cost estimate.
number of states in the FSM and N c is the number of binary codes in the encoding hypercube. To avoid conflicts and inconsistencies during parallel simulated annealing stages, each processor works on a disjoint partition of the encoding hypercube. It exchanges vertices of only its own local partition. Consequently, in order to estimate the cost saving of a move, a processor needs only a part of the distance and gain matrices. The relevant part of the distance matrix, Nc2 /P in size corresponds to N c/P columns of N c Hamming distances, each column for a local code. Correspondingly, the only relevant gain estimates are the ones whose states are encoded on the local partition of the encoding hypercube. In the worst case all of the N c /P local codes may be assigned to states. Hence, the worst case size of the gain matrix partition which a processor needs to keep is N c /P columns of size N c each. Introducing this modification reduces the memory requirement from Nc2 to Nc2 /P. Thus the memory bottleneck in these O(N 2) data structures is reduced by a factor of P, the number of processors. A lot of bookkeeping and communication of large amount of information has to be performed for implementing this approach. All moves are proposed in the global address space.
The cost evaluation of the moves is done by appropriately mapping the global values of code and state numbers to their corresponding local values. For this mapping two bookkeeping arrays of size N c /P are maintained. At each repartitioning stage, a processor gets a different part of the encoding space reassigned to it. The information about the new local codes and states is communicated to it by the processors which owned those particular states and codes in the previous partition. The bookkeeping arrays are updated to reflect the new partition. LEMMA 2. Each pair of processors exchanges N c /p 2 encoding vertices at the time of repartitioning. Proof. Before repartitioning, a set S 1 of log 2 P bits out of N b are constrained. These bits represent the binary number i for all the local nodes on processor i. Size of the local encoding space on each processor = 2 Nb /2log2 P = N c /P. After repartitioning a different set S 2 of log 2 P bits will be constrained. Processor i will own all the encoding vertices where the set S 2 represents binary integer i. Conversely, processor i sends the information about all the local vertices
30
HASTEER AND BANERJEE
where the set S 2 represents binary integer k to processor k. By symmetry, for any 0 ≤ k ≤ P − 1 exactly 1/P of the local vertices on processor i have the bit set S 2 representing k in binary. Thus processor i sends the information for exactly N c / P 2 vertices to each processor (including itself). Let state s i be encoded on encoding vertex b j. At the time of repartitioning, the processor which owns b j sends M I[i, :] and H[j, :] to the new owner of b j. By Lemma 2, N c /P 2 columns of M I and H will be exchanged between every pair of processors. Therefore the communication involves P(P − 1) messages of size 2Nc2 /P 2 each and is responsible for significant time overhead in the implementation. Tables IV and VII show that the time taken by PartJEDI is almost twice the time taken by ProperJEDI on a network of workstations. The quality of the results for both the implementations is the same because the same moves are attempted with identical results on all the processors. 7. TIME AND MEMORY REQUIREMENTS
Let P be the number of processors. For simplicity the communication costs are assumed to be independent of the network load. A distributed memory MIMD machine is assumed with a communication cost model of α + β × (MessageSize), where α is the fixed cost for initiating a message and β is the cost incurred per byte of the message [22]. We derive the time bound for ProperJEDI and show that time requirements reduce by a factor of P for a P-processor run asymptotically as N s → ∞. We also derive the time and memory bounds for PartJEDI which also reduce by a factor of P asymptotically. 7.1. Runtime Analysis of ProperJEDI Gain computation phase (Fig. 3). Reading and initializing the STG on Processor 0 takes O(N s + N e(N o + N i)) (lines 2–4). This is followed by the computation of the Hamming distance matrix which takes O(Nc2 ) (line 5). For building the local gain matrix M I each pair of transitions in the STG is considered. This leads to a triangular workload of O(Ne2 ) complexity as shown in Fig. 2. The time taken by each processor to compute its part of the gain matrix is O(Ne2 /P) (lines 6–14). The local components of M I are summed up in a tree based merge on all the processors. This involves sending log 2P messages of size O(Ns2 ) from each processor. Thus the communication part of the gain computation phase takes O(log 2 P(α + βNs2 )). Encoding phase (Fig. 5). Initializing the parameters for simulated annealing takes constant time. Assume that there are n annealing stages in the serial algorithm. The number of moves attempted in annealing stage i is directly proportional to the problem size and inversely proportional to the acceptance rate A i of the stage. Let the total number of moves attempted by the serial algorithm be (k i N c /A i). By the construction of the parallel algorithm, the number of moves attempted on each processor in annealing stage i is (k i N c /A iP) for a P-processor
run. The gain estimate of a move can be calculated in O(N s) time as explained in Section 4.2 All local updates after a move is accepted can be carried out in constant time. Hence, the total time spent in annealing stage i is O(k i N c N s /A iP) ≈ O(ki Nc2 /Ai P). After each annealing stage the information about the accepted moves is broadcast to all the processors. The number of accepted moves in annealing stage i is ((ki Nc /Ai ) × Ai ) = K i N c. The size of the message broadcast is O(k i N c) on average. Hence, the total cost of communication is O(P(α + βk i N c)). Each processor then receives the information about the accepted moves of other processors and updates its state space. This takes O((P − 1)k i N c). Testing for the stopping criterion and resetting the simulated annealing parameters takes constant time. In practice, k i is a small constant independent of i. The number of simulated annealing stages is always in the range 1000 to 10,000 depending on when the convergence criterion is satisfied. Thus the major runtime of the implementation is dominated by the simulated annealing stages of the encoding phase. As seen from the above analysis the time spent in annealing scales linearly with the number of processors. For most test cases the number of stages in the parallel algorithm remained the same as that of the serial algorithm. This means that the convergence characteristics of the parallel algorithm are similar to that of the serial algorithm. Thus the parallel algorithm succeeds in maintaining quality comparable to that of the serial algorithm. 7.2. Runtime Analysis of PartJEDI Gain computation phase (Fig. 8). Reading and initializing the STG on Processor 0 takes O(N s + N e(N o + N i)) (lines 2– 4). Initial encoding takes O(N s) on line 5. This is followed by the computation of the local Hamming distance matrix, which takes O(Nc2 /P) (lines 6–9). Each processor computes the gain estimates for the states it owns. Since the encoding space is cyclically partitioned initially, each processor can own N c/P states in the worst case. For each of these states all fanout edges are considered pairwise with all the edges in the STG. If f m is the maximum fanout degree of any state, the work done by each processor is O( f m N cN e /P). The encoding phase of PartJEDI is similar to ProperJEDI with some extra communication during state update at the time of repartitioning. Consider a state s x encoded on vertex b y. If processor i owns vertex b y before repartitioning and processor j owns it after repartitioning, then the gain estimates corresponding to s x and Hamming distances corresponding to b y must be communicated by processor i to processor j. In terms of the global address space, H[y, :] and M I[x, :] are sent from processor i to j. By the symmetry of the algorithm each pair of processors exchanges N c/P 2 encodings with each other as shown in Lemma 2. Thus the total communication overhead per processor is O(α + β(2(P − 1)Nc2 /P 2)) after each annealing stage. For large P this reduces to O(α + β(2Nc2 /P)). This overhead leads to a factor of 2 slowdown
31
SIMULATED ANNEALING BASED ASSIGNMENT
the algorithm which requires communication between processors. The size of the message for broadcasting the information about accepted moves is O(N c). The size of the message to send Hamming distance and gain estimate information is O(2Nc2 /P). This buffer is reused to send relevant information to the rest of (P − 1) processors.
8. EXPERIMENTAL RESULTS 5
FIG. 8. Modified gain computation phase for PartJEDI.
of PartJEDI over ProperJEDI on high latency networks such as a network of workstations. The above expression shows that the communication overhead of state updates scales linearly with the number of processors and is of the same order of complexity as the computation part of a simulated annealing stage. The computation and communication costs of alternating simulated annealing and state update stages dominate the total runtime of the algorithm. The convergence criterion is the same as that of ProperJEDI and analysis of Section 7.1 holds. 7.3. Memory Requirement of PartJEDI The primary data structures on each processor are 1. 2. 3. 4.
The entire STG of size O(N c + N e(N o + N i)). Local Hamming distance matrix H of size Nc2 /P. Local gain matrix M I of size Nc2 /P in the worst case. The encoding hypercube of size N c
The gain computation phase does not involve any communication. Messages are created dynamically during the state update part of the encoding phase. This is the only part of
In this section, we will describe the results obtained by applying ProperJEDI on various circuits in the MCNC benchmark suite. The output obtained from ProperJEDI is passed through multilevel logic optimization scripts of SIS, a logic synthesis tool from Berkeley. The quality of the solution is measured as the number of literals in a factored form of the logic after optimization. Fewer literals imply less area to implement the combinational logic and hence better quality. The speedups are reported for a SparcServer1000 shared memory multiprocessor, a CM-5 distributed memory multicomputer, and a network of Sun workstations. Due to a lack of tools for parsing sequential circuits and generating STGs out of them, we tried to construct representative FSMs of large sizes. A cross product generator [13] was used to generate some large FSMs. Results are reported for FSMs containing up to 423 states which would correspond to a moderate size sequential circuit with 9 latches. As seen from Tables II, III, and IV, the speedups are good for both shared memory and distributed memory machines and increase with increasing problem size. They range from 50–80 for 64 processor runs on CM-5. We attribute the superlinear speedups as seen in Table II to the cache effects. Since the amount of data accessed by a processor decreases with an in5 The runtimes reported in this paper are just for the input dominant algorithm. JEDI (and ProperJEDI) is actually run once each for the four Gain Estimate heuristics.
TABLE II ProperJEDI on a CM-5 (Distributed Memory Multicomputer): Time in Seconds (Speedups) a ProperJEDI FSM Benchmark
4 PE
16 PE
32 PE
64 PE
62.97(1.0)
16.11(3.91)
4.84(13.01)
3.97(15.86)
—
ram_test
169.45(1.0)
37.22(4.55)
12.58(13.47)
7.31(23.18)
—
bbara_bbtas
889.02(1.0)
210.87(4.22)
51.35(17.31)
27.94(31.81)
17.38(51.15)
scf
797.57(1.0)
174.24(4.58)
44.01(18.12)
26.91(29.64)
15.64(51.00)
s298
4944.31(1.0)
1008.88(4.9)
254.16(19.4)
130.06(38.0)
69.42(71.22)
fetchXlog
39783.7(1.0)
9496.1(4.19)
1978.8(20.1)
988.6(40.24)
499.0(79.73)
dvramXmark1
36363.8(1.0)
8988.9(4.04)
2018.4(18.0)
933.3(38.96)
469.2(77.50)
s1488
a
1 PE
The actual time taken by JEDI (and ProperJEDI) is four times the figures given, since all the four weight generation heuristics are used to generate four outputs. The results reported in this paper are just for the input dominant algorithm. Benchnmarks s1488 and ram_test were too small to be run on 32 processors.
32
HASTEER AND BANERJEE
TABLE III ProperJEDI on a SparcServer1000 (8–Processor Shared Memory Multiprocessor): Time in Seconds (Speedups) ProperJEDI FSM Benchmarks s1488
JEDI
1 PE
2 PE
4 PE
8 PE
38.57
20.22(1.0)
9.91(2.04)
8.66(2.33)
5.29(3.82)
ram_test
124.57
47.04(1.0)
26.18(1.80)
20.53(2.29)
13.83(3.4)
bbara_bbtas
684.42
273.27(1.0)
143.64(1.9)
79.57(3.43)
46.60(5.86)
scf
607.95
227.18(1.0)
116.63(1.95)
62.35(3.64)
46.74(4.86)
3702.31
1345.16(1.0)
685.84(1.96)
334.99(4.01)
193.38(6.96)
fetchXlog
31,052.78
13,657.75(1.0)
7017.81(1.95)
3315.31(4.12)
1377.41(9.91)
dvramXmark1
28,652.42
12,807.65(1.0)
6569.63(1.95)
3219.63(3.98)
1559.78(8.21)
s298
creasing number of processors, the cache hit ratio increases. Further, the stopping criterion for simulated annealing may get satisfied early and result in superlinear effects. Some optimizations were performed in the serial code during the task of parallelization. As a result, the uniprocessor implementation of ProperJEDI is also more efficient than JEDI, as seen in Tables III and IV. Table V shows that s1488 and ram–test have 48 and 72 states, respectively, so their encoding state spaces are of size 64 and 128, respectively. In case of s1488, for a 32-processor run each processor will have exactly two vertices of the encoding hypercube for any partition. Thus only one simulated annealing move will be made repeatedly. Hence the results have not been reported for such cases where the problem size is very small relative to the number of processors. Table V gives the literal count of the solutions produced by ProperJEDI after optimization through SIS. The results produced by JEDI are the same as those of ProperJEDI on 1 processor. This is because the same pattern of moves as JEDI is tried with the same outcome in the uniprocessor run of ProperJEDI. The quality degradation for multiprocessor runs is minimal. The worst degradation is around 1% for an 8– processor run. Note that with increasing number of processors the quality remains stable, showing that ProperJEDI scales well with the number of processors. To illustrate the power
of ProperJEDI observe that the time requirement for fetchXlog has been brought down from 11 h to less than 10 min for a 64–processor run on CM-5. The corresponding quality loss suffered is just 2.3%. Since dynamic repartitioning improves the quality, it seems logical that trying out more partitions at a given temperature during simulated annealing would further improve the quality. We experimented and found results contrary to our expectation, as seen in Table VI. Since the number of attempted moves is fixed for a particular temperature, more partitions imply fewer attempted moves per partition. Hence we are implicitly restricting our exploration by now allowingsufficient annealing attempts to be made for a given partition. In order to overlap computation with communication, temperature update and repartitioning should be carried out at the same time as the global space update. This makes the choice of exploring one partition per temperature point natural to the algorithm. Table VII shows the time requirements and speedups of PartJEDI on a network of workstations where the communication latency is high and the massive communication Table V gives the literal count of the solutions produced by ProperJEDI after optimization through SIS. The results produced by JEDI are the same as those of ProperJEDI on 1 processor. This is because the same pattern of moves as JEDI is tried with the
TABLE IV ProperJEDI on a Network Sun SPARC Station 5 Workstations: Time in Seconds (Speedups) ProperJEDI FSM Benchmarks s1488
JEDI
1 PE
2 PE
4 PE
55.02
27.16(1.0)
12.46(2.18)
8.21(3.31)
ram
169.25
70.52(1.0)
32.66(2.16)
19.83(3.56)
bbara_bbtas
904.88
350.69(1.0)
167.12(2.1)
105.23(3.33)
scf
828.67
308.71(1.0)
151.80(2.03)
95.52(3.23)
5066.22
1887.25(1.0)
827.34(2.28)
438.75(4.3)
fetchXlog
40,591.27
15,027.0(1.0)
7731.66(1.94)
4185.91(3.59)
dvramXmark1
37,151.28
13,935.09(1.0)
7068.96(1.97)
4010.16(3.48)
s298
33
SIMULATED ANNEALING BASED ASSIGNMENT
TABLE V Quality Comparison (in Literal Counts) of ProperJEDI for the Input Dominant Algorithm a ProperJEDI
FSM Benchmarks
JEDI
1 PE
2 PE
4 PE
8 PE
16 PE
32 PE
64 PE
s1488
741
741
703
710
756
745
710
—
ram_test
650
650
677
890
760
867
841
—
bbara_bbtas
996
996
925
832
906
1178
1104
1197
1242
1242
1154
1179
1201
1176
1180
1178
scf v298
8143
8143
8161
8466
8819
8291
8597
8175
fetchXlog
9291
9291
9423
9657
9304
9508
9342
9508
dvramXmarkl
10,290
10,290
10,172
10,201
10,505
10,215
10,294
10,268
Total literals
31,353
31,353
31,215
31,935
32,251
31,980
32,068
30,326*
Scaled count
100.0
100.0
99.6
101.9
102.9
102.0
102.3
101.2*
a
The scaled count for 64 processor results is calculated using only the corresponding entries of JEDI.
same outcome in the uniprocessor run of ProperJEDI. The quality degradation for multiprocessor runs is minimal. The worst degradation is around 1% for an 8-processor run. Note that with increasing number of processors the quality remains stable, showing that ProperJEDI scales well with the number of processors. To illustrate the power of ProperJEDI observe that the time requirement for fetchXlog has been brought down from 11 h to less than 10 min for a 64-processor run on CM-5. The corresponding quality loss suffered is just 2.3%.overhead is very severe. Speedups of around three are obtained for a 4-processor run. A comparison of Tables IV and VII shows that the runtime of PartJEDI are almost twice the runtime of ProperJEDI. The efficacy of PartJEDI in reducing memory demands is evident in Table VIII. 6 For a 64-processor run on CM-5, the memory required by the dynamic data structures of the algorithm reduces by a factor of more than 50 for all the benchmarks. The memory requirement of static data structures of the algorithm is negligible and hence ignored. Thus, PartJEDI has the potential to fit very large FSMs and 6 Memory required only by the data structures of PartJEDI is reported. Maximum memory required per node is chosen for multiprocessor runs.
TABLE VI Effect of the Number of Partitions per Temperature Point on Literal Count for ProperJEDI 4 Processors FSM 2 1 Benchmarks Partitions Partition
8 Processors
16 Processors
2 1 Partitions Partition
3 1 Partitions Partition
encode them successfully on multiprocessors whereas JEDI on a uniprocessor may run out of memory. 9. CONCLUSIONS AND FUTURE WORK
The compute intensive nature of simulated annealing forces a designer to spend a lot of time or else suffer a loss in quality of the results. The parallelization of such applications provide the designer an option of saving time or annealing longer to get an improvement in quality. Until now the parallel implementation of annealing applications was restricted mainly to cell placement. We have successfully applied it to the problem of state assignment. We have shown that ProperJEDI gives very good speedups on both shared memory machines and message passing machines in this manner. PartJEDI gives impressive memory scaledowns and provides a promising alternative for large FSMs. In terms of quality, both the algorithms perform identical moves and give same literal counts. The quality is comparable to the serial algorithm making them an attractive tool set for reducing time and memory requirements in the design cycle. TABLE VII Data-Partitioned ProperJEDI on a Network of Sun Workstations: Time in Seconds (Speedups) FSM Benchmark s1488
Partitioned ProperJEDI JEDI
1 PE
2 PE
4 PE
55.02
28.74(1.0)
16.51(1.74)
10.21(2.83)
ram_test
169.25
90.53(1.0)
59.17(1.53)
35.64(2.54)
bbara_bbtas
904.88
586.64(1.0)
366.65(1.6)
193.61(3.03)
828.67
476.82(1.0)
296.17(1.61)
172.14(2.77)
5066.22
2730.78(1.0)
1569.41(1.74)
703.81(3.88)
s1488
756
710
714
756
766
745
scf
ram_test
741
890
763
760
948
867
s298
bbara_bbtas
1051
832
1047
906
1252
1178
fetchXlog
40,591.27 22,371.38(1.0) 12,427.53(1.79) 7193.37(3.11)
scf
1220
1179
1220
1201
1271
1176
dvramXmark1
37,151.28 22,363.13(1.0) 13,154.78(1.7) 7032.43(3.18)
34
HASTEER AND BANERJEE
TABLE VIII Memory Requirement per Node for PartJEDI on CM-5: Memory in Megabytes (Memory Scaledowns) ProperJEDI FSM Benchmarks
1 PE
4 PE
8 PE
16 PE
32 PE
64 PE
s1488
1.63(1.0)
0.44(3.7)
0.22(7.4)
0.12(13.6)
—
—
ram_test
6.2(1.0)
1.58(3.92)
0.78(7.95)
0.4(15.5)
0.22(28.18)
—
bbara_bbtas
7.24(1.0)
1.85(3.91)
0.98(7.4)
0.51(14.2)
0.26(27.85)
0.14(51.7)
scf
7.05(1.0)
3.6(1.96)
0.97(7.27)
0.5(14.1)
0.25(28.2)
0.14(50.36)
29.54(1.0)
7.5(3.94)
3.86(7.65)
1.97(15.0)
1.02(28.96)
0.57(51.82)
s298 fetchXlog
112.5(1.0)
28.35(3.97)
14.3(7.87)
7.27(15.5)
3.74(30.1)
1.93(58.3)
dvramXmark1
113.81(1.0)
28.7(3.96)
14.3(7.96)
7.21(15.8)
3.7(30.763)
1.87(60.9)
REFERENCES 1. Banerjee, P. Parallel Algorithms for VLSI Computer-Aided Design Applications. Prentice Hall, Englewood Cliffs, NJ, 1994. 2. Rose, J. S., Snelgrove, W. M., and Vranesic, Z. G. Parallel cell placement algorithms with quality equivalent to simulated annealing, IEEE Trans. Comput. Aided Design 7 (Mar. 1988), 387–396. 3. Belkhale, K. P., and Banerjee, P. Parallel algorithms for VLSI circuit extraction, IEEE Trans. Comput. Aided Design 10 (May 1991), 604– 618. 4. Ramkumar, B., and Banerjee, P. ProperEXT: A portable parallel algorithm for VLSI circuit extraction. Proceedings of the International Parallel Processing Symposium. 1993, pp. 434–438. 5. Patil, S., and Banerjee, P. A parallel branch and bound algorithm for test generation. IEEE Trans. Comput. Aided Design 9 (Mar. 1990), 313–322. 6. Ramkumar, B., and Banerjee, P. Portable parallel test generation for sequential circuits. Digest of Papers, International Conference on Computer-Aided Design, Santa Clara, CA. 1992, pp. 220–223. 7. De, K., Chandy, J. A., Roy, S., Parkes, S., and Banerjee, P. Portable parallel algorithms for logic synthesis using the mis approach. Proceedings of the International Parallel Processing Symposium, Santa Barbara, CA. 1995, pp. 579–585. 8. Lin, B., and Newton, A. Synthesis of multiple level logic from symbolic high-level description languages. Proceedings of the International Conference on VLSI. 1989, pp. 187–196. 9. Parkes, S., Chandy, J. A., and Banerjee, P. A library-based approach to portable, parallel, object-oriented programming: Interface, implementation, and application. Supercomputing ’94, Washington, DC. 1994, pp. 69–78. 10. Ashar, P., Ghosh, A., Devadas, S., and Newton, A. Implicit state transition graphs: Applications to sequential logic synthesis and test. International Conference on Computer Aided Design. 1990, pp. 84–87. 11. Devdas, S., Ma, H., Newton, A. R., and Sangiovanni-Vincentelli, A. MUSTANG: State assignment of finite state machines targeting multilevel logic implementations. IEEE Trans. Comput. Aided Design (Dec. 1988), 1290–1300. 12. Ashar, P., Devadas, S., and Newton, A. Optimum and heuristic algorithms for finite state machine decomposition and partitioning. International Conference on Computer Aided Design. 1989. 13. Hasteer, G., and Banerjee, P. A parallel algorithm for state assignment of finite state machines. [Submitted for publication] 14. Durand, M. D. Accuracy vs speed in placement. IEEE Design Test Comput. (June 1989), 8–34. 15. Greening, D. R. Parallel simulated annealing techniques. Physica D42 (1990), 293–306.
16. Kravitz, S. A., and Rutenbar, R. A. Placement by simulated annealing on a multiprocessor. IEEE Trans. Comput. Aided Design CAD-6 (July 1987), 534–549. 17. Chandy, J. A., and Banerjee, P. Parallel simulated annealing strategies for VLSI cell placement. [Submitted for publication] 18. Banerjee, P., Jones, M. H., and Sargent, J. S. Parallel simulated annealing algorithms for standard cell placement on hypercube multiprocessors. IEEE Trans. Parallel Distrib. Systems 1 (Jan. 1990), 91–106. 19. Casotto, A., Romeo, F., and Sangiovanni-Vincentelli, A. A parallel simulated annealing algorithm for the placement of macro-cells. IEEE Trans. Comput. Aided Design CAD-6, (Sep. 1987), 838–847. 20. Darema, F., Kirkpatrick, S., and Norton, V. A. Parallel algorithms for chip placement by simulated annealing. IBM J. Res. Dev. 31 (May 1987), 391–402. 21. Kim, S., Chandy, J. A., Parkes, S., Ramkumar, B., and Banerjee, P. ProperPLACE: A portable parallel algorithm for cell placement. Proceedings of the International Parallel Processing Symposium, Cancun, Mexico 1994, pp. 932–941. 22. Kumar, V., Grama, A., Gupta, A., and Karypis, G. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1993.
GAGAN HASTEER received his B.Tech. in computer science from the Indian Institute of Technology, Delhi, India, in May 1994, and the M.S. in computer science from the University of Illinois at Urbana–Champaign in December 1995, where he is currently pursuing his Ph.D. Gagan worked in parallel algorithms for CAD problems during his M.S. His research areas include formal verification of VLSI circuits with an emphasis on property checking at the RTL level. He is also interested in design for low power. PRITHVIRAJ BANERJEE received the B.Tech. in electronics and electrical engineering from the Indian Institute of Technology, Kharagpur, India, in August 1981, and the M.S. and Ph.D. in electrical engineering from the University of Illinois at Urbana–Champaign in December 1982 and December 1984, respectively. Dr. Banerjee is currently the Walter P. Murphy Chaired Professor of Electrical and Computer Engineering and Director of the Center for Parallel and Distributed Computing at Northwestern University in Evanston, Illinois. Prior to that he was the Director of the Computational Science and Engineering program and a professor of electrical and computer engineering and the coordinated science laboratory at the University of Illinois at Urbana– Champaign. Dr. Banerjee’s research interests are in parallel algorithms for VLSI design automation, distributed memory parallel compilers, and parallel architectures
SIMULATED ANNEALING BASED ASSIGNMENT with an emphasis on fault tolerance; he is the author of over 170 papers in these areas. He leads the PARADIGM compiler project for compiling programs for distributed memory multicomputers and the ProperCAD project for portable parallel VLSI CAD applications. He is also the author of “Parallel Algorithms for VLSI CAD,” published by Prentice Hall. He has supervised 17 Ph.D. and 23 M.S. student theses. Dr. Banerjee is the recipient of the 1996 Frederick Emmons Terman Award of ASEE’s Electrical Engineering Division, sponsored by Hewlett–Packard. He was elected to the Fellow grade of the IEEE in 1995. He received the University Scholar award from the University of Illinois for in 1993, the Senior Xerox Research Award in 1992, the IEEE Senior Membership in 1990, the National Science Foundation’s Presidential Young Investigators’ Award in 1987, the IBM Young Faculty Development Award in 1986, and the President
35
of India Gold Medal from the Indian Institute of Technology, Kharagpur, in 1981. Dr. Banerjee has served as Program Chair of the International Conference on Parallel Processing for 1995. He has also served as General Chairman of the International Workshop on Hardware Fault Tolerance in Multiprocessors, 1989. He has served on the Program and Organizing Committees of numerous conferences and symposia. He is an Associate Editor of the Journal of Parallel and Distributed Computing, the IEEE Transactions on VLSI Systems, and the IEEE Transactions on Computers. In the past he has served as the Editor of the Journal of Circuits, Systems and Computers. He is also a consultant to AT&T, Westinghouse Corporation, Jet Propulsion Laboratory, General Electric, the Research Triangle Institute, the United Nations Development Program, and Integrated Computing Engines.
Received November 21, 1996; revised March 17, 1997; accepted March 21, 1997