Microelectronics Journal, 23 (1992) 1 9 1 - 1 9 6
!ii
QUISC3: A Distributed Processing Silicon Compiler D. Haines*, S. Penstone and S. Tavares Department of ElectricalEngineering, Queen's University, Kingston, Ontario, CanadaK7L 3N6
QUISC3 is a distributed processing silicon compiler, operating under the ELECTRIC LayoutTool. This new version of QUISC is designedto perform distributed processingon a LAN of workstations. It exploits the inherent hierarchy of VHDL to create macrocellson remote network servers, thereby improving both user response time and the allowable design size.
1. Introduction ilicon compilation is the process of creating a
s physical layout o f an integrated circuit design from either a structural or a behavioral description.
The structural compiler utilizes connectivity data for a set of components to create a layout based on a set o f primitive or standard cells. The compiler then determines the appropriate placement o f these standard cells based on some selected criteria and then performs cell placement and routing to create either a macrocell layout or a complete integrated circuit layout. Parallel processing applications can be implemented on many different types o f computers. Computers such as the Connection Machine are widely used in *Present address: 17 Ismailia Cr., Border, Ontario, Canada L0M 1C0.
0026-2692/92/$5.00
massively-parallel processing applications, although they cost several million dollars. This type of computer is suitable for parallelism at the instruction level. Specialized shared-memory multiprocessor computers such as those manufactured by Sequent are suited for medium-scale parallel processing. This type of system would cost several hundred thousand dollars. The lowest cost alternative would be to implement parallel processing on an existing network of engineering workstations utilizing a communications interconnection such as Ethernet (see Fig. 1). Owing to the physical distribution of the processors, parallel processing on this type of configuration is generally called distributed processing. QUISC3 is a distributed processing silicon compiler operating under the ELECTRIC layout tool [1]. This new version of QUISC [2] is designed to perform distributed processing on a LAN of workstations. It exploits the inherent hierarchy o f VHDL to create macrocells on remote network servers, thereby improving both the user response time and the allowable design size. It is based on two predecessors, QUISC and QUISC2 [3], both of which also run under the ELECTRIC design environment. The QUISC series of compilers is integrated into the ELECTRIC tool, as shown in Fig. 2.
© 1992, Elsevier Science Publishers Ltd.
191
D. Haines et a l l Distributed Processing Silicon Compiler
Communications
Unk.--~
Fig.1. Networkof workstations.
2. Workload Partitioning
I Ubraries Cell I
I Block Partition, I I Place and Route
[,(QU~SC3)
-{
1
VHDL Compiler ?
H
Speoification
J-l'=
, Tools
'
Gr~:~hicN Description and II0
Simulators (QUAIL)
H
ELECTRIC Design Symbol Data Base
Compiler
andDeseril°ti°nl/o
Basic Tools: ..Graphics
-DRC -Network -Router-Stitcher -Input-Output -Extractors -User Interface -Etc.
Simulator H Data !
Fig. 2. The ELECTRIC VLSI layout tool.
As chip designs increase in size, the computational time required to perform silicon compilation also increases. However, many of the procedures utilized in the silicon compilation process are of order O(n 2) or O(n 3) [4]. This means that the performance of the computers used to perform these tasks may also have to increase at a rate proportional to O(n z) or O(n 3) so that the computational run times remain proportional. The problem with this increase in computer performance is that it is very expensive to replace computer equipment and the required performance
192
can very quickly become too costly for the general user. One solution is to utilize a cluster of costeffective workstations connected via a local area network, and to distribute the workload among these available machines. This method appears to be very cost-effective, since many such networks of these workstations already exist in both industry and academia, awaiting the arrival of software which may utilize them to their full potential.
When redesigning an algorithm or a program to use it on a distributed processing system, the main task is to partition the workload efficiently [5]. The partitioning of the layout by QUISC3 is done by the user, as in the previous version of QUISC2. The user selects which macrocells in the VHDL hierarchy are to be retained. By utilizing a hierarchical approach to task distribution, the layout can be created from the lowest level of the hierarchy and then proceed until only the top level of the tree remains. This is more effective than the alternative method of creating the complete layout on one flat level and assigning an area of the overall layout to each processor, because each result depends on the results of the others, and must therefore be communicated to all of the others. This communication of intermediate results would require a large amount of time allocated to exchanging results. Partitioning the design into macrocells of approximately equal size and allowing the absorption of smaller complex cells into the larger macrocells helps to satisfy the requirement to keep the workloads approximately equal (see Fig. 3). By performing a complete action on each macrocell on one processor, the computational overhead is minimized as much as possible. For instance, when the estimate or placement of a block is required, the complete action is performed by the remote processor. Although the time required to set up a communication channel, send the data to the remote processor, and return the result to the client is relatively significant, by
Microelectronics Journal, VoL 23, No. 3
Replaced by
Fig. 3. Absorption of complex cells into a hierarchy.
assigning a complete task to a processor, its impact is minimized [6]. The total time required for the whole layout process is dependent on several factors. One of these is the workload on each processor, since on multi-user networks other users may also be consuming processor resources. Another is network traffic, which is random in nature, and, like workload, is partially caused by other users on the network. The total physical resources available on the network can significantly decrease the actual time to create a layout, since resources such as virtual memory are much slower owing to the required page-swapping to and from disk. Provided that an identical partitioning is done, the overall layout created with QUISC3 is identical to that created with QUISC2, since the same set of algorithms is used to perform the estimation, floorplanning, placement and routing tasks.
3. General Description The overall program consists of a master or client program and a set of remote servers. The user interacts with the main program, which in turn transmits the required data and commands to the remote servers. The servers then execute their commands and return the results to the client. The partitioning of the design is done by user interaction with the master program. The master then schedules tasks based on all their required prerequisites being satisfied. This scheduling process is dynamic in operation, since it is re-evaluated each time a server becomes available. This prevents the occurrence of
multiple independent tasks waiting for the same server while another server sits idle, a possible consequence of static scheduling. VHDL prevents the occurrence of recursive structures in a design, thereby eliminating deadlock occurring during the creation of a layout. Since this is a conversion of an existing sequential program to operate in a parallel mode, only the main layout module resolver module has actually been parallelized. This module, however, is the point from which all of the various layout modules are called. By installing the remote execution module (REM) into this resolver module, the required layout module procedures are performed in parallel. The REM module is made up of three main submodules (see Fig. 4): The Task Scheduler Module (TSM), which is responsible for ensuring that all of the required sub-tasks are dynamically scheduled and monitored for completion status. The Data Transfer Module (DTM), which is responsible for the marshalling of data, the process of pointer resolution, and the transfer and receipt of the necessary data to the remote servers. The DTM module also exists on the servers to perform the same operation for the remote machine. The Data Resolution Module (DRM), which is responsible for the reconstruction of the design database and the merging of the results from the remote servers into the existing database.
4. General Operation The user creates a complete layout by progressing through the required steps of the layout process: (1) The floorplan is created for all flexible macrocells. This requires size estimates of each lower macrocell. The operations for all macrocells below
193
D. Haines et a l l Distributed Processing Silicon Compiler
OUlSC3 :IEMOTE EXECUTION MODULE
~
,.e.,m.
~
m
*ELECTRIC
Fig. 4. QUISC3 internal block diagram.
the top level cell are passed to remote servers for execution. (2) The placement is created for the top macrocell. This requires that the placement and routing be completed for any lower macrocells to fix their exported port positions. The operations for all macrocells below the top level cell are passed to remote servers for execution. (3) The routing on the top level macrocell is then done. Since the routing and placement for all lower cells must first be completed to perform this, it is a relatively trivial task requiring only that the master program complete the routing between the macrocells of the next lower level of the hierarchy. (4) The "make" task actually creates the layout design in the VLSI Layout Tool's (VLT) database, which in this case is ELECTRIC. This is all done directly by the master process. Through all of these steps in the layout creation process the user may view and evaluate the result at each stage. Desired changes can then be implemented, thereby modifying the layout, with one or more layout steps then being executed again. This iterative process allows the modification and improvement of the layout by the user, while providing the capability to create and repeat each step of the layout incrementally.
194
5. Results A series of layouts were performed with QUISC3 utilizing different configurations of distributed servers running on SUN4 workstations. Actual user clock time was reported by the program at various stages of the layout process. The actual time required for the program to complete a layout procedure was used as a metric of performance improvement. All testing was carried out when the workstations were not in use and the network was relatively idle.
5.1 Layout of MBUB The first test layout was performed on a circuit known as MBUB, which implements a bubble sorting contender stack. The circuit consisted of a total of 1702 standard cells, and was partitioned into four main sub-blocks of 85, 87, 139 and 1391 cells. Test layouts were performed for the MBUB without distributed processing, and then with 2, 3 and 4 servers. The results of these tests are shown in Table 1. Summarizing the results it can be seen that the layout using only 2 servers was completed 2.1 times faster than using no remote servers, while increasing the number of servers to 3 or 4 produced negligible improvement. The super-linear speed up can be attributed to several possible factors such as the lower processor loading on the remote servers and the lesser memory requirements to create sub-
Microelectronics Journal, Vol. 23, No. 3
TABLE 1
Layout times in seconds for MBUB
Layout operation
TABLE 2 test cell
Layout times in seconds for VITERBI
Number of servers Layout operation 0
Compile design Partition design Create estimate Create floorplan Create placement Create routing Create actual layout Total time required Effective speed up
2
3
4
5 4 19 20 666 349 0 1 1017 505 2 3 51 10
4 20 332 1 489 2 11
4 20 321 1 471 2 13
93
91
314
92 2.1
1.03
1.03
0
2
3
Compile design Partition design Create estimate Create floorplan Create placement Create routing Create actual layout
8 54 1084 0 1896 8 244
8 60 236 2 745 8 230
8 56 175 2 553 8 208
Total time required
3294 1289
1010
Effective speed up
blocks, thereby reducing paging and swapping of memory to disk. The negligible improvement observed when increasing beyond 2 servers is attributable to the large disparity in the size of the sub-blocks. The time required to complete the layout of the largest block of 1391 cells is much greater than the time required for the others, so that the layout of all remaining blocks can be completed on one server prior to the completion of the largest block. 5.2 Layout of VITERBI Test Cell This test cell is made up of three identical sub-blocks each consisting of a VITERBI layout. To enable QUISC3 to perform the layout of a larger chip with more evenly sized sub-blocks, a VHDL description was created which contained three sub-blocks, each consisting of a complete VITERBI implementation. Each VITERBI sub-block consists of 566 cells for a total of 1698 cells in total. The layout process times are shown in Table 2.
In the case of this layout, super-linear speed up was again obtained when utilizing 2 servers, although now an improvement was also observed when increasing the number of servers to 3. However, the overhead required to transport the necessary data between the client and the servers is sufficient to
Number of servers
2.55
1.27
limit this improvement to only 27% when the additional server is added. 6. Conclusions
The distributed processing implementation of the QUISC silicon compiler was developed to test both the feasibility of parallelizing an existing sequential program and to create a demonstrable implementation of a complete distributed silicon compiler. This was done to create a complete system, as opposed to a specialized implementation of a particular distributed algorithm, which may perform either parallel floorplanning or parallel routing on a specialized computer system. The program successfully performs the complete layout task and will create a layout identical to that produced by its sequential counterpart. The present implementation utilizes remote servers running concurrently on both Sun3 and Sun4 workstations connected via Ethernet. The data transfer between remote processors was done utilizing the Sun External Data Representation Protocol, XDR [7], which will allow this program to be ported to workstations from other manufacturers, such as the DEC 3100 or the IBM RS/6000. In summary, this new distributed processing version of QUISC3 runs under the ELECTRIC Design Tool
195
~iiiii
D. Haines et aL/ Distributed Processing Silicon Compiler
and successfully performs the layout of large integrated circuits, utilizing a network of general purpose engineering workstations. References [1] S.M. Rubin, ELECTRIC: An Integrated Aid for Top-Down Electrical Design, Schlumberger Palo Alto Research, 1987. [2] A. R. Kostiuk, QUISC: An interactive silicon compiler, M.Sc. Thesis, Queen's University, 1987. [3] D. Yurach, QUISC2: An interactive extensible hierarch-
196
[4] [5] [6] [7]
ical silicon compiler, M.Sc. Thesis, Queen's University, 1989. B. P. Preas and P. G. Karger, Automatic placement: A review of current techniques, Proc. 23rd Design Automation Conf., 1986, pp. 622-629. R. Kling and P. Banerjee, Concurrent ESP: A placement algorithm for execution on distributed processors. J. Marantz, Exploiting parallelism in VLSI CAD, Proc. International Conference Computer Design, Oct 1986. Sun Microsystems, XDR: External Data Representation Protocol, Sun Microsystems, Inc., Mountain View, CA, USA, 1990.