Comparing SIMD and MIMD Programming Modes

Comparing SIMD and MIMD Programming Modes

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. 35, 91–96 (1996) 0071 Comparing SIMD and MIMD Programming Modes RAVIKANTH GANESAN, KANNAN...

175KB Sizes 0 Downloads 158 Views

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.

35, 91–96 (1996)

0071

Comparing SIMD and MIMD Programming Modes RAVIKANTH GANESAN, KANNAN GOVINDARAJAN, AND MIN-YOU WU Department of Computer Science, State University of New York, Buffalo, New York 14260

benchmark the MIMD and SIMD programming modes with message-passing Fortran and CM Fortran, respectively, and to bring out the difference in speed between the two modes.

The Connection Machine CM-5 supports both SIMD and MIMD programming modes. The SIMD mode is simulated over the MIMD mode. This simulation is likely to lead to a loss of performance in SIMD programs. This paper describes a comparison of the two programming modes with CM Fortran and message-passing Fortran. Two kinds of benchmarks are discussed. The first kind consists of synthetic benchmarks in which we measure time for basic arithmetic operations and communication time. The second kind consists of application benchmarks. The experimental results conclusively show that message-passing Fortran performs considerably better than CM Fortran. While the CM-5 is obsolete, these issues show up in the T3D and other current machines.  1996 Academic Press, Inc.

2. METHODS AND ASSUMPTIONS

The CM-5 is a highly scalable parallel computer. Each node of the CM-5 is a SPARC processor running at 33 MHz and rated to give a performance of 22 MIPS or 5 MFLOPS. Each node has four vector units. Each node has its own main memory (8, 16, 32 MBytes) and a cache (32, 64 KByte). The nodes can be space shared by configuring them in different partitions. Time sharing is done on each partition; i.e., all the nodes in a partition are all allocated to a single task during its time sharing slot. The control network and the data network support point–point communication, broadcast and reductions. Some architecture and performance issues have been discussed in [8, 16].

1. INTRODUCTION

The current parallel supercomputers have been developed along two major architectures, the SIMD (single instruction multiple data) architecture and the MIMD (multiple instruction multiple data) architecture. The SIMD architecture consists of a central control unit and many processing units [5]. Only one instruction can execute at a time and every processor executes the same instruction. Advantages of a SIMD machine include its simple architecture and its synchronous control structure which makes programming easy [9]. The designers of SIMD architectures have been motivated by the fact that an important, though limited, class of problems fit the SIMD architecture extremely well [6, 11]. The MIMD architecture is based on the duplication of control units, a separate unit for each individual processor. Different processors can execute different instructions at the same time [7]. It is more flexible for different problem structures and can be applied to general applications. This, however, means that MIMD computers are more difficult to program, especially when the programmers are forced to program at the level of explicit message-based communication. Although some machines, such as PASM [10] and OPSILA [2], have been built for research purposes, commercially available parallel machines are typically either pure MIMD machines or pure SIMD machines. The Connection Machine CM-5 is unique in that it supports programming in both styles [15]. The architecture of CM-5 however is inherently MIMD and the SIMD programming mode is simulated over it. The purpose of this paper is to

2.1. The Programming Environment A CM-5 parallel program runs on all nodes of the partition in which it was initiated. The program can only access addresses within its own partition. Each partition is supervised by a partition manager that serves requests from programs for access outside the respective partition. The CM-5 supports both SIMD and MIMD programming modes. 2.1.1. The SIMD Programming Mode. The SIMD programs were written in CM Fortran [12]. CM Fortran, henceforth referred to as CMF, allows the user to map array elements or array sections onto different virtual processors by using appropriate compiler directives like cmf$layout and cmf$align. These directives allow the user to control the layout of arrays [12]. CMF presents the user with a logical view of the machine composed of any number of virtual processors. So array elements can actually be mapped onto virtual processors and a set of these virtual processors is then run on each physical processor. The number of physical processors is hidden from the user. This means that parallel algorithms that involve a high degree of data-parallelism can be nicely formulated as CMF programs. It is the major advantage of the SIMD mode. 91 0743-7315/96 $18.00 Copyright  1996 by Academic Press, Inc. All rights of reproduction in any form reserved.

92

GANESAN, GOVINDARAJAN, AND WU

CM-2 has implemented virtual processors entirely in firmware. CM-5, instead of using emulation, relies on software technology to support virtual processors [15]. CM-5 compilers for CM Fortran generate control-loop code and runtime library calls to be executed by the processors. It adds further opportunities for compile-time optimization. 2.1.2. The MIMD Programming Mode. The MIMD programs were written in message-passing Fortran [14]. The source program is compiled using Fortran 77 compiler and is linked with CM message-passing CMMD library. Message-passing programs on the CM-5 consist of a host program which runs on the host (Partition Manager) and a node program which runs on all the nodes. The node program is the user written application program. Programs can be written in two styles based on who writes the host programs. They are: 1. Host/Node Programming style: The code to be run on the host processor is written by the user. This style of programming will be deprecated in future versions of CMMD [14]. 2. Hostless Programming style: The code to be run on the host processor is the standard host program supplied by the CMMD library itself. This standard host program initiates the node programs, performs I/O for the node programs and acts like an I/O server for the nodes. This style has been recommended for writing messagepassing programs in [13]. All MIMD codes for this experiment were written in the hostless programming style. 2.2. Important Assumptions This work was done on a 32-node CM-5 at Northeast Parallel Applications Center, Syracuse University. Each node has a 32 MByte main memory and 64 KByte cache. The performance presented in this paper was measured with CMOST 7.2 between January 10 and January 15, 1995. The objective of this work is to compare the performance of the SIMD and MIMD programming modes of the CM5. Some of the important assumptions are the following: 1. In our benchmark programs we have used arrays of similar shape. The compiler places different arrays of same shape in the same set of virtual processors and places their corresponding elements in the same virtual processor. In this case, elemental operations between the arrays require no interprocessor communication. 2. For the synthetic benchmarks to make sense, we assume that the corresponding calls in CMF and CMMD library are indeed comparable. This means that the CMF call that performs operations like shift, sum, maximum and minimum and the CMMD library calls that perform the same functions can be compared. Moreover, the data layout of the two modes are made exactly same. 3. We assume that the fair way to compare applications written in the two styles is to code the same algorithm in both the SIMD and MIMD styles. For example, in Gaussian elimination we used the same partitioning

scheme and the same back-substitution scheme when comparing the running times under the two modes. 4. In all SIMD programs, data arrays are stored in node processors. The host processor (front-end) is used only for the purpose of I/O. 5. In our subsequent discussions, the SIMD mode has vector units and the MIMD mode is without vector units. Programs written in the MIMD mode make calls to the CMMD library functions. 6. We use word-sized data (messages) in our synthetic benchmarks. Each word in our implementation consists of four bytes. 7. All MIMD test programs are written in the SPMD (single program multiple data) style. 3. THE SYNTHETIC BENCHMARKS

The synthetic benchmarks were written to compare typical operations that parallel programs would need to perform as part of their computation. The operations that we compare can be broadly classified under two categories: 1. Arithmetic benchmarks. 2. Communication benchmarks. 3.1. Arithmetic Benchmarks The arithmetic benchmarks comprise the three basic arithmetic operations op: addition, multiplication, and division. This suite of benchmarks was chosen to test the pure processor performance. There is no data dependency among the processor operations. We present the average execution time over five runs for each operation in Table I. The vector length is 1000. In the current implementation, the SIMD mode uses the vector units but the MIMD mode does not. It is shown in the table that SIMD is much faster, indicating a significant improvement with the vector units for computation-intense applications. 3.2. Communication Benchmarks The communication benchmarks consist of the standard communication patterns: node–node communication, circular shift, broadcast, and reduction. Table II lists the corresponding calls that were used to benchmark each of the above patterns. The benchmark results are shown in Tables III and IV.

TABLE I Timing Results for Arithmetic Synthetic Benchmarks (in Microseconds) Type

Mode

Add

Multiply

Divide

Integer

MIMD SIMD MIMD SIMD

0.738 0.084 0.708 0.084

1.852 0.085 0.699 0.085

4.576 0.187 1.124 0.141

Real

93

COMPARING SIMD AND MIMD PROGRAMMING MODES

TABLE II Correspondence of Calls for CM Fortran and Message Passing Fortran Benchmark pattern

Message-passing Fortran

Node–node communication

CMMD–send( ) CMMD–receive( ) CMMD–send–and–receive( ) CMMD–bc–to–nodes( ) CMMD–receive–bc– from–node( ) CMMD–reduce–v( )

Circular shift Broadcast Reduction

3.2.1. Node–Node Communication. The communication time TC is modeled by the following equation: TC (l) 5 TS 1 TB p l, where l is the message length in words, TS is the startup time, and TB is the time for transferring one word [4, 3]. This model is a good fit to experimental data obtained for the communication times we have measured on the CM-5. In the MIMD implementation we use blocking send CMMD–send( ) and receive CMMD–receive( ) calls of the CMMD library. So we just measured the time taken for a single message to be sent and received. Another issue in measuring the time taken for node-node communication is the influence of the interconnection network on the time taken for message passing between two nodes. The routing algorithm for the data network is also very simple. The network interface compares the physical destination address to its own and determines how far up the tree the message must travel. The message can then take any path up the tree. This allows the switches to do load balancing on the fly. Once the message has reached the necessary height in the tree, it must then travel down a particular path which is not dependent on the path taken upward. Messages that had to go higher up in the fat tree to reach their eventual destination took longer, but the difference was not really appreciable. CM Fortran provides the user with a logical view of the machine where there are many virtual processors. There is no explicit way of sending messages from one virtual processor to another. Each virtual processor is allowed to

CM Fortran Assignment of array subsections CSHIFT( ) SPREAD( ) MAXVAL( ), ANY( )

access array elements which are in another virtual processor. If the number of virtual processors is the same as the number of physical processors, we assume that each physical processor gets exactly one virtual processor. The need for communication between virtual processors (which arises when a virtual processor accesses an element of an array in another virtual processor) results in a message between the corresponding physical processors. This therefore means that the communication times that we are measuring in the SIMD mode are all definitely interprocessor communication times. No two virtual processors being on the same physical processor means that the node–node communication time does not have a component which corresponds to two virtual processors being on the same processor (in which case the communication between them would just involve an on-chip copy). In both modes, an array was sent as a whole instead of accessing one element at a time. In the MIMD implementation, we use blocking send and receive calls of the CMMD library. The following equations are calculated from Table III.

• The equation for MIMD communication time is TC (l) 5 68.3 1 0.479 p l microseconds. This compares very well with similar values of TS and TB obtained in [1] in their measurements of the MIMD mode on CM-5. • The equation for SIMD communication time is TC (l) 5 212 1 21.1 p l microseconds. MIMD is faster than SIMD for all message lengths since it has a shorter startup time and a higher bandwidth. The

TABLE III Average Timing Results of Node– Node Communication Synthetic Benchmarks (in Microseconds)

TABLE IV Timing Results of Circular Shift, Broadcast, MAX, and OR Synthetic Benchmarks (in Microseconds)

Size

Mode

Node–node

Size

Mode

CSHIFT

BC

MAX

OR

1

MIMD SIMD MIMD SIMD MIMD SIMD MIMD SIMD

68.4 213 73.5 431 99.4 2317 547 21333

1

MIMD SIMD MIMD SIMD MIMD SIMD MIMD SIMD

104 142 115 157 199 499 1010 4043

6.45 181 47.1 257 454 1140 4540 10100

6.15 256 17.3 397 134 1950 1300 18900

6.15 193 17.1 319 134 1800 1300 16600

10 100 1000

10 100 1000

94

GANESAN, GOVINDARAJAN, AND WU

high startup time and low bandwidth makes SIMD less competitive to MIMD in communication. This is because there is an additional cost involved in using the vector units in the SIMD mode which does not pay off in any way in these applications as no arithmetic operations are being performed. 3.2.2. Circular Shift. The circular shift is a regular communication pattern. Vectors of data sizes 1, 10, 100, and 1000 were shifted one unit in each measurement run. We present the average time over five runs of circular shift benchmark for each data size in Table IV. The following equations are calculated for circular shift in Table IV.

• The equation TC(l) 5 104 1 0.89 • The equation TC(l) 5 141 1 3.90

for MIMD communication time is p l microseconds. for SIMD communication time is p l microseconds.

3.2.3. Broadcast. The broadcast communication pattern is one in which a single processor broadcasts a message to all the other processors in the partition. In our implementation, processor 0 did a broadcast to every other processor. In the case of SIMD benchmark virtual processor 0 performed a spread of a vector of varying size in each run to every other virtual processor. The broadcast times are tabulated in Table IV. The following equations are calculated for broadcast.

• The equation for MIMD communication time is TC(l) 5 6.1 1 4.5 p l microseconds. • The equation for SIMD communication time is TC(l) 5 180 1 9.9 p l microseconds. 3.2.4. Reduction. In this benchmark suite, we present the timings of the two reductions Maximum (MAX) and Logical Inclusive-OR (OR). All processors partake in a reduction call. The timings are shown in Table IV. 3.3. Analysis of Synthetic Benchmarks The SIMD arithmetic operations are about eight time faster than the MIMD counterparts. This mainly reflects the performance of vector units because SIMD uses the vector units in the current implementation but MIMD does not. Without the vector units, the speed of SIMD and MIMD arithmetic operations should be about the same since the arithmetic benchmark only tests simple operations. In addition to the overhead of copying and using the vector units, the difference between the SIMD and MIMD communication primitives is mainly caused by compilers and library routines. The CMMD library routines are implemented efficiently. On the other hand, the SIMD compiler and its library routines have not been well optimized yet. The SIMD random node–node operation involves large overhead. Although CM-5 relies on software technology to support virtual processors instead of using emulation

[15], the control-loop code and runtime library calls generated by CM-5 compilers are still not optimized. The SIMD shift, on the other hand, has been better implemented. Therefore, whenever a communication has a shift pattern, CSHIFT should be used instead of array assignments. The SIMD broadcast and reduction operations were very faster in CM-2 with hardware support. CM-5 supports MIMD broadcast and reduction operations directly by hardware. The corresponding SIMD library routines are simulated over the MIMD mode, therefore, have a large overhead. 4. APPLICATION BENCHMARKS

The two applications that we have used to compare the performance of the SIMD and the MIMD modes of the machine are:

• Gaussian Elimination. • Odd–Even Transposition Sorting. These two applications were chosen because they have different characteristics and there is a natural way to program them in both the SIMD and the MIMD styles. Gaussian elimination is an application that exhibits coarse grain parallelism as it involves a good number of floating point operations [17]. Odd–Even sort, on the other hand, exhibits fine-grain parallelism as each node does not have to do too many operations. 4.1. Gaussian Elimination Algorithm The Gaussian elimination algorithm is as follows: /* ELIMINATION PHASE */ for i51 to MAXROW do find pivot element, normalize pivot column and mark row index broadcast this information to all other processors update each unmarked row end for /* BACK-SUBSTITUTION PHASE */ for i5MAXROW to 1 do back substitute to get x(i) send x(i) to all processors having x( j) where j , i update b vector (right side of equation) in parallel end for The partitioning scheme that we have used for Gaussian elimination for MIMD implementation is column partitioning, which is the best partitioning for MIMD implementation. For SIMD, column partitioning is not good as fine-grain partitioning. Therefore, fine-grain partitioning is used for SIMD implementation, where the array is laid out using the (:NEWS,:NEWS) directive. The results are reported in Table V. The timings reported here are averaged over five runs. For a matrix size of 2047, MIMD gives 13.36 MFLOPS and SIMD 8.84 MFLOPS on 32 processors.

95

COMPARING SIMD AND MIMD PROGRAMMING MODES

TABLE V Timing Results for Gaussian Elimination (in Seconds) Size

Mode

Time

511

MIMD SIMD MIMD SIMD MIMD SIMD

6.04 9.94 49.2 84.6 427 645

1023 2047

4.3. Analysis of Application Benchmarks Gaussian elimination is computation-bound. Its communications include reduction, broadcasting, and node–node communications. Although SIMD arithmetic operations are much faster, the MIMD program for doing Gaussian elimination outperforms its SIMD counterpart. It indeed indicates that SIMD communications are much slower than MIMD communications. The following table shows the breakout of the computation time and communication time in SIMD and MIMD implementations, respectively. The matrix size is 1023.

4.2. Odd–Even Parallel Sort This is a parallel sorting algorithm. It runs in linear time across a set of parallel processors. Assume the algorithm uses N logical processors for a data set of size N. Each logical processor holds one element of the input data set. The algorithm consists of two steps that are performed repeatedly. In the first step, all odd-numbered processors P(i) obtain x(i 1 1) from P(i 1 1). If x(i) . x(i 1 1), the P(i) and P(i 1 1) exchange elements they held at the beginning of this step. In the second step, all even-numbered processors perform the same operations as did the odd-numbered ones in the first step. After N/2 repetitions of these two steps in this order, no further exchanges of elements can take place and the algorithm terminates. At this time, each processor P(i) holds an x value in the sorted sequence. The algorithm is given below: for j51 to N/2 do for i 5 1 to N/2 STEP 2 do in parallel if x(i) . x(i11) then swap x(i) and x(i11) end if end for for i 5 2 to N/2 STEP 2 do in parallel if x(i) . x(i11) then swap x(i) and x(i11) end if end for end for In Table VI, the timings for data sizes of 16K and 32K are given. The timings reported are averaged over five runs. TABLE VI Timing Results for Odd–Even Parallel Sort (in Seconds) Size

Mode

Time

16K

MIMD SIMD MIMD SIMD

9.51 25.1 31.1 69.5

32K

MIMD SIMD

Computation time (s)

Communication time (s)

43.8 5.1

5.4 79.5

The odd–even parallel sort is communication-bound, so that the SIMD implementation is much slower than the MIMD implementation. 5. CONCLUSIONS

In this paper, we have presented a set of synthetic and application benchmarks which show that the SIMD programming mode with CM Fortran is slower than the MIMD programming mode with message-passing Fortran. The difference between the two modes could be ascribed to a couple of causes. These are: 1. The SIMD mode is intrinsically slower than the MIMD mode because the CM-5 is a truly MIMD machine, and the SIMD mode has to be emulated on top of a MIMD machine. 2. The compiler for compiling SIMD programs introduces further inefficiencies. We believe that a user of the CM-5 will probably program only in higher-level languages. A comparison like the one done in this paper would provide valuable information to users about how well their programs are likely to perform. ACKNOWLEDGMENT The performance data was gathered on the CM-5 at NPAC, Syracuse University.

REFERENCES 1. Z. Bozkus, S. Ranka, and G. C. Fox, Modeling the CM-5 Multicomputer. The 4th Symposium on the Frontiers of Massively Computation, Oct. 1992, pp. 100–107. 2. P. Duclos, F. Boeri, M. Auguin, and G. Giraudon, Image processing on simd/spmd architecture: OPSILA. 9th Int’l Conf. Pattern Recognition, Nov. 1988, pp. 430–433.

96

GANESAN, GOVINDARAJAN, AND WU

3. T. H. Dunigan, Performance of the Intel iPSC/860 hypercube. Tech. Report ORNL/TM-11491, Oak Ridge National Laboratory, June 1990. 4. D. C. Grunwald, and D. A. Reed, Benchmarking hypercube hardware and software. UIUCDCS-R-86-1303, Dept. of Computer Science, Univ. of Illinois at Urbana–Champaign, Nov. 1986. 5. W. D. Hillis, The Connection Machine, MIT Press, Cambridge, MA, 1985. 6. W. D. Hillis and G. L. Steele, Data parallel algorithms. Comm. ACM, 29(12), 1170–1183 (Dec. 1986). 7. J. P. Hayes, et al., A microprocessor-based hypercube supercomputer. IEEE Micro, 6, 6–17 (Oct. 1986). 8. C. E. Leiserson et al., The network architecture of the Connection Machine CM-5. The Fourth Annual ACM Symposium on Parallel Algorithms and Architecture, 1992, pp. 272–285. 9. J. L. Potter and W. C. Meilander, Array processor supercomputers. IEEE Proc., 77(12), 1896–1914 (Dec. 1989). 10. H. J. Siegel, T. Schwederski, J. T. Kuehn, and N. J. Devis IV, An overview of the PASM parallel processing system. In Computer Architecture (D. D. Gajski et al., eds.), pp. 387–407. IEEE Comput. Soc. Press, Los Alamitos, CA, 1987. 11. Thinking Machines Corp., Introduction to Connection Machine Scientific Software Library (CMSSL), Version 2.2, Nov. 1991. 12. Thinking Machines Corp., CM Fortran Reference Manual, Version 5.2-0.6, Sep. 1989. 13. Thinking Machines Corp., CMMD User’s Guide, Version 2.0, Nov. 1992. 14. Thinking Machines Corp., CMMD Version 2.0 Release Notes, Version 2.0, Dec. 1992. 15. Thinking Machines Corp., Connection Machine CM-5 Technical Summary, Nov. 1992. 16. K. Wu and Y. Saad, Performance of the CM-5 message passing primitives. Tech. Report 93-20, Dept. of Computer Science, Univ. of Minnesota, Mar. 1993. Received September 30, 1993; revised January 23, 1995; accepted October 20, 1995

17. M. Y. Wu and W. Shu, Performance estimation of Gaussian elimination on the Connection Machine. Proc. 1989 Int’l Conference on Parallel Processing, Aug. 1989, Vol. 3, pp. 181–184.

RAVIKANTH GANESAN has been working with The Information Bus Company Inc. (TIBCO), formerly known as Teknekron Software Systems Inc., Palo Alto, California since October 1994. He received his Bachelor’s in electronics engineering from Annamalai University, India in 1984, a Master’s in computer engineering from the Regional Engineering College, Tiruchirapalli, India in 1988 and a Master’s in computer science from the State University of New York at Buffalo in 1994. In 1989 he was awarded the United Nations Development Programme (UNDP) fellowship to work on distributed knowledge based computing at Purdue University. His current research interests include concurrency, load balancing, thread safety, fault-tolerance and performance issues in large distributed and multiprocessor systems. KANNAN GOVINDARAJAN received his B.Tech. in computer science from the Indian Institute of Technology, Madras in 1991. Since then he has been a graduate student in the Computer Science Department at the State University of New York at Buffalo. His research interests are in programming languages, parallel and distributed computing, and databases. MIN-YOU WU received the M.S. from the Graduate School of Academia Sinica, Beijing, China, and the Ph.D. from Santa Clara University, California. Before he joined the Department of Computer Science, State University of New York, Buffalo, where he is currently an assistant professor, he held various positions at the University of Illinois at Urbana–Champaign, University of California at Irvine, Yale University, and Syracuse University. His research interests include parallel operating systems, compilers for parallel computers, programming tools, and VLSI design. He has published over 50 journal and conference papers in the above areas and edited two special issues on parallel operating systems. He is a member of IEEE.