On Factors Limiting the Generation of Efficient Compiler-Parallelized Programs

On Factors Limiting the Generation of Efficient Compiler-Parallelized Programs

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved. ON FACTORS LIMIT...

510KB Sizes 11 Downloads 42 Views

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

ON FACTORS LIMITING THE GENERATION COMPILER-PARALLELIZED PROGRAMS

331

OF EFFICIENT

M. R. WERTH and P. FEAUTttIER

PRISM Laboratory University of Versailles, 45, avenue des ~tats.Unis 78035 Versailles Cedez, France { Mourad.Raji. Werth, Paul.Feautrier} @prism.uvsq.fr

ABSTRACT. In this paper we discuss two factors which may limit the performance of parallel programs: 1. Rate of inactivity, where within a virtual processors grid a number of processors remain idle during the program execution. 2. Overheads due to a communications optimization step. These factors were first observed while testing the performance of programs generated by a paraUelizing compiler. KEYWORDS. ParaUelizing compilers, inactivity rate, unimodular transformations, optimization of communications.

1

INTRODUCTION

Our paraUelization method starts with a program written in a conventional language -e.g. Fortran-, then produces the parallel form of the program using a systematic transformation

scheme. This form is then adapted to generate code for a given parallel architecture. The underlying idea of our systematic transformation scheme is to retrieve a 8pace.time mapping of the program (a similar approach is used by the systolization community [i]), starting with a powerful tool, the Data Flow Graph [2] which provides the exact source of every value manipulated in the program, i.e. the exact instruction and the corresponding iteration (in the presence of a loop nest) that produces the value. The schedule [3] and the placement .function[4] of the program instructions are then derived. They are linear functions in terms of the surrounding loop indexes. The schedule expresses the date at which a given iteration must be scheduled, the placement function provides the coordinates of the

M.R. Werth and P. Feautrier

332

virtual processor-within a multi-dimensional grid- that will execute the corresponding computation. In a subsequent step the program is expanded in order to obtain its single assignment form and the new loop nests are constructed. The logical time and the grid axes represent the new indexes. Each original loop nest is embodied by a multi-dimensional virtual processors grid. The generated program contains a single global sequential loop over the time having an entirely parallel body [5]. Unfortunately, the resulting body incorporates several overheads that limit the performances of the generated code. These overheads are due to the presence of instruction guards and conditional operands and to the high cost of communications. These problems are dealt with in [6] and [7] respectively. Our experiments have shown that the performances of the generated programs (even though optimized) can be limited by several factors. In this paper we address two of them. In section 2, we present a simple example in order to illustrate our purposes. The first factor, which is related to the choice of the transformation applied to the sequential program, concerns the rate of inactivity of the virtual processors within the grid; we show that unimodular transformations are not always the best choice. This topic is treated in section 3. The second factor is a side effect of the optimization of communications. In fact, when g e t s are transformed into sends, the set of the sending processors must be delimited. This operation may, in some cases, induce an important overhead that considerably reduces the performances of the optimized program. Section 4 deals with that issue. In section 5, we present a discussion of possible solutions to the above-mentioned problems as well as our conclusions.

2

AN EXAMPLE

In order to illustrate our purposes, we use the Lamport loop as an example : doi=

I, 1

do j = 2, n-I do k - 2, n-I

a(j, k) = 0.25 *(a(j, k-l)§

k§247

k)§247

k))

end do end do end do

The Lamport loop is an example of a relaxation algorithm where an element of the matrix is determined by computing the average of its four neighbors. This computation is included in a global loop (the one in i from 1 to 1). This makes the dependency issue a bit more complex 'because the newly computed element must be considered for the next computations. Figure 1 illustrates the dependencies between the program iterations. The timing function of the computing instruction is t = 2i + j + k - 6. Thus, for any given time value, the iterations belonging to the hyperplane of equation 2i + j + k - 6 - t = 0 are entirely parallel.

Efficient Compiler-parallelized Programs

333

Figure 1' Dependencies between iterations of the Lamport loop example

In order to obtain a unimodular transformation, we have chosen the placement func-

ti~176176

(q~

= (i+kJ)

(see[5]f~176

transformation is thus '

T(i,j,k) =

qo q~

=

i+j k

the new loop nest is then indexed by t, qo and ql whose bounds are obtained by applying the Fourier-Motzkin algorithm and are given as follows' 1 < t < 21 + 2 n - 8

max(it + 8 - n I , 3 , t + 6 - l - n ) < _ q o < _ m i n ( [n + 2t +.......2 J , t + 2 , 1 + n - 1 ) max(2,t + 7 - 2q0,t + 5 - q 0 - I) < ql _< m i n ( n - 1, n + t + 4 - 2q0,t + 4 - q0) The outermost loop is sequential, the two others are entirely parallel and are incarnated by a 2-dimensional virtual processors (VPs) grid indexed by (q0, ql) and of size 2 n , n.

3

RATE OF INACTIVITY

At any given iteration of the loop on t, only those processors verifying the above constraints on q0 and ql may be active. In substance, these inequations form a parametric polyhedron (where parameters are t, n and l) of which it is not possible to have a visual representation. Therefore and in order to understand the behavior of the parallel program, we shall study the shape of this polyhedra within two time intervals which will enable us to get a visual representation of it. This translates to expressing the bounds of q0 and ql in their simplest form. For simplicity's sake, we assume that I = n. Let's begin with the time interval 90 _< t _< n - 2. Within this interval we have (see [8]): 3<_qo___t+2

M.R. Werth and P. Feautrier

334

Figure 2: R a t e of inactivity for t < n - 2

and

{

2 t + 7 - 2qo

if qo>_t~2~} < q l < t + 4 _ q elsewhere -

~

Figure 2 shows the shape of the active V P s ' domain in terms of t when 0 < t _< n - 2. Notice t h a t t h e r a t e of activity of the the VPs is very low. More t h a n 43- of the processors are idle 1 . As for n - 2 < t < 2n - 5, the bounds of q0 and ql are expressed as follows (see [8]):

t+8-n

_
2

n+t+2

2

and

{ 2

t + 7 - 2q0

_ if qo> elsewhere

} ~

{
.

n-1

if t+8"n
qo

Figure 3 shows the shape of the active V P s ' domain. T h e r a t e of activity is higher t h a n 1This is a rough approximation. If we assume that the number of integer-coordinates points of a triangle is equal to its area (which is asymptotically acceptable), the number of active processors (represented by the dark surface) is equal to : 1(t-1) (t-l) ~ (.-Z-l) ~ .~ (t 2 2 4 4 - 4 rt ~ Given that, the total number of processors is 2n 2 = 8-~--, more than 7 out of 8 processors are idle.

Efficient Compiler-parallelizedPrograms

335

Figure 3: Rate of inactivity for n - 2 _< t < 2 n - 5 the previous case, but it remains quite low 2. We stop our study at this point, suffice it to note that the VPs' activity rate remains low for the other cases For 2n - 5 < t < 3n - 4 we still have t+s-n < q0 < n§ and for 3n - 3 _< t _ 4n - 8 we have t + 6 - 2n _< q0 _< 2n - 1. In both cases the m a x i m u m number of active processors along with axis qo is less than n - 3. "

-

-

2

-

-

2

This obviously limits the performance of the generated program, especially for SIMD virtualized systems like the CM-2(200). Generally speaking, this is a problematic issue. Let's assume, for example, t h a t T is the number of operations for a given program and l the latency (the m a ~ m u m value of time). In order to obtain an optimal parallel execution, one may assure that T virtual processors are simultaneously active in each time step; the inactivity rate would thus be zero. This means t h a t the number of operations that have to be executed is the same for each time step, which is generally not the case. The trouble is that only a small number of operations are carried out at a given time, while there is a large number of available VPs. Knowing that these VPs have to be m a p p e d to the physical processors of the computer, including those VPs to which no computation has been assigned (because of the SIMD architecture of our target machine), it is clear that a high inactivity rate leads inevitably to poor performance. n 2 2Asymptotically, the dark surface doesn't exceed -y. It is in fact less than the area of the including n~ n~ parallelogram which is itself equal to n 2 - 2(-~-) = "T (the area of the including square minus tile surface of the two other triangles), which leads to a rate of inactivity greater than 88

336

M.R. Werth and P. Feautrier

We have replaced the placement function by (j,k) in order to obtain a smaller VP grid (n 2 instead of 2n2). This led to a non-unimodular transformation, but the resulting parallel program ran more than three times faster than the previous one. This result leads to the conclusion that unimodular transformations are not always the best choice. A number of authors have shown how to reduce non-unimodular transformation overheads.(see [10], for example) 4

A SIDE EFFECT OF THE OPTIMIZATION

OF C O M M U N I C A T I O N S

It is well known that communication time is one of the most limiting factors for obtaining high performance programs for parallel architectures. In a parallel program generated by our paralleUzation method, data has to be got from the processor where it is stored, when it is not local. This is most easily accomplished by using the g e t primitive. This solution has the advantage of being natural and easy to generate automatically 3, but unfortunately it is inefficient. Indeed, the source processor pointed out by the g e t instruction can be entirely arbitrary giving that all active processors execute the get at the same time and in parallel. This might lead to conflicts on the Communication links during the forwarding of the messages. Moreover, in some machines (such as the CM-2(200)) the target processor has first to send its address to the source processor in order to allow the latter to send it the desired datum. As a consequence, g e t s are twice as expensive as sends. In [7] we address the issue of communications optimization by proposing ways to replace the costly g e t communication schema by a more efficient equivalent one. This amounts to a decomposition of the get operation into one or more elementary operations. Our goal is to detect the most economical decomposition. We proposed an approach based on linear algebra. Two tracks have been explored : 1. Transforming g e t s into sends. As a result if this transformation, communications called spreads (or b r o a d c a s t s ) , in which all or some of the active processors asking for the same data, are detected. These communications are efficiently implemented on most architectures. 2. Detecting regular communications between neighbor processors called shift 4 communications. Owing to the regular communication pattern involved, these communications are also very efficiently implemented. Transforming g e t s into sends amounts to designating a set of processors within the sending grid that will be in charge to send data to a set of processors within the receiving grid. In order to achieve this one has to designate which of the processors belonging to the sending grid might be active. This operation, which is detailed in [7], may in certain cases involve supplementary tests in order to insure that the receiving processors have actually integer coordinates. 3one can find the same choice in other works, see [9] for example 4they are called NEWScommunications too.

Efficient Compiler-parallelized Programs

337

For instance, in the parallel version of the Lamport loop example, we find the following get operation 9 (qo, ql) '

(2q0 + ql - 5 - t, ql - 1)

which is transformed into a send as follows :

(q0, ql)

) (qo--ql+t+4 2 ,ql+l)

consequently requiring the addition of the test: modulo(q0 - ql + t + 4, 2)= 0. This happens 4 times in the Lamport loop and results in the communications-optimized program being about three times slower than the original one ! This loss of performance is chiefly due to the fact that the modulo function is extremely expensive (60 to 75 times a floating point addition) on our target machine (the CM-2(200)). This overhead is likely to be smaller on other architectures.

5

DISCUSSION AND CONCLUSIONS

In this paper, we have presented two limiting-performance factors that we have encountered during our experiments. The first one concerns the rate of inactivity of the virtual processors. As a consequence of the so-called "VP looping" mechanism in SIMD-virtualized machines, which leads to the evaluation of expressions on every VP, even on those ones which normally don't have to participate in the computation, the performance of programs with a high inactivity rate is dramatically affected. A number of authors from the "Systolic Arrays Synthesis" community (see for example [11] [12] [13]) have proposed solutions in the context of partitioning/clustering that we aim to exploit and adapt to our parallelization context in order to minimize the number of virtual processors. We also have shown that unimodular transformations are not always the best choice. On the other hand, the shape of the wavefronts (sets of iterations to be executed in parallel for a given value of t) is generally not rectangular. The wavefronts are rectangular when the schedule and each component of the placement function is parallel to one of the iteration domain axes. This corresponds to the particular case where the schedule and the components of the placement function take the particular linear form where only one factor is not zero. Unfortunately, the great majority of current machines allow the use of rectangular grids only. This leads to implementing the non-rectangular wavefronts in a rectangular grid which requires making the extra processors idle. In our future work we plan to submit our method to experimentation on a MIMD architecture. We expect the run-time system of such machines to be more efficient in dealing the inactivity problem than with SIMD machines. Due to the fact that physical processors are not obliged to work synchronously, i.e they don't have to execute the same instruction at the same time, the system may be able to manage more properly the virtualization so as to efficiently distribute the actual computations. The second factor is a side effect of the optimization of communications. When turning g e t s into sends, one has to assure that only processors involved in the communication have

338

M.R. Werth and P. Feautrier

to send their data. This operation may induce a great overhead in some cases. Further investigation will be needed to determine whether the generation of a space-time mapping is possible when taking this issue into account.

Acknowledgements We are grateful to our friend Andreas L6tter for proofreading the manuscript.

References

[1]

P. Quinton. Automatic Synthesis of Systolic Arrays from Uniform Recurrence Equations. IEEE, Int. Syrup. on Computer Architecture, pp 208-214, 1984.

[21

P. Feautrier. Datafiow Analysis of Scalar and Array References. Int. J. of Parallel Programming 20(1), pp 23-53, 1991.

[31

P. Feautrier. Some Efficient Solutions to the A/line Scheduling Problem I, One Dimensional Time. Int. J. of Parallel Programming 21(5) pp 313-348, 1992.

[4]

P. Feautrier. Toward Automatic Partitioning of Arrays for Distributed Memory Computers. Seventh ACM Int. Conf. on Supercomputing, pp 175-184, Tokyo 1993.

[5]

M. R. Werth & P. Feautrier. On Parallel Program Generation for Massively Parallel Architectures. High Performance Computing II, M. Durand and F. El Dabaghi editors. North-Holland, pp 115-126, Oct. 92.

[61

M. R. Werth and J. Zahorjan and P. Feautrier. Using Compile-Time Conditional Analysis to Improve the Performance of Compiler Parallelized Programs. Proc. Int. Conference on Massively Parallel Processing, Applications and Development, Elsevier Science Publisher, 1994.

[71

M. R. Werth & P. Feautrier. Optimizing Communications by Using Compile Time Analysis. CONPAR'g4, Linz-Austria, LNCS, Springer-Verlag, Sept. 1994.

[sl

Mourad Raji Werth. G6n6ration Syst6matique de Programmes Parall~les pour Architectures Massivement ParaU~les. Ph.D Thesis, Universitd P. et M. Curie, Nov. 93.

[9]

Luc Boug6. The Data-Parallel Programming Model: a Semantic Perspective. Tech. Report No LIP-IMAG 92-45, 1992.

[10]

M. Barnett and C. Lengauer. Unimodularity considered non-essential. Proc. CONPAR 92- VAPP V, LNCS 634, Springer Verlag, pp 659-664, 1992.

[11] P. Clauss and C. Mongenet and G.-R. Perrin. Calculus of space-optimal Mappings of Systolic Algorithms on Processor Arrays. International Conference on Application Specific Array Processors, IEEE Computer Society, pp 591-602, 1990.

[121

J.Bu, E.Deprettere, P.Dewilde, A design methodology for fixed-size systolic arrays, in Application-specifc array processors (S.Y.Kung, E.Swartzlander, J.Fortes eds.), IEEE computer society Press, pp.591-602, 1990.

Efficient Compiler-parallelized Programs

339

[13] J.Teich, L.Thiele, Partitioning of processor arrays" a piecewise regular approach, b~tegration: The VLSI Journal, 1992.