A note on scaling the Linpack benchmark

A note on scaling the Linpack benchmark

J. Parallel Distrib. Comput. 67 (2007) 491 – 498 www.elsevier.com/locate/jpdc A note on scaling the Linpack benchmark Robert W. Numrich Minnesota Sup...

221KB Sizes 0 Downloads 26 Views

J. Parallel Distrib. Comput. 67 (2007) 491 – 498 www.elsevier.com/locate/jpdc

A note on scaling the Linpack benchmark Robert W. Numrich Minnesota Supercomputing Institute, Minneapolis, MN, USA Received 8 September 2006; received in revised form 4 January 2007; accepted 6 January 2007 Available online 23 January 2007

Abstract Dimensional analysis yields a new scaling formula for the Linpack benchmark. The computational power r(p0 , q0 ) on a set of processors decomposed into a (p0 , q0 ) grid determines the computational power r(p, q) on a set of processors decomposed into a (p, q) grid by the formula r(p, q) = (p/p0 ) (q/q0 ) r(p0 , q0 ). The two scaling parameters  and  measure interprocessor communication overhead required by the algorithm. A machine that scales perfectly corresponds to  =  = 1; a machine that scales not at all corresponds to  =  = 0. We have determined the two scaling parameters by imposing a fixed-time constraint on the problem size such that the execution time remains constant as the number of processors changes. Results for a collection of machines confirm that the formula suggested by dimensional analysis is correct. Machines with the same values for these parameters are self-similar. They scale the same way even though the details of their specific hardware and software may be quite different. © 2007 Elsevier Inc. All rights reserved. Keywords: Performance analysis; Scaling; Self-similarity; Linpack benchmark; Dimensional analysis; Pi Theorem

1. Introduction This note describes a new scaling formula for the Linpack benchmark [14]. Given the computational power r(p0 , q0 ) for a problem of size n0 that executes in time t0 on a subset of processors decomposed into a (p0 , q0 ) grid, there exists a problem size n that executes in the same time on a different subset of processors decomposed into a (p, q) grid with computational power given by the formula,  r(p, q) =

   p  q  r(p0 , q0 ). p0 q0

(1)

This formula is valid because we have imposed a fixed-time constraint that determines the scaling parameters  and . Since the total time is fixed, the time spent doing computation decreases to compensate for the increase in communication time as the machine size increases. The problem size changes in a carefully prescribed way to keep the total time constant, and this prescription is encoded in the scaling parameters. They contain

E-mail address: [email protected]. 0743-7315/$ - see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2007.01.002

information about the interprocessor communication network for a particular algorithm coded for a particular machine. The fixed-time constraint is different from the fixed-memory constraint [5] normally applied to the Linpack benchmark. It is a variant of scaled speedup first described by Gustafson and co-workers [7] and a generalization of the SLALOM benchmark [8]. Our approach can also be thought of as the converse of the approach taken by Worley [15]. We could also determine the self-similarity parameters by imposing a fixed-memory or a fixed-work or some other constraint on the problem. Different constraints, of course, yield different values for the parameters. The fixed-time constraint has the advantage that it requires fewer system resources to run: lower execution time, less memory, and fewer floating-point operations. The fixed-time constraint has the disadvantage that it may not yield the absolute best performance for a given machine. But unlike the usual Linpack results, which yields no information about scaling, our approach yields important information about how the machine scales for a modest expenditure of resources. Two machines with the same parameters are self-similar. They scale the same way even though the details of their interconnection network, their local memory design, and their software environment may be totally different.

492

R.W. Numrich / J. Parallel Distrib. Comput. 67 (2007) 491 – 498

2. Units and dimensions This note extends an approach to computer performance analysis based on the Pi Theorem of dimensional analysis, an approach already outlined in previous papers [10,11,13]. Dimensional analysis provides a recipe [1–3] for scaling variables to dimensionless self-similarity parameters in such a way to reduce the number of variables and to reveal the nature of unknown functional relationships. The first step in this recipe is to identify the quantities that are relevant to the problem and to assign units and dimensions to each quantity. We want to determine how the execution time for the Linpack benchmark behaves as a function of some set of hardware and software parameters. What parameters to pick is based on computational intuition, part of the art that must be employed in dimensional analysis [1,3]. We want to pick as few parameters as possible that describe the quantity of interest without missing anything important. How well our analysis matches actual measurements determines how good our choice is. We assume that execution time for the Linpack benchmark is an unknown function of four independent variables, t = f (w, r∗ , p, q).

(2)

The matrix of dimensions (3) lists the dimensions of the quantities involved. We measure the dependent variable t in seconds (s), and assign it the dimension of time. The first of the four independent variables w is the computational work done in the program. We measure it in floating-point operations (flop) and assign it the dimension of energy. Time and energy are the primary quantities for this problem. T (s) E(flop) t 1 0 w 0 1 1 r∗ −1 p 0 0 q 0 0

(3)

The second independent variable r∗ is the peak computational power of a single processor. We measure it in the unit of power, floating-point operations per second (flop/s), and assign it the derived dimension of computational power, work per time. The parameters p and q are dimensionless. They represent the number of processors decomposed into a (p, q) grid. 3. Applying the Pi Theorem In this section, we derive our main result already displayed as formula (1) in the Introduction. Because there are only two primary units, the Pi Theorem says we can pick a new system of units such that our function of four variables (2) reduces to a dimensionless function of only two dimensionless selfsimilarity parameters [1,3]. The Pi Theorem requires that our problem be independent of the units we pick [2]. Every other physical discipline makes this assumption so it is reasonable to make it here. This independence means that we may define two scaling parameters, T

and E , one for each primary unit, and we may multiply each quantity in relationship (2) according to its dimension such that the original relationship still holds, T t = f (E w, E −1 T r∗ , p, q),

(4)

for whatever values we pick for the scaling parameters. If we pick them such that E w = 1 and E −1 T r∗ = 1, the scaled relationship (4) becomes t/(w/r∗ ) = f (1, 1, p, q).

(5)

If we define the dimensionless time,  = t/(w/r∗ ), and the dimensionless function, (p, q) = f (1, 1, p, q), we obtain the conclusion of the Pi Theorem,  = (p, q).

(6)

This formula simplifies our job of analysis considerably. Now we must apply our computational intuition again. What form might the function (p, q) take? We know from experience that execution time decreases with an increase in the number of processors. The simple conjecture, (p, q) = cp− q − ,

(7)

represents such behavior, where the three unknown, dimensionless parameters, c,  and , are positive numbers independent of p and q. Whether this formula is correct or not is the subject of the rest of this paper. What can we say immediately about what to expect about our three parameters? First of all, a machine that shows no decrease in execution time as the number of processors increases corresponds to  =  = 0. A machine that scales perfectly, on the other hand, corresponds to  =  = 1. Hence, we have the inequalities, 1 < p q  < pq.

(8)

For the case p < q, we have 1 < p q  < q + < q 2 .

(9)

We expect, therefore, that the sum of these two parameters satisfies the set of inequalities, 0 <  +  < 2.

(10)

The case p > q yields the same conclusion. We also expect the parameter c to be greater than one. In fact, for a single processor, p = q = 1, we have t/(w/r∗ ) = c. Since the time w/r∗ is the minimum time required to complete the work at full power, t > w/r∗ and c > 1. It follows, therefore, that the constant c is the reciprocal of the efficiency on a single processor, e(1, 1) = 1/c.

(11)

Next we show that the efficiency on (p, q) processors is determined by the efficiency on (p0 , q0 ) processors by the formula,  −1  −1 q p e(p0 , q0 ). (12) e(p, q) = p0 q0

R.W. Numrich / J. Parallel Distrib. Comput. 67 (2007) 491 – 498

To obtain this result, observe that full power for a machine with pq processors is pqr∗ . The minimum time to complete the computation is w , t∗ (p, q) = pqr∗

(13)

and from Eqs. (6) and (7), the actual time to complete the computation is t (p, q) = cp− q − w/r∗ .

(14)

Dividing (13) by (14), we find e(p, q) = p−1 q −1 /c.

(15)

Evaluating this formula for (p, q) and for (p0 , q0 ) and eliminating the constant c from the two equations, we obtain result (12). We can now obtain our main result. Multiplying both sides of Eq. (12) by the peak power pqr∗ and inserting the identity in the form 1 = p0 q0 /p0 q0 on the right side, we obtain      p −1 q −1 pq e(p, q)pqr∗ = p0 q0 p0 q0 

×e(p0 , q0 )p0 q0 r∗ .

(16)

for all values of (p, q), we obtain the scaling formula,     q p r(p0 , q0 ), r(p, q) = p0 q0

Table 1 Fixed-time measurements for a Cray X1E in MSP mode p

q

n

1 1 2 2 1 4 2 4 1 8 4 2 8 1 16

1 2 1 2 4 1 4 2 8 1 4 8 2 16 1

5551 6921 6508 8307 8425 7383 10196 9572 9780 8014 12309 12393 10532 11255 6947

(17)

which is formula (1) displayed in the Introduction.

(19)

According to formula (14), the time to run a problem of this size is 2cn(p, q)3 . 3p  q  r∗

(20)

If we measure the execution time on (p0 , q0 ) processors for a matrix of size n(p0 , q0 ), the fixed-time constraint requires that we find a problem size n(p, q) on (p, q) processors such that t (p, q) = t (p0 , q0 ).

4. Fixed-time constraint Our formula (18) implies that the computational power on some subset of processors r(p0 , q0 ) determines the computational power on any other subset of processors r(p, q). But first we need to know the values for the scaling parameters  and . To determine them, we must impose a constraint. We have chosen a fixed-time constraint for reasons discussed in Section 6. We perform a kind of scaled speedup experiment [7,8,15] by increasing the problem size as the number of processors increases in such a way that the execution time remains constant. We then use a least-squares fit of the data to find  and . Our procedure is a generalization of the SLALOM benchmark [8]. That benchmark fixes the time and then, for a given machine size, determines the problem size that requires that much time to execute. It yields a single data point for one problem size for one machine size. We determine a sequence of problem sizes that execute in the same time for a sequence of machine sizes. Our procedure yields a scaling formula as a function of machine size. The fixed-time constraint tells us how the problem size changes as the size of the machine increases. Recall that for

14.49 14.51 14.35 14.51 14.50 14.44 14.49 14.51 14.39 14.50 14.50 14.51 14.44 14.49 14.48

LU decomposition, the work done for a matrix of size n(p, q) is approximately

t (p, q) = (18)

t (s)

The parameters p and q are the number of MSP processors, each MSP processor being four physical SSP processors. The problem size n is determined experimentally to execute in fixed time using the iterative search method from Eq. (23).

w = (2/3)n(p, q)3 .

Defining computational power, r(p, q) = e(p, q)pqr∗ ,

493

(21)

Evaluating the right side of formula (20) for the two different processor decompositions, setting them equal to each other, and solving for n(p, q), we find  n(p, q) =

   p /3 q /3 n(p0 , q0 ). p0 q0

(22)

Since we do not yet know the values for  and , we determine the problem size n(p, q) that runs in a fixed time experimentally using the following iterative procedure. We pick a fixed execution time t0 and guess the problem size n0 that will execute in that time on a single processor. We then perform an iterative search that alters the problem size by the formula, nk+1 = (t0 /tk )1/3 nk ,

(23)

which usually converges to the fixed time in about 10 iterations. Table 1 shows a typical set of data obtained in this way for a Cray X1E. We picked a target execution time equal to 14.5 s and an initial problem size n0 = 5000.

494

R.W. Numrich / J. Parallel Distrib. Comput. 67 (2007) 491 – 498

Table 2 Self-similarity parameters for a collection of machines obtained from leastsquares fits of measured data to the objective function (25)





log(r0 )

Cray X1 (MSP) p  q Cray X1 (MSP) p > q Cray X1E (SSP) Cray X1E (MSP) p  q Cray X1E (MSP) p > q AMD Opteron Cluster p  q Intel Xeon Cluster p  q Intel Xeon Cluster p > q

0.99 0.16 0.86 0.88 0.13 1.03 0.93 0.56

0.86 1.37 0.86 0.75 1.44 0.77 0.77 0.79

0.77 0.99 0.55 0.95 1.14 0.38 0.73 0.86

Gflop/s

Machine

102

101

p≤q •• • •••••• • • • • •• • • • •• •• • • p>q ••

• •

Intel Xeon Cluster

100 20

21

22

The single processor power r0 is measured in Gflop/s.

From the data collected, we compute the computational power from the problem size and the measured time,

24

25

Fig. 1. Computational power as a function of the number of processors for a Intel Xeon cluster. The bullets represent measured values. The dotted lines are the least-squares fits of the measured values.

(24)

for each observation point. We fit the results to the logarithm of formula (18) with p0 = q0 = 1 by minimizing the objective function,  2 log(ri ) −  log(pi ) −  log(qi ) − log(r0 ) , (25) J2 =

102

101

••

5. Reality check To see how well the theory matches reality, we compare computational power calculated from our formula with values

p≤q



•• • •



p>q





i

with respect to the three parameters ,  and log(r0 ). This minimization problem leads to a system of three equations for the three unknown parameters. The quantity r0 is the projected value of the computational power on a single processor measured in Gflop/s. Table 2 shows the result of such fits for a collection of machines. The parameters shown in Table 2 are valid for the specific machines shown and for the particular codes used on those machines. The Cray machines use a custom interconnect with a processor frequency 0.8 GHz for the X1 and 1.13 GHz for the X1E. These machines can be run in MSP mode with four tightly coupled vector processors or in SSP mode with each vector processor on its own (www.cray.com/products). The code was written in Co-Array Fortran using CafLib 1.2 [12] using the vendor-supplied LAPACK routines and the cyclic-blockwrapped algorithm of Dongarra and Walker [4]. The two cluster machines use a Myrinet 2000 interconnect with the MX messaging protocol (www.myri.com). The AMD Opteron 240 processor has a frequency of 1.4 GHz and the Intel Xeon 64EMT processor has a frequency of 3.2 GHz, both with dual processor cores. The code was compiled with the gcc compiler using the GoTo BLAS 1.03 (www.tacc.utexas.edu/resources/software) for performance and MPI for communication. The details of the program can be found on the Top 500 website [14]. In all cases, the same code compiled the same way was run for all values of (p, q).

•• •

•• Gf lop /s

ri = (2/3)n3i /ti ,

23

pq/

Cray X1E (MSP Mode)

100 20

22

24

26

28 210 212 214 pq/

Fig. 2. Computational power as a function of the number of processors for a Cray X1E in MSP mode. The bullets represent measured values. The dotted lines are the least-squares fits of the measured values.

measured on a collection of machines. Setting p0 = q0 = 1 in formula (18) and taking logarithms on both sides, we find   log(r) =  log(p) + (/) log(q) + log(r(1, 1)). (26) If we plot the logarithm of the computational power versus the logarithm of pq / , we expect to see a straight line with slope . The intercept is the extrapolated value for the logarithm of the computation power on a single processor. Figs. 1 and 2 show that the computational power does indeed exhibit the expected linear behavior. The bullets in each figure are the measured points and the dotted lines are the scaling laws obtained from the least-squares fit of the data. The first thing to notice is that the data splits into two cases depending on whether p q or p > q. Originally, we had assumed that the parameters  and  were independent of the number of processors. Our measurements reveal that the assumption is only correct if we distinguish the two cases. This asymmetry with respect to processor decomposition has been discussed previously [5,6]. One source of asymmetry is the partial pivoting part of the algorithm. It requires searching for the maximum pivot element in the p direction of the processor

R.W. Numrich / J. Parallel Distrib. Comput. 67 (2007) 491 – 498

decomposition and the interchange of rows of the matrix. This part of the algorithm is latency-dependent, and a small value of p reduces the execution time. The more latency-sensitive a machine is, the more asymmetric it is likely to be. A second source of asymmetry is the broadcast part of the algorithm. Interconnection networks may support different bandwidths in the p and q directions. The Linpack benchmark is normally run with p q to obtain better performance. Running it with p > q may reveal important information about the machine. The Cray X1 in SSP mode appears to be the only symmetric machine in the list. We made measurements on this machine only up to 16 SSP processors, which means they were all on the same physical module with shared memory. Measurements for a larger number of processors may reveal an asymmetry for the SSP mode as well. It would be asking a lot to expect the two scaling parameters to hold for all problem sizes and all machine sizes. Outside the range of measurement from which they were determined, we should be surprised if the formula worked very well. Now that we know that it works within a range of specified measurements, the way to use it is to pick a range of machine sizes, say between some small machine (p1 , q1 ) and some large machine (p2 , q2 ), and then to pick a problem size n00 expected to run for some reasonably short time on a machine of intermediate size (p0 , q0 ). After a single run on that machine to determine t00 , two more short runs using the iterative procedure (23) to determine n11 and n22 that run for the same time on the smallest and largest machines, yield three pieces of input data to the two-by-two system of linear equations, 

  log(1/p0 ) log(1/q0 )  log(p/p0 ) log(q/q0 )



 =

log(r(1, 1)/r(p0 , q0 ))

Gflop/s

103

102 •



⊕•



• • • •



100 20

21

22

23

24

25

pq/ Fig. 3. Computational power as a function of the number of processors for a Cray X1E in MSP mode. The three crosses (⊕) mark the initial measurements for three points at the ends and in the middle of the range of processors. These values were inserted into the linear system (27) and the scaling parameters determined from that system were used to calculate problems sizes from Eq. (22). The bullets (•) represent the measured values for those problem sizes, and the dotted line is the computational power calculated from formula (18).

6. Other constraints Other approaches to performance analysis for LU decomposition might impose constraints different from the fixed-time constraint. Dongarra and coworkers, for example, imposed the constraint that the problem size should increase such that the memory per processor remains constant[4,5], n(p, q)2 n(p0 , q0 )2 = , pq p0 q 0

,

(28)

which implies that the problem size grows by the formula,

(27) whose solution yields the scaling parameters. These measurements take only a few minutes of time and yield parameters that should be valid across a large range of machine sizes. Fig. 3 shows the result for such an experiment. The three points marked with crosses are the three initial measurements starting with n00 = 9423 executing for t00 = 9.95 s on eight processors, (p0 , q0 ) = (2, 4). We found the fixed-time problem sizes n11 = 5171 on one processor, (p1 , q1 ) = (1, 1), and n22 = 14 034 on 32 processors, (p2 , q2 ) = (4, 8), and the scaling parameters  = 0.85 and  = 0.87. The bullets in the figure are the measured values for problem sizes calculated from the scaling formula. They should all lie on the dotted line. The agreement is very good with a few scattered points. Such behavior is to be expected from dimensional analysis alone. It indicates that we have obtained the first-order behavior of the problem but that there are other effects that we have missed. It usually means that we need to go back to our original statement of the problem and to add additional independent variables.



101



log(r(p, q)/r(p0 , q0 ))

495

 n(p, q) =

p p0

1/2 

q q0

1/2 n(p0 , q0 ).

(29)

Substituting formula (29) for the problem size into Eq. (20) for the execution time and using t (p0 , q0 ) from formula (20) evaluated at (p0 , q0 ), we find  t (p, q) =

p p0

3/2− 

q q0

3/2−

t (p0 , q0 ).

(30)

We could, therefore, determine the scaling parameters  and  by measuring the execution times for problem sizes that increase according to formula (29) and then fitting the data using an objective function based on the logarithm of formula (30). We would expect, of course, to obtain different values for the scaling parameters since the constraints imposed for the fixed-memory requirement are different from those imposed for the fixed-time requirement. The disadvantage of this method is that the execution time increases quickly as the problem size increases.

496

R.W. Numrich / J. Parallel Distrib. Comput. 67 (2007) 491 – 498

Another approach might impose the fixed-work constraint such that the work per processor remains constant, w(p, q) w(p0 , q0 ) = , pq p 0 q0

(31)

(33)

This approach yields another set of scaling parameters following the same experimental procedure as for the fixed-memory case. We can compare the memory requirements that result from the three different constraints. Suppose that in each case we start with a problem size n(1, 1) on a single processor close to the largest problem that fits in a single processor’s memory. Setting (p0 , q0 ) = (1, 1) in Eq. (22) for a fixed-time experiment and dividing by the number of processors, we find nT (p, q) = p /3−1 q /3−1 n(1, 1). pq

(34)

A similar relationship was noted by Gustafson and coworkers in their discussion of the SLALOM benchmark. They assumed “linear performance increases,” which means they assumed  =  = 1. In that case, the problem size on each processor decreases by a two-thirds power, nT (p, q) = (pq)−2/3 n(1, 1), pq

(35)

an observation made by the authors of the SLALOM benchmark [8, Fig. 1]. Our model is a generalization of their approach. We specifically model the effect of communication overhead that results in nonlinear performance increases. For the fixed-work experiment, from Eq. (32), we find nW (p, q) = p−2/3 q −2/3 n(1, 1), pq

(36)

and for the fixed-memory experiment, by definition, nM (p, q) = n(1, 1). pq

(37)

From the constraints (10) on the scaling parameters, we conclude that the memory requirements per processor for the three different constraints are ordered as nT (p, q) nW (p, q) nM (p, q)   = n(1, 1). pq pq pq

(38)

We can also compare the work per processor as the processor count increases. For the fixed-time constraint, we find wT (p, q) = p −1 q −1 w(1, 1); pq

wM (p, q) = p −1/3 q −1/3 w(1, 1); pq

(40)

and for fixed-work,

which implies that the problem size grows by the formula,  1/3  1/3 q p n(p0 , q0 ), (32) n(p, q) = p0 q0 and the execution time changes by the formula,  1−  1− p q t (p, q) = t (p0 , q0 ). p0 q0

for fixed-memory,

(39)

wW (p, q) = w(1, 1). pq

(41)

So we have another set of inequalities, wT (p, q) wM (p, q) wW (p, q)   = w(1, 1). pq pq pq

(42)

It is clear that the fixed-time constraint has advantages over other approaches to the benchmark. It takes less time to run, it uses less memory, and it performs fewer floating-point operations. Worley took a different approach to fixed-time constraints [15]. Rather than looking for an unknown function for the execution time, he assumed a function for the execution time with an explicit dependence on the problem size, the number of processors, and a collection of hardware parameters. From this function, he computed the appropriate problem size that satisfied a time constraint. In contrast, we do not know the function that describes the execution time. We have used the methods of dimensional analysis to discover what the function might be. The power of dimensional analysis is that it allows us to find a very simple formula without the need for detailed analysis of the algorithm. It reveals the essence of the problem under very simple assumptions about the nature of the problem. 7. Summary We have shown how the application of the principles of dimensional analysis results in a scaling formula for the Linpack benchmark that reproduces measured results with remarkable accuracy. Once we have determined two self-similarity parameters, formula (1) allows us to think of the Linpack benchmark as an initial-value problem. Under the fixed-time constraint, given the computational power r(p0 , q0 ) for problem size n0 on (p0 , q0 ) processors, there exists a problem size n on (p, q) processors that runs in the same time with computational power r(p, q). We can rewrite our result as a commutation relation for the efficiency, −1

e(p, q)p0−1 q0

− p −1 q −1 e(p0 , q0 ) = 0.

(43)

This result conforms with our intuition that a highly optimized version of the code for a single processor is critically important. Setting p0 = q0 = 1 in relation (43), we see that the computational efficiency e(p, q) decreases from its value e(1, 1) on a single processor as p and q increase. The value of e(1, 1) measures how well the code takes advantage of the floating-point hardware and the memory design on a single processor. As the number of processors increases, the power per processor decreases in a way prescribed by the two scaling parameters. Only a perfect machine, with  =  = 1,

R.W. Numrich / J. Parallel Distrib. Comput. 67 (2007) 491 – 498

2.0 1.5 

achieves the ideal case of iso-efficiency, e(p, q) = e(1, 1), where the machine maintains the same fraction of peak power that it had on a single processor [5,9]. The fact that these parameters are not equal to one is a measure of how far each machine is from a perfect machine because of the overhead it incurs, from both hardware and software, to handle the interprocessor communication requirements of LU decomposition. Our initial assumption that the self-similarity parameters are independent of the decomposition (p, q) was not strictly correct. We found from our measurements that it is only correct if we distinguish two separate cases. For p q, the measured results follow the formula quite closely, the logarithm of the computational power showing only a little scatter away from linear behavior. For p > q, the measured results show more scatter. Such deviation normally implies that we have missed one or more variables of dependence in our original conjecture. Since we suspect that the case p > q is sensitive to memory latency and to asymmetries in bandwidths, more so than the case p q, we probably need to add another variable in Eq. (2) to represent either memory latency or memory bandwidth for the case p > q. We have described three different constraints that might be used to obtain scaling formulas for the Linpack benchmark: fixed-time, fixed-memory, and fixed-work. The advantage of the fixed-time constraint is that it requires fewer system resources to run the benchmark. As the machine size increases, the execution time remains fixed at a low value, and the memory used and the work done by each processor decreases. The other constraints require longer execution times as the problem size increases and require more memory and more computational work per processor. The scaling formula suggests that we characterize machines by the two self-similarity parameters. A simple addition to the Linpack benchmark rules would yield these parameters very quickly. Three short runs, taking only a few minutes to perform, provide input to the linear system (27) whose solution yields the parameters. For a given machine and a given code, these parameters lie in the (, ) plane with a perfect machine at the point (1,1). Two machines with the same parameters are self-similar. They scale the same way even though the details of their interconnection network, their local memory design, and their software environment may be totally different. Fig. 4, for example, shows the distribution of the parameters from Table 2 in the (, ) plane for the machines we measured. The dotted diagonal line represents the constraint (10) that the sum of the two parameters should be less than two. The dotted anti-diagonal line represents symmetric machines with equal parameters. For the case p q, all the machines cluster in a small region very near the best case values  =  = 1. They all lie within an area that is less than 1% of the possible range of values for the self-similarity parameters. They all scale pretty much the same way. The case p > q shows a bigger difference in machines. The two Cray machines are very asymmetric suggesting that they are quite sensitive to memory latency in the p direction. The Intel Xeon Cluster, on the other hand, appears to be somewhat less sensitive to memory latency in the

497

•• p>q •

1.0

• • •• •

0.5 p≤q 0.0 0.0

0.5

1.0 

1.5

2.0

Fig. 4. Self-similarity parameters for the collection of machines listed in Table 2. The dotted diagonal line represents the constraint (10) that the sum of the two parameters should be less than two. The dotted anti-diagonal line represents symmetric machines with equal parameters.

p direction. This difference is not apparent from the normal benchmark results. Acknowledgments The Department of Energy supported this research by Grant No. DE-FG02-04ER25629 through the Office of Science. I thank Piotr Łuszczek for making measurements on two cluster systems at the University of Tennessee Knoxville and for helpful discussions of the Linpack benchmark. I thank Robert van de Geijn for a helpful discussion of the observed asymmetry with respect to processor decomposition. I thank Cray Inc. for a grant of computer time on the Cray X1 and Cray X1E to make measurements for this paper. References [1] G.I. Barenblatt, Scaling, Cambridge University Press, Cambridge, 2003. [2] G. Birkhoff, Hydrodynamics: A Study in Logic, Fact and Similitude, Second ed., Princeton University Press, Princeton, NJ, 1960. [3] P.W. Bridgman, Dimensional Analysis, Second ed., Yale University Press, New Haven, 1931. [4] J.J. Dongarra, D.W. Walker, Software libraries for linear algebra computations on high performance computers, SIAM Rev. 37 (2) (1995) 151–180. [5] J.J. Dongarra, R.A. van de Geijn, D.W. Walker, Scalability issues affecting the design of a dense linear algebra library, J. Parallel Distributed Comput. 22 (1994) 523–537. [6] R. van de Geijn, Massively parallel LINPACK benchmark on the intel touchstone DELTA and iPSC/860 systems: preliminary report, Technical Report TR-91-28, Department of Computer Sciences, University of Texas, 1991. [7] J.L. Gustafson, G.R. Montry, R.E. Benner, Development of parallel methods for a 1024-processor hypercube, SIAM J. Sci. Statist. Comput. 9 (1988) 609–638. [8] J. Gustafson, D. Rover, S. Elbert, M. Carter, The design of a scalable, fixed-time computer benchmark, J. Parallel Distributed Comput. 12 (4) (1991) 388–401. [9] V. Kumar, A. Grama, N. Vempaty, Scalable load balancing techniques for parallel computers, J. Parallel Distributed Comput. 22 (1) (1994) 60–79. [10] R.W. Numrich, Cray-2 memory organization and interprocessor memory contention, in: C. Meyer, R.J. Plemmons (Eds.), Linear Algebra, Markov Chains, and Queueing Models, The IMA Volumes in Mathematics and Its Applications, vol. 48. Springer, Berlin, 1992, pp. 267–294.

498

R.W. Numrich / J. Parallel Distrib. Comput. 67 (2007) 491 – 498

[11] R.W. Numrich, Memory contention for shared memory vector multiprocessors, in: Proceedings of Supercomputing ’92, IEEE Computer Society, 1992, pp. 316–325. [12] R.W. Numrich, A parallel numerical library for co-array Fortran, in: Parallel Processing and Applied Mathematics: Proceedings of the Sixth International Conference on Parallel Processing and Applied Mathematics (PPAM05), Springer Lecture Notes in Computer Science, vol. 3911, Poznan, Poland, 2005, pp. 960–969. [13] R.W. Numrich, What Does the Pi Theorem tell us about computer performance analysis?, in: High Productivity Computing Systems Productivity Team Meeting, Marina del Rey, CA, January 10–11, 2006. [14] Top 500 Benchmark, http://www.top500.org/lists/linpack.php. [15] P.J. Worley, The effect of time constraints on scaled speedup, SIAM J. Sci. Statist. Comput. 11 (5) (1990) 838–858.

Bob Numrich is a Senior Research Associate at the Minnesota Supercomputing Institute, University of Minnesota, Minneapolis. His research interests include parallel architectures, parallel programming languages, and parallel numerical algorithms. He also studies computer performance analysis and the development of theoretical models that yield self-similarity relationships between systems. Previous to his position at the University of Minnesota, he was Principal Scientist at Cray Research where he worked on the Cray-2 and Cray-3 architectures and was a member of the core development teams for the Cray-T3D and Cray-T3E. He invented the one-sided parallel programming model that became the SHMEM Library, and he is the principal author of the Co-Array Fortran programming model.