Computer Physics Communications 53 (1989) 271—281 North-Holland, Amsterdam
271
IMPLEMENTING LANCZOS-LIKE ALGORITHMS ON HYPERCUBE ARCHITEC!’URES David S.
SCOT!’
Intel Scientific Computers, 15201 Northwest Greenbriar Parkway, Beaverton, OR 97006, USA Received 8 August 1988
This paper discusses the Lanczos algorithm for large symmetric eigenvalue problems and its implementation on the Intel hypercube distributed-memory, message-passing parallel computer system.
I. Introduction Large symmetric eigenvalue problems arise in a number of scientific and engineering disciplines, In recent years the Lanczos algorithm has emerged as the most efficient method of computing some or even all of the eigenvalues of a large sparse symmetric matrix. The algorithm converges quickly to well separated eigenvalues at either end of the spectrum. The algorithm can converge to several eigenvalues at once and there are no adjustable parameters that have to be set by the user. The matrix appears in the algorithm only in the formalion of matrix—vector products. Thus, the sparsity of the matrix is preserved and the matrix—vector product subroutine can be coded by the user to take advantage of any other special structure of the matrix. The bizarre numerical behavior of the algorithm is now understood and careful implementations exist which circumvent the problems. See ref. [1] for a survey of results up to 1980. Recent research has focused primarily on the spectral transformation or shift-and-invert Lanczos [2]. This technique, which is analogous to the inverse iteration, takes advantage of “inverting” the operator to allow rapid convergence to arbitrary eigenvalues. The Lanczos algorithm is described in greater detail in section 2 of this paper. The most cost effective form of computing is obtained with integrated microprocessors. The most powerful computers in the future will be
machines consisting of thousands of microprocessors. Unfortunately, it is not possible to connect thousands (or even hundreds) of processors to a common memory. Instead each processor will have its own private memory. To allow these separate computers to work together on a common problem, they will be connected by a network which supports messages passing from one processor to another. The Intel Hypercu~e (iPSC®/2 system) is a particular example of a distributed memory, message passing parallel computer. The name hypercube comes from the topology of the hardware interconnection network. Because there is no common memory, each processor has its own program, although it may be just its copy of the same program, and has its share of the data. The main difficulty in programming these machines is in determining how to distribute the data among the processors. Once an algorithm is chosen for a particular problem and a decomposition strategy is determined, it is relatively straight forward to produce the code. Section 3 of this paper describes the Intel machine in: more detail and gives a programming example. Section 4 discusses how to implement the simple Lanczos algorithm on a distributed memory message passing system. The algorithm is given in section 2. In this section, the distribution of the data is described as ~lell as how each step of the algorithm is implemei~tedon the machine. Section
0010-4655/89/$03.50 © Elsevier Science Publishers B.V. (North-Holland Physics Publishing Division)
272
D.S. Scott
/ Implementing Lanczos-like algorithms
5 gives some conclusions, appendix A documents the global communication utilities and a complete listing of the code appears in appendix B.
These matrices explicitly satisfy the equation (El) AQ,
=
Q
1T~+
where e$ (0, 0,. 0, 1) is a j-vector and because of the symmetry of A, Q. satisfies =
2. The Lanczos algorithm
. . ,
(E2) Q7Q1=I,.. The Lanczos algorithm [3] was originally used to tridiagonalize symmetric matrices. To obtain numerical accuracy, it had to be augmented with an expensive reorthoganization step. It was soon replaced by more effective methods based on cxplicit orthogonal similarity transformations. In the modern interpretation, the Lanczos algorithm is viewed as applying the Rayleigh—Ritz procedure to a Krylov sub space. The Rayleigh—Ritz procedure is a mechanism for extracting the best approximate eigenvectors and associated eigenvalues from a given subspace (ref. [4], p. 213). A Krylov subspace of a matrix A is Ki A
“ ‘1’
/
I 2 s”an’ r \ ~ A” A
=
~
...~
A’
~‘
for some starting vector q and some dimension j. The simple Lanczos algorithm is as follows. Simple Lanczos algorithm For a starting vector
i~ *
o,
‘
=
Aq
‘
—
~‘
q
-
‘
=
L5. /3,.
‘
~ Q =
‘
‘ “
where SJ is the eigenvector matrix of T..J Vectors in the Krylov space are of the form p (A)q0, where p is a polynomial of degree j 1. Various bounds on the accuracy of these ritz vectors have been derived using Tchebychev polynomials [5,6,4 (p. 242)]. Three factors affect the —
starting vector have a head start. In particular, if the starting vector is an eigenvector, the process will terminate immediately having converged to that eigenvalue.
~,
‘
L3. a, u~q,, L4. i, u~ q1a1, =
‘
which is the reduced matrix in the Rayleigh—Ritz procedure. The eigenvalues of I~.(“Ritz values”) are the approximate eigenvalues of A derived from this space. The corresponding Ritz vectors are the columns of
1. Extreme eigenvalues, those on the edges of the converge faster. 2. spectrum, Eigenvalues which are well separated from the bulk of the spectrum converge faster. 3. Eigenvalues which are well represented in the
=
L2. u
=
‘
convergence rate:
with q0 o, and ~ 0 r0 for j 1, 2,..., do steps Li to L6 Li. q1=’)_1/$~_1, =
The span of the colunms of Q~is the Krylov space K(A, r0, j). It follows from eqs. (El) and (E2) that T Q ~AQ
—
j,
The simplest a posteriori way to bound the accu-
L6. If convergence STOP.
racy of a Ritz value 0 is to compute the residual norm of the associated Ritz vector y, Ay yO There is an eigenvalue of A at least this close to 0 (ref. [4], p. 69). Fortunately, it is possible to monitor the residual norm of a Ritz vector during a Lanczos run without computing the Ritz vector. In fact, from eq. (El),
=
i~
—
Let (q1, q2, q~)and let metric tridiagonal matrix =
...,
I
I) =
/~I
a2 $2
/32
be the sym-
I~Ay1-y1~II =~ISflI, where s1~ is the last element of the i th eigenvector
D.S. Scott
/ Implementing Lanczos-like algorithms
of 7~.. The quantity $~is calculated by the algorithm anyway and since j is usually very small compared to n (the dimension of A), the calculation of Sfl is relatively inexpensive. This is the calculation which is done in step 6 of the algorithm. Note that until the Ritz vector are completed, it is only the bottom row of s which is needed. Unfortunately, not all of what is described above is preserved in finite precision arithmetic, Eq. (El) is still satisfied to working precision but eq. (E2) is completely destroyed. This is the infamous “loss of orthogonality” of the Lanczos algorithm. This behavior was known from the beginning as Lanczos himself recommended explicit orthogonalization of the Lanczos vectors, but only with the work of Paige [7] was the underlying cause understood. In essence, loss of orthogonality occurs hand in hand with the convergence of a Ritz vector. Paige discovered that this loss of orthogonality did NOT prevent the convergence of other Ritz vectors, it just caused multiple copies of the converged Ritz vectors to appear. Various schemes have been proposed to control this loss of orthogonality including nothing (purge the repeated copies after the fact) [8], selective orthogonalization [9] and periodic orthogonalization [10]. In most cases periodic orthogonalization is the most efficient. The Lanczos algorithm is incapable of finding an eigenvector which is orthogonal to the starting vector. For simple eigenvalues this event has probability zero and is unstable in the face of rounding errors. However, for multiple eigenvalues this is inevitable. Only the projection of the starting vector into the eigenspace will be found. The orthogonal eigenvector will be missed. For applications where determining multiplicities is required it is necessary to make multiple Lanczos runs orthogonalizing against previously found eigenvectors, It is also possible to implement a block version of the Lanczos algorithm [ii] in which each q vector is now an n x b matrix and each a and /3 is a b x b matrix. This allows multiplicities up to size b to be found simultaneously. Block Lanczos also has advantages when the matrix is not stored in main memory. If the matrix is being recalled
273
from disk, or is being recomputed each time, then it is much more efficient to multiply several vectors at once. Block Lanczos is also more cornplicated. The T matrix is now block tridiagonal which makes extracting its eigenvalues more difficult and there is the added concern of loss of rank in a block. The most common problem encountered with the Lanczos algorithm is slow convergence to the desired eigenvalues. Even if the desired eigenvalues are at one end of the spectrum, they may be very poorly separated from the rest of the spectrum. Even worse, the desired eigenvalues may lie in the middle of the spectrum (see ref. [12] for example). The best solution to this problem is to choose a shift a near the desired eigenvalues and compute the LU factorization of (A — al). With this factorization it is easy to solve equations 1A — b ~ a and it is possible to run Lanczos on (A — aJ)1 by solving a system of equations at each step. This is the “spectral transformation” or “shift-and-invert” Lanczos algorithtn [2]. Sparse factorization packages have been developed [13,14]which makes this task easier. In the transformed spectrum, the eigenvalues near a are now the extreme eigenvalues and the rest of the spectrum is compressed near zero. Convergence is extremely rapid. If factoring the matrix is too expensive then it is possible to use preconditioning techniques to improve convergence rates [15]. There is also a nonsymmetric version fo the Lanczos algorithm. It suffers from an additional kind of numerical difficulty [16] and there is no developed convergence theory but it seems to work well in practice. Finally, it should be mentioned that Lanczos is closely related to the conjugate gradients method of solving systems of linear equations. In fact there are some circumstances in which it is advantageous to use the Lanczos formulation [17]. =
3. The IPSC®/2 system The iPSC/2 system is a distributed memory message passing parallel computer. Each proces-
274
D.S. Scott
/ Implementing Lanczos-like algorithms
sing node consists of an Intel 80386 cpu, an 80387 numeric coprocessor, an optional Weitek 1167 numeric coprocessor, one to sixteen megabytes of memory, a communication daughter board with eight bidirectional communication channels, and an optional vector coprocessor board, Up to seven of the channels can be used for node to node communication. The eight channel is for I/O. The number of processors in a machine is always a power of two. If the processors are thought to lie at the corners of a D dimensional cube, then the wires form the edges of the cube. This gives rise to the popular name hypercube. The I/O channel of node zero is connected to the system Resource Manager (SRM). Both a product summary [18] and a technical summary [19] are available. The I/O channel of the other nodes can be connected to optional concurrent I/O system consisting of I/O nodes with access to disks, tapes and channels to other computing systems. A summary of the I/O subsystem is available [20]. The compilers for the system reside on the SRM. FORTRAN, C and LISP are currently available. Vast-2, a vectorizing precompiler for FORTRAN, provides automatic access to the vector library. ADA is in development and will be available soon. A version of Flat Concurrent Prolog (FCP) is available from the Weizmann Institute [21]. System diagnostics are run from the SRM. The SRM is the network gateway to TCP/IP ethernet. Also provided is a remote hosting environment in which users dnve the cube directly from a workstation on the ethernet. Sun 3’s are currently supported and Vaxes running VMS will be supported soon. In addition, the SRM manages the allocation of nodes of the cube. Individual nodes are not time shared. Users request, and are allocated, subcubes which are privately held until released. Many users can be on the system simultaneously. Despite the fact that the wires themselves are point-to-point between adjacent nodes, the Direct-Connect® network hardware provides automatic circuit switching so that a message may transverse several wires without interrupting the intervening nodes in any way. The performance of the communication subsystem can be modelled
with two terms. There is a startup term which consists of the software overhead at each end and the time to set up the circuit and there is the actual transmission time. The messages are transmitted at 2.8 Mbytes/s. The startup time for a short message is about 280 ~s. The actual time to set up the circuit is about 5 j~s per wire so that the communication performance between any two nodes is essentially constant, independent of the number of wires traversed. Because of the very high bandwidth, it is more efficient to send a few large messages than a lot of little ones, but it hardly matters which nodes are the destinations of the messages. It is possible for a user to load multiple processes on a single node. The Node eXecutive (NX/2) operating system provides for process management, memory management and cornmunication services. It has been optimized for speed and reliability. The standard languages have been augmented by system calls for sending and receiving messages. Environmental calls are provided which allow codes to run correctly regardless of the number of nodes being used. The environmental calls are mynode() the processor number I am running numnodes() the number of processors, mypid() the process number I am (user assigned at load time), myhost() the host number I am attached to. The basic system calls are csend(type,buffer,mlength,node,pid), crecv(type,buffer, blength), where “c” stands for complete, the call will not return until the send or receive is cornplete. The completion of a send does not mean that a process has received the message. It just means that the message has been sent to the destination node. The operating system will then hold the message until the process asks for it; type is a user supplied positive integer used to filter which message will be received; buffer is the actual message;
D.S. Scott
mlength blength node pid
is is is is
the the the the
/ Implementing Lanczos-like algorithms
length (in bytes) of the message; length (in bytes) of the buffer; destination node number; destination pid number.
The type parameter is needed to insure proper receipt of messages. The nodes are running asynchronously and it is possible for several messages to be waiting to be received at a node. A received call chooses the oldest message of the requested type. Thus, messages can be received in a different order than they arrived, It is possible to query the operating system after receiving a message to determine its length, source node or source pid. There are also synchronous versions of send and receive (call isend and irecv, “i” for initiate). Finally, there is an interrupt driven receive (hrecv). Only the asynchronous forms of message passing will be discussed further here. There are also numerous global communication utilities available. The only three used in this paper are: gsend
send a message to all nodes, this is implemented as a normal send with the special destination node “—1”; gssum this computes the global (single precision) sum of contributions from each node; gcol this collects (concatenates) strings from each node in node number order; gssum and gcol are documented in appendix A. For an existing code, there are usually four steps in the process of porting the code to the iPSC/2. First, the code is run on the SRM to check for any compiler problems. Then the compute portion of the code is extracted and run on one node while the user interface is left on the host, Next, the compute part is distributed among the nodes and finally, the code can be tuned for performance. To demonstrate this procedure, we will port an example code to the iPSC/2. One way to compute pi is as
=
1 4dx Jo i + x~’
The rectangle rule for numerical integration ap-
275
proximates the integral as the sum of rectangles. If the number of rectangles (n) is 10 then the graph looks like in fig. 1. As the number of rectangles is increased, the sum converges to pi. A code to compute this sum is as follows. print * input n” read * ,n ,“
h = 1.0/n halfh = 0.5 * h sum = 0 do i0i = 0,n — 1 X = * h + halfh sum = sum + f(x) 10 continue pi = h * sum print * ,n,pi end where f(x) is the function 4/(1
+
x
*
x).
In this case the compute part of the code is the central portion. The node numbers in the cube start with zero. The single node in a subcube of dimension zero is logical node number zero regardless of what physical node number it is. If the compute part of the code is ported to node zero, the result is as follows. Host code Node code print * input n” read * ,n call csend(0,n,4,0,mypid~) call crecv(0,n,4) h = 1.0/n halfh = 0.5 * h sum = 0 do lOi = 0,n — 1 ,“
10
call crecv(1,pi,4) print * ,n,pi end
x = i * h + halfh sum = sum + f(x) continue pi = h * sum call csend(1,pi,4, myhostO,mypidO) end
276
D.S. Scott
/
Implementing Lanczos-like algorithms
This code will run slower on one node than just on the host since there is the extra cost of sending n and receiving pi. Some of the computation could have been done on the host, but it is best to use the host only for user interface functions since it is a multiuser system. To make use of multiple nodes it is necessary to modify the code. Several changes are necessary. The main change is that the work must be partitioned. In this case, it is the DO LOOP which must be divided. It is also necessary that the input data is distributed to all the nodes and that the correct answer is collected at the end. The multiple node code looks like this: pnnt *~ input n read * call csend(0,n,4, — l,mypidO) = numnodesO call crecv(0,n,4) halfh—05 h
—
.
*
sum 0 do lOi= mynodeO,n — l’l~ x = i * h + halfh sum = sum + f(x) 10 continue ni=h* sum call gssum(pi,1,dummy) if(mynodeO.eq. O)then call csend(l,pi,4,myhostO, mypidO) endif end call crecv(1,pi,4) print * ,n,pi end In this version, a copy of the node program is loaded on to each node of the allocated subcube and runs simultaneously. The csend with —1 will cause a copy of the message to be sent to each node. Each node will determine its part of the work by calling numnodes() and mynodeO. It will compute its part of the answer and then combine it with the other parts by a call to gssum. The IF test at the end of the node code is needed so that only one message arrives at the host.
The two main issues in programming these machines are load balancing and communication overhead, A program is load balanced if all the processors are busy all the time. If one processor has most of the work to do then the others will end up being idle most of the time and the program will be inefficient. The pi program is ahnost perfectly balanced. The only lack of balance is that some nodes have one more rectangle than the rest. For large n, this insignificant. Programs can be busy but spend almost all their time talking to each other rather than doing any work. Programs like that are said to have a high communication overhead. The pi program has low communication overhead provided n is not tiny. However, if n is small enough it would be faster to run the program on the host and avoid the cost of sending n and receiving pi. Computer scientists have traditionally evaluated parallel algorithms by solving a fixed sized problem on more and more processors. If the problem runs p times faster on p processors it has achieved perfect speedup. Because any parallel program will involve some overhead, perfect speedup cannot be obtained. In fact, for a fixed size problem, speedup is limited to some finite number independent of the number of processors. This statement is the parallel version of what is known as Amdahl s law [22]. It has been used to claim that there is no point in having machines with more than a few processors. Cleve Moler [23] pointed out the flaw in this argument several years ago. The point is that as machines get bigger, so do problems. And as problems get bigger, parallel machines get more efficient. We refer to this result as “Moler’s law”.
.
2
-
‘
-
-
-
‘
Fig. 1.
-
1
D.S. Scott / Implementing Lanczos-like algorithms
277
In fact, researchers at Sandia [24] recently achieved speedups of over 1000 on a machine with 1024 processors.
system routine for the global collection (concentration) of data. At the end of the call w will contain all of q1. The computation of Aq~is then
The whole programming process, as well as a survey of decomposition strategies, is described in ref. [25].
the local computation of dot products. L3. a~= u7q,, is a global dot product. Each node can compute its part. Then each node calls gssum.
4. Lanczos on the iPSC/2 L4. r1=u~—q~a,. This is a local vector subtraction.
4.1. Data distribution To run the Lanczos algorithm on the iPSC/2 it is necessary to distribute the computation among the nodes, which in turn requires the distribution of the data. The fundamental data elements in the algorithm are the q-vectors of length n. The rimplest division of a vector of length n is the block partition. Each processor gets a sequential set of elements. If n is 21 and p is 4 then the distribution would be as follows: ~
Since P does not divide n, one processor gets an extra element which is assigned to node zero. The Q matrix is thus distributed among the nodes. Similarly, the matrix A is sliced by rows and distributed to the nodes. On the other hand, the T matrix, which must be accessed by all nodes, should be replicated on each node. In general, each step of the lanczos algorithm on the iPSC/2 requires some local computation and most require some global communication. We now investigate each step in detail. Li. q1
=
-
~/$~— i.
This is purely a local operation. Each processor divides its part of i~ ~ by its copy of L2. u1 = Aq1 — q11$,.1. ~,_
~‘
The subtraction of q~1$~ is also a local operation but the formation of Aq1 is not (unless A is block diagonal). Each processor has the part of A needed to compute the required inner products but it does not have all of q,. To get all of q~each processor should copy its part of q1 into a ternporary n-vector w and then call gcol. gcol is a
L5.
$~=
II II. .‘,
This is local inner product, followed by a gssum and a local square root. (To be sure, the inner product may underfiow or overflow when the norm is well defined. A more careful implementation is possible but will not be pursued here.) L6. If convergence, STOP. This is an eigensystem calculation on a small tridiagonalmatrix.Eacbnodehasacopyofthe matrix and the simplest approach is to have each node compute the answer simultaneously. This is certainly better than to have one node compute it and then send it to everyone else since they would just be waiting anyway. 4.2. ~ad balancing Steps Li, L3, L4 and L5 are all almost perfectly load balanced, Step L6 is not load balanced at all. If the T matrix got sufficiently large that the eigensystem calculation was a significant fraction of the entire calculation, it would be important to do a distributed implenlientation. This is possible in several ways. It would be possible to do simultaneous bisection with each processor resolving a different part of the spectrum. It is also possible to first compute the eigenvalues and then do perfect shift QR with each processor updating a different slice of the ~igenvector matrix. These possibilities are discuss~din ref. [26]. Step 2 is the most interesting. As long as there are about the same nuthber of nonzeroes in each row of A, this approach is load balanced. If one row were full then the processor with that row
278
D.S. Scott
/
Implementing Lanczos-likealgorithms
would work much harder while the rest were idle waiting for the next result. However, there is no requirement that each processor have the same number of rows of A. It actually makes more sense to adjust the partition so that each node has about the same number of elements of A. Then the matrix—vector product computation would be balanced. 4.3. Communication overhead Only three steps of the algorithm require cornmunication, Two of these are calls to gssum and the third one is a call to gcol. There are two possible optimizations to the communication. To form the matrix—vector product, each processor needs vector values for each nonzero in each row of A that the processor holds. For general sparse matrices, this may include every vector element, and in any case the needed elements are distributed randomly throughout the vector so that a call to gcol is the most efficient way of obtaining them. If the matrix A is banded with bandwidth less than n/p, then only vector elements from adjacent processors are needed. Taking advantage of this can reduce the overhead significantly. In exact arithmetic it is possible to combine the two calls to gssum. The process is orthogonalizing a vector y against a normalized vector x and then normalizing y. The algorithm presented in section 2 does the following a=x*y,
5. Nwnerical results and conclusions It is easy to implement the Lanczos algorithm on any distributed memory message passing cornputer including the iPSC/2. This distribution of data is straight forward and the resulting code is not much different from the original algorithm. The implementation is naturally load balanced unless the matrix has rows with widely varying numbers of nonzeroes. Even in this case it is possible to adjust the partitioning of the problem to obtain a good load balance. The Lanczos code in appendix B implements the simple Lanczos algorithm for a dense symmetric matrix in double precision arithmetic. It was run for various numbers of processors and various sizes of matrix. Both vector and nonvector machines were used. Some of the results are given in tables 1 and 2. The times for one step of the Lanczos algorithm are given in milliseconds. Three subtotals are given, the communication time, the arithmetic time for the matrix multiply and the rest of the arithmetic time. The table also gives the megaflop speed based on (2n2 + 7n) flops per step. No monitoring for convergence was done in these tests. Efficient methods of monitoring [27,28] exist but were not implemented here. Three general conclusions can be obtained from the results. Communication overhead increases as a fixed size problem is run on more and more processors. Relative communication overhead decreases as larger problems are run on the same number of processors.
y=y—xa,
$=HyII, y=y//3. Alternatively one could compute a
=
/3 =
x”y,
/3 =y*y,
— a2
,
~
)‘
xa,
y =y//3.
This rearrangement allows one gssum to compute both a and /3. Unfortunately, it is numerically dangerous. The final result for /3 could be very inaccurate if x and y are nearly parallel, since the subtraction inside the square root would cause cancellation. For this reason this alternative will not be pursued further.
Table I Nonvector N, P
Total
Comm.
MM
Anth.
320, 1 2 4 8 16 672, 4
1528 763 383 196 105 1668
0 4.4 6.4 9.1 11.8 8.1
1512 750 372 184 91.2 1652
16.3 8.4 4.7 3.3 2.1 8.7
8 16
833 426
10.0 13.3
818 408
5.1 3.4
1024, 16
965
14.7
946
4.0
D.S. Scott
/
Implementing Lanczos-like algorithms
279
Table 2 Vector
4
~“i
Total
Comm.
MM
Arith.
Mflops
Mflops
78.7 43.7 26.6 19.5 17.8
0 3.5 5.8 8.6 11.3
77.4 38.9 19.7 9.9 5.0
1.3 1.3 1.1 1.0 1.5
0.14 0.27 0.54 1.06 1.97
2.63 4.74 7.78 10.62 11.63
85.8 50.1 33.7
7.3 10.2 12.8
77.2 38.8 19.7
1.3 1.1 1.2
0.54 1.09 2.13
10.6 18.1 26.9
59.4
13.7
43.8
1.9
2.18
354
I
I
I
~ Fig. 2. 4(4) ________ 5(5) 0(0)71
I
1(1)71
I
6(61 2(2 ~
7(7~
3(3,7 Fig. 3.
The vector results are both much faster and less efficient than the nonvector results. The Lanczos algorithm is just one of many computational problems in science and engineering that has been successfully implemented on distributed memory message passing computers. The software effort in porting an application to such a machine is not zero. However, the effort need only be made once. Once the decomposition of a problems has been implemented it will run as written on larger machines. As both the individual node performance improves and as the number of available nodes in these systems grows, the performance which can be obtained will be remarkable.
Appendix A. Global Communication routines A D-dimensional hypercube has p = 2’~nodes. They are numbered from zero up to p — 1. Two nodes are directly connected if the binary representations of their node numbers differs by only one bit. A three dimensional cube is numbered as shown in fig. 2. All the global communication subroutines are implemented taking advantage of the hypercube architecture. There are D phases in each global routine. In each phase, each node trades current data with an adjacent node. At the end of D phases each node has exchanged with all of its Suppose gssum is called with each node having neighbors and has the correct result. its node number as data (see fig. 3).
4(9)
5(9)
o(W’:__1(1~~’: 6(~3) : 7(~3) 2(~4” 3(5~” Fig. 4.
In phase 1, the nodes trade in the horizontal direction and add the received result obtaining (where data is in parentheses) (fig. 4). In phase 2, the nodes trade and add in the vertical direction (fig. 5). Finally, in phase 3 the nodes trade and add in the depth direction (fig. 6). At the end of the operation all nodes have the correct answer. All of the other operations are implemented in a similar fashion. The calling sequence for gssum is gssum(x,n,work) integer n real x(n),work(n) all processors call gssum with input x and n, x is returned with the sum of all the input x’s, work is just workspace (to receive messages) gcol(x,xlen,y,ylen,ncnt) integer xlen, ylen, ncnt character x(xlen),y(yleij) x is the input string, xlen is the number of BYTES in x, y is the output string which is the concatenation of the input strings in node number order (0, 1, 2,...),
‘1
4(221 ,. 0(6~
21L
5(,Z2)
~-
0(~-4(?’&). - 1(~)5(28) 6(~8)-
-
2(1 Fig. 5.
Fig. 6.
7~28)
280
D.S. Scott
/
Implementing Lanczos-like algorithms
ylen is the length of y in BYTES ncnt is the length of the output in BYTES, of course, x and y need not be character arrays. Any type can be concatenated as long as the correct lengths in BYTES are used.
Appendix B. Code
>
> >
c c c c c c c c c c c c c c c c
subroutine lanczos(op,n,m,maxj ,nsmall,nlarge, tol,a,q,r,x,alpha,beta,beta2,w,z,ritz,ind) external op integer n, m, maxj, nsmal, nlarge, ind( *) double precision tol, a(m,n), q(m,maxj), r(m), x(n), alpha(maxj), beta(maxj), beta2(maxj), wmaxj), z(maxj,maxj), ritz(m, *) this subroutine implements the Lanczos n dimension of A m number of rows of A on this processor maxj maximum number of lanczos steps nsmall number of small (leftmost) eigenvalues desired nlarge number of large (rightmost) eigenvalues desired tol desired relative accuracy a my slice of the matrix q my slice of the q matrix (mx(maxj + 1)) r my slice of the current residual x temporary vector for computing ~ alpha diagonal of Tj beta off-diagonal of Tj beta2 squares of the off-diagonal of Tj w the computes eigenvalues of Tj z eigenvector matrix of Tj
c ritz c c
returns the ritz vectors (alp returns the ritz values)
integer qstart me = mynode() qstart me * m epsi = 0.OdO c Initialize r and b
10
+
1
do 10 i = 1, m r(i) = urand(idummy) continue b2 = ddot(m,r,1,r,i) call gdsum(b2,l ,dummy) b = sqrt(b2)
c Main loop do lOOj
=
1, maxj
c step 1 c call dcopy(m,r,1,q(1,j),i) c call dscal(m,1.0/b,q(i,j),l) c replace by single call call dsvtsp(m,i.0/b,0.OdO,r,l,q(1,j),l) beta~)= b c step 2 call dcopy(m,q(i,j),1,x(qstart),1) call gcolxd(m,x) call op(n,m,a,r,x) if(j .ne. 1) all daxpy(m, — b,q(i,j
—
1),i,r,l)
c step 3 alpha(j) = ddot(m,r,i,q(1,j),i) call gdsum(alpha(j),l,duminy) c step 4 call daxpy(m,— alpha(j),q(i,j),1,r,1)
c c
include ‘vpnode.h’ double precision buf(1024) equivalence (vp$fast,buf) include ‘fcube,h’ integer i, j, k, ii, ncnt, me, jk, idummy, ierr real urand double precision b, b2, dummy, random, ddot, lb, ub, epsi, bound
c step 5 b2 = ddot(m,r,1,r,i) call gdsum(b2,1,dummy) b = sqrt(b2) c step 6
.D.S.
Scott / Implementing Lanczos-like algorithms
100 continue return end
20
subroutine op(n,m,a,r,buf) integer n,m real * 8 a(m,n), r(m), buf(n), ddot integer i do20i=1, m r(i) = ddot(n,a(i,1),m,buf,1) continue return end
281
[8] J. Cullum and R.A. Willoughby, Sparse Matrix Proc. 1978, (SIAM Press, Philadelphia, 1979) 220. [9] H.D. B.N. Parlett D.S. Scott,University Math. Comput. 33 (1979) 217. [101 Simon,and PhD. Thesis, of California (1982). [11] R. Underwood, PhD. Thesis, Stanford (1975). [12] J.G. Lewis, PhD. Thesis, Stanford (1976). [131 A. George, SPARSPAK, University of Waterloo. [14] M. Schultz, Yale Sparse Matrix Package. [15] RB. Morgan, PhD. Thesis, University of Texas at Austin (1986). [16] [17] [18] [19] [20]
D. Taylor, Ph.D. Thesis, University of California. B.N. Parlett, Lin. Alg. Appl. (1980) 323. iPSC/2 Product Summary, Intel Scientific Computers. iPSC/2 Technical Summary, Intel Scientific Computers. Concurrent I/O Product Summary, Intel Scientific Computers.
[21] E. Shapiro, IEEE Comput. (1986) 44. [22] G.M. Amdahl, Proc. AFIPS Comput. Conf. 30 (1967).
References [1] D.S. Scott, in: Sparse Matrices and their Uses, ed. lain Duff (Academic Press, London, 1981) p. 139. [2] T. Ericsson and A. Ruhe, Math. Comput. 34 (1980). [3] C. Lanczos, J. Res. Nat. Bur. Stand. (1950) 225. [4] B.N. Parlett, The Symmetric Eigenvalue Problem (Prentice-Hall, Englewood Cliffs, NJ, 1980). [5] 5. Kamel, Math. Comput. 20 (1966) 369. [6] Y. Saad, SIAM J. Num. Anal. (1980). [7] C.C. Paige, PhD. Thesis, University of London (1971).
[23] C. Moler, ISC Technical Note (1987). [24) J. Gustafson et al., Sandia Technical Report (1988). [25] The Art and Science of Programming Parallel Computers, Intel Scientific Computers. [26] D.S. Scott, Solving Tridiagonal Eigenvalue Problems on the iPSC/2, in preparation. [27] B.N. Parlett and B. Nour-Omid, Comput. Phys. Commun. 53 (1989) 169. [28] B.N. Parlett and B.N. Nour-Omid, Lin-Alg. Appi. 68 (1985) 179.