Journal of Parallel and Distributed Computing 59, 423444 (1999) Article ID jpdc.1999.1586, available online at http:www.idealibrary.com on
A BSP Recursive Divide and Conquer Algorithm to Compute the Inverse of a Tridiagonal Matrix Joan-Josep Climent, Leandro Tortosa, and Antonio Zamora Departament de Ciencia de la Computacio i Intel } ligencia Artificial, Universitat d 'Alacant, E-03071 Alacant, Spain Received December 22, 1998; revised July 6, 1999; accepted July 12, 1999
In this paper we discuss a recursive divide and conquer algorithm to compute the inverse of an unreduced tridiagonal matrix. It is based on the recursive application of the ShermanMorrison formula to a diagonally dominant tridiagonal matrix to avoid numerical stability problems. A theoretical study of the computational cost of this method is developed, comparing it with the experimental times measured on an IBM SP2 using switch and Ethernet hardware for communications between processors. Experimental results are presented for two and four processors. Finally, the method is compared with a divide and conquer method for solving tridiagonal linear systems. 1999 Academic Press
Key Words: tridiagonal matrix; BSP; divide and conquer; Sherman Morrison formula; diagonally dominant.
1. INTRODUCTION The problem of computing the inverse of a matrix A is much more expensive than solving the linear system Ax=d,
(1)
and most computational problems involving inverses can be reformulated in terms of the solution of linear systems. So, we can say, in broad terms, that the explicit computation of the inverse should be avoided whenever possible. However, we must note that there are some practical applications where the inverse of a matrix needs to be computed explicitly. Linear algebra in general, and the computation of the inverse of a matrix in particular, are used quite heavily in structural engineering. The reason for this is that the analysis of a structure in equilibrium involves writing down many equations with many unknowns. Often, these equations are linear, even when the material deformation is considered. For instance, we consider the problem of a horizontal elastic beam supported at each end and subjected to forces at points 1, 2, ..., n. If x # R n is the list of forces at these points and d # R n is the amount of 423
0743-731599 30.00 Copyright 1999 by Academic Press All rights of reproduction in any form reserved.
424
CLIMENT, TORTOSA, AND ZAMORA
deflection of the beam at these n points; then by Hooke's law in physics, we have a linear system like (1), where A is called the flexibility matrix. Its inverse is called, the stiffness matrix. In most cases it is necessary to compute A &1 because of the physical significance of its columns. So, the first column of A &1 represents the forces that must be applied at the n points in order to produce a particular deflection at point 1 and zero deflections at the other points. And so on for the remaining columns. Many structural engineering or physical problems where we have to obtain the stiffness matrix in order to obtain information from the columns or rows of such a matrix require the computation of the inverse of a tridiagonal matrix. For example, we can refer to the springmass problem in physics, where some masses are suspended vertically by a series of springs and we must consider the spring constants and the displacements of each spring from its equilibrium position. When applying Newton's second law to this problem we obtain a linear system whose coefficient matrix is tridiagonal and diagonally dominant. The inverse of this matrix, that is, the stiffness matrix, is closely connected to the displacement of the masses when some external forces are imposed on them. So, the (i, j) th entry of the inverse matrix tells us what the displacement of the mass i will be if a unit external force is imposed on mass j. But we can also find applications where inverse matrices are computed not dealing with physical or engineering problems. In dynamical systems theory one of the most important processes in numerical analysis is the determination of fixed points, either by direct or iterative methods. This is important due to the fact that in a dynamical system all the periodic orbits can be expressed as fixed points. The most well-known process to determine fixed points is Newton's method. This method requires the inversion of a certain matrix, the calculation of which is difficult in higher dimensional problems (see Stoer and Burlirsch [18]). Also related to dynamical systems and chaos, we find that the numerical simulation of the bidomain equations for cardiac tissue requires the inversion of a large matrix at every time step (see Keener and Bogar [9]). In modern numerical simulation tools for partial differential equations, particularly in the field of finite element or finite volume approaches, mathematical methods as adaptive meshing and error control together with sophisticated iterative solvers of multigrid type have proven to be very suitable. Assuming tensor product meshes, the application of inverse tridiagonal matrices as preconditioners can be easily performed and provides rather good convergence properties (see Altieri [1]). For developing and studying tridiagonal preconditioners for iterative methods like the GaussSeidel or the conjugate gradients method, it is often of interest to know the properties of the inverse, for instance, how the element of the inverse decays along a row or a column. For a more detailed description of the inverse applications related to preconditioners, see Meurant [14] and the references therein. We consider the problem of computing the inverse of a tridiagonal matrix A with the property of being unreduced and diagonally dominant to avoid numerical stability problems. Some authors have studied this problem and have proposed efficient algorithms to compute explicitly the inverse of such matrices, i.e., Lewis [10], Meurant [14], Usnami [20], and Huang and McColl [8].
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
425
In this paper we propose a divide and conquer algorithm to compute the inverse of an unreduced, diagonally dominant tridiagonal matrix which constitutes an application of the ShermanMorrison formula to the above class of matrices. The algorithm proposed is based on the divide and conquer principle: we divide the general problem into two smaller problems, and then merge the solution of the subproblems to recover the solution of the original problem. The plan of this paper is as follows: in Section 2 the BSP model of computation is introduced, while in Section 3 we describe the basic ideas of the method. Section 4 is devoted to discussing the parallelization of the method, obtaining two different implementations in parallel. In Section 5 we obtain the theoretical computational cost of both implementations. We present some theoretical and experimental times on an IBM SP2 with switch and Ethernet hardware for communications in Section 6. A comparison between this method and a divide and conquer method for tridiagonal linear systems is performed in Section 7. Finally, conclusions are summarized in Section 8. 2. THE BSP MODEL The BSP model (see McColl [11, 12], and Valiant [21]) provides software developers with a way out of the world of architecture-dependent parallel software. A BSP computer consists of the following elements: a set of p processormemory pairs, a global communication network that allows processors to have access to nonlocal data, and a mechanism for the global synchronization of all the processors. The BSP programming model is based on the concept of a superstep. A BSP program is a sequence of supersteps, where each one is subdivided in three ordered phases. In the first one, local computation is developed in each process; in the second one, there are communication actions between the processors; and the last one is a barrier synchronization, which waits for all the communication actions to be completed. At the end of a superstep, after the barrier synchronization is done, the data that was moved to the local memory of destination processors is available. If the maximum number of incoming or outgoing messages per processor is h, then such a communication pattern is called an h-relation. The cost model summarizes in a very reduced number of parameters the differences between the different parallel architectures. If we define the time step as the time required for a single local operation such as an addition or multiplication in floating point, then a BSP computer can be characterized by the following four parameters: p, the number of processors; s, the processor speed (the number of time steps per second); l, the synchronization cost, and g, the maximum number of local operations performed by all processors in one second divided by the total number of words delivered by the communications network in one second. The parameter l can be viewed as the time required for a nonlocal memory access when the message traffic is dense; an accurate value of it for each parallel architecture may be obtained empirically because, in practice, it tends to be dominated by the effects of the software in each processor. The computational cost of a BSP algorithm is easily obtained as the sum of the cost of the all individual supersteps. The cost of each individual superstep can be
426
CLIMENT, TORTOSA, AND ZAMORA
established by taking into account that we must compute the arithmetic cost of the processor that performs the maximum number of operations along the superstep, the communication cost between processors, which is proportional to the maximum number of messages sent or received by the processor that performs the maximum number of communications, and the synchronization cost. The cost of a single superstep measured in time steps or flops is upper bounded by w+hg+l, where w is the arithmetic cost and hg is the communication cost, with h the maximum number of words involved in the communication. We must point out that the model does not distinguish between the cost of sending m messages of size one and sending one message of size m. The total cost (in flops) of a BSP algorithm is upper bounded by the addition of the individual cost of all the supersteps needed to implement it; that is, W+Hg+Sl, where W is the total arithmetic cost of the algorithm, H is the total sum of the words circulating through the network during the execution of the computation, and S is the total number of supersteps. If we divide the total cost (in flops) by s we obtain the cost, in seconds, of the algorithm.
3. DESCRIPTION OF THE METHOD The method we consider is a modification of the recursive decoupling method by Spaletta and Evans [17] and Mehrmann [13]. We consider a nonsingular and tridiagonal matrix
A=
_
a1 b1 c2 a2 c3
b2 a3 .. .
b3 .. .
..
. c n&1 a n&1 cn
b n&1 an
&
We also assume that matrix A is unreduced, that is, b i {0 and c i+1 {0 for i=1, 2, ..., n&1, and that A is diagonal dominant, that is |a i | |b i | + |c i |,
for
i=1, 2, ..., n,
with c 1 =b n =0. Our objective is to compute A &1. Assume that n=2 q for some q1 and let
B q&1, i =
_
a (q&1) 2i+1 c 2(i+1)
b 2i+1 , a (q&1) 2(i+1)
&
i=0, 1, 2, ..., 2 q&1 &1,
427
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
where =a 1 , a (q&1) 1 =a 2i &c 2i+1 , a (q&1) 2i
i=1, 2, ..., 2 q&1 &1,
(2)
a (q&1) 2i&1 =a 2i&1 &b 2i&2 ,
i=2, 3, ..., 2 q&1,
(3)
=a n . a (q&1) n q&1 We can compute B &1 &1, easily as q&1, i , for i=0, 1, 2, ..., 2
B &1 q&1, i =
1 2i
_
a (q&1) 2(i+1) &c 2(i+1)
&b 2i+1 , a (q&1) 2i+1
&
(4)
where (q&1) (q&1) a 2(i+1) &b 2i+1 c 2(i+1) . 2 i =a 2i+1
(5)
Now, we can obtain for k=q&2, q&3, ..., 2, 1, 0 and for i=0, 1, 2, ..., 2 k &1 the 2 q&k_2 q&k matrices B &1 k, i =
_
B &1 k+1, 2i B &1 k+1, 2i+1
&: k, i
_
&
B &1 k+1, 2i B
&1 k+1, 2i+1
&
u k+1, 2i v Tk+1, 2i
_
B &1 k+1, 2i B &1 k+1, 2i+1
&,
(6)
where : k, i = 1+v Tk+1, 2i
_
1 B &1 k+1, 2i
, B &1 k+1, 2i+1
&
(7)
u k+1, 2i
and u k+1, 2i =e 2q&(k+1) +e 2q&(k+1) +1 ,
(8)
v k+1, 2i = c 2q&(k+1)(2i+1)+1 e 2q&(k+1) +b 2q&(k+1)(2i+1) e 2q&(k+1) +1 ,
(9)
where e 2q&(k+1) and e 2q&(k+1) +1 are the 2 q&(k+1) th and (2 q&(k+1) +1) th columns of &1 . the 2 q&k_2 q&k identity matrix. In accordance with the method, B &1 0, 0 =A We can summarize the method by saying that to compute the inverse of a 2 q_2 q tridiagonal matrix A we need to perform k steps, for k=q&1, q&2, ..., 0. The initial step, k=q&1, consists of the computation of the inverses of the blocks B q&1, i , for i=0, 1, 2, ..., 2 q&1 &1. The remaining k steps, for k=q&2, q&3, ..., 1, 0, consist of the application of formula (6) in order to build new blocks using the previous ones.
428
CLIMENT, TORTOSA, AND ZAMORA
4. ALGORITHMS Suppose that the number of processors is p=2 m. A BSP algorithm based on the characteristics presented above should include the following supersteps: in the first one, data must be communicated from the main processor P 0 to the remaining ones. In the second superstep we use (6) recursively in each processor P j , for j=0, 1, ..., p&1, to compute the (n p)_(n p) inverse B &1 m, j . Next, processor P j sends this matrix to processor P 0 and finally, in the third superstep, only processor P 0 works to compute B &1 0, 0 using (6) recursively again. We illustrate the method described above in Fig. 1, where n=16 and p=2. In this case, q=4 and m=1. Then, we can give the following algorithm. Algorithm 1. BSP algorithm to compute the inverse of a tridiagonal matrix A. Superstep 1. Processor P 0 sends the elements a 2q&mj+i , b 2q&mj+i&1 and c 2q&mj+i+1 , for i=1, 2, ..., 2 q&m, to processor P j , for j=1, 2, ..., 2 m &1, assuming that c n+1 =0. Superstep 2.
In processor P j , for j=1, 2, ..., 2 m &1,
q&m v compute a (q&1) , using (2) and (3). 2q&mj+i , for i=1, 2, ..., 2 q&m&1 &1 using (4) and (5). v compute B &1 q&1, 2q&m&1j+i for i=0, 1, 2, ..., 2
v For k=q&2, q&3, ..., m, compute B &1 k, 2k&mj+i
i=0, 1, 2, ..., 2 k&m &1
for
using (6) recursively. Processor P j , for j=1, 2, ..., 2 m &1, sends the matrix B &1 m, j to processor P 0 . Superstep 3.
For k=m&1, m&2, ..., 1, 0 processor P 0 computes B &1 k, i
for
i=0, 1, 2, ..., 2 k &1
using (6). The basis of this method is the recursive application of the formula (6). The computational work that it represents can be divided into some parts as we can see in the following algorithm. &1 &1 Algorithm 2. Compute the matrix B &1 k, i from matrices B k+1, 2i and B k+1, 2i+1 .
x=
1.
Define the vectors u k+1, 2i and v k+1, 2i using (8) and (9).
2.
Compute the auxiliary vectors
_
B &1 k+1, 2i B &1 k+1, 2i+1
&
u k+1, 2i
and
y T =v Tk+1, 2i
_
B &1 k+1, 2i B &1 k+1, 2i+1
&.
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
FIG. 1.
3.
Scheme of the method where n=16 and p=2.
Use (7) to compute : k, i as : k, i =
4.
429
1 . 1+y Tu k+1, 2i
(10)
Use (6) to compute B &1 k, i as B &1 k, i =
_
B &1 k+1, 2i B &1 k+1, 2i+1
& &:
k, i
xy T.
Because of the structure of vectors u k+1, 2i and v k+1, 2i we compute : k, i using (10) instead of : k, i =
1 1+v Tk+1, 2i x
to avoid some operations. Figure 2 depicts graphically the different steps that are performed in Superstep 2 of Algorithm 1 in processor P 0 for the case shown in Fig. 1, where n=16 and p=2.
FIG. 2.
Superstep 2 of Algorithm 1 in processor P 0 for q=4 and m=1.
430
CLIMENT, TORTOSA, AND ZAMORA
The figure shows the three steps that we need to run in P 0 to obtain the 8_8 inverse block before any communication is performed. So far, we have considered an implementation in parallel of the method with a special communication pattern, in which all the processors communicate their solutions to the main processor and this computes the final solution. This implementation has the advantage that we need very few supersteps to run the algorithm, but has the disadvantage that in Superstep 3 only the main processor works. We will consider a different implementation based on the same computational work but with a rather different communication pattern. The special characteristics of the method suggest a parallel fan-in algorithm. In this algorithm, the communication pattern is as follows: initially, the responsibility for applying recursively formula (6) is divided among all the processors. Every processor P j is responsible q&m for computing the inverse block B &1 . In the first q&m steps of the m, j of size 2 process, no communication is necessary between processors, since each processor may perform (6) recursively. Each processor j initially performs the q&m steps independently, thereby computing the inverse block B &1 m, j . In the last m steps, processors must communicate their results to the other processors, whose results are merged and sorted. So, inverse blocks are communicated to the appropriate processors to be used as input in the next step. The processors can be divided into two categories that we label as on (when working) and off (when stopped). At the beginning of the program, the original tridiagonal matrix is partitioned among all the processors, so each one works in a submatrix of size 2 q&m. In steps q&1 through m all processors work independently from the other ones. At the end of step m, every processor j has computed an inverse block B &1 m, j . In order to continue with the computations, it becomes necessary to merge the results obtained by different processors. So, every odd numbered processor, for example d, sends its inverse block to its neighboring processor d&1. Processor d&1 uses a merge sort to combine the inverse blocks to apply (6) once, in step m&1. In this step, the even numbered processors have become on as they have received blocks from other processors, while the odd numbered processors have become off. At the end of step m&1, it becomes necessary again to merge the results being retained by the two processors on. In step m&2 every processor that is a multiple of 4 becomes an on processor and the remaining ones are stopped. The process continues in this manner until, in the last step, processor 0 becomes the only processor on and the rest are off. Figure 3 shows an example of a fan-in algorithm for the case where n=16 and p=4. In this case, q=4 and m=2. We can compare it with the scheme shown in Fig. 1 and note that now we need four supersteps for the implementation, while only three supersteps were needed to perform the previous implementation. On the other hand, with a fan-in algorithm we are exploiting the parallel characteristics of the method more efficiently, and we split the computational work among processors in a more reasonable way. Clearly, for two processors both implementations coincide. The following algorithm, for p4, summarizes the parallel implementation of the method when using a fan-in algorithm.
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
431
FIG. 3. A scheme of the computations and communications of the method for the case n=16 for four processors, following a fan-in algorithm.
Algorithm 3. A fan-in BSP algorithm to compute the inverse of a tridiagonal matrix A. Superstep 1. Processor P 0 sends the elements a 2q&mj+i , b 2q&mj+i&1 , and c 2q&mj+i+1 , for i=1, 2, ..., 2 q&m, to processor P j , for j=1, 2, ..., 2 m &1, assuming that c n+1 =0. Superstep 2.
In processor P j , for j=1, 2, ..., 2 m &1,
q&m v compute a (q&1) , using (2) and (3). 2q&mj+i , for i=1, 2, ..., 2 q&m&1 v compute B &1 &1 using (4) and (5). q&1, 2q&m&1j+i for i=0, 1, 2, ..., 2
v For k=q&2, q&3, ..., m, compute B &1 k, 2k&mj+i
for
i=0, 1, 2, ..., 2 k&m &1
using (6) recursively. Processor P 2j+1 , with j=0, 1, ..., 2 m&1 &1, sends the block B &1 m, 2j+1 to processor P 2j . Superstep m&k+2, for k=m&1, m&2, ..., 2, 1. v Processor P 2m&ki , for i=0, 1, ..., 2 k &1 computes B &1 k, i using (6). v Processor P 2m&k +2m&k+1i , for i=0, 1, ..., w(2 k &1)2x communicates its block B &1 k, i to processor P 2m&k+1i . Superstep m+2. Processor P 0 computes B &1 0, 0 using (6).
432
CLIMENT, TORTOSA, AND ZAMORA
5. COMPUTATIONAL COST In this section we compute the computational cost of the method described in accordance with the parallel implementations given by Algorithms 1 and 3. 5.1. Computational Cost of Algorithm 1 Cost of Superstep 1. In this superstep no arithmetic operations are done, so the computational cost is given by the communication cost. The main processor sends 3n&3 n p elements to the remaining ones from matrix A. Moreover, each processor, except the last one, receives two elements out of the tridiagonal block of A. The last processor only receives one further element. So, the communication cost is (3n&3 n p+2p&3) g flops, and therefore the total cost of this superstep is n
\3n&3 p+2p&3+ g+l
flops.
(11)
Cost of Superstep 2. We start with the arithmetic cost that can be divided into two parts. 1. Arithmetic cost of the step q&1. To compute the diagonal elements of the matrices B q&1, 2q&m&1j+i ,
for
i=0, 1, 2, ..., 2 q&m&1 &1
using (2) and (3) we need 2 q&m flops; that is, n p flops. Observe that for j=0 or j= p&1 we need n p&1 flops. To compute B &1 q&1, 2q&m&1j+i , for i=0, 1, 2, ..., q&m&1 q&m&1 2 &1 using (4) and (5) we need 7 } 2 flops. So, the arithmetic cost of this step is 9 } 2 q&m&1 =
9n 2p
flops.
2. Arithmetic cost of step k, for k=q&2, q&3, ..., m. Due to the structure of vectors u k+1, 2i and v k+1, 2i , if we know B &1 k+1, 2i and &1 , we can compute B using (6) with B &1 k+1, 2i+1 k, i 10(2 q&(k+1) ) 2 +2 } 2 q&(k+1) +3
flops.
As we apply this formula 2 k&m times in step k, then to compute B &1 k, i , for i=0, 1, 2, ..., 2 k&m &1, requires 2 k&m[10(2 q&(k+1) ) 2 +2 } 2 q&(k+1) +3]
flops.
433
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
Therefore, the arithmetic cost is q&2
: 2 k&m[10(2 q&(k+1) ) 2 +2 } 2 q&(k+1) +3]
flops.
k=m
Consequently, the arithmetic cost of the superstep is 9n q&2 k&m [10(2 q&(k+1) ) 2 +2 } 2 q&(k+1) +3] + : 2 2p k=m
flops.
To compute the communication cost of the superstep, we note that the main processor receives (n p) 2 elements from the remaining ones. This represents a total cost for communicating data of [(n p) 2 ( p&1)] g flops. Adding the arithmetic and communication costs in this superstep we obtain that the computational cost of Superstep 2 is 9n 1 q&2 k n + : 2 [10(2 q&(k+1) ) 2 +2 } 2 q&(k+1) +3]+ 2p p k=m p
2
\ + ( p&1) g+l
flops.
(12)
Cost of Superstep 3. In this superstep we do not perform any communication. The arithmetic work is given by the recursive application of (6), for k=m&1, m&2, ..., 0. If we take into account that only the main processor works, and we follow an analogous reasoning as in the previous superstep, it is easy to see that the total cost of the superstep is m&1
: 2 k[10(2 q&(k+1) ) 2 +2 } 2 q&(k+1) +3]+l
flops.
(13)
k=0
To obtain the computational cost C (1) of Algorithm 1, we add expressions T (11)(13) and making the suitable simplifications we obtain a cost of C (1) T =
5
1
_ p ( p & p+1) n + p (q&m+ pm&5) n+3p&6& 2
2
2
+
1
3
_ p ( p&1) n + p ( p&1) n+2p&3& g+3l 2
2
flops.
(14)
5.2. Computational Cost of Algorithm 3 Now, we compute the cost of Algorithm 3 in order to compare both implementations. Cost of Superstep 1. This superstep is the same as Superstep 1 in Algorithm 1. Cost of Superstep 2. This superstep also coincides with Superstep 2 in Algorithm 1, except for the communication cost. Now, only half of the processors send data to other processors. So, we can see that the cost for this superstep is n 9n q&2 k&m + : 2 [10(2 q&k&1 ) 2 +2 } 2 q&k&1 +3]+ 2p k=m p
\+
2
p g+l 2
flops.
(15)
434
CLIMENT, TORTOSA, AND ZAMORA
Cost of Superstep m&k+2, for k=m&1, m&2, ..., 2, 1. In this superstep each on processor only computes an inverse matrix; that is, formula (6) is only used once. Therefore, the arithmetic cost of this superstep is 10(2 q&k&1 ) 2 +2 } 2 q&k&1 +3
flops.
Now we compute the communication cost. Observe that we are computing the inverse of a 2 q&k_2 q&k matrix, while the number of on processors is 2 k&1; then the communication cost is (2 q&k ) 2 2 k&1g=2 2q&k&1g
flops.
So, the total cost of the superstep is 10(2 q&k&1 ) 2 +2 } 2 q&k&1 +3+2 2q&k&1g+l
flops.
(16)
Cost of Superstep m+2. In this superstep we only have to perform arithmetic computations. The arithmetic cost is given by 10(2 q&1 ) 2 +2 } 2 q&1 +3
flops.
So, the total cost is 10(2 q&1 ) 2 +2 } 2 q&1 +3+l
flops.
(17)
Finally, adding expressions (11), (15), (16), and (17) and performing the suitable simplifications, we obtain that the total cost C (3) T of Algorithm 3 is TABLE 1 Theoretical Cost, in Seconds, for Algorithms 1 and 3 on an IBM SP2 with Swith and Ethernet, for Four Processors with 2 9 n2 20 Switch
Ethernet
n
C (1) T
C (3) T
C (1) T
C (3) T
29 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20
0.2126 0.5168 1.4180 4.3914 15.0218 54.4574 208.8272 817.3199 3269.1119 12789.7735 51158.4640 204632.5971
0.3921 0.9181 2.3917 7.0254 23.0387 80.9297 305.7646 1187.1683 4748.4969 18420.9667 73683.2084 294731.5197
0.4798 1.7457 6.6565 25.9151 102.3911 407.0371 1623.1051 6491.2946 25891.3440 103560.9040 414234.6761 1656920.8254
0.9221 3.3633 12.8398 50.0117 197.6512 785.8326 3133.8042 12534.0807 49990.8917 199959.0745 799827.3242 3199291.3581
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
435
FIG. 4. Difference, measured in 0, of theoretical times for both implementations using switch and Ethernet with 2 6 n2 20 for p=4.
C (3) T =
+
5 1 1 +2 n 2 + (q&m+2p&7) n+3(m&1) 2 3 p p
_\
+
1 1 2 1 1& n +3 1& n+2p&3 g+(m+2) l 2 p p
_\ +
\ +
&
&
flops.
To finish this section, we compare the theoretical cost of both implementations for four processors. For two processors there is no difference between both implementations. Table 1 summarizes the theoretical results obtained for Algorithms 1 and 3 in Section 4, measured in seconds. We observe that the times predicted theoretically for a fan-in implementation are always much greater than the ones predicted for the first implementation. Times predicted for the implementation with a small number of barrier synchronizations are almost half the time predicted for a fan-in algorithm when using the Ethernet connection, while times measured with switch for the fan-in algorithm grow at a rate of almost 500. Figure 4 shows graphically the differences in time, from a theoretical point of view, between executing Algorithms 1 and 3. Algorithm 1 is always faster than Algorithm 3 in a proportion that we can see in Fig. 4. For example, for n=2 12 Algorithm 3 is 60 0 slower than Algorithm 1 using switch and 90 0 slower using Ethernet. Differences are greater for Ethernet hardware. 6. THEORETICAL AND EXPERIMENTAL TIMES In this section we present theoretical and experimental times for Algorithm 1 using a diagonally dominant tridiagonal matrix A randomly generated. The experiments have been performed in an IBM SP2, using a high performance switch and Ethernet hardware connection. To implement BSP algorithms we used the BSPlib (see Miller [15] and Hill, Crumpton, and Burgess [5]).
436
CLIMENT, TORTOSA, AND ZAMORA
TABLE 2 Values of BSP Parameters for an IBM SP2 s
p
l
g
N 12
switch
45
2 4
15928 33255
70 46
1600 1600
Ethernet
45
2 4
24750 52290
700 680
100 100
To measure the BSP parameters, we follow the model proposed by Hockney [7]. In the refined model, g is defined as a function of the message size r as
\
g= g(r)= 1+
N 12 r
+g
,
where g is the asymptotic communication cost for messages of a large size and N 12 is the size of the message that produces half the optimal bandwidth of the machine. For message sizes where r<
g N 12 , r
TABLE 3 Theoretical and Experimental Times Measured on an IBM SP2 with Switch and Ethernet, for Two and Four Processors with 16n1024 Switch
Ethernet
n
p
C (1) T
C (1) E
C (1) T
C (1) E
16
2 4
0.0080 0.0068
0.0025 0.0044
0.0060 0.0064
0.0036 0.0065
32
2 4
0.0130 0.0057
0.0037 0.0057
0.0115 0.0087
0.0061 0.0086
64
2 4
0.0232 0.0143
0.0073 0.0087
0.0243 0.0203
0.0252 0.0169
128
2 4
0.0442 0.0467
0.0206 0.0202
0.0624 0.0490
0.0499 0.0448
256
2 4
0.0881 0.0971
0.0642 0.0620
0.1886 0.1434
0.1778 0.1462
512
2 4
0.2351 0.2126
0.2182 0.2059
0.6394 0.4798
0.5963 0.4114
1024
2 4
0.6159 0.5168
0.5716 0.5005
2.3392 1.7457
2.1815 1.4968
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
FIG. 5.
Theoretical and experimental cost with 16n1024 for two processors (switch).
FIG. 6.
Theoretical and experimental cost with 16n1024 for four processors (switch).
FIG. 7.
Theoretical and experimental cost with 16n1024 for two processors (Ethernet).
437
438
CLIMENT, TORTOSA, AND ZAMORA
FIG. 8.
Theoretical and experimental cost with 16n1024 for four processors (Ethernet).
which is obtained when we neglect the quotient rN 12 . Based on this cost model and by means of software we have measured the BSP parameters, whose values are summarized in Table 2, where l and g are measured in flops, s in Mflops, and N 12 in words of 64 bits. In Table 3 both experimental time measurements and the theoretical cost (1) measured in seconds are presented for two and four processors, where C (1) T and C E represent the theoretical and experimental costs (in seconds), respectively. The sizes of the matrices range from 16 to 1024. We distinguish between switch or Ethernet, according to the type of connection between processors. Figures 58 display a comparison between theoretical and experimental times for two and four processors, as we can see in Table 3. Observe that the theoretical prediction is always an upper bound for the experimental times when using switch and Ethernet hardware for communications. When we increase the number of processors from two to four the execution times of Algorithm 1 decrease, as we expected. The savings time using four processors instead of two processors reaches 300 for the case where n=1024 with Ethernet.
7. COMPARING WITH BONDELI'S METHOD In this section we compare the recursive divide and conquer method exposed in Section 3 with the divide and conquer method for solving tridiagonal linear systems by Bondeli [3]. The reason for comparing with this method is that when comparing Bondeli's method with other classic methods for solving tridiagonal linear systems, as for example, the cyclic reduction method, Wang's partition method, or the recursive doubling method, Bondeli's method requires less operations and presents more accuracy than classic ones (see Bondeli [3]). Now, we briefly describe the divide and conquer method by Bondeli. Assume that we want to solve system (1) and consider A partitioned into blocks as
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
A=
_
A0
B0
C1
A1 .. .
B1 .. .
..
.
C p&2 A p&2
B p&2
C p&1 A p&1
&
439
,
where each of the diagonal blocks A i , for i=0, 1, ..., p&1, is a tridiagonal k_k matrix and each one of the subdiagonal blocks C i+1 (respectively, superdiagonal blocks B i ), i=0, 1, ..., p&2, is a k_k matrix with the only nonzero entry c (i+1) k+1 (respectively, b (i+1) k ) at position (1, k) (respectively, (k, 1)). Therefore, we can write such blocks as C i+1 =c (i+1) k+1 e 1 e Tk ,
B i =b (i+1) k e k e T1 ,
i=0, 1, ..., p&2,
(18)
where e 1 and e k are, respectively, the first and the last column of the k_k identity matrix. If vectors x and d are partitioned according to matrix A, we can rewrite system (1) as C i x i&1 +A i x i +B i x i+1 =d i ,
i=0, 1, ..., p&1,
(19)
where C 0 =B p&1 =O. Since A i , i=0, 1, ..., p&1, is a nonsingular matrix, from (18) and (19) we obtain that x i =A &1 d i &c ik+1 A &1 e 1 e Tk x i&1 &b (i+1) k A &1 e k e T1 x i+1 , i i i
i=0, 1, ..., p&1. (20)
If we define y i =A &1 di , i z 2i&1 =A &1 e1 , i : 2i&1 =&c ik+1e Tk x i&1 ,
(21)
z 2i =A &1 ek , i : 2i =&b (i+1) k e T1 x i+1 ,
(22) (23)
for i=0, 1, ..., p&1 and : &1 =: 2p&2 =0, then we can write (20) as x i =y i +: 2i&1 z 2i&1 +: 2i z 2i ,
i=0, 1, ..., p&1,
(24)
with z &1 =z 2p&2 =0. Analyzing the relations between the scalars : 0 , : 1 , : 2 , ..., : 2p&4 , : 2p&3 and the vectors z i , for i=0, 1, ..., 2p&4, 2p&3 by means of expressions (20) and (24), we can obtain an auxiliary tridiagonal system H:=;
(25)
440
CLIMENT, TORTOSA, AND ZAMORA
with new unknowns : i , for i=0, 1, ..., 2p&3; coefficient matrix
H=
_
$0
+0
&1
$1 .. .
+1 .. .
..
& 2p&4
.
$ 2p&4 + 2p&4 & 2p&3
$ 2p&3
&
with
{
$ i =e Tk z i , T 1 i
$ i =e z ,
+i=
1 , c ((i2) +1) k+1 T 1 i+1
+ i =e z
,
& i =e Tk z i&1 , &i=
1 b ((i+1)2) k
for
i=0, 2, ..., 2p&4, (27)
,
for
i=1, 3, ..., 2p&3,
where & 0 =0 and + 2p&3 =0, and right-hand side vector ;=[ ; 0 , ; 1 , ..., ; 2p&3 ] T with ;i =
{
&e Tk y i2 &e T1 y (i+1)2
i=0, 2, ..., 2p&4, i=1, 3, ..., 2p&3.
(28)
When the solution : of system (25) is obtained, by substituting it in (24) we get the solution x of system (1). The most expensive computational part of the method is the computation of the initial subsystems given by (21) and (22) which allows us to get the vectors y i and z i . This part may be developed in parallel in such a way that processor P 0 solves the systems A 0[y 0 , z 0 ]=[d 0 , e k ],
(29)
processor P i , for i=1, 2, ..., p&2, solves the systems A i [y i , z 2i&1 , z 2i ]=[d i , e 1 , e k ],
(30)
and processor P p&1 solves the systems A p&1[y p&1 , z 2p&3 ]=[d p&1 , e 1 ].
(31)
The following BSP algorithm resumes the characteristics above exposed. Algorithm 4. A BSP divide and conquer method for tridiagonal systems. Superstep 1. Send data to the processors. For i=1, 2, ..., p&1, processor P i receives the blocks A i and d i from the main processor. Superstep 2. Resolution of the intermediate systems and communication of the auxiliary variables to the main processor.
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
441
In the processor P 0 system (29) is solved. In the processor P i , for i=1, 2, ..., p&2, system (30) is solved. In the processor P p&1 system (31) is solved. Processor P i , for i=1, 2, ..., p&1, sends vectors y i , z 2i&1 , z 2i to processor P 0 . Superstep 3.
Get the final solution. In the main processor,
v H and ; are obtained by expressions (26)(28). v System (25) is solved. v The solution x is computed using (24). Now, since computing the inverse of A is equivalent to solving the n tridiagonal linear systems Ax r =e r ,
for
r=1, 2, ..., n,
(32)
where e r is the rth column of the n_n identity matrix, we need to perform the suitable modifications over Algorithm 4 to solve (32). These modifications in the new algorithm may be resumed in the following items: v In Superstep 1, it is not necessary to transmit the vector d from the main processor to the remaining ones since the right-hand side vectors are e r , which are the r th column of the n_n identity matrix. v In Superstep 2, we have to solve in each of the P i processors, for i=0, 1, ..., p&1, the k systems A i y ik+ j =e ik+ j ,
for j=1, 2, ..., k,
(33)
where e ik+ j represents the vector with the components ik+1, ik+2, ..., (i+1) k of the (ik+ j)th column of the n_n identity matrix. Note that we only solve k tridiagonal k_k systems in each processor. Remark that the vectors e 1 and e k in Bondeli's method are now the vectors e ik+1 and e (i+1) k , respectively. Note too that the systems A i [z 2i&1 , z 2i ]=[e ik+1 , e (i+1) k ] are also solved when solving expression (33). v In the communication process of Superstep 2, we send from each processor P i , for i=1, 2, ..., p&1, k vectors y ik+ j of k components to the main processor. v In Superstep 3, we have to solve n auxiliary systems H: r =; r , for r=1, 2, ..., n. These n systems are solved using the LU factorization of matrix H. From : r, we obtain the n vectors solution of (32). We can summarize the above characteristics in the following algorithm. Algorithm 5. A BSP divide and conquer method for computing the inverse of a matrix A. Superstep 1. Send data to the processors. For i=1, 2, ..., p&1, processor P i receives the blocks A i from the main processor.
442
CLIMENT, TORTOSA, AND ZAMORA
Superstep 2. Resolution of the intermediate systems and communication of the auxiliary variables to the main processor. Processor P i , for i=0, 1, ..., p&1, Solves systems (33). Sends vectors y ik+ j , for j=1, 2, ..., k, to the main processor. Superstep 3.
Get the final solution. In the main processor,
v H and ; r , for r=1, 2, ..., n, are obtained. v The systems H: r =; r , for r=1, 2, ..., n, are solved. v The solution x r, for r=1, 2, ..., n, are computed. The computational cost of Algorithm 5 is given by adding the cost of each of the three supersteps in it, according to the cost model exposed in Section 2. To obtain the arithmetic cost of Algorithm 5, we must remark that the LU factorization of a k_k matrix costs 3k&3 flops and to solve the system A i y ik+ j =e ik+ j , for j=1, 2, ..., k, requires 4k&2( j+1) flops. Therefore the solution of the system A i y ik+1 =e ik+1 costs 4k&4 flops while the solution of the system A i y (i+1) k =e (i+1) k only costs 2k&2 flops. The detailed cost of every superstep of Algorithm 5 is shown in Table 4. Adding the cost of the three supersteps in Table 4 we have that the computational cost of Algorithm 5 is 4n 2 +3k 2 +5kn&4n+6p&12+(nk+3n&3k) g+3l
flops.
(34)
Table 5 resumes the comparison of theoretical times between the recursive divide and conquer and Bondeli's method using the cost functions obtained by expressions (14) and (34). The machine is an IBM SP2 and the BSP parameters are shown in Table 2. From Table 5, it easily follows that the recursive divide and conquer method is always faster than Bondeli's method. Greater differences are satisfied when p=2; using this number of processors, the time needed to execute Algorithm 1 is half the time needed to execute Algorithm 5, for any size matrix. Nevertheless, it is also observed that the behavior of Algorithm 5 is better than the behavior of Algorithm 1 when increasing the number of processors. The results of Algorithm 1 when p=4 are about a factor of two times smaller than the results obtained for p=2. This fact does not happen for Algorithm 1, where the decreases in execution times are not so notable. TABLE 4 Computational Cost of Algorithm 5 Arithmetic cost Superstep 1 Superstep 2 Superstep 3
3k 2 &3 2 4n +5kn&4n+6p&9
Communication cost
Total
(3n&3k) g kng
(3n&3k) g+l 3k 2 &3+kng+l 4n 2 +5kn+6p&9+l
443
A BSP RECURSIVE DIVIDE AND CONQUER ALGORITHM
TABLE 5 Theoretical Comparison Times Measured on an IBM SP2 with Switch and Ethernet, for Two and Four Processors with 2 9 n2 20 Switch
Ethernet
n
p
Algorithm 1
Algorithm 5
29
2 4
0.3968 0.3731
0.7780 0.4834
0.7407 0.7043
1.4698 0.9326
2 10
2 4
0.9345 0.8340
1.8471 1.0974
2.5405 2.1882
5.0590 2.9086
2 11
2 4
2.4466 2.0487
4.8502 2.7164
9.3295 7.5263
18.6075 10.0217
2 12
2 4
7.2186 5.6490
14.3169 7.5184
35.6652 27.6834
71.1849 36.8919
2 13
2 4
23.7530 17.5332
47.0914 23.3791
139.3673 105.9211
278.2613 141.2085
2 14
2 4
84.7839 60.0362
168.0052 80.1265
550.8942 414.0905
1100.1004 552.1471
2 15
2 4
318.6937 219.9803
631.2921 293.7254
2190.4393 1637.2052
4374.5240 2183.2453
2 16
2 4
1233.9057 839.6216
2443.7030 1121.3393
8735.4941 6510.5387
17446.3527 8682.3259
2 17
2 4
4853.8996 3277.9158
9611.8726 4378.2320
34889.4626 25965.6212
69681.9366 34628.0239
2 18
2 4
19252.1661 12950.5510
38121.6040 17298.6770
139452.8351 103709.4488
278520.8096 138309.5673
2 19
2 4
76681.8143 51480.0086
151834.6352 68766.2051
557601.3219 414531.7546
1113669.3768 552833.2433
2 20
2 4
306073.5721 205275.6729
606034.9706 274207.8142
2229985.2622 1657514.9683
4453849.7958 2210522.9522
Algorithm1
Algorithm 5
8. CONCLUSIONS A recursive divide and conquer method to compute the inverse of a tridiagonal matrix has been studied, based on the recursive application of the Sherman Morrison formula. Two implementations have been developed for this method, one of them based on a fan-in algorithm. The theoretical cost has been computed for both implementations, showing that the one based on a fan-in algorithm is always slower than the other one. We have obtained numerical results for the faster one, for two and four processors on an IBM SP2, verifying that the cost model provides us with a good tool to predict the numerical results. Finally, when comparing this method with the divide and conquer method for tridiagonal systems by Bondeli, we observe that the recursive divide and conquer method is faster than the other one.
444
CLIMENT, TORTOSA, AND ZAMORA
REFERENCES 1. M. Altieri, ``Robuste und effiziente Mehrgitter-Verfahen auf verallgemeinerten TensorproduktGittern,'' Diploma-Thesis, to be published. 2. R. H. Bisseling and W. F. McColl, ``Scientific Computing on Bulk Synchronous Parallel Architectures,'' Report, University of Utrecht, 1994. [A short version appears in B. Pehrson and I. Simon, Eds., ``Proceedings of the 13th IFIP World Computer Congress,'' Vol. I, pp. 509514, Elsevier, Amsterdam, 1994]. 3. S. Bondeli, Divide and Conquer: A parallel algorithm for the solution of a tridiagonal linear system of equations, Parallel Comput. 17 (1991), 419434. 4. D. J. Evans, A recursive decoupling method for solving tridiagonal linear systems, Internat. J. Comput. Math. 33 (1990), 95102. 5. J. M. Hill, P. I. Crumpton, and D. A. Burgess, Theory, practice, and a tool for BSP performance prediction, in ``EuroPar'96,'' Lecture Notes in Computer Science, Vol. 1124, pp. 697705, SpringerVerlag, BerlinNew York, August 1996. 6. R. Hockney, A fast direct solution of Poisson's equation using Fourier analysis, J. Assoc. Comput. Mach. 12 (1965), 95113. 7. R. Hockney, Performance parameters and benchmarking of supercomputers, Parallel Comput. 17 (1991), 11111130. 8. Y. Huang and W. F. McColl, A two-way BSP algorithm for tridiagonal systems, Future Gener. Comput. Systems 13 (1998), 337347. 9. J. P. Keener and K. Bogar, A numerical method for the solution of the bidomain equations in cardiac tissue, Chaos 8 (1998), 234241. 10. J. W. Lewis, Inversion of tridiagonal matrices, Numer. Math. 38 (1982), 333345. 11. W. F. McColl, BSP programming, in ``Proceedings of the DIMACS Workshop Parallel Processing Specialist Group Workshop'' (G. E. Blelloch, K. M. Chandy, and S. Jagannathan, Eds.), DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 18, pp. 2135, American Mathematical Society, Providence, 1994. 12. W. F. McColl, Scalable computing, in ``Computer Science Today: Recent Trends and Developments'' (J. Van Leeuwen, Ed.), Lecture Notes in Computer Science, Vol. 100, pp. 4661, Springer-Verlag, BerlinNew York, 1995. 13. V. Merhmann, Divide and conquer methods for tridiagonal linear systems, in ``Notes on Numerical Fluid Mechanics (Parallel Algorithms for Partial Differential Equations)'' (W. Hackbusch, Ed.), Proc. of the Sixth GAMM-Seminar, 1991. 14. G. Meurant, A review on the inverse of symmetric tridiagonal and block tridiagonal matrices, SIAM J. Matrix Anal. Appl. 13 (1992), 707728. 15. R. Miller, A library for bulk-synchronous parallel programming, in ``Proceedings of the British Computer Society Parallel Processing Specialist Group Workshop on General Purpose Parallel Computing,'' December 1993. 16. R. Miller and J. L. Reed, ``The Oxford BSP Library User's Guide,'' Technical report, Programming Research Group, University of Oxford, 1993. 17. G. Spaletta and D. J. Evans, The parallel recursive decoupling algorithm for solving tridiagonal linear systems, Parallel Comput. 19 (1993), 563576. 18. J. Stoer and R. Bulirsch, ``Introduction to Numerical Analysis,'' Springer-Verlag, New York, 1992. 19. H. Stone, An efficient parallel algorithm for the solution of a tridiagonal linear system of equations, J. Assoc. Comput. Mach. 20 (1973), 2738. 20. R. A. Usnami, Inversion of Jacobi's tridiagonal matrix, Comput. Math. Appl. 27 (1994), 5966. 21. L. G. Valiant, A bridging model for parallel computation, Commun. Assoc. Comput. Mach. 33 (1990), 103111.