A parallel implementation of Wang’s method for solving tridiagonal systems

A parallel implementation of Wang’s method for solving tridiagonal systems

Chaos, Solitons and Fractals 31 (2007) 580–585 www.elsevier.com/locate/chaos A parallel implementation of Wang’s method for solving tridiagonal syste...

107KB Sizes 4 Downloads 121 Views

Chaos, Solitons and Fractals 31 (2007) 580–585 www.elsevier.com/locate/chaos

A parallel implementation of Wang’s method for solving tridiagonal systems Salman H. Abbas Department of Mathematics, College of Science, University of Bahrain, P.O. Box 32038, Bahrain Accepted 4 October 2005

Communicated by Prof. Geravdo Iovane

Abstract In this paper, a parallel implementation of Wang’s method for solving tridiagonal system of equations on the multiprocessor machine using occam language is presented. The parallel algorithm has been designed for shared and distributed memory machine that support data parallel and message passing. The over all performance of this implementation on 9 each of processors is given. The communication times are very important and any improvement on this communication would have a significant performance of the implementation. The significance of these results are discussed.  2005 Elsevier Ltd. All rights reserved.

1. Introduction Tridiagonal systems form is a very important class of linear algebraic equations and occur repeatedly as finite difference approximations with second derivatives such as that the simple harmonic motion. Consequently the solution of tridiagonal linear systems are the core of many programs for scientific computation. It is normal practice to solve these equations by using a recursive algorithm based on Gaussian elimination or using LU-factorization. Wang [1] described a parallel method for solving tridiagonal linear systems by using the partition method in which Gaussian elimination is applied to each subsystem within a block which carries with it fill-ins during the elimination. The method had a scalar count closer to but still larger than that of the cyclic reduction. Wang [2] proposed a modification of his original partition method which resulted in an algorithm with a scalar count the same as of that of cyclic reduction. In this paper Wang’s [2] parallel algorithm for solving tridiagonal system is improved by using occam implementation. A comprehensive list of these references are available in the recent study, (see, for example [3–25]). The material of this paper is arranged as follows: in Section 2, an algorithm to describe a linear system of equations is discussed. The implementation of this algorithm is presented in Section 3. In Section 4, we present some computational results with its, theoretical timing is given in Section 5. Finally, we give our conclusion in Section 6.

0960-0779/$ - see front matter  2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.chaos.2005.10.033

S.H. Abbas / Chaos, Solitons and Fractals 31 (2007) 580–585

581

2. The algorithm In this section the following problem is considered: Let 0 1 a1 b1 B C B c2 a2 C b2 B C B C . . B C 2 Rnn . . A¼B . . C B C B cn1 an1 bn1 C @ A cn

an

be a given tridiagonal matrix of order n with property that At is diagonally dominant, jai j > jciþ1 j þ jbi1 j;

i ¼ 1; 2; . . . ; n.

The linear system of equations Ax = b for the given b 2 Rn is discussed, where x and b are vectors of the size n · 1. The parallelism is achieved by slicing the matrix into p (the number of processors) blocks each of fairly equal size, and it is thus assumed that each processor deal with finding of only its slice of the complete solution. This can be done fairly and independently by each processor, but at some point communication is required between adjacent processors. Hence, an array of transputers would seem to be an ideal environment for the an implementation of this algorithm. The problem is divided into three stages: 1. Distribution of the data amongst the slave worker processors which is done by the master processor. 2. Wang’s algorithm to compute the solution. 3. Collection of the results by the master processor. Once the data is on the slave processor, the algorithm can be split into five parts: 1. 2. 3. 4. 5.

Elimination in the lower diagonal of the c0i s which creates fill-ins denoted by fi0 s. Elimination in the upper diagonal of the b0i s which creates fill-ins denoted by g0i s. Elimination of the fi0 s. Elimination of the g0i s. Calculation of the x0i s from the a0i s and the b0i s.

3. Implementation The program which is used uses TDS running and a T8 hosted by a PC is written as follows: First of all a sequential version followed by a pseudo-parallel, version then the full parallel version. The largest problem the program can handle is n = 35 000, this is due to the memory restrictions of the system being used and can run up to 9 slave processors simultaneously. These limits can easily be improved by using VAL statements at the start of both the master and the slave programs, depending on, if more slaves or memory is available. The topology used of a linear chain is that.

The program assumes the data that is in the master in the form of four arrays of size n + 1 (a, b, c, b), where a(i) is equivalent to a1 in the notation used earlier, it is to be noted that occam arrays start with a subscript 0. The program then asks how many processors are to be used for slave (p), up to the maximum given limit in the system, thus calls a procedure using a, b, c, and b as parameters. This procedure distributes the data amongst the required number of slave processors then waits for theses processors to calculate and return the result in the original form of an array to the master. The slave code must be loaded onto the slave to processors before the master is run. After the slave processors have return the result, they wait for next problem as repeated runs of the master are possible. The code shared by all the slave processors in the system, however, if desired only some of these may be used on a particular problem. The only restrictions on n and p are that n/p P 2 and n P 3.

582

S.H. Abbas / Chaos, Solitons and Fractals 31 (2007) 580–585

4. Computational results In this section, the actual and theoretical timings and the speed ups for n = 35 000 are illustrated by assuming that ai = i, bi = 1, xi = i, ci are all random real numbers between 0 and 100 and bi is given by the multiplication of Ax; i = 1, . . . , 35 000. Table 1 gives the timings (in sees) for the three stages of the problem. In some cases, a solution to a tridiagonal system would be required when the data is already distributed to the slave processors. In such a case, the data distribution would not be a problem. However, the result collection needs some attention as it is probable that the complete solution might need to be gathered in one place. It is shown from Table 2 that at stage 2 alone, the speed-ups are fairly good and give an efficiency of approximately 80%. The time taken to distribute the data and collect results increases up as the number of salve processors involved increases. This is due to the fact that parallel input and output from links do not run at full speed when the C004 switches are being used. The data distribution and result collection dominates the calculation time, as shown in Table 2. Further, it is shown that the actual calculation time as a percentage of total time decreases quite considerably as the number of slave processor involved increases. It can be noted that the main problem as given in the linear chain and Table 2 indicate that the time for distribution of the data and collection of the result is too large when compared to the time for the implementation of the algorithm, and is thus restricted to a maximum speed-up of approximately 2. The speed-ups for the entire program and also for the second computational stage alone are shown in Table 3.

5. Theoretical timings We give the theoretical timings of the above algorithm in terms of the arithmetic operation time, tf, the data distribution time, tc, and the degradation factor Dc due to parallel communication problems. The total program running time T = T(n, p, tt, tc), where T = T1 + T2 + T3; Ti, i = 1–3, which is the time required to complete stage i of the program. Evaluation of Ti is as follows: T1 is the time required to pass all the data onto the slave processors. Therefore, ) T 1 ¼ 4ntc ; p¼1 h i ð1Þ c ¼ 4ntc 1þðp1ÞD ; pP2 p

Table 1 The result of the distribution of the data, calculation of the solution and collection of the result No. of salves

Distribution of the data

Calculation of the solution

Collection of the result

Total

1 2 3 4 5 6 7 8 9

0.616 0.700 0.728 0.742 0.751 0.757 0.761 0.764 0.766

1.254 0.785 0.523 0.392 0.314 0.262 0.224 0.197 0.175

0.154 0.175 0.182 0.186 0.188 0.189 0.190 0.191 0.191

2.024 1.660 1.433 1.320 1.253 1.208 1.175 1.152 1.132

Table 2 Percentage of program time on stages 1, 2, and 3 No. of slave processors

Percentage of program time on stage 2

Percentage of program time on stage 1 and 3

1 3 6 9

62 36 22 15

38 64 78 85

S.H. Abbas / Chaos, Solitons and Fractals 31 (2007) 580–585

583

Table 3 Speed-up on stage 2 No. of slave processors

Speed-up

Stage 2 only

1 2 3 4 5 6 7 8 9

1 1.22 1.41 1.53 1.62 1.68 1.72 1.76 1.79

1 1.60 2.40 3.20 3.99 4.79 5.58 6.38 7.17

Evaluation of T2 is as follows: T2 is the time required to complete the algorithm, T2 = s1 + s2 + s3 + s4 + s5, where si; i = 1, . . . , 5 is the time required to complete phase i of the algorithm. The time given for phase i is recorded on the processor that finishes phase i last. Phase 1: This time is also recorded on the first processor in the chain as it gets its data last and this begins last. Even though it has less to do when compare to the others, the time it gains when compared to the processors is not enough to compensate. s1 ¼ time required for phase one ) ¼ 5ðn  1Þtf ; p ¼ 1 ¼ 5ðnp  1Þtf ;

ð2Þ

pP2

Phase 2: As communication takes place, it is noticed that the last processor is the last to finish, even though it was the first to begin. s2 ¼ time required for phase two ¼ 5ðn  2Þtf ;   ¼ 5 np  2 tf þ ðp  1Þð4tc þ 7tf Þ;

p¼1

)

ð3Þ

pP2

Phase 3: Here it is found that the processor p (the last one) is last to finish. s3 ¼ time required for phase three ) ¼ 0; p¼1   ¼ ð3tc þ 5tf Þ þ 5 np  1 tf ; p P 2

ð4Þ

Phase 4: Here processor 1 is last to finish. s4 ¼ time required for phase four ¼ 3ðn  1Þtf ; ¼ ðp  1Þð2tc þ 3tf Þ þ 3ðnp  1Þtf ;

p¼1 pP2

)

ð5Þ

Phase 5: As in this phase every phase does the same number of operations, the processor 1 is again the last to finish. s5 ¼ time required for phase 5 ) ¼ ntf ; p ¼ 1 ¼ np tf ; p P 2

ð6Þ

Therefore, i.e. T 2 ¼ s1 þ s2 þ s3 þ s4 þ s5 T 2 ¼ ð14n  21Þtf ; ¼ ð19 np þ 10p  28Þtf þ ð6p  3Þtc ;

p¼1 pP2

)

ð7Þ

584

S.H. Abbas / Chaos, Solitons and Fractals 31 (2007) 580–585

Evaluation of T3:T3 = time required to return result to the master. Therefore, T 3 ¼ ntc ; h i c ¼ ntc 1þðp1ÞD ; p

p¼1

) ð8Þ

pP2

Therefore, the total running time can be written as : T ¼ ð14n  21Þtf þ 5ntc ;

h  i c tc ; ¼ ð19 np þ 10p  28Þtf þ ð6p  3Þ þ 5n 1þðp1ÞD p

p¼1

)

pP2

ð9Þ

For the data used in the illustration, n ¼ 35 000 P tc tf Dc

¼ 1; 2; . . . ; 9 ¼ 4:4 ls ¼ 2:4 ls ¼ 1:27 ls

These are the standard values for single-precision real numbers with the links operating at 10 Mbits/s [13].

6. Conclusion From the available literature, it is thus known that the problem with achieving good results is the bottleneck of getting the data off the master. It can be presumed that there is no way to overcome this within the present framework. However, there are ways to save at least some of this communication time from performing some calculation. One possibility is to put on each slave processors two processes instead of only one, with soft channels to connect them wherever necessary, so that while the communication proceeds, some work towards the solution is being done. This can possibly be done in two known ways: (1) To each processor assign two consecutive processes, or (2) assign the adjacent processes to the adjacent processors looping round at the end of the chain. In this literature, a form of the first of these methods has been implemented with the result that there is a small improvement when the number of slave processors is small, but a loss in the overall performance when the number of slave processors is increased beyond a critical value, around 3. The second approach is hence more promising. In the present study, the other possibility to save excessive communication time spent is explored by distributing the data fairly equally amongst the slave processors. Since the last processor starts first while the first processor takes longer time to get started, one can utilize the last processor to do more and more work to save the excessive communication time spent. It is evident that the communication times are very important in this problem, and any improvement in communication times would have a significant effect on the overall performance of the program. For example, if links operating at 20 Mbits s were to be used the overall effect would be very significant. In the illustration, the speed-up of the 9 slave version would increase from 1.79 to 3.85.

References [1] Wang H. A parallel method for tridiagonal equations. ACM Trans Math Software 1981;7:170–83. [2] Wang H. The partition method for solving tridiagonal equations on multiprocessor computers, Document No. G 320-3499, IBM Palo Atto Scientific Center, Palo Alto CA 94304, 1987. [3] Abbas SH, Delves LM. Parallel solution of linear systems of equations, Report CSMR, University of Liverpool, UK, 1989. [4] Abbas SH. Parallel solution of linear system of equations using a column oriented approach, Chaos, Solitons & Fractals, in press, doi:10.1016/j.chaos.2005.08.190. [5] Abbas SH. Parallel solution of dense linear equations, Analele Universitatti din Timisora. Seria Math Inform 2001;xxxxIx(1):3–12.

S.H. Abbas / Chaos, Solitons and Fractals 31 (2007) 580–585

585

[6] Abbas SH. On the cost of sequential and parallel algorithm for solving linear system of equation. Int J Comput Math 2000;74:391–403. [7] Abbas SH. Complexity of parallel block Gauss-Jordan algorithm. Int J Comput Numer Anal Appl 2003;4(2):157–66. [8] Abbas SH. Parallel algorithms of linear systems and initial value problems, PhD Thesis, University of Liverpool, 1990. [9] Adam MF. A distributed memory unstructured Gauss-Seidel algorithm for multigrid smoothers, Technical Report, University of California, Berkeley, 2001. [10] Adam MF. A parallel maximal independent set algorithm. In: Proceedings 5th Copper Mountain conference on iterative methods, 1998. [11] Bulgakov VE, Kuhn G. High-performance multilevel iterative. Aggregation solver for large finite-element structural analysis problems. Int J Numer Meth Eng 1995;38:3529–44. [12] Fish J, Belsky V, Gomma S. Unstructured multi-grid method for shells. Int J Numer Meth Eng 1996;39:1181–97. [13] Duff I, Erisman A, Reid J. Direct methods for sparse matrices. Oxford: Clarendon Press; 1986. [14] Hason RJ. A cyclic reduction solver for IMSL MATH LIBRARY. Tech Rep IMSL 1990. [15] Henson VE, Yang UM. A parallel algebraic multi-grid solver and proconditioner, Technical Report UCRL-JC139098, Lawrence Livermore National Laboratory, 2000. [16] Johnson L. Solving tridiagonal systems on ensemble architectures. ACM Trans Math Software 1985;11:271–88. [17] Karypis G, Kumar V. Parallel multilevel K-way partitioning scheme for irregular graphs. In: ACM/IEEE Proceedings of SC96: high performance networking and computing, 1996. [18] Lawrie D, Sameh A. The computation and communication complexity of a parallel banded system solver. ACM Trans Math Software 1984;10:33–43. [19] Meier U. A parallel partition method for solving banded systems of linear equations. Parallel Comput 1985;2:33–43. [20] Michielse P, Vander vorst H. Data transport in wang’s partition method. Parallel Comput 1988;7:185–95. [21] Ortega JM. Introduction to parallel and vector computers. New York: Plenum Press; 1988. [22] Reuter R. Solving tridiagonal systems of linear equations on the IBM 3090 VE. Parallel Comput 1988;8:371–6. [23] Sameh AH, Kuch DJ. A parallel QR algorithm for symmetric tridiagonal matrices. IEEE Trans Comput 1977;C-26:147–51. [24] Smith B, Bjorstad P, Gropp W. Domain decomposition. Cambridge University Press; 1996. [25] Vanek P, Mandel J, Brezina M. Algebraic multi-grid by smoothed aggregation for second and fourth order elliptic problems. In: 7th Copper Mountain conference on multigrid methods, 1995.