Advances in Engineering Software 34 (2003) 185–201 www.elsevier.com/locate/advengsoft
Parallel computing with load balancing on heterogeneous distributed systems Primozˇ Rus, Boris Sˇtok*, Nikolaj Mole Faculty of Mechanical Engineering, University of Ljubljana, Asˇkercˇeva 6, Ljubljana SI-1000, Slovenia Received 13 September 2002; revised 9 December 2002; accepted 9 December 2002
Abstract In the present work, the parallelization of the solution of a system of linear equations, in the framework of finite element computational analyses, is dealt with. As the substructuring method is used, the basic idea refers to a way of decomposing the considered spatial domain, discretized by the finite elements, into a finite set of non-overlapping subdomains, each assigned to an individual processor and computationally analysed in parallel. Considering the fact that Personal Computers and Work Stations are still the most frequently used computers, a parallel computational platform can be built by connecting the available computers into a computer network. The incorporated computers being usually of different computational power and memory size, the efficiency of parallel computations on such a heterogeneous distributed system depends mainly on proper load balance. To cope the balance problem, an algorithm for the efficient load balance for structured and free 2D quadrilateral finite element meshes based on the rearrangement of elements among respective subdomains, has been developed. q 2003 Elsevier Science Ltd. All rights reserved. Keywords: Parallel computing; Heterogeneous computer system; Load balancing
1. Introduction With the development of both, the computers technology on one hand and computational methods on the other hand, technical problems, which can be computationally simulated by corresponding numerical models, become more and more comprehensive. In consequence, along with facilitated investigations and increased physical understanding achieved, the efficiency of problems solutions, especially in R&D stage, is considerably improved. Unfortunately, numerical treatment of complex physical problems by discrete methods of analysis can lead to very large systems of linear equations. Computer systems development in the sense of parallel computers and connecting of single computers into a group via a network, offer a possibility for the solution of such large-size problems even when the capability of a single computer is exceeded. Irrespective of the parallelization strategies used, the issue of proper load balancing, enabling highest computational efficiency, becomes very important when the distributed computer * Corresponding author. E-mail addresses:
[email protected] (P. Rus),
[email protected] (B. Sˇtok),
[email protected] (N. Mole).
system is heterogeneous [1]. This is especially true for nonlinear and non-stationary problems, which are solved incrementally, usually with much iteration needed within an incremental step. Domain decomposition has become a widely adopted technique to develop parallel applications based on discrete approximation methods [2], and in particular by the finite element method [3 – 6] and the boundary element method [7]. The FE computational scheme used in our parallel approach relies on the substructuring method, according to which the general problem equation, resulting from the consideration of the whole problem domain, can be split into a series of equations, pertaining to the individual subdomains, as obtained by the corresponding domain decomposition. These equations can be solved only after the solution of the associated interface problem, which involves the complete determination of the physical variables on the interfaces between adjacent subdomains. For the solution of the interface problem direct [4] or iterative solvers [2,6,8] may be applied. Here, use of the direct solution approach will be made. The efficiency of using the domain decomposition on the heterogeneous system depends on a way the computational
0965-9978/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0965-9978(02)00141-2
186
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
domain is divided into subdomains, and how the latter are assigned to the processors. Clearly, the subdomains should be assigned in such a way that all processors would be engaged in the computation evenly, thus the execution of the allocated task taking approximately the same time. Due to heterogeneity of the considered computing system it is obvious, that allocation of the same amount of work to each processor would result in inefficient parallel computation, with several processors being possibly idle even for a long time. Though considering the established standard computer performance characteristics, it is almost impossible to define a priori such a balance ratio between the computers in the heterogeneous distributed system that would yield efficient parallel computation. This is mostly due to the fact that actual computational performance of a processor, performing the allocated work, depends exclusively on the number of degrees of freedom and the number of interface variables, pertaining to the assigned subdomain. Load balancing for heterogeneous parallel systems is a relatively new subject of investigation with a less-explored landscape. Due to reasons discussed above, static load balancing based solely on the prior knowledge of components performance, is rarely a successful option. In consequence, some sort of dynamic load balancing must be applied. One way of doing it is by decomposing the problem into small and independent tasks, sending a task to each component of the computing system, and sending a new task to a component when it completes the previous task [9,10]. In this approach, the issue of load imbalance is actually ignored. Another possibility, which seems to be reasonable only when the range of processing power is broad, is given by asymmetric load balancing [11]. A real dynamic load balancing technique should use observed computational and communication performance to predict the time a task would complete on a given component and the time needed to query the component. This approach, which can be addressed in many different ways as standard clock approach [12], a dynamic distributed load balancing based on aspect ratio [13,14], a diffusion algorithm for heterogeneous computers [15], is also the path we are following. In order to achieve efficient computation a strategy is developed, which in course of ongoing of incremental computation automatically rearranges the subdomains in sense of the load balancing. Considering a solution of the problem, equations presents the predominant source of CPU time consumption the load balancing is conceived upon
computing times, realized on a processor while establishing all the respective subdomain data needed for the setting and solution of the interface problem. As the freely accessible METIS program [16], which was used for mesh partitioning, proved some deficiencies associated with the automatic load balancing. We have developed an algorithm for the rearrangement of elements among subdomains. The computations being most efficient with smaller number of interface nodes, the algorithm is developed by considering minimization of the interface nodes. All communications and data transfers among computers in the distributed system were managed through the ‘Parallel Virtual Machine (PVM)’ communication library [17].
2. Substructuring method Let us consider a physical problem defined on a domain V. In the context of the FEM, the continuous problem is discretized to yield a finite set of nodal primary and secondary variables, u and f, respectively. The relationship between them is given by the matrix equation Ku ¼ f: The problem is fully determined by specifying the corresponding boundary conditions u ¼ ur on Gu and f ¼ f m on Gf : The basic idea of the substructuring method [3 –5] is in decomposing the considered spatial domain V into a finite set of non-overlapping subdomains Vs (Fig. 1). As a result of the domain decomposition any two adjacent subdomains share the common interface boundary GI ; with conditions of physical continuity and consistency to be respected on it. Following the performed decomposition of the domain V into Ns subdomains Vs ; the general problem equation Ku ¼ f must be accordingly restructured, with equations associated to internal nodes of each subdomain numbered first, and equations associated to interface nodes numbered last. In general, the resulting system of equations has the following matrix form: 38 2 9 8 9 K11 0 ··· 0 K1I > u1 > > f 1 > > > > > > > > 7> 6 > > > > 7 6 0 > > > > K · · · 0 K u f > > > 22 2I 7> 2 > 2 > 6 > > > > > 7> 6 < < = = 7 . 6 . . . . . . 7 . 6 . . . . . . . 7> . > ¼ > . > ð1Þ . . . 6 . 7> 6 > > > > > > > 7> 6 > > > > > 7> 6 0 0 · · · K K u f Ns > > > > > N N N I N s s s 5> 4 s > > > > > > : > : ; ; KI1 KI2 · · · KINs KII uI fI
Fig. 1. Decomposition of the considered domain into two subdomains.
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
187
The system matrix K, usually also called the stiffness matrix, has a block-arrowhead form. Each diagonal submatrix Kss represents the local stiffness of the subdomain Vs ; while the submatrix KII is the stiffness associated with the nodal variables at the interface GI : An off-diagonal submatrix KsI denotes the coupling stiffness between the subdomain Vs and the interface GI : Accordingly, the vector of primary variables u is partitioned into the subvectors us, which correspond to the DOF pertaining to the interior of the subdomains Vs ; and the subvectors uI, associated with DOF on GI : Similarly, fs denotes the part of the loading vector f associated with the interior of Vs, while fI is a subvector corresponding to the interface variables on GI. If the matrix Eq. (1), is further restructured with respect to given boundary conditions for the primary and secondary variables, u ¼ ur on Gu and f ¼ f m on Gf ; respectively, the following solution path can be traced † Step 1. Given KpII upI ¼ f pI
ð2Þ
solve for the unknown primary variables um I at interface nodes. The respective matrix and vectors in Eq. (2) are defined as KpII ¼ Kmm II 2
Ns X
Kps
Fig. 2. Parallel implementation of the substructuring method algorithm.
s¼1
upI ¼ um I mr r f pI ¼ f m I 2 KII uI 2
ð3Þ Ns X
f ps
s¼1
where T mm 21 mm Kps ¼ ðKmm sI Þ ðKss Þ KsI r mm T mm 21 m mr r mr r f ps ¼ Kmr Is us 2 ðKsI Þ ðKss Þ ðf s 2 Kss us 2 KsI uI Þ
ð4Þ † Step 2. With um i being determined in Step 1 solve for the primary variables um s at internal nodes of each subdomain Vs ; ðs ¼ 1; 2; …; Ns Þ mm 21 m mr r mr r mm m um s ¼ ðKss Þ ðf s 2 Kss us 2 KsI uI 2 KsI uI Þ
r rr r rm m rm m r f rs ¼ Krr ss us þ KsI uI þ Kss us þ KsI uI f I Ns X
r rm m ½Krr Is us þ KIs us
3. Performance characterization of heterogeneous computers
ð5Þ
† Step 3. Finally, the unknown secondary variables at internal nodes of each subdomain Vs ðs ¼ 1; 2; …; Ns Þ and at interface nodes, f rs and f rI ; respectively, can be determined
r rm m ¼ Krr II uI þ KII uI þ
are uncoupled is advantageous. It can be exploited for concurrent computation of the matrices Kps and vectors f ps for each subdomain Vs ; provided each subdomain computations are allocated to a different processor. The parallel implementation of the problem solution, as determined by the algorithms 2– 6 of the substructuring method, is schematically presented in Fig. 2. From the scheme, the sequence of computing steps with respective objects of computations defined, as well as the master/slave allocation of a particular computation can be seen.
ð6Þ
s¼1
The fact, that the local stiffness submatrices, associated with mm mr mr mr the internal primary variables, Kmm ss ; KsI ; Kss ; KsI and KIs ;
In order to take advantage of the mathematical structure of the domain decomposed problem which enables concurrent computing, it is of major importance for such a computation to be efficient on heterogeneous distributed systems, that the allocated processor work is well load balanced. Allocation of the same amount of work to two individual computers of different speed would mean that the faster computer would be idle and waiting for the slower one to complete its computation. To avoid such a situation the speed ratios of computers in the heterogeneous distributed system should be known, thus allowing for a corresponding proper load balancing. There exist several standard benchmarks, such as Whetstone [18], LINPACK [19] and HINT [20], that
188
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
Table 1 Characteristics of computers in the heterogeneous distributed system Label
Type
Processor
RAM (MB)
OS
Type of program
M S1 S2 S3 S4 S5 S6
WS, HP B2000 PC PC PC PC PC PC
PA-8000 400 MHz P IV 1.7 GHz P IV 1.7 GHz P III 800 MHz P III 800 MHz P III 800 MHz P III 600 MHz
1024 1024 1024 256 256 256 128
HP-UX 10.0 WIN 2000 SP2 WIN 2000 SP2 WIN 2000 SP2 WIN 2000 SP2 WIN 2000 SP2 WIN 2000 SP2
Master Slave Slave Slave Slave Slave Slave
Computers are connected via Intel InBusiness 8-Port 10/100 Fast Hub.
identify by different procedures computational efficiency of a computer. The Whetstone benchmark defines the number KIPS, which represents the number of operations per time unit for different combinations of real mathematical operations. The LINPACK benchmark presents the number of real operations per time unit for solving of linear system of N equations. Finally, HINT is a benchmark designed to measure a computer processor and memory performance without fixing neither the time nor the problem size. So, HINT measures ‘QUIPS’ (QUality Improvement Per Second) as a function of time, while ‘NetQUIPS’ is a single number, that summarizes the QUIPS over time. Our heterogeneous distributed computer system, used as a testing system for the development of a proper parallel computation strategy, is composed of a master computer and six slave computers, interconnected by a HUB. Table 1 presents their characteristics. All the cited standard benchmarks were performed in double precision. The performed benchmark results are presented in Table 2 in terms of respective performance parameters. The second part of the table shows for each benchmark the performance ratios of individual computers normalized with respect to the first one (S1), which seems, according to the performed benchmarks, to perform computationally most efficiently. It is evident from Table 2 that the analyzed computational performance, expressed by the corresponding performance ratios, depends heavily on the benchmark. This validation, though qualitatively fair, does not give clear answer, what should be the computers performance when running a real computational problem. Because of that, and considering that finite element calculations are themselves specific, in particular when substructuring method of analysis is applied, we decided to perform special tests, taking real nature of the considered problems into account. The main objective of these tests was to find out, how the tested computers perform with respect to different density of finite element meshes, as well as with respect to different number of interface nodes. Tests were performed on a steady state heat transfer case, considering division of a square domain into two subdomains (Fig. 3), with the subdomain V2 being allocated to a computer whose computational performance is to be identified.
Since a great amount of computational time in the slave program is used for Cholesky decomposition of the matrix T Kmm ss ¼ Vs Vs ; which is a symmetric banded matrix, saved in a ‘sky-line’ form, we decided to define the speed of each computer from the analysis of that time. The number of floating-point operations for Cholesky decomposition can be estimated by the equation NOCh ¼
Ne ðN BW þ 1Þ2 2
ð7Þ
where Ne is the number of equations corresponding to the number of DOF of the problem, and N BW means the halfbandwidth of the matrix Kmm ss : For a constant number of elements in the x direction ðNx ¼ 100Þ the speed of calculation vCh ; which is defined as the ratio NOCh =tCh ; is presented as a function of the problem size in Fig. 4. Variation of the number of elements in the y direction ðNy Þ; while Nx being fixed, has dual effect on the computation. First, it affects the number of matrix coefficients, and second, it affects the number of interface nodes as well. Especially the latter, being directly associated with the matrix bandwidth, has a tremendous role on the efficiency of computation. As evident from diagrams in Fig. 4 the speed of computers S1, S2, S3, S4 and S5 decreases drastically after a certain number of interface nodes is reached. When inspecting for a reason of such behaviour, we found out by analysing the program for Cholesky decomposition that decreasing of computational speed is a direct consequence of computer hardware structure. Actually, of the speed of communication regarding whether processor, cache memory or RAM is addressed. As a rule, increasing of Table 2 Benchmark of Whetstone, LINPACK and HINT and their performance ratios Label
KIPS
mflops
MNetQUIPS
rKIPS
rmflops
rMNetQUIPS
S1 S2 S3 S4 S5 S6
0.84 0.82 0.68 0.68 0.68 0.51
169.50 177.10 61.54 45.95 45.64 27.41
55.13 50.11 28.76 27.35 27.23 18.22
1.00 1.03 1.24 1.24 1.24 1.66
1.00 0.96 2.75 3.69 3.71 6.18
1.00 1.10 1.92 2.02 2.02 3.03
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
189
Fig. 3. Test domain V divided into two subdomains V1 and V2 with corresponding boundary conditions.
the distance of memory level from the processor exhibits decreasing of the communication speed. Therefore, when the amount of data exceeds the amount of available memory on a certain level, the data must be delivered from a slower memory, which is reflected negatively in the speed of calculation. Similar behaviour, that confirms our conclusion, can also be recognized from the diagram in Fig. 5, which represents the results of HINT benchmark for QUIPS versus memory. It should be emphasized that during testing of computational performance all unnecessary processes on the testing slave computers were turned off in order to obtain repeatability of computing times. Besides, we limited the page file to the minimum and prevented the system processes from using it.
4. Load balancing on heterogeneous computer systems Having in mind a development of the parallel strategy for heterogeneous distributed computer systems, which
Fig. 4. Speed vCh for the constant number of elements in x direction ðNx ¼ 100Þ:
could be continuously evolved, and considering the fact, that it is practically impossible to perform all kinds of different tests with respect to different densities of finite element meshes, it can be concluded, that proper load balancing is highly dependent on the problem size and available computers structure. Therefore, being rather difficult to obtain proper load balancing for an actually considered analysis case only upon preliminary computing performance testing, we have developed an iterative strategy for the load balancing of processors, solving the specified problem efficiently. According to it, a suitable domain decomposition in the sense of the load balancing of processors, considering the computing times realized in the previous iteration of computation, is performed if needed. The main assumption of this iteration algorithm is that the speed of calculation (the number of operations per time unit) between the two consecutive iterations does not change. This assumption may be, however, roughly violated if in a new domain decomposition procedure the number of interface nodes are subject to change (Fig. 4). Assuming the same speed of computation ðiÞ NOsði21Þ =tsði21Þ ¼ NOðiÞ s =ts in two consecutive iterations, in
Fig. 5. HINT benchmark: QUIPS versus memory.
190
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
the (i 2 1)th iteration with known computation times tsði21Þ on each computer, and the ith iteration yet to be performed, new shares, aimed at simultaneous computing ðt1ðiÞ ¼ t2ðiÞ ¼ · · · ¼ tNðiÞs Þ on all the processors, in a new iteration can be defined
the described procedure to structural and free domain decomposition.
NOðiÞ s Ns X NOðiÞ k
When a problem domain is characterized by simple geometry, usually, a regular discretization pattern can be applied. Then, the domain can be decomposed by following some topological patterns, which are not subject to shape variations, irrespective of eventual subdomain rearrangements due to load balancing. Such a domain decomposition that is reflected only in aspect ratio variation will be referred as a structured one. Although the associated load balancing problem can be trivial, it can serve as a good test example of the described numerical procedure, revealing some general properties. The domain decomposition of the ring domain in Fig. 6, which is discretized with orthogonal polar mesh, can be structured following two patterns. Decomposition into circular cutouts (Fig. 6(a)) yields subdomains which are equivalent with respect to the number of the incorporated interface nodes. Decomposition into concentric rings (Fig. 6(b)) results, on the contrary, in non-equivalent subdomains. The ring subdomains, which include no part of the problem domain boundary, actually have double number of the interface nodes with respect to the two subdomains that incorporate external or internal domain boundary. Although the first decomposition is undoubtedly, from the discussed point of view, advantageous, it does not mean that it is also computationally more efficient. The actual situation does depend on the parameters of discretization in the radial and circumferential direction. Both decompositions are characterized by the fact that with the number of subdomains being given the number of interface nodes, pertaining to a subdomain, is fixed, irrespective of eventual subdomain rearrangements. This fact facilitates the load balancing considerably. Whether shaped in circular cutouts or concentric rings, the number of operations according to Cholesky
NOsði21Þ ; tsði21Þ ) dsðiÞ ¼
k¼1 Ns NOsði21Þ X NOkði21Þ ¼ ði21Þ ði21Þ ts k¼1 tk
!21
ð8Þ ;
s ¼ 1; 2; …; Ns Here, we assume that the number of operations NOs ; performed on the sth processor, can be estimated in some way, considering the characteristic discretization parameters of the sth subdomain. Irrespective of the estimation formula, the reshaping of the subdomains must be done in accordance with the constraint equation Ns X s¼1
NeðiÞs þ NeðiÞI ¼
Ns X
Neði21Þ þ Neði21Þ ¼ Ne ¼ const s I
ð9Þ
s¼1
where Ne is the total number of equations associated with the solution of the original problem, defined on the whole domain V, while Nes and NeI are, respectively, the number of equations referring to the internal DOF in the subdomain Vs and DOF on the interface GI in the substructured problem. The above load balancing algorithm is conceived generally. However, its efficiency can be affected by deviations from the basic assumption, actually realized within the load balancing iteration process. In consequence, to keep the load balancing stable and convergent, the changes associated with the rearrangement of the subdomains, should be smooth enough to prevent eventual destruction of the established mathematical structure of the parallelized problem. In the sequel, we will point out some issues, which arise when applying
4.1. Structured domain decomposition
Fig. 6. Decomposition of the ring into (a) circular cutouts and (b) concentric rings.
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
decomposition, Eq. (7), when applied to the subdomain Vs ; can be expressed by linear relationship ðiÞ ðiÞ NOðiÞ s ¼ c s N es ;
cðiÞ s ¼
ðN ðiÞ BWs
þ 1Þ2 ¼ csði21Þ ¼ const 2
Table 3 Structured domain decomposition of the ring (100 £ 1000) into circular cutouts Iter
Vs ! Sc
Nws
Nes
NPFs
NIs
N BWs
tSls ½s
0
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
167 167 167 167 166 166
16,434 16,434 16,434 16,434 16,335 16,335
3,281,121 3,281,121 3,281,121 3,281,121 3,263,832 3,263,832
202 202 202 202 202 202
199 199 199 199 199 199
1.66 1.92 2.65 2.78 2.78 4.78 96.75
1
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
248 214 156 148 147 87
24,453 21,087 15,345 14,553 14,454 8,514
4,889,129 4,215,963 3,065,182 2,906,226 2,882,661 1,686,579
202 202 202 202 202 202
199 199 199 199 199 198
2.45 2.45 2.49 2.46 2.46 2.52 2.81
ð10Þ
Considering the above relationship in Eq. (8) and respecting the constraint Eq. (9), a system of ðNs 2 1Þ equations can be obtained "N 21 !# NX s s 21 X ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ; ck Nek þ cNs Ne 2 Nek c s N es ¼ d s k¼1 k¼1 ð11Þ s ¼ 1; 2; …; Ns 2 1 The solution of the above system is a vector of NeðiÞs ; which ensures within some tolerance the required subdomain shares dsðiÞ : For the considered ring domain, with the discretization being defined by the corresponding number of elements in the rðNr Þ and wðNw Þ direction, the task of load balancing is really trivial. The number of subdomain equations NecðiÞ ¼ s NeðiÞs for a circular cutout subdomain VscðiÞ can be directly expressed in terms of the respective discretization in the w direction. With essential boundary conditions assumed, the following relationship is true NecðiÞ ¼ ðNr þ 1ÞðNwðiÞs 2 1Þ 2 2ðNwðiÞs 2 1Þ s
ð12Þ
which after its resolution yields the respective number of elements in the w direction for the considered subdomain NwcðiÞ ¼ s
NeðiÞs Nr 2 1
þ1
ð13Þ
Similarly, the number of subdomain equations NerðiÞ ¼ NeðiÞs s rðiÞ for a concentric ring subdomain Vs can be expressed in terms of the respective discretization in the r direction as NerðiÞ ¼ ðNrðiÞs 2 1ÞNw s
ð14Þ
yielding the respective number of elements in the r direction for the considered subdomain NrrðiÞ s
NeðiÞ ¼ s þ1 Nw
ð15Þ
and NrrðiÞ are in general in Because values of the vectors NwcðiÞ s s R, they must be transformed into h considering P s cðiÞ P s rðiÞthe constraint equations Nw ¼ Ns¼1 Nws and Nr ¼ Ns¼1 Nrs : Due to simplicity of the considered problem, some characteristic properties regarding load balancing can be easily revealed by running a real physical problem. In the sequel, results from the steady state heat transfer analysis of the ring in Fig. 6, performed in parallel on six processors, will be considered. The corresponding computations were performed on the heterogeneous distributed system, that is described in Table 1, and qualitatively characterized in
191
Table 2 and Fig. 4. On three distinct meshes ðNr £ Nw Þ considered (100 £ 1000), (1000 £ 100) and (250 £ 250), respectively, effects of the structured domain decomposition into circular cutouts and concentric rings on the computational efficiency were investigated. Load balancing was performed on the slave level according to the respective load balancing formulas (8), (11) and (13) or (15). The parameters, that are listed in Tables 3– 6, characterize dominantly a particular computational case. Referring to a subdomain Vs they are in turn:
Vs ! Sc allocation of the subdomain Vs to the processor Sc Nrs ; Nws respective discretization parameters in the r and w direction N es number of equations NPFs profile of the stiffness matrix N Is number of interface nodes average half-bandwidth of the stiffness matrix. N BWs Table 4 Structured domain decomposition of the ring (1000 £ 100) into concentric rings Iter
Vs ! Sc
Nrs
Nes
NPFs
NIs
N BWs
tSls ½s
0
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
167 167 167 167 166 166
16,600 16,600 16,600 16,600 16,500 16,500
1,699,302 3,339,302 3,339,302 3,339,302 3,319,005 1,689,005
100 200 200 200 200 100
102 201 201 201 201 102
0.52 1.94 2.79 2.80 2.81 1.59 137.61
1
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
473 123 85 85 84 150
47,200 12,200 8,400 8,400 8,300 14,900
4,850,184 2,446,234 1,674,948 1,674,948 1,654,651 1,524,253
100 200 200 200 200 100
102 200 199 199 199 102
1.46 1.44 1.43 1.44 1.45 1.44 2.07
192
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
Table 5 Structured domain decomposition of the ring (250 £ 250) into circular cutouts Iter
Vs ! Sc
Nws
Nes
NPFs
NIs
N BWs
tSls ½s
0
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
42 42 42 42 41 41
10,209 10,209 10,209 10,209 9,960 9,960
4,946,193 4,946,193 4,946,193 4,946,193 4,761,353 4,761,353
502 502 502 502 502 502
484 484 484 484 478 478
10.29 11.05 33.22 33.98 32.83 41.54 120.56
1
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
80 75 25 25 25 20
19,671 18,426 5,976 5,976 5,976 4,731
9,701,988 9,018,459 2,757,809 2,757,809 2,757,809 2,191,848
502 502 502 502 502 502
493 489 461 461 461 463
19.24 19.25 19.83 20.20 20.35 20.95 8.52
2
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
82 76 25 24 24 19
20,169 18,675 5,976 5,727 5,727 4,482
9,952,233 9,201,480 2,757,809 2,692,692 2,692,692 2,006,447
502 502 502 502 502 502
493 492 461 470 470 447
19.66 19.69 19.83 19.86 20.02 19.39 3.20
Table 6 Structured domain decomposition of the ring (250 £ 250) into concentric rings Nes
NPFs
NIs
N BWs
tSls ½s
42 42 42 42 41 41
10,250 10,250 10,250 10,250 10,000 10,000
2,530,627 4,968,127 4,968,127 4,968,127 4,842,380 2,467,380
250 500 500 500 500 250
246 484 484 484 484 246
3.14 11.12 33.23 34.05 33.33 4.69 166.19
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
111 32 11 11 11 74
27,500 7,750 2,500 2,500 2,500 18,250
6,894,670 3,710,657 1,069,970 1,069,970 1,069,970 4,554,531
250 500 500 500 500 250
250 478 427 427 427 249
8.29 8.57 9.51 9.75 9.77 8.39 16.41
2
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
114 32 10 10 10 74
28,250 7,750 2,250 2,250 2,250 18,250
7,084,411 3,710,657 9,44,223 9,44,223 9,44,223 4,554,531
250 500 500 500 500 250
250 478 419 419 419 249
8.51 8.56 8.77 8.97 9.01 7.96 12.38
3
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
112 31 10 9 9 79
27,750 7,500 2,250 2,000 2,000 19,500
6,957,917 3,584,910 944,223 818,476 818,476 4,870,766
250 500 500 500 500 250
250 477 419 409 409 249
8.36 8.31 8.77 8.18 8.19 8.46 6.97
Iter
Vs ! Sc
0
V1, S1 V2, S2 V3, S3 V4, S4 V5, S5 V6, S6 devðtSl Þ (%)
1
Nrs
Iteration counter Iter stands for the number of iterations needed to achieve load balance within a required tolerance. The load balancing is performed upon realized times tSls in the (i 2 1)th iteration by computing the subdomain shares dsðiÞ according to Eq. (8). Time tSls is the time of assembling the local stiffness matrices and local loading vector, plus the time of computing the matrix Kps and vector f ps in the slave program (Fig. 2). With devðtSl Þ; percentage of deviation from the ideal balance is measured taking the difference between maximal and minimal value of the time tSls over their mean value. As can be seen from Tables 3– 6 the initial partition was performed into six equally sized subdomains, which resulted because of the heterogeneity of the computer system in non-balanced computation times tSls : With the implementation of the load balancing algorithm and respective reshaping of the subdomains from the previous iteration, the load balanced subdomains were achieved in one further iteration for the first two cases (Tables 3 and 4), while for the cases presented in Tables 5 and 6 two and three iterations, respectively, were needed. With limit of the deviation devðtSl Þ set to 5% the load balance for the first three cases was achieved within this limit, while in the fourth case, the load balancing procedure was stopped because of the repetition of the respective discretization parameters in the r direction, obviously due to not enough refined discretization. As the (100 £ 1000) and (1000 £ 100) discretization cases are totally equivalent with respect to the number of elements, while considering the nature of the prescribed boundary conditions they differ only slightly regarding the respective number of unknown primary variables, actually 99,000 versus 99,900, the considered two cases are convenient for a comparative analysis. Especially, for considering the effect of the subdomain pattern used in the domain decomposition. The advantage of using decomposition in concentric rings is obvious, the reason for that, according to Eq. (7), being the strongly reduced number of interface nodes for the two boundary subdomains V1 and V6 ; respectively. The same conclusions can be made when considering the (250 £ 250) discretization case. Though the number of unknown primary variables is smaller than in the former two cases, actually 62,250 versus 99,000 (or 99,900), the respective parallel computations are considerably longer. The role of the number of interface nodes is even more exposed. What is certainly worth of our attention is the fact, that the processor S6, though the worst with respect to the performed characterization (Table 2, Fig. 4), is able to contribute significantly in the overall computational efficiency, if only the correct subdomain, characterized by proper favourable discretization properties, is allocated to it. Otherwise, the computational performance can be spoiled tremendously. 4.2. Free domain decomposition In most real life cases structured finite element meshes cannot be applied, which means, that in consequence also structured domain decomposition cannot be performed.
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
With the free mesh discretization given, the free domain decomposition is needed. Along with many automatic mesh generators developed in the past also a lot of automatic algorithms for the partitioning of free meshes were developed. Nowadays, the most used algorithms are based on multi-level graph partitioning. Such a software program is METIS [16] which is freely accessible on the Internet http://www.cs.umn.edu/~metis. It is used for partitioning of large irregular graphs, partitioning of large meshes, and computing of fillreducing orderings of sparse matrices based on multilevel graph partitioning. The included algorithms can partition finite element meshes into the shares of elements requested in each subdomain, with the minimal number of interface nodes. Contrary to relative simplicity of the load balancing problem associated with the structured domain decomposition which proved to respect quite well the assumption leading to load balancing Eq. (8), the load balancing in free domain decomposition problem is far more complex. The main difficulty arises from the fact, that there is no evident and simple direct relationship between the number of elements and number of interface nodes for a respective arbitrary shaped domain. In consequence, a supposition can be set defining the number of operations in Eq. (8) as a function of the number of elements ðiÞ NOðiÞ ~ ðiÞ s ¼c s NEls
ð16Þ
is a constant characterizing the considered where c~ ðiÞ s subdomain. If we suppose further, that the shape of the subdomain and consequently the number of respective interface nodes do not change too much in two consecutive steps of load balancing, which applies otherwise for the structured domain decomposition of the previously considered ring case exactly, we can assume with high probability, that the respective number of operations and number of elements change proportionally by domain reshaping. Therefore, c~ sði21Þ ¼ c~ ðiÞ s can be assumed. The assumption yielding Eq. (8) being still true, this results in a system of ðNs 2 1Þ equations ðiÞ ðiÞ c~ ðiÞ s NEls ¼ ds
Ns X
NOðiÞ k ;
s ¼ 1; 2; …; Ns 2 1
ð17Þ
k¼1
By considering the element constraint equation regarding the total number of elements NEl in the whole domain V Ns X
ðiÞ NEl ¼ NEl ¼ const k
ð18Þ
k¼1
the system of Eq. (17) can be rewritten as "N 21 !# NX s s 21 X ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ c~ s NEls ¼ ds c~ k NElk þ c~ Ns NEl 2 NElk ; k¼1
s ¼ 1; 2; …; Ns 2 1
k¼1
ð19Þ
193
ðiÞ as a solution. With shares dsðiÞ specified with a vector of NEl s upon the known data from a previous iteration (8), the solution vector, that must be transformed into N considering Eq. (18), tries to achieve conditions for load balance. Efficiency of the load balancing is definitely dictated by the degree of actual deviations from the basic assumptions. The above load balancing strategy applied to free domain decomposition was tested by considering the steady state heat transfer analysis of the square domain, shown in Fig. 3. Though discretized with regular orthogonal mesh free domain, decomposition is applied in all steps of load balancing process. In the METIS program, there are 24 different combinations of parameters with which the behaviour of the partitioning algorithm can be influenced. Since for a small number of partitions the multi-level recursive bisection algorithm should be used instead of the multi-level k-way algorithm, the number of possible parameter combinations decreases to six. Our experience has shown, however, that for different domain shapes and different shares of elements in the respective subdomains, these parameters do attain the required element shares, but produce more or less good results with respect to topological properties, affecting thus the efficiency of the respective parallel computations considerably. Also in parallel computations, it is usual and reasonable, that computations regarding a specific subdomain are allocated to the same processor all through the analysis. It is certainly the only secure way of controlling the load balancing process. But when using the METIS program in consecutive steps of the load balancing procedure, we found out that there is no guaranty for the METIS program neither to retain the respective subdomain positions in space nor to perform smoothly the respective reshapings. This behaviour has actually a similar consequence, as if the subdomains processor allocations would be changed, which makes the METIS program definitely unreliable for load balancing purposes. Another deficiency regarding the METIS program concerns the minimization of the number of interface nodes, which is not really effective. Though of minor importance, it is to be considered in eventual reshapings performed later on. Performing of the load balancing algorithm, coupled with free domain decomposition, as generated by the METIS program, on the test case of Fig. 3, revealed the described deficiencies of the METIS program evidently. The results of the performed three load balancing iterations that are tabulated in Table 7, show the catastrophic effect, that is caused by the change of the relative position of respective subdomains in space (Fig. 7). In the considered case this is manifested by a rough change of the number of interface nodes NIs ; contrary to the assumed smooth variation on which the load balancing algorithm is based. Yet, from the presented computing times, it is evident, that the load balancing algorithm does work correctly, if the subdomain positions are not subject to change during the consecutive load balancing iterations. This can be inspected by considering results of Table 7 for the first and third iteration.
194
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
Table 7 Load balancing upon domain partitioning by
Vs ! Sc
V1, S3 V2, S4 V3, S5 tSlmax
Iter ¼ 0
METIS
ðNEl ¼ 100; 000Þ Iter ¼ 1
Iter ¼ 2
Iter ¼ 3
shEls
NIs
tSls
shEls
NIs
tSls
shEls
NIs
tSls
shEls
NIs
tSls
0.333 0.333 0.334
114 110 224
1.79 1.89 6.33 6.33
0.447 0.425 0.128
108 104 212
2.40 2.40 2.37 2.40
0.448 0.423 0.129
109 216 107
2.39 7.20 0.78 7.20
0.455 9.143 0.402
106 252 146
2.41 5.73 2.54 5.37
Because of those facts we decided to follow a different approach, with only initial free domain decomposition performed by the METIS program, considering a decomposition with the minimum number of interface nodes. For all the following rearrangements of elements needed, due to load balance, we have developed our own algorithm.
5. Algorithm for elements rearrangement between subdomains Considering the basic assumptions that were adopted in the derivation of the load balancing algorithm Eq.(8), the major demand on the algorithm for rearrangement of elements between subdomains is that, the respective number of interface nodes should change smoothly and their number minimized as much as possible by the rearrangement of elements. The algorithm should work for structured and free 2D quadrilateral finite element meshes. 5.1. General considerations Let us consider two subdomains, V1 and V2 ; with e1 and e2 being the respective numbers of elements. If a certain amount of elements De is rearranged between the two subdomains, this results in formation of the new subdomains V~ 1 and V~ 2 ; with e~ 1 and e~ 2 being the respective numbers of elements. The integrity of the whole domain being preserved, i.e. V1 < V2 ¼ V~ 1 < V~ 2 ; the identity e1 þ e2 ¼ e~ 1 þ e~ 2 is true. The performed elements rearrangement can be referred to as the subdomains reshaping.
The most natural way of reshaping the subdomains V1 and V2 is certainly by transferring the required amount of elements Deþ ¼ De1 . 0 from a surplus subdomain Vþ ¼ V1 to a deficit subdomain V2 ¼ V2 ðDe2 ¼ De2 ¼ 2De1 Þ over the common interface. In principle, this transfer could be done arbitrarily, but with a risk to not attain computational performance aimed to. Though obtaining the demanded share of elements e~ 1 and e~ 2 ; with no surplus/deficit element ðD~e1 ¼ D~e2 ¼ 0Þ; the efficiency of the computation would be very probably damaged significantly, because there is no controlled variation of the other parameters, affecting the computation as well. In order to fulfil the basic supposition of our approach, those parameters should change sufficiently smoothly by the elements transfer. The transfer procedure that will be presented in the sequel is conceived in a way that, this goal can be attained. The number of interface nodes is no doubt the parameter of greatest relevance. Its effect on the computation time is multiple—it affects actually the computations performed in parallel on the substructure level, as well as the computation of the interface problem solution, performed by the master computer. Considering the optimal organization of the substructure matrices [3] the number of interface nodes can be directly associated with the bandwidth, with quadratic dependence on the number of operations in Cholesky decomposition Eq. (7). Our elements transfer algorithm is traced, so that the number of interface nodes is tried to be minimized. First, let us introduce some basic principles and corresponding terminology regarding the elements transfer operation. Whatever be the number of the subdomains, the transfer is always performed between two subdomain groups
Fig. 7. Subdomain space allocation according to domain partitioning by METIS .
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
Vþ and V2 having equal number of surplus and deficit elements. These groups are constructed in some convenient way from the existing subdomains Vs ; not necessarily including all of them. The common boundary represents the transfer interface GT ; along which elements will be transferred from the surplus subdomain Vþ to the deficit subdomain V2 : All the elements from the subdomain Vþ facing or just touching the transfer interface GT ; i.e. elements having at least one node common with the interface, are potential candidates for the transfer. From those elements the interface front is formed, the sequence of elements being arranged continuously from beginning (S) to the end (E) of the transfer interface, thus respecting interelement connectivity. Since elements can be connected either through a single node or through an edge, the interface front can be further characterized by identifying sequences of elements, referred in the sequel as connectivity rows, having edge or node connectivity. By definition, no element in the row may be connected, either by edge or by node, with more than two row elements. In such a row, the first and the last element, as well as the sequential order of the elements in between, are clearly defined. As exception, a row may consist of one single element. The interface front will thus consist, in general, of several edge-connectivity rows, which are themselves linked by node-connectivity rows. 5.2. Priority criteria for elements transfer across a two-domain interface The transfer of elements from the interface front should be performed in such a way, that we arrive at computationally advantageous subdomain reshaping. Therefore, the transfer priority criteria have to be specified first. These criteria can be established upon topological considerations, actually, upon classification of the relative position of each connectivity row with respect to the interface layout. With respect to the topological properties of the respective first (F) and last (L) element, a connectivity row can be classified into one of six classes. Also, to one of these two elements, the status of the starting element is attributed, the transfer of the row elements starting with it. The properties, characterizing the six classes into which connectivity rows are classified, are as follows: † R4 class: when either one or both of F/L elements has/have four interface nodes; the starting element is the one sharing three edges with the interface, if existing, otherwise the first element in the row. † R3D class: when both of F/L elements have three interface nodes; the starting element is the first element in the row. † R3 class: when one of F/L elements has three interface nodes; the starting element is the one with three interface nodes. † R2 class: when both of F/L elements have two interface nodes; the starting element is the first element in the row.
195
† R1 class: when a row consists of one single element with interface node connectivity. † R0 class: when a row consists of several elements with interface node connectivity; the starting element is the first element in the row. The edge-connectivity rows (R4 class – R2 class) are ranked in descending order, R4 class having the highest priority and R2 class having the lowest priority. The nodeconnectivity rows have no priority. R0 class rows are not subject to transfer at all, avoiding thus a possibility of wedge-like reshaping, while transfer of R1 class rows is only conditional. Actually, a considered R1 class row is transferred, only if at a given stage of the transfer operation, the both adjacent edge-connectivity rows prove the same priority, and are to be moved due to fulfilling the highest priority criterion then applicable. 5.3. Two-domain reshaping With the transfer criteria determined, the procedure of elements rearrangement between the considered subdomains V þ and V 2 can begin. The subdomains reshaping is a multistep process, which demands at each step the identification of actual transfer parameters—number of elements to be transferred, interface front and sequence of connectivity rows. Upon rows classification into classes, only the rows ranked highest, according to the specified transfer priority criteria, are eligible to move within the considered step. If the actual number of surplus elements is greater than the number of elements eligible for the transfer, the latter are moved completely, otherwise only a part of them. Elements are thus transferred row by row, element after element, starting from the starting element in a row and following the established sequential order in it. With the number of surplus elements being less or equal to the number of elements eligible to move the subdomains, reshaping is accomplished within the considered step, otherwise another step is required with further transfer of elements. The sequence of pictures (a –k) in Fig. 8 shows the steps in the elements transfer procedure which are performed according to a highest priority criterion, that is valid at a considered step. The example clearly demonstrates tendency of the transfer algorithm to smooth the interface between the considered subdomains, minimizing thus the number of nodes on it. The elements transfer is finished when all the elements needed to establish a required subdomain element ratio e~ 1 : e~ 2 between the considered subdomains, are moved across the interface GT : As a consequence, the subdomains Vþ ¼ V1 and V2 ¼ V2 are reshaped into subdomains V~ 1 and V~ 2 : If, however, for any reason the integrity of the subdomain V^ 1 ; which is obtained from the surplus subdomain Vþ 1 ; is violated during the elements rearrangement, thus resulting in division of V^ 1 in several disconnected subdomains V^ 1:1 ; V^ 1:2 ; …; the following procedure is applied. One
196
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
Fig. 8. Steps in elements transfer according to transfer priority criteria.
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
197
Fig. 9. Undesirable division of the surplus subdomain.
among the subdomains V^ 1:1 ; V^ 1:2 ; …; having some advantageous property for further reshaping, is kept, while the others are merged in the subdomain V^ 2 : The merged subdomain becomes thus a surplus subdomain and the process of elements transfer to the remainder of V^ 1 can start again. The described anomalous domain disintegrity is likely to happen if a domain is not convex, or if the finite element discretization is characterized by free meshing. In Fig. 9, two subdomain layouts leading to this phenomenon are shown. In both cases, the transfer of elements, if performed sufficiently long, leads to disintegration of the initial surplus domain into two subdomains. Considering the necessary merging of one of them to the initial deficit domain, two properties that are advantageous with respect to further reshaping can be visualized. It is certainly advantageous to merge small subdomains, retaining the largest one (Fig. 9(a)), and to merge island-like subdomains, retaining the boundary one (Fig. 9(b)). 5.4. Multi-domain reshaping The rearrangement of elements in a multi-domain follows essentially the same elements transfer procedure, as developed for a two-domain reshaping. It is actually performed as a multi-step operation, solving at each step a sequence of respective two-domain reshaping problems. In order to transform the multi-domain reshaping problem to a two-domain reshaping problem (or a sequence of them) the same strategy, as used by the METIS program in the domain decomposition, is applied. There, the recursive bisection method is applied to partition the domain V into Ns subdomains Vs ; according to the respective shares of elements. Any subdomain VðjÞ i ; arrived at after the jth bisection step, but not reduced yet to a single subdomain
from the Vs set, is subject to further division in two subdomains, respecting the element shares as resulting from the respective subdomains structures. The evolution of the partitioning process, following the described approach, can be symbolically resumed by ð1Þ ð2Þ ð2Þ ð2Þ ð2Þ V ¼ Vð1Þ 1 < V2 ¼ ðV1 < V2 Þ < ðV3 < V4 Þ ð3Þ ð3Þ ð3Þ ð3Þ ð3Þ ¼ ððVð3Þ 1 < V2 Þ < ðV3 < V4 ÞÞ < ððV5 < V6 Þ ð3Þ < ðVð3Þ 7 < V8 ÞÞ
¼ · · · ¼ ðð· · ·ððV1 < V2 Þ < · · · < ðVNm 21 < VNm ÞÞ· · ·Þ < ð· · ·ððVNm þ1 < VNm þ2 Þ < · · · < ðVNs 21 < VNs ÞÞ· · ·ÞÞ
ð20Þ
with Nm defined as Nm ¼ intðNs =2Þ: The same evolution path, as established by partitioning of the domain V, will be followed in the procedure of rearrangement of elements in a multi-domain. First, the subdomains Vs ðs ¼ 1; …; Nm Þ and subdomains Vs ðs ¼ Nm þ 1; …; Ns Þ; with respective surplus/deficit elements ð1Þ Des ; are merged to yield two subdomains, Vð1Þ 1 and V2 ; respectively. The respective surplus/deficit elements Deð1Þ 1 and Deð1Þ 2 are given by Nm X
Des ¼ Deð1Þ 1 ;
s¼1 Ns X
ð21Þ ð1Þ ð1Þ Des ¼ Deð1Þ 2 ) De1 þ De2 ¼ 0
s¼Nm þ1
If Deð1Þ 1 – 0; rearrangement of elements between the two subdomains is needed, which is performed as described in
198
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
Fig. 10. Demonstration of interface smoothing and interface nodes minimization property of the developed algorithm. ð1Þ Section 5.3. The reshaping of the subdomains Vð1Þ 1 and V2 ð1Þ ð1Þ ~ ~ results in the subdomains V1 and V2 ; with respective ð1Þ ð1Þ ð1Þ numbers of elements e~ ð1Þ ~ ð1Þ 1 ¼ e1 þ De1 and e 2 ¼ e2 þ ð1Þ ð1Þ ð1Þ De2 ; and no surplus/deficit element ðD~e1 ¼ D~e2 ¼ 0Þ: As a consequence of the performed subdomains Vð1Þ 1 and Vð1Þ reshaping, the subdomains Vs ðs ¼ 1; …Nm Þ and 2 subdomains Vs ðs ¼ Nm þ 1; …; Ns Þ are correspondingly reshaped as well, yielding the subdomains ð1Þ Vs ðs ¼ 1; …; Ns Þ: Because of the performed reshaping, the respective surplus/deficit elements Des ; that were characterizing the subdomains Vs ðs ¼ 1; …; Ns Þ before the reshaping, are correspondingly changed to Dð1Þ es ; but now fulfilling the identities Nm X s¼1
Dð1Þ es ¼ 0;
Ns X
Dð1Þ es ¼ 0
ð22Þ
s¼Nm þ1
The elements rearrangement process can be moved now to the next level, in which the described rearrangement procedure is performed separately and independently of ~ ð1Þ each other for the subdomains V~ ð1Þ 1 and V2 : Thus, the set of ð1Þ subdomains Vs ðs ¼ 1; …; Nm Þ; incorporated in V~ ð1Þ 1 ; is correspondingly halved and respective subdomains merged to yield two subdomains Vð2Þ and Vð2Þ 1 2 ; with ð2Þ respective surplus/deficit elements De1 and Deð2Þ 2 demanding for the subdomains reshaping. Similarly, from the set of subdomains ð1Þ Vs ðs ¼ Nm þ 1; …; Ns Þ; incorporated in V~ ð1Þ 2 ; ð2Þ the subdomains Vð2Þ and V ; with respective surplus/deficit 3 4 elements Deð2Þ and Deð2Þ 3 4 ; are obtained. The respective ð2Þ reshapings of the two subdomains couples Vð2Þ 1 – V2 and ð2Þ ð2Þ ð2Þ ð2Þ ð2Þ V3 – V4 yield the subdomains V~ 1 ; V~ 2 and V~ 3 ; V~ ð2Þ 4 ; ð2Þ ð2Þ with no surplus/deficit elements ðD~eð2Þ ¼ D~ e ¼ D~ e 1 2 3 ¼ ð1Þ D~eð2Þ ¼ 0Þ: The respective subdomains V are corres 4 spondingly reshaped in the subdomain ð2Þ Vs ; demonstrating Dð2Þ es surplus/deficit elements. The described procedure can be repeated, following the structure Eq. (20), until complete fulfilment of the required element shares on the subdomains Vs ðs ¼ 1; …; Ns Þ level. At the end, to demonstrate the general property of the developed rearrangement algorithm—interface
smoothing and interface nodes minimization, a simple problem is considered. When a square domain, regularly meshed by (100 £ 100) elements, is partitioned by the METIS program in six subdomains of the required share (shEl1 ¼ shEl2 ¼ shEl3 ¼ shEl4 ¼ 0:167 and shEl5 ¼ shEl6 ¼ 0:166), subdomains, with a respective layout as displayed in Fig. 10(a), are obtained. Though the decomposition is performed in the sense of minimizing the number of interface nodes, it is evident, that the achieved number of interface nodes is not optimal. By applying our rearrangement algorithm, however, after execution of several reshaping of those subdomains enforced by imposition of artificially created surplus/deficit elements, but ending finally at the initial required shares shEl1 ; shEl2 ; … the subdomains layout, as displayed in Fig. 10(b), is obtained. This layout is definitely optimal, which proves the quality of our algorithm. 5.5. Load balancing of the square domain Testing of computational conformity of the developed algorithm for the rearrangement of elements, as the key factor in load balancing, was performed on the steady state heat transfer analysis case of the square domain in Fig. 3. Parallel computations were performed on the same computer system as used in the investigations of structured domain decomposition. The respective notations introduced there still apply, therefore, only NEls and NDEls need be clarified. NEls and NDEls are the number of elements and the number of surplus/deficit elements in each subdomain, respectively. The respective results of the load balancing, performed as before on the slave level, are tabulated in Table 8. The (300 £ 250) elements square domain was initially partitioned into six equally sized subdomains by the METIS program considering free domain decomposition. Actually, the demanded shares of elements were for the first four subdomains 0.167, and for the last two 0.166, respectively, which was accomplished rather satisfactorily by the METIS program. As it can be seen from Table 8, the subdomains V3
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
199
Table 8 Load balancing of square domain partitioned into six subdomains Iter
Vs ! Sc
NEls
Nes
NPFs
NIs
N BWs
tSls ½s
NDEls
0
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
12,490 12,599 12,631 12,385 12,449 12,446
12,362 12,477 12,473 12,261 12,312 12,305
1,918,231 1,984,182 3,691,794 1,953,792 1,996,895 3,471,697
230 237 404 237 240 378
155 159 295 159 162 282
1.75 1.79 5.31 2.78 1.88 5.09 100.68
22,648 22,476 3,344 639 22,028 3,169
1
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
15,138 15,075 9,287 11,746 14,477 9,277
15,006 14,943 9,153 11,627 14,358 9,139
2,504,478 2,484,410 2,471,705 1,787,372 2,337,727 2,416,896
250 248 338 232 243 343
166 166 270 153 162 264
2.61 2.52 3.30 2.50 2.24 3.49 43.37
279 2354 921 2325 21,286 1,123
2
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
15,217 15,429 8,366 12,071 15,763 8,154
15,086 15,299 8,234 11,951 15,642 8,020
2,500,464 2,569,331 2,171,144 1,845,611 2,657,118 2,057,456
249 250 329 234 254 326
165 167 263 154 169 256
2.51 2.65 2.85 2.59 2.85 2.93 15.46
2570 2137 211 2249 422 323
3
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
15,787 15,566 8,155 12,320 15,341 7,831
15,655 15,433 8,022 12,201 15,223 7,699
2,661,782 2,614,834 2,139,848 1,907,414 2,527,548 1,980,058
255 252 329 232 250 322
170 169 266 156 166 257
2.91 2.75 2.86 2.67 2.52 2.81 14.34
474 43 178 2149 2652 105
4
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
15,312 15,523 7,977 12,469 15,993 7,726
15,181 15,394 7,847 12,349 15,871 7,592
2,548,529 2,585,794 2,050,285 1,945,402 2,709,616 1,988,878
250 251 323 233 256 325
167 167 261 157 170 261
2.65 2.69 2.66 2.74 3.06 2.87 14.34
2363 2248 2174 282 741 126
5
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
15,675 15,771 8,151 12,551 15,252 7,600
15,543 15,638 8,018 12,432 15,134 7,468
2,642,529 2,659,097 2,144,592 1,920,845 2,510,315 1,891,993
253 254 329 232 249 323
170 170 267 154 165 253
2.84 2.86 2.84 2.68 2.52 2.61 12.62
325 380 169 2100 2610 2163
6
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
15,351 15,391 7,982 12,651 15,862 7,763
15,221 15,261 7,852 12,530 15,740 7,629
2,536,140 2,557,027 2,052,093 1,993,434 2,672,029 2,007,068
250 251 323 235 255 326
166 167 261 159 169 263
2.62 2.65 2.65 2.82 2.85 2.90 10.13
2340 2250 2130 187 315 218
7
V1, S5 V2, S4 V3, S1 V4, S6 V5, S3 V6, S2 devðtSl Þ (%)
15,691 15,641 8,112 12,464 15,547 7,545
15,560 15,508 7,979 12,345 15,428 7,413
2,630,145 2,628,058 2,125,741 1,939,155 2,610,970 1,928,996
252 253 329 232 252 322
169 169 266 157 169 260
2.80 2.77 2.81 2.72 2.73 2.76 3.25
69 20 68 260 293 24
200
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201
and V6 have more interface nodes than other subdomains, therefore, we allocated the most powerful computers (S1 and S2) to them. As expected, due to heterogeneity of the distributed system, the computing times tSls were not load balanced. By applying in continuation of the computational analysis, load balancing according to the developed algorithm implemented by our element transfer algorithm, the load balance between the subdomains were achieved in seven iterations.
6. Conclusion Several issues regarding parallel computations on a system of heterogeneous distributed computers were addressed in the paper, the parallelization being performed upon domain decomposition approach and finite element substructuring method. Despite performing different standard benchmarks to characterize the computational performance of heterogeneous components, this proved, though qualitatively correct, insufficient for realistic characterization, in particular, with respect to large finite element systems considered. By analysing structure of the substructuring method and revealing the most influential parameters affecting the speed of computation, we arrived at the conclusion to develop specific self-adjusting algorithms to cope with the problem, irrespectively of the problem size and heterogeneity of the computer system. To optimize the computational performance in parallel by even distribution of the work among the processors, proper load balancing is needed. A load balancing algorithm, based on the assumption of preserving the speed of computation in the consecutive iterations, is developed. Irrespective of the subdomains size and processors performance, it can determine upon computing times realized in the previous computing step, shares needed for a balanced computation. The efficiency of the algorithm is first demonstrated on cases allowing structured domain decomposition. When using the algorithm in relation to free domain decomposition, based on the use of the partitioning METIS program, several deficiencies of the latter tend to spoil the efficiency of the developed load balancing algorithm. Therefore, an algorithm for reshaping of the existing subdomains in accordance with the required shares is developed. The rearrangement of elements between subdomains is performed in a way that assumptions used in the derivation of the load balancing algorithm are respected as much as possible. This includes above all smooth shape variations and minimization of the respective interface nodes. Considering the majority of non-linear solvers containing a linear solution step presenting almost always the predominant source of CPU time consumption, the used approach and the algorithms developed cope quite well also with this class of problems. In fact, when the whole
investigated domain is subject to non-linear effects, such as in metal forming processes that are characterized by large plastic deformation through the entire domain, it is not expected that computational efficiency in parallel would be spoiled considerably, namely, the load balancing, which relies on the time needed for the formation of the respective subproblem matrices and corresponding computations prior to the solution of the interface problem, will perform quite well in computations where a small number of iterations is needed to attain convergence within an incremental step. The eventual increase of the number of iterations would definitely spoil the computational efficiency in parallel to what extent it depends on the properties of the considered problem. In such cases, if appropriate, the basis on which the load balance is performed can be changed at will, which again can be successfully covered by the same developed algorithms. If, on the contrary, only a minor subdomain is characterized by a non-linear response, such as in elastoplastic problems with only a part of the whole domain being plastically deformed, this definitely causes the components allocated to the subdomains exhibiting elastic behaviour to be idle, while components performing the needed recalculations on the plastic subdomains being busy. Eventual temptations to rearrange the established domain decomposition are not justified since they would spoil the performance of that part of the computation, i.e. the system of equations solution, which is timely dominant. Both developed algorithms enable efficient parallel computing on heterogeneous distributed systems. Also, it is not necessary to know the performance characteristics of the new added computers to the existing distributed system, because the load balancing algorithm adapts the allocated work to the disposable computational power directly during the computation.
References [1] Bittnar Z, Kruis J, Nemecˇek J, Patzak B, Rypl D. Parallel and distributed computations for structural mechanics: a review. In: Topping B, editor. Civil and structural engineering computing. Stirling, UK: Civil-Comp Press; 2001. [2] Keyes DE. Domain decomposition in the mainstream of computational science. Proceedings of the 14th International Conference on Domain Decomposition Methods in Cocoyoc, Mexico; 2002. [3] Doltsinis IS, Nolting S. Generation and decomposition of finite element models for parallel computations. Comput Syst Engng 1991; 2(5/6):427–49. [4] Farhat C, Wilson E, Powell G. Solution of finite element systems on concurrent processing computers. Engng Comput 1987;2(3):157 –65. [5] Kocak S, Akay H. Parallel Schur complement method for large-scale systems on distributed memory computers. Appl Math Modell 2001; 25:873–86. [6] Roux F, Farhat C. Parallel implementation on the two-level FETI method. Proceedings of the 11th International Conference on Domain Decomposition Methods in Greenwich, England; 1998. [7] Hribersˇek M, Kuhn G. Domain decomposition methods for conjugate heat transfer computations by BEM. Fifth World Congress on Computational Mechanics, Vienna, Austria; 2002.
P. Rus et al. / Advances in Engineering Software 34 (2003) 185–201 [8] Khan AI, Topping BHV. Parallel finite element analysis using Jacobiconditioned conjugate gradient algorithm. Adv Engng Software 1996; 25:309–19. [9] Silva LM. Number-crunching with Java applications. Proc Int Conf Parallel Distrib Process Tech Appl 1998;379– 85. [10] Labriaga R, Williams D. A load balancing technique for heterogeneous distributed networks. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications; 2000. [11] Bohn CA, Lamont GB. Load balancing for heterogeneous clusters of PCs. Future Gener Comput Syst 2002;18:389– 400. [12] Darcan O, Kaylan R. Load balanced implementation of standard clock method. Simul Practice Theor 2000;8:177–99. [13] Diekmann R, Preis R, Schlimbach F, Walshaw C. Shape-optimized mesh partitioning and load balancing for parallel adaptive FEM. Parallel Comput 2000;26:1555–81. [14] Decker T, Luling R, Tschoke S. A distributed load balancing algorithm for heterogeneous parallel computing system. Proc Int Conf Parallel Distrib Process Tech Appl 1998;933–40.
201
[15] Lingen FJ. A versatile load balancing framework for parallel applications based on domain decomposition. Int J Numer Meth Engng 2000;49:1431 –54. [16] Karypis G, Kumar V. A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices. University of Minnesota, Department of Computer Science/Army HPC Research Center Minneapolis; 1998. URL: http://www.cs.umn.edu/~karypis. [17] Geist A, Beguelin A, Dongarra J, Jiang W, Manchek R, Sunderam V. PVM: parallel virtual machine, a users’ guide and tutorial for networked parallel computing. Cambridge: The MIT Press; 1994. [18] Double Precision Whetstone, ORNL. URL: http://www.netlib.org/ benchmark/whetstoned. [19] Dongarra J. Performance of various computers using standard linear equations software in a Fortran environment, ORNL, update periodically. URL: http://www.netlib.org/benchmark/performance.ps. [20] Gustafson JL, Snell QO. HINT: a new way to measure computer performance, scalable computing laboratory, Ames Laboratory. URL: http://www.scl.ameslab.gov/Projects/HINT/index.html.