Parallel Computing 32 (2006) 195–204 www.elsevier.com/locate/parco
Efficient algorithms for estimating the general linear model Petko Yanev
a,*
, Erricos John Kontoghiorghes
q
b,c
a
IRISA/INRIA – Rennes, Campus de Beaulieu, 35042 Rennes cedex, France Department of Public and Business Administration, University of Cyprus, P.O. Box 20537, CY-1678 Nicosia, Cyprus School of Computer Science and Information Systems, Birkbeck College, University of London, Malet Street, London WC1E 7HX, UK b
c
Received 9 December 2004; received in revised form 7 June 2005; accepted 21 June 2005 Available online 28 November 2005
Abstract Computationally efficient serial and parallel algorithms for estimating the general linear model are proposed. The sequential block-recursive algorithm is an adaptation of a known Givens strategy that has as a main component the Generalized QR decomposition. The proposed algorithm is based on orthogonal transformations and exploits the triangular structure of the Cholesky QRD factor of the variance–covariance matrix. Specifically, it computes the estimator of the general linear model by solving recursively a series of smaller and smaller generalized linear least squares problems. The new algorithm is found to outperform significantly the corresponding LAPACK routine. A parallel version of the new sequential algorithm which utilizes an efficient distribution of the matrices over the processors and has low inter-processor communication is developed. The theoretical computational complexity of the parallel algorithms is derived and analyzed. Experimental results are presented which confirm the theoretical analysis. The parallel strategy is found to be scalable and highly efficient for estimating large-scale general linear estimation problems. 2005 Elsevier B.V. All rights reserved. Keywords: General linear model; Generalized QR decomposition; Parallel algorithms
1. Introduction The estimation of the general linear model (GLM) is a well studied problem in statistics [17–19]. It arises in diverse areas such as econometrics, genetics and engineering to name but a few [5,15,16,21]. Emphasis is given to the properties of the estimator rather than the computational merits of the estimation procedures. The GLM is written as y ¼ X b þ e;
e ð0; r2 XÞ;
q
ð1Þ
This work is in part supported by the Swiss National Foundation Grants 2000-061875.00/1 and 101412-105978. Part of the experiments were performed while the first author was visiting the University of Edinburgh under the support of the HPC-Europa project. * Corresponding author. E-mail addresses:
[email protected] (P. Yanev),
[email protected] (E.J. Kontoghiorghes). 0167-8191/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2005.06.007
196
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
where y 2 Rm is the response vector, X 2 Rmn is the full rank exogenous data matrix, b 2 Rn are the coefficients to be estimated and e 2 Rm is the noise with zero mean and variance–covariance matrix r2X. It is assumed that X has full rank. ^ is obtained from the solution of the normal The best linear unbiased estimator (BLUE) of b, say b, equations: ^ ¼ X1 X T y. X T X1 X b That is, ^ ¼ ðX T X1 X Þ1 X1 X T y. b
ð2Þ ^ The derivation of b using (2) is computationally expensive and numerically unstable when X is ill-conditioned. This can be avoided by reformulating the estimation of the GLM as the generalized linear least squares problem (GLLSP) argmin uT u
subject to
y ¼ X b þ Cu.
ð3Þ
u;b
Here X = CCT, C 2 Rmm , u is a random vector defined by Cu = e, i.e. u (0,r2Im). The matrix C is the triangular Cholesky factor of the variance–covariance matrix X. In many applications the triangular C is known rather than X [6,10–12,15]. Hereafter it is assumed that C is avaliable and has an upper-triangular form. e ðX yÞ and C: The GLLSP can be solved using the generalized QR decomposition (GQRD) of X ð4aÞ
and ð4bÞ e and U are upper triangular matrices of orders n and m, respectively, and Q; P 2 Rmm are orthogonal Here R [2,4]. The GLLSP (3) is equivalent to argmin kPT uk2
subject to
QT y ¼ QT X b þ QT CPPT u;
u;b
which can be written also as 2
2
argminðkv1 k þ t2 þ kv2 k Þ v1 ;t;v2 ;b
8 > < ~y ¼ Rb þ U 1;1 v1 þ rt þ U 1;2 v2 ; g ¼ dt þ gv2 ; subject to > : 0 ¼ U 2;2 v2 ;
ð5Þ
where the vector uTP is partitioned as ðvT1 t vT2 Þ and k Æ k denotes the Euclidean norm. The values of v2 = 0 and t = g/d can be derived from the last two constraints in (5). Then, setting v1 = 0, the BLUE of b is obtained by solving Rb ¼ ~y gr=d.
ð6Þ
The LAPACK routine (DGGGLM) which computes the estimator of the GLM makes no assumptions on the structure of C. However, the commonly avaliable triangular structure of C facilitates the development of efficient algorithms for solving the GLLSP [7,12–14]. In this work block recursive sequential and parallel strategies which exploit the structure of the matrices are proposed. The new methods are rich in BLAS-3 operations and solve a series of reduced size GLLSPs. The algorithms have been implemented on 32 CPUs IBMs p690+ high-end compute node with 27 GB distributed memory. This parallel machine performs 6.8 Gflop/s has 3077 Mbyte/s bandwidth and communication latency of 1.92 ls. The communications between the processors are realized using the MPI library. The
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
197
performance of the algorithms has been evaluated experimentally. In addition, the theoretical complexities of the algorithms in number of flops (floating point operations) are also presented. In the Section 2 the serial block-Givens strategy is described and compared with the existing LAPACK routine for estimating the GLM. In Section 3 the parallel algorithm is considered and the theoretical and computational results are presented. In Section 4 some conclusions are drawn. The Appendix shows in detail the derivation of the complexities of the various algorithms. 2. Serial block Givens strategy An efficient sequential algorithm based on Givens rotations for estimating the GLM has been proposed by Paige [12]. Here a block version based on this sequential strategy is investigated [8,22]. Consider the GLLSP e ðX yÞ and C are partitioned, respectively, as (3), where the matrices X
ð7Þ
Here Ci,i (i = 1, . . ., k) are upper triangular where, for simplicity, it is assumed that m = kn. The sequential block algorithm computes the solution of the GLLSP (3) in k steps. The GQRD (4a) is computed during the first (k 1) steps. For j = k i the ith (i = 1, . . ., k 1) step computes the smaller GQRD of ðiÞ
Xj
yj
e jþ1 X
y jþ1
ðiÞ
! and
C j;j 0
e j;jþ1 C e jþ1;jþ1 C
! ;
ð8Þ
ð1Þ ð1Þ e k1;k ¼ C k1;k and C e k;k ¼ C k;k . That is, initially the QRD of the first e k ¼ X k, C where y k1 ¼ y k1 , y k ¼ y k , X matrix in (8) is computed by: 0 1 ! e j ~y ðiÞ n X ðiÞ j X y j j B C QTi ; ð9Þ ¼ @ 0 gi A 1 e jþ1 y ðiÞ X jþ1 n1 0 0
e j is upper triangular. Then, the orthogonal matrix QT is applied from where Qi 2 R2n2n is orthogonal and X i the left of the second matrix in (8) which is then re-triangularized from the right, i.e.
ð10Þ e j;j and C jþ1;jþ1 are upper triangular. Once the ith GQRD of (8) has Here Pi is orthogonal and of order 2n; C been computed, Pi is applied from the right of the affected jth and (j + 1)th block-columns of C, i.e. the product
ð11Þ
198
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
is computed. Note that this is the most time consuming task in each step of the computation of the GQRD (4a), especially when k is large, i.e. when m n. Now let ðuTi1 ~ uTi1 ÞPi be partitioned conformably as ðuTi ~uTi ti uTi Þ, where ðuT0 ~uT0 Þ uT . The GLLSP (3) after the ith step of computing the GQRD (4a) can be written as 2
2
2
ui k þ t2i þ k ui k Þ argminðkui k þ k~
subject to
ui ;~ ui ;ti ; ui ;b
0 B B B B @
1
0 1 X 1:j1 C 1:j1;1:j1 C B B C ðiÞ ej C B X 0 ~y j C C¼B Cb þ B C B B @ A 0 @ 0 gi A 0 0 0 ðiÞ
y 1:j1
0
e 1:j1;j C
ri1:j1
C 1:j1;jþ1
e j;j C 0
rij di
0
0
j;jþ1 C gi jþ1;jþ1 C
10
1 ui CB C CB u~i C CB C. C@ t A A i ui
From this it follows that C jþ1;jþ1 ui ¼ 0, i.e. ui ¼ 0 and gi ¼ di ti þ gi ui , i.e. ti = gi/di. Thus, the GLLSP (3) is equivalent to the reduced GLLSP ! ! ! ðiþ1Þ e 1:j1;j ui y 1:j1 X 1:j1 C 1:j1;1:j1 C 2 2 argminðkui k þ k~ ui k Þ s.t. ; ¼ bþ ðiþ1Þ ej e j;j ~ui X ui ;~ ui ;b 0 C yj where ðiþ1Þ
y 1:j1 ðiþ1Þ
!
ðiÞ
y 1:j1
¼
ðiÞ
~y j
yj
!
ðiÞ
ti
r1:j1 ðiÞ
! .
ð12Þ
rj
Eq. (12) shows the size-reduction of the GLLSP after each step of the computation. Following the completion of the (k 1)th step the final, smallest GLLSP has the form: ! ! ! ðk1Þ ðk1Þ e e ~uk1 ~ y X r C 1 2 2 1;1 1 uk1 k þ ktk1 k Þ s.t. argmin ðk~ ¼ bþ tk1 ~ gk1 uk1 ;tk1 ;b 0 0 dk1 or uk1 k argmin k~ uk1 ;b ~
2
s.t.
ðkÞ e 1;1 ~ e 1b þ C y1 ¼ X uk1 ;
e 1;1 are upper triangular and are derived from the QRD (9) and RQ decomposition (RQD) (10), e 1 and C where X ðkÞ ðk1Þ ðk1Þ respectively, and y 1 ¼ ~y 1 gk1 r1 =dk1 . Thus, the kth step computes the BLUE of b by setting ~uk1 ¼ 0 e 1 b ¼ y ðkÞ and solving the upper triangular system X 1 . Algorithm 1. The sequential block Givens algorithm for solving the GLLSP (3). e and C in (4a) be partitioned as in (7), where m = kn. 1: Let X ð1Þ ð1Þ e k1;k ¼ C k1;k and C e k;k ¼ C k;k . e k ¼ X k, C 2: Let y k1 ¼ y k1 , y k ¼ y k , X 3: 4: 5: 6: 7: 8: 9: 10: 11:
for i = 1,. . .,k 1 do Set j: = k i
! ! ðiÞ e j;jþ1 yj C j;j C Compute the GQRD of and as in (9) and (10) ðiÞ e jþ1;jþ1 0 C y jþ1 if i 5 k 1 then e 1:j1;j rðiÞ e 1:j1;jþ1 Pi ¼ C Compute C 1:j1;j C C 1:j1;jþ1 1:j1 ! ! ! ðiþ1Þ ðiÞ ðiÞ end if y y 1:j1 r1:j1 Update the vector y: 1:j1 ¼ gdii ðiþ1Þ ðiÞ ðiÞ ~y j yj rj end for e 1 b ¼ y ðkÞ Solve X 1 Xj e jþ1 X
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
199
Table 1 Execution times (s) and theoretical results of Algorithm 1 and LAPACK n
25
k2
128
256
384
64
50 128
192
32
100 64
96
LP BG LP/BG
25.77 0.76 33.91
213.35 2.88 74.08
601.58 6.59 91.29
26.25 1.44 18.23
216.62 5.32 40.72
603.03 9.71 62.10
27.77 2.62 10.60
223.35 10.56 21.15
605.58 19.04 31.81
TLP/TBG
42.69
85.35
128.01
21.38
42.69
64.02
10.77
21.39
32.04
Algorithm 1 summarizes the steps of this block Givens strategy for estimating the GLM. For the factorizations (9) and (10) Householder transformations are employed. The orthogonal matrices Qi and Pi (i = 1, . . ., k 1) are not explicitly constructed. The theoretical complexity (see Appendix) of this algorithm is given by: T BG ðm; nÞ 4m2 n þ 14mn2 18n3 .
ð13Þ
Table 1 shows the execution times in seconds and the theoretical complexities in number of flops of Block– Givens Algorithm 1 (BG) and the LAPACK (LP) routine DGGGLM which estimates the GLM for some values of n and k, where m = kn [1]. The theoretical complexity of LAPACK is given by: T LP ðm; nÞ ð4m3 þ 12m2 n 2n3 Þ=3.
ð14Þ
Note that, theoretically, Algorithm 1 is approximately m/3n times faster than the LAPACK routine, which is confirmed by the experimental results. This improvement is due to the fact that the Algorithm 1 exploits the triangular structure of the large and computationally expensive matrix C in (3), while the LAPACK routine assumes that C is dense. The experimental results (LP/BG) confirm the theoretical ones (TLP/TBG). There is a negligible discrepancy between the two ratios when k is big and n is relatively much smaller. This is due to the increasing overheads which occur from the frequent data exchanges of the submatrices in (7). 3. Parallel algorithm The computation of the product (11) is the most time consuming task in Algorithm 1. This cost can be reduced by applying the orthogonal matrices Pi (i = 1, . . . , k 1) to the block columns of C (see line 7 of Algorithm 1) in parallel. An efficient parallel algorithm requires a load-balanced distribution of the matrices over the processors and low inter-processor communication [23]. Let p denotes the number of processors and assume that (k 2) is a multiple of p, where m = kn. e and C as in (7). To achieve low inter-processor communication, Consider the partitioning of the matrices X the GQRDs computed in line 5 of Algorithm 1 are executed simultaneously by all processors. That is, the data matrix X, the last two block rows of y and the main, sub- and super-block diagonals of C are duplicated on each processor. The remaining (k 2) block-rows of C and y are allocated to the processors using a block-row cyclic distribution. Specifically, Ci,: (i = k2, . . . , 1) is allocated to the processor P ki , where ki = p (i 1) modp. The same distribution scheme is used for the vector y. This distribution will result in the processor Pj (j = 1, . . . , p) being allocated the vector T ð15Þ cðjÞ ¼ y Tpþ1j y T2pþ1j y Tk1j and the matrix C ðjÞ ¼ C Tpþ1j,:
C T2pþ1j,:
C Tk1j,:
T
;
ð16Þ
e and C over the where cðjÞ 2 Rnðk2Þ=p1 and C ðjÞ 2 Rnðk2Þ=pm . Fig. 1 shows the distribution of the matrices X processors, with p = 4 and k = 18. The shaded blocks indicate those copied to all processors. The blank, unshaded, blocks are null and are unaffected during the computation.
200
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
Fig. 1. The row-block cyclic distribution of the matrices on 4 processors, when k = 18.
The parallel algorithm solves the GLLSP (3) in k steps. During the first (k 1) steps, all processors initially compute the same factorizations (9) and (10). Then each processor updates its allocated submatrix C(j) and subvector c(j). Note that, at the ith step, each processor updates dq/pe blocks, where q = k i 1. Thus, the processors have equal computational loads. When the local computations have been completed one processor, Pj say, sends one block from C(j) and c(j), which are required for the next step, to the other processors Pr (r = 1, . . ., p and r 5 j). That is, at each step only one processor broadcasts an n · n submatrix and an nelement subvector. This broadcast acts as a barrier-synchronization point for the processors before the next step commences. Note that the duplication of the computations results into a one-to-all broadcast which has a low communication cost compared to the extremely expensive all-to-all broadcast. The latter, ie allto-all broadcast, is required when there are no duplicating computations and consequently results to a highly inefficient algorithm. The parallel strategy is summarized in Algorithm 2. The broadcast performed by processor Pj is shown in lines 13–17 of the parallel algorithm.
Algorithm 2. The parallel algorithm for solving the GLLSP (3) on p processors. e and C be partitioned as in (7), where m = kn and k 2 is a multiple of p. 1: Let X 2: Allocate X, yk 1, yk, Ck,k, Ci,i and Ci,i + 1 (i = 1,. . .,k 1) to all processors. 3: Allocate c(j) and C(j) as in (15) and (16), respectively, to processor Pj (j = 1,. . .,p). 4: each processor Pj (j = 1,. . .,p) do in parallel: 5: for i = 1,. . .,k 1 do ! 6: Set t :¼ k i ðiÞ e t;tþ1 C t;t C Xt yt 7: Compute the GQRD of as in (9) and (10). and e tþ1;tþ1 e tþ1 y ðiÞ X 0 C tþ1 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
ðiþ1Þ
ðiÞ
ðiÞ
Compute y ki ¼ ~y ki gi =di rk i. if i5k 1 then Set q :¼ k i 1 ðjÞ ðjÞ Compute C 1:ndq=pe;nqþ1:nqþ2n ¼ C 1:ndq=pe;nqþ1:nqþ2n P i . Compute
ðjÞ
ðjÞ
ðjÞ
c1:ndq=pe ¼ c1:ndq=pe ti C 1:ndq=pe;nqþnþ1 .
if j = (i 1) mod p + 1 then ðiþ1Þ e t1;t to Pr, where r = 1,. . .,p and r5j. Send y t1 and C else ðiþ1Þ e t1;t from Pr, where r = (i 1) mod p + 1. Receive y t1 and C end if end if end for e 1 b ¼ y ðkÞ Solve X 1
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
201
Table 2 Execution times (s) and efficiency of Algorithm 2 l = m/n 2
2 Processors
4 Processors
8 Processors
16 Processors
32 Processors
l
n
Serial
Time
Eff.
Time
Eff.
Time
Eff.
Time
Eff.
Time
Eff.
128 256 384 512 640 768 896 64 128 192 256 320 384 448 32 64 96 128 160 192 224
25 25 25 25 25 25 25 50 50 50 50 50 50 50 100 100 100 100 100 100 100
0.76 2.88 6.59 11.95 18.77 26.49 36.23 1.44 5.32 9.71 17.47 27.13 39.35 53.71 2.62 10.56 19.04 26.67 41.18 58.91 79.85
0.41 1.54 3.47 6.09 9.51 13.54 18.51 0.79 2.83 5.18 9.09 14.14 20.07 27.38 1.61 5.94 10.39 14.27 21.78 30.93 41.65
.93 .94 .95 .98 .99 .98 .98 .91 .94 .94 .96 .96 .98 .98 .81 .89 .92 .93 .95 .95 .96
0.24 0.82 1.75 3.17 4.82 6.82 9.35 0.49 1.58 2.76 4.80 7.45 10.76 14.34 1.11 3.61 5.92 8.06 11.91 16.76 22.40
.79 .88 .94 .94 .97 .97 .97 .73 .84 .88 .91 .91 .91 .94 .59 .73 .80 .83 .86 .88 .89
0.16 0.50 1.08 1.81 2.69 3.75 4.96 0.36 0.97 1.88 2.82 4.22 5.95 7.83 0.88 2.47 3.81 4.89 7.08 9.83 12.86
.59 .72 .76 .83 .87 .88 .91 .50 .69 .65 .77 .80 .83 .86 .37 .53 .62 .68 .73 .75 .78
0.13 0.36 0.63 1.12 1.50 2.02 2.65 0.31 0.77 1.26 1.97 2.76 3.78 4.62 0.81 1.98 2.76 3.45 4.86 6.47 8.02
.37 .50 .65 .67 .78 .82 .85 .29 .43 .48 .55 .61 .65 .73 .20 .33 .43 .48 .53 .57 .62
0.19 0.38 0.72 1.01 1.28 1.61 1.94 0.34 0.86 1.27 1.70 2.12 2.61 3.04 0.83 1.78 2.56 3.15 3.63 4.75 5.68
.13 .24 .29 .37 .46 .51 .58 .13 .19 .24 .32 .40 .47 .55 .10 .19 .23 .26 .35 .39 .44
(.95) (.97) (.98) (.99) (.99) (.99) (.99) (.90) (.95) (.96) (.97) (.98) (.98) (.98) (.84) (.90) (.93) (.95) (.96) (.96) (.97)
(.86) (.92) (.95) (.96) (.97) (.97) (.98) (.76) (.86) (.90) (.92) (.94) (.95) (.95) (.63) (.76) (.82) (.86) (.88) (.90) (.91)
(.72) (.83) (.88) (.91) (.92) (.94) (.95) (.57) (.72) (.79) (.83) (.86) (.88) (.90) (.42) (.57) (.66) (.72) (.76) (.79) (.81)
(.54) (.70) (.78) (.82) (.85) (.87) (.89) (.39) (.54) (.64) (.70) (.74) (.78) (.80) (.26) (.39) (.48) (.54) (.60) (.64) (.67)
(.37) (.53) (.63) (.69) (.74) (.77) (.80) (.23) (.37) (.36) (.53) (.58) (.63) (.66) (.14) (.23) (.31) (.37) (.42) (.46) (.50)
The theoretical computational complexity (see Appendix) of this algorithm is given by: T P ðm; n; pÞ ð4m2 n 16mn2 þ 16n3 Þ=p þ 30mn2 34n3 .
ð17Þ
From (13) and (17) it follows that the computational efficiency of Algorithm 2 approaches one for very large m, i.e. limm!1TBG(m,n)/ (p · TP(m,n,p)) 1. This does not take into account, however, the inter-processor communication. Table 2 shows the execution times and actual (and in brackets the theoretical) efficiency of Algorithm 2 for some l and n, where l = m/n 2. The experimental results confirm the theoretical complexities. Note that, the communication time increases with the number of the processors which affects the efficiency of the algorithm for small size problems. Furthermore, Algorithm 2 is scalable in the sence that pffiffiffi the efficiency remains constant when the size of the problem m, and consequently l, is multiplied by 2 2 and the number of the processors p is doubled. Fig. 2 shows the scalability plots based on the theoretical and computational results presented in Table 2. The model size indicates the (m/n 2) number of blocks, where each block has dimension n · n. 4. Conclusion Computationally efficient sequential and parallel algorithms for computing the best linear unbiased estimator of the general linear model (1) have been proposed. The sequential algorithm is a block version of an efficient serial approach that employs as a main computational component the Generalized QR Decomposition [12]. The new block Givens algorithm exploits the triangular structure of the Cholesky factor C of the dispersion matrix X and is rich in BLAS-3 operations. It is found to be m/3n times faster than the corresponding LAPACK routine DGGGLM for estimating the GLM [1]. The parallel approach is based on the new sequential strategy. The parallel algorithm copies the augmented e and the main, sub- and super-block diagonals of C to all processors. The rest of the matrix C is matrix X evenly distributed across the processors. The algorithm duplicates parts of the computation. However, this is compensated for the load balanced distribution of the computationally expensive matrix C resulting in minimal inter-processor communication. The algorithms have been implemented on a parallel computer with distributed memory. The theoretical complexities of both algorithms are stated and experimental results are
0.9 0.8 0.7 256
384
512
640
768
0.1
896
128
384
512
640
768
896
1.0
model size
0.6 0.5 0.2
0.3
0.4
efficiency
0.7
0.8
0.9
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.2
0.3 64
128
192
256
320
384
0.1
2 processors 4 processors 8 processors 16 processors 32 processors
0.0
0.0
0.1
2 processors 4 processors 8 processors 16 processors 32 processors
448
64
128
d
192
256
320
384
448
0.5
0.6
0.7
0.8
0.9
1.0
model size
0.3 0.2
0.2
0.3
0.4
0.5
0.6
efficiency
0.7
0.8
0.9
1.0
model size
0.4
efficiency
256
b
model size
c
efficiency
0.5 0.6 0.3 0.2
128
2 processors 4 processors 8 processors 16 processors 32 processors
0.0
0.0
0.1
2 processors 4 processors 8 processors 16 processors 32 processors
a
64
96
128
160
model size
192
0.1
2 processors 4 processors 8 processors 16 processors 32 processors
0.0
0.0
0.1
2 processors 4 processors 8 processors 16 processors 32 processors
32
e
0.4
efficiency
0.5 0.6 0.4 0.2
0.3
efficiency
0.7
0.8
0.9
1.0
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204 1.0
202
224
32
f
64
96
128
160
192
224
model size
Fig. 2. Theoretical and computational scalability plots based on the results in Table 2. (a) Theoretical efficiency, n = 25 (b) Comput. efficiency, n = 25 (c) Theoretical efficiency, n = 50 (d) Comput. efficiency, n = 50 (e) Theoretical efficiency, n = 100 (f) Comput. efficiency, n = 100.
presented and analyzed. Overall, the parallel algorithm is found to be scalable and capable of solving large scale GLM estimation problems, where m n. The minimal interprocessor communication of the algorithm makes it applicable to other parallel computers (shared memory or distributed memory) with different system specifications. That is, the algorithm is expected to have a similar efficiency on other distributed systems due to the high ratio of local computation to communication cost.
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
203
Currently, an adaptation of the parallel algorithm to estimate Seemingly Unrelated Regressions -a special class of a GLM which involving Kronecker structures- is being investigated [3,6,9,20,24]. Acknowledgements The authors are grateful to Maurice Clint and the three anonymous referees for their valuable comments and suggestions. Appendix The theoretical complexities of computing the QRD of an m · n matrix and of applying the orthogonal matrix Q to an m · q matrix, using the corresponding LAPACK routines are given, respectively, by QRðm; nÞ ¼ 2n2 ðm n=3Þ and AppQðm; n; qÞ ¼ 2qnð2m n þ 1Þ. In order to solve the GLM, the LAPACK routine computes the GQRD (4), i.e. it computes the QRD of the e , it then applies the orthogonal QT to the matrix C 2 Rmm and finally it computes the RQD of m · n matrix X T (Q C). Thus, the theoretical complexity of LAPACK is given by T LP ðm; nÞ ¼ QRðm; nÞ þ AppQðm; n; mÞ þ QRðm; mÞ ¼ 2n2 ðm n=3Þ þ 2mnð2m n þ 1Þ þ 2m2 ðm m=3Þ ¼ ð4m3 þ 12m2 n 2n3 þ 6mnÞ=3. This corresponds to the complexity given in (14). The sequential block Givens algorithm (Algorithm 1) takes advantage of the upper-triangular structure of C and solves the GLM in (k 1) steps, where m = kn. At the ith step (i = 1, . . ., k 1) a GQRD of 2n · n and 2n · 2n matrices is computed (line 5, Algorithm 1). This computation takes TLP(2n, n) flops. Then, the orthogonal matrix Pi is applied from the right to a (m (i + 1)n · 2n) matrix (line 7, Algorithm 1). This computation takes AppQ(2n, 2n, m (i + 1)n) flops. The computations for updating the vector y (line 9, Algorithm 1) is negligible and thus, it is ignored. Finally, the total theoretical complexity of Algorithm 1 shown in (13) is calculated by: T BG ðm; nÞ ¼
m=n1 X
ðT LP ð2n; nÞ þ AppQð2n; 2n; m ði þ 1ÞnÞÞ
i¼1
¼ ðm nÞð32n3 þ 48n3 2n3 þ 12n2 Þ=3n þ
m=n1 X
4nðm ði þ 1ÞnÞð2n þ 1Þ
i¼1
¼ ðm nÞð26n2 þ 4nÞ þ 2ðm nÞðm 2nÞð2n þ 1Þ ¼ 4m2 n þ 14mn2 18n3 þ 2m2 2mn. The computations of the parallel algorithm (Algorithm 2) are dominated by the last processor (e.g. P4 in Fig. 1). Similarily to the sequential algorithm, it solves the GLM in (k 1) steps, where m = kn. At the ith step (i = 1, . . ., k 1) a GQRD of 2n · n and 2n · 2n matrices is computed (line 7, Algorithm 2). This computation takes TLP(2n, n) flops. Then, the orthogonal Pi is applied from the right of a (nd(m (i + 1)n)/npe · 2n) matrix (line 11, Algorithm 2). This computation requires AppQ(2n, 2n, nd(m (i + 1)n)/npe) flops. The computations for updating the vector y (line 12, Algorithm 2) are very small and are neglected. Thus, the total computational theoretical complexity of Algorithm 2 in (17) has been calculated by: T p ðm; n; pÞ ¼
m=n1 X
ðT LP ð2n; nÞÞ þ AppQð2n; 2n; ndðm ði þ 1ÞnÞ=npeÞ
i¼1
¼ ðm nÞð26n2 þ 4nÞ þ 2ðm 2nÞðm 2n þ pnÞð2n þ 1Þ=p ¼ ð4m2 n 16mn2 þ 16n3 þ 2m2 8mn þ 8n2 Þ=p þ 30mn2 34n3 þ 6mn 8n2 .
204
P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204
References [1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. Sorensen, LAPACK Users Guide. SIAM, Philadelphia, 1992. ˚ . Bjo¨rck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996. [2] A [3] P. Foschi, D. Belsley, E.J. Kontoghiorghes, A comparative study of algorithms for solving seemingly unrelated regressions models, Comput Statist. Data Anal. 44 (1–2) (2003) 3–35. [4] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., Johns Hopkins University Press, Baltimore, Maryland, 1996. [5] G.G. Judge, W.E. Griffiths, R.C. Hill, H. Lu¨tkepohl, T.C. Lee, The Theory and Practice of Econometrics. Wiley series in Probability and Mathematical Statistics, second ed., John Wiley and Sons, 1985. [6] E.J. Kontoghiorghes, Parallel Algorithms for Linear Models: Numerical Methods and Estimation Problems. Advances in Computational Economics, volume 15, Kluwer Academic Publishers, Boston, MA, 2000. [7] E.J. Kontoghiorghes, Parallel Givens sequences for solving the general linear model on a EREW PRAM, Parallel Algorithms Appl. 15 (1–2) (2000) 57–75. [8] E.J. Kontoghiorghes, Parallel strategies for rank–k updating of the QR decomposition, SIAM J. Matrix Anal. Appl. 22 (3) (2000) 714–725. [9] E.J. Kontoghiorghes, M.R.B. Clarke, An alternative approach for the numerical solution of seemingly unrelated regression equations models, Comput. Statist. Data Anal. 19 (4) (1995) 369–377. [10] E.J. Kontoghiorghes, E. Dinenis, Computing 3SLS solutions of simultaneous equation models with a possible singular variance– covariance matrix, Comput. Econom. 10 (1997) 231–250. [11] S. Kourouklis, C.C. Paige, A constrained least squares approach to the general Gauss–Markov linear model, J. Am. Statist. Assoc. 76 (375) (1981) 620–625. [12] C.C. Paige, Numerically stable computations for general univariate linear models, Commun. Statist. Simulation Comput. 7 (5) (1978) 437–453. [13] C.C. Paige, Computer solution and perturbation analysis of generalized linear least squares problems, Math. Comput. 33 (145) (1979) 171–183. [14] C.C. Paige, Fast numerically stable computations for generalized linear least squares problems, SIAM J. Numer. Anal. 16 (1) (1979) 165–171. [15] D.S.G. Pollock, The Algebra of Econometrics (Wiley series in Probability and Mathematical Statistics), John Wiley and Sons, 1979. [16] D.S.G. Pollock, A handbook of time-series analysis signal processing and dynamics, in: Signal Processing and Its Applications, Academic Press, 1999. [17] C.R. Rao, H. Toutenburg, Linear Models: Least Squares and Alternatives. Springer series in Statistics, Springer, 1995. [18] S.R. Searle, Linear Models, John Wiley and Sons Inc., 1971. [19] G.A.F. Seber, Linear Regression Analysis, John Wiley and Sons Inc., 1977. [20] V.K. Srivastava, T.D. Dwivedi, Estimation of seemingly unrelated regression equations Models: a brief survey, J. Econom. 10 (1979) 15–32. [21] V.K. Srivastava, D.E.A. Giles, Seemingly Unrelated Regression Equations Models: Estimation and Inference (Statistics: Textbooks and Monographs), volume 80, Marcel Dekker, Inc., 1987. [22] P. Yanev, E.J. Kontoghiorghes, Efficient algorithms for block downdating of least squares solutions, Appl. Numer. Math. 49 (1) (2004) 3–15. [23] P. Yanev, E.J. Kontoghiorghes, Parallel algorithms for downdating the QR decomposition, submitted for publication. [24] A. Zellner, An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias, J. Am. Statist. Assoc. 57 (1962) 348–368.