Efficient algorithms for estimating the general linear model

Efficient algorithms for estimating the general linear model

Parallel Computing 32 (2006) 195–204 www.elsevier.com/locate/parco Efficient algorithms for estimating the general linear model Petko Yanev a,* , Err...

244KB Sizes 3 Downloads 86 Views

Parallel Computing 32 (2006) 195–204 www.elsevier.com/locate/parco

Efficient algorithms for estimating the general linear model Petko Yanev

a,*

, Erricos John Kontoghiorghes

q

b,c

a

IRISA/INRIA – Rennes, Campus de Beaulieu, 35042 Rennes cedex, France Department of Public and Business Administration, University of Cyprus, P.O. Box 20537, CY-1678 Nicosia, Cyprus School of Computer Science and Information Systems, Birkbeck College, University of London, Malet Street, London WC1E 7HX, UK b

c

Received 9 December 2004; received in revised form 7 June 2005; accepted 21 June 2005 Available online 28 November 2005

Abstract Computationally efficient serial and parallel algorithms for estimating the general linear model are proposed. The sequential block-recursive algorithm is an adaptation of a known Givens strategy that has as a main component the Generalized QR decomposition. The proposed algorithm is based on orthogonal transformations and exploits the triangular structure of the Cholesky QRD factor of the variance–covariance matrix. Specifically, it computes the estimator of the general linear model by solving recursively a series of smaller and smaller generalized linear least squares problems. The new algorithm is found to outperform significantly the corresponding LAPACK routine. A parallel version of the new sequential algorithm which utilizes an efficient distribution of the matrices over the processors and has low inter-processor communication is developed. The theoretical computational complexity of the parallel algorithms is derived and analyzed. Experimental results are presented which confirm the theoretical analysis. The parallel strategy is found to be scalable and highly efficient for estimating large-scale general linear estimation problems.  2005 Elsevier B.V. All rights reserved. Keywords: General linear model; Generalized QR decomposition; Parallel algorithms

1. Introduction The estimation of the general linear model (GLM) is a well studied problem in statistics [17–19]. It arises in diverse areas such as econometrics, genetics and engineering to name but a few [5,15,16,21]. Emphasis is given to the properties of the estimator rather than the computational merits of the estimation procedures. The GLM is written as y ¼ X b þ e;

e  ð0; r2 XÞ;

q

ð1Þ

This work is in part supported by the Swiss National Foundation Grants 2000-061875.00/1 and 101412-105978. Part of the experiments were performed while the first author was visiting the University of Edinburgh under the support of the HPC-Europa project. * Corresponding author. E-mail addresses: [email protected] (P. Yanev), [email protected] (E.J. Kontoghiorghes). 0167-8191/$ - see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2005.06.007

196

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

where y 2 Rm is the response vector, X 2 Rmn is the full rank exogenous data matrix, b 2 Rn are the coefficients to be estimated and e 2 Rm is the noise with zero mean and variance–covariance matrix r2X. It is assumed that X has full rank. ^ is obtained from the solution of the normal The best linear unbiased estimator (BLUE) of b, say b, equations: ^ ¼ X1 X T y. X T X1 X b That is, ^ ¼ ðX T X1 X Þ1 X1 X T y. b

ð2Þ ^ The derivation of b using (2) is computationally expensive and numerically unstable when X is ill-conditioned. This can be avoided by reformulating the estimation of the GLM as the generalized linear least squares problem (GLLSP) argmin uT u

subject to

y ¼ X b þ Cu.

ð3Þ

u;b

Here X = CCT, C 2 Rmm , u is a random vector defined by Cu = e, i.e. u  (0,r2Im). The matrix C is the triangular Cholesky factor of the variance–covariance matrix X. In many applications the triangular C is known rather than X [6,10–12,15]. Hereafter it is assumed that C is avaliable and has an upper-triangular form. e  ðX yÞ and C: The GLLSP can be solved using the generalized QR decomposition (GQRD) of X ð4aÞ

and ð4bÞ e and U are upper triangular matrices of orders n and m, respectively, and Q; P 2 Rmm are orthogonal Here R [2,4]. The GLLSP (3) is equivalent to argmin kPT uk2

subject to

QT y ¼ QT X b þ QT CPPT u;

u;b

which can be written also as 2

2

argminðkv1 k þ t2 þ kv2 k Þ v1 ;t;v2 ;b

8 > < ~y ¼ Rb þ U 1;1 v1 þ rt þ U 1;2 v2 ; g ¼ dt þ gv2 ; subject to > : 0 ¼ U 2;2 v2 ;

ð5Þ

where the vector uTP is partitioned as ðvT1 t vT2 Þ and k Æ k denotes the Euclidean norm. The values of v2 = 0 and t = g/d can be derived from the last two constraints in (5). Then, setting v1 = 0, the BLUE of b is obtained by solving Rb ¼ ~y  gr=d.

ð6Þ

The LAPACK routine (DGGGLM) which computes the estimator of the GLM makes no assumptions on the structure of C. However, the commonly avaliable triangular structure of C facilitates the development of efficient algorithms for solving the GLLSP [7,12–14]. In this work block recursive sequential and parallel strategies which exploit the structure of the matrices are proposed. The new methods are rich in BLAS-3 operations and solve a series of reduced size GLLSPs. The algorithms have been implemented on 32 CPUs IBMs p690+ high-end compute node with 27 GB distributed memory. This parallel machine performs 6.8 Gflop/s has 3077 Mbyte/s bandwidth and communication latency of 1.92 ls. The communications between the processors are realized using the MPI library. The

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

197

performance of the algorithms has been evaluated experimentally. In addition, the theoretical complexities of the algorithms in number of flops (floating point operations) are also presented. In the Section 2 the serial block-Givens strategy is described and compared with the existing LAPACK routine for estimating the GLM. In Section 3 the parallel algorithm is considered and the theoretical and computational results are presented. In Section 4 some conclusions are drawn. The Appendix shows in detail the derivation of the complexities of the various algorithms. 2. Serial block Givens strategy An efficient sequential algorithm based on Givens rotations for estimating the GLM has been proposed by Paige [12]. Here a block version based on this sequential strategy is investigated [8,22]. Consider the GLLSP e  ðX yÞ and C are partitioned, respectively, as (3), where the matrices X

ð7Þ

Here Ci,i (i = 1, . . ., k) are upper triangular where, for simplicity, it is assumed that m = kn. The sequential block algorithm computes the solution of the GLLSP (3) in k steps. The GQRD (4a) is computed during the first (k  1) steps. For j = k  i the ith (i = 1, . . ., k  1) step computes the smaller GQRD of ðiÞ

Xj

yj

e jþ1 X

y jþ1

ðiÞ

! and

C j;j 0

e j;jþ1 C e jþ1;jþ1 C

! ;

ð8Þ

ð1Þ ð1Þ e k1;k ¼ C k1;k and C e k;k ¼ C k;k . That is, initially the QRD of the first e k ¼ X k, C where y k1 ¼ y k1 , y k ¼ y k , X matrix in (8) is computed by: 0 1 ! e j ~y ðiÞ n X ðiÞ j X y j j B C QTi ; ð9Þ ¼ @ 0 gi A 1 e jþ1 y ðiÞ X jþ1 n1 0 0

e j is upper triangular. Then, the orthogonal matrix QT is applied from where Qi 2 R2n2n is orthogonal and X i the left of the second matrix in (8) which is then re-triangularized from the right, i.e.

ð10Þ e j;j and C jþ1;jþ1 are upper triangular. Once the ith GQRD of (8) has Here Pi is orthogonal and of order 2n; C been computed, Pi is applied from the right of the affected jth and (j + 1)th block-columns of C, i.e. the product

ð11Þ

198

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

is computed. Note that this is the most time consuming task in each step of the computation of the GQRD (4a), especially when k is large, i.e. when m  n. Now let ðuTi1 ~ uTi1 ÞPi be partitioned conformably as ðuTi ~uTi ti uTi Þ, where ðuT0 ~uT0 Þ  uT . The GLLSP (3) after the ith step of computing the GQRD (4a) can be written as 2

2

2

ui k þ t2i þ k ui k Þ argminðkui k þ k~

subject to

ui ;~ ui ;ti ; ui ;b

0 B B B B @

1

0 1 X 1:j1 C 1:j1;1:j1 C B B C ðiÞ ej C B X 0 ~y j C C¼B Cb þ B C B B @ A 0 @ 0 gi A 0 0 0 ðiÞ

y 1:j1

0

e 1:j1;j C

ri1:j1

C 1:j1;jþ1

e j;j C 0

rij di

0

0

 j;jþ1 C gi  jþ1;jþ1 C

10

1 ui CB C CB u~i C CB C. C@ t A A i ui

From this it follows that C jþ1;jþ1  ui ¼ 0, i.e.  ui ¼ 0 and gi ¼ di ti þ gi ui , i.e. ti = gi/di. Thus, the GLLSP (3) is equivalent to the reduced GLLSP ! ! ! ðiþ1Þ e 1:j1;j  ui  y 1:j1 X 1:j1 C 1:j1;1:j1 C 2 2 argminðkui k þ k~ ui k Þ s.t. ; ¼ bþ ðiþ1Þ ej e j;j ~ui X ui ;~ ui ;b 0 C yj where ðiþ1Þ

y 1:j1 ðiþ1Þ

!

ðiÞ

y 1:j1

¼

ðiÞ

~y j

yj

!

ðiÞ

 ti

r1:j1 ðiÞ

! .

ð12Þ

rj

Eq. (12) shows the size-reduction of the GLLSP after each step of the computation. Following the completion of the (k  1)th step the final, smallest GLLSP has the form: ! ! !  ðk1Þ ðk1Þ e e ~uk1 ~ y X r C 1 2 2 1;1 1 uk1 k þ ktk1 k Þ s.t. argmin ðk~ ¼ bþ tk1 ~ gk1 uk1 ;tk1 ;b 0 0 dk1 or uk1 k argmin k~ uk1 ;b ~

2

s.t.

ðkÞ e 1;1 ~ e 1b þ C y1 ¼ X uk1 ;

e 1;1 are upper triangular and are derived from the QRD (9) and RQ decomposition (RQD) (10), e 1 and C where X ðkÞ ðk1Þ ðk1Þ respectively, and y 1 ¼ ~y 1  gk1 r1 =dk1 . Thus, the kth step computes the BLUE of b by setting ~uk1 ¼ 0 e 1 b ¼ y ðkÞ and solving the upper triangular system X 1 . Algorithm 1. The sequential block Givens algorithm for solving the GLLSP (3). e and C in (4a) be partitioned as in (7), where m = kn. 1: Let X ð1Þ ð1Þ e k1;k ¼ C k1;k and C e k;k ¼ C k;k . e k ¼ X k, C 2: Let y k1 ¼ y k1 , y k ¼ y k , X 3: 4: 5: 6: 7: 8: 9: 10: 11:

for i = 1,. . .,k  1 do Set j: = k  i

! ! ðiÞ e j;jþ1 yj C j;j C Compute the GQRD of and as in (9) and (10) ðiÞ e jþ1;jþ1 0 C y jþ1 if i 5 k  1 then    e 1:j1;j rðiÞ e 1:j1;jþ1 Pi ¼ C Compute C 1:j1;j C C 1:j1;jþ1 1:j1 ! ! ! ðiþ1Þ ðiÞ ðiÞ end if y y 1:j1 r1:j1 Update the vector y: 1:j1 ¼  gdii ðiþ1Þ ðiÞ ðiÞ ~y j yj rj end for e 1 b ¼ y ðkÞ Solve X 1 Xj e jþ1 X

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

199

Table 1 Execution times (s) and theoretical results of Algorithm 1 and LAPACK n

25

k2

128

256

384

64

50 128

192

32

100 64

96

LP BG LP/BG

25.77 0.76 33.91

213.35 2.88 74.08

601.58 6.59 91.29

26.25 1.44 18.23

216.62 5.32 40.72

603.03 9.71 62.10

27.77 2.62 10.60

223.35 10.56 21.15

605.58 19.04 31.81

TLP/TBG

42.69

85.35

128.01

21.38

42.69

64.02

10.77

21.39

32.04

Algorithm 1 summarizes the steps of this block Givens strategy for estimating the GLM. For the factorizations (9) and (10) Householder transformations are employed. The orthogonal matrices Qi and Pi (i = 1, . . ., k  1) are not explicitly constructed. The theoretical complexity (see Appendix) of this algorithm is given by: T BG ðm; nÞ  4m2 n þ 14mn2  18n3 .

ð13Þ

Table 1 shows the execution times in seconds and the theoretical complexities in number of flops of Block– Givens Algorithm 1 (BG) and the LAPACK (LP) routine DGGGLM which estimates the GLM for some values of n and k, where m = kn [1]. The theoretical complexity of LAPACK is given by: T LP ðm; nÞ  ð4m3 þ 12m2 n  2n3 Þ=3.

ð14Þ

Note that, theoretically, Algorithm 1 is approximately m/3n times faster than the LAPACK routine, which is confirmed by the experimental results. This improvement is due to the fact that the Algorithm 1 exploits the triangular structure of the large and computationally expensive matrix C in (3), while the LAPACK routine assumes that C is dense. The experimental results (LP/BG) confirm the theoretical ones (TLP/TBG). There is a negligible discrepancy between the two ratios when k is big and n is relatively much smaller. This is due to the increasing overheads which occur from the frequent data exchanges of the submatrices in (7). 3. Parallel algorithm The computation of the product (11) is the most time consuming task in Algorithm 1. This cost can be reduced by applying the orthogonal matrices Pi (i = 1, . . . , k  1) to the block columns of C (see line 7 of Algorithm 1) in parallel. An efficient parallel algorithm requires a load-balanced distribution of the matrices over the processors and low inter-processor communication [23]. Let p denotes the number of processors and assume that (k  2) is a multiple of p, where m = kn. e and C as in (7). To achieve low inter-processor communication, Consider the partitioning of the matrices X the GQRDs computed in line 5 of Algorithm 1 are executed simultaneously by all processors. That is, the data matrix X, the last two block rows of y and the main, sub- and super-block diagonals of C are duplicated on each processor. The remaining (k  2) block-rows of C and y are allocated to the processors using a block-row cyclic distribution. Specifically, Ci,: (i = k2, . . . , 1) is allocated to the processor P ki , where ki = p  (i  1) modp. The same distribution scheme is used for the vector y. This distribution will result in the processor Pj (j = 1, . . . , p) being allocated the vector  T ð15Þ cðjÞ ¼ y Tpþ1j y T2pþ1j    y Tk1j and the matrix  C ðjÞ ¼ C Tpþ1j,:

C T2pþ1j,:

   C Tk1j,:

T

;

ð16Þ

e and C over the where cðjÞ 2 Rnðk2Þ=p1 and C ðjÞ 2 Rnðk2Þ=pm . Fig. 1 shows the distribution of the matrices X processors, with p = 4 and k = 18. The shaded blocks indicate those copied to all processors. The blank, unshaded, blocks are null and are unaffected during the computation.

200

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

Fig. 1. The row-block cyclic distribution of the matrices on 4 processors, when k = 18.

The parallel algorithm solves the GLLSP (3) in k steps. During the first (k  1) steps, all processors initially compute the same factorizations (9) and (10). Then each processor updates its allocated submatrix C(j) and subvector c(j). Note that, at the ith step, each processor updates dq/pe blocks, where q = k  i  1. Thus, the processors have equal computational loads. When the local computations have been completed one processor, Pj say, sends one block from C(j) and c(j), which are required for the next step, to the other processors Pr (r = 1, . . ., p and r 5 j). That is, at each step only one processor broadcasts an n · n submatrix and an nelement subvector. This broadcast acts as a barrier-synchronization point for the processors before the next step commences. Note that the duplication of the computations results into a one-to-all broadcast which has a low communication cost compared to the extremely expensive all-to-all broadcast. The latter, ie allto-all broadcast, is required when there are no duplicating computations and consequently results to a highly inefficient algorithm. The parallel strategy is summarized in Algorithm 2. The broadcast performed by processor Pj is shown in lines 13–17 of the parallel algorithm.

Algorithm 2. The parallel algorithm for solving the GLLSP (3) on p processors. e and C be partitioned as in (7), where m = kn and k  2 is a multiple of p. 1: Let X 2: Allocate X, yk  1, yk, Ck,k, Ci,i and Ci,i + 1 (i = 1,. . .,k  1) to all processors. 3: Allocate c(j) and C(j) as in (15) and (16), respectively, to processor Pj (j = 1,. . .,p). 4: each processor Pj (j = 1,. . .,p) do in parallel: 5: for i = 1,. . .,k  1 do ! 6: Set t :¼ k  i   ðiÞ e t;tþ1 C t;t C Xt yt 7: Compute the GQRD of as in (9) and (10). and e tþ1;tþ1 e tþ1 y ðiÞ X 0 C tþ1 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

ðiþ1Þ

ðiÞ

ðiÞ

Compute y ki ¼ ~y ki  gi =di rk  i. if i5k  1 then Set q :¼ k  i  1 ðjÞ ðjÞ Compute C 1:ndq=pe;nqþ1:nqþ2n ¼ C 1:ndq=pe;nqþ1:nqþ2n P i . Compute

ðjÞ

ðjÞ

ðjÞ

c1:ndq=pe ¼ c1:ndq=pe  ti C 1:ndq=pe;nqþnþ1 .

if j = (i  1) mod p + 1 then ðiþ1Þ e t1;t to Pr, where r = 1,. . .,p and r5j. Send y t1 and C else ðiþ1Þ e t1;t from Pr, where r = (i  1) mod p + 1. Receive y t1 and C end if end if end for e 1 b ¼ y ðkÞ Solve X 1

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

201

Table 2 Execution times (s) and efficiency of Algorithm 2 l = m/n  2

2 Processors

4 Processors

8 Processors

16 Processors

32 Processors

l

n

Serial

Time

Eff.

Time

Eff.

Time

Eff.

Time

Eff.

Time

Eff.

128 256 384 512 640 768 896 64 128 192 256 320 384 448 32 64 96 128 160 192 224

25 25 25 25 25 25 25 50 50 50 50 50 50 50 100 100 100 100 100 100 100

0.76 2.88 6.59 11.95 18.77 26.49 36.23 1.44 5.32 9.71 17.47 27.13 39.35 53.71 2.62 10.56 19.04 26.67 41.18 58.91 79.85

0.41 1.54 3.47 6.09 9.51 13.54 18.51 0.79 2.83 5.18 9.09 14.14 20.07 27.38 1.61 5.94 10.39 14.27 21.78 30.93 41.65

.93 .94 .95 .98 .99 .98 .98 .91 .94 .94 .96 .96 .98 .98 .81 .89 .92 .93 .95 .95 .96

0.24 0.82 1.75 3.17 4.82 6.82 9.35 0.49 1.58 2.76 4.80 7.45 10.76 14.34 1.11 3.61 5.92 8.06 11.91 16.76 22.40

.79 .88 .94 .94 .97 .97 .97 .73 .84 .88 .91 .91 .91 .94 .59 .73 .80 .83 .86 .88 .89

0.16 0.50 1.08 1.81 2.69 3.75 4.96 0.36 0.97 1.88 2.82 4.22 5.95 7.83 0.88 2.47 3.81 4.89 7.08 9.83 12.86

.59 .72 .76 .83 .87 .88 .91 .50 .69 .65 .77 .80 .83 .86 .37 .53 .62 .68 .73 .75 .78

0.13 0.36 0.63 1.12 1.50 2.02 2.65 0.31 0.77 1.26 1.97 2.76 3.78 4.62 0.81 1.98 2.76 3.45 4.86 6.47 8.02

.37 .50 .65 .67 .78 .82 .85 .29 .43 .48 .55 .61 .65 .73 .20 .33 .43 .48 .53 .57 .62

0.19 0.38 0.72 1.01 1.28 1.61 1.94 0.34 0.86 1.27 1.70 2.12 2.61 3.04 0.83 1.78 2.56 3.15 3.63 4.75 5.68

.13 .24 .29 .37 .46 .51 .58 .13 .19 .24 .32 .40 .47 .55 .10 .19 .23 .26 .35 .39 .44

(.95) (.97) (.98) (.99) (.99) (.99) (.99) (.90) (.95) (.96) (.97) (.98) (.98) (.98) (.84) (.90) (.93) (.95) (.96) (.96) (.97)

(.86) (.92) (.95) (.96) (.97) (.97) (.98) (.76) (.86) (.90) (.92) (.94) (.95) (.95) (.63) (.76) (.82) (.86) (.88) (.90) (.91)

(.72) (.83) (.88) (.91) (.92) (.94) (.95) (.57) (.72) (.79) (.83) (.86) (.88) (.90) (.42) (.57) (.66) (.72) (.76) (.79) (.81)

(.54) (.70) (.78) (.82) (.85) (.87) (.89) (.39) (.54) (.64) (.70) (.74) (.78) (.80) (.26) (.39) (.48) (.54) (.60) (.64) (.67)

(.37) (.53) (.63) (.69) (.74) (.77) (.80) (.23) (.37) (.36) (.53) (.58) (.63) (.66) (.14) (.23) (.31) (.37) (.42) (.46) (.50)

The theoretical computational complexity (see Appendix) of this algorithm is given by: T P ðm; n; pÞ  ð4m2 n  16mn2 þ 16n3 Þ=p þ 30mn2  34n3 .

ð17Þ

From (13) and (17) it follows that the computational efficiency of Algorithm 2 approaches one for very large m, i.e. limm!1TBG(m,n)/ (p · TP(m,n,p))  1. This does not take into account, however, the inter-processor communication. Table 2 shows the execution times and actual (and in brackets the theoretical) efficiency of Algorithm 2 for some l and n, where l = m/n  2. The experimental results confirm the theoretical complexities. Note that, the communication time increases with the number of the processors which affects the efficiency of the algorithm for small size problems. Furthermore, Algorithm 2 is scalable in the sence that pffiffiffi the efficiency remains constant when the size of the problem m, and consequently l, is multiplied by 2 2 and the number of the processors p is doubled. Fig. 2 shows the scalability plots based on the theoretical and computational results presented in Table 2. The model size indicates the (m/n  2) number of blocks, where each block has dimension n · n. 4. Conclusion Computationally efficient sequential and parallel algorithms for computing the best linear unbiased estimator of the general linear model (1) have been proposed. The sequential algorithm is a block version of an efficient serial approach that employs as a main computational component the Generalized QR Decomposition [12]. The new block Givens algorithm exploits the triangular structure of the Cholesky factor C of the dispersion matrix X and is rich in BLAS-3 operations. It is found to be m/3n times faster than the corresponding LAPACK routine DGGGLM for estimating the GLM [1]. The parallel approach is based on the new sequential strategy. The parallel algorithm copies the augmented e and the main, sub- and super-block diagonals of C to all processors. The rest of the matrix C is matrix X evenly distributed across the processors. The algorithm duplicates parts of the computation. However, this is compensated for the load balanced distribution of the computationally expensive matrix C resulting in minimal inter-processor communication. The algorithms have been implemented on a parallel computer with distributed memory. The theoretical complexities of both algorithms are stated and experimental results are

0.9 0.8 0.7 256

384

512

640

768

0.1

896

128

384

512

640

768

896

1.0

model size

0.6 0.5 0.2

0.3

0.4

efficiency

0.7

0.8

0.9

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.2

0.3 64

128

192

256

320

384

0.1

2 processors 4 processors 8 processors 16 processors 32 processors

0.0

0.0

0.1

2 processors 4 processors 8 processors 16 processors 32 processors

448

64

128

d

192

256

320

384

448

0.5

0.6

0.7

0.8

0.9

1.0

model size

0.3 0.2

0.2

0.3

0.4

0.5

0.6

efficiency

0.7

0.8

0.9

1.0

model size

0.4

efficiency

256

b

model size

c

efficiency

0.5 0.6 0.3 0.2

128

2 processors 4 processors 8 processors 16 processors 32 processors

0.0

0.0

0.1

2 processors 4 processors 8 processors 16 processors 32 processors

a

64

96

128

160

model size

192

0.1

2 processors 4 processors 8 processors 16 processors 32 processors

0.0

0.0

0.1

2 processors 4 processors 8 processors 16 processors 32 processors

32

e

0.4

efficiency

0.5 0.6 0.4 0.2

0.3

efficiency

0.7

0.8

0.9

1.0

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204 1.0

202

224

32

f

64

96

128

160

192

224

model size

Fig. 2. Theoretical and computational scalability plots based on the results in Table 2. (a) Theoretical efficiency, n = 25 (b) Comput. efficiency, n = 25 (c) Theoretical efficiency, n = 50 (d) Comput. efficiency, n = 50 (e) Theoretical efficiency, n = 100 (f) Comput. efficiency, n = 100.

presented and analyzed. Overall, the parallel algorithm is found to be scalable and capable of solving large scale GLM estimation problems, where m  n. The minimal interprocessor communication of the algorithm makes it applicable to other parallel computers (shared memory or distributed memory) with different system specifications. That is, the algorithm is expected to have a similar efficiency on other distributed systems due to the high ratio of local computation to communication cost.

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

203

Currently, an adaptation of the parallel algorithm to estimate Seemingly Unrelated Regressions -a special class of a GLM which involving Kronecker structures- is being investigated [3,6,9,20,24]. Acknowledgements The authors are grateful to Maurice Clint and the three anonymous referees for their valuable comments and suggestions. Appendix The theoretical complexities of computing the QRD of an m · n matrix and of applying the orthogonal matrix Q to an m · q matrix, using the corresponding LAPACK routines are given, respectively, by QRðm; nÞ ¼ 2n2 ðm  n=3Þ and AppQðm; n; qÞ ¼ 2qnð2m  n þ 1Þ. In order to solve the GLM, the LAPACK routine computes the GQRD (4), i.e. it computes the QRD of the e , it then applies the orthogonal QT to the matrix C 2 Rmm and finally it computes the RQD of m · n matrix X T (Q C). Thus, the theoretical complexity of LAPACK is given by T LP ðm; nÞ ¼ QRðm; nÞ þ AppQðm; n; mÞ þ QRðm; mÞ ¼ 2n2 ðm  n=3Þ þ 2mnð2m  n þ 1Þ þ 2m2 ðm  m=3Þ ¼ ð4m3 þ 12m2 n  2n3 þ 6mnÞ=3. This corresponds to the complexity given in (14). The sequential block Givens algorithm (Algorithm 1) takes advantage of the upper-triangular structure of C and solves the GLM in (k  1) steps, where m = kn. At the ith step (i = 1, . . ., k  1) a GQRD of 2n · n and 2n · 2n matrices is computed (line 5, Algorithm 1). This computation takes TLP(2n, n) flops. Then, the orthogonal matrix Pi is applied from the right to a (m  (i + 1)n · 2n) matrix (line 7, Algorithm 1). This computation takes AppQ(2n, 2n, m  (i + 1)n) flops. The computations for updating the vector y (line 9, Algorithm 1) is negligible and thus, it is ignored. Finally, the total theoretical complexity of Algorithm 1 shown in (13) is calculated by: T BG ðm; nÞ ¼

m=n1 X

ðT LP ð2n; nÞ þ AppQð2n; 2n; m  ði þ 1ÞnÞÞ

i¼1

¼ ðm  nÞð32n3 þ 48n3  2n3 þ 12n2 Þ=3n þ

m=n1 X

4nðm  ði þ 1ÞnÞð2n þ 1Þ

i¼1

¼ ðm  nÞð26n2 þ 4nÞ þ 2ðm  nÞðm  2nÞð2n þ 1Þ ¼ 4m2 n þ 14mn2  18n3 þ 2m2  2mn. The computations of the parallel algorithm (Algorithm 2) are dominated by the last processor (e.g. P4 in Fig. 1). Similarily to the sequential algorithm, it solves the GLM in (k  1) steps, where m = kn. At the ith step (i = 1, . . ., k  1) a GQRD of 2n · n and 2n · 2n matrices is computed (line 7, Algorithm 2). This computation takes TLP(2n, n) flops. Then, the orthogonal Pi is applied from the right of a (nd(m  (i + 1)n)/npe · 2n) matrix (line 11, Algorithm 2). This computation requires AppQ(2n, 2n, nd(m  (i + 1)n)/npe) flops. The computations for updating the vector y (line 12, Algorithm 2) are very small and are neglected. Thus, the total computational theoretical complexity of Algorithm 2 in (17) has been calculated by: T p ðm; n; pÞ ¼

m=n1 X

ðT LP ð2n; nÞÞ þ AppQð2n; 2n; ndðm  ði þ 1ÞnÞ=npeÞ

i¼1

¼ ðm  nÞð26n2 þ 4nÞ þ 2ðm  2nÞðm  2n þ pnÞð2n þ 1Þ=p ¼ ð4m2 n  16mn2 þ 16n3 þ 2m2  8mn þ 8n2 Þ=p þ 30mn2  34n3 þ 6mn  8n2 .

204

P. Yanev, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 195–204

References [1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. Sorensen, LAPACK Users Guide. SIAM, Philadelphia, 1992. ˚ . Bjo¨rck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996. [2] A [3] P. Foschi, D. Belsley, E.J. Kontoghiorghes, A comparative study of algorithms for solving seemingly unrelated regressions models, Comput Statist. Data Anal. 44 (1–2) (2003) 3–35. [4] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., Johns Hopkins University Press, Baltimore, Maryland, 1996. [5] G.G. Judge, W.E. Griffiths, R.C. Hill, H. Lu¨tkepohl, T.C. Lee, The Theory and Practice of Econometrics. Wiley series in Probability and Mathematical Statistics, second ed., John Wiley and Sons, 1985. [6] E.J. Kontoghiorghes, Parallel Algorithms for Linear Models: Numerical Methods and Estimation Problems. Advances in Computational Economics, volume 15, Kluwer Academic Publishers, Boston, MA, 2000. [7] E.J. Kontoghiorghes, Parallel Givens sequences for solving the general linear model on a EREW PRAM, Parallel Algorithms Appl. 15 (1–2) (2000) 57–75. [8] E.J. Kontoghiorghes, Parallel strategies for rank–k updating of the QR decomposition, SIAM J. Matrix Anal. Appl. 22 (3) (2000) 714–725. [9] E.J. Kontoghiorghes, M.R.B. Clarke, An alternative approach for the numerical solution of seemingly unrelated regression equations models, Comput. Statist. Data Anal. 19 (4) (1995) 369–377. [10] E.J. Kontoghiorghes, E. Dinenis, Computing 3SLS solutions of simultaneous equation models with a possible singular variance– covariance matrix, Comput. Econom. 10 (1997) 231–250. [11] S. Kourouklis, C.C. Paige, A constrained least squares approach to the general Gauss–Markov linear model, J. Am. Statist. Assoc. 76 (375) (1981) 620–625. [12] C.C. Paige, Numerically stable computations for general univariate linear models, Commun. Statist. Simulation Comput. 7 (5) (1978) 437–453. [13] C.C. Paige, Computer solution and perturbation analysis of generalized linear least squares problems, Math. Comput. 33 (145) (1979) 171–183. [14] C.C. Paige, Fast numerically stable computations for generalized linear least squares problems, SIAM J. Numer. Anal. 16 (1) (1979) 165–171. [15] D.S.G. Pollock, The Algebra of Econometrics (Wiley series in Probability and Mathematical Statistics), John Wiley and Sons, 1979. [16] D.S.G. Pollock, A handbook of time-series analysis signal processing and dynamics, in: Signal Processing and Its Applications, Academic Press, 1999. [17] C.R. Rao, H. Toutenburg, Linear Models: Least Squares and Alternatives. Springer series in Statistics, Springer, 1995. [18] S.R. Searle, Linear Models, John Wiley and Sons Inc., 1971. [19] G.A.F. Seber, Linear Regression Analysis, John Wiley and Sons Inc., 1977. [20] V.K. Srivastava, T.D. Dwivedi, Estimation of seemingly unrelated regression equations Models: a brief survey, J. Econom. 10 (1979) 15–32. [21] V.K. Srivastava, D.E.A. Giles, Seemingly Unrelated Regression Equations Models: Estimation and Inference (Statistics: Textbooks and Monographs), volume 80, Marcel Dekker, Inc., 1987. [22] P. Yanev, E.J. Kontoghiorghes, Efficient algorithms for block downdating of least squares solutions, Appl. Numer. Math. 49 (1) (2004) 3–15. [23] P. Yanev, E.J. Kontoghiorghes, Parallel algorithms for downdating the QR decomposition, submitted for publication. [24] A. Zellner, An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias, J. Am. Statist. Assoc. 57 (1962) 348–368.