Two space-saving algorithms for computing the permuted transpose of a sparse matrix

Two space-saving algorithms for computing the permuted transpose of a sparse matrix

~,'i: Advances in Engineering Software 17 (1993) 49 60 r Two space-saving algorithms for computing the permuted transpose of a sparse matrix Frank ...

959KB Sizes 0 Downloads 23 Views

~,'i:

Advances in Engineering Software 17 (1993) 49 60

r

Two space-saving algorithms for computing the permuted transpose of a sparse matrix Frank Cameron Tampere University of Technology, Control Engineering Laboratory, PO Box 692, 33101 Tampere, Finland Given an m × n sparse matrix A having na nonzeros and a permutation matrix P, we consider the problem of finding (PA) T in a space-saving manner. Two algorithms, TRANSPERM1 and TRANSPERM2, are presented. Both algorithms make use of the in-place inversion of permutation vectors. To save even more space TRANSPERM1 forms a piecewise linear model of the nonzero distribution. The computational complexity of TRANSPERM2 is O(na,n,m). For a particularly treacherous nonzero distribution we derive an upper bound for the complexity of TRANSPERM1 of O(na,'rnan/nt, m,n, nl), where nl is the number of intervals in the piecewise model and ~- < 1/2. These two algorithms were compared with Gustavson's efficient HALFPERM algorithm. 3 We expected our algorithms to be slower than HALFPERM since they were written with the intention of saving space. In tests on 21 matrices, HALFPERM was on average 2.5 times faster than TRANSPERM 1 and 1.6 times faster than TRANSPERM2. However, for these 21 matrices, TRANSPERMI's storage requirements were on average only 51% of HALFPERM's, and TRANSPERM2's requirements were only 67% of HALFPERM's. Key words." sparse matrix permuted transpose, sparse matrix algorithms, sparse matrix storage requirements, permutation vector inversion.

1 INTRODUCTION

F o r such storage schemes, if what we actually need is the permuted transpose, then it must be calculated. One such c o m m o n storage scheme is the sparse rowwise format. For this storage scheme, Gustavson 3 presented an elegant algorithm, H A L F P E R M , for finding (PA) T. The algorithms presented in this paper, T R A N S P E R M 1 and T R A N S P E R M 2 , can be viewed as space-saving alternatives to H A L F P E R M . In addition to calculating (PA) T, all three algorithms share the following common properties: (1) the columns of the input A need not be stored in order; (2) the columns of the output (PA) T will be stored in order; and (3) via two consecutive calls one can calculate P A Q -l, where P and Q are permutation matrices. The calculation of P A Q - t may be needed in some LU-decomposition implementations, as Gustavson has explained. If the purpose of using specialized formats to store sparse matrices is to minimize space requirements, it would be consistent then to try to manipulate sparse matrices in a space-saving manner. The algorithms presented here, T R A N S P E R M 1 and T R A N S P E R M 2 , are space-saving; in general they will require significantly less storage space than H A L F P E R M . As is often the case, there is a trade-off to be made between run time and space requirements. It is not surprising, then, that

One of the most basic and c o m m o n operations on a matrix is that of transposition, i.e. given A, find A T. This can be generalized somewhat to the problem of finding the permuted transpose: given A and a permutation matrix P find (PA) T. In this paper, two space-saving algorithms are presented for finding (PA) T. One might wonder whether it is actually necessary to explicitly calculate (PA) T. I f we have A stored in a twodimensional array, and P stored appropriately in a vector, p, then if we want the (i, j ) element of (PA) T, we simply access A(j,p(i)). The situation for sparse matrices is different. To save space, we usually store simply the nonzeros of the sparse matrix in some specialized storage format. Tewarson 1 (Ch. 1) and Duff et al. 2 (Ch. 2) discuss such storage formats. Standard storage formats usually have the property that we cannot directly access the elements simply by knowing the row and column indices. Such storage formats, while reducing the space needed to store the matrix, often make actually using the matrix in computations difficult. Advances in Engineering Software 0965-9978/93/$06.00 © 1993 Elsevier Science Publishers Ltd. 49

50

F. Cameron

T R A N S P E R M 1 and T R A N S P E R M 2 are slower than H A L F P E R M . However, they are not so much slower that they can be automatically rejected as alternatives, especially in cases where one is concerned about storage space requirements. At the heart of T R A N S P E R M 1 and T R A N S P E R M 2 is an algorithm for the in-place inversion of a permutation vector. That such is the case is no surprise; the intimate connection between transposition and permutation vector inversion has been known for some time. Those who have investigated the problem of the transposition of full matrices 5'6 have been able to make use of special properties of the permutation vector associated with the normal storage of a full matrix. To our knowledge, no one has yet tried to make use of permutation vector inversion when computing the transpose of a matrix stored in some sparse format. The paper is organized as follows. Section 2 contains a description of the variables used. T R A N S P E R M 1 and T R A N S P E R M 2 are described in sections 3 and 4. Some discussion of storage requirements and computational complexity is given in section 5. All three algorithms, T R A N S P E R M 1 , T R A N S P E R M 2 and H A L F P E R M , are tested in section 6. 2 VARIABLES

T R A N S P E R M 1 and T R A N S P E R M 2 calculate (PA) + when given A, an m × n matrix having na nonzeros, and P, an m × m permutation matrix. The same data structures are used here as in Gustavson, 3 and one is referred to his paper for more details about these structures. Matrix A is stored in sparse row-wise format as three vectors: ia, Ja and a. The vector ia, whose length is m + 1, is a pointer vector; i~(i) contains the position in j~ and a of the first nonzero of row i. The last element in i~, ia(m + 1), is equal to na + 1. The column indices of the nonzeros are contained in vector j~; the column indices of the nonzeros of row i are contained in j~(ia(i)), ja(ia(i) + 1),... ,j~(ia(i + 1) - 1). The vector a contains the actual values of the nonzeros. Both a and j~ have n a elements. For explanation purposes we define the row-wise storage vectors of (PA) T to be iat, Jat and a t. Gustavson's H A L F P E R M algorithm has vectors which correspond directly to i~t, J~t and at. T R A N S P E R M 1 and T R A N S P E R M 2 produce the contents of iat, Jar and at, but put some of the results in the vectors initially used to store A. Thus vectors explicitly corresponding to all of the three vectors, iat , Jat and at, do not appear in T R A N S P E R M 1 and T R A N S P E R M 2 . The ordering of the storage of the nonzeros of a matrix in row-wise format is of significance. We may speak of row-ordered and column-ordered storage. Let us assume we have an m × n matrix A stored in i a, j~ and a, as defined above. If the nonzeros are in column-ordered storage then

ja(ia(i)) < ja(ia(i ) + 1) < . . . < j a ( i a ( i + 1) - 1), for i : 1 , 2 , . . . , m , i.e. the column indices for all rows are in order. If the nonzeros are in row-ordered storage then i~(1) < ia(2 ) _< ... _< i~(m) <_ ia(m + 1), i.e. row 1 of A is stored first in j , and a, then row 2 and so on. The ordering of the inputs and outputs to T R A N S PERM1 and T R A N S P E R M 2 is as follows. Conceptually these algorithms take i~, ja and a as inputs and produce i a , J~t and at as outputs. The inputs are assumed to be row-ordered, but they need not be column-ordered. The outputs are both row-ordered and column-ordered. The permutation matrix P is stored as a vector, p, of length m with the following interpretation: the p(i)th row of A becomes the ith row of the product PA. T R A N S P E R M 1 also uses an integer work space vector, W, of length n, and a real parameter vector, C, of length nt. The significance of C and n t are explained in the next section. 3 TRANSPERMI

In this section we will describe how T R A N S P E R M 1 works. In doing this, we will make reference to Fig.B1 of Appendix B. We will refer often to the contents of some variables in this code; when doing so we use the variable names as they appear in the code. The contents of these variables are not at all points in the code synomous with the similar-looking variables described in section 2. On input, IA, JA and A0 of Fig. B1 are assumed to contain ia, Ja and a as defined above. On output the iat, J~t and a t vectors corresponding to (PA) T are contained in IAT, J A and A0. In Fig. B1, T R A N S P E R M 1 is divided into four stages. These stages will first be explained for the case when simply A T is being calculated, i.e. P = I. The explanation for a general permutation matrix then follows. Calculation of A T

Stage 1 is essentially the same as Steps 1 and 2 of Gustavson's H A L F P E R M algorithm. The row pointers of A T are contained in IAT. In stage 2 the position in Jar of a nonzero is calculated and placed into its old position in JA. For row i the counter, W(i), indicates how m a n y nonzeros in row i of A T have been placed. Consider the ( j + 1)th nonzero of the ith row of A: its corresponding position in ]a is ia(i ) + j . The row in A T corresponding to nonzero is Ja(a(i)+j). The position in Jat corresponding to this nonzero is iat(Jat(ia(i ) + j ) ) -- 1 + W(ja(ia(i ) = +j)). Statement 17 puts this position into the old position in Jh. At the start of stage 3 we can interpret JA(i) as follows: i is the position of some nonzero in Ja and JA(i)

Space-saving algorithms f o r the p e r m u t e d transpose o f a sparse m a t r i x

is the position of the same nonzero in Jar. Thus J A is a permutation vector. The procedure I N V P E R M 2 inverts a permutation vector, in this case JA, and reorders another vector, in this case A0. It is well known that permutation vector inversion can be accomplished in-place, 4 i.e. without additional storage. After inversion, JA(i) can be interpreted as follows: JA(i) is the position of some nonzero in Ja whose position in Jat is i. After 18 we can now search by some means for the corresponding row of A (column index of A T) from i~. This search is performed in stage 4. Stage 4 starts with calls to two procedures. For P = I the call to I N V P E R M has no effect on P. The call to P W L I N creates a piecewise linear model for the distribution of nonzeros amongst the rows and will be explained later. Let us consider the nonzero corresponding to JA(i). We must look for the column index in A v, i.e. the row index in A, of this nonzero. If we assume that this nonzero was in row k of A, then i~(k) < JA(i) _< i~(k + 1) - 1. We must search for the k that satisfies these inequalities. In statement 24 a 'first guess' for the row index is calculated based on the piecewise linear model and is placed in row_es. Statements 25 and 26 simply make sure this first guess is within sensible bounds. Assuming this first guess is incorrect, we must either search from earlier rows or later rows. In 27 and 28 we determine whether we must search from earlier (inc ~-- - 1 ) or later (inc +-- 1) rows. The testing of the inequalities and the incrementing is done in 29 and 30. The correct row index is assigned in statement 31.

Calculation of (PA) T We will now explain how the inclusion of a nontrivial permutation matrix, P ~ I, can be taken into account. Stages 1 and 3 are not affected by P, so we need concern ourselves only with stages 2 and 4. In stage 2 we calculate the positions in Jat of the nonzero. When finding A T we simply handled the rows of A in order in the loop starting at 9. N o w when finding (PA) T instead of handling the ith row of A, we wish to handle the ith row of PA, i.e. the p(i)th row of A. Hence we replace row i by row p(i) at 10. Let us consider stage 4. F o r the nonzero corresponding to JA(i) we must search for its column index in (PA) T, i.e. its row in PA. N o w if this nonzero was in row k, then JA(i) still obeys ia(k ) < JA(i) < ia(k + 1) - 1. However, we want v, where k = p(v). We could of course find k first and then search p for v. Let p* be the inverse of p where p is treated as a permutation vector. Then v = p*(k). Hence in stage 4, p is first inverted with the call to I N V P E R M at 20, and then the row index in PA is assigned using this inverted p at 31. To recover the original p, I N V P E R M is invoked again at 32.

51

Searching strategy In stage 4 we are faced with the following problem for each nonzero: given the nonzero's position in Ja, what is its corresponding row in A. This is the most timeconsuming part of T R A N S P E R M 1 and as such it deserves some attention. The strategy used in T R A N S P E R M 1 to try to reduce the time spent in searching is to model the distribution of nonzeros amongst the rows. Such a model will have the form r = f ( j , c)

(1)

where j is the position in ja, e is a vector of model parameters and r is the row index. The purpose of the model is to give us a good starting guess from which to start the row index search. There are a couple of dangers in including such a model. One is that if we need much storage space to estimate and store the parameters c, then we have at least partially defeated the purpose of T R A N S P E R M 1. Another danger is that if the model is too complex, then any computational savings we gain in reducing the time spent searching we lose in estimating e and in evaluating the model. Both these dangers suggest the use of a simple model with few parameters. With these factors in mind we decided to use a piecewise linear model. Let the nonzeros be broken up in nt intervals whose endpoints are given by ti: ti= l + (i-1)na/nt,

i= l,...,nt+

l

(2)

As our model is piecewise linear, on interval i the model has the form r = ci, 1 ~- Ci,2J ,

t i <_j < ti+ 1

The parameters ci,1 and Ci,2 conditions m i = ci, 1 q- ci,2ti,

where

are

(3)

found from interpolation

mi+ 1 = ci, 1 q- ci,2ti+l

(4)

comes from the condition ia(mi) < t i The procedure P W L I N , shown in Fig. B4, which is called at statement 19 of T R A N S P E R M 1 , calculates ci, j and ci,2, i = 1,... ,nl. Note that in P W L I N the parameters corresponding to ci, l of eqn (3) are integers and are given in variable IC. By restricting ci,1 to be an integer, we need no additional work space in T R A N S P E R M 1 for these parameters; they can be stored in the available work vector W. In T R A N S P E R M 1 then, the model eqn (3) is used at 24 to provide a good starting guess for the row index. For a particular position j from 22, the number of times we must increment in the while-loop of statement 29 depends upon the deviation of our model from the actual distribution as given by ia. I f for a given j the true row is given by k ( j ) , and the estimate from our model eqn (3) is given by ?(j), then the total number of mi

< ia(miq-1).

52

F. Cameron

increments in this while-loop is given by

available, whereas after exiting from T R A N S P E R M 1 or T R A N S P E R M 2 only (PA) T is available. As Cameron 7 shows, it is possible to make the 'structural' information about A, i.e. ia and j~, available upon exiting from T R A N S P E R M 2 . This requires reconstruction of Ja by placing a loop at the end of T R A N S P E R M 2 which essentially reverses loop 8-18. If what we actually want to calculate, however, is P A Q -1 where P and Q are permutation matrices, then there is no need to reconstruct j~.

na

St°t = Z

I ~ ( j ) -- k ( j ) [

(5)

j=l

In Appendix A an upper bound is derived for Stot for a particular distribution.

4 TRANSPERM2

In this section we will briefly describe T R A N S P E R M 2 , which is another algorithm for performing permuted transposition. T R A N S P E R M 2 is a hybrid algorithm, in that it combines features of both T R A N S P E R M 1 and Gustavson's H A L F P E R M . T R A N S P E R M 2 is shown in Appendix B in Fig. B5. Up to statement 18, T R A N S P E R M 2 is in fact the same as H A L F P E R M . In H A L F P E R M there would be the statement ' A T ( k l ) + - - A 0 0 ( j ) ' , which does not occur, however, in T R A N S P E R M 2 . Instead, at 18, the position in Jat of a nonzero is placed into its old position in JA. This is precisely what is done at 17 in T R A N S P E R M 1 . So after loop 8-18 in T R A N S PERM2, the contents of J A are the same as after Stage 2 of T R A N S P E R M 1 . A call to I N V P E R M 2 will now invert the permutation vector contained in JA, and reorder the contents of A0. We do not need the inverted JA, but the reordered A0 now contains at, in the same manner as in T R A N S P E R M 1 . After exiting T R A N S P E R M 2 the iat, Jat and a t vectors corresponding to (PA) T are contained in IAT, J A T and A0. Let us consider the differences between T R A N S PERM1, T R A N S P E R M 2 and H A L F P E R M . T R A N S P E R M 1 requires a search for row indices which neither T R A N S P E R M 2 nor H A L F P E R M has. Both T R A N S PERM1 and T R A N S P E R M 2 require inversion of a permutation vector of length n, and in addition T R A N S P E R M 1 inverts twice a permutation vector of length m. H A L F P E R M has no such inversions. Both T R A N S P E R M 2 and H A L F P E R M need the J A T vector which is not needed in T R A N S P E R M 1 . Only T R A N S P E R M 1 needs the work vectors W and C. Only H A L F P E R M needs a vector corresponding explicitly to a t. Upon exit from H A L F P E R M both A and (PA) T are

5 STORAGE REQUIREMENTS AND COMPUTATIONAL COMPLEXITY

In this section we will consider the storage space requirements and computational complexity of algorithms HALFPERM, TRANSPERM 1 and TRANSPERM2. Storage requirements

The arrays required by H A L F P E R M , T R A N S P E R M 1 and T R A N S P E R M 2 are given in Table 1. To compare the storage required by these routines, some assumption must be made regarding nt, i.e. the number of intervals in the piecewise linear model discussed in section 3. We will assume here that n / << na, so that the space required by the parameter vector e is negligible. In the tests we ran, presented in section 6, this was indeed the case. Let us consider a typical example to see how much storage the three algorithms require. We will consider a square matrix, n = m, with n >> 1, where the number of nonzeros is given by na = a.n, where a is the average number of nonzeros per row. We assume that the elements of the matrix are real, and that a real variable requires 4 bytes while an integer variable requires 2. From Table 1 we can calculate H A L F P E R M ' s storage requirement to be aproximately 6.n + 12"na = (6 + 12.c~)n bytes. For T R A N S P E R M 1 and T R A N S P E R M 2 the corresponding storage requirements are (8 + 6.a)n and (6 + 8-a)n bytes, respectively. For this case, then, the ratio of the storage space required by T R A N S P E R M 1 to that required by

Table 1. The arrays needed by HALFPERM, TRANSPERM1 and TRANSPERM2 for computing (PA) T where A is an m × n matrix having n a nonzeros

Name

I/O or work

TRANSPERM2

TRANSPERM 1

HALFPERM Size

Name

I/O or work

Size

Name

I/O or work

Size

p

I

m

p

I

m

p

I

m

IA JA A0

I I I O 0 0

m+l na na n+ 1 na na

IA JA A0 IAT W C

I I&O I&O O Work Work

m+l no no n+ 1 n nt

IA JA A0 IAT JAT --

I I&O I&O O 0 --

m+l na na n+ 1 no --

IAT JAT AT

Space-saving algorithms for the permuted transpose of a sparse matrix H A L F P E R M is (4 + 3.a)/(3 + 6.a), and the ratio of storage required by T R A N S P E R M 2 to that required by H A L F P E R M is given by (3 + 4-a)/(3 + 6.a). Even if there is on average only one nonzero per row, i.e. a = 1, we still save 23% storage space using either TRANSPERM1 or T R A N S P E R M 2 .

Computational complexity Gustavson 3 gives the computational complexity of H A L F P E R M as O(n~,n,m) where ' . . . t h e constants for n, m and na in the O notation are values [which] sum up to about 10 elementary operations...'. In analyzing the computational complexity of T R A N S P E R M 1 and T R A N S P E R M 2 , very little effort will be made to estimate the constants of terms inside the O notation. Corresponding constants could have been estimated for T R A N S P E R M 2 . To get similar constants for TRANSP E R M 1, however, would be difficult for two reasons: (i) T R A N S P E R M 1 contains floating point operations while T R A N S P E R M 2 and H A L F P E R M do not; hence assumptions would have had to be made regarding the relative cost of floating point and nonfloating point operations; (ii) T R A N S P E R M 1 depends upon the distribution of nonzeros, while T R A N S P E R M 2 and H A L F P E R M do not. In analyzing T R A N S P E R M 1 , then, we will make use of a kind of 'worst case' distribution and so try to obtain an upper bound for the computational complexity. Let us start with T R A N S P E R M 2 since it is fairly simple. Reference will be made to Fig. B5. The costs of loops 1, 2 - 4 and 6-7 are O(n),O(n~) and O(n), respectively. The loop 8-18 examines each nonzero; it simply does so row by row, hence the inner and outer loops. Loop 8-18 is then O(n~, m). Finally, from Knuth 4 (p. 175) we know that the complexity of the INVPERM2 algorithm at statement 20 is O(na). Hence the complexity of T R A N S P E R M 2 is O(n~, n, m). Let us now consider T R A N S P E R M 1 as given in Fig. B1. By observation, the complexities of stages 1 and 2 are O(na, n) and O(na,m ), respectively. As in TRANSPERM2 the complexity of the I N V P E R M 2 algorithm in stage 3 is O(n~). Hence, up to stage 3, the complexity of T R A N S P E R M 1 is O(n~, n, m). Stage 4 starts with the call to P W L I N which is shown in Fig. B4. Except for the loop 9-10, P W L I N is clearly O(nt). How often 9 and 10 are performed depends upon the error in estimating the next m;, where ia(mi) < t i < ia(m i + 1). In Cameron, 7 an upper bound is derived for the total number of times loop 9-10 must be performed for a treacherous distribution of nonzeros, similar to the distribution considered in Appendix A. This upper bound is n + n I if n a < ntm , and n otherwise. For a normal more friendly distribution of nonzeros, loop 9-10 will not need to be performed so often. Hence in general the computational complexity of P W L I N is O(nl, crn) where ~r < 1.

53

The call to P W L I N is followed by a call to I N V P E R M at statement 20. From Knuth 4 we know that this call to I N V P E R M has complexity O(m). In statement 32, p is reinverted, again with complexity

o(m). Finally we have then to consider the loop 21-31. Apart from loop 29-30, all other statements in loop 2 1 31 are performed n a times. How many times loop 29-30 is performed depends upon the estimation error of the piecewise linear model which is used in statement 24. Let Sto t be the total number of times loop 29-30 must be performed. In Appendix A a particularly treacherous nonzero distribution is considered and the following upper bound is derived:

I nna 2nl

(v~n(v~m--1) \ m(v~ + 2)

if nt < m a x l ,

--

S,o, <_

'

nl2 v -_

nno( n,)

~

1+ ~

2(v/-m+ 1) J otherwise (6)

where/3 is the matrix density, i.e. na = flnm. Note that loop 29-30 contains only nonfloating point operations. For the test matrices considered in section 6, nl values were chosen which were the same order of magnitude as the bound in eqn (6). The tradeoff implied by the second line of eqn (6) is clear; if we increase nt too much so that the condition in the first line of eqn (6) is not satisfied, then the upper bound o n Sto t increases by n/2. For a more conciliatory nonzero distribution, Sto t will be considerably less than the bounds in eqn (6). Hence, the complexity of the loop 21-31 is O(na, An, rnan/nt), where A _< 1/2 and r < 1/2. Collecting all terms, then, we can say that a reasonable upper bound for the complexity of T R A N S P E R M 1 is O(n~,rn~n/nl, m,n, nl), where the (rnan/nt) term applies to elementary, i.e. nonfloating point, operations, and r < 1/2.

6 TESTS Tests were run to see how fast T R A N S P E R M I and T R A N S P E R M 2 performed. In essence, we wanted to find answers to two questions: 1 How much slower than H A L F P E R M do TRANSPERM1 and T R A N S P E R M 2 run? 2 How does the nonzero distribution among the rows affect T R A N S P E R M l's run time? We expect T R A N S P E R M 1 and T R A N S P E R M 2 to be slower than H A L F P E R M , since they were written with the intention of saving space. Hence question 1 could be formulated as follows: 'How much must we

54

F. Cameron Table 2. T h e matrices used for testing; the matrix dimension is n x n

Matrix

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

n a

n

Distribution pattern

1000

500 500 500 700 700 700 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900

15 15 15 20 20 20 20 20 20 30 30 30 30 30 30 40 40 40 100 100 100

Storage required (kbytes) HALFPERM

TRANSPERM 1

TRANSPERM2

nl upper bound from eqn (6)

183 183 183 244 244 244 245 245 245 365 365 365 365 365 365 485 485 485 1205 1205 1205

94 94 94 126 126 126 127 127 127 187 187 187 187 187 187 247 247 247 607 607 607

123 123 123 164 164 164 165 165 165 245 245 245 245 245 245 325 325 325 805 805 805

18 18 18 16 16 16 12 12 12 19 19 19 19 19 19 28 28 28 93 93 93

a; almost uniform b; nonlinear c; 2 clumps a; almost uniform b; nonlinear c; 2 clumps a; almost uniform b; nonlinear c; 2 clumps a; almost uniform b; nonlinear c; 2 clumps c; 4 clumps c; 8 clumps c; 12 clumps a; almost uniform b; nonlinear c; 2 clumps a; almost uniform b; nonlinear c; 2 clumps

p a y in terms o f c o m p u t a t i o n time for saving some space?' The m o s t t i m e - c o n s u m i n g p a r t o f T R A N S P E R M 1 is stage 4 (Fig. B1), where searching for row indices is p e r f o r m e d . A s e x p l a i n e d in section 3 an a t t e m p t is m a d e to speed up this search by m o d e l l i n g the d i s t r i b u t i o n o f the n o n z e r o s a m o n g s t the rows. By using such a m o d e l , the speed o f T R A N S P E R M 1 will d e p e n d to a large extent u p o n h o w well the m o d e l fits the d i s t r i b u t i o n o f the nonzeros, a n d hence we pose question 2 above. T o try to answer these questions, a n u m b e r o f test matrices were generated. T a b l e 2 c o n t a i n s some i n f o r m a t i o n a b o u t these test matrices. W h e n c o m p a r e d to H A L F P E R M , storage r e q u i r e m e n t s o f T R A N S P E R M 1 a n d T R A N S P E R M 2 for these matrices are on average 49 a n d 33% less, respectively.

T h e test matrices o f T a b l e 2 were ' r a n d o m ' , b u t certain p a t t e r n s were i n t e n t i o n a l l y used in generating the matrices. These p a t t e r n s require a little e x p l a n a t i o n . T o get some sensible answer to question 2 above, it is necessary to test T R A N S P E R M 1 on matrices whose n o n z e r o d i s t r i b u t i o n s differ m a r k e d l y . Three different d i s t r i b u t i o n p a t t n e r s were used: (a) a d i s t r i b u t i o n which is a l m o s t uniform, i.e. the n u m b e r o f n o n z e r o s per row is close to being constant; (b) a d i s t r i b u t i o n which is n o n u n i f o r m , b u t r e a s o n a b l y s m o o t h ; a n d (c) a n o n u n i f o r m d i s t r i b u t i o n which is a p p r o x i m a t e l y piecewise

L

16000 ::patternb ::.

! , ." ~--~ /

12000

•:.... ..................

7

....

....

"?: :.,~.

.........

• .(.:e?;,;~2

.. •~...:~

8000 . . . . . .

.... ! , ~ " :. . . . . . . . :patterna :/ :.: : ? z:" 3 '...'

4000

¢

:

/

'

." " ~

100

i

x"

:

i

.

:

:

i ~ te~ ~.

200 300 row number

400

Fig. 1. Examples of nonzero distributions.

500

Fig. 2. Sparsity pattern of test matrix 3.

Space-saving algorithms for the permuted transpose of a sparse matrix

55

The run times of H A L F P E R M , T R A N S P E R M 1 and T R A N S P E R M 2 are shown in Table 3. All runs were performed on an HP 9000/720 computer using Fortran code. The fc-compiler developed by HP was used without any optimization options. In all runs, a nontrivial permutation matrix P was used, i.e. P was not equal to the identity matrix. By using a nontrivial P matrix we include the cost of having to find the inverse of the permutation vector p in T R A N S P E R M 1 . No such inversion p is needed in H A L F P E R M or TRANSPERM2. A number of expected observations can be made from Table 3: (1) the run times of H A L F P E R M , T R A N S PERM1 and T R A N S P E R M 2 increase with na; (2) H A L F P E R M is always the fastest while T R A N S PERM1 is always the slowest; (3) the run times of H A L F P E R M and T R A N S P E R M 2 do not depend upon the distribution; (4) increasing n t from 10 to 20 improves the run time of T R A N S P E R M I in all but 3 cases, matrices 3, 4 and 10; (5) for a fixed nl, the run times of T R A N S P E R M 1 for the nonuniform distributions b and c are usually slower than for the uniform distribution a; this is always true for nt = 10. Note that the run times for T R A N S P E R M 1 decrease in going from the 8-clump distribution of matrix 14, to the 12-clump distribution of matrix 15. This is due to the fact that the clumps were evenly spaced along the diagonal and hence the more clumps there are, the closer the distribution comes to being uniform. Let us return to the two questions posed at the start of this section. One response to question 1 is obtained by

uniform. Figure 1 shows examples of these different distribution patterns. In Fig. 1 the cumulative count of nonzeros is plotted against the row index. (The cumulative count of nonzeros up to and including row i is given by ia(i + 1) - 1.) Pattern (a) corresponds to the cumulative count being (almost) linear; for pattern (b) the cumulative count is nonlinear and for pattern (c) the cumulative count is approximately piecewise linear. Pattern (c) corresponds to the matrix having 'clumps' of nonzeros; for example, test matrix 3 is shown in Fig. 2. In generating the test matrices with clumps, the clumps were spread out along the diagonal-region, as in Fig. 2. Whether the clumps are in this diagonal region or not is of no consequence to T R A N S P E R M I , since in the cumulative count of nonzeros a clump will appear as a region of higher gradient regardless of its proximity to the diagonal. In T R A N S P E R M 1 it is the cumulative count of nonzeros - - as, for example, in Fig. 1 - - which we model. With these three different patterns we can get a reasonable idea of how sensitive T R A N S P E R M I ' s computation time is to the modelling of the nonzero distribution. We have one model parameter, nt, the number of intervals in the piecewise linear model. In addition to changing n, n a and the distribution pattern, two different values of nt were used, 10 and 20. The upper bound for n t given in eqn (6) is shown in Table 2. Recall that this upper bound was obtained for a 'treacherous' distribution of nonzeros. The upper bound is used here merely as an indicator for a reasonable order of magnitude for nv

Table 3. Run times for HALFPERM, TRANSPERMI and TRANSPERM2

(Matrix, pattern, n,/1000)

(1 ,a, 15)

(2,b, 15) (3,c, 15) (4,a,20) (5,b,20) (6,c,20) (7,a,20) (8,b,20) (9,c,20) (10,a,30) (11 ,b,30) (12,c,30) (13,c,30) (14,c,30) (15,c,30) (16,a,40) (17,b,40) (18,c,40) (19,a, 100) (20,b, 100) (21,c, 100)

Run times (ms) HALFPERM

60 56 52 74 70 74 70 72 74 114 104 106 122 124 108 158 162 158 382 386 372

TRANSPERM 1

TRANSPERM2

nl = 10

nl = 20

128 136 146 154 174 192 174 178 204 242 290 304 390 392 316 344 350 412 904 1030 1180

122

126 152 166 172 180 174 168 192 264 256 266 272 302 288 332 328 348 892 950 960

84 82 88 114 116 110 110 112 120 176 180 178 174 166 174 236 238 250 668 674 666

56

F. Cameron

computing the ratio of the computation time of H A L F P E R M to that of the other three alternatives in Table 3. If we calculate this ratio for all matrices and then take the mean, we find that TRANSPERM1 with n l = 10 takes on average about 2.6 times as long as H A L F P E R M , T R A N S P E R M 1 with nt = 20 takes 2-4 times as long, and T R A N S P E R M 2 takes about 1.6 times as long. To answer question 2, we must consider groups of matrices with the same n and no. For one such group and for a fixed value of nt, let us consider the ratio of the maximum run time to the minimum run time. Thus, for example, for nt = 10 and the group of matrices {1,2,3}, this ratio is 146/128 ~ 1-14. Considering all matrix groups, the maximum values of this ratio are 1"62 for n~ = 10, (matrices 10 and 14), and 1.25 for nt = 20, (matrices 1 and 3). The value of this ratio is on average smaller for nt = 20 than for nt = 10; hence for nt : 20 TRANSP E R M I ' s run time depends less upon the nonzero distribution.

REFERENCES

1. Tewarson, R.P. Sparse Matrices, Academic Press, New York, USA, 1973. 2. Duff, I.S., Erisman, A.M. & Reid, J.K. Direct Methods for Sparse Matrices, Oxford Science Publications, Clarendon Press, Oxford, UK, 1986. 3. Gustavson, F.G. Two fast algorithms for sparse matrices: Multiplication and permuted transposition, A C M Trans. Math. Softw., 1978, 4, 250-69. 4. Knuth, D.E. The Art of Computer Programming, Vol. 1, Fundamental Algorithms, Addison-Wesley, Reading, Massachusetts, USA, 1973. 5. Cate, E.G. & Twigg, D.W. Algorithm 513. Analysis of insitu transposition, A C M Trans. Math. Softw., 1977, 3, 104-10. 6. Brenner, N. Algorithm 467. Matrix transposition in place. Comm. ACM, 1973, 16, 692 4. 7. Cameron, F. Space-saving computation of the permuted transpose of a sparse matrix, Mathematics Report 60, Dept. of Electrical Engineering, Tampere University of Technology, Tampere, Finland, 1993.

APPENDIX A INTERPOLATION ERROR

7 CONCLUSIONS We have presented two space-saving algorithms, TRANSPERM1 and T R A N S P E R M 2 , for computing the permuted transpose of a sparse matrix. The savings in storage requirements arise essentially from an in-place inversion of a permutation vector. T R A N S P E R M 1 requires less space than TRANSPERM2, but it also requires a search for row indices which is not needed in TRANSPERM2. To speed up this search in T R A N S P E R M 1 , a model is made of the distribution of nonzeros amongst the rows. All the information needed about this distribution is included in the arrays used to store the sparse matrix. A simple piecewise linear model is used, which has one parameter, nt, the number of intervals. For reasonable n t, the storage required by the model will in general be negligible when compared with the storage required for typical sparse matrices. We compared T R A N S P E R M 1 and T R A N S P E R M 2 to the efficient H A L F P E R M algorithm of Gustavson. 3 For a square matrix with an average of a nonzeros per row, the ratio of the storage space required by T R A N S P E R M I to that required by H A L F P E R M is approximately ( 4 + 3.a)/(3 + 6.a). The corresponding ratio for T R A N S P E R M 2 is ( 3 + 4 . a ) / (3 + 6 . a ) . In tests, T R A N S P E R M 1 took on average 2"5 times as long as H A L F P E R M , while TRANSPERM2 took 1.6 times as long. The run times of both H A L F P E R M and T R A N S P E R M 2 depend only upon the size of the matrix and the number of nonzeros; T R A N S P E R M I ' s run time also depends upon the pattern of the nonzero distribution and on nl.

In this appendix we consider the interpolation error which occurs in Stage 4 of procedure T R A N S P E R M 1 of Fig. B1. Our purpose is to establish an upper bound for the sum of these interpolation errors for an unlikely and treacherous distribution of nonzeros. We start by defining the integers d and 6: d =- [ n J n l j = na/n I

(A.1)

6 =_ [n/ntJ = n/nl

(A.2)

We have assumed for simplicity that no and n are integral multiples of nt. As discussed in section 3 the nonzero distribution is approximated using a piecewise linear model. The treacherous and unlikely distribution we will consider for the first interval of this piecewise linear model is given by ia(K) =

1 + ( k - 1)m, d + 1,

k = 1 , 2 , . . . , L(d- 1)/mJ + 1 k = [ ( d - 1)/m] + 1 , . . . , m 2 (A.3)

where k is the row index, mi was defined via ia(mi) <_ ti < ia(mi + 1) and ti is given by eqn (2). What makes distribution (A.3) treacherous is that all d nonzeros in this first interval are maximally packed into the first [ ( d - l)/m] + 1 rows; the remaining rows [ ( d - 1)/m] + 2 , . . . ,m2 contain only zeros. Given j, the position in Ja, the actual row indices, k, implied by eqn (A.3) are k = 1 + Lj/mJ,

j=

1,...,d

(a.4)

Next we derive the model for the first interval. Using 6 from eqn (A.2), the row index m i corresponding to the interval breakpoint is given by mi

=

1 + ( i - 1)6

(A.5)

Space-saving algorithms for the permuted transpose of a sparse matrix From eqns (3), (4) and (A.5), the model for the first interval is

~(j) = 1 +jn/na,

1
$1 :

Z ~(j) -- k ( j ) j=l

~--- Z

eqn (A. 13) is J0(nl) = n/2(2 - m) + ntn(1 - 2/3 + 2/3m) -/32n2m

(A.6)

In deriving eqn (A.6) it was assumed that na > n. Now we consider the interpolation error, i.e. the difference between the row estimate from model (A.6) and the actual row index (A.4). For the first interval, the sum of these interpolation errors is $1, where

(A.7)

--

j=l

57

_< 0 Let the

(A.14)

roots

of a polynomial f

be given by

z i ( f ) , i = 1 , . . . , d e g ( f ) , where Izi(f)[ < Izi+l(f)l. It can be shown that zl(fo ) and z2(fo) are real. Then a sufficient condition for f0(nl) _< 0 is that n t < zl(f0). We will not analyzer0 of eqn (A. 14) directly, but two other polynomials that are more positive than f0. First let us consider fl:

fl(nl) =--nZ(-m/2) + nln(1 + 2tim) -/32nZm < 0

In forming eqn (A.7) the 'floor' of f was taken in accordance with statement 24 of TRANSPERM1 (see Fig. B1). By ignoring the floors, we can get an estimate of Sl:

(A.15) For m _> 4, fl ->Jo. Hence &(fo) >- z l ( f l ) , and nt <_zt(fl) implies fo(nt) <_O. For the smaller root of fl of eqn (A. 15) we can write

~-~ jn j _ (nm - na)(nl + na) j=l n--a m 2mn 2

(A.8)

zl(fl) = ( n ) [ 1 + 2/3m- (1 + 4/3m + 2m2/3) °5]

where d = na/nl has been used. Let the matrix sparsity be given by/3:

/3 = na/(nm)

(A.9)

Assuming nt <_/3na/(1 -/3), then an upper bound can be found for eqn (A.8):

nna nd2 -- Su, 81 < 2 n ~ - 2na

if nl < /3na - 1--S--~

(A.10)

> x/2n(x/2/3m - 1) = 01 -

(A.16)

m(2 + x/2)

Thus if nt _< 01, then fo(nt) <_0 and $1 < S~. It can be shown that n t < 01 implies d > m; hence the condition d > m is superfluous. Let us now consider the following polynomial:

f2(nl) =--n/2(2 - m) + nln(1 + 2/3m) -/32n2m

(A.17)

Since f2->fo, it follows that zl(Jo ) > zl(f2 ). Hence Our approach will be to see under what conditions we can find an upper bound of Su for $1. We first note that if d < m, then the following upper bound for $1 is easily obtained:

dt~aJJn - k j ] < nna( 1 + n~a) $1 =~--~ j=l ~

nt<_zl(f2) implies fo(nt)<0. The following upper bound can be derived for zl(f2): n zl(f2) = ( ~ - - ~ ) [ 1 + 2/3m- (1 + 4/3m + 8m/32)°5] n(2flx/~

ifd
-

-> 2(1+v~)

1)

-02

(A.18)

(A.11) The right-hand side of eqn (A.11) clearly holds as an upper bound for d _> m as well. Our goal is to find the conditions under which $1 < S~. Assuming d _> m, then

$1
~--'~lJJ j=l

nd 2 nd < 2na ~ 2na

(A.12)

(d - m) (d - m + 2) 2m

I nna 2n~ '

Comparing eqns (A.12) and (A.10), then a sufficient condition for $1 < S~ is

nd 2na

In deriving eqn (A.18) it was assumed that m > ( 1 - 4 / 3 2 ) -1, which is reasonable for sparse matrices. Thus nl _< 02 implies fo(nt) <_0 and $1 < S~. The bound nt _< 02 can be shown to imply d > m; hence d _> m is redundant. We now have two bound for nl, from eqns (A.16) and (A. 18). Either of these bounds on nt is sufficient to imply $1 < Su, so in general we may take the larger one:

( d - m) ~m- ( d - m + 2) _O

(A.13)

Using d = na/n l and eqn (A.9), an equivalent form of

if nl <

max

-v/2n( v'~/3m - 1) m(2 + v'2)

n ( 2 / 3 v ~ - 1)] 2(1 + x/~) ]

$1 <

nna(1 nl~,

~n~ ~ -I-ha/

otherwise (A.19)

58

F. C a m e r o n

The lower part of eqn (A. 19) comes from eqn (A. 11). In all this we have only been considering the first interval. Let Slot be the sum of all Si, i = 1 , . . . , n l. If we assume that for all other intervals the distribution is analogous to that of eqn (A.3), then Stot = ntSl, and the upper bound given in eqn (6) comes directly from eqn (A.19).

17.

JA(j) <---W(k) + IAT(k) - I end end

comment STAGE 3 Interpreting the contents of JA as a permutation vector, invert JA and reorder vector A0 according to the inverted JA. INVPERM2(JA, A0, ha)

18. APPENDIX B CODE

comment STAGE 4 In this stage we search for the rows of the nonzeros which are represented by positions in JA. To facilitate this search we first create a piecewise linear model for the distribution of nonzeros over the rows. PWLIN(IA, m, hi, C, W , d) c o m m e n t The permutation vector p is inverted. INVPERM(p, m) comment The loop in which the row indices are searched for. for i ~-- 1 until na d o begin j ~-- JA(i) comment Using the piecewise linear model, estimate the row index which corresponds to the position contained in j jintv ~ Lj/dl +1

p r o c e d u r e TRANSPERMI(m, n, na, JA, IA, IAT, A0, p, W, C, hi) comment Given an m×n matrix A in row-wise format, and the permutation matrix P as a vector,

1. 2.

3. 4. 5. 6. 7. 8.

9.

10. 1 1. 12. 13. 14. 15. 16.

TRANSPERM1 determines (PA) T in ordered rowwise format. begin comment STAGE 1. In this stage the pointer vector IAT is set up. for i ~- 1 until n+l do I A T ( i ) ~ - 0 for i ~ - 1 until n a d o begin k <---JA(i) IAT(k+I) ~-- IAT(k+I) + 1 end IAT(1) ~-- 1 for i ~-- 1 until n d o begin w(i) ~- 0 IAT(i+I) ~-- IAT(i+I) + IAT(i) end comment STAGE 2 Calculate the positions of the nonzeros in (PA) T, and put them in JA. for i ~-- 1 until m d o begin ii ~ p(i) m 1 ~- IA(ii) m2 <-- I A ( i i + I ) - I i f m 2 - m l > 0 then f o r j ~-- ml until m2 d o begin k <-- JAQ) W(k) <-- W(k) + 1

19.

20.

21. 22.

23. 24. 25. 26.

r o w _ e s ~-- W ( j i n t v ) + L c ( j int v) * jJ

if row es > m then row es <-- m if r o w es < 1 then r o w es ~-- 1 comment Set the diredion of the search increment, inc. ifj > I A ( r o w _ e s ) then inc <-- 1 else inc ~- - 1 while j < I A ( r o w _ e s ) o r j > I A ( r o w _ e s + l ) - 1 do begin row es ~-- row es + inc end JA(i) ~-- p(row_es) end comment Re-invert the permutation vector p. INVPERM(p, m)

27. 28. 29.

30. 31.

32. end

Fig. BI. Procedure TRANSPERM1 for permuted transposition of matrix.

Space-saving algorithms f o r the permuted transpose o f a sparse matrix procedure INVPERM(p, n) c o m m e n t This code inverts a permutation vector p of size n in place. begin c o m m e n t Search for a suitable starting index. i4-1 2. w h i l e p(i) = i and i < n 3. doi4-i+l 4. if i = n then r e t u r n c o m m e n t A suitable starting index has been found and is placed in variable m. m4- i 6. j 4-- p(i) c o m m e n t The loop where we go around the circuit graph. 7.circuit: k 4-- p(/) 8. p(j) 4-- - i 9. i4-j 10. j4- k 11. if i ~ m then g o t o circuit c o m m e n t When we reach the next statement, o

.

a circuit has been traced. Find starting point for new circuit if uncovered circuits still exist. 12.next: j 4-- p(m) 13. i f m > n then r e t u r n 14. ifj < 0 then begin 15. p(m) 4-- - j 16. m 4-m+l 17. g o t o next end else begin 18. ifj = m then begin 19. m 4-m+ 1 20. g o t o next end 21. i4- m 22. g o t o circuit end end

Fig. B2. Procedure INVPERM for in-place inversion of permutation vector.

procedure INVPERM2(p, q, n) c o m m e n t This code inverts a permutation vector p of size n in place. It also reorders the vector q using the same reordering as applied to p. begin 1. i4-1 2. while p(i) = i and i < n 3. doi4-i+ 1 4. i f i = n then r e t u r n 5. m 4- i 6. j 4 - P(i) 7. qO 4--- q(/) 8.circuit: k 4-- p(/) 9. p(/) 4-- - i 10. i 4-- j 11. j 4- k 12. ql 4-- q(i) 13. q(i) 4-- qO 14. q 0 4 - ql 15. if i # m then g o t o circuit

59

16.next: j 4-- p(m) 17. if m > n then r e t u r n 18. ifj < O then begin 19. p(m) 4- - j 20. m 4-m+l 21. goto next end else begin 22. ifj = m then begin 23. m 4-- m + 1 24. g o t o next end 25. i 4- m 26. q0 4-- q(i) 27. g o t o circuit end end

Fig. B3. Procedure INVPERM2 for in-place inversion of a permutation vector and reordering of an associated vector.

60

F. C a m e r o n

procedure PWLIN(IA, n, hi, C, IC, d) comment The parameters of a piecewise linear model are calculated. The model has the following form:

y = IC(j) + C(/)*x, for t(]) <= x < t(]+l), j = 1..... nl

L(IA(n)-

where t(j) = IA(I) + (j-l)* IA(1))/nl] = I A ( I ) + (j'-l)*d, f o r j = 1. . . . . nl and t(nl+l) = IA(n). The parameters IC(/') and C(j) are found from the following two interpolation conditions: m q ) = I C q ) + CQ)*t(j) m(j+l) = IC(]) + Cfj)*t(j+l) where IA(m(])) < tO) < IA(m(j)+l). It is assumed that ni < n and that IA(J) _


2. 3. 4.

begin d e-- [ ( I A ( n ) - I A ( 1 ) ) / n l J ianew 6-- IA(1) + d mold e-- 1 m n e w <--- LnlnlJ

for i <-- 1 until h i - 1 d o begin if m n e w > n then m n e w <--- n if I A ( m n e w ) > i a n e w then inc e-- - 1 else inc <--- 1 while I A ( m n e w ) > i a n e w or I A ( m n e w + l) < ianew do begin

.

6. 7. 8. 9.

10.

renews-- m n e w + inc

end C(i) ~-- ( m n e w - mold)/d IC(i) ~-- m o l d - L C ( i ) * ( i a n e w - d ) J

11. 12. 13. 14. 15.

ianew ~-- ianew + d m o l d ~-- m n e w renew ~-- IC(i) + [C(i)* ianew]

end comment The parameters for the last interval are calculated.

16. 17. 18.

iaold ~-- i a n e w - d C(nl) ~ - (n - mold)[(IA(n) - iaold) IC(nl) ~ m o l d - [C(nl)* iaoldJ

end

comment The parameters for the first h i - 1 intervals are calculated in the following loop.

Fig. B4. Procedure PWLIN for estimating parameters of piecewise linear model.

procedure TRANSPERM2(m, n, na, JA, IA, I A T , J A T , A0, p) comment Given an m×n matrix A in row-wise format, and the permutation matrix P as a vector, TRANSPERM2 determines (PA) T in ordered rowwise format.

.

2. 3. 4. 5. 6. 7.

begin for i <--- 1 until n+l do IAT(i) <-- 0 comment Set up pointer vector I A T . for i ~-- 1 until na do begin k e-- JA(i) IAT(k) <---IAT(k) + 1 end I A T ( n + I ) e - na + 1 - IAT(n) for i <--- n - 1 step - 1 until 1 d o begin IAT(i+I) <-- IAT(i+2) - IAT(i) end comment Determine vector J A T . Put positions of nonzeros in J A T into JA.

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

19.

20.

for i <-- 1 until m d o begin ii ~-- p(i) m 1 ,-- IA(ii) m2 <---I A ( i i + I ) - I if m 2 - m l > 0 t h e n f o r j <-- m l until m2 d o begin k ~-- JA(]) kl ~ I A T ( k + I ) J A T ( k l ) ~-- i I A T ( k + I ) ~-- kl + 1 JA(]) ~-- kl end end IAT(1) e-- 1 comment Interpreting the contents of JA as a permutation vector, invert JA and reorder vector A0 according to the inverted JA. INVPERM2(JA, A0, na) end

Fig. B5. Procedure TRANSPERM2 for computing the permuted transposition of a matrix.