Or&al Chemometrics and Intelligent
Laboratory
Systems, 2 (1987)
Elsevier Science Publishers B.V., Amsterdam -
Research Paper
w
187-197
Printed in The Netherlands
Analysis of Two Partial-Least-Squares Algorithms for Multivariate Calibration ROLF h&ANNE Department of Chemistry, University of Bergen, N-5007 Bergen (Norway)
ABSTRACT
Manne, R., 1987. Analysis of two partial-least-squares rics and Intelligent Laboratory Systems, 2: 187-197.
algorithms
for multivariate
calibration.
Chemomet-
Two algorithms for multivariate calibration are analysed in terms of standard linear regression theory. The matrix inversion problem of linear regression is shown to be solved by transformations to a bidiagonal form in PLSl and to a triangular form in PLS2. PLSl gives results identical with the
bidiagonalization algorithm by Golub and Kahan, similar to the method of conjugate general efficiency of the algorithms is discussed.
INTRODUCTION
Partial least squares (PLS) is the name of a set of algorithms developed by Wold for use in econometrics [1,2]. They have in common that no a priori assumptions are made about the model structure, a fact which has given rise to the name “solft modelling” for the PLS approach. Instead, estimates of reliability may be made using the “jack-knife” or cross-validation (for a review of these techniques, see ref. 3). Although such reliability estimates seem to be essential in the description of the PLS approach 141, they are not considered here. The PLS approach has beeu used in chemometrics for extracting chemical information from complex spectra which contain interference effects from other factors (noise) than those of primary interest [4-lo]. This problem can also be solved by using more or less standard least-squares meth0169-7439/87/$03.50
@ 1987
Elsevier Science Publishers B.V.
gradients. The
ods, provided that the collinearity problem is considered [ll]. What is required by these methods is a proper method for calculating generalized inverses of matrices. The PLS approach, however, is so far only described through its algorithms, and appears to have an intuitive character. There has been some confusion about what the PLS algorithms do, what their underlying mathematics are, and how they relate to other formulations of linear regression theory. The purpose of this paper is to place the two PLS algorithms that have been used in chemometrics in relation to a more conventional description of linear regression. The two PLS algorithms are called, in the terminology in chemometrics, PLSl and PLS2 [6]. PLM considers the case when several chemical variables are to be fitted to the spectrum and has PLSl, which fits one such variable, as a special case. A third algorithm, which is equivalent to PLSl, has been suggested by Martens and Noes 187
n
Chemometrics and Intelligent Laboratory Systems
[7]. It differs from the latter in its orthogonality relations, but has not been used for actual calculations. It will be referred to here only in passing. Some properties of the PLSl algorithm have previously been described by Wold et al. [4], in particular its equivalence to a conjugate-gradient algorithm. This equivalence, however, has not been used further in the literature, e.g., in comparisons with other methods. Such comparisons have been made, particularly with the method of principal components regression (PCR), by Naes and Martens [12] and Helland [13]. A recent tutorial article by Geladi and Kowalski [14] also attempts to present the PLS algorithms in relation to PCR but, unfortunately, suffers from a certain lack of precision. The outline of this paper is as follows. After the notation has been established, the solution to the problem of linear regression is developed using the Moore-Penrose generalized inverse. The bidiagonalization algorithm of Golub and Kahan [15] is sketched and shown to be equivalent to the PLSl algorithm. Properties of this solution are discussed, in particular its relation to Lanczos’ method for tridiagonalization of symmetric matrices [16]. The PLS algorithms consist of two steps, which may be called calibration and prediction. It will be shown that in contrast to other approaches to linear regression, the matrix inversion is performed not in the calibration step but in the prediction step. The matrices inverted are triangular in PLS2 and bidiagonal in PLSl. Comparisons are made with PCR. In a final section, questions of the numerical efficiency of the PLS algorithms are discussed. The Ulvik workshop made it clear that within chemistry and the geosciences there is growing interest in the methods of multivariate statistics but, at the same time, the theoretical background of workers in the field is highly variable. With this situation in mind, an attempt has been made to make the presentation reasonably self-contained.
sion is caused by the use of different conventions for normalization. In the following, we arrange the measurements of the calibration set in a matrix X = { Xij}, where each row contains the measurements for a given sample and each column the measurements for a given variable. The number of samples (or rows) is given as n and the number of variables (or columns) as p. With the experimental situation in mind, we shall call the p measurements for sample i a spectrum which we denote by xi. For each sample in the calibration set, there is, in addition, a “chemical” variable yi which is represented by a column vector y. In the prediction step the spectrum x of a new sample is used to predict the value h for that sample. We use bold-face capital letters for matrices and bold-face lower-case letters for vectors. The transpose of vectors and matrices will be denoted “ I 23 e.g., y’ and X’. A scalar product of two by (column) vectors is thus written (a’6). Whenever possible, scalar quantities obtained from vector or matrix multiplication are enclosed in parentheses. The Euclidian norm of a column vector II is written ]]a(] =(a’~) 1/2. The “Kronecker delta”, S,,, is used to describe orthogonality relations. It takes the values 1 for i =j and 0 for i #j. CALIBRATION
AND THE LEAST-SQUARES
METHOD
The relationship between a set of spectra X and the known values y of the chemical variable is assumed to be ~+=b,+
i
Xijbj+noise(i=1,2,...,n)
0)
j-1
If all variables are measured relative to their averages, the natural estimate of b, is zero, and in this sense b,, may be eliminated from eq. 1. We thus assume that the zero points of the variables Y, x1, x7_,..., xp are chosen so that i&+=Oand
iTj=O(j=l,...,p)
(2)
i=l
NOTATION
The notation used for the PLS method varies from publication to publication. Further confu188
Eq. 1 may therefore be written as yi=
5 Xijbj j=l
(3)
n
Original Research Paper
or, in matrix
notation,
Xb=y
(4)
For the estimation of 6, eq. 4 in general does not have an exact solution. In standard least squares one instead minimizes the residual error 11e I( ’ defined by the relationship e=y-Xb
(5)
This leads to the normal X’Xb = X’y
(6) square of X’X
b = (X’X)-‘X’y
(7)
If the inverse does not exist, there will be non-zero vectors cj which fulfil X’XC, = 0
(8)
Then, if b is a solution to eq. 6, so is b + CX,c, (A, arbitrary scalars). This may be expressed by substituting for (X’X)-’ any generalized inverse of X’X. A particular such generalized inverse is chosen as follows. Write X as a product of three matrices: X=URW’
(9)
where U and W are orthogonal and of dimensions n x a and p x a, respectively, R is of dimension a x a and non-singular and a is the rank of the matrix X. That is U’U=W’W=l
(10)
where 1 is here the unit matrix of dimension a x a. A generalized inverse may be written: X- = WR-‘U
(11)
Truncations are commonly introduced by choosing the dimension of R equal to r smaller than the rank a of X. As written (no truncation implied), X- fulfils not only the defining condition for a generalized inverse, i.e., xx-x=x
(12)
but also X-XX
= X-a , xx-
= (xx-)‘;
(x-x)’
WW’b = WR-‘U’y
= x-x
generalized inof eq. 9 into eq.
(14)
It may be shown [17] that the Moore-Penrose generalized inverse gives the minimum-norm solution to the least squares problem, i.e., b = WW’b = X-y = WR-‘U’y
equations
which reduce to eq. 4 for non-singular matrices X. Provided that the inverse exists, the solution may be written as
which define the Moore-Penrose verse (see, e.g., ref. 17). Insertion 6 gives
(15)
is the solution which minimizes (b’b) in the case that X’X is singular and eq. 6 has multiple solutions. The orthogonal decomposition 9 gives together with eq. 15 a simple expression for the residual error: e=y-Xb=(l-URW’WR-‘U’)y =(l-Uu’)y
(16)
There are many degrees of freedom in the orthogonal decomposition 9. From the computational point of view, it is important that the inverse R- ’ is simple to calculate. This may be achieved, e.g., with R diagonal or triangular. For a right triangular matrix one has Rij = 0 for i > j. In that case, from the definition of the inverse it follows that c
(R-l)ikRkj+
(R-1)ijRjj=6ij
k-ej
(17)
or 6ij-
C (R-‘)ikRkj
k
(18)
A bidiugonul matrix is a special case of a triangular matrix with Rzj = 0 except for i =j and either i =j - 1 (right bidiagonal) or i =j + 1 (left bidiagonal). From successive application of eq. 18 it follows that the inverse of a right triangular matrix (including the bidiagonal case) is itself right triangular. With a diagonal matrix R, i.e., with Rij = 0 for i #j, the decomposition 9 is known as the singular-value decomposition. The singular values, which are chosen to be > 0, are the diagonal elements Rii. Another decomposition of interest for matrix inversion is the QR decomposition with R right triangular and W = 1, the unit matrix.
(13) 189
H
Chemometrics
and Intelligent
BIDIAGONALIZATION
AND
Laboratory
Systems
PLSl
Wold et al. [4] mention that the transformation step of the PLSl algorithm is equivalent to a conjugate gradient algorithm given by Paige and Saunders [18]. This algorithm, which is applied to X, however, is not the general conjugate gradient algorithm, which requires a symmetric X [19,20], but a bidiagonalization algorithm developed by Golub and Kahan [15] and called Bidiag2 by Paige and Saunders [18]. The detailed properties of Bidiag2, which in our view is the key to the understanding of the PLSl algorithm, have so far not been exploited in the literature. The Bidiag2 algorithm gives a decomposition of the type described in eq. 9. It can be described as follows: given w, = X’y/ (1X’y I] and ui = Xw,/ ]IXw, I], one obtains successive column vectors of the matrices U and W through wi = ki[X’Ui_i
- rvi_i( w/_*x’Ui_J]
ui = k,f[XWi - Ui&-,XWi)]
(19) (20)
where k, and k,f are appropriate normalization constants. Eqs. 19 and 20 may be obtained from the general expressions x9=
c
ui(u;xy)
(21)
i=l i+1
X’Ui =
c
wj(M(X’Ui)
j=l
(22)
which define the vectors uj and wi+i, respectively. The following properties are easily shown: (i) the sets { ui } and { wi} are orthonormal, (ii) eq. 21 gives (ujX9) = 0 for i >j and (iii) eq. 22 gives (u;Xy) = 0 for j > i + 1. The decomposition of X according to eq. 9 therefore makes Rij = (ujXwj) a right bidiagonal matrix. Eqs. 21 and 22 may therefore be reformulated as xw, = ui(u;xwi)
+ Ui&_iXWi)
X’Ui = wi+i( W/+iX’U,) + Wi( Wi’X’Ui)
wi+l
= N[X’Xw, -W,_l(
-
Wi( W/X’XWi)
W:-*X’xwi)]
(25)
where N is a normalization constant. A similar equation applies to ui+ ,. In the Lanczos basis {w;} the matrix X’X is tridiagonal. It is further obvious from eq. 25 that the vectors { wi} represent a Schmidt orthogonalization of the Krylov sequence { ki; kj = (X’X)‘-‘wi}. The relationship between the Krylov sequence and the vectors w, generated by the PLSl algorithm has previously been pointed out by Helland [13]. In order to show the equivalence between the bidiagonalization algorithm and PLSl, the calibration step of the latter can be written as follows: Step 1: X,=X; yi =y. Step 2: For i = 1, 2, . . . , r (r s a) do Steps 2.1-2.4: Step 2.1: W, = Xlyi/llXryi II Step 2.2: ui = Xjwi/l]Xiwi I] Step 2.3: Xi+i = (1 - uiuI)Xi Step 2.4: y,+i = (1 - uiuI) y, The iteration in Step 2 may be continued until the rank of X, equals zero (a = rank of X) or may be interrupted earlier using, e.g., a stopping criterion from cross-validation. The present description differs from that given, e.g., by Martens and Naes [7] only by the introduction of normalized vectors {u;}. The latter write Step 2.2a: tj = X,wi but have equations for the other steps that give results identical with our formulation. The PLS decomposition of the X matrix into x=TP
(26)
where T is orthogonal and of dimension n X a and P is of dimension a Xp therefore corresponds in our notation to
(23)
X = U(U’X)
(24)
The alternative algorithm of Martens and Naes gives a decomposition of X according to eq. 26, but with orthogonal rows in the matrix P [7]. These rows are, apart from a normalization factor, identical with those of the matrix W’ defined here.
which are equivalent to eqs. 20 and 19, respectively. There is a close relationship between the present bidiagonalization scheme and Lanczos’ itera190
tion scheme for making a symmetric matrix tridiagonal [16]. The latter is obtained by multiplying the two bidiagonalization equations together, e.g.,
= U(RW’)
(27)
Original Research Paper
In our notation decomposition X = (XW)W’
this algorithm
corresponds
to the
= (UR)W’
(28)
The equivalence of this algorithm with the ordinary PLSl algorithm has been pointed out by NZS and Martens [12] and shown in detail by Helland [13]. In the PLSl algorithm the orthogonality of { ui } follows from Steps 2.2 and 2.3 through induction. One may then reformulate Step 2.3 as Step 2.3b: X,+, = (1 - z’,=,u,uL) X Step 2.1 may be simplified into Step 2.lb: wj= Xiy/llXiy 11 Step 2.4 is thus shown to be unnecessary. Sjijstrijm et al. [5] updated y by (29)
Y;+l = (z-ciuiuI)Y~
with specifications for the parameter cj, called the inner PLS relation. This updating therefore has no effect on the result. Later publications which use the “inner PLS relationship” [6,14] have, in fact, the same updating expression as used here in Step 2.4. Comparing PLSl with Biadiag2 one finds that Steps 2.2 and 2.3b of the former give the second step of the latter, eq. 20, since (u;Xw;) = 0 for k < i - 1. In order to show the equivalence of the first step of Bidiag2, eq. 19 and Step 2.lb, we write the latter as y+1 = Xl0
- vl)Y/IIx;+lY
The orthogonality obtained from ~v;vv~+~a
between
(30)
II w, and
rvi+i is then
~,'X~(l-u~~~)ya~j(l-~~~~)y= 0 (31)
From
one thus obtains,
-x’u,(u;y)
by multiplication
II = (wilx’ui)(ulY>
-
PLS2
The PLS2 algorithm was designed for the case when several “chemical” variable vectors yk are to be fitted using the same measured spectra X. The “chemical” vectors are collected as columns in a matrix Y. The algorithm may be described as follows: Step 1: X, = X; Yi = Y Step 2: For i = 1, 2,. . . , r (r I a) do Steps 2.1-2.4: Step 2.1: u, = first column of y. Step 2.2: Repeat until convergence of wi: Step 2.2.1: wi = Xjv,/]] Xiui ]I Step 2.2.2: ui = X,wi/]]Xiwi ]I Step 2.2.3: zi = Yi’ui/]]Yi’ui ]I Step 2.2.4: vi = Y&/ ]IY;z, ]I Step 2.3: Xj+1 = (1 - UiQXi Step 2.4: Y,+l = (1 - u,u:)Y, As for PLSl the iteration may be continued until the rank of Xi is zero or may be stopped earlier (r < a). Several simplifications of the algorithm can be made. What is important in the present context, however, is that the ui and wi vectors form two orthonormal sets, and that the transformed.matrix U’XW is right triangular. Also, it may be shown that PLS2 reduces to PLSl when the Y matrix has only one column. In the latter case zi from Step 2.2.3 has only a single element = 1. Convergence of wi is then obtained in the first iteration. The orthogonality uIuj = aij follows from Steps 2.2.2 and 2.3 in the same way as,in PLSl. This in turn makes it possible to show that eq. 21 is valid also for PLSZ, which proves the triangularity of the transformed matrix R = U’XW. Finally, the orthogonality w: w, = Sii (i >j) may be established from wi’w,
= wJx;yll
llxlY
MATRIX TRANSFORMATIONS
n
a u:XiX,‘q
(32) with wi’, (33)
which inserted back in eq. 30 gives the bidiagonalization equation 19 apart from, possibly, a sign factor.
at$(l- gukuL)2dj=0
The orthogonal&y relationships established in the literature.
(34)
of PLS2 are well
191
W Chemometrics and Intelligent Laboratory Systems
MATRIX INVERSION AND PREDICTION
For a sample with the spectrum x (row vector) but with unknown “chemical” composition, the predicted value of the chemical variable is obtained as (cf. eqs. 4 and 14) 9 = (xb) = (xWR-‘U’y)
(35)
We write here this expression as for PLSl. The extension to PLM, however, is trivial. Utilizing the fact that R-’ is right triangular both in PLSl and in PLS2, the vector of regression coefficients can be written as b= C 1 wi(R-‘)ij(ttJy) j
= Cdj(Ujy)
i
(36)
C wi(R-‘)ii
or c d,R,, = w/
(38)
k
simplifies for a bidiagonal matrix R to dj-,Rj-l,j)/Rjj
(41)
(Uj’Y) = -(uJ-1Y)Rj-i,j/Rjj
The proof of eq. 41 is given in the next section. In the following, we develop eq. 35 to yield the equations for the prediction used in the PLS literature. A basic feature of these equations is that the regression vector b is never explicitly calculated. For this reason, the predicted value is written as 9 = (xb) = ,$ (xdj)(u;y)
= i
where 7 fromJiq . 39 3 one obtamy
(XW,> - C hkRkj /RjJ 192
ij4,
1
h,(u;y)
’
(42)
(44)
with (cf. PLSl Step 2.2a) qj= (?,lY)/(t,ltj) f;=
x-
=
xi,&
(U(Y)/(UJXWj)
(45)
WI
k
i
(46)
1
and t;tk) = r@/(
r&x&)
(47)
i.e. 5=
(XWj)
-
C ?kRkj/Rkk kcj
(48)
From the identification Jj = h,R, in eq. 48, it follows that the prediction equation 44 of Martens and Naes [7] gives the same result as eq. 42. Wold et al. [6] and Geladi and Kowalski [14] give expressions for prediction containing the “inner PLS relationship”. They write for the PLSl case p= i
(9
This equation makes it possible to calculate b with little use of computer memory, especially since also
kxj
i j=l
(37)
i
dj= (wi-
9=
pk = t;X,/(
j
where ruj and uj are columns of the matrices W and U, respectively. The substitution dj=
These expressions differ from those given by Martens and Naes [7] only in the normalization. The latter write
cjzjqj
(49)
j=l
The need for the inner relationship comes from the normalization of qj to ) q, 1 = 1 used by these authors. A detailed calculation shows that cjqj in eq. 49 equals qj as defined by Martens and NES [7]. Also for PLS2, the inner relationship of eq. 48 gives results identical with those of Martens and Naes [7]. On the other hand, we believe that Sjiistriim et al. [5] make an erroneous use of the “inner PLS relationship” in the prediction step. These authors make still another choice of normalization. Also in other respects their prediction equations indicate an early stage of development. Geladi and Kowalski in their tutorial [14], discuss a procedure for obtaining “orthogonal t values” which we, so far, have chosen to overlook. This procedure, however, is said to be “not absolutely necessary”. What is described is a scaling
Original Research Paper
procedure for tbe vectors pj, tj and 9 so that Pj”” =Pj/ll
Since (wiw,) = 0 for s + 1 one obtains the itera-
Pj II; rj”” = tj IIPj II; ynew = w/ II Pj
II
(50) It should be noted that both before and after this scaling the vectors rj are orthogonal. The replacements in eq. 50 also scale the values of cj and 4 appearing in eq. 49 but have no effect upon the predicted value 9. STOPPING RULES FOR BIDIAGONALIZATION
It is well known that the Krylov sequence converges to the eigenvector with the numerically largest eigenvalue. This is the basis of the power method for finding eigenvalues of symmetric matrices. In the chemometrics literature the power method is often called NIPALS and attributed to Wold [21]. The method, however, is well established in numerical analysis and can, according to Householder [22], be traced back at least to the work of Mtintz in 1913 [23]. A Krylov or, for that matter, a Lanczos basis thus becomes linearly dependent if the iteration is continued too far. In exact arithmetic this linear dependence would at some point give a vector ws+i or u, exactly equal to zero. Roundings prevent this from happening, however. Like the Bidiagl algorithm discussed in detail by Paige and Saunders [18], the Bidiag2 algorithm produces simple estimates of quantities that can be used in stopping rules for the iteration scheme. We write e, as the residual error (5) obtained with a matrix R or s dimensions, i.e.,
tion equation (s > 1) (Y’U,)
= -( Y’u,-,)(u:-,Xw,)/(uSXw,)
IlXS+,Y
II = IIXY
llMx%+,vww
(55)
may be derived from eq. 30. The use of eqs. 52 and 55 and further criteria for stopping was discussed by Paige and Saunders [18]. These criteria are simple to evaluate and relate directly to the numerical properties of the PLSl iteration scheme. For this reason, they may be of advantage as a complement to the cross-validation currently used.
COMPARISON OF BIDIAG2/PLSl LAR-VALUE DECOMPOSITION
AND
PCR/SINGU-
It is frequently stated that with the same number of factors wi PLSl has a better predictive ability than principal components regression (PCR)/singular-value decomposition. This may be understood by expanding PLSl results in terms of the factors of the singular-value decomposition (= principal components). We write the latter as (56)
i
and obtain
defined in Step 2.4 and
II Y,+1 II 2 = II Ys II 2- #Y)’
(52) As the iteration proceeds II y’, II2 becomes smaller, and the stability of this quantity may be taken as a stopping criterion. Using Step 2.1 and eq. 21 we write = IIX’Y ll(w;%)
X’X = cd;&
(53)
A’
(57)
i
The first PLSl factor becomes “‘1=X’Y/IIX’Y =c&ci
= (Y’%)(u;x%) + ( Y’Us-JUS-1X%)
(54)
which may be used to evaluate the residual error (eq. 52). As mentioned above, eq. 41, one may also use eq. 54 in the evaluation of regression coefficients. Another quantity of interest is the normalization integral IIX’e, )I = I]X:+, y ]I, which appears in the denominator of Step 2.1 of PLSl. When this quantity approaches zero the iteration scheme becomes instable. The equation
X= Cgid, f;’
(Y’XW,)
H
II = C_tidi(g,‘y)/llX’yll ’
(58)
(no contribution from vectors h with d, = 0). Application of the Lanczos equation (25) ex193
n
Chemometrics and Intelligent Laboratory Systems
panded according to eq. 57 yields w, o X’Xw, - Wi( w;x’xw,) = Cr,c,[d?-
(w;x’xw*)]
(59)
Partial summations over functions i with degenerate eigenvalues di give the same results for w, as for w,. One may therefore show that the number of linearly independent terms in the sequence { wi} is no greater than the number of eigenvectors with distinct eigenvalues contributing to the expansion of w,. Almost degenerate or clustered eigenvalues coupled with finite numerical accuracy may make the expansion even shorter in practice. Compare this with PCR, where all eigenvectors with large eigenvalues are used irrespective of their degeneracy. These properties of PLSl have been pointed out by Nas and Martens [12] and by Helland [13]. They are also well established in the literature on the conjugate gradient method (e.g., ref. 20). The exclusion of d, = 0 not only for all y but also for all uj is the main advantage of Bidiag2 over the Bidiagl algorithm used by Paige and Saunders in their LSQR algorithm [18]. The latter starts the bidiagonalization with pi = y,, w, = X’y, and obtains a left bidiagonal matrix along similar lines as Bidiag2. The two algorithms generate the same set of vectors { We}, but Bidiagl runs into singularity problems for least-squares problems, which, however, are solved by the application of the QR algorithm. The resulting right bidiagonal matrix is the same as that obtained directly from Bidiag2. We have not found any obvious advantage of this algorithm over the direct use of Bidiag2 as in PLSl. In PCR the stopping or truncation criterion is usually the magnitude of the eigenvalue di. As discussed by Joliffe [24], this may lead to omission of vectors &, which are important for reducing the residual error ((e, ((‘, eq. 51. On the other hand, the Bidiag2/PLSl algorithms or, equivalently, the Lanczos algorithm do not favour only the principal components with large eigenvalues di. Instead, it is our own experience from eigenvalue calculations with the Lanczos algorithm in the initial tridiagonalization step [25] that convergence is first reached for eigenval194
ues at the ends of the spectrum. That means in the present context that vectors with large and small eigenvalues are favoured over those with eigenvalues in the middle. The open question is the extent to which the principal components with small eigenvalues represent noise and therefore should be excluded from model building. Without additional information about the data there is no simple solution to this problem. UNDERSTANDING
PLS2
In this section we consider the triangularization algorithm in PLM. At convergence the iteration loop contained in Step 2.2 of the algorithm (see Matrix transformations - PLSZ) leads to the eigenvalue relationships X;YiYi’Xiwi = k;wi
(60)
Y,‘X,X;y,z,
(61)
= k;zi
where kf, which are the numerically largest eigenvalues, may be evaluated from the product of normalization integrals in Steps 2.2.1-2.2.4. We interpret these relationships as principal component relationships of the matrix XjY,. For each matrix one thus obtains the principal component with largest variance. As mentioned before, the vectors wi obtained in this way are mutually orthogonal, but the vectors zi are not. The matrix XIY, may be simplified to X’Y,. Hence for each iteration, those parts of the column vectors of Y which overlap with ui = Xiw/]]Xiwi ]I are removed. Eventually, a ui is obtained that has zero (or a small) overlap with all the columns of Y, and the iteration has to be stopped. As for PLSl, it is of interest to relate the vectors wi to the singular-value decomposition of X, eq. 56. We write eq. 60 as X’AiXw; = k,‘w;
(62)
Inserting eq. 56 and w; = c fCji _i we obtain xf(djg;AiXwi) i
(63)
= k;CJcji j
(64)
Original Research Paper
or, using the orthogonality k;c,, = d, ( g;A;Xrvi)
of the fs, (65)
From ki > 0 and d,= 0 it follows that cji= 0. Hence there is no contribution to wi from a vector f with the singular value d, = 0. In this way, we are assured that the vectors w, of PLS2 form a subspace of the space spanned by { f; dj + O}. The transformed matrix R = U’XW is therefore invertible.
DISCUSSION The first result of this study is that both PLS algorithms, as given by Martens and Naes [7], yield the ordinary least-squares solution for invertible matrices X’X. The algorithms correspond to standard methods for inverting matrices or solving systems of linear equations, and the various steps of these methods are identified in the PLS algorithms. This result is likely to be known to those who know the method in detail. However, as parts of the PLS literature are obscure, and as even recent descriptions of the algorithms in refereed publications contain errors, it is felt necessary to make this statement. There are, however, no reasons to believe that the errors mentioned carry over into current computer codes. The close relationship with conjugate gradient techniques makes it possible to speculate about the computational utility of PLS methods relative to other methods of linear regression. As pointed out also by Wold et al. [4], the matrix transformations of Bidiag2 are computationally simpler than those of the original PLSl method. Further, in the prediction step some saving would be possible using the equations given here. For problems of moderate size this saving will not, however, be large. For small matrices a still faster procedure for bi- or tridiagonalization is Householder’s method. Savings by using this method would be important for both matrix inversion and matrix diagonalization (principal components regression). The real saving with methods of the conjugate gradients type discussed here is for large and sparse matrices where the elements can only be accessed in a fixed order.
n
On the other hand, with present technology neither matrix inversion nor matrix diagonalization is particularly difficult, even on a small computer. The cost of obtaining high-quality chemical data for the calibration is likely to be much higher than the cost of computing. This puts a limit on the amount of effort one may want to invest in program refinement. Compared with principal components regression/singular value decomposition it is clear that PLSl/Bidiag:! manages with fewer latent vectors. Like PCR, the PLS methods avoid exact linear dependences, i.e., the zero eigenvalues of the. X’X matrix. On the other hand, there is room for uncertainty in how PLS treats approximate linear dependences, i.e., small positive eigenvalues of X’X. Is it desirable to include such eigenvalues irrespective of the data considered? Detailed studies of this problem in a PCR procedure might lead to a cut-off criterion where the smallness of the eigenvalue is compared with the importance of the eigenvector for reducing the residual error. The points where the PLS algorithms depart most from standard regression methods are the use of latent vectors (PLS factors) instead of regression coefficients in the prediction step, and that the matrix inversion of standard regression methods is actually performed anew for each prediction sample. As is clear from the present work and also from that of Helland [13], the latter procedure is by no means a requirement. Once the latent vectors are obtained they may be combined into regression coefficients, (eq. 36), i.e., into one vector giving the same predicted value as obtained with several PLS or PCR vectors. A possible use of the PLS factors would then be for the detection of outliers among samples supplied for prediction. For this purpose, a regression vector is insufficient as it spans only one dimension. On the other hand, there seems to be no guarantee that the space spanned by the PLS vectors is more suitable for this purpose than that spanned by principal components. It seems as if the PLM method has few numerical or computational advantages both relative to PLSl/Bidiag2 performed for each dependent variable y and relative to PCR. The power method of extracting eigenvalues, although simple 195
w
Chemometrics
and Intelligent
Laboratory
Systems
to program, is inefficient, especially for near-degenerate eigenvalues. In contrast to principal components analysis, the PLS2 eigenvalue problem changes from iteration to iteration, which makes the saving small if matrix diagonalization is used instead. As long as the number of dependent variables is relatively small, the use of PLSl for each dependent variable may well be worth the effort. In conclusion, it can be stated that the PLSl algorithm provides one solution to the calibration problem using collinear data. This solution has a number of attractive features, some of which have not yet been exploited. It is an open question, however, whether this method is the optimal solution to the problem or not. For an answer one would have to consider the structure of the input data in greater detail than has been done so far.
7
8
9
10 ACKNOWLEDGEMENTS
Numerous discussions with Olav M. Kvalheim are gratefully acknowledged. Thanks are also due to John Birks, Inge Helland, Terje V. Karstang, H.J.H. MacFie, Harald Martens and an unnamed referee for valuable comments.
11 12
13 REFERENCES H. Wold, Soft modelling. The basic design and some extensions, in K. Joreskog and H. Wold (Editors), Systems under Indirect Obseruation , North-Holland, Amsterdam, 1982, Vol. II, pp. l-54. H. Weld, Partial least squares, in S. Kotz and N.L. Johnson (Editors), Encyclopedia of Statistical Sciences, Vol. 6, Wiley, New York, 1985, pp. 581-591. B. Efron and G. Gong, A leisurely look at the bootstrap, jackknife and cross-validation, The American Statisticinn, 37 (1983) 37-48. S. Wold, A. Ruhe, H. Wold and W.J. Dunn III, The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM Journal of Scientific and Statistical Computations, 5 (1984) 735-743. M. Sjostrom, S. Wold, W. Lindberg, J.-A. Persson and H. Martens, A multivariate calibration problem in analytical chemistry solved by partial least-squares models in latent variables, Analytica Chimica Actu, 150 (1983) 61-70. S. Wold, C. Albano, W.J. Dunn III, K. Esbensen, S. Hellberg, E. Johansson and M. Sjiistrom, Pattern recogni-
196
tion: finding and using regularities in multivariate data, in H. Martens and H. Russworm, Jr. (Editors), Food Research and Data Analysis, Applied Science Publishers, London, 1983, pp. 147-188. H. Martens and T. Nres, Multivariate calibration by data compression, in H.A. Martens, Multiuariate Calibration. Quantitative Interpretation of Non-selective Chemical Data, Dr. techn. thesis, Technical University of Norway, Trondheim, 1985, pp. 167-286; K. Norris and P.C. Williams (Editors), Near Infrared Technology in Agricultural and Food Industries, American Cereal Association, St. Paul, MN, in press. T.V. Karstang and R. Eastgate, Multivariate calibration of an X-ray diffractometer by partial least squares regression, Chemometrics and Intelligent Luboratoty Systems, 2 (1987) 209-219. A.A. Christy, R.A. Velapoldi, T.V. Karstang, O.M. Kvalheim, E. Sletten and N. Telnaes, Multivariate calibration of diffuse reflectance infrared spectra of coals as an alternative to rank determination by vitrinite reflectance, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 199-207. K.H. Esbensen and H. Martens, Predicting oil-well permeability and porosity from wire-line geophysical logs-a feasibility study using partial least squares regression, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 221-232. P.J. Brown, Multivariate calibration, Proceedings of the Royal Statistical Society, Series B, 44 (1982) 287-321. T. N;es and H. Martens, Comparison of prediction methods for multicollinear data, Communications in Statistics Simulations and Computations, 14 (1985) 545-576. I.S. Helland, On the structure of partial least squares regression, Reports from the Department of Mathematics and Statistics, Agricultural University of Norway, 21 (1986) 44
PP. 14 P. Geladi and B.R. Kowalski, Partial least-squares regression: A tutorial, Analyticu Chimica Acta, 185 (1986) l-17. the singular values 15 G.H. Golub and W. Kahan, Calculating and pseudo-inverse of a matrix, SIAM Journal of Numerical Analysis, Series B, 2 (1965) 205-224. 16 C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, Journal of Research of the National Bureau of Standards 45 (1950) 255-282. 17 CR. Rao and S.K. Mitra, Generalized Inverse of Matrices and its Applications, Wiley, New York, 1971. 18 C.C. Paige and M.A. Saunders, A bidiagonalization algorithm for sparse linear equations and least squares problems, ACM Transactions on Mathematical Software, 8 (1982) 43-71. 19 M.R. Hestenes and E. Stiefel, Method of conjugate gradients for solving linear systems, Journal of Research of the National Bureau of Star&t&, 49 (1952) 409-436. 20 M.R. Hestenes, Conjugate Direction Methods in Optimizotion, Springer, New York, 1980, p. 247 ff.
Original
21 H. Wold, Estimation of principal components and related models by iterative least squares, in P.R. Krishnaiah (Editor), Multivariate Analysis, Academic Press, New York, 1966, pp. 391-420. 22 A.S. Householder, The Theory of Matrices in Numerical Analysis, Blaisdell Publ. Corp., New York, 1964, reprinted by Dover Publications, New York, 1975, p. 198. 23 C.R. Mtintz, Solution directe de l’bquation seculaire et des
Research
Paper
n
probltmes analogues transcendentes, Comptes Rendus de I’AcadPmie des Sciences, Paris, 156 (1913) 43-46. 24 LT. Joliffe, A note on the use of principal components in regression, Applied Stufistics, 31 (1982) 300-303. inter25 R. Ameberg, J. Miiller and R. Marine, Configuration action calculations of satellite structure in photoelectron spectra of H,O, Chemical Physics, 64 (1982) 249-258.
197