Computer Aided Innovation of New Materials II M. Doyama, J. Kihara, M. Tanaka and R. Yamamoto (Editors) © 1993 Elsevier Science Publishers B.V. All rights reserved.
867
The UNIveral PArtial Least Squares, UNIPALS, algorithm for Partial Least Squares, PLS, regression William J. Dunn III, Department of Medicinal Chemistry and Pharmacognosy, University of Illinois at Chicago, 833 S. Wood St., Chicago, Illinois 60680, USA PLS regression is one of the most powerful data analytic methods to become available for the analysis of chemical data. It was developed by a chemist to be applied to measured data and has evolved into a method which can be applied to theoretical (computed) data as well. It is usually applied as the Nonlinear Iterative PArtial Least Squares, NIPALS, algorithm. Here, an alternative, the UNIPALS algorithm is discussed. 1. THE NIPALS ALGORITHM
Start with u the first column in Y
PLS regression extracts latent variables, t and u, from two data blocks, X, and Y, respectively, with the constraints that the latent variables are along the axes of greatest variation in X and Y and are optimally correlated. The NIPALS algorithm for PLS regression was first discussed by Wold, et al. (1) but was not well-understood until a recent paper by Höskuldsson (2). The NIPALS algorithm for extracting the latent variables is shown below in block and algebraic form (2). n x k
1. w = X'u/(u'u) 2. Normalize to length 3. t = Xw/(w'w) 4. c = Y't/(t't) 5. Normalize to length 1 6. u = Yc/(c'c) Figure 1. NIPALS algorithm
normalize w normalize .
Y*
t
c
t
Once the latent variables have been computed, the loadings, p and q, and weights, w and c, are computed. The algorithm usually converges rapidly for matrices of small to moderate order. In order to understand what the NIPALS algorithm does Höskuldsson showed that the loadings and weights at each stage in the extraction of the latent variables were eigenvectors associated with the largest eigenvalue of four matrices derived from X and Y. This is shown below in Figure 2 for the Y score, u.
868 Us
2.
=YC S /(C' S C S )
= YY'ts/[(c'scs)(fsts)] = YY'Xw s /i(c' s c s )(t' s t s )(w' s w s )] = YY'XX'us.,/[(c'scs)(fsts)(w'sw5)(u's.,us.l)] Figure 2. Derivation of u. Similar derivations of t, w, and c can be written. Thus extracts the largest eigenvalue for a matrix (2). Further, Höskuldsson (2) showed that t, u, w, and c are eigenvectors associated with a common eigenvalue as shown below in Figure 3.
THE UNIPALS ALGORITHM
With PLS UNIPALS (3,4) works as follows. Form the covariance matrix, D, of X'Y, which is of order m x k. Since there are usually fewer invariables than x-variables, D'D is smaller (k x k) than the association matrix, DD', which is m x m. D'D is equivalent to Y'XX'Y below. From D'D compute the first eigenvalue using the power method. (Hell = 1.0) Then compute in order the following 7 steps:
1. u = Y c
(1)
2.
(2)
w = X'u
(3)
3. t = X w
YY'XX'u = au (n x n) Y'XX'Yc = ac XX'YY't = at X'YY'Xw = aw
(k x k) (n x n) (m x m )
Figure 3. Equations showing that t, u, w, and c are eigenvectors associated with a common eigenvalue, a. The orders of the corresponding matrices are given in parentheses. This suggests another algorithm for computing PLS models which we have programmed in our laboratory. The algorithm and its advantages are discussed.
w is then normalized to length Then compute the loadings as:
4. X loadings
as
5. Y loadings
as q=Y'u/(u'u)
1.
p=X't/(t't)(4)
<5)
From the inner relation b = t'u/f t, form û = bt and from this update X and Y as:
(6)
6 . E = X - tp1 7.
F = Y - ûc'
= y - bXwc'
(7)
Note that in equation 7, Y is modeled in terms of the X matrix and the PLS output vectors. To compute the next component, form E'F as the updated X'Y and continue.
869 3. ADVANTAGES OF UNIPALS The major advantage of UNIPALS is that it is much faster for extracting PLS components from moderately sized matrices. It uses the power method for computing the eigenvalues sequentially and the subsequent loadings and scores are vector multiplications which reduces the computing time required for a cycle. Even though the power method is an iterative procedure, it coverges rapidly except when two eigenvalues are approximately the same but NIPALS does not converge readily in this case. Also, by selecting the smaller of D'D and DD' to operate on, UNIPALS has an advantage in speed. In the special case in which Y is but a single column vector, c = 1. A disadvantage of UNIPALS is that it cannot deal with missing data in X or Y while NIPALS can handle missing data. This is not so much a disadvantage when computations are made on measured data such as that from optical spectroscopy are being analyzed. UNIPALS has been programmed in BASIC and is used routinely at the University of Illinois at Chicago. A more detailed description of PLS and principal components analysis, PCA, has been published (3) as has a more detailed description of the algorithm (4). REFERENCES 1.
S. Wold, A. Ruhe, H. Wold and W. J. Dunn III, "The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses" SI AM J. Sei. Stat. Comput. (1984) 5, 735.
2.
A. Höskuldsson, "PLS Regression Methods" J. Chemometrics, (1988) 2, 211.
3.
W. G. Glen, W. J. Dunn III and D. R. Scott, "Principal Components Analysis and Partial Least Squares Regression" Tetrahedron Computer Methodology, (1989) 2, 349.
4.
W. G. Glen, M. Sarker, W. J. Dunn III and D. R. Scott, "UNIPALS: Software for Principal Components Analysis and Partial Least Squares Regression" Tetrahedron Computer Technology, (1989) 2, 377.