CHAPTER
CALCULATING THE SOLUTION FOR REGRESSION TECHNIQUES: PART 3—PARTIAL LEAST SQUARES REGRESSION MADE SIMPLE
23
For the past three chapters we have described the most basic calculations for MLR, PCR, and PLS. Our intent is to show basic computations for these regression methods while avoiding unnecessary complexity which could confuse rather than instruct. There are of course a number of difficulties in taking this simplistic approach, namely the assumptions made for our simple cases do not always hold, and poorly behaved matrices are the rule rather than the exception. We have not yet discussed the concepts of rank, collinearity, scaling, or data conditioning. Issues of graphical representation and details of computational methods and assessing model performance are forthcoming. We ask that you abide with us over the next several chapters as we intend to delve much more deeply into the details and problems associated with regression methods. For this chapter we will illustrate the straightforward calculations used for PLSR utilizing SVD. For PLS, a special case of SVD is used. You will notice that the PLS form of SVD includes the use of the concentration vector c as well as the data matrix A. The reader will note that the scores and loadings are determined using the concentration values for PLS SVD whereas only the data matrix A is used to perform SVD for PCA. The SVD and PLS SVD will be the subject of several future chapters so we will only introduce its use here and not its derivation. All mathematical operations are completed using MATLAB software [1,2]. As previously discussed, the manual methods for calculating the matrix algebra used within these chapters on the subject are found in Refs. [3–7]. You may wish to program these operations yourselves or use other software to routinely make the calculations. As in our last installment, we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples and three data channels, as a rows columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance, or frequency. We arbitrarily designate A for our example as: 2
3 2 3 A11 A12 A13 1 7 9 Arc ¼ 4 A21 A22 A23 5 ¼ AIK ¼ 4 4 10 12 5 6 14 16 A31 A32 A33
(23-1)
Thus, 1, 7, and 9 represent the instrument signal for three data channels (frequencies 1, 2, and 3) for sample spectrum #1; 4, 10, and 12 represent the same data channel signals (e.g., frequencies 1, 2, and 3) for sample spectrum #2; and so on. Chemometrics in Spectroscopy. https://doi.org/10.1016/B978-0-12-805309-6.00023-4 # 2018 Elsevier Inc. All rights reserved.
125
126
CHAPTER 23 PART 3—PARTIAL LEAST SQUARES REGRESSION MADE SIMPLE
If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as: 2 3 3 c11 4 crc ¼ 4 c21 5 ¼ cI1 ¼ 4 8 5 11 c31 2
(23-2)
We now have both the data matrix A and the concentration vector c required to calculate PLS SVD. Both A and c are necessary to calculate the special case of PLSSVD. The operation performed in PLSSVD is sometimes referred to as the PLS form of eigenanalysis, or factor analysis. If we perform PLSSVD on the A matrix and the c vector, the result is three matrices, termed LSV matrix or the U matrix; SVM or the S matrix; and RSV or the V matrix. We now have enough information to find our PLS Scores matrix and PLS Loadings matrix. First of all, the PLS Loadings matrix is simply the RSV matrix or the V matrix; this matrix is referred to as the P matrix in PCA and PLS terminology. The PLS Scores matrix is calculated as: The data matrix A the PLS Loadings matrix V ¼ PLS Scores matrix T
(23-3)
Note: The PLS Scores matrix is referred to as the T matrix in PCA and PLS. Let us look at what we have completed so far by showing the PLS SVD calculations in MATLAB as illustrated in Table 23-1. We can now use S, V, and T to calculate the following: A reconstruction of the original data matrix A is computed by using the preselected number of factors (i.e., columns in our T and V matrices) as: A ðestimatedÞ ¼ T V 0
(23-4)
The set of regression coefficients (i.e., the regression vector) is calculated as: bðregression vectorÞ ¼ V S1 U 0 c
(23-5)
The predicted or estimated values of c are computed as: cðestimatedÞ ¼ ðT V 0 Þ b
(23-6)
cðestimatedÞ ¼ AðestimatedÞ b ¼ A b
(23-7)
This expression is equivalent to
or can be used to predict a single sample spectrum a using the expression: cðestimatedÞ ¼ aðestimatedÞ c ¼ a b
(23-8)
Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of matrix operations as shown in Table 23-2.
Table 23-1 Matrix Operations in MATLAB to Compute the PLS SVD Calculations of Data Matrix A (See Eqs. 23-1–23-3) Command Line
Comments
≫ A ¼ [1 7 9; 4 10 12; 6 14 16] 1 7 9 A ¼ 4 10 12 6 14 16 ≫ c ¼ [4;8;11] 4 ≫c ¼ 8 11 ≫[U, S, V] ¼ SVDPLS(A, c, 3)
Enter the A matrix Display the A matrix
≫U
0:3817 0:9067 0:1797 U ¼ 0:5451 0:0638 0:8359 0:7465 0:4170 0:5186 ≫S 29:5796 0:2076 0:0000 S ¼ 0:0000 1:9904 0:0367 0:0000 0:0000 0:2038 ≫V 0:2446 0:9345 0:2588 V ¼ 0:6283 0:0506 0:7764 0:7386 0:3525 0:5747 ≫ T ¼ A*V 11:2894 1:8839 0:0034 T ¼ 16:1236 0:0138 0:1680 22:0801 0:6750 0:1210
Enter the c vector Display the c vector
Perform PLS SVD on the A matrix. This is a CPAC(7) version of the PLS SVD algorithm Display the U matrix or the left singular values (LSV) matrix
Display the S matrix or the singular values (SV) matrix
Display the PLS V matrix or the right singular values (RSV) matrix (Note: This is also known as the P matrix or PLS Loadings matrix)
Calculate the PLS Scores matrix or the T matrix
Table 23-2 Matrix Operations in MATLAB to Compute Eqs. (23-4–23-8) Command Line
Comments
≫ Aest ¼ T∗V0 1:0000 7:0000 9:0000 ≫Aest ¼ 4:0000 10:0000 12:0000 6:0000 14:0000 16:0000 ≫ b ¼ V∗ inv(S)∗ U0 ∗ c
Estimate the A data matrix Display the estimate for A
1:1667 b ¼ 0:6667 0:8333 ≫ cest ¼ (T∗V0 )∗ b 4:0000 cest ¼ 8:0000 11:0000
Calculate the PLS regression vector [Note: The inverse operation refers only to the singular values matrix S. The calculation to determine b is performed using three columns in each of the V, S, and U matrices; this number is equivalent to the number of latent variables (or PLS factors) used] Display the regression vector
Predict the concentrations [Note: This computation is equivalent to (Aest b)] Display the concentration vector [Note: For this simple example of PLS no residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11]
128
CHAPTER 23 PART 3—PARTIAL LEAST SQUARES REGRESSION MADE SIMPLE
REFERENCES [1] MatLab software Version 4.2 for Windows From The MathWorks, Inc., 24 Prime Park Way, Natick, MA. 01760–1500. Internet:
[email protected]. [2] T.C. O’Haver, Teaching and learning chemometrics with Matlab, Chemom. Intel. Lab. Syst. 6 (1989) 95. [3] J. Workman, H. Mark, Chemometrics in spectroscopy—matrix algebra and multilinear regression, part I, Spectroscopy 8 (9) (1993) 16. [4] J. Workman, H. Mark, Chemometrics in spectroscopy—matrix algebra and multilinear regression, part II, Spectroscopy 9 (1) (1994) 16. [5] J. Workman, H. Mark, Chemometrics in spectroscopy matrix algebra and multiple linear regression, part III: the concept of determinants, Spectroscopy 9 (4) (1994) 18. [6] H. Mark, J. Workman, Chemometrics in spectroscopy: elementary matrix algebra and multiple linear regression: conclusion, Spectroscopy 9 (5) (1994) 22. [7] Center for Process Analytical Chemistry, University of Washington, Seattle, WA, m-Script Library, 1993 (Contact Mel Koch or Dave Veltkamp for current versions).