Variable and subset selection in PLS regression

Variable and subset selection in PLS regression

Chemometrics and Intelligent Laboratory Systems 55 Ž2001. 23–38 www.elsevier.comrlocaterchemometrics Variable and subset selection in PLS regression ...

371KB Sizes 186 Downloads 251 Views

Chemometrics and Intelligent Laboratory Systems 55 Ž2001. 23–38 www.elsevier.comrlocaterchemometrics

Variable and subset selection in PLS regression ) Agnar Hoskuldsson ¨ Technical UniÕersity of Denmark, Building 358, 2800 Lyngby, Denmark Received 1 May 1999; accepted 6 November 2000

Abstract The purpose of this paper is to present some useful methods for introductory analysis of variables and subsets in relation to PLS regression. We present here methods that are efficient in finding the appropriate variables or subset to use in the PLS regression. The general conclusion is that variable selection is important for successful analysis of chemometric data. An important aspect of the results presented is that lack of variable selection can spoil the PLS regression, and that cross-validation measures using a test set can show larger variation, when we use different subsets of X, than obtained by different methods. We also present an approach to orthogonal scatter correction. The procedures and comparisons are applied to industrial data. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Variable selection; Partial Least Squares ŽPLS.; Principal Component Analysis ŽPCA.; H-principle; Stepwise regression; Orthogonal scatter correction ŽOSC.

1. Introduction PLS regression has been found important in handling regression tasks in case there are many variables. The theoretical basis for PLS regression is found in Ref. w1x. The basic aspect of PLS regression is that it suggests that we should maximize the covariance between score vector in X-space and a score vector in Y-space or equivalently to maximize the size of the loading vector in Y-space derived from the score vector in X-space. In the applied situation we have given an N = K matrix X and N = M matrix Y. When working with chemometric data the matrix X

)

Tel.: q45-4525-5643; fax: q45-4593-1577. .. E-mail address: [email protected] ŽA. Hoskuldsson ¨

can be large. When working with NIR ŽNear-Infra Red. instruments, it is common that there are 1050 variables Ž K s 1050.. These instruments are becoming popular because of their ability to measure samples without touching the samples. In Denmark, there has been established a society that plans to promote the use of NIR technology within food industry and agriculture. There are also available NIR instruments giving around 8000 variables. There are also many other kinds of instruments that may give a large amount of variables. At chemometric and related conferences, there are frequent discussions on the topic of variable selection. The reason is partly that there are many proposals around and partly that people do not quite agree on which method should be used. The program packages in statistics, like e.g., SAS, SPSS, BMDP, suggest stepwise regression, where the variables are

0169-7439r01r$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 7 4 3 9 Ž 0 0 . 0 0 1 1 3 - 1

24

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

selected according to, e.g. the increase in R 2 , AdjR 2 , AIC, Mallows Cp and others. For chemometric data, these procedures often lead to overfitting, because they only measure the degree of fit and do not take into account the prediction aspect of the model. It is possible to show that these measures are invariant to the size of the score vectors in the model. Therefore, these measures are only appropriate if we know the correct model, which seldom happens for chemometrics data. In this work we search for intervals that should be used in the analysis. The reason for finding intervals instead of individual variables is that it is useful to get as large score vectors as possible. Score vectors based on intervals are larger than the ones based on variables and thus give more stable predictions. The literature on variable selection is very large, but very little has been done on finding intervals. In Ref. w2x the variables are divided into a number of intervals and a PLS regression is carried out at each interval. By working with sufficiently many Žor small. intervals, this approach is useful in finding the interval that we should work with. The importance of the present work is due to that the procedures can be used in au-

tomatic selection of intervals that should be used in company process environments. The second part of the paper contains an approach to Orthogonal Scatter Correction ŽOSC.. The present approach is simple and it is easy to use the approach on new test samples. The MATLAB code in the Appendix also shows that it is easy to program the procedure and apply it in regression or other context. We shall start by illustrating the problems of variable selection by a case study.

2. A case study 2.1. NIR data in breweries We have given 926 variables and there have been 61 samples observed. The data are presented in Fig. 1. It shows that the samples represent smooth curves as functions of the variable number. At both ends the samples show larger variation. At the lower end we see larger variation in the curves, while at the higher end we see some instrumental noise. The response

Fig. 1. Plot of NIR data. 61 samples vs. 926 variables.

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨ Table 1 Variation found in first 10 components in PLS regression No.


Ý< D X i < 2

< DYi < 2

Ý < DYi < 2

ru, t

1 2 3 4 5 6 7 8 9 10

85.3917 10.0954 1.2161 0.3421 0.2177 0.1030 0.0606 0.0906 0.1163 0.0698

85.3917 95.4871 96.7032 97.0453 97.2630 97.3661 97.4266 97.5172 97.6335 97.7033

5.7205 40.2205 38.1453 12.8755 1.8843 0.9463 0.1708 0.0236 0.0085 0.0034

5.7205 45.9410 84.0863 96.9618 98.8461 99.7924 99.9631 99.9867 99.9952 99.9986

0.239 0.653 0.840 0.899 0.788 0.906 0.907 0.799 0.800 0.838

variable is a quality measure of beer. The following table shows the results from a PLS regression with 10 components ŽTable 1.. The table shows that the first component selects much of X but not very much of Y. In fact the first correlation coefficient, 0.239, is not significant from a statistical point of view. The description of Y starts basically at the second component. The last columns shows that there is a high correlation between reduced Y and the score vectors. Thus, statistically all 10 components are highly significant. The last three

25

components describe very little of Y. And crossvalidation shows that they do not improve the prediction of the model. Thus, the last three components should not be used, although they show high correlation with the y-variable. It is also instructive to look at the first four plots of the u-vectors Žreduced y’s. against the score vectors. It is shown in Fig. 2. The figure shows that the first two score vectors are not linearly related to y, or at least the linear relationship is not good. We often see inclination to different non-linearities, when the first few components are not explaining much of the variation in Y. The conclusion of this analysis is that the results are not satisfactory. We should in one way or another reduce or transform X before we carry out a PLS regression. Here we suppose that we only choose variable nos. 411 to 490. For instance, we select 80 variables in the middle of data. We shall later see that this choice can be motivated by the correlations in data. A PLS regression with 10 components gives the following results ŽTable 2.. Here we see that the first PLS components describe Y. A cross-validation shows that three or possibly four components should be used. Although all the 10 correlation coefficients are statistically signifi-

Fig. 2. Plot of the first four u-vectors vs. the score vectors.

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

26

Table 2 Variation found in first 10 components in PLS regression. 80 variables No.


Ý< D X i < 2

< DYi < 2

Ý < DYi < 2

ru, t

1 2 3 4 5 6 7 8 9 10

97.7140 1.1889 0.3405 0.0596 0.0690 0.0356 0.0380 0.0436 0.0437 0.0325

97.7140 98.9029 99.2434 99.3031 99.3721 99.4077 99.4457 99.4894 99.5330 99.5656

94.8634 4.0663 0.3394 0.1704 0.0639 0.0823 0.0555 0.0664 0.0647 0.0517

94.8634 98.9297 99.2691 99.4395 99.5034 99.5857 99.6412 99.7077 99.7724 99.8241

0.974 0.890 0.563 0.483 0.338 0.407 0.366 0.430 0.471 0.477

cant Žthe 99% limit with df s 60 is around 0.30., only the first three or four should be used. Fig. 3 shows the reduced y-vectors vs. the score vectors. It is similar to Fig. 2 except that we here use the variable nos. 411 to 490. The first two score vectors now show a good linear relationship with the response variable. The conclusion of this case study is that it was successful to reduce the data to the selected 80 vari-

ables. A considerable improvement was obtained compared to using all the variables.

3. Selection of variables based on intervals For some types of data it is useful to look at ‘intervals’ that show high correlation to Y. When working with intervals, we get more stable PLS regressions than if we use a few selected variables. There are many ways to find intervals that are useful to work with. The steps we suggest are as follows: Ža. Find a variable that is ‘good’ to work with Žb. Find an interval around the variable that should be used in the regression Žc. Carry out the PLS regression based on the interval Žd. Adjust X by the results obtained by this interval Že. Start over at Ža. to find a good variable The procedure in Ža. that we shall consider here is to find the variable that has the highest correlation

Fig. 3. Plot of u-vectors Žreduced y’s. vs. score vectors.

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

with Y. The interval around this variable is the one that contains significant correlations with Y. When we have found an interval, it amounts to write X as X s ŽX 1 , X 2 , X 3 .. The interval corresponds to X 2 . PLS regression is carried out using X 2 and Y. It amounts to finding the weight vector w 2 and work with w s Ž w 1 , w 2 , w 3 ., where w 1 s 0 and w 3 s 0. When the PLS regression using X 2 and Y has been carried out, X and Y are adjusted by the score vectors found. With these reduced X and Y the procedure starts over again. The procedure is best illustrated by an example. The data we shall consider are the ones from the previous section. There are 926 variables. For each variable we compute the squared correlation coefficient, r 2 , with the response variable. Fig. 4 shows the 926 coefficients vs. the variable number. We have used here the squared of the simple correlation coefficient, r, between a variable Ža column in X. and y. Other measures of correlation could also be used. The figure shows that there are a few ‘intervals’, where the values of the squared correlation coefficient are large. Values above the horizontal line

27

indicate statistical significance. The figure shows that the variables from 1 to about 280 are not significant. Also the variables above number 600 show only very little correlation. The large interval contains variable nos. 400 to 500. Following the procedure above we should start with the variables X 2 s Ž x 400 , x 401 , . . . , x 500 .. If we carry out the PLS regression of X 2 and y, we get results similar to Table 2. When X and y are adjusted for the score vectors found, and we again plot the squared correlation coefficients, now for the reduced data. Then we find that all correlations are close to zero and far below the horizontal line that indicates statistical significance. We may conclude that only X 2 should be used in the PLS regression. We may ‘tune’ X 2 a little bit by the method in next section. In this example we only selected one interval. But the procedure can be continued by finding the next interval until we arrive at a situation like above, where all correlations are close to zero. NIR data are important in the analysis of food and agricultural data. This example shows that it is important to remove or eliminate the ‘ends’ of the data.

Fig. 4. Plot of squared correlation coefficients, r 2 , vs. variable no.

28

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

If we allow the first and the last variables to be included, we do not get a satisfactory analysis. The company Foss Electric, Hillerød, Denmark, has had success by this type of procedure, where the variables in Ža. are selected as principal variables, see Ref. w1x, and the intervals in Žb. are carefully selected.

4. Most correlated variables The type of procedures in previous section is efficient when we observe a few and relatively large intervals. In case we observe many small intervals and the correlations or covariances are small it may be better to use a different approach. In these methods we want to focus only on those variables that show the highest correlation or covariance with the response variable. The steps we suggest are as follows:

Ža. Decide on a measure of correlation or covariance that should be used Žb. Find the variable that has the highest numerical value

Žc. Divide the interval from zero to the numerically highest value into say, 20 intervals. Žd. Find the intervals of variables that show values above the limits found in Žc. Že. Carry out the 20 PLS regressions

The procedure is best illustrated an example. As a measure in Ža. we shall use the squared simple correlation coefficient with Y. The choice of measure is not important for the analysis. The data used in this example are from an industrial company. The present X-data are transmittance spectra Žtransformed by ylog10. measurements and y-data are some quality measurements. The number of columns in X is 1056. There are 45 samples in each X. The samples consist of three replicates. So actually 15 samples are measured three times. In the study, one of the three replicates is selected randomly, and the 15 samples are used as a test set. Thus, in the analysis X is 30 = 1056 and y is 30 = 1. And there is a test set, ŽX 0 ,y 0 ., where X 0 is 15 = 1056 and y0 is 15 = 1. Note, that this is not traditional cross-validation, because the test set consists of replications. When we choose the replications as a test set, we are evaluating the measurement equipment. The present data show large variations,

Fig. 5. Plot of squared correlation coefficients, r 2 , versus variable no. Two horizontal lines.

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

also across replicates. Therefore, it has practical interest for the company to use replications as a test set. We first show the 1056 values of r 2 , the squared simple correlation coefficient between a column in X and y. Fig. 5 shows a plot of these values vs. the variable number. Two horizontal lines are drawn in the figure. We shall explain them closer. The lower horizontal line represents the significance level corresponding to 1% significance. Variables having values above it indicate a significant relationship to the response variable. In order to find out which variables should be used, we place a horizontal line on the figure and select all the variables, giving a squared correlation coefficient, r 2 , higher than the one corresponding to the line. For these variables, we carry out a PLS regression. We start the horizontal line so that we get a few variables, say 3, then we lower the line. We carry out around 20 PLS regressions. In the last one, we use all the variables. In all 20 PLS regressions, four components were used. Four components were found appropriate for the model having the largest Q 2 . For the 20 PLS regressions, we compute the R 2 for the given data and Q 2 for the test data Žthe 15 samples.. In Fig. 6 we plot the 20 values of R 2 and Q 2 .

29

The values of R 2 and Q 2 are computed as R2 s 1 y

2

2 i

ž Ý Ž y yy . /r Ý y Q s1y ž Ý Ž y yy . /r Ý y i

i ,est

2

2

i ,0

i ,0,est

2 i ,0 ,

the degree of fit for the two data sets. Note that the x-axis indicates increasing amount of variables in the PLS regressions. The maximum value of Q 2 is obtained at no. 6, the 6th regression. If we look at it closer, we see that there the squared correlation coefficient is 0.6 and that there are 50 variables in four intervals that give values of r 2 above 0.6. The four intervals can be seen in Fig. 5. Thus, we conclude that in the analysis of X ´ y we should use the 50 variables. It will give us an R 2 of 99.1% and Q 2 of 98.5%. If we select a new test set in a different way, we get approximately the same results. For instance, if we select randomly three samples and they’re replicated, the results are almost the same. This procedure is very simple and fast to carry out. There are some aspects of Fig. 6 that are important to notice. The amount of variables used in the first six PLS regressions are 3, 13, 15, 18, 20 and 50, respectively. All PLS regression were carried out us-

Fig. 6. Plot of R 2 and Q 2 from 20 PLS regressions, each with four components. R 2 : –`–, Q 2 : -q-.

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

30

ing four components Žexcept the first one., which is the correct amount when the 50 variables are used. When we increase the amount of variables beyond 50, the regressions get worse. If we use all the data, we do not get results that can be used. When all the 1056 variables or most of them are used, we get considerably worse results than in the beginning. This is also the experience of the industrial company working with these types of data. It uses extensive computations to find a good set of variables to use in the PLS regressions. In the literature we often see comparison of methods based on the value of Q 2 . This example emphasises the fact that, when comparing methods, we must haÕe identified the appropriate Õariables to work with. The variation in Q 2-values, when we change the variables in the model, can be larger than the one from a method of analysis to another method.

5. Orthogonal scatter correction (OSC) We often observe the situation illustrated in Table 1, where the first PLS components are explaining X and very little of Y. PLS regressions in these cases are not very successful. The method of OSC has been suggested w3,4x as an alternative to variable selection. The idea is to remove from X that which does not relate to the response variables Y. We shall first study the problem in question. The PLS component t s X w is found by selecting w as the eigen vector associated with the largest eigen value of the eigen system XX YY X X w s lw. In other terms, we are looking for the score vector t giving a large value of

X

2

X

X

w X YY X w s

tt s

wX XX X w wX Ž XX1Y q XX2 Y . Ž Y X X 1 q Y X X 2 . w wX XX1 X 1w q wX XX2 X 2 w

Suppose that there is a subspace of X, X 2 , that is not correlated to Y, Y X X 2 s 0. Then we can write
s

wX XX1YY X X 1w wX XX1 X 1w q wX XX2 X 2 w

An important question now is: what can be said about the term wX XX2 X 2 w in PLS regression? Unfortunately, nothing or very little can be said about the term. It can be small, and it can also be extremely large, giving almost zero fit, and thus spoiling the PLS regression. Thus, if there is a large part of X that does not contribute to Y, it must be removed before the analysis. One approach, variable selection, is to remove a part of X, which was presented above. Another one is OSC. The OSC methods that have been presented in the literature w3–5x have not been satisfactory. The reason is that the score vectors selected are not in the column space of X, i.e., the suggested methods indeed find a score vector t ) which is orthogonal to Y. But the score vectors found are not in the column space of X, i.e., we cannot find a loading vector w such that t ) s X w, where X is the reduced matrix. The basic idea of OSC is to remove a part of X that does not contribute to the modelling of Y. Therefore, the task is to find w ) such that t ) s X w ) and Y X t ) s 0. We shall consider closer how we can obtain this. The task is to find large score vectors t s X w such that Y X t s Y X X w s 0. If Z s Y X X, we want to maximize X w subject to Z w s 0. This is obtained by projecting the rows of X onto Z and subtracting the result from X, V s X y XZX Ž ZZX .

y1

Z

s X y XXX Y Ž Y X XXX Y .

y1

YX X

If M is the projection operator associated with the orthogonal complement of Z, M s I y ZX ŽZZX .y1 Z, it is easy to show that for Z of full rank, Z w s 0 if and only if w s M w. Thus, X w s XMw. The matrix V s XM is the matrix to work with and it is simple to show that Y X V s 0. Thus, the singular value decomposition of V will show us how we can

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

select one or more components that do not contribute to the description of Y. The components t s XMw chosen in this way will have maximal sizes. Consider the singular value decomposition of V, V s SDFX . Then Y X V s 0 implies Y X S s 0. If S 1 is the matrix that contains the first columns of S, we adjust X by projecting X onto S 1 and subtracting the result, X 1 s X y S 1SX1 X The matrix X 1 is the resulting matrix. We work with X 1 instead of X. This approach to finding the OSC components is equivalent to w6x, but it is more transparent. The MATLAB code in the Appendix shows that it is easy to compute the OSC components and implement it in the PLS procedure. The reviewer pointed out that the procedure above computes the maximal size of the component t, but that it does not give maximal reduction in X subject to Z w s 0. The reason is that we are maximizing in a subspace of X. If we want t s XMw to maximize the reduction in X, we must find w that maximizes the expression

31

z is orthogonal to V and F, Vz s 0, FX z s 0. The projected X matrix, X 1 , has the same covariance with y as X, y X X 1 s y X X y y X S 1SX1 X s zX s y X X. Thus, the matrix X 1 that we work with does not change the criterion we optimize. But we can reduce the size of the score vector in the PLS regression by selecting the appropriate number of columns in S 1. 5.2. Regression coefficients We need the regression coefficients in the PLS regression, when OSC have been computed, in order to compute the response values for new samples. Here we follow closely the approach shown in Ref. w1x. The regression coefficients are computed as B s Sra qaX . The transformation vectors ra are computed for each OSC-component and each PLS component. The regression coefficients qa are by definition zero for the OSC-components. The idea in Ref. w1x is to write adjustment of X as


X new s X y tpX s X Ž I y wpX . ,

The solution w is found as the eigen vector of the generalised eigen value system

where I is the identity matrix. Then the transformation vectors ra are computed from the recursive equations

MX XX XXX XMw s lMX XX XMw We shall not consider this question closer. But it should be pointed out like shown in Ref. w7x that we must be careful when solving the generalized eigen value problem, because the matrices have reduced rank.

r s Ew E new s E y rpX , where the initial value of E is the identity matrix. We need to write the OSC-components in a similar way. This is obtained by writing X y XXX Y Ž Y X XXX Y .

5.1. One response Õariable It is instructive to consider the case where we only have one response variable, Y s y. In this case the formulae above simplify. Let z be the covariance, z s X X y. In this case Ž Y X XX X Y . is a number, Y X XXX Y s zX z. The matrix V is V s X y XzzXr Ž zX z . s X Ž I y zzXr Ž zX z . .

y1

YX X

s X Ž I y XX Y Ž Y X XXX Y .

y1

Y X X . s OSV X .

E.g., the first OSC-component, o 1 can be written as o 1 s XU0 z1rs1 s X w 0 where U0 s I y XX YŽY X XXX Y.y1 Y X X and w 0 s U0 z1rs1. The loading vector p in the formulae above is given by p s XX o 1. See further in the Appendix for how the computations in a MATLAB program are

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

32

Table 3 Cumulative R 2 value for 0 to 5 OSC components Žhorizontal.. Up to eight PLS components No. of PLS components

0

1

2

3

4

5

1 2 3 4 5 6 7 8

5.721 45.941 84.086 96.962 98.846 99.792 99.963 99.987

44.760 83.973 96.953 98.844 99.792 99.963 99.987 99.995

79.116 96.604 98.754 99.783 99.962 99.986 99.995 99.999

91.259 98.123 99.725 99.953 99.980 99.994 99.998 100.000

96.334 99.566 99.906 99.964 99.991 99.998 99.999 100.000

96.338 99.574 99.937 99.984 99.996 99.999 100.000 100.000

arranged. For further details of computations of the transformation vectors, ra , see Chap. 3 in Ref. w1x. 5.3. Illustration Let us look at the NIR data above. The following table gives the measure of fit, R 2 , for 0 to 5 OSC components ŽTable 3.. The table shows that considerable improvement in fit is obtained by eliminating a few OSC components. We see from the table that if we, e.g. want a 96.962% degree of fit, it can be obtained by four PLS

components, or by three PLS components and one OSC component or by two PLS and two OSC components. In this case, the sum of PLS and OSC components that gives a given R 2 value is approximately constant. We see this often for NIR data. If we compare the table with Table 2, we see that the results from selecting an interval are better than the results in Table 3. It has some interest to compute the squared simple correlation coefficients, r 2 , when X has been reduced by one, two, three and four OSC components. This is shown in Fig. 7. If we compare the figure with

Fig. 7. Squared simple correlation coefficient, r 2 , for reduced X and y, when X has been reduced by one, two, three and four OSC components.

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

Fig. 4, we see that the variables from 1 to 400 become more correlated to y, when more OSC components are selected. We see also that the variables at the right end continue to stay uncorrelated to y. These results indicate that we may get some improvements, if extraction of OSC components are combined with the methods in Sections 3 and 4.

6. Discussion Variable selection is one of the most studied subject in theoretical statistics. For a review of methods and literatures, see Ref. w8x. Most people working with data have established some approach that they have found useful or acceptable. Also, different research groups in chemometrics have defined their standard approach to variable selection. The present author has carried out an extensive and systematic study of the question of reducing the data before carrying out PLS regression. The study contained Ža. different measures of covariancercorrelation, Žb. choice of variables or intervals, Žc. transformation of variables, Ž d . adjustment procedures of X, Že . sizercovariance for X ™ T results, Žf. change X Ždifferences, relative differences., Žg. robust measures, Žh. non-parametric measures Žranks. and some others. The general result is that an appropriate selection of intervals of variables can considerably improve the results of PLS regression, especially for NIR and FT-IR type of data. This result has also been confirmed by working groups of chemometrics at the Danish Royal Veterinary University and at Foss Electric. The methods presented in Sections 3 and 4 and variants of these methods are the ones that provide with the best results. For data like, e.g. fluorescence, RAMAN, and some others, the results were not as clear as for example NIR data. In some cases, the benefits of these interval methods were only marginal. Extraction of OSC-components is a new important development initiated by Wold et al. w3x. It has many interesting applications. Its importance is due to fact that it improves the following PLS regression. The recommendation of this paper is that to analyse at each step of the modelling procedure the weight vector w. If many values are not significant, then remove those variables in the analysis. But there

33

can be arguments for using the variables, when the loading matrix P is computed, e.g., because we want to see the effects of the variables used or for the sake of interpretation. In the present paper we have focused on PLS regression. But as shown in Ref. w1x, PCA can be considered as a special case of PLS regression, where Y s X. Therefore, selection of variable is also of fundamental importance for PCA analysis. If there is much noise in the data matrix X, the largest singular values tend to be larger than if there is little or no noise in data. This can be illustrated as follows. Suppose that X has rank three and X 1 is given by X 1 s X q E, where E is generated as random number matrix. Then the first three singular values of X 1 will be larger than the three singular values of X. The results can be quite different, making analysis based on X 1 not reliable, even if cross-validation or other methods to validate results have been carried out. The same considerations apply to PLS regression. If a large portion of X does not correlate to Y, the score vectors tend to be too large. In these cases, some reduction of X like shown in this paper is necessary. In general, a large component variance is considered good. But this is not the case if it is derived from many noisy variables. In our notation, if wX2 XX2 X 2 w 2 is relatively large, but X 2 is not describing the phenomenon of interest, we should consider not using X 2 in the Žfurther. analysis. We would like to repeat the results shown in Fig. 6. A cross-validation may show that a good model is available, if appropriate part of the data is used. If all of the data are used, the results may be useless. This result is important when we want to compare different methods. Before we compare the methods, we must have secured that the data set is appropriate for the given methods.

Appendix A The following MATLAB program shows how the computations can be arranged, such that OSC is carried out before the PLS regression and also at each step of the PLS regression. The arrangement of the computations follows closely w1x.

34

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

35

36

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

References w1x A. Hoskuldsson, Prediction Methods in Science and Technol¨ ogy, Basic Theory vol. 1, Thor Publishing, Copenhagen, 1996. w2x L. Nørgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Interval Partial Least Squares Regression

37

ŽiPLS.: a comparative chemometric study with an example from near-infrared spectrocopy, Applied Spectroscopy 54 Ž3. Ž2000. 413–419. ¨ w3x S. Wold, H. Antti, F. Lindgren, J. Ohman, Orthogonal signal correction of near-infrared spectra, Chemometrics and Intelligent Laboratory Systems 44 Ž1998. 175–186.

38

A. Hoskuldssonr Chemometrics and Intelligent Laboratory Systems 55 (2001) 23–38 ¨

w4x J. Sjoblom, O. Svensson, M. Josefson, H. Kullberg, S. Wold, ¨ An evaluation of orthogonal signal correction applied to calibration transfer of near infrared spectra, Chemometrics and Intelligent Laboratory Systems 44 Ž1998. 229–244. w5x C.A. Andersson, Direct orthogonalization, Chemometrics and Intelligent Laboratory Systems 47 Ž1999. 51–63.

w6x T. Fearn, On orthogonal signal correction, Chemometrics and Intelligent Laboratory Systems 50 Ž2000. 47–53. w7x K. Faber, On solving generalised eigen value problem using MATLAB, Journal of Chemometrics 11 Ž1998. 87–91. w8x M.L. Thomson, Selection of variables in multiple regression, International Statistical Review 46 Ž1978. 1–19.