The Sequential and Orthogonalized PLS Regression for Multiblock Regression

The Sequential and Orthogonalized PLS Regression for Multiblock Regression

C H A P T E R 6 The Sequential and Orthogonalized PLS Regression for Multiblock Regression: Theory, Examples, and Extensions Alessandra Biancolillo*,...

1MB Sizes 1 Downloads 87 Views

C H A P T E R

6 The Sequential and Orthogonalized PLS Regression for Multiblock Regression: Theory, Examples, and Extensions Alessandra Biancolillo*, 1, Tormod Næsx, { * Department of Chemistry, University of Rome “La Sapienza”, Rome, Italy; x Nofima AS, Aas, Norway; { Quality and Technology, Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Frederiksberg, Denmark 1 Corresponding author

1. INTRODUCTION Linking several datasets with a predictive order is usually named multiblock regression. Combining two or more spectroscopies for calibration and linking raw materials properties and process settings to end product quality are typical examples where this type of methodology is needed. There exist a number of methods for this purpose; the most well known in the chemometric tradition is the multiblock PLS regression (MB-PLS) obtained when concatenating the input datasets and performing PLS as usual. This is a very useful and simple approach, but for certain situations other methods may represent important advantages. One of these is the sequential and orthogonalized-PLS regression (SO-PLS) [1], which is the method discussed here.

Data Fusion Methodology and Applications https://doi.org/10.1016/B978-0-444-63984-4.00006-5

157

Copyright © 2019 Elsevier B.V. All rights reserved.

158

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

This chapter presents the basic theory and points at a number of properties and pros and cons of the method. Typical examples of use are discussed and concrete results presented, both with focus on prediction ability and interpretation. In addition, we discuss a number of extensions and new ways of applying the method. Among these examples are classification, variable selection, and explicit handling of interactions and three-way data.

2. HOW IT STARTED The SO-PLS method can be seen as a further development of the least squares-PLS (LS-PLS) methodology developed by Jørgensen et al. [2] for the purpose of combining experimental designs (Xdes) with noncategorical and highly collinear predictor blocks. As discussed in Ref. [2], the combination of categorical blocks and blocks consisting of quantitative variables can make the standard PLS-based multiblock models more complex, yielding, for instance, an overestimation of the number of components. LS-PLS handles the blocks sequentially, overcoming this limitation. Briefly, in LS-PLS the noncategorical block is deflated with respect to the design matrix to remove the redundancy with the categorical block. LS is used for predicting the design matrix and PLS is used for the fitting of the orthogonalized information in the other block. It was soon realized that the same idea could be used also for situations with no restriction on the type of input data to use. Therefore, the present version of SO-PLS can be used for any type of input data that can also be handled by standard PLS regression. LS-PLS can be seen as a special case of SO-PLS.

3. MODEL AND ALGORITHM For the presentation of the method, we will restrict attention to the twoinput block case, with blocks called X and Z. The output dataset, i.e., the response matrix, will be named Y. The multiblock linear regression model can then be represented by the equation: Y ¼ XB þ ZC þ E

(6.1)

where: X and Z are predictor blocks of dimensions (N  J) and (N  L), respectively. Y is the response matrix, of dimension (N  K). B and C are

3. MODEL AND ALGORITHM

159

the regression coefficients of dimensions ( J  K) and (L  K), respectively. E is the residual matrix of dimension (N  K). The SO-PLS algorithm can be summarized by the following four main steps: 1. First regression: The response matrix Y is fitted to X by PLS regression. 2. Orthogonalization: Z is orthogonalized with respect to the scores (TX) of the PLS regression in step (1), obtaining ZOrth. 3. Second regression: Y-residuals from step (1) are fitted to ZOrth by PLS regression. 4. Final prediction: Y is predicted summing up the predictions of the two individual regression models in (1) and (3). This can also be seen as a regression of Y onto the scores from the two models in (1) and (3). In the earlier steps, the only mathematical functions involved are PLS regression and orthogonalization. For more than two input blocks, the same procedure is used, iterating between (2) and (3) before (4). Fig. 6.1 illustrates the process of sequential use of PLS regression and orthogonalization. The order of the blocks may have an impact on predictions. In some cases, for instance, when there is a design block, the order is quite obvious, i.e., one will normally use the design block first as X (see LS-PLS in the earlier text). In other cases, it may be less obvious unless there is a particular goal of estimating the additional contribution of a block to another. If in doubt, one can choose both alternatives and use this as an extra tool for interpretation.

FIGURE 6.1 Sequential use of PLS regression and orthogonalization in SO-PLS.

160

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

4. SOME MATHEMATICAL FORMULAE AND PROPERTIES In this section we present the main formulae involved in the earlier steps. 1. This step produces the X-scores, X-weights (WX), X-loadings (PX), and Y-loadings (QX). The Y-residuals matrix EY is then calculated as: EY ¼ Y  TX QTX

(6.2)

If the number of the extracted components in this step is equal to the full column rank of X, SO-PLS is identical to LS and the relation to LS-PLS becomes evident [1]. 2. Mathematically, the orthogonalized Z (Zorth) is calculated as:  1 T TX Z (6.3) Zorth ¼ Z  TX TX T TX i.e., the Zorth is obtained by deflating Z with respect to TX. This is the core of the method. It allows removing redundancies between the two predictor blocks and provides a number of advantages to the method, which are described in detail in the following text. 3. Residuals from the first PLS regression (2) are then fitted to Zorth by standard PLS. In this step the Zorth-scores TZorth, the Zorth-loadings PZorth, the Zorth-weights WZorth, and the Y-loadings QZorth are calculated. 4. Owing to the orthogonality between TX and TZorth, Y can be predicted by summing the predictions from steps (1) and (3): b ¼ TX QT þ TZ QT Y orth X Zorth

(6.4)

b can also be written as: The Y b ¼ XV X QT þ Zorth VZorth QT Y (6.5) X Zorth    1 where VX ¼ WX PTX WX and    1 VZorth ¼ WZorth PTZorth WZorth are the weights allowing the direct calculation of X-scores and Zorth-scores, respectively. With standard b ¼ XB bX þ regression notation, this equation can be written as Y b ZOrth B Zorth . Moreover, Eq. (6.5) can be expressed in terms of the original predictor blocks (using Z instead of Zorth):     b ¼ XV X QT þ I  TX TX T TX 1 TX T ZVZorth QT (6.6) Y X Zorth

5. HOW TO CHOOSE THE OPTIMAL NUMBER OF COMPONENTS

161

5. Because of the sequential orthogonalization, the SO-PLS method shares some resemblance with the so-called type I ANOVA, i.e., the type of ANOVA where effects are incorporated sequentially. For instance, the orthogonalization allows splitting the total sums of squares SSTot for Y into contributions for each block, i.e.: þ SSE (6.7) SSTot ¼ SSX þ SSOrth Z     T b X ÞT X B b X and SSOrth ¼ tr ZOrth B b Orth ZOrth where: SSX ¼ tr ðX B Z Z  bZ B are the sums of squares of the two contributions from X and orth Zorth and SSE is the residual sums of squares. This again points to the possibility of testing for significant improvement as new blocks are incorporated. The aspect of degrees of freedom for PLS is, however, unclear, so a simpler possibility here is to use the CV-ANOVA [3] based on comparing predicted residuals for the first block with the predicted residuals from two blocks and judge significance based on the standard paired t-test. An example of this is given in the following discussion.

5. HOW TO CHOOSE THE OPTIMAL NUMBER OF COMPONENTS The estimation of the optimal complexity in SO-PLS is not straightforward. In fact, the number of latent variables is estimated individually for each PLS model, but it is evident that the optimal complexity chosen for a block will influence the number of components to be extracted from the other blocks added to the model. In most cases one will use crossvalidation for the estimation, which is common practice when working with PLS models. In the literature [1], two different approaches for choosing the optimal complexity in SO-PLS have been discussed. The first approach is the so-called sequential approach. This is based on fixing the optimal number of components in each block before the next one is considered. The other possible approach is called the global approach and is based on finding the set of components in all blocks that optimize the predictions. Despite the fact that the global approach is more time-consuming, it is the most widely applied procedure. A reason for this is that the order of the predictor blocks has less influence on the prediction. It has, however, more choices and can therefore be more prone to overfitting. A proper test

162

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

set validation is therefore to be recommended. A wider discussion of this based on a simulation study can be found in Ref. [4].

5.1 The Ma˚ge Plot A graphical tool, called “Ma˚ge plot,” has been developed to ease the choice of the optimal number of components. This tool is particularly suitable when the global approach is used. On the horizontal axis the total number of components is shown and, in the plot itself, the combination of components corresponding to the different predictions are given. The best predictions for the sum of the two (or more if more blocks are used) components are indicated by the lower line. The inspection of global and local minima in the plot will suggest which combination of components leads to the best solution. An example of a Ma˚ge plot is given in Fig. 6.2. In this case, four components in X and one in Z were the optimal choices.

6. HOW TO INTERPRET THE MODELS An SO-PLS model can be graphically interpreted inspecting the same type of plots that are usually interpreted in PLS, i.e., based on scores and loadings. In principle, regression coefficients also could be investigated to explore the relation between predictor variables and response.

FIGURE 6.2 Example of a Ma˚ge plot. Figure reprinted from T. Næs, O. Tomic, B.-H. Mevik, H. Martens, Path modelling by sequential PLS regression, J. Chemometr. 25 (2011) 28e40, with permission from Wiley.

7. SOME FURTHER PROPERTIES OF THE SO-PLS METHOD

163

Nevertheless, regression coefficients can be quite difficult to interpret in multicollinear systems; a wider discussion on this aspect can be found later and in Refs. [5,6].

6.1 Interpretation of the Scores Plots The information present in each predictor block can be investigated looking at the scores plots extracted from the individual PLS models. It has to be pointed out, however, that the interpretation of these plots does not allow inspecting all the information in the Z-block, because only the orthogonalized part is used in the model. The scores from the Z-block are not in the column space of Z but can be thought of as representing the extra information in Z that is not already accounted for by X.

6.2 Interpretation of the Loadings Plots As in regular PLS, loading plots can be used to see which variables (in each block) influence the model the most. In SO-PLS, the loadings for the X-block can be plotted directly. Owing to the orthogonalization step, the loadings of the second PLS model are not in the original space of the Z measurements. Therefore, in most cases one will calculate loadings by regressing the original Z onto the TZorth , i.e.: 1  TZorth T Z (6.8) PZ ¼ TZorth T TZorth

6.3 Interpretation by the Use of the PCP Plots To reduce the number of plots to look at, it was proposed in Ref. [1] that the PCP plot be used for interpretation. This plot is based on first using PCA on the predicted Y-values and then regressing the different X-variables onto the scores from the PCA. One then obtains Y-loadings and Y-scores directly from the PCA and a separate X-loadings plot from the regression coefficients. The PCP plot is particularly useful if the number of blocks is larger than two as it typically is when applying the method in, for instance, path modeling [1].

7. SOME FURTHER PROPERTIES OF THE SO-PLS METHOD In general, the SO-PLS method gives accurate predictions, and it is also very suitable for interpretation. At the same time, as all other methods, it

164

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

also has some drawbacks. The main benefits and disadvantages are listed and discussed briefly: 1. SO-PLS allows handling of large ill-conditioned matrices. Because PLS is the method of choice for fitting in each step of the algorithm, ill-conditioned (collinear) predictor blocks can easily be handled by the use of the SO-PLS method. 2. SO-PLS is not affected by different variances of the blocks (scale invariance). Since the blocks used for PLS regression are orthogonal, SO-PLS becomes independent of the relative scaling of the blocks. In other words, the predictions will be identical and also the interpretation of each step. This is different from, for instance, MB-PLS, which requires a block scaling before PLS regression can take place (for more information see Refs. [7e10]). 3. The contribution (to the model) for each predictor block can be investigated individually. In SO-PLS, each predictor matrix takes part in an individual PLS-regression model for interpretation. This may represent an advantage; for example, in process monitoring, the interpretation of the individual PLS models in SO-PLS could highlight the presence of an anomaly at a particular place in the production process. In addition, owing to the orthogonalization step, it is possible to interpret the “unique” information carried by each block (because any redundancy with the already modeled block is removed). A possible drawback with this aspect is that one will have to interpret more than one model. These can, however, be summarized using the so-called PCP method mentioned earlier and to be illustrated later. 4. The number of latent variables to be used can be optimized for each individual block. In, for instance, MB-PLS, all the blocks are modeled together and the number of components is chosen based on the full model. This means there is a risk of extracting a number of components that is too high/ low with respect to the underlying number of components/ dimensions present in the different blocks. In SO-PLS, the number of components is optimized individually (for each block) in each PLS regression. In this way, one can obtain a better estimate of the number of components needed in each block. This means that blocks with very different underlying dimensionality can easily be combined. A possible disadvantage of this aspect is that there are more parameters to be determined with the risk of overfitting and an even stronger need for test set validation.

8. EXAMPLES OF STANDARD SO-PLS REGRESSION

165

8. EXAMPLES OF STANDARD SO-PLS REGRESSION The example presented here is from multivariate calibration [11] using two spectroscopic principles, near infrared (NIR) and Raman. A major focus was to investigate whether Raman could improve on prediction when NIR was already available. The dataset used for illustration comes from Ref. [12] and consists of 69 emulsions in which the contents of whey proteins, water, and fats are varied according to a mixture design. In addition, the fat composition in these samples is varied by including mixtures of five different vegetable and marine oils according to another mixture design. In this study, one specific fatty acid feature, namely, the contents of polyunsaturated fatty acids (PUFAs), is expressed both in percentage of total sample weight (Y1 ¼ PUFA% emulsions) and in percentage of total fat content (Y2 ¼ % PUFA). In the analyses, NIR was used as X to highlight the additional benefits of adding Raman, Z. Both Y1 and Y2 were used as y-variables. Prediction results are presented in Fig. 6.3. As can be seen, Raman clearly improves on NIR in this case. More than 90% of the responses are predicted using both spectroscopies. The incremental contribution can also be read directly from this type of plot; in this particular case it was quite large for both y-values. The ANOVA for

FIGURE 6.3 Prediction ability of the SO-PLS model. Explained variances for the two responses after the addition of the first (NIR) and the second block (Raman). Figure reprinted from T. Næs, T. Tomic, N.K. Afseth, V. Segtnan, I. Ma˚ge, Multi-block regression based on combinations of orthogonalisation, PLS-regression and canonical correlation analysis, Chemometr. Intell. Lab. Syst. 124 (2013) 32e42, with permission from Elsevier.

166

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

TABLE 6.1

CV-ANOVA (for Two Responses Y1 and Y2) Using Two Input Blocks (X-NIR, Z-Raman) Y1 RMSEP

Y2 p-value

RMSEP

p-value

Mean

2.79

15.75

Addition of X

1.37

<0.0001

11.10

0.001

Addition of Z

0.69

<0.0001

2.23

<0.0001

these results using CV-ANOVA are given in Table 6.1. In all cases the effects are strongly significant for both responses. The interpretation of the spectra was then done inspecting the PCP loadings displayed in Fig. 6.4. In Fig. 6.4, the solid and the dotted lines represent the first and the second components, respectively. From these, the absorptions related to the main functional groups in PUFA can be recognized. In fact, both the solid and dotted lines in Fig. 6.4A show a peak at 1706 nm, which is related to vibrations of polyunsaturated CeC bonds and the water band. PCP-loadings related to the Raman spectra (Fig. 6.4B) are more informative (from the chemical point of view) than those extracted from the NIR measurements. The first component (solid line in Fig. 6.4B) shows two major positive peaks (at 1263 and 1658 cm1), confirming the presence of unsaturated carbonecarbon chains, and two main negative bands

FIGURE 6.4 Loadings for the first two PCP components. (A) From the X-NIR block; (B) from the Z-Raman block. Solid line, first component; dotted line, second component. Figure reprinted from T. Næs, T. Tomic, N.K. Afseth, V. Segtnan, I. Ma˚ge, Multi-block regression based on combinations of orthogonalisation, PLS-regression and canonical correlation analysis, Chemometr. Intell. Lab. Syst. 124 (2013) 32e42, with permission from Elsevier.

9. EXTENSIONS AND MODIFICATIONS OF SO-PLS

167

(at 1304 and 1439 cm1), which are related to vibrations of saturated CeC bonds. The second component (dotted line in Fig. 6.4B) shows a signature that can be linked to proteins and water in the emulsions.

9. EXTENSIONS AND MODIFICATIONS OF SO-PLS 9.1 SO-PLS Can Be Extended to Handle Multiway Data Arrays (Without Unfolding) Being a PLS-based method, SO-PLS can handle only data matrices and not arrays with a higher dimensionality than two, i.e., for instance, data from fluorescence spectroscopy. As a consequence, in case of a multiway predictor block, it has to be unfolded before the analysis. Unfortunately, this procedure has some drawbacks and, for this reason, a multiway version of SO-PLS has been proposed, called SO-N-PLS [13]. An illustration of the different situations that can be considered by this procedure are given in Fig. 6.5. The orthogonalization step in SO-PLS here plays a major role. Because each estimation starts “from scratch” after each new orthogonalization, a three-way array can easily be incorporated at any step using, for instance, the N-way PLS regression [14] instead of the standard PLS regression. The SO-N-PLS method therefore consists of the same steps as mentioned earlier (Section 3), except that at places where a three-way dataset is involved, the N-PLS is used [14] instead of standard PLS. The orthogonalization takes place for each column in the three-way array, i.e., for each variable for the sample. 9.1.1 Interpretation of Scores Plots in SO-N-PLS Models As for SO-PLS, the scores for the submodels can be graphically interpreted inspecting the scores plots. The same considerations discussed for the SO-PLS apply here also. 9.1.2 Interpretation of Loadings Plots/Weights Plots in SO-N-PLS Models In SO-N-PLS, the interpretation of the model loadings for the threeway arrays is not as straightforward as mentioned earlier, because of the use of the N-PLS regression [14]. In fact, this multiway regression (N-PLS) method adheres to Martens’ PLS algorithm [15] where loading vectors are not calculated as described earlier. In N-PLS, given the threeway array X, two sets of weights (wJand wK) are calculated in such a way that the covariance between the scores T and (the unexplained part of) the response is maximized. These weights can be interpreted in place of loadings. Plots of weights can be obtained in two different ways: displaying

168

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

FIGURE 6.5 Graphical representation of SO-N-PLS and multiblock methods on unfolded data. SO-N-PLS is applied on multiway arrays avoiding unfolding. (A) The X-block is threeway, whereas Z is a matrix. (B) Both X- and Z-blocks are three-way arrays.

wJ and wK individually or plotting their outer product according to the original three-way structure [13]. Another possibility is to regress the original Z onto the scores TZ as in Section 6, i.e.:  1 T WZ ¼ TTZ TZ TZ ZUnfolded (6.9) An example of the representation of weights as a landscape is reported later. 9.1.3 Interpretation of Regression Coefficients Plots in SO-N-PLS Models Regression coefficients can be investigated to have an indication of the relation between predictor variables and response. Nevertheless, it must

9. EXTENSIONS AND MODIFICATIONS OF SO-PLS

169

be noted that they should be properly normalized before the interpretation [5,6]. Moreover, it has to be taken into consideration that they do not always lead to a reliable interpretation. For example, analyzing a chemical system where the instrumental signal produced by a specific analyte is used to predict the response, the regression coefficients are influenced not only by the compounds at study but also from possible interferents. For these reasons, the interpretation of the regression coefficients is not pursued in this chapter (see, for instance, Refs. [5,6]). 9.1.4 Example Different industrial butters have been analyzed to investigate the oxidation process when samples are stored at different conditions. Twenty-one samples of butter have been stored at different light conditions (no light or exposed to red, green, or violet light) and at different atmospheres (low or high oxygen). Then, butter samples have been analyzed by excitation emission spectroscopy (EEM) (emission scanned from 580 to 720 nm, excitation scanned from 350 to 452 nm) and by another fluorimeter (single emission, 405e563 nm) having a higher signal to noise ratio. After light exposure, 11 panelists evaluated the acidic odor of the samples. The fluorescence spectra have been used as predictor blocks: the EEM array as X (21  274  35) and the emission fluorescence matrix as Z (21  392). The optimal complexity for the models has been obtained using a Ma˚ge plot, as described earlier (plot not shown). The number of latent variables used, the root mean squares error in cross validation (RMSECVs), and the explained variances from the different models tested are reported in Table 6.2. As it can be seen from Table 6.2, the three methods show comparable performances from the prediction point of view. This is not surprising; in fact, as shown in Ref. [13], SO-N-PLS gives better predictions than the other two methods only when datasets are highly noisy (and satisfy a genuine underlying three-way structure). Note that here the MB-PLS has a lower number of components than the other two.

TABLE 6.2 RMSEPs, Explained Variances, and Number of Latent Variables (LVs) Used for the SO-N-PLS, SO-PLS, and MB-PLS Models Method

LVs

RMSECVs

Explained Variance (%)

SO-N-PLS

2.4

0.53

88

SO-PLS

2.4

0.53

88

MB-PLS

4

0.58

88

170

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

FIGURE 6.6 Butter dataset: X-weights plots for SO-N-PLS (A and D), SO- PLS (B and E), and MB-PLS (C, FeH) models.

In Fig. 6.6, the X-weights extracted from SO-N-PLS, SO-PLS, and MB-PLS are reported (run by unfolding and concatenating the blocks before standard PLS regression [7]). It is clear that, because the MB-PLS method needs to take both blocks into account in all calculations, the information in the X-block is not found before component 4 (Fig. 6.6H). For the SO-PLS-based methods, this is different; only two components are needed in X (Fig. 6.6D and E). On comparison of the X-weight plots of the second component extracted from the SO-N-PLS (Fig. 6.6D) and from the SO-PLS (Fig. 6.6E) models, one can notice that the latter presents a significant negative contribution close to the lower wavelengths (perhaps ascribable to the orthogonalization with respect to the first component, see Fig. 6.6B). This effect is not present for the SO-N-PLS, which on the other hand shows a wave tendency in other areas. There is also a small tendency (here most visible along the border) of a noise component in the SO-PLS plot.

9.2 SO-PLS Can Be Used for Classification As mentioned earlier, SO-PLS can be used as a features extraction method (as all the PLS-based methods) and, consequently, it can be used as a starting point for classification using a dummy Y matrix representing the groups. A simple possibility is to extract all model-scores from all blocks and then concatenate them before using Fisher’s linear discriminant analysis (LDA) [16] resulting in a method called SO-PLS-LDA [17].

9. EXTENSIONS AND MODIFICATIONS OF SO-PLS

171

The algorithm can be summarized in the following points: 1. The SO-PLS model is created as described in Section 1 (but now the Y is dummy 0/1). Two sets of scores are obtained: TX and TZOrth. 2. The scores extracted from the regression model are concatenated (TC ¼ [TX TZOrth]). 3. The classification model is obtained applying LDA to the TC matrix. For selecting the number of components there are here two options, the Ma˚ge plots based on the RMSECVs (obtained from the dummy regression) or percentages of correct classification. Because classification is the final aim of SO-PLS-LDA, the use of the classification rates for the selection of the optimal complexity of the model seems more appropriate. Because classification results are categorical, the RMSECV plot can, however, give a smoother and more continuous curve, which can simplify the choice of components. The method has been demonstrated to give good predictions/classifications comparable with other state-of-the-art methods. As demonb ) or the strated in Ref. [18], LDA can be applied to either the predicted Y ( Y scores obtained from the regression model. Consequently, it would be possible to avoid step 2, calculating LDA on the predicted response. Classification results can be reported in different ways. Classification rates and confusion matrices are useful, but graphical representations of results can give a deeper insight. Usually, classification results can be graphically interpreted by analyzing the scores plot or displaying the predicted Y against sample number. Another possibility is to use canonical variates, either directly on the predicted results or on the crossvalidated analogues, which was proposed in Ref. [17]. 9.2.1 Example Based on Lambrusco Wines Lambrusco is a PDO Italian wine, produced in the Modena region (North Italy). This wine can be made using different Lambrusco grapes, but the relative amounts of these are strictly fixed by law. Fifty-eight samples of Lambrusco wines (produced in the same year) were measured by EEM (used as X(5816121)) and by nuclear magnetic resonance (used as Z(589168)). These belonged to three different types of Lambrusco: 19 are “Lambrusco Grasparossa di Castelvetro PDO,” 20 “Lambrusco Salamino di Santa Croce PDO,” and 19 “Lambrusco di Sorbara PDO.” More details on this can be found in Ref. [19]. Samples have been classified by SO-N-PLS-LDA and by SO-PLS-LDA and MB-PLS-LDA after unfolding. Misclassified samples and total errors from all the models are reported in Table 6.3. The graphical representation of cross-validated predictions in the space of canonical variates obtained by SO-NPLS-LDA is shown in Fig. 6.7. In the figure, Grasparossa samples are in red, Salamino in green, and Sorbara in blue. Misclassified samples are

172

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

TABLE 6.3 Lambrusco Dataset: Misclassified Samples and Total Errors by SO-NPLS-LDA, SO-PLS-LDA, and MB-PLS-LDA Grasparossa Misclassified

Salamino Misclassified

Sorbara Misclassified

Total Error (%)

SO-N-PLS

3

5

3

19

SO-PLS

6

5

3

24

MB-PLS

6

7

2

26

FIGURE 6.7 Classification of Lambrusco wines by SO-N-PLS-LDA. Cross-validated predictions in the space on canonical variates; squared samples are the misclassified ones.

squared; the color of the square indicates the predicted class. The classification rate is not very good; a further investigation has highlighted that this is because these wines are not prepared using 100% of Grasparossa, Salamino, and Sorbara grapes but by using mixtures of grapes from the Modena area (see Ref. [13] for more details). Nevertheless, it is clear that the SO-N-PLS-LDA method gives slightly better results than calculations based on unfolded data.

9.3 SO-PLS Can Handle Interactions Between Blocks In ANOVA and also in polynomial regression, it is standard practice to incorporate interactions among variables. The authors of this chapter do not know of any clear advice in the literature, except the paper by Næs et al. [20], on how to extend this to interactions among blocks. In that paper, it was proposed that an interaction matrix be constructed by

9. EXTENSIONS AND MODIFICATIONS OF SO-PLS

173

multiplying each column in XV1 with each column in ZV2 where V1 and V2 are matrices representing linear combinations of the blocks. Then this new block is added to the first two in the model: Y ¼ XB þ ZC þ ðXV1 $ZV2 ÞD þ E

(6.10)

Note that this definition of interaction encompasses products of principal components of the blocks and also reduced blocks after variable selection (see later discussion). It can also be seen as a direct generalization of the standard multiplication trick used in polynomial regression. The model is fitted by SO-PLS as earlier, but now the extra matrix XV1$ZV2 is added as a new matrix U after fitting of X and Z, in line with the type I ANOVA principle discussed briefly earlier. A similar ANOVA formula as in (7) can be presented for this model also by adding an SS for the interactions. The methods were tested in Ref. [20] on data from salting of salmon filet. Forty-five salmon fillets of different sizes (3e4, 4e5, and 5e6 kg) underwent salting procedure for different durations (8, 16, and 24 h), and they were measured by NIR spectroscopy (here only six wavelengths were used, more details on the dataset can be found in Ref. [21]). An SO-PLS model was created using a dummy coded X matrix (level 1 as reference in both) for the design factors, the NIR data are used as Z and the response vector is the salt concentration. The X-matrix is fitted with the maximum number of components, which corresponds to LS (see LS-PLS earlier). The optimal number of components were two for the Z block and two for the U block. In Fig. 6.8 are shown the RMSECVs obtained using different number of components from Z and U. From Table 6.4 is evident that the addition of each individual block increases the prediction ability of the model. Moreover, it shows that the incorporation of the U-block provides an improvement of 8%e9%, suggesting the presence of an effective interaction between design matrix and salt content. Regression coefficients for U are displayed in Fig. 6.9. All are very close to zero except for those corresponding to the third level of the second factor, namely, the salting effect. This indicates that the main contribution to the interaction is the different effect of fish chemistry for 24 h of salting as compared with the other experimental conditions.

9.4 Variable Selection in SO-PLS Owing to the sequential nature of SO-PLS, the variable selection can take place at different points during the creation of the regression model. For instance, data blocks can be reduced before multiblock analysis, and then the model is created using the reduced predictor blocks. Another option is to embed the feature reduction step into the creation of the

174

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

RMSECVs obtained extracting different components in Z and U, after four components have been extracted from X. Figure reprinted from T. Næs, I. Ma˚ge, V.H. Segtnan, Incorporating interactions in multi-block sequential and orthogonalised partial least squares regression, J. Chemometr. 25 (2011) 601e609, with permission from Wiley.

FIGURE 6.8

TABLE 6.4 Root Mean Square Error of Cross-Validation Results When Only X, X and Z or X, Z and U Are Used for the Creation of the SO-PLS Model

RMSECV

Only X

LV

X and Z

LV

X, Z and U

LV

1.04

4

0.57

2

0.49

2

FIGURE 6.9 Regression coefficients for U. Figure reprinted from T. Næs, I. Ma˚ge, V.H. Segtnan, Incorporating interactions in multi-block sequential and orthogonalised partial least squares regression, J. Chemometr. 25 (2011) 601e609, with permission from Wiley.

10. CONCLUSIONS

175

modeling itself by first reducing the model for X before reducing the model for the ZOrth (for a detailed description of the different procedures, please see Ref. [22]). A third possibility is to use a variant of forward regression [23] whereby the variable with the best performance in terms of RMSECV is selected from either X or ZOrth, where ZOrth is always the Z matrix orthogonalized with the information already extracted from X. In Ref. [22] the different approaches were compared using a sensory dataset, a spectroscopic dataset, and a number of simulated datasets. For the backward selection methods, the reduction was based on variable importance in projection (VIP) [24,25] and selectivity ratio (SR) [26,27], which were found to be the preferred ones for the purpose, as determined in an extensive simulations study. It was shown that SO-PLS combined with SR or VIP can provide a large reduction in the number of variables and still give good prediction results. In general, it has been observed that, when the goal is reducing the number of variables without penalizing the predictive ability, the procedure based on the forward selection seems to be the best approach to follow. Nevertheless, it has to be taken into account that this procedure is computationally demanding and can be very sensitive to overfitting [22]. From an interpretation point of view, there was a tendency when using a spectroscopic dataset that combining SO-PLS with VIP was better than combining with SR. Moreover, it was observed that, when handling noisy data, VIP is slightly more appropriate than SR, whereas in the presence of systematic errors, the suggested variable selection method was SR.

10. CONCLUSIONS The aim of this chapter is to describe the SO-PLS method for multiblock regression and to show some of its main published modifications and extensions. Both theoretical aspects and applications on real data have been discussed. The method is invariant to the relative scales of the blocks involved, which means that block scaling is not needed. A block sums-ofsquares decomposition is possible. A possibility for testing the significance of incorporating an extra block was also discussed. An alternative method based on some of the same principles is the PO-PLS method [11], which focuses on common and unique information in the blocks instead of additional variability as focused in SO-PLS. It has been shown that SO-PLS is a versatile method for both regression and classification problems. In both cases, it is particularly suitable from an interpretation point of view. An extension of the method for handling multiway arrays has also been described and illustrated. The method is called SO-N-PLS and combines the SO-PLS method with N-PLS regression. This method keeps the main characteristics of SO-PLS and is

176

6. THE SEQUENTIAL AND ORTHOGONALIZED PLS REGRESSION

therefore applicable in both regression and classification problems for both prediction and interpretation. The SO-N-PLS is a robust method, which can reduce the effect of noise better than the other methods based on, for instance, unfolding of the data. As a consequence, it is particularly suitable for data analysis when samples are few and data are noisy. Because of the sequential nature of SO-PLS, it also allows incorporation of the interaction between the predictor blocks for the purpose of improving both the predictions and the interpretation of the model. The method has also been proposed as an exploratory approach for structural equations modeling [1]; the main advantage of SO-PLS in this context is it can handle blocks with more than one underlying dimension.

References [1] T. Næs, O. Tomic, B.-H. Mevik, H. Martens, Path modelling by sequential PLS regression, J. Chemom. 25 (2011) 28e40. [2] K. Jørgensen, V. Segtnan, K. Thyholt, T. Næs, A comparison of methods for analysing regression models with both spectral and designed variables, J. Chemom. 18 (2004) 451e464. [3] U. Indahl, T. Næs, Evaluation of alternative spectral feature extraction methods of textural images for multivariate modeling, J. Chemom. 12 (1998) 261e278. [4] A. Biancolillo, Method Development in the Area of Multi-block Analysis Focused on Food Analysis (Ph.D. thesis), Department of Food Science, Faculty of Science, University of Copenhagen, 2016. [5] M. Seasholtz, B. Kowalski, Qualitative information from multivariate calibration models, Appl. Spectrosc. 44 (1990) 1337e1348. [6] K. Kjeldahl, R. Bro, Some common misunderstandings in chemometrics, J. Chemom. 24 (2010) 558e564. [7] S.J. Qin, S. Valle, M.J. Piovoso, On unifying multiblock analysis with application to decentralized process monitoring, J. Chemom. 15 (2001) 715e742. [8] L.E. Wangen, B.R. Kowalski, A multiblock partial least squares algorithm for investigating complex chemical systems, J. Chemom. 3 (1988) 3e20. [9] J.A. Westerhuis, A.K. Smilde, Deflation in multiblock PLS, J. Chemom. 15 (2001) 485e493. [10] J.A. Westerius, T. Kourti, J.F. MacGregor, Analysis of hierarchical PCA and PLS models, J. Chemom. 12 (1998) 301e321. [11] T. Næs, T. Tomic, N.K. Afseth, V. Segtnan, I. Ma˚ge, Multi-block regression based on combinations of orthogonalisation, PLS-regression and canonical correlation analysis, Chemom. Intell. Lab. Syst. 124 (2013) 32e42. [12] N.K. Afseth, V.H. Segtnan, B.J. Marquardt, J.P. Wold, Raman and near-infrared spectroscopy for quantification of fat composition in a complex food model system, Appl. Spectrosc. 59 (2005) 1324e1332. [13] A. Biancolillo, T. Næs, R. Bro, I. Ma˚ge, Extension of SO-PLS to multi-way arrays: SO-NPLS, Chemom. Intell. Lab. Syst. 164 (2017) 113e126. [14] R. Bro, Multiway calibration. Multilinear PLS, J. Chemom 10 (1996) 47e61. [15] H. Martens, T. Naes, Multivariate Calibration, John Wiley & Sons, New York, 1989. [16] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen. 7 (1936) 179e188.

REFERENCES

177

[17] A. Biancolillo, I. Ma˚ge, T. Næs, Combining SO-PLS and linear discriminant analysis for multi-block classification, Chemom. Intell. Lab. Syst. 141 (2015) 58e67. [18] T. Næs, U. Indahl, A unified description of classical classification methods for multicollinear data, J. Chemom. 12 (1998) 205e220. [19] M. Silvestri, A. Elia, D. Bertelli, E. Salvatore, C. Durante, M. Li Vigni, A. Marchetti, M. Cocchi, Mid level data fusion strategy for the Varietal Classification of Lambrusco PDO wines, Chemom. Intell. Lab. Syst. 137 (2014) 181e189. [20] T. Næs, I. Ma˚ge, V.H. Segtnan, Incorporating interactions in multi-block sequential and orthogonalised partial least squares regression, J. Chemom. 25 (2011) 601e609. [21] V.H. Segtnan, M. Høy, F. Lundby, B. Narum, J.P. Wold, Fat distributional analysis in salmon fillets using non-contact near infrared interactance imaging: a sampling and calibration strategy, J. Near Infrared Spec. 17 (2009) 247e253. [22] A. Biancolillo, K.H. Liland, I. Ma˚ge, T. Næs, R. Bro, Variable selection in multi-block regression, Chemom. Intell. Lab. Syst. 156 (2016) 89e101. [23] N.R. Draper, H. Smith, Applied Regression Analysis, Wiley-Interscience, Hoboken, NJ, 1998, pp. 307e312. [24] I.G. Chong, C.H. Jun, Performance of some variable selection methods when multicollinearity is present, Chemom. Intell. Lab. Syst. 78 (2005) 103e112. [25] S. Favilla, C. Durante, M. Li Vigni, M. Cocchi, Assessing feature relevance in NPLS models by VIP, Chemom. Intell. Lab. Syst. 129 (2013) 76e86. [26] T. Rajalahti, R. Arnenberg, F.S. Berven, K.M. Myhr, R.J. Ulvik, O. Kvalheim, Biomarker discovery in mass spectral profiles by means of selectivity ratio plot, Chemom. Intell. Lab. Syst. 95 (2009) 35e48. [27] O.M. Kvalheim, Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots, J. Chemometr. 24 (2010) 496e504.