A continuum decorrelation of the variables; Application to multivariate regression

A continuum decorrelation of the variables; Application to multivariate regression

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory...

907KB Sizes 0 Downloads 32 Views

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics

A continuum decorrelation of the variables; Application to multivariate regression

MARK



El Mostafa Qannari , Angélina El Ghaziri, Mohamed Hanafi StatSC, ONIRIS, INRA, 44322 Nantes, France

A R T I C L E I N F O

A B S T R A C T

Keywords: Standardization Mahalanobis distance PLS regression Redundancy analysis

A continuum standardization, called αM -standardization, of a dataset whereby the variances of the variables and their correlations are gradually shaded off is proposed. It is tightly connected to the Mahalanobis distance. After investigating its properties, it is used to set up a continuum approach to predict one data set from another. It turned out that this continuum strategy is tightly linked to an approach called PLS power regression [1,2]. It is also shown that this strategy of analysis is broad enough to encompass PLS regression and redundancy analysis. An illustration on the basis of a two case studies is outlined.

1. Introduction Practitioners who aim at predicting a dataset, Y, from another dataset, X, can face a dilemma. By using a method of analysis such as redundancy analysis [3,4], the focus is entirely on recovering the variations of Y, which is a positive point. However, the drawback is that if the X-variables are plagued with multicollinearity, the model may be of poor quality. An alternative strategy is to use PLS regression which indeed tackles the problem of multicollinearity. However, the drawback is that if the variances of the X variables are high and, if, moreover, these variables are highly correlated then the focus will shift to recovering the variations in X rather than Y. This means that the first latent variables will be more likely related to X than Y. One of the purposes of the present paper is to set up a continuum approach which depends on a tuning parameter α. This approach bridges the gap between PLS regression and Redundancy analysis since the variances and the correlations among the X-variables are gradually shaded off as α increases. In a previous paper [5], we introduced a continuum standardization strategy of a dataset, X, which consists in dividing each variable of X by its standard deviation to the power α. In a matrix form, this consists in transforming X into X (α ) = XS −α /2 where S is the diagonal matrix containing the respective variances of the variables in X. By varying α between 0 and 1, we move from a situation where the variables are not standardized to a situation where the variables are standardized and, midway along this continuum, we find Pareto standardization (α=0.5), which is commonly used among practitioners in metabolomics and chemometrics among other disciplines [6,7]. We propose herein a more general strategy which consists not only in a continuum standardization ⁎

of the variables but also in a continuum decorrelation of the variables. The key point to this continuum approach is the Mahalanobis transformation which is tightly linked to the so-called Mahalanobis distance [8,9]. In the following, we shall refer to this continuum decorrelation of the variables as αM -standardization, where α refers to the tuning parameter and M refers to the Mahalanobis-distance. Once the principle of αM -standardization is defined, we sketch some of its properties. In particular, we assess the similarity between the original dataset and the αM -standardized dataset (Section 2.1). Thereafter, we show that the αM -standardization is of little interest in Principal Component Analysis (PCA) (Section 2.2). More emphasis is put on the application of the αM -standardization to PLS-regression and redundancy analysis (Section 2.3). In particular, we will show that the proposed continuum approach is very tightly linked to a stratey of analysis called continuum power PLS [1]. In Section 3, we illustrate the interest of the continuum strategy using two datasets. As a general rule, all the proofs of the properties are left to the appendix. 2. Methods 2.1. αM -standardization Let us consider a dataset X formed of the measurements of p variables xj ( j = 1, …, p ) on n individuals. Let X = UΛ1/2 V T be the singular value decomposition of X, where U and V are unitary matrices (i.e. UTU, UUT, VTV and VVT are all equal to an identity matrix) and Λ = diag(λ1, …, λp ) is a diagonal matrix formed of the eigenvalues of matrix XTX or, equivalently, XXT. We have XT X = VΛV T and XXT = UΛUT , which are the spectral (or Jordan) decomposition of

Corresponding author. E-mail address: [email protected] (E.M. Qannari).

http://dx.doi.org/10.1016/j.chemolab.2017.01.014 Received 18 October 2016; Received in revised form 12 January 2017; Accepted 17 January 2017 Available online 18 January 2017 0169-7439/ © 2017 Elsevier B.V. All rights reserved.

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

Fig. 1. Effect of the αM -Standardization on two correlated variables and ellipses corresponding to various Mahalanobis distances from the origin.

XTX and XXT, respectively. We consider a scalar α between 0 and 1 and we define (XT X )−α /2 = VΛ−α /2 V T . The αM -standardized dataset of X is given by Xα = X (XT X )−α /2 . We have X0 = X and X1 = X (XT X )−1/2 , which is the Mahalanobis transformation of X. This latter dataset is characterized by the fact that X1T X1 = I (identity matrix), which means that its columns are standardized and uncorrelated. More generally, we have XαT Xα = VΛ1− α V T . This entails that as α moves from 0 to 1, we achieve a gradual decorrelation of the variables. In order to illustrate the effect of the αM -standardization, we depict in Fig. 1, the scatterplot of two-dimensional data (α = 0 ) and their αM -standardized data for α=0.3, 0.7 and 1. Additionally, we show ellipses of Mahalanobis equidistant points from the origin. It is clear that as α increases, the correlation coefficient between the two variables decreases and the ellipses become less elongated, gradually taking the shape of circles. Belsley (1991, [10]) proposed to use the condition indices as colinearity diagnostics. These indices are given by:

conditioning of the variance-covariance matrix. As a matter of fact, the purpose of the condition indices is to assess the discrepancy among the eigenvalues of the variance-covariance matrix of Xα. This discrepancy can also be reflected by CV (α ), which is the coefficient of variation (ratio of the standard deviation to the mean) associated with the eigenvalues of the variance covariance matrix of Xα. CV (α ) is given by: p

CV (α ) =

that: CV (α ) =

1− α 2

j = 1, …, p

The derivative of this quantity with respect to α is:

1 ⎛λ ⎞ η′ j (α ) = × ⎜⎜ 1 ⎟⎟ 2 ⎝ λj ⎠

1− α 2

(1)

p β (α )

− 1 where β (α ) =

α2 (∑pj =1 λ1− ) j

∑pj =1 λj2(1− α )

is known as the

sphericity index [11,12]. β (α ) is also used to assess the complexity of a dataset by Kazi-Aoual et al. (1995) [13]. Another way to assess the impact of the αM -standardization is by computing some measure of association (or similarity) between X and Xα as α varies from 0 to 1. Hereinafter, we single out two measures of association between two datasets, namely the multivariate coefficient of correlation [14,5] and the RV coefficient [15,16,5]. The multivariate coefficient of correlation between X and Xα is defined by:

j = 1, …, p

Obviously, ηj ≥ 1 for all j. Large ηj indicate the presence of near collinearity among the X variables. By applying the αM -standardization to X, matrix XTX is transformed into (XT X )1− α . Therefore, the associated condition indices are given by:

⎛λ ⎞ ηj (α ) = ⎜⎜ 1 ⎟⎟ ⎝ λj ⎠

p

∑ j =1 λj1− α

We can show (see A.1.1) that CV (α ) is a decreasing function of α. This means that as α moves from 0 to 1, the eigenvalues associated with the variance-covariance matrix of Xα tend to be close to each others. In particular, we have CV (1) = 0 , this corresponds to the situations where all the eigenvalues are equal (orthogonal design). It is worth noting

1 ⎞2

⎛λ ηj = ⎜⎜ 1 ⎟⎟ ⎝ λj ⎠

p

α 2 p ∑ j =1 λj2(1− α ) − ( ∑ j =1 λ1− ) j

MR(α ) =

⎛ λj ⎞ × ln⎜ ⎟ ⎝ λ1 ⎠

trace(XT Xα ) trace(XT X ) trace(XαT Xα )

(2)

It is comprised between −1 and 1, the lower bound is reached if and only if X = aXα for some negative scalar, a, and the upper bound is reached if and only if X = aXα for some positive scalar, a. It is easy to check that:

which is negative since λj ≤ λ1. This indicates that ηj (α ) is a decreasing function of α. This entails that by increasing α, we achieve a better 112

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al. p

MR(α ) =

∑ j =1 λj(1− α /2) p ∑ j =1 λj

this consists in replacing Xα and Y by Xα(2) = Xα −

p ∑ j =1 λj(1− α )

(3)

Moreover, we show that this is a decreasing function of α (A.1.2). The implication of this property is that the similarity between X and Xα becomes smaller and smaller as α increases. The second similarity index between X and Xα is given by the RV coefficient:

RV (α ) =

(4)

XαXαT .

T

The RV coefficient reflects the similarwhere W = XX and Wα = ity of the spatial configurations of the individuals associated with X and Xα. It ranges between 0 and 1. The lower bound is reached if all the variables in X are uncorrelated with those in Xα. RV=1 if and only if X and Xα can be matched by a rotation and a multiplication by a scaling factor. We can show that:

p

(XT X )−1/2 XT YY T X (XT X )−1/2 (XT X )1/2a = μ(XT X )1/2 a Let us denote by w = (XT X )1/2 a . We have wTw=1 and X1T YY T X1w = μw , where X1 = X (XT X )−1/2 . This means that w is the loading vector corresponding to the PLS regression of Y on Xα for α=1. As a summing up, the continuum strategy offers a whole range of models which encompass PLS regression and Redundancy analysis. As α moves from 0 to 1, the prediction ability of the models is likely to improve but, these models are likely to be instable in case of collinearity among the X-variables. The rationale behind the continuum approach is to find an appropriate α that realizes a good compromise between stability and good prediction. This can be done by performing a cross-validation strategy. This consists in screening values from 0 to 1 and selecting the parameter, α*, that corresponds to the minimum of predicted residual sum of squares (PRESS) or some associated statistic such as the Root Mean Squared error of Prediction (RMSEP). For more details regarding this strategy of selection on the tuning parameter, we refer to Stone [18].

p

(5)

We also show that this is a decreasing function of α (A.1.3). The implication of this property is that as α increases, the spatial configuration of the individuals changes. 2.2. Application to PCA From the singular value decomposition of X and Xα, namely 1− α

X = UΛ1/2 V T and Xα = UΛ 2 V T , it readily follows that X and Xα have the same standardized principal components (i.e. U) and the same vectors of loadings (i.e. V). The only thing that changes is the respective magnitude of the eigenvalues associated with the variance-covariance matrix. More importantly, the relative importance of these eigenvalues changes. Let us consider the percentage of the total variance recovered by the k th component of Xα: pk =

λk(1− α ) p ∑ j =1 λj(1− α )

2.3.2. Univariate y In the following, we show additional properties of the continuum strategy of analysis in the particular case where Y=y is formed of a single variable. We believe that these properties could be extended to the multivariate case but so far, we have not been able to prove them. We denote by tα the first latent variable associated with the PLS regression of y on Xα. We have tα = Xαwα = X (XT X )−α /2 wα with

(6)

We show in A.2 that the percentage of variation explained by the first principal components tends to decrease and, contrariwise, the percentage of variation explained by the last components tends to increase. All in all, the αM -standardization does not seem to be of particular interest insofar as PCA is concerned.

wα =

Y T tα

. Instead of tα, we consider the latent variable tα* = Xwα* (XT X )−α /2wα ∥ (XT X )−α /2wα ∥

=

(XT X )−α XT y ∥ (XT X )−α XT y ∥

. Obviously, tα* and tα are colli-

near. Therefore any regression model upon tα is equivalent to the same regression model upon tα*. The latent variable tα* presents the advantage of being expressed as a linear combination of the original variables in X where the vector of loadings, wα*, is of unit length. Thus, as α increases from 0 to 1, the components tα* are set on the same footing and can be compared regardless of the variation of the scale due to the αM -standardization. We consider the following functions:

2. 3.1. Multivariate Y We consider the setting where we aim at predicting a dataset, Y, formed of q quantitative variables from X. We assume that both X and Y are centered. For the investigation of the relationships between two datasets, PLSregression [17,8] is a very popular method, specially when the number of X-variables is large compared to the number of observations or/and when the X-variables are highly correlated. The application of PLS regression to predict Y from Xα yields a family of models indexed by α ∈ [0, 1]. For a given α, we seek, in a first step a pair of latent variables tα = Xαwα and uα = Yνα so as to maximize cov(tα, uα ) under the constraint ∥ wα ∥ = ∥ να ∥ = 1. It is known that the solution is obtained as follows: wα is an eigenvector of XαT YY T Xα ∥ Y T tα ∥

XαT y

∥ XαT y ∥

where wα* =

2.3. Relating two datasets

associated with the largest eigenvalue and να =

respectively and start anew the same procedure on

This is equivalent to

∑ j =1 λj2− α ∑ j =1 λj2 ∑ j =1 λj2(1− α )

(2)

Xα and

(XT X )−1XT YY T Xa = μa

p

RV (α ) =

Xα(2)

tαT tα

and Y . Thereafter, the procedure extends in a natural way to obtain subsequent components. Obviously, for α = 0 , we obtain PLSregression of Y on X. In the following, we show that for α = 1, we are led to redundancy analysis [3]. It is known that Redundancy analysis consists in determining, in a first step, a latent variable t=Xa so as to maximize t T YY T t under the constraint aT XT Xa = 1. The vector a is obtained as the eigenvector of (XT X )−1XT YY T X associated with the largest eigenvalue, which we denote by μ. We have:

trace(WWα ) trace(WW ) trace(WαWα )

Y (2) = Y −

tαtαT Y, tαT tα

tαtαT

f (α ) = var (tα*), g(α ) = cov 2(tα*, y ) and R 2(α ) =

cov 2(tα*, y ) cov 2(tα, y ) = var (y )var (tα ) var (y )var (tα*)

We show in the A.3 the following interesting properties: f and g are decreasing functions of α whereas R 2(α ) is an increasing function of α. The interpretation of these properties are straightforward: as α increases, we are exploring less stable directions in the space since the variance of tα* decreases. But R 2(α ), which is the proportion of variance of y explained by tα* (or equivalently tα) increases. The implication of theses properties is that one should strive to find a good compromise between stability (α near 0) and good prediction (α near 1).

. Subsequent

pairs of latent variables are determined following the same strategy of analysis after deflation of Xα and Y with respect to tα. More precisely, 113

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

the latent variables increases. A three fold cross-validation was also performed to assess the effect of varying the parameter α on the prediction. Fig. 4 shows the Root Mean Squared Error (RMSE) associated with this cross-validation for various values of α in function of the number of components. It turns out that although the case α=1 (i.e. redundancy analysis), which was found in the previous figure as the most efficient in recovering the variation in Y, is now the worst model in terms of prediction of Y for new observations. For this case study, it would be better to choose a tuning parameter around 0.25. Fig. 5 shows the representation of the forty mice (left) on the basis of the first two latent components obtained by running PLS regression of Y on Xα for α=0.25. The mice are labelled according to the diet they were submitted to. Fig. 5 (right) shows the correlation plot of the X and Y variables with these two components. In this latter graphical display, only the variables with a significant correlation coefficient with either the first or the second latent variables are depicted. We can see a separation between the mice with PPARα deficient (ppar) and wild type mice (wt) on the first component. It opposes mainly the fatty acid variable C16.0 with the variables C18.2n.6, C20.2n.6 and C20.1n.9 in Y. Only the variable X36b4 in X is highly correlated with this first latent variable. The second latent variable highlights the wild type group of mice. It opposes two groups of variables in Y, namely, C18.1n.9, C16.1n.9, C18.1n.7 and C16.1n.7 on the one hand and C18.0, C22.6n.3 and C20.5n.3 on the other hand. Only the variable ACAT1 in X seems to be highly correlated with this second latent variable. By way of comparing methods, we also investigated the outputs obtained by means of PLS regression of Y on X. The results (not shown) indicated a high agreement with those discussed herein. It is also worth mentioning that the graphical displays in Fig. 5 also show some similarities with those obtained by means of GCCA [22].

2.4. Comparison to continuum power PLS regression We consider the general case where Y is a multivariate dataset. We show the tight connection of the continuum approach discussed herein and a continuum approach called continuum power PLS regression which was introduced by Wise & Ricker [1] and also discussed by de Jong et al. [2]. For the prediction of a multivariate Y from a dataset X and given a positive scalar γ, the continuum power PLS is defined as PLS regression of Y upon Xγ = UΛγ /2 V T , where X = UΛ1/2 V T is, as stated above, the singular value decomposition of X. The approach discussed herein amounts to performing a PLS regression of Y upon 1− α

Xα = X (XT X )−α /2 . It is easy to check that Xα = UΛ 2 V T . This shows that the two continuum approaches are identical up to a re-parametrization of the tuning parameters. Our presentation of this approach has the advantage of showing the rationale behind the continuum strategy of analysis through the αM -standardization. Moreover, new properties have been shown with the aim of shedding more light on this continuum approach. Among these properties, we single out the connection of the continuum approach with redundancy analysis.

3. Illustration 3.1. Nutrimouse dataset Nutrimouse dataset pertains to a nutrigenomic study on mice under low-fat nutrition [19]. The dataset is avaliable in the R package CCA [20]. The experimental design was originally set to assess the benefits of a low-fat dietary fatty acids on the performance of PPARα (peroxisome proliferator-activated receptor-α), which is a major transcriptional regulator of lipid metabolism. For this purpose, forty mice were submitted to an experimental study, with half of them having a PPARα deficient (ppar) and the other half of wild-type (wt). Five diets were applied (four mice per diet and by genotype ppar/wt): – – – – –

3.2. Orange dataset The second case study concerns the authentication of orange juice using 1H NMR spectroscopy [23]. The dataset X is composed of 480 spectral variables and 150 observations. The observations are formed of thirty authentic juices (20 pure orange and 10 pure clementine) and 120 adulterated (mix of both orange and clementine juices). Dataset Y is composed of one variable, y, which is the proportion of clementine juice. This dataset is available in the R package ClustVarLV [24]. We applied the αM -standardization on the X-dataset, Fig. 6 shows the variations of three diagnostic coefficients as a function of α. The coefficient of variation associated with the eigenvalues of the variancecovariance matrix of Xα (Fig. 6, left) decreases relatively sharply starting from CV (0) ≃ 2 . The Multivariate correlation (MR) coefficient between X and Xα (Fig. 6, center) also decreases relatively sharply with α and reaches the values of 0.5 for α = 1. The RV coefficient between X and Xα (Fig. 6, right) decreases abruptly starting from α around 0.4 and reaches the value 0.2 with α=1. Fig. 7 shows the variations of three functions related to the first latent variable tα* of the PLS regression of y on Xα in function of α. Both the variance of tα* and the squared covariance between tα* and y decrease. The percentage of explained variance in y is equal to 50% for α=0 and increases swiftly to reach 100% with α=1. In order to assess the effect of the parameter α in terms of prediction, Fig. 8 (left) shows for various values of α, the RMSE on the basis of the calibration set (RMSEC) and Fig. 8 (right) shows the RMSE associated with a three fold cross-validation (RMSECV), in function of the number of components introduced in the model. The models with small values of α perform better in terms of RMSECV (α around 0.3) whereas, as it is usually the case, the models associated with α close to 1 performed better in terms of RMSEC.

ref: reference diet, corn and colza oils (50/50) sun: sunflower oil diet, rich in ω6 fatty acids lin: linseed oil diet, rich in ω3 coc: hydrogenated coconut oil diet, a saturated fatty-acid fish: fish oil diet, corn/colza/docosahexaenoic acid-enriched fish oils.

Two sets of variables were measured. The X-variables consisted of 120 gene expressions selected as informative in this context study and the Y dataset was formed of 21 variables measuring the concentrations of hepatic fatty acids obtained by gas chromatography. Several statistical analyses were performed on this data [21,19,22]. Among these methods, we single out Regularized Canonical Correlation Analysis (RCCA) which is a Canonical Correlation Analysis (CCA) adapted to the case where the number of observations is smaller than the number of X-variables or Y-variables or in presence of multicollinearity in X or Y. More precisely, RCCA amounts to a procedure of regularization akin to Ridge regularization applied to CCA [22]. We perform hereinafter the continuum approach of regression consisting in applying PLS regression of Y on Xα. Fig. 2 shows the influence of α on the first latent variable tα*. As α varies between 0 and 1, the variance of tα* decreases (Fig. 2, left) and so does the function g(α ) = tα*T YY T tα* (Fig. 2, center). By contrast, the function R 2(α ) increases (Fig. 2, right). In particular, it appears that, for α=1, the variance of tα* is close to 0, whereas the percentage of explained variance in Y is as high as 40%. Fig. 3 shows the cumulative percentage of the total variance in X and Y recovered by the first four latent components obtained by means of PLS regression of Y on Xα. It is clear that as α increases, the percentage of total variance in X recovered by the latent variables decreases whereas the percentage of total variance in Y recovered by 114

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

Fig. 2. Nutrimouse dataset: three properties related to the first latent variable tα* associated with the PLS regression of Y on Xα: the variance of tα* (left), g(α ) = tα*T YYT tα* (center) and R 2(α ) in function of α (right).

Fig. 3. Nutrimouse dataset: percentages of total variance in X and Y explained by the first four latent components obtained by means of PLS regression of Y on Xα.

Fig. 4. Nutrimouse dataset: RMSE obtained by a three fold cross-validation in function of the latent variables introduced in the model and for different values of α (0,0.25, 0.4, 0.5 and 0.75).

4. Conclusion analysis. It turned out that this approach is equivalent, up to a reparametrization, to an existing approach called PLS power regression [1,2]. The αM -standardization not only made it possible to unveil the rationale behind this latter method but also to exhibit new properties that shed more light on how this continuum approach operates. Since the continuum regression strategy encompasses both PLS regression and redundancy analysis, it is likely to outperform these two regression methods providing that the tuning parameter α is appropriately chosen.

The αM -standardization provides a continuum strategy to perform a gradual decorrelation of the variables in a given dataset. Its scope of application is wide since it could be potentially used in many statistical analyses. We have shown how it could be applied within the framework of multivariate regression. This led us to set up a continuum approach which is broad enough to encompass PLS regression and redundancy

115

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

Fig. 5. Nutrimouse dataset: for α=0.25, representation on the first two latent components associated to the PLS regression of Y on Xα, of the observations (left) and the correlation plot (right) of X (full lines and bold letters) and Y (dashed lines) variables with these two components. Only the variables whose sum of squared correlations with the first two components higher than 0.075 are shown.

assignments rules the dataset X by its αM -standardized dataset, Xα. In effect, this will result in a regularization of the Mahalonobis distance among the individuals. More research work on this strategy of analysis is needed and will be reported elsewhere.

For the choice of such a parameter, we advocate using a crossvalidation strategy in the same vein as the continuum approach proposed by stones & Brooks [25] and Sundberg ([26]). Other potential uses of the αM -standardization include linear and quadratic discriminant analysis. This can simply be done by replacing in the

Fig. 6. Orange dataset: evolution of the coefficient of variation (left) associated with the eigenvalues of the variance-covariance matrix of Xα, the multivariate correlation (center) and the RV (right) coefficients between X and Xα.

116

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

Fig. 7. Orange dataset: three properties related to the first latent variable tα* associated with the PLS regression of y on Xα: var (tα*) (left), cov 2(tα*, y ) (center) and R 2(α ) in function of α (right).

Fig. 8. Orange dataset: the RMSE of calibration (left) and the RMSE associated with a three fold cross-validation (right) in function of the number of components and for five values of α (0, 0.3, 0.4, 0.5 and 0.6).

Appendix A. Appendices A.1. Chebyshev's Weighted sum inequality Prerequisite: Chebyshev's Weighted sum inequality [27]. We start by recalling Chebyshev's sum inequality since it is used hereinafter for the proof of several properties. Let us consider p scalars (aj), p scalars (bj) and p positive scalars (qj) ( j = 1, …, p ) such that 117

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

a1 ≤ a 2… ≤ ap

b1 ≤ b2… ≤ bp q1 + q2 + ⋯ + qp = 1 Then, we have:

⎞ ⎞⎛ p ⎛ p ⎜⎜∑ ak qk ⎟⎟⎜⎜∑ bk qk ⎟⎟ ≤ ⎠ ⎠⎝ k =1 ⎝ k =1

p

∑ akbkqk k =1

Note that the Chebyshev's sum inequality is also true in the case where a1 ≥ a 2… ≥ ap and b1 ≥ b2… ≥ bp . Moreover, if the direction of the inequalities concerning either (aj )j =1,…, p or (bj )j =1,…, p is inverted then the direction of the latter inequality is also inverted.

Appendix A. .2Effect of αAppendix A. .2.1Coefficient of variation. We will start by proving that the coefficient of variation is a decreasing function of α. We recall that CV (α ) is given by: p

p

p ∑ j =1 λj2(1− α ) − ( ∑ j =1 λj1− α )2

CV (α ) =

p ∑ j =1 λj1− α

CV (α ) has the same variation as K (α ) = (CV (α ))2 since CV (α ) ≥ 0 . We have: p

K (α ) =

p

p

α 2 p ∑ j =1 λj2(1− α ) − ( ∑ j =1 λ1− ) j p α 2 ( ∑ j =1 λ1− ) j

=p

∑ j =1 λj2(1− α ) p

( ∑ j =1 λj1− α )2

−1

It follows: p

K ′(α ) =

p

p

p

[ − 2( ∑ j =1 ln(λj )λj2(1− α ))( ∑ j =1 λj1− α ) + 2( ∑ j =1 λj2(1− α ))( ∑ j =1 ln(λj )λj1− α )] A × p B ( ∑ j =1 λj1− α )2 p

p

α α 2 ) and B = ( ∑ j =1 λ1− ) which are both positive. Thus, K ′(α ) ≤ 0 iff: with A = p( ∑ j =1 λ1− j j p

p

p

( ∑ j =1 λj2(1− α ))( ∑ j =1 ln(λj )λj1− α ) p

( ∑ j =1 λj1− α )2



p

( ∑ j =1 ln(λj )λj2(1− α ))( ∑ j =1 λj1− α ) p

( ∑ j =1 λj1− α )2

Or equivalently: α α α ⎛ p ⎞⎛ p ⎞ ⎛ p ⎞ λ1− λ1− λ1− j j j ⎜∑ λ1− α ⎟⎜ ln(λ ) ⎟ ≤ ⎜∑ ln(λ )λ1− α ⎟ j j j j p p p 1− α ⎟⎜∑ 1− α ⎟ 1− α ⎟ ⎜ ⎜ ∑ j =1 λj ⎠⎝ j =1 ∑ j =1 λj ⎠ ⎝ j =1 ∑ j =1 λj ⎠ ⎝ j =1

Let us denote by qj =

α λ1− j α ∑pj =1 λ1− j

p

. Obviously, we have qj ≥ 0 and ∑ j =1 qj = 1.

The inequality above can be written as

⎞ ⎞ ⎛ p ⎞⎛ p ⎛ p ⎜∑ λ1− αq ⎟⎜∑ ln(λ )q ⎟ ≤ ⎜∑ ln(λ )λ1− αq ⎟ j j j⎟ j j j ⎟⎜ j⎟ ⎜ ⎜ ⎠ ⎠ ⎝ j =1 ⎠⎝ j =1 ⎝ j =1 Since λ1 ≥ λ 2… ≥ λp and 0 ≤ α ≤ 1, we have λ11− α ≥ λ 21− α ≥ … ≥ λp1− α and ln(λ1) ≥ ln(λ 2 ) ≥ … ≥ ln(λp ). It follows that the inequality above holds by a direct application of the Chebyshev's weighted sum inequality (prerequisite A.0). Therefore K ′(α ) ≤ 0 . This shows that the coefficient of variation is a decreasing function of α.A.2.2. Multivariate coefficient of correlation. In the following, we prove that the multivariate coefficient of correlation between X and Xα is a decreasing function of α.

MR(α ) =

trace(XT Xα ) trace(XT X ) trace(XαT Xα ) 1− α

By using the singular value decomposition of X and Xα, namely X = UΛ1/2 V T and Xα = X (XT X )−α /2 = UΛ 2 V T , where Λ is a diagonal matrix whose diagonal elements are equal to λj ( j = 1, …, p ) and U, V are unitary matrices. MR(α ) can also be written as: p

MR(α ) =

∑ j =1 λj1− α /2 p

p

∑ j =1 λj ∑ j =1 λj1− α

MR(α ) has the same variation as

⎛ p 1− α /2 ⎜ ∑ j =1 λj F (α ) = ln⎜ ⎜ ∑p λ1− α j =1 j ⎝

⎞ ⎛ p ⎞ ⎞ ⎛ p ⎟ 1 ⎜ 1− α /2⎟ 1− α ⎟ ⎜ − ln λ ∑ j ⎟⎟ = ln⎜∑ λj ⎟ ⎟ 2 ⎜⎝ j =1 ⎠ ⎠ ⎝ j =1 ⎠

By deriving this latter expression with respect to α we are led to: 118

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al. p

p

p

p

p

p

1− α 1− α /2 1− α 1− α /2 ) − ( ∑ j =1 ln(λj )λj1− α /2 )( ∑ j =1 λj1− α ) 1 ( ∑ j =1 ln(λj )λj )( ∑ j =1 λj 1 ∑ j =1 ln(λj )λj 1 ∑ j =1 ln(λj )λj = + p p p p α α /2 2 2 ∑ j =1 λ1− 2 ∑ j =1 λ1− ( ∑ j =1 λj1− α /2 )( ∑ j =1 λj1− α ) j j

F ′(α ) = −

This derivative is negative if and only if:

⎞ ⎞⎛ p ⎞ ⎛ p ⎞⎛ p ⎛ p ⎜∑ ln(λ )λ1− α ⎟⎜∑ λ1− α /2⎟ ≤ ⎜∑ ln(λ )λ1− α /2⎟⎜∑ λ1− α ⎟ j j j j j j ⎟ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎜ ⎠ ⎠⎝ j =1 ⎠ ⎝ j =1 ⎠⎝ j =1 ⎝ j =1 p

By dividing both members of this inequality by ∑ j =1 λj1− α , we have:

⎞ ⎛ p ⎞⎛ p ⎛ p ⎞ ⎜∑ ln(λ )q ⎟⎜∑ λ α /2q ⎟ ≤ ⎜∑ ln(λ )λ α /2q ⎟ j j ⎟⎜ j j j j⎟ j⎟ ⎜ ⎜ ⎠ ⎝ j =1 ⎠⎝ j =1 ⎝ j =1 ⎠ with qj =

α λ1− j

p

≥ 0 and ∑ j =1 qj = 1. Since λ1 ≥ λ 2… ≥ λp and 0 ≤ α ≤ 1 we have ln(λ1) ≥ ln(λ 2 ) ≥ … ≥ ln(λp ) and λ1α /2 ≥ λ 2α /2 ≥ … ≥ λpα /2 . It follows

α ∑pj =1 λ1− j

that the inequality above holds by a direct application of the Chebyshev's weighted sum inequality (prerequisite A.1). This shows that the multivariate coefficient of correlation is a decreasing function of α.A.2.3. RV coefficient. In the following, we show that the RV coefficient between X and Xα is a decreasing function of α. We recall that:

trace(WWα )

RV (α ) =

trace(WW T ) trace(WαWα )

With W = XXT and Wα = XαXαT . By using the singular value decomposition of X, namely X = UΛ1/2 V T , with Λ a diagonal matrix with λj as diagonal elements and U, V unitary matrices, RV (α ) can also be written as: p

∑ j =1 λj2− α

RV (α ) =

p

p

∑ j =1 λj2 ∑ j =1 λj2(1− α )

Let us denote by μj = λj2 , then RV (α ) can be expressed as: p

∑ j =1 μj1− α /2

RV (α ) =

p

p

∑ j =1 μj ∑ j =1 μj(1− α )

this is a function very similar to MR(α ) and the decrease of RV (α ) can be shown in a very similar way. A.3. Total variance recovered by the principal components The purpose of this appendix, is to prove that the percentage of variation explained by the first principal components tend to increase and, contrariwise, the percentage of variation explained by the last components tends to decrease. We recall that the percentage of variation recovered by the k th component of Xα, equation (6), is given by:

λk1− α p ∑ j =1 λj(1− α )

pk (α ) =

The derivative of pk with respect to α is given by:

p′k (α ) =

= =

⎛ ⎞ ⎛ ⎞ −ln(λk )λk(1− α )⎜∑pj =1 λj(1− α )⎟ − λk(1− α )⎜∑pj =1 ln(λj )λj(1− α )⎟ ⎝ ⎠ ⎝ ⎠ ⎛ p ⎞2 ⎜∑ j =1 λj(1− α )⎟ ⎝ ⎠ λk(1− α )

(∑pj =1 λj(1− α )) λk(1− α )

(∑pj =1 λj(1− α ))

with qj =

⎛ p ⎞ λj(1− α ) × ⎜∑ j =1 ln(λj ) p (1− α) − ln(λk )⎟ ∑ j =1 λj ⎝ ⎠ ⎞ ⎛ p × ⎜∑ j =1 qj ln(λj ) − ln(λk )⎟ ⎠ ⎝

λj(1− α ) (∑pj =1 λj(1− α ))

⎛ p ⎞ ⎜∑ j =1 qj = 1⎟. ⎝ ⎠

It follows that: p

p′k (α ) ≥ 0

iff

ln(λk ) ≤

∑ qj ln(λj ) j =1

or, equivalently p

λk ≤

q

∏ λj j j =1

Moreover,

119

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al. p

p′k (α ) ≤ 0

iff ln(λk ) ≥

∑ qj ln(λj ) j =1

This is equivalent to: p

λk ≥

q

∏ λj j j =1

q p ∏j =1 λj j

is a (weighted) geometric average of λk.

A.4. Properties related to The aim of this appendix is to prove that the functions f (α ) = var (tα*), g(α ) = cov 2(tα*, y ) are decreasing functions of α, whereas R 2(α ) = an increasing function of α. We recall that:

Xα = X (XT X )−α /2

tα* = Xwα*

wα* =

wα =

cov 2(tα*, y ) var (tα*)var (y )

is

XαT y ∥ XαT y ∥

(7)

α (XT X )− 2 w

α

and ∥ wα* ∥ = 1

α

∥ (XT X )− 2 wα ∥

(8)

A.4.1. Variance We start by proving the first property: f (α ) = var (tα*) is a decreasing function with α.

X (XT X )−α XT y tα* = Xwα* = ∥ (XT X )−α XT y ∥

(9)

It follows that

f (α ) =

1 yT X (XT X )−α XT X (XT X )−α XT y n yT X (XT X )−2α XT y 1/2

(10)

T

Setting X = UΛ V , we have: p

f (α ) =

2(1− α ) aj 1 yT UΛ2(1− α ) UT y 1 ∑ j =1 λj = p 1−2α T 1−2α T n y UΛ n ∑ j =1 λj aj U y

(11)

th

2

with aj = cov (y, uj ) (uj being the j column of matrix U). By deriving with respect to α we have:

f ′(α ) =

p 2(1− α )a )(∑p λ1−2αa ) − (∑p λ 2(1− α )a )(−2 ∑p ln(λ )λ1−2αa ) j j j j j j 1 (∑ j =1 −2ln(λj )λj j =1 j j =1 j j =1 α 2 n (∑p λ1−2 aj ) j j =1

α α α ⎛ p ⎛ p λ1−2 aj ⎞ λ1−2 aj ⎞⎛ p λ1−2 aj ⎞ j j j = − 2⎜∑ j =1 ln(λj )λj p 1−2α ⎟ + 2⎜∑ j =1 λj p 1−2α ⎟⎜∑ j =1 ln(λj ) p 1−2α ⎟ ∑ j =1 λj ∑ j =1 λj ∑ j =1 λj aj ⎠ aj ⎠⎝ aj ⎠ ⎝ ⎝ ⎞ ⎞⎛ p ⎛ p ⎞ ⎛ p = − 2⎜∑ j =1 ln(λj )λj qj ⎟ + 2⎜∑ j =1 λj qj ⎟⎜∑ j =1 ln(λj )qj ⎟ ⎠ ⎠⎝ ⎝ ⎠ ⎝

with qj =

α λ1−2 aj j

p

α ∑pj =1 λ1−2 aj j

positive and ∑ j =1 qj = 1. Since λj for j = 1, ‥, p are arranged in a decreasing order, ln(λj ) are also arranged in a decreasing order.

By a direct application of the Chebychev's Weighted sum inequality (A.1), we have f ′(α ) ≤ 0 and var (tα*) a decreasing function with α. A.4.2. Covariance We prove in the following that the function g(α ) = cov 2(tα*, y ) is a decreasing function with α. −α

g(α ) = cov 2(tα*, y ) = 1/2

2

1 (y T X (XT X ) XT y ) n 2 yT X (XT X )−2α XT y

(12)

T

Setting X = UΛ V , it follows that: p

g (α ) =

1− α 2 1 (yT UΛ1− α UT y )2 1 ( ∑ j =1 λj aj ) = p n 2 yT UΛ1−2α UT y n 2 ( ∑ j =1 λj1−2αaj )

(13)

2

With aj = cov (uj , y ). It follows: p

g′(α ) =

p

p

p

p

α α α ( − 2 ∑ j =1 ln(λj )λ1− aj )( ∑ j =1 λ1− aj )( ∑ j =1 λ1−2 aj ) + 2( ∑ j =1 λj1− αaj )2 ( ∑ j =1 ln(λj )λj1−2αaj ) j j j p

( ∑ j =1 λj1−2αaj )2 p

α aj , and rearrange this expression: we factorize by ∑ j =1 λ1− j

120

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

⎞⎤ ⎞⎛ p ⎞ ⎛ p ⎞⎡ ⎛ p ⎛ p g′(α ) = 2⎜⎜∑ λj1− αaj ⎟⎟⎢ −⎜⎜∑ lnλj λjαqj ⎟⎟ + ⎜⎜∑ λjαqj ⎟⎟⎜⎜∑ lnλj qj ⎟⎟⎥ ⎢ ⎥ ⎠⎦ ⎠⎝ j =1 ⎠ ⎝ j =1 ⎠⎣ ⎝ j =1 ⎝ j =1 α λ1−2 aj j

where qj =

α ∑pj =1 λ1−2 aj j

p

≥ 0 and ∑ j =1 qj = 1. Since λj for j = 1, ‥, p are arranged in a decreasing order, then λjα are arranged in a decreasing order and

so are the quantities ln(λj ). By a direct application of the Chebychev's Weighted sum inequality (A.1), we have g′(α ) ≤ 0 then cov 2(tα*, y ) is a decreasing function with α. A.4.3. Variation in y explained by The aim of this appendix is to show that the percentage of variation in y explained by tα given by the ratio R 2(α ) = function. Using Eqs. (11) and (13), R 2(α ) is given by:

cov 2(tα*, y ) var (tα*)var (y )

is a non decreasing

p

R 2 (α ) =

1− α 2 1 ( ∑ j =1 λj aj ) var (y ) ∑pj =1 λj2(1− α )aj

(14)

R 2(α ) has the same variation as the function: p

ψ (α ) =

α ( ∑ j =1 λ1− aj )2 j p

∑ j =1 λj2(1− α )aj

The derivative of ψ (α ) with respect to α is equal to:

⎡⎛ p ⎞⎤ ⎞⎛ p ⎞ ⎛ p ⎞⎛ p A ⎢⎜ × ⎜ − ∑ ln(λj )λj1− αaj ⎟⎟⎜⎜∑ λj2(1− α )aj ⎟⎟ + ⎜⎜∑ λj1− αaj ⎟⎟⎜⎜∑ ln(λj )λj2(1− α )aj ⎟⎟⎥ ⎥ B ⎢⎣⎝ j =1 ⎠⎦ ⎠⎝ j =1 ⎠ ⎝ j =1 ⎠⎝ j =1 ⎞2 ⎛ p ⎛ p ⎞ α aj ⎟ and B = ⎜∑ j =1 λj2(1− α )aj ⎟ which are both positive where A = 2⎜∑ j =1 λ1− j ⎠ ⎝ ⎝ ⎠ Thus ψ ′(α ) ≥ 0 iff

⎞⎛ p ⎞ ⎛ p ⎞⎛ p ⎛ p ⎞ ⎜∑ λ1− αa ⎟⎜∑ ln(λ )λ 2(1− α )a ⎟ ≥ ⎜∑ ln(λ )λ1− αa ⎟⎜∑ λ 2(1− α )a ⎟ j j j j j j j j j j ⎟⎜ ⎟ ⎜ ⎟⎜ ⎜ ⎟ ⎠⎝ j =1 ⎠ ⎝ j =1 ⎠⎝ j =1 ⎝ j =1 ⎠

⎞2 ⎛ p α aj ⎟ Dividing both members of this inequality by ⎜∑ j =1 λ1− j ⎠ ⎝ α 1− α 1− α ⎛ p ⎞ ⎛ p ⎞⎛ p λ a λ a λ1− aj ⎞ j j j j j ⎜∑ ln(λ )λ1− α ⎟ ⎟ ≥ ⎜∑ ln(λ ) ⎟⎜∑ λ1− α j j j j p p p α ⎟ α ⎟⎜ α ⎟ ⎜ ∑ j =1 λ1− ∑ j =1 λ1− ∑ j =1 λ1− aj ⎠ ⎜⎝ j =1 aj ⎠⎝ j =1 aj ⎠ j j j ⎝ j =1 Let us denote by qj =

α λ1− aj j α ∑pj =1 λ1− aj j

p

. Clearly we have qj ≥ 0 and ∑ j =1 qj = 1.

The inequality above can be written as

⎞ ⎞⎛ p ⎞ ⎛ p ⎛ p ⎜∑ ln(λ )λ1− αq ⎟ ≥ ⎜∑ ln(λ )q ⎟⎜∑ λ1− αq ⎟ j j j j j⎟ j ⎟⎜ j⎟ ⎜ ⎜ ⎠ ⎠⎝ j =1 ⎠ ⎝ j =1 ⎝ j =1 This inequality holds according to the Chebyshev's weighted sum inequality (prerequisite A.1) since λj ( j = 1, …, p ) are arranged in a decreasing α order, and so are ln(λj ) and λ1− . j

Springer, Berlin, Heidelberg, 2015. [10] D.A. Belsley, A guide to using the collinearity diagnostics, Comput. Sci. Econ. Manag. 4 (1991) 33–50. [11] K.J. Worsley, K.J. Friston, Analysis of fMRI Time-Series Revisited-again, NeuroImage 2 (1995) 173–181. [12] H. Abdi, Congruence: Congruence coefficient, rv coefficient, and mantel coefficient., Encyclopedia of Research Design (2010) pp. 222–229. [13] F. Kazi-Aoual, S. Hitier, R. Sabatier, J.-D. Lebreton, Refined approximations to permutation tests for multivariate inference, Comput. Stat. Data Anal. 20 (1995) 643–656. [14] J.O. Ramsay, J.T. Berge, G.P.H. Styan, Matrix correlation, Psychometrika 49 (1984) 403–423. [15] P. Robert, Y. Escoufier, A unifying tool for linear multivariate statistical methods: the RV-coefficient, Appl. Stat. 25 (1976) 257–265. [16] P. Schlich, Defining and validating assessor compromises about product distances and attribute correlations, in: T. Næs, E. Risvik (Eds.), Data Handling in Science and Technology, 16, Elsevier, 1996, pp. 259–306. [17] S. Wold, M. Sjöström, L. Eriksson, Pls-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst. 58 (2001) 109–130. [18] M. Stone, Cross-validatory choice and assessment of statistical predictions, J. R. Stat. Soc., Ser. B (Methodol.) 36 (1974) 111–147. [19] P.G.P. Martin, H. Guillou, F. Lasserre, S. Déjean, A. Lan, J.-M. Pascussi,

References [1] B.M. Wise, N.L. Ricker, Identification of finite impulse response models with continuum regression, J. Chemom. 7 (1993) 1–14. [2] S. de Jong, B.M. Wise, N.L. Ricker, Canonical partial least squares and continuum power regression, J. Chemom. 15 (2001) 85–100. [3] A.L. Van Den Wollenberg, Redundancy analysis an alternative for canonical correlation analysis, Psychometrika 42 (1977) 207–219. [4] K.E. Muller, Relationships between redundancy analysis, canonical correlation, and multivariate regression, Psychometrika 46 (1981) 139–142. [5] A. El Ghaziri, E.M. Qannari, A continuum standardization of the variables. Application to principal components analysis and PLS-regression, Chemom. Intell. Lab. Syst. 148 (2015) 95–105. [6] R.A. van den Berg, H.C.J. Hoefsloot, J.A. Westerhuis, A.K. Smilde, M.J. van der Werf, Centering, scaling, and transformations: improving the biological information content of metabolomics data, BMC Genom. 7 (2006) 1–15. [7] L. Eriksson, T. Byrne, E. Johansson, J. Trygg, C. Vikström, 13. Centering and Scaling, MKS Umetrics AB. Third revised edition, Sweden, pp. 243–254. [8] K. Varmuza, P. Filzmoser, Introduction to Multivariate Statistical Analysis in Chemometrics, Taylor & Francis, CRC Press, Boca Raton, FL, USA, 2009. [9] W.K. Härdle, L. Simar, Applied Multivariate Statistical Analysis, fourth edition,

121

Chemometrics and Intelligent Laboratory Systems 162 (2017) 111–122

E.M. Qannari et al.

[20] [21]

[22] [23]

M. SanCristobal, P. Legrand, P. Besse, T. Pineau, Novel aspects of PPARα-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study, Hepatology 45 (2007) 767–777. I. González, S. Déjean, CCA: canonical correlation analysis, 2012. R package version 1.2. A. Baccini, P. Besse, S. Déjean, P.G.P. Martin, C. Robert-Granié, M.San. Cristobal, Stratégies pour l'analyse statistique de données transcriptomiques, J. De. La. société française De. Stat. 146 (2005) 5–44. I. González, S. Déjean, P. Martin, A. Baccini, CCA: an R package to extend Canonical Correlation Analysis, J. Stat. Softw. 23 (2008) 1–14. E. Vigneau, F. Thomas, Model calibration and feature selection for orange juice

[24] [25]

[26] [27]

122

authentication by 1H NMR spectroscopy, Chemom. Intell. Lab. Syst. 117 (2012) 22–30. E. Vigneau, M. Chen, ClustVarLV: Clustering of variables around Latent Variables, 2014. R package version 1.2. M. Stone, R.J. Brooks, Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression, J. R. Stat. Soc., Ser. B 52 (1990) 237–269. R. Sundberg, Continuum regression and ridge regression, J. R. Stat. Soc., Ser. B (Methodol.) 55 (1993) 653–659. Z. Cvetkovski, Inequalities, Springer, Berlin Heidelberg, 2012.