Clusterwise PLS regression on a stochastic process

Clusterwise PLS regression on a stochastic process

Computational Statistics & Data Analysis 49 (2005) 99 – 108 www.elsevier.com/locate/csda Clusterwise PLS regression on a stochastic process C. Predaa...

203KB Sizes 8 Downloads 156 Views

Computational Statistics & Data Analysis 49 (2005) 99 – 108 www.elsevier.com/locate/csda

Clusterwise PLS regression on a stochastic process C. Predaa,∗ , G. Saportab a Département de Statistique-CERIM, Faculté de Médecine, Université de Lille 2,

1, Place de Verdun, 59045 Lille, Cedex, France b CNAM Paris, Chaire de Statistique Appliquée, CEDRIC, 292, Rue Saint Martin, 75141 Paris, Cedex 03, France

Received 27 May 2003; received in revised form 4 May 2004; accepted 4 May 2004 Available online 9 June 2004

Abstract The clusterwise linear regression is studied when the set of predictor variables forms a L2 continuous stochastic process. For each cluster the estimators of the regression coefficients are given by partial least square regression. The number of clusters is treated as unknown and the convergence of the clusterwise algorithm is discussed. The approach is compared with other methods via an application on stock-exchange data. © 2004 Elsevier B.V. All rights reserved. Keywords: Clusterwise regression; PLS regression; Principal component analysis; Stochastic process

1. Introduction Cluster analysis based on stochastic models considers that a cluster is a subset of data points which can be modeled adequately in order to reflect the meaning of homogeneity with respect to the certain data analysis problem. The clusterwise linear regression supposes that the points of each cluster are generated according to some linear regression relation: given a dataset {(xi , yi )}i=1,...,n the aim is to find simultaneously an optimal partition G of data in K clusters, 1  K < n, and regression coefficients (, ) = {(i , i )}i=1,...,K within

∗ Corresponding author. Tel.: +33-320-62-69-69; fax: +33-320-52-10-22.

E-mail addresses: [email protected] (C. Preda), [email protected] (G. Saporta). 0167-9473/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2004.05.002

100

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

each cluster which maximize the overall fit:

G : {1, . . . , n} → {1, . . . , K}, G−1 (i) = , ∀i = 1, . . . , K, ∀ i = 1, . . . , K, yj = i + i , xj + ij , ∀j : G(j ) = i, K   (G, (, )) = argmin 2ij . i=1 j :G(j )=i

In such a model, the parameters that have to be estimated are: the number of clusters (K), the regression coefficients for each cluster {(i , i )}i=1,...,K and the variance of residuals ij within each cluster. Charles (1977) and Spaeth (1979) propose methods for estimating these parameters considering a kind of piecewise linear regression based on the least-square algorithm of Bock (1969). The algorithm is a special case of k-means clustering with a criterion based on the minimization of the squared residuals instead of the classical within dispersion. The estimation of local models {(i , i )}i=1,...,K could be a difficult task (number of observations less than the number of explanatory variables, multicollinearity). Solutions such as clusterwise principal component regression (CW-PCR) or ridge regression (RR) are considered in Charles (1977). The partial least squares (PLS) approach (Wold et al., 1984) is considered for finite number of predictors by Esposito Vinzi and Lauro (2003) as PLS typological regression. Other approaches based on mixture of distributions are developed in DeSarbo and Cron (1988), Hennig (2000) and Hennig (1999). In this paper, we propose to use the PLS estimators for regression coefficients of each cluster in the particular case where the set of explanatory variables forms a stochastic process X = (Xt )t∈[0,T ] , T > 0. Thus, clusterwise PLS regression on a stochastic process is an extension of the global PLS approach given in Preda and Saporta (2002). The paper is divided into three parts. In the first part we introduce some tools for linear regression on a stochastic process (PCR, PLS) and justify the choice of the PLS approach. The clusterwise linear regression algorithm adapted to PLS regression as well as aspects related to the prediction problem are discussed in the second part. In the last part we present an application of the clusterwise PLS regression to stock-exchange data and compare the results with those obtained by other methods such as Aguilera et al. (1997) and Preda and Saporta (2002).

2. Some tools for linear regression on a stochastic process Let X=(Xt )t∈[0,T ] be a random process and Y=(Y1 , Y2 , . . . , Yp ), p  1, a random vector defined on the same probability space (, A, P ). We assume that (Xt )t∈[0,T ] and Y are of second order, (Xt )t∈[0,T ] is L2 -continuous and for any  ∈ , t  → Xt () is an element of L2 ([0, T ]). Without loss of generality we assume also that E(Xt ) = 0, ∀t ∈ [0, T ] and E(Yi ) = 0, ∀i = 1, . . . ,p. It is well known that the approximation of Y obtained by the classical linear regression  ˆ = T (t)Xt dt is such that  is in general a distribution rather than on (Xt )t∈[0,T ] , Y 0 a function of L2 ([0, T ]) (Saporta, 1981). This difficulty appears also in practice when one tries to estimate the regression coefficients, (t), using a sample of size N. Indeed, if

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

101

{(Y1 , X1 ), (Y2 , X2 ), . . . , (YN , XN )} is a finite sample of (Y, X), the system  T Xi (t)(t) dt ∀i = 1, . . . , N, Yi = 0

has an infinite number of solutions (Ramsay and Silverman, 1997). Regression on principal components (PCR) of (Xt )t∈[0,T ] (Deville, 1978) and PLS approach (Preda and Saporta, 2002) give satisfactory solutions to this problem. 2.1. Linear regression on principal components Also known as Karhunen–Loève expansion (Deville, 1974), the principal component analysis (PCA) of the stochastic process (Xt )t∈[0,T ] consists in representing Xt as  fi (t)i ∀t ∈ [0, T ], (1) Xt = i 1

where the set {fi }i  1 (the principal factors) forms an orthonormal system of deterministic functions of L2 ([0, T ]) and {i }i  1 (principal components) are uncorrelated zero-mean random variables. The principal factors {fi }i  1 are solution of the eigenvalue equation:  T C(t, s)fi (s) ds = i fi (t), (2) 0

where C(t, s) = cov(Xt , Xs ), ∀t, s ∈ [0, T ]. Therefore, the principal components {i }i  1 T defined as i = 0 fi (t)Xt dt are eigenvectors of the Escoufier operator (Escoufier, 1970), WX , defined by  T E(Xt Z)Xt dt, Z ∈ L2 (). (3) WX Z = 0

The process (Xt )t∈[0,T ] and the set of its principal components, {k }k  1 , span the same linear space. Thus, the regression of Y on (Xt )t∈[0,T ] is equivalent to the regression on  ˆ = {k }k  1 and we have Y k  1 (E(Yk )/k )k . In practice we need to choose an approximation of order q, q  1:  T q  E(Yk ) ˆ YPCR(q) = k = ˆ PCR(q) (t)Xt dt. (4) k 0 k=1

But the use of principal components for prediction is heuristic because they are computed independently of the response. The difficulty to choose the principal components used for regression is discussed in detail in Saporta (1981). 2.2. PLS regression on a stochastic process The PLS approach offers a good alternative to the PCR method by replacing the leastsquares criterion with that of maximal covariance between (Xt )t∈[0,T ] and Y (Preda and Saporta, 2002).

102

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

The PLS regression is an iterative method. Let X0,t =Xt , ∀t ∈ [0, T ] and Y0 =Y. At step q, q  1, of the PLS regression of Y on (Xt )t∈[0,T ] , we define the qth PLS component, tq , by the X WY , where WX , eigenvector associated to the largest eigenvalue of the operator Wq−1 q−1 q−1 Y respectively Wq−1 , are the Escoufier’s operators associated to (Xq−1,t )t∈[0,T ] , respectively to Yq−1 . The PLS step is completed by the ordinary linear regression of Xq−1,t and Yq−1 on tq . Let Xq,t , t ∈ [0, T ] and Yq be the random variables which represent the error of these regressions: Xq,t = Xq−1,t − pq (t)tq and Yq = Yq−1 − cq tq . Then, for each q  1, {tq }q  1 forms an orthogonal system in L2 (X) and the following decomposition formulas hold: Y = c1 t1 + c2 t2 + · · · + cq tq + Yq , Xt = p1 (t)t1 + p2 (t)t2 + · · · + pq (t)tq + Xq,t ,

t ∈ [0, T ].

The PLS approximation of Y by (Xt )t∈[0,T ] at step q, q  1, is given by ˆ PLS(q) = c1 t1 + · · · + cq tq = Y



T 0

ˆ PLS(q) (t)Xt dt.

(5)

de Jong (1993) and Phatak and De Hoog (2001) show that for a fixed q, the PLS regression fits closer than PCR, that is, ˆ PCR(q) )  R 2 (Y, Y ˆ PLS(q) ). R 2 (Y, Y

(6)

In Preda and Saporta (2002) we show the convergence of the PLS approximation to the approximation given by the classical linear regression: ˆ PLS(q) − Y ˆ 2 ) = 0. lim E(Y

(7)

q→∞

In practice, the number of PLS components used for regression is determined by crossvalidation (Tenenhaus, 1998). Remark 1 (Numerical solution). Because (Xt )t∈[0,T ] is a continuous time stochastic process, in practice we need a discretization of the time interval in order to obtain a numerical solution. In Preda (1999) we give such an approximation. Thus, if = {0 = t0 < t1 < · · · < tp = T }, p  1, is a discretization of [0, T ], consider the process (Xt )t∈[0,T ] defined as Xt =

1 ti+1 − ti



ti+1 ti

Xt dt

∀t ∈ [ti , ti+1 [

∀i = 0, . . . , p − 1.

(8)

t Denote by mi the random variable 1/(ti+1 − ti ) ti i+1Xt dt, i = 0, . . . , p − 1. Then, the approximation of the PLS regression on (Xt )t∈[0,T ] by that on (Xt )t∈[0,T ] is equivalent to √ the PLS regression on the finite set {mi ti+1 − ti }i=0,...,p−1 , (Cazes, 1997). For some fixed p, we give in Preda (1999) a criterion for the choice of the optimal discretization according to the approximation given in (8).

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

103

3. Clusterwise PLS regression Without loss of generality, let us suppose that Y is a real random variable Y, i.e. p = 1. The clusterwise linear regression of Y on X states that there exists a random variable G, G :  → {1, 2, . . . , K}, K ∈ N − {0} such that E(Y |X = x, G = i) = i + i x, V (Y |X = x, G = i) = 2i > 0

∀x

∀i = 1, . . . , K,

(9)

where E and V stand for expectation, respectively, variance and (i , i ) are the regression coefficients associated to the cluster i, i = 1, . . . , K. Let us denote by Yˆ the approximation of Y given by the global linear regression of Y on X, Yˆ i = i + i X the approximation of Ygiven by the linear regression of Y on X within ˆi the cluster i, i = 1, . . . , K, and by Yˆ L = K i=1 Y 1G=i (L stands for “Local”). Then, the following decomposition formula holds (Charles, 1977): V(Y − Yˆ )

= V(Y − Yˆ L ) + V(Yˆ L − Yˆ ) s  = P({G = i})V(Y − Yˆ i |G = i) + V(Yˆ L − Yˆ ).

(10)

i=1

Thus, the residual variance of the global regression is decomposed in a part of residual variance due to linear regression within each cluster and another part representing the distance between predictions given by global and local models. This formula defines the criterion for estimating the local models used in the algorithms of Charles (1977) and Bock (1969). 3.1. Estimation Let us consider K fixed and that the homoscedasticity hypothesis holds, i.e. 2i = 2 , ∀i = 1, . . . , K. Charles (1977) and Bock (1969) use the following criterion for estimating the distribution of G, L(G), and {i , i }K i=1 :   min V(Y − Yˆ L ) . (11) {i ,i }K i=1 ,L(G)

If n data points {xi , yi }ni=1 have been collected, the cluster linear regression algorithm ˆ (as estimation of L(G)), and finds simultaneously an optimal partition of the n points, G i the regression models for each cluster (element of partition) (ˆ, ˆ ) = {ˆi , ˆ }K i=1 , which minimize the criterion: ˆ , ˆ , ˆ ) = V(K, G

2 i yj − (ˆi + ˆ xj ) .

K   

(12)

i=1 G ˆ (j )=i

Notice that under classical hypothesis on the model (residuals within each cluster are considered independent and normally distributed) this criterion is equivalent to maximization of the likelihood function (Hennig, 1999).

104

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

In order to minimize (12), the clusterwise linear regression algorithms iterates the following two steps: i

ˆ , V(K, G ˆ , ˆ , ˆ ) is minimized by the LS-estimator (ˆi , ˆ ) from the points (i) For given G ˆ (j ) = i. (xj , yj ) with G i ˆ , ˆ , ˆ ) is minimized according to (ii) For given {ˆi , ˆ }K , V(K, G i=1

ˆ (j ) = arg G

 min

i∈{1,...,K}

2 i yj − (ˆi + ˆ xj ) .

(13)

ˆ , ˆ , ˆ ) is monotonely decreasing if the steps (i) and (ii) are carried out That is, V(K, G alternately: ˆ 0 ⇒ (ˆ0 , ˆ 0 ) ⇒ G ˆ 1 ⇒ (ˆ1 , ˆ 1 ) ⇒ · · · ⇒ G ˆ l ⇒ (ˆl , ˆ l ) ⇒ . . . , G





    V V V 0

1

(14)

l

ˆ 0 being an initial partition where the index of each quantity denotes the iteration number, G of the n data points.

3.2. Functional data. The PLS approach When the predictor variables form a stochastic process X = (Xt )t∈[0,T ] , i.e. xi are curves, the classical linear regression is not adequate to give estimators for the linear models within clusters, {i , i }K i=1 (Preda and Saporta, 2002). We propose to adapt the PLS regression for the clusterwise algorithm in order to overcome this problem. Thus, the local models are estimated using the PLS approach given in the previous section. Notice that PCR and Ridge regression approaches are discussed for the discrete case (finite number of predictor variables) in Charles (1977). K ˆi ˆ PLS,s , {ˆi Let us denote by G PLS,s , PLS,s }i=1 this estimators at the step s of the clusterwise algorithm using the PLS regression. However, a natural question arises: is the clusterwise algorithm still convergent in this case? Indeed, the LS criterion is essential in the proof of the convergence of the algorithm (Charles, 1977). The following proposition justify the use of the PLS approach in this context. Proposition 2. For each step s of the clusterwise PLS regression algorithm there exists K ˆi ˆ PLS,s and {ˆi a positive integer q(s) such that G PLS,s , PLS,s }i=1 given by the PLS regressions using q(s) PLS components preserve the decreasing monotonicity of the sequence i ˆ PLS,s , {ˆi {V(G , ˆ }K )} . PLS,s

PLS,s i=1

s 1

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

105

  K ˆi ˆ PLS,s , {ˆi Proof. Let s  1 and G ,  } PLS,s PLS,s i=1 be the estimators given by the clusterwise PLS algorithm at the step s. From (13) we have on one hand, i i K ˆ PLS,s , {i ˆ V(G iPLS,s , ˆ PLS,s }K i=1 ). PLS,s , PLS,s }i=1 ) V(GPLS,s+1 , {ˆ

On the other hand, from (7), there exists q(s + 1) such that i

i K ˆ PLS,s+1 , {i ˆ V(G iPLS,s+1 , ˆ PLS,s+1 }K i=1 ), PLS,s , PLS,s }i=1 ) V(GPLS,s+1 , {ˆ i where {ˆiPLS,s+1 , ˆ PLS,s+1 }K i=1 are the estimators of the regression coefficients within each cluster using q(s + 1) PLS components. Thus the proof is complete. 

Remark 3. (i) Because of the infinite dimension of the problem, one cannot state, for the general case, that the previous result remains valid by imposing decreasing monotonicity of the sequence {q(s)}s  1 . One can assert that even for the case of finite number of predictors, the strictly decreasing monotonicity of the sequence {q(s)}s  1 is not guaranteed. (ii) In our experiences, the number of PLS components given by cross-validation with a confidence level higher than 0.9 provides a good approximation for q(s). ˆ PLS , {ˆi , ˆ i }K the estimators given by the clusterwise algorithm Let us denote by G PLS PLS i=1 using the PLS regression. Prediction: Given a new data point (xi ∗ , yi ∗ ) for which we have only the observation of xi ∗ , the prediction problem of yi ∗ is reduced to the problem of pattern recognition: to which cluster belongs the point (xi ∗ , yi ∗ )? A rule that uses the k-nearest neighbours approach is proposed by Charles (1977). Let m, M be two positive integers such that m  M. For each k ∈ [m, M] let • • • •

Nk be the set of the k nearest neighbours of xi ∗ ,

nj (k) = G−1 (j ) ∩ Nk , ∀j ∈ 1, . . . , K, J (k) = {j ∈ {1, . . . , K} : nj (k) = maxl=1,...,K nl (k)},  j = M k=m 1J (k) (j )

Then,

G(i ∗ ) = arg max j .

(15)

j

Therefore, yˆ

i∗

ˆ (i ∗ ) G = ˆ PLS

 +

T 0

ˆ



G(i ) ˆ PLS (t)xi ∗ (t) dt.

(16)

It is important to notice that the properties of the clusterwise PLS regression do not change if Y is a random vector of finite or infinite dimension. When Y={Xt }t∈[T ,T +a] , the clusterwise PLS regression is used to predict the future of the process from its past. We will consider this situation in the following application on stock exchange data.

106

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108 1.0 0.7 0.4 0.1 -0.2 -0.5 -0.8 -1.1 -1.4 -1.7 -2.0 0

300

600

900

1200

1500

1800

2100 2400 (1)

2700

3000

3300

3600

0

300

600

900

1200

1500

1800

2100 2400 (2)

2700

3000

3300

3600

1.0 0.7 0.4 0.1 -0.2 -0.5 -0.8 -1.1 -1.4 -1.7 -2.0

Fig. 1. Evolution of the share 85: (1)—before approximation (2)—after approximation.

The number of clusters is considered unknown. Charles (1977) proposes to choose K observing the behavior of the decreasing function c(K) = V (Y − Yˆ L )/V (Y ). Other criteria based on the decomposition formula (10) are proposed in Plaia (2001). Hennig (1999) gives some approximations of K based on the likelihood function. 4. Application on stock exchange data The clusterwise PLS regression on a stochastic process presented in the previous sections is used to predict the behavior of shares on a certain lapse of time. We have developed a C++ application which implements the clusterwise PLS approach, by varying the number of clusters and using the cross-validation criterion for different levels of significance of PLS components. We have 84 shares quoted at the Paris stock exchange, for which we know the whole behavior of the growth index during 1 h (between 1000 and 1100 ). Notice that a share is likely to change every second. We also know the evolution of the growth index of a new share (quoted 85) between 1000 and 1055 . The aim is to predict the way that share will behave between 1055 and 1100 using the clusterwise PLS approach built with the other 84 shares. We use the approximation given in (8) by taking an equidistant discretization of the interval [0, 3600] (time expressed in seconds) in 60 subintervals. Fig. 1 gives the evolution of the share 85 in [0, 3300] before and after this approximation. The forecasts obtained will then match the average level of the growth index of share 85 considered on each interval [60 · (i − 1), 60 · i), i = 56, . . . , 60. The same data are used in Preda and Saporta (2002) where the global PCR and PLS regressions are fitted. We quote by CW-PLS(k) the clusterwise PLS regression with k clusters,

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

107

by PCR(k), respectively PLS(k), the global regression on the first k principal components, respectively, on the first k PLS components. Considering the cross-validation approach with a significance level of 0.95, we obtain the following results:

Observed PLS(2) PLS(3) PCR(3) CW-PLS(3) CW-PLS(4) CW-PLS(5)

m ˆ 56 (85)

m ˆ 57 (85)

m ˆ 58 (85)

m ˆ 59 (85)

m ˆ 60 (85)

SSE

0.700 0.312 0.620 0.613 0.643 0.653 0.723

0.678 0.355 0.637 0.638 0.667 0.723 0.685

0.659 0.377 0.677 0.669 0.675 0.554 0.687

0.516 0.456 0.781 0.825 0.482 0.652 0.431

−0.233 0.534 0.880 0.963 0.235 −0.324 −0.438

— 0.911 1.295 1.511 0.215 0.044 0.055

Using the sum of squared errors (SSE) as measure of fit, let us observe that the clusterwise models give better results than the global analysis. The models with 4 and 5 clusters predict the crash of the share 85 for the last 5 min, whereas the global models do not. For the 17 32 10 25 model with 4 clusters, the estimation of the distribution of G is ( 84 , 84 , 84 , 84 ), the point 85 belonging to the first cluster.

5. Conclusions The clusterwise PLS regression on a stochastic process offers an interesting alternative to classical methods of clusterwise analysis. It is particularly adapted to solve multicollinearity problems for regression and also when the number of observations is smaller than the number of predictor variables, which is often the case in the context of the clusterwise linear regression.

Acknowledgements We are grateful to the Groupe SBF (Bourse de Paris) for providing us with the stockexchange data treated in this paper.

References Aguilera, A.M., Ocaña, F., Valderama, M.J., 1997. An approximated principal component prediction model for continuous-time stochastic process. Appl. Stochastic Models Data Anal. 13, 61–72. Bock, H.H., 1969. The equivalence of two extremal problems and its application to the iterative classification of multivariate data. Lecture Note, Vortragsausarbeitung, Tagung “Meolizinische Statistik”, Mathematisches Forschungsinstitut Oberwolfach, 1969, pp.10. Cazes, P., 1997. Adaptation de la régression PLS au cas de la régression après Analyse des Correspondances Multiples. Rev. Statist. Appl. XLIV (4), 35–60. Charles, C., 1977. Régression typologique et reconnaissance des formes. Thèse de doctorat, Université Paris IX. de Jong, S., 1993. PLS fits closer than PCR. J. Chemometr. 7, 551–557.

108

C. Preda, G. Saporta / Computational Statistics & Data Analysis 49 (2005) 99 – 108

DeSarbo, W.S., Cron, W.L., 1988. A maximum likelihood methodology for clusterwise linear regression. J. Classification 5, 249–282. Deville, J.C., 1974. Méthodes statistiques et numériques de l’analyse harmonique. Ann. l’INSEE France 15, 3–101. Deville, J.C., 1978. Analyse et prévision des séries chronologiques multiples non stationnaires. Statist. Anal. Données 3, 19–29. Escoufier, Y., 1970. Echantillonnage dans une population de variables aléatoires réelles. Publ. Inst. Statist. Univ. Paris 19 (4), 1–47. Esposito Vinzi, V., Lauro, C., 2003. PLS regression and classification. PLS and Related Methods, Proceedings of the PLS’03 International Symposium, Decisia, Paris, pp. 45–56. Hennig, C., 1999. Models and methods for clusterwise linear regression. Classification in the Information Age. Springer, Berlin, pp. 179–187. Hennig, C., 2000. Identifiability of models for clusterwise linear regression. J. Classification 17, 273–296. Phatak, A., De Hoog, F., 2001. PLSR, Lanczos, and conjugate gradients. CSIRO Mathematical & Information Sciences, Report No. CMIS 01/122, Canberra. Plaia, A., 2001. On the number of clusters in clusterwise linear regression. Xth International Symposium on Applied Stochastic Models and Data Analysis, Proceedings, vol. 2, Compiegne, France, pp. 847–852. Preda, C., 1999.Analyse factorielle d’un processus: problèmes d’approximation et de régression, Thèse de doctorat, No. 2648. Université de Lille 1, France. Preda, C., Saporta, G., 2002. Régression PLS sur un processus stochastique. Rev. Statist. Appl. (France) L (2), 27–45. Ramsay, J.O., Silverman, B.W., 1997. Functional Data Analysis, Springer Series in Statistics. Springer, New York. Saporta, G., 1981. Méthodes exploratoires d’analyse de données temporelles. Cahiers du B.U.R.O., No. 37–38. Université Pierre et Marie Curie, Paris. Spaeth, H., 1979. Clusterwise linear regression. Computing 22, 367–373. Tenenhaus, M., 1998. La régression PLS. Théorie et pratique, Editions Technip, Paris. Wold, S., Ruhe, A., Dunn III, W.J., 1984. The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Statist. Comput. 5 (3), 735–743.