PLS regression on a stochastic process

PLS regression on a stochastic process

Computational Statistics & Data Analysis 48 (2005) 149 – 158 www.elsevier.com/locate/csda PLS regression on a stochastic process C. Predaa;∗ , G. Sap...

234KB Sizes 3 Downloads 180 Views

Computational Statistics & Data Analysis 48 (2005) 149 – 158 www.elsevier.com/locate/csda

PLS regression on a stochastic process C. Predaa;∗ , G. Saportab a D epartement

de Statistique - CERIM, Faculte de Medecine, Universite de Lille 2, 1, Place de Verdun, 59043 Lille Cedex, France b CNAM Paris, Chaire de Statistique Appliqu ee, CEDRIC, 292, Rue Saint Martin, 75141 Paris Cedex 03, France

Received 9 October 2003; received in revised form 9 October 2003; accepted 9 October 2003

Abstract Partial least squares (PLS) regression on an L2 -continuous stochastic process is an extension of the 2nite set case of predictor variables. The PLS components existence as eigenvectors of some operator and convergence properties of the PLS approximation are proved. The results of an application to stock-exchange data will be compared with those obtained by other methods. c 2003 Elsevier B.V. All rights reserved.  Keywords: PLS regression; Stochastic process; Escou2er’s operator; Principal component analysis

1. Introduction It does not seem usual to perform a linear regression when the number of predictors is in2nite. However, it is the case when one tries to predict a response variable Y thanks to the observation of a time-dependent variable Xt , for any t ∈ [0; T ] (for example, (Xt )t∈[0; T ] can represent temperature curves observed in n places and Y the amount of crops). Theoretically, this can be expressed by the regression of Y on the process (Xt )t∈[0; T ] . The aim of this paper is to adapt the PLS regression when the set of predictor variables forms a stochastic process Fig. 1. The problems brought about by the classical linear regression on a process—the indetermination of the regression coe?cients (Ramsay and Dalzell, 1991; Ramsay and Silverman, 1997; Saporta, 1981) or the choice of the principal components of (Xt )t∈[0; T ] as predictor variables (Deville, 1978; ∗

Corresponding author. Tel.: +33-3-20-62-69-69; fax: +33-3-20-52-10-22. E-mail addresses: [email protected] (C. Preda), [email protected] (G. Saporta).

c 2003 Elsevier B.V. All rights reserved. 0167-9473/$ - see front matter  doi:10.1016/j.csda.2003.10.003

150

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

Xt

X

Y

1

i n t

0

T

time

Fig. 1. Regression on a stochastic process.

Saporta, 1981; Aguilera et al., 1998)—get satisfactory solutions within this framework, the main characteristics which are derived from those of the Escou2er operator associated with the process (Xt )t∈[0; T ] . PLS regression on a stochastic process is an extension of the 2nite case (2nite set of predictors) developed by Wold et al. (1984), Tenenhaus et al. (1995) and Cazes (1997) (see also EldIen, 2003; Nguyen and Rocke, 2003). We prove the existence of PLS components as well as some convergence properties towards the classical linear regression. The case Y = (Xt )t∈[T; T +a] , a ¿ 0, presents an alternative to prediction methods proposed by Aguilera et al. (1998) and Deville (1978). The results of an application on stock exchange data are compared with those obtained by other methods. 2. Some results about principal component analysis (PCA) and regression when data are curves Let (Xt )t∈[0; T ] be a random process and Y = (Y1 ; Y2 ; : : : ; Yp ), p ¿ 1, a random vector de2ned on the same probability space ( ; A; P). We assume that (Xt )t∈[0; T ] and Y are of second order, (Xt )t∈[0; T ] is L2 -continuous and for any ! ∈ , t → Xt (!) is an element of L2 ([0; T ]). Without loss of generality we also assume that E(Xt ) = 0, ∀t ∈ [0; T ] and E(Yi ) = 0, ∀i = 1; : : : ; p. 2.1. PCA of a stochastic process Also known as Karhunen–LoMeve expansion, the principal component analysis (PCA) of the stochastic process {Xt }t∈[0; T ] consists in representing Xt as:  Xt = fi (t)i ; ∀t ∈ [0; T ]; (1) i¿1

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

151

where the set {fi }i¿1 (the principal factors) forms an orthonormal system of deterministic functions of L2 ([0; T ]) and {i }i¿1 (principal components) are uncorrelated zero-mean random variables. The principal factors {fi }i¿1 are solutions of the eigenvalue equation  T C(t; s)fi (s) ds = i fi (t); (2) 0

where C(t; s) = cov(Xt ; Xs ), ∀t; s ∈ [0; T ]. Therefore, the principal components {i }i¿1 T are de2ned as i = 0 fi (t)Xt dt. C

For f; g ∈ L2 ([0; T ]) let C be the covariance operator, f→g,  T C(t; s)f(s) ds: g(t) = 0

Then, (2) can be written as Cfi = i fi ;

i ¿ 1:

The principal components are also eigenvectors of the Escou2er operator, WX , de2ned by  T X E(Xt Z)Xt dt; Z ∈ L2 ( ): W Z= 0

Thus, the principal component analysis of (Xt )t∈[0; T ] is the extension of the classical case corresponding to a 2nite number of random variables (see also Deville, 1974). This analogy is summarized in Table 1. 2.2. Linear regression on a stochastic process Without loss of generality, let us consider for instance Y to be a one-dimensional random vector (p = 1) and put Y = {Y }. Thus, the approximation of Y given by the linear regression on the stochastic process (Xt )t∈[0; T ] is the projection Yˆ of Y Table 1 Principal component analysis

Variables Data Covariance Factors Components

Finite

In2nite

X1 ; : : : ; Xp ; p ¿ 1 Vectors ∈ Rp Matrix V V ∈ Mp×p (R) Vector u ∈ Rp Vu = u r.v.  ∈ L2 ( ) W = 

Xt ; t ∈ [0; T ] Curves ∈ L2 ([0; T ]) Operator C C: L2 ([0; T ]) → L2 ([0; T ]) Function f ∈ L2 ([0; T ]) Cf = f r.v.  ∈ L2 ( ) W = 

152

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

on L2 (X ). By the projection theorem, Yˆ satis2es the following system E(YXt ) = E(Yˆ Xt );

∀t ∈ [0; T ]:

The linear model consists in writing Yˆ as  T Yˆ = (t)Xt dt;  ∈ L2 ([0; T ]) 0

(3)

(4)

and from (3), it follows that the function  must verify the Wiener–Hopf equation  T E(Xt Y ) = C(t; s)(s) ds; ∀t ∈ [0; T ]: (5) 0

The solution of this equation is characterized by Picard Theorem (Saporta, 1981). Eq. (5) has a unique solution in L2 ([0; T ]) if and only if  T  c2 k ¡ ∞; c = E(Xt Y )fk (t) dt; (6) k 2 0 k¿1 k where {k }k¿1 and {fk }k¿1 are the elements of the PCA of (Xt )t∈[0; T ] . If condition (6) is not veri2ed,  is a distribution instead of a function (Saporta, 1981). This di?culty also appears in practice when one tries to estimate the regression coe?cients, (t), using a sample of size N . Indeed, if {(Y1 ; X1 ); (Y2 ; X2 ); : : : ; (YN ; XN )} is a 2nite sample of (Y; X ), X = (Xt )t∈[0; T ] , the system  T Yi = Xi (t)(t) dt; ∀i = 1; : : : ; p; (7) 0

has an in2nite number of solutions (Ramsay and Silverman, 1997). Several approaches are proposed by Green and Silverman (1994), Palm and Iemma (1995), Ramsay and Silverman (1997). In the next section we present two of these. 2.2.1. Linear regression on principal components The process (Xt )t∈[0; T ] and the set of its principal components, {k }k¿1 , span the same linear space. Therefore, the regression of Y on (Xt )t∈[0; T ] is equivalent to those on {k }k¿1 . Thus, we have  E(Yk ) Yˆ = k : (8) k k¿1

The 2t of regression is usually measured by 1  E 2 (Yk ) : R2 (Y; Yˆ ) = E(Y 2 ) k k¿1

In practice we need to choose an approximation of order q, q ¿ 1: Yˆ q =

q  E(Yk ) k=1

k

k :

(9)

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

153

But the use of principal components for prediction is heuristic because they are computed independent of the response. The di?culty of the choice of principal components used for regression is discussed in detail in Saporta (1981). 2.2.2. Projection method Let {ei }i=1; :::; q be an orthonormal system of L2 ([0; T ]). Theprojection method q (Aguilera et al., 1998) consists in approximating  with (t) ≈ i=1 bi ei (t), bi ∈ R, ∀i = 1; : : : ; q. Relation (5) becomes Kb = a; where K=(ki; j )16i; j6q , ki; j =(Cei |ej ), a=(ai )i=1; :::; p ∈ Rp , ai =(ei ; ’), ’(t)=E(YXt ); ∀t ∈ [0; T ]: Notice that if {ei }i=1; :::; q = {fi }i=1; :::; q , the projection method is equivalent to the regression on the q principal components. The choice of the approximation method employed to perform linear regression on a stochastic process seems to be a compromise between the stability of the model and its quality of 2t. The PLS approach gives a solution at this compromise and is presented in the next section. 3. PLS regression on a stochastic process Under the hypothesis presented in Section 2, let CYX and CXY be the operators de2ned as  T CYX x; xi = E(Xt Yi )f(t) dt; i = 1; p; CYX : L2 ([0; T ]) → Rp ; f → 0

p

CXY : R → L2 ([0; T ]);

CXY

x → f;

f(t) =

p 

E(Xt Yi )xi ;

t ∈ [0; T ]

i=1

denoted by UX = CXY ◦ CYX and UY = CYX ◦ CXY . Obviously, the operators CYX and CXY generalize the cross-covariance matrices used in the 2nite case (Tenenhaus et al., 1995). In addition, UX and UY are self-adjoint, positive and compact operators with the same spectrum. Therefore, the spectral analyses of UX and UY lead to a countable set of positive eigenvalues. The following proposition justi2es the interest for these operators and gives the solution to the PLS problem: Proposition 1 (Tucker criterion).   p T  2 Xt w(t) dt; ci Y i ; max Cov w;c

0

i=1

where w ∈ L2 ([0; T ]); w = 1, c ∈ Rp ; c = 1 is reached for w and c, the eigenvectors associated with the largest eigenvalue of UX , and UY , respectively.

154

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

Let w1 ∈ L2 ([0; T ]) be the eigenfunction of UX associated with the largest eigenvalue. Then, the 2rst PLS component (Tenenhaus et al., 1995) of the regression of Y on the process (Xt )t∈[0; T ] is the random variable de2ned as  T t1 = Xt w1 (t) dt (10) 0

denote the Escou2er’s operators (Escou2er, 1970) associated with (Xt )t∈[0; T ] , Y, by WX , WY , respectively and de2ned by  T p  X Y E(Xt Z)Xt dt; W Z = E(Yi Z)Yi ; ∀Z ∈ L2 ( ): W Z= 0

i=1

Our main result is the following theorem. Theorem 2. t1 is the eigenvector of the WX WY associated with the largest eigenvalue. The PLS regression is an iterative method. Let X0; t = Xt , ∀t ∈ [0; T ] and Y0; i = Yi , ∀i = 1; : : : ; p. At the step h, h ¿ 1, of the PLS regression of Y on (Xt )t∈[0; T ] , we de2ne the hth PLS component, th , by the eigenvector associated to the largest eigenvalue of X Y the operator Wh−1 Wh−1 , X Y Wh−1 Wh−1 th = max th ;

(11)

X Y where Wh−1 , Wh−1 are the Escou2er’s operators associated with (Xh−1; t )t∈[0; T ] , Yh−1 = X Y Wh−1 . Finally, the (Yh−1; i )i=1; :::; p , respectively and max the largest eigenvalue of Wh−1 PLS step is completed by the ordinary linear regression of Xh−1; t and Yh−1; i on th . Let Xh; t , t ∈ [0; T ] and Yh; i , i = 1; : : : ; p be the random variables which represent the error of these regressions:

Xh; t = Xh−1; t − ph (t)th ; Yh; i = Yh−1; i − ch; i th ;

t ∈ [0; T ]; i = 1; : : : ; p;

As in the 2nite case (Tenenhaus et al., 1995), the next statements hold: Proposition 3. For each h ¿ 1: (a) (b) (c) (d) (e)

{th }h¿1 forms an orthogonal system in L2 (X ), Yi = c1; i t1 + c2; i t2 + · · · + ch; i th + Yh; i ; i = 1; : : : ; p, Xt = p1 (t)t1 + p2 (t)t2 + · · · + ph (t)th + Xh; t ; t ∈ [0; T ], E(Yh; i tj ) = 0; ∀i = 1; : : : ; p; ∀j = 1; : : : ; h, E(Xh; t tj ) = 0; ∀t ∈ [0; T ]; ∀j = 1; : : : ; h.

From Proposition (3(b)), the PLS approximation of Y by (Xt )t∈[0; T ] at step h, h ¿ 1, is given by Yˆ h = c1 t1 + · · · + ch th ;

ci ∈ Rp ;

i = 1; : : : ; p:

(12)

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

155

Let Yˆ denote the approximation of Y given by the ordinary linear regression on ˆ (Xt )t∈[0; T ] . Then, the sequence {Yˆ h }h¿1 is convergent in L2 ( ) and the limit is Y. Proposition 4. ˆ 2 ) = 0: lim E( Yˆ h − Y

h→∞

(13)

Finally, the choice of h using the cross-validation criterion (Green and Silverman, 1994) remains applicable in this case. Remark 5 (Numerical Solution). Because (Xt )t∈[0; T ] is a continous time stochastic process, in practice we need a discretization of the time interval in order to obtain a numerical solution. In Preda (1999) we give such an approximation. Thus, if ' = {0 = t0 ¡ t1 ¡ · · · ¡ tp = T }, p ¿ 1, is a discretization of [0; T ], consider the process (Xt' )t∈[0; T ] de2ned as  ti+1 1 ' Xt = Xt dt; ∀t ∈ [ti ; ti+1 ]; ∀i = 0; : : : ; p − 1: (14) ti+1 − ti ti t Let mi denote the random variable 1=(ti+1 − ti ) ti i+1 Xt dt, i = 0; : : : ; p − 1. Then, the approximation of the PLS regression on (Xt )t∈[0; T ] by that on (Xt' )t∈[0; T ] is equivalent √ to the PLS regression on the 2nite set {mi ti+1 − ti }i=0; :::; p−1 . For some 2xed p, we give in Preda (1999) a criterion for the choice of the optimal discretization ' according to the approximation given in (14). 3.1. The continuous case The previous results are still valid for the particular case Y = (Xt )t∈[T; T +a] , a ¿ 0. Indeed, because of the L2 continuity of the process (Xt )t∈[0; T +a] , CX; Y and CY; X are compact and therefore, UX and UY are compact. The results of Proposition 1 and Theorem 2 are preserved. The decomposition formulas Proposition (3(b,c)) in this case are  t1 p1 (t) + · · · th ph (t) + Xh; t ; ∀t ∈ [0; T ]; Xt = (15) t1 c1 (t) + · · · th ch (t) + Xh; t ; ∀t ∈ [T; T + a]: For each s ∈ [0; a], the “forecast” of XT +s by (Xt )t∈[0; T ] is given by Xˆ T +s = t1 c1 (T + s) + · · · + th ch (T + s):

(16)

4. Application on stock exchange data The PLS regression on a stochastic process will be used to predict the behavior of shares on a certain lapse of time.

156

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158 1.0 0.7 0.4 0.1 -0.2 -0.5 -0.8 -1.1 -1.4 -1.7 -2.0 0

300

600

900

1200

1500

1800

2100 2400 (1)

2700

3000

3300

3600

0

300

600

900

1200

1500

1800

2100 2400 (2)

2700

3000

3300

3600

1.0 0.7 0.4 0.1 -0.2 -0.5 -0.8 -1.1 -1.4 -1.7 -2.0

Fig. 2. Evolution of the share 85: (1) – Before approximation. (2) – After approximation.

Table 2 Percentages of variance

PLS

%

% cum.

t1 t2 t3

91.7 2.9 2.7

91.7 94.6 97.3

PCA 1 2 3

%

% cum.

91.7 4.0 2.5

91.7 95.7 98.2

We have 84 shares quoted at the Paris stock exchange, for which we know the whole behavior of the growth index during 1 hour (between 1000 and 1100 ); note that a share is likely to change every second. We also know the evolution of the growth index of a new share (noted 85) between 1000 and 1055 . The aim is to predict the way that the new share will behave between 1055 and 1100 using a PLS model built with the other 84 shares. We use the approximation given in (14) by taking an equidistant discretization of the interval [0; 3600] (time expressed in seconds) in 60 subintervals. Fig. 2 gives the evolution of the share 85 in [0, 3300] before and after this approximation. The forecasts obtained will then match the average level of the growth index of share 85 considered on each interval [60(i − 1); 60i], i = 56; : : : ; 60. Using the SIMCA-P software (Umetri, 1996) we build several PLS models according to the number of components chosen for regression. We refer to the model with k PLS components as PLS(k). The forecasts obtained with these models will be compared to those given by the regression on the principal components (models quoted with PCR(k)) and the NIPALS algorithm (see Tenenhaus, 1998, for details). Table 2 gives

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

157

Table 3 Forecasts and errors

mˆ 56 (85)

mˆ 57 (85)

mˆ 58 (85)

mˆ 59 (85)

mˆ 60 (85)

SSE =

0.700

0.678

0.659

0.516

−0:233



PLS(1) PLS(2) PLS(3)

−0:327

−0:335

−0:338

−0:325

−0:302

3.787 0.928 1.318

PCR(1) PCR(2) PCR(3)

−0:356 −0:332

−0:365 −0:333

−0:368 −0:335

−0:355 −0:332

−0:331 −0:298

0.963

4.026 3.786 1.538

0.222

0.209

0.240

0.293

0.338

1.000

3480

3540

Observed

NIPALS

0.312 0.620

0.613

0.355 0.637

0.638

0.377 0.677

0.669

0.456 0.781

0.825

0.534 0.880

60

ˆi i=56 (m

− m i )2

1.0 0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 3300

3360 Vrai PLS(2) PLS(3)

3420

3600

CP(3) NIPALS

Fig. 3. Forecasts for share 85.

the percentages of variance associated with the 2rst three PLS components, to the three 2rst principal components of the set {mi }i=1; :::; 55 . To rate the quality model we compare the forecasts obtained by each model, quoted with {m} ˆ i (85), {i = 56; : : : ; 60}, with the observed values. The forecasts of the some models are presented in Table 3 and Fig. 3. Considering SSE as a global measure of the quality model, the best forecasts are those of the PLS regression in two steps. The models PLS(3) and PCR(3) are particularly good for short-term prediction (3300, 3480), but much less for the interval of time (3480 –3600). The NIPALS algorithm also provides on an average, good predictions.

158

C. Preda, G. Saporta / Computational Statistics & Data Analysis 48 (2005) 149 – 158

5. Conclusions The PLS regression on a stochastic process oTers an alternative to regression on principal components. It gives a solution to the problems of multicollinearity of predictors and when the number of observations is smaller than the number of predictor variables, which is often the case in this context. References Aguilera, A.M., Oca˜na, F., Valderama, M.J., 1998. An approximated principal component prediction model for continous-time stochastic process. Appl. Stochastic Models Data Anal. 13, 61–72. Cazes, P., 1997. Adaptation de la rIegression PLS au cas de la rIegression aprMes Analyse des Correspondences Multiples. Rev. Statist. Appl. XLIV (4), 35–60. Deville, J.C., 1974. MIethods statistiques et numIeriques de l’analyse harmonique. Ann. l’INSEE 15, 3–101. Deville, J.C., 1978. Analyse et prIevision des sIeries chronologiques multiples non stationnaires. Statist. Anal. DonnIees 3, 19–29. EldIen, L., 2003. Partial least-squares vs. Lanczos bidiagonalization I: analysis of a projection method for multiple regression. Comput. Statist. Data Anal., in press. Escou2er, Y., 1970. Echantillonnage dans une population de variables alIeatoires rIeelles. Publications de l’Institut de Statistique de l’UniversitIe de Paris, Vol. 19(4) pp. 1– 47. Green, P.J., Silverman, B.W., 1994. Nonparametric regression and generalized linear models. A Roughness Penalty Approach, Monographs on Statistic and Applied Probability, No. 58. Chapman & Hall, London. Nguyen, D.V., Rocke, D.M., 2003. On partial least squares dimension reduction for microarray-based classi2cation: a simulation study. Comput. Statist. Data Anal., in press. Palm, R., Iemma, A.F., 1995. Quelques alternatives aM la rIegression classique dans le cas de colinIearitIe. Rev. Statist. Appl. XLIII (2), 5–33. Preda, C., 1999. Analyse factorielle d’un processus: problMemes d’approximation et de rIegression. ThMese de Doctorat, No. 2648. UniversitIe de Lille 1, France. Ramsay, J.O., Dalzell, C.J., 1991. Some tools for functional data analysis. J. Roy. Statist. Soc. (B) 53 (3), 539–572. Ramsay, J.O., Silverman, B.W., 1997. Functional Data Analysis, Springer Series in Statistics, Springer, New York. Saporta, G., 1981. MIethods exploratoires d’analyse de donnIees temporelles. Cahiers du B.U.R.O., No. 37-38 UniversitIe Pierre et Marie Curie, Paris. Tenenhaus, M., 1998. La rIegression PLS. ThIeorie et Pratique. Editions Technip, Paris. Tenenhaus, M., Gauchi, J.P., Menardo, C., 1995. RIegression PLS et applications. Rev. Statist. Appl. XLIII (1), 7–63. Wold, S., Ruhe, A., Dunn III, W.J., 1984. The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput. 5 (3), 735–743. Umetri, A.B., 1996. SIMCA-P for Windows, Graphical Software for Multivariate Process Modeling. Umea, Sweden.