Time-series forecasting through wavelets transformation and a mixture of expert models

Time-series forecasting through wavelets transformation and a mixture of expert models

Neurocomputing 28 (1999) 145}156 Time-series forecasting through wavelets transformation and a mixture of expert models Ruy L. MilidiuH *, Ricardo J...

185KB Sizes 0 Downloads 6 Views

Neurocomputing 28 (1999) 145}156

Time-series forecasting through wavelets transformation and a mixture of expert models Ruy L. MilidiuH *, Ricardo J. Machado, RauH l P. RentermH a Departamento de Informa& tica, Pontifn& cia Universidade Cato& lica do Rio de Janeiro, Rio de Janeiro, Brazil

Abstract This paper describes a system formed by a mixture of expert models (MEM) for time-series forecasting. We deal with several di!erent competing models, such as partial least squares, K-nearest neighbours and carbon copy. The input space, after changing its base using the Haar wavelets transform, is partitioned into disjoint regions by a clustering algorithm. For each region, a benchmark is performed among the di!erent competing models aiming at selecting the most adequate one. MEM has improved the forecast performance when compared with the single models as experimentally demonstrated through two di!erent time series: laser data and exchange rate data.  1999 Elsevier Science B.V. All rights reserved. Keywords: Partial least squares; Wavelets; Time-series forecasting; Mixture of experts

1. Introduction Many potential applications of predictive models are related with forecasting discrete time series. A discrete time series is a "nite or in"nite enumerable set of observations of a given variable z(t) ordered according to the parameter time, and denoted as z , z , 2, z , where N is the size of the time series. A key assumption   , frequently adopted in time-series forecast is that the statistical properties of the data generator are time independent. The goal is, given part of a time series z , z , 2, z (named pattern), to predict a future value z , where w is the R R\ R\U> R>F

* Corresponding author. E-mail address: [email protected] (R.L. MilidiuH )  Deceased. 0925-2312/99/$ } see front matter  1999 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 5 - 2 3 1 2 ( 9 8 ) 0 0 1 2 0 - 9

146

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

length of the window on the time series and h51 is named the prediction horizon. We will focus on models that are able to explore the regularities of the patterns observed in the past for accurately predicting the short evolution of the system (h small). Earlier research e!orts on time-series forecast focused on linear models typi"ed by ARMA models. Two crucial developments appeared around 1980 [4]: 1. the state-space reconstruction paradigm by time-delay embedding, drew on ideas from di!erential topology and dynamical systems to provide a technique for recognizing when a time series has been generated by deterministic governing equations and, if so, for understanding the geometrical structure underlying the observed behaviour; 2. the second development was the emergence of the "eld of machine learning, typi"ed by neural networks, that can adaptively explore a large space of potential models [1,9]. In the "rst case, where an experimentally observed quantity arises from deterministic equations, the idea is to use time-delay embedding to recover a representation of the relevant internal degrees of freedom of the system from the observable [4]. Although the precise values of these reconstructed variables are not meaningful (because of the unknown change of coordinates), they can be used to make precise forecasts since the embedded maps preserve their geometric structure. In the second case, the idea is to make function approximations by using an adaptive model, such as a connectionist network, to learn to emulate the input}output behavior. By moving a window along the discrete time series, we can create a training data set consisting of many sets of input values (patterns) with the corresponding target values. Once the adaptive model has been trained, it can be presented with a pattern of observed values z , z , 2, z and used to make a prediction for z . R R\ R\U> R>F One lesson from all the research already developed on time-series forecast is that there is no one method universally superior to another. In this paper, we describe a method based on a mixture of expert models (MEM) for time-series forecasting. We consider the problem of learning a mapping in which the form of the mapping is di!erent for di!erent regions of the input space. The idea is to construct a speci"c predictive model for each input space region. To improve the quality of the information available to the models, we perform "rst a wavelet transformation of the time-series data. Next, we describe how the paper is organized. Section 2 describes the MEM method. An extensive description of partial least squares (PLS) is included because it has been intensively used in our experiments. Section 3 contains experimental results on two time series distributed by the Santa Fe Institute [11]. Concluding remarks are given in Section 4.

2. Methodology In this section we present MEM, a method for time-series forecasting based on a mixture of expert models.

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

147

MEM focuses on the problem of learning a mapping in which the form of the mapping is di!erent for di!erent regions of the input space. Although a single homogeneous adaptive model could be applied to this problem, we might expect that the task would be better performed if we assign di!erent `experta models to tackle each of the di!erent regions, and then use an extra `gatinga model, which also checks the input vector, to decide which one of the experts should be used to determine the output. If the problem has an obvious decomposition of this form, then it may be possible to design the MEM system by hand. However, a more powerful and more general approach would be to discover a suitable decomposition as part of the learning process. Previous work adopting this approach may be seen in [4,6,10]. Our implementation of the gating model in MEM is based on the clustering algorithm Isodata [2], applied to Haar wavelets transformed data, which assumes the previous knowledge of the number of input space classes (clusters) and uses the Euclidian distance as the similarity measure. The idea of MEM is `to divide for conquera. A complex problem is subdivided into simpler subproblems that are treated individually. The implementation of the MEM method is made in "ve phases that we describe below. (I) Changing the base of the input vector space: The base of the input space is changed by applying the Haar wavelets transform. In this way each sample is described by an overall shape plus details [7], thus allowing the clustering algorithm to group patterns that have closer shapes. This approach enables a better modeling for the following phases. (II) Input space partitioning: The transformed data plus any pertinent data available for the training of the expert models are partitioned by the Isodata algorithm into a prede"ned number of classes. As output, we obtain the samples and the centers of mass for each cluster (input space classes). If we provide Isodata with an arbitrarily high number of initial random seeds (classes), we may observe that, at the end of Isodata's processing, many classes turn out to be empty (no samples associated). By discarding such classes, we get a smaller number of e!ective classes to partition the input space. (III) Training of expert models: Di!erently of other MEM implementations, we deal with several categories of predictive models, such as, neural networks, statistical models, etc. For each model type, we construct and train one speci"c model for each input space class identi"ed in phase II, using its corresponding training samples. Hence we generate the set of models +m , with i"1,2, C, and j"1,2, M, where GH M is the number of model categories used in MEM and C is the number of e!ective input space classes identi"ed in phase II. (IV) Testing and benchmarking: Given an independent set of test patterns, "rstly we classify them among the input space classes identi"ed in phase II. This is done selecting for each test pattern the corresponding centroid that has the minimum Euclidian distance to it. Next, we test each expert model m using the test patterns GH belonging to class i, obtaining the performance measure value RMSE (root mean GH squared error). Finally, we select for each class i the winner expert model denoted as B(i), which is that one presenting the minimum RMSE , for j"1,2, M. GH The vector B(i), for i"1,2, C, works as the gating mechanism of MEM and indicates the winner expert model for each input space class.

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

148

(V) Forecasting: Given a wavelet-transformed pattern, taken from a time series, "rstly we identify its input space class i by selecting its corresponding nearest centroid. Then we select the model B(i) to treat this sample and produce the required forecast. 2.1. Model categories MEM is a #exible method because it allows us to apply the most adequate model category to each input space region. The user is free to include in the benchmark the model categories that he judges may be useful for forecasting his particular time series, e.g., linear models, nonlinear models, such as neural networks, and state-space embedding appproaches. The MEM method will select the best model category for each input space region. In this paper we have only dealt with a limited number of model types for time-series prediction: partial least squares (PLS), K-nearest neighbours (KNN), and carbon copy (CRB). 2.1.1. Partial least squares (PLS) Partial least squares (PLS) is a multivariate statistical method, based on the use of factors, which is aimed at prediction [5]. The goal is to predict the values of a set of variables y based on the observed values of a set of variables x. In our case, x may be formed by the values of a time-series window and y be taken as the value of a single future observation, or as the values of a set of future points in the same time series. The construction of a PLS model requires a set of observation samples (patterns) and also their respective future values. Let X be the matrix containing in its rows the patterns of observations and > be the matrix containing in its rows the future values to be predicted. The PLS method provides additional predictive power when compared with both the methods of multiple linear regression (MLR) and principal components regression (PCR) [8]. MLR models > from X using a least-squares criterion. All of X is available to model >, and the method is dependent on the inversion of a matrix. PCR "rst models

 In vector notation, the "rst factor or principal component of X can be shown to be equal to a linear combination of the original variables as follows: Xv"u, where X is the original matrix of observations, v is the "rst eigenvector of the covariance matrix X2X and u is the score vector. Using all the eigenvectors, the matrix X can be expressed as X<<2";<2, X";<2(<<2)\, where <2(<<2)\ is the generalized inverse of <, the matrix of eigenvectors. The elements of the generalized inverse of < are called the principal component loadings. They range between -1 and #1 and are the cosines between the eigenvectors and axis of the original variables.

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

149

X by a score matrix ; and then uses MLR to estimate the relationship between ; and y. The possibility of deleting factors and thus reducing dimensionality is considerd its major advantage. The PLS method is a modelling procedure that simultaneously estimates underlying factors in both X and >. These factors are then used to de"ne a subspace in X that is more adequate to model >. With PCR, the rotation de"ned by the eigenvectors is used to "nd a subspace in X that subsequently is used to model >. The approach taken by PLS is very similar to that of principal component analysis (PCA) [3], except that factors are chosen to describe the variables in > as well as in X. This is accomplished by using the columns of the > matrix to estimate the factors for X. At the same time, the columns of X are used to estimate the factors of >. The resulting models are X"¹P#E and >";Q#F, where the elements of ¹ and ; are called the scores of X and >, respectively, and the elements of P and Q are called the loadings. The matrices E and F are the errors associated with modeling X and > with the PLS model. The ¹ factors are not optimal for estimating the columns of X as is the case with PCA, but are rotated so as to simultaneously describe the > matrix. In the ideal situation, the sources of variation in X are exactly equal to the sources of variation in >, and the factors for X and > are identical. In real applications, X varies in ways not correlated to the variation in >, and therefore tOu (t and u are columns of ¹ and ;, respectively). However, when both matrices are used to estimate factors, the factors for the X and > matrices have the following relationship: u"bt#e, where b is termed the inner relationship between u and t and is used to calculate subsequent factors if the intrinsic dimensionality of X is greater than one. Geometrically, this states that the vectors u and t are approximately equal except for their lengths. The main advantage of PLS over PCR is that it incorporates more information in the model-building phase. The advantages gained by employing factor-based methods, such as PLS, are due to the characteristics of the factors. The "rst advantage of factor models comes from a computational limitation of MLR. MLR cannot be used in cases where the columns of measurement variables in X are linear combinations of each other. This is because MLR is based on the inversion of X2X, which is not possible if X has some columns that are linear combinations of the others. Furthermore, the inversion process is unstable when some columns are nearly linear combinations of others. A model built on such an inverse will give large errors in the output variable estimates when noise is present in the X matrix. Factor-based models can eliminate collinearities without removing useful information. The second main advantage of factor-based models is the possibility of removing noise from the data matrix. Again, using PCA as an example, one can demonstrate that the inclusion of all the eigenvectors can actually decrease the predictive ability of

150

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

the model. To "gure out when this is occurring, the analyst can use cross-validation and build a factor-based calibration model using a set of samples named the training set. This calibration model is then applied to the X matrix of an independent set of samples named the test set, and an estimate of the > matrix for the test set is obtained. Observe that, for the test set, the > matrix is also known. The sum of squared deviations of the predicted values for the y 's and the true y 's are calculated for the GH GH test set. This value is termed the PRESS value, an acronym for predictive residual error sum of squares [5]. Factors are then added one at a time, and the PRESS value is then calculated. What usually occurs is that the PRESS value either levels o! or reaches a minimum before all the factors have been included. Inclusion of additional factors beyond this point actually results in an increased PRESS value, which means poorer prediction. This is because the random variation or noise in the X matrix is being used to "t the > matrix, that is, over"tting is occurring. By not including the factors that describe mainly noise, it is possible to increase the predictive ability of the model. Both PCR and PLS o!er such a noise reduction capability. 2.1.2. K-nearest neighbours (KNN) KNN may be used to solve the forecasting problem for low-dimensional deterministic systems. If there is an understandable structure in the embedding space, it can be detected and roughly approximated by KNN. Given a pattern x, we compute its Euclidian distances to the centroids determined by the Isodata algorithm, and select its nearest centroid. Then we select the K nearest neighbours of x belonging to the selected cluster. The prediction is taken as the average of the future values associated to the K nearest neighbours. 2.1.3. Carbon copy (CRB) In this simple method we take as a prediction the last observed value of the times series window, that is, Prediction(t#1)"z . R 3. Experimental results We have tested the MEM method on two time series taken from the Santa Fe Time Series Prediction Analysis Competition, held during the fall of 1990 under the auspices of the Santa Fe Institute [4,11]. 3.1. Performance measures The following three performance measures expressed in terms of one-step-ahead prediction errors were employed in this paper: 1. RMSE } root mean squared error. 2. NMSE } normalized mean squared error. 3. DVS } direction variation symmetry.

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

151

To compute the value of RMSE we use the formula



1 , (observation !prediction ), H H N H where q is the test set and N is the size of q. Similarly, for the NMSE computation we use RMSE"

, (observation !prediction ) H H . NMSE" H , (observation !mean ) H H O When NMSE"1 that corresponds to predicting the unconditional mean. For the computation of DVS, we use 1 , DVS" W(*obs ) *pred ) H H N!1 H where *obs "observation !observation , *pred "prediction !prediction H H H\ H H H\ and W is the Heavyside function, i.e., W(x)"1 if x'0 and W"0 otherwise. DVS is the percentage of correctly predicted direction variations with respect to the target variable. 3.2. Experiments on laser data The "rst time series we have used came from a clean laboratory physics experiment. A total of 1000 points of far-infrared laser #uctuations, approximately described by three coupled nonlinear ordinary di!erential equations, were used as the training data for the construction of one-step-ahead predictive models. The next 100 points were used as the test data. We have used a sliding window with width of eight points. This window size was that one that produced the best performance. The resulting patterns have been tranformed by the application of the Haar wavelet transform into sets of eight coe$cients [7]. These patterns of wavelet coe$cients were used as inputs for the predictive models and for the Isodata clustering algorithm. We started Isodata with 100 random seeds. At the end of its processing, Isodata has produced 24 input space classes, that is, nonempty clusters. Table 1 presents the performance of the single predictive models PLS, KNN and CRB on the test data set. It also presents the performance of MEM. The general PLS model used 6 factors for the prediction. KNN used a neighbourhood size of 1. Fig. 1 plots the predicted values using MEM and the observation values of the 100 points of the laser time series used for testing. Table 2 shows in how many clusters each type of model (PLS, KNN, and CRB) has been the winner of the benchmark made during the construction of MEM. PLS was

152

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156 Table 1 Performance of the single models and of MEM on the laser test set

PLS KNN CRB MEM

RMSE

NMSE

DVS

33.035 16.940 54.180 4.757

0.4865 0.0930 0.9443 0.0073

0.6837 0.8367 0.6837 0.8673

Fig. 1. Prediction of the laser time series using MEM.

the best model in 9 out of the 24 clusters. When the number of training samples in a cluster was less than 10, we did not generate a speci"c PLS model for competing in the benchmark, but used the general PLS model (reported in row 1 of Table 1) to make the predictions. This occurred "ve times. When the general PLS model (reported in row 1 of Table 1) presented better performance (RMSE) than the speci"c PLS model built for a given cluster, we adopted the general PLS model. This occurred once. We can observe that MEM presented a signi"cantly better performance when compared with the single models (PLS, KNN, or CRB). Nevertheless, in terms of DVS, this did not happen. The excellent performance presented is the result of

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

153

Table 2 Victories of each model type in the MEM benchmark using the laser data Model type

Number of clusters

PLS KNN CRB

9 13 2

subdividing the input space in 24 disjoint regions and of selecting the best available model type for each one of them. Because the laser data is approximately described by a set of deterministic equations, KNN has been more intensively used by MEM. The PLS model, which adopts the function approximation approach, has been also selected for a considerable number of clusters. 3.3. Currency exchange rate data The second time series we have used contains 10 segments of 3000 points each, respresenting the high-frequency exchange rate between the Swiss franc and the US dollar. If one assumes an e$cient market hypothesis, then such data should be a random walk. In this experiment, we have used only daily closing exchange rates, reducing in this way the size of the time series to 107 points. We have used a sliding window with width of eight points. The resulting patterns were transformed into sets of eight wavelet coe$cients by the application of the Haar wavelet transform. This window size was the one that produced the best performance. The wavelet coe$cients, as well as the week day, were used as inputs for the predictive models and for the Isodata clustering algorithm. We started Isodata with 100 random seeds. At the end of its processing, Isodata has produced 11 input space classes (nonempty clusters). Table 3 presents the performance of the single predictive models PLS, KNN and CRB on the test data set. It also presents the performance of MEM. The general PLS model used 6 factors for the prediction. KNN used a neighbourhood size equal to 1. Table 4 shows in how many clusters each type of model (PLS, KNN, and CRB) was the winner of the benchmark made during the construction of MEM. Four input space classes did not present patterns in the test set. PLS was the best model in 5 of the 7 nonempty clusters. When the general PLS model (reported in row 1 of Table 1) presented better performance than the expert PLS model built for a given cluster, we adopted it. This occurred four times. We can observe that MEM presented a slightly better performance when compared with the single models (PLS, KNN, or CRB). KNN has been less useful for forecasting exchange rate data. It was adopted by only two clusters in the benchmark. This event could be expected since exchange rate data has a strong random character.

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

154

Table 3 Performance of the single models and of MEM on the exchange rate data

PLS KNN CRB MEM

RMSE

NMSE

DVS

0.0307 0.0417 0.0335 0.0264

0.3547 0.6010 0.4537 0.2419

0.5000 0.5000 0.6429 0.5714

Table 4 Victories of each model type in the MEM benchmark using the exchange rate data Model type

Number of clusters

PLS KNN CRB

5 2 0

4. Concluding remarks We have presented a system formed by a mixture of expert models (MEM) for time-series forecasting. Since there is no single predictive method universally superior to the others, we have expanded previous implementations of MEM by allowing di!erent types of predictive models to participate in its constitution. After performing a base change with the Haar wavelets transform, the input space was partitioned into disjoint regions by a clustering algorithm, and for each region a benchmark was performed among the di!erent competing model types aiming at selecting the most adequate one for that region. We have made experiments with two very di!erent time series: one of them formed by laser data and the other formed by "nancial data. The "rst series may be approximately described by deterministic equations, while the "nancial time series has a strong random character. We have employed three types of predictive models: partial least squares, K-nearest neighbours and carbon copy. In this way MEM combined the approaches of state-space reconstruction and of function approximation. MEM has been able to improve the forecasting performance in both types of time series when compared with the single models. As the main advantages of the MEM method we can enumerate: 1. The use of the Haar wavelets transform to perform a base change of the input vector space, giving an overall shape description of each pattern to the clustering algorithm. 2. The independence of models and data, what makes it possible to train and adjust the di!erent predictive models in a individualized form and in parallel.

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156

155

3. The possibility of using di!erent classes and variations of adaptive models, selecting those ones that best "t to particular regions of the input space. Future work involves the inclusion of other model types in MEM, such as neural networks, and the test of MEM with other types of time series.

References [1] C.M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1995. [2] P.A. Devijver, J. Kittler, Pattern Recognition: a Statistical Approach, Prentice-Hall, Englewood Cli!s, NJ, 1982. [3] R. Gnanadesikan, Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York, 1977. [4] A.S. Weigend, N.A. Gershenfeld, Forecasting the Future and Understanding the Past, AddisonWesley, Reading, MA, 1992. [5] P. Geladi, B.R. Kowalski, Partial least squares regression: a tutorial, Analy. Chim. Acta. 185 (1986) 1}17. [6] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Comput. 3 (1991) 79}87. [7] E.J. Stollnitz, T.D. DeRose, D.H. Salesin, Wavelets for computer graphics: a primer, Part 1, IEEE Comput. Graphics Appl. 15 (3), (1995) 76}84. [8] K.R. Beebe, B.R. Kowalski, An introduction to multivariate calibration and analysis, Analy. Chemi. 59 (1987) 1007}1017. [9] R.J. Machado, P.E.C.S.A. Neves, Multi-model neural network for image classi"cation, Proceedings of the II Workshop on Cybernetic Vision, Sa o Carlos, Brazil, 1996. [10] A.C.G. Thome, A massively parallel nonlinear system identi"cation: a committee architecture, Ph.D. Dissertation, Purdue University, Purdue, USA, 1993. [11] Santa Fe Institute, `ftp://ftp.santafe.edu/pub/Time-Series/Competition/a.

Ruy Luiz MilidiuH received the Ph.D. degree in operations research from the University of California, Berkeley. He is currently an Assistant Professor in the Informatics Department at the PontifmH cia Universidade CatoH lica do Rio de Janeiro, Brazil, where he also coordinates the Algorithms Engineering and Neural Networks Lab. His research activity is in data compression, neural networks and systems optimization.

156

R.L. Milidiu& et al. / Neurocomputing 28 (1999) 145}156 Ricardo Machado received a D.Sc. in Computer Science from the Federal University of Rio de Janeiro, Brazil, in 1985. Until his untimely passing away in late 1997 he was with the Algorithms Engineering and Neural Networks Lab at the Catholic University of Rio de Janeiro, Brazil. Prior to that he was with the IBM Rio Scienti"c Center, where he conducted research on neural networks applications.

RauH l Renter1H a graduated in computer engineering in 1996 from the PontifmH cia Universidade CatoH lica of Rio de Janeiro, Brazil. He is currently pursuing the M.S. degree in the Informatics Department there, where he is also a research assistant at the Algorithms Engineering and Neural Networks Lab.