Adaptive forecasting of irregular demand processes

Adaptive forecasting of irregular demand processes

ARTICLE IN PRESS Engineering Applications of Artificial Intelligence 17 (2004) 137–143 Adaptive forecasting of irregular demand processes ! Jose! L. ...

241KB Sizes 5 Downloads 192 Views

ARTICLE IN PRESS

Engineering Applications of Artificial Intelligence 17 (2004) 137–143

Adaptive forecasting of irregular demand processes ! Jose! L. Carmoa, Antonio J. Rodriguesb,* b

a Faculty of Science and Technology, Universidade do Algarve, Faro 8000, Portugal Centre for Operational Research, DEIO - Faculty of Sciences, Universidade de Lisboa, Campo Grande, Edificio C6, Lisboa 1749-016, Portugal

Received 31 October 2003; received in revised form 31 October 2003; accepted 12 January 2004

Abstract The paper presents some comparative results from different approaches for solving the problem of forecasting irregularly spaced time series, including those representing irregular demand processes in inventory management or production planning problems. The best-known methods for that purpose are based on simple exponential smoothing of the time and magnitude of occurrence of demand. It is shown how supervised neural networks can be used as suitable filtering and forecasting models for many of those processes. In particular, we discuss the identification and estimation issues, as well as the limitations associated to the application of Gaussian radial or elliptical basis function networks. r 2004 Elsevier Ltd. All rights reserved. Keywords: Forecasting; Radial basis function networks; Recursive estimation

1. Introduction In many time series analysis and forecasting problems, the variable of interest occurs or is registered at irregularly spaced dates or points in time. Most notably, many inventory management or production planning systems have to handle the problem of forecasting irregular—or intermittent—demand processes, where both the quantities and the time intervals vary. Let fðxk ; zk Þgk¼1;:::;N denote such a time series, recorded at dates xk ðxk1 oxk Þ: Usually, zk represents the quantity demanded on date xk, but other interpretations are possible, as in the case of the sample series shown in Fig. 1. The demand size may be correlated, in particular, with the previous P time interval, Dxk ¼ xk  xk1 : Furthermore, let yk ¼ ki¼1 zi denote the cumulative demand up to, and including, time xk; and dk ¼ zk =Dxk denote the demand rate. Common objectives in practice are: to estimate the total demand up to a given future date t>xN; and, to estimate the time and magnitude of the next occurrence of demand. One might be tempted to directly model the evolution of the demand rate, but this would be *Corresponding author. Tel.: +351-217500036; fax: 217500081. E-mail address: [email protected] (A.J. Rodrigues).

+351-

0952-1976/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2004.01.001

inadequate, as it might mask or completely ignore important features of the process. Instead, two other approaches are defensible, and are discussed in the present work: modelling either the bivariate series of differenced data, fðDxk ; zk Þg; or the bivariate series of cumulative data, fðxk ; yk Þg: Gaussian radial basis function networks are shown here to be adequate nonlinear autoregressive models for handling the data in the differenced form. However, in some cases, it may be preferable to handle the data in the cumulative form, using suitable time-varying parameter linear models—namely, special exponential smoothing methods (Wright, 1985; Cipra et al., 1995; Carmo, 1997), or state-space models, estimated through the Kalman filter (Rodrigues, 1996). In the following sections, these different approaches are briefly described and demonstrated, and some comparative experimental results are presented. This work was motivated by two real-world problems, where forecasting and decision support systems were developed to handle hundreds of irregular time series: one in marketing and production planning of kraft paper rolls, and the other in the area of petrol supply chain management. In either case, demand and sales volume virtually coincide. The data sets reported in the present paper come from the first of those projects.

ARTICLE IN PRESS 138

J.L. Carmo, A.J. Rodrigues / Engineering Applications of Artificial Intelligence 17 (2004) 137–143 350 300 250

z

200 150 100 50 0 0

(a)

500

1000

1500

2000

2500

3000

3500

x

y

10000

0

0

500

1000

1500

(b)

2000

2500

3000

3500

x 5

350 300 250 200 z

0 150 100 50

(c)

0 0

50

100 ²x

150

200

(d)

-5 -5

0

5

Fig. 1. Sample time series ‘‘A’’: (a) orders received by a paper manufacturing company during 9 years (time axis in days); (b) the cumulative demand series; (c) scatter plot of quantities vs. time intervals; (d) idem, after basic preprocessing (log-transformation and standardization).

2. Exponential smoothing methods

smoothing coefficient for both series may lead to noticeably poorer results in some cases.

2.1. Croston’s method The best-known method for handling irregular demand was proposed by Croston (1972), and is currently considered as a reference method to which more complex alternative ones may be compared. This method is based on the separate handling of the series of time intervals, fDxk g; and the series of demand quantities, fzk g: The simple exponential smoothing method is applied to each of these series, with the same smoothing coefficient, in order to compute one-step# kþ1jk : The future ahead predictions, z#kþ1jk and Dx # kþ1 : demand rate is then estimated by d#kþ1 ¼ z#kþ1 =Dx This has been proved to be a biased estimate, and one may consider possible corrective formulae (Syntetos and Boylan, 2001). However, the optimization of a common

2.2. Wright’s method Wright (1985) proposed an extension to Holt’s exponential smoothing method, for irregular time series with a trend component. Later, Cipra et al. (1995) further extended the method to seasonal time series. In some cases, one may find such trend or periodic behavior in the basic time series fzk g; but it is preferable to apply the above methods to the series of cumulative demand, fyk g: As in Holt’s method, Wright’s method considers a model linear in two time-varying parameters—level (ak) and slope (bk). These are exponentially smoothed under the control of independent adaptive smoothing

ARTICLE IN PRESS J.L. Carmo, A.J. Rodrigues / Engineering Applications of Artificial Intelligence 17 (2004) 137–143

hyperparameters (ak,bk), which in turn are updated according to the variations in the time intervals, fDxk g: For nondecreasing time series, the original estimation formulae of the method may lead to negative predictions of demand, z#kþ1jk ¼ y# kþ1jk  yk : It is then preferable to define, instead z#kþ1jk ¼ ðDxkþ1 Þb#k :

ð1Þ

In a separate task, time intervals can be predicted through simple exponential smoothing. In the event that the original parameter updating formulae (Wright, 1995) produce an estimate b#k o0; we propose the following redefinitions: a# k ¼ a# k1 þ ð1  1=bk Þ#zkjk1 ; b#k ¼ 0:

ð2Þ

From Carmo (1997), the method seems to be especially appealing when the demand sizes are nonstationary in time, but suffers with the presence of outliers in fzk g or in fDxk g:

139

3.2. Basis function networks Basis function networks are usually depicted by a three-layer structure (Fig. 2). The input layer of sensors has the simple task of forwarding the input patterns, unchanged, to all the processing units, or neurons, of the intermediate layer. Most commonly, the nonlinear transfer functions considered for these units are Gaussian functions. The location of these is determined by the centers, which can be interpreted as hyperparameters of the model. The output nodes compute linear combinations of the values produced by the basis functions, using vectors of weights. Therefore, a single output basis function network corresponds to a continuous mapping fw: Rp-R, nonlinear in the inputs but linear in the weight parameters, w. More specifically, considering m basis functions, {fi}, fw ðxÞ ¼ w0 þ

m X

wi fi ðxÞ;

ð6Þ

i¼1

3. Neural models 3.1. Supervised neural networks Supervised neural networks have been widely applied as predictive nonlinear models for regularly spaced time series, fyk gk¼1;:::;N ; usually according to an autoregressive (single or multi-step ahead) input–output setup: ðykpþ1 ; :::; yk1 ; yk Þ-ðykþ1 ; :::; ykþq Þ:

ð3Þ

This setup has been adopted by many practitioners even when the time series is heavily nonstationary. However, poor generalization is a natural consequence to expect when the training patterns have a nonstationary distribution. Generalization is riskier, in general, when the new input patterns are relatively different, i.e. far from the previously observed ones. The autoregressive setup can be adapted to the case of irregularly spaced time series, but taking into consideration the above remarks. A practical and appropriate approach is, therefore, to model the bivariate series of differenced data, fðDxk ; zk Þg (Carmo, 1997). To predict the next value of demand conditional to knowing the (expected) time of occurrence, one can, therefore, consider the following structure for the training patterns: ðDxkpþ1 ; zkpþ1 ; :::; Dxk ; zk ; Dxkþ1 Þ-ðzkþ1 Þ:

ð4Þ

A similar scheme can be used to solve the dual problem, namely, to estimate the time needed to observe a given total size of demand. More generally, one can also synthesize a neural model with two outputs, in order to simultaneously predict both variables: ðDxkp1 ; zkp1 ; :::; Dxk ; zk Þ-ðDxkþ1 ; zkþ1 Þ:

where x is a generic input pattern and wi is the weight of the arc connecting neuron i to the output neuron (w0 is the bias, or independent term). In multi-output versions of the model, there is a common set of basis functions, but a different set of weights, w, for each output variable. Given a set of patterns, or observations, fðxk ; yk Þgk¼1;:::;N ; the goal is to identify the number and characteristics of the basis functions, and then to estimate the parameters {wi}, according to the model yk ¼ fw ðxk Þ þ ek ; where the {ek} are assumed to be white noise. Conditional to fixing the values of the nonlinear parameters (or hyperparameters) of the model, the problem reduces to that of solving, usually in the least-squares (LS) sense, a system of N stochastic linear equations, with m+1 unknowns (Nbm): y ¼ Rw þ e: The LS solution is then defined by w ¼ Rþ y; where Rþ ¼ ðRT RÞ1 RT is the Moore–Penrose pseudo-inverse matrix, and assuming RT R is invertible. With patterns built from nonstationary time series, and especially with streaming data, it is more appropriate to recursively estimate the parameters, through the recursive least squares (RLS) algorithm, or more

ð5Þ

xkj

. . .

. . .

wi

1 Fig. 2. A single output basis function network.

fw(xk)

ARTICLE IN PRESS 140

J.L. Carmo, A.J. Rodrigues / Engineering Applications of Artificial Intelligence 17 (2004) 137–143

complex variants of it. Furthermore, a recursive procedure enables the adaptive revision of the model hyperparameters.

*

*

the model size, i.e. the number of (Gaussian) basis functions; the location of centers and the radii values (or, for elliptical basis functions, matrices Si Þ:

3.3. Gaussian basis functions: radial and elliptical For basis functions {fi} that are radially symmetrical relatively to the corresponding centers ciARp, the output they produce only depends on the Euclidean distances jjx  ci jj: Most commonly, Gaussian radial basis functions (RBF) are considered: fi ðxÞ ¼ fðjjx  ci jjÞ ¼ expð0:5jjx  ci jj2 =s2i Þ:

ð7Þ

The model is then just a linear mixture of multidimensional Gaussians depending on the parameters fwi g (weights), and on the location and dispersion hyperparameters fðci ; si Þg (centers and widths, or radii). Each neuron, represented by a Gaussian RBF, focuses a particular region of the input space, giving equal importance to the different dimensions in that space, as the dispersion is simply described by a scalar, si : This is somehow inadequate, since, for relatively complex distributions of the input–output patterns, a large number of basis functions may then be required. Alternatively, one may consider a model based on elliptical basis functions (EBF), with different dispersion coefficients for the different dimensions: ! p 1 X 2 2 fi ðxÞ ¼ exp  ðxk  cki Þ =ski 2 k¼1   1 T 1 ¼ exp  ðx  ci Þ Si ðx  ci Þ ; ð8Þ 2 where Si ¼ diagðs21i ; s22i ; :::; s2pi Þ: Another, much simpler option is to use normalized radial basis function (NRBF) networks, where the original functions, fi ðxÞ; are replaced by fi ðxÞ= P f j ðxÞ: Werntges (1993), among other authors, j¼1;:::;m has discussed the advantages of this normalization, while, on the other hand, Shorten and Murray-Smith (1996) have identified some possibly undesirable side effects.

Most of the heuristics discussed in the literature for neural model identification—such as cross-validation— do not take into proper account the special nature of the prediction problem. Instead of solving an approximation problem to a set of patterns, one should seek to solve a filtering problem of a sequence of patterns, serially correlated, and possibly nonstationary. Moreover, one may wish to consider time-varying parameters, adaptive hyperparameters, or even dynamic structures. In all these cases, cross-validatory methods are not appropriate. Building from the experience gathered in a previous study (Carmo and Rodrigues, 2002), the following heuristical procedure is here proposed for the identification and estimation of Gaussian RBF networks as time series forecasting models: Phase 1 (preliminary identification): *

*

*

*

Phase 2 (recursive estimation): *

*

*

3.4. Model identification issues As stated earlier, the estimation of the linear parameters in a basis function network can be easily accomplished, either in a direct (en bloc) or in a recursive way. The other relevant methodological decisions include determining: *

the dimension of the input space (with patterns built from a sequence of time series observations, one needs to identify the order of the nonlinear autoregression);

the number of inputs (p) and the number of basis functions (m) are optimized; for such purpose, a simple hold-out, or validation set of patterns is considered, namely those based on the most recent observations; the RBF centers are inferred from k-means clustering of the patterns; the clustering is performed in the input–output space, with cluster centers then projected in the input space; the RBF radii are inferred using the k-nearest neighbor heuristic (Moody and Darken, 1989); weights are estimated en bloc (static least-squares estimation).

*

the best values ðp ; m Þ found before are considered (and held fixed); the weights are estimated using RLS, starting from the values determined in Phase 1; meanwhile, the RBF centers are continuously adapted, using a heuristic proposed by Chen et al. (1992), starting from those determined in Phase 1, and the radii are also recomputed (k-NN); the first 20, or so, estimation errors are ignored when computing any forecasting performance measures.

Naturally, depending on the problem on hand, it is possible to obtain reasonable results through Phase 1 only, or through Phase 2 only—in the latter case, using simple heuristics for the choice of p and m and for the initialization of the estimation process.

ARTICLE IN PRESS J.L. Carmo, A.J. Rodrigues / Engineering Applications of Artificial Intelligence 17 (2004) 137–143

141

(Type 2) predicting the time and quantity of the next occurrence of demand. In both cases, the performance measure used for optimization and evaluation of the models and methods was always the root mean squared one-interval-ahead prediction error (RMSE) of the demand rate:

4. Experiments 4.1. The data The models and methods described above have been tested for short term prediction (one interval ahead) of 10 real-world time series (‘‘A’’ to ‘‘J’’), with from 89 up to 263 observations. For Croston’s method and for the neural models, the data is first preprocessed. Taking into account the observed asymmetry of the distribution of demand sizes, as well as of the distribution of time intervals, both of them are subject to a logarithmic transformation, and then separately standardized to zero mean and unit variance. The latter (linear) transformation seeks to avoid the problem of dealing with different measuring units, in different scales, thus facilitating the clustering procedure.

zk z#kjk1 dk  d#kjk1 ¼  : # kjk1 Dxk Dx

ð9Þ

For Type 1 problems, this measure is sensible enough to discount the effect of a large forecasting horizon in the demand size forecasting error. However, for Type 2 problems, it implicitly favors accuracy in demand forecasts at the possible expense of larger errors in the time intervals forecasts. Alternative approaches, that handle the accuracy problem as a truly bicriteria one, or that handle asymmetrical costs, are currently being investigated. The accuracy of the tested models was also compared to that of the na.ıve method that assumes both fzk g and fDxk g to be random walks.

4.2. The evaluation and optimization criteria Two types of problem were considered: (Type 1) predicting future total demand for a given horizon; and, 350

300

250

z

200

150

100

50

0

0

500

1000

1500

2000

2500

3000

3500

2000

2500

3000

3500

x

(a) 80 70 60

d

50 40 30 20 10 0

(b)

0

500

1000

1500 x

Fig. 3. Series ‘‘A’’ forecasts from an adaptive RBF model: (a) estimates for zk given xk known; (b) corresponding estimates for the demand rate.

ARTICLE IN PRESS J.L. Carmo, A.J. Rodrigues / Engineering Applications of Artificial Intelligence 17 (2004) 137–143

142

4.3. Prediction of demand sizes: assessing the performance of RBF models

Table 1 Prediction of demand sizes: accuracy from different basis function networks

First, we demonstrate the applicability of Gaussian RBF networks, with series ‘‘A’’. The two-phase procedure described in Section 3.4 was applied. The validation set included the patterns from the last third of observations. Optimization of the order of the nonlinear autoregression, p, and of the number of basis functions, m, was done through exhaustive search in the sets f1; 2; y; 7g and f6; 7; y; 30g; respectively. The best values found were p=4; m=18. It should be clear that such optimal values are tradeoff solutions between too simple, underparameterized models, and too large ones, vis-a" -vis prediction performance, rather than fitting performance. Fig. 3 exhibits the very good performance of the adaptive RBF (4,18) model in all parts of the series.

Series (N)

Model (p,m)

A (89)

RBF (4,18) EBF (7,11) NRBF (6,24)

4.58 2.53 3.88

3.05 4.38 2.76

B (212)

RBF (5,23) EBF (2,21) NRBF (6,18)

4.37 4.47 4.79

2.37 2.92 2.34

C (92)

RBF (7,18) EBF (6,28) NRBF (7,18)

19.30 23.80 19.27

17.49 22.59 12.71

D (167)

RBF (3,15) EBF (4,12) NRBF (3,15)

55.96 58.48 57.28

28.81 28.18 29.14

E (119)

RBF (7,15) EBF (4,25) NRBF (4,13)

8.92 5.50 4.53

5.64 6.50 7.86

F (210)

RBF (5,29) EBF (7,13) NRBF (4,14)

5.46 6.08 6.04

4.52 5.45 4.82

G (163)

RBF (5,23) EBF (5,15) NRBF (4,12)

4.64 5.22 5.17

0.78 0.94 0.81

H (210)

RBF (5,28) EBF (7,30) NRBF (5,27)

4.63 4.32 4.74

1.43 2.34 1.16

I (118)

RBF (2,20) EBF (7,18) NRBF (2,8)

2.88 2.34 2.81

0.79 1.37 0.71

J (263)

RBF (6,29) EBF (3,15) NRBF (4,27)

24.18 24.25 23.98

22.50 25.31 24.03

4.4. Prediction of demand sizes: comparing neural models Next, RBF and EBF models were tested, with 10 time series, for Type 1 problems. Only Phase 1 of the procedure presented in Section 3.4 was carried out. For the EBF networks, the different variances fs21i ; :::; s2pi g were empirically estimated as follows. Denoting by Oi the cluster corresponding to the basis function fi ; Si was estimated by computing the empirical variances and covariances, 1 X Ri ¼ ðx  ci Þðx  ci ÞT : ð10Þ jOi j xAO

RMSE training

RMSE validation

i

The off-diagonal elements were then zeroed. Provided the size of the cluster, jOi j; is large enough, the full variance–covariance matrix may be used instead. However, this would imply having time series rather longer than the available ones. The results of Table 1 show that, despite its greater flexibility, the EBF model gives worse performance, in general, than the RBF model. As expected, it often requires fewer parameters (m) for a similar performance. One might try to optimize all the variance hyperparameters, but that is rather cumbersome. Failing to have enough data to compute robust variance estimates, the extra flexibility of elliptical basis functions is then counterproductive. Moreover, from examination of the distribution of the data fðDxk ; zk Þg; in virtually no case was possible to detect clearly defined clusters; and that is less likely to happen when the patterns are defined in higher dimensions (p>1). Given reasonably large data sets, more complex methods could be used to support the identification of EBF networks. This includes, for instance, RPCL (Rival Penalized Competitive Learning—Xu et al., 1992), and its extensions, designed to facilitate the handling of clusters with more complex forms and distributions.

From Table 1, one can also note that normalized Gaussian RBF networks lead, in general, to results similar to those of nonnormalized networks, although in some cases requiring fewer units. 4.5. Comparing adaptive RBF models with exponential smoothing methods The described exponential smoothing methods have little room for optimization—namely, only 1 or 2 smoothing hyperparameters. Their performance was then compared with RBF models not fully optimized. Specifically, we set the default values ðp; mÞ ¼ ð3; 15Þ; and proceeded with the adaptive estimation procedure (Phase 2).

ARTICLE IN PRESS J.L. Carmo, A.J. Rodrigues / Engineering Applications of Artificial Intelligence 17 (2004) 137–143 Table 2 Prediction of demand sizes: accuracy from different approaches Series

Naive

Wright

RBF (3,15)

A B C D E F G H I J

13.62 8.23 36.37 59.71 13.48 10.14 5.28 6.92 3.81 32.20

13.36 6.91 35.31 46.92 15.94 8.59 6.03 6.24 4.04 30.83

4.12 4.67 33.94 36.74 11.40 6.29 2.99 4.08 2.74 25.99

Table 3 Prediction of time intervals and demand sizes: accuracy from different approaches Series

Naive

Croston

RBF (3,15)

RBF (p,m)

A B C D E F G H I J

18.86 9.72 53.25 65.61 22.64 12.14 8.60 8.98 5.79 45.56

13.60 7.28 37.48 47.77 16.23 8.93 6.13 6.51 4.04 32.18

13.33 7.33 36.65 42.56 16.16 9.16 4.68 6.55 4.11 31.26

9.67 6.03 32.73 39.97 13.64 8.77 4.44 5.33 3.41 31.07

For Type 1 problems, RBF models were compared with Wright’s method (optimized), while for Type 2 problems they were compared with Croston’s method (also optimized). The results are shown in Tables 2 and 3. Most notably, Wright’s method, even with the correction of Eq. (2), suffers with the high degree of irregularity of these time series, and would probably perform better if optimized with respect to prediction accuracy for longer horizons. The adaptive RBF model does not improve significantly upon Croston’s method unless p and m are optimized. Naturally, this apparent evidence of the improved performance of adaptive RBF models might be reinforced or attenuated for other optimization criteria or for time series with much different characteristics.

5. Conclusions Gaussian basis function networks were shown to be adequate models for short-term prediction of irregularly

143

spaced time series, with the ability to generate better predictive performance than alternative models, by taking into account nonlinear correlations in the data. In particular, from the preliminary evidence gathered, they seem to compare favorably with Croston’s and Wright’s exponential smoothing methods. The applicability and usefulness of elliptical basis function model was put into question, and radial basis functions seem to be more appropriate, in general, provided the patterns are conveniently preprocessed. The performance of RBF models is yet to be assessed in terms of other performance measures, for longer prediction horizons, or for time series with lowfrequency periodicities. But they show promise that, if adequately designed, identified and estimated might be equally satisfactory for more intricate irregular demand forecasting problems.

References * de Base Radiais: Identifica@*ao e Carmo, J.L., 1997. Redes de Fun@oes Aplica@*ao na Previs*ao de S!eries Irregularmente Espa@adas. M.Sc. Thesis, Universidade de Lisboa (in Portuguese). Carmo, J.L., Rodrigues, A.J., 2002. Identifica@*ao de redes neuronais gaussianas como modelos de previs*ao. Investiga@*ao Operacional 22 (1), 43–57 (in Portuguese). Chen, S., Billings, S.A., Grant, P.M., 1992. Recursive hybrid algorithm for non-linear system identification using radial basis function networks. International Journal of Control 55, 1051–1070. Cipra, T., Trujillo, J., Rubio, A., 1995. Holt–Winters method with missing observations. Management Science 41 (1), 174–178. Croston, J.D., 1972. Forecasting and stock control for intermittent demands. Operational Research Quarterly 23 (3), 289–303. Moody, J., Darken, C.J., 1989. Fast learning in networks of locallytuned processing units. Neural Computation 1, 281–294. Rodrigues, A.J., 1996. Dynamic regression and supervised learning methods in time series modelling and forecasting. Ph.D. Thesis, Lancaster University, UK. Shorten, R., Murray-Smith, R., 1996. Side effects of normalising radial basis function networks. International Journal of Neural Systems 7 (2), 167–179. Syntetos, A.A., Boylan, J.E., 2001. On the bias of intermittent demand estimates. International Journal of Production Economics 71, 457–466. Werntges, H.W., 1993. Partitions of unity improve neural function approximators. In: Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, Vol. 2 IEEE Neural Networks Council, pp. 914–918. Wright, D.J., 1985. Extensions of forecasting methods for irregularly spaced data. In: Anderson, O.D. (Ed.), Time Series Analysis: Theory and Practice, Vol. 7. Elsevier, Amsterdam, pp. 169–181. Xu, L., Krzyzak, A., Oja, E., 1992. Unsupervised and supervised classification by rival penalized competitive learning. In: Proceedings of the 11th International Conference on Pattern Recognition, The Hague, The Netherlands. IEEE Computer Society Press, Los Alamitos, CA, pp. 492–496.