Sales forecasting using longitudinal data models

Sales forecasting using longitudinal data models

International Journal of Forecasting 20 (2004) 99 – 114 www.elsevier.com/locate/ijforecast Sales forecasting using longitudinal data models Edward W...

270KB Sizes 466 Downloads 560 Views

International Journal of Forecasting 20 (2004) 99 – 114 www.elsevier.com/locate/ijforecast

Sales forecasting using longitudinal data models Edward W. Frees *, Thomas W. Miller University of Wisconsin—Madison, School of Business, 975 University Avenue, Madison, WI 53706, USA

Abstract This paper shows how to forecast using a class of linear mixed longitudinal, or panel, data models. Forecasts are derived as special cases of best linear unbiased predictors, also known as BLUPs, and hence are optimal predictors of future realizations of the response. We show that the BLUP forecast arises from three components: (1) a predictor based on the conditional mean of the response, (2) a component due to time-varying coefficients, and (3) a serial correlation correction term. The forecasting techniques are applicable in a wide variety of settings. This article discusses forecasting in the context of marketing and sales. In particular, we consider a data set of the Wisconsin State Lottery, in which 40 weeks of sales are available for each of 50 postal codes. Using sales data as well as economic and demographic characteristics of each postal code, we forecast sales for each postal code. D 2003 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. Keywords: Panel data models; Unobserved effects; Random coefficients; Heterogeneity

1. Introduction Forecasting is an integral part of a marketing manager’s role. Sales forecasts are important for understanding market share and the competition, future production needs, and the determinants of sales, including promotions, pricing, advertising and distribution. Market researchers and business analysts are often faced with the task of predicting sales. One approach to prediction is to use cross-sectional data, working with one point in time or aggregating over several time periods. We search for variables that relate to sales and use those variables as explanatory variables in our models. Another approach is to work with time series data, aggregating across sales territories * Corresponding author. E-mail addresses: [email protected] (E.W. Frees), [email protected] (T.W. Miller).

or accounts. We use past and current sales as a predictor of future sales as we search for explanatory variables that relate to sales. Cross-sectional and simple time series approaches do not make full use of data available to sales and marketing managers. Typical sales data have a hierarchical structure. They are longitudinal or panel data, having both cross-sectional and time series characteristics. They are cross-sectional because they include observations from many cases, sales across stores, territories or accounts; we say that these data are differentiated across ‘space’. They are time series data because they represent many points in time. Longitudinal data methods are appropriate for these types of data. Longitudinal data methods have been widely developed for understanding relationships among variables in the social and biological sciences including marketing research; see, for example, Ailawadi and Neslin (1998) and Erdem (1996). But there is

0169-2070/$ - see front matter D 2003 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/S0169-2070(03)00005-0

100

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

relatively little literature available for forecasting using longitudinal data methods. Some important exceptions include Battese, Harter, and Fuller (1988) and Baltagi and Li (1992). By using information in both the cross section (space) and time, we are able to provide forecasts that are superior to traditional forecasts that use only one dimension. We can forecast at the subject/micro level, providing managers with additional information for making both strategic and tactical decisions, including decisions about sizing and capacity planning for manufacturing plants, pricing, marketing promotions, advertising, sales organization and sales processes. The longitudinal data mixed model is introduced in Section 2. Appendix A shows that the longitudinal data mixed model can be represented as a special case of the mixed linear model. Thus, there is a large literature on estimation of the regression parameters (B) as well as variance components; see, for example, Searle, Casella, and McCulloch (1992) or Verbeke and Molenbergs (2000). In the data analysis Section 4, we find it convenient to use the SAS procedure for mixed linear models (PROC MIXED) when estimating longitudinal data mixed models; See Littell, Milliken, Stroup, and, Wolfinger (1996) for an introduction from this perspective. Similar procedures are available for S-PLUS (Pinheiro and Bates, 2000). Section 3 develops longitudinal data mixed model forecasts using best linear unbiased predictors (BLUPs). These predictors were introduced by Goldberger (1962) and developed by Harville (1976) in the context of the mixed linear model. One goal of this paper is to show how this type of predictor can be used as an optimal forecast for longitudinal data mixed models. This section is more technical and many readers may wish to go directly to the Section 4 case study. Specifically, Section 4 describes the case study motivating the theoretical modeling work, forecasting Wisconsin lottery sales. Here, we consider a data set that contains forty weeks of lottery sales from a random sample of 50 Wisconsin postal codes. Section 4 shows how to specify an appropriate longitudinal data mixed model and forecast using the specified model. Section 5 closes with some concluding remarks.

2. Longitudinal data mixed model Longitudinal data models are regression models in which repeated observations of subjects, such as stores, are available. Using longitudinal data models, we can provide detailed representations of characteristics that are unique to each subject, thus accounting for the classical misspecification problem of heterogeneity. Furthermore, the repeated observations over time allow us to consider flexible models of the evolution of responses, such as sales, known as the dynamic structure of a model. This article introduces forecasting for a broad class of dynamic longitudinal data models that we call the longitudinal data mixed model. As an example of this class of models, consider the basic two-way model yit ¼ ai þ kt þ xitVb þ eit ; i ¼ 1; . . . ; n:

t ¼ 1; . . . ; T ; ð2:1Þ

Baltagi (1988) and Koning (1989) developed forecasts for this model. Here, yit denotes the response (sales) for the ith subject, such as store, during the tth time period. This is a model of balanced data in that we assume that the same number, T, observations is available for each of n stores. The quantity b ¼ ðb1 ; . . . ; bK ÞV is a K  1 vector of parameters that is common to all subjects and xit ¼ ðxit;1 ; xit;2 ; . . . ; xit;K ÞV is the corresponding vector of covariates. The term ai is specific to subject i yet is common to all time periods. This variable may account for features that are unique, yet unobserved, characteristics of each subject. The term kt is specific to the time period t yet is common to all stores. This variable may account for common, yet unobserved, events that affect sales. Both terms ai and kt are random variables and hence the model in Eq. (2.1) is also known as the two-way error components model. The longitudinal data mixed model is considerably more complex than the model in Eq. (2.1) because it has the ability to capture many additional features of the data that may be of interest to an analyst. We focus on three aspects: 1. The longitudinal data mixed model does not require balanced data. To illustrate, it is possible to allow new subjects to enter the data by

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

allowing the first observation to be after time t=1. Similarly, the last observation may be prior to time t=T, allowing for early departure. Using an underlying continuous stochastic process for the disturbances, the model allows for unequally spaced (in time) observations as well as missing data. 2. The longitudinal data mixed model allows for covariates associated with vector error components. This allows one to handle broad classes of mixed models, such as random coefficient models. 3. The longitudinal data mixed model allows for specification of dynamic aspects in two fashions, through the error terms {eit} and through the specification of {kt } as a stochastic process. To illustrate the third point, in the traditional longitudinal data mixed model, such as introduced by Laird and Ware (1982), the dynamics are specified through the correlation structure of subjectspecific errors. For example, it is common to consider an autoregressiveoforderp ðARðpÞÞ model for the disturbances {eit } of the form:

yit ¼ z aV;i;t ai þ z Vk;i;t lt þ x Vit b þ eit ; t ¼ 1; . . . ; Ti ; i ¼ 1; . . . ; n:

ð2:5Þ

Here, ai ¼ ðai;1 ; . . . ; ai;q ÞV is a q  1 vector of subject-specific terms and za;i;t ¼ ðza;i;t;1 ; . . . ; za;i;t;q ÞV is the corresponding vector of covariates. Similarly, lt = (kt;1 ; . . . ; kt;r ÞVis a r  1 vector of time-specific terms and zk;i;t ¼ ðzk;i;t;1 ; . . . ; zk;i;t;r ÞVis the corresponding vector of covariates. We use the notation t = 1; . . . ; Ti to indicate the unbalanced nature of the data. Without the time-specific terms, this model was introduced by Laird and Ware (1982) and is widely used in the biological sciences (Diggle, Liang, & Zeger, 1994). We have added the time-specific terms to provide another mechanism for handling temporal, or dynamic, patterns. We allow the time-specific term to be a vector for symmetry with the subject-specific terms and to handle some special cases described in Section 3 where we give more details of the assumptions of these quantities.

A more compact form of Eq. (2.5) can be given by stacking over t . This yields a matrix form of the longitudinal data mixed model yi ¼ Za;i ai þ Zk;i l þ Xi b þ ei ; i ¼ 1; . . . ; n:

This expression uses vectors of responses, yi ¼ ðyi1 ; yi2 ; . . . ; yiTi ÞV, and of disturbances, eei ¼ ðei1 ; ei2 ; . . . ; eiTi ÞV. Similarly, the matrices of covariates are Xi ¼ ðxi1 ; xi2 ; . . . ; xiTi ÞV, of dimension Ti  K; Za;i ¼ ðza,i;1; za;i;2 ; . . . ; za;i;Ti ÞV, of dimension Ti  q matrix and 0

ð2:3Þ Z;i

and ð2:4Þ

ð3:1Þ

z0 ;i;1 B0 B ¼ B. @ .. 0

::: 0 0 : z;i;2 : : .. . ::: 0 .

where {fi,t} are initially assumed to be identically and independently distributed, mean zero, random variables. Alternative structures are easily accommodated; see Section 3 for further discussion. Alternatively, we may model the dynamics using a stochastic process for {kt }. For comparison, note that Eq. (2.2) is a model of serial relationships at the subject level, whereas a dynamic model of kt is one that is common to all subjects. To illustrate the latter specification, Section 3 considers a random walk model for the common, time-specific components. Beginning with the basic two-way model in Eq. (2.1), more generally, we use

3.1. The longitudinal data mixed model

..

ð2:2Þ

zk;i;t;1 kt;1 þ : : : þ zk;i;t;r kt;r ¼ zV k;i;t kt :

With these terms, we define the longitudinal data mixed model as

3. Longitudinal data mixed model and forecasting

ei;t ¼ /1 ei;t1 þ /2 ei;t2 þ . . . þ /p ei;tp þ fi;t :

za;i;t;1 ai;1 þ : : : þ za;i;t;q ai;q ¼ zV a;i;t ai

101

1 0 C 0 C C : 0i .. A . 0 z ;i;Ti

102

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

of dimension Ti  rT , where 0i is a Ti  rðT  Ti Þ zero matrix. Finally, L=(L1,. . . , LT)Vis the rT  1 vector of time-specific coefficients. We assume that sources of variability, ei ; ai and lt , are mutually independent and mean zero. The non-zero means are accounted for in the B parameters. The disturbances are independent between subjects, yet we allow for serial correlation and heteroscedasticity through the notation Var ei ¼ Ri. Further, we assume that the subject-specific effects {ai} are random with variance – covariance matrix D, a q  q positive definite matrix. Time-specific effects L have variance –covariance matrix Sk, a rT  rT positive definite matrix. With this notation, we may express the variance of each subject as Var yi ¼ Va ;i + Zl ;i Sl ZlV;i where

with residuals ei,GLS=yiXibGLS and bGLS is the generalized least squares estimator of B. The best linear unbiased predictors for ei and ai are,  ð3:5Þ ei;BLUP ¼ Ri V1 a;i ei;GLS  Zl;i l BLUP

Va ;i ¼ Za ;i DZaV;i þ Ri :

Remarks. We may interpret the BLUP forecast as arising from three components. The first component, x iV;Ti þL bGLS þ z aV;i;Ti þL ai;BLUP, is due to the conditional mean. The second component, zVk;i;Ti þL CovðlTi þL ; lÞV 1 l EBLUP , is due to time-varying coefficients. The third component, Covðei;Ti þL ; ei ÞVR1 i ei;BLUP, is a serial correlation ‘correction term,’ analogous to a result due to Goldberger (1962); see Example 1.2 below. An expression for the variance of the forecast error, Var yˆ i;Ti þL  yi;Ti þL , is available from the authors.

ð3:2Þ

3.2. Forecasting for the longitudinal data mixed model For forecasting, we wish to predict yi;Ti þL ¼ zVa ;i;Ti þ L ai þ zV l ;i;Ti þ L l Ti þ L þxVi;T i þ Lb þ ei;Ti þ L

ð3:3Þ

for L lead time units in the future. We use results for best linear unbiased prediction (BLUP) for the mixed linear model; see Robinson (1991) or Frees, Young, and Luo (1999, 2001) for recent reviews. A BLUP is the best linear combination of responses that is unbiased and has the smallest mean square error over the class of all linear, unbiased predictors. When using available data to approximate random variables, such as yi;Ti þL , we use the term ‘prediction’ in lieu of ‘estimation.’ To calculate P these predictors, we use the sum of 1 n squares SZZ ¼ l ; iVa;i Zl ; i. We summarize i¼1 ZV the results in the following proposition. The details of the derivation are in Appendix A. Proposition. Consider the longitudinal data mixed model described in Section 3.1. Then, the best linear unbiased predictor L is n  1 X 1 lBLUP ¼ SZZ þ  1 ZV l;i Va;i ei;GLS l i¼1

and ai;BLUP ¼ DZ aV;i R1 i ei;BLUP :

ð3:6Þ

Further, the best linear unbiased predictor of yi;Ti þL is yˆ i;Ti þL ¼ x iV;Ti þL bGLS þ z aV;i;Ti þL ai;BLUP þzVl;iTi þL CovðlTi þL ; lÞV

1 l l BLUP

þCovðei;Ti þL ; ei ÞVR1 i ei;BLUP :

ð3:7Þ

3.3. Forecasting for special cases of the longitudinal data mixed model The Proposition provides sufficient structure to calculate forecasts for a wide variety of models. Still, it is instructive to interpret the BLUP forecast in a number of special cases. We first consider the case of independent and identically distributed time-specific components {Lt}. Example 1. (Random time-specific components). We consider the special case where {Lt} are i.i.d. and assume that Ti þ L > T . Thus, from Eq. (3.7), we have the BLUP forecast of yi;Ti þL is yˆ i;Ti þL ¼ xV i;Ti þL bGLS þ zV a;i;Ti þL ai;BLUP

ð3:4Þ

þ Covðei;Ti þL ; ei ÞVR1 i ei;BLUP :

ð3:8Þ

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

103

Suppose further that the disturbance terms are serially uncorrelated so that Covðei;Ti þL ; ei Þ ¼ 0 . As an immediate consequence of Eq. (3.8), we have

where E B t ¼ B; lt ¼B t B and zk;i;t ¼ xi;t . With this notation and Eq. (3.8), the forecast of yi;Ti þL is yˆ i;Ti þL ¼ xVi;Ti þL bGLS .

yˆ i;Ti þL ¼ xV i;Ti þL bGLS þ zV a;i;Ti þL ai;BLUP :

Example 1.4. (Two-way error components model) Consider the basic two-way model given in Eq. (2.1). As in Example 1.2, we have that q ¼ r ¼ 1 and D ¼ r2a and za;i;Ti þL ¼ 1. Thus, from Eq. (3.8), we have that the BLUP forecast of yi;Ti þL is yˆ i;Ti þL ¼ ai;BLUP þ x iV;Ti þL bGLS :

We note that even when {lt } are i.i.d., the timespecific components appear in ai,BLUP. Thus, the presence of Lt changes the forecasts. Example 1.1. (No time-specific components) We now consider the case of no time-specific component lt . Here, using Eq. (3.8), the BLUP forecast of yi;Ti þL is yˆ i;Ti þL ¼ xV i;Ti þL bGLS þ zV a;i;Ti þL ai;BLUP þ Covðei;Ti þL ; ei ÞVV1 a;i ðyi  Xi bGLS Þ where, from Eqs. (3.5) and (3.6), ai;BLUP ¼ DZVa ;i V1 a;i ðyi  Xi bGLS Þ. To help further interpret this case, consider: Example 1.2. (AR(1) serial correlation) An interesting special case that provides a great deal of intuition is the case where we assume autoregressive of order 1 (AR(1)), serially correlated errors. For example, Baltagi and Li (1991, 1992) considered this serially correlated structure in the error components model (q ¼ 1 ) in the balanced data case. More generally, from Eq. (3.7), it can be checked that yˆ i;Ti þL ¼ xVi;Ti þL bGLS þ zVi;Ti þL ai;BLUP þ qL eiTi ;BLUP : Thus, the L step forecast equals conditional mean, with the correction factor of qL times the most recent BLUP residual. This result was originally given by Goldberger (1962) in the context of ordinary regression without random effects (that is, assuming D ¼ 0). Example 1.3. (Time-varying coefficients) Suppose that the model is yit ¼ xV it B t þ eit ; where fB t g are i.i.d. We can re-write this as: yit ¼ zVk;i;t lt þ xitVB þ eit ;

For additional interpretation, we assume balanced data so that Ti ¼ T as in Baltagi (1988) and Koning (1989); see also Baltagi (1995, p. 38). To ease notation, define f ¼ T r2a ðr2 þ T r2a Þ. Then, it can be shown that b þ f ðy¯ i  x¯ Vi bGLS Þ yˆ i;Ti þL ¼ xV i;Ti þL GLS

nð1  fÞr2k ¯ ¯  2 ðy  xVbGLS Þ : r þ nð1  fÞr2k Example 2. (Random walk model) Through minor modifications, other temporal patterns of common, yet unobserved, components can be easily included. For this example, we assume that r ¼ 1; fkt g are i.i.d., so that the partial sum process {k1 þ k2 þ : : : þ kt } is a random walk process. Thus, the model is : : : þ kt þ xVit yit ¼ zV a;i;t ai þ k1 þ k2 þ B þeit ;

t ¼ 1; . . . ; Ti ; i ¼ 1; . . . ; n:

Stacking over t, this can be expressed in matrix form as Eq. (3.1) where the Ti  T matrix Zl;i is a lower triangular matrix of 1’s for the first Ti rows, and zero elsewhere. That is, 0

1 B1 B 1 Zl ;i ¼ B B. @ .. 1

0 ::: 0 ::: 1 ::: .. . O 1 1 :::

0 1 1 .. .

0 0 0 0 1

0 ::: 0 ::: 0 ::: 0 ::: 0 :::

Then, it can be shown that Xt yˆ i;Ti þL ¼ xV l i;Ti þL bGLS þ s¼1 t;BLUP þzVa;i;Ti þLai;BLUP þCovðei;Ti þL ; qi ÞVR1 i ei;BLUP :

1 0 1 0C2 C 0C C 0A 0

Ti

104

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

4. Case study: Forecasting Wisconsin lottery sales In this section, we forecast the sale of state lottery tickets from 50 postal (ZIP) codes in Wisconsin. Lottery sales are an important component of state revenues. Accurate forecasting helps in the budget planning process. Further, a model is useful in assessing the important determinants of lottery sales. Understanding the determinants of lottery sales is useful for improving the design of the lottery sales system and making decisions about numbers of retail sales licenses to grant within postal (ZIP) codes. 4.1. Sources and characteristics of data State of Wisconsin lottery administrators provided weekly lottery sales data. We consider online lottery tickets that are sold by selected retail establishments in Wisconsin. These tickets are generally priced at $1.00, so the number of tickets sold equals the lottery revenue. We analyze lottery sales (ZOLSALES) over a 40-week period, April, 1998 through January, 1999, from 50 ZIP codes randomly selected from more than 700 ZIP codes within the state of Wisconsin. We also consider the number of retailers within a ZIP code for each time (NRETAIL). A budding literature, such as Ashley, Liu, and Chang (1999), suggests variables that influence lottery sales. In developing models for lottery sales we can draw upon this literature and anecdotal evidence concerning the determinants of sales vol-

ume. Higher lottery jackpots lead to higher sales. Ticket sales should be higher in areas with higher population. Ticket sales should be higher in areas better served by online ticket retailers; i.e. higher numbers of retailers should lead to higher sales. Lower income, less educated people may buy more lottery tickets per capita than higher income, more educated people. Senior citizens may buy more lottery tickets per person than people in other age groups. The thinking here is that seniors have more free time to engage in recreational and gaming activities. Table 1 lists economic and demographic characteristics that we consider in this analysis. Much of the empirical literature on lotteries is based on annual data that examine the state as the unit of analysis. In contrast, we examine much finer economic units, the ZIP code level, and weekly lottery sales. The economic and demographic characteristics were abstracted from the 1990 and 1995 United States census, as organized and distributed by the Direct Marketing Education Foundation. These variables summarize characteristics of individuals within ZIP codes at a single point in time and thus are not time-varying. Table 2 summarizes the economic and demographic characteristics of 50 Wisconsin ZIP codes. To illustrate, for the population variable (POPULAT), we see that the smallest ZIP code contained 280 people whereas the largest contained 39 098. The average, over 50 ZIP codes, was 9311.04. Table 2 also summarizes average online sales and average number of retailers. Here, these are averages over 40 weeks. To illustrate, we see that the 40-week average

Table 1 Lottery, economic and demographic characteristics of 50 Wisconsin ZIP codes Lottery characteristics ZOLSALES NRETAIL

Online lottery sales to individual consumers Number of listed retailers

Economic and demographic characteristics PERPERHH MEDSCHYR OOMEDHVL PRCRENT PRC55P HHMEDAGE CEMI POPULAT

Persons per household times 10 Median years of schooling times 10 Median home value in $100 s for owner-occupied homes Percent of housing that is renter occupied Percent of population that is 55 or older Household median age Estimated median household income, in $100 s Population

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

105

Table 2 Summary statistics of lottery, economic and demographic characteristics of 50 Wisconsin ZIP codes Variable

Mean

Median

Standard deviation

Average ZOLSALES

6494.83

2426.41

11.94

6.36

Average NRETAIL PERPERHH MEDSCHYR OOMEDHVL PRCRENT PRC55P HHMEDAGE CEMI POPULAT

27.06 126.96 570.92 24.68 39.70 48.76 451.22 9311.04

27 126 539 24 40 48 431 4405.5

of online sales was as low as $189 and as high as $33 181. It is possible to examine cross-sectional relationships between sales and economic/demographic characteristics. For example, Fig. 1 shows a positive relationship between average online sales and population. Further, the ZIP code corresponding to city of Kenosha, Wisconsin has unusually large average sales for its population size. However, cross-sectional relationships alone, such as correlations and plots

Minimum

Maximum

8103.01

189

33 181

13.29

1

2.09 5.51 183.73 9.34 7.51 4.14 97.84 11 098

22 122 345 6 25 41 279 280

68.625

32 159 1200 62 56 59 707 39 098

similar to Fig. 1, do not show dynamic patterns of sales. Fig. 2 is a multiple time series plot of logarithmic (weekly) sales over time. Here, each line traces the sales patterns for a ZIP code. This figure shows the increase in sales for most ZIP codes, at approximately weeks eight and 18. For both time points, the jackpot prize of one online game, PowerBall, grew to an amount in excess of $100 million. Interest in lotteries, and sales, increases dramatically when jackpot prizes

Fig. 1. Scatter plot of average lottery sales versus population size. Sales for Kenosha are unusually large for its population size.

106

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

Fig. 2. Multiple time series plot of logarithmic (base 10) lottery sales.

reach large amounts. Moreover, Fig. 2 suggests a dynamic pattern that is common to all ZIP codes. Specifically, logarithmic sales for each ZIP code are relatively stable with the same approximate level of variability. Another form of the response variable to consider is the proportional, or percentage, change. Specifically, define the percentage change to be

salesit pchangeit ¼ 100 1 : ð4:1Þ salesi;t1 A multiple times series plot of the percentage changes, not displayed here, shows autocorrelated serial patterns. We consider models of this transform of the series in the following subsection on model selection. 4.2. In-sample model specification This subsection considers the specification of a model, a necessary component prior to forecasting. We decompose model specification criteria into two components, in-sample and out-of-sample criteria. To this end, we partition the data into two subsamples:

we use the first 35 weeks to develop alternative models and estimate parameters, and we use the last 5 weeks to ‘predict’ our held-out sample. The choice of 5 weeks for the out-of-sample validation is somewhat arbitrary; it was made with the rationale that lottery officials consider it reasonable to try to predict 5 weeks of sales based on 35 weeks of historical sales data. Our first forecasting model is an ordinary regression model yit ¼ a þ xVit B þ eit ; where the intercept is common to all subjects, also known as a pooled cross-sectional model. The model fits the data well; the coefficient of determination turns out to be R2 ¼ 69:6%: The estimated regression coefficients appear in Table 3. From the corresponding tstatistics, we see that each variable is statistically significant. Our second forecasting model is an error components model yit ¼ ai þ xVit B þ eit ;

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

107

Table 3 Lottery model coefficient estimates Variable

Pooled cross-sectional model

Error components model

Parameter estimate

t-Statistic

Parameter estimate

t-Statistic

Parameter estimate

t-Statistic

Intercept PERPERHH MEDSCHYR OOMEDHVL PRCRENT PRC55P HHMEDAGE CEMI POP/1000 NRETAIL

13.821 0.108 0.082 0.001 0.032 0.070 0.118 0.004 0.057 0.021

10.32 6.77 11.90 5.19 8.51 5.19 5.64 8.18 9.41 5.22

18.096 0.129 0.108 0.001 0.026 0.073 0.119 0.005 0.121 0.027

2.47 1.45 2.87 0.50 1.27 0.98 1.02 1.55 4.43 1.56

15.255 0.115 0.091 0.001 0.030 0.071 0.120 0.004 0.080 0.004

2.18 1.36 2.53 0.81 1.53 1.01 1.09 1.58 2.73 0.20

Var a (r2a ) Var e (r2e ) AR(1) corr (q)

0.700

0.528 0.279 0.555

25.88

AIC

Error components model with AR(1) term

0.607 0.263

4353.25

2862.74

2269.83

Based on in-sample data of n=50 ZIP codes and T=35 weeks. The response is (natural) logarithmic sales.

where the intercept varies according to subject. Table 3 provides parameter estimates and the corresponding t-statistics, as well as estimates of the variance components, r2a and r2e . As is common in longitudinal data analysis, allowing intercepts to vary by subject can result in regression coefficients for other variables becoming statistically insignificant. When comparing this model to the pooled crosssectional model, we may use the Lagrange multiplier test described in Baltagi (1995, Chapter 3). The test statistic turns out to be TS ¼ 11; 395:5, indicating that the error components model is strongly preferred to the pooled cross-sectional model. Another piece of evidence is Akaike’s Information Criterion (AIC). The smaller this criterion, the more preferred is the model. Table 3 shows that the error components model is preferred compared to the pooled cross-sectional model based on the smaller value of the AIC statistic. To assess further the adequacy of the error components model, we calculated residuals from the fitted model and examined diagnostic tests and graphs. One such diagnostic graph, not displayed here, is a plot of residuals versus lagged residuals. This graph shows a strong relationship between residuals and lagged residuals which we can represent using an autocorrelation structure for the error terms. The graph also

shows a strong pattern of clustering corresponding to weeks with large PowerBall jackpots. A variable that captures information about the size of PowerBall jackpots would help in developing a model of lottery sales. However, for forecasting purposes, we require one or more variables that anticipates large PowerBall jackpots. That is, because the size of PowerBall jackpots is not known in advance, variables that proxy the event of large jackpots are not suitable for forecasting models. These variables could be developed through a separate forecasting model of PowerBall jackpots. Other types of random effects models for forecasting lottery sales could also be considered. To illustrate, we also fit a more parsimonious version of the AR(1) version of the error components model; specifically, we fit this model, deleting those variables with insignificant t-statistics. It turned out that this fitted model did not perform substantially better in terms of overall model fit statistics such as AIC. We explore alternative transforms of the response when examining a held-out sample in the following subsection. Table 4 reports the estimation results from fitting the two-way error components model in Eq. (2.1), with and without an AR(1) term. For comparison

108

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

Table 4 Lottery model coefficient estimates Variable

Intercept PERPERHH MEDSCHYR OOMEDHVL PRCRENT PRC55P HHMEDAGE CEMI POP/1000 NRETAIL Var aðr2a Þ Var eðr2e Þ Var kðr2k Þ AR(1) corr ðqÞ AIC

One-way error components model with AR(1) term

Two-way error components model

Parameter estimate

t-Statistic

Parameter estimate

t-Statistic

Parameter estimate

t-Statistic

15.255 0.115 0.091 0.001 0.030 0.071 0.120 0.004 0.001 0.004

2.18 1.36 2.53 0.81 1.53 1.01 1.09 1.58 2.73 0.20

16.477 0.121 0.098 0.001 0.028 0.071 0.118 0.004 0.001 0.009

2.39 1.43 2.79 0.71 1.44 1.00 1.06 1.59 5.45 1.07

15.897 0.118 0.095 0.001 0.029 0.072 0.120 0.004 0.001 0.003

2.31 1.40 2.70 0.75 1.49 1.02 1.08 1.59 4.26 0.26

0.554 0.024 0.241 0.518

25.54

0.528 0.279

Two-way error components model with AR(1) term

0.564 0.022 0.241

0.555

25.88 2270.97

1109.61

1574.02

Based on in-sample data of n=50 ZIP codes and T=35 weeks. The response is (natural) logarithmic sales.

purposes, the fitted coefficient for the one-way model with an AR(1) term are also presented in this table. As in Table 3, we see that the model selection criterion, AIC, indicates that the more complex two-way models provide an improved fit compared to the one-way models. As with the oneway models, the autocorrelation coefficient is statistically significant even with the time-varying parameter kt. In each of the three models in Table 4, only the population size (POP) and education levels (MEDSCHYR) have a significant effect on lottery sales. 4.3. Out-of-sample model specification This subsection compares the ability of several competing models to forecast values outside of the sample used for model parameter estimation. As in Section 4.2, we use the first 35 weeks of data to estimate model parameters. The remaining 5 weeks are used to assess the validity of model forecasts. For each model, we compute forecasts of lottery sales for weeks 36 through 40, by ZIP code level, based on the first 35 weeks. Denote these forecast values as ˆ ZOLSALES i;35þL , for L=1 to 5. We summarize the

accuracy of the forecasts through two statistics, the mean absolute error MAE ¼

n X 5  1 X ZOLSALES ˆ i;35þL 5n i¼1 L¼1

ZOLSALESi;35þL j

ð4:2Þ

and the mean absolute percentage error MAPE ¼

n X 5 100 X 5n i¼1 L¼1

  ZOLSALES  ˆ i;35þL  ZOLSALESi;35þL  ZOLSALESi;35þL

ð4:3Þ

The several competing models include the models of logarithmic sales summarized in Tables 3 and 4. Because the autocorrelation term appears to be highly statistically significant in Table 3, we also fit a pooled cross-sectional model with an AR(1) term. Further, we fit two modifications of the error components model with the AR(1) term. In the first case we use lottery sales as the response (not the logarithmic version) and in the second case we use percentage change of lottery sales, defined in

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

109

Eq. (4.1), as the response. Finally, we also consider a basic fixed effects model,

4.4. Forecasts

yit ¼ ai þ eit ;

We now forecast using the model that provides the best fit to the data, the error components model with an AR(1) term. The forecasts and forecast intervals for this model are a special case of the results for the more general longitudinal data mixed model, given in Eqs. (3.6) and Appendix A, respectively. Fig. 3 displays the forecasts and forecast intervals. Here, we use T ¼ 40 weeks of data to estimate parameters and provide forecasts for L ¼ 5 weeks. Calculation of the parameter estimates, point forecasts and forecast intervals were done using logarithmic sales as the response. Then, point forecasts and forecast intervals were converted to dollars to display the ultimate impact of the model forecasting strategy. Fig. 3 shows the forecasts and forecast intervals for two selected postal codes. The lower forecast represents a postal from Dane County whereas the upper represents a postal code from Milwaukee. For each postal code, the middle line represents the point forecast and the upper and lower lines represent the bounds on a 95% forecast interval. When calculating this interval, we applied a normal curve approximation, using the point forecast plus or minus 1.96 times the standard error. Compared to the Dane County code, the Milwaukee postal code has higher forecast sales. Thus, although standard errors on a logarithmic scale are about the same as Dane County, this higher point forecast leads to a larger interval when rescaled to dollars.

with an AR(1) error structure. For fixed effects models, the term ai is treated as a fixed parameter, not a random variable. Because this parameter is time-invariant, it is not possible to include our timeinvariant demographic and economic characteristics as part of the fixed effects model. Table 5 presents the model forecast criteria in Eqs. (4.2) and (4.3) for each model. We first note that Table 5 re-confirms the point that the AR(1) term improves each model. Specifically, for the pooled cross-sectional, as well as one-way and two-way error components models, the version with an AR(1) term outperforms the analogous model without this term. Table 5 also shows that the one-way error components model dominates the pooled cross-sectional model. This was also anticipated by our pooling test, an in-sample test procedure. Somewhat surprisingly, the two-way model did not perform as well as the one-way model. Table 5 confirms that the error components model with an AR(1) term with logarithmic sales as the response is the preferred model, based on either the MAE or MAPE criterion. The next best model was the corresponding fixed effects model. It is interesting to note that the models with sales as the response outperformed the model with percentage change as the response based on the MAE criterion, although the reverse is true based on the MAPE criterion.

Table 5 Out-of-sample forecast comparison of nine alternative models Model

Model response

Model forecast criteria

Pooled cross-sectional model Pooled cross-sectional model with AR(1) term Error components model Error components model with AR(1) term Error components model with AR(1) term Error components model with AR(1) term Fixed effects model with AR(1) term Two-way error components model Two-way error components model with AR(1) term

Logarithmic sales Logarithmic sales

3012.68 680.64

83.41 21.19

Logarithmic sales Logarithmic sales Sales Percentage change Logarithmic sales Logarithmic sales Logarithmic sales

1318.05 571.14 1409.61 1557.82 584.55 1257.21 1202.97

33.85 18.79 140.25 48.70 19.07 33.14 32.47

MAE

MAPE

110

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

Fig. 3. Forecast intervals for two selected postal codes. For each postal code, the middle line corresponds to point forecasts for 5 weeks. The upper and lower lines correspond to endpoints of 95% prediction intervals.

4.5. Sales and marketing applications The lottery sales example has a structure similar to many sales and marketing applications. Sales data are organized in time—days, weeks, months, or years. Sales data are organized in space—country, geographical region, sales areas or districts. We model sales volume as a function of market conditions and the ‘marketing mix.’ Market conditions include information about competitors, substitute products, and economic climate. ‘Marketing mix’ refers to the actions that firms take with regard to product features, pricing, distribution, advertising, and promotion. Like sales response, most explanatory variables vary in time and space. Consider the case of supermarket scanner data collected weekly from thousands of stores across the United States. To forecast sales volume for a particular product, say a frozen dinner, in a particular store, researchers build upon observed sales volumes for all products in the frozen dinner category. Prod-

uct price, discount coupon, promotion, and advertising data would be collected, and appropriate explanatory variables defined. In building a forecasting model for a food manufacturer or retail chain, a researcher might combine data across stores within cities and across products within brands. Models at various levels of aggregation can be built from longitudinal data. Leeflang, Wittink, Wedell, and Naert (2000) and Hanssens, Parsons, and Schultz (2001) discuss traditional time series and regression approaches to scanner data analysis. Longitudinal data mixed models, as discussed in this paper, represent a set of flexible, dynamic models for sales and marketing applications of this type.

5. Summary and concluding remarks This article considers the longitudinal data mixed model, a class of models that extends the traditional two-way error components longitudinal data models

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

yet still falls with the framework of a mixed linear model. In particular, the theory allows us to consider unbalanced data, slopes that may vary by subject or time and parametric forms of heteroscedasticity as well as serial correlation. Forecasts for these models are derived as special cases of best linear unbiased predictors, BLUPs, together with the variance of the forecast errors. The theory provides optimal predictions assuming that the variance components are known. If estimators replace the true variance components, then the mean squared error of the predictors increases. The magnitude of this increase, as well as approximate adjustments, have been studied by Kackar and Harville (1984), Harville and Jeske (1992) and Baillie and Baltagi (1999). The theory is substantially broader than prevailing forecasting practice. To illustrate how this theory may be applied, this article considers forecasting Wisconsin lottery sales. We examined a variety of model specifications to arrive at a simple one-way error components with an AR(1) term. This simple model provides a good fit to the data. In subsequent work, we intend to investigate more complex models in order to realize more useful forecasts of future sales. One direction that subsequent research may take is to examine longitudinal data models with spatial, as well as time-series, error components.

0

Za;1 B0 B B Za ¼ B 0 B .. @. 0

0 Za;2 0 .. . 0

::: 0 ::: 0 Za;3 : : : .. . O ::: 0

111

0 0 0 .. .

1 C C C C: C A

Za;n

With these choices, the longitudinal data mixed model in Eq. (2.1) can be expressed as a mixed linear model, given by y ¼ Z a a þ Z l l þ Xb þ e:

ðA:0Þ

Further, we also use the notation R=Var E=blockdiag(R1,. . ., Rn) and note that Var A=In D. With this notation, we may express the variance – covariance matrix of y as Var y ¼ V ¼ Za ðIn DÞZaVþ ZE RE ZEVþ R:

ðA:1Þ

Moreover, define Va ¼ R þ Za ðIn DÞZaV ¼ blockdiagðVa;1 ; . . . ; Va;n Þ; where Va;i is defined in Eq. (3.2). With this notation, we use standard matrix algebra results to write

Appendix A . Inference for the longitudinal data mixed model To express the model more compactly, we use the mixed linear model specification. Thus, define y ¼ ðyV ; e ¼ ðeV1 ; eV2 ; . . . ; eVn ÞV; a ¼ 1 ; yV 2 ; . . . ; yV n ÞV ðaV1 ; aV2 ; . . . ; aVn ÞV, 0

X1

1

B C B C B C B X2 C B C B C X C X¼B B 3 C; B C B. C B .. C B C @ A Xn

0

ZE;1

1

C B C B B ZE;2 C C B C B C B C and B Z E;3 Zl ¼ B C C B C B. C B .. C B A @ ZE;n

V1 ¼ ðVa þ ZE RE ZEVÞ1 1 1 ¼ V1 a  Va ZE ðZEVVa ZE 1 1 þ R1 E Þ ZEVVa :

ðA:2Þ

For best linear unbiased prediction (BLUP) in the mixed linear model, suppose that we observe an N  1 random vector y with mean E y=X B and variance Var y=V. The generic goal is to predict a random variable w, such that E w ¼ c VB and Var w ¼ r2w . Denote the covariance between w and y as Cov(w, y). The BLUP of w is wBLUP ¼ cVbGLS þ Covðw; yÞVV1 ðy  XbGLS Þ ðA:3Þ

112

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

 1 where bGLS ¼ XVV1 X XVV1 y. The mean square error is  VarðwBLUP  wÞ ¼ c V  Covðw; yÞVV1 X  1  c V  Covðw; yÞVV1 X V  XVV1 X  Covðw; yÞVV1 Covðw; yÞ þ r2w :

ðA:4Þ

Proof of the Proposition We first derive the BLUP predictor of L. Let cL be an arbitrary vector of constants and set w=cLVL. For this choice of w, we have ;=0. From Eq. (A.2), we have 1

clVlBLUP ¼ clVCovðl; Zl lÞVV ðy  XbGLS Þ: n P Using Eq. (A.2) and SZZ ¼ ZlV ;i V1 a;i Zl;i ¼ 1 i¼1 ZlV Va Zl , we have

1 ¼ CovðcVe e i ; yÞVVa ðy  XbGLS Þ

 CovðceVei ; yÞVV1 a Zl l BLUP  ¼ ceVRi V1 a ;i ðyi  Xi bGLS Þ  Zk;i kBLUP : ðA:6Þ

Using Wald’s device, we have the BLUP of ei, given in Eq. (3.5). The derivation for the BLUP of Ai is similar. Let cA be an arbitrary vector of constants and set w=cAVAi. This yields  CovðcaVei ; yj Þ ¼

caVRi 0

for j ¼ i : for j p i

Using this, Eqs. (A.2), (A.3) and (3.4), we have caVai;BLUP ¼ CovðcaVai ; yÞVV1 ðy  XbGLS Þ

1 1 Rl ZlVV1 ¼ Rl ZlVðV1 a  Va Zl ðZlVVa Zl

¼ CovðcaVai ; yÞVV1 a ðy  XbGLS Þ

1 1 þ R1 l Þ ZlVVa Þ

 CovðcaVai ; yÞVV1 a Zk ðSZZ

1 1 ¼ Rl ðI  SZZ ðSZZ þ R1 l Þ ÞZlVVa 1 1 ¼ ðSZZ þ R1 l Þ ZlVVa :

Thus, kBLUP ¼ Rl Zl VV1 ðy  XbGLS Þ 1 1 ¼ ðSZZ þ R1 l Þ Zl VVa ðyXbGLS Þ;

ðA:5Þ

which is sufficient for Eq. (3.4). To derive the BLUP predictor of ei , let ce be an arbitrary vector of constants and set w=ceV ei . With ; = 0, we have  CovðceVei ; yj Þ ¼

1 1 þ R1 l Þ ZlVVa ðy  XbGLS Þ  ¼ cVa DZVa;i V1 a;i ei;GLS  Zl ; iBLUP ;

which is sufficient for Eq. (3.6). Now, to calculate the BLUP forecast of yi;Ti þL in Eq. (3.7), we wish to predict w ¼ yi;Ti þL . With this choice of w, we have ; ¼ xi;Ti þL . Now, we examine Covðyi;Ti þL ; yÞ ¼ CovðzVa;i;Ti þL ai ; yÞ þ CovðzV l ;i ; Ti þL l Ti þL ; yÞ þ Covðei;Ti þL ; yÞ:

ðA:8Þ

Using Eqs. (A.0) and (A.5), we have ceVRi 0

for j ¼ i : for j p i

Using this, Eqs. (A.2) and (A.3) yield

CovðzVl ; i; Ti þ LlTi þL ; yÞVV1 ðy  XbGLS Þ ¼ zVl ; i; Ti þLCovðlTi þL ; lVÞZVl V1 ðy  XbGLS Þ ¼ zVl ; i; Ti þ LCovðlTi þL ; lVÞR1 l l BLUP :

ceVei;BLUP ¼ CovðceVei ; yÞVV1 ðy  XbGLS Þ ¼ CovðceVei ; yÞVV1 a ðy  XbGLS Þ 

ðA:7Þ

CovðceVei ; yÞVV1 a Zl ðSZZ

1 1 þ R1 l Þ ZlVVa ðy  XbGLS Þ

ðA:9Þ Similarly, from Eq. (A.7), we have CovðzVa;i;Ti þL ai ; yÞVV1 ðy  XbGLS Þ ¼ zVa;i;Ti þL ai;BLUP

ðA:10Þ

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

and, similar to Eq. (A.6), we have Covðei;Ti þL ; yÞVV1 ðy  XbGLS Þ ¼ Covðei;Ti þL ; ei ÞVR1 i ei;BLUP :

ðA:11Þ

Thus, from Eqs. (A.3), (3.3) and (A.8)– (A.11), the BLUP forecast of yi;Ti þL is yˆ i;Ti þL ¼ xV i;Ti þL bGLS þ Covðyi;Ti þL ; yÞVV1 ðy  XbGLS Þ ¼ xV i;Ti þL bGLS þ z lV;i;Ti þL CovðlTi þL ; lÞVS1 l l BLUP þ z aV ; i; Ti þ Lai;BLUP þ Covðei;Ti þL ; ei ÞVR1 i ei;BLUP ; as in Eq. (3.7). This is sufficient for the proof of the Proposition. References Ailawadi, K. L., & Neslin, S. A. (1998). The effect of promotion on consumption: Buying more and consuming it faster. Journal of Marketing Research, 35, 390 – 398. Ashley, T., Liu, Y., & Chang, S. (1999). Estimating net lottery revenues for states. Atlantic Economics Journal, 27, 170 – 178. Baillie, R. T., & Baltagi, B. H. (1999). Prediction from the regression model with one-way error components. In Hsiao, C., Lahiri, K., Lee, L., & Pesaran, M. H. (Eds.), Analysis of panels and limited dependent variable models. Cambridge, UK: Cambridge University Press. Baltagi, B. H. (1988). Prediction with a two-way error component regression model. Problem 88.1.1. Econometric Theory, 4, 171. Baltagi, B. H. (1995). Econometric analysis of panel data. NY: Wiley. Baltagi, B. H., & Li, Q. (1991). A transformation that will circumvent the problem of autocorrelation in an error-component model. Journal of Econometrics, 48(3), 385 – 393. Baltagi, B. H., & Li, Q. (1992). Prediction in the one-way error component model with serial correlation. Journal of Forecasting, 11, 561 – 567. Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association, 83, 28 – 36.

113

Diggle, P. J., Liang, K. -Y., & Zeger, S. L. (1994). Analysis of longitudinal data. Oxford University Press. Erdem, T. (1996). A dynamic analysis of market structure based on panel data. Marketing Science, 15, 359 – 378. Frees, E. W., Young, V., & Luo, Y. (1999). A longitudinal data analysis interpretation of credibility models. Insurance: Mathematics and Economics, 24, 229 – 247. Frees, E. W., Young, V., & Luo, Y. (2001). Credibility ratemaking using panel data models. North American Actuarial Journal, 5(4), 24 – 42. Goldberger, A. S. (1962). Best linear unbiased prediction in the generalized linear regression model. Journal of the American Statistical Association, 57, 369 – 375. Hanssens, D. M., Parsons, L. J., & Schultz, R. L. (2001). Market response models: econometric and time series analysis (2nd edition). Boston: Kluwer. Harville, D. (1976). Extension of the Gauss – Markov theorem to include the estimation of random effects. Annals of Statistics, 2, 384 – 395. Harville, D., & Jeske, J. R. (1992). Mean square error of estimation or prediction under a general linear model. Journal of the American Statistical Association, 87, 724 – 731. Kackar, R. N., & Harville, D. (1984). Approximations for standard errors of estimators of fixed and random effects in mixed linear models. Journal of the American Statistical Association, 79, 853 – 862. Koning, R. H. (1989). Prediction with a two-way error component regression model. Solution 88.1.1. Econometric Theory, 5, 175. Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963 – 974. Leeflang, P. S. H., Wittink, D. R., Wedel, M., & Naert, P. A. (2000). Building models for marketing decisions. Boston: Kluwer. Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D. (1996). SAS system for mixed models. Cary, North Carolina: SAS Institute. Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. New York: Springer. Robinson, G. K. (1991). The estimation of random effects. Statistical Science, 6, 15 – 51. Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York: John Wiley and Sons. Verbeke, G., & Molenbergs, G. (2000). Linear mixed models for longitudinal data. New York: Springer – Verlag. Biographies: Edward W. (Jed) FREES is a Professor of Business and Statistics at the University of Wisconsin—Madison and is holder of the Fortis Health Professorship of Actuarial Science. He is a Fellow of both the Society of Actuaries and the American Statistical Association. Professor Frees is a frequent contributor to scholarly journals. His papers have won several awards for quality, including the Actuarial Education and Research Fund’s annual Halmstad Prize for best paper published in the actuarial literature (three times). Professor Frees currently is the Editor of the North American Actuarial Journal and an Associate Editor for Insurance: Mathematics and Economics. The National Science Foundation (Grant Number SES-0095343) provided funding to support this research.

114

E.W. Frees, T.W. Miller / International Journal of Forecasting 20 (2004) 99–114

Thomas W. MILLER is Director of the A.C. Nielsen Center for Marketing Research at the University of Wisconsin—Madison. He holds graduate degrees in psychology (PhD, psychometrics) and statistics (MS) from the University of Minnesota and in business (MBA) and economics (MS) from the University of Oregon. An expert in applied statistics and modeling, Tom has designed and conducted numerous empirical and simulation studies comparing traditional and data-adaptive methods. Tom’s current research includes explorations of online research methods and studies of consumer life-styles, choices and uses of technology products. He won the David K. Hardin Award for the best paper in Marketing Research in 2001.