Journal of Mathematical Psychology 53 (2009) 518–529
Contents lists available at ScienceDirect
Journal of Mathematical Psychology journal homepage: www.elsevier.com/locate/jmp
Using information criteria to determine the number of regimes in threshold autoregressive models E.L. Hamaker ∗ Methods and Statistics, Faculty of Social Sciences, Utrecht University, The Netherlands
article
info
Article history: Received 1 September 2008 Received in revised form 20 July 2009 Available online 15 August 2009 Keywords: Information criteria Threshold autoregressive model Time series analysis Regime-switching
abstract Threshold autoregressive models can be used to study processes that are characterized by recurrent switches between two or more regimes, where switching is triggered by a manifest threshold variable. In this paper the performance of diverse information criteria for selecting the number of regimes in small to moderate sample sizes (i.e., n = 50, 100, 200) is investigated. In addition it is investigated whether these information criteria can be used to determine whether the residual variances are identical across the regimes. It is concluded that for small sample sizes AICu should be preferred, while for larger sample sizes either BIC or AICcp should be considered: The latter is the only information criterion that includes a penalty for the unknown threshold parameters. © 2009 Elsevier Inc. All rights reserved.
1. Introduction Regime-switching has been recognized as an important phenomenon in psychology. In cognitive psychology for instance, stage-wise development has been studied extensively (Dolan, Jansen, & Van der Maas, 2004; Schmittmann, Dolan, & Van der Maas, 2005; Van der Maas & Molenaar, 1992). At a microlevel, it has been suggested that the variability in performance on certain cognitive tasks can also be understood as stemming from switches between different strategies or regimes (cf., Wagenmakers, Farrell, & Ratcliff, 2004). In clinical psychology a variety of psychological disorders, as defined in the DSM IV, clearly distinguish between different regimes (e.g., bipolar disorder, several eating disorders, and posttraumatic stress disorder; see American Psychiatric Association, 1994). To develop a deeper understanding of such processes, researchers need to employ models that allow for such regimeswitches. A potentially useful model for this purpose is the class of threshold autoregressive (TAR) models introduced by Tong and Lim (1980). Characteristic of a TAR process is that the system switches repeatedly between two or more regimes (which may also be referred to as states), that each regime is a distinct autoregressive (AR) process, and that regime-switching depends on the value of a manifest threshold variable. TAR models have been used to describe and predict diverse phenomena including predator–prey
∗ Corresponding address: Methods and Statistics, PO box 80140, 3508 TC Utrecht, The Netherlands. Tel.: +31 30 253 1851; fax: +31 30 253 5797. E-mail address:
[email protected]. 0022-2496/$ – see front matter © 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jmp.2009.07.006
population cycles (Stenseth et al., 1998), sunspot data (Tong & Lim, 1980), river flows (Amendola, Niglio, & Vitale, 2006; Tsay, 1998), climatic data (Al-Alwadhi & Jolliffe, 1998), monthly unemployment rates (Caner & Hansen, 2001), and exchange rates (Narayan, 2007). In psychological research these models have been used to distinguish between periods of relapse and recovery in the behavior of alcohol addicts and sex offenders (Warren, 2002; Warren, Hawkins, & Sprott, 2003), and more recently Hamaker, Zhang, and Van der Maas (in press) have shown that the models considered by Gottman, Murray, Swanson, Tyson, and Swanson (2002) for dyadic interactions also fit within the TAR framework. Furthermore, TAR models have been used to model the switches between mania and depression in a patient with bipolar disorder (Hamaker, Grasman, & Kamphuis, in press). An important topic in TAR modeling is how to determine the appropriate number of regimes. If the thresholds are known, models with differing numbers of regimes can be compared using conventional hypothesis testing. If the thresholds are unknown however, and thus need to be estimated from the data, conventional hypothesis testing using nested models is inappropriate because of nuisance parameters that are absent (or unidentified) under the null model (Chan & Tong, 1990; Tong, 2003, p. 234).1 To overcome this problem, several alternative statistical tests have been developed for TAR models (Chan & Tong, 1990; Chan & Ng, 2004; Hansen, 1999; Strikholm & Teräsvirta, 2006), but these techniques are typically computationally demanding.
1 A similar problem is encountered in determining the number of components in a mixture (cf. Frühwirth-Schnatter, 2006, pp. 114–115).
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
A simple and easy to use alternative are information criteria, which consist of −2 times the log likelihood and a penalty for model complexity that is based on the number of unknown parameters. Gonzalo and Pitarakis (2002) and Strikholm and Teräsvirta (2006) performed simulation studies that focus on determining the number of regimes of a TAR model using information criteria. However, they considered only moderate to large sample sizes (i.e., n = 200, 400, 600, 800). Peña and Rodriguez (2005) studied the same issue in small to moderate sample sizes (i.e., n = 50, 100, 250), using the BIC, AIC and a small sample corrected AIC. However, the latter was not developed for TAR models and may not be the appropriate expression for the small sample corrected AIC (see Burnham & Anderson, 2002, p. 379). The current paper contributes to this existing TAR literature in several ways. First, we consider small to moderate sample sizes (n = 50, 100, 200), and compare the BIC with two small sample corrected AICs, which were specifically developed for TAR modeling by Wong and Li (1998), and De Gooijer (2001). The performance of these small sample TAR AICs for the purpose of selecting the number of regimes has not been studied yet. Second, we include an alternative AIC, which was recently derived for change-point models by Ninomiya (2005). Since there is a strong relationship between TAR models and change-point models (e.g., Chen, 1998), this may be a more appropriate AIC measure than the regular AIC. It is the convention in TAR modeling not to penalize the threshold parameters, and the change-point AIC (AICcp ) is the only information criterion that includes a penalty for the unknown threshold parameters. Third, we investigate whether it is possible to distinguish between TAR models with identical residual variances across regimes, and models with differing residual variances. This issue has not been studied yet. The remainder of this paper is organized as follows. In the next section the TAR model is presented. This is followed by a section on model estimation and a section on model selection in the context of TAR modeling with information criteria. In the following section the focus is on how to evaluate the performance of information criteria in simulation studies. Next, the results of four simulation studies are presented and the final section consists of a discussion. 2. The TAR model Since the appearance of Box and Jenkins’s Time Series Analysis: Forecasting and Control (Box & Jenkins, 1970), a variety of nonlinear extensions of the original autoregressive moving-average (ARMA) model have been proposed (Engle, 1982; Fan & Yao, 2003; Granger & Andersen, 1978; Tong, 2003). One of these extensions is the threshold autoregressive (TAR) model introduced by Tong (1983) and Tong and Lim (1980), which is appropriate for processes that are characterized by regime-switching that is triggered by an observed threshold variable. The TAR model is based on the linear AR model. A random variable yt is said to represent an AR process of order p if yt = φ0 + φ1 yt −1 + · · · + φp yt −p + σ et ,
(1)
where φ0 is a constant, and φ1 to φp are the autoregressive coefficients in the regression of the current value on previous values of the same variable. To ensure stationarity (meaning that both the first and second moments of the series yt are independent of time), the roots of 1 − φ1 z − φ2 z 2 − · · · − φp z p must lie outside the unit circle (Hamilton, 1994). The residual σ et is the part of yt that cannot be predicted from the previous y’s, and has a mean of zero and a variance of σ 2 (i.e., et is assumed to have a mean of zero and a variance of 1). Assuming the residuals are normally distributed, we have σ et ∼ N (0, σ 2 ). The mean of the AR process is φ0 /(1 − φ1 − · · · − φp ).
519
A TAR process consists of two or more AR processes — which are referred to as regimes — between which the system switches. The switching occurs when the threshold variable passes one of the thresholds. Suppose there are k regimes, and let zt −d be the threshold variable with delay d, and let p(i) be the order of the AR process in regime i, with i = 1, . . . , k. Then a TAR(k, p(1) , . . . , p(k) ) process is defined as yt =
k n X
φ0(i) + φ1(i) yt −1 + · · · + φp(i()i) yt −p(i) + σ (i) et
o
i=1
I (zt −d ∈ Ai ),
(2)
where I (·) is the indicator function, which equals 1 if zt −d ∈ Ai , and is 0 otherwise. Typically Ai = (τi−1 , τi ], with −∞ = τ0 < τ1 < · · · < τk = ∞, where τ1 to τk−1 are the thresholds of interest. That is, if zt −d ≤ τ1 (meaning zt −d ∈ A1 ), yt falls in regime 1, if τ1 < zt −d ≤ τ2 (meaning zt −d ∈ A2 ), yt falls in regime 2, and so on. The regimes can differ from each other with respect to the order of (i) the AR process (i.e., p(i) ), the constant (i.e., φ0 ), the AR coefficients (i)
(i)
(i.e., φ1 to φp(i) ), and/or the residual variance (i.e., σ (i)2 ).
Based on the kind of threshold variable that is used, a distinction can be made between several subclasses of the general TAR model in (2). If yt serves as its own threshold variable zt , the model is referred to as a self-exciting TAR (SETAR) model. In Fig. 1 a linear AR process and two SETAR(2,1,1) processes are depicted (i.e., a SETAR process with 2 regimes where both regimes are formed by first order AR processes). The linear AR process is presented at the top, and was generated with φ0 = 0, φ1 = .4 and σ = 1. The first SETAR process can be expressed as 1 + .4yt −1 + et 2 + .4yt −1 + et
yt =
if yt −1 ≤ 3 if yt −1 > 3.
(3)
and the second SETAR process can be expressed as 1 + .8yt −1 + et 1 + .4yt −1 + et
yt =
if yt −1 ≤ 3 if yt −1 > 3.
(4)
The plots on the left in Fig. 1 are the time series plotted against time; the plots in the middle are the autocorrelation functions; and the plots on the right are the state-space plots, that is the plots of yt against yt −1 . The φ1 parameters given above can be recognized as the slopes in (parts of) the state-space, while the constants φ0 are the intercepts (i.e., the intersection of the regression lines with the y-axis). Note that while the first SETAR process is characterized by an almost cyclic switching between the two regimes, the second SETAR process switches much more often, which results in lower autocorrelations, and a more random appearance. The statespace of both SETAR processes show a break at the threshold value 3: For the first SETAR process the observations above the threshold are shifted upward, while for the second SETAR process the observations above the threshold are shifted downward. Moreover, the slope in the second regime (depicted in grey) of the second SETAR process is less steep than in the first regime (depicted in black). There are two other subclasses of the TAR model (Tong & Lim, 1980). If the threshold variable is an exogenous variable, the model is referred to as an open-loop TAR system (TARSO). If two variables serve as each other’s threshold variable, the model is referred to as a closed-loop TAR system (TARSC). Besides these subclasses, a number of extensions of the basic TAR model in (2) have been proposed, which consist of including other (lagged) variables as predictors in the equation, and incorporating moving average terms (Tong, 2003). In addition, multivariate (vector) extensions of the TAR model have been developed (Koop, Pesaran, & Potter, 1996; Tsay, 1998).
520
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
Fig. 1. Linear AR process and two different kinds of SETAR processes, both with a threshold at 3. The left column contains plots of the sequences, the middle column contains autocorrelation functions, and the right column contains the state-space plots. The latter shows a break at the threshold value 3 for the two SETAR processes. The observations that fall in the first regime are depicted in black, while the observations that fall in the second regime are depicted in grey. The regression lines in the state-space (1) (2) plots illustrate that the first SETAR process has different intercepts (φ0 < φ0 ) but a common slope φ1 , while the second SETAR process has a common intercept φ0 , but (1)
different slopes (φ1
> φ1(2) ).
These models have some desirable properties which can be of interest to the psychological researcher. That is, SETAR processes (i.e., TAR processes in which the variable serves as its own threshold variable), can be understood as processes that incorporate a feedback mechanism, such that if the system moves away from some equilibrium too much, the system responds to this by switching to another regime. In the second SETAR process in Fig. 1, the regime-switching can be understood as a correction, in that the second regime leads to less extreme development than the first regime (and as a result, the system switches quite regularly between the regimes). The first SETAR processes in Fig. 1 can be understood as incorporating a maladaptive feedback system, in that extreme values make the system ‘‘surrender’’ to this extremism by switching to another regime which is more extreme (and as a result, this system switches less often, which leads to the higher autocorrelations). Hence, SETAR models may be of interest to cognitive psychologists when the purpose is to model a process in which participants correct their own behavior (Wagenmakers et al., 2004), but also to clinical psychologists, when the purpose is to detect maladaptive mechanisms, in for instance the regulation of emotions (Hamaker & Grasman et al., in press). Second, TARSOs (i.e., TAR models with an exogenous threshold variable) can be useful to model regime-switching processes when
switches are triggered by some exogenous manifest variable. For instance, in studying behavioral, affective and physical symptoms related to premenstrual syndrome (PMS), one can consider day until the onset of menses or hormonal blood levels as the exogenous threshold variable. TARSOs could also be of interest to cognitive researchers to determine whether the effect of a (continuous) independent variable on the performance of a participant is of a gradual nature, or that it is a step-function. This can be done by including the independent variable both as a predictor and as the threshold variable. Third, TARSCs (i.e., two TAR processes, where each variable serves as the other’s threshold variable), is a model which can be used if two variables respond to changes in each other. It has been used for the study of dyads (Gottman et al., 2002; Hamaker & Zhang et al., in press), but can be used for any bivariate system. For instance, in diary studies it could be used to determine whether an increase in anxiety triggers a switch in negative affect and/or vice versa. In experimental and social psychology TARSCs could prove useful in modeling the behavior of two participants who have to negotiate with one another. In this scenario, the thresholds could represents sudden large changes in one actor’s behavior, as a response to small changes in the partner’s behavior.
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
3. Estimation of a TAR model Estimation of a TAR model is typically done using least squares estimation. In this section, the least squares procedure is explained and the relatedness between TAR model estimation and changepoint analysis is emphasized. Moreover, the log likelihood function for a TAR model is given, as this is necessary for the information criteria discussed in the next section. Let {yt }nt=1 be an observed series of length n that was generated by a TAR process. The unknown parameters of the TAR model are (i) (i) the constants φ0 and the autoregressive coefficients φj(i) (where
j(i) = 1, . . . , p(i) ), the residual variances σ (i)2 , the thresholds τr (where r = 1, . . . , k − 1), and the delay parameter d. Estimation of the parameters of a TAR model can be done through reformulating the problem so that it becomes a changepoint problem, and subsequently minimizing the sum of squared residuals (Chen, 1998; Tsay, 1998). To illustrate this, suppose our observed series is the sequence {6, 7, 5, 6, 9, 8, 5, 3, 4}, and let’s assume we want to estimate a SETAR(2,1,1). First, we write the vector with the dependent variable (i.e., y) and the matrix with regressors (i.e., X) as 7 5 6 9 y= 8 5 3 4
1 1 1 1 and X = 1 1 1 1
6 7 5 6 . 9 8 5 3
4
6 3 7 y˜ = 9 5 5 8
1
1 1 1 ˜ and X = 1 1 1 1
3 5 5 6 . 6 7 8 9
Above, yt −1 is considered the threshold variable, but the same procedure applies when a different delay is chosen, or when another variable is considered as the threshold variable. If the threshold variable is time t, then the above procedure is actually equivalent to a change-point analysis (with autoregression), which illustrates the strong relationship between TAR estimation and change-point analysis. In more general terms, estimation of TAR models can be described as follows. After rearranging the data based on the size of the threshold variable and conditional on k − 1 threshold values, which we denote as θ = {τ1 , . . . , τk−1 }, the observations can be (i) divided into the k regimes, such that y˜ is a vector that contains the (i)
˜ is the matrix n(i) observations that fall in the i-th regime, and X with its corresponding regressors. For all i = 1, . . . , k regimes we have (i)
y˜
(i) = X˜ φ(i) + σ (i) e˜ (i) (i)
where φ(i) = [φ0
(7)
φ1(i) . . . φp(i()i) ]T , such that the sum of squared
residuals of regime i can be defined as (i)
(i)
− X˜ φ(i) }T {˜y(i) − X˜ φ(i) }.
(i)
S (i) = {˜y
(8)
The constant and AR parameters of regime i can be estimated by least squares as (i) (i)T (i) (i)T φˆ θ = {X˜ X˜ }−1 X˜ y˜ (i) ,
(5)
Next, we rearrange the observations according to the size of the threshold variable (i.e., yt −1 , which is the sequence: {6, 7, 5, 6, 9, 8, 5, 3}), such that the first case has the smallest value, while the last one has the largest value, and thus we obtain
521
(6)
Note that the relationship between y and X is exactly the same as ˜ the only difference being the the relationship between y˜ and X, order in which the cases are presented. Following this rearrangement of the data, we can consider ˜ into different values for the threshold which divides both y˜ and X two parts, and perform separate least squares estimation on both parts. For instance, if we consider the threshold value 5, the first 3 ˜ fall in the regime below this threshold, and the last cases in y˜ and X five cases fall in the regime above this threshold. For both parts the parameters are estimated separately using least squares, and based on these the total sum of squared residuals can be determined. This is repeated for other threshold values, and the threshold value that minimizes the total sum of squared residuals is considered the least squares estimate for the threshold parameter. Note that each observation yt is always regressed on the previous observation yt −1 , whether these observations fall into the same regime or not. That is, in the example above, 9 is always regressed on 6, whether the threshold is 5 — meaning the two observations fall in the same regime — or 7 — meaning the two observations fall in different regimes. However, the parameters by which yt is regressed on yt −1 depend on the regime the threshold variable (here yt −1 ) falls in.
(9)
(i) φˆ θ
where is the vector with least squares estimates of the parameters of regime i. To obtain least squares estimates of both φ(i) and the thresholds τ1 , . . . , τk−1 , the threshold values are varied to find the minimum total sum of squared residuals Sθ =
k X
(i)
Sθ .
(10)
i =1
In a similar manner the least squares estimate for the delay parameter d can be obtained. Note that since both yt −p(i) and zt −d are
needed for the regression of yt , the first g = max(d, p(i) ) observations are discarded for estimation purposes and the model is run over the observations y1+g to yn . Chen and Lee (1995) proposed the conditional likelihood function for a two-regime TAR model. Here we extend this to a kregime model, such that the log likelihood function of an observed series (conditional on the first g observations) is log L = −
n−g 2
log(2π ) −
(i) k k 1 X (i) 1 X Sθ n log(σ (i)2 ) − , 2 i=1 2 i=1 σ (i)2 (11)
where n(i) is the number of observations in regime i. Note that since (i) the maximum likelihood estimate of σ (i)2 equals Sθ /n(i) , we can (i)2 replace σ in (11) by its maximum likelihood estimate and obtain the concentrated log likelihood log Lc = −
n−g 2
k 1 X (i) log(2π ) − n log 2 i=1
(
(i)
Sθ
)
n(i)
1
− (n − g ). 2
(12) When σ (i)2 = σ 2 — that is, if the residual variance is invariant across regimes — the concentrated log likelihood in (12) reduces to log L∗c = −
n−g 2
log(2π ) −
n−g 2
log
Sθ n−g
1
− (n − g ) (13) 2
from which it is easy to see that minimizing (10) yields the same estimates as maximizing the concentrated log likelihood function in (13). However, if the residual variances vary across the regimes, only maximizing (12) will yield maximum likelihood estimates.
522
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
4. Determining the number of regimes with information criteria Two studies have been published that focus on the issue of determining the number of regimes in a TAR model by using information criteria in moderate to large sample sizes (n = 200, 400, 600, 800) (Gonzalo & Pitarakis, 2002; Strikholm & Teräsvirta, 2006). These studies showed that the BIC performs quite well if the order of the autoregressive processes is low, but the performance deteriorates if the orders increased, unless the sample size is large (n = 800). Peña and Rodriguez (2005) have studied the same problem in small to moderate sample sizes (n = 50, 100, 250), using several statistical tests, the AIC, the BIC, and a small sample corrected AIC. They concluded that both AICmeasures tend to overfit, while the BIC tends to underfit in these small sample sizes. However, the small sample corrected AIC they used was developed for linear time series models and nonlinear regression models (Hurvich & Tsai, 1989). Using a small sample corrected AIC which is specifically developed for nonlinear time series models such as TAR models may lead to more favorable results. Wong and Li (1998) and De Gooijer (2001) both use small sample corrected versions of the AIC which were specifically developed for TAR models. However, in both studies the focus was on determining the delay parameter d and the order of the AR processes in each regime, while the number of regimes was fixed at 2. Hence, the performance of these small sample corrected criteria for determining the number of regimes is still unknown. Therefore, the focus in this study is on comparing standard measures such as the AIC and BIC with a number of alternative information criteria in small to moderate sample sizes. First, we include the two small sample corrected AICs developed by Wong and Li (1998) and De Gooijer (2001). Second, we include the BIC2, an ad hoc measure that was considered in previous studies by Gonzalo and Pitarakis (2002) and Strikholm and Teräsvirta (2006) to determine the effect of the penalty term: Its penalty weight (i.e., the value by which the number of free parameters is multiplied to obtain the penalty term) is twice that of the regular BIC. Hence, the BIC2 will prefer simpler models than the BIC. Third, we include the AICcp , which was derived for change-point models by Ninomiya (2005). As indicated in the previous section, there is a strong relationship between TAR modeling and changepoint analysis (see also Chen, 1998; Tsay, 1998). In contrast to the other information criteria, the change-point AIC (AICcp ) includes a penalty for the change-points, or — in case of TAR models — the thresholds. This penalty is three times more severe than for the regular parameters. Hence, in comparison to the original AIC, the AICcp will tend to select fewer regimes. In Table 1 an overview of these information criteria is given. While the AIC, the BIC, and the BIC2 used in the studies by Gonzalo and Pitarakis (2002) and Strikholm and Teräsvirta (2006) are based on models in which the residual variance is invariant across the regimes, the AICc , the AICu and the AICcp were developed for models in which the residual variances differ across the regimes. To make these measures comparable, the lacking versions are also presented in Table 1. The information criteria consist of −2 times the log likelihood plus a penalty for model complexity. The log likelihood that is used is either the one presented in (12) — in case of varying residual variances across regimes — or (13)—when the residual variances are identical across regimes. If the residual variances are allowed to vary across regimes, we have p(i) + 2 parameters in the i-th (i) (i) regime—that is, a constant φ0 , p(i) autoregressive parameters φ1 (i)
to φp(i) , and the residual variance σ (i)2 . Hence the total number of
P regular parameters across all regimes is (p(i) + 2). If the residual
variance is the Psame in all regimes, the total number of regular parameters is (p(i) + 1) + 1. In addition there are k − 1 threshold parameters, which are referred to as nonregular parameters, and as indicated above, these parameters are only penalized in the AICcp . 5. Evaluating the performance of information criteria To evaluate the performance of the information criteria in a simulation study, typically the percentage of correctly selected models is used. However, it has been argued that in practice it is unlikely that the ‘‘true’’ model (i.e., the model that generated the data) is among the fitted models (cf., Burnham & Anderson, 2002), such that correct model recovery may not be the most appropriate measure of performance. This concern is handled in two ways in the current paper. First, in one of the simulation studies the ‘‘true’’ model is not among the fitted models. Then, instead of evaluating how often the correct model is selected, the focus is on how often the best approximating model (among the fitted models) is selected. Second, an out-of-sample prediction performance measure is used, which is described in more detail below. Given a particular information criteria, the best model based on a calibration data set y = {y1 , . . . , yn } is selected. Then, 100 independent data sets are generated (using the same model and sample size as was used for generating the calibration data set); these serve as cross-validation data sets. Denoting these cross-validation data sets as yCV,r = {y1,CV,r , . . . , yn,CV,r }, where r = 1, . . . , 100, the mean squared prediction error is determined through MSPE(yCV |y) =
100 n 1 X X (yt ,CV,r − yˆ t ,CV,r )2
100 r =1 t =1
n
,
(14)
where yˆ t ,CV,r is the prediction of yt ,CV,r using the selected model and parameter estimates obtained in the calibration data set. For instance, if based on the calibration data set the linear AR(1) model was selected with parameter estimates φˆ 0 and φˆ 1 , we have yˆ t ,CV,r = φˆ 0 + φˆ 1 yt −1,CV,r . Because for each cell 1000 replications are used, the averaged mean squared prediction error is based on a total of 100 000 cross-validation data sets.2 To facilitate interpretation, this averaged mean squared predic2 tion error, which we denote as σPE , is corrected for the amount 2 of noise in the data σ : The latter is variance that cannot be explained even with the true model. Moreover, through dividing by the amount of variance that can be accounted for by the model (i.e., total variance of the process σy2 minus the residual variance
σ 2 ), we obtain the proportion of unexplained variance that could
have been explained (if the true model was known). Subtracting this from 1 results in the explained variance as a proportion of the total explainable variance, that is,
π =1−
2 − σ2 σPE . σy2 − σ 2
(15)
2 Hence, if π = 1, this implies that σPE = σ 2 , which means that the only unexplained variance is the residual variance. If π = 0, this 2 means that σPE = σy2 , which means that all the observed variance is left unexplained. From this it follows that information criteria that select models that have large values for π should be favored over criteria that perform less well with respect to this out-of-sample prediction measure.
2 At first, only 1 cross-validation data set was used per replication, but the results proved to be unstable. Therefore, 100 cross-validation data sets per replication were used.
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
523
Table 1 Information criteria for TAR models. Equal residual variances AIC
(n − g ) log
n
BIC
(n − g ) log
n
BIC2
(n − g ) log
n
AICc AICu AICcp
S (θ) n−g
o
S (θ) n−g
o
S (θ) n−g
o
Unequal residual variances
+ 2k(p + 1) + 2
Pk h
n
S (i) (θ) n(i)
o
i + 2(p(i) + 2)
+ log(n − g )k(p + 1) + log(n − g )
Pk h
n
S (i) (θ) n(i)
o
+ log(n(i) )(p(i) + 2)
Pk h
n
S (i) (θ) n(i)
o
+ 2 log(n − g )k(p + 1) + 2 log(n − g ) n o P h (i) i (p+1) (θ) (n − g ) log nS− + ki=1 2n + n−2(gn−−pg−) 2 g n(i) −p−2 o P h (i) n (i) oi n (p+1) (θ) + ki=1 2n + n(i) log n(i)n−p−2 + (n − g ) log nS− g n(i) −p−2 n o (θ) (n − g ) log nS− + 2k(p + 1) + 2 + 6(k − 1) g
i=1
i=1
n(i) log n(i) log
i
i + 2 log(n(i) )(p(i) + 2) n (i) o i Pk h (i) (i) (p(i) +2) S (θ) log + 2n i=1 n n(i) n(i) −p(i) −3 n (i) o i Pk h (i) (i) (p(i) +2) ( i) S (θ) log + 2n + n(i) log n(i) −np(i) −2 i=1 n n(i) n(i) −p(i) −3 n (i) o i Pk h (i) S (θ) log + 2(p(i) + 2) + 6(k − 1) i=1 n n(i) i=1
2(n−g ) n−g −p−2
n(i) log
NOTE: Six information criteria for TAR models. First column contains the information criteria when the residual variances across the regimes are identical; second column contains the information criteria when the residual variances differ across regimes.
6. Simulations Because it is often difficult to obtain large numbers of repeated measures in psychological research (with the exception of physiological data), we are particularly interested in the performance of information criteria in small to moderate sample sizes. We consider sample sizes of 50, 100 and 200 observations. For each model and sample size 1000 replications were simulated. Note that to estimate a SETAR model with first order AR processes in each regime, at least two observations per regime are needed (i) (i) to estimate the parameters φ0 and φ1 . To avoid that a few outliers are erroneously identified as a separate regime in a SETAR model, Strikholm and Teräsvirta (2006) suggested to use a lower bound of at least 10% of the total number of observations in each regime. While this may be seen as a (weak) constraint against complex models, it seems reasonable to use more than the absolute minimum of two observations per regime to estimate its parameters. Applying the 10% rule in the current study implies that at least 5, 10 and 20 observations need to fall in each regime for sample sizes 50, 100 and 200 respectively. In the first simulation study a linear AR model is compared to a SETAR model. The performance of six information criteria is evaluated with respect to correct model rate detection and the outof-sample prediction measure π described above. In the second study the performance of these information criteria is determined when the true model (i.e., the model that generated the data) is not among the fitted models. The third simulation study includes higher order AR processes and three-regime SETAR models in line with earlier work by Strikholm and Teräsvirta (2006). In the fourth simulation study we investigate whether the information criteria can be used to determine whether the residual variances are equal across the regimes or not. Unless indicated differently, all simulated models have a residual variance of 1. 6.1. Simulation study 1: Linear AR(1) versus SETAR(2,1,1) The first study was set up to compare the performance of information criteria in case of a linear AR(1) model and a SETAR(2,1,1), that is, a SETAR model with two regimes which are formed by AR(1) processes. For both models different parameter values are used in simulating the data. Subsequently both models are fit to the simulated data. Tables 2 and 3 contain the results for the data that were generated with a linear AR(1) model. In Table 2 the percentage of correctly identified models are given. The AIC, BIC and BIC2 results are in agreement with those reported by Strikholm and Teräsvirta (2006) for n = 200. In addition, the results indicate that AIC and AICc falsely select the two-regime model in more than 50% of the cases, which implies we would be better of flipping a coin than using the AIC or AICc , if the true model is indeed a
Table 2 Percentage of correct model detection for a linear AR(1) model when compared to a SETAR(2,1,1) model. AR parameter
n
AIC
AICc
AICu
AICcp
BIC
BIC2
φ1 = .5
50 100 200
22.6 19.3 20.1
47.0 31.8 24.6
85.6 77.9 70.6
89.8 90.8 88.0
75.3 87.5 91.2
98.7 99.9 100.0
φ1 = .7
50 100 200
24.0 20.0 18.4
47.7 32.6 23.0
83.4 76.1 70.0
90.0 88.3 88.8
74.7 85.1 91.5
99.1 99.6 100.0
φ1 = .9
50 100 200
21.1 17.2 17.1
46.7 28.9 21.9
83.8 75.5 66.4
88.8 88.2 87.2
75.0 84.6 90.0
99.1 99.9 100.0
NOTE: Percentages of correctly selected (linear) model based on 1000 replications. The constant φ0 = 0 and the residual variance σ 2 = 1. Table 3 Proportion of explained variance in out-of-sample predictions for an AR(1) when compared to a SETAR(2,1,1). AR parameter
n
AIC
AICc
AICu
AICcp
BIC
BIC2
φ1 = .5
50 100 200
.352 .607 .867
.539 .751 .881
.720 .855 .924
.733 .893 .948
.639 .881 .953
.824 .931 .970
φ1 = .7
50 100 200
.713 .888 .950
.785 .902 .954
.862 .942 .971
.876 .955 .980
.827 .951 .983
.924 .973 .989
φ1 = .9
50 100 200
.963 .956 .986
.895 .960 .987
.941 .976 .991
.947 .982 .994
.925 .980 .995
.971 .991 .997
NOTE: Proportion of explainable variance that was accounted for in cross-validation 2 data sets (i.e., π = 1−(σPE −σ 2 )/(σy2 −σe2 )). For each cell 1000 calibration sets were used, and for each calibration data set either a linear AR(1) model or a SETAR(2,1,1) model was selected. This selected model was fitted to 100 independent crossvalidation data sets, which were generated by the same model as the calibration data set.
linear AR process. In contrast, the AICu , AICcp , BIC, and BIC2 perform well above chance level (percentages between 66.4 and 100). Note also that the performance of the AICc and AICu , which were specially developed for small sample sizes, decrease as sample size increases. In contrast, the performance of the BIC (and BIC2) improves as sample size increases. The AICcp is hardly affected by sample size, and outperforms all three AIC-based measures, as well as the BIC if sample size is 50 or 100. In Table 3 the explained variance as a proportion of explainable variance is given. For instance, in case of the first model where φ1 = .5 and σ 2 = 1, the total variance of the series is σy2 =
σ 2 /(1 −φ12 ) = 1.333. The averaged mean squared prediction error 2 (i.e., σPE ) for the models that were selected with the AIC here (with
sample size 50) was 1.216. Hence, using the formula in (15) we obtain 1 − (1.216 − 1)/(1.333 − 1) = .352. This implies that
524
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
Table 4 Percentage correct model detection for SETAR(2,1,1) model when compared to a linear AR(1) model. Parameters Threshold
Regime 1 (1)
τ1 = 2.5
τ1 = 3
τ1 = 3
τ1 = 3
Regime 2 (2)
n
AIC
AICc
AICu
AICcp
BIC
BIC2
φ0 = 1 φ1(1) = .4 σ (1)2 = 1
φ0 = 2 φ1(2) = .4 σ (2)2 = 1
50
92.3
81.9
46.8
36.8
55.9
9.0
100 200
98.8 100.0
97.8 100.0
82.8 97.6
65.8 93.1
72.1 91.1
15.7 40.6
φ0(1) = 1 φ1(1) = .4 σ (1)2 = 1
φ0(2) = 2 φ1(2) = .4 σ (2)2 = 1
50
88.8
73.0
37.1
29.9
47.7
5.8
100 200
97.0 99.9
94.8 99.7
74.4 96.9
55.8 90.8
62.0 89.6
9.2 34.8
φ0(1) = 1 φ1(1) = .8 σ (1)2 = 1
φ0(2) = 1 φ1(2) = .4 σ (2)2 = 1
50
98.4
93.4
69.0
58.5
75.7
23.7
100 200
100.0 100.0
99.9 100.0
97.1 100.0
92.6 99.9
94.6 99.7
52.8 91.0
φ0(1) = 1 φ1(1) = .8 σ (1)2 = 1
φ0(2) = 2 φ1(2) = .4 σ (2)2 = 1
50
83.1
59.9
19.5
16.4
30.1
1.6
100 200
87.3 93.1
78.3 91.2
39.0 64.9
23.4 42.6
28.3 37.3
1.6 2.4
NOTE: Percentages of correctly selected two-regime SETAR models, with p = 1, based on 1000 replications. Percentage of observations falling in either regime are: 50 in both regimes for the first model; 76 in regime 1 and 24 in regime 2 for the second model; 67 in regime 1 and 33 in regime 2 for the third model; and 41 in regime 1 and 59 in regime 2 for the last model. Chances for staying in the same regime are: .82 for the first model; .86 for the second model; .54 for the third model; and .77 for the fourth model.
when the AIC is used in this scenario, the proportion of explained variance which could have been explained is .352. It can be seen that for all six information criteria, the proportion of explained variance increases as sample size increases. The BIC2 performs best, while the AIC and the AICc perform worst. The AICcp performs equally well or better than the AICu and the BIC, depending on sample size. When comparing the results from Table 2 with those in Table 3, it can be concluded that the out-ofsample performance of the six information criteria is quite similar to the correct model detection rates, in that AIC and AICc perform worst for all sample sizes, and BIC2 performs best. In Tables 4 and 5 the results are given for the data generated with four different SETAR(2,1,1) processes, when comparing a linear AR(1) to a SETAR(2,1,1) model. To determine how differences in parameters affect the model selection results, specific parameters were fixed across the regimes. That is, the first two models have the same autoregressive parameter in both (i) (1) regimes (φ1 = .4), while the intercept differs (φ0 = 1 and
φ0(2) = 2). The difference between these two models is the value of the threshold, which is 2.5 in the first model and 3 in the second model. Hence the first model can be expressed as
yt =
1 + .4yt −1 + et 2 + .4yt −1 + et
if yt −1 ≤ 2.5 if yt −1 > 2.5,
(16)
and the second model can be expressed as 1 + .4yt −1 + et 2 + .4yt −1 + et
yt =
if yt −1 ≤ 3 if yt −1 > 3.
(17) (1)
= .8 = .4), while the intercepts are fixed across regimes
The third model has differing autoregressive parameters (φ1 (2)
and φ1 (i)
(φ0 = 1), that is 1 + .8yt −1 + et 1 + .4yt −1 + et
yt =
if yt −1 ≤ 3 if yt −1 > 3.
(18)
In the fourth model the regimes differ both with respect to the (1) intercept (φ0 = 1 and φ0(2) = 2) and the autoregressive (1)
parameter (φ1
1 + .8yt −1 + et 2 + .4yt −1 + et
yt =
= .8 and φ1(2) = .4), that is if yt −1 ≤ 3 if yt −1 > 3.
(19)
Because the fourth model is the only model for which both the intercept and the autoregressive parameter differ across the two
regimes, it was expected that this model would be most easily recognized as a two regime model. However, the results in Table 4 show quite the opposite: In comparison to the other models, the last model was most often erroneously identified as a linear AR model. A possible explanation for low detection rates was given by Strikholm and Teräsvirta (2006), who indicated that some combinations of parameter values lead to very few observations in one of the regimes. If no observations were made in a particular regime, it implies that the observations actually come from a linear model and selecting the linear model should be considered the correct choice. To see whether this could explain the counterintuitive results in Table 4, we determined the number of replications for which less than 10% of the observations fell in one of the two regimes (as 10% was set as the minimum number of observations per regime in the estimation). For the first model, when n = 50, 11 out of the 1000 replications had fewer than 10% observations in one of the regimes; when n increased there were no replications that failed to meet this lower bound. For the second model things were worse: when n = 50, 151 replications had less than 10% of observations in one of the regimes, when n = 100 this decreased to 74 cases, and when n = 200 it became 9. For the third model (for which the criteria are performing best) and the fourth model (for which the criteria are performing worst), there were no replications with less then 10% observations in one of the regimes. Hence, this cannot be the explanation for the results reported in Table 4. Inspecting the state-space plots of the four models presented in Fig. 2 is illuminating however. These plots contain the data points {yt , yt −1 } of series of 10,000 occasions that were simulated for each (i) (i) of the four models. The regression lines based on φ0 and φ1 are depicted in gray. From this it becomes clear that, although the intercepts and slopes in the fourth model differ across regimes, the relationship between yt and yt −1 can be described quite well with a single straight regression line, and this is why the two-regime model is difficult to detect in this case. Further proof for this comes from raising the threshold from 3 to 4, while leaving the other parameters unchanged: This leads to a model for which the two regression lines are not as easily replaced by a single regression line, and as a result the detection rates go up to 37% for the AICu , 32 for AICcp , 49 for BIC and 7 for BIC2 when sample size is 50. Comparing the performance of the six information criteria, while temporarily ignoring the results for the last model in Table 4, it can be concluded that the AIC and AICc perform quite well here. This may come as no surprise as it was already shown that these
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
Model 2
y(t)
–2
–2
0
0
2
2
4
4
6
6
Model 1
y(t)
525
–2
0
2 y(t–1)
4
–2
6
0
2 y(t–1)
4
6
Model 4
4 –2
–2
0
0
2
2
y(t)
y(t)
4
6
6
8
Model 3
–2
0
2 y(t–1)
4
6
–2
0
2
4
6
8
y(t–1) (i )
Fig. 2. State-space plots of the four models presented in Table 4. Each plot contains the data points {yt , yt −1 } of a series of 10,000 occasions. Regression lines based on φ0 (i)
and φ1 are presented in gray.
measures select the more complex SETAR model over the true AR model in more than half of the cases. In contrast, the BIC2 has a strong preference for the simpler (erroneous) linear model, although less so in the third model, which was most easily detected as a SETAR model by all the criteria. The other three measures have trouble detecting the true SETAR model when the sample size is 50, with percentages below 50 sometimes. However, this improves as the sample size increases to 100 when AICu , AICcp , BIC, and BIC2 all perform above chance level (with the exception of the fourth model). Moreover, the performance of all six criteria increases as sample size increases. In Table 5 the proportions of explained variance are given. From this is can be seen that when the sample size is 200, the information criteria perform about equally well, except for the BIC2, which performs slightly worse than the other measures. In sample sizes 50 and 100, the AICcp outperforms the regular AIC, and typically performs better than the BIC. Moreover the AICu outperforms most of the other measures most of the times. Combining the results from Tables 4 and 5, leads to an interesting finding: Criteria that tend to select the correct nonlinear model more often (e.g., the AIC and the AICc ), tend to perform worse with respect to the out-of-sample prediction. This is especially true in small samples (n = 50), and in the fourth model, which due to its specific parameter values is well approximated with a linear AR(1) model as discussed above. This indicates that selecting the correct two-regime model does not necessarily lead to the best out-of-sample predictions. This is in agreement with results reported earlier by Clements and Krolzig (1998), who
indicated that having a nonlinear model that characterizes the features of the data will not necessarily lead to better predictions than a simpler linear model, especially when sample size is small. More generally, since the SETAR model has more parameters than the linear AR model, its parameter estimates will be less reliable, which is likely to affect the quality of the predictions made with them. 6.2. Simulation study 2: Linear AR(1) with outliers To further investigate the issue of selecting a ‘‘good approximating’’ model rather than the true model, data were simulated according to a model that was not fitted to the data. To this end a linear AR(1) process was generated with mean 0, φ1 = .5, and residual variance 1. However, for a certain percentage of the observations (i.e., 20% or 40%), an extra error term was added, which was randomly drawn from a normal distribution with mean of either 1 or 2, and variance 1. The rationale for this simulation is that outliers may be thought of as observations coming from a different regime. However, since the outliers occur at random occasions instead of being predictable from the observation at the previous occasion (as is the case in a SETAR model), the linear AR(1) model can be considered the best approximating model when comparing it to the SETAR(2,1,1) model. The results in Table 6 are quite similar to the results for the corresponding model in Table 2 without outliers. The AIC and AICc have a tendency to overfit and their performance gets worse as sample size increases. The AICu performs well in small samples, but
526
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
Table 5 Proportion of explained variance in out-of-sample predictions for a SETAR(2,1,1) when compared to an AR(1). Parameters Threshold
Regime 1
Regime 2
(1 )
τ1 = 2.5
τ1 = 3
τ1 = 3
τ1 = 3
n
(2)
AIC
AICc
AICu
AICcp
BIC
BIC2
φ0 = 1 φ1(1) = .4 σ (1)2 = 1
φ0 = 2 φ1(2) = .4 σ (2)2 = 1
50
.726
.784
.806
.794
.759
.839
100 200
.883 .955
.898 .955
.898 .955
.892 .952
.893 .952
.892 .926
φ0(1) = 1 φ1(1) = .4 σ (1)2 = 1
φ0(2) = 2 φ1(2) = .4 σ (2)2 = 1
50
.623
.689
.747
.721
.688
.797
100 200
.851 .939
.866 .941
.863 .940
.856 .935
.857 .934
.860 .906
φ0(1) = 1 φ1(1) = .8 σ (1)2 = 1
φ0(2) = 1 φ1(2) = .4 σ (2)2 = 1
50
.119
.265
.239
.123
.135
.139
100 200
.662 .848
.673 .849
.666 .849
.638 .848
.646 .847
.483 .805
φ0(1) = 1 φ1(1) = .8 σ (1)2 = 1
φ0(2) = 2 φ1(2) = .4 σ (2)2 = 1
50
.333
.526
.675
.666
.592
.768
100 200
.724 .876
.752 .883
.802 .888
.822 .891
.815 .893
.863 .905
2 NOTE: Proportion of explainable variance that was accounted for in cross-validation data sets (i.e., π = 1 − (σPE − σ 2 )/(σy2 − σe2 )). For each cell 1000 calibration sets were used, and for each calibration data set either a linear AR(1) model or a SETAR(2,1,1) model was selected. This selected model was fitted to 100 independent cross-validation data sets, which were generated by the same model as the calibration data set.
Table 6 Percentage of times the linear AR(1) model is selected, compared to a SETAR(2,1,1) model for a linear AR(1) process with outliers.
η
γ
n
AIC
AICc
AICu
AICcp
BIC
BIC2
.2
1
50 100 200
21.1 17.0 12.9
43.3 27.7 15.9
79.5 67.5 54.8
85.4 82.0 76.9
69.1 77.8 80.1
98.8 98.9 99.4
2
50 100 200
16.2 12.0 7.3
39.1 20.5 9.8
76.8 61.4 36.9
81.9 76.6 62.1
64.4 71.2 65.4
97.2 98.6 98.0
1
50 100 200
21.1 17.3 13.5
47.9 27.6 17.8
82.9 70.3 59.1
87.5 84.7 80.4
72.9 80.6 83.7
98.6 99.3 99.4
50 100 200
18.8 16.3 10.9
40.5 26.1 14.0
79.8 64.0 50.4
85.6 81.2 73.4
70.7 75.4 77.4
98.5 99.3 99.3
.4
2
NOTE: Percentages of times the linear AR(1) model is selected in 1000 replications when compared to a SETAR(2,1,1). The constant φ0 = 0, the autoregressive parameter φ1 = .5, and the residual variance σ 2 = 1. The frequency with which the additive outliers occur varies (i.e., η = .2, .4), as does the average size of the outliers (i.e, γ = 1, 2). The variance of the outliers is equal to the variance of the AR process (i.e., it is 1).
its performance also decreases as sample size increases. The AICcp outperforms the other AIC-based measures, but in contrast to the linear AR model without outliers, its performance now deteriorates as sample size increases. BIC does not perform as well as AICu and AICcp in small sample sizes (n = 50), but its performance increases as sample size increases, and at n = 200 it performs slightly better than AICcp . Again, due to its stringent penalty, the BIC2 outperforms the other measures in all sample sizes. When the average size of the outliers increases (i.e., γ = 2 versus γ = 1), the process is more often mistaken for a SETAR model. Moreover, as the proportion of outliers increases (i.e., η = .4 versus η = .2), all information criteria, except the AIC, have a higher detection rate. Hence, if outliers occur more frequently, the process is less easily mistaken for a SETAR model, probably because it becomes easier to detect that the outliers occur at random, rather than as a function of the threshold variable. 6.3. Simulation study 3: SETAR(2,2,2) The third simulation study was performed to see whether the results for a SETAR(2,2,2) model with sample size 200 reported by Strikholm and Teräsvirta (2006) could be replicated, and to complement these results with different information criteria and
smaller sample sizes. The SETAR(2,2,2) model that was used to generate the data is
yt =
−3 + .5yt −1 − .9yt −2 + et 2 + .3yt −1 + .2yt −2 + et
if yt −2 ≤ 1.5 if yt −2 > 1.5.
(20)
In addition, the aim here is to determine whether the results can be generalized to a SETAR(2,2,1) model, that is, a SETAR model with two regimes, with an AR(2) process in the first regime and an AR(1) process in the second regime. This second model can be expressed as
yt =
1 + 1.4yt −1 − .5yt −2 + et 2 + .4yt −1 + et
if yt −2 ≤ 3 if yt −2 > 3.
(21)
Following Strikholm and Teräsvirta (2006) three models were fitted to the data: the linear AR(2) model, a SETAR(2,2,2), and a SETAR(3,2,2,2), that is, a three regime SETAR model with AR(2) processes in each of the three regimes. Note that for the second model, the true SETAR(2,2,1) model is not among the fitted models. However, one could claim that the SETAR(2,2,1) model is a special case of the SETAR(2,2,2) model: The former can be obtained from (2) the latter by fixing φ2 = 0. Thus, we consider the SETAR(2,2,2) the correct (or best approximating) model here. The results are presented in Table 7. The results for the first model when n = 200 are in agreement with those reported by Strikholm and Teräsvirta (2006). Moreover, the results for the SETAR(2,2,1) model are quite similar to the results for the SETAR(2,2,2) model. Again, the AIC tends to overfit for all sample sizes, as does the AICc for sample sizes 100 and 200. When the sample size is 50 the AICu outperforms the other measures, but as sample size increases the BIC and BIC2 outperform the other measures. The AICcp does not perform as well as the AICu in small samples, but outperforms all the other AIC-based measures for all (other) sample sizes. 6.4. Simulation study 4: SETAR(2,1,1) with different residual variances In this final simulation study the goal is to determine whether the information criteria could differentiate between a linear AR process, a SETAR(2,1,1) process with equal residual variances, and a SETAR(2,1,1) process with different residual variances across the two regimes. A linear AR(1) model was simulated with φ0 = 1, φ1 = .7, and residual variance σ 2 = 1. The first SETAR model (with equal residual variances) that was simulated was the same as the first SETAR model in simulation study 1, that is
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
527
Table 7 Model selection percentages for a SETAR(2,2,2) and a SETAR(2,2,1) model. Parameters Threshold
τ1 = 1.5
τ1 = 3
Regime 1 (1 )
φ0 = −3 φ1(1) = .5 φ2(1) = −.9 σ (1)2 = 1
φ0(1) = 1 φ1(1) = 1.4 φ2(1) = −.5 σ (1)2 = 1
Regime 2 (2)
φ0 = 2 φ1(2) = .3 φ2(2) = .2 σ (2)2 = 1
φ0(2) = 2 φ1(2) = .4 φ2(2) = .0 σ (2)2 = 1
n 50
k
AIC
AICc
AICu
AICcp
BIC
BIC2
1
2.2
18.0
27.6
23.3
23.0
30.8
2
18.0
60.7
69.6
62.4
61.6
69.2
3
79.8
21.3
2.8
14.3
15.4
.0
100
1 2 3
1.1 24.5 74.4
4.2 48.3 47.5
11.2 79.6 9.2
10.9 78.4 10.7
12.4 82.9 4.7
16.8 83.2 .0
200
1 2 3
.7 25.3 74.0
.9 39.7 59.4
3.8 79.8 16.4
3.9 83.0 13.1
4.1 92.2 3.7
5.4 94.6 .0
50
1
.3
6.1
24.3
15.9
14.9
73.6
2
7.8
63.1
70.5
59.9
58.9
26.2
3
91.9
30.8
5.2
24.2
26.2
.2
100
1 2 3
.0 12.6 87.4
.2 40.5 59.3
.9 82.2 16.9
.8 79.0 20.2
2.0 86.7 11.3
32.5 67.4 .1
200
1 2 3
.0 13.0 87.0
.0 25.6 74.4
.0 76.0 24.0
.0 80.5 19.5
.0 95.5 4.5
1.6 98.4 .0
NOTE: Simulation results for two-regime SETAR models when fitting a linear AR(2) model (k = 1), a SETAR(2,2,2) model (k = 2), and a SETAR(3,2,2,2) model (k = 3), with correct model detection percentages printed bold. The first model is a SETAR(2,2,2), where in the long run 53% of the observations fall in the first regime, and 47% in the second. Its probability of remaining in regime 1 is .88, and for regime 2 it is .86. The second model is a SETAR(2,2,1) where in the long run 30% of the observations fall in the first regime and 70% in the second. Its probability of staying in regime 1 is .53 and for regime 2 it is .80.
1 + .4yt −1 + et 2 + .4yt −1 + et
yt =
if yt −1 ≤ 2.5 if yt −1 > 2.5.
(22)
The second SETAR model had the same intercepts and autoregressive parameters, but the residual variance in the first regime was 1 and in the second regime it was 1.5, that is, 1 + .4yt −1 + e√t 2 + .4yt −1 + 1.5et
yt =
if yt −1 ≤ 2.5 if yt −1 > 2.5.
(23)
Since three models are fitted to the data (i.e., a linear AR(1), a SETAR(2,1,1) with a common residual variance, and a SETAR(2,1,1) with different residual variances), by chance one third of the selected models should be correct. From the results in Table 8 it can be seen that in general a sample size of 50 is not enough for any of the information criteria to differentiate well between the two versions of the SETAR model. The AICc performs best when looking at all three models, with correct model detection percentages of 43.5 for the linear model, 70.9 for the first SETAR model and 45.4 for the second model. The other measures have one scenario in which they perform around chance level (i.e., 33.3%) or much lower. Focusing on the results for sample sizes 100 and 200, it is clear that the AIC and the AICc tend to overfit when the true model is the linear AR(1) model. That is, in less then one third of the cases the correct AR(1) model is selected, meaning these measures are doing worse than what could be expected on chance here. Although these measures perform quite well when the true model is one of the more complex SETAR models, their poor performance under the linear model make them inappropriate for model selection in the current context. The AICcp and BIC perform about equally well for sample size 100 in case of the linear AR(1) model. For sample size 200, the BIC performs slightly better than the AICcp in this case. Looking at the SETAR models it can be seen that for sample sizes 100 and 200 the BIC more often selects the correct model when there are equal residual variances, while the AICcp more often selects the correct model when there are unequal residual variances. The BIC2 does not distinguish well between the latter two models, and has a serious tendency to underfit, although this improves somewhat when sample size increases to 200. While the AICu performs well in comparison to both AICcp and BIC when the true model is one of the SETAR models and the sample
size is 100 or 200, it does not perform as well when the true model is a linear AR(1) model. Hence, choosing between AICu on the one hand, and AICcp and BIC on the other depends on whether one finds overfitting or underfitting the most serious problem. Given the convention of a probability of .05 for Type I errors (i.e., overfitting) and a probability of .20 for Type II errors (i.e., underfitting; this is based on a power of .80 which is typically desired), it becomes clear that in general overfitting is considered a bigger problem than underfitting: In that case AICcp or BIC should be preferred. 7. Discussion SETAR models are a class of time series models that can be of use in psychological research that focuses on processes. Since many psychological processes are characterized by some kind of switching between different states, regime-switching models such as the SETAR model can be of great importance to gain a further insight of the underlying mechanisms that drive such processes. An important issue in SETAR modeling is how to determine the number of regimes. One way of determining this is by use of information criteria. However, it is not clear which information criteria is most appropriate for this purpose, specifically in small to moderate sample sizes (i.e., n = 50 to 200). In the current study the performance of six information criteria was compared in the context of determining the number of regimes in SETAR models. This study complements existing studies by Gonzalo and Pitarakis (2002) and Strikholm and Teräsvirta (2006), who considered moderate to large sample sizes, and the work by Peña and Rodriguez (2005), who focused on small to moderate sample sizes, but only considered the BIC, AIC and a small sample corrected AIC that was not specifically developed for SETAR models. In the current study, next to the AIC and BIC, two small sample corrected versions of the AIC — which were specifically developed for SETAR modeling — were included. In addition, the BIC2 was included, a measure that was also used by Gonzalo and Pitarakis (2002) and Strikholm and Teräsvirta (2006). Moreover, since there is a strong relationship between the SETAR model and change-point models, an AIC developed for the latter was also included. This AICcp is the only information criterion that includes a penalty for the unknown threshold parameter(s).
528
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529
Table 8 Model selection percentages for a linear AR(1), a SETAR(2,1,1) with equal residual variances and a SETAR(2,1,1) with unequal residual variances. Parameters Threshold
Regime 1
regime 2
φ0 = 1 φ1 = .7 σ2 = 1
τ1 = 2.5
τ1 = 2.5
φ0(1) = 1 φ1(1) = .4 σ (1)2 = 1
φ0(1) = 1 φ1(1) = .4 σ (1)2 = 1
φ0(2) = 2 φ1(2) = .4 σ (2)2 = 1
φ0(2) = 2 φ1(2) = .4 σ (2)2 = 1.5
n
Model
AIC
AICc
AICu
AICcp
BIC
50
a b c
13.3 40.8 45.9
43.5 43.6 12.9
82.0 12.9 5.1
78.5 6.3 15.2
68.9 17.4 13.7
BIC2 99.1 .5 .4
100
a b c
13.4 47.4 39.2
24.9 50.1 25.0
68.3 17.8 13.9
81.9 6.8 11.3
83.3 11.7 5.0
99.8 .2 .0
200
a b c
11.6 54.5 33.9
15.0 57.2 27.8
62.0 23.6 14.4
81.8 9.3 8.9
89.6 8.7 1.7
100.0 .0 .0
50
a
3.1
15.2
49.6
52.8
37.0
90.8
b c
56.7 40.2
70.9 13.9
40.6 9.8
25.0 22.2
45.6 17.4
7.7 1.5
100
a b c
.5 75.1 24.4
2.3 81.2 16.5
15.3 70.0 14.7
30.2 52.9 16.9
27.0 67.3 5.7
85.4 14.5 .1
200
a b c
.0 80.5 19.5
.0 83.9 16.1
1.4 82.6 16.0
5.4 76.0 18.6
7.4 88.6 4.0
59.2 40.7 .1
50
a
3.1
13.1
40.7
40.6
31.4
89.7
b c
29.4 67.5
41.5 45.4
25.0 34.3
13.5 45.9
30.4 38.2
7.0 3.3
100
a b c
.4 14.5 85.1
1.1 19.3 79.6
8.0 15.6 76.4
16.8 9.0 74.2
21.8 22.9 55.3
83.5 8.0 8.5
200
a b c
.0 1.2 98.8
.0 1.4 98.6
.0 1.4 98.6
.5 1.1 98.4
3.6 7.4 89.0
48.9 9.0 42.1
NOTE: Simulation results for linear AR(1) model, a two-regime SETAR models with invariant residual variance, and a two-regime SETAR model with varying residual variances. All three models were fitted to the data. Correct model detection percentages are printed bold.
While the custom of not penalizing the threshold parameters in SETAR modeling may seem awkward, it is clear that these are not regular parameters and should not be treated as such. In the derivation of both the AIC and the BIC, second-order Taylor approximations of the log likelihood function are used, which implies the log likelihood needs to be differentiable twice with respect to the unknown parameters (cf. Burnham & Anderson, 2002, p. 365–368; Raftery, 1995). However, this is not true for threshold parameters in a TAR model (or for change-points in a change-point model), such that the penalty terms in the AIC and the BIC, which are derived for models with regular parameters only, do not apply here. The AIC derived by Ninomiya (2005) for change-point models results in a penalty term for change-points (which are analogue to threshold parameters) that is three times as large as the penalty term for regular parameters. This derivation is not directly applicable to SETAR models, because the change-point model Ninomiya considered did not include predictors (which, in the case of SETAR models, would be the previous observations on which the current observation is regressed). However, the strong resemblance between change-point models and TAR models in general was considered an appropriate argument for including this measure here. When evaluating the simulation results, it can be concluded that (with the exception of the AICu in the smallest sample size), the AICcp outperformed the other AIC-based measures, indicating that the threshold parameters should be penalized, and that the penalty derived by Ninomiya (2005) seems applicable. As indicated above an important issue in simulation studies such as these is what should be considered good performance of information criteria. In line with previous studies regarding SETAR models, correct model detection rate was considered the main measure of good performance in the current study. In addition, an out-of-sample performance measure was also included in some of
the simulations in this study. Moreover, in one of the simulation studies, the true model (i.e., the data generating function) was not among the fitted models. In sum, with respect to correct model detection rates, the AICcp outperformed the regular AIC as well as the AICc in most situations, the latter two having a serious tendency for overfitting, which makes them inappropriate for the current purpose. In samples of size 50, the AICu outperformed most other measures most of the time, while the BIC outperformed most other measures most of the time when sample size increased to 200. The BIC2 has a serious tendency for underfitting. With respect to out-of-sample performance, it seems that in small sample sizes a linear model leads to better predictions than a nonlinear model, even when the data generating model is a nonlinear SETAR model. As a result, the more conservative measures like the AICcp and BIC outperform the measures that have a preference for more complex models (i.e., the AIC and AICc ). However, the most conservative measure in the current study, that is, the BIC2, has such a strong tendency to underfitting, that when sample size increases it does not perform as well as the other criteria with respect to the out-of-samplerange measure. Overall it can be concluded that in small sample sizes the AICu is an appropriate measure to use, while in moderate sample sizes the AICcp or BIC should be preferred. References Al-Alwadhi, S., & Jolliffe, I. (1998). Time series modelling of surface pressure data. International Journal of Climatology, 18, 443–455. Amendola, A., Niglio, M., & Vitale, C. (2006). Multi-step SETARMA predictors in the analysis of hydrological time series. Physics and Chemistry of the Earth, 31, 1118–1126. American Psychiatric Association, (1994). Diagnostic and statistical manual of mental disorders, Vol. DSM-IV (fourth edition). Washington, DC: APA. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San Francisco, CA: Holden-Day.
E.L. Hamaker / Journal of Mathematical Psychology 53 (2009) 518–529 Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. New York, NY: Springer. Caner, M., & Hansen, B. E. (2001). Threshold autoregression with unit root. Econometrica, 69, 1555–1596. Chan, K. S., & Tong, H. (1990). On likelihood ratio tests for threshold autoregression. Journal of the Royal Statistical Society, 52, 469–476. Chan, W.-S., & Ng, M.-W. (2004). Robustness of alternative non-linearity tests for SETAR models. Journal of Forecasting, 23, 215–231. Chen, C. W. S. (1998). A Bayesian analysis of generalized threshold autoregressive models. Statistical and Probability Letters, 40, 15–22. Chen, C. W. S., & Lee, J. C. (1995). Bayesian inference of threshold autoregressive models. Journal of Time Series Analysis, 16, 483–492. Clements, M. P., & Krolzig, H.-M. (1998). A comparison of the forecast performance of Markov-switching and threshold autoregressive model of US GPN. Econometrics Journal, 1, C47–C75. De Gooijer, J. G. (2001). Cross-validation criteria for SETAR model selection. Journal of Time Series Analysis, 22, 267–281. Dolan, C. V., Jansen, B. R. J., & Van der Maas, H. L. J. (2004). Constrained and unconstrained multivariate normal finite mixture modeling of Piagetian data. Mutlivariate Behavioral Research, 36, 69–98. Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of the United Kingdom inflation. Econometrica, 50, 987–1007. Fan, J., & Yao, Q. (2003). Nonlinear time series: Nonparameteric and parameteric methods. New York, NY: Springer-Verlag. Frühwirth-Schnatter, S. (2006). Springer Series in Statistics, Finite mixture and Markov switching models. New York, NY: Springer. Gonzalo, J., & Pitarakis, J.-Y. (2002). Estimation and model selection based inference in single and multiple threshold models. Journal of Econometrics, 110, 319–352. Gottman, J. M., Murray, J. D., Swanson, C. C., Tyson, R., & Swanson, K. R. (2002). The mathematics of marriage: Dynamic nonlinear models. Cambridge, MA: MIT Press. Granger, C. W. J., & Andersen, A. P. (1978). An introduction to bilinear time series models. Göttingen: Vandenhoeck und Ruprecht. Hamaker, E.L., Grasman, R.P.P., & Kamphuis, J.-H. Regime-switching models to study psychological processes. In: P.M.C. Molenaar & K. Newell (Eds.), Individual pathways of change in learning and development. American Psychological Association (in press). Hamaker, E.L., Zhang, Z., & Van der Maas, H.L.J. (2009). Using threshold autoregressive models to study dyadic interactions, Psychometrika, in press (doi:10.1007/S11336-009-9113-4). Hamilton, J. D. (1994). Time series analysis. Princeton, NJ: Princeton University Press. Hansen, B. E. (1999). Testing for linearity. Journal of Economic Surveys, 13, 551–576. Hurvich, C. M., & Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297–307.
529
Koop, G., Pesaran, M. H., & Potter, S. N. (1996). Impulse response analysis in nonlinear multivariate models. Journal of Econometrics, 74, 119–147. Narayan, P. K. (2007). Are nominal exchange rates and price levels co-integrated? new evidence from threshold autoregressive and momentum-threshold autoregressive models. The Economic Record, 83, 74–85. Ninomiya, Y. (2005). Information criterion for gaussian change-point model. Statistics and Probability, 72, 237–247. Peña, D., & Rodriguez, J. (2005). Detecting nonlinearity in time series by model selection criteria. International Journal of Forecasting, 21, 731–748. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25. Schmittmann, V. D., Dolan, C. V., & Van der Maas, H. L. J. (2005). Discrete latent Markov models for normally distributed response data. Multivariate Behavioral Research, 40, 461–488. Stenseth, N. C., Falck, W., Chan, K.-S., Bjørnstad, O. N., O’Donoghue, M., Tong, H., et al. (1998). From patterns to processes: Phase and density dependencies in the Canadian lynx cycle. Proceedings of the National Academy of Sciences: Ecology, 95, 15430–51435. Strikholm, B., & Teräsvirta, T. (2006). A sequential procedure for determining the number of regimes in a threshold autoregressive model. Econometrics Journal, 9, 472–491. Tong, H. (1983). Threshold models in non-linear time series analysis. New York, NY: Springer-Verlag. Tong, H. (2003). Non-linear time series: A dynamic system approach. Oxford: Oxford Science Publications. Tong, H., & Lim, K. S. (1980). Threshold autoregression, limit cycles and cyclical data. Journal of the Royal Statistical Society, B, 42, 245–292. Tsay, R. S. (1998). Testing and modeling multivariate threshold models. Journal of the Americal Statistical Association, 93, 1188–1202. Van der Maas, H. L. J., & Molenaar, P. C. M. (1992). Stagewise cognitive development: An application of catastrophe theory. Psychological Review, 99, 395–417. Wagenmakers, E.-J., Farrell, S., & Ratcliff, R. (2004). Estimation and interpretation of 1/f α noise in human cognition. Psychonomic Bulletin and Review, 11, 579–615. Warren, K. (2002). Thresholds and the abstinence violation effect: A nonlinear dynamic model of the behaviors of intellectually disabled sex offenders. Journal of Interpersonal Violence, 17, 1198–1217. Warren, K., Hawkins, R. C., & Sprott, J. C. (2003). Substance abuse as a dynamical disease: Evidence and clinical implications of nonlinearity in a time series of daily alcohol consumption. Addictive Behaviors, 28, 369–374. Wong, C. S., & Li, W. K. (1998). A note on the corrected Akaike information criterion for threshold autoregressive models. Journal of Time Series Analysis, 19, 113–124.