Bayesian variable selection in non-homogeneous hidden Markov models through an evolutionary Monte Carlo method

Bayesian variable selection in non-homogeneous hidden Markov models through an evolutionary Monte Carlo method

Computational Statistics and Data Analysis 143 (2020) 106840 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

619KB Sizes 0 Downloads 49 Views

Computational Statistics and Data Analysis 143 (2020) 106840

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Bayesian variable selection in non-homogeneous hidden Markov models through an evolutionary Monte Carlo method Luigi Spezia Biomathematics & Statistics Scotland, Craigiebuckler, Aberdeen, AB15 8QH, United Kingdom

article

info

Article history: Received 28 September 2018 Received in revised form 21 August 2019 Accepted 14 September 2019 Available online 23 September 2019 Keywords: Time-varying transition probabilities State-dependent multivariate Normal observed process Trigonometric separation strategy Path sampling Population MCMC Ozone dynamics

a b s t r a c t Hidden Markov models (HMMs) are dynamic mixture models applied to time series in order to classify the observations into a small number of homogeneous groups, to understand when change points occur, and to model data heterogeneity through the switching between subseries with different state-dependent parameters. In the most general case, HMMs have an unobserved Markov chain whose transition probabilities are time-varying and dependent on exogenous variables through multinomial logit functions. When many covariates are available it is worthwhile selecting the subsets of variables which might affect most each row of the transition matrices. A Bayesian method for the stochastic selection of subsets of covariates is proposed by developing a novel evolutionary Monte Carlo algorithm. The methodology is illustrated and shown to be effective by performing experiments and comparisons on both synthetic data sets and a real multivariate time series with covariates. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Hidden Markov models (HMMs) are pairs of stochastic processes, one observed and one unobserved, or hidden, which affects the dynamics of the observed one. The hidden process is modelled as a finite-state Markov chain (MC), whereas the observed process, given the MC, is a sequence of conditional independent random variables, whose conditional distribution depends on the MC only through its contemporary state. Thus, HMMs are the dynamic version of the finite mixture models and useful tools to classify time-correlated observations into a finite number of states. In the most general case, the MC is non-homogeneous (NH), that is, the transition probabilities are time-varying and dependent on exogenous variables, giving rise to NH-HMMs. Diebolt et al. (1994) were the first who studied these models with a two-state MC, whose transition probabilities depended on deterministic exogenous variables through a logistic function. Then, Raymond and Rich (1997) linked the transition probabilities to the covariates through the binary probit models. Hughes et al. (1999) developed a HMM with a non-homogeneous Markov chain (NH-MC), whose transition probabilities depended on multivariate Normal exogenous variables, to analyse the precipitation occurrence in different sites. That model was modified by Charles et al. (1999), Bellone et al. (2000), and Neykov et al. (2012) to include precipitation amounts, Gelati et al. (2010) to generate runoff scenarios, and Ailliot et al. (2015) for wind time series using von Mises exogenous variables. The Bayesian estimation of a NH-MC with probit link function was introduced by Filardo and Gordon (1998) and Lu and Berliner (1999), whereas Spezia (2006) considered the multinomial logit link function. The logit approach is currently the most popular for non-homogeneous hidden Markov chains (NH-HMCs) and was adopted, among others, by Filardo (1994) to model US industrial production, Robertson et al. (2004) for rainfall amounts, Banachewicz et al. (2007) for portfolio defaults, Meligkotsidou and Dellaportas (2011) for the US three-month treasury E-mail address: [email protected]. https://doi.org/10.1016/j.csda.2019.106840 0167-9473/© 2019 Elsevier B.V. All rights reserved.

2

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

bill rates, Hamilton and Owyang (2012) for employment growth rates, Hokimoto and Shimizu (2014) for relative sea surface elevation, Kaufmann (2015) for a two-pillar Phillips curve for the Euro area, Holsclaw et al. (2017) for precipitation amounts, Montbet and Ailliot (2017) for air temperatures, Bazzi et al. (2017) for the US industrial production. Meligkotsidou and Dellaportas (2011), Kaufmann (2015), and Holsclaw et al. (2017) all conducted Bayesian inference: the first used an auxiliary sampling method to estimate the logit parameters, whereas the other two proposed different samplers based on data augmentation. As an alternative to the previous approaches, NH-HMCs were obtained by Heaps et al. (2015) and Pinto and Spezia (2016) by modelling a finite number of transition matrices that alternate according to the value assumed by a dynamic categorical covariate. One of the major issues when dealing with NH-HMCs is the selection of the exogenous variables that affect the dynamics of the hidden states. Paroli and Spezia (2008a) compared four different techniques to select the covariates affecting each row of the time-varying transition matrices: Stochastic Search Variable Selection of George and McCullogh (1993), the unconditional priors Gibbs sampler (KM) of Kuo and Mallick (1998), Gibbs Variable Selection (GVS) of Dellaportas et al. (2000), and the Metropolized version of KM (MKMK). MKMK outperformed the rival techniques in simulated examples, especially when the explanatory variables were strongly correlated, and/or when the complexity of the model was high, as it was in the case of univariate Normal NH-HMMs. In MKMK the better performance in selecting the explanatory variables was due to the acceptance in a single block of both the coefficients and the indicators of the explanatory variables, because blocking increases the precision of the estimators and improves the mixing of the Markov chain Monte Carlo (MCMC) algorithm. MKMK was successfully applied in environmental (Paroli and Spezia, 2008b; Birkel et al., 2012) and ecological (Spezia et al., 2014a,b) models including NH-HMCs. Meligkotsidou and Dellaportas (2011) selected a single subset of covariates affecting all transition probabilities by a method based on the reversible jump MCMC algorithm of Green (1995). In this paper a novel evolutionary Monte Carlo (EMC) algorithm is proposed for the selection of exogenous variables affecting the different rows of the transition matrices in a NH-HMM. EMC is an MCMC method which processes a population of chains in parallel, with a different temperature attached to each chain. EMC was introduced by Liang and Wong (2000) for drawing samples (also called individuals) from binary distributions. Liang and Wong (2001) generalized the algorithm for sampling from real space. Draws from the parallel chains are obtained by applying operators borrowed from genetic algorithms (Goldberg, 1989), i.e. mutation (update of a single individual), crossover (partial update of two individuals) and exchange (full update of two individuals), and performing simulations at various temperatures, as in simulated annealing (Kirkpatrick et al., 1983). Sampling along a temperature ladder can help the chains, especially those associated with a high temperature, to escape from local basins of attraction, and, thus, to improve their mixing ability. This happens because the high temperature flattens the posterior density. When applying the exchange operator, the temperature of two samples is swapped: the worse sample will climb up the temperature ladder, whereas the better will move down. Then, at the high temperatures, mutations are easily accepted and bad samples eliminated from the population. Further, the crossover operator reinforces the adaptive quality of the algorithm. If two individuals are bad in terms of their posteriors, when they cross each other, they are likely to generate bad samples which will be eliminated from the population, whereas if the individuals are good they are likely to generate better samples which will increase the quality of the population. The information gained by the population of chains over its evolution allows its adaptation generating better and better samples, by using information from other individuals to adjust their sampling distributions, and so learning from past samples. EMC can be included in the class of population MCMC (PMCMC), as defined by Blackmond-Laskey and Myers (2003), who compared the performance of random walk Metropolis, genetic, and PMCMC algorithms. Other significant contributions to EMC and PMCMC methods are as follows: EMC was generalized by Liang (2005) to sample from a distribution defined in variable dimensional spaces, with specific mutation, crossover, exchange operators designed for Bayesian neural networks; the genetic algorithm called Differential Evolution was combined with PMCMC by Ter Braak (2006); four moves for the EMC were developed by Goswani and Liu (2007), who also introduced a strategy for designing the temperatures; a review of PMCMC methods was presented by Jasra et al. (2007a), who subsequently proposed a trans-dimensional PMCMC (Jasra et al., 2007b); PMCMC methods were used in model choice by Friel and Pettitt (2008) and Calderhead and Girolami (2009) to compute the marginal likelihood via power posteriors; a stereo matching method using PMCMC was designed by Kim et al. (2009); a PMCMC called Distributed EMC was introduced by Hu and Tsui (2010); a sampling algorithm based upon EMC for variable selection in linear models when a large number of covariates are available was proposed by Bottolo and Richardson (2010); PMCMC algorithms to generate multiple history matched models in reservoir modelling were presented by Mohamed et al. (2012); and four gradient-free MCMC samplers (including PMCMC) applied to dynamical causal models in neuroimaging were compared by Sengupta et al. (2015). The plan of the paper is as follows. NH-HMMs are introduced in Section 2; Section 3 describes the EMC algorithm for variable selection; Section 4 presents the results of various simulated examples, where the covariates were either uncorrelated or correlated, with increasing levels of correlations; an application to ozone dynamics is illustrated in Section 5, where results about covariate selection, choice of the number of hidden states, parameter estimation, hidden chain reconstruction, and classification are shown; a discussion finalizes the paper.

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

3

2. Non-homogeneous hidden Markov models 2.1. The basic model Let {xt } be a univariate, discrete-time, first order, non-homogeneous Markov chain on a finite state space Sx = {1, . . . , m}. For any t = 2, . . . , T , Γt = [γjt,i ] is the (m × m) time-varying transition matrix, where γjt,i = P(xt = i|xt −1 = j), for any i, j ∈ Sx ; the initial distribution is the vector δ = (δ1 , . . . , δm )′ , with δi = P(x1 = i), for any i ∈ Sx , such that δ ′ = δ ′ Γ2 (δ ′ can be obtained from Γ2 , as the left eigenvector associated with the unit eigenvalue). At each time t = 2, . . . , T , the transition probabilities γjt,i can be obtained as multinomial logit functions of zt −1,• βj,i , where zt −1,• is a row vector containing n exogenous variables associated with the departure state j, zt −1,• = (1, zt −1,1 , . . . , zt −1,n ), z T −1 = (z1,• , . . . , zT −1,• )′ , and βj,i , when j ̸ = i, is a vector of n + 1 coefficients, βj,i = (βj,i,0 , βj,i,1 , . . . , βj,i,n )′ , whereas, when j = i, it is the null vector: ( ) γjt,i t logit(γj,i ) = ln = zt −1,• βj,i (1) γjt,j and exp(zt −1,• βj,i )

γjt,i = ∑

i

exp(zt −1,• βj,i )

,

(2)

for any i, j ∈ Sx . Let xT = (x1 , . . . , xT )′ be the sequence of the states of the Markov chain, where each xt (t = 1, . . . , T ) has values in Sx . The joint probability function of the whole sequence of the states is p(xT |β ) = δx1

T ∏

γxtt −1 ,xt ,

t =2

where β = [βj,i ] is the (m × m) matrix of the covariate coefficients βj,i . If n = 0 for any t = 1, . . . , T − 1, the Markov chain is homogeneous. The sequence xT of realizations from the Markov chain cannot be observed directly, because {xt } is a latent process, hidden behind an observed process {yt }. Furthermore, we assume that the process {yt }, given {xt }, is a sequence of conditional independent random variables, whose conditional distribution depends on {xt } only through the contemporary state xt . Let yT = (y1 , . . . , yT )′ be the sequence of observations. Given the conditional independence and the contemporary conditions we have p(yT |xT , θ ) =

T ∏

p(yt |xt , θxt ),

t =1

where θ = (θ1 , . . . , θm ) is the vector of parameters, indexing the state-dependent distribution of each yt (t = 1, . . . , T ). The pair of processes ({yt }; {xt }) so defined gives a hidden Markov model. Finally, marginalizing out the sequence of the hidden states xT , the joint density of all variables in the model is p(yT , θ , β ) = p yT | θ, β p(θ )p(β ),

(

)

where p(θ ) and p(β ) are prior densities and p yT | θ , β =

(

)

T m ∏ ∑

p (yt | xt = i, θi ) ξt |t −1 (i),

(3)

t =1 i=1

with, for any t = 1, . . . , T , ξt |t −1 (i) = P xt = i | yt −1 , θ, β and yt −1 = (y1 , . . . , yt −1 )′ . Filtered probabilities ξt |t −1 (i) can be computed recursively by iterating the following equations, for any t = 1, . . . , T and for any i = 1, . . . , m:

(

ξt −1|t −1 (i) ∝ ξt −1|t −2 (i) p (yt −1 | θi )

)

and

ξt |t −1 (i) =

m ∑

ξt −1|t −1 (j)γjit −1 ,

(4)

j=1

with ξt −1|t −1 (i) = P xt −1 = i | yt −1 , θ, β and ξ1|0 (i) = δi (Hamilton, 1994, ch. 22).

(

)

2.2. Covariate selection Several methods were introduced in the Bayesian literature for selecting explanatory variables in statistical models. A broad review on Bayesian variable selection can be found in O’Hara and Sillanpää (2009). The most intuitive approach is to associate each covariate z•,h = (z1,h , . . . , zt ,h , . . . , zT −1,h )′ with a state-dependent binary indicator ηj,h (h = 1, . . . , n; j = 1, . . . , m), where z T −1 = (1(T −1) , z•,1 , . . . , z•,h , . . . , z•,n ), with 1(T −1) (T − 1)-dimensional column vector of ones. When

4

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

ηj,h equals 1 the corresponding variable z•,h is included in the computation of the transition probabilities in the jth row, whereas when it equals 0, the hth covariate is excluded. Thus, we can have different covariates in the m rows of the time-varying transition matrices. Hence, expressions (1) and (2) respectively become

( logit(γ

t j,i )

= ln

γjt,i

)

γjt,j

= zt −1,• Hj βj,i

and exp(zt −1,• Hj βj,i )

γjt,i = ∑

i

exp(zt −1,• Hj βj,i )

,

(5)

for any i, j ∈ Sx , with Hj = diag [1, ηj,1 , . . . , ηj,n ] and H = (H1 , . . . , Hj , . . . , Hm ). Note that we assume the transitions from state j (j = 1, . . . , m) to any states i are regulated by the same set of covariates, so the same matrix Hj is associated with all entries of the jth row of each matrix Γt (t = 2, . . . , T ) and, consequently, with each βj,i (i = 1, . . . , m). 2.3. Prior specification We also assume the probability density function (pdf) of each observation, given the contemporary hidden state, is a multivariate Normal (MVN), but this statement does not affect the generality of our results. In fact, the inference procedures and the variable selection techniques we present in the rest of the paper hold for any distribution. Let p(yt | xt = i, µi , Σi ) = [(2π )p | Σi |]−1/2 exp{−(y′t − µ′i )′ Σi−1 (y′t − µ′i )/2} be the state-dependent pdf of a MVN random variable, for any i ∈ Sx , with yt = (yt ,1 , . . . , yt ,p ), for any t = 1, . . . , T . Let also µ = (µ1 , . . . , µi , . . . , µm )′ , with µi = (µi,1 , . . . , µi,p ), and Σ = (Σ1 , . . . , Σi , . . . , Σm ), with Σi = [σi,k,l ], for any i ∈ Sx and k, l = 1, . . . , p, such that θ = (µ, Σ ) (σi,k,l = σi2,k , when k = l). Given that our model is Bayesian, we introduce the following prior distributions, where N(p) (·) stands for the MVN of dimension p, G (·) for the Gamma, Be(·) for the Bernoulli, and U (·) for the Uniform. A vague MVN is placed on each off-diagonal entry of matrix β : βj,i ∼ N(n+1) (0(n+1) ; 10I(n+1) ), with j ̸ = i, where 0(n+1) is the (n + 1)-dimensional null vector and I(n+1) is the (n + 1)-dimensional identity matrix. Following (Richardson and Green, 1997), non-informative priors on the means are defined by allowing them to vary on the interval of variation of the data, with a variance equal to (the square )of the length of the interval: µi ∼ N(p) (λ; Λ), for any i = 1, . . . , m. Thus, we place the entries of vector λ = λ1 , . . . , λp equal to the mid-point of the range of each of the p univariate time series. To have high variability around the mid-points, we assume the variances in Λ are equal to the square of the ranges, whereas the covariances are all null. Also, to avoid the label switching problem (Richardson and Green, 1997; Frühwirth-Schnatter, 2001; Jasra et al., 2005; Spezia, 2009)) due to the m! ways of ordering the states, a vector (µ1,k , . . . , µi,k , . . . , µm,k ), with k ∈ {1, . . . , p}, is placed in increasing order. The covariance matrices are modelled a priori by applying the trigonometric separation strategy of Spezia (2019), based on the separation strategy of Barnard et al. (2000). Each Σi (i ∈ Sx ) is modelled by separating out the standard deviations and the correlations. In order to preserve the positive definiteness of the correlation matrices, their Cholesky decompositions are considered, and their entries are reparameterized using trigonometric functions. The reparameterization in terms of angles is due to a few constraints on the Cholesky matrices: the Cholesky decomposition of a correlation matrix is an upper triangular matrix, whose diagonal elements belong to the interval (0, 1), except the first which is equal to one, and those in the upper triangle belong to the interval (−1, 1); finally, the sum of the square of the entries, taken over each column, must equal one. Thus, we have

Σi = Si Pi Si = Si Ci′ Ci Si , for any P1 , . . . , Pm[ are the correlation matrices, C1 , . . . , Cm are their Cholesky decompositions, and S1 = [ 1, . . . , m, where ] ] diag σ1,1 , . . . , σ1,p , . . ., Sm = diag σm,1 , . . . , σm,p . For any state i ∈ Sx , the Cholesky matrices Ci = [ci,k,l ], k, l = 1, . . . , p, are reparameterized in terms of angles, in order to satisfy the positive definiteness constraint of the Pi :

{

ci,1,2 = cos αi,1,2 , ci,2,2 = sin αi,1,2

with αi,1,2 ∈ (0, π);

{

ci,1,3 = sin αi,1,3 cos αi,2,3 ci,2,3 = sin αi,1,3 sin αi,2,3 , ci,3,3 = cos αi,1,3

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

5

with αi,1,3 ∈ (0, π /2) and αi,2,3 ∈ (0, 2π ); . . .; and

⎧ ci,1,p = sin αi,1,p cos αi,2,p ⎪ ⎪ ⎪ ⎪ ci,2,p = sin αi,1,p sin αi,2,p cos αi,3,p ⎪ ⎨ ... , ci,p−2,p = sin αi,1,p sin αi,2,p . . . sin αi,p−2,p cos αi,p−1,p ⎪ ⎪ ⎪ ⎪ c = sin αi,1,p sin αi,2,p . . . sin αi,p−2,p sin αi,p−1,p ⎪ ⎩ i,p−1,p ci,p,p = cos αi,1,p with αi,1,p ∈ (0, π /2), αi,2,p , . . . , αi,p−2,p ∈ (0, π), and αi,p−1,p ∈ (0, 2π ). Hence, the following Uniform priors are placed:

αi,1,2 , αi,2,4 , . . . , αi,2,p , αi,3,5 , . . . , αi,3,p , . . . , αi,p−3,p−1 , αi,p−3,p αp−2,p ∼ U (0, π) ; αi,1,3 , αi,1,4 , . . . , αi,1,p ∼ U (0, π/2) ; αi,2,3 , αi,3,4 , . . . , αi,p−2,p−1 , αi,p−1,p ∼ U (0, 2π ) for any i = 1, . . . , m. Following the guidelines of Seaman, III et al. (2012), we also assume that all standard deviations are mutually independent a priori and use a prior that induces a non-informative prior on the entries of the Σi : σi,k ∼ G (4, 0.5), for any i = 1, . . . , m, and any k = 1, . . . , p. Finally, the model includes the prior distributions of the ηj,h ’s, which are independent Bernoulli with known probability ωh (ηj,h ∼ Be (ωh ), for any h = 1, . . . , n, j = 1, . . . , m), where ωh = P(ηj,h = 1) = 1 − P(ηj,h = 0). We set ωh = 0.5, for any h = 1, . . . , n; therefore, we have the indifference prior p(H) = 2−mn . 3. MCMC inference through EMC Let ψ = (µ, S , α, β, H) be the vector of the unknown parameters ( of our NH-HMM with)the MVN as state-dependent pdf, where S = (S1 , . . . , Sm ) and α = (α1 , . . . , αm ), with αi = αi,1,2 , αi,1,3 , . . . , αi,p−1,p , for any i = 1, . . . , m. The posterior distribution of ψ is p ψ | yT , z T −1 ∝ p yT |z T −1 , ψ p (µ) p (S ) p (α) p (β) p (H ) ,

(

)

(

)

where the sequence of the hidden states xT has been marginalized out, as in formula (3). We generate ψ from the posterior through an EMC algorithm. In this section, first, we briefly summarize the EMC method, when all entries are real values. Then, we show how an EMC algorithm works when applied to a NH-HMM, whose parameters are both real and binary (i.e., the variable selection indicators). The summary of the MCMC method is based on Liang and Wong (2001), but two novel crossover operators are considered, because they can be efficiently extended for variable selection purposes, as described in new algorithm of Section 3.2. 3.1. Evolutionary Monte Carlo Let p(ψ ) be the density of a Gibbs distribution, when the system has configuration ψ , whose entries are all real values. It can be written as p (ψ) ∝ exp

{

} −U (ψ) , τ

where τ is the temperature of the system and U (·) is the energy of the system. In terms of genetic algorithms, U (·) is also called fitness function. Henceforth, we assume both the configuration ψ as a vector of parameters and the energy function equal to the negative natural logarithm of the posterior density of ψ : U (ψ) = − ln p ψ | yT .

(

)

Now, let Ψ = {ψ1 , . . . , ψN } be a set of independent configurations and {τ1 , . . . , τN } be a temperature ladder, such that τ1 > τ2 > · · · > τN . We associate any configurations ψD (D = 1, . . . , N ) with N Gibbs distributions: { } −U (ψD ) p (ψD ) ∝ exp . τD A population of N chains are run in parallel and the draws from the Nth chain are collected to compute the estimates of the unknown parameters. The configuration set Ψ evolves through the application of mutation, crossover (either asymmetric or random walk), and exchange operators, as in genetic algorithms. The mutation operator is the standard Metropolis–Hastings algorithm with random walk proposal: a configuration ψD (new ) is selected at random within Ψ and a new configuration ψD is obtained by adding to ψD a realization of a Gaussian

6

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

{

(new )

variable with zero mean vector and diagonal covariance matrix. The configuration set Ψ (new) = ψ1 , . . . , ψD

, . . . , ψN

}

is accepted with probability min {1; RM }, where the acceptance ratio is

( ) ⎧ ⎫ ( (new) ) ( ⏐ (new) ) (new ) ⎨ U ψ − U (ψD ) ⎬ ⏐ D p Ψ q Ψ Ψ ( ) = exp − RM = ⎩ ⎭ p (Ψ ) q Ψ (new) |Ψ τD ( ) ⎧ ( )⎫ ⎨ ln p ψD(new) |yT , z T −1 − ln p ψD |yT , z T −1 ⎬ = exp , ⎩ ⎭ τD

(6)

as the proposal ratio q Ψ ⏐Ψ (new) q Ψ (new) |Ψ cancels out because of the symmetry of the proposal density q(·). The asymmetric crossover consists of selecting at random two configurations ψD and ψG , fixing a threshold ν , ν ∈ (0; 1), (new ) (new ) (new ) and generating two new configurations ψD and ψG according to the following rule: for each entry of ψD , we (new ) generate a value u from U (0; 1) and, if u ≤ ν , the entry of ψD is equal to the corresponding entry of ψD , otherwise, if (new ) (new ) (new ) u > ν , the entry of ψD is equal to{ the corresponding entry of ψG ; the}value not included in ψD is placed in ψG .

( ⏐

)/ (

)

(new )

The new configuration set Ψ (new) = ψ1 , . . . , ψD

, . . . , ψG(new) , . . . , ψN is accepted with probability min {1; RC }, where

the acceptance ratio is

( ) ⎫ ⎧ ⎨ U ψD(new) − U (ψD ) ⎬ ( ) = exp − RM = ⎭ ⎩ p (Ψ ) q Ψ (new) |Ψ τD ) ( ) ( ⎧ ( ) ( )⎫ (new ) T ⎨ ln p ψD(new) |yT , z T −1 − ln p ψD |yT , z T −1 |y , z T −1 − ln p ψG |yT , z T −1 ⎬ ln p ψG = exp , (7) + ⎩ ⎭ τD τG ( ⏐ ) / ( (new) ) |Ψ cancels out because of the symmetry of the proposal density q(·). as the proposal ratio q Ψ ⏐Ψ (new) q Ψ p Ψ (new) q Ψ ⏐Ψ (new)

(

) ( ⏐

)

The random walk crossover uses the same acceptance rule of the asymmetric crossover, but the way to obtain the new (new ) (new ) configuration set is different: ψD and ψG are obtained by adding to ψG and ψD , respectively, the realizations of a Gaussian variable with zero mean vector and diagonal covariance matrix, in order to obtain two random walks. The exchange operator is based on the random selection of a configuration ψD and, successively, a configuration ψG = ψD−1 or ψD+1 with probability 0.5. If D is 1 or N, G is 2 or N − 1, with probability one. The new configuration set Ψ (new) is obtained by exchanging the place of ψD and ψG with respect to Ψ , without exchanging their respective temperatures. The configuration set Ψ (new) is accepted with probability min {1; RE }, where the acceptance ratio is p Ψ (new) q Ψ ⏐Ψ (new)

(

) ( ⏐

)

)} { ( 1 1 ) = exp (U (ψD ) − U (ψG )) − p (Ψ ) q Ψ (new) |Ψ τD τG { ( )} ( ( ) ( )) 1 1 = exp ln p ψG |yT , z T −1 − ln p ψD |yT , z T −1 , − τD τG ( ⏐ ) / ( (new) ) |Ψ cancels out because of the symmetry of the proposal density q(·). as the proposal ratio q Ψ ⏐Ψ (new) q Ψ RE =

(

The iterative scheme of the algorithm is the following. [1] Set the iteration counter g equal to zero; generate randomly the starting values from their respective prior distributions. [2] At any iteration g > 0, apply mutation, asymmetric crossover or random walk crossover operators with probability ε , (1 − ε )/2 and (1 − ε )/2, respectively. [3] Apply exchange operator. [4] After the burn-in period, if g is a multiple of w , save vector ψN . [5] Set the counter g equal to g + 1 and go to [2]. 3.2. EMC for variable selection in NH-HMMs Now, we examine how the four operators (mutation, asymmetric crossover, random walk crossover, exchange) work when applied to NH-HMMs. We present the most general case, when the covariates are selected stochastically. This means that the entries of Ψ are both real and binary. In the case the set of covariates is fixed, all entries of H are constant and never changed. The mutation and the random walk crossover operators incorporate the main ideas of the KM (Kuo and Mallick, 1998) and the MKMK (Paroli and Spezia, 2008a) methods. Mutation operator — The mutation operator is based on the random walk move, applied to the individual ψD (D = 1, . . . , N). Random walks can update real parameters, thus the standard deviations σiD,f ’s, and the angles αiD,k,l ’s are mapped ( ) onto the real line through a logarithmic transformation. As the angles belong to the interval Ai,k,l ; Bi,k,l , we consider the

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

7

[ ] /[ ] [ ] /[ ] natural logarithm of αiD,k,l − Ai,k,l Bi,k,l − αiD,k,l . The values βjD,i , µDi,f , ln σiD,f , and ln αiD,k,l − Ai,k,l Bi,k,l − αiD,k,l , for any i, j = 1, . . . , m (j ̸ = i), f = 1, . . . , p, k = 1, . . . , p − 1, l = k + 1, . . . , p, belong to the interval (−∞; +∞) and can be generated by the following random walk proposals: w) βjD(ne = βjD,i + UB , ,i

w) µD(ne = µDi,f + UM , i,f

D(new ) = ln σiD,f + UΣ , [ ] /[ ] [ ] /[ ] D(new ) D(new ) ln αi,k,l − Ai,k,l Bi,k,l − αi,k,l = ln αiD,k,l − Ai,k,l Bi,k,l − αiD,k,l + UA , ) ( where UΩ ∼ N 0; σΩ2 , with Ω ∈ {B; M ; Σ ; A}.

ln σi,f

In order to obtain a proposal for H, we need to generate an auxiliary sequence of hidden states xT , given ψD , through the forward-filtering backward-sampling (FFBS) algorithm (Carter and Kohn, 1994; Frühwirth-Schnatter, 1994), which works as follows. First, we compute iteratively the filtering probabilities (4), until we have ξT |T (i), for any i ∈ Sx . Then, we generate backwards each xt using the equality p xT ⏐yT , ψD = p xT ⏐yT , ψD

(



)



(

−1 ) ) T∏ ( ⏐ p xt ⏐xt +1 , yt , ψD . t =1

D(new ) Each coefficient j,h (j D(new ) D(new ) D(new ) bj,h aj,h aj , h

η

Be

(

/(

+

= 1, . . . , m; h = 1, . . . , n) is independently generated from a Bernoulli distribution )) , with

(

D(new )

aj,h



∗(new )

exp zt −1 θj,xt [h]

∏ {t ≥2:xt −1 =j}



(

)

∗(new )

i exp zt −1 θj,i[h]

) ωh

(8)

and

(

D(new )

bj,h



∗∗(new )

exp zt −1 θj,xt [h]

∏ {t ≥2:xt −1 =j}



(

)

∗∗(new)

i exp zt −1 θj,i[h]

) (1 − ωh ),

(9)

∗(new )

where θj,i[h] is a vector whose elements are the product of the regression coefficients and their corresponding indicators, except the hth indicator, which is replaced by 1, because the hth covariate is always included, i.e.

)′ ( D(new ) D(new ) D(new ) D(new ) D(new ) w) D(new ) D D D D , , β , η β , . . . , η β β , . . . , η β , η θj∗,i(ne = β j,h+1 j,i,h+1 j,n j,i,n j,i,h [h] j,i,h−1 j,h−1 j,i,1 j,1 j,i,0 ∗(new )

∗∗(new )

has the same elements of θj,i[h] , except the hth indicator, which is replaced by 0, because the hth whereas in θj,i[h] covariate is always excluded, i.e.

)′ ( D(new ) D(new ) D(new ) D(new ) (new ) D(new ) D D D D . , 0 , η β , . . . , η β β , . . . , η β , η θj∗∗ = β j , h + 1 j , i , h + 1 j , n j , i , n ,i[h] j,i,h−1 j,h−1 j,i,1 j,1 j,i,0 The proposal Ψ (new) is accepted according to the acceptance ratio RM in equality (6), where q Ψ ⏐Ψ (new)

( ⏐

q µD ⏐µD(new) q S D ⏐S D(new)

) ( ⏐ ) ⏐ ⏐ ( ) = ( ) ( )× q Ψ (new) |Ψ q µD(new) ⏐µD q S D(new) ⏐S D ( D ⏐ D(new) ) ( D ⏐ D(new) ) ( D ⏐ D(new) D D(new) ) q α ⏐α q β ⏐β q H ⏐H ,β ,β ⏐ ) ( ⏐ ) ( ⏐ ( ). (10) D(ne w ) D D(ne w ) D D(ne w ) D ⏐ ⏐ ⏐ q α α q β β q H H , β D , β D(new) ( ⏐ ) / ( D(new) ⏐ D ) ( ⏐ ) / ( D(new) ⏐ D ) w) ⏐ ⏐β cancel out because of the symmetry of the The ratios q µD ⏐µD(new) q µ and( q β D ⏐⏐β D(ne q(β ⏐ ( D ⏐ D(neµw) ) / ) ) / ( D(new) ⏐ D ) ⏐α reduce to the Jacobians of proposal densities, whereas q S ⏐S q S D(new) ⏐S D and q α D ⏐α D(new) q α )

(



the logarithmic transformations, after cancelling out the proposal ratios because of the symmetry of the proposal densities. The Jacobians are p m ∏ ∏

σ

D(new ) i ,f

i=1 h=1

/m p ∏∏

σiD,f

i=1 h=1

and p p ( m ∏ ∏ ∏

Bi,k,l − α

i=1 k=1 l=1

D(new ) i,k,l

)(

D(new ) i,k,l

α

− Ai,k,l

)

/m p p ∏∏∏(

Bi,k,l − αiD,k,l

i=1 k=1 l=1

)(

) αiD,k,l − Ai,k,l .

8

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

Asymmetric crossover operator — The asymmetric crossover operator for variable selection works as that described in Section 3.1. Random walk crossover operator — The random walk crossover operator is based on two random walk moves, on individuals ψD and ψG , randomly selected from Ψ . The new entries of ψD are obtained from the current entries of ψG and the new entries of ψG are obtained from the current entries of ψD : w) βjD(ne = βjG,i + UB , ,i

w) µD(ne = µGi,f + UM , i,f D(new )

ln σi,f

= ln σiG,f + UΣ , ] /[ ] [ ] /[ ] D(new ) D(new ) ln αi,k,l − Ai,k,l Bi,k,l − αi,k,l = ln αiG,k,l − Ai,k,l Bi,k,l − αiG,k,l + UA ,

[

and w) βjG(ne = βjD,i + UB , ,i

w) µiG(ne = µDi,f + UM , ,f G(new )

ln σi,f

[

= ln σiD,f + UΣ , ] /[ ] ] [ ] /[ G(new ) − Ai,k,l Bi,k,l − αi,k,l = ln αiD,k,l − Ai,k,l Bi,k,l − αiD,k,l + UA ,

G(new ) i,k,l

ln α

with UB , UM , UΣ , and UA as in the mutation operator. D(new ) G(new ) All coefficients ηj,h ’s and ηj,h ’s (j = 1, . . . , m; h = 1, . . . , n) are independently generated from the Bernoulli

(

D(new )

distributions Be aj,h D(new ) bj,h ,

/(

G(new ) bj,h

D(new )

aj,h

w) + bD(ne j,h

))

(

G(new )

and Be aj,h

/(

G(new )

aj , h

w) + bG(ne j,h

))

D(new )

, respectively, with aj,h

G(new )

, aj , h

,

and obtained as in (8) and (9), after the generation of two auxiliary sequences of hidden states via the FFBS algorithm, i.e., xD , given ψD , and xG , given ψG . The proposal Ψ (new) is accepted according to the acceptance ratio RC in equality (7), where the proposal ratio is obtained following the factorization in (10). Exchange operator — The exchange operator for variable selection works as that described in Section 3.1. 3.3. Using the EMC sample When we tackle the variable selection problem, the EMC algorithm will be iterated until we have a sample large enough to choose the best model, by evaluating the subset of indicators H generated most. In fact, the promising subset of covariates is usually identified by the highest posterior probability of its corresponding vector of indicators, i.e. that with the most frequent appearance in the EMC sample. As an alternative, Barbieri and Berger (2004) suggested to select the median model, that is the model including all covariates whose marginal posterior probability is greater than 0.5. They found optimality conditions for the median model. Although these conditions are quite strong, the median model is the only one which may reach optimality. Once the covariates of the non-homogeneous Markov chain have been selected, that is the indicators H are fixed, we can generate from the posterior a sample large enough to estimate the unknown parameters µ, S, α , and β by the posterior mean. Using these estimates, we can also reconstruct the sequence of the hidden states through the marginal posterior mode of the smoothing probabilities. First, we compute iteratively the filtering probabilities (4), until we have ξT |T (i), for any i ∈ Sx . Then, the smoothed probabilities of any state, that is the probability of any state, at any time, given all observations and parameters are obtained by the following backward recursion (Kim, 1994): P xt = i | y , z T

(

T −1

, ψ = ξt |t (i)

)

( ) m ∑ γit,j P xt +1 = j | yT , z T −1 , ψ j=1

ξt +1|t (j)

,

for any t = T − 1, . . . , 1 and any i = 1, . . . , m. Finally, the EMC algorithm can also be used to choose the best model, that is, in the case of HMMs, to select the number of hidden states m. Here, Bayesian model choice is performed by means of the Bayes factors (Kass and Raftery, 1995), in which the marginal likelihoods are evaluated numerically through the path sampling algorithm of Gelman and Meng (1998), also known as thermodynamic integration. Gelman and Meng (1998) introduced the path sampling to compute ratios of normalizing constants; then Lartillot and Hervé (2006) and Friel and Pettitt (2008) applied the path sampling to the computation of marginal likelihoods. EMC was used in model choice by Friel and Pettitt (2008) and Calderhead and Girolami (2009) to compute the marginal likelihood via power posteriors. The path sampling ∫ works as follows. Let p (ψ) be a distribution that contains a normalizing constant Z : p (ψ) = q (ψ) /Z , where Z = Ψ q (ψ) dψ , with ψ ∈ Ψ . In our context of NH-HMMs with MVN observations, we have: q (ψ) = p yT |z T −1 , ψ, m p (µ|m) p (S |m) p (α|m) p (β|m)

(

)

and

Z = p yT |z T −1 , H , m ,

(

)

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

9

so that p (ψ) = p µ, S , α, β|yT , z T −1 , H , m ,

(

)

with H fixed, that is the covariates have been already selected and are given. We can also introduce the definition of the unnormalized power posterior (Friel and Pettitt, 2008): qϕ (ψ) = p yT |ψ, m

(



p (ψ|m) ,

where ϕ is thought of) as an inverse temperature, with ϕ ∈ [0; 1]. When ϕ = 0, q0 (ψ) = p (ψ|m), whereas ( ( ) when ϕ = 1, q1 (ψ) = p yT |ψ, m p (ψ|m). The corresponding normalizing constants are Z0 = 1 and Z1 = p yT |m , respectively. Therefore, which can be written as) L = ∫ 1 ∂ ln Zϕ the difference ∫L = ln Z1 − ln Z0 is the natural logarithm of the marginal likelihood, ∫ ( ∂ ln Z dϕ , where Zϕ = Ψ qϕ (ψ) dψ . It can be proved (Friel and Pettitt, 2008) that ∂ ϕ = Ψ pϕ (ψ) ln p yT |ψ, m dψ , 0 ∂ϕ ϕ which is an expectation with respect to pϕ (pϕ (ψ) = qϕ (ψ) /Z ϕ ):

[ ( )] ∂ ln Zϕ = Epϕ ln p yT |ψ, m . ∂ϕ Hence, the natural logarithm of the marginal likelihood is 1



Epϕ ln p yT |ψ, m

[

L= 0

(

)]

dϕ.

For any value ϕ ∈ [0; 1], we can obtain an estimate ˆ Epϕ ln p yT |ψ, m of Epϕ ln p yT |ψ, m by using the EMC draws generated from the power posterior qϕ (ψ). If we run an EMC algorithm with N chains and an inverse temperature ladder such as 0 = ϕ1 < · · · < ϕj < · · · < ϕN = 1, we can obtain an estimate ˆ L of L by applying the trapezoidal rule:

[

ˆ L=

N −1 ∑ (

ϕj+1 − ϕj

(

[ ( )] ) [ ( )] ( Epϕj ln p yT |ψ, m Epϕj+1 ln p yT |ψ, m + ˆ ) ˆ

j=1

2

[

)]

(

)]

.

In this case, the inverse temperature ladder is applied to the likelihood only, while the priors have always temperature equal to one. The inverse temperatures from ϕ1 = 0 to ϕN = 1 define a path from p0 (ψ) to p1 (ψ), hence the name path sampling (Gelman and Meng, 1998). 3.4. Tuning the EMC algorithms An efficient use of EMC demands that some factors are tuned suitably. The number of independent configurations N must be sufficiently large to diversify the configurations ψD ’s and, so, to encourage the mixing of the chains. The temperatures are used to act on the diversity of the population, by increasing or decreasing the similarities of the densities. The temperature ladder should allow the crossover and the exchange operators to both eliminate bad samples from the population and to improve the quality of the good ones. Probability ε must balance the intervention of the three operators, pursuing both the convergence and the mixing of the parallel chains. The diagonal entries of the covariance matrix of the Gaussian noise appearing in the random walks of the mutation and crossover operators should be chosen to reach an acceptance rate between 0.2 and 0.5. Also the threshold ν of the asymmetric crossover operator should facilitate the evolution of the different individuals and produce suitable acceptance rates. Vector ψN is saved at every w th iterations after the burn-in, where w is sufficiently large to allow ψN to be updated by one operator, at least, in w iterations. The questions of the optimal number of chains and the appropriate temperature ladders are still open. Guidelines for setting the temperature ladder are given by Liang and Wong (2001): temperatures are equally spaced between the lowest temperature usually equal to one (they also considered 0.1) and the highest temperature that can be 5, 15, 20, with a population size of 20 or 40. More complicated temperature ladders are proposed by Jasra et al. (2007a) and Goswani and Liu (2007). In a variable selection problem for linear regression models (with the number of variables much greater the number of observations), Bottolo and Richardson (2010) designed a temperature ladder with four chains and an automatic tuning of the temperatures. In the evaluation of the marginal likelihood, Friel and Pettitt (2008) and Calderhead and Girolami (2009) set geometricbased inverse temperature schedule on [0, 1], with partitions clustered towards ϕ1 = 0: ϕD = [(D − 1)/(N − 1)]5 , with D = 1, . . . , N. The selection of the inverse temperature for computing the marginal likelihood with separated chains is discussed also in Friel et al. (2014), Hug et al. (2016), and Vitoratou and Ntzoufras (2017). In the case of NH-HMMs, we use the following temperature ladders. For parameter estimation and variable selection, (N −D)h2 the temperature ladder is obtained by an exponential formula: τD = h1 (h1 ; h2 > 0, D = 1, . . . , N). The lowest temperature τN is one, whereas the exponential progression of the others allows the crossover and the exchange operators to both eliminate bad samples from the population and to improve the quality of the good ones. On the other hand, for the marginal likelihood, we follow the guidelines of Friel and Pettitt (2008) and Calderhead and Girolami (2009).

10

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

4. Simulated examples We illustrate the performance of our method on simulated examples, with two, three, and four hidden states (i.e., m = 2, 3, 4). We considered synthetic examples with T = 1, 000 observations and five covariates (i.e., n = 5 ). We started with five uncorrelated covariates generated from MVNs of dimension five, whose marginal variables had means equal to zero and variances equal to one. Then, the correlation between covariates increased. Correlated covariates were obtained as follows (Paroli and Spezia, 2008a): we set the covariates between variables (z2 , z4 ) and (z1 , z5 ), while the covariates between the other variables were random values in modulo less than the previous ones. We considered correlations equal to 0.3, 0.7, and 0.9. The observed process was either MVN of dimension two, with mean vectors

µ1 = (0.5, 2),

µ2 = (3, 4.5),

µ3 = (7, 9),

µ4 = (10, 12),

and covariance matrices

Σ1 = Σ3 =

[

1.1 0.5

0.5 0.8

]

[

0.8 0.4

0.4 0.6

]

, ,

Σ2 = Σ4 =

[

1.25 0.6

[

1.0 0.5

0.6 1.0 0.5 1.0

] ]

,

,

or dimension three, with mean vectors

µ1 = (0.5, 2, 1),

µ2 = (3, 4.5, 6),

µ3 = (7, 10, 9),

µ4 = (10, 12, 11),

and covariance matrices

[ Σ1 =

[ Σ3 =

1.0 0.5 0.4

0.5 0.8 0.4

0.4 0.4 1.1

]

0.8 0.4 0.2

0.4 0.6 0.4

0.2 0.4 1.0

]

[ ,

Σ2 =

[ ,

Σ4 =

1.25 0.6 0.3 1.0 0.5 0.4

0.6 1.0 0.6 0.5 1.0 0.6

0.3 0.6 1.1 0.4 0.6 0.8

] ,

] .

The label switching problem was tackled by assuming the means of the first component in increasing order. For m = 2, the true model contained covariates z1 and z2 in state 1 and covariates z3 , z4 , and z5 in state 2; the values of the parameters were:

β1,2 β2,1

−1.2 −1.3

1.9 0

1.7 0

0 1.75

0 1.6

0 1.8

H1 H2

1 1

1 0

1 0

0 1

0 1

0 1

For m = 3, the true model contained covariates z1 and z2 in state 1, covariates z3 , z4 , and z5 in state 2, and covariates z4 and z5 in state 3; the values of the parameters were:

β1,2 β1,3 β2,1 β2,3 β3,1 β3,2

−1.2 −1.65 −1.3 −1.55 −1.6 -2

1.9 1.5 0 0 0 0

1.7 1.45 0 0 0 0

0 0 1.75 2 0 0

0 0 1.6 1.75 1.7 1.65

0 0 1.8 1.6 2 1.45

H1

1

1

1

0

0

0

H2

1

0

0

1

1

1

H3

1

0

0

0

1

1

For m = 4, the true model contained covariates z1 and z2 in state 1, covariates z1 , z2 , and z3 in state 2, covariates z3 , z4 and z5 in state 3, and covariates z4 and z5 in state 4; the values of the parameters were:

β1,2 β1,3 β1,4 β2,1 β2,3 β2,4 β3,1 β3,2 β3,4 β4,1 β4,2 β4,3

−1.2 −1.65 −1.9 −1.3 −1.55 −2.15 −1.8 −1.7 −1.95 −1.6 -2 −1.75

1.9 1.5 1.85 1.75 2 1.7 0 0 0 0 0 0

1.7 1.45 2 1.6 1.75 1.7 0 0 0 0 0 0

0 0 0 1.8 1.6 1.85 1.65 1.65 1.5 0 0 0

0 0 0 0 0 0 1.6 2 2.1 1.7 1.65 1.4

0 0 0 0 0 0 1.95 1.6 1.9 2 1.45 2

H1

1

1

0

0

0

0

H2

1

1

1

1

0

0

H3

1

0

0

1

1

1

H3

1

0

0

0

1

1

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

11

Table 1 Results of the simulations for NH-HMMs with two (m = 2) hidden states. The observed process was MVN of dimension two (p = 2) or three (p = 3). The number of covariates was five. The correlations between the second and the fourth covariate and between the first and the fifth were 0, 0.3, 0.7, 0.9. Results represent the joint posterior probability (JPP) of the correct set of covariates in each state, which was always the highest. Therefore, the correct model was always selected. Results also represent the marginal posterior median probability (MPMP): the probability of the covariates to be included was always greater than 0.5 and that of the covariates to be excluded less. Therefore, the correct model was always selected. m=2

p=2

p=3

JPP

MPMP

JPP

MPMP

corr = 0

state 1 state 2

0.90 0.82

1.00, 0.94, 0.05, 0.10, 0.04 0.06, 0.11, 0.87, 0.89, 0.87

0.76 0.90

0.85, 0.91, 0.02, 0.04, 0.09 0.03, 0.00, 0.96, 1.00, 0.94

corr = 0.3

state 1 state 2

0.76 0.87

0.92, 0.81, 0.07, 0.06, 0.11 0.05, 0.12, 0.94, 0.91, 0.96

0.60 0.85

0.74, 0.72, 0.21, 0.11, 0.12 0.09, 0.06, 0.91, 0.94, 0.91

corr = 0.7

state 1 state 2

0.74 0.80

0.94, 0.82, 0.06, 0.12, 0.09 0.14, 0.05, 0.82, 0.94, 0.88

0.81 0.59

0.87, 0.82, 0.07, 0.21, 0.14 0.21, 0.24, 0.64, 0.72, 0.68

corr = 0.9

state 1 state 2

0.78 0.72

0.89, 0.84, 0.19, 0.21, 0.09 0.19, 0.35, 0.76, 0.81, 0.77

0.54 0.70

0.66, 0.67, 0.21, 0.31, 0.19 0.09, 0.12, 0.79, 0.77, 0.76

Table 2 Results of the simulations for NH-HMMs with three (m = 3) hidden states. The observed process was MVN of dimension two (p = 2) or three (p = 3). The number of covariates was five. The correlations between the second and the fourth covariate and between the first and the fifth were 0, 0.3, 0.7, 0.9. Results represent the joint posterior probability (JPP) of the correct set of covariates in each state, which was always the highest. Therefore, the correct model was always selected. Results also represent the marginal posterior median probability (MPMP): the probability of the covariates to be included was always greater than 0.5 and that of the covariates to be excluded less. Therefore, the correct model was always selected. m=3

p=2

p=3

JPP

MPMP

JPP

MPMP

corr = 0

state 1 state 2 state 3

0.80 0.82 0.88

0.86, 0.91, 0.08, 0.17, 0.07 0.19, 0.21, 0.88, 0.98, 0.89 0.27, 0.08, 0.04, 0.90, 0.95

0.75 0.90 0.79

0.85, 0.78, 0.19, 0.09, 0.12 0.04, 0.02, 0.94, 0.96, 1.00 0.11, 0.10, 0.14, 0.84, 0.90

corr = 0.3

state 1 state 2 state 3

0.69 0.81 0.78

0.77, 0.72, 0.10, 0.31, 0.14 0.21, 0.05, 0.86, 0.94, 0.87 0.12, 0.27, 0.06, 0.85, 0.82

0.82 0.62 0.79

0.88, 0.95, 0.09, 0.07, 0.15 0.27, 0.17, 0.68, 0.80, 0.71 0.14, 0.09, 0.10, 0.85, 0.84

corr = 0.7

state 1 state 2 state 3

0.74 0.62 0.70

0.81, 0.90, 0.88, 0.14, 0.09 0.25, 0.19, 0.76, 0.68, 0.70 0.17, 0.21, 0.12, 0.74, 0.88

0.60 0.74 0.76

0.67, 0.69, 0.30, 0.11, 0.19 0.21, 0.27, 0.84, 0.80, 0.89 0.31, 0.02, 0.16, 0.81, 0.85

corr = 0.9

state 1 state 2 state 3

0.64 0.82 0.82

0.69, 0.72, 0.34, 0.21, 0.09 0.19, 0.17, 0.90, 0.86, 0.86 0.07, 0.25, 0.23, 0.85, 0.92

0.52 0.54 0.60

0.54, 0.75, 0.41, 0.21, 0.32 0.14, 0.37, 0.70, 0.61, 0.58 0.12, 0.17, 0.27, 0.74, 0.70

In all examples, the coefficients βj,i,0 were set equal to zero for any j, i ∈ SX (i ̸ = j). The following tuning factors were set: the number of configurations was N = 6; the temperature ladder was obtained by using h1 = 1.75 and h2 = 2; the probability ε was 0.4, while the threshold ν was 0.2. Given the tuning factors, the EMC algorithm was run for 1,500,000 iterations, after a burn-in of 100,000 iterations, and thinning the sample at every w = 60 iterations (3,000,000; 100,000; 120 for the marginal likelihoods). Indicators were estimated by both the joint posterior mode and the marginal posterior median. The results are presented in Tables 1 (m = 2), 2 (m = 3), and 3 (m = 4). The joint posterior mode is the vector of indicators associated with the highest posterior probability, i.e. that with the most frequent appearance in the EMC sample. From Tables 1, 2, and 3, we can see that the correct model was always selected by the joint posterior probability criterion. By the marginal posterior median probabilities of the covariates in each state, the best model includes all covariates whose marginal posterior probability is greater than 0.5. From Tables 1, 2, and 3 we can also see that via the marginal posterior median criterion the correct model was always selected. In order to see how the method works with a smaller sample size, the example with m = 3, p = 3, and two correlations equal to 0.3 was re-run considering sequences of 500 and 100 observations. For T = 500, we obtained results similar to the case with T = 1000. On the other hand, for T = 100 all possible combinations of covariates were visited and a few of them had a joint posterior probability very close to the highest. Looking at the marginal median posterior probabilities, a further variable was included in two states. Then, in order to see how the method works with a larger number of covariates, the example with m = 3, p = 3, and four correlations equal to 0.3 was re-run considering ten covariates, with two covariates included in state 1, three in state 2, four in state 3. The correct covariates were selected in all three states by the marginal median posterior probabilities. The same example was run also with four correlations equal to 0.7. In states 1 and 3 the correct covariates were selected, while in state 2 a further covariates was included. Next, in order to see how the method influences the choice of the number of hidden states, the observations generated in the example with T = 1000, m = 3, p = 3, n = 5, and two correlations equal to 0.3 were also analysed by the models

12

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

Table 3 Results of the simulations for NH-HMMs with four (m = 4) hidden states. The observed process was MVN of dimension two (p = 2) or three (p = 3). The number of covariates was five. The correlations between the second and the fourth covariate and between the first and the fifth were 0, 0.3, 0.7, 0.9. Results represent the joint posterior probability (JPP) of the correct set of covariates in each state, which was always the highest. Therefore, the correct model was always selected. Results also represent the marginal posterior median probability (MPMP): the probability of the covariates to be included was always greater than 0.5 and that of the covariates to be excluded less. Therefore, the correct model was always selected. m=4

p=2

p=3

JPP

MPMP

JPP

MPMP

corr = 0

state state state state

1 2 3 4

0.64 0.72 0.66 0.66

0.72, 0.76, 0.42, 0.25,

0.80, 0.77, 0.14, 0.39,

0.41, 0.89, 0.20, 0.37,

0.19, 0.29, 0.70, 0.74,

0.30 0.31 0.76 0.85

0.58 0.63 0.59 0.62

0.61, 0.69, 0.12, 0.22,

0.72, 0.89, 0.40, 0.39,

0.44, 0.70, 0.70, 0.03,

0.39, 0.19, 0.67, 0.71,

0.21 0.30 0.68 0.79

corr = 0.3

state state state state

1 2 3 4

0.58 0.61 0.73 0.60

0.65, 0.70, 0.31, 0.14,

0.69, 0.79, 0.15, 0.29,

0.47, 0.19, 0.76, 0.42,

0.39, 0.27, 0.91, 0.85,

0.27 0.35 0.80 0.69

0.61 0.55 0.58 0.56

0.70, 0.59, 0.39, 0.11,

0.64, 0.69, 0.17, 0.43,

0.12, 0.42, 0.67, 0.09,

0.31, 0.14, 0.74, 0.64,

0.27 0.23 0.62 0.64

corr = 0.7

state state state state

1 2 3 4

0.53 0.53 0.60 0.60

0.65, 0.61, 0.32, 0.22,

0.69, 0.60, 0.38, 0.46,

0.19, 0.22, 0.43, 0.41,

0.44, 0.11, 0.67, 0.65,

0.41 0.45 0.711 0.69

0.62 0.54 0.56 0.54

0.72, 0.61, 0.39, 0.39,

0.74, 0.67, 0.29, 0.27,

0.08, 0.59, 0.70, 0.37,

0.39, 0.32, 0.63, 0.59,

0.27 0.44 0.74 0.64

corr = 0.9

state state state state

1 2 3 4

0.52 0.57 0.58 0.53

0.58, 0.64, 0.07, 0.22,

0.62, 0.68, 0.42, 0.29,

0.21, 0.17, 0.29, 0.45,

0.39, 0.41, 0.64, 0.72,

0.43 0.32 0.70 0.58

0.55 0.55 0.52 0.56

0.69, 0.62, 0.48, 0.40,

0.70, 0.73, 0.31, 0.20,

0.29, 0.68, 0.29, 0.37,

0.12, 0.29, 0.56, 0.61,

0.41 0.41 0.54 0.64

with m = 2 and m = 4. After selecting the covariates, the logarithm of the marginal likelihood was computed (with N = 12): −2736.927 for m = 2, −2648.911 for m = 3, and −2955.320 for m = 4. Thus, the correct model was chosen, after the selection of the correct covariates. Finally, because the mutation and the random walk crossover operators incorporate the main ideas of MKMK (Paroli and Spezia, 2008a), that is the Metropolized version of the method of Kuo and Mallick (1998), the EMC algorithm was compared with these two techniques and also with GVS of Dellaportas et al. (2000), which is based on KM as well (see Paroli and Spezia (2008a), for details on the various algorithms in the case of univariate Normal NH-HMMs). We considered the series generated with m = 3, p = 3, T = 1, 000, n = 5, and two correlations equal to 0.3. As shown in Table 2, for the EMC we obtained joint posterior probabilities equal 0.82, 0.62, and 0.79, for states 1, 2, and 3, respectively. With MKMK we had similar results (0.77, 0.54, 0.73), although slightly lower: this is not surprising because two operators in the EMC are based on MKMK. The great advantage of EMC is given by its ability to compute also the marginal likelihood and tackle the model selection issue. On the other side, KM and GVS identified the correct model, but with much lower joint posterior probabilities (0.12, 0.08, 0.09, and 0.21, 0.17, 0.18, respectively). 5. Application to ozone data As an example of the methodology presented in the previous sections, a multiple time series of ozone (O3 ) data was analysed by applying the NH-HMM ({yt }; {xt }) of Section 2 and the EMC algorithm of Section 3. The data set was made by three time series of O3 recorded along with a few meteorological variables and other pollutants by monitoring stations located in Venice (Italy). We considered the natural logarithm of the daily maximum hourly mean concentrations (in µg/m3 ) of O3 recorded by stations placed in Mestre, Marghera, and Isle of Giudecca, respectively, over the years 2001, 2002, 2003 (1095 observations, Fig. 1). The covariates we used were the daily maximum hourly mean concentrations (in µg/m3 ) of nitrogen monoxide (NO) and dioxide (NO2 ), carbon monoxide (CO), and sulphur dioxide (SO2 ), the daily maximum hourly mean wind speed (in m/s), the daily maximum hourly mean temperature (in C), the daily maximum hourly mean relative humidity (in %), the daily maximum hourly mean atmospheric pressure (in mbar), the daily maximum hourly mean precipitation (in cm of rain), the daily maximum hourly mean solar radiation (in mW/cm2 ). All these covariates were standardized before the analysis, except the pressure that was transformed in decimal points. We also considered two seasonal covariates, sine and cosine, to model a harmonic periodic component with a peak in the Summer and a trough in the Winter:

ρ1 cos(2π t /365) + ρ2 sin(2π t /365). The same binary indicator was used for both of them; thus, both the sine and the cosine were included or excluded at the same time. The Bayesian analysis was developed along three consecutive steps: (i) for any NH-HMM with a number of states up to six, the covariates affecting the hidden process were selected; (ii) for any NH-HMM, with a number of states up to six, given the covariates selected at the previous step, the marginal likelihood was computed in order to choose the number

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

13

Fig. 1. Time series of the three natural logarithms of the ozone concentrations (recorded in Mestre, Marghera, and on the Isle of Giudecca) and the estimated sequence of hidden states.

of hidden states; (iii) given the best model, parameter estimation, hidden chain reconstruction, and classification of the multiple time series were performed. The label switching problem was tackled by assuming a priori that the means of the third series were in increasing order, except when the marginal likelihood was computed, because it could be obtained by sampling from an unconstrained posterior (Frühwirth-Schnatter, 2004). Graphical analyses confirmed that the constraint on the means of the third series respected the geometry of the posterior density (Frühwirth-Schnatter, 2001). The algorithms were always run using the following tuning factors. The number of configurations was either N = 6 in step (i) or N = 12 in steps (ii) and (iii); the temperature ladders were obtained by using h1 = 1.75 and h2 = 2 in step (i), whereas h1 = 1.5 and h2 = 1.25 in step (iii); the probability ε was 0.4, while the threshold ν was 0.2. Given the tuning factors, the EMC algorithm was run for 3,000,000 iterations, after a burn-in of 100,000 iterations, and thinning the sample at every w = 120 iterations (1,500,000; 100,000; 120 in step (i)). The mixing and the convergence of the algorithms were always good; the following acceptance rates were obtained for the four operators (step (iii) of the analysis): 0.34 for the mutation (0.32 for the Nth population); 0.46 for the random walk crossover; 0.24 for the asymmetric crossover; and 0.43 for the exchange. Given the six different NH-HMMs we compared, Table 4 shows the exogenous variables Hj selected for each state j (j = 1, . . . , m, where m was set up to six) of any model at the first stage of the analysis. The covariates were selected through the median model criterion of Barbieri and Berger (2004). Table 4 also shows the natural logarithms of the marginal likelihoods computed over the second stage of the analysis, where the EMC algorithms were run only with the covariates selected at the previous step, that is setting their respective indicators equal to one. From Table 4, we can observe that the best model (corresponding to the highest logarithm of the marginal likelihood) is that with four hidden states and the following covariates: NO, NO2 , wind speed, temperature, and solar radiation. In fact, O3 , unlike other pollutants, is not emitted directly in the atmosphere, but is a secondary pollutant produced by a reaction among volatile organic compounds (not available in our data set) and NO and NO2 , in the presence of sunlight and high temperature. Sunlight provides the energy to initiate O3 formation; consequently, high levels of O3 are generally observed during hot, still sunny, summertime weather. On the other hand, wind can clear the air, reducing the level of pollution. NO and NO2 were selected in the first row of the matrix of the logistic functions (5), which represent the time-varying transition probabilities; NO, NO2 , and T in the second row; NO, NO2 , WS, T, and SR in the third and fourth rows. For a further check on the selection procedure, we generated five standardized Normal additional synthetic covariates and repeated the stochastic selection procedure along with the five already selected variables: the same results were obtained. We also compared our variable selection method with those considered in the previous section, i.e. KM (Kuo

14

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840 Table 4 Vectors of the state-dependent binary indicators for NO, NO2, CO, SO2, wind speed, temperature, relative humidity, atmospheric pressure, precipitation, solar radiation, seasonality, respectively (left column); natural logarithms of the marginal likelihoods (ln marg lik) of the corresponding models (right column). The best model is identified by the highest ln marg lik. H

ln marg lik

m=1

−3402.211

m=2

1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0

−2566.379

m=3

1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0

−2350.649

m=4

1, 1, 1, 1,

1, 1, 1, 1,

0, 0, 0, 0,

0, 0, 0, 0,

0, 0, 1, 1,

0, 1, 1, 1,

0, 0, 0, 0,

0, 0, 0, 0,

0, 0, 0, 0,

0, 0, 1, 1,

0 0 0 0

−2324.781

m=5

1, 1, 1, 1, 1,

1, 1, 1, 1, 1,

0, 0, 0, 0, 0,

0, 0, 0, 0, 0,

0, 0, 0, 1, 1,

0, 0, 0, 1, 1,

0, 0, 0, 0, 0,

0, 0, 0, 0, 0,

0, 0, 0, 0, 0,

0, 0, 1, 1, 1,

0 0 0 0 0

−2596.769

m=6

1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1,

0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0,

0, 0, 0, 1, 1, 1,

0, 0, 0, 1, 1, 1,

0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0,

0, 0, 1, 1, 1, 1,

0 0 0 0 0 0

−2643.434

Table 5 Estimates of the parameters of the NH-HMM with four hidden states, whose time-varying transition probabilities were logistic functions of NO, NO2, WS, T, SR.

β1,2,0 = −4.577

β1,2,1 = 0.541

β1,2,2 = 0.836

β1,3,0 = −5.434

β1,3,1 = 0.943

β1,3,2 = 0.591

β1,4,0 = −3.715

β1,4,1 = 0.372

β1,4,2 = 0.431

β2,1,0 = −2.484

β2,1,1 = 0.768

β2,1,2 = 0.910 β2,4,0 = −3.997

β2,1,4 = −1.269

β2,3,0 = −0.471

β2,3,1 = 1.873

β2,3,2 = 1.945

β2,3,4 = 0.886

β2,4,1 = 0.227

β2,4,2 = 1.253

β2,4,4 = 1.153

β3,1,0 = −4.857

β3,1,1 = 1.492

β3,1,2 = −0.683

β3,1,3 = −0.933

β3,1,4 = −2.695

β3,1,5 = 0.588

β3,2,0 = 3.107

β3,2,1 = −2.605

β3,2,2 = −0.160

β3,2,3 = 0.873

β3,2,4 = 0.780

β3,2,5 = −4.160

β3,4,0 = −2.309

β3,4,1 = 1.185

β3,4,2 = 0.377

β3,4,3 = 1.136

β3,4,4 = 0.219

β3,4,5 = 1.804

β4,1,0 = −4.286

β4,1,1 = −0.256

β4,1,2 = −0.334

β4,1,3 = −1.635

β4,1,4 = −1.549

β4,1,5 = −0.434

β4,2,0 = −5.325

β4,2,1 = −0.180

β4,2,2 = −0.013

β4,2,3 = −1.500

β4,2,4 = −0.451

β4,2,5 = −0.090

β4,3,0 = −4.287

β4,3,1 = −1.702

β4,3,2 = −0.448

β4,3,3 = −2.371

β4,3,4 = −3.537

β4,3,5 = −0.368

µ1,1 = 3.180

µ2,1 = 4.334

µ3,1 = 4.463

µ4,1 = 4.609

µ1,2 = 3.139

µ2,2 = 4.384

µ3,2 = 4.396

µ4,2 = 4.180

µ1,3 = 2.975

µ2,3 = 3.810

µ3,3 = 4.132

µ4,3 = 4.594

σ12,1 = 0.554

σ1,1,2 = 0.363

σ1,1,3 = 0.317

σ12,2 = 0.658

σ1,2,3 = 0.405

σ12,3 = 0.897

σ22,1 = 0.087

σ2,1,2 = 0.076

σ2,1,3 = 0.069

σ22,2 = 0.127

σ2,2,3 = 0.078

σ22,3 = 0.324

σ32,1 = 0.084

σ3,1,2 = 0.038

σ3,1,3 = 0.029

σ

= 0.057

σ3,2,3 = 0.047

σ

σ42,1 = 0.150

σ4,1,2 = 0.194

σ4,1,3 = 0.057

σ42,2 = 0.451

σ42,2,3 = 0.086

σ42,3 = 0.080

2 3,2

2 3,3

= 0.201

and Mallick, 1998), GVS (Dellaportas et al., 2000), and MKMK (Paroli and Spezia, 2008a). The variable selected through EMC have always a marginal posterior median probability greater than 0.80, whereas for the other techniques the selection probability was usually lower. In fact, EMC selected NO and NO2 in state 1 (with probabilities 0.99 and 0.96, respectively); NO, NO2 , and T in states 2 (1.00, 0.98, 0.91); NO, NO2 , WS, T, and SR in states 3 (0.95, 0.98, 0.88, 0.94, 0.87) and 4 (0.97, 1.00, 0.82, 0.89, 0.84). On the other hand, KM selected NO and NO2 in state 1 (0.79, 0.59), NO and T in states 2 (0.87, 0.54) and 3 (0.84, 0.61), NO, NO2 , and T in state 4 (0.81, 0.57, 0.57); GVS selected NO and NO2 in state 1 (0.85, 0.67) and 2 (0.925, 0.74), NO, T, and SR in state 3 (0.91, 0.53, 0.61), NO, NO2 , and SR in state 4 (0.89, 0.63, 0.53); MKMK selected NO, NO2 , and SR in all states (0.92, 0.96, 0.54; 0.94, 0.90, 0.51; 1.00, 0.95, 0.54; 0.89, 0.93, 0.52). Finally, we could estimate the parameters of the best model and reconstruct the sequence of the hidden states. The estimates we obtained are listed in Table 5. We can observe that the values of the state-dependent marginal means increase as the labels of the states increase; thus the four states can represent increasing levels of pollution and used as a synthetic index of the ozone pollution. The sequence of the hidden states is shown in Fig. 1-bottom right. The hidden

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

15

Fig. 2. Classification of the observations, that is natural logarithms of ozone concentrations, into the four hidden states. The first variable (left column) was recorded in Mestre, the second (central column) in Marghera, and the third (right column) on the Isle of Giudecca. The four states represent increasing levels of pollution, since the values of the state-dependent marginal means increase as the labels of the states increase.

Markov chain visited state one 247 days, state two 269, state three 183, and state four 396. The classification of the observations, for the three sites, into the four hidden states, is shown in Fig. 2, where the seasonality is also evident in these plots, with the highest values in the warmest periods. 6. Discussion A novel evolutionary Monte Carlo (EMC) algorithm was proposed in order to stochastically select the exogenous variables affecting the time-varying transition probabilities of a non-homogeneous hidden Markov model (NH-HMM). The method was widely tested on synthetic examples and applied to a real multiple time series of ozone data recorded along with contemporary covariates. We demonstrated we produced an efficient tool for Bayesian variable selection, model choice, parameter estimation, hidden chain reconstruction, and observation classification. The major problem when dealing with Bayesian inference in NH-HMMs is the posterior multimodality, but we had evidence that our EMC algorithm could always traverse complicated posterior surfaces avoiding to be trapped in local

16

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

basins of attraction. This happens because EMC processes a population of chains in parallel, with different temperatures attached to each chain. Our method generates draws from the posterior density via four operators: mutation, asymmetric crossover, random walk crossover, exchange. All of them were designed to generate jointly both real (parameters of both the hidden and the observed process) and binary (covariate indicators) values. This strategy based on the acceptance of a single block of values highly improves the mixing ability and the adaptive quality of the algorithm. EMC could also be applied to compute the natural logarithms of the marginal likelihoods of the competing models with a different number of hidden states and perform model choice by means of the Bayes factors. In fact, each marginal likelihood could be obtained by means of the EMC draws generated by the whole set of power posteriors associated with any chain. NH-HMMs with multivariate Normal observations are special cases of Markov switching vector autoregressions (MSVARs; Krolzig, 1997), with autoregressive order equal to zero. An interesting future extension of the methods presented here would be the development of MSVARs with a positive autoregressive order. This generalization will allow not only classification of the observations into a few states, but also fitting the actual values for a better interpretation of the investigated phenomenon. In MSVARs, EMC may be used both to select the covariates to be included in the hidden process and to choose the autoregressive order of the observed process. Another major issue might be the prior setting for the parameters of the observed process. Current research of the author is focussed on parameterizing the matrices of the autoregressive coefficients and the covariance matrices of the noises in terms of the autocovariance matrices through the Yule–Walker equations and inducing priors on them by extending the methodology of the trigonometric separation strategy (Spezia, 2019). Finally, MSVARs have been used so far only for macroeconomic analyses and forecasting (e.g.,Binder and Gross, 2013; Billio et al., 2013), but can be applied also to other disciplines, such as environmental sciences and ecology. For example, Birkel et al. (2012) applied univariate Markov switching autoregressive models to analyse separately two time series of water isotopes, deuterium and oxygen-18. It would be interesting to re-analyse jointly the two time series by MSVARMs. Further, in ecology, MSARMs can be applied to animal movement. Observations obtained from bio-loggers can be multivariate and the time series vectorial: e.g., pressure, water temperature, light level (Shepard et al., 2006); distance between locations, turning angle (Franke et al., 2004). Acknowledgements Luigi Spezia’s research was funded by the Scottish Government’s Rural and Environment Science and Analytical Services Division, United Kingdom. The data set analysed in the paper was kindly provided by Laboratorio ARPAV Provinciale di Venezia. Comments from Glenn Marion and two anonymous referees improved the quality of the final paper. References Ailliot, P., Bessac, J., Montbet, V., Pène, F., 2015. Non-homogeneous hidden Markov switching models for wind time series. J. Statist. Plann. Inference 160, 75–88. Banachewicz, K., Lucas, A., Vaart, A., 2007. Modeling portfolio defaults using hidden Markov models with covariates. Econom. J 10, 1–18. Barbieri, M.M., Berger, J.O., 2004. Optimal predictive model selection. Ann. Statist. 32, 870–897. Barnard, J., McCulloch, R., Meng, X.-L., 2000. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statist. Sinica 10, 1281–1311. Bazzi, M., Blasques, F., Koopman, S.J., Lucas, A., 2017. Time-varying transition probabilities for Markov regime switching models. J. Time Series Anal. 38, 458–478. Bellone, E., Hughes, J.P., Guttorp, P., 2000. A hidden Markov model for downscaling synoptic atmospheric patterns to precipitation amounts. Clim. Res. 15, 1–12. Billio, M., Casarin, R., Ravazzolo, F., van Dijk, H.K., 2013. Interactions between eurozone and US booms and busts: a Bayesian panel Markov-switching VAR model. Norge Bank Research, Working Paper n. 20/2013. Binder, M., Gross, M., 2013. Regime-switching global vector autoregressive models. European Central Bank, Working Paper Series, n. 1569. Birkel, C., Paroli, R., Spezia, L., Dunn, S.M., Tetzlaff, D., Soulsby, C., 2012. A new approach to simulating stream isotope dynamics using Markov switching autoregressive models. Adv. Water Resour. 26, 308–316. Blackmond-Laskey, K., Myers, J.M., 2003. Population Markov chain Monte Carlo. Mach. Learn. 26, 175–196. Bottolo, L., Richardson, S., 2010. Evolutionary stochastic search for Bayesian model exploration. Bayesian Anal. 5, 583–618. Calderhead, B., Girolami, M., 2009. Estimating Bayes factors via thermodynamic integration and population MCMC. Comput. Stat. Data Anal. 53, 4028–4045. Carter, C.K., Kohn, R., 1994. On gibbs sampling for state space models. Biometrika 81, 541–553. Charles, S.P., Bates, B.C., Hughes, J.P., 1999. A spatiotemporal model for downscaling precipitation occurrence and amounts. J. Geophys. Res. 104, 31657–31669. Dellaportas, P., Forster, J.J., Ntzoufras, I., 2000. Bayesian variable selection using the gibbs sampler. In: Dey, D.K., Ghosh, S.K., Mallick, B.K. (Eds.), Generalized Linear Models: A Bayesian Perspective. Marcel Dekker, New York, pp. 273–286. Diebolt, F.X., Lee, J.H., Weinbach, G.C., 1994. Regime switching with time varying transition probabilities. In: Hargreaves, C.P. (Ed.), Nonstationary Time Series Analysis and Cointegration. Oxford University Press, Oxford, pp. 283–302. Filardo, A.J., 1994. Business-cycle phases and their transitional dynamics. J. Bus. Econom. Statist. 12, 299–308. Filardo, A.J., Gordon, S.F., 1998. Business cycle duration. J. Econometrics 85, 99–123. Franke, A., Caelli, T., Hudson, R.J., 2004. Analysis of movements and behaviour of caribou (Rangifer tarandus) using hidden Markov models. Ecol. Model. 173, 259–270. Friel, N., Hurn, M., Wyse, J., 2014. Improving power posterior estimation of statistical evidence. Statist. Comput. 24, 709–723. Friel, N., Pettitt, A.N., 2008. Marginal likelihood estimation via power posteriors. J. R. Stat. Soc. Ser. B 70, 589–607.

L. Spezia / Computational Statistics and Data Analysis 143 (2020) 106840

17

Frühwirth-Schnatter, S., 1994. Data augmentation and dynamic linear models. J. Time Series Anal. 15, 183–202. Frühwirth-Schnatter, S., 2001. Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. J. Amer. Statist. Assoc. 96, 194–209. Frühwirth-Schnatter, S., 2004. Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econom. J. 7, 143–167. Gelati, E., Madsen, H., Rosbjerg, D., 2010. Markov-switching model for nonstationary runoff conditioned on El Niño information. Water Resour. Res. 46, W02517. Gelman, A., Meng, X., 1998. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat. Sci. 13, 163–185. George, E.I., McCullogh, R.E., 1993. Variables selection via Gibbs-sampling. J. Amer. Statist. Assoc. 88, 881–889. Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading. Goswani, G., Liu, J.S., 2007. On learning strategies for evolutionary. Monte Carlo Statist. Comput. 17, 23–38. Green, P.J., 1995. Reversible jump Markov chain Monte Carlo computation and model determination. Biometrika 82, 711–732. Hamilton, J.D., 1994. Time Series Analysis. Princeton University Press, Princeton. Hamilton, J.D., Owyang, M.T., 2012. The propagation of regional recessions. Rev. Econom. Statist. 94, 935–947. Heaps, S.E., Boys, R.J., Farrow, M., 2015. Bayesian modelling of rainfall data by using non-homogeneous hidden Markov models and latent Gaussian variables. Appl. Statist. 64, 543–568. Hokimoto, T., Shimizu, K., 2014. A non-homogeneous hidden Markov model for predicting the distribution of sea surface elevation. J. Appl. Stat. 41, 294–319. Holsclaw, T., Greene, A.M., Robertson, A.W., Smyth, P., 2017. Bayesian non-homogeneous Markov models via Polya-Gamma data augmentation with applications to rainfall modeling. Ann. Appl. Stat. 11, 393–426. Hu, B., Tsui, K.-W., 2010. Distributed evolutionary Monte Carlo for Bayesian computing. Comput. Stat. Data Anal. 54, 688–697. Hug, S., Schwarzfischer, M., Hasenauer, J., Marr, C., Theis, F.J., 2016. An adaptive scheduling scheme for calculating Bayes factors with thermodynamic integration using simpson’s rule. Statist. Comput. 26, 663–677. Hughes, J.P., Guttorp, P., Charles, S.P., 1999. A non-homogeneous hidden Markov model for precipitation occurrence. Appl. Stat. 48, 15–30. Jasra, A., Holmes, C.C., Stephens, D.A., 2005. Markov Chain Monte Carlo methods and the label switching problem in Bayesian mixture modelling. Stat. Sci. 20, 50–67. Jasra, A., Stephens, D.A., Holmes, C.C., 2007a. On population-based simulation for static inference. Statist. Comput. 17, 263–279. Jasra, A., Stephens, D.A., Holmes, C.C., 2007b. Population-based reversible jump Markov chain Monte Carlo. Biometrika 94, 787–807. Kass, R., Raftery, A., 1995. Bayes factors. J. Amer. Statist. Assoc. 90, 773–795. Kaufmann, S., 2015. K-state switching models with time-varying transition distributions - does loan growth signal stronger effects of variables on inflation? J. Econometrics 187, 82–94. Kim, C.-J., 1994. Dynamic linear models with Markov-switching. J. Econometrics 60, 1–22. Kim, W., Park, J., Lee, K.M., 2009. Stereo matching using population MCMC. Int. J. Comput. Vis. 83, 195–209. Kirkpatrick, S., Gelatt, Jr., C.D., Vecchi, M.P., 1983. Optimization by simulated annealing. Science 220, 671–680. Krolzig, H.-M., 1997. Markov-Switching Vector Autoregressions: Modelling, Statistical Inference and Applications to Business Cycle Analysis. Springer, Berlin. Kuo, L., Mallick, B., 1998. Variable selection for regression models. Sankhya Ser. B 60, 65–81. Lartillot, N., Hervé, P., 2006. Computing Bayes factors using thermodynamic integration. Syst. Biol. 55, 195–207. Liang, F., 2005. Bayesian neural networks for nonlinear time series forecasting. Statist. Comput. 15, 13–29. Liang, F., Wong, W.H., 2000. Evolutionary Monte Carlo sampling: applications to Cp model sampling and change-point problem. Statist. Sinica 10, 317–342. Liang, F., Wong, W.H., 2001. Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models. J. Amer. Statist. Assoc. 96, 653–666. Lu, Z.-Q., Berliner, L.M., 1999. Markov Switching time series models with application to a daily runoff series. Water Resour. Res. 35, 523–534. Meligkotsidou, L., Dellaportas, P., 2011. Forecasting with non-homogeneous hidden Markov models. Statist. Comput. 21, 439–449. Mohamed, L., Calderhead, B., Filippone, M., Christie, M., Girolami, M., 2012. Population MCMC methods for history matching and uncertainty quantification. Comput. Geosci. 16, 423–436. Montbet, V., Ailliot, P., 2017. Sparse vector Markov switching autoregressive models: application to multivariate time series of temperature. Comput. Stat. Data Anal. 108, 40–51. Neykov, N., Neytchev, P., Zucchini, W., Hristov, H., 2012. Linking atmospheric circulation to daily precipitation patterns over the territory of Bulgaria. Environ. Ecol. Stat. 19, 249–267. O’Hara, R.B., Sillanpää, .M.J., 2009. A review of Bayesian variable selection methods: what, how and which. Bayesian Anal. 4, 85–117. Paroli, R., Spezia, L., 2008a. Bayesian variable selection in Markov mixture models. Commun. Stat. Simul. Comput. 37, 25–47. Paroli, R., Spezia, L., 2008b. Bayesian inference in non-homogeneous Markov mixture of periodic autoregressions with state-dependent exogenous variables. Comput. Statist. Data Anal. 52, 2311–2330. Pinto, C., Spezia, L., 2016. Markov switching autoregressive models for interpreting vertical movement data with application to an endangered marine apex predator. Methods Ecol. Evol. 7, 407–417. Raymond, J.E., Rich, R.W., 1997. Oil and macroeconomy: a Markov state-switching approach. J. Money Credit Bank 29, 193–213. Richardson, S., Green, P.J., 1997. On Bayesian analysis of mixture with an unknown number of components (with Discussion). J. Royal Statist. Soc., Series B 59, 731–792. Robertson, A.W., S., Kirshner, P., Smyth, 2004. Downscaling of daily rainfall occurrence over Northeast Brazil using a hidden Markov model. J. Climate 17, 4407–4424. Seaman, III, J.W., Seaman, Jr., J.W., Stamey, J.D, 2012. Hidden dangers of specifying noninformative priors. Amer. Stat. 66, 77–84. Sengupta, B., Friston, K.J., Penny, W.D., 2015. Gradient-free MCMC methods for dynamic causal modelling. Neuroimage 112, 375–381. Shepard, E., Ahmed, M., Southall, E., Witt, M., Metcalfe, J., Sims, D., 2006. Diel and tidal rhythms in diving behaviour of pelagic sharks identified by signal processing of archival tagging data. Mar. Ecol. Prog. Ser. 328, 205–213. Spezia, L., 2006. Bayesian analysis of non-homogeneous hidden Markov models. J. Stat. Comput. Simul. 76, 713–725. Spezia, L., 2009. Reversible jump and the label switching problem in hidden Markov models. J. Statist. Plann. Inference 139, 2305–2315. Spezia, L., 2019. Modelling covariance matrices by the trigonometric separation strategy with application to hidden Markov models. Test 28, 399–422. Spezia, L., Cooksley, S., Brewer, M.J., Donnelly, D., Tree, A., 2014a. Mapping species distributions in one dimension by non-homogeneous hidden Markov models: the case of freshwater pearl mussels in the river dee. Environ. Ecol. Stat. 21, 487–505. Spezia, L., Cooksley, S., Brewer, M.J., Donnelly, D., Tree, A., 2014b. Modelling species abundance in a river by negative binomial hidden Markov models. Comput. Stat. Data Anal. 71, 599–614. Ter Braak, C.J.F, 2006. A Markov chan Monte Carlo version of the genetic algorithm differential equation: easy Bayesian computing for real parameter spaces. Statist. Comput. 16, 239–249. Vitoratou, S., Ntzoufras, I., 2017. Thermodynamic Bayesian model comparison. Statist. Comput. 27, 1165–1180.