Accepted Manuscript Bayesian model selection using automatic relevance determination for nonlinear dynamical systems Rimple Sandhu, Chris Pettit, Mohammad Khalil, Dominique Poirel, Abhijit Sarkar PII: DOI: Reference:
S0045-7825(16)31302-0 http://dx.doi.org/10.1016/j.cma.2017.01.042 CMA 11328
To appear in:
Comput. Methods Appl. Mech. Engrg.
Received date: 15 October 2016 Revised date: 31 January 2017 Accepted date: 31 January 2017 Please cite this article as: R. Sandhu, C. Pettit, M. Khalil, D. Poirel, A. Sarkar, Bayesian model selection using automatic relevance determination for nonlinear dynamical systems, Comput. Methods Appl. Mech. Engrg. (2017), http://dx.doi.org/10.1016/j.cma.2017.01.042 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Bayesian model selection using automatic relevance determination for nonlinear dynamical systems Rimple Sandhua , Chris Pettitb , Mohammad Khalilc , Dominique Poireld , Abhijit Sarkara,∗ a
Department of Civil and Environmental Engineering, Carleton University, Ottawa, Ontario, Canada b Department of Aerospace Engineering, United States Naval Academy, Annapolis, Maryland, United States c Sandia National Laboratories, Livermore, California, United States d Department of Mechanical and Aerospace Engineering, Royal Military College of Canada, Kingston, Ontario, Canada
Abstract Bayesian model selection is augmented with automatic relevance determination (ARD) to perform model reduction of complex dynamical systems modelled by nonlinear, stochastic ordinary differential equations (ODE). Given noisy measurement data, a parametrically flexible model is envisioned to represent the dynamical system. A Bayesian model selection problem is posed to find the best model nested under the envisioned model. This model selection problem is transferred from the model space to hyper-parameter space by regularizing the parameter posterior space through a parametrized prior distribution called the ARD prior. The resulting joint prior pdf is the combination of parametrized ARD priors assigned to parameters whose relevance to the system dynamics is questionable and the known prior pdf for parameters whose relevance is known a priori. The hyper-parameter of each ARD prior explicitly represents the relevance of the corresponding model param∗
Corresponding author. Tel.: +1 613 520 2600x6320; fax: +1 613 520 3951
Preprint submitted to Journal of Computer Methods in Applied Mechanics and EngineeringJanuary 30, 2017
eter. The hyper-parameters are estimated using the measurement data by performing evidence maximization or type-II maximum likelihood. Superfluous model parameters are switched off during evidence maximization by the corresponding ARD prior, forcing the model parameter to be irrelevant for prediction purposes. An efficient numerical implementation for evidence computation using Markov Chain Monte Carlo sampling of the parameter posterior distribution is presented for the case when the analytical evaluation of evidence is not possible. The ARD approach is validated with synthetic measurements generated from a nonlinear, unsteady aeroelastic oscillator consisting of a NACA0012 airfoil undergoing limit cycle oscillation. A set of intentionally flexible stochastic ODEs having different state-space formulation are proposed to model the synthetic data. ARD is used to obtain an optimal nested model corresponding to each proposed model. The optimal nested model with the maximum posterior model probability is chosen as the overall optimal model. ARD provides a flexible Bayesian platform to find the optimal nested model by eliminating the need to propose candidate nested models and its prior pdfs. Keywords: Automatic Relevance Determination, Bayesian inference, Bayesian model selection, Nonlinear Structural dynamics, Markov Chain Monte Carlo Simulation, Kalman filter, Empirical Bayes methods, Model evidence 1. Introduction In engineering mechanics, Bayesian methods are increasingly employed to model nonlinear dynamical systems through the statistical inversion of sparse and noisy measurements of the system response [1–8]. While conforming with the axioms of measure theory, Bayesian methods provide robust 2
inversion tools by incorporating various unavoidable sources of modelling uncertainties in a cohesive and rational manner. A general Bayesian inference procedure powered by Markov Chain Monte Carlo (MCMC) sampling has been implemented by the authors to estimate the posterior probability density function (pdf) of the unknown and uncertain parameters of a given model [4, 7, 9, 10]. When modelling complex engineering systems, multiple plausible models can be envisioned to represent the system physics, where all the models will have varying data-fit and response prediction capabilities depending on the model complexity and the quality and density of the measurement data. A Bayesian model selection procedure based on non-asymptotic calculation of model evidence using MCMC samples from the posterior parameter pdf has been implemented by the authors for any model represented in the form of a nonlinear stochastic ordinary differential equation (ODE) [10, 11]. Bayesian model selection assigns a posterior probability to each model using the measurement data, where the model that strikes the right amount of balance between data-fit and prediction uncertainty receives the highest posterior probability [5, 12]. This condition of balance between data-fit and prediction uncertainty is also referred to as the principle of parsimony [13], bias-variance trade-off [13], or Ockham’s razor [14]. In practice, the modeller is often faced with a scenario where a parametrically flexible model of the engineering system is envisioned to model the information captured in the noisy measurement data. This model is aimed at capturing all the probable aspects of the system physics, irrespective of what the measurements suggest. In engineering practice, the proposed models often provide an overly-complicated representation of the system dynamics in relation to the information content of the measurements. Given a 3
set of measurements, there exists a unique nested model that best balances the data-fit and prediction capability among all the models nested under the proposed model. In our previous work, the optimal nested model is obtained using Bayesian model selection performed on a recurring set of candidate models obtained using either the stepwise model reduction scheme [10] or by manually choosing a finite number of plausible models based on a priori knowledge (before assimilating measurements) available about the system dynamics [11]. In this work, the optimal nested model is obtained automatically by performing Bayesian model selection using automatic relevance determination (ARD) priors. ARD enables implicit comparison of all nested models within the Bayesian framework, therefore eliminating the need for manually proposing candidate nested models. Another important aspect of the Bayesian framework is the assigning of prior pdfs to model parametric uncertainty. A prior pdf contains prior knowledge about a model parameter that may help in improving or regularizing the posterior results. However, assigning prior pdfs to parameters whose relevance to the system physics is not known a priori is rendered arbitrary. A common practice to this lack of prior knowledge is to assign diffuse (but proper) priors to such parameters. However, Bayesian model selection tends to select a simpler model with increasing diffusivity of the prior [8, 15]. Using ARD priors eliminates the need to manually assign prior pdfs to parameters where limited a priori information exists to justify a particular prior pdf. Automatic relevance determination (ARD) [16–18] was first presented in the neural network literature as an effective tool for feature selection and has since been used abundantly in the field of machine learning [19]. ARD operates by pruning redundant features or terms in the proposed model 4
through a data-dependent regularization of the solution space. The regularization is achieved by assigning a parametrized prior distribution to model parameters with limited prior knowledge or with questionable relevance to the system dynamics. The parameters of the prior distribution are known as hyper-parameters and they are estimated using the measurement data by performing evidence maximization or type-II maximum likelihood [18]. This approach of assigning prior pdf using the data is also known as parametric empirical Bayes (EB) method [19–21]. The EB or ARD approach is in contrast to the standard Bayesian statistics where known prior distributions has to be assigned by the modeller; and also in contrast to the frequentist statistics where no prior is assigned. Choosing hyper-parameters using evidence maximization can also be termed as a Bayesian model selection problem in a continuous model domain, since each choice of hyper-parameter corresponds to a unique model. This continuous model domain increases the chance of obtaining an optimal model with higher evidence than that of the optimal model obtained using a manually proposed candidate model set. The optimal nested model obtained using evidence maximization automatically embodies the concept of Ockham’s razor. The use of ARD for statistical modelling has been limited in engineering mechanics due to the conceptual and implementation challenges involved in evidence computation and maximization [22]. Oh et al.[15] used an ARD prior for a classification problem in earthquake engineering where the models considered were a linear combination of basis functions and the posterior pdf is approximated as Gaussian. The evidence maximization problem was solved analytically using Laplace approximation of the posterior pdf and by setting the derivatives of the log-evidence to zero. In this paper, we present an automatic model reduction procedure for nonlinear stochastic 5
ODE models where the model evidence is not available analytically and the posterior pdf can take any general shape. Unlike black-box modelling, we assign a combination of ARD priors and known priors by categorizing the model parameters based on their a priori relevance using prior knowledge about the system dynamics. In other words, the continuous model domain is regularized by the known prior pdfs and is not unbounded. Bayesian LASSO [23] offers a similar approach of regularizing the posterior parameter space using parameterized double-exponential prior pdfs. Bayesian LASSO is primarily used in regression and operates by optimizing the posterior pdf with respect to hyper-parameters to prune redundant parameters, which is in contrast to ARD where evidence is optimized. The contributions to this paper are as following: • ARD is successfully applied to perform automatic model reduction of a nonlinear dynamical system where a hybrid scheme is used to assign
parameter prior pdfs. Under the proposed scheme, the joint parameter prior pdf embodies the prior knowledge and also allows for pruning-off of redundant parameters. • An efficient implementation of evidence computation and maximization algorithm is presented, which uses Chib-Jeliazkov method to com-
pute evidence without holding any assumption about the shape and type of the parameter posterior pdf. • The proposed Bayesian model selection procedure using ARD is validated using a synthetically generated data from a nonlinear aeroelastic system. The synthetic data resembles the wind-tunnel data that has been previously modelled by the authors using alternate Bayesian schemes [10, 11]. 6
In Section 2, we present the methodology for automatic Bayesian model reduction using ARD wherein the MCMC powered evidence computation and maximization algorithm is detailed for the case where the analytical derivatives of the evidence are unavailable. In Section 3, the proposed model reduction methodology is validated using synthetic measurements from a nonlinear aeroelastic oscillator. 2. Methodology: Automatic Bayesian model reduction 2.1. State-space representation of dynamical systems A typical mathematical model M describing the dynamical system can
be represented in the general discrete stochastic state-space form [7, 24, 25] uj+1 = gj (uj , ϕ, fj , qj ),
(1)
where u ∈ Rnu is the state vector, g ∈ Rnu is the discrete nonlinear model
operator, ϕ ∈ Rnϕ denotes the time-invariant parameters, f ∈ Rnu denotes the know deterministic input and q ∈ Rnq signifies the unknown model approximation noise, for example, due to unmodelled dynamics and/or unknown random input. The measurement vector y ∈ Rnm , is mapped to the
state vector through the measurement equation
yk = hk (uj(k) , ϕ, ϵk ),
(2)
where h ∈ Rnm is the nonlinear measurement operator and ϵ ∈ Rnϵ denotes
the measurement noise stemming from the data-acquisition system. As the
model integration time step and the arrival time of observational data can differ, two indices j and j(k) are used to denote these quantities respectively in Fig. 1. 7
In this investigation, qj ∼ N (¯ qj , Qj ) and ϵk ∼ N (¯ ϵk , Γk ) are assumed
¯ k ∈ Rnϵ ¯ j ∈ Rnq and ϵ to be mutually independent and Gaussian; and q
are the mean vectors, and Qj ∈ Rnq ×nq and Γk ∈ Rnϵ ×nϵ are covariance
matrices. The inclusion of ϵk and qj enables the solution of the estimation problem in a statistical (Bayesian) sense. A more refined colored-noise model
(additive or multiplicative) can be adopted for ϵk and qj , and the associated parameter estimation can be carried out as per Kennedy and O’Hagan [26]. This study focuses on the refinement of the physical aspect of the model for a given representation of the error terms. The mutual independence and Gaussian assumption of qj and ϵk simplifies the computational aspect of the inference and it is also supported by the maximum entropy principle [5, 27].
Є3{ Є1{
t1 u1
t2 u2
k
{ Є2
y1 tj (1) t0 u0
{Є
y2 tj (2) t3 u3
t4 u4
· · · yk · · · yn · · · tj (k) · · · tj (n)
y3 tj (3) t5 u5
t6 u6
t7 u7
t8 u8
··· ···
tj uj
··· ···
··· ···
··· ···
Figure 1: A general time-discretization scheme for discrete stochastic models of dynamical systems shown in Eq. (1) and Eq. (2). (Adapted from Sandhu et al. [10, 11], Evensen [25])
2.2. Bayesian model selection using ARD We assume that a realization d1:nd = {d1 , d2 , . . . , dnd } of the measure-
ment vector y in Eq. (2) is available from the concerned system. A candidate model set M = {Mr ; r = 1, 2, . . . , nM } is proposed to model the system
dynamics captured in the noisy measurement data d1:nd . The candidate set M consists of overly-complicated model equations of the form Eq. (1) with 8
each model having a different state space formulation. In other words, no model in M can be represented as nested under an another candidate model in M without altering the composition of state-space vector u. Given d1:nd
and M, we seek the optimal model set Mopt = {Mopt r ; r = 1, 2, . . . , nM } where Mopt r is the optimal nested model that strikes the right amount of balance between model complexity and data-fit among all models nested under Mr . Once Mopt is available, we seek to find the best of all optimal nested models. In this work, the task of obtaining the optimal nested models Mopt r
and the subsequent ranking of multiple optimal nested models is performed using the Bayesian estimation framework [5]. In Bayesian framework, the joint posterior pdf of the unknown timeinvariant model parameters in ϕ for each Mr ∈ M is obtained by assim-
ilating measurements d1:nd using Bayes theorem. When using ARD, the Bayesian parameter estimation equation is written as p(ϕ|d1:nd , ψ) =
p(d1:nd |ϕ)p(ϕ|ψ) p(d1:nd |ϕ, ψ)p(ϕ|ψ) = , p(d1:nd |ψ) p(d1:nd |ψ)
(3)
where p(d1:nd |ϕ) is the likelihood function; p(ϕ|ψ) is the parametrized prior
pdf known as a function of the hyper-parameter vector ψ; and p(d1:nd |ψ) is
the model evidence with an implicit dependence on ψ. Note that the explicit dependance on the model Mr is omitted for brevity. The parametrization
of the prior pdf forms the foundation of ARD and will be discussed in detail in Section 2.3. The maximum a posteriori (MAP) estimate ϕMAP = arg maxϕ {p(ϕ|d1:nd , ψ)} is the mode of the posterior pdf for a given ψ and the maximum likelihood estimate (MLE) ϕMLE = arg maxϕ {p(d1:nd |ϕ)} is
the mode of the likelihood function. Note that ϕMAP is a function of the hyper-parameter vector ψ. The model evidence p(d1:nd |ψ) plays a central role in Bayesian model 9
selection, where as it acts as a normalizing constant for the sake of parameter estimation in Eq. (3). The model evidence is calculated as p(d1:nd |ψ) =
∫
p(d1:nd |ϕ)p(ϕ|ψ)dϕ.
(4)
The evidence p(d1:nd |ψ) is known as a quantitative Ockham’s razor that
performs an explicit trade-off between the data-fit and the model complexity [5]. This fact is revealed when the logarithm of the model evidence is written in the form: [28] [ ] p(ϕ|d1:nd , ψ) ln p(d1:nd |ψ) = E[ln p(d1:nd |ϕ)] − E ln | {z } | {z } p(ϕ|ψ) | {z } Log-evidence Goodness-of-fit
(5)
Information gain (EIG)
where the expectation E[.] is with respect to the posterior p(ϕ|d1:nd , ψ). The first term, being the expectation of the log-likelihood, provides a measure of the data-fit and is referred to as goodness-of-fit. The second term provides a measure of the relative entropy (or Kullback-Leibler divergence) between the prior and posterior pdf, and is referred to as information gain [13]. Note that the information extracted from the measurement data d1:nd is higher for a complex model. As a result, the information content in the posterior pdf will differ vastly from that of the prior pdf, thereby increasing the information gain term in Eq. (5). Hence, the information gain term in Eq. (5) represents model complexity. This representation of model complexity is in contrast to the frequentist approach where model complexity, in general, only depends on the number of model parameters [13]. The parameterization of the prior pdf p(ϕ|ψ) in Eq. (3) and Eq. (4) is performed in such a fashion that the hyper-parameter vector ψ represents the relevance of model parameters ϕ. In other words, all the nested models under M can be obtained by varying ψ values (details in section 2.3). In 10
this work, the optimal value of ψ is inferred from the data d1:nd by using the hierarchical Bayes approach [20, 21]. Using Bayes’ theorem, the posterior distribution p(ψ|d1:nd ) of hyper-parameter vector ψ is obtained as p(ψ|d1:nd ) =
p(d1:nd |ψ)p(ψ) p(d1:nd )
(6)
where p(ψ) is the prior distribution of hyper-parameters and p(d1:nd ) is just a normalization constant for a given d1:nd . For model selection purpose, our interest lies in the value of ψ that maximizes the hyper-parameter posterior pdf p(ψ|d1:nd ). Such an instance of ψ is referred to as the optimal hyper-parameter vector ψ opt . Assuming flat prior p(ψ) ∝ 1, optimal hyper-
parameter vector ψ opt can be obtained by maximizing the model evidence p(d1:nd |ψ). In mathematical terms, ψ opt = arg maxψ {p(d1:nd |ψ)}. Given a special kind of prior parametrization p(ϕ|ψ) and measurement data d1:nd , the optimal hyper-parameter vector ψ opt will pertain to the optimal nested model Mopt under the candidate model Mr ∈ M. In other words, the r optimal prior pdf p(ϕ|ψ opt ) will enforce the pruning of redundant model parameters while retaining the important model parameters in the posterior parameter pdf p(ϕ|d1:nd , ψ opt ). This technique of transferring the model selection problem from a discrete model space of nested models under Mr to a continuous hyper-parameter space of ψ through parametrized prior pdfs is called as automatic relevance determination (ARD). The parameterized prior used for model reduction is referred to as the ARD prior. The selection of an appropriate ARD prior and the subsequent optimization of model evidence are the critical aspect of the model reduction task. opt Once the optimal nested models Mopt are available, the poster ∈ M
11
rior model probability P(Mopt r |d1:nd ) is obtained as P(Mopt r |d1:nd ) =
opt p(d1:nd |ψ opt , Mopt r )P(Mr ) , P(Mopt )
(7)
opt ) is where the maximum model evidence p(d1:nd |ψ opt , Mopt r ) or p(d1:nd |ψ
known and the prior model probability P(Mopt r ) is assumed to same for all opt Mopt r ∈M .
2.3. ARD prior In machine learning applications, the ARD prior is assigned to all model parameters and the subsequent inferences are entirely data-dependent or frequentist in nature and are free from prior beliefs (if any). This modelling approach is not well-suited for engineering systems where some a priori information about system physics is available and should be incorporated into the modelling. In this work, we implement a hybrid approach of assigning ARD priors that will result in automatic pruning of redundant parameters while still allowing for the inclusion of prior information about certain model parameters. Based on the a priori information available about the system dynamics, the nϕ -sized parameter vector ϕ is split into 1) essential parameter vector ϕ-ψ , and 2) nonessential parameter vector ϕψ ; where ϕψ ∈ Rnψ and ϕ-ψ ∈ Rnϕ −nψ . The essential parameters in ϕ-ψ are a priori (before assimi-
lating measurement data) known to be salient to the system dynamics and a proper prior pdf p(ϕ-ψ ) can be constructed using the prior information. On the other hand, limited information is available about the non-essential parameters and therefore ϕψ is assigned a parametrized prior pdf p(ϕψ |ψ).
For this parameterized prior to be an ARD prior, it should satisfy following conditions: 1) zero-mean, 2) unboundedness. The condition of zero-mean 12
ensures that the prior is unbiased and for certain values of ψ the prior pdf of an individual non-essential parameter will converge to a dirac-delta function at zero. The unboundedness ensures that the posterior space of non-essential parameters is unrestricted and can attain any value. A natural choice for the ARD prior that satisfies the conditions of zeromean and unboundedness is a zero-mean Gaussian distribution p(ϕψ |ψ) =
) and N (0, A), where the covariance matrix is A = Diag(α1−1 , α2−1 , . . . , αn−1 ψ ¯ the hyper-parameter vector is ψ = {α1 , α2 , . . . , αnψ }. The joint prior pdf
p(ϕ|ψ) of ϕ can be written as
p(ϕ|ψ) = p(ϕ-ψ )p(ϕψ |ψ) = p(ϕ-ψ )
nψ ∏ i=1
) ( 1 N 0, . αi
(8)
Laplace (double exponential) distribution is another choice for ARD priors and is widely used in LASSO regression [23]. Laplace distribution has wider tails and higher probability mass near the origin than the Gaussian distribution. Gaussian ARD prior provides more flexibility to the posterior space than the Laplace distribution, thereby reducing the bias in parameter estimates obtained using optimized ARD priors. As evident from Eq. (8), the hybrid approach allows for the inclusion of prior knowledge while avoiding the assignment of known prior pdfs for non-essential parameters. As evident from Eq. (8), prior pdf of a non-essential model parameter is a dirac-delta function N (0, 0) for α = ∞. The posterior pdf of such a parameter will also be N (0, 0), thereby making it irrelevant for all purposes. We can concur that ψ in Eq. (8) represents the relevance of parameters in
ϕψ . This property of the ARD prior N (0, A) implies that, for a given prior ¯ p(ϕ-ψ ) of essential parameters, all possible models nested under M can be
constructed by varying ψ in Eq. (8). Hence, there exists an optimal ψ value that corresponds to the optimal model nested under M. This optimal hyper13
parameter ψ opt is obtained by maximizing the model evidence p(d1:nd |ψ)
in Eq. (6). The optimal hyper-parameter value will enforce the pruning of redundant parameters while still allowing for relevant parameters in ϕψ to attain non-zero posterior space based on the evidence in the measurement data. Next, we discuss the sampling-based method used for computing the model evidence. 2.4. Chib-Jeliazkov method of computing model evidence The model evidence p(d1:nd |ψ) in Eq. (4) is a nϕ -dimensional integral
for each choice of ψ, where nϕ is the size of model parameter vector ϕ. Cal-
culating model evidence requires the computation of the likelihood function p(d1:nd |ϕ) for a given stochastic model of the form Eq. (1). Using the state-
space structure of the proposed model, the likelihood p(d1:nd |ϕ) is expanded
as [7]
p(d1:nd |ϕ) = p(d1 , d2 , . . . , dk , . . . , dn |ϕ) = = =
nd ∏
p(dk |d1:k−1 , ϕ)
k=1 nd ∫ ∏
k=1 nd ∫ ∏ k=1
p(dk , uj(k) |d1:k−1 , ϕ)duj(k)
(9)
p(dk |uj(k) , ϕ)p(uj(k) |d1:k−1 , ϕ)duj(k) ,
where p(uj(k) |d1:k−1 , ϕ) represent the forecast state pdf given the data history d1:k−1 .
In this work, we use the extended Kalman filter (EKF) for the state estimation task required to compute p(uj(k) |d1:k−1 , ϕ) in Eq. (9) for a given
ϕ value [24, 29, 30]. EKF is the extension of Kalman filter whereby the non-
linear model and measurement operators are linearized around the current 14
estimate using the first-order Taylor series expansion. The state pdf is assumed to be locally Gaussian and the updated and forecasted state pdfs are available analytically. Appendix C describes the update and forecast steps of EKF and the associated likelihood function. The assumption of Gaussian state pdf works well with dense measurement data d1:nd considered in this work. Alternatively, the non-Gaussian state estimation problem in Eq. (9) can be handled by the sampling-based filters such as unscented Kalman filter (UKF) [4], ensemble Kalman filter [4, 25, 31] and particle filter [4, 31–33]. The evidence in Eq. (4) is analytically intractable for the nonlinear stochastic ODE models where the likelihood computation involves state estimation procedure. In the case of nearly Gaussian posterior pdf p(ϕ|d1:nd , ψ), Laplace’s method [34] of asymptotic approximation provides an efficient way of estimating the evidence. However, the assumption of Gaussianity only holds true for “large and dense” measurement data d1:nd . When performing evidence maximization using ARD priors, it is difficult to determine if the posterior pdf can be approximated as a Gaussian for various instances of the hyper-parameter vector ψ. In other words, certain values of hyperparameters can result in a prior pdf p(ϕψ |ψ) that can alter the Gaussian nature of the joint posterior pdf p(ϕ|d1:nd , ψ).
Recent advances in MCMC methods and high performance computing architecture has enabled a fast and efficient sampling of non-Gaussian posterior pdfs. The Chib-Jeliazkov method estimates the model evidence using the posterior samples of Metropolis-Hastings(MH) MCMC [8, 35–38]. The Chib-Jeliazkov method is derived using the reversibility condition of the transition pdf of MH algorithm. The Chib-Jeliazkov method is preferred for the current study since it uses samples from the posterior pdf and not the prior pdf, which is constantly changing with varying hyper-parameter val15
ues. Also, the Chib-Jeliazkov method does not hold any assumption about the shape and type of the parameter posterior pdf. Other notable numerical methods of computing evidence include nested sampling [39], power posteriors [40], annealed importance sampling [41], harmonic mean estimator [42], Gauss-Hermite quadrature [43], Transitional MCMC [44]; among others. See Friel and Wyse [45] for a comprehensive review of evidence calculation methods. We assume that a MCMC sample set {ϕ(g) , g = 1, 2, . . . , M } has been
generated from the posterior pdf p(ϕ|d1:nd , ψ) at a given ψ value. The
MCMC sampling is performed using the MH algorithm, whereby a Markov chain is generated whose stationary pdf is the posterior pdf p(ϕ|d1:nd , ψ). The proposed move ϕ′ of the Markov chain is generated by sampling from a proposal distribution q(ϕ′ |ϕ), where ϕ is the current position of the chain. The proposed move ϕ′ is accepted with the probability r(ϕ, ϕ′ ), where [35] ( ) p(ϕ′ |d1:nd , ψ)q(ϕ|ϕ′ ) ′ r(ϕ, ϕ ) = min 1, (10) p(ϕ|d1:nd , ψ)q(ϕ′ |ϕ)
Invoking the reversibility property of the MH transition kernel, the ChibJeliazkov method estimates the posterior ordinate p(ϕ∗ |d1:nd , ψ) using any preselected point ϕ∗ in the posterior space [35] as
p ˆ(ϕ∗ |d1:nd , ψ) =
M 1 ∑ r(ϕ(g) , ϕ∗ )q(ϕ(g) |ϕ∗ ) M g=1
J 1 ∑ r(ϕ∗ , ϕ(j) ) J
.
(11)
j=1
where samples ϕ(j) , j = 1, 2, . . . , J are generated from the fixed proposal pdf q(ϕ′ |ϕ∗ ) using the Monte Carlo sampling. Cheung and Beck [8] gen-
eralised this method proving that the stationary MCMC samples ϕ(g) can be obtained from any MCMC sampling procedure meeting the stationarity 16
condition, rather than just from the proposal pdf q(.). Also, the fixed point ϕ∗ should be chosen in the high-probability region of the posterior pdf for best performance of the Chib-Jeliazkov method. Using p ˆ(ϕ∗ |d1:nd , ψ), the model evidence is estimated using Eq. (3) as [35] p ˆ(d1:nd |ψ) =
p(d1:nd |ϕ∗ )p(ϕ∗ |ψ) , p ˆ(ϕ∗ |d1:nd , ψ)
(12)
where p(d1:nd |ϕ∗ ) and p(ϕ∗ |ψ) can be computed using Eq. (9) and Eq. (8), respectively. Note that the accuracy of the evidence estimate p ˆ(d1:nd |ψ) ob-
tained using Eq. (11) and Eq. (12) relies on the quality of MCMC samples ϕ(g) and the preselected point ϕ∗ . Next, we detail the MCMC algorithm used in this work that ensures efficiency and accuracy for evidence computation using Chib-Jeliazkov method. 2.5. Numerical implementation of evidence maximization algorithm Given a ψ value at each evidence maximization iteration, an MCMC sampler is initiated to generate samples ϕ(g) required for computing evidence using Chib-Jeliazkov method in Eq. (11) and Eq. (12). The un-normalized posterior pdf p(ϕ|d1:nd , ψ) ∝ p(d1:nd |ϕ)p(ϕ|ψ) from Eq. (3) is sampled us-
ing a Random-walk Metropolis (RWM) algorithm [46, 47]. The proposal pdf for the RWM algorithm is chosen as q(ϕ′ |ϕ) = N (ϕ′ |ϕ, C). The covariance
C dictates the burn-in period and the quality of MCMC samples (quantified by sample autocorrelation), which in turn governs the variation among multiple evidence estimates [9]. We propose an adaptive and parallel RWM algorithm that incorporates the following modifications/improvements: 1) Delayed Rejection Adaptive Metropolis (DRAM) [48], and 2) Inter-chain Adaptation (INCA) [49]. DRAM algorithm is the combination of following two algorithms: 1) Adaptive Metropolis(AM) [50], and 2) Delayed Rejection 17
(DR) [51, 52]. AM performs an automatic tuning of proposal covariance using the entire chain history, thereby ensuring rapid convergence and efficient exploration of the posterior space by optimizing the chain rejection ratio. DR complements the AM algorithm by proposing a multi-level RWM step when encountering a rejected Markov chain move. Furthermore, the parallel implementation of DRAM algorithm through INCA enables proposal tuning using chain history from all processes. The Brooks-Gelman-Rubin (BGR) diagnostic [55, 56] monitors the amount of information exchange among parallel chains and dictates the length of burn-in period. The proposed sampler ensures that a poor choice of initially supplied proposal covariance is automatically tuned in minimum simulation time. Once the sampler is terminated the evidence is calculated by substituting ϕMAP and stationary samples ϕ(g) (post burn-in) in Eq. (11). The evidence estimate from each process will be different due to the finite length of the MCMC sample history ϕ(g) used in computing evidence. The expectation of the evidence estimates over all processes, E[ˆ p(d1:nd |ψ)], is maximized
using a line-search algorithm to obtain an estimate ψ opt of the optimal hyper-parameter. The overall evidence maximization algorithm is summarized in Algorithm 2.5. Once the hyper-parameter value exceeds a certain threshold pre-assigned to each model parameter, the parameter is deemed irrelevant (set to zero) and the maximization continues for the rest of the non-essential parameters. The algorithm is implemented in C++ where the Armadillo library [53] is used to perform the linear algebra operations while the Message Passing Interface (MPI) [57] library is used for interprocessor communication. The simulations were performed on 16 Intel Xeon E5540 cores interconnected through blocked QDR InfiniBand. The HPC architecture used for the simulations was provided by SCINET consortium [58]. 18
Evidence maximization algorithm 1:
Initiate ψ opt and ϕMAP , and the RWM proposal covariance C.
2:
while optimization stopping criterion is not met do
3:
Propose a new ψ ′ = βψ opt using line-search algorithm
4:
Initiate the MCMC sampler on nc processes using dispersed values of ϕMAP and C.
5: 6:
while burn-in is not reached do Generate W number of MCMC samples on each process [Algorithm: DRAM].
7:
Update ϕMAP using new MAP estimates from all processes.
8:
Update the proposal covariance C using samples from all processes [Algorithm: INCA]
9:
Check for burn-in conditions using BGR diagnostic.
10:
end while
11:
Generate stationary samples {ϕ(g) , g = 1, . . . , M } using the proposal
covariance C [Algorithm: AM]. 12:
Calculate evidence estimate p ˆ(d1:nd |ψ ′ ) on each process by equating
ϕMAP and ϕ(g) in Eq. (11) and Eq. (12). 13:
Calculate the expected evidence E[ˆ p(d1:nd |ψ ′ )] using evidence esti-
mates from each process. 14:
if New evidence optimum is found then Update ψ opt
15:
end if
16:
end while
19
Given the optimal hyper-parameter ψ opt and the corresponding parameter MAP estimate ϕMAP and model evidence E[ˆ p(d1:nd |ψ opt )], the goodnessof-fit and information-gain (or EIG) term in Eq. (5) can be estimated as:
Goodness-of-fit: E[ln p(d1:nd |ϕ)] ≈ ln p(d1:nd |ϕMAP ) [ ] p(ϕ|d1:nd , ψ) EIG: E ln ≈ ln p(d1:nd |ϕMAP ) − ln{E[ˆ p(d1:nd |ψ opt )]} p(ϕ|ψ) (13)
3. Numerical investigation: Nonlinear aeroelastic oscillator 3.1. Structural model of the oscillator In this section, we investigate the performance of ARD priors as an automatic model reduction tool using synthetically generated measurement data from a nonlinear aeroelastic oscillator. The oscillator is a rigid NACA0012 wing undergoing pure-pitch limit cycle oscillations (LCO) in the transitional Reynolds numbers ranging 5.0 × 104 ≤ Rec ≤ 1.3 × 105 (Rec is the chordbased Reynolds number) [59]. The self-sustained dynamic instability and
the subsequent stabilization to an LCO are attributed to the nonlinear and unsteady aerodynamic loads arising due to the presence of laminar separation bubble (LSB) on the wing surface in a feedback relationship with the free pitching wing [59]. The experimental set-up for the nonlinear oscillator is located in the low speed wind-tunnel at the Royal Military College (RMC) of Canada. More detailed description of the wind-tunnel apparatus and test conditions can be found in [59]. Note that although the wing apparatus allows for both pitch and plunge displacements, the observed LCO is a pure-pitch phenomenon. The measurement data gathered from wind-tunnel
20
testing has been studied previously by authors using Bayesian methodology [7, 10, 11]. The use of uncertainty quantification tools such as Bayesian methods is a part of growing appreciation of the aeroelastic scientific and engineering community in efforts to understand the physics of nonlinear aeroelasticity [60–63]. The structural properties of the wing apparatus in the pure-pitch degreeof-freedom (dof) has been estimated previously by authors [9] using measurements from a free-decay motion under no airflow conditions. The following nonlinear ODE models the pitching motion of the wing: IEA θ′′ + Dl θ′ + Kl θ + Knl θ3 = Dnl sign(θ′ ) + MEA (θ, θ′ , θ′′ ; ϕ) ,
(14)
where θ = θ(t) is the pitch displacement (rotation), θ′ , θ′′ are derivatives with respect to time t, IEA is the mass moment of inertia, Dl is the linear viscous damping coefficient, Kl is the linear stiffness coefficient, Knl is the nonlinear stiffness coefficient, Dnl is the nonlinear friction damping coefficient, ϕ is the time-invariant aerodynamic parameter vector, MEA = qc2 sCM is the aerodynamic moment about the elastic-axis (EA), q = ρU 2 /2 is the dynamic pressure, U is the free-stream airflow velocity, c is the chord of the (constant-chord) wing, s is the span of the wing, and CM (.) is the aerodynamics moment coefficient. Table 1 lists the estimated value of the structural parameters of Eq. (14) and other geometrical properties of the wing. The values thus obtained are c1 = -6.875e-05, c2 = -2.038e-02, c3 = 3.819e-02, c4 = -7.275e-03 and c5 = 1.824e-01. For consistent comparison, the structural model in Eq. (14) is nondimensionalized by replacing time t with the non-dimensional time τ = 2U t/c. As a result, Eq. (14) can be re-written as ˙ + c2 θ + c3 CM + c4 θ˙ + c5 θ3 , θ¨ = c1 sign(θ) 21
(15)
Parameter
Value
Parameter
Value(Estimated [9])
I
1.40e-03 kg.m2
Kl
3.39e-01 kg.m2 /s2
ρ
1.19e-00 kg/m3
Knl
3.03e-00 kg.m2 /(s2 rad2 )
s
6.10e-01 m
Dl
1.11e-03 kg.m2 /s
c
1.56e-01 m
Dnl
1.14e-03 kg.m2 .rad/s
Table 1: Estimated structural parameters and other geometrical parameters.
where θ = θ(τ ) is the pitch displacement as a function of non-dimensional ˙ θ¨ are with respect to τ . Note that the derived time τ , the derivatives θ, parameters c1 , c2 , c3 , c4 , c5 in Eq. (15) are known as a function of the parameters of Eq. (14) and can be calculated using parameter values listed in Table 1. The structural model and associated parameter will be treated as known for the inverse modelling purposes. 3.2. Data-generating aerodynamic model The synthetic LCO data resembling the wind-tunnel response (reported in [9–11]) is generated by modelling the aerodynamic moment coefficient CM in Eq. (15) using the following stochastic ODE: C˙ M c6 + CM = a1 θ + a2 θ˙ + a3 θ3 + a4 θ2 θ˙ + θ¨ + σξ(τ ), B B
(16)
where B, a1 , a2 , a3 , a4 are aerodynamic coefficients; and σ is the strength of the modelling error where ξ(τ ) ∼ N (0, 1) is a Gaussian random variable.
The modelling error term in Eq. (16) helps mimic the effect of turbulence
in the wind-tunnel setting. The pitch response thus obtained has frequency and amplitude modulations similar to the wind-tunnel data. As evident from Eq. (16), the aerodynamic loading on the wing, as quantified by CM , involves contributions from both nonlinear and unsteady 22
aerodynamics. The nonlinear aerodynamic stiffness is introduced through a3 θ3 term and the nonlinear aerodynamic damping is introduced through a4 θ2 θ˙ term. The first-order memory effect (or unsteadiness) in the circulatory aerodynamic loading acting on the wing is introduced through the C˙ M and θ¨ terms. See [11] and references within for a detailed discussion on time-domain modelling of unsteady aerodynamics and the mathematical derivation of Eq. (16) using Wagner’s theory. Note that Eq. (16) pertains to the assumption of first-order Wagner function Φ(τ ) = 1 − 0.5 exp(−Bτ ) for
which c6 = π(0.25 − ah 2 )/(2c). In essence, Wagner constant B quantifies the effect of unsteadiness in the aerodynamic loading. In other words, a higher
¨ value of Wagner constant B will nullify the terms C˙ M /B and c6 θ/B and Eq. (16) will resemble a quasi-steady model, more commonly known as the Duffing-van der Pol oscillator (without the noise term). The measurement data needed for inverse modelling is obtained by forward integration of coupled ODE system comprising of the structural model in Eq. (16) and the aerodynamic model in Eq. (15). The coupled ODE system is converted into a state space form of Eq. (1) using the state vector ˙ CM }. The forward integration of coupled ODE system is initiated as {θ, θ, ˙ CM } = {0, 0, 0}, similar to wind-tunnel conditions. The dynamic from {θ, θ,
instability is induced by the synthetic turbulence introduced by the modelling error term σξ(τ ) in Eq.(16). The LCO is obtained at different time instants in the forward integration for varying seed of the random number generator. The aerodynamic coefficients in Eq. (16) are chosen as: B=0.2, a1 =-1.25, a2 =-1.0, a3 =100.0, a4 =-500.0 and σ=2e−03 . The pitch displacement θk is obtained through forward integration of the ODE system using Euler-Maruyama scheme with ∆t= 1e-03. The measured pitch displacement dk is obtained by corrupting θk using 23
the measurement equation dk = θk + ϵk , where ϵk ∼ N (0, γ 2 ) is an additive Gaussian noise with strength γ. The value of γ is chosen as 2e-03.
This additive noise mimics the digital noise induced by the data-acquisition system in the wind-tunnel tests. See [59] for a detailed description of the data-acquisition system used in the wind-tunnel apparatus of the aeroelastic oscillator. Fig. 2 shows the measured pitch displacement and its power spectral density (PSD).
Figure 2: Raw synthetic data and its PSD.
For modelling purpose, the measured pitch displacement is filtered using a low-pass filter with a cut-off frequency 25 Hz. Fig. 3 shows the filtered pitch response and its PSD. The cut-off frequency 25 Hz ensures that the super-harmonics lower and equal to 7f are maintained in the pitch response, where f is the fundamental frequency of the LCO while high frequency measurement noise is removed. The measurement data d1:nd is obtained by trimming the filtered response, as shown in Fig. 3. The trimming keeps the size nd of d1:nd to a minimum, while ensuring that the essential aerodynamics is well represented in d1:nd . The total number of data points is nd = 5000. 24
Next, the Bayesian model reduction methodology detailed in Section 2 is applied to find the optimal nested model under a given proposed model using d1:nd .
Figure 3: Filtered synthetic data and its PSD. The data used for ARD computations is shown in red.
3.3. Unidimensional ARD prior for an essential parameter In this section, we demonstrate the performance of ARD prior in determining the relevance of a non-essential parameter of the proposed model using the information contained in the measurement data d1:nd . For this case, the proposed model is chosen to be the data-generating model consisting of the coupled ODE system of Eq. (15) (Structural model) and Eq. (16) (Aerodynamics model). The parameters of the structural model in Eq. (15) and the parameter c6 in Eq. (16) are treated as known for the sake of modelling. Therefore, the unknown parameter vector for the proposed model is ϕ = {B, a1 , a2 , a3 , a4 , σ}.
For illustration purpose, we assume that sufficient a priori information 25
exists about parameters B, a1 , a2 , a4 , σ such that an informative prior pdf can be assigned to these parameters. On the contrary, it is assumed that limited information is available about parameter a3 and therefore the relevance of a3 θ3 term to the LCO dynamics needs to be determined. Note that executing Bayesian model selection using an arbitrarily chosen diffuse prior for a3 can inadvertently alter the model evidence leading to unreliable posterior model probabilities [12]. ARD offers a rational solution in such circumstances by allowing the data to determine the prior pdf and the relevance of a non-essential parameter. Based on the ARD methodology outlined in Section 2.3, the model parameters in ϕ are categorized as (1) essential ϕ-ψ = {B, a1 , a2 , a4 , σ}, or (2)
nonessential ϕψ = {a3 }. Table 2 shows the joint prior pdf chosen for ϕ.
The log-normal pdf is chosen for σ in order to enforce the positivity and unbounded nature of the strength of modelling error. The prior pdfs for the essential parameters ϕ-ψ are known and ϕψ = {a3 } is assigned an ARD
prior consisting of a zero-mean Gaussian with unknown variance 1/α. The hyper-parameter vector is ψ = {α}. For a given value of α, there exist a
unique prior pdf p(ϕ|ψ) and a unique value of model evidence p(d1:nd |ψ),
which can be computed using the MCMC powered Chib-Jeliazkov method described in Section 2. Fig. 4 shows various statistical entities pertaining to Bayesian model selection for selected values of α. As evident from Fig. 4, model evidence p(d1:nd |ψ) has a unique optimum at the value of ψ opt = αopt =1e-04. When
using the optimal prior pdf, the parameter a3 will have a non-zero MAP estimate and its posterior pdf will be dictated by d1:nd and not the prior pdf. In other words, the cubic aerodynamic stiffness (a3 θ3 ) is found to be crucial to the LCO aerodynamics based on the information contained in 26
Prior pdf, p(ϕ|ψ) = p(ϕ-ψ )p(ϕψ |ψ); Hyper-parameter, ψ = {α} p(ϕ-ψ )
L(B|0.2, 50) U(a1 |−2, 0) U(a2 |−2, 0) U(a4 |−600, 0) L(σ|0.002, 50)
p(ϕψ |ψ)
ARD prior, N (a3 |0, 1/α)
Table 2: Parameter prior pdf for the proposed model in Eq. (16). Note that L(.|r, s) represent a log-normal distribution with median at r and coefficient of variation (COV) of s%; N (.|u, v) represents a gaussian distribution with mean u and variance v; and U(.|e, f ) represents a uniform distribution with bounds (e, f ).
d1:nd . Fig. 5 shows the marginal posterior pdf of parameter a3 for various α values. For α > αopt , the prior pdf p(a3 |α) will have a relatively low variance and the prior pdf can be treated as a dirac-delta function for all purposes.
A dirac-delta function has a high information content (more specifically, Shannon entropy), which will overcome the information contained in d1:nd and force the posterior pdf p(a3 |d1:nd , ψ) to approach a dirac-delta function
centred at zero. This switching-off of a3 parameter is evident from Fig. 5 for α > αopt values. This pruning of a3 parameter results in a drastic decrease in the goodness-of-fit, which dictates the decrease in model evidence. The model evidence is also decreased by the minor increase in the information gain for α > αopt . As shown in Fig. 5, the marginal posterior pdf p(a3 |d1:nd , ψ opt ) and the
MAP estimate of a3 pertaining to the optimal evidence value p(d1:nd |ψ opt ) shows close resemblance to the true value of a3 = 100.0 used in generating
d1:nd . Note that it has been argued that choosing a zero-mean ARD prior might bias the posterior parameter statistics due to the contrary information contained in likelihood and prior. Fig. 5 shows the marginal posterior pdf of a3 when a flat prior p(a3 ) ∝ 1 is selected for a3 in Eq. (16), while all 27
α
10-1
102
160 +27660 140 120 100 opt 80 α 60 40 20 0 10-10 10-7 10-4 10-1
Log-evidence
55 50 45 40 35 30 25 20 -10 10-7 10
EIG
Goodness-of-fit
160 +27680 140 120 100 80 60 40 20 0 10-10 10-7 10-4
10-4
α
10-1
102
α
102
Figure 4: Statistical quantities pertaining of Bayesian model selection for various values of hyper-parameter α.
other priors are unchanged as in Table 2. It is evident from Fig. 5 that the bias in the posterior statistic of a3 is negligible when using zero-mean ARD prior as compared to a using a flat prior. In other words, the posterior parameter statistics of non-essential parameters is solely dictated by the likelihood p(d1:nd |ϕ) and the prior pdfs of essential parameters (in case of
strong correlation among parameters). Note that this conclusion can only be generalized for ‘large and dense’ data d1:nd case and may not hold good
1e−1
2 1e2
1.0
1
0.8
−1 −2 -10 10
10-7
10-4
α
0.6
10-1
102
0.6 0.4
0.2 0.0 8.33e+01
COV=4.2 True
1.0 0.8
0.4
Posterior Prior MAP
1e−1
True
pdf
0
pdf
a3
when the information contents in likelihood and prior are comparable.
0.2 9.69e+01
( =α )
a3 α
opt
1.11e+02
0.0 8.35e+01
9.72e+01
a3 (Flat
prior)
1.11e+02
Figure 5: Marginal posterior pdf of a3 parameter for various values of hyper-parameter α. Note that the prior and posterior bounds plotted herein contains 99% of the probability mass.
28
3.4. Unidimensional ARD prior for a non-essential parameter In this section, we demonstrate the performance of the model reduction approach when a unidimensional ARD prior is assigned to a non-essential parameter of the proposed model. For this case, the following aerodynamic model is proposed to model the coefficient of moment in Eq. (15) using d1:nd : C˙ M c6 + CM = a1 θ + a2 θ˙ + a3 θ3 + a4 θ2 θ˙ + a5 θ5 + θ¨ + σξ(τ ) B B
(17)
The proposed model has an additional term a5 θ5 compared to the datagenerating model in Eq. (16). The unknown model parameter vector is ϕ = {B, a1 , a2 , a3 , a4 , a5 , σ}. For illustration purpose, we assume that the
limited a priori knowledge is available about a5 and its relevance to the LCO dynamics needs to be determined based on the information contained in d1:nd . Hence, we categorize the model parameters as 1) essential ϕ-ψ = {B, a1 , a2 , a3 , a4 , σ}, or 2) nonessential ϕψ = {a5 }. Table 3 shows the prior pdf for the essential ϕ-ψ and nonessential ϕψ parameters, where the parameter a5 is assigned an ARD prior. Prior pdf, p(ϕ|ψ) = p(ϕ-ψ )p(ϕψ |ψ); Hyper-parameter, ψ = {α} p(ϕ-ψ )
L(B|0.2, 50) U(a1 |−2, 0) U(a2 |−2, 0) U(a3 |−250, 250) U(a4 |−600, 0) L(σ|0.002, 50)
p(ϕψ |ψ)
ARD prior, N (a5 |0, 1/α)
Table 3: Parameter prior pdf for the proposed model in Eq. (17). Note that L(.|r, s) represent a log-normal distribution with median at r and coefficient of variation (COV) of s%; N (.|u, v) represents a gaussian distribution with mean u and variance v; and U(.|e, f ) represents a uniform distribution with bounds (e, f ).
Fig. 6 shows the Bayesian model selection results obtained for selected values of α. Fig. 7 shows the marginal posterior pdf of a5 for varying α. As 29
evident from Fig. 6, model evidence has a unique optimum for all α > 1e-05, with αopt = ∞ being one of the solution. As seen in Fig. 7, both posterior
and prior pdf of a5 approach a dirac-delta function for all α > 1e-05 values. In other words, maximum model evidence indicates that the parameter a5 is irrelevant to the aerodynamics represented in the measurement data d1:nd . Hence, the a5 θ5 term can be pruned from the proposed model in Eq. (17) for all purposes. The small amount of noise present in the evidence estimates for α > 1e-05 is due to sampling-based evidence computation. Fig. 7 also shows the posterior pdf of a5 using ARD prior (αopt = 100.0) and the posterior pdf obtained using a flat prior (p(a5 ) ∝ 1) while all other
priors are same as in Table 3. The pdf bounds in Fig. 7 indicates that the posterior pdf pertaining to the maximum model evidence is a dirac-delta function in comparison to the posterior pdf obtained using flat prior. This observation further demonstrates the pruning of parameter a5 through ARD.
Goodness-of-fit
25
EIG
20 15
10 5
0 10-12 10-9
29 28 27 26 25 24 23 22 21 -12 10 10-9
10-6
α
10-3
100
7 +27807 6 5 4 3 2 1 0 -12 10 10-9 10-6
Log-evidence
30 +27820
10-6
α
10-3
100
α
α
opt
10-3
=∞
100
Figure 6: Statistical quantities pertaining of Bayesian model selection for various values of hyper-parameter α.
3.5. Multi-dimensional ARD prior In this section, we investigate the performance of a multidimensional ARD prior when the relevance of multiple parameters needs to be determined simultaneously. In other words, we seek the optimal nested model 30
Posterior Prior MAP 10
-9
10
-6
α
10
-3
10
0
1e−4
True
True
3.5 3.0 2.5 pdf
0
−1 -12 10
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -3.17e-01
pdf
a5
1 1e4
2.0 1.5 1.0 0.5
3.53e-04
( =α )
a5 α
opt
3.18e-01
0.0 -3.02e+03
9.11e+02
a5 (Flat prior)
4.84e+03
Figure 7: Marginal posterior pdf of a5 parameter for various values of α.
in the proposed model when limited information is available about multiple model parameters. The following aerodynamic model is proposed to model the coefficient of moment in Eq. (15): C˙ M c6 + CM = a1 θ + a2 θ˙ + a3 θ3 + a4 θ2 θ˙ + a5 θ5 + a6 θ4 θ˙ + θ¨ + σξ(τ ) (18) B B The proposed model has the additional terms a5 θ5 and e6 θ4 θ˙ in comparison to the data-generating aerodynamic model in Eq. (16). The unknown model parameter vector is ϕ = {B, a1 , a2 , a3 , a4 , a5 , a6 , σ}. It is assumed that the relevance of parameters {a3 , a4 , a5 , a6 } is unknown and needs to determined
using ARD priors. On the contrary, the parameters {B, a1 , a2 , σ} are known to be relevant based on the prior information and their prior pdfs are available. Using the ARD methodology outlined in Section 2.3, we categorize the model parameters as (1) essential ϕ-ψ = {B, a1 , a2 , σ}, or (2) nonessential ϕψ = {a3 , a4 , a5 , a6 }. Table 3 shows the prior pdf for the essential ϕ-ψ
and nonessential ϕψ parameters chosen for the sake of model reduction or relevance determination. Note that the evidence p(d1:nd |ψ) will be a function of the hyper-parameter
vector ψ = {α1 , α2 , α3 , α4 } for a given d1:nd . The evidence computation and 31
Prior pdf, p(ϕ|ψ) = p(ϕ-ψ )p(ϕψ |ψ); Hyper-parameter, ψ = {α1 , α2 , α3 , α4 } p(ϕ-ψ )
L(B|0.2, 50) U(a1 |−2, 0) U(a2 |−2, 0) L(σ|0.002, 50)
p(ϕψ |ψ)
ARD prior, N (a3 |0, 1/α1 )N (a4 |0, 1/α2 )N (a5 |0, 1/α3 )N (a6 |0, 1/α4 )
Table 4: Parameter prior pdf for the proposed model in Eq. (18). Note that L(.|r, s) represent a log-normal distribution with median at r and coefficient of variation (COV) of s%; N (.|u, v) represents a gaussian distribution with mean u and variance v; and U(.|e, f ) represents a uniform distribution with bounds (e, f ).
the subsequent maximization is performed using the parallel MCMC algorithm described in Section 2.3. Note that the starting value of ψ for evidence maximization is chosen as ψ = {1e-10, 1e-10, 1e-10, 1e-10}. The low-values of hyper-parameters ensures that the ARD prior pdf is nearly flat near the posterior space. This choice of the hyper-parameter is critical to ensure that the posterior parameter pdf of non-essential parameters is solely dictated by the data d1:nd in the initial runs of the evidence maximization algorithm. Fig. 8 shows the statistical entities pertaining to Bayesian model selection for successful evidence maximization iterations. Fig. 9 contrasts the MAP estimate and the bounds of marginal posterior and prior pdf of non-essential parameters for successful evidence maximization iterations. The optimal value of hyper-parameter was obtained as ψ opt = {1e-04, 1E-05, ∞, ∞}. It is concurred from ψ opt value that parameters a3 and a4 are relevant to
the LCO aerodynamics while parameters a5 and a6 are redundant. The marginal posterior pdf and MAP estimates of parameters a5 and a6 further demonstrate the switching-off of these parameters for the optimal evidence value. The goodness-of-fit remains unchanged during the maximization since the posterior pdf of relevant parameters is unaltered by the ARD priors and 32
the redundant parameters have no significant contribution to the data-fit. The information gain decreases due to the decrease in the model complexity as the irrelevant parameters are being pruned away. The information gain remains constant when the posterior pdf of parameters a5 and a6 is a diracdelta function centred at zero. The increase in evidence values is solely generated by the decrease in EIG while the goodness-of-fit remains constant throughout the maximization. The optimal nested model and the optimized parameter prior pdf is listed in Table 5. Note that the optimal nested model is the same as the data-generating model in Eq. (16). Fig. 10 shows the marginal and joint posterior pdf of relevant parameters a3 and a4 using the optimized prior pdf. Fig. 10 also shows the posterior pdf of relevant parameters obtained by using flat priors for a3 and a4 parameters while the prior pdf of other parameters are same as in Table 5. The closeness of the posterior pdfs for the two choices of prior pdf further reinforce the fact that zero-mean ARD prior does not bias the posterior parameter statistics when the information content in d1:nd dominates that of the prior pdf (a ‘large’ data case). Unlike reversible jump MCMC, the optimal nested model listed in Table 5 can be compared with any other proposed model using the optimal evidence value without any re-calculation [64].
10 5 0 0
5
10 15 20 iteration
25
24 +2.779e4 22 20 18 16 14 12 10 8 5 10 15 20 30 0 iteration Log-evidence
15
40 38 36 34 32 30 28 26 24 22 30 0 EIG
Goodness-of-fit
20 +2.783e4
5
10 15 20 iteration
25
Figure 8: Output of evidence maximization.
33
25
30
Optimal nested model; ln p(d1:nd ) = 2.78127e+04 Model
C˙ M c6 + CM = a1 θ + a2 θ˙ + a3 θ3 + a4 θ2 θ˙ + θ¨ + σξ(τ ) B B
ϕ
{B, a1 , a2 , a3 , a4 , σ}
p(ϕ)
L(B|0.2, 50) U(a1 |−2, 0) U(a2 |−2, 0) N (a3 |0,1e+04) N (a4 |0,1e+05) L(σ|0.002, 50)
Table 5: Parameter prior pdf for the proposed model in Eq. (18). Note that L(.|r, s) represent a log-normal distribution with median at r and coefficient of variation (COV) of s%; N (.|u, v) represents a gaussian distribution with mean u and variance v; and U(.|e, f ) represents a uniform distribution with bounds (e, f ).
4
1e2
2.0 1e3 1.5 1.0 0.5 0.0 −0.5 −1.0 −1.5 −2.0 30 0 2.0 1e4
0
a4
a3
2
−2 −4 0 1.0 1e4
5
10
15 20 iteration
25
1.5 1.0 0.5 0.0 −0.5 −1.0 −1.5 −2.0 30 0
0.0 −0.5 −1.0 0
10
15 20 iteration
a6
a5
0.5
5
5
10
15 20 iteration
25
25
30
Posterior Prior MAP 5
10
15 20 iteration
25
30
Figure 9: Marginal posterior pdfs of non-essential parameters.
3.6. Comparison of non-nested models In this section, we consider the comparison of multiple plausible models that cannot be considered as nested under a common encompassing model and have different state-space composition. Table 6 list such three 34
0.8
2.0
0.6
1.5
pdf
0.4
1.0
0.2
0.5
9.72e+01
a3 (Flat
prior)
1e−1
1.11e+02
0.0 -5.59e+02
True
1.0
2.5
0.6
pdf
pdf
a4
-5.04e+02
(Flat prior)
1e−2
1.0
0.2
0.5
9.68e+01
( = α1 )
a3 α1
opt
1.11e+02
0.0 -5.58e+02
ρ = -0.29
−4.6
2.0e-03 1.6e-03
−5.0
1.2e-03
−5.2
8.0e-04 4.0e-04
−5.4 8.35e+01
0.0e+00 1.11e+02
9.72e+01
a3 (Flat prior)
1e2 ρ
−4.6
= -0.29
−4.8
−5.0
-5.03e+02
( = α2 )
a4 α2
opt
-4.49e+02
2.8e-03 2.4e-03
−4.8
True
1.5
0.4
1e2
-4.49e+02
2.0
0.8
0.0 8.31e+01
COV=3.3 True
opt
0.0 8.35e+01
1e−2
a4 α2
pdf
2.5
a4 (Flat prior)
COV=4.2 True
( = α2 )
1e−1 1.0
−5.2
−5.4 8.31e+01
9.68e+01
( = α1 )
a3 α1
opt
2.8e-03 2.4e-03 2.0e-03 1.6e-03 1.2e-03 8.0e-04 4.0e-04 0.0e+00 1.11e+02
Figure 10: Comparison of marginal and joint posterior pdf of relevant parameters a3 and a4 for ARD prior with optimal hyper-parameters and flat priors pdf. Note that ρ in the joint pdf plot is the correlation coefficient between the two parameters.
overly-complicated aerodynamic models, namely {M1 , M2 , M3 }, that are proposed to model the aerodynamics in d1:nd . A hybrid model compari-
son approach is designed wherein ARD and Bayesian model selection using posterior model probabilities are executed concurrently. We use the ARD approach to perform model reduction on each model in the candidate set opt opt {M1 , M2 , M3 } and the resulting optimal nested models {Mopt 1 , M2 , M3 }
will be compared using posterior model probabilities computed using the corresponding optimal evidence value. The optimal model thus obtained will have the highest evidence among all the models nested under all three models {M1 , M2 , M3 }.
Unlike black-box modelling, the model set {M1 , M2 , M3 } in Table 6 is
motivated by a predefined goal. Performing model reduction on individual models using ARD will help identify the nature of nonlinear aerodynamic
35
stiffness through the optimal nested model. On the other hand, model comparison of optimal nested models will help identify the nature of unsteady aerodynamics. Hence, the optimal model thus obtained will help understand the contribution of unsteady aerodynamics and nonlinear aerodynamic stiffness in the LCO dynamics captured in d1:nd . Note that M1 is a quasi-steady
model, M2 a first-order Wagner model, and M3 is a second-order Wagner model. See Sandhu et al. [11] and references within for a detailed mathe-
matical derivation of Wagner models M2 and M3 using Wagner’s unsteady
theory. M
Model definition
M1
CM = e1 θ + e2 θ˙ + e3 θ3 + e4 θ2 θ˙ + e5 θ5 + e6 θ4 θ˙ + σξ(τ )
M2 M3
C˙ M c6 + CM = e1 θ + e2 θ˙ + e3 θ3 + e4 θ2 θ˙ + e5 θ5 + e6 θ4 θ˙ + θ¨ + σξ(τ ) B B ¨ ˙ CM (B1 + B2 )CM + + CM = e1 θ + e2 θ˙ + e3 θ3 + e4 θ2 θ˙ + e5 θ5 + e6 θ4 θ˙ B1 B2 B1 B2 ... (2c6 c7 + 0.5)θ¨ c6 θ + + + σξ(τ ) B1 B2 B1 B2 Table 6: Proposed models
The Bayesian model reduction results for the model M2 has already been
reported and discussed in Section 3.5. The hybrid prior pdf p(ϕ|ψ) for the
quasi-steady model M1 and the second-order Wagner model M3 is shown in Table 7 and Table 8, respectively. Fig. 11 and Fig. 12 shows the Bayesian
entities during evidence maximization for models M1 and M3 , respectively. Fig. 13 and Fig. 14 contrasts the marginal posterior pdf and prior pdf of non-essential parameters for models M1 and M3 . Table 9 lists the optimal
nested models obtained using ARD for the model set {M1 , M2 , M3 }. Table 10 lists the statistical entities pertaining to Bayesian model selection for 36
the optimal nested models. Note that Mopt 2 is selected as the overall optimal
model and has the same parametric structure as the data-generating aero-
dynamic model. Fig. 15 shows the marginal posterior pdfs and Fig. 16 shows the joint posterior pdf the overall optimal model. Note that the parameter estimates shows close resemblance with the true parameter values. Prior pdf, p(ϕ|ψ) = p(ϕ-ψ )p(ϕψ |ψ); Hyper-parameter, ψ = {α1 , α2 , α3 , α4 } p(ϕ-ψ )
U(a1 |−2, 0) U(a2 |0, 2) L(σ|0.002, 50)
p(ϕψ |ψ)
ARD prior, N (a3 |0, 1/α1 )N (a4 |0, 1/α2 )N (a5 |0, 1/α3 )N (a6 |0, 1/α4 )
Table 7: Parameter prior pdf for model M1 in Table 6. Note that L(.|r, s) represent a lognormal distribution with median at r and coefficient of variation (COV) of s%; N (.|u, v) represents a gaussian distribution with mean u and variance v; and U(.|e, f ) represents a uniform distribution with bounds (e, f ).
Prior pdf, p(ϕ|ψ) = p(ϕ-ψ )p(ϕψ |ψ); Hyper-parameter, ψ = {α1 , α2 , α3 , α4 } p(ϕ-ψ )
U(A|0, 0.5) L(B1 |0.32, 50) L(B2 |0.04, 50)U(a1 |−2, 0) U(a2 |−2, 0) L(σ|0.002, 50)
p(ϕψ |ψ)
ARD prior, N (a3 |0, 1/α1 )N (a4 |0, 1/α2 )N (a5 |0, 1/α3 )N (a6 |0, 1/α4 )
Table 8: Parameter prior pdf for model M3 in Table 6. Note that L(.|r, s) represent a lognormal distribution with median at r and coefficient of variation (COV) of s%; N (.|u, v) represents a gaussian distribution with mean u and variance v; and U(.|e, f ) represents a uniform distribution with bounds (e, f ).
4. Conclusion Automatic Bayesian model reduction scheme is presented to prune off the irrelevant parameters of the proposed model. The concept of automatic 37
1.5 1e3 1.0
a4
0.5 0.0 −0.5 −1.0 5
1.5 1.0 0.5 0.0 −0.5 −1.0 −1.5 −2.0 0
10
15 20 iteration
−1.5 0 2.0 1e3
25
30
25
1.5 1.0 0.5 0.0 −0.5 −1.0 −1.5 −2.0 30 0
5
10
15 20 iteration
a6
a3 a5
2.0 1e2 1.5 1.0 0.5 0.0 −0.5 −1.0 −1.5 −2.0 0 2.0 1e3
5
10
15 20 iteration
25
30
Posterior Prior MAP 5
10
15 20 iteration
25
30
Figure 11: Output of evidence maximization for model M1
1.0 1e3
0.0
a4
a3
0.5
−0.5 −1.0 0 2 1.5 1e4
4
6
8 10 iteration
12 14
16
2
0.0
0
−0.5
−2
−1.0
−4
−1.5 0
4
6
8 10 iteration
12 14
16
4
0.5
a6
a5
1.0
1e3 6 4 2 0 −2 −4 −6 0 2 6 1e5
2
4
6
8 10 iteration
12 14
16
−6 0
Posterior Prior MAP 2
4
6
8 10 iteration
Figure 12: Output of evidence maximization for model M3
38
12 14
16
20 +2.77e4
32 30
15
18 Log-evidence
28 26
EIG
Goodness-of-fit
20 +2.773e4
10
24 22
5 5
10
15 20 iteration
25
18 0
30
14 12 10
20
0 0
16
5
10
15 20 iteration
25
30
8 0
5
10
15 20 iteration
25
30
Figure 13: Output of evidence maximization for model M1
Goodness-of-fit
50 30 20 10 0 0
12 +2.777e4 10
EIG
40
60 58 56 54 52 50 48 46 44 0
Log-evidence
70 +2.778e4 60
2
4
6
8 10 12 14 16 18 iteration
8 6 4 2
2
4
6
8 10 12 14 16 18 iteration
0 0
2
4
6
8 10 12 14 16 18 iteration
Figure 14: Output of evidence maximization for model M3
M
Model definition
Mopt 1
CM = e1 θ + e2 θ˙ + e3 θ3 + e4 θ2 θ˙ + σξ(τ )
Mopt 2 Mopt 1
c6 C˙ M + CM = e1 θ + e2 θ˙ + e3 θ3 + e4 θ2 θ˙ + θ¨ + σξ(τ ) B B ... ¨ C¨M (B1 + B2 )C˙ M (2c c6 θ 6 c7 + 0.5)θ 3 2˙ ˙ + + CM = e1 θ + e2 θ + e3 θ + e4 θ θ + + + σξ(τ ) B1 B2 B1 B2 B1 B2 B1 B2 Table 9: Optimal nested models obtained using ARD
39
Mopt ∈ Mopt r
E[ln p(d1:nd |ϕ)]
[ ] p(ϕ|d1:nd ,ψ opt ) E ln opt p(ϕ|ψ )
log p(d1:nd |ψ opt )
P(Mopt r |d1:nd )
Goodness-of-fit
EIG
Log evidence
Posterior probability
(+2.77E04)
(+2.77E04)
Mopt 1
37.81
19.51
18.29
0.000
Mopt 2
135.55
22.85
112.70
1.000
Mopt 3
129.42
48.17
81.25
0.000
Table 10: Statistical entities pertaining to Bayesian model selection for the optimal nested models.
1e1
True
1.4
True pdf
pdf
3
0.8 0.6 0.4
B1
1e−1
2.54e-01
0.0 -1.35e+00
True
2.5
-1.24e+00
a1
1e−2
-1.13e+00
True
2.0 pdf
0.6
1.5 1.0
0.4
0.5
0.2 0.0 8.31e+01
9.68e+01
( = α1 )
a3 α1
opt
1.11e+02
0.0 -5.58e+02
0 -1.15e+00 1e3 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.79e-03
-8.26e-01
a2
-5.05e-01
True
pdf
0.8
2 1
0.2
2.20e-01
True
4
1.0
1.0
pdf
1e1
1.2
pdf
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1.86e-01
-5.03e+02
( = α2 )
a4 α2
opt
-4.49e+02
2.71e-03
σ
Figure 15: Marginal posterior pdf of the overall optimal model Mopt 2 .
40
3.64e-03
B1
ρ = 0.86
1.2e+02 1.0e+02 9.0e+01 7.5e+01 6.0e+01 4.5e+01 3.0e+01 1.5e+01 0.0e+00 -1.13e+00
−0.8 −0.9 −1.0 -1.24e+00
a1
ρ
1.05
1.00
0.95
0.90
ρ
-1.24e+00 a1
1.4e+01 1.2e+01 1.0e+01 8.0e+00 6.0e+00 4.0e+00 2.0e+00 0.0e+00 -1.13e+00
−4.8
−5.0 −5.2
−5.4 9.68e+01
( = α1 )
a3 α1
opt
2.8e-03 2.4e-03 2.0e-03 1.6e-03 1.2e-03 8.0e-04 4.0e-04 0.0e+00 1.11e+02
opt
( = α3 ) a3 α3
2.5 2.0 -5.58e+02
-5.03e+02
( = α4 )
a4 α4
6.0e+00 4.5e+00
0.95
3.0e+00
0.90
1.5e+00 0.0e+00 2.54e-01
2.20e-01 B1
1e2 ρ
= -0.84
1.05
1.00
0.95
0.90 -8.26e-01 a2
= 0.08
3.0
opt
4.0e+01 3.5e+01 3.0e+01 2.5e+01 2.0e+01 1.5e+01 1.0e+01 5.0e+00 0.0e+00 -4.49e+02
Figure 16: Joint posterior pdf of the overall optimal model Mopt 2 .
41
9.0e+00 7.5e+00
0.85 -1.15e+00
ρ
= -0.89
1.00
1.10
= -1.0
3.5
= -0.29
( = α2 ) opt
B1
1e2
0.85 -1.35e+00
1e2
8.31e+01
2.20e-01
ρ
0.85 1.86e-01
1e−3
−4.6
a4 α2
a2
−0.7
1.10 opt
−0.6
−1.1 -1.35e+00
−1.1 1.86e-01
opt
2.20e-01
−1.0
( = α3 )
1.86e-01
−0.9
1e2
1.05
( = α3 )
−1.30
−0.8
1.10
σ
−1.25
4.8e+02 4.2e+02 3.6e+02 3.0e+02 2.4e+02 1.8e+02 1.2e+02 6.0e+01 0.0e+00 2.54e-01
−0.7
a3 α3
a1
−1.20
ρ = 0.93
−0.6
a2
1.4e+03 1.2e+03 1.0e+03 8.0e+02 6.0e+02 4.0e+02 2.0e+02 0.0e+00 2.54e-01
a3 α3
ρ = 0.91
−1.15
8.0e-01 7.0e-01 6.0e-01 5.0e-01 4.0e-01 3.0e-01 2.0e-01 1.0e-01 0.0e+00 -5.05e-01
relevance determination (ARD) is used to identify the relevance of individual model parameters through parametrized prior pdfs, called the ARD priors. We propose a hybrid approach for choosing prior pdfs, wherein parameters with significant prior knowledge are assigned a known pdf and the rest are assigned ARD priors. Subsequently, the model selection problem is transformed from a discrete to continuous model domain governed by the hyper-parameters of the ARD priors. The optimal model thus obtained corresponds to the optimal ARD prior pertaining to the maximum evidence value within the continuous model domain. The proposed method of model reduction permits implicit comparison of all model nested under the proposed model, without the need to assign prior pdfs to questionable model parameters. The evidence computation and optimization are accelerated by an adaptive MCMC sampler implemented using high performance computing. The efficiency of the MCMC sampler aids the Chib-Jeliazkov method in computing reliable evidence estimates in minimum computation time. The model reduction methodology is applied to synthetic data generated from a nonlinear aeroelastic oscillator involving limit cycle oscillations (LCO). Three intentionally over-specified models were proposed with varying capabilities in modelling the unsteady aerodynamics of the LCO. The proposed models had different state-space vectors and cannot be represented as nested under a common encompassing model. Subsequently, an optimal nested model is obtained for each proposed model using ARD. The three optimal nested models are compared among themselves using their optimal evidence values using Bayesian model selection to obtain posterior model probabilities. The optimal nested model with highest posterior probability was the same as the data-generating model. The posterior parameter samples were obtained as a by-product of the evidence computation algorithm. 42
The posterior parameter pdfs showed close resemblance with the true parameter values used in generating the synthetic data. It is also shown that the zero-mean ARD priors introduce negligible bias in the posterior parameter statistics. Hence, the data-generating model was obtained as the best (maximum evidence) model among all the models nested under the three complicated models. It is concluded that using ARD with Bayesian model selection enlarges the model domain and increases the chances of obtaining a better model than discrete model comparison. 5. Acknowledgement The first author acknowledges the support of Ontario Graduate Scholarship program and the Canadian Department of National Defence. The fourth author acknowledges the support of the Canadian Department of National Defence and a Discovery Grant from Natural Sciences and Engineering Research Council of Canada. The fifth author acknowledges the support of a Discovery Grant from Natural Sciences and Engineering Research Council of Canada and the Canada Research Chair Program. The computing infrastructure is supported by the Canada Foundation for Innovation (CFI), the Ontario Innovation Trust (OIT), CLUMEQ and SciNet HPC Consortia at Canada. Appendix A. ARD prior: Application to polynomial regression In this section, we present the application of Bayesian model reduction using ARD prior for polynomial regression models. This numerical exercise is chosen to illustrate the benefits of ARD as an automatic model reduction tool while avoiding the complexity of ODE based models. The measured 43
response d1:nd = {y1 , y2 , . . . , yi , . . . , ynd } is generated using yi = 3 + 5xi + ϵi ,
(A.1)
where ϵi ∼ N (0, 0.32 ) is a Gaussian random variable with independent and identically distributed (IID) samples. Figure A.1 shows the measured and true response (measured response minus noise).
Figure A.1: Measured response and true response
Next, we propose the following intentionally general model for the measured response d1:nd : yi = a0 + a1 xi + a2 x2i + a3 x3i + a4 x4i + ϵi ,
(A.2)
where ϵi ∼ N (0, γ 2 ) is a Gaussian random variable with IID samples and
unknown strength γ. The unknown parameter vector for the proposed model is ϕ = {a0 , a1 , a2 , a3 , a4 , γ}. The ai parameters are assigned independent
ARD priors while γ is assigned a known lognormal prior. The joint prior
44
pdf p(ϕ|ψ) is assigned as p(ϕ|ψ) = p(a0 )p(a1 )p(a2 )p(a3 )p(a4 )p(γ) = N (0, 1/α0 ) N (0, 1/α1 ) N (0, 1/α2 ) N (0, 1/α3 ) N (0, 1/α4 ) LN(0.03, 0.5) | {z } ARD prior
(A.3)
where the hyper-parameter vector ψ = {α0 , α1 , α2 , α3 , α4 }. For the model in Eq. (A.2) the likelihood function is
( ) ∑ (yi − 4j=0 aj xji )2 1 √ p(d1:nd |ϕ) = p(yi |ϕ) = exp − 2γ 2 2πγ i=1 k=1 n ∏
n ∏
(A.4)
For the data d1:nd considered here, the posterior distribution p(ϕ|d1:nd , ψ) of the model parameters can be approximated as a Gaussian distribution for all ψ values (confirmed through MCMC simulations). Subsequently, the model evidence can be computed using Laplace method by approximating the posterior distribution around the maximum a posteriori value ϕMAP with Taylor series expansion, as detailed in Appendix B. The resulting expression for the log-evidence is log p(d1:nd |ψ) ≈ log{p(d1:nd |ϕMAP )p(ϕMAP |ψ)} + where, H(ψ) = −
nϕ 1 ln 2π + log |H(ψ)| 2 2 (A.5)
∂ 2 log{p(d1:nd |ϕ)p(ϕ|ψ)} ϕ = ϕMAP . ∂ϕϕT
(A.6)
The covariance matrix Σ(ψ) of the posterior p(ϕ|d1:nd , ψ) is also equal to the inverse of the Hessian matrix H(ψ). Note that given d1:nd , H and Σ(ψ) are functions of only hyper-parameter vector ψ. Given any choice of ψ the posterior pdf p(ϕ|d1:nd , ψ) ∝ p(d1:nd |ϕ)p(ϕ|ψ)
is optimized to find the MAP estimate ϕMAP and the Hessian H(ψ). Then, 45
ϕMAP and H(ψ) are substituted in Eq. (A.5) to calculate the evidence p(d1:nd |ψ). The evidence is maximized to obtain the optimal hyper-parameter vector ψ opt . The posterior pdf p(ϕ|d1:nd , ψ opt ) corresponding to maximum model evidence is sampled using MCMC in order to investigate the posterior statistics of the optimal nested model. Figure A.2 contrasts the marginal posterior pdf (using posterior samples) and the prior pdf (using ψ opt ) of ai parameters in ϕ. Notice that both prior and posterior of redundant parameters a2 , a3 and a4 resembles a Dirac delta function centred at zero. In other words, these parameters were pruned from the proposed model during evidence maximization and are irrelevant for subsequent model usage. Also, the parameters a0 and a1 have been estimated accurately and their MAP values are quite close to the true values. This exercise shows the potential of ARD to discern the appropriate complexity of a model within the Bayesian paradigm.
Figure A.2: Marginal prior and posterior pdfs of model parameters
46
Appendix B. Estimating model evidence using Laplace method For the case of large and dense measurement data, Laplace method [34] provides an estimate of model evidence by approximating the posterior distribution as a Gaussian distribution centred at maximum a posteriori estimate ϕMAP . The logarithm of model evidence p(d1:nd |ψ) can be written
as
ln p(d1:nd |ψ) = ln
[∫
] exp{L(ϕ, ψ)}dϕ
(B.1)
where L(ϕ, ψ) = ln{p(d1:nd |ϕ)p(ϕ|ψ)}. When the posterior p(ϕ|d1:nd , ψ)
is close to Gaussian, L(ϕ, ψ) is expanded around ϕMAP using the Taylor series expansion as
1 ∂ 2 L(ϕ, ψ) L(ϕ, ψ) = L(ϕMAP , ψ) + (ϕ − ϕMAP )T (ϕ − ϕMAP ) + . . . 2 ∂ϕ∂ϕT ϕ = ϕMAP 1 ≈ L(ϕMAP , ψ) − (ϕ − ϕMAP )T Σ(ψ)−1 (ϕ − ϕMAP ) 2 (B.2) where the Hessian matrix H(ψ) of L(ϕ, ψ) and the covariance matrix Σ(ψ) of the posterior p(ϕ|d1:nd , ψ) are defined as H(ψ) = −Σ(ψ)
−1
∂ 2 L(ϕ, ψ) = . ∂ϕ∂ϕT ϕ = ϕMAP
47
(B.3)
Substituting L(ϕ, ψ) from Eq. (B.2) to Eq. (B.1), we get [∫ { } ] 1 ln p(d1:nd |ψ) ≈ ln exp L(ϕMAP , ψ) − (ϕ − ϕMAP )T Σ(ψ)−1 (ϕ − ϕMAP ) dϕ 2 [∫ { } ] 1 T −1 ≈ ln exp{L(ϕMAP , ψ)} exp − (ϕ − ϕMAP ) Σ(ψ) (ϕ − ϕMAP ) dϕ 2 } ] [∫ { 1 T −1 ≈ ln [{exp{L(ϕMAP , ψ)}] + ln exp − (ϕ − ϕMAP ) Σ(ψ) (ϕ − ϕMAP ) dϕ 2 √ (2π)nϕ ≈ L(ϕMAP , ψ) + ln |Σ(ψ)−1 | nϕ 1 ≈ L(ϕMAP , ψ) + ln 2π − ln |Σ(ψ)−1 | 2 2 nϕ 1 ≈ L(ϕMAP , ψ) + ln 2π + ln |Σ(ψ)| 2 2 (B.4) Hence, for calculating model evidence using Laplace method we need the mode ϕMAP and the covariance matrix Σ or the Hessian H(ψ) for given measurement data d1:nd and the parameter prior pdfs (or ψ). Appendix C. Extended Kalman Filter In this section we list the expressions for obtaining the update and forecast mean and covariance of the state vector given the nonlinear model and measurement equations in Eq. (1) and Eq. (2), respectively. This section follows the references [4, 24]. See section 2.1 for details about notation. Appendix C.1. Update step
¯k )) Updated mean, uaj(k) = ufj(k) + Kk (dk − hk (ufj(k) , ϕ, ϵ Updated covariance, Paj(k) = (I − Kk Ck )Pfj(k)
48
(C.1)
where the Kalman gain matrix is Kk = Pfj(k) Ck T (Ck Pfj(k) Ck T + Dk Γk Dk T )−1
(C.2)
and the Jacobian matrices are Ck =
∂hk (.) ∂hk (.) , Dk = f f ¯k ¯k ∂uj(k) uj(k) = uj(k) , ϵk = ϵ ∂ϵk uj(k) = uj(k) , ϵk = ϵ (C.3)
Appendix C.2. Forecast step
¯j ) Forecast mean, ufj+1 = gj (uaj , ϕ, fj , q Forecast covariance, Pfj+1 = Aj Paj Aj T + Bj Qj Bj T
(C.4)
where the Jacobian matrices are Aj =
∂gj (.) ∂gj (.) a, q = q a , Bj = ¯ ¯j . u = u j j j j ∂uj ∂qj uj = uj , qj = q
(C.5)
Appendix C.3. Expression for likelihood function Given the mean and covariance of the forecast state pdf computed using EKF, the expression for the likelihood function in Eq. (9) can be reduced to p(d1:nd |ϕ) =
nd ∏
k=1
¯k ), Ck Pfk CTk + Dk Γk DTk ). N (dk |hk (ufk , ϵ
(C.6)
References [1] K. namics
Yuen, and
Bayesian Civil
Methods Engineering,
for
Structural
Wiley,
2010.
DyURL:
https://books.google.ca/books?id=0iDSezuV9mIC. [2] G. Box, G. Tiao, Bayesian inference in statistical analysis, volume 40, John Wiley & Sons, 2011. 49
[3] J. Beck, L. Katafygiotis, Updating Models and Their Uncertainties. I: Bayesian Statistical Framework, Journal of Engineering Mechanics 124 (1998) 455–461. [4] M. Khalil, D. Poirel, A. Sarkar, Probabilistic parameter estimation of a fluttering aeroelastic system in the transitional Reynolds number regime, Journal of Sound and Vibration 332 (2013) 3670–3691. [5] J. Beck,
Bayesian system identification based on probability logic,
Structural Control and Health Monitoring 17 (2010) 825–847. [6] J. Beck, K. Yuen,
Model selection using response measurements:
Bayesian probabilistic approach, Journal of Engineering Mechanics 41 (2004) 192–203. [7] M. Khalil, Bayesian Inference for Complex and Large-scale Engineering Systems, Ph.D. thesis, Carleton University, 2013. [8] S. H. Cheung, J. Beck,
Calculation of Posterior Probabilities for
Bayesian Model Class Assessment and Averaging from Posterior Samples Based on Dynamic System Data, Computer-Aided Civil and Infrastructure Engineering 25 (2010) 304–321. [9] R. Sandhu, Bayesian Model Selection and Parameter Estimation for a Nonlinear Fluid-structure Interaction Problem, Master’s thesis, Carleton University, 2012. [10] R. Sandhu, M. Khalil, A. Sarkar, D. Poirel, Bayesian model selection for nonlinear aeroelastic systems using wind-tunnel data, Computer Methods in Applied Mechanics and Engineering 282 (2014) 161 – 183.
50
[11] R. Sandhu, D. Poirel, C. Pettit, M. Khalil, A. Sarkar, Bayesian inference of nonlinear unsteady aerodynamics from aeroelastic limit cycle oscillations, Journal of Computational Physics 316 (2016) 534 – 557. [12] R. E. Kass, A. E. Raftery, Bayes Factors, Journal of the American Statistical Association 90 (1995) pp. 773–795. [13] S. Konishi, G. Kitagawa, Information Criteria and Statistical Modeling (Springer Series in Statistics), Springer, 2007. [14] H. Walach, Ockham’s razor, in: Wiley Interdisciplinary Reviews Computational Statistics, volume 2, Sage Publications, 2007, pp. n/a–n/a. [15] C. Oh, J. Beck, M. Yamada, Bayesian learning using automatic relevance determination prior with an application to earthquake early warning, Journal of Engineering Mechanics 134 (2008) 1013–1020. [16] D. J. MacKay, Bayesian interpolation, Neural Computation 4 (1991) 415–447. [17] M. E. Tipping, Sparse bayesian learning and the relevance vector machine, J. Mach. Learn. Res. 1 (2001) 211–244. [18] R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996. [19] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [20] G. Casella, Illustrating empirical bayes methods, Chemometrics and Intelligent Laboratory Systems 16 (1992) 107 – 125. 51
[21] B. Carlin, T. Louis, Bayes and empirical bayes methods for data analysis, Statistics and Computing 7 (1997) 153–154. [22] Y. Li, C. Campbell, M. Tipping, Bayesian automatic relevance determination algorithms for classifying gene expression data, Bioinformatics 18 (2002) 1332–1339. [23] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 (1994) 267–288. [24] C. Chui, G. Chen, Kalman Filtering with Real-Time Applications, volume 6 of Springer Ser. Info. Sci., Vol. 17, Springer, 1987. doi:10.1007/978-3-540-87849-0. [25] G. Evensen, Data Assimilation: The Ensemble Kalman Filter, SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2006. [26] M. Kennedy, A. O’Hagan, Bayesian calibration of computer models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2001) 425–464. [27] E. T. Jaynes, Information Theory and Statistical Mechanics, Physical Review 106 (1957) 620–630. [28] M. Muto, J. Beck, Bayesian Updating and Model Class Selection for Hysteretic Structural Models Using Stochastic Simulation, Journal of Vibration and Control 14 (2008) 7–34. [29] R. E. Kalman, A New Approach to Linear Filtering and Prediction Problems 1, Journal Of Basic Engineering 82 (1960) 35–45.
52
[30] M. S. Grewal, A. P. Andrews, Kalman filtering: theory and practice using MATLAB, volume 3, John Wiley & Sons, Inc., 2001. [31] M. Khalil, A. Sarkar, S. Adhikari, Tracking noisy limit cycle oscillation with nonlinear filters, Journal of Sound and Vibration 329 (2010) 150– 170. [32] N. J. Gordon, D. J. Salmond, A. F. M. Smith, Novel approach to nonlinear/non-gaussian bayesian state estimation, IEE Proceedings F - Radar and Signal Processing 140 (1993) 107–113. [33] M. Khalil, A. Sarkar, S. Adhikari, D. Poirel, The estimation of timeinvariant parameters of noisy nonlinear oscillatory systems, Journal of Sound and Vibration 344 (2015) 81 – 100. [34] L. Tierney, J. B. Kadane, Accurate approximations for posterior moments and marginal densities, Journal of the American Statistical Association 81 (1986) pp. 82–86. [35] S. Chib, I. Jeliazkov, Marginal likelihood from the metropolis-hastings output, Journal of the American Statistical Association 96 (2001) 270– 281. [36] S. Chib, Marginal likelihood from the gibbs output, Journal of the American Statistical Association 90 (1995) 1313–1321. [37] W. Hastings, Monte carlo sampling methods using markov chains and their applications, Biometrika 57 (1970) 97–109. [38] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, Equation of state calculations by fast computing machines, The Journal of Chemical Physics 21 (1953) 1087–1092. 53
[39] J. Skilling,
Nested sampling for general Bayesian computation,
Bayesian Analysis 1 (2006) 833–860. [40] N. Friel, A. N. Pettitt, Marginal likelihood estimation via power posteriors, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (2008) 589–607. [41] R. Neal, Annealed importance sampling, Statistics and Computing 11 (2001) 125–139. [42] M. A. Newton, A. E. Raftery, Approximate Bayesian Inference with the Weighted Likelihood Bootstrap, Journal of the Royal Statistical Society. Series B (Methodological) 56 (1994) 3–48. [43] D. A. P. Qing Liu, A note on gauss-hermite quadrature, Biometrika 81 (1994) 624–629. [44] J. Ching, Y. Chen, Transitional markov chain monte carlo method for bayesian model updating, model class selection, and model averaging, Journal of Engineering Mechanics 133 (2007) 816–832. [45] N. Friel, J. Wyse, Estimating the evidence – a review, ArXiv e-prints (2011). [46] G. O. Roberts, A. Gelman, W. R. Gilks, Weak convergence and optimal scaling of random walk Metropolis algorithms, The Annals of Applied Probability 7 (1997) 110–120. [47] S. Chib, E. Greenberg, Understanding the Metropolis-Hastings Algorithm, American Statistician 49 (1995) 327–335.
54
[48] H. Haario, M. Laine, A. Mira, E. Saksman, DRAM: efficient adaptive MCMC, Statistics and Computing 16 (2006) 339–354. [49] R. Craiu, J. Rosenthal, C. Yang, Learn From Thy Neighbor: ParallelChain and Regional Adaptive MCMC, Journal of the American Statistical Association 104 (2009) 1454–1466. [50] H. Haario, E. Saksman, J. Tamminen, An adaptive metropolis algorithm, Bernoulli 7 (2001) 223–242. [51] A. Mira, On Metropolis-Hastings algorithms with delayed rejection, Metron 59 (2001) 231–241. [52] L. Tierney, A. Mira, Some adaptive monte carlo methods for Bayesian inference., Statistics in medicine 18 (1999) 2507–15. [53] C. Sanderson, Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments, Technical Report NICTA (2010) 1–16. [54] T. Chan, G. Golub, R. LeVeque, S. U. C. D. O. C. SCIENCE., Updating formulae and a pairwise algorithm for computing sample variances, Annals of Physics 54 (1979) 258. [55] S. P. Brooks, A. Gelman, General methods for monitoring convergence of iterative simulations, Journal of computational and graphical statistics 7 (1998) 434–455. [56] D. B. R. Andrew Gelman, Inference from iterative simulation using multiple sequences, Statistical Science 7 (1992) 457–472.
55
[57] M. Passing, I. Forum, MPI : A Message-Passing Interface Standard, Forum American Bar Association 8 (2009) 647. [58] C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques, J. Dempsey, C.-H. Yu, J. Chen, L. J. Dursi, J. Chong, S. Northrup, J. Pinto, N. Knecht, R. V. Zon, Scinet: Lessons learned from building a power-efficient top-20 system and data centre, Journal of Physics: Conference Series 256 (2010) 012026. [59] D. Poirel, Y. Harris, A. Benaissa, Self-sustained aeroelastic oscillations of a NACA0012 airfoil at low-to-moderate Reynolds numbers, Journal of Fluids and Structures 24 (2008) 700–719. [60] C. Pettit, Uncertainty quantification in aeroelasticity: Recent results and research challenges, Journal of Aircraft 41 (2004) 1217 – 1229. [61] P. Beran, C. Pettit, D. Millman, Uncertainty quantification of limitcycle oscillations, Journal of Computational Physics 217 (2006) 217 – 247. Uncertainty Quantification in Simulation Science. [62] M. Khalil, D. Poirel, A. Sarkar, Bayesian analysis of the flutter margin method in aeroelasticity, Journal of Sound and Vibration 384 (2016) 56 – 74. [63] B. Danowsky, J. Chrstos, D. Klyde, C. Farhat, M. Brenner, Evaluation of aeroelastic uncertainty analysis methods, Journal of Aircraft 47 (2010) 1266 – 1273. [64] P. Green, Reversible jump markov chain monte carlo computation and bayesian model determination, Biometrika 82 (1995) 711–732.
56