Journal of Statistical Planning and Inference 142 (2012) 3152–3166
Contents lists available at SciVerse ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
On probabilistic parametric inference c ˇ Tomazˇ Podobnik a,b,n, Tomi Zivko a
Faculty of Mathematics and Physics, University of Ljubljana, Ljubljana, Slovenia Jozˇef Stefan Institute, Ljubljana, Slovenia c Slovenian Nuclear Safety Administration, Ljubljana, Slovenia b
a r t i c l e in f o
abstract
Article history: Received 14 December 2009 Received in revised form 22 May 2012 Accepted 25 May 2012 Available online 1 June 2012
This paper formulates a theory of probabilistic parametric inference and explores the limits of its applicability. Unlike Bayesian statistical models, the system does not comprise prior probability distributions. Objectivity is imposed on the theory: a particular direct probability density should always result in the same posterior probability distribution. For calibrated posterior probability distributions it is possible to construct credible regions with posterior-probability content equal to the coverage of the regions, but the calibration is not generally preserved under marginalization. As an application of the theory, the paper also constructs a filter for linear Gauss–Markov stochastic processes with unspecified initial conditions. & 2012 Elsevier B.V. All rights reserved.
Keywords: Credible region Consistency factor Invariant model Kalman filter Inverse probability distribution Prior probability distribution
1. Direct and inverse probability distributions First, we define direct and inverse probability distributions. Assumption 1. There are functions H : O!Rm , called a parameter, and X : O!Rn , on a universal set O. Let Bm denote the Borel s-algebra on a set SH D Rm , called a parameter space, Bn denote the Borel algebra on (the state space) Rn , and Bm Bn denote the Borel algebra on SH Rn . Assumption 2. There is a set fðOH ¼ h , SH ¼ h ,P H ¼ h Þ : h 2 SH g, or fðOh , Sh ,P h Þ : h 2 SH g for short, of (abstract) probability spaces ðOh , Sh ,P h Þ, with h ¼ HðoÞ being a realization of the parameter (the image of an element o of O under H). Each of the spaces consists of a set Oh fo 2 O : HðoÞ ¼ hg, of a s-algebra Sh on Oh , and of a (unitary and countably additive) probability measure P h on Sh (Kolmogorov, 1933, Chapter 1, Section 1, p. 2, and Chapter 2, Section 1, p. 13). In addition, for every h 2 SH , the restriction X9Oh of X to Oh is Sh -measurable: fo 2 Oh : XðoÞ rxg 2 Sh , 8x 2 Rn . The function X9Oh is called a (sampling) random variable and XðoÞ (o 2 Oh ) is called a realization of the variable. Definition 1 (Direct probability distribution). A model P X;H ð; Þ : Bm Bn SH !½0; 1 is defined by 1 P X;H ðB; hÞ P h ½X fpX ðBh Þg, where Bh is the h-section of a set B 2 B Bn (the restriction of B to h Rn ), pX ðBh Þ is the projection of Bh to Rn (Bh ¼ h pX ðBh Þ), and X1 fpX ðBh Þg is the inverse image of pX ðBh Þ under X, n
Corresponding author at: Faculty of Mathematics and Physics, University of Ljubljana, Ljubljana, Slovenia. ˇ E-mail addresses:
[email protected] (T. Podobnik),
[email protected] (T. Zivko).
0378-3758/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jspi.2012.05.009
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3153
X1 fpX ðBh Þg ¼ fo 2 Oh : XðoÞ 2 pX ðBh Þg 2 Sh . The restriction P X;H ð; hÞ of P X;H ð; Þ to a fixed h 2 SH is called a direct probability distribution for X. The function PX;H ðB; hÞ assigns a probability to a set B 2 Bm Bn . From Definition 1 it follows that the probability P X;H ðB; hÞ of a set B is uniquely determined by the probability P X;H ðBh ; hÞ of its h-section Bh , P X;H ðB; hÞ ¼ P X;H ðBh ; hÞ, which motivates the following definition. Definition 1 (Continued). The probability distribution P~ X;H ð; hÞ on Bn is defined by P~ X;H ðpX ðBh Þ; hÞ PX;H ðB; hÞ. Remark 1. Because of their usual frequency interpretation, the direct probability distributions are also called sampling distributions, X9Oh is called a sampling random variable, and the codomain of X9Oh is called a sampling space. Until Section 4, however, we do not invoke interpretations of probability distributions and random variables. The cumulative distribution function F X;H ð; Þ : Rn SH !½0; 1 (or Fðx; hÞ for short) is defined as Fðx; hÞ P X;H ðC; hÞ, where x ¼ ðxi Þni¼ 1 , C h ¼ h pX ðC h Þ, and pX ðC h Þ ¼ ni¼ 1 ð1,xi , so that X1 fpX ðC h Þg ¼ fo 2 Oh : XðoÞ rxg. We assume throughout that the models are identifiable: given hah0 , there are x 2 Rn for which Fðx; hÞaFðx; h0 Þ. R þ n An (absolutely) continuous model can be expressed as PX;H ðB; hÞ ¼ pX ðBh Þ f X;H ðx; hÞ d x, where f X;H ð; Þ : Rn SH !R 0 þ þ n (or f ðx; hÞ for short) is called a probability density function, and R 0 ¼ R [ f0,1g; equivalently, f ðx; hÞ ¼ @x1 ...xn Fðx; hÞ. From the definitions of P X;H and f X;H it follows immediately that Z n f X;H ðx; hÞ d x: ð1Þ Ph ½X1 fpX ðBh Þg ¼ pX ðBh Þ
Every probability density is normalized: of f ðx; hÞ.
R
Rn f ðx; hÞ
n
d x ¼ 1, h 2 SH . The set SX ðhÞ ¼ fx 2 Rn : f ðx; hÞ 40g is called the support
Definition 2 (Restricted model). Let PX;H ð; Þ be a model and let H ¼ ðH1 , H2 Þ : O!Rm1 Rm2 , so that h ¼ ðh1 , h2 Þ 2 SH1 SH2 D Rm1 Rm2 , m1 ,m2 Z1. The restricted model PX;H9h ð; Þ is defined as the restriction of P X;H ð; Þ to 1 Bm Bn ðh1 SH2 Þ, and P X;H9h ð; Þ is defined as the restriction of P X;H ð; Þ to Bm Bn ðSH1 h2 Þ. 2
The distribution function F h1ð2Þ ðx; h2ð1Þ Þ and the probability density function f h1ð2Þ ðx; h2ð1Þ Þ of the restricted model P X;H9 ð; Þ coincide with the restrictions of Fðx; hÞ and f ðx; hÞ to Rn h1 SH2 (to Rn SH1 h2 ). h1ð2Þ
Example 1 (Normal model). The distribution function Fðx; m, sÞ of a location-scale model has the form F½ðxmÞ=s, where x 2 R is a realization of a scalar sampling variable X, m and s are realizations of a location and a scale parameter, and ðm, sÞ 2 SH ¼ R R þ (it is clear from the context whether s stands for a realization of a scale parameter, for a s-algebra, or for s-finiteness of a measure). The restricted distribution function F s ¼ 1 ðx; mÞ has the form FðxmÞ. An example of a location-scale model is the normal model with the distribution function Fðx; m, sÞ ¼ ½1 þerffðxmÞ=sg=2 and with the density function f ðx; m, sÞ ¼ @x Fðx; m, sÞ ¼ s1 f½ðxmÞ=s, where fðuÞ ¼ ð2pÞ1=2 expðu2 =2Þ. The support of f ðx; m, sÞ is the entire real axis. Definition 3 (Marginal direct probability distribution). Let P X;H ð; hÞ be a direct probability distribution with the density function f ðx; hÞ and let X ¼ ðXa ,Xb Þ : O!Rna Rnb , so that x ¼ ðxa ,xb Þ 2 Rna Rnb , na ,nb Z 1. Let, in addition, Ch ðXa Þ and Ch ðXb Þ be the cylinder sub-s-algebras of Bn , Ch ðXa Þ ¼ fpX ðBh Þ 2 Bn : pX ðBh Þ ¼ pXa ðBh Þ Rnb , pXa ðBh Þ 2 Bna g and Ch ðXb Þ ¼ fpX ðBh Þ 2 Bn : pX ðBh Þ ¼ Rna pXb ðBh Þ, pXb ðBh Þ 2 Bnb g, where, for example, pXa ðBh Þ is the projection of a set Bh ¼ h pXa ðBh Þ Rnb Rm Rna Rnb to Rna . The restriction PXaðbÞ ;H ð; hÞ of PX;H ð; hÞ to Bm Ch ðXaðbÞ Þ is called the marginal direct probability distribution. The marginal distribution and density functions are denoted by FðxaðbÞ ; hÞ and f ðxaðbÞ ; hÞ, which is an abbreviation of R n F XaðbÞ ;H ð; Þ and f XaðbÞ ;H ð; Þ, and f ðxaðbÞ ; hÞ ¼ RnbðaÞ f ðx; hÞ d bðaÞ xbðaÞ . When Fðx; hÞ ¼ Fðxa ; hÞFðxb ; hÞ (when f ðx; hÞ ¼ f ðxa ; hÞf ðxb ; hÞ) for all h 2 SH , Xa and Xb are called independent random variables. Assumption 3. There is a set IXb D Rnb such that, for every xb 2 IXb , there is a probability space ðOXb ¼ xb , SXb ¼ xb ,PXb ¼ xb Þ, or ðOxb , Sxb ,Pxb Þ for short, and Z m f ðxb ; hÞ d h 4 0 ð2Þ SH
holds. Here, Oxb ¼ fo 2 O : Xb ðoÞ ¼ xb g is the X1 b ðxb Þ-section of O, Sxb is a s-algebra on Oxb , and P xb is a probability measure on Sxb . In addition, for every xb 2 IXb , the restrictions H9Ox and Xa 9Ox of H and Xa to Oxb are Sxb -measurable. b
b
The probability spaces ðOxb , Sxb ,Pxb Þ, xb 2 IXb , together with the random variables H9Ox and Xa 9Ox , define a function b b P H,Xa ;Xb ð; Þ : Bm Bn IXb !½0; 1, P H,Xa ;Xb ðB; xb Þ P xb ½ðH,Xa Þ1 fpH,Xa ðBxb Þg, where Bxb ¼ pH,Xa ðBxb Þ xb is the xb -section of 1 m n a set B 2 B B , and ðH,Xa Þ fpH,Xa ðBxb Þg ¼ fo 2 Oxb : ðH,Xa ÞðoÞ 2 pH,Xa ðBxb Þg. The function P H,Xa ;Xb ð; Þ, restricted to a fixed xb 2 IXb , is a probability distribution, denoted by P H,Xa ;Xb ð; xb Þ. The density function of the distribution is denoted by f ðh,xa ; xb Þ, which is short of f H,Xa ;Xb ð,; Þ.
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3154
Definition 4 (Inverse probability distribution). Let Cxb ðHÞ ¼ fpH,Xa ðBxb Þ 2 Bm Bna : pH,Xa ðBxb Þ ¼ pH ðBxb Þ Rna , pH ðBxb Þ 2 Bm g be the cylinder sub-s-algebra of Bm Bna , where pH ðBxb Þ is the projection of Bxb to Rm . The restriction PH;Xb ð; xb Þ of P H,Xa ;Xb ð; xb Þ to Cxb ðHÞ Bnb is called an inverse probability distribution. The inverse distribution and density functions are denoted by Fðh; xb Þ and f ðh; xb Þ (by F H;Xb ð; Þ and f H;Xb ð; Þ in full R n notation), where f ðh; xb Þ ¼ Rna f ðh,xa ; xb Þ d a xa . ~ In an analogy with P X;H , we also define P~ H;Xb . Definition 4 (Continued). The probability distribution P~ H;Xb ð; xb Þ on Bm is defined by P~ H;Xb ðpH ðBxb Þ; xb Þ ¼ PH;Xb ðB; xb Þ, where B 2 Cxb ðHÞ Bnb . The definition of the functions PH,Xb ;Xa , f H,Xb ;Xa , PH;Xa , F H;Xa , f H;Xa , and P~ H;Xa is analogous to that of P H,Xa ;Xb , PH,Xa ;Xb , f H,Xa ;Xb , F H;Xb , f H;Xb , and P~ H;Xb , respectively. Remark 2. Whenever the abbreviated notation is used for distribution and density functions, the arguments of the functions also denote the functions themselves, and so the (marginal) distribution function FðxaðbÞ ; hÞ is not necessarily the same function as Fðh; xaðbÞ Þ, f ðxaðbÞ ; hÞ is not necessarily the same function as f ðh; xaðbÞ Þ, and f ðxa ; hÞ is not necessarily the same function as f ðxb ; hÞ. Consider a model with the density function f ðx; hÞ, let X ¼ ðXa ,Xb Þ : O!Rna Rnb , let f ðxaðbÞ ; hÞ be the marginal direct density, let Xa ¼ ðX1 ,X2 Þ : O!Rn1 Rn2 , and let H ¼ ðH1 , H2 Þ : O!Rm1 Rm2 , so that xa ¼ ðx1 ,x2 Þ 2 Rn1 Rn2 and h ¼ ðh1 , h2 Þ 2 SH1 SH2 , n1 ,n2 ,m1 ,m2 Z1. Assumption 4. There is a set IH1 ,X2 D SH1 Rn2 such that, for every ðh1 ,x2 Þ 2 IH1 ,X2 , there is a probability space ðOh1 ,x2 , Sh1 ,x2 ,P h1 ,x2 Þ, which is short of ðOH1 ¼ h1 ,X2 ¼ x2 , SH1 ¼ h1 ,X2 ¼ x2 ,P H1 ¼ h1 ,X2 ¼ x2 Þ, and Z m f h1 ðx2 ; h2 Þ d 2 h2 4 0 ð3Þ SH2
holds. Here, Sh1 ,x2 is a s algebra on Oh1 ,x2 ¼ fo 2 O : H1 ðoÞ ¼ h1 4X2 ðoÞ ¼ x2 g, P h1 ,x2 is a probability measure on Sh1 ,x2 , and R n f h1 ðx2 ; h2 Þ ¼ Rn1 f h1 ðxa ; h2 Þ d 1 x1 , where f y1 ðxa ; h2 Þ is the restriction of f ðxa ; hÞ to Rna ðh1 SH2 Þ. In addition, for every ðh1 ,x2 Þ 2 IH1 ,X2 , the restrictions H2 9Oh ,x and X1 9Oh ,x of H2 and X1 to Oh1 ,x2 are Sh1 ,x2 -measurable. 1
2
1
2
A probability space ðOh1 ,x2 , Sh1 ,x2 ,P h1 ,x2 Þ and the random variables H2 9Oh distribution
PH2 ,X1 ;H1 ,X2 ð; h1 ,x2 Þ
on
the
sub-s-algebra
m
1 ,x2
B Ch ðXa Þ
and X1 9Oh of
m
1 ,x2
n
B B ,
together define a probability PH2 ,X1 ;H1 ,X2 ðB; h1 ,x2 Þ P h1 ,x2
fðH2 ,X1 Þ1 ½pH2 ,X1 ðBh1 ,x2 Þg, where pH2 ,X1 ðBh1 ,x2 Þ is the projection of the ðh1 ,x2 Þ-section Bh1 ,x2 of a set B 2 Bm Ch ðXa Þ to Rm2 Rn1 , Bh1 ,x2 ¼ h1 pH2 ,X1 ðBh1 ,x2 Þ x2 Rnb . The corresponding density function is denoted by f H2 ,X1 ;H1 ,X2 ð,; ,Þ, which is often abbreviated to f ðh2 ,x1 ; h1 ,x2 Þ. Definition 5 (Inverse probability distribution of the restricted model). Let Ch ðX2 Þ ¼ fpX ðBh Þ 2 Bn : pX ðBh Þ ¼ Rn1 pX2 ðBh Þ Rnb g be a sub-s-algebra of Ch ðXa Þ. The restriction P H2 ;H1 ,X2 ð; h1 ,x2 Þ of PH2 ,X1 ;H1 ,X2 ð; h1 ,x2 Þ to Bm Ch ðX2 Þ is called the inverse probability distribution of the restricted model P X2 ;H9 (see also Remark 6, below). h1
The distribution and the density functions of the restricted inverse distribution are denoted by F H2 ;H1 ,X2 ð; ,Þ and R n f H2 ;H1 ,X2 ð; ,Þ (by Fðh2 ; h1 ,x2 Þ and f ðh2 ; h1 ,x2 Þ for short), where f ðh2 ; h1 ,x2 Þ ¼ Rn1 f ðh2 ,x1 ; h1 ,x2 Þ d 1 x1 . The definition of the functions P H2 ,X2 ;H1 ,X1 , f H2 ,X2 ;H1 ,X1 , P H2 ;H1 ,X1 , F H2 ;H1 ,X1 , and f H2 ;H1 ,X1 is analogous to that of PH2 ,X1 ;H1 ,X2 , f H2 ,X1 ;H1 ,X2 , P H2 ;H1 ,X2 , F H2 ;H1 ,X2 , and f H2 ;H1 ,X2 , respectively. Remark 3. The integrals (2) and (3) need not be finite. The reasons for requiring the integrals to be strictly positive will become clear in the context of Proposition 1. Next, we define the conditional direct probability distribution. Definition 6 (Conditional direct probability distribution). Suppose there is f ðx; hÞ, let X ¼ ðXa ,Xb Þ : !Rna Rnb , let x ¼ ðxa ,xb Þ, and let SXb ðhÞ ¼ fxb 2 Rnb : f ðxb ; hÞ 4 0g. Let, in addition, B,D 2 Bm Bn , Bh ¼ h pX ðBh Þ, Dh ¼ h pX ðDh Þ, pX ðDh Þ 2 b Ch ðXb Þ (i.e., pX ðDh Þ ¼ Rna pXb ðDh Þ), and pXb ðDh Þ ¼ ni ¼ 1 ½xb,i ,xb,i þ hÞ, where xb,i is the i-th component of xb 2 SXb , h4 0, and P h ½X1 fpX ðDh Þg 40. First, we define the conditional probability of X1 fpX ðBh Þg given X1 fpX ðDh Þg, P h ½X1 fpX ðBh Þg9X1 fpX ðDh Þg P h ½X1 fpX ðBh \ Dh Þg=P h ½X1 fpX ðDh Þg ¼ P X;H ðB \ D; hÞ=PX;H ðD; hÞ. Second, by applying the l’Hˆopital rule we define Z f ðx; hÞ na 1 P h,xb ½X1 d xa fpX ðBh Þg9X1 fpX ðDh Þg ¼ a fpXa ðBh,xb Þg lim P h ½X h-0 pXa ðBh,x Þ f ðxb ; hÞ b
for xb 2 SXb , and Ph,xb ½X1 = SXb , where pXa ðBh,xb Þ is the projection of the ðh2 ,xb Þ-section Bh,xb of B to Rna . a fpXa ðBh,xb Þg 0 for xb2 Third, we define the conditional model P Xa 9Xb ;H ð9; Þ : Bm Bn Rnb SH !½0; 1, P Xa 9Xb ;H ðB9x; hÞ Ph,xb ½X1 a fpXa ðBh,xb Þg. Finally, we define the conditional probability density function f Xa 9Xb ;H ð9; Þ : Rna Rnb SH !R0þ (or f ðxa 9xb ; hÞ), R n P Xa 9Xb ;H ðB9xb ; hÞ ¼ pX ðBh,x Þ f ðxa 9xb ; hÞ d a xa . a
b
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3155
From the above definition of f ðxa 9xb ; hÞ it follows that for xb 2 SXb the (joint) direct probability density function f ðx; hÞ decomposes by the product rule: f ðx; hÞ ¼ f ðxa 9xb ; hÞf ðxb ; hÞ. Similarly, for xa for which f ðxa ; hÞ 40, f ðx; hÞ ¼ f ðxb 9xa ; hÞf ðxa ; hÞ. The definition of the inverse conditional probability distribution is analogous to the definition of the direct one, so that the product rule applies also to the inverse density, f ðh; xi Þ ¼ f ðh1ð2Þ 9h2ð1Þ ; xi Þf ðh2ð1Þ ; xi Þ, i ¼ a,b, and f ðh1ð2Þ ; xi Þ ¼ R m2ð1Þ h2ð1Þ 4 0. Similarly, Rm2ð1Þ f ðh; xi Þ d f ðh,xaðbÞ ; xbðaÞ Þ ¼ f ðh9xaðbÞ ; xbðaÞ Þf ðxaðbÞ ; xbðaÞ Þ ¼ f ðxaðbÞ 9h; xbðaÞ Þf ðh; xbðaÞ Þ, R m f ðh; xbðaÞ Þ 4 0 and f ðxaðbÞ ; xbðaÞ Þ ¼ SH f ðh,xaðbÞ ; xbðaÞ Þ d h 40, and f ðh2 ,x1ð2Þ ; h1 ,x2ð1Þ Þ ¼ f ðh2 9x1ð2Þ ; h1 ,x2ð1Þ Þf ðx1ð2Þ ; h1 ,x2ð1Þ Þ ¼ f ðx1ð2Þ 9h2 ; h1 ,x2ð1Þ Þf ðh2 ; h1 ,x2ð1Þ Þ, R R n m f ðh2 ; h1 ,x2ð1Þ Þ ¼ Rn1ð2Þ f ðh2 ,x1ð2Þ ; h1 ,x2ð1Þ Þ d 1ð2Þ x1ð2Þ 4 0 and f ðx1ð2Þ ; h1 ,x2ð1Þ Þ ¼ Rm2 f ðh2 ,x1ð2Þ ; h1 ,x2ð1Þ Þ d 2 h2 40. The product rules imply f ðh9xaðbÞ ; xbðaÞ Þ ¼
f ðh; xbðaÞ Þf ðxaðbÞ 9h; xbðaÞ Þ f ðxaðbÞ ; xbðaÞ Þ
ð4Þ
and f ðh2 9x1ð2Þ ; h1 ,x2ð1Þ Þ ¼
f ðh2 ; h1 ,x2ð1Þ Þf ðx1ð2Þ 9h2 ; h1 ,x2ð1Þ Þ : f ðx1ð2Þ ; h1 ,x2ð1Þ Þ
ð5Þ
Remark 4 (Kolmogorov concept of conditioning). It is well known that the definitions of P h,xb , P Xa 9Xb ;H , and f Xa 9Xb ;H (Definition 6) depend on the direction in which the limiting set limh-0 pX ðDh Þ is calculated (Kolmogorov, 1933, Chapter 5, Section 2, pp. 44–45; Rao, 1993, Section 1.4, p. 15, and Section 3.2, pp. 65–66); that is, they depend on the sub-s-algebra Ch ðXb Þ to which the set pX ðDh Þ belongs. The conditional model P Xa 9Xb ;H solves the system of equations Z n P ðB9pX ðxÞ; hÞP~ X;H 9 ðd x; hÞ, ð6Þ P~ X;H ðpX ðBh \ C h Þ; hÞ ¼ pX ðC h Þ
Xa 9Xb ;H
b
Ch ðXb Þ
B,C 2 Bm Bn and pX ðC h Þ 2 Ch ðXb Þ Bn , where pX ðBh Þ and pX ðC h Þ are the projections of the h-section s of the sets B and C to Rn , Bh ¼ h pX ðBh Þ and C h ¼ h pX ðC h Þ, P~ X;H 9Ch ðXb Þ is the restriction of P~ X;H to Ch ðXb Þ, and pXb ðxÞ ¼ xb (Rao, 1993, Section 2.4, pp. 51–54). The system of Eq. (6) defines the conditional probability distribution within the Kolmogorov concept of conditioning (Kolmogorov, 1933, Chapter 5, Section 1, pp. 41–44): the Radon–Nikodym Theorem assures that given the sub-s-algebra Ch ðXb Þ of Bn , (6) defines the conditional probability distribution P Xa 9Xb ;H ð9; hÞ uniquely P~ X;H 9Ch ðX Þ ð; hÞ-almost b everywhere (i.e., everywhere except possibly on a set pX ðC h Þ 2 Ch ðXb Þ with P~ X;H 9Ch ðX Þ ðpX ðC h Þ; hÞ ¼ 0). P Xa 9Xb ;H ð9; hÞ is b therefore called a version of the Kolmogorov conditional probability distribution, given Ch ðXb Þ, and so the conditional density f ðxa 9xb ; hÞ (Definition 6) is included in the Kolmogorov concept. Ph,xb ½X1 , is a probability of a set X1 a fpXa ðBh,xb Þg, defined from P X; a fpXa ðBh,xb Þg ¼ fo 2 Oh,xb : Xa ðoÞ 2 pXa ðBh,xb Þg RH n 1 (Definition 6), and P h,xb ½Xa fpXa ðBh,xb Þg ¼ pX ðBh,x Þ f ðxa 9xb ; hÞ d a xa (which is analogous to Eq. (1)). In the same way, from R a b 1 0 P H,Xa ;Xb , we obtain another probability, P0h,xb , of the same set X1 a fpXa ðBh,xb Þg, and P h,xb ½Xa fpXa ðBh,xb Þg ¼ pXa ðBh,x Þ b na f ðxa 9h; xb Þ d xa . Assumption 5. For every B 2 Bm Bn , h 2 SH , and xb 2 Rnb , the probability of the set X1 a fpXa ðBh,xb Þg is unique, 1 0 P h,xb ½X1 a fpXa ðBh,xb Þg ¼ P h,xb ½Xa fpXa ðBh,xb Þg, which is equivalent to assuming f ðxa 9xb ; hÞ ¼ f ðxa 9h; xb Þ. For analogous reasons we assume f ðxb 9xa ; hÞ ¼ f ðxb 9h; xa Þ, f ðh9xa ; xb Þ ¼ f ðh9xb ; xa Þ, f ðh2 9x1 ; h1 ,x2 Þ ¼ f ðh2 9x2 ; h1 ,x1 Þ ¼ f ðh2 9h1 ; x1 ,x2 Þ ( ¼ f ðh2 9h1 ; xa Þ), and f ðx1ð2Þ 9h2 ; h1 ,x2ð1Þ Þ ¼ f ðx1ð2Þ 9x2ð1Þ ; h1 , h2 Þ ( ¼ f ðx1ð2Þ 9x2ð1Þ ; hÞ). Under Assumption 5, Eqs. (4) and (5) reduce to f ðh9xa ; xb Þ ¼
f ðh; xa Þf ðxb 9xa ; hÞ f ðh; xb Þf ðxa 9xb ; hÞ ¼ f ðxb ; xa Þ f ðxa ; xb Þ
ð7Þ
and f ðh2 9h1 ; xa Þ ¼
f ðh2 ; h1 ,x1 Þf ðx2 9x1 ; hÞ f ðh2 ; h1 ,x2 Þf ðx1 9x2 ; hÞ ¼ : f ðx2 ; h1 ,x1 Þ f ðx1 ; h1 ,x2 Þ
ð8Þ
Remark 5 (Consistency of probability manipulations). Less formally, we may first construct the inverse density f ðh; xa Þ on the basis of the realization xa and then multiply it by f ðxb 9xa ; hÞ that depends also on the realization xb , or we may first construct the inverse density f ðh; xb Þ on the basis of the realization xb and then multiply it by f ðxa 9xb ; hÞ that depends also on xa , but the final result f ðh9xa ; xb Þ should not depend on the order of taking the realizations into account. O’Hagan (1994, Section 3.5, p. 67) calls this property the consistency of probability manipulations. Assumption 6. The densities f ðxa ; hÞ and f ðxb ; hÞ are the same function, and the densities f ðx1 ; hÞ and f ðx2 ; hÞ are the same function (Remark 2).
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3156
In Eqs. (7) and (8), the predictive densities f ðxaðbÞ ; xbðaÞ Þ and f h1 ðx1ð2Þ ; x2ð1Þ Þ are determined from the normalization of the posterior densities f ðh9xa ; xb Þ and f ðh2 9h1 ; xa Þ, whereas the general form of the densities f ðh; xaðbÞ Þ and f ðh2 ; h1 ,x1ð2Þ Þ is given as follows. Proposition 1. Eqs. (7) and (8) imply that the densities f ðh; xaðbÞ Þ and f ðh2 ; h1 ,x1ð2Þ Þ factor as f ðh; xaðbÞ Þ ¼
zH ðhÞ
ZXa ðxaðbÞ Þ
f ðh2 ; h1 ,x1ð2Þ Þ ¼
f ðxaðbÞ ; hÞ,
zH9h ðh2 Þ 1
ZX1 ðx1ð2Þ Þ
ð9Þ
f h1 ðx1ð2Þ ; h2 Þ:
ð10Þ
The functions zH ðhÞ and zH9 ðh2 Þ (also abbreviated to zðhÞ and zh1 ðh2 Þ) are called consistency factors. h1
The proofs of propositions are postponed to Appendix B. Remark 6. The density of the inverse probability distribution of the restricted model f ðh2 ; h1 ,x1ð2Þ Þ is proportional to the (marginal) density of the restricted model f h1 ðx1ð2Þ ; h2 Þ, thus the name of f ðh2 ; h1 ,x1ð2Þ Þ. The normalization factors ZXa ðxaðbÞ Þ and ZX1 ðx1ð2Þ Þ (or ZðxaðbÞ Þ and Zðx1ð2Þ Þ) are determined by normalizing f ðh; xaðbÞ Þ and f ðh2 ; h1 ,x1ð2Þ Þ (the non-vanishing integrals (2) and (3) are necessary conditions for f ðh; xaðbÞ Þ and f ðh2 ; h1 ,x1ð2Þ Þ to be normalizable). The consistency factors zðhÞ and zh1 ðh2 Þ can only be determined up to factors (multipliers), say w and wh1 . For example, multiplying zðhÞ by w results in multiplying ZðxaðbÞ Þ by the same factor, and the factors cancel in the ratio zðhÞ=ZðxaðbÞ Þ. In addition, a consistency factor cannot switch sign. With no loss of generality we can therefore assume zðhÞ, zh1 ðh2 Þ Z0. Corollary 1. From f ðh2 9h1 ; xa Þ ¼ f ðh; xa Þ=f ðh1 ; xa Þ, h ¼ ðh1 , h2 Þ, and from Eq. (9) it follows f ðh2 9h1 ; xa Þ ¼ R
SH2
zðhÞf ðxa ; hÞ , zðhÞf ðxa ; hÞdm2 h2
whereas from f ðxa ; hÞ ¼ f ðx2 9x1 ; hÞf ðx1 ; hÞ, xa ¼ ðx1 ,x2 Þ, and from Eqs. (8) and (10) it follows f ðh2 9h1 ; xa Þ ¼ R
zh1 ðh2 Þf ðxa ; hÞ m2 h2 SH2 zh1 ðh2 Þf ðxa ; hÞd
,
so that
zðhÞ ¼ zh1 ðh2 Þxðh1 Þ, where xðh1 Þ ¼
R
ð11Þ m2
SH2 zðhÞf ðxa ; hÞd
R h2 = SH zh1 ðh2 Þf ðxa ; hÞdm2 h2 . 2
For discrete models, f ðh; xaðbÞ Þ and f ðh2 ; h1 ; x1ð2Þ Þ are obtained by replacing the densities f ðxaðbÞ ; hÞ and f h1 ðx1ð2Þ ; h2 Þ in (9) and (10) by the appropriate probability mass functions. Remark 7 (Bayesian models). A Bayesian statistical model equips the universal set O from Assumption 1 with a s-algebra S on O and with a probability P on S, and imposes S-measurability on H and XaðbÞ (see, for example, Florens et al., 1990, Section 1.2, pp. 26–36). The probability space ðO, S,PÞ and the random variables H and XaðbÞ give rise to a joint probability distribution PB ðCÞ ¼ P½ðH,XaðbÞ Þ1 ðCÞ, C 2 Bm BnaðbÞ and ðH,XaðbÞ Þ1 ðCÞ ¼ fo 2 O : ðH,XaðbÞ ÞðoÞ 2 Cg. For continuous PB with R n density f B ðh,xaðbÞ Þ, f B ðhÞ Rn f B ðh,xaðbÞ Þd aðbÞ xaðbÞ is the density of the prior distribution. For a non-vanishing prior density, f B ðxaðbÞ 9hÞ ¼ f B ðh,xaðbÞ Þ=f B ðhÞ holds (throughout this paper, index B denotes the Bayesian probability distributions and R m probability density functions). For non-vanishing density f B ðxaðbÞ Þ ¼ Rm f B ðh,xaðbÞ Þd h of the predictive probability distribution, the Bayesian posterior density f B ðh9xaðbÞ Þ is equal to f B ðh,xaðbÞ Þ=f B ðxaðbÞ Þ, implying Bayes’ Theorem (Florens et al., 1990, Section 1.2.2, p. 30, Eq. (1.2.11); O’Hagan, 1994, Section 1.9, p. 4, Eq. (1.6)) f B ðh9xaðbÞ Þ ¼
f B ðhÞf B ðxaðbÞ 9hÞ : f B ðxaðbÞ Þ
ð12Þ
In contrast to Bayesian statistical models, however, we do not assume the unconditional probability distributions that correspond to f B ðh,xaðbÞ Þ, f B ðhÞ, and f B ðxaðbÞ Þ: while f B ðhÞ and f B ðxaðbÞ Þ in Bayes’ Theorem are (marginal) probability densities R R m n with Rm f B ðhÞd h ¼ RnaðbÞ f B ðxaðbÞ Þd aðbÞ xaðbÞ ¼ 1, none of zðhÞ and ZðxaðbÞ Þ in (9) (as well as none of zh1 ðh2 Þ and Zðx1ð2Þ Þ in (10)) need be integrable (see also Sections 2.2 and 2.4). Our primary notions, the direct and the inverse probability distributions (with densities f ðxaðbÞ ; hÞ and f ðh; xaðbÞ Þ), are analogous to conditional probability distributions (with densities f B ðxaðbÞ 9hÞ and f B ðh9xaðbÞ Þ) in Bayesian settings. That is, all our probabilities are defined on (s-algebras on) restrictions of O so that, from the Bayesian perspective, all our probability distributions are conditional. This is in agreement with the maxim that in practice every probability distribution is based on prior information and assumptions, and should therefore be conditional (Re´nyi, 1970, Section 2.1, pp. 33–34 and Section 2.4, p. 56).
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3157
Remark 8 (Improper priors). Often (see, for example, Robert, 2001, Section 1.5, p. 27; Jaynes, 2003, Section 15.12, p. 478; DeGroot, 2004, Section 10.1, p. 191; Ghosh et al., 2006, Section 2.6, pp. 40–41;), improper (non-integrable) prior densities ðiÞ ðiÞ f B ðhÞ are introduced as limits of sequences ff B ðhÞg1 i ¼ 1 of proper priors f B ðhÞ. The concept, however, lacks mathematical ðiÞ ðiÞ 1 precision because the limit of ff B ðhÞgi ¼ 1 need not exist (the sequence ff B ðhÞg1 i ¼ 1 need not converge to a measure; Kadane et al., 1986, Section 5, p. 64). Remark 9 (Data dependent priors). As the proportionality factor between f B ðh9xaðbÞ Þ and f B ðxaðbÞ 9hÞ in Bayes’ Theorem (12), the factor between the inverse density f ðh; xaðbÞ Þ and the direct density f ðxaðbÞ ; hÞ in Eq. (9) decomposes into a function of (realization of) the inferred parameter and a function of (realization of) the sampling random variable. There is, on the other hand, no such decomposition in the general fiducial inference (Fisher, 1930), in the empirical Bayesian analysis (Robbins, 1956, 1964), and in all other systems and methods of probabilistic parametric inference with so-called datadependent priors (see, for example, Box and Cox, 1964; Lindley, 1965, Section 5.2, pp. 18–19; Akaike, 1980; Wasserman, 2000; Fraser et al., 2010). According to Proposition 1, the decomposition is a necessary condition for consistency of probability manipulations (Eqs. (7) and (8) and Remark 5). This means that the methods with data-dependent priors necessarily lack the consistency: there are sets to which these methods assign non-unique probabilities. The system of probabilistic parametric inference, based on Assumptions 1-6, is in a complete accordance with a generalized Kolmogorov setting, based on the following two assumptions. Assumption (Kolmogorov 1). There is a s-algebra S on the universal set O and a (not necessarily finite) measure M on S. In addition, the functions H and X ¼ ðXa ,Xb Þ on O are S-measurable. þ
A function (distribution) M H,X : CðH,Xa Þ!R 0 is defined by MH,Xa ðBÞ M½ðH,Xa Þ1 fpH,Xa ðBÞg, where CðH,Xa Þ ¼ fB 2 Bm Bn : B ¼ pH,Xa ðBÞ Rnb , pH,Xa ðBÞ 2 Bm Bna g Bm Bn . For an absolutely continuous distribution, R þ m n M H,Xa ðBÞ ¼ pH,X ðBÞ g H,Xa ðh,xa Þd hd a xa holds for every B 2 CðH,Xa Þ, where g H,Xa ð,Þ : SH Rna !R 0 (or gðh,xa Þ) is a density a of the distribution. Assumption (Kolmogorov 2). The restriction MH,X 9CðHÞ of MH,X to the sub-s-algebra CðHÞ ¼ fC 2 CðH,Xa Þ : C ¼ pH ðCÞ n 1 Rn , pH ðCÞ 2 Bm g of CðH,Xa Þ is s-finite (i.e., there is a sequence fC i g1 and i ¼ 1 in CðHÞ such that [i ¼ 1 C i ¼ SH R n M H,X 9CðHÞ ðC i Þ o 1 for all i, while M H,X 9CðHÞ ðSH R Þ may still take on 1). R n For a s-finite M H,X 9CðHÞ , the marginal density g H ðhÞ ¼ Rna g H,Xa ðh,xa Þd a xa (or gðhÞ) is finite. Then, in analogy with the conditional probability P Xa 9Xb ;H (Eq. (6)), the system of Z m n P Xa 9H ðB9pH ðh,xÞÞM H,X 9CðHÞ ðd hd xÞ, ð13Þ M HXa B \ CÞ ¼ pH,Xa ðCÞ
B 2 CðH,Xa Þ and C 2 CðHÞ, defines a distribution P Xa 9H ð9hÞ on CðH,Xa Þ, conditional on CðHÞ, with pH ðh,xÞ ¼ h. In addition, we þ define the corresponding conditional density function f Xa 9H ð9Þ : Rna SH !R 0 (or f ðxa 9hÞ), Z n f ðxa 9hÞd a xa , PXa 9H ðB9hÞ ¼ pXa ðBh Þ
where Bh is the h-section of B 2 CðH,Xa Þ. According to these definitions, for gðhÞ 40, there is a version of f ðxa 9hÞ such that gðh,xa Þ ¼ f ðxa 9hÞgðhÞ. When,on the other hand, the restriction MH,X 9CðXa Þ of M H,X to the sub-s-algebra CðXa Þ ¼ fC 2 CðH,Xa Þ : C ¼ SH pXa ðCÞ Rnb , pXa ðCÞ 2 Bna g of R m CðH,Xa Þ is also s-finite (that is, when g Xa ðxa Þ ¼ SH g H,Xa ðh,xa Þd h (or gðxa Þ) is finite), we can define a conditional density f H9Xa ð9Þ (or f ðh9xa Þ) such that, for gðxa Þ 40, gðh,xa Þ ¼ f ðh9xa Þgðxa Þ, and so f ðh9xa Þ ¼ gðhÞf ðxa 9hÞ=gðxa Þ. If, in addition, we identify f ðxa 9hÞ with f ðxa ; hÞ, f ðh9xa Þ with f ðh; xa Þ, gðhÞ with zðhÞ, and gðxa Þ with Zðxa Þ, this recovers Eq. (9). For this and analogous reasons we henceforth adhere to the standard notation and replace, for example, P XaðbÞ ;H ð; hÞ, F XaðbÞ ;H ðxaðbÞ ; hÞ, f XaðbÞ ;H ðxaðbÞ ; hÞ, f H;XaðbÞ ðh; xaðbÞ Þ, f H2 ;H1 ,Xa ðh2 ; h1 ,xa Þ, f h1 ðx1ð2Þ ; h2 Þ, and f H2 ;H1 ,X1ð2Þ ðh2 ; h1 ,x1ð2Þ Þ by P XaðbÞ 9H ð9hÞ, F XaðbÞ 9H ðxaðbÞ 9hÞ, f XaðbÞ 9H ðxaðbÞ 9hÞ, f H9XaðbÞ ðh9xaðbÞ Þ, f H2 9H1 ,Xa ðh2 9h1 ,xa Þ, f h1 ðx1ð2Þ 9h2 Þ, and f H2 9H1 ,X1ð2Þ ðh2 9h1 ,x1ð2Þ Þ, respectively. Remark 10. In order to apply the Radon–Nikodym Theorem to the (13) of P Xa 9H ð9hÞ (Remark 4), M H,X 9CðHÞ must be s-finite, which is the reason for making Assumption (Kolmogorov 2). The s-finite M H,X 9CðHÞ implies that M H,X is also s-finite, yet need not be finite. The measure space ðO, S,MÞ, introduced in Assumption 1, is therefore a generalization of the classical Kolmogorov probability space ðO, S,PÞ, adopted in Bayesian statistical models. Unlike the classical Kolmogorov setting, the generalized setting makes the concept of improper prior distributions (of non-integrable consistency factors) mathematically precise by avoiding the inconsistency, discussed in Remark 8. The restriction M H,X 9CðXa Þ of M H,X to CðXa Þ, however, need not be s-finite, even though M H,X is s-finite (Florens et al., 1990, Section 1.2.6, p. 34). Then, for M H,X 9CðXa Þ that is not s-finite, the Radon–Nikodym Theorem does not apply, and so f ðh9xa Þ cannot be defined within the generalized Kolmogorov framework (see Example 2, below). For the same reason it is false to state that according to the Kolmogorov concept, the uniform distribution on N N decomposes as the product of the marginal and the conditional distribution (Hartigan, 1964, Section 3.0, p. 23).
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3158
Example 2. A location-scale model with the density f ðx1 9m, sÞ and a consistency factor zðm, sÞ ¼ s1 (Proposition 3, below) R R yield a density gðm, s,x1 Þ ¼ zðm, sÞf ðx1 9m, sÞ, such that gðx1 Þ ¼ R þ ½ R gðm, s,x1 Þ dm ds ¼ 1. The infinite gðx1 Þ implies that the restriction M H,X 1 9CðX 1 Þ of the corresponding underlying distribution M H,X to CðX 1 Þ ¼ fC : C ¼ R R þ pX 1 ðCÞ, pX 1 ðCÞ 2 B1 g is not s-finite, so that f ðm, s9x1 Þ cannot be defined within the Kolmogorov concept of conditioning. That is, at least two realizations of the scalar sampling variable are needed to make a simultaneous probabilistic inference about a location and a scale parameter, a single one, x1, is not enough. More generally, improper posterior distributions are not defined within the Kolmogorov framework, neither the classical nor the generalized one. 2. Objective probabilistic inference 2.1. Objectivity This section imposes objectivity, also called internal consistency (McCullagh, 1992), on probabilistic parametric inference, and investigates the models for which this requirement leads to unique consistency factors. Definition 7 (Objective inference). A probabilistic parametric inference is called objective if there is a unique posterior probability density function that corresponds to a particular model. That is, a probabilistic parametric inference is called objective if a particular direct probability density function always leads to the same posterior probability density function. The rationale behind this requirement is that at the beginning of the inference, before any realizations fxa ,xb , . . .g of random variables fXa ,Xb , . . .g are given, only the model is known, and inferences based on identical information should be the same. 2.2. Consistency factors and invariance under Lie groups Untill Section 4, we omit indices a,b of Xa ,Xb and xa ,xb . Let Eq. (9) hold and let ðs,sÞ : ðH,XÞ!ðs JH,sJXÞ ¼ ðK,YÞ be a differentiable function with non-vanishing Jacobians det½@k sðkÞ and det½@x sðxÞ. From the transformation properties of direct and inverse probability density functions under the change of variables it follows that
zK ðkÞ ¼ wzH ½s 1 ðkÞ9det½@k s 1 ðkÞ9
ð14Þ
and ZY ðyÞ ¼ wZX ½s1 ðyÞ9det½@y s1 ðyÞ9 (Appendix A). The transformations of the factors zH9h and ZX1 from Eq. (10) are 1 analogous. Remark 11. Bayes’ Postulate (the Laplace Principle of insufficient reason) and the Principle of maximum entropy (Catlin, 1989, Section 3.3, p. 82), both suggesting that the prior densities should be uniform, are are not compatible with Eq. (14). Consider a model that is invariant under a group G (cf. Appendix A). Then, because of the objectivity requirement, Eq. (14) reduces to
zH ðhÞ ¼ wðaÞzH ½lða1 , hÞ9det½@h lða1 , hÞ;
ð15Þ
that is, as w may depend on a 2 G, the factor zH is relatively invariant under lða, hÞ. Example 3 (Parity). Let a model be invariant under the action lða,xÞ ¼ ax of a finite group G ¼ f1,1g, such that lða, yÞ ¼ ay; that is, the model has positive parity. Since zðyÞ cannot switch sign, the symmetry implies that zðyÞ also has positive parity; apart from this, it can take on any form. Example 4 (Invariance of location-scale models). Let X 1 ,X 2 be independent scalar random variables with identical probability distribution from a location-scale model. The model is invariant under the action l½ða1 ,a2 Þ,ðx1 ,x2 Þ ¼ ða2 x1 þ a1 ,a2 x2 þ a1 Þ of the two-dimensional Lie group G ¼ R R þ whose group operation is aJb ¼ ða2 b1 þ a1 ,a2 b2 Þ,
ð16Þ
and the corresponding induced action is l½ða1 ,a2 Þ,ðm, sÞ ¼ ða2 m þ a1 ,a2 sÞ:
ð17Þ
A model for a scalar random variable that is invariant under a one-dimensional Lie group G is closely related to a location-scale model. Proposition 2. Consider a probability distribution for a continuous, scalar sampling random variable X from a model that is invariant under a one-dimensional Lie group G. Let each of lða,xÞ, lða, yÞ, and Fðx9yÞ be differentiable in both of its arguments.
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3159
Then, on the subspaces of R and SY with non-vanishing derivatives @a lða1 ,xÞ9a ¼ e ,@a lða1 , yÞ9a ¼ e ,
ð18Þ
the model is reducible to a location model. In order to assign an inverse probability distribution to a scalar parameter of a model that is invariant under a onedimensional Lie group G, it therefore suffices to determine the consistency factor zs ¼ 1 ðmÞ, which can subsequently be transformed by applying (14) to the factor for the original parameter. Example 5. Every restricted location-scale model P X9H9m with the distribution function F X9H9m ðx9sÞ ¼ F½ðxmÞ=s is reducible to a model P Y9Y2 with the distribution function F Y9Y2 ðy9sÞ ¼ Fðy=sÞ, Y ¼ Xm. The model PY9Y2 is invariant under the Lie group R þ under multiplication, and except for y¼0, the density f Y9Y2 ðy9sÞ of the model can be written as a þ 7 sum c þ f Y9Y2 ðy9sÞ þ c f Y9Y2 ðy9sÞÞ, where c 7 f Y9Y2 ðy9sÞ coincides with f Y9Y2 ðy9sÞ for y_0 and is 0 otherwise, while 7 c þ ¼ 1F Y9Y2 ð09sÞ and c ¼ F Y9Y2 ð09sÞ. For c 7 4 0, there exist conditional probability densities f Y9Y2 ðy9sÞ ¼ f Y9Y2 7 ~ ðzl Þ, ðy9sÞ=c 7 for Y, given Y_0, which can be further reduced to densities f Z9L1 ðz9l1 Þ ¼ expðzl1 Þf½ 7expðzl1 Þ ¼ f 1 7 where Z ¼ logð 7 YÞ, L1 ¼ logðY2 Þ, and fðuÞ ¼ F0 ðuÞ; that is, every scale parameter of a location-scale model is reducible to a 7 location parameter. Finally, the models with densities f Z9L1 may be regarded as restrictions of the location-scale models 7 1 ~ with the densities f Z9K ðz9l1 , l2 Þ ¼ l2 f ½ðz l Þ= l , K ¼ ðL1 , L2 Þ. 1 2 7 Proposition 3. Consider independent scalar random variables ðX 1 ,X 2 Þ ¼ X with identical probability distribution from a location-scale model P X 1 9H , and independent scalar random variables ðZ 1 ,Z 2 Þ ¼ Z with identical probability distribution from a location-scale model P Z1 9K , where the relation between the two models is that from Example 5. Suppose first that there exist objective densities f H9X and f Y1 9Y2 ,X 1 . Then,
zH9s ðmÞ ¼ 1:
ð19Þ
Second, if we assume objective densities f Y2 9Y1 ,X 1 ,
7 f K9Z ,
7
and f L1 9L2 ,Z1 , then
1
zH9m ðsÞ ¼ s : Third, if we assume objective f H9X , f Y2 9Y1 ,X 1 ,
ð20Þ 7 f K9Z ,
7
and f L1 9L2 ,Z 1 , then
1
zH ðm, sÞ ¼ s :
ð21Þ
Remark 12. Functional Eq. (15), which follows from the objectivity requirement, corresponds to what has been called the Principle of relative invariance (Hartigan, 1964). Contrary to Proposition 3, some have asserted that the principle is insufficient to determine uniquely defined priors (consistency factors) (Hartigan, 1964, Section 4, p. 838 and Section 10, p. 845; Villegas, 1977, Section 2, p. 454; Berger, 1980, Section 3.3, pp. 86–87; Kass and Wasserman, 1996, Section 3.2, p. 1348). Example 6 (Inference about the normal model). For independent random variables fX i gni¼ 1 with identical probability distribution from the normal model, the factors ((19)–(21)) yield f ðm9s,xÞpexp½nðx n mÞ2 =ð2s2 Þ, f ðs9m,xÞpsðn þ 1Þ n exp½nSn 02 =ð2s2 Þ, f ðm, s9xÞpsðn þ 1Þ expfn½ðx n mÞ2 þ Sn 02 =ð2s2 Þg, f ðm9xÞp1=½1 þ ðx n mÞ2 =Sn 02 2 , and f ðs9xÞpsn Pn Pn n 2 02 02 2 1 1 exp½nSn =ð2s Þ. Here, x ¼ ðxi Þi ¼ 1 , x n ¼ n i ¼ 1 xi , Sn ¼ n i ¼ 1 ðxi x n Þ , and n Z 1 for f ðm9s,xÞ and f ðs9m,xÞ, whereas n Z 2 for f ðm, s9xÞ, f ðm9xÞ and f ðs9xÞ. 2.3. (Non-)Existence of objective inverse distributions In Proposition 3, the invariance of the location-scale models under translation and scale transformation implies the unique consistency factor zH ðm, sÞ (21). The factor, however, need not be relatively invariant under possible additional symmetries of a particular model; for such a model, there is no objective f ðm, s9xÞ (Hartigan, 1983, Section 5.2, p. 47). Example 7 (Cauchy model). Suppose for a moment that objective f ðm, s9xÞ exists for the Cauchy model. Then, since the model is a location-scale model, the objectivity requirement leads to zðm, sÞ ¼ s1 (Proposition 3). Besides the invariance 2 2 under G ¼ R R þ , the model is also invariant under s : xi !x1 i , i ¼ 1, . . . ,n, and s : ðm, sÞ!ðm, sÞ=ðm þ s Þ (McCullagh, 1992; Kass and Wasserman, 1996), implying that zðm, sÞ ¼ ½sðm2 þ s2 Þ1 (Eq. (14)). The inconsistency is based on the assumed existence of objective f ðm, s9xÞ, and so the assumption is ruled out. (As n-1, however, the argument against the existence of the objective f ðm, s9xÞ vanishes, because the inverse densities, constructed from zðm, sÞ ¼ s1 and from 1 zðm, sÞ ¼ ½sðm2 þ s2 Þ1 , differ by at most Oðn2 Þ; McCullagh, 1992.) When, on the other hand, the symmetry group is discrete, the sampling and the parameter space break up into fundamental regions of the group, with no group transformations linking the points of the same region. We are then free to choose zðhÞ in one of these regions, e.g., we can choose zðyÞ for y 4 0 in Example 3, so that the invariance of a model under a discrete group G alone does not lead to a unique consistency factor. (When a discrete probability distribution can be approximated (in distribution) by a continuous one, e.g., when a binomial distribution can be approximated by a normal one, the parameter of the former may of course be inferred from the parameter of the latter.)
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3160
The system of probabilistic parametric inference may never be applicable to all models; the probabilistic inference is an ideal (Paris, 1994, Chap. 3, p. 33), and ideals cannot always be reached (recall also Example 2).
2.4. Non-integrability of the consistency factors It is easily verified that the existence of normalized direct probability densities from location-scale models guarantees that all the other densities involved in the foregoing derivations are also integrable. Integrability, however, has not been imposed on consistency factors, and the factors ((19)–(21)) are indeed not integrable. Several authors refrain from using improper priors (see, for example, Stone and Dawid, 1972; Dawid et al., 1973; Stone, 1976, Florens et al., 1990, Section 1.2.6, pp. 34–35 and Section 8.3.2, p. 400, and Kass and Wasserman, 1996, Section 4.2, pp. 1356–1359), apparently because they lead to (mathematical) inconsistencies such as the marginalization paradox and the strong inconsistency. The next two examples show, however, that these inconsistencies are irrelevant to the system of objective probabilistic parametric inference. Q Example 8 (Marginalization paradox). Let f X9H ðx9m, sÞpsn ni¼ 1 f½ðxi mÞ=s, fðuÞ ¼ expðu2 =2Þ, and x ¼ ðxi Þni¼ 1 , n Z2. The objective posterior density is f H9X ðm, s9xÞps1 f X9H ðx9m, sÞ, and so f K9X ðk9xÞpf X9H ðx9ls, sÞpf Y9K ðy9kÞ, where k ¼ ðl, sÞ is a realization of the parameter K ¼ ðY1 =Y2 , Y2 Þ, l ¼ m=s, y ¼ ðr,RÞ is a realization of the sampling random variable Y with P P R2 ¼ ni¼ 1 x2i and rR ¼ ni¼ 1 xi , and ( ) n3=2 2 r2 nl rRl R2 þ f Y9K ðy9kÞpsn Rn1 1 exp 2 : n 2 s 2s Hence, f L1 9X ðl9xÞ ¼
Z Rþ
(
f K9X ðk9xÞds ¼
)
2 aðr, lÞbðr,R, lÞ nl pexp Hn2 ðr, lÞ, ZY ðyÞ 2
R R where Hn ðr, lÞ ¼ R þ un expfu2 =2 þr lug du, aðr, lÞ ¼ f Y 1 9K ðr9kÞ ¼ R þ f Y9K ðy9kÞ dR (i.e., f Y 1 9K ðr9kÞ does not vary with s), R bðr,R, lÞ ¼ R þ f Y 2 9Y 1 , K ðR9r, kÞ ds, and f Y 2 9Y 1 , K ðR9r, kÞ ¼ f Y9K ðy9kÞ=f Y 1 9K ðr9kÞ. Since f L1 9X ðl9xÞ does not vary with R, bðr,R, lÞ factors as bðr,R, lÞ ¼ b1 ðr,RÞb2 ðr, lÞ. According to the Principle of reduction, the marginal densities f Y 1 9K and f L1 9X should be related as f L1 9X ðl9xÞ ¼ zK9 ðlÞf Y 1 9K ðr9kÞ=ZY 1 ðrÞ (Stone and Dawid, 1972; Bernardo and Smith, 1994, Section 5.6, p. 363); that s is, b2 ðr, lÞ should further factor as zK9s ðlÞb3 ðrÞ, but it does not. This is called a marginalization paradox. Proposition 1 does not imply the relation f L1 9X ðl9xÞ ¼ zK9 ðlÞf Y 1 9K ðr9kÞ=ZY 1 ðrÞ between the marginal densities f Y 1 9K and s f L1 9X . That is, the Principle of reduction may well be an axiom outside the theory of objective probabilistic parametric inference, and unless the opposite is proved (i.e., unless the principle can be deduced from the axioms of the theory), the marginalization paradox does not apply to consistency factors, notwithstanding the non-integrability of the latter. Example 9 (Strong inconsistency). Consider the normal model with the scale parameter constrained to 1, P X9Y , and a consistency factor (an improper prior density) zðmÞ ¼ expf4mg. From the perspective of objectivity, this factor is inappropriate because it does not have the positive parity, whereas the normal model does (Example 3). Stone (1976), pffiffiffi on the other hand, rejects the factor because the direct probability pffiffiffi distribution PX9Y ðA9mÞ ¼ ð1erff 2gÞ=2 is uniformly low compared with the inverse distribution P Y9X ðA9xÞ ¼ ð1 þerff 2gÞ=2, A ¼ fðm,xÞ 2 R R : mx 42g. This is called strong inconsistency, apparently because of the conclusions that P X9Y ðA9mÞ, not varying with m, implies PX9Y ðA9mÞ ¼ P Y,X ðAÞ and P Y9X ðA9xÞ, not varying with x, implies P Y9X ðA9xÞ ¼ P Y,X ðAÞ (Koop et al., 2007, Chap. 8, Exer. 8.9, pp. 88–89), and so R P Y,X ðAÞ oP Y,X ðAÞ. The conclusions, however, are false: A zðmÞf ðx9mÞ dm dx ¼ 1, so there is no P Y,X ðAÞ, whereas M Y,X ðAÞ ¼ 1. 3. Calibration
Definition 8 (Calibrated distribution). A density f ðh9xÞ is calibrated if for every g 2 ½0; 1 and P~ X9H ð9hÞ-almost all x 2 Rn , there is at least one algorithm to construct a set DðxÞ 2 Bm (credible set or region) whose (inverse-) probability content P~ H9X ðD9xÞ ¼ g is equal to the (direct-) probability content of a set BðhÞ 2 Bn , whether it is the unconditional or a conditional probability, and for which h 2 D3x 2 B holds. Credible sets with these properties are called confidence sets (or regions). Since the consistency factors ((19)–(21)) coincide with the elements of the right-invariant Haar measures for the group
R under addition, for the group R þ under multiplication, and for the group R R þ under operation (16), the resulting posterior distributions are calibrated. Stein (1965), and Chang and Villegas (1986) proved the calibration of the inverse densities, constructed from the right Haar factors, by demonstrating the credible sets with DðxÞ that are equivariant, D½lða,xÞ ¼ l½a,DðxÞ (l½a,DðxÞ ¼ fh 2 SH : lða1 , hÞ 2 DðxÞg, a 2 G), are confidence sets. Let x ¼ ðxi Þni¼ 1 , n Z2, be realizations of independent random variables fX i gni¼ 1 with identical probability distribution from a location-scale model P X9H . Consider the corresponding inverse density f ðm, s9xÞ, constructed from the consistency
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3161
factor (21), and a set of equations P~ H9X ðD1 9xÞ ¼ a1 ,
P~ H9X ðD2 9xÞ ¼ a2 ,
P~ H9X ðD3 9xÞ ¼ a3 ,
P~ H9X ðD9xÞ ¼ g,
where D1 ðxÞ ¼ ð1, m1 Þ R þ , D2 ðxÞ ¼ ðm1 , m2 Þ R þ , D3 ðxÞ ¼ ðm1 , m2 Þ ð0, s1 Þ, and DðxÞ ¼ ðm1 , m2 Þ ðs1 , s2 Þ (m1 , m2 2 R, s1 , s2 2 R þ , m1 o m2 , and s1 o s2 ) are the maximal rectangles that solve the set of equations. Every quadruple ða1 , a2 , a3 , gÞ for which the set is solvable determines a (possibly infinite) rectangle DðxÞ, equivariant under the locationscale transformations lða,xi Þ ¼ a2 xi þa1 and l½a,ðm, sÞ ¼ ða2 m þ a1 ,a2 sÞ, a1 2 R and a2 2 R þ , and the sides of the rectangle are parallel to the axes m and s; thus, we say that f ðm, s9xÞ is calibrated on rectangles. Since the rectangles include the bands R ðs1 , s2 Þ and ðm1 , m2 Þ R þ (set a2 ¼ g and a2 ¼ 1, respectively), the calibration of f ðm, s9xÞ automatically ensures the calibration of the marginal densities f ðm9xÞ and f ðs9xÞ on the intervals ðm1 , m2 Þ and ðs1 , s2 Þ. Remark 13. The posterior distributions that are calibrated on rectangles are analogous to the complete simultaneous distributions within the fiducial theory (Bartlett, 1939). A density f H9X calibrated on the rectangles DðxÞ, and a density f K9Y with Y ¼ ðY 1 ,Y 2 Þ ¼ sJX and K ¼ ðL1 , L2 Þ ¼ s JH, are isomorphic, provided that the functions s and s are one-to-one. In addition, the calibration of f H9X is a sufficient condition for f K9Y to be calibrated on the images s½DðxÞ of DðxÞ under s, but the images are not necessarily rectangles (i.e., the reparametrization does not generally preserve equivariance of the rectangles). When it comes to the calibration of the marginal densities, f H9X and f K9Y are therefore no longer equivalent for every one-to-one s; f L1 9Y is calibrated only if L1 ¼ s 1 JY1 or L1 ¼ s 1 JY2 , and f L2 9Y is calibrated only if L2 ¼ s 2 JY2 or L2 ¼ s 2 JY1 . The marginal density f L1 9X from Example 8, for instance, is not calibrated, whereas f L2 9X is indeed calibrated. Similarly, ð1Þ
ð2Þ
the inverse probability densities f H9X ðm, s9xÞps1 f X9H ðx9m, sÞ and f H9X ðm, s9xÞp½sðm2 þ s2 Þ1 f X9H ðx9m, sÞ for the Cauchy ð1Þ
model are both calibrated, even though they do not conform to the objectivity requirement (Example 7). The function f H9X ð1Þ
is calibrated on the rectangles DðxÞ, which ensures that both f Y
1 9X
ð1Þ
and f Y
2 9X
ð2Þ
are calibrated. The function f H9X , on the other
hand, is calibrated on the images of DðxÞ under inversion s : ðm, sÞ!ðm, sÞ=ðm2 þ s2 Þ, and the images are not generally ð2Þ
rectangles, so neither f Y
1 9X
ð2Þ
nor f Y
2 9X
is calibrated.
These arguments extend without change to parameters of higher dimensions. 4. Interpretations of probability distributions After considering mathematical aspects of the theory of probabilistic parametric inference, we now turn to interpretations of direct and inverse probability distributions. An abstract, mathematical theory allows for different concrete interpretations (Kolmogorov, 1933, Chap. 1, p. 1). Assumption 7. For every h 2 SH , the random variable X9Oh (Assumption 2) consists of N components Xa 9Oh ,Xb 9Oh , . . . that are independent random variables with identical probability distribution, say P Xa 9H ð9hÞ. Let xa ,xb , . . . be realizations of the N random variables Xa 9Oh ,Xb 9Oh , . . .: xi ¼ Xi ðoÞ, o 2 Oh , i ¼ a,b, . . ., and HðoÞ ¼ h be the realization of H on the same o. By the Glivenko-Cantelli Theorem (van der Vaart, 1998, Section 19.1, pp. 265–266), as N grows to infinity, for Ph -almost all o 2 Oh , the proportion (relative frequency) limN-1 Nxi 2B =N of the realizations xi contained in a (fixed) Borel set B converges to P~ Xa 9H ðB9hÞ. That is to say, the frequency interpretation applies to P Xa 9H ð9hÞ. The parameter (Assumption 1) is, on the other hand, unique, and so, for a fixed o 2 O, there is a unique (not distributed) realization HðoÞ ¼ h (less formally, there is the (inferred) true value h of H that uniquely determines a probability distribution P Xa 9H ð9hÞ within the model P Xa 9H ð9Þ; see also Remark 14, below). That is, the frequency interpretation that applies to the direct probability distribution PXa 9H ð9hÞ, cannot apply to the inverse probability distributions P H9Xa ð9xi Þ, i ¼ a,b, . . .: independently of xi , a fixed D 2 Bm either covers or does not cover the true value h, so that limN-1 ½Nh2D =N is either one or zero, which is not generally equal to P~ H9Xi ðD9xi Þ. Still, when the inverse and the predictive distributions are calibrated, and Assumption 7 is fulfilled, in the limit N-1 the probability distributions of the credible regions can be associated with frequencies. For every g 2 ½0; 1, there is at least one algorithm to construct credible regions Dðxi Þ 2 Bm (varying with xi , not fixed) whose (fixed) inverse-probability content P~ H9Xi ðDðxi Þ9xi Þ ¼ g coincides with the relative frequency, called the coverage, of the regions that cover the true value h. (The consistency and normalization pffiffiffiffiffiffi factors alone, on the other hand, cannot be associated with frequencies, just as, for example, the factor ð 2psÞ1 of a density from the normal model does not represent any frequency distribution.) Remark 14. The inferred true value h need not even be the same for all xi . Suppose, for example, that the functions X and H (Assumption 1) have the same dimension, that they are both composed of N components, X ¼ ðXa ,Xb , . . .Þ and H ¼ ðHa , Hb , . . .Þ, that the components Xa 9Oh ,Xb 9Oh , . . . of X9Oh are independent random variables, F X9H ðx9hÞ ¼ Q i F Xi 9H ðxi 9hÞ, i ¼ a,b, . . ., and that F Xi 9H ðxi 9hÞ ¼ Fðxi hi Þ, where xi ¼ Xi ðoÞ, h ¼ ðha , hb , . . .Þ ¼ HðoÞ, and o 2 O. (Consequently, the pivotal quantities Yi 9O , Yi ¼ Xi Hi , are independent random variables with identical distribution function h F Yi 9H ðyi 9hÞ ¼ Fðyi Þ for all i ¼ a,b, . . ..) Then, for the constant (right-Haar) consistency factors zHi , i ¼ a,b, . . ., and for the
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3162
credible regions Dðxi Þ that are equivariant under translations, the (fixed) probability P~ Hi 9Xi ðDðxi Þ9xi Þ ¼ g coincides with the coverage limN-1 ½Nhi 2Dðxi Þ =N of the true values hi ¼ Hi ðoÞ, even though hi varies arbitrarily with i. Berger (1980, Section 3.1, pp. 76–77) and Dempster (1966) argue that the difference between the direct (sampling) and the inverse (posterior) probability distributions is important only philosophically, but while marginalization of the direct probability distributions preserves their frequency interpretation, marginalization of calibrated inverse distributions does not in general yield calibrated marginal distributions. That is, the existence of multidimensional confidence regions does not of itself imply the corresponding existence of regions of lower order, equivalent to the elimination (marginalization) of irrelevant (nuisance) parameters (Bartlett, 1946). From an interpretational perspective, Section 3 thus serves as a caveat against the unreserved use of marginal distributions, even though, from a mathematical perspective, marginalization is perfectly consistent with the adopted axioms. The caveat is important because marginalization is considered the principal practical advantage of the Bayesian approach over other methods of inference (Cox and Hinkley, 2000, Section 2.4, p. 53 and Section 10.1, pp. 364–365; Robert, 2001, Section 4.1.3, p. 168; Jaynes, 2003, Preface, p. xxviii; Gelman et al., 2004, Chap. 3, pp. 73–74; Ghosh et al., 2006, Section 2.10, p. 52). 5. Application: initialization of a filter for diffuse Gauss-Markov stochastic processes Consider a sequence Hnk ¼ fHi gki ¼ 1 of state vectors Hi : O!Rm , and its realization hnk ¼ Hnk ðoÞ ¼ fhi gki ¼ 1 , o 2 O. Consider, in addition, a sequence Xnk ¼ fXi gki ¼ 1 of output vectors Xi : O!Rn , and its realization (a time series) xnk ¼ Xni ðoÞ ¼ fxi gki ¼ 1 . Assumption 8. There is a continuous probability distribution PH n ,Xn 9H1 ð9h1 Þ, H nk ¼ fHi gki¼ 2 , to which the frequency k k interpretation applies. Its probability density function is denoted by f ðh nk ,xnk 9h1 Þ, h nk ¼ H nk ðoÞ ¼ fhi gki ¼ 2 . R R By the product rule, f ðh nk ,xnk 9h1 Þ ¼ f ðh nk 9h1 Þf ðxnk 9hnk Þ ¼ f ðxnk 9h1 Þf ðh nk 9h1 ,xnk Þ, so that f ðxnk 9h1 Þ ¼ Rm . . . Rm f ðh nk 9h1 Þf ðxnk 9hnk Þ m
m
d h2 . . . d hk and f ðh nk 9h1 ,xnk Þ ¼ f ðh nk 9h1 Þf ðxnk 9hnk Þ=f ðxnk 9h1 Þ. The densities f ðh nk 9h1 Þ and f ðxnk 9hnk Þ further decompose as Q Q f ðh nk 9h1 Þ ¼ kj ¼ 2 f ðhj 9hnj1 Þ and f ðxnk 9hnk Þ ¼ f ðx1 9hnk Þ kj ¼ 2 f ðxj 9hnk ,xnj1 Þ. Assumption 9 (Transition Equation). The sequence Hnk is a (discrete) linear Gauss-Markov stochastic process: for j ¼ 2, . . . ,k, f ðhj 9hnj1 Þ is m-dimensional normal density with mean T j hj1 and variance (symmetric, positive-definite, real m m matrix) Vj, called dynamical noise, f ðhj 9hnj1 Þ ¼ Nðhj ; T j hj1 ,V j Þ, where Tj is a m m transition matrix. The stochastic process is called diffuse because the (prior) distribution of h1 (the (prior) density f ðh1 Þ) is not specified. Assumption 10 (Observation Equation). For j ¼ 2, . . . ,k, f ðxj 9hnk ,xnj1 Þ ¼ Nðxj ; Hj hj ,U j Þ, and f ðx1 9hnk Þ ¼ Nðx1 ; H1 h1 ,U 1 Þ, where U 1 , . . . U k are variance matrices, called measurement noise, and H1 , . . . ,Hk are n m real matrices. Assumptions 8– 10 imply f ðxnk 9h1 Þ ¼ Nðxnk ; Hnk,1 h1 ,U nk,1 Þ. In Nðxnk ; Hnk,1 h1 ,U nk,1 Þ, xnk denotes a column of vectors xi of the time series fxi gki ¼ 1 (a kn-dimensional column-vector), Hnk,1 is a kn m matrix, and U nk,1 is a (symmetric and positive) kn kn variance matrix (Appendix C). nT 1 n Assumption 11. For kZ kmin , kmin minfk : detðHk,1 U nk,1 Hk,1 Þ 4 0g, there is a continuous probability distribution n n n P Hk 9X ð9xk Þ whose probability density function is denoted by f ðhnk 9xnk Þ. (The frequency interpretation is not assumed for k P Hnk 9Xn ð9xnk Þ). k
The product rule implies f ðhnk 9xnk Þ ¼ f ðh1 9xnk Þf ðh nk 9h1 ,xnk Þ, so that f ðhnk 9xnk Þ ¼ f ðh1 9xnk Þf ðh nk 9h1 Þf ðxnk 9hnk Þ=f ðxnk 9h1 Þ. Assumption 12. The density f ðh1 9xnk Þ is objective, continuous in h1 , and factors according to Proposition 1: f ðh1 9xnk Þ ¼ zðh1 Þf ðxnk 9h1 Þ=Zðxnk Þ. Assumption 12 implies zðh1 Þ ¼ 1. The proof of this statement invokes the invariance of f ðxnk 9h1 Þ under l½ða1 ,a2 Þ,xnk ¼ a2 xnk þ Hnk,1 a1 and l½ða1 ,a2 Þ, h1 ¼ a2 h1 þa1 , ða1 ,a2 Þ 2 Rm f1,1g. Remark 15 (Existence of f ðh1 9xnk Þ). Constant zðh1 Þ means that f ðh1 9xnk Þ is a (m-dimensional) normal density with mean T 1 n 1 nT n1 n T n1 n ðHnk,1 U nk,1 Hk,1 Þ Hk,1 U k,1 xk and variance ðHnk,1 U k,1 Hk,1 Þ1 . The condition k Zkmin (Assumption 11) therefore assures that (a n proper) f ðh1 9xk Þ exists. Because of kmin n Zm, for m4 n the condition implies k4 1. Again, as in Example 2, it is not possible to make a probabilistic inference about a parameter when the dimension of the parameter is higher than the dimension of the corresponding sampling variable(s). R R m m m m From Assumption 8–12 then readily follows that f ðhp 9xnk Þ ¼ Rm . . . Rm f ðhnk 9xnk Þd h1 . . . d hp1 d hp þ 1 . . . d hk ,p ¼ 1, . . . ,k, 1 1 1 1 1 T 1 1 ~ ~ ~ is a normal density with mean ðRp þ R p Þ ðyp þT p þ 1 V p þ 1 W p þ 1 y~ p Þ, and variance ðRp þ R p Þ , where R1 and yp are p T 1 T 1 given by a forward recursion R1 1 ¼ H 1 U 1 H 1 ,y1 ¼ H 1 U 1 x1 , and 1 1 W j1 ¼ ðT Tj V 1 , j T j þ Rj1 Þ
ð22Þ
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3163
1 1 T 1 R 1 j ¼ V j V j T j W j1 T j V j ,
ð23Þ
T 1 1 R1 j ¼ Hj U j Hj þ R j ,
ð24Þ
1 yj ¼ HTj U 1 j xj þ V j T j W j1 yj1 ,
ð25Þ
~ p , R~ 1 , and y~ are given by a backward recursion R~ 1 ¼ 0, y~ ¼ 0 and W ~ j ¼ ðR~ 1 þ V 1 þ for j ¼ 2, . . . ,p, whereas W p k p k j j T 1 1 ~ 1 T 1 T 1 ~ 1 ~ 1 ~ ~ HTj U 1 j H j Þ , R j1 ¼ T j ðV j V j W j V j ÞT j , and y j1 ¼ H j U j xj þ T j þ 1 þ V j þ 1 W j þ 1 y j , for j ¼ k, . . . ,p þ 1. (The existence of n 1 n W j1 (invertibility of T Tj V 1 j T j þ Rj1 ) is a necessarily condition for (the proper) f ðhk 9xk Þ to exist and is therefore implicit in
Assumptions 8–12.) For p o k, f ðhp 9xnk Þ is called a smoother (C.f. Catlin, 1989, Chapter 9, pp. 188–199), and 1 f ðhk 9xnk Þ ¼ Nðhk ; rk ,Rk Þ, rk ¼ Rk yk and Rk ¼ ðR1 , is called a filter for Hnk . k Þ For every k Zkmin , there is a probability density f ðrk 9hk Þ ¼ Nðrk ; hk ,Rk Þ for which the frequency interpretation applies (Appendix D). Obviously, f ðhk 9xnk Þ ¼ zðhk Þf ðrk 9hk Þ=Zðrk Þ with zðhk Þ ¼ Zðrk Þ ¼ 1. Since the constant zðhk Þ is the element of the right-invariant Haar measure of the group Rm under which the density function f ðrk 9hk Þ is invariant, the filter f ðhk 9xnk Þ is calibrated (Section 3), and because of the frequency interpretation of f ðrk 9hk Þ the filter can be associated with frequencies (Section 4). By the Spectral Theorem, there is an orthogonal m m matrix Ok such that Ok Rk OTk ¼ diagðSðiÞ Þm , and so k i¼1 Qm ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ n f ðlk 9xk Þ ¼ f ðzk 9lk Þ ¼ i ¼ 1 Nðzk ; mk , Sk Þ, where zk and mk are the i-th components of the vectors zk Ok rk and lk Ok hk . The filter f ðlk 9xnk Þ is calibrated on (m-dimensional) rectangles whose sides are parallel to mðiÞ (Section 3). Consequently, the k R ð1Þ ði1Þ ði þ 1Þ ðmÞ ðiÞ ðiÞ ðiÞ n n marginal filters f ðmðiÞ 9x Þ ¼ . . . d m d m . . . d m ¼ Nðz ; m , S Þ are also calibrated and can be m1 f ðl 9x Þ dm k k R k k k k k k k k k associated with frequencies. The original filter f ðhk 9xnk Þ, on the other hand, is calibrated on the images of the rectangles ðiÞ under Ok (i.e., on rotated rectangles whose sides are not necessarily parallel to the components yk of hk ), so that the ðiÞ n marginal filters f ðyk 9xk Þ are not generally calibrated and therefore cannot be associated with frequencies.
Remark 16 (Kalman filter). For j1 Zkmin , Eq. (23) reduces to R j ¼ V j þ T j Rj1 T Tj ,
ð26Þ
whereas for j Z maxf2,kmin g Eq. (25) reduces to rj ¼ Rj yj ¼ T j rj1 þK j ðxj Hj T j rj1 Þ,
ð27Þ
where K j ¼ Rj HTj U 1 j
ð28Þ
is the Kalman gain. Eqs. (24), (26), (27), and (28) are called the Kalman filter (cf. Brown and Hwang, 1992, Eqs. (6.2.6), (5.4.25), (5.4.8), and (6.2.7)). The (non-diffuse) Kalman filter is initialized by specifying the mean and the variance of the distribution of the initial state vector h1 (Kalman, 1960; Brown and Hwang, 1992, Section 5.4, p. 232), which is equivalent to specifying an unconditional (prior) density f B ðh1 Þ (Cox, 1964; Catlin, 1989, Section 7.3, p. 140). With the initial variance T 1 1 1 T 1 1 V1 being specified, R1 1 ¼ H 1 U 1 H 1 is replaced by R1 ¼ V 1 þH 1 U 1 H 1 . This means that R1 is invertible, so that the filter (Eqs. ((26)–(28))) applies for all j Z 2 (kmin ¼ 1). The requirement that f B ðh1 Þ must be specified, however, reduces applicability of the Kalman filter and therefore represents a drawback for the filter (Makridakis, 1976, III.2, pp. 49–50). Remark 17 (Initialization of the Kalman filter with diffuse initial conditions). Often, the Kalman filter with unspecified (or diffuse) initial conditions is initialized by an infinite R1 (Anderson and Moore, 1979, Section 6.3, p. 140; Simon, 2006, 1 Section 5.1, p. 127). Then, the next step in the recursion yields the variance matrix R2 ¼ ½ðV 2 þ T 2 R1 T T2 Þ1 þ HT2 U 1 , but 2 H2 for the infinite R1, V 2 þ T 2 R1 T T2 is not invertible. (Switching to the information filter and setting the information matrix R1 1 to zero (Anderson and Moore, 1979, Section 6.3, pp. 139–141; Simon, 2006, Section 6.2, pp. 156–157) does not solve this problem because the Kalman and the information filters are algebraically equivalent.) An infinite initial variance is analogous to an improper (prior) density f B ðh1 Þ in the Bayesian framework. Pole and West (1989), for example, choose f B ðh1 Þ ¼ 1 and obtain the recursion ((22)–(25)) for R1 and yj , but for m 4 n their algorithm invokes improper posterior j distributions which do not conform to the Kolmogorov axiomatization, not even to the generalized one (Section 1). Catlin (1989, Section 7.4, pp. 149–160; 1993) initializes the Kalman filter with the (generalized) Fisher estimation, but he does not associate the initialized filter with frequencies (i.e., the initialized filter lacks statistical interpretation; Catlin, 1989, Section 7.4, p. 151). Bell and Hillmer (1991), Go´mez and Maravall (1994), and Koopman (1997) consider the approach to the initialization by Ansley and Kohn (1985) the most general and rigorous. Within this approach, however, the initial state vector is only partially diffuse (the approach incorporates additional assumptions about the initial conditions; Go´mez and Maravall, 1993).
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3164
Acknowledgments We are deeply indebted to professors Frank Coolen of Durham University, Gabrijel Kernel of Ljubljana University, Louis ˇ Lyons of Oxford University, and Crtomir Zupancˇicˇ of Ludwig-Maximilian-University, Munich, for their support. We would like to express our appreciation to the Executive editors, to the Associate editor, and to the two anonymous referees, for their criticism and suggestions that led to a substantial improvement of this article. Appendix A. Transformations of probability distributions Consider a model P X9H with the density f X9H . Let a function ðs,sÞ on SH Rn be one-to-one, let ðK,YÞ ¼ ðs JH,sJXÞ, and let the Jacobian det½@y s1 ðyÞ be finite and non-vanishing for all y. Since OK ¼ k ¼ OH ¼ s 1 ðkÞ , the probability space ðOH ¼ s 1 ðkÞ , SH ¼ s 1 ðkÞ ,PH ¼ s 1 ðkÞ Þ coincides with the space ðOK ¼ k , SK ¼ k ,P K ¼ k Þ, and because of the common probability space underlying P X9H ð9s 1 ðkÞÞ and PY9K ð9kÞ, f Y9K ðy9kÞ ¼ f X9H ðs1 ðyÞ9s 1 ðkÞÞ9det½@y s1 ðyÞ9 and Y9OK ¼ k , must hold. For scalar variables X9O 1 H¼s ðkÞ 8 < F X9H ðs1 ðyÞ9s 1 ðkÞÞ; ½s1 ðyÞ0 4 0 F Y9K ðy9kÞ ¼ 1 1 : 1F X9H ðs ðyÞ9s ðkÞÞ; ½s1 ðyÞ0 o 0
ðA:1Þ
ðA:2Þ
must also hold. From a mathematical perspective, the inverse and the direct probability distributions are equivalent objects, and so the transformation of the inverse densities under the change of variables is analogous to (A.1), f K9Y ðk9yÞ ¼ f H9X ðs 1 ðkÞ9s1 ðyÞÞ9det½@k s 1 ðkÞ9:
ðA:3Þ
Eq. (14) follows from combining (A.1) and (A.3) with Eq. (9). Let F X9H be the distribution function of a model PX9H , let G ¼ fa,b, . . .g be a group whose unit element is denoted by e, and let l be the left action of G on Rn , satisfying lðe,xÞ ¼ x and lðaJb,xÞ ¼ l½a,lðb,xÞ, for a,b 2 G and x 2 Rn . In addition, suppose that for every a 2 G and every x 2 Rn , there is an induced left action lða, hÞ of G on SH , so that F Y9K ðx9hÞ ¼ F X9H ðx9hÞ, where Y ¼ lJX and K ¼ l JH. The model P X9H is then said to be invariant under the left action of the group G, or Ginvariant. For continuous distributions from the G-invariant model with lða,xÞ differentiable in x and with the nonvanishing Jacobian det½@x lða1 ,xÞ, the invariance further implies that f X9H ðx9hÞ ¼ f X9H ðlða1 ,xÞ9lða1 , hÞÞ9det½@x lða1 ,xÞ9. Appendix B. Proofs of propositions Proof of Proposition 1. From the product rule for f ðxa ,xb ; hÞ it follows immediately that f ðxa 9xb ; hÞ ¼ f ðxb 9xa ; hÞf ðxa ; hÞ=f ðxb ; hÞ. This, combined with Eq. (7), yields kðxa , hÞ=kðxb , hÞ ¼ hðxa ,xb Þ, where kðxaðbÞ , hÞ ¼ f ðh; xaðbÞ Þ=f ðxaðbÞ ; hÞ and hðxa ,xb Þ ¼ f ðxb ; xa Þ=f ðxa ; xb Þ. Hence, hðxa ,xb Þ factors as hðxa ,xb Þ ¼ ZXa ðxb Þ=ZXa ðxa Þ, so that ZXa ðxa Þk ðxa , hÞ ¼ ZXa ðxb Þkðxb , hÞ, further implying that ZXa ðxaðbÞ ÞkðxaðbÞ , hÞ may only vary with h, ZXa ðxaðbÞ ÞkðxaðbÞ , hÞ ¼ zH ðhÞ, which proves Eq. (9). Eq. (10) is proved similarly, by applying (8) instead of (7). & Proof of Proposition 2. For a continuous scalar random variable X with a probability distribution drawn from a model P X9Y that is invariant under a one-dimensional Lie group G, Eq. (A.2) reduces to 8 < Fðlða1 ,xÞ9lða1 , yÞÞ; @x lða,xÞ 4 0 : ðB:1Þ F X9Y ðx9yÞ ¼ Fðx9yÞ ¼ : 1Fðlða1 ,xÞ9lða1 , yÞÞ; @x lða,xÞ o 0 For x and y with non-vanishing derivatives (18) we define functions s(x) and sðyÞ such that ½s0 ðxÞ1 ¼ @a lða1 ,xÞ9a ¼ e and ½s 0 ðyÞ1 ¼ @a lða1 , yÞ9a ¼ e . Differentiating (B.1) with respect to a and setting a ¼e then yields equation @x Fðx9yÞ@y H ðx, yÞ@y Fðx9yÞ@x Hðx, yÞ ¼ 0, where Hðx, yÞ ¼ sðxÞsðyÞ. The general solution of the equation is Fðx9yÞ ¼ F½Hðx, yÞ ¼ F½sðxÞ sðyÞ. Then, by Eq. (A.2), ( FðymÞ; ½s1 ðyÞ0 4 0 F Y9L ðy9mÞ ¼ ~ , F ðymÞ; ½s1 ðyÞ0 o 0 ~ ðymÞ ¼ 1FðymÞ. That is, P where Y ¼ sJX, L ¼ s JY, and F Y9L is a location model (a restricted location-scale model). & Proof of Proposition 3. First, given a location-scale model, Functional Eq. (15) for the consistency factor zH ðm, sÞ reduces to ma1 s zH ðm, sÞ ¼ hða1 ,a2 ÞzH , ðB:2Þ ; m,a1 2 R, s,a2 2 R þ : a2 a2
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
3165
For a1 ¼ m and a2 ¼ s, (B.2) yields hðm, sÞ ¼ zH ðm, sÞ=zH ð0; 1Þ, while setting a1 ¼ m and a2 ¼ 1 reveals that zH ðm, sÞ factors as
zH ðm, sÞ ¼
zH ðm,1ÞzH ð0, sÞ : zH ð0; 1Þ
ðB:3Þ
For a1 ¼ 0 and a2 ¼ s, (B.2) thus reduces to zH ðm,1Þ ¼ zH ðm=s,1Þ. Hence, zH ðm,1Þ ¼ C 7 for m_0, and so, according to (B.3), zH ðm, sÞ ¼ C 7 zH ð0, sÞ=zH ð0; 1Þ. Then, for a2 ¼ 1, m,a1 40, and m 4 a1 , (B.2) further implies zH ð0; 1Þ ¼ C þ , whereas for a2 ¼ 1, m,a1 o 0, and m o a1 it implies zH ð0; 1Þ ¼ C , so that zH ðm, sÞ ¼ zH ð0, sÞ. Exchange now indices 1 and 2 in Eq. (11) and set h ¼ ðm, sÞ, h1 ¼ m, and h2 ¼ s, so that zH ð0, sÞ ¼ zH9s ðmÞxðsÞ must hold, implying constant zH9s ðmÞ. Second, every restriction P X 1 ;H9m of a location-scale model P X 1 ;H is reducible to the model PY 1 9Y2 with the density f Y 1 9Y2 ðy1 9sÞ ¼ s1 fðy1 =sÞ (Example 5). For y1 _0, the inverse density f Y2 9Y 1 reads f Y2 9Y 1 ðs9y1 Þ ¼
zY2 ðsÞ zY ðsÞ 7 f f ðy 9sÞ ¼ 72 ðy 9sÞ, ZY 1 ðy1 Þ Y 1 9Y2 1 ZY 1 ðy1 Þ Y 1 9Y2 1
where ZY71 ðy1 Þ ¼ ZY 1 ðy1 Þ=c 7 and zY2 ðsÞ ¼ zH9 ðsÞ. In addition, m
7
f L1 9L2 ,Z1 ðl1 9l2 ¼ 1,z1 Þ ¼
zK79
l2 ¼ 1
ðl1 Þ
ZZ71 ðz1 Þ
7
f Z 1 9K9
l2 ¼ 1
ðz1 9l1 Þ,
7
7
7
where ZZ71 ðz1 Þ ¼ expfz1 gZY71 ð 7 expfz1 gÞ, zK9 ðl1 Þ ¼ expfl1 gzY2 ðexpfl1 gÞ, and f Z1 9K9 ðz1 9l1 Þ ¼ expfz1 gf Y 1 9Y2 ð 7expfz1 g9 l2 ¼ 1 l2 ¼ 1 expfl1 gÞ. Since k ¼ ðl1 , l2 Þ is a realization of a location-scale parameter K ¼ ðL1 , L2 Þ, and we assumed that the objective 7 7 7 inverse densities f K9Z exist, zK9 ðl1 Þ ¼ zK9 ðl1 Þ ¼ 1 by the first part of the proof, and so zY2 ðsÞ ¼ zH9 ðsÞ ¼ s1 . m l2 ¼ 1 l2 Third, set again h ¼ ðm, sÞ, h1 ¼ m, and h2 ¼ s in Eq. (11), this time without changing indices 1 and 2. Then, according to the identities zH ðm, sÞ ¼ zH ð0, sÞ and zH9 ðsÞ ¼ s1 , Eq. (11) reduces to zH ð0, sÞ ¼ s1 xðmÞ, finally implying zH ðm, sÞ ¼ s1 . & m
Appendix C. The probability density function f ðxnk 9h1 Þ For k¼1, the parameters of f ðxnk 9h1 Þ ¼ Nðx1 ; Hn1;1 h1 ,U n1;1 Þ are Hn1;1 ¼ H1 and U n1;1 ¼ U 1 by Assumption 10. For k 4 1, the parameters are given by a (backward) recursion that comprises ðkjÞn ðkjÞn block diagonal (symmetric and positivedefinite) matrices U nk,j ¼ diagðU j ,M k,j þ 1 Þ, ðkj þ 1Þn m vertical block matrices Hnk,j ¼ vertðHj ,M k,j þ 1 U nk,j þ 1 1Hnk,j þ 1 nT n1 n 1 1 1 ½Hk,j V j þ 1 T j þ 1 Þ, and (symmetric and positive-definite) ðkj þ1Þn ðkj þ 1Þn matrices M k,j ¼ U nk,j þ þ 1 U k,j þ 1 H k,j þ 1 þ V j þ 1 n T Hnk,j V j Hk,j . The recursion is initialized by Hnk,k ¼ Hk and U nk,k ¼ U k . Appendix D. The frequency interpretation of f ðrk 9hk Þ, k Zkmin Suppose first that k¼1. By Assumption 8, the frequency interpretation applies for f ðx1 9h1 Þ. The density f ðr1 9h1 Þ ¼ Nðr1 ; h1 ,R1 Þ is obtained from f ðx1 9h1 Þ by a change of variable and (possibly) by a marginalization which both preserve the frequency interpretation. The frequency interpretation applies also for f ðh nk ,xnk 9h1 Þ, k 41. The density can be Q T 1 expressed as f ðh nk ,xnk 9h1 Þ ¼ aðxnk ÞNðwk ; 0,Rk Þ k1 þ 1 hj þ 1 þ yj Þhj for j ¼ 1, . . . ,k1, j ¼ 1 Nðwj ; 0,W j Þ, where wj ¼ W j ðT j þ 1 V j Q k n n n n wk ¼ rk hk , and aðxk Þ is a function of xk only. Consequently, f ðwk 9h1 Þ ¼ Nðwk ; 0,Rk Þ k1 j ¼ 1 Nðwj ; 0,W j Þ, wk ¼ fwi gi ¼ 1 , is obtained from f ðh nk ,xnk 9h1 Þ by a change of variables and (possibly) by a marginalization, and f ðwk 9h1 Þ ¼ Nðwk ; 0,Rk Þ is obtained from f ðwnk 9h1 Þ by a further marginalization, so that the frequency interpretation applies also to f ðwk 9h1 Þ. Since f ðwk 9h1 Þ does not vary with h1 , we can define f ðrk 9hk Þ Nðrk ; hk ,Rk Þ for which, again, the frequency interpretation applies. References Akaike, H., 1980. The interpretation of improper prior distributions as limits of data dependent proper prior distributions. J. Roy. Statist. Soc. Ser. B 42, 46–52. Anderson, B.D.O., Moore, J.B., 1979. Optimal Filtering. Information and System Science Series. Prentice-Hall, Inc, Englewood Cliffs, N.J. Ansley, C.F., Kohn, R., 1985. Estimation, filtering, and smoothing in state space models with incompletely specified initial conditions. Ann. Statist. 13, 1286–1316. Bartlett, M.S., 1939. Complete simultaneous fiducial distributions. Ann. Math. Statist. 10, 129–138. Bartlett, M.S., 1946. A general class of confidence interval. Nature 158, 521. Bell, W., Hillmer, S.C., 1991. Initializing the Kalman filter for nonstationary time series models. J. Time Ser. Anal. 12, 283–300. Berger, J.O., 1980. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics. Springer–Verlag. Bernardo, J.M., Smith, A.F.M., 1994. Bayesian Theory. John Wiley & Sons Inc.. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26, 211–252. Brown, R.G., Hwang, P.Y.C., 1992. Introduction to Random Signals and Applied Kalman Filtering. John Wiley & Sons Inc. Catlin, D.E., 1989. Estimation, Control, and the Discrete Kalman Filter. Applied Mathematical Sciences. Springer-Verlag. Catlin, D.E., 1993. Fisher initialization in the presence of ill-conditioned measurements. In: Chen, G. (Ed.), Approximate Kalman Filtering. Vol. 2 of Series in Approximations and Decompositions. World Scientific, pp. 23–38. Chang, T., Villegas, C., 1986. On a theorem of Stein relating Bayesian and classical inferences in group models. Canad. J. Statist. 14, 289–296. Cox, D.R., Hinkley, D.V., 2000. Theoretical Statistics. Chapman & Hall/CRC. Cox, H., 1964. On the estimation of state variables and parameters for noisy dynamic systems. IEEE Trans. Autom. Contr. 9, 5–12. Dawid, A.P., Stone, M., Zidek, J.V., 1973. Marginalization paradoxes in Bayesian and structural inference. J. Roy. Statist. Soc. Ser. B 35, 189–233. DeGroot, M.H., 2004. Optimal Statistical Decisions. Willey Classic Titles. John Wiley & Sons Inc.. Dempster, A.P., 1966. New methods for reasoning towards posterior distributions based on sample data. Ann. Math. Statist. 37, 355–374.
3166
ˇ T. Podobnik, T. Zivko / Journal of Statistical Planning and Inference 142 (2012) 3152–3166
Fisher, R.A., 1930. Inverse probability. Proc. Cambridge Philos. Soc. 26, 528–535. Florens, J.-P., Mouchart, M., Rolin, J.-M., 1990. Elements of Bayesian Statistics. Marcel Dekker Inc.. Fraser, D.A.S., Reid, N., Marras, E., Yi, G.Y., 2010. Default prior for Bayesian and frequentist inference. J. Roy. Statist. Soc. Ser. B 72, 631–654. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 2004. Bayesian Data Analysis, 2nd Edition. Chapman & Hall/CRC. Ghosh, J.K., Delampady, M., Samanta, T., 2006. An Introduction to Bayesian Analysis. Springer Texts in Statistics. Springer. Go´mez, V., Maravall, A., 1993. Initializing the Kalman filter with incompletely specified initial conditions. In: Chen, G. (Ed.), Approximate Kalman Filtering. Vol. 2 of Series in Approximations and Decompositions. World Scientific, pp. 39–62. Go´mez, V., Maravall, A., 1994. Estimation, prediction, and interpolation for nonstationary series with the Kalman filter. J. Amer. Statist. Assoc. 89, 611–624. Hartigan, J.A., 1964. Invariant prior distributions. Ann. Math. Statist. 35, 836–845. Hartigan, J.A., 1983. Bayes Theory. Springer Series in Statistics. Springer–Verlag. Jaynes, E.T., 2003. Probability Theory – The Logic of Science. Cambridge University Press. Kadane, J.B., Schervish, M.J., Seidenfeld, T., 1986. Statistical implications of finitely additive probability. In: Goel, P.K., Zellner, A. (Eds.), Bayesian Inference and Decision Techniques, Elsevier Science Publishers, Amsterdam, pp. 59–76. Kalman, R.E., 1960. A new approach to linear filtering and prediction problems. J. Bas. Engr. 82, 35–45. Kass, R.E., Wasserman, L., 1996. The selection of prior distributions by formal rules. J. Amer. Statist. Assoc. 91, 1343–1370. Kolmogorov, A.N., 1933. Grundbegriffe der Wahrscheinlichkeitsrechnung. Ergebnisse der Matematik und ihrer Grenzgebeite, Band 2, Nr. 3. Springer, Berlin. Koop, G., Poirier, D.J., Tobias, J.L., 2007. Bayesian Econometric Methods. Vol. 7 of Econometric Exercises. Cambridge University Press. Koopman, S.J., 1997. Exact initial Kalman filtering and smoothing for nonstationary time series models. J. Amer. Statist. Assoc. 92, 1630–1638. Lindley, D.V., 1965. Introduction to Probability and Statistics from a Bayesian Viewpoint, Part 2 – Inference. Cambridge University Press. Makridakis, S., 1976. A Survey of Time Series. Int. Statist. Rev. 44, 29–70. McCullagh, P., 1992. Conditional inference and Cauchy models. Biometrika 79, 247–259. O’Hagan, A., 1994. Bayesian Inference. Vol. 2B of Kendall’s Advanced Theory of Statistics. Arnold, London. Paris, J.B., 1994. The Uncertain Reasoner’s Companion - A Mathematical Perspective. Cambridge Tracts in Theoretical Computer Science 39. Cambridge University Press. Pole, A., West, M., 1989. Reference analysis of the dynamic linear model. J. Time Ser. Anal. 10, 131–147. Rao, M.M., 1993. Conditional Measures and Applications. Marcel Dekker Inc.. Re´nyi, A., 1970. Foundations of Probability. Holden-Day Inc., San Francisco. Robbins, H., 1956. An empirical Bayes approach to statistics. In: Neyman, J. (Ed.), Proceedings of the Third Berkeley Simposium on Mathematical Statistics and Probability, Vol. 1. , University of California Press, Berkeley, pp. 157–163. Robbins, H., 1964. The empirical Bayes approach to statistical decision problems. Ann. Math. Statist. 35, 1–20. Robert, C.P., 2001. The Bayesian Choice, 2nd Edition. Springer Texts in Statistics, Springer. Simon, D., 2006. Optimal State Estimation. John Wiley & Sons Inc., Hoboken, N.J. Stein, C., 1965. Approximation of improper prior measures by prior probability measures. In: LeCam, L.M., Neyman, J. (Eds.), Bernoulli–Bayes–Laplace Anniversary Volume: Proceedings of an International Research Seminar. Statistical Laboratory, University of California, Berkeley, 1963, SpringerVerlag, pp. 217–240. Stone, M., 1976. Strong inconsistency from uniform priors. J. Amer. Statist. Assoc. 71, 114–116. Stone, M., Dawid, P., 1972. Un-Bayesian implications of improper Bayes inference in routine statistical problems. Biometrika 59, 369–375. van der Vaart, A.W., 1998. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Villegas, C., 1977. Inner statistical inference. J. Amer. Statist. Assoc. 72, 453–458. Wasserman, L., 2000. Assimptotic inference for mixture models using data dependent priors. J. Roy. Statist. Soc. Ser. B 62, 159–180.