On the Bingham distribution with large dimension

On the Bingham distribution with large dimension

Journal of Multivariate Analysis 124 (2014) 345–352 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www...

374KB Sizes 1 Downloads 77 Views

Journal of Multivariate Analysis 124 (2014) 345–352

Contents lists available at ScienceDirect

Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva

On the Bingham distribution with large dimension A. Kume ∗ , S.G. Walker Institute of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, CT2 7NF, UK Department of Mathematics, and Division of Statistics and Scientific Computation, University of Texas at Austin, TX, United States

article

info

Article history: Received 15 March 2013 Available online 11 November 2013 AMS subject classifications: 60E05 62H11

abstract In this paper, we investigate the Bingham distribution when the dimension p is large. Our approach is to use a series expansion of the distribution from which truncation points can be determined yielding particular errors. A point of comparison with the approach of Dryden (2005) is highlighted. © 2013 Published by Elsevier Inc.

Keywords: Bingham distribution Dirichlet distribution Mixture model Series expansion

1. Introduction The Bingham distribution is constructed by constraining multivariate normal distributions to lie on the surface of a sphere of unit radius. Hence, if x = (x0 , x1 , . . . , xp ) are distributed according to such a distribution then the norm of x is 1. Hence, x2 = (x20 , x21 , . . . , x2p ) lies on the simplex and we will write si = x2i for each i. To elaborate, the Bingham distribution is obtained via a multivariate normal random vector in Rp+1 constrained to lie on the unit sphere S p . In particular, for a given matrix Σ of dimension (p + 1) × (p + 1) the density with respect to the uniform measure dS p (x) in S p is given as f (x; Σ ) = C (Σ )−1 exp(x⊤ Σ x) where C (Σ ) is the corresponding normalizing constant and x ∈ Rp+1 such that x⊤ x = 1. Note that the distribution f (x; Σ ) is the conditional distribution of some multivariate distributed vector in Rp+1 with mean zero and covariance matrix −Σ −1/2 . The uniform measure in S p is invariant with the orthogonal transformations; it can be easily shown that if X has density f (x; Σ ) then for each orthogonal matrix V ∈ O(p + 1), Y = XV has density f (y; V ⊤ Σ V ). A special case from this family of distribution is that of the complex Bingham which can be applied to shape analysis of objects in 2 dimensions (see [3]). The inferential issues in such cases are possible via the likelihood approach. For the general cases Kume and Wood [6] show that the saddle point approximation approach is highly accurate and robust for various parameter values. However, our aim in this paper is to study the class of Bingham distributions for large p. The practical importance of these cases is closely related to the statistical inferential problems in images where the number of points of interest p is large and the invariance on location, rescale and possibly rotation is essential for inference. Recent work done in this direction is that of Dryden [2] who undertakes inference by considering the connection between the uniform measure on the infinite dimensional sphere with the Wiener processes and its generalizations. The key result which links these processes with



Corresponding author at: Institute of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, CT2 7NF, UK. E-mail addresses: [email protected] (A. Kume), [email protected] (S.G. Walker).

0047-259X/$ – see front matter © 2013 Published by Elsevier Inc. http://dx.doi.org/10.1016/j.jmva.2013.10.023

346

A. Kume, S.G. Walker / Journal of Multivariate Analysis 124 (2014) 345–352

the sphere is the fact that if X is a multivariate normal distribution of dimension p with zero mean and variance–covariance matrix Ip /p then |X | = 1+Op (p−1/2 ) for large p. Therefore, in probability |X |2 goes to 1 and therefore X approaches the sphere of large dimension p. Hence, under certain conditions, [2] removes the constraint on the multivariate normal distribution. Our approach is based on a series expansion of the Bingham distribution in terms of 1/p and then by studying the limiting behaviour if p is large by primarily making sure that the vector of study X remains rescaled so that it remains on the sphere and then simply observing the corresponding limits as the sphere increases in dimension. Hence, our paper is to complement the findings of Dryden [2]. The aim is to find the density for large p correct to the order p−2 and we show this results in a standard mixture of Dirichlet distributions and hence inference is straightforward. In Section 2 we describe the approach based on a series expansion using Dirichlet distributions. This gives us that under certain conditions we need a determinable finite mixture to achieve a certain level of approximation when p is large. With this information, in Section 3, we use a matrix type expansion and take the terms discovered in Section 2. This gives us directly a value for E(XX ⊤ ) which forms a basis for estimation of Σ . We can at this point make a comparison with the approach of Dryden [2]. 2. Expansion of the Bingham distribution We will start with the Bingham distribution of dimension p + 1; so we have f (x) ∝ exp(x⊤ Σ x)1(x⊤ x = 1)

 ∝ exp

p 

 Σii x2i

−2



Σij xi xj 1(x⊤ x = 1).

i
i=0

Letting Σ00 be the largest diagonal element of Σ , we can write x⊤ Σ x = Σ00 −

p 

ai x2i + 2



Σij xi xj ,

i
i =1

when x⊤ x = 1, and where ai = Σ00 − Σii ≥ 0. If we now transform to (w0 , . . . , wp , s0 , . . . , sp ), where si = x2i and wi = xi /|xi |, then f (w, s) ∝ exp

 p 

 ai s i − 2

where bij =

bij si sj wi wj

(s0 . . . sp )−1/2 ,

i
i=1

Σij2 .



Hence,

f (s) ∝ exp

 p  i =1



 

ai s i

exp −2

w0 ...wp ∈{−1,+1}

 

bij si sj wi wj

(s0 . . . sp )−1/2 .

i
This expression for f (w, s) has appeared in [5] in which it is used as a starting point for a Gibbs sampler in order to sample from f (x). One can see that the form of f (wi |s) is not difficult to obtain. Hence, for what follows we will concentrate on expanding f (s). We start by assuming Σ is diagonal. 2.1. Zero off-diagonal elements Let us first deal with the case when bij = 0 for i ̸= j. From now on, we will write the normalizing constant C (Σ ) as

Cp (Σ ). We have the following Lemma:

Lemma 1. For large p, and with bij = 0 for all i ̸= j and p 

ai = O(p),

i =1

then it is that

 f (s) = Cp (Σ ) Dir(s; 1/2, 1/2, . . . , 1/2) + (p + 1)

−1

p 

 ai Dir(s; 1/2, 1/2 + δi,1 , . . . , 1/2 + δi,p ) + O(p

−2

) ,

i=1

where δi,k = 1 if i = k and δi,k = 0 if i ̸= k. Here Cp (Σ ) is the normalizing constant and Cp (Σ ) → 1 as p → +∞. Also, Dir(s; α0 , α1 , . . . , αp ) is the density function of the Dirichlet distribution with parameters (α0 , α1 , . . . , αp ) see e.g. Chapter 49 of [4]. The proof to this and all lemmas and theorems are deferred to the Appendix.

A. Kume, S.G. Walker / Journal of Multivariate Analysis 124 (2014) 345–352

347

2.2. Full matrix An intermediary step between the Bingham and the Bingham distribution with zero off-diagonal elements is to study the case when ai = 0 for all i. Lemma 2. If ai = 0 for all i and



bij = p + O(1)

i
then it is that

  

   i
bij

where δr ,i,j = δr ,i + δr ,j . If we now put this result together with the result for the Bingham distribution then we obtain:

p

Theorem 1. If

i=1

f (s) = Cp (Σ )

ai = O(p) and

   



i
bij = p + O(1) then p 

Dir(s; 1/2, . . . , 1/2) +

  

ai

i =1

p+1

Dir(s; 1/2, 1/2 + δi,1 , . . . , 1/2 + δi,p )

    i
bij

The normalizing constant Cp (Σ )−1 = 1 + O(p−1 ) and converges to 1 as p → ∞. Specifically,

Cp (Σ )−1 = 1 + a/p + b/((p + 1)(p + 3)) + O(1/p2 )





where a = i ai and b = i
Cp (Σ )−1 exp(x⊤ Σ x)dS p (x) where Cp (Σ ) is the corresponding normalizing constant and dS p (x) is the uniform measure in S p induced by the Lebesgue measure of x ∈ Rp+1 such that x⊤ x = 1. Using the Taylor expansion with conditions obtained in Section 2 it is clear that the distribution above is representable as

Cp (Σ )−1

∞  (x⊤ Σ x)k k=0

k!

dS p (x).

The uniform measure dS p (x) can be expressed as dS p (x) = dw (w)ds (s) where dw (w) is the p + 1 uniform discrete measure of the signs of coordinates of x and ds (s) = Dir(s; 1/2, 1/2, . . . , 1/2) is the Dirichlet measure with parameters (1/2, 1/2, . . . , 1/2). A simple change of y = V ⊤ x, which does not change the uniform measure dS p , implies that variables p 2 dS p (x) = dS p (y) and x⊤ Σ x = y⊤ ∆y = δ y . This implies that for the transformed coordinates y, y⊤ ∆y does not depend i=0 i i on the signs of elementary yi .

348

A. Kume, S.G. Walker / Journal of Multivariate Analysis 124 (2014) 345–352

It now turns out that E(X ) Cp (Σ ) =

 x

∞  (x⊤ Σ x)k

k!

k=0

∞  (y⊤ ∆y)k

 =

Vy ∞ 



dS p (y)

k!

k=0

=V

dS p (x)

y(y⊤ ∆y)k dw (w)ds (s) k!

k=0

.

Since y(y⊤ ∆y)k dw (w)ds (s) is zero (due to being componentwise sign dependent), then it is clear that E(X ) = 0. For the second moments,



E(XX ⊤ )Cp (Σ ) =



xx⊤

∞  (x⊤ Σ x)k

k!

k=0

 =

Vyy⊤ V ⊤

∞  (y⊤ ∆y)k

k!

k=0

=V

∞ 

=V

yy⊤ (y⊤ ∆y)k dw (w)ds (s) ⊤ V k!



yy⊤



p 

δi y2i

k

dw (w)ds (s)

i=0

V ⊤.

k!

k=0



Since yi = wi si , it is clear that componentwise to focus only on the diagonal components, i.e.



 y2j

p 

k δi y2i

dS p (y)



k=0

∞ 

dS p (x)



p

yy⊤ (

i =0

δi y2i )k dw (w) is zero for the off-diagonal elements, so we need

  k  p  ds (s) = E sj δ i si  ,

i =0

i =0

since ds (s) = Dir(s; 1/2, 1/2, . . . , 1/2). In particular, for this Dirichlet distribution, E(sj ) =

1 1/2 = (p + 1)/2 p+1

and E(sj si ) =

  

1

if i ̸= j

(p + 1)(p + 3) 3

 

if i = j.

(p + 1)(p + 3)

Therefore, for the original matrix, we have that for k = 0 and k = 1

Cp (Σ )



 yy



p 

0 δ

2 i yi

δ

2 i yi

dw (w)ds (s) =

i =0

Cp (Σ )



 yy



p 

I p+1

for k = 0

1 dw (w)ds (s) = trace(∆)

i =0

I

(p + 1)(p + 3)

+2

∆ (p + 1)(p + 3)

It is clear now that approximating Cp (Σ )E(XX ⊤ ) by the first two terms yields

Cp (Σ )E(XX ⊤ ) ≈



xx⊤ (x⊤ Σ x)0 + (x⊤ Σ x)1 dS p (x)

 

=V



I

+ trace(∆)

I

+2

∆ (p + 1)(p + 3)

(p + 1)(p + 3) I trace(Σ ) 2Σ = + + . p+1 (p + 1)(p + 3) (p + 1)(p + 3) p+1



V⊤

for k = 1.

A. Kume, S.G. Walker / Journal of Multivariate Analysis 124 (2014) 345–352

Likewise, the normalizing constant C =

Cp ( Σ ) =





≈ 1+

 ∞

k=0

(x⊤ Σ x)k k!

349

dS p (x) is approximated by the first two terms as

 (x⊤ Σ x)0 + (x⊤ Σ x)1 dS p (x) + O(p−2 ) trace(Σ ) p+1

.

Putting these results together we have, correct to order O(p−2 ), I



2 I trace(Σ )

. (p + 1)2 (p + 3) We have the last term since under the conditions for Σ it is allowed that trace(Σ ) = O(p). This has some points of discussion: E(XX ⊤ ) =

+

p+1

(p + 1)(p + 3)



1. Since trace(XX ⊤ ) = 1 we would expect this for the expectation; which is easily seen to be the case. 2. There is also an invariance property crucial for the Bingham distribution; namely that Σ + cI changes nothing. This also works in that if we replace Σ by Σ + cI then E(XX ⊤ ) remains unchanged. 3. Based on 2, we can arbitrarily set trace(Σ ) = 0 and therefore base estimation of Σ on I

E (XX ⊤ ) =

p+1

+

2Σ + O(p−2 ). (p + 1)(p + 3)

(1)

The Dryden [2] approach to dealing with large p in the Bingham distribution is to use the following approximation in the distributional sense X ≈ N(0, Ψ −1 ), where

Ψ =

I p+1

+

2Σ . (p + 1)2

The approximation used here is based on the removal of the constraint X ⊤ X = 1, and Dryden [2] shows that for suitable Ψ it is that X ⊤ X ≈ 1. This approximation is easy to see in particular when Σ is diagonal and basically the result relies on the strong law of large numbers. Estimation is based on E(XX ⊤ ) =

I p+1

+

2Σ + O(p−2 ) (p + 1)2

which is slightly different with what we obtained in (1). Moreover, the approach of Dryden [2] implicitly assumes trace(Σ ) = 0. 4. Discussion This paper shows that direct computation of the limiting behaviour of the Bingham distribution is possible without the need of departing from the sphere. The point of contact between our findings and those of Dryden [2] is strictly the result of E(XX ⊤ ) which is effectively the same. However, our derivation of this result is different and comes via a density which remains on the sphere. This density is a finite mixture model of Dirichlet components and so is easy to apply to real data including those observed from normal Bingham mixture models. In particular, a method of moment estimator of Σ would coincide with [2]. Our approach in the matrix version could be easily applied to the Fisher Bingham case with the only difference that the matrix expectations will need to be calculated for higher orders. Finally, we would like to mention that mixtures of Dirichlet distributions are widely used in Latent Dirichlet Allocation models whose main focus is on the frequency estimation (see [1]). Acknowledgments The authors would like to thank the two anonymous referees for their careful review and constructive suggestions which improved the paper. Appendix Proof to Lemma 1. Now it is clear to see that f (s) ∝ exp

 p  i=1

 ai s i

350

A. Kume, S.G. Walker / Journal of Multivariate Analysis 124 (2014) 345–352

and exp

 p 

 =

ai s i

i =1

∞ 

 −1

m!

m=0

m

p 

ai s i

i=1

so



∞ 

f (s) ∝



p 

m=0 l1 +···+lp =m





p 

m=0 l1 +···+lp =m

∞ 





m=0 l1 +···+lp =m

(s0 . . . sp )−1/2

/li !

i=1



∞ 

 l l aii sii

 l aii

/li !

−1/2 l1 −1/2

s0

s1

l −1/2

. . . spp

i=1

Γ (1/2) Γ (m + (p + 1)/2)



l p  a i Γ (li + 1/2)



i

i =1

li !

Dir(s; 1/2, l1 + 1/2, . . . , lp + 1/2).

So we see that f (s) is a mixture of Dirichlet distributions and our aim now is to understand the weights. To this end let us write f (s) =

∞ 

 

w m,p

αm−,1p

l1 +···+lp =m

m=0

p 

 l aii Γ

(li + 1/2)/li !

Dir(s; 1/2, l1 + 1/2, . . . , lp + 1/2),

i =1

where

 

p 

l1 +···+lp =m

i =1

αm,p =

 l aii Γ

(li + 1/2)/li !

and

w m,p = wm,p

 ∞

wm,p

m=0

and

wm,p =

αm,p . Γ (m + (p + 1)/2)

Now Γ (l + 1/2) = (l − 1/2) . . . (1/2)Γ (1/2) and Γ (m + (p + 1)/2) = (m − 1 + (p + 1)/2) . . . ((p + 1)/2)Γ ((p + 1/2)) and so we can take out Γ (1/2)p /Γ ((p + 1)/2) as a common term and readjust wm,p accordingly. Based on this we have  w0,p = 1, w1,p = (p + 1)−1 pi=1 ai and, for m ≥ 2,

 wm,p ≤ 2

m

p 

m  ai

(p + 1)m .

i=1

This follows since (l − 1/2) . . . (1/2) < l!, (m − 1 + q) . . . (q) > qm , m! l1 ! . . . lp ! when

p

i=1 li



≥1

= m and p 

 l aii

( )/li ! ≤ (m!)

l1 +···+lp =m i=1

−1

p 

m ai

.

i=1

Our result now is that provided p 

ai = O(p)

i=1

then we have that the wm,p are bounded by (K /p)m for some K < +∞. Therefore, for large p, we have

 f (s) = Cp (Σ ) Dir(s; 1/2, 1/2, . . . , 1/2) + (p + 1)

−1

p  i =1

where δi,k = 1 if i = k and δi,k = 0 if i ̸= k.

 ai Dir(s; 1/2, 1/2 + δi,1 , . . . , 1/2 + δi,p ) + O(p

−2

) ,

A. Kume, S.G. Walker / Journal of Multivariate Analysis 124 (2014) 345–352

351

Proof to Lemma 2. We have that

 

f ( s) ∝



exp −2

(s0 . . . sp )−1/2 k



∞ 



(−1) 2 /k! k k

 





k!

w0 ...wp ∈{−1,+1}

k=0



  



k

2

(s0 . . . sp )−1/2

bij si sj wi wk

0≤i
w0 ...wp ∈{−1,+1}

k=0

∞  ∝ (−1)k 2k /k!



bij si sj wi wj

0≤i
w0 ...wp ∈{−1,+1}



 

lij /2 lij /2 lij /2 si sj

bij

l

l

wiij wjij /lij ! (s0 . . . sp )−1/2

i




lij /2 bij

lij i

lij j



w w /lij ! s0

j>0 l0j /2−1/2 l10 /2+

j>1 l1j /2−1/2



s1

...

i
w0 ...wp ∈{−1,+1} ∗∗   l / jr lrj /2−1/2

k even



ljp /2−1/2

. . . sp j


where

λi =

i−1 

p 

lri /2 +

r =0

lir /2.

r =i+1

 = k} and ∗∗ denotes the sum over { i
Also ∗ denotes the sum over {



i
where, for even k,

 wk,p =



i
lij /2

bij

/lij !

p 

Γ (λi + 1/2)

i=0

Γ (k + (p + 1)/2)

∗∗

.

Now

Γ (1/2)p+1 Γ ((p + 1)/2)

w0,p = and for k ≥ 2,

wk,p =

Γ (1/2)

    blijij /2 lij !

i
∗∗

Γ ((p + 1)/2) (k − 1 + (p + 1)/2) . . . ((p + 1)/2)

(λi − 1 + 1/2) . . . (1/2) ≤1 l0i ! . . . li−1 i !li i+1 ! . . . lip ! and so

k/2

 wk,p ≤ (2/p)

 i
If we now assume that

 i
bij = p + O(1)

bij

(λi − 1 + 1/2) . . . (1/2)

i =0

Now for each i it is that

k

p 

w0,p .

.

352

A. Kume, S.G. Walker / Journal of Multivariate Analysis 124 (2014) 345–352

then (using similar techniques to Section 2.1) then

wk,p ≤ 2k p−k/2 w0,p . Specifically,

w2,p = w0,p

1



(p + 1)(p + 3)

i
bij

and so putting together these results we see that

  

   i
bij

where δr ,i,j = δr ,i + δr ,j . References [1] [2] [3] [4]

D. Blei, A. Ng, M. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. I.L. Dryden, Statistical analysis on high-dimensional spheres and shape spaces, Ann. Statist. 33 (4) (2005) 1643–1665. J.T. Kent, The complex Bingham distribution and shape analysis, J. R. Stat. Soc. B 56 (1994) 285–299. S. Kotz, N. Balakrishnan, N.L. Johnson, Continuous Multivariate Distributions, Models and Applications, in: Models and Applications, vol. 1, John Wiley & Sons, New York, 2002. [5] A. Kume, S.G. Walker, On the Fisher–Bingham distribution, Statist. Comput. 19 (2009) 167–172. [6] A. Kume, A.T.A. Wood, Saddlepoint approximations for the Bingham and Fisher–Bingham normalising constants, Biometrika 92 (2005) 465–476.