Factorization of posteriors and partial imputation algorithm for graphical models with missing data

Factorization of posteriors and partial imputation algorithm for graphical models with missing data

Statistics & Probability Letters 64 (2003) 369 – 379 Factorization of posteriors and partial imputation algorithm for graphical models with missing d...

262KB Sizes 0 Downloads 64 Views

Statistics & Probability Letters 64 (2003) 369 – 379

Factorization of posteriors and partial imputation algorithm for graphical models with missing data Zhi Geng∗ , Kaican Li Department of Probability and Statistics, Peking University, Beijing 100871, China Received in revised form October 2000

Abstract In this paper, we discuss factorization of a posterior distribution and present a partial imputation algorithm for a graphical model with incomplete data. We use an ordinary graph to represent a graphical model and introduce a hypergraph to represent an observed data pattern where each hyperedge is a set of variables observed for a group of individuals. First, in terms of a decomposition of such a mixed graph, we discuss factorization of a joint posterior distribution into several marginal posterior distributions so that calculation of posterior distribution can be localized. Then, for a mixed graph which cannot be decomposed without loss of information, we present a partial imputation algorithm which imputes only a part of missing data and reduces unnecessary imputation of an ordinary Gibbs sampler. Finally, we discuss the e2ciency improved by a decomposition and the partial imputation algorithm. c 2003 Elsevier B.V. All rights reserved.  Keywords: Contingency table; Dirichlet distribution; Graphical models; Incomplete data; Imputation algorithm

1. Introduction During the last decades, graphical models are rapidly developed and widely applied in various :elds (see Whittaker, 1990; Edwards, 1995; Lauritzen, 1996). A graphical model is a powerful tool for interpreting conditional independence relationships among a great number of variables and, thus, it can be conveniently used to represent a large complex model. Much attention has been focused on algorithms and manipulations related to structures of graphs, such as local computations (Lauritzen and Spiegelhalter, 1988), global propagation of evidence (Jensen et al., 1990) and decomposition and collapsibility (Lauritzen and Wermuth, 1989; Frydenberg and Lauritzen, 1989). Geng et al. (2000) ∗

Corresponding author. E-mail address: [email protected] (Z. Geng).

c 2003 Elsevier B.V. All rights reserved. 0167-7152/$ - see front matter  doi:10.1016/S0167-7152(03)00181-0

370

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

presented a hypergraph representation for an incomplete data pattern and a partial imputation EM algorithm related to structures of hypergraphs. For a large graphical model, it is not always possible to obtain complete data. It becomes di2cult to calculate a posterior distribution for a graphical model when observed data are incomplete. In this paper, we use an ordinary graph to represent a graphical model and introduce a hypergraph to represent an observed data pattern where each hyperedge is a set of variables observed for a group of individuals. For such a mixed graph with two types of edges (ordinary edges and hyperedges), we present a lossless decomposition and a partial imputation (PI) algorithm. The lossless decomposition means that the decomposition does not lose information for calculation of posterior distribution. First, in terms of lossless decomposition, we discuss factorization of a joint posterior distribution into several marginal posterior distributions so that calculation of posterior distribution can be localized. Then for a mixed graph or subgraph which cannot be losslessly decomposed, we present the PI algorithm which imputes only a part of missing data and reduces unnecessary imputation of an ordinary Gibbs sampler. Finally, we discuss the e2ciency improved by a lossless decomposition and the PI algorithm. In Section 2, we introduce a hypergraphical representation for an observed data pattern and a lossless decomposition of a mixed graph. Section 3 discusses strong hyper Markov laws and gives a condition for factorization of a joint posterior distribution. In Section 4, we present the PI algorithm. Finally, in Section 6, we discuss the e2ciency of the PI algorithm.

2. Graphical models with missing data and decomposition Let V be a set of vertices and E = {e1 ; : : : ; eM } denote a set of edges where em = (i; j) denotes an edge between two vertices, i and j. Then an ordinary graph can be denoted as G=(V; E). In the graph G, each vertex denotes a variable and thus random variables are XV =(Xv )v∈V . The absence of an edge between a pair of vertices means that the corresponding variable pair is independent conditionally on the other variables, which is the pairwise Markov property with respect to G. De:ne GA = (A; EA ) to be the subgraph induced by A and let XA = (Xv )v∈A where A is a subset of V and EA = {e ∈ E|e ∈ A × A} = E ∩ (A × A). Observed data are classi:ed into S groups such that data in the sth group have the same set ts of observed variables. Let T = {t1 ; t2 ; : : : ; tS } denote an observed data pattern. Then observed data can be denoted as {xt(s;s i) ; i = 1; : : : ; ns } for s = 1; : : : ; S where ns is the size of the sth group, see Fig. 1. We can use a hypergraph to represent an observed data pattern where a hyperedge is a set of observed variables, ts say. A monotone pattern of incomplete data is represented as nested hyperedges. Example 1. Let V = {1; 2; 3}. Then X1 , X2 and X3 denote three variables. Suppose that X1 is conditionally independent of X3 given X2 (i.e. E = {(1; 2); (2; 3)}), and that an observed data pattern is given in Fig. 2, denoted by T = {{1; 2; 3}; {1; 2}; {3}}. In Fig. 3, the ordinary graph represents the graphical model and the hypergraph represents the observed data pattern. Let p(xV |V ) denote the joint density of variables XV and p(xA |A ) the marginal density of XA for a subset A of V . Assume throughout this paper that incomplete data are from a noninformative

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

...

X1 Group 1

1 .. .

X t1

n1 1 . Group 2 .. n2 .. .

.. .

Group S

1 .. .

X|V |

Xt2 .. .

XtS 1

XtS 2

nS XtS =( XtS 1 , XtS 2 ) Fig. 1. An observed data pattern,

denotes missing data.

Fig. 2. The data pattern T = {{1; 2; 3; }; {1; 2; }; {3}}.

Fig. 3. A graph with hyperedges.

371

372

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

censoring process, see Dickey et al. (1987). Then the likelihood function can be represented as ns S   i) p(xt(s; |ts ); (1) L(V |T ) ˙ s s=1 i=1



where p(xt |t ) = p(xV |V ) d xV \t . When observed data are incomplete, a Gibbs sampler and the data augmentation (DA) algorithm (Tanner and Wong, 1987) are popularly used to calculate posterior distributions. Below we introduce a lossless decomposition of a graphical model, discuss factorization of a posterior distribution in terms of lossless decomposition, and show that decomposition can improve the e2ciency of calculation of posterior distributions. Denition 1. A triplet (A; B; C) of disjoint subset of V is called a decomposition of G if (i) A ∪ B ∪ C = V , (ii) C separates A and B in G, and (iii) C is complete. We say that a graph G can be decomposed into two subgraphs GA∪C and GB∪C if the triplet (A; B; C) is a decomposition of G. A graph is called decomposable if it can be repeatedly decomposed into complete subgraphs. A decomposable graph denotes a decomposable model. For the cases with missing data, we introduce below a lossless decomposition which means no loss of information for the decomposition in the sense that an original joint posterior distribution can be obtained from marginal posterior distributions based on subgraphs and sub-hypergraphs. Denition 2. A decomposition (A; B; C) of G is called a lossless decomposition of G with respect to T if each t in T B contains C where T B = {t ∈ T |t ∩ B = ∅}. The condition of each t in TB containing C is called “XC is more observed than XB ” by Didelez and Pigeot (1998). If a triplet (A; B; C) is a lossless decomposition of G with respect to T , then we say that G with T can be decomposed losslessly into two subgraphs, GA∪C with T and GB∪C with T B . Example 2. Suppose that the observed data pattern of Example 1 is changed to T = {{1; 2; 3}, {1; 2}, {2; 3}}, see Fig. 4(a). Let A = {1}, B = {3} and C = {2}. Then T B = {{1; 2; 3}; {2; 3}}. Since (A; B; C) is a decomposition of G and each element, {1; 2; 3} or {2; 3}, in T B contains C, we obtain that (A; B; C) is a lossless decomposition. Thus the graph G with T can be decomposed losslessly into GA˜ with T and GB˜ with T B where A˜ = A ∪ C and B˜ = B ∪ C, see Fig. 4(b) and (c). Note that the hyperedge {2; 3} in T is projected as {2} on the subgraph GA˜ and that the hyperedge {1; 2; 3} is projected as {1; 2} and {2; 3} on the subgraphs GA˜ and GB˜ , respectively. For the graph in Fig. 3, ({1}; {3}; {2}) is not a lossless decomposition, but ({3}; {1}; {2}) is. 3. Factorization of posterior distribution In this section, we give a condition under which a joint posterior distribution can be factorized into marginal posterior distributions such that calculation of posterior distribution can

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

373

Fig. 4. A lossless decomposition ({1}; {3}; {2}): (a) G with T , (b) GA˜ with T , (c) GB˜ with T B .

be decomposed into local calculation. This decomposition is very important for a large complex system. Let X ⊥ Y denote that X is independent of Y , f(A ) denote the prior distribution of A and f(A |T ) the posterior distribution given observed data of a pattern T . Dawid and Lauritzen (1993) presented hyper Markov laws for graphical models, which describe conditional independence of prior probability distributions. Denition 3. A prior distribution of V is strong hyper Markov over G if B|C ⊥ A∪C for any decomposition (A; B; C) of G. In the following theorem, we show a condition for factorizing a joint posterior distribution into two marginal ones. Theorem 1. Suppose that the prior distribution of V is strong hyper Markov over G. If the triplet (A, B, C) is a lossless decomposition of G with respect to T , then the posterior distribution of B|C is updated only by the observed data in T B , that is, f(B|C |T ) = f(B|C |T B )

(2)

and B|C is independent of A∪C , that is, f(A∪C ; B|C |T ) = f(A∪C |T )f(B|C |T B ):

(3)

Proof. Let A˜ = A ∪ C and B˜ = B ∪ C. Since the (A, B, C) is a decomposition of G, we have p(xV |V ) = p(xA˜|A˜)p(xB|C |B|C ) and f(V ) = f(A˜)f(B|C ). Let R = T \ T B . If the set of observed variables in the sth group, ts , belongs to R (i.e. ts ∈ R), then the contribution of an individual in this i) group to the likelihood is p(xt(s; |ts ∩A˜); Otherwise, ts must belong to T B (i.e. ts ∈ T B ), and then the s

374

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

i) i) (s; i) contribution is p(xt(s; |ts ) = p(xt(s;∩i)A˜|ts ∩A˜)p(xt(s; |xC ; (ts ∩B)|C ). Thus, the likelihood function is s s ∩B s     ns ns    i) i) (s; i) L(A˜; B|C |T ) ˙ p(xt(s; |ts ∩A˜)  p(xt(i)∩A˜|ts ∩A˜)p(xt(s; |xC ; (ts ∩B)|C ) s s ∩B ts ∈R i=1

 =

ns  ts ∈T i=1

ts ∈T B i=1

s

  ns   i) (s; i) p(xt(s;∩i)A˜|ts ∩A˜)  p(xt(s; |xC ; (ts ∩B)|C ) s ∩B s

ts ∈T B i=1

= L(A˜|T )L(B|C |T B ): Hence we obtain the joint posterior distribution of A˜ and B|C f(A˜; B|C |T ) ˙ f(A˜)f(B|C )L(A˜|T )L(B|C |T B ): So the marginal posterior distribution of A˜ is f(A˜|T ) ˙ f(A˜)L(A˜|T ) and the marginal posterior distribution of {p(iB|C )} is f(B|C |T ) ˙ f(B|C )L(B|C |T B ): Theorem 1 indicates that information on separators is important for factorization of a posterior distribution. Thus, separators should be observed simultaneously as much as possible. First, we repeatedly apply this theorem to a graph and its decomposed subgraphs till subgraphs cannot be further decomposed, then we locally calculate marginal posterior distributions under these subgraphs and, :nally, we get the joint posterior distribution in the form of factorization (3). In this way, we can decompose a large global calculation into several small local calculations. 4. Multinomial models and the hyper Dirichlet law In this section, we restrict attention to decomposable models of discrete variables. Let iV be a cell index in a contingency table classi:ed by variables in V . Denote cell probabilities as p(iV ) for all iV , marginal probabilities as p(iA ) where iA is an index in a marginal table classi:ed by variables in a subset A of V , and denote conditional probabilities given variables in a subset B as p(iV \B|B ). Assume that the prior distribution of {p(iV )} is a hyper Dirichlet distribution which is strong hyper Markov over G, denoted as HD({p(iV )}|{(iV )}). The hyper Dirichlet distribution of a decomposable model can be described as follows: For a vertex set e of a maximum complete subgraph of G, {p(ie )} has the Dirichlet distribution D({p(ie )}|{(ie )}) where (ie ) is a marginal summation of (iV ). See Dawid and Lauritzen (1993) for further details. Example 3. Consider again Example 2. Since (A; B; C) is a lossless decomposition of G with respect to T , the posterior distributions can be factorized into f({p(i12 )}; {p(i3|2 )}|T ) = f({p(i12 )}|T )f({p(i3|2 )}|T B )

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

375

and f({p(i3|2 )}|T ) = f({p(i3|2 )}|T B ): Next, we further consider the factorization of f({p(i12 )}|T ). Let A˜= A ∪ C and B˜ = B ∪ C. For the ˜ ∈ E} = {(1; 2)} in the sense of a graphical subgraph GA˜ = (A;˜ EA˜) induced by A˜ where EA˜ = {e ∩ A|e     model, let A = ∅, B = {1} and C = {2}. Then (A ; B ; C  ) is a lossless decomposition of GA˜ with  respect to T , and T B = {{1; 2; 3}; {1; 2}}. We get 

f({p(i1|2 )}|T ) = f({p(i1|2 )}|T B ) and 

f({p(i2 )}; {p(i1|2 )}|T ) = f({p(i2 )}|T )f({p(i1|2 )}|T B ): So we can locally calculate the posterior distributions of p(i3|2 ), p(i1|2 ) and p(i2 ) with data {n123 (i23 ) + n23 (i23 )}, {n123 (i12 ) + n12 (i12 )} and {n123 (i2 ) + n12 (i2 ) + n23 (i2 )}, respectively. Note that each set of these data is a set of complete data for the corresponding marginal posterior distribution. Finally, we get the factorized posterior distribution 

f({p(i2 )}; {p(i1|2 )}; {p(i3|2 )}|T ) = f({p(i2 )}|T )f({p(i1|2 )}|T B )f({p(i3|2 )}|T B ): 5. Partial imputation algorithm Lossless decomposition discussed in the previous section depends on both the structure of a graph G and that of an observed data pattern T . For a general data pattern T , a graph G or its subgraph may not be decomposed into subgraphs without loss of information. In this case, data augmentation and imputation methods can be used to calculate its posterior distribution. Let Y = (Yobs , Ymis ) where Yobs denotes the observed data and Ymis denotes the missing data. In an ordinary Gibbs sampler, incomplete data need to be augmented to “complete” by imputing Ymis , see Tanner (1991) and Rubin (1987) for further details. In this section, for the case that a graphical model cannot be decomposed due to its data pattern, we present the PI algorithm which imputes missing data as little as possible such that the graphical model can be further decomposed into small models. Given a model G and an observed data pattern T , suppose that the model G cannot be losslessly decomposed due to T . In this case, the observed data pattern T is augmented to T˜ such that the model G can be losslessly decomposed with respect to the augmented data pattern T˜ . Let Ymis = (Ymis1 , Ymis2 ) where Ymis1 denotes a part of missing data that will be imputed at the PI algorithm and Ymis2 denotes the other part of missing data that will not be imputed. T˜ is the pattern of augmented data (Yobs , Ymis1 ). Suppose that (A; B; C) is a lossless decomposition with respect to T˜ . Let A˜= A ∪ C and B˜ = B ∪ C. Then the PI algorithm can be described as follows. Start with an initial posterior distribution and repeat the following two steps until convergence: 1. Partial imputation step: Impute missing data Ymis1 from the predictive distribution f(ymis1 |A(t) ˜ ; (t) ˜ B|C ; T ) such that (Yobs ; Ymis1 ) has the pattern T .

376

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

2. Posterior step: Draw A(t+1) and B(t+1) from the current approximate posterior distribution ˜ |C f(A˜; B|C |T ) = f(A˜|T˜ )f(B|C |T˜ B ); where T˜ denotes the augmented data (Yobs , Ymis1 ) and T˜ B = {t ∈ T˜ |t ∩ B = ∅}. Example 4. Consider again Example 3 except that the data pattern is changed to T ={{1; 2; 3}; {1; 2}; {3}}, as shown in Fig. 3. The joint posterior distribution can be calculated directly by using a Gibbs sampler or the DA algorithm, in which all of incomplete data, {n12 (i2 )} and {n3 (i3 )}, need to be imputed completely into {n12 (i123 )} and {n3 (i123 )}, respectively. This is obviously unnecessary. Let A = {3}, B = {1} and C = {2}. Then (A, B, C) is a lossless decomposition of G, and the graph G with T can be losslessly decomposed into two subgraphs, GA˜ with T and GB˜ with T B , where A˜= A ∪ C = {2; 3}, B˜ = B ∪ C = {1; 2} and T B = {{1; 2; 3}; {1; 2}}. Thus the joint posterior distribution can be factorized into f({p(i23 )}; {p(i1|2 )}|T ) = f({p(i23 )}|T )f({p(i1|2 )}|T B ) and f({p(i1|2 )}|T ) = f({p(i1|2 )}|T B ): The data pattern T B is a complete data pattern for the posterior distribution of {p(i1|2 )}, and thus we have  D({p(i1|2 )}|{(i12 ) + n123 (i12 ) + n12 (i12 )}): f({p(i1|2 )}|T B ) = i2

The posterior distribution f({p(i23 )}|T ) cannot be further factorized by Theorem 1. To calculate f({p(i23 )}|T ), we can also use an ordinary Gibbs sampler or the DA algorithm, in which the incomplete data, both {n12 (i2 )} and {n3 (i3 )}, need to be augmented to “complete”, that is, {n12 (i23 )} and {n3 (i23 )}, respectively. According to condition of Theorem 1, however, we need to augment only one of them, say to augment {n3 (i3 )} to {n3 (i23 )}. Then the observed data pattern T is augmented to T˜ = {{1; 2; 3}; {1; 2}; {2; 3}}, and we get f({p(i23 )}|{n123 (i23 )}; {n12 (i2 )}; {n3 (i23 )})f({n3 (i23 )}|T ); f({p(i23 )}|T ) =

M(n3 (i23 ):n3 (i3 )) ∀i3

where M(n3 (i23 ):n3 (i3 )) ∀i3 denotes a sequence of summations over all possible n3 (i23 ) given n3 (i3 ) for all i3 . By Theorem 1 and p(i23 ) = p(i3|2 )p(i2 ), f({p(i23 )}|{n123 (i23 )}; {n12 (i2 )}; {n3 (i23 )}) can be transformed and then be factorized as f({p(i2 )}; {p(i3|2 )}|{n123 (i23 )}; {n12 (i2 )}; {n3 (i23 )}) =f({p(i2 )}|{n123 (i2 ) + n12 (i2 ) + n3 (i2 )})f({p(i3|2 )}|{n123 (i23 ) + n3 (i23 )}): Since each factor in the right side is a posterior with complete data, we get f({p(i2 )}; {p(i3|2 )}|{n123 (i23 )}; {n12 (i2 )}; {n3 (i23 )})

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

377

= D({p(i2 )}|{(i2 ) + n123 (i2 ) + n12 (i2 ) + n3 (i2 )})  × D({p(i3|2 )}|{(i23 ) + n123 (i23 ) + n3 (i23 )}): i2

The predictive density of {n3 (i23 )} is f({n3 (i23 )}|T ) = f({n3 (i23 )}|{p(i23 )}; T )f({p(i23 )}|T ) d{p(i23 )}: Thus the posterior distribution f({p(i23 )}|T ) can be calculated by using the following PI algorithm: 1. Partial imputation step: Generate {n3 (i23 )} from f({n3 (i23 )}|{p(t) (i23 )}; {n3 (i3 )}). 2. Posterior step: Draw {p(t+1) (i2 )} and {p(t+1) (i3|2 )} from the current estimate of f({p(i2 )}; {p(i3|2 )}|T ): f({p(i2 )}; {p(i3|2 )}|T ) = D({p(i2 )}|{(i2 ) + n123 (i2 ) + n12 (i2 ) + n3 (i2 )})  D({p(i3|2 )}|{(i23 ) + n123 (i23 ) + n3 (i23 )}): × i2

6. E)ciency of partial imputation algorithm In this section, we appreciate the e2ciency improved by decomposing a model and by using the PI algorithm. It is obvious that a decomposition of a large model into smaller ones can reduce a large amount of unnecessary data imputation. We shall show below that reduction of data imputation can also speed up the convergence of a Gibbs sampler. First, we give a de:nition of convergence rate, then we show some results on the rate of convergence of a Gibbs sampler and, :nally, we compare the rates of convergence of the PI algorithm and Tanner and Wong (1987)’s DA algorithm. Let X = (x1 ; : : : ; xm ) be a random variable with m components, each of which can be multidimensional. A Markov chain, X (0) ; X (1) ; : : :, is constructed with its transition function m  (t) (t+1) K(X (t) ; X (t+1) ) = (xi(t+1) |x1(t+1) ; : : : ; xi(t+1) ): −1 ; xi+1 ; : : : ; xm i=1

Let (X ) denote the equilibrium distribution of the Markov chain. The Pearson !2 distance, d (p; ), from density p(X ) to (X ) is de:ned as

p2 (X ) p(X ) 2 = dX − 1: d (p; ) = var (X ) (X ) Denition 4. The rate # of convergence of a Gibbs sampler is the minimum number r ∈ [0; 1] such that d2 (pt ; ) 6 r t d2 (p0 ; ); where p0 is the initial distribution and pt denotes the distribution at the tth iteration of the Gibbs sampler.

378

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

Let F denote the forward operator for the Gibbs sampler Ft(X ) = K(X; Y )t(Y ) dY: De:ne L2 ( )={t(X )|E[t(X )]=0; var [t(X )] ¡ ∞} with an inner product t(X ); s(X )=E[t(X )s(X )] which is a Hilbert space. The norm of the operator F is de:ned as F = supFt(X ) with the supremum taken over all function with E(t 2 ) = 1. Let r denote the spectral radius of F. Assume that the regularity conditions hold, which have been used by many researchers, see the conditions (a) – (d) of Liu et al. (1995) for details. Under this assumption, Liu et al. (1995) showed the following lemma. Lemma 1. The spectral radius of F is strictly less than 1, and d (pn ; ) 6 Fn d (p0 ; ) where p0 (x) is an initial distribution of the systematic scan Gibbs sampler and pn is the distribution obtained at the nth iteration of the Gibbs sampler. According to Lemma 1, below we show that the spectral radius r of F is just the rate # of convergence of the Gibbs sampler. Lemma 2. Let r denote the spectral radius F. Then we have that d (pn ; ) 6 r n d (p0 ; ) and that r is the rate of convergence of the Gibbs sampler. Proof. According to Yosida (1980)’s result that limn→∞ Fn 1=n = r, we immediately get from Lemma 1 that d (pn ; ) 6 r n d (p0 ; ). Now we show that r is the minimum number that makes the inequality true, i.e. the rate of convergence. Suppose that there exists r1 (0 6 r1 ¡ r) such that d (pn ; ) 6 r1n d (p0 ; ). Then we have d (pn ; ) 6 r1n d (p0 ; ) ¡ r n d (p0 ; ): Setting p0 = , we get that 0 ¡ 0. Thus there does not exit such a positive r1 which is less than r. Now we compare the rates of convergence of the DA algorithm and the PI algorithm. For simplicity, let x1 , x2 and x3 denote the parameter vector , Ymis1 and Ymis2 , respectively, and X = (x1 ; x2 ; x3 ) be a random variable with these three components. The DA algorithm can be regarded as a two-component Gibbs sampler, and the PI algorithm as a collapsed Gibbs sampler. Their schemes of systematic scan are, respectively, FDA : x1 → (x2 ; x3 ) and FPI : x1 → x2 : According to Liu (1994)’s results on the collapsed Gibbs sampler, it is obvious that FPI  6 FDA . Let #PI and #DA denote the rates of convergence of the PI algorithm and the DA algorithm, respectively. Then we can immediately obtain the following result. Theorem 2. The PI algorithm converges faster than the DA algorithm does, i.e. #PI 6 #DA .

Z. Geng, K. Li / Statistics & Probability Letters 64 (2003) 369 – 379

379

For a large graphical model with incomplete data, a large amount of unnecessary data imputation may be reduced by decomposing this model into smaller models and then by using the PI algorithm. Thus a decomposition and the PI algorithm cannot only reduce calculation of data imputation but also speed up convergence. Acknowledgements This research was supported by NSFC, RFDP and NEYSF. References Dawid, A.P., Lauritzen, S.L., 1993. Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist. 21, 1272–1317. Dickey, J.M., Jiang, J.M., Kadane, J.B., 1987. Bayesian methods for censored categorical data. J. Amer. Statist. Assoc. 82, 773–781. Didelez, V., Pigeot, I., 1998. Maximum likelihood estimation in graphical models with missing values. Biometrika 85, 960–966. Edwards, D., 1995. Introduction to Graphical Modelling. Springer, New York. Frydenberg, M., Lauritzen, S.L., 1989. Decomposition of maximum likelihood in mixed graphical interaction models. Biometrika 76, 539–555. Geng, Z., Wan, K., Tao, F., 2000. Mixed graphical models with missing data and the partial imputation EM algorithm. Scand. J. Statist. 27, 433–444. Jensen, F.V., Olesen, K.G., Andersen, S.K., 1990. An algebra of Bayesian belief universes for knowledge-based systems. Networks 20, 637–659. Lauritzen, S.L., 1996. Graphical Models. Oxford University Press, Oxford. Lauritzen, S.L., Spiegelhalter, D.S., 1988. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J. Roy. Statist. Soc. Ser. B 50, 157–224. Lauritzen, S.L., Wermuth, N., 1989. Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann. Statist. 17, 31–57. Liu, S.J., 1994. The collapsed Gibbs sampler in Bayesian computation with applications to gene regulation problem. J. Amer. Statist. Assoc. 89, 958–965. Liu, S.J., Wong, W.H., Kong, A., 1995. Covariance structure and convergence rate of the Gibbs sampler with various scans. J. Roy. Statist. Soc. Ser. B 57, 157–169. Rubin, D.B., 1987. Multiple Imputation for Nonresponse in Surveys. Wiley, New York. Tanner, M.A., 1991. Tools for Statistical Inference: Observed Data and Data Augmentation Methods. In: Lecture Notes in Statistics, Vol. 67. Springer, New York. Tanner, M.A., Wong, W.H., 1987. The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82, 528–550. Whittaker, J., 1990. Graphical Models in Applied Multivariate Statistics. Wiley, Chichester. Yosida, K., 1980. Functional Analysis, 6th Edition. Springer, New York.