Journal of Statistical Planning and Inference 79 (1999) 259–287
www.elsevier.com/locate/jspi
Lattice conditional independence models for contingency tables with non-monotone missing data patterns 1 Michael D. Perlman ∗ , Lang Wu Department of Statistics, Box 35-4322, University of Washington, Seattle, WA 98195-4322, USA Received 30 October 1997; accepted 15 December 1998
Abstract In the analysis of non-monotone missing data patterns in multinomial distributions for contingency tables, it is known that explicit MLEs of the unknown parameters cannot be obtained. Iterative procedures such as the EM-algorithm are therefore required to obtain the MLEs. These iterative procedures, however, may oer several potential diculties. Andersson and Perlman [Ann. Statist. 21 (1993) 1318–1358] introduced lattice conditional independence (LCI) models for multivariate normal distributions, which can be applied to the analysis of non-monotone missing observations in continuous data (Andersson and Perlman, Statist. Probab. Lett. 12 (1991) 465 – 486). In this paper, we show that LCI models may also be applied to the analysis of categorical data with non-monotone missing data patterns. Under a parsimonious set of LCI assumptions naturally determined by the observed data pattern, the likelihood function for the observed data can be factored as in the monotone case and explicit MLEs can be obtained for the unknown parameters. Furthermore, the LCI assumptions can be tested by explicit likelihood c 1999 Elsevier Science B.V. All rights reserved. ratio tests.
1. Introduction Missing observations in categorical data occur frequently in practice. The literature has been concerned with the estimation of the parameters in contingency tables with missing data (Chen and Fienberg, 1974; Hocking and Oxspring, 1974; Fuchs, 1982; Little and Rubin, 1987; Williamson and Haber, 1994). It is known that, for monotone missing data, explicit (or closed-form) MLEs can be obtained by factoring the likelihood of the observed data into a product of likelihoods with distinct parameters (Rubin, 1974; Little and Rubin, 1987). For non-monotone missing data, however, such factorization is generally impossible, so iterative procedures such as the EM-algorithm are 1
This research was supported in part by the U.S. National Science Foundation. Corresponding author. E-mail addresses:
[email protected] (M.D. Perlman),
[email protected] (L. Wu)
∗
c 1999 Elsevier Science B.V. All rights reserved. 0378-3758/99/$ - see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 9 9 ) 0 0 0 0 3 - 8
260
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Fig. 1. In each column, a ‘1’ (or ‘2’, or ‘3’) indicates that the rst (or second, or third) component of that column vector xi is observed, while a blank indicates a missing observation.
required to obtain the MLEs (Fuchs, 1982; Little and Rubin, 1987). The EM algorithm may oer potential diculties (discussed later in this section). In this paper, we will show that, under a minimal set of lattice conditional independence (LCI) restrictions, the likelihood functions for non-monotone missing data factors again as in the monotone case. Thus explicit MLEs for non-monotone missing data can also be obtained. Furthermore, the LCI assumptions can be tested by likelihood ratio (LR) tests. Before describing the general approach, we rst consider a simple example that illustrates the basic points. Let (X1 ; X2 ; X3 ) be categorical variables. Suppose that, for a sample of size n, each observation is classi ed according to the three categories. This classi cation then yields a single 3-way contingency table if the data are complete (i.e., if each observation is fully observed) (Fig. 1(a)). Frequently in practice, however, some observations can only be partially classi ed because the values of one or more variables may be missing. Thus the observed data patterns may assume forms such as those in Fig. 1(b) and (c). After permuting columns and combining identical columns, the observed data patterns in Fig. 1 can be represented as Sa ={123}; Sb ={1; 12; 123}; and Sc = {12; 13; 123}; respectively. Each pattern is speci ed by the class of subsets of indices determined by its columns. We call Sa ; Sb , and Sc the observed data patterns. The missing data pattern in Fig. 1(b) is called monotone or nested. In general, statistical inference for a monotone missing data pattern is relatively simple, since the likelihood function for the observed data can be factored as a product of conditional likelihood functions with distinct parameters in the sense of Rubin (1974). MLEs of these distinct parameters can be obtained explicitly by standard (complete data) methods. Explicit MLEs of the original parameters can then be reconstructed (Rubin, 1974; Fuchs, 1982; Little and Rubin, 1987, pp. 172–181). For high dimensional contingency tables, however, the vast majority of missing data patterns are non-monotone, such as that in Fig. 1(c). In this case, the likelihood function cannot be factored simply as in the monotone case or the parameters are not distinct even if the likelihood function factors. Thus explicit MLEs of the parameters cannot be obtained (Fuchs, 1982; Little and Rubin, 1987, p. 181). In practice, common approaches to this problem are: (i) the EM algorithm (Dempster et al., 1977), which is an iterative algorithm; (ii) delete all incomplete observations (complete-data method), or delete partial observations in order to obtain a monotone pattern. However, the EM algorithm
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
261
may not converge to a unique solution, or the resulting estimates may depend upon the initial values, or its rate of convergence may be very slow (Murray, 1977; Fuchs, 1982; Diebolt, 1997; Smith, 1997). Estimates obtained by deleting observations usually suer from reduced eciency, especially when many observations need to be discarded. Andersson and Perlman (1993) introduced lattice conditional independence (LCI) models for multivariate normal distributions, which can be applied to the analysis of non-monotone missing data problems (Andersson and Perlman, 1991). They show that every non-monotone observed data pattern naturally determines a set of LCI restrictions. Under these LCI restrictions, the desirable properties of the monotone case are retained. In particular, the MLEs of the parameters can be obtained explicitly under the LCI assumptions. Furthermore, the LCI assumptions can be tested by explicit likelihood ratio tests (Andersson and Perlman, 1995). In the present paper, we show that LCI models may also be applied to the analysis of categorical data with non-monotone missing data patterns. As an example, consider an i.i.d. sample with the observed data pattern S = {12; 13; 123} (see Fig. 1(c)). Observations that are completely observed can be fully classi ed and yield a 3-way contingency table, while observations with only the rst two categories observed, or only the rst and the third category observed, can only be partially classi ed and thus yield 2-way supplemental subtables. Let (x1 ; x2 ; x3 ) denote a cell in the 3-way table, and let (x1 ; x2 ) or (x1 ; x3 ) denote cells in the 2-way supplemental tables. Let p(x1 ; x2 ; x3 ) denote the cell probabilities in the 3-way table, and denote marginal and conditional probabilities in the usual ways: p(x1 ); p(x1 ; x2 ); p(x2 |x1 ), etc. We assume that the missing data are missing at random in the sense of Rubin (1976), and that the sampling scheme is multinomial, although our results are also applicable under the Poisson or product-multinomial sampling schemes. We denote a random variable by X , and its observed value by x. If we assume that X2 ⊥X3 |X1 (the LCI restriction determined by S), i.e., X2 and X3 are conditionally independent given X1 , then p(x2 ; x3 |x1 ) = p(x2 |x1 )p(x3 |x1 ): Here the likelihood function for the observed data pattern S can be factored in the following form (beginning with the two supplemental tables): Q
Q Q p(x1 ; x2 ) · p(x1 ; x3 ) · p(x1 ; x2 ; x3 ) Q Q Q = p(x1 )p(x2 |x1 ) · p(x1 )p(x3 |x1 ) · p(x1 )p(x2 |x1 )p(x3 |x1 ) Q Q Q = p(x1 ) · p(x2 |x1 ) · p(x3 |x1 ):
(1)
It can be shown that the marginal and conditional parameters p(x1 ); p(x2 |x1 ), and p(x3 |x1 ) in (1) are distinct (≡ variation-independent) in the sense of Rubin (1974). Thus the likelihood function for the observed data can be maximized by maximizing each factor in the last expression of (1) separately. The MLEs of the unconstrained parameters p(x1 ); p(x2 |x1 ), and p(x3 |x1 ) can be obtained explicitly by standard complete data methods (see, e.g., Agresti, 1990). Then explicit MLEs of the original
262
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
(constrained) parameter p(x1 ; x2 ; x3 ) can be reconstructed (see Section 3). We shall call the parameters (p(x1 ); p(x2 |x1 ); p(x3 |x1 )) the K-parameters, where K is the lattice generated from the observed data pattern S (see De nition 2.2). Rubin (1974) gave a factorization of the likelihood function similar to Eq. (1) for the observed data pattern S0 = {12; 13}. Essentially, he noted that if it is assumed that X2 ⊥X3 |X1 , then the likelihood function for the observed data can be factored into a product of likelihood functions with distinct parameters. Earlier, for multivariate normal data with non-monotone missing observations, Lord (1955) and Anderson (1957) also considered the observed data pattern S0 and suggested a possible factorization of the likelihood similar to Eq. (1), but they did not relate the factorization to a conditional independence assumption. Geng (1988) considered hierarchical log-linear models for a multidimensional contingency table with missing data. He showed that for special missing data patterns, the estimation problems can be reduced to those for a collection of lower-dimensional tables. The reduction depends on both the missing data pattern and the log-linear model. In terms of graphical models, this reduction is achieved by assuming that the graph is decomposable and that the form of the observed data pattern allows a factorization of the likelihood function into a product of likelihood functions for lower dimensional tables (Geng et al. 1997; Lauritzen, 1996, Proposition 3:18). However, such a reduction is not possible for all missing data patterns. Furthermore, even when such reduction is possible, an iterative algorithm still may be required to obtain the MLEs, since the factors may not correspond to saturated models. Geng et al. (1996, 1997) proposed a partial imputation method when the observed data pattern does not allow such a reduction. In this paper, we present a general approach to the non-monotone missing data problem using LCI models. We will show that every observed data pattern naturally determines a lattice which in turn determines a minimal set of LCI restrictions. If we impose these LCI restrictions on the saturated model, the likelihood function can be factored as in the monotone case and explicit MLEs of the parameters can be obtained. We shall argue that in some cases such LCI assumptions may be reasonable in practice, or alternatively, that these LCI assumptions can be tested by the likelihood ratio test derived in Section 4. If the LCI model appears too restrictive, we may obtain less restrictive LCI models by deleting additional parts of the data. 1.1. Notation In practice, the cell probabilities and cell counts in a contingency table are usually denoted by subscripts. For example, the cell probabilities and cell counts in a three-way table are usually denoted by (pijk ) and (nijk ); respectively. This notation is, however, inconvenient when dealing with the general case (especially for high-dimensional tables) and is impractical for the theoretical developments. Therefore, we will use an alternative notation in this paper for theoretical development. For example, the cells, the cell probabilities, and the cell counts in a three-way table will be denoted
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
263
simply by x; p(x); and n(x), respectively, with x := (x1 ; x2 ; x3 ): For the speci c tables in Section 6, we return to the conventional notation since the general notation just described perform less elegantly in these situations. 1.2. Outline of the paper In Section 2, we introduce the general theory of LCI models for complete categorical data and de ne the K-parameters for a general lattice K. Under the LCI assumptions, we show that the likelihood function for complete data can be factored into a product of conditional likelihood functions involving distinct K-parameters. Each of these conditional likelihood functions has the form of the likelihood function for a saturated model for a table of lower dimensions. Thus the MLEs of the original parameters can be derived explicitly under the LCI model. In Section 3, the minimal LCI model determined by an arbitrary missing data pattern is de ned and analyzed. The likelihood ratio (LR) test for testing the LCI assumptions based on monotone data is derived in Section 4. In Section 5, we discuss some related issues in detail. In Section 6 we present several examples to illustrate the general theory and its application to a data set. 2. LCI models for complete categorical data 2.1. LCI models – the general case Consider a multivariate statistical model that consists of a family of probability distributions P de ned on a product space X := × (Xi | i ∈ I ), where I is a nite index set and each Xi is a measurable space (either discrete or continuous). For any subset K ⊆ I , de ne XK := × (Xi | i ∈ K). For x := (xi | i ∈ I ) ∈ X , de ne xK := (xi |i ∈ K) ∈ XK to be coordinate projection of x onto XK . Let K be a ring of subsets (hence a distributive lattice) of the index set I, that is, K is closed under nite unions and intersections. We shall always assume that I; ∈ K. Deÿnition 2.1. The LCI model determined by a lattice K is the set of all probability distributions P on X such that for X ∼ P ∈ P, XL ⊥XM |XL∩M ;
∀L; M ∈ K;
(2)
that is, XL and XM are conditionally independent given XL∩M under P. Note that, if L ∩ M = , Eq. (2) reduces to XL ⊥XM , and if L ⊆ M , (2) is trivially satis ed. Note also that Eq. (2) is equivalent to the more conventional form XL\(L∩M ) ⊥XM\(L∩M ) |XL∩M : To characterize LCI models in terms of probability densities, we rst review the following basic lattice-theoretic concepts. The reader should consult Andersson and
264
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Perlman (1991,1993) for a more detailed treatment, including examples. Also see Section 6 of the present paper, especially Examples 6.5 and 6.6. For each subset K ∈ K, de ne S hKi := (K 0 ∈ K | K 0 ⊂ K); [K] := K \hKi; J (K) := {K ∈ K | K 6= ; [K] 6= }; [J (K)] := {[K] | K ∈ J (K)}: (It is important to keep in mind that ⊂ signi es proper inclusion in the de nition of hKi.) Then J (K) is a partially ordered set (poset) under the inclusion ordering ⊆ and is called the set of join-irreducible elements of K (c.f. Davey and Priestley, 1990). ˙ ˙ denotes “disjoint” union) and J (K) is uniquely determined Note that K = [K]∪hKi( ∪ by K. Conversely, J (K) uniquely determines K: for all L ∈ K, S L = {[K] | K ∈ J (K); K ⊆ L} (3) (see Proposition 2.1 of Andersson and Perlman, 1993). In particular, S I = ˙ {[K] | K ∈ J (K)};
(4)
so for each x ∈ X we have x = (x[K] | K ∈ J (K)):
(5)
Andersson et al. (1995,1997) give the following characterization of LCI models: Proposition 2.1 (Andersson et al., 1995,1997). Suppose that the distribution P on X is absolutely continuous with respect to the product measure on X . Let f be the density function. Then the following three conditions are equivalent: (i) P is a member of the LCI model determined by K; (ii) For almost every x ∈ X ; Q f(x) = (f(x[K] |xhKi ) | K ∈ J (K)); (iii) For almost every x ∈ X and every L ∈ K; Q f(xL ) = (f(x[K] |xhKi ) | K ∈ J (K); K ⊆ L): 2.2. LCI models for an I-way contingency table We denote an I-way contingency table by C := ×(Ji | i ∈ I ), where I indexes a nite set of categories and Ji is the set of levels in the i-th category. Denote the family of positive probability distributions on C by P(I ): P P(I ) := {P ≡ (p(x) | x ∈ C) | p(x) ¿ 0; p(x) = 1} : Here the assumption that p(x) ¿ 0 for all x ∈ C insures the existence of all the conditional probabilities and of the MLE of P, which will be derived later. Note that
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
265
P(I ) is the saturated model for the contingency table C, i.e., no restrictions are imposed on P(I ) except that the cell probabilities are positive and sum to 1. For any cell x ≡ (xi | i ∈ I ) ∈ C and any subset K ⊆ I , let xK := (xi | i ∈ K) denote the coordinate projection of x onto the K-way marginal table CK := ×(Ji | i ∈ K). For any P ∈ P(I ) and any K ⊆ I , the marginal probability distribution PK ≡ (p(xK )|xK ∈ CK ) ∈ P(K) is given by P p(x); xK ∈ CK ; (6) p(xK ) := xI \K ∈CI \K
P where P(K) = {PK ≡ (p(xK ) | xK ∈ CK ) | p(xK ) ¿ 0; p(xK ) = 1} for any K ⊂ I . Suppose that X ∼ P ∈ P(I ). As in Section 2.1, let K be a ring (lattice) of subsets of I . For each K ∈ J (K), the conditional probability distribution of X[K] given XhKi =xhKi is given by x
hKi ≡ (p(x[K] |xhKi ) | x[K] ∈ C[K] ) ∈ P([K]); P[K]
where p(x[K] |xhKi ) =
p(xK ) p(xK ) =P : p(xhKi ) x[K] ∈C[K] p(xK )
Deÿnition 2.2. For P ∈ P(I ), the family of conditional probabilities x
hKi |xhKi ∈ ChKi ; K ∈ J (K)) (P[K]
is called the family of K-parameters of P. Let P(K) denote the subfamily of P(I ) obtained by imposing the LCI constraints (2) on the saturated model P(I ). Then P(K) is called the LCI model for the I-way contingency table C. Note that P(I ) = P({; I }). Then Proposition 2.1 can be re-stated as follows for the distributions P ∈ P(I ). Proposition 2.2. For P ∈ P(I ); the following three conditions are equivalent: (i) P ≡ (p(x) | x ∈ C) ∈ P(K); (ii) For each x ∈ C; Q p(x[K] |xhKi ); p(x) = K∈J (K)
(7)
(iii) For each L ∈ K and xL ∈ CL , Q p(xL ) = (p(x[K] |xhKi ) | K ∈ J (K); K ⊆ L): Thus, under P(K), Eq. (7) states that each cell probability can be factored as a product of its K-parameters. The following result shows that the correspondence between P ∈ P(K) and its K-parameters is one-to-one, so the K-parameters uniquely determine P under the LCI model.
266
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Proposition 2.3 (Factorization of the parameter space). The mapping : P(K) → ×(P([K])ChKi | K ∈ J (K)) := (K) x
hKi |xhKi ∈ ChKi ; K ∈ J (K)) := (P) P 7→ (P[K]
is bijective. Proof. The injectivity of is immediate from Eq. (7). We must show that is surxhKi xhKi | xhKi ∈ ChKi ; K ∈ J (K)) ∈ (K), where [K] := jective. For any ≡ ( [K] ( (x[K] |xhKi )|x[K] ∈ C[K] ) ∈ P([K]), de ne Q (8) p(x) ˜ := ( (x[K] |xhKi ) | K ∈ J (K)): We claim that P˜ := (p(x) ˜ | x ∈ C) ∈ P(K): To prove this claim, we rst show that P˜ ∈ P(I ), i.e., P˜ is a positive probability distribution on C. Clearly, p(x) ˜ ¿ 0 for each x ∈ C. Let K1 ; : : : ; Kq be a never-decreasing listing of the elements in J (K), i.e., i ¡ j ⇒ Kj * Ki , so i ¡ j implies hKi i∩[Kj ]=. Then p(x) ˜ can be written as p(x) ˜ =
q Q i=1
(x[Ki ] |xhKi i ):
Since the sets in [J (K)] := ([K1 ]; [K2 ]; : : : ; [Kq ]) are disjoint and follows that q P P P Q P p(x) ˜ = ···
(x[Ki ] |xhKi i ) x∈C
x[K1 ] ∈C[K1 ] x[K2 ] ∈C[K2 ]
= =
P x[K1 ] ∈C[K1 ]
P x[K1 ] ∈C[K1 ]
··· ···
P
i=1
[Ki ] = I , it
x[Kq ] ∈C[Kq ] i=1 q−1 Q
x[Kq−1 ] ∈C[Kq−1 ] i=1
P
Sq
q−1 Q
x[Kq−1 ] ∈C[Kq−1 ] i=1
(x[Ki ] |xhKi i )
P x[Kq ] ∈C[Kq ]
(x[Kq ] |xhKq i )
(x[Ki ] |xhKi i )
= · · · = 1; hence P˜ ∈ P(I ). To see that in fact P˜ ∈ P(K); note that for each K ∈ J (K) and xK ∈ CK , by the construction of p(x) ˜ in Eq. (8) and by Eq. (3), we have Q p(x ˜ K ) = ( (x[L] |xhLi ) | L ∈ J (K); L ⊆ K) Q = (x[K] |xhKi ) ( (x[L] |xhLi ) | L ∈ J (K); L ⊂ K) Q = (x[K] |xhKi ) ( (x[L] |xhLi ) | L ∈ J (K); L ⊆hKi) ˜ hKi ); = (x[K] |xhKi )p(x hence p(x ˜ [K] |xhKi ) ≡
p(x ˜ K) = (x[K] |xhKi ): p(x ˜ hKi )
(9)
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
267
Combining Eqs. (8) and (9), we have Q p(x) ˜ = (p(x ˜ [K] |xhKi ) | K ∈ J (K)); so by Proposition 2.2, it follows that P˜ ∈ P(K). 2.3. Factorization of the likelihood function and the explicit MLEs under the LCI model P(K) Let x1 ; x2 ; : : : ; xn be an i.i.d. sample from C, where each xj ≡ (xij | i ∈ I ) ∈ C. Suppose that each observation xj is fully observed, so that it can be completely classi ed according to the I categories. This sample then yields a single I -way contingency table C. We denote the observed count in cell x in the I -way table C by n(x). The observed count in cell xK in the K-way marginal table CK is then given by P n(x): (10) n(xK ) = xI\K ∈CI\K
The joint likelihood function for a multinomial model can be written as follows: Q p(x)n(x) : (11) (P) := (P; x1 ; : : : ; xn ) = x∈C
It is well known that, under the saturated model P(I ), when all n(x) ¿ 0 the MLE of p(x) exists and is given by n(x) ; x ∈ C: (12) p(x) ˆ = n Under the LCI model P ∈ P(K), by Proposition 2.2, the likelihood function can be factored as follows: Q p(x)n(x) K (P) := x∈C
= = = = = =:
Q
x∈C K∈J (K)
Q
Q
K∈J (K) x∈C
Q
p(x[K] |xhKi )n(x) p(x[K] |xhKi )n(x)
Q
Q
K∈J (K) xK ∈CK xI\K ∈CI\K
Q
Q
K∈J (K) xK ∈CK
Q
p(x[K] |xhKi )n(x)
p(x[K] |xhKi )n(xK )
Q
Q
K∈J (K) xhKi ∈ChKi x[K] ∈C[K]
Q
Q
hKi ) := K;xhKi (P[K]
Q x[K] ∈C[K]
p(x[K] |xhKi )n(xK ) x
K∈J (K) xhKi ∈ChKi
where x
Q
hKi K;xhKi (P[K] );
p(x[K] |xhKi )n(xK ) : x
(13)
(14)
hKi ) has the form of the likelihood function Note that for xed K and xhKi ; K;xhKi (P[K] for the saturated model P([K]) based on a sample of size n(xhKi ). Furthermore, by
268
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Propositions 2.2 and 2.3, the restrictions speci ed by the LCI model P(K) on the K-parameter p(x[K] |xhKi ) appearing as the arguments of K;xhKi (·) are simply P p(x[K] |xhKi ) = 1; (15) p(x[K] |xhKi ) ¿ 0; x[K] ∈C[K]
and the restrictions (15) are independent for each K ∈ J (K). Thus under the LCI model P(K), the likelihood function K (P) is a product of the likelihood functions for distinct saturated models P([K]); K ∈ J (K), whose cell probabilities are the K-parameters p(x[K] |xhKi ): Therefore, K (P) can be maximized by maximizing each of the factors K;xhKi (·) separately, so by Eq. (12) the MLEs of these K-parameters are given by p(x ˆ [K] |xhKi ) =
n(xK ) ; n(xhKi )
x[K] ∈ C[K] ;
(16)
provided that each n(xK ) ¿ 0. Note that n(x ) = n. Finally, by Eq. (7), Proposition 2.3, and Eq. (16), the explicit MLE of the original parameter p(x) is given by p(x) ˆ =
Q K∈J (K)
n(xK ) ; n(xhKi )
x ∈ C:
(17)
under the LCI model P(K). Furthermore, the maximum value of the likelihood function K (P) under P(K) is given by n(x) Q Q Q n(xK ) n(x) ˆ p(x) ˆ = : (18) K (P) = n(xhKi ) x∈C x∈C K∈J (K) Remark. The results in this section can be shown to agree with those presented in Lauritzen (1996), where the explicit MLE for a recursive graphical model is derived. In Section 5, we will brie y discuss the relation between LCI models and graphical models.
3. LCI models for categorical data with non-monotone missing observations 3.1. The LCI model determined by an arbitrary missing data pattern Let y := (x1 ; : : : ; xn ); xj ∈ C; denote an I × N matrix of n i.i.d. discrete random vectors drawn from C, where N := {1; : : : ; n}. Suppose that some observations in y are fully classi ed, while others are only partially classi ed due to missing values in subsets of xj for j ∈ N . We denote the observed data subvectors by x01 ; : : : ; x0n . For each j ∈ N , let Kj ⊆ I be the subset of I such that the Kj -subvector of xj is observed while S the I \Kj -subvector of xj is missing. It is assumed that Kj 6= and (Kj | j ∈ N ) = I , i.e., no column of the data matrix y is completely missing and every category in I is observed at least once. Let D(I ) denote the set of all subsets of I . For each K ∈ D(I ),
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
269
let NK represent the index set of all observed data for which only variables in K are observed but others are missing, i.e., NK := {j : j ∈ N | Kj = K}: Note that S (NK | K ∈ D(I )) = N
and
(19) S
(K | K ∈ D(I ); NK 6= ) = I:
(20)
Then the observed data pattern S is de ned to be S := {K | K ∈ D(I ); NK 6= }; that is, S is the pattern of partially observed column vectors in the I × N data matrix y. Note that the second equation in (20) can be rewritten as S (K | K ∈ S) = I: The observed data pattern S is called monotone if S is totally ordered under set inclusion, otherwise S is called non-monotone. As noted in Section 1, statistical inference for a monotone observed data pattern is relatively simple – the joint likelihood function can be factored as a product of conditional likelihood functions for distinct parameters. To obtain a similar factorization of the likelihood function for a non-monotone pattern S, we impose a parsimonious set of LCI constraints determined by S as follows: First, every observed data pattern S uniquely determines a lattice (ring) K ≡ K(S) de ned to be the smallest subring of D(I ) that contains S and S T , i.e., K is generated from S and by the set operations and . Then, the LCI model determined by the observed data pattern S is de ned to be the submodel P(K(S)) ≡ P(K). In other words, the LCI model P(K) is obtained by imposing the LCI restrictions (2) on the saturated model P(I ). 3.2. Factorization of the likelihood function for non-monotone data We now factorize the likelihood function (P) ≡ (P; x01 ; : : : ; x0n ) under the model P(K). Observations x0j with all I categories observed can be fully classi ed and yield an I -way contingency table, while for L ⊂ I , observations x0j with only categories in L observed can only be partially classi ed and form an L-way supplemental table. We denote the cell counts and cell probabilities in the L-way supplemental table by {nL (xL ) | xL ∈ CL } and {p(xL ) | xL ∈ CL }, respectively. Note that when L = I; CI ≡ C is the full table based on the complete observations and xL ≡ x. Under the assumption that the missing data are missing at random, the missing data mechanism can be ignored when making likelihood inference (Rubin, 1976). Since NL = for L ∈ D(I )\K, the likelihood function for the observed data x01 ; : : : ; x0n under the saturated model P(I ) can be written as follows: Q Q Q p(xL )nL (xL ) =: L (P); (21) (P) ≡ (P; x01 ; : : : ; x0n ) = L∈K xL ∈CL
L∈K
where L (P) is the likelihood function for the L-way supplemental table.
270
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
For any K ⊆ L and L ⊆ I , we de ne P nL (xL ); xK ∈ CK ; nL (xK ) := xL\K ∈CL\K
(22)
P Note that if K = , we have nL (x ) = (nL (xL ) | xL ∈ CL ) ≡ |NL |. If we now impose the LCI restrictions (2) on the saturated model, i.e., assume that the LCI model P(K) holds, then by Proposition 2.2(iii) and Eq. (21), the likelihood function K (P) for the observed data can be factored as follows: Q Q Q p(x[K] |xhKi )nL (xL ) (23) K (P) = L∈K xL ∈CL
= = = = =
= =
K∈J (K) K ⊆L
Q
Q
L∈K
K∈J (K) K ⊆L
xL ∈CL
K∈J (K) K ⊆L
xK ∈CK xL\K ∈CL\K
Q
Q
Q
L∈K
Q
Q
Q Q
K∈J (K) K ⊆L
xK ∈CK
L∈K xK ∈CK
K∈J (K) K ⊆L
L∈K
Q
Q
Q
Q
Q
Q
K∈J (K) xK ∈CK
Q
p(x[K] |xhKi )nL (xL )
Q
K∈J (K) xK ∈CK
L∈K L⊇K
p(x[K] |xhKi )nL (xK ) p(x[K] |xhKi )nL (xK ) p(x[K] |xhKi )nL (xK )
Q
K∈J (K) xhKi ∈ChKi x[K] ∈C[K]
=:
Q
Q
p(x[K] |xhKi )nL (xL )
p(x[K] |xhKi )n+ (xK )
Q
Q
Q
p(x[K] |xhKi )n+ (xK ) x
K∈J (K) xhKi ∈ChKi
(24)
hKi + K;x (P[K] ); hKi
(25)
where n+ (xK ) =
P L∈K L⊇K
nL (xK );
and x
hKi + (P[K] ) := K;x hKi
Q x[K] ∈C[K]
p(x[K] |xhKi )n+ (xK ) :
(26)
3.3. Explicit MLEs under the LCI model P(K) with non-monotone data As in Section 2, the likelihood function K (P) in Section 3.2 can be maximized + + separately. Since K;x is the likelihood function for the by maximizing each K;x hKi hKi saturated model P([K]) whose cell probabilities p(x[K] |xhKi ) are unrestricted (except
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
271
for Eq. (15)), by Eqs. (26) and (12), if n+ (xK ) ¿ 0 for all K ∈ J (K) then the MLE of p(x[K] |xhKi ) is given by p(x ˆ [K] |xhKi ) = where n++ (xK ) :=
n+ (xK ) ; n++ (xK ) P
x[K] ∈C[K]
(27)
n+ (xK ):
(28)
By Propositions 2.2 and 2.3, for each x ∈ C, the MLE of the original parameter p(x) under the LCI model P(K) is thus given explicitly by Q Q n+ (xK ) p(x ˆ [K] |xhKi ) = ; (29) p(x) ˆ = n K∈J (K) K∈J (K) ++ (xK ) i.e., Eq. (29) is the explicit MLE under the LCI model P(K) determined by the observed data pattern S. The maximum value of the likelihood function K (P) under P(K) is given by Q Q ˆ = p(x ˆ [K] |xhKi )n+ (xK ) K (P) K∈J (K) xK ∈CK
=
Q
Q
K∈J (K) xK ∈CK
n+ (xK ) n++ (xK )
n+ (xK )
:
(30)
Remark 3.1. When there is no missing data, i.e., when S = {I }, the MLE of P in Eq. (29) reduces to Eq. (12). Remark 3.2. One may be tempted to think that n++ (xK ) = n+ (xhKi ). In general, however, n++ (xK )6n+ (xhKi ). To see this, note that by Eq. (28), P P P nL (xK ) = nL (xhKi ) n++ (xK ) = x[K] ∈C[K]
6
P
L∈K L ⊇hKi
L∈K L⊇K
L∈K L⊇K
nL (xhKi ) ≡ n+ (xhKi ):
For example, consider K = {; 1; 12; 13; 123}. For K = 12 ∈ J (K), we have hKi = 1 and n++ (x12 ) = n12 (x1 ) + n123 (x1 ), while n+ (x1 ) = n1 (x1 ) + n12 (x1 ) + n13 (x1 ) + n123 (x1 ). Remark 3.3. After determining the observed data pattern S it is necessary to determine the poset J (K) of join-irreducible elements of the lattice K generated from S. Steel and Wood (1993) show that the elements of K can be expressed as unions of intersections of appropriate members of S. They also give a polynomial-time algorithm which can determine J (K) directly from S without rst generating K. 4. Testing LCI models based on monotone data As shown in the previous sections, in the analysis of categorical data with a non-monotone observed data pattern S, if the LCI restrictions determined by S are
272
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
imposed, explicit MLEs can be obtained. It becomes important, therefore, to be able to test the appropriateness of the LCI model P(K) before imposing such constraints. In other words, we should consider the following testing problem: HK : P ∈ P(K)
vs:
HI : P ∈ P(I );
(31)
vs:
HM : P ∈ P(M);
(32)
or, more generally, HK : P ∈ P(K)
where M is a proper subring of K, so P(K) ⊆ P(M). Note that Eq. (31) is the special case of Eq. (32) where M = {; I }. In this section, we derive explicit LR statistics for Eq. (32) based on a sample with a monotone missing data pattern. A warning about notation is needed here. In general, since K and M are uniquely determined by J (K) and J (M) respectively, we have J (K) 6= J (M) if K ⊃ M. Therefore, for example, the two elements hKiK and hKiM , which are members of J (K) and J (M) respectively, are not necessary the same, and quantities such as n+ (xhKi ) and n+ (xK ) depend not only on the set K (K ⊆ I ) but also on the lattice in which K is a member. Henceforth, to simplify the notation, the letter K will denote a member of K, while the letter M will denote a member of M. Let x1 ; : : : ; xn denote an i.i.d. sample from the model P(M), and again suppose that some of these vectors are incompletely observed. By further discarding some minimal subset of the observed components, a monotone observed data pattern S0 can be obtained. Denote this monotone subsample by x01 ; : : : ; x0m (m6n), and denote its index set by N 0 := {1; : : : ; m}. We will derive the explicit LR test for Eq. (32) based on this monotone subsample, so the notation used in the previous sections must be modi ed accordingly. Let L := K(S0 ) be the lattice determined by the monotone observed data pattern 0 S , so L is totally ordered under set inclusion. We assume that I ∈ L. The quantities nL (xK ), for K ⊆ L ∈ L\, are de ned as in Eq. (22), but now they are based on the subsample x01 ; : : : ; x0m . For L ∈ L\, let mL ≡ |NL0 | be the number of all observations in the monotone subsample for which all variables in L are observed and all others are missing. Here NL0 := {j ∈ N 0 ; Kj = L}, where Kj is de ned in Section 3.1. Throughout this section, we assume that, for L ∈ L\, mL → cL m
as m → ∞;
(33)
with 06cL ¡ ∞ for L 6= I and 0 ¡ cI ¡ ∞. (If cL = 0; the contribution of the incomplete observations in the L-way supplemental table to the likelihood function becomes negligible as m → ∞.) Finally, we denote the MLEs of P under P(M) and P(K) by ˆ respectively. P˜ M ≡ P˜ and Pˆ K ≡ P, In Section 2, we derived the explicit MLEs for P ∈ P(K) based on complete data x1 ; : : : ; xn . Here we present a lemma which gives the explicit MLEs for P ∈ P(K) based on the monotone data x01 ; : : : ; x0m .
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
273
Lemma 4.1. Under the LCI model P(K); the MLEs based on the monotone observed subsample x01 ; : : : ; x0m exist and are given explicitly by p(x) ˆ =
n0+ (xK ) ; n0++ (xK )
Q K∈J (K)
where n0+ (xK ) =
P
n0++ (xK ) =
nL (xK );
L∈L L⊇K
∀x ∈ C;
(34)
P x[K] ∈C[K]
n0+ (xK );
∀K ⊆ I:
(35)
Proof. The proof is similar to the arguments presented in Sections 3.2 and 3.3. We replace K with L in the index set of the rst product in Eqs. (21) and (23), then proceed as in Sections 3.2 and 3.3: Q Q p(xL )nL (xL ) K (P) := L∈L xL ∈CL
=
Q
Q
L∈L xL ∈CL
Q K∈J (K) K ⊆L
= ··· Q =
Q
Q
Q
K∈J (K) xK ∈CK
=
K∈J (K) xK ∈CK
p(x[K] |xhKi )nL (xL )
Q L∈L L⊇K
p(x[K] |xhKi )nL (xK ) 0
p(x[K] |xhKi )n+ (xK ) :
Hence, as in Section 3.3, the MLE of P under the LCI model P(K) based on the monotone data x01 ; : : : ; x0m is given explicitly by Eq. (34). Proposition 4.1. The LR statistic for testing HK against HM in Eq. (32); based on the monotone subsample x01 ; : : : ; x0m ; is given explicitly by G 2 := 2 log
˜ P P M (P) p(x ˜ L) ; =2 nL (xL )log ˆ p(x ˆ L) K (P) L∈L xL ∈CL
(36)
where p(x ˆ L) =
n0+ (xK ) ; 0 xI\L ∈CI\L K∈J (K) n++ (xK )
p(x ˜ L) =
n0+ (xM ) ; 0 xI\L ∈CI\L M ∈J (M) n++ (xM )
P
Q
P
Q
and n0+ (·) and n0++ (·) are deÿned by Eq. (35). Under HK ; as m → ∞; the asymptotic distribution of G 2 is 2 (f); the chi-square distribution with f degrees of freedom; where ! ! Q P Q Q Q P |Ji | − |Jl | − |Ji | − |Jl | : (37) f= M ∈J (M)
i∈M
l∈hM i
K∈J (K)
i∈K
l∈hKi
274
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Proof. By Lemma 4.1, the MLE under the LCI model P(K) is given by Eq. (34). For L ⊆ I , we have P P Q n+ (xK ) : p(x) ˆ = p(x ˆ L) = xI\L ∈CI\L xI\L ∈CI\L K∈J (K) n++ (xK ) Thus the maximized value of the likelihood function under the LCI model P(K) is given by Q Q ˆ = p(x ˆ L )nL (xL ) : K (P) L∈L xL ∈CL
Similarly, the maximized value of the likelihood function under the LCI model P(M) is given by Q Q ˜ = p(x ˜ L )nL (xL ) : M (P) L∈L xL ∈CL
Therefore, the LR statistic G 2 is given by Eq. (36). It follows from Neyman (1949, Theorem 7) that under HK , the asymptotic distribution of G 2 as m → ∞ is 2 ( f), where f = dim(P(M)) − dim(P(K)). The degrees of xhKi is a saturated model for xed freedom f can be derived as follows. Since each P[K] xhKi ∈ ChKi and K ∈ J (K), by Proposition 2.3 it follows that P dim(P(C[K] )ChKi ) dim(P(K)) = K∈J (K)
= = = where
Q l∈hKi
P
(dim(P(C[K] )) · |ChKi |)
K∈J (K)
P
Q
K∈J (K)
i∈[K]
P
Q
K∈J (K)
i∈K
! |Ji | − 1
|Ji | −
Q l∈hKi
·
dim(P(M)) =
Q
M ∈J (M)
i∈M
|Ji | −
l∈hKi
! |Jl |
!
|Jl | ;
|Jl | = 1 if hKi = . Similarly, P
Q
Q l∈hM i
! |Jl | :
Hence f ≡ dim(P(M)) − dim(P(K)) is given by Eq. (37). Corollary 4.2. The LR statistic for testing HK against HI in Eq. (31) based on the monotone subsample x01 ; : : : ; x0m is given explicitly by P P p(x ˜ L) ; nL (xL )log G2 = 2 p(x ˆ L) L∈L xL ∈CL where p(x ˆ L) = p(x ˜ L) =
n0+ (xK ) ; 0 xI \L ∈CI \L K∈J (K) n++ (xK ) P
Q
P
n(x) : m
xI \L ∈CI \L
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Under HK ; as m → ∞; the asymptotic distribution of G 2 is 2 (f); with ! P Q Q Q |Ji | − |Jl | : f = |Ji | − 1 − i∈I
K∈J (K)
i∈K
l∈hKi
275
(38)
Proof. In Proposition 4.1, choose M = {; I }, so J (M) = {I }. The LR statistic for testing (31) based on completely observed data reduces to the following familiar form (see Lauritzen, 1996): Corollary 4.3. The LR statistic for testing HK against HI in Eq. (31) based on the 0 completely observed subsample x01 ; : : : ; x0m (m0 6m) is given explicitly by Q n(xK ) P K∈J (K) n(xhKi ) : n(x)log (39) G2 = 2 n(x) x∈C
m
Under HK ; as m → ∞; the asymptotic distribution of G 2 is 2 (f) with f given by Eq. (38). Remark 4.1. We have derived the LR test statistics for the testing problems (31) and(32). Alternatively, (31) and (32) can be tested by the corresponding Pearson 2 statistics. It is known that the LR test and Pearson’s 2 -test are asymptotically equivalent as m → ∞ (Neyman, 1949). Remark 4.2. The validity of the 2 -approximations in the above propositions require that the distribution satis es certain regularity conditions (Bishop et al., 1975, p. 509). Here, Proposition 2.3 and the assumption that p(x) ¿ 0 for each x ∈ C together imply that these regularity conditions hold. Remark 4.3. We have derived an explicit LR test statistic for (32) based on observed data x01 ; : : : ; x0m with a monotone pattern. More generally, it is also possible to obtain an explicit LR test statistic for (32) based on data with an observed data pattern S such that P(M) ⊆ P(K(S)).
5. Discussion 5.1. The applicability and eciency of LCI models For missing observations in categorical data with a non-monotone missing data pattern, we have derived explicit MLEs by imposing a minimal set of LCI restrictions on the saturated model. In some cases, such LCI assumptions may be reasonable due to the nature of the statistical experiment (Rubin, 1987, p. 191). In the log-linear formulation of the contingency table, conditional independences are equivalent to setting certain
276
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
interaction terms to zero (see, e.g., Whittaker, 1990, p. 207), which may be reasonable in practice (Baker and Laird, 1988). Furthermore, the LCI assumptions can be tested by likelihood ratio tests based on the complete data or a monotone subsample. If the LCI restrictions determined by the missing data pattern are considered reasonable, e.g., if the LCI hypothesis is accepted by the LR test, then explicit MLEs can be obtained (under the LCI model) without recourse to iterative algorithms. Moreover, in this case the LCI estimates are expected to be more ecient than those obtained from the EM-algorithm or by restriction to the completely observed data vectors. For the case of continuous data, this has been con rmed by a Monte-Carlo study on the eciency of LCI models (Wu and Perlman, 1998). Furthermore, the explicit nature of the MLE may facilitate the study of its analytic properties, such as its second moments. In cases where the LCI restrictions determined by a missing data pattern seem too restrictive, we may discard partial observations in order to obtain a less restrictive LCI model. This may be done in several ways. We may discard as few observations as possible such that the new LCI model determined by the resulting observed data pattern is accepted by the LR test (based on the original completely observed data vectors or on monotone data). In such a case, the resulting observed data pattern may still be non-monotone, so this approach is more general than that of obtaining a monotone pattern by discarding observations. Alternatively, the LCI model estimates may be used as starting values for the EM algorithm. Note that one of the commonly used starting values for the EM-algorithm is the complete-data estimate, which may be unsatisfactory in some situations (Fuchs, 1982; Little and Rubin, 1987, p. 189; also see the discussion at the end of our Example 6.7) and may not even be applicable if none of the observations are fully observed. A referee has proposed an interesting variation of the EM algorithm for the missing data problems discussed here. In this “restricted” EM algorithm, called ER, the E-step of the EM algorithm is replaced by a restricted E-step based on the LCI model. Preliminary studies suggest that this ER estimator is an ecient compromise between the unrestricted EM estimator and the fully restricted LCI estimator. These ideas are being developed. 5.2. LCI models and graphical models Andersson et al. (1995, 1997) show that the class of LCI models coincides with the subclass of transitive acyclic directed graphical models. Speci cally, they show that P(K) = P(D); where P(D) is the Markov model determined by the transitive acyclic directed graph (ADG) D, with the following vertex set V and edge relation R: V := ([K]|K ∈ J (K)); R := ([K1 ] → [K2 ]
i K1 ⊂ K2 ;
∀K1 ; K2 ∈ J (K)):
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
277
For basic graphical model concepts, see Lauritzen (1996) and Andersson and Perlman (1998). 6. Examples In this section, several examples will be presented to illustrate the general theory developed in this paper. As we mentioned in Section 1, when working on concrete tables, the general notation used so far is not always convenient, so conventional notation will sometimes be used below. For instance, for a 3-way table with I = 123, we write (nijk ) for the table of cell counts, and write (pijk ) for the table of cell probabilities. For marginal counts we write for example P PP n:j: = nijk ; nij: = nijk ; k
i
k
and so on. We also write (nij ) for the cell counts in the 2-way supplemental table obtained from the observations with only the rst two categories observed, and write (nik ) for the cell counts in the 2-way supplemental table obtained from the observations with only the rst and the third categories observed, and so on. In Figs. 2–7, the members of J (Ki ) (the join-irreducible elements of Ki ) are indicated by open circles. Example 6.1 (Monotone observed data pattern). Let the observed data pattern S1 = (K1 ; : : : ; Kq ) be such that K1 ⊂ K2 ⊂ · · · ⊂ Kq = I . Then the lattice determined by S1 is K1 = S1 ∪ {} (Fig. 2). The LCI condition (2) is trivially satis ed, so P(K) = P(I ), i.e., no LCI restrictions are imposed on P. Here J (K1 ) = (K1 ; K2 ; : : : ; Kq );
hKk i = Kk−1 ;
k = 1; : : : ; q;
and the K1 -parameters are (p(xKi\Ki−1 | xKi−1 ) | i = 1; : : : ; q), where K0 = . The only observed data pattern that generates K1 is S1 . By Eq. (29), the MLE of P under the saturated model P(I ) is given by q n (x ) Q + Ki ; i=1 n++ (xKi ) Pq where n+ (xKi ) = j=i nKj (xKi ). If I = 12 and S1 = {1; 12}, then the K1 -parameters are (pj|i ; pi |i = 1; : : : ; |Ji |; j = 1; : : : ; |Jj |), and the MLE reduces to
p(x) ˆ =
pˆ ij =
ni + ni: nij · ; n ni:
which agrees with (9:3) in Little and Rubin (1987, p.174).
Fig. 2. The lattice K1 .
278
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Fig. 3. The lattice K2 .
If I = 123 and S1 = {1; 12; 123}, then the K1 -parameters are (pk|ij ; pj|i ; pi | i = 1; : : : ; |Ji |; j = 1; : : : ; |Jj |; k = 1; : : : ; |Jk |), and the MLE becomes pˆ ijk =
ni + ni: + ni:: nij + nij: nijk · : · n ni: + ni:: nij:
Example 6.2 (Independence of two blocks). Consider the observed data pattern S2 = {L; M; I } with L ∩ M = ; L ∪ M = I . Then the lattice determined by S2 is K2 = {; L; M; I } (Fig. 3) and the LCI restriction is X L ⊥ XM : Here, J (K2 ) = {L; M }; hLi = hM i = , and the K2 -parameters are (p(xL ); p(xM ) | xL ∈ CL ; xM ∈ CM ). Note that the pattern S2 = {L; M } also generates K2 . By Eq. (29), the MLE of P under the LCI model P(K2 ) is given by nL (xL ) + n(xL ) nM (xM ) + n(xM ) · ; (40) n+ n+ L M P P + where n+ L = xL (nL (xL ) + n(xL )); nM = xM (nM (xM ) + n(xM )): If I =123 and S2 ={12; 3; 123}, then the K2 -parameters are (pij: ; p::k | i=1; : : : ; |Ji |; j= 1; : : : ; |Jj |; k = 1; : : : ; |Jk |), and Eq. (40) becomes p(x) ˆ =
pˆ ijk =
nij + nij: nk + n::k · : n:: + n::: n: + n:::
As an illustration, the LR statistic for testing the LCI assumption (X1 ; X2 )⊥X3 against the saturated model, based on a completely observed sample (of size m), is given by P nijk · m nijk log ; G22 = 2 nij: · n::k i; j; k and the degrees of freedom are f = (|J1 ||J2 | − 1)(|J3 | − 1): Example 6.3 (One pairwise LCI condition). Let the observed data pattern be either S3 = {L; M; I }; or S3 = {L; M }, or S3 = {L ∩ M; L; M }, or S3 = {L ∩ M; L; M; I }, with L; M ⊂ I; L ∩ M 6= ; L ∪ M = I; then the lattice determined by S3 is K3 = {; L ∩ M; L; M; I } (Fig. 4), and the LCI restriction is X[L] ⊥ X[M ] |XL∩M :
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
279
Fig. 4. The lattice K3 .
Here J (K3 ) = {L ∩ M; L; M }; hL ∩ M i = ; hLi = hM i = L ∩ M , and the K3 -parameters are (p(xL∩M ); p(xL\L∩M |xL∩M ); p(xM\L∩M |xL∩M )|xL∩M ∈ CL∩M ; xL\L∩M ∈ CL\L∩M ; xM\L∩M ∈ CM\L∩M ): Then, by Eq. (29), the MLE of P under the LCI model P(K3 ) is given by n+ (xL∩M ) n+ (xL ) n+ (xM ) · · ; (41) n++ n++ (xL ) n++ (xL ) P where n++ = xL∩M n+ (xL∩M ): If I = 123 and S3 = {12; 13; 123} (Fig. 1(c) in Section 1), then the LCI restriction becomes X2 ⊥X3 |X1 and the K3 -parameters are (pi:: ; pj|i ; pk|i | i = 1; : : : ; |Ji |; j = 1; : : : ; |Jj |; k = 1; : : : ; |Jk |). The MLE in Eq. (41) can be written as P P nik + ni:: nij + nij: nik + ni:k j nij + Pk P pˆ ijk = ·P ·P : (42) n + n + n n + n ::: i:: i; j ij i;k ik j ij k nik + ni:: p(x) ˆ =
Rubin (1974) considered a special case with observed data pattern S30 = {12; 13}. Then the MLE under the LCI restriction becomes P P nik nij nik j nij + Pk ·P ·P : pˆ ijk = P n + n n i; j ij i;k ik j ij k nik Note that, when there is no missing data, Eq. (42) reduces to the familiar form: pˆ ijk =
(nij: )(ni:k ) (nij: =n::: )(ni:k =n::: ) (pˆ ij: )(pˆ i:k ) = = : (n::: )(ni:: ) (ni:: =n::: ) pˆ i::
The LR statistic for testing the LCI restriction X2 ⊥X3 |X1 against the saturated model, based on a completely observed sample, is given by P nijk · ni:: 2 ; (43) nijk log G3 = 2 nij: · ni:k i; j;k and the degrees of freedom f = |J1 |(|J2 | − 1)(|J3 | − 1): This agrees with Whittaker (1990, p. 223). Fuchs (1982) considered an example which has an observed data pattern S30 = {1234; 12345; 12346; 123456} with I = 123456. Here L = 12345 and M = 12346, so the corresponding lattice is K03 = {} ∪ S30 and the LCI restriction determined by S30 is X5 ⊥X6 | (X1 ; X2 ; X3 ; X4 ):
280
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Fig. 5. The lattice K4 .
Under this LCI restriction, the explicit MLE can be obtained and is given in (41). In Fuchs’ example, only one observation has X5 missing. If this observation is discarded, a monotone pattern is obtained, and the explicit MLE is found for the monotone pattern without imposing any LCI restrictions and with minimal loss of eciency. Example 6.4 (Two pairwise LCI conditions). Suppose the observed data pattern has the form S4 ={L; M; L0 ; M 0 ; I } with L0 ; M 0 ⊃ L∪M . Then the lattice determined by S4 is K4 ={; L∩M; L; M; L∪M; L0 ; M 0 ; I } (Fig. 5). Here J (K4 )={L∩M; L; M; L0 ; M 0 }; hL∩ M i = ; hLi = hM i = L ∩ M , and hL0 i = hM 0 i = L ∪ M . Note that an observed data pattern S generates K4 i {L; M; L0 ; M 0 } ⊆ S, so there are 23 = 8 possible observed data patterns that determine the same K4 . In this case, the LCI restrictions are X[L] ⊥ X[M ] |XL∩M ;
X[L0 ] ⊥ X[M 0 ] |XL∪M :
By Eq. (29), the MLE of P ∈ P(K4 ) is given by n+ (xL ) n+ (xM ) n+ (xL0 ) n+ (xM 0 ) n+ (xL∩M ) · · · · : p(x) ˆ = 0 n++ (xL∩M ) n++ (xL ) n++ (xM ) n++ (xL ) n++ (xM 0 )
(44)
(45)
If I = 12345 and S4 = {12; 13; 1234; 1235; 12345}, then the LCI conditions (44) become X2 ⊥ X3 |X1 ;
X4 ⊥ X5 |(X1 ; X2 ; X3 ):
(46)
The MLE in Eq. (45) can be written as n+ (x1 ; x2 ) n+ (x1 ; x3 ) n+ (x1 ; x2 ; x3 ; x4 ) n+ (x1 ; x2 ; x3 ; x5 ) n+ (x1 ) · · · · : p(x) ˆ = n++ (x1 ) n++ (x1 ; x2 ) n++ (x1 ; x3 ) n++ (x1 ; x2 ; x3 ; x4 ) n++ (x1 ; x2 ; x3 ; x5 ) Example 6.5. Rubin (1987, Table 5:6, p.190) consider the following observed data pattern: S5 = {1; 12; 13; 123; 124; 1234};
(47)
where I = 1234. The general form of this pattern can be written as S5 = {L ∩ M; L; M; L ∪ M; L0 ; I };
(48)
where L0 ⊃ L; L0 + M . In this case, K5 =S5 ∪{} (Fig. 6), J (K5 )={L∩M; L; M; L0 }, and hL ∩ M i = ; hLi = hM i = L ∩ M; hL0 i = L. The LCI restriction determined by K5 is (X[L] ; X[L0 ] ) ⊥ X[M ] | XL∩M :
(49)
For S5 in Eq. (47), Eq. (49) becomes (X2 ; X4 )⊥X3 |X1 : Here it is more concise to write the MLE of P ∈ P(K5 ) in the general form (29).
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
281
Fig. 6. The lattice K5 :
Fig. 7. The lattice K6 :
Example 6.6. Rubin (1974, Table 1, p. 469) considers the following observed data pattern: S6 = {3; 8; 1238; 123678; 345678}; where I = 12345678. The lattice (Fig. 7) determined by S6 is K6 = {; 3; 8; 38; 3678; 1238; 345678; 123678; 12345678}; J (K6 ) = {3; 8; 1238; 3678; 345678}, and h3i = h8i = ; h1238i = h3678i = 38; h345678i = 3678. The LCI restrictions are X3 ⊥X8 ;
(X1 ; X2 )⊥(X4 ; X5 ; X6 ; X7 ) | (X3 ; X8 ):
Here it is more concise to write the MLE of P ∈ P(K6 ) in the general form (29). Example 6.7. We consider an example given in Bishop et al. (1975). The data gives information on the survival rate of 715 infants attending two clinics and the amount of care received by the mother, where the amount of care is classi ed as either “more”
282
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
Table 1 The fully and partially classi ed tables obtained from the sample with 60% incomplete observations and an observed pattern S1 (a)
(b)
Clinic
Care
Clinic 1 Clinic 2
Less More Less More
(c)
Survival
Care
No
Yes
Less
More
2 1 7 2
83 113 73 5
Clinic 1
12
17
Clinic 2
22
4
Note: n123 = 286.
Note: n12 = 55.
Survival No
Yes
Clinic 1
4
244
Clinic 2
8
118
Note: n13 = 374.
or “less”. The data is presented in the following three-way contingency table: Clinic Clinic 1
Care
less more Clinic 2 less more Note: n = 715:
Survival No Yes 3 4 17 2
176 293 197 23
Each observation has three components, namely, X1 = Clinic, X2 = Care, X3 = Survival. Suppose that the information on X1 is always observed, but the information on X2 or X3 may be missing. Then we have a non-monotone observed data pattern S1 ={12; 13; 123} (or S10 = {1; 12; 13; 123}). The LCI restriction determined by S1 (or S10 ) is X2 ⊥ X3 |X1 (i.e., Care and Survival are independent given Clinic). We simulate this situation by randomly deleting the second or the third components in some observations. We do this twice: once to obtain 20% incomplete observations, and once to obtain 60% incomplete observations. To save space, here we only present detailed information for the sample with 60% incomplete observations. Table 1 shows the fully and partially classi ed tables based on the sample with 60% incomplete observations, where n123 ≡ |N123 |; n12 ≡ |N12 |, and n13 ≡ |N13 | are the numbers of fully and partially classi ed observations. Based on the monotone subsample with the observed data pattern L = {13; 123} (the largest monotone subsample), the LR test for testing the LCI model against the saturated model gives a p-value of 0.26, indicating that the LCI assumption may be reasonable for this data. Note that there are a few cells having small counts in the fully classi ed table (Table 1(a)), but the sample sizes n123 and n13 are large, so the LR test may still be considered reliable here, although it may be somewhat conservative (see, e.g., Agresti, 1990, p. 246).
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
283
Table 2 Estimates based on the sample with 60% incomplete observations Care
LCI model Survival No
Yes
No
Yes
No
Yes
No
Yes
Clinic 1
Less More Less More
0.0044 0.0060 0.0241 0.0026
0.2788 0.3765 0.2776 0.0299
0.0069 0.0035 0.0202 0.0068
0.2764 0.3789 0.2826 0.0246
0.0070 0.0035 0.0244 0.0070
0.2902 0.3951 0.2552 0.0175
0.0042 0.0056 0.0238 0.0028
0.2462 0.4098 0.2755 0.0322
Clinic 2 a Actual
EM-algorithm Survival
CD method Survival
Actual valuea Survival
Clinic
value = MLEs based on the original 715 complete observations.
Table 3 SAD for the three estimation procedures
20% incomplete 60% incomplete
LCI
EM
CD
(EM−LCI) EM
(CD−LCI) CD
0.42 0.51
0.90 3.10
0.88 3.31
53% 84%
52% 85%
For the sample with 60% incomplete observations, the estimated cell probabilities pˆ ijk based on the LCI model, the EM-algorithm and the complete-data method (CD) are presented in Table 2. To measure the accuracy of each estimation procedure, we computed the sum of the scaled absolute deviations (SAD) between the estimated cell probabilities pˆ ijk and the MLEs p˜ ijk based on the original 715 complete observations, i.e., SAD :=
P |pˆ ijk − p˜ ijk | : p˜ ijk i; j;k
(50)
Table 3 presents the SADs for the two samples with 20% and 60% incomplete observations. We see that, for both samples, the LCI model produces better estimates than the EM-algorithm and the complete-data method, especially when the missing data is extensive. For the sample with 60% incomplete observations, the LCI model oers 84% and 85% improvement over the EM-algorithm and the complete-data method respectively. For the sample with 20% incomplete observations, the LCI model oers 53% and 52% improvement over the EM-algorithm and the complete-data method respectively. This example indicates that the LCI model estimates may be much better than the estimates based on the EM-algorithm and the complete-data method when the LCI model is accepted by the LR test and the missing data is extensive. Sometimes the LCI model determined by the observed data pattern may be very restrictive. In the above example, if there are also some observations with the rst component missing, then the observed data pattern is S2 = {12; 13; 23; 123}. The LCI restriction determined by S2 is X1 ⊥ X2 ⊥ X3 (i.e., mutual independence). The p-value for testing this LCI assumption is zero, indicating that this LCI model is too restrictive. In this case, if we discard the observations with the rst component missing, we again obtain the observed data pattern S1 . As we already know, the LCI restriction
284
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
determined by S1 is reasonable. This procedure has the advantage that fewer observations are discarded to obtain S1 than to obtain a monotone pattern. To illustrate, suppose that, having randomly created 60% incomplete observations as in Table 1(b) and (c), we further delete the rst components in n23 = 72 (10%) randomly selected observations from the 286 complete observations in Table 1(a). These 72 incomplete observations are summarized in the following table:
Clinic Clinic 1
Care
Less More Clinic 2 Less More Note: n23 = 72.
Survival No Yes 2 0 0 0
24 29 16 1
The SAD for the LCI model estimates obtained after discarding these 72 observations to obtain the pattern S1 , and the SADs for the estimates obtained from the EM-algorithm and the complete-data method without discarding these 72 observations are 1.4, 2.9, and 3.3, respectively. Thus, in this case, the LCI model estimates are still much better than the other two estimates and still provide 52% and 58% improvement over the EM-algorithm and the complete-data method respectively, although here the LCI model estimates are obtained by additionally discarding the 72 observations while the estimates from the EM-algorithm are obtained without discarding any additional observations. Note that here explicit MLEs can also be obtained based on monotone subsamples. In this example, the largest monotone subsample is the one with observed pattern {13; 123} (the size of this subsample is n13 + n123 − n23 = 588). The SAD for the estimates based on this monotone subsample is 3.1. Thus the LCI model estimates obtained after discarding the 72 observations are also much more ecient than the MLEs based on the largest monotone subsample (oering 55% improvement). In summary, this example suggests that, in cases when the LCI model determined by the missing data pattern is too restrictive, if a reasonable LCI model can be obtained by discarding a small proportion of observations, the LCI model estimates obtained after discarding these observations may still be more ecient than other estimates obtained by either discarding or not discarding observations. Returning to the beginning of this example, suppose instead that we have the non-monotone observed data pattern S3 = {12; 3; 123}. Then the LCI restriction determined by S3 is (X1 ; X2 ) ⊥ X3 (i.e., Survival is independent of Clinic and Care). The LR test for this LCI assumption yields a p-value of 0.0005, suggesting that this LCI model may not be reasonable. We now study the eciency of LCI models in this case. We consider two simulated incomplete samples, with 20% and 90% incomplete observations, respectively. (Here we choose 90% rather than 60% in order to study the
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
285
Table 4 The fully and partially classi ed tables obtained from the sample with 90% incomplete observations and an observed pattern S3 Clinic
Care
Survival No
Yes
Clinic 1
Less More Less More
0 0 2 1
13 35 20 1
Clinic 2
Care
Note: n123 = 72.
Survival
Less
More
Clinic 1
122
204
Clinic 2
150
19
Note: n12 = 495.
No
Yes
5
143
Note: n13 = 148.
Table 5 SAD for the three estimating procedures
20% incomplete 90% incomplete
LCI
EM
CD
(EM−LCI) EM
(CD−LCI) CD
3.1 4.1
1.7 7.0
1.7 7.2
−79% 40%
−79% 43%
performance of the LCI model when the completely classi ed table is sparse, since in this case the EM algorithm and the complete-data method are usually unsatisfactory.) The sample with 90% incomplete observations is summarized in Table 4. Table 5 presents the SADs for the three procedures. Table 5 shows that the LCI model estimates may be less ecient if the LCI model seems inappropriate and the proportion of incomplete observations is small. However, if the missing data is extensive so that the completely classi ed table is sparse (e.g., 90% incomplete), the LCI model may still produce better estimates than the EM-algorithm and the complete-data method. Note that, in this example, the fully classi ed table (Table 4(a)) has two zero cells but the corresponding cells in the supplemental tables have positive counts. If starting values for the EM-algorithm are based on the fully classi ed table, the EM-algorithm never allows the zero cell to attain a nonzero probability estimate (Fuchs, 1982; Little and Rubin, 1987), thus these initial values are inappropriate. In this case, the LCI model estimates may be used as starting values. Furthermore, when the missing data is extensive, the EM algorithm is usually slow, so using the LCI model estimates as starting values may speed up the algorithm. In this example, the EM algorithm using the LCI model estimates as starting values converges faster (by 10%) than does the EM algorithm using uniform starting values (equal probabilities in all cells). This example, together with more extensive simulations not presented here, suggests that when the missing data is extensive so that the completely classi ed table is sparse, even if the LR test rejects the LCI model determined by the missing data pattern under common signi cant levels (e.g., = 0:05; 0:01), the estimates based on the LCI model still may be better than the competing estimates. In this case, by imposing the LCI restrictions, the number of parameters to be estimated is reduced so that more data
286
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
are available to estimate the parameters, therefore the LCI model may yield better estimates. Furthermore, in this case, the LCI model estimates may also be a good starting values for the EM-algorithm. Acknowledgements We wish to thank Steen Andersson, A.P. Dempster, David Madigan, X.L. Meng, Charles Nelson, Thomas Richardson, Paul Sampson, and Galen Shorack for helpful discussions. The two referees deserve special thanks for their interest and insights on these topics. References Agresti, A., 1990. Categorical Data Analysis. Wiley, New York. Anderson, T.W., 1957. Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J. Amer. Statist. Assoc. 52, 200–203. Andersson, S.A., Madigan, D., Perlman, M.D., Triggs, C.M., 1995. On the relation between conditional independence models determined by nite distributive lattices and by directed acyclic graphs. J. Statist. Plann. Inference 48, 25– 46. Andersson, S.A., Madigan, D., Perlman, M.D., Triggs, C.M., 1997. A graphical characterization of lattice conditional independence models. Ann. Math. Artif. Intell. 21, 27–50. Andersson, S.A., Perlman, M.D., 1991. Lattice-ordered conditional independence models for missing data. Statist. Probab. Lett. 12, 465– 486. Andersson, S.A., Perlman, M.D., 1993. Lattice models for conditional independence in a multivariate normal distribution. Ann. Statist. 21, 1318–1358. Andersson, S.A., Perlman, M.D., 1995. Testing lattice conditional independence models. J. Multivariate Anal. 53, 18–38. Andersson, S.A., Perlman, M.D., 1998. Normal linear regression models with recursive graphical markov structure. J. Multivariate Anal. 66, 133–187. Baker, S.G., Laird, N.M., 1988. Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. J. Amer. Statist. Assoc. 81, 29– 41. Bishop, Y.M.M., Fienberg, S.E., Holland, P.W., 1975. Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. Chen, T.T., Fienberg, S.E., 1974. Two-dimensional contingency tables with both completely and partially cross-classi ed data. Biometrics 30, 629–642. Davey, B.A., Priestley, H.A., 1990. Introduction to Lattice and Order. Cambridge Univ. Press, Cambridge. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood estimation from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B 39, 1–38. Diebolt, J., 1997. Discussion of “The EM algorithm – an old folk-song sung to a fast new tune” by Meng, X.L. and Van Dyk, D. J. Roy. Statist. Soc. B 59 (3), 511–567. Fuchs, C., 1982. Maximum likelihood estimation and model selection in contingency tables with missing data. J. Amer. Statist. Assoc. 77, 270. Geng, Z., 1988. Multidimensional contingency tables with missing data. Commun. Statist. Theory Meth. 17 (12), 4137–4146. Geng, Z., Asano, C., Ichimura, M., Tao, F., Wan, K., Kuroda, M., 1996. Partial imputation method in the EM algorithm. In: Proc. Comput. Statist., 12th Symp. held in Barcelona, Spain. Geng, Z., Wan, K., Tao, F., Guo, J., 1997. Decomposition of mixed graphical models with missing data. Contemporary Multivariate Analysis and its Applications, May 19 –22, Hong Kong. Hocking, R.R., Oxspring, H.H., 1974. The analysis of partially categorized contingency data. Biometrics 30, 469–483.
M.D. Perlman, L. Wu / Journal of Statistical Planning and Inference 79 (1999) 259–287
287
Lauritzen, S.L., 1996. Graphical Models. Oxford Univ. Press, Oxford. Little, R.T.A., Rubin, D.B., 1987. Statistical analysis with missing data. Wiley, New York. Lord, F.M., 1955. Estimation of parameters from incomplete data. J. Amer. Statist. Assoc. 50, 870–876. Murray, G.D., 1977. Discussion of “Maximum likelihood from incomplete data via the EM algorithm” by A.P. Dempster, N.M. Laird and D.B. Rubin. J. Roy. Statist. Soc. Ser. B 39, 1–38. Neyman, J., 1949. Contributions to the theory of the 2 test. 1st Berkley Symp. on Mathematical Statistics and Probability, pp. 239 –273. Rubin, D.B., 1974. Characterizing the estimation of parameters in incomplete data problems. J. Amer. Statist. Assoc. 69, 467–474. Rubin, D.B., 1976. Inference and missing data. Biometrika 63, 581–592. Rubin, D.B., 1987. Multiple imputation for nonresponse in sample surveys. Wiley, New York. Smith, C.A.B., 1997. Discussion of “The EM algorithm – an old folk-song sung to a fast new tune” by Meng, X.L. and Van Dyk, D. J. Roy. Statist. Soc. B 59 (3), 511–567. Steel, M.A., Wood, G.R., 1993. On a problem of Andersson and Perlman. Statist. Probab. Lett. 18, 381–382. Whittaker, J.L., 1990. Graphical Models in Applied Multivariate Statistics. Wiley, New York. Williamson, G.D., Haber, M., 1994. Models for three-dimensional contingency tables with completely and partially cross-classi ed data. Biometrics 49, 194–203. Wu, L., Perlman, M.D., 1998. Eciency of lattice conditional independence models for continuous data with non-monotone missing observations. Technical Report. Dept. of Statist., Univ. of Washington.