Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Q1
Q2
The doubly smoothed maximum likelihood estimation for location-shifted semiparametric mixtures Byungtae Seo Department of Statistics, Sungkyunkwan University, Seoul 03063, Republic of Korea
article
info
Article history: Received 20 July 2015 Received in revised form 30 May 2016 Accepted 2 November 2016 Available online xxxx Keywords: Finite mixture Semiparametric mixture Doubly smoothed MLE
abstract Finite mixture of a location family of distributions are known to be identifiable if the component distributions are common and symmetric. In such cases, several methods have been proposed for estimating both the symmetric component distribution and the model parameters. In this paper, we propose a new estimation method using the doubly smoothed maximum likelihood, which can effectively eliminate potential biases while maintaining a high efficiency. Some numerical examples are presented to demonstrate the performance of the proposed method. © 2016 Elsevier B.V. All rights reserved.
1. Introduction Finite mixture models have been widely applied as a model based clustering tool for classifying or clustering data in various scientific fields. In order to employ mixture models, practitioners must first determine a parametric family for the component distribution. The normal distribution is the most commonly used, but other parametric distributions can also be employed, depending on the source of the data or the area of study. For example, in survival analysis, exponential and gamma distributions are common choices, owing to the nature of the data involved. After determining the component distribution, the next task is to specify a suitable number of components. This is related to the number of hidden homogeneous subpopulations of the whole population under study. There are some well-known criteria for choosing the number of components. The most common tools employed for this purpose are information-based criteria, such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and their relatives. Such criteria can be used to determine a suitable number of components when a certain class of component distributions is assumed. Although the procedure described above is standard for fitting finite mixture models, a more desirable method would be to estimate the component distribution itself as well as model parameters rather than adopting an assumption. However, we should exercise a high degree of care in addressing this issue. Because mixture distributions are already flexible enough to cover almost all statistical distributions, to allow further flexibility could result in fundamental statistical problems. For example, if one leaves the component density completely unspecified in a finite mixture model, then the model is not identifiable in general. Hall and Zhou (2003) tackled this issue in a multivariate mixture in which coordinate distributions are conditionally independent given the component membership. Under some mild conditions, they showed the identifiability of the model for more than two dimensional mixtures. Hohmann and Holzmann (2013) showed the identifiability and proposed an estimation method for bivariate mixtures under additional tail conditions on the component characteristic functions and densities. Recently, Chauveau et al. (2015) provided a comprehensive survey for such semiparametric multivariate mixtures.
E-mail address:
[email protected]. http://dx.doi.org/10.1016/j.csda.2016.11.003 0167-9473/© 2016 Elsevier B.V. All rights reserved.
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
34
For the identifiability issue, the multivariate mixtures with the conditional independence can make the model identifiable. However, the identifiability of the univariate mixtures needs an additional condition. One feasible condition is that the component distribution is common and symmetric for each component. Hunter et al. (2007) and Bordes et al. (2006) studied such a model and demonstrated that this semiparametric mixture model is identifiable with a nonparametric symmetric component distribution when the number of component is two. Hunter et al. (2007) also discussed the identifiability for the mixtures with more than two components. Despite the theoretical justification for the use of semiparametric mixtures, it is not easy to carry out the required estimation owing to the nonparametric component distribution. In order to overcome this challenge, Hunter et al. (2007) suggested a minimum distance approach. However, their method does not produce a proper estimate of the component distribution. The estimating methods proposed by Bordes et al. (2006) and Hunter et al. (2007) are not efficient in general and hard to generalize to more than two-component mixtures. Bordes et al. (2007) proposed a stochastic EM-like algorithm which can be generalized for more than two-component semiparametric mixtures, but it does not have the descent property. Recently, Chauveau et al. (2015) modified the algorithm in order to have a descent property. Although these estimation methods can produce a reasonable estimator, they are not fully likelihood-based and the choice of bandwidth remains a difficult task. The focus of this paper is to develop a likelihood-based estimation method for semiparametric univariate mixtures. In this regard, Chee and Wang (2013) proposed nonparametric mixture models for the nonparametric component density, and applied the constrained Newton method for multiple support points (CNM) to estimate the nonparametric mixing distribution. Their method provides a well-defined likelihood along with an algorithm estimating the nonparametric mixing distribution and model parameters. However, because their method still requires some extent of smoothing, a loss in efficiency is inevitable. Furthermore, although their method is less sensitive to the choice of the bandwidth than other kernel-based methods, the choice of bandwidth remains a nontrivial issue. When a given problem requires some degree of smoothing, the use of kernel almost always produces biases, and leads to some loss in efficiency. This results from the fact that smoothing involves contamination or distortion of the initial problem, which leads to a certain deviance from the original model or data. In order to minimize this deviance, we propose the use of the doubly smoothed maximum likelihood estimator (DSMLE), as introduced by Seo and Lindsay (2013). The DSMLE was originally developed with the aim of constructing a consistent estimator when the usual maximum likelihood estimator (MLE) fails to be consistent. In addition, the DSMLE can also be useful in reducing biases or information loss when a given problem requires a certain degree of smoothing. In this paper, we develop an estimation method for semiparametric mixture models based on a double smoothing procedure. The remainder of this paper is organized as follows. In Section 2, we introduce some recently proposed computing methods for the semiparametric mixtures. In Section 3, we present the method for applying the double smoothing procedure, along with a brief introduction to the DSMLE. Some computational issues are discussed in Section 4, and numerical examples are presented in Section 5. We then make some concluding remarks in Section 6.
35
2. Semiparametric mixture models
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
In this section, we briefly introduce two existing methods for estimating model parameters in semiparametric mixture models. In this context, let us consider the following m-component mixture density with an unspecified symmetric density g: f (x; π, µ, g ) =
m
πj g (x − µj ),
(2.1)
j=1 36 37
where µ = (µ1 , . . . , µm ) is a vector containing m each subpopulation mean and π = (π1 , . . . , πm ) is the corresponding vector of mixing proportions, which satisfies j=1 πj = 1 and πj > 0, j = 1, . . . , m. To estimate (π, µ, G), Bordes et al. (2007) proposed a stochastic expectation–maximization (EM) algorithm. Although Bordes et al. (2007) did not provide a clear probability model for g, in some sense, their estimating algorithm is equivalent to estimating (π, µ) and B under gK (x; B) =
n Kh (x − x˜ i ) + Kh (x + x˜ i ) i =1
38 39
2n
=
Kh (x − x˜ ) + Kh (x + x˜ ) 2
dB(˜x),
(2.2)
where Kh is a symmetric kernel density with bandwidth h, and B is the uniform distribution on x˜ 1 , . . . , x˜ n . Using the stochastic EM algorithm, they then iteratively update (π, µ) and x˜ i ’s. Chee and Wang (2013) modeled g using a similar but more flexible model as gM (x; Q ) =
Kh (x − η) + Kh (x + η) 2
dQ (η),
(2.3)
where Q is an unspecified probability distribution on R+ . Readers should note that gM (x; Q ) can be interpreted as a nonparametric mixture density with a component density (Kh (x − η) + Kh (x + η))/2 and an unknown mixing distribution
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
3
Q . Since the class of Q contains that of B, gK can be viewed as a submodel of gM . Now, the corresponding semiparametric finite mixture densities based on gK and gM can be expressed as fK (x; π, µ) =
m
πj
n Kh (x − x˜ i − µj ) + Kh (x + x˜ i − µj )
j =1
2n
i=1
and fM (x; π, µ, Q ) =
m
πj
Kh (x − η − µj ) + Kh (x + η − µj ) 2
j =1
dQ (η),
(2.4)
respectively. Modeling g as in (2.3) has several advantages. First, unlike fK and other kernel-based methods, one can construct a clearly defined likelihood function, meaning that the standard maximum likelihood theory and algorithms can be applied. Second, by computing the nonparametric maximum likelihood estimator (NPMLE) of Q , we can obtain a legitimate density estimator for the unknown symmetric density g in (2.1). Third, although we must still specify a bandwidth h, this method is less sensitive to its choice than methods that use kernel density estimators. This is because the class of Q is quite flexible so that the NPMLE of Q can be adjusted as h changes. For the computation of the MLE, Chee and Wang (2013) suggested an iterative algorithm to compute the MLE of (µ, π) and Q . When Q is fixed, fM is just a finite location mixture density and one can easily compute the MLE of (µ, π) using the EM algorithm (Dempster et al., 1977). Based on our experience, the usual Newton type algorithms are also effective unless m is too big. With fixed (µ, π), the estimation of Q is then equivalent to computing the NPMLE of the mixing distribution in a nonparametric mixture density. For this computational task, we can use several well-known algorithms to find the NPMLE of Q (Bohning, 1985; Lesperance and Kalbfleisch, 1992; Wang, 2007).
1 2 3 4 5 6 7 8 9 10 11 12 13
3. DSMLE
14
3.1. Review of DSMLE
15
In this section, we briefly review the doubly smoothed maximum likelihood procedure (Seo and Lindsay, 2013). Suppose that X1 , . . . , Xn is a random sample from a population having probability density f (x; α). Here, α can be either a finite or infinite dimensional model parameter. The most popular method for the estimation of α is undoubtedly the maximum likelihood method. However, there are some cases in which the usual MLE fails to produce a consistent estimator. This can result from a certain violation of regularity conditions. In such cases, smoothing either a given model or data can regularize the problem, such that the usual regularity conditions are fulfilled. However, smoothing procedure almost always results in some bias or information loss, because smoothing results in a certain contamination. Of course, if we let the degree of smoothing go to zero as the sample size increases, we may still have some desirable statistical properties asymptotically, but not in the finite world. The double smoothing procedure is devised to minimize this discrepancy between the smoothed model or data and the original model or data, by smoothing not only the data but also the model. To explain this, we first produce a smoothed density based on the original model and a chosen kernel Kh with bandwidth h, as f ∗ (x; α) =
Kh (x − t )f (t ; α)dt .
16 17 18 19 20 21 22 23 24
(3.1)
In this case, we can interpret f ∗ as the density of X + ϵ , where ϵ is a random variable with probability density Kh . Note that f ∗ is a continuous and bounded density for a suitable choice of Kh . Second, we construct a smoothed density using Kh based on the given data as fˆn∗ (x) =
n 1
n i=1
Kh (x − Xi ).
Again, fˆn∗ (x) can be interpreted as an empirical version of the probability density of X + ϵ , and this is also continuous and bounded density for a suitable choice of Kh . We now have two smooth probability densities for X + ϵ based on the model and data. Here, smoothing plays a role of regularizing both the model and the data. The DSMLE of α is then obtained by computing the maximizer of the doubly smoothed (DS) log-likelihood l∗ (α) =
25 26
log f ∗ (x; α) fˆn∗ (x)dx.
In fact, maximizing l∗ (α) with respect to α is equivalent to minimizing the Kullback–Leibler divergence between the two smoothed densities. This implies that computing the DSMLE of α is equivalent to finding f (x; α) that minimizes the Kullback–Leibler divergence between the model and data based densities.
27 28 29
4
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
7
Although the DSMLE still employs smoothing, Seo and Lindsay (2013) proved that the DSMLE is consistent with almost no regularity conditions even with a fixed bandwidth h. In addition, the DSMLE is always more efficient than the MLE that applies single smoothing. This is because the DSMLE takes the smoothing effect into account by smoothing both the data and the model. If a given model satisfies standard regularity conditions, then the usual MLE is clearly superior to the DSMLE. On the other hand, if one requires any kind of smoothing for the initial model or data, then the DSMLE can be used to minimize the effect of smoothing, and produce an estimator that corresponds more closely with the initial model and data.
8
3.2. DSMLE for semiparametric location mixtures
1 2 3 4 5 6
In this section, we first present an alternative interpretation of the method given by Chee and Wang (2013) for semiparametric location mixtures, and then we propose the DSMLE to reduce biases and gain more efficiency. For this, let us consider the following model for g instead of gK as in (2.2) or gM as in (2.3): g (x; q) =
q(x) + q(−x) 2
,
(3.2)
where q(x) is a completely unspecified probability density on R+ . That is, we model a symmetric density g (x) via an unspecified q, which is more natural and intuitive than gK or gM . The corresponding semiparametric mixture density is then f (x; π, µ, Q ) =
m
πj g (x − µj ),
(3.3)
j =1 9 10 11
where Q is the distribution function of q. However, leaving q completely unspecified raises some critical problems. For example, the likelihood based on (3.2) is not even bounded unless we impose some constraints on q. That is, at least one of the standard regularity conditions is violated by (3.3). In order to regularize the problem, we may consider a kernel smoothed version of f (x; π, µ, Q ), as f ∗ (x; π, µ, Q ) =
m
πj g ∗ (x − µj ; Q ),
j =1
where g ∗ ( x; Q ) =
= 12 13 14 15 16 17 18 19 20 21 22 23
Kh (x − t )g (t )dt =
Kh (x − t )
Kh (t − x) + Kh (t + x) 2
q(t ) + q(−t ) 2
dt
dQ (t ).
Indeed, g ∗ (x; Q ) is exactly the same as gM (x; Q ) in (2.3). Therefore, as an alternative interpretation, gM (x; Q ) can be considered as the smoothed model density for the initial model (3.2). Consequently, the smoothed model density f ∗ (x; π, µ, Q ) should also be the same as fM (x; π, µ, Q ) in (2.4). Now, if X1 , . . . , Xn is a random sample from a population having density (3.3), the most ideal estimate of (π, µ, Q ) should be obtained from the likelihood based on (3.3). However, estimates of Chee and Wang (2013) are obtained from the likelihood based on fM (x; π, µ, Q ), the density of Xi + ϵ , where the probability density function of ϵ is Kh (·). That is, fM is the probability density of the random sample added through measurement error rather than the given sample. Therefore, the estimation based on fM inevitably causes biases and information loss when we assume the model (3.3) for a given sample. Of course, when the class of fM contains the true model for a certain choice of h, no bias would occur if h were suitably chosen. However, even in this case, the choice of h would constitute a critical issue. In addition, since (3.3) is clearly a larger class than the class of fM , the estimator based on (3.3) provides a more reliable result. Hence, modeling q as a continuous mixture can suitably regularize the estimation problem, but results in some loss of flexibility and stability. From the double smoothing view point described in Section 3.1, if we use f ∗ , we should also use a smoothed version of the data instead of the raw data. To apply the double smoothing procedure to this problem, as based on Section 3.1, we need to construct the smoothed data density using the same kernel Kh as fˆn∗ (x) =
n 1
n i=1
Kh (x − Xi ).
The DSMLE is then defined as the maximizer of the DS log-likelihood l∗ (π, µ, Q ) =
=
log f ∗ (x; π, µ, Q ) fˆn∗ (x)dx
n 1
n i=1
log f ∗ (x; π, µ, Q ) Kh (x − Xi )dx.
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
5
ˆ n, µ ˆ n , Qˆ n ) weakly converges to For any fixed h and K satisfying conditions (A1)–(A3) in Seo and Lindsay (2013), F (x; π ˆ n, µ ˆ n , Qˆ n ) F (x; π 0 , µ0 , Q0 ) on a set of probability one (Seo and Lindsay, 2013), where F is the distribution function of f , and (π and (π 0 , µ0 , Q0 ) are the DSMLE and the true model index of (π, µ, Q ), respectively. ˆ n, µ ˆ n , Qˆ n ), first, the model should be identifiable and this is verified by Bordes For the consistency of the model index (π et al. (2006) and Hunter et al. (2007) for m ≤ 3. Second, we need to verify the model continuity condition (M2) in Seo and Lindsay (2013). That is, the convergence of the model distribution should imply the convergence of the model index. When the parameter space of (π, µ) is compact, this can be shown using the model identifiability and a similar argument followed by Corollary 1 in Seo and Lindsay (2013). ˆ n, µ ˆ n , Qˆ n ) to F (x; π0 , µ0 , Q0 ) implies the convergence of (πˆ n , µ ˆ n ) to (π0 , µ0 ) and the Lemma 1. The convergence of F (x; π weak convergence of Qˆ n to Q0 for m ≤ 3. ˆ ˆ n, µ ˆ n , Qˆ n )}∞ Proof. For the sequence of {(π n=1 , there exists a subsequence {k} ⊂ {n} such that Qk vaguely converges to Q1 ˆ k, µ ˆ k ) converges to (π1 , µ1 ) on a set of probability one. The identifiability of F (x; π, µ, Q ) and the distributional and (π ˆ k, µ ˆ k , Qˆ k ) imply then (πˆ 1 , µ ˆ 1 , Qˆ 1 ) = (π0 , µ0 , Q0 ) except on a set of probability zero. This means that consistency of F (x; π ˆ ˆ k, µ ˆ k ) converges to (π0 , µ0 ). This in turn the for every convergent subsequence {k} ⊂ {n}, Qk weakly converges to Q0 and (π ˆ n, µ ˆ n ) to (π0 , µ0 ) and the weak convergence of Qˆ n to Q0 . convergence of (π Proposition 1. If the parameter space of (π, µ) is compact and conditions (A1)–(A3) in Seo and Lindsay (2013) are satisfied, the DSMLE of (π, µ, Q ) is consistent on a set of probability one for a fixed bandwidth h when m ≤ 3.
ˆ n, µ ˆ n , Qˆ n ) converges to F (x; π0 , µ0 , Q0 ) on a set of probability one. The identiProof. From Seo and Lindsay (2013), F (x; π ˆ n, µ ˆ n , Qˆ n ). fiability and Lemma 1 imply the consistency of DSMLE (π For a computational point of view, maximizing l∗ requires the computation of log(f ∗ (x; π, µ, Q ))Kh (x − Xi )dx for each i. Unfortunately, this integration cannot be solved explicitly, and requires the application of some approximation methods. However, the application of an approximation method can result in the computation being even harder, because l∗ contains both parametric and nonparametric components that must be estimated. In the next section, we will discuss this computational issue along with the choice of the bandwidth in detail.
1 2 3 4 5 6 7 8
9 10
11 12 13 14 15
16 17
18 19 20 21 22 23 24
4. Bandwidth selection and computation
25
4.1. Bandwidth selection
26
Theoretically, the DSMLE is consistent regardless of the choice of the bandwidth h, which implies that the DSMLE is also robust to the choice of bandwidth h. Hence, for the DSMLE, the choice of h is less sensitive than those corresponding to the estimator of Chee and Wang (2013) or other kernel-based methods. This is not surprising, because both the initial model and data are artificially smoothed with the same kernel and bandwidth while usual kernel-based estimators smooth only the data. In practice, however, there must exist an optimal choice of bandwidth. To choose h, Chee and Wang (2013) tested AIC, BIC, least-squares cross-validation (LCSV), and likelihood cross-validation LCV (Bowman, 1984), and compared the performance of their estimators using various simulations. They found that the performances of their estimators are similar over all of the bandwidth selection criteria that they considered. Among those, they reported that LCV performs slightly better than the others. In this paper, we also adopt the LCV criterion for the choice of an optimal h. To compute LCV, we first randomly split the given random sample into V subsamples. Let Sv be the set of all observations except for the v th subsample, and let |Sv | be the number of observations belonging to Sv . For each given h, LCV can then be calculated by LCV (h) = − v
V 1 1
V v=1 x ∈S |Sv | v i
v
v
ˆ ,µ ˆ , Qˆ v ) , log fM (xi ; π
v
v∗
v∗
V 1
V v=1
v∗
v∗
28 29 30 31 32 33 34 35 36
(4.1)
ˆ ,µ ˆ , and Qˆ v are the MLE based on Sv . where π Because the DS-likelihood is different from the likelihood based on fM , we cannot directly apply LCV (h) for our purpose. Instead, we propose a smoothed version of LCV (h) as LCV ∗ (h) = −
27
37
ˆ ,µ ˆ , Qˆ v∗ ) fˆnv∗ (x)dx, log f ∗ (x; π
ˆ ,µ ˆ and Qˆ v∗ represent the DSMLE based on Sv , and fˆnv∗ (x) is the smoothed data density based on Sv . The best h where π would be then the one that has the smallest LCV (h) or LCV ∗ (h).
38 39
6
1 2 3 4 5 6 7 8 9 10 11
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
In general, a grid search is required to apply LCV (or LCV ∗ ) criterion. Because a grid search requires excessive computing time, it is important to restrict the range of the grid beforehand. Chee and Wang (2013) employed a grid between 0.2s and 0.8s where s is the sample standard deviation in their simulation. However, this choice only works under their particular simulation settings, and in general, it is very difficult to choose a suitable range for h for a grid search beforehand. For the DSMLE, however, there exists a useful tool to determine a suitable range for h, which is called spectral degrees of freedom (sDOF) proposed by Lindsay et al. (2008). The sDOF is defined based on the spectral decomposition of the L2 distance between the smoothed model and data densities. This is analogous to the usual degrees of freedom in the usual χ 2 goodness of fit test. In the usual χ 2 goodness of fit test, there is no optimal rule for the choice of the binwidth or the degrees of freedom. In practice, it is recommended to adopt a binwidth that corresponds to degrees of freedom between 5 and n/5, where n is the sample size. Similarly, Seo and Lindsay (2013) recommended the use of h that corresponds to sDOF between 5 and n/5 for the DSMLE. Although this guideline cannot give the optimal h, this is very helpful when we need to determine a suitable range for h before we fit the model. Because computing empirical sDOF is reasonably easy and fast, we can considerably reduce the computing time. Based on various simulations, we also found that the DSMLE is reasonably stable within this range. For a given kernel K and bandwidth h, the empirical sDOF can be calculated by
= sDOF
2 ˜Khcen (Xi , Xi )
n i =1
2 ,
n n
K˜ hcen (Xi , Xj )
i=1 j=1
where K˜ hcen (Xi , Xj ) = K˜ h (Xi , Xj ) − 12
13
n 1
n i=1
K˜ h (Xi , Xj ) −
n 1
n j =1
K˜ h (Xi , Xj ) +
n n 1
n2 i=1 j=1
K˜ h (Xi , Xj )
and K˜ h (x, y) =
Kh (x − t )Kh (t − y)dt .
14
For more detailed discussion of sDOF, see Lindsay et al. (2008).
15
4.2. Computation In order to compute the DSMLE, we first require an explicit form for l∗ . Unfortunately, an explicit form of l∗ does not exist in our problem. There are several ways to approximate l∗ , including the use of a small order of Taylor expansion for the integrand, the Gaussian quadrature method, and other techniques. Among these, we propose the use of the Monte-Carlo approximation (Seo and Lindsay, 2010), using the following representation of l∗ : l∗ (π, µ, Q ) =
log f ∗ (x; π, µ, Q ) fˆn∗ (x)dx
= ≈
16 17 18 19 20 21 22 23 24 25 26 27 28
log f ∗ (x; π, µ, Q ) Kh (t − x)dFˆn (t )dx
J n 1
nJ i=1 j=1
log f ∗ (xij ; π, µ, Q ) ,
(4.2)
where Fˆn is the empirical distribution function based on a given random sample, xij = xi + ϵij , for i = 1, . . . , n and j = 1, . . . , J, and ϵij ’s are random numbers generated from the kernel density Kh . The use of Monte-Carlo approximation has two advantages. First, l∗ can easily be computed without having an explicit form for l∗ . Second, by using the Monte-Carlo sample, the approximated l∗ will have the same structure as the usual log-likelihood function. That is, (4.2) is the same as the log-likelihood based on (2.4) but with the Monte-Carlo sample xij = xi + ϵij , for i = 1, . . . , n and j = 1, . . . , J. Hence, the algorithm proposed by Chee and Wang (2013) can be directly applied without any modification. However, such a Monte-Carlo approximation would also have two disadvantages. First, the estimator obtained from (4.2) is different over different simulated samples of ϵij ’s. This makes the estimator simulationdependent and causes an extra variation of the estimator. In order to reduce this variation, a large number of Monte-Carlo random samples are required. Second, because the application of the algorithm proposed by Chee and Wang (2013) to xij ’s proceeds as though the data size were J times bigger than the original sample, a large amount of computing time would be required. If we use LCV ∗ (h) for the choice of h, the computing time will be greatly increased because we need to repeatedly fit the model for each subsample and h on the chosen grid points.
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
7
Fig. 1. Density functions for (a) separate case and (b) overlapped case.
In order to overcome these two disadvantages, we recommend the use of J fixed ϵij ’s, rather than random numbers for ϵij . 1 To explain this, let ϵij = K− h (j/(J + 1)), j = 1, . . . , J, where Kh is the distribution function of Kh (x) so that ϵij ’s are identical and fixed for each i. By this means, ϵij ’s can be considered as a well-balanced random sample from Kh (x), although they are not a random sample. In addition, l∗ can be closely approximated with a relatively small J. By applying this small trick, we can significantly reduce the overall computing time without sacrificing much accuracy. In our experience, the computation of (4.2) using fixed ϵij ’s with J = 15 gives almost the same result as that obtained from a simulated Monte-Carlo sample with J = 100. Even with J = 7, the result appears to be reliable based on our experience. However, a suitable choice of J should depend on a given model and sample size. 5. Numerical examples 5.1. Simulation studies
1 2 3 4 5 6 7 8
9
10
For simulation studies, we largely adopt similar simulation scenarios to those of Chee and Wang (2013) except for several small details. From this point on, we denote the estimator employed by Chee and Wang (2013) as the CW estimator. Because the CW estimator outperforms other kernel-based estimators, we focus on the comparison between the CW estimator and the DSMLE. Our basic simulation model consists of (2.1) with m = 2. That is, random samples of size n are generated from the population with the probability density f (x; π , µ1 , µ2 ) = π g (x − µ1 ) + (1 − π)g (x − µ2 ). For the symmetric component density g, we considered four symmetric distributions: (I) Laplace (II) Normal (III) T5 (IV) Uniform. These involve more extreme cases than the distributions used in Chee and Wang (2013). For the true parameter values we tested well-separated and overlapped cases. For the well-separated case, we set π = 0.3, µ1 = −2, and µ2 = 2, and π = 0.3, µ1 = −1, and µ2 = 1 for the overlapped case. Fig. 1 shows plots of each probability density function for the two cases. Note that each distribution is scaled to have unit variance so that suitable comparisons can be made. Because Chee and Wang (2013) reported that the CW estimator based on LCV results in a better performance than other criteria, we employ LCV and doubly smoothed LCV for the CW and DSMLE, respectively. In most cases, the chosen optimal bandwidth in the CW estimator is larger than that in the DSMLE. All of our simulation results are obtained using MATLAB codes. Table 1 shows the mean squared error (MSE) for each scenario based on 200 replications in the well-separated case. The number in each parenthesis represents the empirical bias. For cases (I) and (III), the performance of DSMLE is slightly better than that of CW and the empirical biases are negligible. For case (II), where the true symmetric distribution is the normal, CW preforms slightly better than DSMLE. This is natural, because the class of normal distributions is a subclass of model (2.3) when Kh is the normal kernel. When the true distribution is uniform (case (IV)), DSMLE performs considerably better than CW, although CW still produces reasonable estimates. Table 2 shows the results when the two component densities are relatively overlapped compared to the previous case. For all simulated models, the DSMLE achieves a superior performance to the CW estimator. This deterioration of CW is mainly the result of non-negligible biases. In particular, for case (IV), the bias is considerably large. As we described in Section 3, the CW estimator uses a model that is contaminated in some sense, which leads to the production of biases, whereas the DSMLE minimizes them. In addition, we also investigate the performance of each method in estimating the common symmetric distribution. For this, we computed the estimators for G(0.5), G(1.0), G(1.5), and G(2.0) using the nonparametric estimator of component densities based on CW and DSMLE, where G is the distribution function of the symmetric component density g. Figs. 2 and 3 present boxplots for each estimator, based on 200 replications with n = 200. Note that the horizontal
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
8
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Table 1 MSE × 100(bias × 100) for parameter estimates in separate case. n
100
200
g
Method
π1
µ1
µ2
(I) Laplace
CW DSMLE
0.23(0.64) 0.23(0.65)
3.08(2.02) 3.01(2.06)
1.24(−0.35) 1.17(−0.25)
(II) Normal
CW DSMLE
0.25(0.04) 0.25(0.21)
4.91(−0.00) 5.15(0.49)
2.01(−1.96) 1.91(−1.39)
(III) T5
CW DSMLE
0.22(−0.18) 0.22(−0.18)
4.74(−0.94) 4.53(−0.86)
1.24(−0.65) 1.22(−0.83)
(IV) Uniform
CW DSMLE
0.25(−0.60) 0.21(−0.35)
7.58(4.15) 2.02(0.73)
6.86(−8.05) 1.31(−1.95)
(I) Laplace
CW DSMLE
0.13(0.08) 0.13(0.06)
1.37(−0.04) 1.27(−0.29)
0.60(0.01) 0.60(−0.12)
(II) Normal
CW DSMLE
0.12(0.05) 0.12(0.16)
2.31(0.73) 2.46(1.66)
1.00(−0.72) 1.00(−0.08)
(III) T5
CW DSMLE
0.13(0.32) 0.14(0.27)
1.63(2.43) 1.61(2.06)
0.71(−0.52) 0.69(−0.42)
(IV) Uniform
CW DSMLE
0.17(−0.07) 0.12(−0.11)
2.84(2.72) 1.08(0.34)
1.58(−2.21) 0.80(−0.76)
Table 2 MSE × 100(bias × 100) for parameter estimates in overlapped case. n
100
200
1 2 3 4 5
g
Method
π1
µ1
µ2
(I) Laplace
CW DSMLE
0.34(0.10) 0.32(0.62)
6.69(3.19) 4.47(1.48)
2.00(−2.14) 1.52(0.13)
(II) Normal
CW DSMLE
0.38(−0.43) 0.38(−0.28)
10.19(7.12) 8.81(6.12)
3.85(−4.74) 3.92(−4.30)
(III) T5
CW DSMLE
0.37(−1.27) 0.31(−0.50)
9.26(1.04) 6.33(1.29)
2.36(−5.42) 2.14(−2.81)
(IV) Uniform
CW DSMLE
0.94(0.65) 0.47(−0.19)
18.72(21.57) 3.79(4.61)
13.11(−12.49) 2.66(−2.99)
(I) Laplace
CW DSMLE
0.21(0.26) 0.19(0.36)
2.44(−0.63) 2.47(0.03)
0.64(0.11) 0.73(0.34)
(II) Normal
CW DSMLE
0.21(−0.35) 0.20(−0.08)
6.77(9.35) 5.72(7.89)
2.52(−4.90) 2.26(−3.59)
(III) T5
CW DSMLE
0.23(−0.58) 0.19(0.15)
3.50(0.05) 2.82(1.39)
1.35(−2.93) 1.13(−0.78)
(IV) Uniform
CW DSMLE
0.35(−1.20) 0.19(−0.45)
13.27(17.80) 1.61(3.74)
10.14(−16.29) 0.72(−2.14)
reference line in each boxplot represents the true value. In both figures, it is clear that DSMLE outperforms CW. In particular, CW produces non-negligible biases except in the normal case. Because it is known that (2.1) is identifiable up to m = 3, we also carry out similar simulations with a three-component semiparametric mixture, f (x; π1 , π2 , µ1 , µ2 , µ3 ) = π1 g (x − µ1 ) + π1 g (x − µ3 ) + (1 − π1 − π2 )g (x − µ3 ).
8
(π1 , π2 ) = (0.3, 0.4) and (µ1 , µ2 , µ3 ) = (−3, 0, 3). Similar to Tables 1 and 2, Table 3 shows MSE and empirical bias based on 200 replications, and the results are quite similar to m = 2. The boxplots for the estimated G(x) at x = 0.5, 1.0, 1.5, 2.0 also show similar patterns to m = 2, so we omit them here.
9
5.2. Real data examples
6 7
10 11 12 13 14 15 16 17
Q3 We set
5.2.1. Body dimension data The first data set we consider consists of body dimension data (Heinz et al., 2003), which contains 24 body dimension measurements for 247 men and 260 women. Among the body measurements, we consider elbow diameters and shoulder girths over deltoid muscles. These two types of measurement are selected because the sample standard deviations for men and women are almost identical. For model fitting, we ignore gender indication variables, and fit the two-component semiparametric location-shift mixture. Table 4 shows the estimated parameter values for the two measurements. In the table, ‘‘True’’ is obtained under the true gender identification. The smoothing parameter h in the table represents the best bandwidth selected by the LCV and doubly
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
9
Fig. 2. Boxplots of estimated G(0.5), G(1.0), G(1.5), and G(2.0) for separate case. Table 3 MSE × 100(bias × 100) for parameter estimates with tree-component mixture. n
100
200
g
Method
π1
π2
µ1
µ2
µ3
(I)
CW DSMLE
0.32(−0.52) 0.24(−0.09)
0.71(1.44) 0.41(0.53)
3.09(−0.44) 3.17(−0.99)
1.96(−0.76) 1.55(−0.68)
4.06(−0.12) 3.29(0.07)
(II)
CW DSMLE
0.30(0.56) 0.29(0.40)
0.40(0.50) 0.39(0.62)
8.22(−0.14) 9.74(−0.40)
6.31(3.24) 4.41(1.87)
6.01(2.37) 6.51(2.90)
(III)
CW DSMLE
0.27(0.00) 0.27(−0.04)
0.37(−0.07) 0.38(0.10)
5.16(0.29) 5.01(0.15)
3.05(0.88) 2.48(1.55)
5.59(0.27) 5.51(0.03)
(IV)
CW DSMLE
0.48(1.49) 0.34(0.18)
0.95(−3.18) 0.65(−1.36)
18.21(25.51) 7.06(10.41)
9.29(1.69) 1.96(−0.15)
14.92(−20.51) 7.63(−10.46)
(I)
CW DSMLE
0.13(0.25) 0.14(0.09)
0.21(0.28) 0.25(0.49)
1.60(1.73) 1.55(1.40)
0.99(−0.12) 0.95(−0.61)
1.39(−0.31) 1.42(−0.27)
(II)
CW DSMLE
0.15(−0.32) 0.14(−0.24)
0.19(0.07) 0.18(−0.00)
2.99(−0.44) 3.17(−0.56)
2.92(−1.79) 2.16(−1.53)
3.59(−4.42) 3.82(−3.55)
(IV)
CW DSMLE
0.14(0.36) 0.14(−0.40)
0.18(0.34) 0.18(0.36)
1.96(−0.20) 1.91(−0.05)
1.48(−1.34) 1.35(−1.04)
2.35(−0.02) 2.31(−0.14)
(III)
CW DSMLE
0.25(1.67) 0.18(0.81)
0.46(−3.33) 0.29(−1.27)
9.82(20.65) 2.78(5.68)
3.68(−0.09) 0.80(1.11)
12.70(−24.21) 3.54(−6.98)
10
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Fig. 3. Boxplots of estimated G(0.5), G(1.0), G(1.5), and G(2.0) for overlapped case.
Table 4 Parameter estimates and elapsed time in seconds for Elbow a shoulder measurements.
1 2 3 4 5 6 7 8
Variable
Method
h
π
µ1
µ2
σ1
σ2
Elapsed time
Elbow
True CW DSMLE
– 0.76 0.16
0.51 0.56 0.56
12.37 12.46 12.44
14.46 14.55 14.55
0.84 0.87 0.86
0.88 0.87 0.86
– 5.12 70.95
Shoulder
True CW DSMLE
– 1.87 1.48
0.51 0.53 0.53
100.31 99.87 100.13
116.50 117.19 117.13
6.50 5.97 5.98
6.50 5.97 5.98
– 11.28 42.97
smoothed LCV criteria. The standard deviations σ1 and σ2 for each component are always identical for CW and DSMLE, but can be different for ‘‘True’’. The estimated parameters for both CW and DSMLE are very close to ‘‘True’’. In the elbow diameter measurements, all estimated parameters from CW and DSMLE are similar, and close to the true values. For elbow diameters, it appears that a two-component normal mixture model with homogeneous variance fits well, because it results in the same estimates as CW. In shoulder girth measurements, DSMLE provides a slightly better approximation of the true value than CW. In Table 4, we also provide the elapsed computing time for the h chosen by LCV and smoothed LCV criteria. The computing times for DSMLE are much longer than those for CW. This is mainly owing to the augmented sample. In this example, we used J = 15 fixed Monte-Carlo sample to approximate the doubly smoothed
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
11
Fig. 4. Histogram of male and female elbow diameter measurements (left) and CDF plot of estimated symmetric component density (right).
Fig. 5. Histogram of male and female shoulder girth measurements (left) and CDF plot of estimated symmetric component density (right).
likelihood. For both CW and DSMLE, the smaller h we use, the longer computing time we need. Since the optimal h for CW is smaller than that for DSMLE in most cases, the DSMLE using the optimal h also needs an extra computing time. For further investigation, we provide the histogram of all observations without gender labels, and the cumulative distribution function (CDF) plots of the estimated common symmetric distributions in Figs. 4 and 5, respectively. In the CDF plot, the smooth and step-function like lines represent the estimated CDFs using CW and DSMLE, respectively. The dashed line represents the empirical CDF of observations subtracted from gender means. Both the CW estimator and DSMLE give similar estimates for the parameters and the component distribution. In order to see the sensitivity of the bandwidth h, we also present the parameter estimates over various h in Fig. 6. In this figure, the horizontal axis represents h and the vertical axis represents the estimated values for (π , µ1 , µ2 , σ ) based on CW and DSMLE. For small h, both estimators are similar to each other. However, for large h, CW appears to break down while DSMLE remains in a reasonable range.
5.2.2. Faithful data Our second data set is the Old Faithful Geyser data set, which contains waiting times between eruptions and the duration of eruptions for the Old Faithful Geyser in the Yellowstone National Park in the USA. This data set is commonly employed as an example of a two-component mixture model. For this data set, we do not know the true label and the true number of components. We fit the data set with a two-component location-shifted semiparametric mixture. A similar table, histogram, and CDF plot to those in Section 5.2.1 are presented in Table 5 and Fig. 7. Because ‘‘True’’ is not applicable in this case, we simply provide parameter estimates given by CW and DSMLE. In Table 5, σ is the estimated standard deviation for each group and S is the estimated standard deviation of the entire data. Note that the sample standard deviation of the raw data is 13.59. For the same reason, the CDF plot in Fig. 7 is not given for the component distribution but for the semiparametric mixture distribution.
1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21
12
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Fig. 6. CW estimator (
) and DSMLE (
) over various h for shoulder girth measurement.
Fig. 7. Histogram of waiting times (left) and estimated CDF plot of waiting times (right).
Table 5 Parameter estimates for Old Faithful Geyser data. Method
h
π
µ1
µ2
σ
S
CW DSMLE
5.66 1.22
0.36 0.36
54.70 54.90
80.14 80.21
5.88 5.80
13.56 13.48
B. Seo / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
13
6. Concluding remarks
1
In this paper, we introduced a doubly smoothed maximum likelihood estimation method for semiparametric mixtures. For location mixtures with an identical symmetric component density, the most ideal model would be (3.2) in which no distributional assumption is made on the symmetric component density. However, the likelihood for this model is not even bounded. Several methods for modeling or estimating the unspecified symmetric component density have been proposed to resolve this issue. Such methods basically employ a certain degree of smoothing for the model, however this smoothing results in bias or efficiency loss, and the choice of the bandwidth is always a critical issue. Our proposed method is designed to reduce the smoothing effect by employing the double smoothing technique, which can reduce the introduced bias and efficiency loss. Some numerical examples were shown to demonstrate that the proposed method is less sensitive to the choice of bandwidth than existing methods. In addition, the proposed method is more efficient and produces less bias than the others. One potential drawback of the proposed method is that the DSMLE does not provide a continuous density estimator for the component density, while other existing methods do. The DSMLE only provides a discrete distribution for the component density. When the main goal of the study is to estimate the location parameters, this is not a problem. However, in some applications, the continuous component density could also be of interest. In such cases, a researcher may produce a continuous density based on the discrete DSMLE by using smoothing technique. An additional drawback is the increase in required computing time, owing to the adoption of the Monte-Carlo DS-likelihood method. For this, Seo (2012) used Taylor series expansion for the integrand so that the integral is approximated by several moments of the kernel. This may work in parametric models and some nonparametric models, but cannot be applied to semiparametric mixtures due to the nonparametric mixing distribution. A more efficient approximation method should be investigated in the future.
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Acknowledgments The author would like to thank the associate editor and two anonymous referees for valuable suggestions and comments. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2016R1A2B4007373). References Bohning, D., 1985. Numerical estimation of a probability measure. J. Statist. Plann. Inference 11, 57–69. Bordes, L., Chauveau, D., Vandekerkhove, P., 2007. A stochastic EM algorithm for a semiparametric mixture model. Comput. Statist. Data Anal. 51, 5429–5443. Bordes, L., Mottelet, S., Vandekerkhove, P., 2006. Semiparametric estimation of a two component mixture model. Ann. Statist. 34, 1204–1232. Bowman, A.W., 1984. An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71, 353–360. Chauveau, D., Hunter, D.R., Levine, M., 2015. Semi-parametric estimation for conditional independence multivariate finite mixture models. Stat. Surv. 9, 1–31. Chee, C., Wang, Y., 2013. Estimation of finite mixtures with symmetric components. Stat. Comput. 23, 233–249. Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. B39, 1–38. Hall, P., Zhou, X., 2003. Nonparametric estimation of component distributions in a multivariate mixture. Ann. Statist. 31, 201–224. Heinz, G., Peterson, L.J., Kohnson, R.W., Kerk, C.J., 2003. Exploring relationships in body dimensions. J. Stat. Educ. 11, www.amstat.org/publications/jse/ v11n2/datasets.heinz.html. Hohmann, D., Holzmann, H., 2013. Two-component mixtures with independent coordinates as conditional mixtures: Nonparametric identification and estimation. Electron. J. Stat. 7, 859–880. Hunter, D., Wang, S., Hettmansperger, T., 2007. Inference for mixtures of symmetric distributions. Ann. Statist. 35, 224–251. Lesperance, M.L., Kalbfleisch, J.D., 1992. An algorithm for computing the nonparametric mle of a mixing distribution. J. Amer. Statist. Assoc. 87, 120–126. Lindsay, B.G., Markatou, M., Ray, S., Yang, K., Chen, S., 2008. Quadratic distances on probabilities: A unified foundation. Ann. Statist. 36, 983–1006. Seo, B., 2012. An EM algorithm for the doubly smoothed MLE in normal mixture models. Commun. Stat. Appl. Methods 19, 135–145. Seo, B., Lindsay, B.G., 2010. A computational strategy for doubly smoothed MLE exemplified by the normal mixture model. Comput. Statist. Data Anal. 54, 1930–1941. Seo, B., Lindsay, B.G., 2013. A universally consistent modifcation of maximum likelihood. Statist. Sinica 23, 467–487. Wang, Y., 2007. On fast computation of the non-parametric maximum likelihood estimate of a mixing distribution. J. R. Stat. Soc. Ser. B Stat. Methodol. 69, 185–198.
21
22 23
Q4
24
25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43