Integrated squared error estimation of normal mixtures

Integrated squared error estimation of normal mixtures

Computational Statistics & Data Analysis 44 (2004) 517 – 526 www.elsevier.com/locate/csda Integrated squared error estimation of normal mixtures Pana...

236KB Sizes 3 Downloads 148 Views

Computational Statistics & Data Analysis 44 (2004) 517 – 526 www.elsevier.com/locate/csda

Integrated squared error estimation of normal mixtures Panagiotis Besbeas, Byron J. T. Morgan∗ Institute of Mathematics and Statistics, University of Kent at Canterbury, Canterbury, Kent, CT2 7NF, UK Received 14 August 2002

Abstract Based on the empirical characteristic function, the integrated squared error criterion for normal mixtures is shown to have a simple form for a particular weight function. When the parameter of that function is chosen as the smoothed cross-validation selector in kernel density estimation, the estimator which minimises the criterion is shown to perform well in a simulation study. In comparison with maximum likelihood and a new recently proposed method there are better bias and standard deviation results for the method of this paper. Furthermore, the new estimator is less likely to fail and is appreciably more robust. c 2002 Elsevier B.V. All rights reserved.  Keywords: Characteristic function; Integrated squared error; Kernel density estimation; Normal mixtures; Parseval’s theorem; Simulated annealing; Smoothed cross-validation selector

1. Introduction Finite mixtures of distributions have received widespread attention in the statistical literature, mainly due to the large number of areas in which they are encountered. In practice, the most widely used ;nite mixture distributions are those involving normal components. In its most general form, the mixture of k normal distributions is de;ned by the probability density function (pdf) f(x; V) =

k 

pj (x; j ; j2 );

x ∈ R;

j=1 ∗

Corresponding author. Tel.: +44-1227-827612; fax: +44-1227-827932. E-mail address: [email protected] (B.J.T. Morgan).

c 2002 Elsevier B.V. All rights reserved. 0167-9473/03/$ - see front matter  PII: S 0 1 6 7 - 9 4 7 3 ( 0 2 ) 0 0 2 5 1 - 7

(1)

518

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

k 2 where each pj ; j = 1; : : : ; k, is non-negative and j=1 pj = 1, and where (x; ; ) 2 denotes the normal pdf with mean  and variance . The vector V denotes the collection of all distinct parameters occurring in the mixture so, for example, in the simulation studies of Section 3 where k = 2; V = (p; 1 ; 1 ; 2 ; 2 ). The problem of estimating the parameters in a mixture of normal densities has also been the subject of a large, diverse body of literature. The likelihood surface for a normal mixture exhibits singularities, resulting in diEculties with maximum-likelihood estimation (Everitt and Hand, 1981, p. 9). As a result, various authors have considered alternative methods of parameter estimation, including methods based on empirical generating functions—see for example Quandt and Ramsey (1978), who use the moment generating function, and Kumar et al. (1979) and Bryant and Paulson (1983), who both use the empirical characteristic function (ecf). In this paper, we provide a new method involving integrated squared error (ISE) based on the ecf—see Heathcote (1977). In Section 2 of the paper, we derive a simple expression for the ISE criterion for a particular form of weight function. We suggest choosing the parameter of the weight function as the bandwidth in an appropriate kernel density estimation problem. In Section 3 of the paper, we compare the resulting method with maximum-likelihood and with a new procedure proposed by Scott (2001) and Scott and Szewczyk (2001), which also involves minimising an integrated squared error expression; we call this the L2 method. The surfaces searched by the ISE and L2 methods are free of singularities. For the simulation experiments the ISE method is shown to perform well.

2. The ISE criterion, and selecting the smoothing parameter The characteristic function corresponding to (1) has the form, (t; V) =

k 

pj exp(itj − j2 t 2 =2);

j=1

which may be estimated by the ecf, n (t) = n−1

n 

eitXj ;

j=1

formed from the random sample {X1 ; : : : ; Xn } from f(x; V). ISE estimators Vˆ result from minimising with respect to V the criterion,  ∞ | n (t) − (t; V)|2 dW (t) (2) I (V) = −∞

for some weight function W (t), selected to ensure convergence of the integral in (2). The good robustness and asymptotic properties of the estimators Vˆ have been established by Campbell (1993) and Heathcote (1977), respectively. If we take W (t) ≡ t 2 2 W (t; ) = −∞ e− y dy for  ¿ 0, then writing I (V) ≡ I (V; ) we ;nd that, omitting

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

519

an additive constant which does not depend on V, I (V; ) = 2(p I0 p − 2p I1 );

(3)

where p = (p1 ; p2 ; : : : ; pk ) ; I0 is a symmetric k × k matrix with (i; j)th element equal to I0; ij = {2( i2 + j2 + 22 )}−1=2 exp{− 12 (i − j )2 =( i2 + j2 + 22 )} and I1 is a k × 1 vector with ith element, n

I1; i =

1 {2( i2 + 22 )}−1=2 exp{− 12 (Xj − i )2 =( i2 + 22 )}: n j=1

Thus, the second term in (3) is twice the average of the components of the likelihood based on (1) and the random sample {Xi }, but with variances inKated by the addition of 22 , and the sample does not inKuence the ;rst term in (3). Furthermore, we can write p  I1 =

n

1 g(Xj ; V; (22 )1=2 ); n j=1

k where g(x; V; ) = ‘=1 p‘ (x; ‘ ; ‘2 + 2 ). It is straightforward to verify that when k = 1 expression (3) reduces to the corresponding form given by Thornton and Paulson (1977). Estimates resulting from minimising I (V; ) with respect to V will be functions of the ˆ parameter ; V(), and we need a method of choosing . It is easily shown that  needs ˆ to change with the sample. The size of  aLects the robustness of V(), robustness ˆ increasing as  decreases. The inKuence functions of V() have been shown to be ˆ bounded in Besbeas (1999). Unfortunately, the corresponding estimates of Var(V()) are not tractable, so that choosing  to maximise precision is not straightforward, as it was for example in Besbeas and Morgan (2001). Naive minimisation of I (V; ) with respect to  results in  = ∞ (cf. Aitchison and Aitken, 1976). From Eq. (2), we can write  ∞ I (V; ) = | n (t) − (t; V)|2 |w(t; )|2 dt −∞

 =



−∞

where w(t; ) = e−  I (V; ) = 2

| n (t)w(t; ) − (t; V)w(t; )|2 dt;

2 2

t =2 ∞

−∞

. Hence, by Parseval’s theorem, [f n w (x; ) − f w (x; V; )]2 d x;

520

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

where

  n 1 1 1 2 2 √ exp − (x − Xj ) = f n w (x; ) = n 2  2 j=1

and f w (x; V; ) = g(x; V; ), so that we can write  ∞ [gn (x; ) − g(x; V; )]2 d x; I (V; ) = 2 −∞

n 2 where gn (x; ) = n j=1 (x; Xj ;  ) and g(x; V; ) is its expectation. As gn (x; ) can be regarded as a kernel density estimate based on the normal kernel, for the data {Xj }, we have chosen  as the smoothed cross-validation bandwidth selector, ˜ (Wand and Jones, 1995, pp. 78, 86). We experimented with alternative bandwidth selectors, and ˜ although other sophisticated alternatives could obtained the best performance with , also be used. Scott (2001) has proposed a new general method of parameter estimation which for a density function f(x; V) results in the estimator VˆL2 which minimises the criterion,   n    2 L2 (V) = f(x; V)2 d x − f(Xj ; V)   n −1

j=1

corresponding to the random sample X1 ; : : : ; Xn . This procedure is motivated by the non-parametric least squares cross-validation algorithm for choosing the bin width h of a histogram, using an ISE criterion. For the case of a mixture of two normal pdfs, the criterion to be minimised has the form: p2 (1 − p)2 L2 (V) = √ + √ + 2p(1 − p)(1 − 2 ; 0; 12 + 22 ) 2  1 2  2 n



2 {p(Xj ; 1 ; 12 ) + (1 − p)(Xj ; 2 ; 22 )}: n j=1

Because of the similarities between the speci;cation of VˆL2 and the ISE estimator, and because Scott (2001) and Scott and Szewczyk (2001) have used the L2 method for ;tting normal mixtures, we shall also consider the performance of VˆL2 in the simulation study of the next section. 3. Results ˆ ) ˜ with the maximum-likelihood estimate Vˆ and the L2 In order to compare V( ˆ estimate VL2 , we conducted simulation experiments with k = 2 and several values of n. We thus wish to estimate the parameters V = (p; 1 ; 1 ; 2 ; 2 ) in the mixture of two normal distributions f(x; V) = p(x; 1 ; 12 ) + (1 − p)(x; 2 ; 22 );

x ∈ R:

(4)

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

-4

-2

0 x

2

4

6

-6

-2

0 x

2

4

0.30 0.25 0.20 f (x)

6

-6

-4

Experiment 6

-2

0 x

2

4

6

-6

-4

-2

0 x

2

4

6

Experiment 7

0.2 f (x)

f (x) -6

-4

-2

0 x

2

4

6

0.0

0.0

0.0

0.1

0.05

0.10 0.05

f (x)

0.10

0.15

0.3

0.15

0.20

Experiment 5

-4

0.0

0.0

0.05

0.05

0.10

0.15

0.15 f (x) 0.10

0.10 0.0

0.05

0.05 0.0 -6

Experiment 4

0.20

0.25 0.20 0.15

f (x)

0.15 0.10

f (x)

Experiment 3 0.25

Experiment 2

0.20

Experiment 1

521

-10

-5

0 x

5

10

-6

-4

-2

0 x

2

4

6

Fig. 1. The form of f(x; V) for the 7 simulation experiments.

Table 1 Summary characteristics of experiments

Experiment

1 2 3 4 5 6 7

Population Parameters p

1

12

2

22

0.5 0.5 0.5 0.8 0.8 0.8 0.8

−2 −1

1 3 9 1 3 9 1

2 1 3 2 1 1 0

1 1 1 1 1 1 16

0

−2 −1 −1

0

We present results for n = 50, but there are broadly comparable conclusions for smaller and larger n. The seven populations sampled are shown in Fig. 1, corresponding to the parameter values of Table 1. We can see therefore that some of the populations sampled provide a diEcult challenge for ;tting a mixture of normals.

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

6

2

4 ML

6

4 L2

6

2

4 6 ISE

8

2

8

4 6 ML

8

2

2

4 6 L2

8

2

4 6 ISE

8

2

4 6 ML

8

8 10

2

4

6 ML

8 10

8

2

4 6 L2

8

2

4

6 L2

8 10

2

4

6 L2

8 10

5 10 15 20 25

0 2

4

6 ISE

8 10

0

5 10 15 20 25

20 10 0

5 0

0 6 ISE

30

10 15 20

5 10 15 20 25

30 20 10 0

4

6

Experiment 6 40

Experiment 5

2

4 L2

5 10 15 20 25

0

0 2

8

0

10 20 30 40

30 10

20 10 0

0

4 6 ISE

4 6 ML

Experiment 4

20

30

10 20 30 40

Experiment 3

2

0 5 10 15 20 25 30

0 5 10 15 20 25 30

20 30 10 0 2

5 10 15 20 25

4 ISE

40

10 15 20 25 0

0 2

Experiment 2

5

10

20

30

0 10 20 30 40 50

Experiment 1

0

522

2

4

6 ML

8 10

2

4

6 L2

8 10

2

4

6 ISE

8 10

10 15 20 5 0

0

0

5

10

20

30

10 15 20

40

Experiment 7

2

4

6 ML

8 10

Fig. 2. Histograms showing the number of alternative solutions when the ISE, maximum-likelihood and L2 iterative methods were started from random starting values.

The method-of-moments estimator V˜ is known to perform relatively poorly for ;tting normal mixtures (Everitt and Hand, 1981, p. 45) and it is also evaluated later. Computationally, V˜ involves the solution of a nonic equation. However the method-of-moments is not iterative. By comparison the other three estimators result from iterative search. ˆ ) ˜ and VˆL we used a simplex search For Vˆ we used the EM algorithm, while for V( 2 with logarithmic transformations for the variances and a logistic transformation for the proportion. We use simulation to compare estimation performance in three diLerent ways. We ˆ V( ˆ ) ˜ and VˆL with regard to the inKuence of initially compared the behaviour of V; 2 ˆ V( ˆ ) ˜ and VˆL were the starting value: for each population, the iterative methods for V; 2 started from the same 10 random starts, over the domain, 0 ¡ p ¡ 1; X(1) ¡1 ; 2 ¡X(n) ; 0 ¡ 1 ; 2 ¡ (X(n) − X(1) )=4, where X(i) denotes the ith order statistic. All three methods compared sometimes resulted in multiple solutions, and we can ˆ ) ˜ from Fig. 2. Our ;ndings con;rm the appreciate the superior performance of V( statement in Scott (2001) that the L2 method needs several random alternative starts in order to try to avoid local optima.

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

523

Table 2 Number of failures, and instances of small/zero estimates of standard deviation, for experiments 1–7 when true parameter values were used to start the iterative searches

Experiment

Number of MM

1 2 3 4 5 6 7

ML

ISE

L2

Failures

ˆj ¡ 0:1

Failures

ˆj ¡ 0:1

Failures

ˆj ≈ 0:0

Failures

ˆj 6 0:1

2 28 21 7 31 36 11

0 1 0 0 1 0 0

0 0 0 0 4 1 4

0 4 0 0 3 4 2

0 0 0 0 0 0 0

0 8 0 0 16 12 5

0 0 0 0 0 0 0

0 18 12 5 20 17 14

MM denotes method of moments and ML denotes maximum-likelihood. In each experiment, 100 simulations were undertaken. We write ˆj to indicate one of the estimated standard deviations.

We next test both methods by starting the iterations from the parameter values used in the simulations. The results are shown in Table 2. For each experiment there were 100 simulations. The iterative algorithms are said to fail if the iterations diverge. The method-of-moments is very unreliable, as expected. It requires a negative root of a ninth degree polynomial equation (Besbeas, 1999, p. 81), and can fail either by this equation having no negative roots, or by the parameter estimates being out of range. It performs best for the two bimodal pdfs of experiments 1 and 4. The L2 and ISE methods have no failures, and the maximum-likelihood method has 9. Additionally, the maximum-likelihood method converged to a singularity (with an estimated standard deviation ¡ 0:1) 13 times. The only defect of the ISE method is that 41 times it essentially estimated a standard deviation as zero, with a non-zero associated probability. This was due to the kernel-density bandwidth procedure producing too large a value for  in these cases. In each such a case, simply reducing , by replacing  by =2, removed this anomaly. The L2 method quite frequently produces a solution with a small, or near zero, estimate for one of the variances. This is a minor problem for experiments 1 and 4. When successful, the method-of-moments produced a small estimate of a variance only twice. In Fig. 3 we provide a graphical comparison of the performance of the four estimators, in terms of bias and standard deviation. These are calculated only for the simulated samples which did not result in failures for the method-of-moments or maximum-likelihood methods. For each experiment there are ;ve parameters, estimated four ways. We can see that there is a tendency for the symbols denoting the four methods of estimation to group together, corresponding to the estimation of a single parameter in each case. On the basis of these results, we can see that for each experiment the worst values of bias and of standard deviation correspond to either the L2 method or the method-of-moments. The maximum-likelihood and ISE methods have similar performance, but the ISE

524

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

Experiment 2 0.2

bias -0.10 -0.05 0.0 0.05

Experiment 1

+

+

+

0.1

+ ++

bias -1.0 -0.6 -0.2

+ +

+ +

0.2 0.3 0.4 standard deviation

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 standard deviation

Experiment 4 bias -0.4 -0.3 -0.2 -0.1 0.0

0.5

Experiment 3

bias -1.5 -0.5

+ ++ +

-2.5

+

0.0

+ ++

0.5 1.0 1.5 standard deviation

+ +

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 standard deviation

Experiment 6

0 bias -1

+ ++ +

+

+

+ +

-2

+

+

-3

bias -1.0 -0.5 0.0 0.5

1

Experiment 5

0

2 4 6 standard deviation

8

0.5 1.0 1.5 standard deviation

Experiment 7

+ +

-2

bias -1 0

1

+ +

-3

+ 0.5

1.0 1.5 standard deviation

2.0

Fig. 3. Performance of the four estimators, in terms of bias and standard deviation, for experiments 1–7, when the true parameter values were used to start the iterations. The symbols ◦; +; ;  are used to denote MM, ML, ISE and L2 , respectively.

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

525

Table 3 Relative performance of the four methods when simulated annealing is used to select starting values for deterministic searches for methods ML, ISE and L2

Method

MM ML ISE L2

Rank 1

2

3

4

9 6 16 4

1 15 12 7

12 10 7 6

13 4 0 18

Note that in a small number of cases, there were tied ranks. The ties were broken using the values of the mean bias.

method is better overall in terms of producing fewer instances of worst values of bias and standard deviation. Finally for the three iterative methods, ML, ISE and L2 , we used a hybrid algorithm with an initial simulated annealing approach, which produced suitable starting values for the standard deterministic iterations, as suggested by Brooks and Morgan (1994, 1995). This is an approach which is useful in practice when surfaces possess local optima. We used a stringent annealing schedule which resulted in only very small numbers of alternative starts for the deterministic search methods. To compare methods in terms of mean squared error, for example, is not appropriate here, due to the presence of a few outliers among the ML estimates (cf, for example, Leytham, 1984). Consequently, we adopted the approach of Woodward et al. (1984)—for each method/parameter/experiment combination we compared estimates with the known values used in the simulations, when all produced solutions. Then, for example, a method is given rank 1 if its estimate was most often closest over all the relevant simulations to the value used in the simulation, etc. The results are given in Table 3. The L2 method is worst overall. We see also the poor performance of the method-ofmoments, and that ISE is better overall than ML. Detailed study of particular examples reveals that better ISE performance compared to maximum-likelihood occurs when the maximum-likelihood estimation responds to an outlying subset of data by describing it through one component of the mixture, with small probability; in these cases, the more robust ISE method usually does not produce such a component. 4. Conclusion Although the conclusions of Section 3 are necessarily limited by the sample sizes and populations selected in the study, we believe that the ISE method will also perform as well more generally, especially when k ¿ 2. Unlike other transform-based methods, mentioned in Section 1 the ISE is simple to apply. If on occasion it is necessary to reduce the bandwidth , then that is easily done. The surface being minimised does not suLer from singularities, it is relatively easy to locate the global minimum of the surface, the ISE method performs well in terms of bias and standard error, and the method is more robust than both maximum-likelihood and L2 . In this paper, we have

526

P. Besbeas, B.J.T. Morgan / Computational Statistics & Data Analysis 44 (2004) 517 – 526

only considered univariate mixtures. However, the expression for I (V; ) in Eq. (3) generalises simply to the case of a mixture of k multivariate normal distributions (Besbeas, 1999), and so the ISE method may also be used in such cases. We suspect that this is where the real strength of the ISE method lies, involving multivariate kernel density bandwidth selection (Besbeas, 1999). The simplicity of the ISE method in the paper is due to the explicit expression for I (V; ) in Eq. (3). Anexplicit form also holds t for mixtures of exponential distributions, from setting W (t; )= −∞ (1+2 y2 )−1 dy (see t Besbeas, 1999), and for mixtures of Cauchy distributions, using W (t; )= −∞ e−|y| dy (see Besbeas and Morgan, 2001). However, it may not always be possible to obtain an explicit form for I (V; ). Acknowledgements The work of Panagiotis Besbeas was supported by an EPSRC research studentship and also by the BBSRC/EPSRC research grant, 96/E09745. We thank the referees for their constructive suggestions. References Aitchison, J., Aitken, C.G.G., 1976. Multivariate binary discrimination by the kernel method. Biometrika 63, 413–421. Besbeas, P., 1999. Parameter estimation based on empirical transforms. Ph.D. Thesis, University of Kent, UK. Besbeas, P., Morgan, B.J.T., 2001. Integrated squared error estimation of Cauchy parameters. Statist. Probab. Lett. 55, 397–401. Brooks, S.P., Morgan, B.J.T., 1994. Automatic starting point selection for function optimization. Statist. Comput. 4, 173–177. Brooks, S.P., Morgan, B.J.T., 1995. Optimization using simulated annealing. The Statistician 44, 241–257. Bryant, J.L., Paulson, A.S., 1983. Estimation of mixing proportions via distance between characteristic functions. Comm. Statist. Theory Methods 12, 1009–1029. Campbell, E.P., 1993. InKuence for empirical transforms. Comm. Statist. Theory Methods 22, 2491–2502. Everitt, B.S., Hand, D.J., 1981. Finite Mixture Distributions. Chapman & Hall, London. Heathcote, C.R., 1977. The integrated squared error estimation of parameters. Biometrika 64, 255–264. Kumar, K.D., Nicklin, E.H., Paulson, A.S., 1979. Comment on “estimating mixtures of normal distributions and switching regressions”. J. Amer. Statist. Assoc. 74, 52–55. Leytham, K.M., 1984. Maximum likelihood estimates for the parameters of mixture distributions. Water Resour. Res. 20, 896–902. Quandt, R.E., Ramsey, J.B., 1978. Estimating mixtures of normal distributions and switching regressions. J. Amer. Statist. Assoc. 73, 730–738. Scott, D.W., 2001. Parametric statistical modeling by minimum integrated square error. Technometrics 43, 274–285. Scott, D.W., Szewczyk, W.F., 2001. From kernels to mixtures. Technometrics 43, 323–335. Thornton, J.C., Paulson, A.S., 1977. Asymptotic distribution of characteristic function-based estimators for the stable laws. SankhyVa Ser. A 39, 341–354. Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. Chapman & Hall, London. Woodward, W.A., Parr, W.C., Schucany, W.R., Lindsey, H., 1984. A comparison of minimum distance and maximum likelihood estimation of a mixture proportion. J. Amer. Statist. Assoc. 79, 590–598.