ARTICLE IN PRESS
Statistics & Probability Letters 77 (2007) 104–110 www.elsevier.com/locate/stapro
A penalized version of the empirical likelihood ratio for the population mean Francesco Bartolucci Dipartimento di Economia, Finanza e Statistica, Universita` di Perugia, 06123 Perugia, Italy Received 29 June 2005; received in revised form 9 May 2006; accepted 23 May 2006 Available online 7 July 2006
Abstract A penalized version of the empirical likelihood ratio test statistic for the population mean is proposed which may be computed even when this parameter does not belong to the convex hull of the data. Some theoretical results on the proposed test statistic are provided together with some guidelines on the choice of the penalization term. A double bootstrap procedure is also described which may be used for calibration when the sample size is small. The approach is illustrated by an example and a small simulation study. r 2006 Elsevier B.V. All rights reserved. Keywords: Constrained maximization; Convex hull condition; Double bootstrap; Non-parametric inference
1. Introduction An effective non-parametric method for making inference on the mean m of a p-variate distribution is the empirical likelihood (EL) method introduced by Owen (1988); see also Owen (1990, 2001). It is based on the non-parametric likelihood LðmÞ ¼ max
Y
p2Sm ðX Þ
pi ,
i
where X ¼ ðx1 ; . . . ; xn Þ denotes the sample and SmP ðX Þ denotes the subset of the n-dimensional simplex S containing all the vectors p ¼ ðp1 ; . . . ; pn Þ such that i pi xi ¼ m. LðmÞ is in practice the profile likelihood for m of a multinomial model that places a mass probability pi on any data point xi . This function reaches its maximum value, nn , only at m ¼ x, ¯ with x¯ denoting the sample mean. For making inference on m, we can therefore use the empirical likelihood ratio (ELR) test statistic RðmÞ ¼
Y LðmÞ ¼ max npi . p2Sm ðX Þ LðxÞ ¯ i
Fax: +39 075 5855950.
E-mail address:
[email protected]. 0167-7152/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2006.05.016
(1)
ARTICLE IN PRESS F. Bartolucci / Statistics & Probability Letters 77 (2007) 104–110
105
A very interesting result is that, under mild conditions, 2 logfRðm0 Þg, with m0 denoting the true population mean, has asymptotic w2 ðpÞ distribution regardless of the distribution of the population (Owen, 1988). When the sample size is large enough, this extension of the Wilks (1938)’s theorem allows to compute very easily p-values and cutoff points for RðmÞ. For small samples, instead, a bootstrap calibration is required (Owen, 2001, Section 3.3); it is based on the computation of RðxÞ ¯ for a suitable number of samples drawn with replacement from the observed sample. A problem that may limit the applicability of the EL approach is that LðmÞ, and thus RðmÞ, may not be computed when m does not belong to the convex hull ofP the data, HðX Þ, defined as the subset of Rp containing all the p-dimensional vectors that may be expressed as i pi xi , p 2 S. The convention commonly followed in this case is to set RðmÞ ¼ 0 (Owen, 2001, Section 3.14). In this way, however, the actual significance level of the test for H 0 : m ¼ m0 versus H 1 : mam0 is at least equal to the probability that m0 eHðX Þ and so it may be much higher than the nominal level a. Similarly, the actual coverage level of the confidence region for m may be much smaller than the nominal level 1 a. Obviously, the probability that m0 eHðX Þ is significative only when n is small with respect to p; the greater the skewness of the population is, the larger such a probability. The problem may also affect bootstrap calibration since there may exist bootstrap samples for which RðxÞ ¯ may not be computed because their convex hull does not contain x. ¯ If the proportion of these samples is larger than a, this calibration is not feasible (see Owen, 2001, Section 3.3). To cope with the problem above, P Owen (2001, Section 10.4) suggested to use the EL-t method (Baggerly, 1999) in which the constraint i pi xi ¼ m is replaced by a studentized version. Alternatively, we can use modified versions of EL that, as the Euclidean likelihood, are based on the maximization of functions of p that may be computed also when some pi are negative (Baggerly, 1998; Owen, 2001, Sections 3.15 and 3.16). These solutions, however, are not considered completely satisfactory since they may lead to different inferential conclusions than that of EL when the latter may be applied without problems. A different approach to deal with the convex hull condition is proposed in this paper. In this approach, the constrained maximization problem involved in the computation of the ELR is turned into an unconstrained maximization problem in which the multinomial P likelihood in (1) is multiplied by a suitable penalization term. This term measures the discrepancy between i pi xi and m and depends on a positive scalar h in a way that forces the solution towards the EL solution as h approaches 0. The resulting likelihood ratio, denoted by Ry ðm; hÞ, may be computed for any m and is always positive. Moreover, if h is small enough and m 2 HðX Þ, our approach leads to the same inferential conclusion on m of the EL approach. If, instead, meHðX Þ, Ry ðm; hÞ is still a measure of the evidence provided by the data in favor of m for that bootstrap calibration is required. The paper is organized as follows. The convex hull condition is illustrated, by using a simulated data set, in the remainder of this section. The proposed approach is described in Section 2, which also deals with the numerical computation of logfRy ðm; hÞg and its theoretical properties. The use of this statistic for making inference on m is outlined in Section 3, where some guidelines for choosing h and bootstrap calibration are also illustrated. Finally, Section 4 proposes an empirical illustration based on the data set introduced below and on a small simulation study. 1.1. An example The following sample consists of n ¼ 15 observations generated from a vector of four independent random variables, each with w2 ð1Þ distribution. 0.65 2.43 0.23 0.07
0.00 0.15 1.31 0.82
0.75 4.65 0.00 0.12
0.07 1.43 1.45 8.27
0.64 0.04 0.65 0.18
0.20 0.00 0.00 1.46
0.01 0.23 0.12 1.04
0.08 0.01 0.00 0.05
0.21 0.44 0.65 0.04
0.40 0.79 0.40 0.03
0.27 0.15 4.22 1.60
0.09 5.66 0.00 1.17
1.28 0.24 0.22 0.01
0.33 1.21 1.16 0.21
0.08 0.84 0.00 0.99
Even though the true population mean is m0 ¼ 14 , where 1k denotes a column vector of k ones, Rðm0 Þ cannot be computed since m0 does not belong to the convex hull of the data. There does not exist in fact any convex linear combination of the 15 column vectors above which is equal to m0 . Following the general convention
ARTICLE IN PRESS F. Bartolucci / Statistics & Probability Letters 77 (2007) 104–110
106
according to that we set RðmÞ ¼ 0 when meHðX Þ, we have to reject H 0 : m ¼ m0 . In this way, however, the actual significance level for the test is at least equal to the probability of m0 eHðX Þ which, by simulation, we estimated to be equal to 0.128. Bootstrap calibration is also infeasible for the sample above. This is due to the fact that x¯ does not belong to the convex hull of about 30% of the bootstrap samples. We cannot therefore construct a confidence region for m with a coverage level larger than 0.7. A similar problem happens for most of the samples generated from the w2 ð1Þ population at issue. In particular, we observed that for only two of the 1000 samples generated from this population it is possible to perform a bootstrap calibration when a is equal to 0:05 and for only 44 samples when a is equal to 0:10. 2. Penalized EL The proposed approach is based on the following penalized version of the EL ! Y 2 y L ðm; hÞ ¼ max pi endðnmÞ=ð2h Þ , p2S
where n ¼
P
i
i
pi xi and dðn mÞ is a distance measure between m and n. In particular, we will consider
dðn mÞ ¼ ðn mÞ0 V 1 ðn mÞ, where V is the sample variance–covariance matrix; when V is singular, V 1 is substituted by E diagðvÞ1 E 0 , where v is the vector of the positive eigenvalues of V and E is the matrix of the corresponding eigenvectors. An interesting property of Ly ðm; hÞ is that, for any h40, the maximum value of this function is attained only at m ¼ x¯ and this maximum is equal to nn . So, for making inference on m, we can use the ratio ! Y 2 Ly ðm; hÞ y R ðm; hÞ ¼ y npi endðnmÞ=ð2h Þ , ¼ max p2S L ðx; ¯ hÞ i which is similar in the spirit to RðmÞ. The main difference between RðmÞ and Ry ðm; hÞ is that the latter escapes the convex hull condition and so may be computed for any value of m. In practice, however, we will make use of X n logðnpi Þ 2 ðn mÞ0 V 1 ðn mÞ, ry ðm; hÞ ¼ logfRy ðm; hÞg ¼ max p2S 2h i which is simpler to compute. 2.1. Computing ry ðm; hÞ First of all note that ry ðm; hÞ may be expressed in an equivalent form as ry ðm; hÞ ¼ max gðm; n; hÞ, n2HðX Þ
where n ðn mÞ0 V 1 ðn mÞ 2h2 P and rðnÞ ¼ logfRðnÞg. Owen (2001, Section 3.14) showed that rðnÞ ¼ i logfnp^ i ðnÞg, where gðm; n; hÞ ¼ rðnÞ
p^ i ðnÞ ¼
1 1 ; ^ 0 ðxi nÞ n 1 þ lðnÞ
i ¼ 1; . . . ; n,
^ and lðnÞ is implicitly defined as the solution of the equation tðl; nÞ ¼ 0, with X xi n tðl; nÞ ¼ . 1 þ l0 ðxi nÞ i
(2)
ARTICLE IN PRESS F. Bartolucci / Statistics & Probability Letters 77 (2007) 104–110
107
Therefore, for given m and h, ry ðm; hÞ may be computed by maximizing gðm; n; hÞ with respect to n. This may be performed through a Newton–Raphson algorithm that consists of updating, until convergence, the vector n by adding to its current value, n0 , the vector gnn ðm; n0 ; hÞ1 gn ðm; n0 ; hÞ, where gn ðm; n; hÞ and gnn ðm; n; hÞ denote, respectively, the first derivative vector and the second derivative matrix of gðm; n; hÞ with respect to n. These are equal to n gn ðm; n; hÞ ¼ rn ðnÞ 2 V 1 ðn mÞ, h n gnn ðm; n; hÞ ¼ rnn ðnÞ 2 V 1 , h ^ where rn ðnÞ ¼ nlðnÞ and, according to the implicit derivative rule, ^ ^ ng1 tn flðnÞ; ng. rnn ðnÞ ¼ ntl flðnÞ; After some algebra, we can easily see that X ^ ^ 0 nI p , tn flðnÞ; ng ¼ n2 p^ i ðnÞ2 ðxi nÞlðnÞ i
^ ng ¼ n2 tl flðnÞ;
X
p^ i ðnÞ2 ðxi nÞðxi nÞ0 ,
i
where I p is the identity matrix of dimension p. The point that maximizes gðm; n; hÞ will be denoted by n^ ðm; hÞ or simply by n^ when its arguments are clear from the context. Note that, since gðm; n; hÞ is the sum of two concave functions in n, this solution is unique and the starting value of the algorithm above is not so relevant. However, we suggest to use n ¼ m as starting value for the algorithm when m 2 HðX Þ; we observed that in this way it converges very quickly: three or four iterations are usually enough. When instead meHðX Þ, some other starting point, such as x, ¯ may be taken. In these circumstances, however, the algorithm is slower since n^ tends to be very close to the boundary of HðX Þ. As will be explained below, this requires careful choice of h. 2.2. Properties of ry ðm; hÞ In the following, some results concerning ry ðm; hÞ are proved. These results allow to clarify the strong relation between this test statistic and rðmÞ. Proposition 1. For any sample X and any h40, ry ðm; hÞ is a concave function of m and reaches its maximum, equal to 0, only at m ¼ x. ¯ Proof. Consider two points mð1Þ and mð2Þ in Rp and the corresponding solutions in terms of n of ry ðm; hÞ, n^ ð1Þ and n^ ð2Þ . Consider also a third point in Rp , m¯ ¼ amð1Þ þ ð1 aÞmð2Þ , 0oao1, and let n¯ be defined in a similar way. We obviously have that ry ðm; ¯ hÞ ¼ gðm; ¯ n^ ; hÞXgðm; ¯ n¯ ; hÞ which, in turn, is greater than ary ðmð1Þ ; hÞ þ y ð2Þ y ð1 aÞr ðm ; hÞ for any 0oao1 and so concavity of r ðm; hÞ follows. This is because n¯ m¯ may be expressed as að^nð1Þ mð1Þ Þ þ ð1 aÞð^nð2Þ mð2Þ Þ and for the concavity of both summands at the right-hand side of (2). For any h40, the maximum of ry ðm; hÞ is reached only at m ¼ x¯ and ry ðx; ¯ hÞ ¼ 0 since Ly ðm; hÞ reaches its y n maximum only at m ¼ x¯ and L ðx; ¯ hÞ ¼ n . & The proof of the following proposition is based on the first-step solution of the Newton–Raphson algorithm described in Section 2.1, 1 1 ^ ^ ^ lðmÞ, (3) mg1 tn flðmÞ; mg þ 2 V 1 n~ ðm; hÞ ¼ m þ tl flðmÞ; h which will denoted by n~ when the arguments m and h are clear from the context.
ARTICLE IN PRESS F. Bartolucci / Statistics & Probability Letters 77 (2007) 104–110
108
Proposition 2. For any sample X and any m, ry ðm; hÞ is a decreasing function of h. Moreover, if m 2 HðX Þ, ry ðm; hÞXrðmÞ for any h40, k^n mk ¼ Oðh2 Þ and so limh!0 ry ðm; hÞ ¼ rðmÞ. If instead meHðX Þ, k^n mk is bounded away from 0 for any h40 and so limh!0 ry ðm; hÞ ¼ 1. Proof. The first derivative of ry ðm; hÞ with respect to h, n ryh ðm; hÞ ¼ 3 ð^n mÞ0 V 1 ð^n mÞ, h is always negative and so ry ðm; hÞ is decreasing in h. To prove that when m 2 HðX Þ, ry ðm; hÞXrðmÞ for any h40, simply consider that ry ðm; hÞ ¼ gðm; n^ ; hÞXgðm; m; hÞ ¼ rðmÞ. To prove that, in the same situation, k^n mk ¼ Oðh2 Þ, it is enough to consider that k~n mk ¼ Oðh2 Þ, where n~ is the first-order solution defined in (3); by substitution, limh!0 ry ðm; hÞ ¼ rðmÞ. Finally, since n^ 2 HðX Þ for any m and h40, k^n mk is bounded away from 0 when meHðX Þ and so the penalization term grows indefinitely as h approaches 0. & 3. Inference on the population mean On the basis of penalized ELR introduced in the previous section, we can test a hypothesis of type H 0 : m ¼ m0 versus H 1 : mam0 and construct confidence regions for m. In particular, the critical region of a test of this type is fX : ry ðm0 ; hÞora ðhÞg,
(4)
where ra ðhÞ is a suitable cutoff point such that the probability that a random sample belongs to this region under H 0 is equal to the significance level a. A confidence region at the level 1 a for m may be expressed as fm : ry ðm; hÞXra ðhÞg.
(5)
Convexity of such a region is ensured by Proposition 1. Obviously, a crucial issue for this type of inference is the choice of h and ra ðhÞ. 3.1. Choice of h The criterion that we follow to choose h is based on the curvature of the function ry ðm; hÞ at m ¼ x¯ that may be measured through the determinant of its second derivative, i.e. cðhÞ ¼ jrymm ðx; ¯ hÞj. This is because, provided that ra ðhÞ does not vary with h, the higher the curvature of ry ðm; hÞ is, the more powerful the test with critical region (4) and the lower the volume of the confidence region (5). In particular, using standard rules on the implicit derivatives, we have that n rymm ðx; V 1 , ¯ hÞ ¼ gmm ðx; ¯ n^ ; hÞ gmn ðx; ¯ n^ ; hÞgnn ðx; ¯ n^ ; hÞ1 gnm ðx; ¯ n^ ; hÞ ¼ 1 þ h2 which, in accordance with Proposition 1, is negative definite. So, we have that p n 1 cðhÞ ¼ 2 jV j 1þh whose infimum is c ¼ ðnÞp =jV j. Not surprisingly, this lower bound is equal to the determinant of rmm ðxÞ ¯ ¼ nV 1 . Since limh!0 cy ðhÞ ¼ c, according to our criterion we should choose a value of h very close to 0. To check if h is close enough to 0 we suggest to consider the ratio cðhÞ=c ¼ ð1 þ h2 Þp ; in particular, we can choose the optimal value of h by solving cðhÞ=c ¼ 1 e=n, where e is a suitable tolerance level. The solution of this equation is simply qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h ¼ ð1 e=nÞ1=p 1.
ARTICLE IN PRESS F. Bartolucci / Statistics & Probability Letters 77 (2007) 104–110
109
Note that, when h is very close to 0 and meHðX Þ, we may go into numerical problems in computing ry ðm; hÞ since n^ will be very close to the boundary of HðX Þ and both rð^nÞ and ry ðm; hÞ will be very large in absolute value. So, we suggest to avoid too stringent tolerance levels, i.e. less than 106 n. 3.2. Computing rya First of all consider the following proposition concerning the asymptotic distribution of ry ðm; hn Þ, where hn denotes the value of h when the sample size is n. Similarly, we will use V n to denote the variance–covariance matrix of a sample of size n and n^ n to denote n^ ðm; hn Þ. Proposition 3. If the population has mean m0 and variance– covariance matrix S0 , finite and of full rank, and hn ¼ Oðn1=2 Þ, 2ry ðm0 ; hn Þ ¼ 2rð^nn Þ þ
n ð^nn m0 Þ0 V 1 nn m 0 Þ n ð^ h2n
(6)
has asymptotic w2 ðpÞ distribution. ^ Þk ¼ Op ðn1=2 Þ; this implies that k~nn m k, Proof. According to Owen (2001, Section 11.2) we have that klðm 0 0 3=2 2 and so k^nn m0 k, is Op ðn Þ. Then, 2rð^nn Þ has asymptotic w ðpÞ distribution (see Owen, 1988, Corollary 1), whereas the second term at the right-hand side of (6) tends in probability to 0 and the result therefore follows. & According to the above result, when the sample size is large enough and h is close to 0, both ry ðm; hÞ and rðmÞ have approximately w2 ðpÞ distribution. This is not surprising because in these circumstances ry ðm; hÞ is usually very close to rðmÞ. This result, however, may not be applied when the sample size is small, a situation in which our approach is really advantageous with respect to the usual EL approach and that requires bootstrap calibration. This procedure consists of computing, for a suitable number of bootstrap samples, X 1 ; . . . ; X B , the statistic ry ðx; ¯ hÞ. Then, following Davison and Hinkley (1997, Section 4.4) , the p-value for the observed value of ry ðm; hÞ, ry , may be estimated as P 1 þ b Iðryb pry Þ p¼ , Bþ1 where IðÞ is the indicator function and ryb is the value of ry ðx; ¯ hÞ for the bth bootstrap sample. Accordingly, ^ value of ra ðhÞ may be found as the value of ry such that p ¼ a. This may be obtained as the smallest bth y y y ^ r1 ; . . . ; rB , denoted by rðbÞ^ , where b ¼ aðB þ 1Þ 1. The bootstrap procedure above may lead to p-values and cutoff points which are consistently biased. An improved procedure is based on a double bootstrapping (Davison and Hinkley, 1997, Section 4.5). It consists of drawing, from any bootstrap sample X b , a suitable number of second-level bootstrap samples X b1 ; . . . ; X bC and computing, for any of them, the statistic ry ðx¯ b ; hÞ, where x¯ b is the mean of X b . Then, the adjusted p-value is computed as P 1 þ b Iðpb ppÞ , padj ¼ Bþ1 where pb ¼
1þ
P
Iðrybc pryb Þ Cþ1 c
(7)
is the p-value for ryb , with rybc denoting the value of ry ðx¯ b ; hÞ for the sample X bc . The adjusted cutoff point radj;a ðmÞ may be obtained as ryðb^ Þ where b^adj ¼ pðbÞ^ ðB þ 1Þ 1. This procedure is very expensive from the adj computational point of view since it requires to BC maximizations of the type described in Section 2. y Therefore, we substitute rb and rybc in (7) with approximated values obtained by a second-order expansion
ARTICLE IN PRESS F. Bartolucci / Statistics & Probability Letters 77 (2007) 104–110
110
Table 1 Cutoff points for the penalized ELR for the data in Section 1.1 a
ra ðhÞ
radj;a ðhÞ
0:01 0:05 0:10 0:25
3:0231 107 3:2119 106 1:6335 106 23:54
3:0231 107 3:1742 106 1:5587 106 1:4413 106
Table 2 Results of the simulation study a
Bootstrap
0.01 0.05 0.10 0.25
Double bootstrap
a^
a^ L
a^ U
a^
a^ L
a^ U
0.003 0.030 0.059 0.211
0.000 0.019 0.044 0.186
0.006 0.041 0.074 0.236
0.019 0.052 0.088 0.245
0.011 0.038 0.070 0.218
0.027 0.066 0.106 0.272
a^ is the actual significance level of the test of H 0 : m ¼ m0 versus H 1 : mam0 and ð^aL ; a^ U Þ is the corresponding 95% confidence interval.
of ry ðm; hÞ around m ¼ x. ¯ According to this expansion we have that ry ðm; hÞ
1 ðx¯ mÞ0 V 1 ðx¯ mÞ, 2 2ð1 þ h Þ
which may be computed very quickly. 4. Empirical illustration To analyze the sample reported in Section 1.1 we used h ¼ 0:002. With this value of h, the penalized ELR for the true value of the population mean, m0 ¼ 14 , is ry ¼ 62; 313. With a bootstrap procedure based on B ¼ 199 samples we obtained a p-value for ry equal to p ¼ 0:225, while the adjusted p-value based on the double bootstrap procedure based on C ¼ 199 second-level samples is padj ¼ 0:265. So, even if m0 does not belong to the convex hull of the data, there is not enough evidence against this value of m. Using the approach described in Section 3.2 we can also compute the cutoff points rya ðhÞ and ryadj;a ðhÞ. These are shown in Table 1 for some values of a. To assess if these cutoff points are reliable, we carried out a simulation based on 1000 samples drawn from the same population from that the sample considered above has been drawn. The results of the simulation are shown in Table 2 where, for some values of a, the actual significance level and the corresponding 95% confidence interval are reported. From this table, it is possible to note that the actual significance level of the test based on ry ðm; hÞ is close to the nominal level when the double bootstrap procedure is used for calibration. References Baggerly, K.A., 1998. Empirical likelihood as a goodness-of-fit measure. Biometrika 85, 535–547. Baggerly, K.A., 1999. Studentized empirical likelihood and maximum entropy. Technical Report, Department of Statistics, Rice University. Davison, A.C., Hinkley, D.V., 1997. Bootstrap Methods and their Application. Cambridge University Press, Cambridge. Owen, A.B., 1988. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237–249. Owen, A.B., 1990. Empirical likelihood ratio confidence regions. Ann. Statist. 18, 90–120. Owen, A.B., 2001. Empirical Likelihood. Chapman & Hall, London. Wilks, S.S., 1938. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Statist. 9, 60–62.