The epsilon–skew–normal distribution for analyzing near-normal data

The epsilon–skew–normal distribution for analyzing near-normal data

Journal of Statistical Planning and Inference 83 (2000) 291–309 www.elsevier.com/locate/jspi The epsilon–skew–normal distribution for analyzing near...

198KB Sizes 352 Downloads 703 Views

Journal of Statistical Planning and Inference 83 (2000) 291–309

www.elsevier.com/locate/jspi

The epsilon–skew–normal distribution for analyzing near-normal data Govind S. Mudholkar a , Alan D. Hutsonb; ∗

b Division

a Department

of Statistics, University of Rochester, Rochester, NY 14627, USA of Biostatistics, Department of Statistics, University of Florida, P.O. Box 100212 Gainesville, 32610-0212, USA Received 16 September 1997; accepted 12 February 1999

Abstract A family of asymmetric distributions, which rst appeared in Fechner (1897, Kollectivmasslehre. Leipzig, Engleman) is reparameterized using a skewness parameter  and named the epsilon– skew–normal family. It is denoted by ESN(; ; ). Its basic properties such as the relationship between the mean and mode, and its higher-order moments are examined. They are used to obtain simple estimators of the parameters measuring the location , the scale , and the skewness . The maximum likelihood estimates are derived and it is shown that the estimators of  and  are asymptotically independent. The estimators reduce properly to the normal case when  = 0. The ESN(; ; ) can be used both as a model and as a prior distribution in Bayesian analysis. The posterior distributions in both cases are unimodal, and the modes are available in closed c 2000 Published by Elsevier Science B.V. All rights reserved. form. Keywords: Bayesian analysis; Binormal; Maximum likelihood estimation; Mode; Skew-normal distribution; Two-piece normal

1. Introduction and summary The celebrated Gaussian distribution has been known since at least a century before Gauss (1809) popularized it. Its ubiquity continues to be driven by its analytical beauty, the simplicity, the folklore and strength of the associated central limit e ect discovered by De Moivre (1733), and the numerous intuitively appealing methods derived assuming it. It appears in probability and stochastic processes, and in the theories and methods of univariate and multivariate, parametric and nonparametric, frequentist and Bayesian statistics. A variety of characterizations and mechanisms have been suggested as its genesis and justi cation; e.g. see Rao (1973). It has been analyzed deeply, generalized extensively and applied across technologies and disciplines. ∗

Corresponding author.

c 2000 Published by Elsevier Science B.V. All rights reserved. 0378-3758/00/$ - see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 9 9 ) 0 0 0 9 6 - 8

292 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

Yet there have always been doubts, reservations and criticisms about the unquali ed use of normality, and resistance to methods resulting from the normality assumptions. This is re ected in the quote attributed by Poincare to Lippmann; see Stigler (1977): “Everyone believes in the normal law, the experimenters because they imagine it a mathematical theorem, and the mathematicians because they think it an experimental fact.” Note also the provocative assertion by Geary (1947): “Normality is a myth; there never was, and never will be, a normal distribution”. These remarks are re ected in the plethora of distributional families conceived, studied and used for data modeling, and the frameworks such as nonparametric and robust theories and methods developed as alternatives to the normal theory statistics. Some families of near-normal distributions which include the normal distribution and to some extent share its desirable properties have played a crucial role in these developments. The better known among these are the classical Gram-Charlier and Edgeworth families of near-normal distributions, and the symmetric extensions such as contaminated normals, the exponential power distributions of Box (1953) and the Tukey lambda family de ned in terms of the percentile function; see Freimer et al. (1988). Recent applied research has seen wide use of the inverse Gaussian family as a substitute for the normal family; see Joergensen (1981), and Chhikara and Folks (1989). Other families which contain the normal distributions by strict inclusion or in limit are Tukey’s g and h distributions reviewed by Hoaglin (1985), and the models suggested by Prentice (1976) and Copenhaver and Mielke (1977) for quantal response analysis. The e ects of asymmetry on the appropriateness of normal theory methods are in general more serious than those of nonnormal peakedness. Hence families of asymmetric distributions which are analytically tractable, accommodate practical values of skewness and kurtosis, and strictly include the normal distributions, can be useful for data modeling, statistical analysis and robustness studies of normal theory methods. One such family was proposed by O’Hagan and Leonard (1976) for Bayesian analysis of normal means; see also the entry entitled “skew-normal distributions” in the Encyclopedia of Statistical Sciences (Kotz and Johnson, 1988). It is investigated in detail by Azzalini (1985, 1986) and Liseo (1990) and is de ned by the density function f(x) = 2(x)(x);

(1.1)

where (x) and (x) denote the standard normal probability density function and distribution function, respectively. It is studied further by Henze (1986), and used recently by Arnold et al. (1993) in the analysis of screening procedures. As observed by Runnenberg (1978), an alternative skew extension obtained by splicing two half-normal distributions di ering in scale appears in Fechner (1897) in the context of the mean–median–mode ordering. For independent rediscoveries and applications of this family see Gibbons and Mylroie (1973), John (1982), Kimber (1985), Kimber and Jeynes (1987) and Toth and Szentimrey (1990). It can be re-expressed in terms of an explicit skewness parameter . This reparamaterized family denoted by ESN(; ; ), and named epsilon–skew–normal family, resembles the normal family in many ways and strictly includes it at  = 0. For example its entropy function is of the

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 293

same form as that of the normal distribution. Also the maximum likelihood estimators of the location and scale parameters of the family are asymptotically independent and reduce properly when  is known, and when it is known to be zero. The family is convenient for Bayesian analysis of normal means. It can be used both as a model and a prior. The posterior distributions in both cases are unimodal with the modes available in simple closed forms. The purpose of this paper is to introduce the ESN(; ; ) family, present some known results in streamlined form, derive additional properties, and give some applications. The family and its basic properties are presented in Section 2. Section 3 contains the method of moments estimation of its parameters. The maximum likelihood estimation is discussed in Section 4. Section 5 is given to illustrations and associated computations using data on 219 of the world’s volcanoes. Section 6 gives some posterior distributions and their modes which are useful in Bayesian applications involving the ESN(; ; ) family. 2. Basic properties The epsilon–skew–normal distribution ESN(; ; ) is a unimodal distribution with mode at  and probability mass (1+)=2 below the mode. The probability density function, the distribution function, and quantile function of its canonical form ESN(0; 1; ) are, respectively,    1  x2  √ if x ¡ 0; exp −  2(1+)2 2 (2.1) f0 (x) =   1  x2  √ if x¿0; exp −  2(1−)2 2  ( x if x ¡ 0; (1 + ) 1+   F0 (x) = (2.2) x if x¿0;  + (1 − ) 1− and

( Q0 (u) = F0−1 (u) =

(1 + )−1 (1 −



u 1+  )−1 u− 1−

if 0 ¡ u ¡ (1 + )=2; if (1 + )=26u ¡ 1;

(2.3)

where −1 ¡  ¡ 1, and (x) denotes the standard normal d.f. Remark 2.1. The limiting cases of (2.1) as  → ±1 are the well-known half-normal distributions. The standard epsilon–skew–normal distribution, ESN(0; 1; ), is a mixture of two half-normal distributions, and reduces to the standard normal distribution, when  = 0. In other words the ESN(0; 1; ) may be viewed as generated by (1 − U )(1 − )|N1 | − U (1 + )|N2 |, where U; N1 ; N2 are independent, P(U = 1) = (1 + )=2 = 1 − P(U = 0), and N1 ; N2 are independent standard normal variates.

294 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

The p.d.f. f0 (·) at (2.1) has derivatives of arbitrary orders. It is di erentiable once at the mode. The p.d.f. of ESN(; ; ) is f0 ((x − )=)=, where f0 (·) is given by (2.1). Its quantile function, Q(u) =  + Q0 (u), can be used to generate samples from the ESN(; ; ) population. Note the relationship Q((1 + )=2) = . 2.1. Mean, median and mode The mode of ESN(; ; ) distribution is . It is easy to verify that its mean is 4 E(X ) =  − √ 2

(2.4)

and that its median is    1    + (1 + )−1 2(1+) Q(1=2) =      + (1 − )−1 1=2− 1−

if  ¿ 0;

(2.5)

if 60:

2.2. Moments The moment generating function of the standard epsilon–skew–normal distribution ESN(0; 1; ) is M (t) = (1 + )e(1+)

2 2

t =2

[ − (1 + )t] + (1 − )e(1−)

2 2

t =2

[(1 − )t]:

(2.6)

If X ∼ ESN(; ; ) then MX (t) = et M (t). Hence, E(X ) is given by (2.4). Also, the variance and the coecients of skewness and kurtosis of X are given, respectively, by 2 [(3 − 8)2 + ];  √ p 2 2[(5 − 16)2 − ] 1 = ; [(3 − 8)2 + ]3=2 var(X ) =

(2.7) (2.8)

and 2 =

(152 + 16 − 192)4 + 10(3 − 8)2 + 32 : [(3 − 8)2 + ]2

From (2.8) and (2.9) we have the bounds as  ranges from −1 to +1: p − 0:99536 1 ¡ 0:9953; 3:00006 2 ¡ 3:8692:

(2.9)

(2.10)

(2.11) p It is interesting to note that 1 and 2 of the ESN family have the same range as the skew normal distribution introduced by O’Hagan and Leonard (1976) and investigated by Azzalini (1985, 1986), Henze (1986) and Liseo (1990). This may be expected since the normal and the half-normal distributions are the limiting cases for both families. Fig. 1 gives a selection of the shapes of the probability distribution functions.

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 295

Fig. 1. Some typical ESN probability density functions.

2.3. Entropies The entropy is a basic structural concept in the probability theory, see Csiszar (1975). It can characterize probability distributions, e.g. Kagan et al. (1973), and can be used to test goodness of t; see Vasicek (1976), Mudholkar and Lin (1984), and Mudholkar et al. (1993). The entropy of the members of ESN(; ; ) family is independent of . Speci cally, the entropy H (f) of the ESN(; ; ) is given by Z ∞ √ 1 H (f) = − f(x) log f(x) dx = + log( 2): (2.12) 2 −∞ Interestingly a similar property also holds for Renyi’s (1961) generalized entropies, H (f). For the ESN(; ; ) distribution we have    1 −(1=2)x2 =2 √ (2.13) e H (f) = −E[f (x)] = −E 2 for all ; −1 ¡  ¡ 1. It may be noted that H (f) was examined earlier by Yule (1938) in the context of frequency moments.

296 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

2.4. Sampling distributions Let X1 ; X2 ; : : : ; Xn be i.i.d. ESN(; ; ). Then, for large n, the sample mean X is approximately normal. For small samples the approximation may be improved using the coecients of skewness and kurtosis given by (2.8) and (2.9). It may be noted that in this case X is not a reasonable estimator for location. Now, if X ∼ ESN(; ; ) then from (2.6) the moment generating function of Y = (X − )2 =2 is MY (t) = (1 + )[1 − 2(1 + )2 t]−1=2 =2 + (1 − )[1 − 2(1 − )2 t]−1=2 =2:

(2.14)

So Y is distributed as a mixture of two gamma random variables. From (2.14) we see that for a random sample of size n from an ESN(; ; ) population the moment Pn generating function of U = i=1 (Xi − )2 =2 is given by    n−r n n 1 +  r P 1− MU (t) = r 2 2 r=0 ×[1 − 2(1 + )2 t]−r=2 [1 − 2(1 − )2 t]−(n−r)=2 ;

(2.15)

which is the m.g.f. of a binomial mixture of the sums of independent gamma random variables. 3. Method of moment estimation Let X1 ; X2 ; : : : ; Xn be a random sample of size n from an ESN(; ; ) population de ned in Section 2. Then relations (2.4) and (2.7) are equivalent to 4  = E(X ) + √ ; 2

(3.1)

and 2 =

 var(X ) : (3 − 8)2 + 

(3.2)

p An expression for  in terms of p 1 is analytically intractable. However, one may t a polynomial through pairs of (; 1 ) values from (2.8) and obtain the approximate inverse relation, p p p p (3.3)  ≈ −0:5835 1 − 0:5861( 1 )3 + 1:0763( 1 )5 − 0:9226( 1 )7 ; which gives an estimate of , denoted ˆmm , in terms of the sample coecient of √ √ skewness b1 . However, an empirical examination shows that, if b1 ¡ − 0:995 or √ b1 ¿ 0:995, i.e. outside the limits given by (2.10), then (3.3) can yield unreasonable estimates for . For n = 25 this happens in about 25% of the cases, and in about 1% of the cases if n = 100. The sux “mm” in this section obviously connotes “method of moments estimator”. It may be noted that the sample mean X is not a reasonable estimator of the location parameter .

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 297

Case of  known: If  is known then, for a random sample of size n from an ESN(; ; ) population, (3.1) and (3.2) give the obvious moment estimates 4ˆmm ˆmm = X + √ ; 2

(3.4)

and ˆ2mm =

S 2 ; (3 − 8)2 + 

where X and S 2 are the sample mean and variance. Now to obtain the large sample distribution of ˆ and ˆ2 , let p ˆmm = X + c1 c2 S 2 ;

(3.5)

(3.6)

(3.7) ˆ2mm = c2 S 2 ; √ where c1 =4= 2 and c2 ==[(3−8)2 +]. Then from standard large sample theory, we get the following: √ Theorem 3.1. As n → ∞; the limiting distribution of n(ˆmm − ; ˆ2mm − 2 ) is a centered bivariate normal distribution with variance–covariance matrix mm with the elements √ √  = 2 + c1 c2 3 = 2 + c12 c2 (4 − 22 )=(42 ); (3.8) √ 2 = 2  = c2 3 + c1 c23=2 (4 − 22 )=(2 2 ); 2 2 = c22 (4 − 22 );

(3.9) (3.10)

where r = E[X − E(X )]r ; r = 2; 3; : : : ; denotes the moments of ESN(; ; ). Proof. Use the standard large sample theory, e.g. see Ser ing (1980, pp. 122–126). Remark 3.2. Since the product moment correlation coecient is invariant with respect to location and scale changes, in (3.8)–(3.10) we can take  = 0 and  = 1, and see that for a given  the asymptotic correlation of ˆ and ˆ is given by c2 3 + c1 c23=2 (4 − 22 )=(221=2 ) : Corr(ˆmm ; ˆ2mm ) = q p 1=2 1=2 2 2 2 2 2 + 3 c1 c2 =2 + c1 c2 (4 − 2 )=(42 ) c2 (4 − 2 ) (3.11) A numerical examination of (3.11) for −1 ¡  ¡ 1 indicates that − 0:5504 ¡ Corr(ˆmm ; ˆ2mm ) ¡ 0:5504:

(3.12)

298 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

4. Maximum likelihood estimation In this section we develop the maximum likelihood estimation for the ESN(; 2 ; ) family. The estimation of the three parameters is considered rst followed by the case where  is known. 4.1. The general case The maximum likelihood estimation involves maximization of the loglikelihood with respect to an auxiliary integer k; 06k6n; appearing in the de nition of the loglikelihood l(; 2 ; ) at (4.1), in addition to the population parameters ; 2 and . It is convenient to rst assume k known, maximize the loglikelihood in each of the n + 1 cases, and then to determine the global maximum of the loglikelihood and the maxiˆ ˆ2 ; ). ˆ mum likelihood estimators (; Let x(1) 6x(2) 6 · · · 6x(n) denote the order statistic of a sample from the ESN(; ; ) population. Also denote x(0) = −∞ and x(n+1) = ∞. Lemma 4.1. There exists an integer k = k(x(1) ; : : : ; x(n) ); 06k6n, such that the loglikelihood l(; 2 ; ) can be expressed as  n Pn 2 2 1 if k = 0; n;  − 2 log 2 − 82 [ i=1 (x(i) − ) ] 2 l(;  ; )= Pk (x(i) −)2 Pn  n (x(i) −)2 if 16k ¡n: − 2 log 22 − 21 2 [ i=1 (1+) 2 + i=k+1 (1−)2 ] (4.1) Proof. The existence of k is obvious. The form of the loglikelihood follows from the p.d.f. of ESN(; ; ) given by (2.1). Note that the case k = 0 corresponds to the case of the p.d.f. when  = −1 and the case k = n corresponds to the case of the p.d.f. when  = 1. In other words for k = 0 and n the loglikelihood comes directly from the half-normal distributions. The estimation of k, i.e. identi cation of the interval containing the maximum likelihood estimate of the mode, is helped by the following two lemmas. ˆ ˆ2 ; ) ˆ is given by Lemma 4.2. If k = 0 or k = n the maximum likelihood estimate (;  2 (x(1) ; s0 ; −1) if k = 0; ˆ ˆ2 ; ) (; (4.2) ˆ = (x(n) ; sn2 ; 1) if k = n; Pn Pn−1 where s02 = i=2 (x(i) − x(1) )2 =(4n) and sn2 = i=1 (x(i) − x(n) )2 =(4n). Proof. If k = 0, then  = −1, and in the loglikelihood, x(i) ¿, i = 1; 2; : : : ; n. A maximization of the loglikelihood then yields ˆ = x(1) , the sample minimum, and ˆ20 = s02 . The proof of (4.2) at k = n follows similarly.

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 299

Lemma 4.3. If 16k ¡ n then the local minima of  #1=3 3  j 1=3 " n   P P 1 gj (x; ) = (x(i) − )2 + (x(i) − )2 ;  4  i=1 i=j+1

(4.3)

j = 1; 2; : : : ; n − 1; if any; determines the local maxima of the loglikelihood. Proof. If 16k ¡ n then, for a xed 2 , the loglikelihood l(; 2 ; ) at (4.1) is maximized by examing the local minima of gj (x; ; ) =

j (x − )2 n P P (x(i) − )2 (i) + ; 2 2 i=1 (1 + ) i=j+1 (1 − )

(4.4)

where −1 ¡  ¡ 1, and j=1; 2; : : : ; n−1. Now gj (x; ; ) may be minimized sequentially, rst with respect to  for a xed  (and consequently for xed k), and then with respect to . Thus, equating the derivative of gj (x; ; ) with respect to  to zero we have j (x − )2 n P P (x(i) − )2 (i) − = 0: 3 3 i=1 (1 + ) i=j+1 (1 − )

(4.5)

Eliminating  between (4.4) and (4.5), and simplifying, we get the minimum of gj (x; ; ) with respect to  as  #1=3 3  j 1=3 " n   P P 1 (x(i) − )2 + (x(i) − )2 : (4.6) gj (x; ) =  4  i=1 i=j+1 Thus, the problem is reduced to minimizing gj (x; ), or equivalently hj (x; ) = (4gj (x; ))1=3 , with respect to  subject to x(j) 6 ¡ x(j+1) . Clearly, hj (x; ) is a di erentiable function of  with derivative   j −2=3 j P 2 P (x(i) − ) (x(i) − )2 h0j (x; ) = − 3 i=1 i=1 #−2=3  "  n n P P + (x(i) − ) (x(i) − )2 : (4.7)  i=j+1 i=j+1 It follows that the solutions of h0j (x; ) = 0, j = 1; 2; : : : ; n − 1, if any, correspond to the local stationary points of the loglikelihood. It is observed that in general the function gj (x; ) is monotone in most intervals. In other words, most of the half-open intervals [x(j) ; x(j+1) ) do not contain the local minima of gj (x; ). The problem of identifying the intervals which have the local minima can be eciently handled by graphing the continuously di erentiable (except possibly at  = x(1) and x(n) ) function g() =

n−1 P j=1

gj (x; )Ij ();

(4.8)

where Ij () is the indicator function for the half-open interval [x(j) ; x(j+1) ), and gj (x; ) is given by Eq. (4.6). However, the isolation of the global maximum of the likelihood

300 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

from the set of the local minima identi ed as in Lemma 4.3 requires the values of the likelihood at these points. Towards this end we have the following: Lemma 4.4. Let 0 denote a solution of h0j (x; ) = 0; j = 1; 2; : : : ; n − 1. Then the corresponding values 2 = 02 and  = 0 which maximize the loglikelihood are given by Pn Pj [ i=1 (x(i) − 0 )2 ]1=3 − [ j+1 (x(i) − 0 )2 ]1=3 ; (4.9) 0 = Pj Pn [ i=1 (x(i) − 0 )2 ]1=3 + [ j+1 (x(i) − 0 )2 ]1=3 " # j (x −  )2 n (x −  )2 P 1 P (i) 0 (i) 0 2 ; + 0 = 2 n i=1 (1 + 0 )2 j+1 (1 − 0 )  #1=3 3  j 1=3 " n   P P 1 : (4.10) (x(i) − 0 )2 + (x(i) − 0 )2 =  4n  i=1 i=j+1 Proof. Follows directly from Eqs. (4.1) and (4.5). The value of the integer j, i.e. the interval containing the maximum likelihood estimator of the mode, is determined by the value of the loglikelihood over the set of all (0 ; 02 ; 0 ) given by Lemmas 4.2 and 4.4. Lemma 4.5. The global maximum of the loglikelihood is obtained by examining the following local maxima for the set k = 0; k = n; and the set (0 ; 02 ; 0 ) given by Lemmas 4:2 and 4.4:  n 2 if k = 0;  − 2 [1 + log s0 ] 2 2 n (4.11) Max lk (0 ; 0 ; 0 ) = − 2 [1 + log 0 ] if 16k ¡ n;  n if k = n; − 2 [1 + log sn2 ] where s02 and s12 are given by Lemma 4:2 and 02 is given by Lemma 4:4. Remark 4.6. For  = −1 the ESN(; ; ) coincides with the half-normal distribution over the range (; ∞) with  as the scale parameter. In other words,  becomes a threshold parameter with x(1) as a consistent and reasonable estimator. The scale parameter may then be estimated by substituting x(1) for  and maximizing the likelihood. The asymptotic theory of estimation in the presence of a threshold parameter, e.g. as discussed by Smith (1985), is then operative. The analogous observation applies when  = 1. Theorem 4.7. As n → ∞; the ˆml − ) is a centered trivariate  2 2 ) I  = 3(1− 3−8  ml = 

√ maximum likelihood estimator n(ˆml − ; ˆ2ml − 2 ; normal distribution with variance–covariance matrix √  2 2 ) I  = 0 I  = 2 2(1− 3−8 2 2 2  (4.12) : I   = 24 I   = 0 I  =

(1−2 ) 3−8

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 301

Proof. The regularity conditions for the validity of the large sample theory for maximum likelihood estimation (e.g. Rao 1973, Section 5g) are obviously satis ed. Also, ml is the inverse of the expectation of the total sample information matrix given by the density in (2.1) and may be computed readily. The proof is completed by taking expectations using the score equations. Corollary 4.8. The maximum likelihood estimate ˆ2ml of the scale parameter 2 is asymptotically independent of both ˆml and ˆml ; the maximum likelihood estimators of the location parameter  and the skewness √ √ parameter . The asymptotic coecient ˆ of correlation between ml and ˆml is 2 2= 3 = 0:921. Proof. Follows from the asymptotic variance–covariance matrix (4.12). The case when  is known a priori, say  = 0 , further clari es the analogy with the normal distribution. The maximization of the likelihood in this case is much simpler. The asymptotic distribution of the maximum likelihood estimators is given by the following result, the proof of which is a straightforward application of the standard methodology. √ Theorem 4.9. If  = 0 then as n → ∞; n(ˆml (0 ) − ; ˆ2ml (0 ) − 2 ) is a centered bivariate normal distribution with variance–covariance matrix   (1 − 02 )2 0 : (4.13) 0 24 Normal case: If 0 = 0, then the maximum likelihood estimators of  and 2 are the same as in the normal case, namely X and S 2 . Note that if  is known to be equal to 0 then the asymptotic conditional distribution of (ˆml , ˆ2ml ) given ˆ reduces to the same bivariate normal distribution as given by Theorem 4.9. Thus if  is known then the routine normal theory results may be employed for inference about the ESN(; ; 0 ) family. Also note that the variance of the maximum likelihood estimator of ˆ is reduced by a factor of 3=(3 − 8) = 6:6149, when the value of , if known a priori, is used. √ Corollary 4.10 (Conditional distribution). As n → ∞ and given ˆml ; n(ˆml − ( + √ 2 2(ˆml − )=); ˆ2ml − 2 ) is centered bivariate normal distribution with variance– covariance matrix   (1 − 2 )2 0 : (4.14) 0 24 Proof. Follows from the standard theory and Theorem 4.7. Theorem 4.11 (Variance stabilization). Let ! r 1 3 − 8 sin−1 (ˆml ) ; h(ˆml ; ˆ2ml ; ˆml ) = ˆml ; √ log ˆ2ml ;  2

(4.15)

302 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

be the vector-valued transform of the maximum likelihood estimates. Then; as n → ∞; √ 2 n(h(ˆml ; ˆ2ml ; ˆml ) − h(ml ; ml ; ml )) has a centered trivariate normal distribution with variance–covariance matrix   Dml D0 = 

3(1−2 )2 3−8

0 1

2

q

2(1−2 ) (3−8)

0 1

  ;

(4.16)

where 

1 0 D=

0

√1 22

0

0

q



0 0 3−8 (1−2 )

 :

(4.17)

Proof. Follows from the Mann–Wald Theorem (e.g. Ser ing, 1980) and the asymptotic variance–covariance matrix (4.12). Con dence intervals: Theorem 4.11 can be used to construct a large sample con dence interval, with approximate con dence coecient (1 − )100%, for 2 as follows: (ˆ2ml exp(−z =2

p

2=n); ˆ2ml exp(z =2

p

2=n)):

(4.18)

Similarly, an approximate (1 − )100% con dence interval for  is 

 sin sin

−1

r (ˆml ) − z =2

 n(3 − 8)



 ; sin sin

−1

r (ˆml ) + z =2

 n(3 − 8)

 : (4.19)

The marginal asymptotic distribution of ˆml involves both 2 and . ˆml is independent √ but is correlated with ˆ2ml . Yet from Slutsky’s Theorem it follows that n{(ˆml − of ˆ2ml , q

)=[ˆml 3(1 − ˆ2ml )=3 − 8]} is asymptotically standard normal. Hence we have the approximate (1 − )100% con dence intervals for : s ˆml ± z =2 ˆml

3(1 − ˆ2ml ) ; n(3 − 8)

(4.20)

where z = (1 − ). One-step estimator for : For xed k a one-step estimate for , denoted ˆos , is given by the numerical algorithm ˆos = ˆmm − h0 (x; ˆmm )=h00 (x; ˆmm );

(4.21)

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 303

where h0 is given by (4.7), ˆmm is given approximately by (3.4) in conjunction with the approximation for ˆmm given by (3.3) and   j 2  j −2=3 −5=3  P j P 2 4 P k (x(i) − )2 + (x(i) − ) (x(i) − )2 h00 (x; ) = 3  i=1 3 i=1 i=1 " +(n − j) +

4 3

"

n P i=j+1

n P i=j+1

#−2=3 2

(x(i) − )

(x(i) − )

#2 "

n P i=j+1

(x(i) − )2

#−5=3   

:

(4.22)

4.2. Score test for symmetry In this subsection we consider a test of symmetry of the composite hypothesis H0 : = 0 ;  ∈ 0 versus HA :  6= 0 ;  ∈  , for 0 = 0 corresponding to symmetry, where  = (; 2 ) is an unknown two-dimensional nuisance parameter lying in an open set  . The only alternative values considered are in a collapsing neighborhood of the null √ value 0 = 0, i.e. we consider only contiguous alternatives of the form (h0 = n;  + √ h = n). Under H0 the score test statistic P[n=2] Pn [ i=1 (x(i) − x)  2 − [n=2]+1 (X(i) − X )2 ] p ; (4.23) Tn = S 2 n(3 − 8)= is AN(0,1), where [ · ] denotes the oor function, see Hall and Mathiason (1990) and Basawa (1991) for the technical details of the methodology pertaining to the construction of the test statistic. The local power of the test is given by (hA (3 − 8)= − z =2 ) + (hA (3 − 8)= + z =2 ):

(4.24)

The power function follows straightforward from the fact the e ective score reduces to the score for  after substituting ˆ = x for  and ˆ2 = S 2 for 2 . The test statistic Tn at (4.23) may be easily shown to be asymptotically ecient. 4.3. Numerical example The maximum likelihood computations involving the ESN(; ; ) family are now illustrated using the heights of 219 of the world’s volcanoes (Source: National Geographic Society and the World Almanac 1966, pp. 282–283) used in Tukey (1977) for explaining the role of ideas such as “easy re-expressions”, boxplots and letter value displays in exploratory data analysis. They are presented in the form of a stem-and-leaf plot in Fig. 2. The basic descriptive statistics for the volcano heights X are: the mean X = 70:246, √ the variance S 2 =1850:56, and the median =65:000. The coecient of skewness b1 = 0:840 indicates that X is asymmetric, and its coecient of kurtosis b2 = 3:482 is within the range of the ESN family given by (2.11).

304 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

Fig. 2. Heights of 219 of the world’s volcanoes (unit=100 ft).

The method of moments estimators of the ESN t to the volcano data, as given by (3.3)–(3.5), are respectively ˆmm = 28:845, ˆ2mm = 1545:28 and ˆmm = −0:660. Furthermore, the integer k required for the maximum likelihood estimation of the parameters is 35, i.e. ˆml is between 35th and 36th sample-order statistics. The maximum likelihood estimates of the volcano data given by Eqs. (4.7), (4.9) and (4.10) are ˆml = 27:586, ˆ2ml = 1553:84 and ˆml = −0:673. As further analysis of the data, we can use (4.18)–(4.20) to construct large sample con dence intervals for the parameters , 2 , and , with individual con dence coef cients 95%. These are:  (17.654, 37.518), 2 (1288.43, 1873.92) and  (−0:805; −0:516). From the con dence intervals for  it can be seen again that the normal family would be unsuitable for modeling the original volcano heights. As seen in Fig. 3 the ESN family provides a satisfactory model for the volcano heights, especially as compared with the normal family. To further assess the quality of the t the data were grouped into 10 classes using consistent estimators of the deciles to de ne the class boundaries. The Pearson 2 statistic for goodness-of t for the maximum likelihood estimates given above based on this grouping has a value 4.65 with six degrees and a p-value =0:41. The corresponding p-value for the normal t is 0.06.

5. Bayesian calculations O’Hagan and Leonard (1976) introduced the asymmetric normal distribution, denoted SN() by Azzalini (1985, 1986), as a less restrictive prior for Bayesian estimation of

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 305

Fig. 3. The normal and the epsilon–skew–normal ts to the 219 volcano heights.

the normal mean. Recently, it was re-examined by Liseo (1990) in the context Bayesian modeling, and used by Arnold et al. (1993) in the Bayesian analysis of Otis test scores. The ESN(; ; ) family, which has the same range of skewness and kurtosis as the SN() distribution, may also be used for such purposes. It can be used both as a model and as a prior distribution. In this section we discuss the posterior distributions in the two cases. It is shown that both posterior distributions are unimodal, and their modes are available in closed form. These can be used for estimating  and for constructing highest posterior density regions. 5.1. Epsilon–skew–normal model, normal prior If X | ∼ ESN(; ; ), and a priori,  ∼ N(; ) then the joint distribution of X and  is  2 2 1  2 exp{− 12 [ (−) + (x−) if 6x; 2 (1−)2 ]} 2 h(x; ) = (5.1) 2 2  1 (x−) 1 (−) exp{− [ + ]} if  ¿ x: 2 2 2 2 (1+)2 Letting 1 = (1 − ) 1 =

21 + 2 21 2

and

2 = (1 + );

and

2 =

22 + 2 22 2

(5.2) (5.3)

306 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

and re-expressing the joint distribution of X and  at (5.1) in terms of (5.2) and (5.3) we get  (−x)2 1  2 exp{− 12 1 [ − 11 ( x2 + 2 )]2 } exp [ − 2[ if 6x; 2 +2 ] ] 1 1 h(x; ) = (5.4) 2 (−x)   1 exp{− 1 2 [ − 1 ( x2 + 2 )]2 } exp [ − 2 2 ] if  ¿ x: 2 2 2   2[ + ] 2

2

The marginal distribution of X , obtained by integrating out  over the entire range, is given by      ( − x)2  1 x 1=2 −1=2 exp − 2 −  +  x m(x) = √ 1 1 2 2[ + 21 ] 21 21        ( − x)2  1 x −1=2 1=2 exp − 2 + 2 − x2 : (5.5) +√  2  2[ + 22 ] 22 22  The posterior (|x) = h(x; )=m(x) is then simply obtained from (5.4) and (5.5). Now we show that (|x) is unimodal and obtain and expression for its mode. Lemma 5.1. The posterior distribution (|x) consists of two truncated normal distributions spliced together continuously at  = x. It is unimodal. Proof. Fix x, and denote h(x; ) at (5.1) by  2  h1 () = 1 exp {− 1 [ (−) + 2 2 2 h() = 2  h2 () = 1 exp {− 1 [ (−) + 2 2

2



(x−)2 2 (1−)2 ]} (x−)2 2 (1+)2 ]}

if 6x; if  ¿ x:

(5.6)

It is obvious that h1 () and h2 () are proportional to normal probability density functions. Furthermore, they are equal at  = x, i.e. h1 (x) = h2 (x). Hence h() is continuous with at most two modes. Now consider the ratio   −2( − x)2 : (5.7) h1 ()=h2 () = exp 2 (1 − 2 )2 It is clear that h1 ()=h2 ()61

if  ¿ 0;

(5.8)

h1 ()=h2 ()¿1

if  ¡ 0:

(5.9)

In other words, h() is a join of one part each of the two functions h1 () and h2 (), one of which is larger than the other except at one point, namely =x. This establishes the unimodality of h(), and hence of its normalized version (|x). Theorem 5.2. The mode of the posterior distribution (|x) is given by ( 1 x  if  ¡ x; 1 ( 21 + 2 ) Mode(|X = x) =  1 x if  ¿ x: 2 ( 2 + 2 )

(5.10)

2

Proof. For a given x, the mode of (|x) is the same as the mode of h() = h(x; ) at (5.6). But the mode of h() must coincide with either the mode 1 = 1=1 (x=21 + =2 ),

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 307

of h1 (), or the mode 2 = 1=2 (x=22 + =2 ), of h2 (). From Lemma 5.1 it also follows that 1 and 2 must be both less than x or both greater than x. If both 1 and 2 are less than x then the mode is 1 . It is easy to verify that 1 = 1=1 (x=21 + =2 ) ¡ x is equivalent to x ¿ . Similarly if x ¡  then the mode is 2 . This completes the proof. 5.2. Normal model, epsilon–skew–normal prior Suppose X | ∼ N(; ), and a priori  ∼ ESN(; ; ). Then the joint distribution of X and ,  2 (x−)2 1  2 exp {− 12 [ (−) if  ¡ ; 2 (1+)2 + 2 ]} h(x; ) = (5.11) 2 2  1 exp {− 1 [ (−) + (x−) ]} if ¿; 2 2 2 (1−)2 2 may be re-expressed as  1   2 exp {− 12 2 [ − h(x; ) =  1  2 exp {− 12 1 [ −

1  2 ( 22

+

x 2 2 )] } exp [



(−x)2 ] 2[2 +22 ]

if  ¡ ;

1  1 ( 21

+

x 2 2 )] } exp [



(−x)2 ] 2[2 +21 ]

if ¿;

(5.12)

where 1 , 2 , 1 , and 2 are given by (5.2) and (5.3). Also, the marginal distribution of X is      1 ( − x)2 x  1=2 −1=2 m(x) = √ exp − 2 −  +   2 2 2 2[ + 22 ] 22 22        ( − x)2 x 1  −1=2 1=2 √ exp − 2 (5.13) + 2 − 1 ; +  1  2[ + 21 ] 21 21  and the posterior (|x) = h(x; )=m(x) is given by (5.12) and (5.13). As in the case of (|x), the posterior distribution (|x) is unimodal, and its mode has a simple closed form. The proofs of the following two results parallel those of Lemma 5.1 and Theorem 5.2. Lemma 5.3. The posterior distribution (|x); which consists of two truncated normal distributions spliced together continuously at  = ; is unimodal. Theorem 5.4. The mode of the posterior distribution (|x) is given by  1   2 ( 2 + x2 ) if x ¡ ; 2 Mode(|X = x) =  1 ( 2 + x2 ) if x ¿ : 1  

(5.14)

1

6. Conclusion In this paper we have introduced a family of distributions which includes the Gaussian family and can be used for analyzing near-normal data. Obviously, much additional work is needed.

308 G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309

Acknowledgements The authors would like to thank Marshal Freimer for his helpful discussions and the reviewer for his helpful comments. Alan Hutson’s work is supported by the National Institutes of Health General Clinical Center Grant RR00082. References Arnold, B.C., Beaver, R.J., Groeneveld, R.A., 1993. The nontruncated marginal of a truncated bivariate normal distribution. Psychometrika 58, 471–488. Azzalini, A., 1985. A class of distributions which includes the normal ones. Scand. J. Statist. 12, 171–178. Azzalini, A., 1986. Further results on a class of distributions which includes the normal ones. Statistica 46, 199–208. Basawa, I.V., 1991. Generalized score tests for composite hypotheses. In Estimating Functions. Clarendon Press, Oxford, pp. 121–131. Box, G.E.P., 1953. A note on regions for test of kurtosis. Biometrika 40, 465–468. Chhikara, R.S., Folks, J.L., 1989. The Inverse Gaussian Distribution: Theory, Methodology, and Applications. Dekker, New York. Csiszar, I., 1975. I -Divergence geometry of probability distributions and minimization problems. Ann. Probab. 3, 146–158. Copenhaver, T.W., Mielke, P.W., 1977. Quantit analysis: a quantal assay re nement. Biometrics 33, 175–186. n De Moivre, A., 1733. Approximatio ad Summan Terminorum Binomii a + b in Seriem expansi. Kotz, S., Johnson, N.L. (Eds.), Read, C.B. (Exec. Ed.), Editors-In-Chief, 1988. Skew-normal distributions. Encyclopedia of Stat’l. Sci., vol. 8. Wiley, New York, pp. 507–507. Fechner, G.T., 1897. Kollectivmasslehre. Leipzig, Engleman. Freimer, M., Mudholkar, G.S., Kollia, G., Lin, C.T., 1988. A study of the generalized Tukey Lambda family. Comm. Statist. Theory Methods 17, 3547–3567. Gauss, C.F., 1809. Theoria Motus Corporum Coelestium. Perthes and Besser, Hamburg. Geary, R.C., 1947. Testing for normality. Biometrika 34, 209–242. Gibbons, J.F., Mylroie, S., 1973. Estimation of impurity pro les in ion-implanted amorphous targets using joined half-Gaussian distributions. Appl. Phys. Lett. 22, 568–572. Hall, W.J., Mathiason, D.J., 1990. On large-sample estimation and testing in parametric models. Internat. Statist. Rev. 58, 77–97. Henze, N., 1986. A probabilistic representation of the ‘skew-normal’ distribution. Scand. J. Statist. 13, 271–275. Hoaglin, D.C., 1985. In: Hoaglin, D.C. (Ed.), Summarizing Shape Numerically: The g- and -h Distributions. Exploring Data Tables, Trends and Shapes. Wiley, New York, pp. 461–513. Joergensen, B., 1981. Statistical Properties of the Generalized Inverse Gaussian Distribution. Springer, New York. John, S., 1982. The three-parameter two-piece normal family of distributions and its tting. Comm. Statist. Theory Methods 11, 879–885. Kagan, A.M., Linnik, Y.V., Rao, C.R., 1973. Characterization Problems in Mathematical Statistics. Wiley, New York. Kimber, A.C., 1985. Methods for the two-piece normal distribution. Comm. Statist. Theory Methods 14, 235–245. Kimber, A.C., Jeynes, C., 1987. An application of the truncated two-piece normal distribution to the measurement of depths of arsenic implants in silicon. Appl. Statist. 36, 352–357. Liseo, B., 1990. The skew-normal class of densities: inferential aspects from a Bayesian viewpoint (Italian). Statistica 50, 71–82. Mudholkar, G.S., Chaubey, Y.P., Smethurst, P.A., 1993. On entropy-based goodness-of- t tests in probability and statistics. In: Basu, S.K., Sinha, B.K. (Eds.), Proceedings of the First International Trienial Calcutta Symposium in Probability and Statistics.

G.S. Mudholkar, A.D. Hutson / Journal of Statistical Planning and Inference 83 (2000) 291–309 309 Mudholkar, G.S., Lin, C.T., 1984. On two applications of characterizations theorems to goodness-of- t. Colliq. Math. Soc. Janos Bolyai. 45, 395–414. O’Hagan, A., Leonard, T., 1976. Bayes estimation subject to uncertainty about parameter constraints. Biometrika 63, 201–202. Prentice, R.L., 1976. Generalizations of the probit and logit methods for dose response curves. Biometrics 32, 761–768. Rao, C.R., 1973. Linear Statistical Inference and Its Applications, 2nd Edition. Wiley, New York. Renyi, A., 1961. On measures of entropy and information. Proceedings of the fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561. Runnenberg, J.T., 1978. Mean, median, mode. Statist. Neerlandica 32, 73–79. Ser ing, R.J., 1980. Approximation Theorems of Mathematical Statistics. Wiley, New York. Smith, R.L., 1985. Maximum likelihood estimation in a class of nonregular cases. Biometrika 72, 67–90. Stigler, S.M., 1977. Do robust estimators work with real data? Ann. Statist. 5, 1055–1098. Toth, Z., Szentimrey, T., 1990. The binormal distribution: a distribution for representing asymmetrical but normal-like weather elements. J. Climate 3, 128–136. Tukey, J.W., 1977. Exploratory Data Analysis. Addison-Wesley, Reading, MA. Vasicek, O., 1976. A test for normality based on sample entropy. J. Roy. Statist. Soc. Ser. B 38, 54–59. Yule, G.U., 1938. On some properties of normal distributions, univariate and bivariate, based on sums of squares of frequencies. Biometrika 30, 1–10.