Asymptotic values and expansions for the correlation between different measures of spread

Asymptotic values and expansions for the correlation between different measures of spread

Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212 www.elsevier.com/locate/jspi Asymptotic values and expansions for the correlatio...

186KB Sizes 0 Downloads 19 Views

Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212 www.elsevier.com/locate/jspi

Asymptotic values and expansions for the correlation between different measures of spread Anirban DasGuptaa,∗ , L.R. Haff b a Purdue University, West Lafayette, IN, USA b UCSD, La Jolla, CA, USA

Available online 6 October 2005

Abstract For i.i.d. samples from a normal, uniform, or exponential distribution, we give exact formulas for the correlation between the sample variance and the sample range for all fixed n. These exact formulas are then used to obtain asymptotic expansions for the correlations. It is seen that the correlation converges to √ √ √ zero at the rate log n/ n in the normal case, the 1/ n rate in the uniform case, and the (log n)2 / n rate in the exponential case. In two of the three cases, we obtain higher-order expansions for the correlation. We then obtain the joint asymptotic distribution of the interquartile range and the standard deviation for any distribution with a finite fourth moment. This is used to obtain the nonzero limits of the correlation between them for some important distributions as well as some potentially useful practical diagnostics based on the interquartile range and the standard deviation. It is seen that the correlation is higher for thin tailed and smaller for thick tailed distributions. We also show graphics for the Cauchy distribution. The graphics exhibit interesting phenomena. Other numerics illustrate the theoretical results. © 2005 Published by Elsevier B.V. Keywords: Asymptotic; Bahadur representation; Correlation; Expansion; Range; Standard deviation

1. Introduction The sample standard deviation, range, and the interquartile range are variously used as measures of spread in the distribution of a random variable. Range-based measures are still common in process control studies, while measures based on the interquartile range are sometimes used as robust measures of spread. It would seem natural to expect that the three measures of spread share some reasonable positive dependence property for most types of populations, and perhaps for all sample sizes n. The purpose ∗ Corresponding author.

E-mail address: [email protected] (A. DasGupta). 0378-3758/$ - see front matter © 2005 Published by Elsevier B.V. doi:10.1016/j.jspi.2005.08.028

2198

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

of the present article is to investigate the interrelations between them in greater depth than has been done before, mathematically as well as numerically. For example, we investigate the correlation between the sample range (W ) or the interquartile range and the sample standard deviation s, both for fixed samples and asymptotically. We also investigate the joint asymptotic behavior of the standard deviation and the interquartile range in the sense of their joint distribution, and use these asymptotics to evaluate the exact limiting values of their correlation for a number of important distributions. As a matter of mathematical tractability, it is much easier to analyze the correlation between s 2 and W , both for fixed samples and asymptotically. In the next three sections, we present exact formulae for the correlation between s 2 and W , for every fixed sample size n, when the underlying population is a normal, or an Exponential, or a uniform. They are common distributions, and manifest a symmetric population with no tail (uniform), a symmetric population with thin tail (normal), and a skewed population widely used in practice. Another reason for specifically working with these three cases is that they seem to be the only three standard distributions for which an exact formula for the correlation can be given for fixed sample sizes. Using the fixed sample formulas, we then derive asymptotic expansions for the correlation. The first term in the expansion gives us the rate of convergence of the correlation √ to zero in each case. For instance, we prove √ that the correlation converges to zero at the√rate 1/ n in the uniform case, at the rate log n/ n 2 in the normal case, and at the rate (log n) / n in the exponential case. These derivations involve a great deal of calculations, much of which, however, has been condensed for the sake of brevity. Next, by use of the Bahadur representation of sample quantiles, we work out the asymptotic bivariate normal distribution for interquartile range and standard deviation for any distribution with four moments. We apply it to obtain the limits of the correlation between them for five important distributions. The general result can be used to obtain the limiting correlation for any distribution with four moments. We also use this general result to form some quick diagnostics based on the ratio of the interquartile range and standard deviation.These may be of some practical use. The article ends with graphics based on simulations for the scatterplots of s against W and IQR for the Cauchy distribution. The graphics show some interesting outlier and clustering phenomena. We hope that the asymptotic calculations and the graphics presented here give some insight to a practicing statistician as well as to an applied probabilist. We also hope the asymptotic expansions are of some independent technical interest. There is considerable literature on a related classical problem, namely the distribution of W/s; see, for example, Plackett (1947), David et al. (1954), and Thomson (1955). We have not addressed that problem here. 2. Uniform distribution Using a well known property of the order statistics of an i.i.d. uniform sample, we derive below an explicit formula for n =Corr(s 2 , W ) for every fixed sample size n. For brevity of presentation, certain intense algebraic details have been omitted. Theorem 1. Let X1 , X2 , . . . , Xn be i.i.d. U [a, b]. Then for each n 2, √ 2 5n(n + 2) n = √ (n + 3) 2n + 3

(1)

Proof. Without loss of generality, we may assume that a = 0 and b = 1. We derive expressions for Cov(s 2 , W ), Var(s 2 ), and Var(W ) in the following steps. We will evaluate the correlation

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

between

n

i=1 (Xi

2199

− X)2 and W, which is the same as n .

Cov(s 2 , W ): Step 1. It is well known that for U [0, 1] data, W has the density n(n−1)w (n−2) (1−w), 0 < w < 1. It follows immediately that E(W ) =

n−1 n+1

and

Var(W ) =

2(n − 1) (n + 1)2 (n + 2)

.

 Also, obviously, E( ni=1 (Xi − X)2 ) = (n − 1)/12.  The rest of the steps analyze the term E(W ni=1 (Xi − X)2 ). Step 2. Towards this end, we use the fact that conditional on X(n) , X(1) , the remaining order statistics X(2) , . . . , X(n−1) are distributed as the order statistics of a sample of size n − 2 from U [X(1) , X(n) ]. Thus, denoting the order statistics of an i.i.d. sample of size n − 2 from U [0, 1] as U(1) , . . . , U(n−2) , we have:  E W

n 

 (Xi − X)

2

i=1



= EE W 

n 

 (Xi − X) |X(1) , X(n) 2

i=1



= EE W (X(1) − X)2 + (X(n) − X)2 +

n−1 





(Xi − X)2 |X(1) , X(n)

i=2



2  2 1 n−2 n−1 n−2 3 2 = E(W )E + − U + E(W )E U n n n n n−2 

2   1 n−2 2 U − U(i) − + E(W )E n n i=1  n−2   E(W 3 ) 2 2 2 = U(i) E (1 + (n − 2)U ) + (n − 1 − (n − 2)U ) + n − 2 + n n2 i=1 2  n−2   2 2 U(i) − U + U − U + U − 2n . n n i=1

Step 3. From (2), using the facts that E(W 3 ) =

n(n − 1) 1 , E(U ) = (n + 2)(n + 3) 2

and

Var(U ) =

1 , 12(n − 2)

(2)

2200

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

after a few lines of algebra, it follows that: Cov

 n 

 (Xi − X) , W 2

i=1

=E

 n 

 (Xi − X) W

n 

−E

2

i=1

(Xi − X)2 E(W )

i=1

n−1 . = 3(n + 1)(n + 3)

(3)

 Step 4. Now, finally, we have to evaluate Var( ni=1 (Xi − X)2 ). First note the identity  n 

2 (Xi − X)

 =

2

i=1

n  i=1

2 Xi2

− 2nX

n 2 i=1

4

Xi2 + n2 X .

(4)

Therefore, E

 n 

2 (Xi − X)

2

i=1

= Var

 n  i=1

 Xi2

 + E

n  i=1

2 Xi2



Step 5. Of these, obviously, Var( And ⎛ 2

2

4

− 2n2 E(X12 X ) + n2 E(X ).

Xi2 ) = 4n/45 and (E







n

2 2 i=1 Xi )

(5)

= n2 /9.

⎞2 ⎞ ⎟ Xi ⎠ ⎠

1 ⎜ 4 E ⎝X1 + 2X13 Xi + X12 ⎝ n2 i=1 i=1    1 1 n − 1 1 n − 1 (n − 1)2 = 2 + + + 5 4 3 12 4 n

E(X12 X ) =

=

15n2 + 20n + 1 . 180

(6)

Step 6. Finally, ⎛  1 E(X)4 = 4 E ⎝ n

Xi Xj Xk Xl +

i=j =k=l

+

 i=j

Xi3 Xj +

 i

⎞ Xi4 ⎠ .

 i=j =k

Xi2 Xj Xk +

 i=j

Xi2 Xj2

(7)

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

2201

There are (n(n − 1)(n − 2)(n − 3)/4!)(4!/1!1!1!1!) terms of the first kind, (n(n − 1)(n − 2)/3!)3(4!/2!1!1!) terms of the second kind, (n(n − 1)/2!)(4!/2!2!) terms of the third kind, (n(n − 1)/2!)2(4!/3!1!) terms of the fourth kind, and, n terms of the fifth kind. Therefore, from (7), on some tedious algebra, it follows that: E(X)4 =

1 15n3 + 30n2 + 5n − 2 . n3 240

(8)

Step 7. Combining the previous steps, Var

 n 

 (Xi − X)2 =

i=1

Var(W ) =

2n2 + n − 3 , 360n

2(n − 1) (n + 1)2 (n + 2)

,

and,  Cov

n 

 (Xi − X)2 , W

i=1

=

n−1 , 3(n + 1)(n + 3)

from which the formula for n follows.  Example 1. Using the exact formula of Theorem 1, the following correlations are obtained for selected values of n: We see that in the uniform case, the correlation between s 2 and W is quite high till about n = 30, and the correlation seems to be the maximum when n = 3. Theorem 2. n admits the asymptotic expansion  n =

√ √ 10 251 5 11 5 + √ + O(n−7/2 ). − √ n 2 2n3/2 16 2n5/2

(9)

This asymptotic expansion follows on very careful combination of terms from the exact formula in (1). We omit the derivation. Note that we are able to give a third-order expansion in the uniform case. This is because formula (1) is amenable to higher order expansions. The accuracy of the thirdorder expansion is excellent; for example, for n = 15, n = .691, and the third-order asymptotic expansion gives the approximation .695. 3. Normal distribution The formula for n in the normal case follows from a familiar application of (Basu’s theorem Basu, 1955). Theorem 3. Let X1 , X2 , . . . , Xn be i.i.d. N (, 2 ).

2202

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

Let an =

((n + 2)/2) , ((n − 1)/2)

bn =

((n + 3)/2) , ((n − 1)/2)

n = E(W );

n = E(W 2 ),

and 2(n−3)/2 ((n/2))2 n = E(s) = √ . (n − 1)(n − 2)!

(10)

√ n ((2 2an /(n − 1)3/2 n ) − 1) n =  . ((4bn /(n − 1)2 ) − 1)(n − 2n )

(11)

Then,

Proof. Without loss of generality, we may take  = 0,  = 1. Step 1. By Basu’s theorem (Basu, 1955), W/s and s are independent, and hence, Cov(W, s 2 ) = E(W s 2 ) − E(W )E(s 2 ) = EE(s 2 E(W |s)) − E(W )E(s 2 )   √ 2 2an 2 3 E(W ) −1 , = E(s ) − E(W )E(s ) = n E(s) (n − 1)3/2 n by a direct calculation of E(s 3 ) using the chisquare(n − 1) density. Step 2. Next, Var(s 2 ) = E(s 4 ) − 1 = (4bn /(n − 1)2 ) − 1), again by a direct calculation of E(s 4 ) from the chisquare (n − 1) density. Also, from definitions of n and n , Var(W ) = n − 2n . Step 3. Using the expressions for Cov(W, s 2 ), Var(W ), and Var(s 2 ), the formula for n follows on algebra.  Example 2. Although a general closed form formula for n and n in terms of elementary functions is impossible, they can be computed for any given n (closed form formulas for small n are well known; see David (1970), or Arnold et al. (1992)). The following table gives the numerical value of the correlation n for some selected values of n. The table reveals that the correlations are substantially higher than in the uniform case (see Table 1). Even at n = 100, the correlation is .6. Also, again we see that the maximum correlation is for a small n, namely n = 4. Table 1 n

n

2

3 .956

4 .962

5 .944

10 .917

15 .786

20 .691

30 .622

50 .529

100 .424

.308

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

2203

Table 2 n

n

2

3 .952

.935

4 .955

5 .951

10 .891

15 .844

20 .809

30 .772

50 .687

100 .602

The next result gives the asymptotic order of n . Due to the great technical complexity in finding higher-order approximations to the variance of W for normal samples, this result is weaker than the asymptotic expansion in Theorem 2 for the uniform case. Theorem 4. √ 5 3 log n n ∼ √ √ 2 2 n

(12)

Remark. It is rather interesting that asymptotically, √ the correlation between s and W is of the same 6/, which is about 1.559, while the constant in order, with the exception that the constant is 2 √ √ Theorem 4 above is 5 3/2 2, which is about .975. Thus, s and W are slightly more correlated than s 2 and W. This is in fact manifest for fixed n as well. For example, at n = 10, 20, 30, 100, the correlation between s and W can be computed to be .901, .812, .773 and .606, compared to .891, .809, .772 and .602 for s 2 and W (Table 2). Proof. The proof of Theorem 4 requires use of asymptotic theory for W for the normal case as well as careful algebra with Stirling’s approximation for the various terms in n . Step 1. By use of Stirling’s formula,  an ∼

n+2 2

3/2  1−

3 n+2

2  1+

 9 , 4(n + 2)

after several lines of algebra. 1 + o(1/n), after algebra. Step 2. Again, by Stirling’s formula, n = 1 − 2n Step 3. From these, on a little more algebra, one gets √ √  log n 10 2 log n Cov(s 2 , W ) = +o . 4n n Step 4. By standard uniform integrability arguments, Var(s 2 ) = 2/n + o(1/n). √ Step 5. For i.i.d. N (0, 1) samples, 2 log n(W − n ) ⇒ H , for n = −1 (1 − 1/n), with H denoting a distribution with density 2 exp(−x)K0 (2 exp(−(x/2))), K0 being the appropriate Bessel function (see, e.g., Serfling (1980)). The variance of the distribution H, by a direct integration, works out to 2 /3. From uniform integrability arguments, that hold for the normal case, it follows that Var(W ) ∼ 2 /6 log n. Step 6. The first-order approximation to n now follows by combining Steps 3–5.  Remark. Comparing the leading term for n in the uniform case (Theorem 2) to Theorem 4, we see that the correlation dies out at a slower rate in the normal case. This is interesting. This asymptotic observation is in fact clearly observed by comparison of Tables 1 and 2 as well.

2204

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

4. Exponential distribution An exact formula for the correlation n for the exponential case will be derived by using a representation for the order statistics of an exponential sample in terms of i.i.d. exponential variables. Theorem 5. Let X1 , X2 , . . . , Xn be i.i.d. Exp() variables. Let an =

n−1  1 ; i

bn =

i=1

n−1  1 . i2

(13)

i=1

Then,  n−1 ai 4(nbn − an ) + 2((n + 1)an − nbn + n n−1 i=1 i − i=1 ai − n + 1) n = .  (8n3 − 14n2 + 6n)bn

(14)

Proof. We may assume without loss of generality that  = 1. The derivation uses the following well known representation for the order statistics of an Exp(1) sample: Let Z1 , Z2 , . . . , Zn be i.i.d. Exp(1) variables. Then the order statistics X(1) , X(2) , . . . , X(n) of  an i.i.d. Exp(1) sample admit the representation X(i) = ik=1 Zk /(n − k + 1).  Step 1. First note the obvious fact that n also equals the correlation between i
j  k=i+1

=

j  k=i+1

⎞2 Zk ⎠ n−k+1 Zk2

(n − k + 1)2

+2

j j −1   k=i+1 l=k+1

Z k Zl × . (n − k + 1)(n − l + 1)

Therefore,  (X(i) − X(j ) )2 i
=

j   i
=

n  k=2

Zk2 (n − k + 1)

(k − 1)Zk2 +2 n−k+1

2

j −1  j  

+2

i
l−1 n   l=3 k=2

Zk Z l (n − k + 1)(n − l + 1)

(k − 1)Zk Zl , n−k+1

by rearranging the order of summation in the iterated sums. n  Step 3. Likewise, obviously, W = Zj /(n − j + 1). j =2

(15)

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

2205

Step 4. By an easy calculation, ⎛ ⎞ n n n−1   (k − 1)Zk2  Zj n−i ⎠=4 Cov ⎝ = 4(nbn − an ). , n−j +1 n−k+1 i2 k=2

j =2

i=1

Step 5. On the other hand, ⎛ ⎞ l−1 n n    Z (k − 1)Z Z j k l ⎠ Cov ⎝ , n−j +1 n−k+1 j =2

l=3 k=2

=

n  l−1 

k−1 (n − k + 1)2

l=3 k=2 n  l−1 

+ =

l=3 k=2 n−1 

k−1 Cov(Zk Zl , Zl ) (n − k + 1)(n − l + 1)

(k − 1)(n − k) (n − k + 1)2

k=2

Cov(Zk Zl , Zk )

n−1 

+

k=2

n−k

k−1  1 . n−k+1 j

(16)

j =1

From (16), by a change of variable, ⎛ ⎞ n  n l−1   Zj (k − 1)Z Z k l ⎠ Cov ⎝ , n−k+1 n−j +1 l=3 k=2

=

n−1  i=2

j =2

(n − i)(i − 1) + i2

= (n + 1)an − nbn + n

n−1  i=2 n−1  i=1

i−1 n−i  1 i j j =1

ai − i

n−1 

ai − n + 1.

(17)

i=1

 Step 6. Having the covariance term done, we now have to find the variances of i
E

 n  i=1

2 Xi2

2

E(X12 X ) =

= 4n2 + 20n, 2(n2 + 5n + 6) , n2

2206

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

Table 3 n

2

n

3 .894

4 .861

5 .844

10 .831

15 .786

20 .753

30 .727

50 .688

100 .634

.558

and 4

E(X ) =

n3 + 6n2 + 11n + 6 . n3

(18)

Combining all these expressions into the binomial expansion, as in Theorem 1, one gets: ⎛

⎞  Var ⎝ (X(i) − X(j ) )2 ⎠ = 8n3 − 14n2 + 6n.

(19)

i
Step 7. The formula for n now follows by substituting the covariance and the variance expressions from Steps 1–6.  Example 3. Exact values of n are given in the table below for some selected values of n. Remark. From Table 3, we notice an interesting diversity of phenomenon from the uniform and the normal case. For small n, the correlations are smaller in the exponential case, and the maximum correlation appears to be attained at n = 2 itself, while for larger n, the correlation is in between the correlations for the uniform and the normal case. √ The final result gives an asymptotic expansion for n with an error of the √ order of 1/ n. Although it is a two-term expansion, due to the error being of the order of 1/ n, the accuracy is poor unless n is very large. For the sake of completeness, however, it is nice to have the expansion. Theorem 6.  √    3 (log n)2 log n 1 n = + 2(1 + C) √ +O √ , √ 2 n n n

(20)

where C is the Euler constant. Proof. The proof uses the asymptotic expansions, derived from the Euler summation formula, given below: ∞  1 1 1 1 = + 2 + 3 + O(n−4 ), 2 i n 2n 6n i=n n  i=1

1 1 1 + O(n−4 ). = C + log n + − i 2n 12n2

(21)

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

2207

Step 1. By using (21), nbn − an = O(n), on some algebra. Step 2. Also by using  (21), (n + 1)an − nbn − n + 1 = n log n + O(n), on algebra. a = n log n + O(n). Step 3. Similarly, n−1 i=1  i Step 4. The final term n−1 i=1 ai / i is the hardest one. We analyze this term by using the summation by parts formula and (21): n−1  ai i=1

i

= −

n−1  ai+1 i=1

i



 2 (log n) + an2 = −C log n − + O(1) + ((log n)2 2

+ 2C log n + O(1)) =

(log n)2 + C log n + O(1). 2

These four steps take care of the covariance√ term in n . √ Step 5. As regards the variance terms in n , 8n3 − 14n2 + 6n = 2 2n3/2 (1 + O(1/n)), and bn = 2 /6(1 + O(1/n)). Step 6. Substituting these approximations into the exact formula (14) for n , the asymptotic expansion in (20) follows after some algebra.  5. Interquartile range and standard deviation The interquartile range is another well accepted measure of spread. As such, the correlation between it and the sample standard deviation is also of intrinsic interest. Unlike the correlation between s and W, the correlation between s and the interquartile range does not converge to zero as n → ∞. In this section, we first derive the joint asymptotic distribution between s and the interquartile range for any population with four moments, and use it to obtain the limiting correlation between s and the interquartile range for a number of important distributions. We see some interesting effects of the thickness of the tail of the population on the correlation between s and the interquartile range.

5.1. Joint asymptotics of interquartile range and standard deviation Theorem 7. Let X1 , X2 , . . . , Xn be i.i.d. observations from a CDF F on the real line with a finite fourth moment and a positive density function f (x). Let 0 < p1 < p2 < 1, and let Q = Qp1 ,p2 = X([np2 ]) − X([np1 ]) . For any 0 < p < 1, let p = F −1 (p). Then,



n[(Q, s) − ( p2 − p1 , )] ⇒ N (0, A A ), with A, defined as follows:

A = ((aij )),

= ((ij )),

where a11 = a12 = 0,

a13 = 1,

 a21 = − , 

a22 =

1 , 2

a23 = 0,

(22)

2208

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

 p1

3

−∞ xf (x) dx

 p2

xf (x) dx − −∞ 12 = E(X ) − E(X ), 13 = f ( p1 ) f ( p2 )   p1 p2 − − , 22 = E(X 4 ) − (E(X 2 ))2 , f ( p1 ) f ( p2 )  p 2 2  p 1 2   p2 p1 −∞ x f (x) dx −∞ x f (x) dx 2 23 = − − E(X ) − , f ( p1 ) f ( p2 ) f ( p1 ) f ( p2 ) 2p1 (1 − p2 ) p2 (1 − p2 ) p1 (1 − p1 ) 33 = . + − f ( p1 )f ( p2 ) f 2 ( p2 ) f 2 ( p1 )

11 =  , 2

2

(23)

Proof. The main tool needed in deriving the joint asymptotic distribution of (Q, s) is the Bahadur representation:  X([np2 ]) − X([np1 ]) = Z + op

 1 , √ n

where Zi = p2 +

p2 − IXi  p f ( p2 )

2

− p1 −

p1 − IXi  p f ( p1 )

1

,

see, e.g., Serfling (1980), or van der Vaart (1998). Step 1. By the multivariate central limit theorem, on the obvious centering and normalization √ by n, (X, X 2 , Z) ⇒ N (0, ). √ Step 2. Consider the transformation h(u, v, z) = (z, v − u2 ). The gradient matrix of h has first row = (0, 0, 1) and the second row   1 u , √ ,0 . = −√ v − u2 2 v − u2 Step 3. The stated joint asymptotic distribution for (Q, s) now follows from an application of the delta theorem and the Bahadur representation stated above. We omit the intermediate calculation.  An important consequence of Theorem 7 is the following result: Corollory 1. Under the hypotheses of Theorem 7,   √ Q p2 − p1 n − ⇒ N (0, c A A c) s   −  where c = 1 , p12 p2 . The corollary follows on another application of the delta theorem to the result in Theorem 7.

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

2209

Table 4 n Normal

∗n

10

15

20

30

40

50

.583

.598

.594

.597

.600

.599

.699

.774

.787

.792

.803

.813

.412

.433

.436

.398

.421

.406

.476

.441

.447

.423

.424

.489

.453

.442

.440

.423

Uniform

∗n

Exponential

∗n

Double Exponential .470 t (5) ∗n .464

∗n

Corollory 2. Let ∗n = Corr(IQR, s). Then (a) (b) (c) (d) (e)

limn→∞ ∗n = .6062 limn→∞ ∗n = .8373 limn→∞ ∗n = .3982 limn→∞ ∗n = .4174 limn→∞ ∗n = .3191

if if if if if

F = Normal, F = Uniform, F = Exponential, F = Double Exponential, F = t (5).

Corollary 2 follows on doing the requisite calculation from the matrix A A in Theorem 7. Thus, note that the correlation between IQR and s does not die out as n → ∞, unlike the correlation between W and s. It is also interesting to see the much higher correlation between IQR and s for the Uniform case, and that successively heavier tails lead to a progressively smaller correlation. Table 4 provides the values of the correlation for some selected finite values of n. Example 4. The correlation is remarkably stable except for the uniform case and in the uniform case it is stable for n 15 or so, that is unless n is quite small. We find this stability of the correlation across a very large range of values of n interesting and also surprising. 5.2. Thumb rule based on IQR and s Simple thumb rules for quick diagnostics can be formed by using the result of Corollary 1. We present them only for normal and Exponential data here; but they can be formed for any distribution with four moments by using Corollary 1.   √ IQR Corollory 3. (a) n − 1.349 ⇒ N (0, 1.566) if F = Normal; s  √ IQR (b) n − 1.099 ⇒ N (0, 3.060) if F = Exponential. s

2210

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

Corollary 3 is a direct consequence of Corollary 1 by taking p1 = 41 , p2 = 43 , and on computing the mean and the variance of the limiting normal distribution for IQR/s by using the expressions in Corollary 1. Using 1.5 standard deviation around the mean value as the plausible range (choice of 1.5 is evidently somewhat subjective) for IQR/s, we have the following thumb rules for Normal and Exponential data: Thumb Rule: For i.i.d. Normal data, IQR/s should be in the intervals [.85,1.85], [.975,1.725], [1.05,1.65], [1.08,1.6] for n = 15, 25, 40, 50, respectively. For i.i.d. exponential data, IQR/s should be in the intervals [.4,1.75], [.6,1.6], [.68,1.5], [.75,1.45] for n = 15, 25, 40, 50, respectively. The overlaps between the two sets of intervals decrease as n increases. The thumb rule for the normal case above may be of some practical use.

6. The Cauchy case The Cauchy case is always an interesting one to address because of its lack of moments. Thus, the correlation between s and either W or the interquartile range is not defined. Still, one would certainly expect some positive dependence. This is explored in this final section by use of graphics (Figs. 1–6) based on simulations for three different sample sizes: n = 10, 25 and 100. The s vs. IQR scatterplots are fundamentally different; they show a massive concentration of the points close to the vertical axis, and a small fraction of stray points. We believe this is the connection of the association between s and IQR with the thickness of the tail of the population previously seen in Section 5. But now we see it in almost an extreme form for the Cauchy case. The graphics show two interesting phenomena for the W vs. s scatterplots: there is always an outlier in the scatterplots, and there is also an interesting clustering. The clustering gets somewhat blurred as n increases. However, there is an obvious positive dependence seen in each scatterplot. It would be interesting to quantify this mathematically by using some measure of dependence other than correlation.

Fig. 1.

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

Fig. 2.

Fig. 3.

Fig. 4.

2211

2212

A. DasGupta, L.R. Haff / Journal of Statistical Planning and Inference 136 (2006) 2197 – 2212

Fig. 5.

Fig. 6.

References Arnold, B., Balakrishnan, N., Nagaraja, H., 1992. A First Course in Order Statistics. Wiley, New York. Basu, D., 1955. On statistics independent of a complete sufficient statistic. Sankhya¯ 15, 377–380. David, H.A., 1970. Order Statistics. Wiley, New York. David, H.A., Hartley, H.O., Pearson, E.S., 1954. The distribution of the ratio in a single sample of range to standard deviation. Biometrika, 482–493. Plackett, R.L., 1947. Limits of the ratio of mean range to standard deviation. Biometrika 34, 120–122. Serfling, R., 1980. Approximation Theorems of Mathematical Statistics. Wiley, New York. Thomson, G.W., 1955. Bounds for the ratio of range to standard deviation. Biometrika 42, 268–269. van derVaart,A.W., 1998.Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.