Statistical implications of selectively reported inferential results

Statistical implications of selectively reported inferential results

Statistics & Probability Letters 56 (2002) 13 – 22 Statistical implications of selectively reported inferential results Nicola Loper!do ∗ Universit a...

Statistics & Probability Letters 56 (2002) 13 – 22

Statistical implications of selectively reported inferential results Nicola Loper!do ∗ Universit a di Urbino, via Sa 2, 61029, Urbino (PS), Italy Received July 2000; received in revised form November 2000

Abstract Researchers sometimes report only the largest estimate, the most signi!cant test statistic and the most shifted con!dence interval. The following result quanti!es the statistical implications of this behavior, when the choice is restricted to two inferential procedures: the minimum and maximum of two standardized random variables whose distribution is jointly normal is skew-normal. More generally, the distribution of the minimum and maximum of two random variables whose distribution is bivariate normal centered at the origin is a mixture with equal weights of scaled skew-normal distributions. c 2002 Elsevier Science B.V. All rights reserved  Keywords: Bivariate normal; Skew-normal; Order statistics; Selection bias

1. Introduction Less than scrupulous researchers apply several statistical procedures to the same dataset, but report only the most convenient result (Lehmann, 1999). Jiang (1997) focuses on the selection between two statistical tests and gives bounds for the critical value of the selected procedure. This paper shows that this topic is strictly related to the skew-normal distribution, de!ned by as follows: f(z; ) = 2(z) · (z);

−∞ ¡ z;  ¡ + ∞;

where (·) and (·) denote the standard normal density function and distribution function, respectively (Azzalini, 1985) . When (1) is the density of a random variable Z we write Z ∼ SN(). The parameter  controls skewness, which is positive when  ¿ 0 and negative when  ¡ 0. Despite skewness, these distributions resemble the normal ones in several ways: they are unimodal, their support is the real line and the square of a variable whose distribution is skew-normal follows a chi-square distribution. ∗ Via Giotto 8=A, Fermignano, 61033 Urbino, Italy. E-mail address: [email protected] (N. Loper!do).

c 2002 Elsevier Science B.V. All rights reserved 0167-7152/02/\$ - see front matter  PII: S 0 1 6 7 - 7 1 5 2 ( 0 1 ) 0 0 0 5 2 - 9

14

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

The standard normal distribution is a skew-normal distribution with  = 0: We can then say that the class of distributions generated by the skew-normal ones through location-scale transformations include the class of normal distributions. This generalization of the normal law has the advantage of being mathematically tractable (see Azzalini and Della Valle, 1996) and easily interpretable (Arnold et al., 1993). Applications of the skew-normal distribution include econometrics (Aigner et al., 1977), graph theory (Azzalini and Capitanio, 1999) and time series (Bartolucci et al., 2000). Section 2 relates the skew-normal distribution to the theory of order statistics. Section 3 applies the previous section’s results to point estimation, con!dence intervals and hypothesis testing. Proofs are reported in Appendix A.

2. Main results David (1981) computed the expected value and the variance of the maximum of two standardized random variables whose joint distribution is bivariate normal. Theorem 2.3 shows that its distribution is indeed skew-normal. There are several implications of this result. In the !rst place it hints new applications of the skew-normal distribution, which add to those discussed by O’Hagan and Leonard (1976), Andel et al. (1984), Azzalini and Capitanio (1999). Consider the following example: a sample of size n is drawn from a bivariate normal population with known covariance matrix. Under √ the null hypothesis both means are greater than a value 0 : Then a reasonable test statistic is the following: n max[(XK − 0 )=X ; (YK − 0 )=Y ] and its sampling distribution is skew-normal. In the second place Theorem 2.3 leads to a new algorithm for generating random numbers from a skewnormal distribution with parameter , in addition to the one proposed by Azzalini and Della Valle (1996). First we generate two numbers from a bivariate normal distribution with correlation =(1−2 )=(1+2 ). Then we standardize them and choose the maximum of the standardized values (if  is positive) or their minimum (if  is negative). When the variances are not assumed to be equal the distribution of their maximum is a mixture, with equal weights, of scaled skew-normal distributions (Theorem 2.1). A similar result holds for their minimum (Theorem 2.2). Theorem 2.1. Let the random variables X1 ; X2 ; Y1 ; Y2 be distributed as follows:       11 12 0 X1 ; ∼N ; 0 X2 21 22   Y1 11 − 12 ; ∼ SN  √ 2 11 11 22 − 12

  Y2 22 − 12 : ∼ SN  √ 2 22 11 22 − 12

(1)

(2)

Then the distribution of max(X1 ; X2 ) is the mixture with equal weights of the distributions of Y1 and Y2 . Theorem 2.2. Let the random variables X1 ; X2 ; Y1 ; Y2 be distributed as follows:       0 11 12 X1 ∼N ; ; 0 X2 21 22

(3)

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

  Z1 12 − 11 ; ∼ SN  √ 2 11 11 22 − 12

  Z2 12 − 22 : ∼ SN  √ 2 22 11 22 − 12

15

(4)

Then the distribution of min(X1 ; X2 ) is the mixture with equal weights of the distributions of Z1 and Z2 . Theorem 2.3. Let X1 ; X2 be two standardized random variables whose distribution is jointly normal:       1  0 X1 ∼N ; : 0 X2  1 Then the distribution of the random variables max(X1 ; X2 ) and min(X1 ; X2 ) are skew-normal:   



1− 1− ; min(X1 ; X2 ) ∼ SN − : max(X1 ; X2 ) ∼ SN 1+ 1+

(5)

(6)

3. Statistical applications This section applies the previous one’s results to the problems implied by selectively reported inferences. Theorems 3.1, 3.2 and 3.3 deal with point estimation, con!dence intervals and hypothesis testing, respectively. They all rely on the normality assumption for the relevant statistics, which is often reasonable when the sample size is large enough. In order to keep the notation simple, theorems are presented without any reference to convergence, sequences or limits. They can be trivially restated as large sample theory results. 3.1. Point estimation Sometimes a researcher would like his estimate to be as high (low) as possible. Therefore, he might compute two available estimates from the data and report only the largest (smallest) one. The implications of this behavior are twofold, whenever assumptions of Theorem 3.1 are satis!ed. The !rst implication is rather trivial: the reported estimator is positively (negatively) biased. Theorem 3.1 shows the exact value of the bias. The second implication is more interesting: the MSE of the reported estimator is exactly the same as the MSE of the estimator randomly chosen from the pair. That is the researcher who chooses the estimate which better suits his aims gains, on average, the same eMciency of the researcher who gives both estimates an equal chance to be chosen. Theorem 3.1. Let M be the maximum of two unbiased estimators for the parameter  whose joint distribution is bivariate normal.       11 12 T1  : (7) ∼N ;  T2 21 22 Then the following hold:  √ √ • M is positively biased for : E(M ) =  + 0:5 · ( 11 + 22 ) (1 − )=(4). • The mean square error of M with respect to  is the average of the estimators’ variances: MSEM () = (11 + 22 )=2.

16

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

3.2. Con&dence intervals A researcher might look for the con!dence interval which is more shifted to the right (left), in order to cover higher (lower) parameter values. Therefore he might compute two level  con!dence intervals and select the one whose center is higher. A more scrupulous researcher should choose his level  con!dence interval regardless of the values it actually includes. Theorem 3.2 shows that the two strategies lead exactly to the same con!dence level. Moreover the expected length of the former interval is not necessarily higher than the latter’s. In fact it equals the expected length of the interval obtained through a random choice between the two intervals themselves. Theorem 3.2. Let  be an unknown parameter and let T1 and T2 be two statistics whose joint distribution is       11 12 T1  : (8) ∼N ;  T2 21 22 Moreover; let I = [T − a; T + a] be a con&dence interval for , where √ 11 T1 ¿ T2 T = max(T1 ; T2 ); = √ a ¿ 0: 22 T1 6 T2

(9)

Then the following hold: • The interval I contains  with probability 2(a) − 1. √ √ • The expected length of I is a( 11 + 22 ). 3.3. Hypothesis testing Consider the problem of testing H0 :  = 0 against H1 :  ¿ 0, where  is the mean (median) of a symmetric distribution. The signi!cance level can be computed either from the t-test or the sign test, both enjoying very desirable properties. A researcher might apply both tests to his data, each at level ; then report only the most signi!cant one. Jiang (1997) gives an upper bound for the signi!cance level under this circumstance. Theorem 3.3 is an improvement as well as a generalization of Jiang’s results. It is an improvement since it sharpens Jiang’s inequality under Jiang’s assumptions. It is a generalization since it also holds for other tests, it does not require test statistics being positively correlated or the cutoN value being positive. Theorem 3.3. Let  be an unknown parameter and let T1 and T2 be two statistics whose joint distribution is       1  T1  ∼N ; : (10)  T2  1 Then the size of the rejection region R = {max(T1 ; T2 ) ¿ a} for the null hypothesis H0 :  6 0 satis&es the following inequalities:

  1−   a ∈ R; (a) + −1 artg    1+ P[R|H0 ] 6 (11)

    1 −     min 2(a) − 2 (a); (a) + −1 artg ; a ¿ 0:  1+

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

17

Acknowledgements The author would like to thank Professor Azzalini for his advice and his encouragement. Appendix A. Proof of Theorem 2.1. Let the random variables U1 ; U2 ; V1 ; V2 and the constants 1 ; 2 be de!ned as follows: X1 U1 = √ ; 11

X2 − (12 X1 =11 ) ; V1 =  22 (1 − 2 )

(A.1)

X2 U2 = √ ; 22

X1 − (12 X2 =22 ) ; V2 =  22 (1 − 2 )

(A.2)

11 − 12 1 =  ; 2 11 22 − 12

2 = 

22 − 12 2 11 22 − 12

:

(A.3)

We shall !rst prove that X1 ¿ X2 ⇔ 1 U1 ¿ V1 (the proof for X2 ¿ X1 ⇔ 2 U2 ¿ V2 is similar) X1 ¿ X2 ⇔ X1 − (12 X1 =11 ) ¿ X2 − (12 X1 =11 )

(A.4)

⇔ [1 − (12 =11 )] X1 ¿ X2 − (12 X1 =11 )

(A.5)

1 − (12 =11 ) X2 − (12 X1 =11 ) ⇔ X1 ¿  2 22 (1 −  ) 22 (1 − 2 )

(A.6)

⇔ 1 U1 ¿ V1 :

(A.7)

By de!nition Y = max(X1 ; X2 ), so that: √ X1 X1 ¿ X2 11 U1 Y= ⇔ Y= √ X2 X1 ¡ X2 22 U2

1 U1 ¿ V1 ; 2 U2 ¿ V2 :

(A.8)

We can then write √ P(Y 6 y) = P( 11 U1 6 y|1 U1 ¿ V1 )P(1 U1 ¿ V1 ) √ + P( 22 U2 6 y|2 U2 ¿ V2 )P(2 U2 ¿ V2 ):

(A.9)

By de!nition the distribution of (X1 ; X2 ) is normal and centered at the origin. P(1 U1 ¿ V1 ) = P(2 U2 ¿ V2 ) = 0:5:

(A.10)

The random variables (U1 ; V1 ) are independent, standardized and jointly normal (the same holds for U2 ; V2 ). We can then apply a result by Azzalini and Della Valle (1996) U1 |(1 U1 ¿ V1 ) ∼ SN(1 );

U2 |(2 U2 ¿ V2 ) ∼ SN(2 ):

(A.11)

18

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

Then the distribution of Y equals the distribution of Y1 when 1 U1 ¿ V1 and equals the distribution of Y2 in the opposite case. The two events occur with equal probability and the proof is complete. Proof of Theorem 2.2. By assumption, X1 and X2 are centered at the origin, so that their distribution is invariant with respect to orthogonal transformations. The transformation (W1 ; W2 ) = (−X1 ; −X2 ) is orthogonal and we can write       W1 0 11 12 ∼N ; : (A.12) 0 W2 21 22 The random variables W1 and W2 are standardized and jointly normal. We can then apply Theorem 2.1    max(W1 ; W2 )  11 − 12 (A.13) √  W1 ¿ W2 ∼ SN   − 2 ; 11 11 22 12    max(W1 ; W2 )  22 − 12 √  W1 ¡ W2 ∼ SN   − 2 : 22 11 22 12

(A.14)

Simple algebra shows that the equality max(−r1 ; −r2 ) = −min(r1 ; r2 ) holds for any two real numbers r1 ; r2 . Then    min(X1 ; X2 )  11 − 12 − √ (A.15)  X1 ¡ X2 ∼ SN   − 2 ; 11 11 22 12    min(X1 ; X2 )  22 − 12 − √  X1 ¿ X2 ∼ SN   − 2 : 22 11 22 12

(A.16)

Azzalini (1985) shows that if the distribution of a random number U is skew-normal with parameter  then the distribution of −U is skew-normal with parameter −, so that    min(X1 ; X2 )  12 − 11 (A.17) √  X1 ¡ X2 ∼ SN   − 2 ; 11 11 22 12    min(X1 ; X2 )  12 − 22 √  X1 ¿ X2 ∼ SN   − 2 : 22 11 22 12 By assumption, Z1 and Z2 are distributed as follows:     Z1 Z2 12 − 11 12 − 22 ; √ : ∼ SN  ∼ SN  √ 2 2 11 22 11 22 − 12 11 22 − 12 We can then write: min(X1 ; X2 )|X1 ¿ X2 has the same distribution of Z1 ; min(X1 ; X2 )|X1 ¡ X2 has the same distribution of Z2 :

(A.18)

(A.19)

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

19

X1 and X2 are centered at the origin by assumption, so that P(X1 ¿ X2 ) = P(X1 ¡ X2 ) = 0:5

(A.20)

and the proof is complete. Proof of Theorem 2.3. We shall follow the same notationof the previous theorem and we shall only prove that Y = max(X1 ; X2 ) is skew-normal with parameter  = (1 − )=(1 + ). The proof that Z = min(X1 ; X2 ) is skew-normal with parameter − is similar. By assumption, X1 and X2 are two standardized random variables whose distribution is jointly normal, so that we can apply Theorem 2.1 and deduce that Y1 and Y2 are distributed as follows:     Y2 11 − 12 22 − 12 Y1 ; √ : (A.21) ∼ SN  ∼ SN  √ 2 2 11 22 11 22 − 12 11 22 − 12 By assumption, 11 = 22 = 1, so that 12 =  and   1− ; i = 1; 2: Yi ∼ SN  1 − 2

(A.22)

From Theorem 2.1 we know that the distribution of Y is a mixture of the distributions of Y1 and Y2 , which are identical. Then Y1 ; Y2 and Y are identically distributed and we can write   1− : (A.23) Y ∼ SN  1 − 2 Simple algebra shows that (A.24) is equivalent to  

1− Y ∼ SN 1+

(A.24)

and the proof is complete.  √ √ Proof of Theorem 3.1. We shall !rst prove that E(T ) =  + 0:5 · ( 11 + 22 ) (1 − )=(4) 1. Azzalini (1985) showed the following:  2  : Z ∼ SN() ⇒ ·√  1 + 2

(A.25)

From basic properties of the skew-normal we get  1− ; E(T1 |T1 ¿ T2 ) =  + 11   E(T2 |T1 6 T2 ) =  +

22

1− : 

(A.26)

20

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

Now recall that P[T1 ¿ T2 ] = 0:5 and apply the total probability theorem E(T ) = E(T1 |T1 ¿ T2 )0:5 + E(T2 |T1 6 T2 )0:5 1 = 2





+ √

=+

1− 11 

11 + 2

22





1 + 2

 +



1− 22 



1− 

(A.27)

and this proof is complete. 2. Since assumptions of Theorem 2.1 hold we can write    M −   11 − 12 ; T1 ¿ T2 ∼ SN  2 11  11 22 − 12    M −   22 − 12 : T1 6 T2 ∼ SN  2 22  11 22 − 12

(A.28)

Azzalini (1985) shows that the square of a random variable whose distribution is SN() follows a chi-square distribution with one degree of freedom. Moreover, by assumption, E(T1 ) = E(T2 ) = . Then 2   M −    T1 ¿ T2 ∼ &12 ⇒ E[(M − )2 |T1 ¿ T2 ] = 11 ;  11 

M − 22

2    T1 6 T2 ∼ &12 ⇒ E[(M − )2 |T1 6 T2 ] = 22 : 

(A.29)

Apply now the total probability theorem MSEM () = E[(M − )2 ]

(A.30)

=11 · P(T1 ¿ T2 ) + 22 · P(T1 6 T2 )

(A.31)

=(11 + 22 )=2

(A.32)

and the proof is complete. Proof of Theorem 3.2. 1. P[T − a 6  6 T + a] = 2(a) − 1. From the de!nitions of T and  we obtain    T1 −      P √ 6a T1 ¿ T2 ;   11  P[T − a 6  6 T + a] =     T2 −      6a P  √ T1 6 T2 : 22 

(A.33)

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

21

The conditional distributions (T1 − )2 =11 |T1 ¿ T2 and (T2 − )2 =22 |T1 6 T2 are both skew-normal. Moreover Azzalini (1985) shows that Z 2 ∼ &12 if Z ∼ SN(). An equivalent statement is the following: if Z ∼ SN() then the distribution of |Z| is half-normal, that is its c.d.f. is 2(a) − 1 if a ¿ 0. Then P[T − a 6  6 T + a]

(A.34)

    T −  T −  1 1     6 a|T1 ¿ T2 +P  6 a|T1 ¿ T2 =P    2   2 =

(2(a) − 1) (2(a) − 1) + = 2(a) − 1 2 2

(A.35) (A.36)

and this proof is complete. √ √ 2. E(2a) = a( 11 + 22 ). √ √ √ By de!nition  = 11 if T1 ¿ T2 and  = 22 if T1 6 T2 . We also know that P[(T1 − )= 11 ¿ √ (T2 − ) 22 ] = 0:5. Then √ √ √ √ E(2a) = 2a[ 11 · 0:5 + 22 · 0:5] = a( 11 + 22 ) (A.37) and this proof is complete. Proof of Theorem 3.3. Let (·|) denote the c.d.f. of a distribution SN(). Jiang (1997) proved the following inequality under the assumption that (T1 ; T2 ) are standardized and positively correlated random variables whose joint distribution is bivariate normal P(max(T1 ; T2 ) ¿ a) 6 2(a) − 2 (a); Therefore we have only to prove that 1 P[max(T1 ; T2 ) ¿ a] 6 (a) + artg 

a ¿ 0:

1− : 1+

(A.38)

(A.39)

The following result appears in Azzalini (1985): sup|(a) − (a|)| = a

Apply now Theorem 2.3 

max(T1 ; T2 ) ∼ SN

1 1 artg|| ⇒ |(a) − (a|)| 6 artg||:  

1− 1+

(A.40)

 ⇒ P[max(T1 ; T2 ) 6 a] = (a|):

From (A.41) and (A.42) we get 1 |(a) − P[max(T1 ; T2 ) 6 a]| 6 artg 

1− : 1+

(A.41)

(A.42)

The following implication stems from ordinary properties of the c.d.f.: max(T1 ; T2 ) ¿ T1 ⇒ (a) ¿ (a|):

(A.43)

22

N. Loper&do / Statistics & Probability Letters 56 (2002) 13 – 22

Then inequality (A.44) can be written as follows:

1− 1 (a) − P[max(T1 ; T2 ) 6 a] 6 artg  1+ 1 ⇒ P[max(T1 ; T2 ) 6 a] ¿ (a) − artg 

1 ⇒ P[max(T1 ; T2 ) ¿ a] 6 (−a) + artg 

1− 1+

1− 1+

(A.44)

and this proof is complete. References Aigner, D.J., Lovell, C.A.K., Schmidt, P., 1977. Formulation and estimation of stochastic frontier production function model. J. Economet. 12, 21–37. Andel, J., Netuka, I., Zvara, K., 1984. On threshold autoregressive processes. Kybernetika 20, 89–106. Arnold, B.C., Beaver, R.J., Groeneveld, R.A., Meeker, W.Q., 1993. The nontruncated marginal of a truncated bivariate normal distribution. Psychometrika 58, 471– 478. Azzalini, A., 1985. A class of distributions which includes the normal ones. Scand. J. Stat. 12, 171–178. Azzalini, A., Capitanio, A., 1999. Statistical applications of the multivariate skew-normal distribution. JRSS B 61, 579–602. Azzalini, A., Della Valle, A., 1996. The multivariate skew-normal distribution. Biometrika 83, 715–726. Bartolucci, F., de Luca, G., Loper!do, N., 2000. A generalization for skewness of the basic stochastic volatility model. Proceedings of the 15th International Workshop in Statistical Modelling. Bilbao, Spain. David, H.A., 1981. Order Statistics. Wiley, New York. Jiang, J., 1997. Sharp upper and lower bounds for asymptotic levels of some statistical tests. Statist. Probab. Lett. 35, 395–400. Lehmann, E.L., 1999. Elements of Large-Sample Theory. Springer, New York. O’Hagan, A., Leonard, T., 1976. Bayes estimation subject to uncertainty about parameters constraints. Biometrika 63, 201–203.