MATHEMATICAL COMPUTER MODELLING PERGAMON
Mathematical
and Computer
Modelling
34 (2001) 921-936 www.elsevier.nl/locate/mcm
The Mean and Median Absolute Deviations T. PHAM-GIA DBpartement de Mathematiques et de Statistique Universite de Moncton, Moncton, New Brunswick, Canada ElA 3E9 T. L. HUNG Hue University, Viet Nam (Received and accepted May 2000)
Abstract-h this article, we present a survey of important results related to the mean and median absolute deviations of a distribution, both denoted by MAD in the statistical modelling literature and hence creating some confusion. Some up-to-date published results, and some original ones of our own, are also included, along with discussions on several controversial issues. @ 2001 Elsevier Science Ltd. All rights reserved. absolute deviation, Contamination, Robustness,
Keywords-Mean distributions, totics.
Median absolute deviation, Standard deviation, Sampling Estimation, Skewness, Gini index, Peters formula, Asymp-
1. INTRODUCTION The L2-norm is very well present in the probabilistic
and statistical
the basic measure of dispersion (variance) in statistics: variable X with finite mean p =
J
m xdF(x),
-co
o(x) =IJX 412
=
[E(X
-
literatures,
most evidently in
the standard deviation of a real random
P,“]
“’
=
(s _I(x
-
/42Wx))
l/2 .
Also, the validity of the orthogonal relation /IX + YIIz = /[X/j; + [(Y [Ii, for X and Y independent centered variables, is the prime reason for the widespread use of this norm in sampling theory and in an important topic of statistics: the analysis of variance. Due to its convenience in differentiation, and hence in optimization, the L2-norm is also preponderant in estimation, for example in the method of least squares, and for the same reason the square-error loss function is widely used in statistical decision theory. This seemingly very useful character of the L”-norm has several drawbacks however, for example, when the design ceases to be orthogonal in the analysis of variance. On the other hand, although playing a dominant role in mathematical functional analysis, the L’-norm has seen relatively few applications in statistics and statistical modelling. Here, the Research partially supported by NSERC Grant 41-505 (Canada). The authors wish to warmly thank several colleagues that have helped to improve the content and the presentation of this paper. Thanks are also due to Professor N. Turkkan for setting up several complex computer programs. 0895-7177/01/S - see front matter @ 2001 Elsevier Science Ltd. All rights reserved. PII: SO895-7177(01)00109-1
Typeset
by AM-T&$
T. PHAM-GIA AND T. L. HUNG
922
presence of the absolute value complicate computations and difficulties encountered in deriving sampling distributions with this norm have further reduced its role. Recent research efforts on robust statistical and the median of a distribution,
modelling and inference have given t,he L’-norm,
some renewed popularity
(see, e.g., articles in [l]).
First,
in
curve fitting and regression, the L1-norm, when applied to a discrete set of points, leads to the minimization
of a sum of absolute distances,
the most frequently
in an approach known under a variety of names,
used ones being the least absolute deviation (LAD) or the least absolute
errors (LAE). A ground breaking result by Basset and Koenker [2] states that, for linear models, t,he LAE estimat,or has strictly estimator,
for any distribution
smaller asymptotic
confidence ellipsoids than the least squares
for which the sample median is a more efficient location than the
sample mean. Reference [3] gives a comprehensive and fairly up-to-date coverage of topics related to this domain, and Portnoy and Koenker [4] give a comparative approaches.
overview of bot,h L’ and L2
On the other hand, Bai et al. [5] obtained some encouraging results on analysis of
variance based on LAD. Second, for a continuous or discrete random variable, the counterpart of the standard deviation in the L’-norm is the mean absolute deviation (MAD), denoted by S,(X) = ]]X -p]]i = s-“, 1~: p] d&‘(z) = E(]X-~1). Tl re same concept can be used when Md, the median of the distribution, is considered, instead of CL,and we have the mean absolute deviation about the population median, or 62(X)
= E(]X
- Mdl), which h as several applications in statistics
and other related domains.
Third, while the above measures are still based on expectations (or “averages”), another concept based completely on the median (or “middle value”) gives rise to the median absolute deviation, also denoted by MAD. In this article, we set X(X) = Median (1X - Md(X)I), and X has applications in a growing number of domains, mostly in issues related to robustness. There is considerable confusion in the literature on the abbreviation MAD, with several distinct statistical measures all called MAD. In this article, we will focus on the above three dispersion measures, and their related sampling statistics, and will use a distinct notation and a distinct name for each of them for the study of their behaviours and applications. In Section 2, we review the main properties of the mean absolute deviation, and in Section 3 basic results on the sampling distribution of d,, the sample mean absolute deviation about the sample mean, are presented, together with a comparative study on the advantages in using S, and d,. In Section 4, we look into the estimation of the population mean absolute deviations, from both point and interval viewpoints. Some uses of the mean absolute deviations, in statistics, in experimental physics, and in other domains are presented in Section 5. In Section 6, we look into the asymptotic expressions of some statistics associated with different mean absolute deviations. The dispersion function, as a generalization of the mean absolute deviation, is presented in Section 7, and finally, Section 8 deals with symmetrization of a random variable and the median absolute deviation.
2. THE
MEAN
ABSOLUTE
DEVIATIONS
2.1. Inequalities A basic inequality in probability theory states that if X is a random variable and g is a nonnegative Bore1 function on R, which is even and nondecreasing on (0, co), then for a > 0, we have [6]
Eg(X)- g(a’< P(lXl a.s. supg(X)
2 a) <
E
(a.s. means almost surely). Considering X - c, where c is a constant, instead of X, and taking g(z) = 9, where T is a positive integer, we have the classical Markov inequality P( /X -c] 2 E) 5 E(\X - cl’)/?.
For c = ,u, T = 2, and E = kV,‘Ir, where V, is the rth absolute moment of X, i.e.,
Absolute Deviations
K = WIX - PI’), we obtain the Pearson inequality
For T = 2, we then have the well-known
P(IX -PI L kg) I For c = CL,T = 1, and E = k&(X),
absolute
or
inequality P(Ix
- PI 2 kllx
we have the lesser-known
P(IX -PI L k&(X)) < ;, is called the mean
[7, p. 1101
Chebyshev
$,
923
deviation
where 61 = S,(X) (MAD)
of X (about
(1)
- ,412) 5 $
case concerning
the L1-norm
= E(IX
IIX - ~111
- ~1) =
its mean).
of the distribution of X, we have P(IX - Mdl 2 k&(X)) < l/k, Since where 62 = 62(X) = /IX - Mdll 1 is the mean absolute deviation of X about its median. the median minimizes the average absolute distance [8, p. 2321, we have 62 2 61. By Lyapunov’s Similarly,
inequality
if Md is the median
[9, p. 1031, we also have 61 < (T and hence,
we always
have 62 5 61 < 0 for any
distribution. 2.2.
The Basic Meaning of the Mean Absolute
Deviations
We can see that 61 offers a direct measure of the dispersion of X about its mean, of the first degree in (XI, unlike the standard deviation which, according to [7, p. 541, is obtained by using “the device of squaring and then taking the square root of the resultant sum, which may appear a little arti,ficial”. 61 can We have S1(X +a) = S,(X) and &(aX) = [al&(X) for any real a. Interpretationwise, be easily seen to be “the average of the variation of X about its mean, irrespective of the sign”, while the practical meaning of the standard deviation remains unclear, as the quadratic mean of the variation whereas
of X about CL. For X uniform on [O,l], 61 has the very natural value of l/4 and does not lend itself to any interpretation. Furthermore, we then have a very logical equality, while P(IX pj < a) = l/d is, again, difficult = l/2,
c = l/(2&),
P(IX - PI < 61) to interpret. The first equality can be intuitively understood as follows. For the uniform distribution, there is 50% probability that X is within its mean of the “average distance” about that same mean. For most common distributions, 61 can be obtained in closed form [lo, Table 11. Indeed, for the binomial distribution Diaconis and Zabell [ll] reported four equivalent forms for 61. For the Pearson family, defined we have
by f’(z)/f
(x) = (z + a)/(bo + blz + b2x2), with mean p and variance
61 = 2Ca2f (p),
where C =
g2,
4P2 - 3P1
6(P2 -Pi
- 1)’
with ,$I and & being the usual coefficients of skewness and kurtosis [12]. For the Pearson Type III, or gamma distributions, we have C = 1 and 151is “twice the variance times the value of the density at its mean”. This property also holds for the normal and for several discrete distributions, like the Poisson, the binomial, and the negative binomial where the mean is replaced by its integral part. A similar relation holds for the beta distribution [13], and some other closed form summation for common distributions. or’ which bl is a special case, are found to have interesting relations with classical orthogonal polynomials [ll]. For the normal N(p, g2) we have 61 = 62 = gm N 0.80. For the Student distribution t,, with density f(t;n)
= [&O
($;)l’
(1+
;).+l,,
-co
11
12,
T. PHAM-GIA AND T. L. HUNG
924
w h’rch converges toward -+ fi, = (n/n) ‘j2 J?((n - 1)/2/I’(n/2)) For several common distributions, although 61 does not come in closed form, &/a
we have 6r(T,)
have to limit ourselves to the study of this ratio. The population median Md is the location parameter such that P(X Md) 2 l/2.
as expected. does and we
2 Md) > l/2 and P(X
5
For simplicity, we will suppose Md unique here. Then, we have Md = F-‘(l/2),
= inf{x : F(X) > t}, t E [O,l]. For where F-’ is the quantile function defined by F-‘(t) symmetrical distributions we have 61 = 62. Since for asymmetrical distributions the median is more representative of the “center” than the mean, 62 readily provides a meaningful dispersion measure related to that center, while in the L2 norm very few applications have been found for its counterpart [E(X - Md)2]‘/2, whose meaning is also unclear. Since the median, usually, is not available in closed form, no analytic expression of 62 can be given for skewed distributions, except for the chi-square, where Causey [14] reported that for X N Xi, where n is the number of degrees of freedom, we have 6s = 2nok + 1 if n = 2k + 1 and 62 = 2nbk + 1 if n = 2k + 2, where ok = ~(Md)k-1/2exp(-Md/2)/&I(2j bk = (Md/2)” exp(-Md/2)/k!.
Using the quantile function F-l,
- 1) and
bp can also be represented
cq and most of the discussion on 61 above can be carried over as J;“[F-I(1 - a) - F-‘(a)]d to & which has applications in issues related to skewed distributions. For example, a measure of skewness proposed by Groeneveld and Meeden [15] is 0 = (n - Md)/&.
3. THE SAMPLE MEAN ABSOLUTE DEVIATIONS AND THEIR DISTRIBUTIONS Because of the widespread use of the standard deviation, it is informative to have a comparative study of the respective distributions of the sample standard deviation and of the sample mean absolute deviations about the mean and about the median. 3.1. Distribution
of the Sample
Standard
Deviation
For a sample of size n taken from a population with mean ~1 and variance u2, let Sz = CrY”=,(Xz - Xn)2/n be the sample variance, where X, = x7=“=,Xi/n is the sample mean. We know that E(Sz) = a2(n - l)/ n and if X is normal, we have nSz/cr’ N Xtn_rJ and hence, Var(S2)
= 2cr4(n - 1)/n2,
~(~)(~2)(n-3)/2~-n~2/2”2,
K(n)
with the density of Sz given by a gamma distribution, s2
>
0, where K(n)
f(s2)
=
is the normalizing constant depending on n,
= [(2c72/n)(n-1)/21T((n - 1)/2)1-l.
Hence, for the sample standard deviation S,, we have f(s) = 2X(n)(~)“-~e-~~~/~~*, s > 0, since &&/a has a chi-distribution. Then, S, is a biased estimator of 0, with E(S,) = A(n)g and Var(S,)
= B(n)a2, where A(n) =
0
z
1’2
r(n/2) J3(n - 1)/2) ’
n 2 2,
and
B(n) = 1 - i - [A(n)12. Asymptotically,
we have the following convenient expression: 157
B(n)=&&.&--.
1024n4
-....
Using the relations r(p)
ryp+ i/2)
r(2p)r(1/2) = (2+r[r(p + i/2)12) ’
(4)
and I’(1/2) = ,/-7r, we have obtained the interesting expression of A(n), n 1 2, in terms of n, which does not seem to be available anywhere in the literature. We have
42) =
$3
Absolute Deviations
925
for n even, n > 2, (n - 2)...4.2 (n_3)...3.1
‘(“)= and for n odd, n L 3,
A(n) = (n-2F.3.1 (n - 3). . .4.2 3.2.
Distribution
of the Sample
Mean
Absolute
Deviations
About
the Mean
Let us consider first, the case when ~1is known, and let us define dt = cy=“=, [Xi - pi/n. then have E(di)
= 61 and Var(di)
We
= (a2 - 6f)/n.
If ~1is unknown, let us define the sample mean absolute deviation (about the sample mean) by d, = (Cz,
IX, - _%,,l)/n. Only for the normal case can the distribution of d, be computed
precisely, but by a recurrence relation. As established by Godwin [16,17] (see also the Appendix), this relation is quite complex. However, we have, for the mean and variance of d,
(5) and Var(d,)
= T
yJ(n),
5 + dm
- n + arcsin
(
We know that J(n) converges toward (X - 2)/2 as n + 00. Hence, E(d,/C(n))
I-1
C(n) = Jc
n
)
)I’ (6)
&
= (T, where
2 7r’
Also,
4,
Var C(n)
= cJ(n)
(7)
converges toward (~~/,n)[(7r - 2)/2]. If the population median, Md = Md(X),
is known, let dz’ = cy=,
IX, - Mdl/n.
Then, we
have
E d; (
>
= 62,
and
Var di = c >
g2 + (Md -,u)”
- 6;
12
However, if hifd is unknown, let md,, be the sample median (a precise definition of md, is given in Section 8) and let dk = Crcz=, IXi -md,j/n. The distribution of dk, however, is untractable, even for the normal case, and we have to make asymptotic considerations to derive its approximate distribution when n is large. A priori, we can consider the following ratios relating the above statistics:
and
Y, = fi
md, - p d:,
’
Only two of the above statistics have been studied in detail, W,, by Geary [181 and h, Herrey [19].
by
T. PHAM-CIA AND T. L. HUNG
926
3.3. Using ,u,& as Location-Scale In the L1-norm approach, as location-scale parameters
Parameters for the Normal
61 should be used instead of the standard deviation 0. Using p, fil for the normal distributions, X N N(p, &) has as density
f(z)=-$exp{-i(y,‘},
-co
03)
a much more symmetrical and simpler form than the one with u2. Here, 61 = $$ to the common E(V)
standard
= 0 and &(V)
of V are then common
normal
those
case u = 1. The standard
= 1, has as density of the normal
tables
f(v)
(in the L1-norm)
= e- (v/fi)2/7r.
with /.L = 0 and Var(X)
via a linear transformation
normal
corresponds variate
Hence, V = (X -p)/bl.
Percentiles
= 1r/2 and can be obtained
inference
is anchored
around
the variance,
when X and Y are independent,
from
on the variable.
However, although 61 has its origin back to de Moivre [ll] its applications in statistical are quite limited due to mathematical difficulties encountered in exact sampling theory. Var(Y),
V, with
which enjoys
the property
Var(X
inference Classical
+ Y) = Var(X)
+
certainly not shared by S1 and hence, unlike the mean of a sample of the standard deviation, we do not have 61(x,) = 61/T 7~ concerning size n, except for the normal case. The exact expression of SIP is, in general, highly complex except in the case where the distribution is regenerative. This is the case of the gamma (a,p) distribution, for example, with density: N I’(ncr, p/n) and since S,(X) have x,
3.4. Independence The independence
a property
f(z; (Y,0) = ICY-’ exp( -x/P)/(P”l?(a)), = 2aap/[eLlI’(cr)], we have
5 > 0. We then
of _%, and d, of _%, and Sz for a normal
distribution
is a basic result
in statistics.
In fact,
it can be proven that _%, and g(X1 -x,, . . . ,X, -T?,) are independent for any function g. The proof can be obtained directly, or can be a consequence of Basu’s theorem, since x, is completely [20]. sufficient and (X1 - x,, . . . , X, -xn) is ancillary, as pointed out by Boos and Hughes-Oliver Hence, & and d, are independent, and whereas the ratio (x, - p)/(Sn/Jn7i) can be shown to have a t,_l distribution independent of S,, the ratio H,, = (x7;,, - p)/(d,/fi) can also be shown to have a distribution independent of d, [19]. However, the precise density of H, can only be obtained by a complex recurrence integral relation, which uses the distribution of d, obtained by Godwin [16]; see the Appendix. This density is symmetrical with respect to the vertical axis, and converges toward the standard form of the density, f(v), considered above. Distributions of H, and their critical values at different confidence levels can be obtained by using a computer program available from the authors, or by consulting tabulated values given by Herrey [21]. To our knowledge, the precise sampling distributions of WA, h;, Y,, and YL, for 17,finite, have not been determined, due to the difficulty in establishing the value of the sample median, but asymptotic expansions for most of them are given by Babu and Rao [22]. 3.5. Other Results Concerning d, To compare the relative magnitudes of S, and d, for the normal studied the distribution of the Studentized d,, or Geary’s ratio W, a highly complex expression. Basically, the density of I&‘, is centered about that value as n increases. Geary gave approximate percentiles various values of n. Due to the complexity of the exact distribution some authors have given approximations to this distribution. Cadwell
distribution, Geary [18] = d,/S,, which also has at 0.789 and gets tighter of the I&‘,, according to of d, in the normal case, [23], for example, gave an
Absolute Deviations
927
approximate density to &/a, using (x~JC)~, where n, C, and CYare determined by matching the first three moments. He also used a chi-square distribution to give an approximate distribution to the average of k independent sample mean deviations, each of size n from the same normal N xz, where v and c are given by two population, denoted @z(k, n). He found that c{iiz(k, n)/o}l.’ formulae. As consequences, we have, approximately, cf=, ci(mi/~)‘.’ N XE, with p = Czk,i zli wit/ C wi] N Fp,zLffor two independent sets of k and k’ independent
and [CC&~/C c;(n~;)‘.~][C sample mean deviations. 3.6. Advantages
of Using d, over S,
Although the use of S, seems to have overwhelming advantages, and S, appears to be the sample scale statistic mostly used in the literature (among others such as the range, the interquartile range, etc.), there are still several merits for d,. (a) A simpler
and more
robust
sample
measure
of dispersion
Containing only first degree terms d, gives less importance to extreme values, and is hence more robust and is also simpler to compute than S,. Population-wise, it is often suggested that a multiple of bi can provide a more robust measure of spread than the standard deviation [7, p. 591. (b) A better
estimator
of scale
Comparing (2) and (3) with (5) and (6), respectively, we can see that, although both d, and S, are asymptotically unbiased estimators of 61 and of 0, respectively, E(dn) converges toward 6i faster than E(S,) does toward 0. Similarly, Var(d,) goes to zero faster than Var(S,) does. Hence, d, performs better as a point estimator for Si than S, does for 0. This point seemed to be missed by several authors who used d, to estimate 0 instead of 61. For example, Davies and Pearson [24] provided correction coefficients for d, for the estimation of o. On the other hand, Fisher (see discussion in [19]) used (3) and (7) to conclude that the asymptotic relative efficiency (ARE) of d,/C(n) compared to S,: in estimating 0, is only l/(x - 2) (i.e., Var(S,)/Var(d,/C(n)) + 1/(7r--2)). Naturally, in the estimation ARE(d,/C(n)IS,I) = lim,,, of u, S, is the square root of a second degree expression and should perform better than d,, which contains only first degree terms. However, as respective unbiased estimators of 61 and of u separately, the statistics dn/(Jl’i---T) and $/A(n) g’we ARE(d,/dm](S,/A(n))]) = (B(n)/2[A(n)12) . nr/J(n) > 1. Therefore, it is important to notice here that in order to benefit from the fact that d, is a better estimator of scale, one has to use 61, instead of c, in all related analysis (see Section 5). (c) A better
estimator
in the event of contamination
In general, there is contamination when the observations, instead of coming from a single distribution, come from the mixture of two or several distributions having the same mean, but with different variances. For mathematical tractability, we usually consider normal distributions only. For example, Teichroew [25] considered the case when these variances are gamma distributed, derived the marginal distribution of X, and showed that the variance of the latter increases as the shape parameter of the gamma distribution decreases. In the presence of contamination, the estimation of the population scale parameters changes drastically. Let us consider the case where the population distribution is aN(p, 02) + (1 - CY)N(~, 902), with 0 5 (Y 5 1. In an exhaustive investigation of the above distribution, Tukey [26] found that although for (Y = 1 (no contamination) the ARE of the mean absolute deviation to the standard deviation ARE(d, ] Sn) = lim,,, Var(S,)/ Var(d,), is 0.88, this measure quickly rises to 1 for cy = 0.992
T. PHAM-GIA
92*
and becomes much larger as Q decreases. (1 - o > O.OOS), d, can be a much better Since that,
a similar
“nearly
and location estimate remark
conclusion
entirely
nonnormalities
useless”
and that
of scale in large samples that
criterion
“for the errors
of accuracy
rejected”,
than
Tukey supported
Hence, for any significantly estimator of scale than S,.
can be reached
imperceptible
commonly
when
x,
Concerning
occurring
contaminated
to estimate
“for some contaminated
square
Eddington’s
using
may make conventional
than S,“.
the mean
AND T. L. HUNG
rejection
in practice,
the mean
especially
if doubtful
error,
conclusion
rather
than Fisher’s
(1) Fishci ‘s analysis did not seem to be carried out far enough. (2) His assumption was that the sole purpose of scaling is to estimate Huber
[27] in his preliminary
provided
numerical
h4ore generally,
discussion
about
values for ARE, according by considering
robustness
to different
the contaminated
p, Tukey
relative
concluded
efficiencies
of scale
d, will be a better
populations,
Fisher’s
population
of d, and Eddington’s absolute
error is a safer
observations because
have been
of two reasons:
0.
also cited the above example
and
values of cr.
distribution
aN(p,
0)” + (1 - o)N(p,
/cc*),
E[( Sk - 0)“] x we can show [9, p. 5841 that, if Ic(1 - o) M 0 (very small contamination), (a2/2rl). A(/c, a), where S’ = S(n/(n - l))-‘/* and A(k, a) = (1 + 1.5(1- ~)k*) while md,, has approximately o(c~*/2n)(7r - 2) as variance, when n is sufficiently large. Hence, the second unbiased estimator of o is indeed the better one in this case. Concerning consequently,
4.
of this statistic dk, at the present time no distribution no result on its behavior can be presented.
ESTIMATION
OF
THE
MEAN
ABSOLUTE
has been
derived,
and
DEVIATIONS
When 61 is a scale parameter of a distribution, on the same basis as 0, the question related to its direct estimation from data is of interest. However, when 61 is a function of the parameters of the distribution, its estimation can be obtained through the estimation of these parameters.
(a) Point
estimation
of 61
For the Pearson system, Suzuki [28] has developed a general formula to estimate 61, based on t,he four first, sample central moments mi, i = 1, . . . ,4. The function H(m) = H(7nl,7~12, nz3, ~4) - H(p)) asymptotically normal, where p = gives efficient estimators of Si, with fi(H(m) (11,~pp, p3? ~1~1).More specifically,
J 2
m2
ii’
m3
-
(47n~/&)
m2
r
4m”m’ ’
(4mg/mi)
and 2 [2mzm4 (m4 - 3mz)] “* [3 Cm4
-
mi)
P (l/2,
Pm4
-
3mB)
/ (m4
-
3ml))l’
are efficient estimators of 61 for the three distributions N(p, o*), xg, and te, respectively. ki [28] also gave the variances of these estimators, but their expressions are very complex. (b) Interval
Suzu-
estimation
In classical interval estimation of a distribution parameter, the (1 -o) 100% confidence interval is often obtained via the distribution of a “pivotal quantity”, whose percentage points are used in the computation of the confidence interval end points. For the normal case, we have the following result on 61, which is similar to the one stating that nSi/u* N x?~_~).
929
Absolute Deviations THEOREM
1. Let X N N(p, 61) and d, = Cycl
/Xi - xnj/n.
Then, nd,/&
ry gn, with gn
obtained from Godwin’s density. PROOF.
See the Appendix.
Hence, for an observed value of d,, the lOO(1 - (Y)% confidence interval for 61 is obtained from the double inequality gn,a/2 < nd,/& I gn,l_a/2, with the values of the percentage points of gn to be computed using the expression of gn given in the Appendix (a computer program is available from the authors). For example, if six observations are taken from a normal population with MAD &, with an observed ~5~< 6d6/2.708, where
value of ds, we have the 90% confidence
2.708 and
8.705 are, respectively,
interval
the 5th and
obtained from our computer program. For nonnormal distributions, we can apply the above result to obtain interval
an approximate
of gs,
confidence
for 61 when n is large.
5. SOME In this section, 5.1.
for dl, 6dcl8.705 <
95th percentiles
Sample
USES OF THE MEAN we will present
Size
ABSOLUTE
some uses of 61 in current
statistical
DEVIATIONS issues.
Determination
Since there is a simple
relation
between
61 and o for the normal
case, the sample
size problem
when using 61, regarding mean estimation and hypothesis testing with control over the two types of error, can be solved in the usual way and some advantages can even be gained when using S1. Let us consider the case where we have to compute the sample size required to satisfy condition on the length of a confidence interval of the mean, in a normal distribution. (a)
Case when 61 is known.
In order
for the (1 - a) 100% confidence
interval
a bound
for ~1 to be
shorter than L, it suffices to take n > (2V,,y51/L)2 + 1, where V,/, is the a/2th percentile of the standard normal variate V considered above. (b) Case when 61 is unknown. If the ratio &/(L/2) can be given, the above formula can still be applied. This is the case where the meaning of 61 can be effectively used and presents a clear advantage over a/(L/2) since 61/(L/2) clearly means the “ratio of the average variation over half of the desired variation”, which is easier to interpret and evaluate. Working with 0, the ratio a/(L/2), o ft en used in textbooks, does not have that clear meaning, and, in fact, any meaning at all in the case of a skewed population distribution, even in the large sample case. 5.2.
Peters
Formula
In experimental
in Applied
physics,
Physics
when the number
of observations
is small and extreme
observations
may be willingly deleted, it is justified to use d,. First, as pointed out by Brown [29, p. 1781 the aim of most physical experiments is to obtain a numerical value for some quantity, under the form A 31b, with b called either precision, or error, although the last term can lead to confusion. There are several types of errors that can contribute to the value of b. They are mistakes, gross errors, constant errors, systematic errors, and random errors. When only errors of random nature are considered, they are supposed to have a normal distribution. Four types of precision, which are related to various confidence levels or to various multiples of the standard deviation, are frequently considered: (1) The precision index h (or [~Jzl-‘). (2) The standard deviation (or root mean square error) (3) The average error 61, or “true average error”, which (4) The probable error T, or quartile, which is defined making an error in the range --r to +r is the same error.
0. is zero if account is taken of signs. as being such that the probability of as of making any numerically larger
930
T. PHAM-GIA AND T. L. HUNG
We then have, taking u and the normal distribution as reference,
6 = 0.798a,
1 h
=
ah,
and
T = 0.674~~.
The precision frequently used in experimental physics is the probable error T. With a 50% confidence level, we have the classical Peters formula, r = 0.845347d,/JnZ-i,
which was obtained
via a very simple argument by Peters (see [19]), and used mostly for small values of n. However, this formula is inaccurate if the coefficient 0.84537 does not change with the sample size since it came from the characteristic value of the standard normal distribution, 20.~~= 0.67449 (case n = oo) multiplied by m.
The precise formula is x,
f h,,,,z(d,/&i),
where h,,,lz
can be
obtained from the tabulated values given by Herrey [21] (or th e computer program obtained from the authors).
For example, for n. = 8, the 50% Peters formula is _& & 0.921(ds/J8),
the 95% formula is xs f 3.933(ds/fi)
whereas
and g’Ives an interval about 4.27 times larger.
Herrey [21] has found that the two confidence intervals, based, respectively, on d, and on S,, (we then have x; f t,_i,+(Sn/m)), give practically the same answers.
5.3. Other
Applications
in Fields
Associated
with Statistics
There are several applications of 61 and 62 in fields closely associated with statistics. For example, in econometrics, 61 is closely associated with the Lorenz curve, where 6i/2~ is the maximum vertical distance between that curve and the first diagonal, occurring at the abscissa F(/J). It is closely related to other measures like the Gini index and the Pietra ratio used in measuring income distributions [30]. A similar conclusion applies for Sz on the Lorenz curve, where &z/211is the vertical distance between the diagonal and the curve, occurring at the abscissa F(Md) = l/2. Since the Lorenz curve has close relations with the total time on test transform curve in reliability theory [lo], corresponding results for this curve in relation with 6i can also be established. In engineering statistics, the main use of Si is to characterize extreme value distributions [31] and in Bayesian statistics, Pham-Gia et al. [13] have used it to elicit the beta prior distribution. Concerning 62, in descriptive statistics it has been used effectively to measure skewness and kurtosis and to partially order distributions 1151. For example, the Bowley coefficient of skewness of a distribution is bl = (Q3 +&I - 2Qz)/(Q3 - Ql), where Qi, 1 5 i 5 3, are the quartiles of the distribution. It can be extended to b2 = {F-‘(1 -a) +F-l(o) - 2Md}/[Fm1(1 -a) - Fml(a)], where F-’ is the quantile function. Noticing that 62 = &i’2[F-1(1 - o) - F-‘(n)] da, another measure of skewness can be defined very simply by b3 = (p - Md)/& and obtained by integrating the numerator and denominator of b2 from 0 to l/2. This measure, bs, maintains the cc ordering, i.e., if F_v cc Fy then bs(X) < b,(Y), w here, for two distributions F and G, we write F cc G (i.e., Fc-precedes G) if G-l(F(z)) . convex, i.e., G(z) is at least as skewed to the right as F(z). is 011 the other hand, Bickel and Lehman [32] found that bz outperforms the trimmed deviations for thin-tailed distributions when the standard asymptotic variance is taken as comparison criterion. In decision theory, 62 is closely associated with the absolute error loss function [33], in either a two-person or a multiperson game, and in economics (income distribution and Lorenz curve) results similar to those related to 61 are valid for bz, when the median, instead of the mean, is considered in heavily skewed distribiltions.
6. ASYMPTOTIC
CONSIDERATIONS
Althnugh several sampiing distributions considered here cannot be given in a closed form for n finite, where n is ~hr\sample size, we have the following results for n -+ co.
AbsoluteDeviations THEOREM 2 [22].
If F is differentiable
931
in a neighborhood
of p with derivative
f(p)
> 0 and
variance u2, then: (1) If F(p)
# l/2,
&(d,,
- d9/[2F(p)
(2) If F(p) = l/2, fi(& where (U, V) is bivariate Concerning Lipschitz
normal
the sample
condition
- UV in distribution,
with mean zero and covariance
median
on F, about
in distribution.
- l] -+ N(0,a2)
- di) --) f(p)U2
md,
02 61 1 >. ( 61 subject, however,
matrix
and dk, we have another
C =
result,
to a
Md.
the median
I c[x - Md[o for some c > 0 and 0 < p 5 1 and let F have a then JSi(dA - 62) -+ N(0,r2) - Md) IS ’ b ounded in probability,
THEOREM 3. Let IF(z) - F(Md)(
unique
median
Md. If fi(md,
in distribution, where t2 = E(X - Md)2 - 6:. Concerning
dk and di, we have the following
theorem.
THEOREM 3’. If F has a continuous derivative f in a neighborhood
then n(d;’ - db) = n(md, - Md)2(f(Md) (n/4)f(Md)(di - d:,) -+ XT, in distribution. Proofs of the above theorems butions
of Md and f(Md)
+ o(1)) + 0(n-‘/4(logn)“).
can be found in [22], where Edgeworth
> 0,
As a consequence, expansions
for the distri-
of IV,, WA, h,, and hk are also given.
7. THE DISPERSION FUNCTION: A GENERALIZATION OF THE MEAN ABSOLUTE DEVIATION More generally, the mean absolute deviation of a random variable about an arbitrary point a, denoted 6,(X), has been studied in [13], where it was related to the first moment distribution of F, denoted @, a.nd used for the elicitation of the prior distribution, in a Ba,vesian context. More generally, Munoz-Perez and Sanchez-Gomez [34] considered the following case. function For X E L1, let Dx(u) = E((X - ~1) b e a function of u E R, called dispersion of X. We have a very simple relation between F(z) and Dx at points of continuity of the latter F(u) = (1 + DX(21))/2. THEOREM 4. Dx has the following properties: (1) Dx
(2) lim,,, (3) lim,,,
is differentiable and convex; Dx(u) (Dx(u)
= 1 and limzL+_oo Dx_(u) = -1; - U) = -E(X) and limU+oo(Dx(u)
+ u) = E(X).
Conversely, a function Dx having the above properties is the dispersion distribution. For a proof, see [34]. Considering the degenerate distribution G,,(z) = &,,)(~), UIdt, where F;‘(t) Hence, Dx is the Then, the variance G,(u)]
1 that Dx(u) = J-“, [Fx(z) ‘t can be established is the quantile function, as defined previously. L1 distance between Fx and G,,, or equivalently ff2 is the L’-distance between Dx(u) and G,(U),
du = c2, where G,(U) 8.
THE
function of a unique G, at a point u, i.e.,
- GIL(x)1dx = ,r,’ IF;‘(t)
-.
between Fi’ and G;l :- u. i.e., we have J-“, i Dxjuj -
= (p - ~1.
MEDIAN
ABSOLUTE
DEVIATION
For the median absolute deviation (about the median), results presented in the literature are mostly asymptotic in nature. Here, we only discuss the most important ones. Let Md be the median of the distribution F. For simplicity, let us consider the case F is strictly increasing and there is a unique value for its median. For symmetric distributions we naturally have Md = p. For finite populations, or a sample, we define md, = median (Xl, . . . , Xn) = Xk+lZn if n = 2k + 1, and m& (in general,
= aXkZn + (1 - a)Xk+~:~,
we take u = l/2),
O
where Xi:n, i = 1,. . . , n, is the ordered
ifn=2k sample.
T. PHAM-GIA AND T. L. HUNG
932
8.1. The Population Median (Md) and the Population Median Absolute Deviation
(or MAD)
Denoted by X
Since the median cannot usually be defined in closed form, there are relatively few analytical results concerning the median. For positively skewed distributions we have mo 5 Md 5 p, where ‘mo is the mode.
The reverse order applies if the distribution is negatively skewed. In
general, we can say that for an arbitrary population, the median is a highly robust measure of location. An important application of the median is in centering and symmetrization of random variables. In the statistical literature, a centered variable Y is usually centered at the population mean, i.e., Y = X - p,v, and hence E(Y)
= 0. Using the same concept with the median Md, we have
median-centered random variables, an idea that, according to [6, p. 2551, “not only completes the first one, but also tends to replace it altogether”. A random variable W is said to be symmetric if Vw, P(W < w) = P(W > -w), F( -w + 0) = 1 - F(w).
Then, Md(W)
= 0 and if the density f(w)
or equivalently,
exists, it is symmetrical
w.r.t. the vertical axis. In symmetrization, we assign to any r.v. X, with median Md(X),
the
r.v. X” = X - Y, where Y is independent of X but has the same distribution. We then have Xs symmetric, with E(XS) = Md(Xs) = 0, E(IXSI) = 6,(X”) = 62(X’) = A, where A is the classical mean-difference of X, which is “the average of the differences between all possible pairs of values taken by X, regardless of sign”. Moreover, ch.f. (X”) is ]Q12, where Q is the characteristic function of X. Without knowing the value of Md(X), probabilities of quantities relating X to Md(X) can still be related to those of X” itself, and to those of X - a, tra E R. For example, we have P(jX--Md(X)I > E) 5 2P(]Xs] 2 e) I4P(]X-a] 2 c/2) and E(IX-Md(X)lr) 5 2E(lXSI’) 5 4C,(EIX - air), w here Ci = 1 and C, = 2’-’ for r > 1, which gives, for r = 1, 62(X)
5
2A
L
4&(X),
Va E R
(see [6, p. 2571). In the theory of Lorenz curves, we known that A = 4/1xA, where A is the area between the Lorenz curve of X and the first diagonal. We define the population median absolute deviation (MAD) by X = Median( IX - Mdj), which reduces to a@-l(3/4) tion.
for th e normal case, where @ is the cumulative standard normal distribu-
8.2. The Sample Median md, and Sample Median Absolute Deviation
(or mad,)
Denoted by &
X,, be a random sample from a distributionF and let us define md, as in the case LetXi,..., of a finite population. For normal F, the efficiency is Var(Xn)/Var(md,) = 2/7r = (&~/a)~ = 0.637. However, it is found that this efficiency is higher in sampling from other distributions, reaching 2 for the Laplace and oo for the Cauchy. Under imperceptible contamination raised by Tukey [26], this efficiency is practically one and it is higher for scale-contaminated normal distributions and other heavy tailed distributions. For a general distribution F with density f continuous at Md, md, is asymptotically normal, which reduces to ((g”/n) . 7~/2)-’ for the normal with mean Md and variance [4nf2(Md)]-‘, case. ARE(md, 1x,) is hence 4u2f2( Md). The estimation of the population median from an ordered sample of observations is an important topic of statistics, about which several results have been reported in [35]. In general, the sample median is an estimator of location which is less sensitive to heavy tails
933
Absolute Deviations
If X is normal, then X, with Var(md,
- X,)
and md, - Xn = median(Xr - Xn,. . . ,X,
5 Var(md,).
on the independence between X,
- X;)
are independent,
This result is the counterpart of the one already mentioned, and g(Xr - X,, . . . , X, - X,).
The sample median absolute deviation (from the sample median) is mad,,
denoted by C, =
median ((Xi - md, 1, i = 1, . . . , n). According to Hampel [36] C, is the natural nonparametric estimator of the “probable error” r of a single observation already encountered in the section on experimental physics. It is also the scale equivalent of the median and is the most robust estimate of scale. He also proved that C, is superior to the sample interquartile range and that the two are equivalent when the population distribution is symmetric.
Possible uses of C, include a rough but fast scale estimate in cases
where no higher accuracy is required (see [36]). Let us consider the case where there is a unique median Md for the distribution F, and let us consider the r.v. W = ]X - Md] and suppose that W too has a unique median, denoted as X previously (we have G(X) = l/2, where G(s) = P(W I x)). Hall and Welsh [37] then proved that !, converges almost surely toward /\ if F is continuous at Md & A. Furthermore, a central limit-type result is valid for &.
THEOREM 5 (Hall and Welsh). Suppose thatF’(Md) exists and is positive and thatF’(Md A) + x exists for x in a neighborhood of the origin and is continuous at x = 0. If g(x, Md)
f =
F’(Md + x) + F’(Md - x) is positive on a neighbourhood of A, then fi(fJ, - A) + N(0, e2) in distribution, where e2 = S + V, with S and V being functions of F, F’, A, and Md. As a consequence, considering the population median Md, if fi(!; - X) converges in distribution toward N(0, [4g2(X, Md)]-l), Furthermore, if F(Md + A) = 1 - F(Md Md], i = 1,. . . ,n). ,/Z(& - ei) --+ 0 in probability and fi(!, - Qn) -+ 0 in probability,
g(X, Md) is positive, then where e; = median ([Xi A) and F’(Md - A) then where Qn = (&id - f1,4)/2
is the semi-interquartile range, with &, being the sample pth quantile, 0 < p < 1. The asymptotic equivalence of & and the sample semi-interquartile range for n -+ 00 is hence established. For a normal population we also have these results: (1) X, - md,
+ 0, as.;
(2) & - SnP(3/4)
---f 0, a.s., for n -+ co, where Q, is the standard normal cdf.
Furthermore, there are interesting asymptotic properties of the distribution of the couple (mdn, e,), which is a robust counterpart of (X,, Sz), where S; = dmSn. Some of these results parallel those of (Xn, SG). For example, let us consider the following known property of the latter couple.
THEOREM 6. Let X be such that E(X4)
is finite. Then, (X,, S:) is asymptotically binormal, i.e., fi(Xn - p, S, - a) 4 N(0, a:, ai, ~712) in distribution, where the covariance matrix is a? = c2, 02” = [E(x - /J)~ - a4]/402, and e12 = E[(x - /.J)~]/~cT H ence, X,, and S, are asymptotically independent iff E(X - ~1)~ = 0, i.e., for symmetrical distributions. Correspondingly,
for (md,,
&), we have the following theorem.
THEOREM 7 (Falk). Suppose F is continuous, differentiable and that f(F-l(1/2)) > 0. Then, ,/Z(mdn-F-‘(1/2),&-X) where pl, ~2, and ~12 are functions off, F, F-‘, and A.
at F-l(1/2),
and at F-‘(l/2)
+ N(O,p~,p~,pl;!)
f A,
in distribution,
PROOF. See [38] where complex expressions of pl, ~2, and pi2 are given. From the above theorem, we have the asymptotic independence of md, and P,, if a certain condition is verified, a condition which is satisfied when X is symmetric w.r.t. its unique median F-r(1/2). A similar result can be proved for the sample interquartile range X13n,41:n- XIYL,/I1:,z, where [z] = inf{lc E N, k 1 x} isthe right in . t egral neighbour of x. The above theorem can also be seen as an extension of the Theorem 5 by Hall and Welsh [37].
T. PHAM-GIA AND T. L. HUNG
934
9. CONCLUSION The purpose of this article is to give an up-to-date and unified presentation of results related to several dispersion measures and sampling statistics, all denoted by MAD in the literature. Difficulties in establishing analytic expressions for the former and exact sampling distributions for the latter have prevented the full development of these topics for the past several decades. Hopefully, with the renewed interest in robustness and in the L’-approach
to statistics, new
progress will be made.
APPENDIX Godwin [16] gave the following expression for the density of the sample mean absolute deviation d,, denoted m here, for a standard normal N(O,l) population (mean zero and variance unity). Let Go(z) = 1, and
Gl(z) =
~zexp [-,,,F+ 1,]G-l(t)dh
T = 1,2,. . . .
Setting
g,(x) = G(Z) w
[
-
5&J
we have (a)
2 (n)gn-k-l (Y) Ic=l
n3/2 fZCrn) = 2(n+‘)/2(7p-w
k
(Al)
)
where n is the sample size. Percentage points computed proof of the above expression, the moments of m, based on a For large n, the distribution of
by Hartley [39] for 1 2 r 5 10 and Godwin [17] provided another using geometric considerations. He also provided expressions for recurrence relation, and proposed some approximations for (Al). m approaches
(b) The density of H, = (xn - p)/m,
as given by Herrey [19], is
(A3 where f;(v) is given by (Al). Critical values for fn(h) are given by Herrey [21]. (c) Let X N N(O,&) and let W = X/(&r@). We then have
d,(X)
=
2 Ixiixni =c+d,(w). i=l
Let Y = nd,(X)/61. of Y as
Then, Y = nfld,(W).
Hence, by (Al), Jz
%(?I) = f7t i bb
r/2
)
0’
y > 0.
we have the density
(A3)
Absolute
Deviations
935
REFERENCES 1. Y. Dodge,
Editor,
Statisticnl Data Analysis Based on the L’-Norm
and Related Methals,
North-Holland,
Amsterdam, (1987). 2. J.G. Bassett and R. Koenker, Asymptotic theory of least absolute error regression, Joum. Amer. Stat. Assoc. 73, 618-622, (1978). 3. P. Bloomfield and W.S. Steiger, Least Absolute Deviations: Theory, Applications and Algorithms, Birkhiiuser, Boston, (1983). 4. S. Portnoy and R. Koenker, The GaUssian hare and the Laplacian tortoise: Computability of square-error versus absolute-error estimators, Statistical Science 2 (4), 279-300, (1997). 5. Z.D. Bai, C.R. Rae and Y.Q. Yin, Least absolute deviations analysis of variance, Sankhya, Series A 52, 166-177, (1990). 6. M. Lo&e, Probability Theory I, 4 th edition, Springer-Verlag, New York, (1977). 7. A. Stuart and J.K. Ord, Kendall’s Advanced Theory of Statistics, Vol. I, Oxford University Press, New York, (1987). 8. M. De Groot, Optimal Statistical Decisions, McGraw-Hill, New York, (1970). 9. V.K. Rothagi, An Introduction to Probability Theory and Mathematical Statistics, John Wiley and Sons, New York, (1976). 10. T. Pham-Gia and N. Turkkan, The Lorenz and T4-curves: A unified approach, IEEE i’+ans. on Reliability, 76-84, (1994). 11. P. Diaconis and S. ZabeII, Closed form summation for classical distributions: Variations on theme of de Moivre, Statistical Science 6 (3), 284-302, (1991). 12. A.R. Kamat, A property of the mean deviation of the Pearson-type distributions, Biometrika 53, 287-289,
(1966). 13. T. Pham-Gia, N. Turkkan and Q.P. Duong, Using the mean deviation to determine the prior distribution, Stat. and Prob. Letters 13, 373-381, (1992). 14. B.D. Causey, Expected absolute departure of chi-square from its median, Commun. Statist.-Simula. 15 (l), 181-183, (1986). 15. R.A. Groeneveld and G. Meeden, Measuring skewness and kurtosis, The Statistician 33, 391-399, (1984). 16. H.J. Godwin, On the distribution of the estimate of mean deviation obtained from samples from a normal population, Biometrika 33, 254-257, (1943-1946). 17. H.J. Godwin, A further note on the mean deviation, Biometrika 34, 304-309, (1947). 18. R.C. Geary, Moments of the ratio of the mean deviation to the standard deviation for normal samples, Biometrika 28, 295-305, (1936). 19. E.M. Herrey, Confidence intervals based on the mean deviation of a normal sample, Journ. Amer. Stat. Assoc., 257-269, (1965). 20. D.D. Boos and Hughes-Oliver, Applications of Bssu’s theorem, The American Statistician 52 (3), 218-221, (1998). 21. E.M. Herrey, Percentage points of H-distributions for computing confidence limits or performing t-tests by way of the mean absolute deviation, Journ. Amer. Stat. Assoc. 66, 187-188, (1971). 22. G.J. Babu and C.R. Rae, Expansions for statistics involving the mean absolute deviation, Ann. Inst. Statist. Math. 44, 387-403, (1992). 23. J.H. Cadwell, The statistical treatment of mean deviation, Biometrika 41, 12-18, (1954). 24. O.L. Davies and ES. Pearson, Methods of estimating from samples the population standard deviation, Journ.
Royal Stat. Sot., Suppl. 1, 76-93, (1934). 25. D. Teichroew, The mixture of normal distributions with different variances, Ann. of Math. Stat. 28, 51&513, (1957). 26. J.W. Tukey, A survey of sampling from contaminated distributions, In Contributions to Probability and Statistics, (Edited by I. Olkin), Paper 39, pp. 448-485, The Stanford Press, Stanford, CA, (1960). 27. P.J. Huber, Robust Statistical Procedures, Sot. Ind. & Appl. Math., (1979). 28. G. Suzuki, A consistent estimator for the mean deviation of the Pearson-type distribution, Annals Inst. Stat. Math. 17, 271-285, (1966). 29. A.F. Brown, Statistical Physics, American Elsevier, New York, (1970). 30. T. Pham-Gia and N. Turkkan, Determination of the beta distribution from its Lorenz curve, Mathl. Cornput. Modelling 16 (2), 73-84, (1992). 31. E.J. Gumbel, Statistics of Extremes, Columbia University Press, (1958). 32. P.J. Bickel and E.L. Lehman, Descriptive statistics for nonparametric models III. Dispersion, Ann. Statist. 6, 1139-1158, (1976). 33. T. Pham-Gia, Some applications of the Lorenz curve in decision analysis, Amer. Journ. of Math. and Manag. Sciences 15, l-34, (1995). 34. J. Munoz-Perez and Sanchez-Gomez, A characterization of the distribution function: The dispersion function, Stat. and Prob. Letters 10, 235-239, (1990). 35. D.F. Andrews, P.J. Bickel, F.R. HampeI, P.J. Huber, W.H. Rogers and J.W. Tukey, Robust Estimates of Location: Survey and Advances, Princeton University Press, Princeton, NJ, (1972). 36. F.R. Hampel, The influence curve and its role in robust estimation, Joum. Amer. Stat. Assoc. 69, 383-393, (1974).
936
T. PHAM-GIA AND T. L. HUNG
37. P. Hall and A.H. Welsh, Limit theorems for the median deviation, Ann. Inst. Statist. Math., Part A 37, 27-36, (1985). 38. M. Falk, Asymptotic independence of median and MAD, Stat. and Prob. Letters 34, 341-345, (1997). 39. H.O. Hartley, Note on the calculation of the distribution of the estimate of mean deviation in normal samples, Biometrika 33, 257-265, (1943-1946).