Computational Statistics and Data Analysis 56 (2012) 4327–4337
Contents lists available at SciVerse ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Nonparametric estimation of location and scale parameters C.J. Potgieter a,b,∗ , F. Lombard c a
Institute for Applied Mathematics and Computational Science, Texas A&M University, United States
b
Department of Statistics, University of Johannesburg, South Africa
c
Centre for Business Mathematics and Informatics, North-West University, Potchefstroom, South Africa
article
info
Article history: Received 29 April 2011 Received in revised form 27 March 2012 Accepted 28 March 2012 Available online 27 April 2012 Keywords: Location-scale families Asymptotic likelihood Nonparametric estimation
abstract Two random variables X and Y belong to the same location-scale family if there are constants µ and σ such that Y and µ + σ X have the same distribution. In this paper we consider non-parametric estimation of the parameters µ and σ under minimal assumptions regarding the form of the distribution functions of X and Y . We discuss an approach to the estimation problem that is based on asymptotic likelihood considerations. Our results enable us to provide a methodology that can be implemented easily and which yields estimators that are often near optimal when compared to fully parametric methods. We evaluate the performance of the estimators in a series of Monte Carlo simulations. Published by Elsevier B.V.
1. Introduction Let X and Y be independent random variables and µ and σ > 0 be constants such that Y has the same distribution as µ + σ X . Given independent observations X1 , . . . , Xn on X and Y1 , . . . , Ym on Y , is it possible to estimate µ and σ efficiently without knowing the form of the density of X ? Here, ‘‘efficiently’’ has its usual meaning, namely that the asymptotic variance of the estimator equals the asymptotic variance of the maximum likelihood estimator that would have been used had the parametric form of the underlying density been known. It has long been known that the answer to the above question is ‘‘yes’’. Weiss and Wolfowitz (1970), Wolfowitz (1974) and Park (1990), amongst others, constructed asymptotically efficient estimators of µ and σ under various sets of assumptions regarding the smoothness of the density of X without specification of its parametric form. These estimators are not readily susceptible to practical application for a number of reasons, one of which is that the derivative(s) of the underlying density functions must be estimated. The latter estimation involves particular choices of smoothing parameters of which only the asymptotic behavior is specified. While it is in principle possible to make optimal choices of these parameters by numerical means – see, e.g. Park (1990, Section 2) – the extent of numerical work involved in producing estimates of µ and σ would tend to preclude routine application of the methodology. Parzen (1979, Section 10) discovered a formulation of the problem which implied that µ and σ could be estimated as the intercept and slope respectively of a straight line in a heteroscedastic regression setup. In essence, Parzen’s (1979) idea was the following. If F and G denote the (continuous, strictly increasing) distribution functions of X and Y , then the corresponding inverse functions satisfy the relation G−1 (t ) = µ + σ F −1 (t ),
∗
0 < t < 1.
Corresponding author at: Institute for Applied Mathematics and Computational Science, Texas A&M University, United States. E-mail addresses:
[email protected] (C.J. Potgieter),
[email protected] (F. Lombard).
0167-9473/$ – see front matter. Published by Elsevier B.V. doi:10.1016/j.csda.2012.03.021
(1)
4328
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
˜ −1 , then the relationship If F −1 and G−1 are replaced by their empirical counterparts, the quantile functions F˜ −1 and G ˜ −1 (t ) = µ + σ F˜ −1 (t ) + ϵ (t ) + op n−1/2 G
where ϵ (t ) is an error term, follows from (1). Since t is a free parameter, one can now estimate µ and σ by regressing ˜ −1 (tj ) upon F˜ −1 (tj ) with a suitable choice of tj , j = 1, . . . , k (
˜ respectively, namely F˜ (x) = 0 for x < X(1) , F˜ (x) = 1 for distribution functions of the X and Y data are denoted by F˜ and G
x > X(n) and F˜ (x) defined by linear interpolation between X(1) , . . . , X(n) , F˜ (X(j) ) = (j − 1)/(n − 1),
j = 1, . . . , n,
˜ (x). F and G˜ −1 will denote the corresponding continuous empirical quantile functions which with a similar definition for G are uniquely defined by the relations ˜ −1
˜ G˜ −1 (x) = x. F˜ F˜ −1 (x) = G ˜ n + m) and assume throughout in the asymptotic theory that λ˜ → λ, 0 < λ < 1. The dependence of various We set λ/( ˜ above. The same symbol quantities upon sample sizes is not indicated by subscripts but rather by using a tilde, as in F˜ and λ without a tilde will denote the limit of the quantity when the sample sizes tend to infinity. The transpose operation on vectors and matrices is indicated by the symbol ⊤. 2. The asymptotic likelihood estimator
˜ −1 (t ) − G−1 (t ). From David (1981, Theorem Define ϵn (t ) = F˜ −1 (t ) − F −1 (t ) and δm (t ) = G 9.2), for fixed 0 < t1 < · · · < tk < 1, j = 1, . . . , k, the random vectors n1/2 f F −1 (tj ) ϵ(tj ) and m1/2 g G−1 (tj ) δ(t ) converge in distribution as m, n → ∞ to multivariate normal distributions with common covariance function 6ij = min{ti , tj } − ti tj .
(2)
Thus, the order statistics X(1) < · · · < X(n) and Y(1) < · · · < Y(m) are essentially the true quantiles with added noise: X(j+1) = F˜ −1 (j/(n − 1)) = F −1 (j/(n − 1)) + εn (j/(n − 1))
˜ −1 (k/(m − 1)) = G−1 (k/(m − 1)) + δm (k/(m − 1)) Y(k+1) = G = µ + σ F −1 (k/(m − 1)) + δm (k/(m − 1))
for j = 0, . . . , n − 1 and k = 0, . . . , m − 1. We see that there are m + n observations and m + n − 2 unknown parameters F −1 (j/(n − 1)) ,
j = 1, . . . , n − 2;
G−1 (k/(m − 1)) ,
k = 1, . . . , m − 2; µ, σ ,
m + n − 4 of which depend on the respective sample sizes. Clearly, one can expect consistent estimation of µ and σ in such a context to be impeded because of the surfeit of parameters. However, suppose we fix a small (relative to m and n) positive integer k and numbers n−1 < t1 < · · · < tk < 1 − n−1 and regard
˜ −1 (t1 ) , . . . , G˜ −1 (tk ), G
F˜ −1 (t1 ) , . . . , F˜ −1 (tk )
(3)
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
4329
as the data. Then there are k + 2 parameters
⊤ θ = F −1 (t1 ) , . . . , F −1 (tk ) , µ, σ , none of which depend on m or n, so that consistent estimation is now possible. While some information will certainly have been lost in reducing the full data set to that in (3), our results will show that the loss can be relatively small asymptotically and in finite samples. In what follows we set
φj = F −1 tj ,
j = 1 , . . . , n,
to emphasize that the F −1 (tj ) are regarded as unknown parameters. Accordingly, we write
φ = (φ1 , . . . , φk )⊤ and
⊤ θ = φ⊤ , µ, σ . Let f˜ and g˜ denote consistent estimators of the densities f and g and set
⊤
˜ −1 tj )(G˜ −1 tj − µ − σ φj ), j = 1, . . . , k W1 (θ) = g˜ (G
,
⊤
W2 (θ) = f˜ (F˜ −1 tj )(F˜ −1 tj − φj ), j = 1, . . . , k
and V(θ) = [W1 (θ)⊤ W2 (θ)⊤ ]⊤ .
˜ 1/2 V(θ) converges in distribution as m → ∞ to a zeroThen, using the convergence result stated above, we see that (mλ) mean, 2k-variate, normal distribution with covariance matrix =
λ6 0
0
(1 − λ) 6
(4)
where 6 is given in (2). The component of the asymptotic log-likelihood of V(θ) involving the k + 2 parameters is Q (θ) = V(θ)⊤ −1 V(θ)/2
(5)
and we propose to estimate θ , hence µ and σ , by minimizing
˜ Q˜ (θ) = V(θ)⊤
−1
V(θ)
˜ is with λ replaced by λ˜ . The resulting estimator is invariant with respect to a relabeling of the X and Y data where because Q in (5) is invariant under such a relabeling. The minimizer of Q˜ (θ), θˆ , cannot be expressed in closed form but can be found easily using generally available optimization software. Alternatively one can use the nonlinear least squares algorithm outlined in Section 1 of the Appendix. Some details of this algorithm are required to prove the result in (6). A Matlab program that implements the algorithm can be obtained upon request from the corresponding author. Denote the components of θˆ by φˆ 1 , . . . , φˆ k , µ, ˆ σˆ . We call µ ˆ and σˆ the AL (asymptotic likelihood) estimators of µ and σ and show in Section 2 of the Appendix that for fixed k, ⊤ ˜ 1/2 µ (mλ) ˆ − µ, σˆ − σ → N2 0, σ 2 J −1
(6)
in distribution as n, m → ∞ where the elements of the matrix J are
(fi − fi−1 )2 ; t − ti−1 1≤i≤k+1 i F −1 (ti )fi − F −1 (ti−1 )fi−1 (fi − fi−1 ) J12 = ti − ti−1 1≤i≤k+1 −1 − F (ti )fi − F 1 (ti−1 )fi−1 2 J22 = , ti − ti−1 1≤i≤k+1 with fi = f F −1 (ti ) , t0 = 0, tk+1 = 1 and f0 = fk+1 = 0. J11 =
(7)
4330
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
If max{|ti+1 − ti |} → 0 as k → ∞ and if f is sufficiently smooth, then J → I with elements
∂ −1 2 = dt ; f F (t ) ∂t ∂ −1 ∂ −1 f F (t ) F (t )f F −1 (t ) dt ; = ∂t ∂t −1 2 ∂ −1 = dt , F (t )f F (t ) ∂t
I11 I12 I22
(8)
which are the Fisher information bounds when the underlying distribution is known up to location and scale parameters. This result suggests that near-optimal estimation may be possible by making an appropriate choice of k. However, in view of the facts stated in the first paragraph of this section, the most appropriate choice for k is often not k = m or k. For those choices of k the corresponding estimators can have larger standard errors than when k is chosen much smaller. This is illustrated by the following results. Consider independent samples of sizes m = 100 from a (unbeknown to the analyst) Student t3 -distribution, so that µ = 0 and σ = 1, and two schemes for finding AL estimators of µ and σ . The first scheme uses 100 equally spaced t-values {1/101, . . . , 100/101} while the second scheme uses just 8 equally spaced values {1/9, . . . , 8/9}. Ten thousand estimates of µ and σ were obtained by Monte Carlo simulation from each of the two schemes. The sample standard deviations of the statistic
˜ 1/2 µ (mλ) ˆ =
√
50 × µ ˆ
(9)
were 1.44 and 1.29 respectively while those of
˜ 1/2 (σˆ − 1) = (mλ)
√
50 × (σˆ − 1)
(10)
were 1.90 and 1.14. Thus, in this particular instance, as in many others involving distributions with heavier tails than the normal, ‘‘more is not better’’. On the other hand, if the underlying distribution is Normal, say, then the optimal procedure would use all the data. Nonetheless, using k = 8 rather than k = 100 equally spaced values in this situation is not overly detrimental to estimation precision: the standard errors were found to be 1.070 and 0.96 respectively for µ ˆ and 0.90 and 0.80 respectively for σˆ . Again, suppose in the first example, the Student t3 distribution that instead of using 100 equally spaced values over the interval [0, 1] one spreads them equally over the interval [1/9, 8/9]. Then the sample standard deviations of the statistics (9) and (10) were found to be 1.40 and 1.51 respectively, showing that just 8 points spread equally over [1/9, 8/9] gave a better result than 100 points spread equally over the same interval. 3. Asymptotic efficiencies We computed for a number of estimators and for a number of underlying distributions the asymptotic efficiencies e = lim
n−→∞
SE ML SE EST
where SE EST and SE ML denote the asymptotic standard errors of the specific estimators and of the maximum likelihood estimators of µ or of σ respectively. Since all estimators considered here are asymptotically unbiased, e is an appropriate measure of asymptotic efficiency. The normal and Student t-distributions with ν = 5, 3, 2 and 1 degrees of freedom, denoted by tν , which have ∞, 4, 2, 1 and 0 finite integer moments respectively, were chosen to represent increasing degrees of tail thickness among symmetric distributions while the corresponding skew Student t-distributions (Azzalini, 2005, p. 187), denoted by stν , were chosen to represent asymmetric distributions. When ν = ∞, stv is the skew normal distribution with skewness parameter δ = 0.95 (Azzalini, 2005, p. 180). Three types of estimator were considered: (i) The method of moments (MOM) estimators σˆ = SY /SX and µ ˆ = Y¯ − σˆ X¯ where SY , SX , Y¯ and X¯ denote sample standard deviations and means. (ii) A modification of the MOM estimators in (i), obtained by replacing SY , SX , Y¯ and X¯ by their trimmed equivalents SYα , SXα , Y¯ α and X¯ α using α = 25%, 10% and 5% trimming at both ends of the ordered data. Thus, m−⌊mα⌋
Y¯ α = (m − 2 ⌊mα⌋)−1
Yj
(11)
j=⌊mα⌋+1
SYα = (m − 2 ⌊mα⌋)−1
m−⌊mα⌋
(Yj − Y¯ α )2
j=⌊mα⌋+1
with similar definitions of X and SXα , and where ⌊x⌋ denotes the integer part of x. These estimators are denoted by MOM(α).
¯α
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
4331
Table 3.1 Asymptotic efficiency (%) of estimators for symmetric distributions. Estimation of µ MOM(25) MOM(10) MOM(5) MOM AL(3) AL(9) AL(19)
Estimation of σ
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
91 96 98 100 92 97 98
97 98 97 89 96 98 100
98 95 91 0 96 98 100
97 90 81 0 91 98 98
88 64 47 0 90 97 100
54 79 88 100 60 84 91
68 91 96 17 72 93 97
74 95 96 0 78 95 98
81 97 93 0 84 96 98
93 88 70 0 90 97 100
Table 3.2 Asymptotic efficiency (%) of estimators for asymmetric distributions. Estimation of µ MOM(25) MOM(10) MOM(5) MOM AL(3) AL(9) AL(19)
Estimation of σ
st∞
st5
st3
st2
st1
st∞
st5
st3
st2
st1
88 93 94 91 70 90 95
83 84 82 42 75 95 97
78 75 70 0 76 95 98
70 63 56 0 76 96 97
47 31 22 0 74 95 96
60 79 81 91 60 82 91
55 86 85 43 69 91 95
81 78 76 0 72 92 96
61 75 64 0 75 94 97
69 54 43 0 75 95 98
(iii) AL estimators using k = 3, 9 and 19 points and denoted by AL(k). For these estimators we used tj = j/(k + 1), j = 1, . . . , k in the asymptotic information matrix (7). With the latter choice, these estimators can be thought of as being based on the data remaining after trimming 25%, 10% and 5% at both ends of the ordered data. Because of this, it is reasonable to compare the AL (k) estimator to the trimmed method of moment estimator with αk = 2/ (k + 1) × 100%. The results for the symmetric distributions are given in Table 3.1 and those for the asymmetric distributions in Table 3.2. For the symmetric distributions considered, the AL(19) and AL(9) estimator of µ consistently have a higher efficiency ratio than the corresponding MOM(5) and MOM(10) estimators. On the other hand, the MOM(25) estimator of µ tends to have higher efficiency than the AL(3) estimator, with the exception occurring for an underlying Normal distribution. When considering estimation of the parameter σ , there are only two instances in which the MOM (αk ) estimator outperforms the AL (k) estimator. For an underlying t2 distribution, the MOM(10) estimator has efficiency 96% compared to 94% for the AL(9) estimator. There is also an interesting phenomenon that is notable for the underlying Cauchy distribution. Here, the efficiency of the trimmed method of moment estimators decreases as the proportion of trimming decreases, as one would expect. However, the efficiency of the asymptotic likelihood estimators increases with k. Although a larger k includes more tail-values from the underlying distribution, the inherent weighting employed by the AL estimation method results in more efficient estimators. The picture for asymmetric distributions is very similar to the symmetric case. For most of the distributions considered here, the AL(3) estimator of µ does not perform as well as the MOM(25) estimator, while the AL(9) and AL(19) perform better, respectively, than the MOM(10) and MOM(25) estimators. When considering the parameter σ there are only two instances in which the trimmed method of moments estimator shows better performance than the asymptotic likelihood estimator, namely the st∞ and st3 distributions with the AL(3) and MOM(25) estimators. As was to be expected, the AL(19) estimator shows the best overall asymptotic efficiency among the estimators considered here. However, the Monte Carlo simulation results in Section 5 will show that this conclusion can be misleading when finite sample sizes are involved. The poor asymptotic performance (Table 3.2) and finite sample performance (Tables 5.3a, 5.3b, 5.4a and 5.4b) of the trimmed method of moments estimators for asymmetric distributions is very noticeable. 4. Practical considerations The approach to be followed here, regardless of the properties of the underlying distributions, is to use equally spaced t-values and Gaussian kernel estimates of the densities f and g with bandwidths h computed according to the standard formula h = l1/5 × 1.06 × min{iqr/1.349, s}, where l (=m or n) is the sample size, iqr is the sample interquartile range and s is the sample standard deviation—see Silverman (1986). Since neither of these choices are ‘‘optimal’’ in any sense, one then has to be content with some (unknown) loss of efficiency at the expense of the increased simplicity and practicability of the method. However, the asymptotic and finite samples results shown in Section 3 and in Section 5 suggest that the loss of efficiency is relatively minor and
4332
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337 Table 4 Most frequently selected values of k ∈ K using a bootstrap procedure. Distribution
Normal t5 t3 t1
Sample size (n = m) 50
100
250
500
{9, 17, 19} {3, 5, 7} {3, 5, 7} {3, 5, 7}
{15, 17, 19} {5, 7, 9} {3, 5, 7} {3, 5, 7}
{9, 17, 19} {5, 7, 9} {5, 7, 9} {5, 7, 9}
{11, 17, 19} {9, 11, 13} {7, 9, 11} {5, 7, 9}
constitutes a small price to pay for having estimators that are easily calculable, robust and near-optimal asymptotically and in finite samples. The only indeterminacy in the proposed method involves the choice of k. This is analogous to the indeterminacy involved in choosing trimming proportions when trimmed means and variances are used as robust estimators. The asymptotic results suggest that little is to be gained by choosing k in excess of 9 when the underlying distribution has heavier tails than the normal distribution. The finite sample results reported in Section 5 lend support to this view. In response to a suggestion by a referee, we give an outline of a somewhat more elaborate procedure that could produce more efficient AL estimators. Assuming that equally spaced values tj = j/(k + 1) are to be used, we wish to choose an ‘‘optimal’’ value of k from a set K = k1 , . . . , kp of integers. Asymptotically (i.e. when n → ∞) the largest value of k ∈ K will always be chosen. However, it is clear from the last paragraph in Section 2 (and also from Section 5) that this is not necessarily the best choice in a finite sample setting. Let µ ˆ j , σˆ j denote the estimators of location and scale obtained for k = kj , j = 1, . . . , p. One can use standard bootstrap resampling to estimate the covariance matrix of µ ˆ j , σˆ j , say
Cˆ j and then choose the value of k that minimizes the determinant of Cˆ j . A Monte Carlo study was done to examine the distribution of k values obtained in this manner. N = 1,000 independent samples {X1 , . . . , Xn } and {Y1 , . . . , Ym } of size n = m ∈ {50, 100, 250, 500} were drawn from t-distributions with ν ∈ {1, 3, 5, ∞} degrees of freedom. The set K of possible k-values was taken to be all odd integers from k = 3 through k = 19. Thus, p = 9. Then B = 100 bootstrap samples were drawn from each of {X1 , . . . , Xn } and {Y1 , . . . , Ym }, the estimated covariance matrices Cˆ j , j = 1, . . . , 9 computed and the k that minimizes det(Cˆ j ) determined. Table 4 shows for each of the four underlying distributions the three values of k most often selected by the bootstrap procedure. The value most often selected is printed in boldface. As expected, at each sample size the ‘‘optimal’’ k value decreases as the tail thickness of the underlying distribution increases. 5. Monte Carlo simulation results We conducted a Monte Carlo study to gauge the performance of the estimators in finite samples. N = 10,000 independent samples X1 , . . . , Xn and Y1 , . . . , Ym (thus, µ = 0 and σ = 1) with n = m ∈ {50, 250, 500} were generated from each of the ten distributions specified in Section 3. For each sample, estimates of µ and σ were made by each of the methods described ˆ below denote either µ in Section 3. Letting ψ ˆ or σˆ , the standard errors of ψˆ were estimated by
= SE
N −1
2 ψˆ j − ψˆ
(12)
1≤j≤N
ˆ is the sample mean of the ψˆ j values. Tables 5.1a, 5.1b, 5.2a and 5.2b show the ratios where ψ EST /SE ML rˆ = SE obtained when estimating µ and σ respectively when for underlying symmetric distributions, while Tables 5.3a, 5.3b, 5.4a and 5.4b show the ratios for underlying asymmetric distributions. When the sample size was small (m = 50), there were instances where the asymptotic likelihood estimation algorithm did not converge for heavy-tailed distributions. This is indicated by n/c in the tables. Since the estimation biases were negligible in all instances, we report in the tables ratios of standard errors rather than of root mean square errors. In Tables 5.1, the MOM(α) estimators are already very close to their asymptotic efficiencies at m = 250 and, with the exception of the MOM(25) estimator of µ, show rapid deterioration as the tail thickness of the underlying distribution increases. For the lighter-tailed distributions, the efficiencies of the AL(9) and AL(19) estimators are on average a little less than those of the corresponding trimmed method of moments estimators. However, the latter estimators show relatively little deterioration with increasing tail thickness, whereas the trimmed method of moments with α = 10% and 5% performs very poorly. Perhaps surprisingly, the MOM(25) estimator fares the best overall, followed by the AL(9) and AL(19) estimator. However, it is interesting to note that the MOM(25) estimator fares worst among all the estimators when the underlying distribution is normal. In Table 5.2a, the MOM(5) and MOM(10) estimators of σ generally fare better than their AL(19) and AL(9) counterparts at sample size 50, while at the larger sample sizes n = 250 and n = 500, the MOM(10) and AL(9) estimators fare the best and about equally well when tail thickness is large while MOM(5) and AL(9) are best when tails are thinner.
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
4333
Table 5.1a Finite sample efficiency of estimators of µ (symmetric distributions). m = n = 50 MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
m = n = 250
m = n = 500
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
78 92 96 100 91 89 83
92 98 94 81 89 92 92
96 92 86 n/a 87 91 92
94 78 66 n/a 83 87 87
81 37 18 n/a n/c 77 69
83 94 96 100 94 92 84
96 100 96 83 96 96 94
98 94 86 n/a 94 94 94
98 81 68 n/a 92 94 89
78 41 22 n/a 87 89 73
84 95 98 100 95 94 85
96 99 96 80 96 97 94
98 93 85 n/a 95 96 93
97 82 68 n/a 95 95 90
78 41 22 n/a 92 91 75
Table 5.1b
√
n× bias of estimator of µ (symmetric distributions). m = n = 50
MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
m = n = 250
m = n = 500
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
0.042 0.028 0.025 0.016 0.017 0.017 0.031
0.046 0.032 0.025 0.019 0.035 0.035 0.040
0.028 0.017 0.019 n/a 0.012 0.006 0.028
0.027 0.022 0.017 n/a 0.009 0.012 0.030
−0.053 −0.158 −0.307
0.057 0.051 0.047 0.043 0.055 0.051 0.061
−0.020 −0.014 −0.015 −0.033 −0.021 −0.018 −0.022
0.050 0.063 0.076 n/a 0.056 0.055 0.050
−0.016 −0.010 −0.010
−0.018 −0.042 −0.051
−0.024 −0.001
n/a −0.018 −0.012 −0.005
−0.069 −0.065 −0.066 −0.078 −0.069 −0.073 −0.069
−0.027 −0.026 −0.027
n/a −0.012 −0.027 −0.021
0.055 0.046 0.044 0.043 0.046 0.048 0.045
0.036 0.022 −0.053 n/a 0.056 0.034 0.051
n/a n/c −0.000 −0.068
0.018 n/a −0.009 −0.012 −0.014
n/a −0.026 −0.025 −0.037
Table 5.2a Finite sample efficiency of estimators of σ (symmetric distributions). m = n = 50 MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
m = n = 250
m = n = 500
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
30 68 80 100 72 64 35
45 86 91 61 77 76 50
53 94 91 n/a 77 81 58
64 89 71 n/a 61 80 66
83 53 16 n/a n/c 55 77
30 63 78 100 78 66 35
45 86 94 n/a 89 83 52
55 91 94 n/a 91 87 60
64 94 86 n/a 89 91 68
86 77 44 n/a 77 91 81
31 65 80 100 83 71 37
46 84 95 n/a 91 84 53
55 91 95 n/a 92 89 60
65 95 87 n/a 94 92 70
87 77 47 n/a 88 92 82
Table 5.2b
√
n× bias of estimator of σ (symmetric distributions). m = n = 50
MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
m = n = 250
m = n = 500
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
t∞
t5
t3
0.214 0.100 0.088 0.067 0.093 0.126 0.165
0.276 0.162 0.147 0.206 0.178 0.188 0.267
0.261 0.172 0.178 n/a 0.217 0.187 0.256
0.276 0.253 0.322 n/a 0.328 0.264 0.278
0.384 0.536 0.921 n/a n/c 0.546 0.358
0.048 0.011 −0.003 −0.002 −0.005 −0.009 0.074
0.185 0.085 0.090 0.157 0.102 0.086 0.148
0.106 0.070 0.063 n/a 0.076 0.076 0.093
0.098 0.086 0.094 n/a 0.087 0.082 0.093
0.123 0.240 0.393 n/a 0.194 0.157 0.142
−0.014 −0.003 −0.007
0.107 0.029 0.033 0.036 0.032 0.034 0.080
0.060 0.055 0.064 n/a 0.062 0.057 0.065
0.003 0.002 0.003 0.013
t2
t1 0.040
−0.005 0.002 n/a 0.018 −0.002 0.042
0.136 0.159 0.241 n/a 0.150 0.151 0.135
For the asymmetric underlying distributions, for which results are reported in Tables 5.3a (estimation of µ) and 5.4a (estimation of σ ), the asymptotic likelihood estimators perform consistently better than their trimmed method of moments counterparts. Clearly, the AL(9) and AL(19) estimators fare best at all three sample sizes with the AL(9) holding a slight edge overall when the underlying distribution has heavier tails. While it has been shown that the AL(19) estimators are superior to the AL(9) estimators in an asymptotic sense, that is not always the case in finite samples. Even at samples of size m = 500, there are instances where the AL(9) estimators have better finite-sample efficiency than the AL(19) estimators. In Tables 5.1b, 5.2a, 5.2b, 5.3a, 5.3b, 5.4a and 5.4b, the estimated biases of the estimators are reported. From the simulation results the bias in the AL estimator of is definitely a bit more than that in the estimation of, but nevertheless negligible at the larger (n = 250; 500) sample sizes used in the simulations. Furthermore, the biases of the AL estimators and the corresponding MOM estimators are of the same order of magnitude.
4334
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337 Table 5.3a Finite sample efficiency of estimators of µ (asymmetric distributions). m = 50 MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
m = 250
m = 500
st∞
st5
st3
st2
st1
st∞
st5
st3
st2
st1
st∞
st5
st3
st2
st1
41 73 81 86 83 74 46
43 76 80 43 87 86 53
43 71 69 n/a 86 87 53
43 60 49 n/a 77 86 53
29 15 6 n/a 39 61 42
40 68 78 84 87 80 49
43 72 77 29 91 87 55
43 67 66 n/a 89 91 57
44 58 47 n/a 84 91 59
30 19 8 n/a 58 83 50
64 82 88 92 95 89 70
45 73 77 28 92 89 56
46 69 65 n/a 90 91 63
43 63 51 n/a 86 92 58
31 20 8 n/a 59 86 55
Table 5.3b
√
n× bias of estimator of µ (asymmetric distributions). m = n = 50
MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
t∞
t5
−0.123 −0.042 −0.033 −0.022 −0.029 −0.041 −0.089
−0.043 −0.006 −0.006 −0.086
m = n = 250 t3
t2
t1
t∞
−0.194 −0.195 −0.278 −0.071 −0.096 −0.117 −0.651 −0.036 −0.100 −0.141 −1.129 −0.029 n/a n/a n/a −0.034 0.004 −0.071 −0.046 −0.194 −0.025 0.001 −0.072 −0.057 −0.142 −0.032 −0.021 −0.157 −0.170 −0.229 −0.079
m = n = 500
t5
t3
t2
t1
t∞
−0.073 −0.045 −0.019 −0.050 −0.004 −0.013 −0.050
−0.066 −0.006 −0.097 −0.043 −0.016 −0.024 −0.113 −0.010 −0.020 −0.038 −0.395 −0.009 n/a n/a n/a −0.020 0.009 −0.006 −0.016 −0.005 0.005 −0.005 0.010 0.000 0.040 −0.024 0.067 −0.043
t5
t3
t2
−0.066 −0.043 −0.047 −0.074 −0.031 −0.034 −0.055
−0.094 −0.027 −0.084 −0.038 −0.026 −0.049 −0.021 −0.051 −0.139 n/a −0.001 −0.016 −0.067
n/a −0.018 −0.018 −0.030
t1
n/a
−0.042 −0.039 −0.045
Table 5.4a Finite sample efficiency of estimators of σ (asymmetric distributions). m = 50 MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
m = 250
m = 500
st∞
st5
st3
st2
st1
st∞
st5
st3
st2
st1
st∞
st5
st3
st2
st1
28 61 72 84 71 60 33
40 69 72 38 76 73 45
44 65 62 n/a 74 76 50
45 54 44 n/a 58 71 50
41 24 13 n/a 22 35 46
29 57 69 83 78 65 35
40 66 71 27 86 78 48
45 62 59 n/a 86 81 53
48 54 43 n/a 86 84 57
44 30 18 n/a 66 81 56
55 77 85 92 90 83 61
41 66 69 26 88 81 48
45 63 60 n/a 94 85 55
48 58 45 n/a 89 86 59
43 31 19 n/a 77 80 54
Table 5.4b
√
n× bias of estimator of σ (asymmetric distributions). m = n = 50
MOM(25) MOM(10) MOM(5) MOM AL(19) AL(9) AL(3)
m = n = 250
m = n = 500
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
t∞
t5
t3
t2
t1
0.247 0.118 0.103 0.087 0.103 0.129 0.197
0.191 0.141 0.143 0.260 0.141 0.142 0.182
0.314 0.178 0.192 n/a 0.160 0.160 0.278
0.403 0.298 0.336 n/a 0.224 0.237 0.378
0.643 1.128 1.644 n/a 1.016 0.647 0.608
0.133 0.062 0.052 0.062 0.052 0.055 0.129
0.143 0.107 0.071 0.106 0.054 0.063 0.123
0.195 0.122 0.132 n/a 0.111 0.112 0.164
0.122 0.138 0.150 n/a 0.098 0.101 0.148
0.191 0.211 0.476 n/a 0.130 0.064 0.145
0.106 0.056 0.053 0.073 0.048 0.041 0.108
0.114 0.083 0.091 0.130 0.074 0.077 0.104
0.176 0.103 0.082 n/a 0.053 0.074 0.145
0.068 0.066 0.101 n/a 0.051 0.053 0.071
0.190 0.147 0.241 n/a 0.121 0.135 0.145
6. Estimation of the standard error In practice, standard errors for the estimates can be obtained from the last step in the recursion (14)—see the Appendix.
−1
ˆ ˜ ˜J(θ) The covariance matrix of θˆ is estimated by mλ
and the square roots of its last two diagonal elements provide the
required estimated standard errors. We evaluated this approach using N = 10,000 samples generated by Monte Carlo simulation from each of the five symmetric underlying distributions specified in Section 3. Denote by SE the standard 1 , . . . , SE N the N = 10,000 estimated deviation of the N = 10,000 individual AL(9) point estimates of µ or σ and by SE
ˆ ˜ ˜J(θ) standard errors obtained from mλ
−1
. In this sampling experiment SE can be looked upon as the ‘‘true’’ standard error
j as an estimate of SE. Tables 6.1 (estimation of µ) and 6.2 (estimation of σ ) show the average of the AL(9) estimator and SE of the ratios
1 /SE , . . . , SE N /SE SE
together with their standard deviations (in brackets) when sampling from three of the distributions.
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
4335
Table 6.1 Ratio of average standard error estimate of µ ˆ to (estimated) true standard error of µ ˆ.
m = 50 m = 250
t5
t3
t1
0.97 (0.14) 0.99 (0.06)
0.98 (0.15) 1.03 (0.07)
1.08 (0.27) 1.09 (0.11)
Table 6.2 Ratio of average standard error estimate of σˆ to (estimated) true standard error of σˆ .
m = 50 m = 250
t5
t3
t1
0.92 (0.21) 0.96 (0.10)
0.88 (0.23) 0.93 (0.10)
0.71 (0.31) 0.91 (0.13)
It is evident from Table 6.1 that the suggested method works quite well when estimating the standard error of µ ˆ . It is equally clear from Table 6.2 that the method works somewhat less well when estimating the standard error of σˆ , the latter being generally underestimated, especially in small samples and particularly when the underlying distribution has heavy tails. Nonetheless, the interval defined by the average plus or minus two standard deviations contains the unit ratio everywhere in Table 6.2, suggesting that it may not be inappropriate in large samples to also estimate the standard error of σˆ in this manner. 7. Summary We have developed in this paper near-efficient nonparametric estimation of location and scale parameters in two sample location-scale problems. Earlier theoretical work on the problem has been reviewed and issues hindering routine application of the methods have been identified. The proposed methodology, which is based on the joint asymptotic normality of sample quantiles, is straightforward to implement and yields estimators of location and scale parameters that are asymptotically near-efficient and are naturally robust against the influence of outliers in the data. Competing estimators, which uses sample moments after trimming the observed sample, were also considered. In this paper, the two methods of estimation were compared for a selection of symmetric and asymmetric distributions. The symmetric family considered is Student’s t-distribution with ν (=1, 2, 3, 5, ∞) degrees of freedom. The asymmetric family is the skew-t distribution, which is a generalization of Student’s t-distribution. While there are individual instances in which the trimmed method of moments estimator shows better asymptotic performance than the asymptotic likelihood method, the latter performs consistently well, even when the trimmed method of moments estimator has low asymptotic efficiency. The small sample behavior of both types of estimators have been compared via Monte Carlo simulation, using maximum likelihood estimators as a reference. The results are encouraging, indicating that the asymptotic likelihood estimator performs very well overall. even when the sample sizes are as small as 50 observations. For underlying symmetric distributions, there are instances in which the trimmed method of moments approach performs better than the asymptotic likelihood approach. However, for the asymmetric distributions considered, the asymptotic likelihood method shows overall better performance. Here, the instances in which the trimmed method of moments approach performs better than the asymptotic likelihood approach, the observed difference in small-sample efficiency is typically less than 3%. Acknowledgments The first author’s work was supported by Award No. KUS-C1-016-04, made by King Abdullah University of Science and Technology (KAUST). The second author’s work was supported by the National Research Foundation of South Africa. The authors thank two referees for comments that led to an improved exposition of the work. Appendix A.1. Nonlinear least squares implementation Denote by D the diagonal matrix with entries
[˜g (G˜ −1 tj ), j = 1, . . . , k, f˜ (F˜ −1 tj ), j = 1, . . . , k], denote by Z the data
˜ −1 tj , G
j = 1, . . . , k,
F˜ −1 tj ,
j = 1, . . . , k
and set
γ (θ) = [µ + σ φj , j = 1, . . . , k; φj , j = 1, . . . , k].
4336
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
Then we can write V(θ) = D(Z − γ θ) and −1/2
˜ Q˜ (θ) = (Z − γ (θ))′ D′
D(Z − γ (θ)).
Thus, we see that minimizing Q˜ (θ) is equivalent to estimating by least squares the parameter vector θ in the model
˜ DZ = Dγ (θ) +
1/2
ϵ
(13)
where ϵ is a vector of i.i.d. random variables with a distribution that does not depend upon θ . Premultiplying throughout by −1/2
˜
gives the nonlinear regression model Y∗ = H(θ) + ϵ
where
˜ Y∗ =
−1/2
DZ,
and
˜ H(θ) =
−1/2
Dγ(θ).
Least squares estimates are easily found using, for instance, the linearization method (Draper and Smith, 1998, p. 508). To obtain the linear approximation to the nonlinear model, note that
∂ g g x′ ˜ −1/2 σ H(θ) = := A(θ) f 0 ∂θ ˜ −1 (tj )), j = 1, . . . , k}; f˜ = diag{f˜ (F˜ −1 (tj )), j = 1, . . . , k} and x = [1k φ], 1k denoting a column vector of with g˜ = diag{˜g (G ones. As our initial estimate θˆ
σˆ
(0)
µ ˆ
(0)
φˆ j(0)
(0)
of θ we use the vector with components
= iqr{Yj }/iqr{Xi }, = median{Yj } − σˆ (0) × median{Xi } = 1 − λ˜ G˜ −1 tj − µ ˆ (0) /σˆ (0) + λ˜ F˜ −1 tj ,
j = 1, . . . , k
where iqr is short for ‘‘interquartile range’’. Note that this estimator is equivariant with respect to a relabeling of the X and Y samples. The normal equations of the linearized model then define the recursion ( N +1 ) (N ) (N ) (N ) θˆ = θˆ − ˜J(θˆ )−1 Ψ˜ (θˆ )
(14)
where
˜J(θ) = A˜ (θ)⊤ ˜ −1 A˜ (θ)
(15)
and −1
˜ V(θ). Ψ (θ) = A˜ ⊤ (M )
(N +1)
(N )
One then sets θˆ = θˆ where M is the first index N at which |θˆ − θˆ | is less than a specified small number ϵ . We use −6 −1 ˆ ˜ ϵ = 10 in all our numerical work. Note that J(θ) is the usual estimate of the covariance matrix of (λ˜ m)1/2 (θˆ − θ). A.2. Asymptotic distribution of the AL estimators
˜ m)1/2 (θˆ It follows from Lemma 21.7in vander Vaart (2000) that (λ
(0)
− θ) = OP (1). Hence, by (14) and Theorem 5.45 in
˜ 1/2 θˆ − θ has the same limiting distribution as van der Vaart (2000), (mλ) −1
˜ A(θ)⊤ −1 A(θ) T = (mλ)
A(θ)⊤ −1 V(θ).
˜ 1/2 V(θ) converges in distribution to a zero-mean, 2k-variate, normal distribution with Recall from Section 2 that (mλ) covariance matrix . Thus the limiting distribution of T is (k + 2)-variate normal with zero mean and covariance matrix −1
M = A(θ)⊤ −1 A(θ)
.
C.J. Potgieter, F. Lombard / Computational Statistics and Data Analysis 56 (2012) 4327–4337
4337
Some calculation shows that M −1 =
σ 2 g 6−1 g / {λ (1 − λ)} σ x⊤ g 6−1 g /λ
σ g 6−1 g x/λ x⊤ g 6−1 g x/λ
whence (Rao, 1965, p. 29), B−1 + FE−1 F⊤ −E − 1 F ⊤
M=
−FE−1
E −1
with B = σ 2 g 6−1 g /λ (1 − λ) ,
F = B−1 C,
C = σ x⊤ g 6−1 g x/λ
and E = x⊤ g 6−1 g x/λ − C⊤ B−1 C = x⊤ g 6−1 g x.
⊤
˜ −1/2 µ The matrix E−1 is the asymptotic covariance matrix of (mλ) ˆ − µ, σˆ − σ . Further calculation using the −1 representation for 6 given in Lemma 2 of Hsieh (1995) together with the fact that g (G−1 (t )) = σ −1 f (F −1 (t )) shows that
x⊤ g 6−1 g x = σ −2
J11 ⊤ J12
J12 J22
where the elements of the matrix are those in (7). References Azzalini, A., 2005. The skew-normal distribution and related multivariate families. Scand. J. Statist. 32, 159–188. David, H.T., 1981. Order Statistics. John Wiley and Sons, New York. Draper, N.R., Smith, H., 1998. Applied Regression Analysis, third ed. John Wiley and Sons, New York. Hsieh, F., 1995. The empirical process approach for semiparametric two-sample models with heterogeneous treatment effect. J. R. Stat. Soc. Ser. B 57, 735–748. Park, B.U., 1990. Efficient estimation in the two-sample semiparametric location-scale models. Probab. Theory Related Fields 86, 21–39. Parzen, E., 1979. Nonparametric statistical data modeling. J. Amer. Statist. Assoc. 74, 105–121. Rao, C.R., 1965. Linear Statistical Inference and its Applications. John Wiley and Sons, New York. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. van der Vaart, A.W., 2000. Asymptotic Statistics. Cambridge University Press, Cambridge. Weiss, L., Wolfowitz, J., 1970. Asymptotically efficient nonparametric estimators of location and scale parameters. Z. Wahrscheinlichkeitstheor. Verwandte Geb. 16, 134–150. Wolfowitz, J., 1974. Asymptotically efficient nonparametric estimators of location and scale parameters. II. Z. Wahrscheinlichkeitstheor. Verwandte Geb. 30, 117–128.