Inference on Distribution Functions and Quantiles

Inference on Distribution Functions and Quantiles

36 Sample Surveys: Inference and Analysis, Vol. 29B ISSN: 0169-7161 © 2009 Published by Elsevier B.V. DOI: 10.1016/S0169-7161(09)00236-3 Inference o...

193KB Sizes 9 Downloads 82 Views

36

Sample Surveys: Inference and Analysis, Vol. 29B ISSN: 0169-7161 © 2009 Published by Elsevier B.V. DOI: 10.1016/S0169-7161(09)00236-3

Inference on Distribution Functions and Quantiles

Alan H. Dorfman 1. Introduction The focus of this chapter is on the estimation of the finite population distribution function, on the basis of a sample taken. The subject is important: the distribution function is a basic statistic underlying many others (Serfling, 1980, Section 2.1); for purposes of assessing and comparing finite populations it can be more revealing than means and totals (Sedransk and Sedransk, 1979). Intimately tied to distribution functions are quantiles, and we shall be concerned with their estimation as well, which are typically estimated by the inversion of estimates of the distribution function. Note: the distribution function is also known as the cumulative distribution or the cumulative distribution function. For shorthand purposes, we shall adopt cdf, lest anyone arriving in the middle of the chapter think we are talking about degrees of freedom. In some important respects, research on the estimation of finite population cdfs from survey samples is relatively young, and there is a good deal which is uncertain. We will suggest some avenues for further investigation explicitly in the last section and implicitly in the course of the chapter with the phrase “it is not clear that…”, “it would be of interest to see…”, or similar expressions.

1.1. Definitions The distribution function, also known as the empirical distribution function, of a variable y for a finite population U having N units is defined by FN (t) = N −1

N 

I (yi ≤ t) ,

−∞ < t < ∞,

i=1

where I(u) is the indicator function (or “truth function”) defined by I(u) = 1, if u is true, I(u) = 0, if not. In words, for a given value of t, FN (t) is the proportion of y in the population not exceeding  t. An alternate symbolism sometimes employed in the cdf literature is FN (t) = N −1 N i=1 H(t − yi ), where H(u) is the Heaviside function, given by H(u) = 1, u ≥ 0 and H(u) = 0, u < 0, and sometimes represented using  instead of H. In this chapter we will stick to the indicator function notation. It will occasionally be convenient to use a shorthand zi = I(yi ≤ t). 371

372

A. H. Dorfman

From the definition immediately follows the basic properties: (a) FN (t) is monotonic nondecreasing in t, (b) FN (t) is a step function with step size N −1 , and (c) 0 ≤ FN (t) ≤ 1. If the yi ’s are regarded as realizations of independent random variables Yi each generated from a probability distribution having distribution function F(t), then FN (t) is an unbiased and strongly consistent estimator for F(t) (Serfling, 1980, pp. 56–57). For 0 ≤ p ≤ 1, the pth quantile is defined by Q(p) = min{t : F(t) ≥ p}. In the case where F is monotonic strictly increasing, Q(p) = F −1 (p), in the ordinary sense of function inversion. We shall use this notation even when we do not have strict monotonicity, when we wish to identify the distribution function from which the quantile is derived. Estimation of the distribution function leads to estimates of quantiles, Q(p) by QN (p) = min{t : FN (t) ≥ p}. In this chapter, we shall be concerned with the transition from sample to finite population, and ignore the further question of estimating the underlying generating distribution function. It can be safely assumed, however, that a good job of estimating FN (t) typically implies a good job of estimating F(t), and vice versa. From this point on, for notational convenience, we will write F(t) ≡ FN (t), dropping the subscript N, with, we hope, no confusion. Estimating the finite population distribution function F(t) is in some respects easier and in others more difficult than estimating a population total or mean. On the one hand, for fixed t, F(t) is simply a mean of 0’s and 1’s. As such it should show no more complication than estimating a mean, and sometimes can be eased using a transformed version of y (see Remark 1.1). On the other hand, (a) we typically want to estimate F(t) for more than one value of t and these estimates need to be coordinated, especially if we have a further interest in estimating quantiles, and (b) where y is related to an auxiliary variable x, it becomes a question how to use this information, since now we are concerned with z ≡ I(y ≤ t) and not y itself, which is what usually gets modeled on x. Remark 1.1. If u = g(y) is a strictly monotone increasing function of y, then one readily sees that estimating the cdf for u is equivalent to estimating the cdf for y: FY (t) = P(Y ≤ t) = P(g(Y) ≤ g(t)) = FU (g(t)). In some cases, it may help to use the relation to auxiliary information x of a transformed version of y rather than y itself. Compare for example Section 3.4. 1.2. The design-based and model-based perspectives In this chapter, we will be referring to “design-based” and “model-based” estimators, alluding to two different perspectives on sampling inference. Most basic sampling texts adhere to the design-based (or “probability sampling”) approach, for example, Cochran (1977), Särndal et al. (1992). Valliant et al. (2000) describe the model-based (or “prediction”) approach. See also especially Chapters 1 and 23 of this volume. In brief, the design-based approach bases inference on the probability distribution generated by the activity of the survey sampler; the model-based approach, on a hypothesized relation of the data in the sample to that in the population. The design-based approach treats the finite population U = {(xi , yi ), i = 1, 2, . . . , N} as a fixed large set of parameters, where y is the variable of interest and x is the auxiliary variable, embodying extra information possibly relevant to y. The

Inference on Distribution Functions and Quantiles

373

random variable on which inference rests is the indicator variable Ii , i ∈ U, which is 1 if unit i is chosen for the sample s, and 0 if i is in the “remainder” r = U − s. The “inclusion probability” πi = prob (Ii = 1) plays a fundamental role in design-based estimators, with the inverse of πi the default “weight” placed on  the ith sample point. Likewise the “second-order inclusion probabilities” πij = prob Ii = 1 & Ij = 1 play a role in variance calculations and in some estimators of the distribution function (see Section 3.2). The model-based approach regards the population values of y as realizations of random variables that are generated according to some probability model - the “working model.” Estimators are grounded in the model, and the πi (and even the random selection of s) are not typically regarded as necessary ingredients for inference. At the same time, efforts are made to protect inference against model misspecification (see Chapter 23). This turns out to be intrinsically more difficult in the case of estimating distribution functions than for totals and means (see Section 3.1). Thus there is more emphasis in the case of distribution functions on model diagnostics (Section 3.5), on estimators that seek to automatically adjust for model failure (Section 3.6.2), and on weakening the model (Section 3.7). Model-assisted estimation is a form of design-based estimation that incorporates both the inclusion probabilities πi and a model, attempting “to provide valid conditional inferences under the assumed model and [to protect] against model misspecification in the sense of providing valid design-based inferences irrespective of the population y-values” (Rao, 1994). There is often a strong resemblance of these estimators to the class of model-based estimators that seek to automatically adjust for model failure (Section 3.6.2). Most simulation studies, even those studying the behavior of model-based estimators, are essentially constructed from a design-based perspective, looking at biases and mean square error over repeated sampling from a single population. They may therefore be regarded as slightly favoring design-based estimators. 1.3. Desirable properties of estimators of the distribution function Many authors, notably Nascimento Silva and Skinner (1995), have advocated that a sample based cdf estimator Fˆ (t) should possess some or all of the following properties: (1) Fˆ (t) is itself a distribution function, satisfying (a) the boundary condition, 0 ≤ Fˆ (t) ≤ 1, and (b) the monotonicity condition t < t  ⇒ Fˆ (t) ≤ Fˆ (t  ). Of these, the boundary condition (a) is the more fundamental, because Fˆ (t) outside these bounds are necessarily off target. To achieve a quantile estimate ˆ Q(p) by inverting Fˆ (t) requires (b). ˆ (2) F (t) is simple, (a) in form, for example, have a single set of weights wi applied to zi ≡ I (yi ≤ t) for all t, (b) to calculate, and (c) to understand.

374

A. H. Dorfman

(3) Fˆ (t) is readily invertible to get quantiles. This is perhaps a particular aspect of 2(a). (4) Fˆ (t) is automatically constructed in all its details, that is, it does not require choices (of for example, a particular model or bandwidths) or diagnostic acuity on the part of the analyst. (5) Fˆ (t) is calibrated with respect to any auxiliary variables x that are correlated with y. Typically, this means that if y is replaced by x, then Fˆ (t) = F(t). (6) Fˆ (t) is efficient: its mean square error with respect to F(t) is smaller than competing estimators – the mean square error is usually calculated across samples from the same population, that is, with respect to the sample design. It makes full use of available relevant auxiliary information. (7) Fˆ (t) is unbiased or nearly so: usually taken with respect to the sample design. (8) Fˆ (t) is consistent: Fˆ (t) tends to be closer and closer to the target F(t), as the sample size n gets larger and larger. (9) Fˆ (t) is robust: it stands up well against competing estimators under a variety of conditions. If not invariably the most efficient, it is not much less efficient than the best estimator for given circumstances. (10) Fˆ (t) has a readily formulated variance, and a readily calculated and simple variance estimator. These properties are not entirely compatible. For example, aiming at simplicity we may pay a price in efficiency, and vice versa. They are also not equally important; for example, it may suffice that an estimator be efficient and robust, without requiring unbiasedness as well. Also, standard algorithms of isotonic regression, for example, the pool-adjacent-violaters algorithm, can be used to make minor adjustments to a potentially wayward Fˆ (t) to guarantee Property 1, albeit at some sacrifice with respect to Property 2. Which properties to most emphasize will depend on circumstances. Large repetitive government surveys will value property 4. A one time survey, with a statistically savvy analyst who has time to explore the data, will perhaps put efficiency and robustness considerations uppermost. Different priorities (cdf or quantiles), different emphases on the aforementioned properties, the different forms auxiliary information can take, as well as the tension between model-based and design-based sampling ideas, perhaps account for the great variety of cdf estimators that have been put forth in the recent years.

2. Estimating the distribution function with no auxiliary information We assume a sample s of size n, perhaps selected by probability sampling with inclusion probabilities πi . Although the inclusion probabilities inevitably rest on some auxiliary information, that information may not be available to the data analyst. In this section, we consider estimation lacking any data on an auxiliary variable x. All we have is the sample y and possibly also the sample design weights di ≡ πi−1 . For the moment we do not worry about second-order inclusion probabilities, which are rarely available, except in simple random or stratified sampling.

Inference on Distribution Functions and Quantiles

375

2.1. The Hájek estimator A convenient estimator has the form  Fˆ w (t) = wi I (yi ≤ t),

(1)

i∈s

 where the weights wi satisfy 0 ≤ wi ≤ 1 and i∈s wi = 1. This estimator in general satisfies Properties 1, 2, and 3 mentioned earlier. Assume the yi are listed in ascending order. Then the corresponding quantile estimator, readily calculated, is   i  ˆ Q(p) = min yi | w i ≥ p . If wi = di

i =1



di , we get the Hájek estimator, the “customary design based estimator”     Fˆ π (t) = πi−1 I (yi ≤ t) πi−1 = di zi di (2) i ∈s

i∈s

i∈s

i∈s

i∈s

which is design-consistent and approximately design-unbiased. This tends to be the estimator against which other estimators of F(t) are (usually successfully) compared. In the case of simple random sampling, Fˆ π (t) reduces to the sample empirical distribution function or “naïve estimator”   Fˆ n (t) = n−1 I (yi ≤ t) = n−1 zi . (3) i∈s

i∈s

When there are no preconditions on the values of the population yi , then Fˆ n is admissible with respect to four standard loss functions for all sample designs (Cohen and Kuo, 1985a) and is minimax under simple random sampling for one of these loss functions (Cohen and Kuo, 1985b). Another special case of Fˆ π (t) is the stratification-based estimator studied by Sedransk and Sedransk (1979). Design-based variances and variance estimators for Fˆ π (t) use standard design-based formulas with zi the variable of interest. 2.2. Alternatives to Hájek Kuk (1988) compared three estimators,  the Hájek as mentioned above, the Horvitz– Thompson estimator Fˆ HT (t) = N −1 i∈s di I (yi ≤ t), and the “the complementary pro portion” Fˆ R (t) = 1 − N −1 i∈s di I (yi > t). Neither Fˆ HT (t) nor Fˆ R (t) are cdf. He gives theoretical reasons for preferring Fˆ R (t) or Fˆ π (t) to Fˆ HT (t). In an empirical study on several populations, Fˆ R (t) showed best as basis for construction of medians. He also ˆ Fˆ R (t), a weighted average considered a fourth estimator Fˆ λˆ (t) = λˆ Fˆ HT (t) + (1 − λ) aimed at giving minimal mean square error. The weights are estimates based on the population of x values. Thus properly speaking, this estimator belongs to Section 3 below. In simulations, it is often best, but, surprisingly, sometimes not as good as Fˆ R (t), which does not require the auxiliary information. For the sake of quantile estimation, Hyndman and Fan (1996) (henceforth HF) consider linear interpolated alternatives to the naïve estimator Fˆ n (t) that are of the form

376

A. H. Dorfman

Fˆ α,β (t) = λF˜ (yk ) + (1 − λ)F˜ (yk+1 ), for yk ≤ t ≤ yk+1 , where F˜ (yk ) = (k − α)/ (n − α − β + 1), and λ = (yk+1 − t)/(yk+1 − yk ), the yi being taken in an increasing order. This estimator is continuous over the range of sample y values, and gives a readily −1 ˆ α,β (p) = Fˆ α,β (p) for F˜ (y1 ) ≤ p ≤ F˜ (yn ). calculable quantile estimator Q There is some controversy as to the best values of α and β. The pair α = 0, β = 1 gives the standard naive estimator Fn above, interpolated. Other values yield estimators with the property that Fˆ α,β (yi ) < 1 for all sample yi , which may be desirable, because it is unlikely that in srs the largest point in the sample is the largest possible. HF suggest a list of desirable properties of quantiles and indicate which different values of α and β satisfy them. We shall not go into detail here, except to note that HFs Property P2, p.361, should be corrected to say that the counts of occurrences of the variable that are ˆ less than or equal to Q(p) is no less than the floor of pn– the “floor” of a quantity being the largest integer less than a quantity. This revised criterion is met if α ≥ 0 and β ≤ 1, which replaces a statement in HF p.363 (Rob Hyndman, personal communication.) It is an interesting question as to what the appropriate generalization of HF is in the case where we seek an alternative to Fˆ w in general. Shah and Vaish (2006) give one approach to this question. Shuster (1973) and Modarres (2002) consider the case where the targeted cdf has a known point of symmetry, so that F(θ + t) + F(θ − t) = 1, for all t. Transforming so the point of symmetry is at zero, an estimator that puts weight 1/2n at ±yi is a good deal more efficient than the naïve estimator Fn . 3. Estimating the distribution function with complete auxiliary information The groundbreaking paper for using population information on an auxiliary variable x is that of Chambers and Dunstan (1986), described below. The papers that followed, which we describe in this section, most immediately and notably the paper by Rao et al. (1990) can be thought of us the responses to the fundamental property of the Chambers Dunstan estimator: when the model it assumes is correct, it tends to be far and away better than other estimators; when the model is incorrect, the estimator has an inevitable bias, and it can do worse than even the naïve estimator. 3.1. The Chambers–Dunstan estimator Chambers and Dunstan (1986) (henceforth CD) suppose a regression model holds such as 1/2

(4) Yi = xiT β + vi εi ,   where the errors εi ∼ G 0, σ 2 are independent, having some (typically unknown) distribution function G, with mean 0 and variance σ 2 . The auxiliary variables xi and vi are assumed known for all units in the population. Then the central idea of CD is to estimate zj = I Yj ≤ t , for j in the nonsample r, by an estimate of its expectation under the model. From (4) we get that this expectation can be written

t − xjT β   E zj = G . 1/2 vj

Inference on Distribution Functions and Quantiles

377

Estimating this involves two steps using the sample data: (a) get a regression estimate βˆ of β, and (b) use the (standardized) residuals

1/2 εˆ i = yi − xiT βˆ vi (5) to estimate G(u) by ˆ G(u) = n−1

   I εˆ i ≤ u .

(6)

i∈s

The result is the classic Chambers–Dunstan estimator ⎧

⎫ Tˆ ⎬ ⎨  t − x β j ˆ Fˆ CD (t) = N −1 G I (yi ≤ t) + . 1/2 ⎭ ⎩ v j i∈s j∈r ⎧

⎫ Tˆ ⎬ ⎨  β t − x j = N −1 I (yi ≤ t) +n−1 I εˆ i ≤ 1/2 ⎩ ⎭ v i

i

j

(7)

(8)

j

We note: (1) Fˆ CD is a distribution function, fulfilling both parts of Property 1. (2) It is calibrated: if yi is any of the components xki of xi , then βˆ is 1 in the kth place, 0 elsewhere, the residuals εˆ i are all zero, and the estimator reduces to F(t). (3) The model that CD focused on was the through the origin model 1/2

Yi = βxi + vi εi .

(9)

However, they make it clear that their approach suits the more general model (4). (4) In the case of no auxiliary information, that is, in the case of the simple model Yi = μ + εi , Fˆ CD reduces to the naïve estimator Fˆ n . However, if the model 1/2 is expanded slightly to Yi = μ + vi εi , the result is not an estimator of the form (1). (5) If the working model (4) holds, Fˆ CD has a strong tendency to be much more efficient than Fˆ π and other estimators of the cdf. This has been verified in a great variety of simulation studies. However, under unusual circumstances, unlikely, we believe, to be encountered in practice, Fˆ CD can have a falling off of efficiency, even if the model is correct (Chambers et al., 1992); see the discussion in Section 3.3. From a practical point of view, this is probably not of much consequence, but from a theoretical viewpoint it is quite intriguing. There is no analogous property among standard model-based estimators of means and totals. This suggests that indirect estimates of means using the cdf are unlikely always to be competitive. (6) More serious is the robustness question: what happens to Fˆ CD if the specific regression model adopted as the “working model” is mistaken? CD show that even if only the variance structure is misconstrued, biases can arise. This is a more serious consideration from both the theoretical and practical standpoints, and is, in large measure, the motivating force behind the development of a large number of alternative estimators.

378

A. H. Dorfman

3.2. The Rao–Kovar–Mantel estimator An estimator that is design consistent, and which, in effect, compensates for model misspecification, is the difference estimator of Rao et al. (1990) (henceforth RKM), also constructed with reference to some specific linear regression model represented generally by (4) 

Tˆ   β t − x π k ˆπ Fˆ RKM (t) = N −1 G di I (yi ≤ t) + 1/2 v i∈s k k∈U

 Tˆ  t − x i  βπ ˆ πC − , (10) di G 1/2 v i i ∈s Where βˆ π is the weighted least squares estimate of β using weights di = πi−1 ,   −1  πi I eˆ πi ≤ u yi − xiT βˆ π ˆ π (u) = i∈s  G , with eˆ πi = , (11) −1 1/2 πi vi i∈s

and

  πi πi−1 ˆ πi ≤ ui i I e ˆ πC (ui ) = i∈s  G , πi πi−1 i 

for ui =

i∈s

t − xiT βˆ π 1/2

v i

.

(12)

We note: (1) Because of the differencing, Fˆ RKM is not necessarily a distribution function: it can fail both the boundary and monotonic aspects of Property 1. We do not regard this as a serious concern. Algorithms such as the pool-adjacent-violaters exist to make rectifications where needed. This should not be much of a problem. (2) The estimator is calibrated (fulfills Property 5). (3) The estimator is complicated. In particular, it incorporates second-order inclusion probabilities, a fact that causes no problems in simple random or stratified sampling, but in general is a nuisance. (4) The estimator is design and model unbiased, and, if the model is even roughly correct will tend to do better than Fˆ π . Fˆ RKM and simplified variants of Fˆ RKM will do better than Fˆ CD under severe misspecification of the model. (5) Nonetheless, if the working model holds true, then Fˆ RKM can be considerably less efficient than Fˆ CD . Often, for a given not-well-modeled population, Fˆ RKM will be considerably better than Fˆ CD for some values of t, and the reverse for other values of t; for example, see Table 2 in RKM. 3.3. Asymptotic variances for CD and RKM estimators Under the working model, both Fˆ CD and Fˆ RKM will have negligible bias. CD gave an analytic expression for the variance of Fˆ CD under the model (9). Chambers et al. (1992) (henceforth CDH) give expressions for the asymptotic variances of both Fˆ CD and Fˆ RKM .

Inference on Distribution Functions and Quantiles

379

They assume a simple homoscedastic linear model Yi = a + bxi + εi , where the errors εi are independent and have distribution function G with mean 0 and variance σ 2 . The density function g = G is assumed to exist. Sample and nonsample x are assumed to have a common asymptotic density d(x), with μ and τ 2 the corresponding mean and variance of x. They define four integrals:  I1 = (x − μ)g{t − (a + bx)}d(x)dx     I2 = G {t − (a + bx)} ∧ {t − (a + bx∗ )} d(x)d(x∗ )dxdx∗ (where a ∧ b = min(a, b))

 I3 = I4 =



(13)

G{t − (a + bx)}d(x)dx  G{t − (a + bx)} − G{t − (a + bx)}2 d(x)dx,



and prove



  var Fˆ CD (t) − F(t) = n−1 (1 − π) τ −2 σ 2 I12 + I2 − I32   + N −1 (1 − π)I4 + o n−1

  var Fˆ RKM (t) − F(t) = n−1 (1 − π)I4 + N −1 (1 − π)I4 + o n−1 ,   where π = lim N −1 n , as n, N → ∞, assumed to exist. Note that I1 , etc. are more properly I1 (t), etc, functions of t.

(14) (15)

Comments is the limit of the variance (1) The term N −1 (1−π)I4 , common to the two variances,  of the unknown, nonsample component N −1 j∈r I (yi ≤ t) of the target F(t). As in the case of estimation of totals (see Chapter 23), the variance corresponding to the nonsample term tends to be an order of magnitude less than the variance of the sample component. 2 + I4 , so that, except for the term with I1 in it, (2) CDH

note that I2 ≤ I3 var Fˆ CD (t) − F(t) ≤ var Fˆ RKM (t) − F(t) . This term arises out of the fact that we only have estimates of the parameters a and b, and leads to the surprising fact that even when the working model is correct, Fˆ CD can do relatively badly for some t. CDH constructed a population (“the CDH population”) to illustrate by simulation such a situation. The population had a rather unusual configuration of error and x distributions, which made I1 large for a particular value of t. 3.4. The weighted average estimator On the basis of the asymptotic variances (14) and (15), Wang and Dorfman (1996) (henceforth WD) construct a weighted average of the CD and RKM estimators: Fˆ WA (t) = wFˆ CD + (1 − w)Fˆ RKM . The weight w is a function of I1 , . . . , I4 above, calculated to

380

A. H. Dorfman

achieve minimal (asymptotic) mean square error of the resulting estimator, and estimable from the available data. Note that w, like the I, depends on t. In simulation studies on both real and artificial populations, Fˆ WA performs well compared to Fˆ CD and Fˆ RKM . Table 1 gives results on root mean square error for samples of size 60 from the “Beef Population,” abstracted from Tables 3 and 4 in WD. Two sets of estimators are considered, one based on a linear model of y on x, which fails adequately to capture the curvature in the data, the other modeling log(y) on log(log(x)), which fits the data well (see Remark 1.1). The winner within each set is in bold italics. Fˆ WA is never worse than second place, and is usually the winner or nose-to-nose with the winner. This suggests it adapts well, even if the working model is incorrect. There is another lesson to be gleaned from these results, noted in the next to last paragraph of the next section. 3.5. The fundamental issue of diagnostics One of the effects of the dominance of the design-based approach to survey sampling has been the isolation of finite population estimation from mainstream statistics, which is dominated by modeling and data exploration. One striking aspect of this is the unwillingness of survey samplers to pay attention to the modeling process. This has special importance for the estimation of finite population cdf. In the case of estimating totals or means, there are interesting workarounds, combining balanced designs and weighting, that mitigate the effects of model failure (see Chapter 26). This appears to be impossible for the model-based CD estimator of a cdf. Thus the question of model diagnostics, which plays such a prominent role in regression theory and practice and in statistical practice generally, may in the case of estimating cdf be very important. As noted earlier, the RKM estimator and its variants automatically adjust to model misspecification, and can do a good deal better than CD if the working model is wrong (or very peculiar, as in the CDH population.) On the other hand, if the model is reasonably correct, CD will, as a rule, far outperform RKM. Dorfman (1993) examines the issue and concludes that under the circumstance, of say time constraints or analyst inexperience, where careful diagnostics and modeling does not occur, the RKM estimator is the safer bet. On the other hand, it pays to do such analysis, and great efficiencies tend to come from using the proper model-based estimator Fˆ CD . One can go a bit further. The degree of correctness of the model has an impact on the RKM estimator itself. One sees that in Table 1, where RKM under the transformed model does much better than RKM under the untransformed model; for example, in estimating the median transformed RKM is 64% more efficient than RKM based on the untransformed model. The same holds also for Fˆ WA . Thus even apart from the issue of which estimator, diagnostics can be important for getting the best out of the estimator of choice. One open question is to what extent the diagnostic process can be automated, replacing the analysts’ observations and decisions by a well designed sequence of testing. It is not clear that this would improve on the results of automatic model adaptation of nonparametric regression, the use of which for estimating cdf is described in Section 3.8.

Inference on Distribution Functions and Quantiles

381

Table 1 Root mean square error × 10,000, 500 random samples (n = 60) from the beef population (N = 410), from Wang and Dorfman (1996) p:

0.10

0.25

0.50

0.75

0.90

Untransformed Fˆ π Fˆ RKM Fˆ CD Fˆ WA

335 313 286 267

529 447 618 440

609 447 800 541

523 370 433 409

383 322 311 306

505 383 293 296

604 349 227 227

533 330 204 209

371 279 177 191

log(y + 1) versus log(log(x)) 358 Fˆ π Fˆ RKM 303 214 Fˆ CD Fˆ WA 222

3.6. RKM-like and CD-like estimators There exist several estimators that are variants on the CD and RKM estimators. 3.6.1. CD-like estimators Instead of estimating G(u) using the sample residuals asin (5), Mak Kuk (1993)  and 1/2  ˆ t − xjT βˆ vj in the CD take G to be a normal (0, σ 2 ) distribution, replacing G  1/2   Tˆ ˆ j , where σˆ is the estimate of estimator (7) mentioned earlier, by t − xj β σv standard deviation from the weighted regression fit. The main advantage of this is relative ease of calculation. It is not clear how well this alternative, Fˆ CD, , performs against CD calculated as at (7,8). CD had noted (see Section 3.1 mentioned earlier) that, unlike estimators of mean or total, Fˆ CD was sensitive to departures from the working model for the variances of the errors. Lombardía et al. (2005) construct a CD-like estimator based on nonspecific variance structure, replacing the prespecified vj = v(xj ) in (4) by nonparametric regression  estimates of variances at the sample and nonsample points. They take vˆ (xj ) = i∈s wij (b)ri2 , where ri are the residuals from the fit of sample y on sample x using the working model, and the wij (b) are nonparametric regression weights, larger for sample units i with xi closer to xj , and b is a “bandwidth” which controls how flat the weights are. Nonparametric regression is discussed further in Section 3.7; see also Chapter 27. In a simulation study on several well-behaved populations, using the stan1/2 dard linear model Yi = α + βxi + vi εi , their estimator Fˆ CD,ˆv does consistently better than a standard CD estimator based on the homoscedastic assumption vi = 1. Welsh and Ronchetti (1998) robustify the CD estimator against the presence of outliers (outliers in the sense of sharp local deviations of y from the working model) in two ways: (1) they replace the standard least squares estimator of β by a robust estimator (they choose the bi-weight estimator βˆ R (for a discussion of outlier-robust estimation see Chapter 11) and (2) they use robust estimation of the distribution function G, replacing the residuals εˆ i in (5) by the cσˆ scaled Huber ψ-function max{−cσ, ˆ min(ˆεi , cσ)}, ˆ which puts bounds on how effectively large the contributing residuals can be. Here σˆ is a robust estimate of standard deviation equal to 1.4826 times the median absolute deviation of

382

A. H. Dorfman

the εˆ i from their median value, and c is a positive tuning constant at the discretion of the user. They suggest varying c with the quantile at which the cdf is being estimated and give some tentative guidelines, recommending smaller values of c for t at the low end of y range and larger at larger. In a simulation study on the Beef Population (which has almost become the standard messy population in the finite population cdf literature), the resulting estimators at the 0.5, 0.75, and 0.90 quantiles improve considerably on the CD estimator with respect to mean square error. A nonparametric version of the CD estimator is discussed in Section 3.7.3. 3.6.2. RKM-like estimators Although Fˆ RKM originated as a design-based alternative to Fˆ CD , it can also be viewed from a model-based perspective as an estimator that self corrects for model failure. Dorfman (1993) strips away much of the π structure to get a variant of Fˆ RKM , which may be termed the residual corrected estimator Fˆ rc (t) = N −1

⎧ ⎨ ⎩

I (yi ≤ t) +

i∈s

= Fˆ CD (t) + N −1



ˆ G

t − xjT βˆ

+

j∈r

1/2 vj

 i∈s



 πi−1 − 1 Ri

⎫ ⎬

 πi−1 − 1 Ri , ⎭ (16)

i∈s



ˆ where the upper level residuals Ri = I (yi ≤ t) − G

t−xiT βˆ 1/2 vi

, and G and β are estimated

as for the CD estimator. This is of the same form as Eq. (10), but avoids the designbased components (11) and (12). A similar estimator can be found in Godambe (1989). If the factors πi−1 − 1 are a reasonable reflection of the number of nonsample units that are like the ith sample unit, then the residual adjustment term will give a measure of the difference of what the working model is yielding and what it should yield, blown up to nonsample size, yielding a kind of model-robustness. Fˆ RKM basically does the same thing, but Fˆ rc has the advantage of avoiding extra π calculations, in particular the second-order probabilities. In practice, Fˆ RKM and Fˆ rc will behave quite similarly (see Tables 3 and 4 of Dorfman (1993)). Of course, if the model is correct, the third term adds noise, and these estimators will not do as well as Fˆ CD . Chambers et al. (1993) (henceforth CDW) take (16) one step further, replacing the “a priori” factors πi−1 − 1 by weights derived from the relation of units in the particular, chosen sample to those in the nonsample. Suppose we had the upper  level residuals ∗ Ri for all units in the population. Then Fˆ adj (t) = Fˆ CD (t) + N −1 j∈r Rj equals the target cdf F(t) by the definition of the Ri . The nonsample Rj can be estimated using  ˆ j = i∈s wij (b)Ri , where the weights wij (b) are larger the nonparametric regression: R nearer xi is to xj . The bandwidth b measures how close an xi need be to xj to give much weight to the ith sample unit. CDW describe one method for selecting b, but bandwidth selection is always a difficult issue. Section 3.7 and also Chapter 27 further discuss nonparametric regression. The resulting estimator takes the form  w i Ri , (17) Fˆ CDW (t) = Fˆ CD (t) + N −1 i∈s

Inference on Distribution Functions and Quantiles

383

 where wi = j∈r wij (b) represents the total contribution the ith sample point makes to estimating the different nonsample Rj . This estimator appears to have some asymptotic advantage over Fˆ RKM or Fˆ rc (see discussion, Section 3.7.3), but it is more complicated to calculate, and a distinct advantage has not yet been shown in simulation studies. Like Fˆ RKM , it is not necessarily a proper distribution function. As with the CD estimator, there is a nonparametric regression version of the RKM estimator, given in Section 3.7.3. Additionally, there is the family of calibration estimators, described in Section 3.8, which bear a strong kinship to RKM.

3.7. Nonparametric regression-based estimators It is an interesting fact that the first use of nonparametric regression in survey sampling (discussed more broadly in Chapter 27) was for the purpose of estimating the distribution function. This is not entirely a historical accident. In estimating totals and means, robustness against model failure can be achieved by adroit sampling with concomitant weighting (see Chapter 23). This is not the case for the CD estimator of the cdf. A natural alternative to the model diagnosis of Section 3.5 is to weaken the model so its failure is unlikely. Nonparametric regression assumes only that the expected value of the target variable given x is a smooth function. Dorfman and Hall (1993) (henceforth DH) describe three models in which nonparametric regression can play a role: Model 1. Conditional on x, y obeys a parametric model, such as the linear model (4). Model 2. The expectation of y given x is a smooth function of unspecified form: yi = m (xi ) + εi , with εi independent with a common distribution function G. Model 3. The expectation of zi ≡ h (yi ) ≡ I (yi ≤ t) given x is a smooth function of unspecified form: E (zi ) = H (xi ). It may be worth noting that in case 3, there are as many models implied as values t of interest. They do not consider a 4th case: Model 4. Conditional on x, zi ≡ h (yi ) ≡ I (yi ≤ t) obeys a parametric model, such as the logistic. To our knowledge, the possible use of this model for estimating cdf has not been studied. We have already seen a use of nonparametric regression in Model 1 by CDW (Section 3.6.2) and Lombardía et al. (2005) (Section 3.6.1). The original nonparametric regression estimator, due to Kuo (1988), relies on Model 3. The nonparametric CD and RKM estimators, described later, rely on Model 2. Consider Model 2. There are many methods for estimating m (xi ), all coming under the heading of nonparametric regression, and typically coming down to an estimate of  the form m(x ˆ j ) = i∈s wij (b) yi . Here xj may be a sample or nonsample point and the wij (b) are weights that depend on the type of nonparametric regression and on a tuning

384

A. H. Dorfman

parameter b that needs to be carefully selected. For example, we might take wij (b) to be kernel weights    k b−1 xi − xj  , wij (b) =   −1  (18) k b xi − xj i ∈s

where k( ) is typically a symmetric  density function with finite support such as the  Epanichnikov k(u) = 0.75 1 − u2 I (|u| ≤ 1), and b is the bandwidth that controls how flat and extensive the weights will be. In using nonparametric regression the question arises as to what the role is of the sampling weights. Traditional sampling would still seem to require their presence. The other view, as is emphasized in the CDW study, is that nonparametric weights give an alternative to sampling weights for tying the given sample units to the population units they represent. We expect that as a rule adding in the sampling weights will not affect results too much, for better or worse. 3.7.1. The Kuo estimator Kuo (1988) suggested the following estimator ⎫ ⎧ ⎬ ⎨    Fˆ Kuo (t) = N −1 I (yi ≤ t) + Pˆ y ≤ t|xj , ⎭ ⎩ i∈s

j∈r

   with Pˆ y ≤ t|xj = i∈s wij (b)I (yi ≤ t). Kuo actually used nearest neighbor weights: the k units with x values nearest xj have equal weight, the remainder weight zero. The ˆ implied model is Model 3 and H(x) = E (I(y ≤ t)|x) = P(y ≤ t|x). FKuo is readily seen to be of the form (1), with weights wi = j∈r wij (b) that are necessarily positive, and so readily yields quantile estimates. It is not calibrated with respect to x. Dorfman and Hall (1993) (DH) give the asymptotic bias and variance of Fˆ Kuo ; see Section 3.7.3. 3.7.2. Kuo alternatives Kuk (1993) likewise employs nonparametric regression estimates of the conditional  distribution function P(Y ≤ t | x) and takes Fˆ kuk,np (t) = k∈U Pˆ (Y ≤ t | xk ), with      −1  −1  ˆ [x − xi ] K b−1 [t − yi ] [x − xi ] , where k P(Y ≤ t | x) = s di k b s di k b is a selected density function and K the corresponding distribution function. This estimator is conceived in the spirit of design-based sampling, and we note three differences from the Kuo: (1) it uses the design weights di , (2) it estimates the known sample portion of F(t), and (3) it replaces the indicator variables I by K. For the bandwidth, Kuk uses b = Rx /n, where Rx is the population range of the x-values; also, Y is rescaled to match its range to x, in order that b apply to both variables. The estimator is a cdf. In simulations, it outperforms RKM for several modes of sampling on two populations. We would expect it and Fˆ Kuo (t) to behave similarly. (Note: Kuk uses the logistic distribution for K (and k). It might be safer to use a symmetric distribution, because a nonsymmetric K can shift the estimate of mean entailed by Fˆ kuk,np by an additive term dependent on the bandwidth.) In the case where x is a vector, it may be appropriate to use a smoothing device which appears in Durrant and Skinner (2006), namely to fit the sample y on the sample x and use the resulting fitted values yˆ k , k ∈ U, in place of the x to form the weights wij .

Inference on Distribution Functions and Quantiles

385

DH considers a design-adjusted version of the Kuo, an analogue for Model 3 of the RKM estimator: ⎧ ⎫ ⎨ ⎬      Fˆ Kuo,adj (t) = N −1 πi−1 − 1 Ri , Pˆ y ≤ t|xj + I (yi ≤ t) + ⎩ ⎭ i∈s

j∈r

i∈s

where Ri = I (yi ≤ t) − Pˆ (yi ≤ t|xi ). Theorem 3 of DH suggests a subtle difference between Kuo and Kuo-adjusted: when the sample and nonsample x densities are the same their “I7 ” term goes out in the Kuo-adjusted. In empirical work on several artificial populations, using nonparametric local linear regression, Johnson et al. (2004) find the design-adjusted Kuo doing better than Kuo. Model 3 aforementioned actually fails only in one of the seven populations they use, the “Jump” population. Thus “twicing” (estimating and adding on an estimate of the error of one’s estimate) seems to be helpful even when the assumed model is correct. Nascimento Silva and Skinner (1995) suggest a poststratified estimator of F(t). The range of x is partitioned into intervals of adjacent x, each interval correspond Ng ˆ ˆ ing to a poststratum g, g = 1, . . . , G. Then Fˆ ps (t) = G g=1 N Fπg (t), where Fπg (t) =   i∈g∩s di I (yi ≤ t) i∈g∩s di , and Ng is the number of population units in the gth poststratum. The basic model here is the same as for the Kuo and Kuk estimators, except that the underlying continuous P(Y ≤ t|x) is being estimated by a step function. The choice of length and number of intervals defining the poststrata is analogous to the choice of bandwidth. In simulations, Fˆ ps (t) with best stratification did well compared to Fˆ Kuo (t) with a fixed b, but it is not known how they would compare under optimal choice of bandwidth for the Kuo. Fˆ Kuo (t) can in general be expected to do somewhat better, because the boundaries of the intervals in Fˆ ps (t) can prevent points with x near   ˆ to xi being used for  of P y ≤ t|xj . An exception would be the case where  estimation the underlying P y ≤ t|xj really is a step function and the selected poststratification matches it. On the other hand, the poststratified estimator has the advantage of simplicity and familiarity to survey samplers. Modarres (2002) suggests a similar idea and offers a modification in the special circumstance that F is known to have a point of symmetry. (See discussion of cdf symmetry in Section 2.2.)

3.7.3. Nonparametric CD and RKM Based on Model 2, DH derive analogues of the CD and RKM estimators (their Fˆ 1

 and F 1 , respectively). Let m ˆ (xk ) = i∈s wik yi be a nonparametric fit of y on x. They define ⎤ ⎡      ˆ t−m I (yi ≤ t) + Fˆ np,CD = N −1 ⎣ G ˆ xj ⎦ i∈s

j





⎤      −1  ˆ t−m G ˆ xj + πi − 1 Rnp,i ⎦,

j

i∈s

⎡ Fˆ np,RKM = N −1 ⎣

i∈s

I (yi ≤ t) +

     ˆ t−m ˆ ˆ (xi ) with G(u) where Rnp,i = I (yi ≤ t) − G = n−1 i∈s I εˆ i ≤ u and εˆ i = ˆ (xi ). yi − m

386

A. H. Dorfman

DH analyze eight estimators which, aside from the naïve estimator Fˆ π , belong to two classes: three for which a given chosen parametric (in fact, linear) model plays a role, namely Fˆ CD , Fˆ RKM , and Fˆ CDW , and four nonparametric estimators (“the np group”) for which it does not: Fˆ Kuo , Fˆ Kuo,adj , Fˆ np,CD , and Fˆ np,RKM , described earlier. In the following description of variances and biases, we consider just the case where the sample and nonsample densities for x are the same, as would arise under srs or simple balanced sampling (see Chapter 23). Several formulas then simplify. In terms of variance, the estimators fall into three (a) estimators that have a categories:

−1 ˆ variance asymptotically equivalent to that of var Fπ ≈ n (1−π)2 I4 , with I4 as in (13)

aforementioned (and referred to as I6 in DH), (b) var Fˆ CD as at (14) abovementioned,



and (c) var Fˆ np,CD , which is similar in structure to var Fˆ CD , except that τ −2 σ 2 I12  (np) (np)  2 is replaced by σ 2 I1 , with I1 = {g (t − m (x)) − g(t − m(x))ds (x)} ds (x)dx (notated I3 in DH), where ds (x) is the

density of x among sampled units. It is not ˆ clear that this term makes var Fnp,CD more or less vulnerable to peculiar population

structure than τ −2 σ 2 I12 makes var Fˆ CD . In any case this result suggests that Fˆ np,CD will have efficiency close to that of Fˆ CD without the same vulnerability to model failure. (np) Provided the I1 term is not large, we would expect Fˆ np,CD to be more efficient than the remaining estimators that DH consider, because of the inequality I2 ≤ I32 + I4 noted earlier. Bias (compare DH, Table 1): Under suitable bandwidth choice, the np group will   have bias that is o n−1/2 , lower than the order of their standard deviation. The naïve   estimator Fˆ π has bias O n−1/2 , same as the standard deviation. If the working model is ˆ ˆ ˆ correct, , and have  FCDWall−1/2  bias of order  −1 then the model dependent estimators FCD , FRKM −1/2 , and o n , respectively. O n . If the model fails, they have bias O(c), O n DH note the relative strength of the CDW estimator compared to RKM. It is not clear from simulations under what circumstances their difference will be important. CDW did not find there to be much difference between them in their empirical study. DH note that empirical comparison of the np estimators requires a practical means of selecting bandwidths. Lombardía et al. (2004) use a bootstrap methodology to estimate the optimal bandwidth b for Fˆ np,CD . It would be of interest to apply their approach to estimating bandwidths of other estimators as well, and to compare the resulting estimates among themselves and to the parametric-based estimators. Model 2 underlying Fˆ np,CD and Fˆ np,RKM is a homoscedastic model. We may expect that in the nonparametric case, deviations from homoscedasticity will have minimal impact. Nonetheless, this has not been investigated. We have noted the use of np regression by Lombardía et al. (2005) to estimate the variance function for the CD estimator. It would be of interest to carry out something parallel to this for Fˆ np,CD and see the effect.

3.8. Calibration estimators Here we consider estimators that calibrate the estimate of the cdf with respect to the auxiliary x in a variety of ways, with the twin goals of design consistency and some degree of efficiency under a chosen model – protecting against the worst in case of

Inference on Distribution Functions and Quantiles

387

model failure and improving on the usual design-based estimator in the case of model correctness. Despite the name, not all members of this class of estimators are calibrated in the sense of fulfilling Property 5 of Section 1. The estimators do bear a kinship to the RKM estimator, often being asymptotically equivalent. One set of methods for calibrating relies on the difference between a population total and a weighted sample  sum whose terms  are some function g(.), possibly vector valued, of x : w (xU ) = k∈U g (xk ) − i∈s wi g (xi ); g can depend on the sample y, and x other than the explicit xk , as well. There are at least three methodological schemes for doing this: (a) explicit regression models (GREG) (see Chapter 25), (b) the use of calibration weights (also Chapter 25), and (c) the use of pseudo empirical likelihood (see Chapter 30). For estimating the distribution function, all take the dependent variable as zi = I (yi ≤ t). (a) In the case of GREG, wi = di , and d (xU ) is an explicit component of the estimator. The estimator can bewritten in the form   Fˆ GREG (t) = N −1 g(xk ) − i ∈s di g(xi ) Bˆ π , where i∈s di I(yi ≤ t) + k∈U    ¯ π )2 and g¯ π = i∈s Bˆ π = i∈s i∈s di (g (xi ) − g di (g (xi ) − g¯ π )I(yi ≤ t) di g(xi ) i∈s di . (b) In the case of calibration weights, the wi = pi are such as to satisfy (i)  p (xU ) = 0, (ii) i∈s pi = N and at the same time minimize a chosen function measuring the distance  between the sample wi and di . The estimator is of the I (yi ≤ t). In the particular case of the chi-square disform Fˆ cal (t) = N −1 i∈s pi tance function D(w, d) = i∈s (wi − di )2 /(qi di ), (where the qi are some user controlled weights, usually just a constant), the calibration estimator reduces to the GREG, mentioned earlier. The pi need not be positive in this case. (c) The pseudo empirical likelihood estimator can be regarded as exactly like the calibration situation,except now the pi are such as maximize the empirical likelihood l(p) = i∈s di log pi and there is an additional condition (iii) 0 < pi . (This follows by taking the standard formulation of pseudo empirical likelihood as in Chapter 30 and multiplying the pi in that formulation by N.) Then pseudolikelihood maximum likelihood estimators are, again, of the form  Fˆ pl (t) = N −1 i∈s pi I (yi ≤ t), but now the pi are conveniently positive, and satisfy Property 1a of Section 1.3. For some choices of g(x) the pi will depend on t, so Property 1b is not necessarily satisfied in general. For a given choice of g, the calibration and pseudolikelihood estimators are asymptotically equivalent, and they tend to behave similarly to each other in simulations. Wu and Sitter (2001a) (henceforth WS) suggest taking g(xi ) =   ˆ π t − xiT βˆ π vi , as in the RKM estimator. In that case we get: Fˆ GREG (t) = G  1/2          Tˆ Tˆ ˆ ˆ vk − + N −1 i∈s di I yi ≤ t k∈U Gπ t − xk β i ∈s di Gπ t − xi β  1/2           ˆ −G ˆ −G ¯ I yi ≤ t ¯ 2 , with G ˆi vi Bˆ π where Bˆ π = d G d G  i∈s iT i π    i∈s i i  π ˆ π t − xi βˆ π vi and G ¯ π = i∈s di G ˆi shorthand for g xi = G i∈s di . [compare ˆ WS, Eq. (19)]. The major difference between this and RKM is the Bπ [compare Eqs. (10–12)]. Kovacevic (1997) suggests two calibration estimators. First, Fˆ KOV1 using g(x) = x and with distance function D(w, d) other than the chi-square distance, designed to

388

A. H. Dorfman

guarantee that the pi are nonnegative. This estimator does not require that all x in the population be known, but only the population totals. A second estimator Fˆ KOV2 modifies Fˆ KOV1 by including a bias corrected term,   which does require full population ˆ i , with the G ˆ i defined ˆk −G information: Fˆ Kov2 (t) = N −1 i∈s pi zi + N −1 k∈U G as earlier. This estimator behaves very similarly to RKM. Chen and Wu (2002) (henceforth CW) suggest three pseudolikelihood estimators. The first relies on a model 1/2

yi = μ (xi , β) + vi εi ,

(19)

which, of course, includes the linear Model (4), except that they posit that the cdf of the errors is a normal distribution

(cf. Kuk (1993), discussed in Section 3.7.2). ˆπ) i ,β They take g (xi ) = t0 −μ(x , where is the standard normal distribution, 1/2 vi σˆ π

ˆ for some  fixed suitable preselected value t0 . The resulting estimator Fpl,CW1 (t) = −1 N i∈s pi (t0 )I (yi ≤ t) is asymptotically equivalent to a calibration estimator, satisfies Property 1b, and so is a distribution function. The estimator is design consistent for all t. It is model unbiased for F(t0 ), but not necessarily for F(t), t  = t0 , so there might be some loss of efficiency if t is far from t0 . CW’s second estimator Fˆ pl,CW2 (t) relies on a model for zi = I (yi ≤ t0 ). CW adopt in     1 + exp xiT βˆ π . Fˆ pl,CW2 (t) particular a logistic model and take g(xi ) = exp xiT βˆ π has similar characteristics to Fˆ pl,CW1 (t). Their third estimator Fˆ pl,CW3 (t) returns to the     model (19) and takes g(xi ) = I yˆ πi ≤ t0 , where yˆ πi = μ xi , βˆ π are the fitted values from the regression. In a simulation study on artificial populations CW find Fˆ pl,CW1 (t) and Fˆ pl,CW2 (t) outperforming Fˆ pl,CW3 (t). Rueda et al. (2007a) (hereafter RMMA) develop a calibration estimator, also based on yˆ πi and a fixed vector t ∗ = (t1 , t2 , . . . tP )T , with t1 < t2 < · · · < tP , leading to      T  a vector valued g(xi ) = I yˆ πi ≤ t ∗ = I yˆ πi ≤ t1 , . . . , I yˆ πi ≤ tP . RMMA use the chi-square distance and show that, if qi = c, a constant, then the weights pk take a surprisingly simple form and are positive (see RMMA, Section 4). If tP is taken large enough, then Fcal,RMMA (t) → 1, as t → ∞. This estimator is a cdf. In simulation studies on some actual populations, the estimator performs on a level with the CD estimator, and better than RKM. It has not been established how   robust it is to model failure. Rueda et al. (2007b) replaces yˆ πi = μ xi , βˆ π in the above by a nonparametric fit yˆ πi = m(xi ). See Section 3.7 described earlier for a discussion of nonparametric regression. Harms and Duchesne (2006) suggest a calibration estimator based on a population x-quantile, as an intermediate step to estimating quantiles. Their estimator differs from the above in replacing  t−y k−1 , yk−1 ≤ t < yk yk −yk−1 . zi = I (yi ≤ t) by ϕ (yk ; t) = I (yk ≤ t) else The resulting estimator  Fˆ HD (t) = N −1 wi ϕ (yk ; t) s

(20)

Inference on Distribution Functions and Quantiles

389

is an interpolated continuous linear function in t. As in the above  estimators the weights minimize a distance function D(w, d). The constraints are i∈s wi = N and ˆ HD,x (p0 ) = Qx (p0 ), where Qx (p0 ) is a chosen (or available) population quantile Q on the x-variables. They allow x to be vectorial, and so in general Qx (p0 ) is a vector of length J ≥ 1. The sample estimates are defined in terms of a sample cdf of the −1 ˆ HD,x (p0 ) = Fˆ HD,x (p0 ). Their primary focus is on estimating quantiles; see form (20): Q Section 5.2 below for further discussion. Kuk and Mak (1994) (henceforth KM) replace the difference form of the abovementioned calibration estimators by a relation between distribution functions. They offer two estimators, one which does not rely on an explicit model, the other, like Fˆ RKM , which does. The first, the simple transformation model, is   ˆ −1 ˆ F (t) Fˆ KM1 (t) = G G π π ˆ π is its Hajek estimator. They where G is the population distribution function for x, and G ˆ ˆ suggest making both Fπ and Gπ continuous by linear interpolation before applying this transformation. KM point out that there is a tacit weak model here, namely y being some monotone increasing function of x. Where this condition is met, but an explicit working model is false (e.g., the working model assumes linearity, but there is strong curvature) this estimator can do considerably better than estimators like Fˆ RKM which incorporate the model (cf. KM, Figure 1, and Table 1, Factory Population). KMs second estimator is   ˆ π−1 Fˆ π (t) , Fˆ KM2 (t) = H H where H(t) =



ˆ k∈U Gπ



t−xkT βˆ π 1/2 vk

ˆ π (t) = and H



ˆ i∈s di Gπ



t−xiT βˆ π 1/2 vi

 i∈s

di . Especially

when the model is correct or nearly correct, this estimator behaves much like Fˆ RKM , with the added advantage of being a strict distribution function.

4. Estimating the distribution function using partial auxiliary information Several estimators of Section 3 can be applied with a limited amount of population information on the auxiliary. The poststratified estimator Fˆ ps of Nascimento Silva and Skinner (1995) requires only interval information. The calibration estimator Fˆ Kov1 of Kovacevic (1997) requires only means or totals. The calibration estimator Fˆ HD of Harms and Duchesne (2006) requires only knowledge of population quantiles. Dunstan and Chambers (1989) adapt the CD estimator to the situation in which group information is available on x. Assumed known is the number Nh of x in each of H intervals [xhL , xhU ] , h = 1, . . . , H, the boundaries themselves, as well as the within interval means x¯ h . Positing within interval cdf Ch (x) for the x, and conditioning on the known residuals from the sample an  expression   for  the expec fit, they derive tation of the double sum in (8) as h (Nh − nh ) 1 − n−1 i∈s ht εˆ i , where ht   1/2 ˆ hi v1/2 is the cdf of the transformed variable t − βx hi , it being assumed that vhi is a well-defined function of xhi . They work out details for the case where vhi = xhi and Ch

390

A. H. Dorfman

is a split uniform distribution on [xhL , xhU ] with mean x¯ h . They also derive expressions for variance estimation. In simulations on a population where the linear model holds pretty well, behavior of the resulting estimator Fˆ CD,lim closely approximates that of Fˆ CD . It would be interesting to see how their estimator compares to Fˆ ps , in the case where just interval boundary information is available. Kuk and Mak (1989) suggest an estimator of the population cdf which relies only on knowing the population median Qx (1/2) of an auxiliary x. Break the sample into two components: s1, those with xi ≤ Qx (1/2), and s2 , those with xi > Qx (1/2). (Under unequal probability sampling, these groups could be quite different in size.) Let nx1 be the number of units in the first group and  nx2 , in the second, and define Fˆ KM,Q(1/2) (t) = 12 Fˆ 1 (t) + Fˆ 2 (t) , where Fˆ l (t) = n1l i∈sl I (yi ≤ t). This is a well defined cdf. Their primary concern is estimating the population median for y, which can be readily gotten by inverting Fˆ KM,Q(1/2) (t). In simulations it does not do quite as well as their direct “position” estimate of Qy (1/2) (see Section 5.2). 5. Quantile estimation The primary way to get estimates of quantiles is by inverting estimates of distribution functions, as we briefly describe in Section 5.1. In addition to estimates by inversion, there are a number of direct estimates of quantiles suggested in the literature, described in Section 5.2. Also, there is a class of what we shall here call “hinge estimates,” which avoid extensive repeated calculations on a complicated cdf estimator by making use of Fˆ π or other simply calculated estimator (Section 5.3). 5.1. Inversion of estimates of the distribution function Estimates of quantiles are most readily achieved through inverting an estimate of the distribution function. Thus all of the estimates Fˆ ∗ (t) of distribution function described earlier yield a corresponding quantile estimate ˆ ∗ (p) = min{t : Fˆ ∗ (t) ≥ p}. Q

(21)

Basically, one gets a grid of values Fˆ ∗ (tν ), t1 < t2 < · · · < tν < · · · < tν∗ so that the values Fˆ ∗ (tν ) surround p and takes the smallest of the tν satisfying (21). It is desirable ˆ ∗ (p) ≤ Q ˆ ∗ (p ). This that the estimator have the monotonicity property: p < p ⇒ Q requires that Fˆ ∗ (t) is a proper distribution function; in some cases this may require isotonization. The estimators Fˆ KM,Q(1/2) of Kuk and Mak (1989) and Fˆ HD of Harms and Duchesne (2006) are specifically intended for quantile estimation (see comment in Section 5.2). 5.2. Direct estimates By a direct estimate of quantile, we mean one that does not require an explicit expression ˆ w (p) in the next paragraph being the prime example. for the cdf, Q We have already noted that estimators of the form (1) with fixed positive wi are ˆ w (p) to be the particularly easy to invert: we order the sample y by size and take Q i ˆ smallest yi such that 1 wi ≥ p. Qn (p), the sample quantile when the weights are all

Inference on Distribution Functions and Quantiles

391

equal, is especially simple. In the case of small samples, it may be worthwhile to interpolate between the sample y, effectively linearizing the estimate of cdf. Additionally, one might want to modify slightly the weights themselves, to achieve analogues to the quantile estimates available in standard software packages (compare Hyndman and Fan (1996) and Section 2.2 earlier). Rao et al. (1990) suggest readily calculated ratio and difference estimators:    ˆ rat (p) = Q ˆ πt (p) Q ˆ xπ (p) Qx (p) Q   ˆ πt (p) + βˆ π Qx (p) − Q ˆ xπ (p) , ˆ dif (p) = Q Q   with βˆ π = s di yi s di xi . The ratio estimator requires that x be a scalar, but the difference estimator readily generalizes to vector x (Kovacevic, 1997). In any case the only information required beyond the sample is the appropriate population quantile for x. It may noted that in neither is monotonicity guaranteed. In rare cases, one might have ˆ π (p). Harms and to make adjustments. Empirical work suggests both do better than Q Duchesne (2006) carried out an empirical study on several populations and under two −1 −1 ˆ rat , Q ˆ dif (p), Q ˆ HD = Fˆ HD ˆ CD = Fˆ CD ˆ π, Q and Q . Typically, sampling plans, comparing Q apart from the CD estimator, which could strongly lead or lag, the winners with respect ˆ rat or Q ˆ HD . There were instances where Q ˆ rat was a good to mean square error were Q ˆ HD performed well. ˆ HD , but not the reverse, so overall Q deal worse than Q ˆ pos (1/2) of the population median Kuk and Mak (1989) suggest a position estimator Q of y that relies on knowing only the population median Qx (1/2) of an auxiliary x. The basic idea is this: there exists some p between 0 and 1 such that the sample distribution function evaluated at the unknown population median is p : Fˆ n (Q (1/2)) = p. Thus, a good estimate of p will yield a good estimate of Q (1/2). This is achieved as follows: ˆ x (1/2), the x sample Break the sample into two components: s1 , those with xi ≤ Q  ˆ median, and s2 , those with xi > Qx (1/2). Let nx1 be the number of units in s1 and nx2 , in s2 . Let pˆ 1|1 be the proportion of points in the first group s1 for which y is less  thanthe sample median,  and pˆ 1|2 be the like proportion of the points in s2 . Then pˆ = −1 n nx1 pˆ 1|1 + nx2 pˆ 1|2 is an estimate of the fraction p of points i in the sample having yi   ˆ pos (1/2) = Qn pˆ , less than or equal to Qy (1/2), and we take the position estimator Q the pth ˆ sample quantile. In simulations the position estimator does better than the straight sample median, than a ratio-type estimator, and than an estimate based on Fˆ KM,Q(1/2) . Note that it makes no use of selection probabilities. It would be of interest to compare ˆ π (1/2). it to the Hájek estimator Q Meeden (1995) uses an urn model simulating a posterior distribution of nonsample yi that rests on the idea of exchangeability of the ratios ri = yi /xi , as would hold under the model yi = βxi + xi εi .

(22)

The method assumes availability of nonsample xj , which we can assume set out in ∗ = ri recorded, the unit is order xn+1 , . . . , xN . A sample unit i is selected at random, rn+1 cloned, and both it and its clone returned to the urn. The process is repeated N − n times ∗ to get rn+j ,j = 1, . . . , N − n, (the eventual size of the urn is N) and nonsample values ∗ ∗ = rn+j xn+j . These, combined with the known sample values of y are taken as yn+j gives one realization of a population of y compatible with the given data and assumed

392

A. H. Dorfman

structure. The whole process is repeated R times and the average of the medians of ˆ pp (1/2). the R generated populations taken as the estimated Polya posterior median Q (Clearly, the same process supplies estimates of other quantiles as well, and these would fulfill the monotonicity property.) In simulation studies on three real and several artificial ˆ pp (1/2) appears to be robust to failure of the model (22) and to perform populations, Q on a par with the inverse of the CD estimator. 5.3. Hinge estimates of quantiles The following idea arises out of Mak and Kuk (1993). Suppose we have an estimator Fˆ ∗ , typically based on the population values of an auxiliary x, and so presumed accurate, but complicated enough that we might prefer not to calculate it repeatedly on an extensive grid {tν } to get Fˆ ∗−1 (p). Suppose available also a simple estimator Fˆ w , say Fˆ π . Then, as ˆ pos , described earlier, it would be enough to in the rationale for the position estimator Q ˆ adj (p) = Fˆ w−1 (q). determine q so that Fˆ w (Q(p)) = q, for then we can take Q The difference  ≡ q − p can be written  = Fˆ w (Q(p)) − F(Q(p)), the difference between the value of the simple estimator and the actual population distribution evaluated at the population quantile. It suffices to estimate . This is done by replacing Q(p) ˆ w (p) and F by Fˆ ∗ , to yield a one-step adjusted estimator by an initial estimate such as Q

ˆ (1) (p) = Fˆ w−1 p + Fˆ w Q ˆ w (p) − Fˆ ∗ Q ˆ w (p) Q adj

ˆ w (p) (23) = Fˆ w−1 2p − Fˆ ∗ Q This requires calculation of Fˆ ∗ at only one value of t. The process can be iterated, ˆ (1) (p) on the right in expression (23); this will require a calculation ˆ w (p) by Q replacing Q adj of Fˆ ∗ at a second value; and further iteration is possible. Mak and Kuk focus on Fˆ ∗ = Fˆ RKM , but there is no reason to limit the idea to this case and indeed in a slightly variant version of this estimator, they take Fˆ ∗ = Fˆ CD, (see Section 3.6.1). It is neither clear what the impact is of the choice of Fˆ ∗ or of the degree of iteration, nor how close the result would be to Fˆ ∗−1 (α). 6. Variance estimation and confidence intervals for distribution functions We shall not in this chapter pursue the possibilities for variance estimation in great detail. Each of the estimators described earlier will require some particular corresponding variance estimator, and it seems wisest to refer the reader to the originating papers. Instead, we shall make some general remarks, focus on the CD and RKM estimators, and point to one or two papers whose primary concern is variance estimation. In general, there are three basic approaches: (1) plug-in model based, estimating asymptotic variances, (2) design-based, and (3) replication methods, which are widely regarded as falling into both the model-based and design-based camps. All three are considered in Wu and Sitter (2001b) (henceforth WS). The model assumption  is the simple linear model yi = α + βxi + εi , and they get an estimate ˆ Fˆ CD (t) − F(t) by plugging in sample estimates of the I terms in (14), replacvm = var ing the separate expressions I2 and I3 by an expression for I23 ≡ I2 − I32 and estimating that directly, for the sake of stability and positivity.

Inference on Distribution Functions and Quantiles

393

WS suggest several jackknife estimates for the CD estimator, but best appears to be a jackknife hybrid estimator which estimates the (less important) var(F(t))  component by a simple plug-in estimate of I4 , and var Fˆ CD (t) by the jackknife 2   ¯ ˆ ˆ ˆ vJ1 (t) = n−1 i∈s FCD,−i (t) − F (t) , where FCD,−i (t) is FCD (t) computed without n  −1 ¯ ˆ ˆ making any use of the ith sample point, and F (t) = n i FCD,−i (t) (note that FCD (t) itself would probably do as well.) We note that leave one out estimates of the necessary parameters and residuals are readily available without needing complete recalculation (Miller, 1974). For the design-based RKM estimator WS take RKM design-based variance estimator    π π −π uj (i)  vd = vˆar Fˆ RKM = N −2 i∈s j>i i πj ij ij uiπ(j) , where the residuals ui (j) = − πj i     πij πij ˆ ˆ ˆ I yi ≤ t − Gic (j), with Gic (j) = I{yk ≤ t − βπ (xi − xk )} . Typk∈s k∈s πijk

πijk

ically, simplifying assumptions will be necessary to calculate the higher order inclusion probabilities.   WS give two variants of a jackknife estimator of var Fˆ RKM (t) : vJd2 , based on Fˆ RKM,−i (t), as vJ1 (t) was based on Fˆ CD,−i(t) and vJd1 , slightly easier to calculate, based Tˆ T ˆ β k βπ i π ˆ πC t−x1/2 ˆ π t−x1/2 of (10), and using deleted versions and G on the full sample G vk

vi

only of the weights. In testing on artificial data, the plug-in estimator vm for CD was consistently less variable than the hybrid jackknife (INST, WS, Table 1), and tended to be biased lower (RB, WS, Table 1). This could lead, we suspect, to somewhat closer to nominal coverage for the jackknife. For RKM, vJd1 and the design-based estimator vd were indistinguishable; vJd2 was consistently less variable than they were, and tended to be biased higher (WS, Table 2). This should lead to better coverage for vJd2 . It would be of interest to assess WS avoided considering a model-based plug-in estimate for   coverage properties. var Fˆ RKM (t) − F(t) , but it would be interesting to see how the one given in Wang and Dorfman (1996) compares empirically to vd and vJd2 . Lombardía et al. (2003) suggest a bootstrap methodology for generating bias and variance estimates and confidence intervals for the CD estimator. The method involves Monte Carlo generation of B bootstrap populations based on the fitted model and residual distribution from the sample, and R samples from each population, and assumes homoscedasticity and that the working model is correct. It is not clear how robust the methodology is to deviations from those conditions. Lombardía et al. (2004) use a similar bootstrap methodology for generating estimates of bias, variance, and mean square error, for the nonparametric CD estimator Fˆ np,CD , as ingredients in a comprehensive bandwidth selection procedure (Section 3.7.3). Again, homoscedasticity is assumed. It would be of interest to see how their methodology can be adapted to other estimators, like the Kuo. It is very computer intensive. Empirical likelihood confidence intervals for the distribution function are discussed in Chapter 30. 7. Confidence intervals and variance estimates for quantiles Woodruff (1952) describes a straightforward method for transforming estimates of precision for a cdf to estimates of precision for the corresponding quantiles. Assume a

394

A. H. Dorfman

ˆ intervals cdf estimatorF (t) is itself a cdf and that (1 − α)100% two-sided confidence   Fˆ L (t),Fˆ U (t) , based say on the normal assumption and an estimate of var Fˆ (t) − F(t) . Suppose we want a (1 − α)100% two-sided confidence interval for tp = Q(p), which       we have estimated by ˆtp = Fˆ −1 (p). Let tL = Fˆ −1 Fˆ L ˆtp and tU = Fˆ −1 Fˆ U ˆtp . Then Woodruff argues that [tL , tU ] is an approximate (1 − α)100% confidence interval ˆ for Q(p). It will not in general be symmetric about Q(p). Francisco and Fuller  (1991) (henceforth    FF) offered an alternative construction, taking tL = Fˆ U−1 Fˆ ˆtp and tU = Fˆ L−1 Fˆ ˆtp . The resulting interval is basically a standard calibration confidence interval in the sense of Carroll and Ruppert (1988), Section 2.9.3. It is more complicated to calculate than the Woodruff, because it requires values of the upper and lower bounds over a range of values of t near tp , as well as various transformations and smoothing operations. The Woodruff has held up well in empirical studies, and there is little evidence of an advantage to the FF. Sitter and Wu (2001) note that, especially at low or high p, the actual coverage of the Woodruff interval for Q(p) can be better than that for the corresponding interval for F [Q(p)]. (Consider small p; by the nature of a binomial distribution pˆ = Fˆ (Q(p)) is more likely than not to lie below p and because the variance estimator of a binomial is monotonically increasing for small p, the variance estimate associated with pˆ will be lower than it would have been had pˆ equaled p. This can lead to low coverage of the corresponding confidence interval for the cdf. On the ˆ other hand, on the same premise,working  backwards, one can anticipate that Q(p) is ˆ likely to lie above Q(p), with Fˆ Q(p) ≥ Fˆ (Q(p)) and the corresponding variance estimate of the cdf, from which the Woodruff is generated, is larger than what it would ˆ have been if Q(p) = Q(p).) For low or high p, Shah and Vaish (2006) make adjustments to Fˆ (t) and adopt modifications of the variance estimate, due to Korn and Graubard (1998c), prior to applying the Woodruff. Dorfman and Valliant (1993) (DV), in the context of quantile estimation for domains, some of them quite sparse, compare Woodruff and FF confidence intervals, as well as confidence intervals derived from balanced repeated replication (BRR) variance estimation applied directly to the quantile estimates. In terms both of bias for the actual mean square error and coverage, BRR was better than Woodruff, which was better than FF. ˆ n (p) − Q(p) ≈ Making use of the well known Bahadur (1966) representation Q F(Q(p)) − Fˆ (Q(p))/f(Q(p)), DV also base a variance estimator on the variance esti      2 ˆ n (p) − Q(p)} ≈ vˆar Fˆ Q(p) ˆ ˆ ˆ mate for the cdf: vˆar{Q − F Q(p) /fˆ Q(p) , where   ˆ fˆ Q(p) is estimated using nonparametric density estimation. The overall behavior of the resulting estimator is somewhere between BRR and Woodruff. Chen and Wu (2002) have further discussion of the Bahadur representation. Wheeless and Shah (1988) give an alternate approach to estimating the density f . Using Woodruff 1 − α confidence intervals, Rao et al. (1990) suggest variance esti 2 mates Lα /2zα/2 where Lα is the length of the interval and zα/2 is the α/2 percentile of the standard normal distribution. Variance estimates for ratios, etc. can be built up from these. DV offer a few modifications of this.

Inference on Distribution Functions and Quantiles

395

8. Further results and questions The CD and other estimators we have discussed assume independence of errors. This is not always a safe assumption, particularly in cluster sampling. Mayor (2002) has made a suggestion on estimation of the cdf for cluster sampling, using a Horvitz Thompson type estimator, which, however, violates Property 1a. It is not clear how in general to do cdf estimation in the presence of correlated errors. Singh et al. (2001) discuss generic methods for estimating quantiles based on double sampling. It might be expected that most of the methods of this chapter can be extended to the double sampling situation, with the larger sample replacing the population in the various formulas. Ranked set sampling is an approach to double sampling designed to give an even spread of the y among the sampled x. Lam et al. (2002) give an estimator of the cdf appropriate to ranked set sampling patterned after the kernel method of Kuk (1993). Stefanski and Bay (1996) consider the interesting situation of estimating the cdf when there is measurement error in the variable of interest. It would be interesting to combine the nonparametric regression estimation of variance structure (Section 3.7.1: Lombardía et al. (2005)) with nonparametric estimation of the mean for a completely parameter free estimator of the cdf. It would be of interest to compare the resulting estimator to the CD estimator with mean and variance structure chosen by regression diagnostics. There has been no comprehensive comparison of the many available alternatives for cdf and quantile estimation. Such a study, breaking up estimators into comparison groups based on auxiliary information used, carried out on the many real and artificial populations found in the cdf literature, and tracking relative efficiencies and computing times, would be very useful.