Mitigating the effect of measurement errors in quantile estimation

Mitigating the effect of measurement errors in quantile estimation

ARTICLE IN PRESS Statistics & Probability Letters 77 (2007) 514–524 www.elsevier.com/locate/stapro Mitigating the effect of measurement errors in qu...

1MB Sizes 0 Downloads 57 Views

ARTICLE IN PRESS

Statistics & Probability Letters 77 (2007) 514–524 www.elsevier.com/locate/stapro

Mitigating the effect of measurement errors in quantile estimation E. Schechtmana, C. Spiegelmanb, a

Department of Industrial Engineering and Management, Ben Gurion University, Beer Sheva 84105, Israel Department of Statistics, Texas A&M University, The Texas Transportation Institute, College Station, TX 77843, USA

b

Received 8 February 2006; received in revised form 24 July 2006; accepted 7 August 2006 Available online 28 September 2006

Abstract Quantiles are frequently used as descriptive measures. When data contains measurement errors, using the contaminated data to estimate the quantiles results in biased estimates. In this paper, we suggest two methods for reducing the effect of measurement errors on the quantile estimates and compare them, via an extensive simulation study, to the estimates obtained by the naive method, that is: by the estimates obtained from the observed (contaminated) data. The method we recommend is based on a method in a paper by Cook and Stefanski. However, we suggest using a combination of bootstrap and jackknifing to replace their extrapolation step. r 2006 Elsevier B.V. All rights reserved. Keywords: Bootstrap; Jackknife; Percentiles

1. Introduction Quantiles (or percentiles) are among the most widely used descriptions of measurements, see Gilchrist (2000) and Parzen (2003). The difference between the 25% and the 75% quantiles is known as the inter quartile range (IQR), which is a well-known measure of dispersion. Colleges use quantiles for admission criteria, state highway departments use them for setting highway speeds (Texas Department of Transportation (TxDOT), 1997; Hamada et al., 2003), and they are widely used throughout the science and engineering community. In practice, it is not uncommon for the data to be obtained with measurement errors in addition to other errors such as sampling variability. Usually, the quantiles are constructed directly from the data using the empirical distribution function or a parametric model is fitted and quantiles are taken from the fitted model. In either case, measurement errors are ignored, and their effect is to obscure the true quantiles. This paper gives two methods for reducing the effect of measurement errors on quantiles and compares them to the naı¨ ve method, which bases the estimates on the observed data. Both new methods use the idea in Cook and Stefanski (1994) and Stefanski and Cook (1995) of adding (additional) measurement error to the data and then extrapolating to the zero error solutions. The more effective method for doing the extrapolation Corresponding author.

E-mail addresses: [email protected] (E. Schechtman), [email protected] (C. Spiegelman). 0167-7152/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2006.08.026

ARTICLE IN PRESS E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

515

uses bootstrap samples and a jackknife approach rather than the approach found in Cook and Stefanski (1994) for extrapolation. While Cook and Stefanski do not implement a bootstrap approach they motivate their SIMEX method by discussing the similarities to the jackknife. We use a special implementation of a jackknife approach that they discuss. In addition, for the simulation step, we use bootstrapping from residuals rather than using normal distributions for error, as does SIMEX. The structure of the paper is as follows: in Section 2 we introduce the notation needed and describe the methods. Section 3 is devoted to the simulation study and Section 4 details the results. A real example is analyzed in Section 5, and Section 6 concludes the paper. 2. Notation and methods Let X be a random variable, having a cumulative distribution function F(x). The objective is to estimate the quantiles of F(x). That is: let the pth quantile qp ¼ F 1 X ðpÞ assuming that FX(x) is a strictly increasing function. We are interested in estimating qp for various choices of p. As stated in the introduction, in many cases X is unobservable, and instead, we assume that the observed data, Y, follows the model Y ¼ X þ s,

(2.1)

where e is a random variable, independent of X, with mean zero and variance equals to one, and s2 is the known measurement error variance (see comment below). In this study, we will introduce four estimators of qp: the estimator based on the unobserved X sample, denoted by q^ p;unobserved , the naı¨ ve estimator, based on the observed Y sample, denoted by q^ p;na€ive , the estimator based on Cook and Stefanski (1994) method, denoted by q^ p;SIMEX , and the bootstrap estimator q^ p;boot . The first estimator, q^ p;unobserved , is only theoretical and typically cannot be obtained in practice, since it is based on the unobserved, error-free data. However, it will serve as the benchmark, to be compared with the other estimators. It is defined as the pth quantile (in our study we use MATLAB function called prctile for such calculations) of the unobserved sample, X, and given by q^ p;unobserved ¼ F 1 n;X ðpÞ,

(2.2)

the pth quantile of the empirical distribution function of the unobserved X’s. The second estimator q^ p;na€ive is based on the observed data, which are the data containing measurement errors. It is called ‘‘naı¨ ve’’ since it treats the observed values as the ‘‘true’’ values. It is defined as the pth quantile of the observed sample, Y, which is easily obtained by sorting the observed sample and choosing the pth quantile, and given by q^ p;na€ive ¼ F 1 n;Y ðpÞ.

(2.3)

The third estimator, q^ p;SIMEX , is based on the method introduced by Cook and Stefanski (1994). Generally, their method is based on a two-step procedure. It combines a parametric bootstrap with a method-of-moments estimator. The first step is the simulation part (‘‘SIM’’), and the second is an extrapolation part (‘‘EX’’), comprising its name SIMEX. The basic idea of this method is to first add additional measurement errors to the data, then try to model the bias as a function of the variance of the added measurement error, and finally use the model and extrapolate to the non-error case. The estimator obtained by this method is approximately consistent, that is (following Cook and Stefanski, 1994): it converges in probability to some constant that is approximately equal to the parameter that is being estimated. The (hopefully small) bias in the estimate is due to inaccuracies in the extrapolating model. The procedure can be described as follows: For the simulation step, (S.1) Generate independentprandom errors e(b) (called pseudo-errors) from one of the error distributions and ffiffiffiffi define Ybðlj Þ ¼ Y þ lj s ðbÞ, b ¼ 1,y,B; j ¼ 1,y,k; ljX0, where B is the number of Monte Carlo samples. Find the quantiles, to be denoted by q^ p;YbðljÞ . (S.2) Fit a second-degree polynomial regression of each quantile on l.

ARTICLE IN PRESS 516

E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

Repeat steps (S.1) and (S.2) B times, obtain B regressions for each quantile. For the extrapolation step, (S.3) Evaluate the B regression equations obtained in (S.2) at l ¼ 1. Obtain B estimates, one per each Monte Carlo sample. (S.4) Average the B extrapolated values over the B Monte Carlo samples to get q^ p;SIMEX . It is worthwhile to note that if s2 ¼ 0, i.e., there is no measurement error, then the naı¨ ve estimate coincides with the unobserved one. Also, the variance of Yb(lj)X is s2(1+lj), hence the idea to extrapolate to l ¼ 1. The fourth estimator is q^ p;BOOT . It uses a combination of three well-understood ideas. The first idea is using the efficient bootstrap to obtain bootstrap samples (Davison et al., 1986). The second idea is adding error to the observations (perturbing the observations), and the last idea is extrapolation using the ideas behind the jackknife method of bias correction that culminates in applying formula (2.5) below. The motivations for the bootstrap estimator are similar to the motivations used by Cook and Stefanski (1994). The advantage of using a bootstrap is that we are less tied to a parametric model for the distribution in the simulation step. Our estimator is also simple and is based upon the same ideas behind jackknife estimators and the connection of jackknife estimators to SIMEX is explained in Stefanski and Cook (1995). To understand the motivation more fully let Y i ðsÞ ¼ X i þ si ;

i ¼ 1; . . . ; n

(2.4)

be the observed sample. We assume that s, the measurement error standard deviation, is known. (In practice, one has to estimate it from replications in the data). In this approach, we first construct pseudo-observations using a standard bootstrap approach, by resampling with replacement from Y1(s),Y2(s),y,Yn(s). We denote the T bootstrap samples by ðY 1t ðsÞ; . . . ; Y nt ðsÞÞ, t ¼ 1,y,T. We then add additional random error, e+, having a normal distribution, to form new observations  þ Yþ it ðs þ sBOOT Þ ¼ Y it ðsÞ þ sBOOT i .

Here s is the standard deviation of the original measurement errors e and sBOOT is the standard deviation of our added simulation error e+. We assume that the bias in estimating the pth quantile is proportional to s2. Let þ Q^ p ðs þ sBOOT Þ be the average of the estimates obtained from the bootstrap samples fY þ it ðs þ sBOOT Þg, t ¼ 1,y,T. Using the typical jackknife approach, our estimate of the pth quantile is " þ # þ s2 ðs2 þ s2BOOT Þ Q^ p ðsÞ Q^ p ðs þ sBOOT Þ  . (2.5) q^ p;BOOT ¼ s2 s2BOOT s2 þ s2BOOT The idea behind the jackknife approach is to use observations with and without added error and then, to model the bias as a function of the measurement error variance. We now give the detailed steps of our algorithm used to calculate q^ p;BOOT . 1. Given the sample Y1(s), Y2(s),y,Yn(s), we create a string of length nT by concatenating T copies of the original sample. We then apply a random permutation to the new string. The next step is to read off the T bootstrap samples ðY 1t ðsÞ; . . . ; Y nt ðsÞÞ, t ¼ 1,y,T as successive blocks in the permuted string. This method is called the efficient bootstrap (Davison, et al., 1986). It is based on the idea to balance the simulated samples: using this method, each datum is guaranteed to occur equally often in the aggregate of all T bootstrap samples. 2. An additional normal error, denoted by þ i , added to the bootstrap samples, multiplied by the bootstrapadded standard deviation sBOOT. The T bootstrap samples (after adding the additional error) are denoted þ by Y þ 1t ðs þ sBOOT Þ; . . . ; Y nt ðs þ sBOOT Þ, t ¼ 1,y,T. 3. Percentiles are now computed from the bootstrap samples, with and without added error, and denoted by þ þ Q^ p ðs þ sBOOT Þ and Q^ p ðsÞ, respectively. 4. Finally, the bootstrap estimate q^ p;BOOT is obtained by applying formula (2.5).

ARTICLE IN PRESS E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

517

3. Simulation study An extensive simulation study was performed in order to compare the performances of the various methods. The simulation experiment has a factorial structure with four factors. The factors are the distribution of the true X’s (3 levels), the distribution of the measurement error (3 levels), the sample size (5 levels), and the measurement error variance (3 levels). The steps of the simulation study, for each independent sample, were the following: 1. A random sample (of size n) of ‘‘true’’ X’s was drawn from three underlying distributions: Normal (0,1), which represents the typical well-behaved model, t(4), which represents a symmetric, heavy-tailed distribution, and lognormal, which represents a skewed distribution. The samples were standardized to have mean zero and variance 1. 2. A random sample (of size n) of errors, e, was drawn from one of the three distributions. 3. Samples (of size n) of naı¨ ve measurements, Y, were created, according to the model Y ¼ X þ s for selected values of s, to be detailed below. 4. For each Y sample, the following percentiles were calculated: 5, 15, 25, 35, 45, 50, 55, 65, 75, 85, and 95, according to the naive, SIMEX and the bootstrap methods. Denote these values by q^ p;naive , q^ p;SIMEX and q^ p;BOOT . In addition, the ‘‘true’’ quantiles were calculated from the ‘‘unobserved’’ X sample. Denote these values by q^ p;unobserved . For the SIMEX method, steps (S.1)–(S.4) were taken, with B ¼ 20,000, k ¼ 12 and lj ¼ 0(0.1)1.1. For the bootstrap method we used T ¼ 500 resampled sets. The above procedure was repeated M ¼ 100 times. One hundred independent samples were drawn for each choice of parameters, to be detailed below. The data was summarized by two measures: a. A measure of offset, defined as the average difference from the ‘‘true’’, unobserved (sample) quantile. It is given by, for example, offsetðq^ p;SIMEX Þ ¼

M   1 X q^ p;SIMEX;i  q^ p;unobserved;i . M i¼1

The reason that offset is important is because the addition of a measurement error to a random variable of interest typically causes a bias when the quantiles of the data with error are used in a naı¨ ve way. Typically, the bigger the error variance the bigger is the bias and this is easily seen by considering a normally distributed true X with distributed measurement error. For all quantiles, except for the p  ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi a normally median the bias is s2X þ s2  sX  standard normal quantile. b. A measure of dispersion, defined as the average of the squared deviations from the ‘‘true’’, unobserved quantiles. Note that it is NOT the commonly used MSE, as each ‘‘true’’ sample has its own ‘‘true’’ quantile. The mean square term (MST) is given by, for example, M    2 1 X q^ p;SIMEX;i  q^ p;unobserved;i . MST q^ p;SIMEX ¼ M i¼1

The choice of parameters: a. Sample sizes: n ¼ 100,200,300,400 and 500. b. The standard deviation, s, for the measurement error: 0.1, 0.3 and 0.5 (note that the ‘‘true’’ X’s have variance of 1, so s of 0.1 means 10% of the true standard deviation). c. The standard deviation, sBOOT, for the error added in the bootstrap step: 0.1, 0.3 and 0.5 (equals to the s of the measurement error). In order to present comparisons among the naı¨ ve, SIMEX, and bootstrap

ARTICLE IN PRESS E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

518

estimators we use the ratio MST(., naive)/MST(., method). Here the method in the denominator will be either SIMEX or the bootstrap method. Ratios smaller than one favor the naı¨ ve method while ratios bigger than one favor the method used to calculate the term in the denominator.

4. Results In this section, we describe the overall findings of our simulation experiment. The datasets are complex and are posted on our web site (http://www.stat.tamu.edu/cliff/mitigation) so that readers can obtain typical results for particular combinations of parameters that interest them. In addition, a user-friendly MATLAB program is posted on the web site. The interested user can insert his/her own vector of observations and the estimated standard deviation of the measurement error, and the program with produce estimates obtained by the three methods: naı¨ ve, SIMEX, and bootstrap. The most important parameters are sample size and the ratio of the measurement error variance to the true error variance. The reasons for this can be quickly understood. The methods given in this paper typically reduce the bias but increase the variance of the quantile estimator. The variance of the estimator typically decreases as the sample size increases and so for very big sample sizes, one would expect that a procedure that reduces bias will reduce MST. On the other hand, one would expect in small samples that a bias reducing formula would increase MST. We show in the simulations below that a minimum sample size of about 100 is needed to have much hope that the procedures presented here are helpful when the measurement error variance is about 9 percent of the variation of the ‘‘true’’ X. Overall, the stated trends with error variance and sample size are apparent in the figures that we now present. In what follows, we present three sets of plots. The first set shows the main effect results for different underlying X-distributions. The second set of plots shows the main effect results for different sample sizes, and the third set of plots shows the main effect results for different error distributions. The results shown are the relative MST’s. When the ratio is less than one the naı¨ ve estimator is preferred and when the ratio is greater than one the naı¨ ve estimator is not the best. The plots are for main effects. In Figs. 1a–e, the error variance was set at 25% of the ‘‘true’’ X’s variance. In these plots we show the main effects of ‘‘true’’ X distribution (that is, all sample sizes and error distributions used in the study are combined to show the effects of the methods as the distribution of the ‘‘true’’ X varies). It can be seen that in over 75% of cases, the bootstrap method has smaller MST than the naı¨ ve method when the 5th, 15th, 85th or 95th percent quantiles are being estimated. The naı¨ ve estimator is almost always better for estimating central quantiles. In Fig. 1c the naı¨ ve estimator is clearly the best. In Figs. 2a–e, the error variance was set at 9% of the ‘‘true’’ X’s variance. In these plots we show the main effects of sample sizes. Again in the vast majority of cases the bootstrap estimator is superior to both the naı¨ ve and SIMEX estimators for non-central quantiles and in particular the 5th percentile and 95th percent quantile. When the sample size is 200 or bigger the SIMEX estimator is also better than the naı¨ ve in most cases when estimating tail quantiles. In Figs. 3a–e, the error variance is only one percent of the variance of the ‘‘true’’ X. At this error level the naı¨ ve estimator is competitive with the bootstrap estimator. The bootstrap estimator is a bit better overall for lognormally shaped error and is slightly better more than half of the time for the other error distributions. The naı¨ ve estimator is considerably better than the SIMEX estimator here. Finally, we note that both the SIMEX and bootstrap estimators reduce the bias caused by measurement error. The general shape of bias reduction is much less sensitive to error standard deviation and sample size than are the MST calculations. Fig. 4 shows a typical plot of bias reduction for error variance at nine percent of the variance of the ‘‘true’’ X. A trend that can be seen from the figures is that in most cases of estimating quantiles in the tails of a distribution the bootstrap procedure is superior to both the SIMEX procedure and the naı¨ ve method. When estimating quantiles around the median, unless the sample size is large or the measurement error variance is huge, the naı¨ ve method is best. The complete dataset of experimental results is available at http://www.stat.tamu.edu/cliff/mitigation for detailed analysis. The factorial experiment results are available for the 5th percentile to the 95th percentile in steps of 10%, plus the median.

ARTICLE IN PRESS E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

Fig. 1. Relative MST by X-distribution for error variance ¼ 25%.

519

ARTICLE IN PRESS 520

E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

Fig. 2. Relative MST by sample size for error variance ¼ 9%.

ARTICLE IN PRESS E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

Fig. 3. Relative MST by error distribution for error variance ¼ 1%.

521

ARTICLE IN PRESS E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

522

Bootstrap

Naive

Bias

0.05

0.00

-0.05

SIMEX

Bias

0.05

0.00

-0.05

0.05

0.25 0.15

0.45 0.55 0.75 0.35 0.50 0.65

0.95 0.85

Fig. 4. Plots of median bias by method for variance ¼ 9%.

5. A real example—traffic speed The following example is taken from Hamada et al (2003). True speed X Measured speed Y True speed X Measured speed Y

69 70 70 69

74 74 70 73

71 73 73 69

75 76 63 65

57 54 71 74

47 48 71 73

71 76 70 70

66 66 65 65

62 61 71 70

70 73 69 70

ARTICLE IN PRESS E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

True speed X Measured speed True speed X Measured speed True speed X Measured speed True speed X Measured speed True speed X Measured speed True speed X Measured speed True speed X Measured speed True speed X Measured speed

64 65 64 65 64 65 66 66 68 69 64 65 68 65 64 63

Y Y Y Y Y Y Y Y

66 66 66 69 78 80 74 78 72 76 63 65 66 65 72 70

58 58 71 73 60 63 61 61 65 69 69 70 68 69 68 70

60 60 66 66 61 63 63 65 68 69 61 63 62 60 64 69

61 61 69 70 66 70 69 70 62 60 69 70 69 69 66 66

73 74 63 65 64 66 65 65 63 63 63 63 57 58

69 70 68 69 68 69 58 60 64 63 64 65 65 66

64 65 63 65 69 69 64 65 69 73 61 60 64 65

523

52 51 67 69 64 65 60 58 60 61 75 78 48 48

63 63 61 63 71 70 68 69 68 69 58 60 60 61

Summary statistics for the data: n ¼ 95; Average X ¼ 65.5; Standard deviation of X ¼ 5.347. In order to estimate the measurement error variance, we used a one-way ANOVA (the within-sample variation represents the measurement error variability). The value of MSE is 2.81668, which yields an estimated s of 1.678. Note that s is about 31% of the true-X standard deviation. Using the above data, the following estimates of the 85% quantile were obtained: q^ :85;unobserved ¼ 71, q^ :85;naive ¼ 73;

BIAS ¼ 2:000;

q^ :85;SIMEX ¼ 73:771; q^ :85;BOOT ¼ 71:338;

BIAS ¼ 2:771, BIAS ¼ 0:338.

Comment: A referee has pointed out that the improvement of our implementation of the jackknife approach over the SIMEX approach could be due to a combination of the jackknife extrapolation or the use of bootstrap residuals in our simulation step. We do not try to isolate the effects from these two factors. Our intuition is that a two point statistical design is the efficient design for estimating a straight line and that the bias correction should be done in a region where the response is well modeled by a straight line. In this case, modeling approaches that spread points out in linear regions lose efficiency. However, in the case of nonlinearities, spreading points out helps to detect them and if the nonlinearities were large for small additions of error, it would be important to include higher-order terms in the jackknife approach. Also we feel that convoluting normal errors with naturally occurring nonnormal errors may cause nonlinear bias. The two variants from the SIMEX approach that we implemented have not been separately evaluated in this study but rather they have been evaluated as a combined package. We leave that to others to evaluate their individual effects.

6. Summary and conclusions This paper gives two methods for mitigating the effects of measurement error in the estimation of quantiles. These methods typically are shown to lead to improved quantile estimation in the distribution tails but are typically ineffective when the quantile of interest is in the center of the distribution. Several situations where tail quantiles are important are cited, and a transportation example is given. It is clear that as measurement error gets bigger or the sample size gets bigger the new procedures given in this paper give more improvement. As one wants more extreme quantiles the new procedures help more.

ARTICLE IN PRESS 524

E. Schechtman, C. Spiegelman / Statistics & Probability Letters 77 (2007) 514–524

The bootstrap approach to bias reduction was the better of the two new methods and MATLAB code for implementing both the SIMEX and bootstrap procedure along with the simulation experiment results contained in Excel spread sheets are available at http://www.stat.tamu.edu/cliff/mitigation. References Cook, J.R., Stefanski, L.A., 1994. Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89, 1314–1328. Davison, A.C., Hinkley, D.V., Schechtman, E., 1986. The efficient bootstrap. Biometrika 73 (3), 555–566. Gilchrist, W., 2000. Statistical Modeling with Quantile Functions. Chapman and Hall/CRC Press, Boca Raton. Hamada, M., Pohl, A., Spiegelman, C., Wendelberger, J., 2003. A Bayesian approach to calibration intervals and properly calibrated tolerance intervals. J. Quality Technol. 35, 194–205. Parzen, E., 2003. Quantile/quantile plots, conditional quantiles, comparison distributions. In: Bodt, B.A.,Wegman, E.J. (Eds.),Proceedings of the Sixth US Army Conference on Applied Statistics. Stefanski, L.A., Cook, J., 1995. Simulation extrapolation: the measurement error jackknife. J. Amer. Statist. Assoc. 90, 1247–1256. Texas Department of Transportation (TxDOT), 1997. Procedures for Establishing Speed Zones.