Journal of Clinical Epidemiology 66 (2013) 319e329
Avoid lost discoveries, because of violations of standard assumptions, by using modern robust statistical methods Rand Wilcoxa,*, Mike Carlsonb, Stan Azenc, Florence Clarkb a
Department of Psychology, University of Southern California, 618 Seeley Mudd Building, University Park Campus, Los Angeles, CA 90089-1061, USA b Division of Occupational Science and Occupational Therapy, University of Southern California, 1540 Alcazar Street, CHP 133, Los Angeles, CA 90089-9003, USA c Department of Preventive Medicine, University of Southern California, 1441 Eastlake Ave., USC/Norris Comprehensive Cancer Center, NOR-4435, Los Angeles, CA 90089-9175, USA Accepted 2 September 2012; Published online 26 November 2012
Abstract Objectives: Recently, there have been major advances in statistical techniques for assessing central tendency and measures of association. The practical utility of modern methods has been documented extensively in the statistics literature, but they remain underused and relatively unknown in clinical trials. Our objective was to address this issue. Study Design and Purpose: The first purpose was to review common problems associated with standard methodologies (low power, lack of control over type I errors, and incorrect assessments of the strength of the association). The second purpose was to summarize some modern methods that can be used to circumvent such problems. The third purpose was to illustrate the practical utility of modern robust methods using data from the Well Elderly 2 randomized controlled trial. Results: In multiple instances, robust methods uncovered differences among groups and associations among variables that were not detected by classic techniques. In particular, the results demonstrated that details of the nature and strength of the association were sometimes overlooked when using ordinary least squares regression and Pearson correlation. Conclusion: Modern robust methods can make a practical difference in detecting and describing differences between groups and associations between variables. Such procedures should be applied more frequently when analyzing trial-based data. Ó 2013 Elsevier Inc. All rights reserved. Keywords: Robust methods; Violation of assumptions; Outliers; Heteroscedasticity; Occupational therapy; Well Elderly study
1. Introduction
1.1. Problems with standard methodology
During the last quarter of a century, many new and improved methods for comparing groups and studying associations have been derived that have the potential of documenting important statistical relationships that are likely to be missed when using standard techniques [1e7]. Although awareness of these modern procedures is growing, it is evident that their practical utility remains relatively unknown and that they remain underused. The goal of this article was to briefly review why the new methods are important and illustrate their value using data from a recently conducted randomized controlled trial (RCT).
Appreciating the importance of recently developed robust statistical procedures requires an understanding of modern insights into when and why traditional methods based on means can be highly unsatisfactory. In the case of detecting differences among groups, there are three fundamental problematic issues that have important implications in controlled trials. The first goal of this article was to briefly review these issues. After this overview, we will then examine the viability of various robust procedures as a means of solving the problems associated with standard procedures.
This work was supported by National Institute on Aging (R01 AG021108). There are no conflicts are of interest. * Corresponding author. Tel.: þ213-740-2258; fax: þ213-746-9082. E-mail address:
[email protected] (R. Wilcox). 0895-4356/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jclinepi.2012.09.003
2. Problem 1: Heavy-tailed distributions The first insight has to do with the realization that small departures from normality toward a more heavy-tailed distribution can have an undesirable impact on the estimation of the population variance and can greatly affect power [1,7]. Moreover, in a seminal article, Tukey [8] argued that
320
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
What is new? Key findings During the last quarter of a century there have been many advances and insights regarding the robustness of standard methods for comparing groups and studying associations. But clinical trials rarely take advantage of these advances. What this adds to what was known? Practical concerns about standard methods plus the advantages of modern methods are reviewed and illustrated. What is the implication, what should change? Modern statistical techniques, aimed at dealing with violations of standard assumptions, have considerable practical value in terms of both power and gaining a deeper understanding of data.
heavier than normal tails are to be expected. Random samples from such distributions are prone to having outliers, and the experience with modern outlier detection methods supports Tukey claim. This is not to suggest that heavytailed distributions always occur but to document that it is ill advised to ignore problems that can result from heavy-tailed distributions. The classic illustration of the practical consequences of heavy-tailed distributions is based on a mixed (contaminated) normal distribution, in which an observation is sampled from a standard normal distribution with a probability of 0.9; otherwise, an observation is sampled from a normal distribution with mean of 0 and SD of 10. Despite the apparent similarity between the two distributions, as shown in Fig. 1, the variance of the mixed normal is 10.9, which greatly reduces power. As an illustration, consider two normal distributions with variance 1 and means 0 and 1, which are shown in the left panel of Fig. 2. When testing at the 0.05 level with a sample size of 25 per group, Student t-test has power 5 0.9. The right panel shows the mixed normal and another mixed normal distribution that has been shifted to have a mean of 1. Despite the similarity with the left panel, now, Student t-test has power 5 0.28. The power is relatively low in the right panel because the variances are relatively large. This illustrates that arbitrarily small changes in a distribution can result in relatively poor power when using the mean. For two distributions with means m1 and m2 and common variance s2, a well-known measure of effect size, popularized by Cohen [9], is given by: d5
m1 m2 : s
The usual estimate of d, d, is obtained by estimating the means and assumed common variance in a standard way. Although the importance of a given value for d depends on the situation, Cohen suggested that as a general guide, a large effect size is one that is visible to the naked eye. Based on this criterion, he concluded that for normal distributions, d 5 0.2, 0.5, and 0.8, which correspond to small, medium, and large effect sizes, respectively. In the left panel of Fig. 2, d 5 1.0, which from Cohen perspective would be judged to be quite large. In the right panel, d 5 0.28, which would be judged to be a small effect size from a numerical point of view. But from a graphical point of view, the difference between means in the right panel corresponds to what Cohen describes as a large effect size. This illustrates that d can be highly sensitive to small changes in the tails of a given distribution and that it has the potential of characterizing an effect size as small when in fact it is large. The upshot of this result is that in the context of analyzing RCTs, the use of Cohen d can oftentimes lead to a spuriously low estimate of effect magnitude. Similar anomalies can result when using Pearson correlation. The left panel of Fig. 3 shows a bivariate normal distribution with correlation r 5 0.8, and the middle panel shows a bivariate normal distribution with correlation r 5 0.2. However, in the right panel, although r 5 0.2, the bivariate distribution appears to be similar to that of normal distribution with r 5 0.8. One of the marginal distributions in the third panel in Fig. 3 is normal, but the other is a mixed normal, which illustrates that even a small change in one of marginal distribution can have a large impact on r.
3. Problem 2: Assuming normality via the central limit theorem Many introductory statistics books claim that with a sample size of 30 or more, when testing H0 : m5m0 , m0 given, normality can be assumed [10e12]. Although this claim is not based on wild speculations, two things were missed. First, when studying the sampling distribution of the mean, X, early efforts focused on relatively light-tailed distributions for which outliers are relatively rare. However, in moving from skewed light-tailed distributions to skewed heavy-tailed distributions, larger than anticipated sample sizes are needed for the sampling distribution of X to be approximately normal. Second, and seemingly more serious, is the implicit assumption that if the distribution of X is approximately normal, then pffiffiffi n Xm : T5 s will have, approximately, a Student t distribution with n 1 df, where n is the sample size and s is the standard deviation. To illustrate that this is not necessarily the case, imagine that 25 observations are randomly sampled from
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
321
normal curve mixed normal
-3
-2
-1
0
1
2
3
x
Fig. 1. Despite the obvious similarity between the standard normal and contaminated normal distributions, the variance of the standard normal is 1 and that of the contaminated normal is 10.9.
0.1
0.2
0.3
0.4
a sample size of 200 or more is required. For skewed distributions having heavier tails, roughly meaning the expected proportion of outliers is higher, a sample size in excess of 300 can be required. A possible criticism of the illustrations just given is that they are based on hypothetical distributions and that in actual studies such concerns may be less severe. But at least in some situations, problems with achieving accurate probability coverage in actual studies have been found to be worse [1]. In particular, based on bootstrap samples, the true distribution of T can differ from an assumed Student t distribution even more than indicated here [1].
0.0
0.0
0.1
0.2
0.3
0.4
a lognormal distribution. In such a case, the sampling distribution of the mean is approximately normal, but the distribution of T is poorly approximated by a Student t distribution with 24 df. Fig. 4 shows an estimate of the actual distribution T based on a simulation with 10,000 replications. If T has a Student t distribution, then P(T 2.086) 5 0.025. When sampling from a lognormal distribution, the actual probability is approximately 0.12. Moreover, P(T 2.086) is approximately 0.001 and E(T ) 5 0.54. As a result, Student t is severely biased. If the goal is to have an actual type I error probability between 0.025 and 0.075 when testing at the 0.05 level,
−3
−2
−1
0
1
2
3
4
−3
−2
−1
0
1
2
3
Fig. 2. Left panel, d 5 1 and power 5 0.96. Right panel, d 5 0.3 and power 5 0.28.
4
322
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
Y
Y
Y
X
X
X
Correlation=.2
Correlation=.8
Correlation=.2
Fig. 3. Illustration of the sensitivity of Pearson correlation to contaminated (heavy-tailed) distributions.
Turning to the two-sample case, it should be noted that early simulation studies dealing with nonnormality focused on identical distributions. If the sample sizes are equal, the sampling distribution of the difference between the sample means will be symmetric even when the individual distributions are asymmetric but have the same amount of skewness. In the one-sample case, when sampling from a symmetric heavy-tailed distribution, the actual probability level of Student t is generally less than the nominal level, which helps to explain why early simulations studies regarding the two-sample case concluded that Student t is robust in terms of controlling the probability of a type I error. But, more recent studies have indicated that when the two distributions being compared differ in skewness, the actual type I error probability can exceed the nominal level by a substantial amount [1,13].
4. Problem 3: Heteroscedasticity
0.0
0.1
0.2
0.3
0.4
The third fundamental insight is that violating the usual homoscedasticity assumption (i.e., the assumption that all groups have a common variance) is much more serious than once thought. Both relatively poor power and inaccurate confidence intervals can result. Cressie and Whitford [14] established general conditions under which Student t is not even asymptotically correct under random sampling. When comparing the means of two independent groups, some software packages now default to the heteroscedastic method derived by Welch [15], which is asymptotically correct. Welch method can produce more accurate confidence intervals than Student t, but serious concerns persist in terms of both type I errors and power. Indeed, all methods based on means can have relatively low power
−10
−8
−6
−4
−2
0
2
T
Fig. 4. The distribution of Student t, n 5 25, when sampling from a (standard) lognormal distribution. The dashed line is the distribution under normality.
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
under arbitrarily small departures from normality [1]. Also, when comparing more than two groups, commonly used software uses the homoscedastic analysis of variance F test, and typically, no heteroscedastic option is provided. This is a concern because with more than two groups, problems are exacerbated in terms of both type I errors and power [1]. Similar concerns arise when dealing with regression. 4.1. Robust solutions As suggested previously, the application of standard methodology for comparing means can seriously bias the conclusions drawn from clinical trial data. Fortunately, a wide array of procedures have been developed that produce more accurate robust results when the assumptions that underlie standard procedures are violated. We describe some of these methods in the following. 5. Alternate measures of location One way of dealing with outliers is to replace the mean with the median. Two centuries ago, Laplace was aware that for heavy-tailed distributions, the usual sample median can have a smaller standard error than the mean [16]. But a criticism is that under normality, the median is rather unsatisfactory in terms of power or achieving a relatively short confidence interval. Note that the median belongs to the class of trimmed means. By definition, a trimmed mean is the average after a prespecified proportion of the largest and smallest values is deleted. A crude description of why methods based on the median can have unsatisfactory power or poor precision is that they trim all but one or two values. A way of dealing with this problem is to trim less, and Rosenberger and Gasko [17] conclude that a 20% trimmed mean is a good choice for general use. More precisely, its standard error compares well with that of the mean under normality, but unlike the mean, it provides a reasonable degree of protection against the deleterious effects of outliers. Heteroscedastic methods for comparing trimmed means, when dealing with a range of commonly used designs, have been derived [1] (The population trimmed mean is the mean of a distribution when the tails of a distribution are ignored. For example, the 20% trimmed mean ignores the values less than the 0.2 quantile and greater than the 0.8 quantile. Ignoring the tails is desirable from a robustness point of view.). Another general approach when dealing with outliers is to use a measure of location that first empirically identifies values that are extreme, after which these extreme values are down weighted or deleted. Robust M-estimators are based in part on this strategy [1,2,5,7]. Note that the mean is a least squares estimator. M-estimators replace squared error with a function aimed at yielding an estimator that performs well under normality but simultaneously deals with the deleterious effects of outliers. Methods for testing
323
hypotheses based on robust M-estimators and trimmed means have been studied extensively. The strategy of applying standard hypothesis testing methods after outliers are removed is technically unsound and can yield highly inaccurate results because of using an invalid estimate of the standard error. The reason is that even under random sampling from a normal distribution, when extreme values are removed, the remaining observations are correlated. Consequently, the usual derivation of the standard error of the mean is incorrect. Theoretically, sound methods have been derived [1e7]. In both theoretical and simulation studies, when testing hypotheses about skewed distributions, it seems that a 20% trimmed mean has an advantage over a robust M-estimator [1], although no one single method dominates. The HodgeseLehmann estimator is perhaps better known than M-estimators and trimmed means; so, for completeness, some comments about it should be made. Huber and Ronchetti [5] describe concerns about the robustness of this estimator, and Wilcox [1] summarizes other concerns suggesting that trimmed means and M-estimators are preferable. Yet, another approach is to use rank-based methods when comparing groups. Generally, these techniques answer different questions compared with methods aimed at comparing measures of location, particularly when dealing with two-way and higher designs. That is, they provide a different and useful perspective. It should be stressed, however, that important improvements on classic rankebased methods have been derived in recent years [1]. For example, the WilcoxoneManneWhitney test uses a correct expression for the standard error when comparing identical distributions. But when distributions differ, an incorrect estimate of the standard error is used. This issue can be addressed with the Cliff and BrunnereMunzel methods described by Wilcox. [1] In particular, these newer methods provide improved confidence intervals for the probability that a randomly sampled observation from the first group is less than that from the second, what is sometimes called a probabilistic measure of effect size. In terms of power, all indications are that methods based on robust measures of location typically, but certainly not always, have an advantage. However, when trying to gain a good understanding of how groups differ and by how much, modern rankebased methods can be very useful. 6. Transformations A simple way of dealing with skewness is to transform the data. Well-known possibilities are taking logarithms and using a BoxeCox transformation [18,19]. One serious limitation is that simple transformations do not deal effectively with outliers, leading Doksum and Wong [20] to conclude that some amount of trimming remains beneficial even when transformations are used. Another concern is that after data are transformed, the resulting distribution can remain highly skewed. However, both theory and
324
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
simulations indicate that trimmed means reduce the practical problems associated with skewness, particularly when used in conjunction with certain bootstrap methods [1,13]. 7. Nonparametric regression There are well-known parametric methods for dealing with regression lines that exhibit curvature. But parametric models do not always provide a sufficiently flexible approach, particularly when dealing with more than one predictor. There is a vast literature on nonparametric regression estimators, sometimes called smoothers, for addressing this problem. Robust versions have been developed (Wilcox [1], section 11.5). No single estimator is always best, but two that seem to perform relatively well are Cleveland LOESS method and a so-called running interval smoother. Splines (roughly piecewise polynomials) have received considerable attention, but in terms of mean squared error and bias, certain variations appear to be less satisfactory than the other smoothers that have been proposed [1]. To provide a crude indication of the strategy used by smoothers, imagine that in a regression situation, the goal is to estimate the mean of Y, given that X 5 6, based on n pairs of observations. The strategy is to focus on the observed X values close to 6 and use the corresponding Y values to estimate the mean of Y. Typically, smoothers give more weight to Y values for which the corresponding X values are close to 6. For pairs of points for which the X value is far from 6, the corresponding Y values are ignored. The general problem, then, is determining which values of X are close to 6 and how much weight the corresponding Y values should be given.
8. Robust measures of association and effect size There are many robust measures of association beyond Spearman rho and Kendall tau [1], but the relative merits of these methods are difficult to explain quickly. Here, we merely outline two general approaches to robustly measure the strength of the association between two variables. The first is to use some analog of Pearson correlation that removes or down weights the outliers. The other is to fit a regression line and measure the strength of the association based on this fit. An approach based on this latter strategy using what they call explanatory power (e.g., Wilcox [1], section 11.9). Let Y be some outcome variable of interest, X be some predictor variable, and Yb be the predicted value of Y based on some fit, which might be based on either a robust linear model or nonparametric regression estimator that provides a flexible approach to curvature. Then, explanatory power is given by: s2 Yb 2 ; re 5 2 s ðYÞ where s2 ðYÞ is the variance of Y. When Yb is obtained via ordinary least squares, r2e is equal to r2 , where r is the
Pearson correlation. Explanatory power reflects the proportion of variation explained by the regression model that has been fitted to the data. A robust version of the explanatory power is obtained simply by replacing the usual variance with some robust measure of variation. Here, the biweight midvariance (Wilcox [1], p90) is used with the understanding that several other choices are reasonable. Indeed, there are many choices including the median absolute deviation estimator and Gini mean difference. However, in terms of achieving a relatively small standard error and simultaneously dealing effectively with outliers, the biweight midvariance performs relatively well and is readily applied with extant software described later in this article. Roughly, the biweight midvariance empirically determines whether any values among the marginal distributions are unusually large or small. If such values are found, they are down weighted. It should be noted, however, that the choice of smoother matters when estimating explanatory power, with various smoothers performing rather poorly in terms of mean squared error and bias, at least with small-tomoderate sample sizes [21]. One that performs relatively well in simulations is the robust version of Cleveland (LOESS) smoother [22]. When comparing measures of location corresponding to two independent groups, Algina et al. [23] suggest a robust version of Cohen measure of effect size that replaces the means and assumed common variance with a 20% trimmed mean and 20% Winsorized variance. They also rescale their measure of effect size so that under normality, it is equal to d. The resulting estimate is denoted by dt. Here, however, we focus on a robust measure effect size stemming from the notion of explanatory power that takes into account heteroscedasticity [24]. Briefly, it reflects the variation of the measures of location divided by some robust measure of variation of the pooled data, which have been rescaled so that it estimates the population variance under normality. We label this version of explanatory power x2, and x is called a heteroscedastic measure of effect size. More broadly, the square root of explanatory power represents a robust generalization of Pearson correlation. It can be shown that for normal distributions with a common variance, d 5 0.2, 0.5, and 0.8 which correspond to x 5 0.15, 0.35, and 0.5, respectively. But perhaps, it should be stressed that x is more a robust measure of association rather than a robust analog of d. Also, x is readily extended to more than two groups (Wilcox [1], section 7.1.1). Consider again the two mixed normal distributions in the right panel of Fig. 2, only now the second distribution is shifted to have a mean equal to 0.8. Then, Cohen d is 0.24, suggesting a small effect size, although from a graphical perspective we have what Cohen characterized as a large effect size. The measure of effect size x 5 0.46, approximately, which suggests a large effect size. It is not being suggested that only x be used to measure effect size. Other robust techniques are available and provide alternative perspectives that have practical value [1].
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
It seems that no single measure of effect size is always satisfactory and that often more than one measure of effect size might be needed to gain a good understanding of how groups compare, but complete details go beyond the scope of this article. Our suggestion is that these alternative measures be given serious consideration. 9. Comments on software A practical issue is applying robust methods. This can be done using one of the most important software developments during the last 30 years: the free software R, which can be downloaded from http://www.r-project. org/. The illustrations used here are based on a package of more than 950 R functions [1], which are available at http://www-rcf.usc.edu/|rwilcox/. An R package WRS is available as well and can be installed with the R command install.packages(‘‘WRS’’,repos5‘‘http:RForge.R-project.org’’’), assuming that the most recent version of R is being used.
10. Practical illustration of robust methods: analysis of a lifestyle intervention for older adults 10.1. Design
was conducted to compare a 6-month lifestyle intervention with a no treatment control condition. The design used a crossover component, such that the intervention was administered to participants during the first 6 months of the study and control participants during the second 6 months. The results reported here are based on a pooled sample of 364 participants who completed assessments both before and after the receipt of the intervention. In assessing the potential importance of using robust statistical methods, our goal was to assess the association between number of hours of treatment and various outcome variables. Outcome variables included the following: (1) eight indices of health-related quality of life, based on the SF-36 (physical function, bodily pain, general health, vitality, social function, mental health, physical component scores, and mental component scores) [26]; (2) depression, based on the Center for Epidemiologic Studies Depression Scale [27]; and (3) life satisfaction, based on the Life Satisfaction Index, Z Scale [21]. In each instance, the outcome variable consisted of signed posttreatment minus pretreatment change scores. Preliminary analyses revealed that the distributions are skewed. In addition, outcome variables were found to have outliers based on boxplots, raising the possibility that modern robust methods might make a practical difference in the conclusions reached. Before discussing the primary results, it is informative to first examine the plots of the regression lines estimated via the nonparametric quartile regression line derived by Koenker and Ng [28] (Their method is based in part on splines.). Fig. 5 shows the estimated regression line (based on the R function qsmcobs) when predicting the median value of the variable physical function given the number
10 0 −30
−20
−10
Physical Function
20
30
To illustrate the practical benefits of using recently developed robust methods, we report some results stemming from a recent RCT of a lifestyle intervention designed to promote the health and well-being of communitydwelling older adults. Details regarding the wider RCT are summarized by Jackson et al. [25]. Briefly, this trial
325
0
5
10
15
Session Hours
Fig. 5. The median regression line for predicting physical function based on the number of session hours (based on the R function qsmcobs).
326
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
initially, and around 5e7 hours, an association appears but unlike Fig. 5. Fig. 6 suggests that the regression line levels off at about 9 or 10 hours. The main point here is that smoothers can play an important role in terms of detecting and describing an association. Indeed, no association is found after 9 hours, although this might be because of low power or precision. Similar issues regarding nonlinear associations occurred when analyzing other outcome variables. Fig. 7 shows the median regression line when predicting physical composite based on individual session hours. Pearson correlation rejects the null hypothesis of no association (r 5 0.2, P 5 0.015), but again, a smoother (LOESS) indicates that there is little or no association from about 0e5 hours, after which an association appears. For 0e5 hours, r 5 0.071 (P 5 0.257), and for 5 hours or more, r 5 0.25 (P 5 0.045). But the 20% Winsorized correlation for 5 hours or more is rw 5 0.358 (P 5 0.0045) (Winsorizing means that the more extreme values are ‘‘pulled in’’, which reduces the deleterious effects of outliers.). The explanatory measure of effect size, re, is estimated to be 0.47. A cautionary remark should be made. When using a smoother, the ends of the regression line can be misleading. In Fig. 8, as the number of session hours gets close to the maximum observed value, the regression appears to be monotonic decreasing. But this could be because of an inherent bias when dealing with predictor values that are close to the maximum value. For example, if the goal is to predict some outcome variable Y when X 5 8, the method gives more weight to pairs of observations for which the X values are close to 8. But if there are many X values less than 8 and only a few larger than 8, this can have a negative impact on the accuracy of the predicted
10 0 −30
−20
−10
Physical Function
20
30
of hours of individual sessions [28]. Notice that the regression line is approximately horizontal from 0 to 5 hours, but then, an association appears. Pearson correlation over all sessions is 0.178 (P 5 0.001), indicating an association. However, Pearson correlation is potentially misleading because the association appears to be nonlinear. The assumption of a straight regression line can be tested with the R function qrchk, which yields P 5 0.006. The method used by this function stems from the results derived by He and Zhu [29], which is based on a nonparametric method for estimating the median of a given outcome variable of interest. As can be seen, it appears that with 5 hours or less, the treatment has little or no association with physical function, but at or above 5 hours, an association appears. In fitting a straight regression line via the Theil and Sen estimator, for at and above 5 hours, the robust explanatory measure of association is 0.347. And, the hypothesis of a zero slope, tested with the R function regci, yields P 5 0.015. For hours less than 5, the TheileSen estimate of the slope is 0. In summary, Pearson correlation indicates an association, but it is misleading in the sense that for less than 5 hours, there appears to little or no association, and for more than 5 hours, the association appears to be much stronger than indicated by Pearson correlation. Note also that after about 5 hours, Fig. 5 suggests that the association is strictly increasing. However, the use of a running interval smoother with bootstrap bagging, which was applied with the R function rplotsm, yields the plot in Fig. 6 (Roughly, bootstrap bagging means that bootstrap samples are generated, the regression line is estimated for each bootstrap sample, and the results are averaged to obtain a final estimate of the regression line. See Wilcox [1] p566, for details.). Again, there is little or no association
0
5
10
15
Session Hours
Fig. 6. The running interval smooth based on bootstrap bagging using the same data in Fig. 5 (The R function rplotsm was used).
327
0 −10 −30
−20
Physical Composite
10
20
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
0
5
10
15
Session Hours
Fig. 7. The median regression line for predicting physical composite based on the number of session hours (based on the R function qsmcobs).
of the estimators used. One concern is that there are very few points close to the maximum number of session hours, 16.5. Consequently, the precision of the estimates of physical composite, given that the number of session hours is close to 16.5, would be expected to be relatively low. Table 1 summarizes the measures of association between the session hours and 10 outcome variables. Some degree of curvature was found for all the variables listed. Shown are the Pearson r; Winsorized correlation rw , which measures
0 −10 −30
−20
Physical Composite
10
20
Y value. Also, it is often beneficial to consider the impact of removing outliers among the independent variable. This is easily done with the R function lplot by setting the argument xout 5 T. Fig. 8 shows the estimate of the regression line based on a running interval smoother with bootstrap bagging. When the number of session hours is relatively high, again there is some indication of a monotonic decreasing association, but it is less pronounced, suggesting that it might be an artifact
0
5
10
15
Session Hours
Fig. 8. The estimated regression line for predicting physical composite based on the number of session hours. The estimate is based on the running interval smoother with bootstrap bagging (The R function rplotsm was used).
328
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
Table 1. Measures of association between the hours of treatment and variables listed in column 1 (n 5 364) Variables
Pearson r
P
rwa
P
re b
Physical function Bodily pain General health Vitality Social function Mental health Physical composite Mental composite Depression Life satisfaction
0.178 0.170 0.209 0.099 0.112 0.141 0.200 0.095 0.022 0.086
0.001 0.002 0.0001 0.075 0.043 0.011 0.0002 0.087 0.694 0.125
0.135 0.156 0.130 0.139 0.157 0.167 0.136 0.149 0.132 0.118
0.016 0.005 0.012 0.012 0.005 0.003 0.015 0.007 0.018 0.035
0.048 0.198 0.111 0.241 0.228 0.071 0.255 0.028 0.134 0.119
rw 5 20% Winsorized correlation. re 5 robust explanatory measure of association (No P-values are provided for this method.). a
b
the strength of a linear association while guarding against outliers among the marginal distributions; and robust explanatory measure of the strength of the association based on Cleveland smoother, re (Note that there is no P-value reported for re. Methods for determining a P-value are available, but the efficacy of these methods, in the context of type I error probabilities, has not been investigated.). As can be seen, when testing at the 0.05 level, if Pearson correlation rejects, the Winsorized correlation rejects as well. There are, however, four instances in which the Winsorized correlation rejects and Pearson correlation does not: vitality, mental composite, depression, and life satisfaction. Particularly striking are the results for depression. Pearson correlation is 0.022 (P 5 0.694), whereas the Winsorized correlation is 0.132 (P 5 0.018). This illustrates the basic principle that outliers can mask a strong association among the bulk of the points. Note that re values are always positive. This is because it measures the strength of the association without assuming that the regression line is monotonic. That is, it allows for the possibility that the regression line is increasing over some region of the predictor values but decreasing otherwise. Of course, other possibilities are Spearman rho and Kendall tau. Often, these more familiar measures of association give results similar to the Winsorized correlation used here, but exceptions are encountered. Spearman rho, for example, gives very similar results for the situations in Table 1 except for Center for Epidemiologic Studies Depression Scale (CES-D): Spearman rho is 0.096 with a P-value equal to 0.086.
A positive feature of Spearman rho, Kendall tau, and the Winsorized correlation is that they guard against the deleterious effects of outliers among the marginal distributions. But a criticism is that two outliers, properly placed, can mask an association among the bulk of the points [1]. What is needed is a measure of the association that takes into account the overall structure of the data when dealing with outliers. Such measures of association are available and are summarized in chapter 9 of Wilcox [1]. One that performs relatively well is the outlier projection (OP) correlation, which uses a projection method for detecting outliers. Included is a variation based in part on Spearman correlation. Another portion of the study dealt with comparing the change scores of two groups of participants. The variables were measures of physical function, bodily pain, physical composite, and a cognitive score. For the first group, there was an ethnic match between the participant and therapist; in the other group, there was no ethnic match. The total sample size was 205. Table 2 summarizes the results when comparing means with Welch test, 20% trimmed means with Yuen test, which was applied with the R function yuenv2, and Cliff method, which is a heteroscedastic analog of the WilcoxoneManneWhitney test and was applied with the R function cidv2 (When comparing 20% trimmed means, a theoretically correct estimate of the standard error of a trimmed mean is based in part on the 20% Winsorized variance, which was used here.). As can be seen, both Welch test and Cliff method reject at the 0.05 level for three of the four outcome variables. In contrast, Yuen test rejects in all four situations, the only point being that the choice of method can make a practical difference. Also, shown are the two measures of effect size. The first is P, the probability that a randomly sampled observation from the first group is less than that from the second, and the other is x, which was previously described. Although Cohen d is not recommended, it is instructive to note that for the four situations in Table 2, it ranges between 0.21 and 0.59, which are often interpreted as small to medium effect sizes. In contrast, x indicates medium to large effect sizes. 11. Discussion There are many important applications of robust statistical methodology that extend beyond those described in this article. For example, the violation of assumptions in more complex experimental designs is associated with especially
Table 2. P-values and measures of effect size comparing ethnic-matched patients with a nonmatched group Variables Physical function Bodily pain Physical composite Cognition
Welch test: P-value
Yuen test: P-value
Cliff P-value
P
x
0.1445 0.01397 !0.0001 0.0332
0.0469 !0.0001 0.0002 0.0091
0.059 0.001 0.001 0.031
0.588 0.705 0.657 0.608
0.252 0.501 0.391 0.308
Cliff method is a heteroscedastic analog of the WilcoxoneManneWhitney test, P5P ðX !Y Þ, the probability that an observation in the first group is less than that from the second group, and x is a robust measure of effect size that allows heteroscedasticity that is roughly a robust analog of Pearson correlation.
R. Wilcox et al. / Journal of Clinical Epidemiology 66 (2013) 319e329
dire consequences, and robust methods for dealing with these problems have been derived [1]. A point worth stressing is that no single method is always best. The optimal method in a given situation depends in part on the magnitude of the true differences or associations and the nature of the distributions being studied, which are often unknown. This raises an important question: How many methods are required to get a good understanding of what the data are trying to tell us? Robust methods are designed to perform relatively well under normality and homoscedasticity, while continuing to perform well when these assumptions are violated. But it is possible that classic methods offer an advantage in some situations. For example, for skewed distributions, the difference between the means might be larger than that between the 20% trimmed mean, suggesting that power might be higher using means. However, this is no guarantee that means have more power because it is often the case that a 20% trimmed mean has a substantially smaller standard error. In general, complete reliance on classic techniques seems highly questionable, in that hundreds of articles published during the last quarter century underscore the strong practical value of modern procedures. All indications are that classic routinely used methods can be highly unsatisfactory under general conditions. Moreover, access to modern methods is now available via a vast array of R functions. The main message here is that modern technology offers the opportunity for analyzing data in a much more flexible and informative manner, which would enable researchers to learn more from their data, thereby augmenting the accuracy and practical utility of clinical trial results. References [1] Wilcox RR. Introduction to robust estimation and hypothesis testing. 3rd ed. Burlington, MA: Elsevier Academic Press; 2012. [2] Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust statistics: the approach based on influence functions. New York: Wiley; 1986. [3] Hoaglin DC, Mosteller F, Tukey JW. Understanding robust and exploratory data analysis. New York: Wiley; 1983. [4] Hoaglin DC, Mosteller F, Tukey JW. Exploring data tables, trends, and shapes. New York: Wiley; 1985. [5] Huber P, Ronchetti EM. Robust statistics. 2nd ed. New York: Wiley; 2009. [6] Maronna RA, Martin MA, Yohai VJ. Robust statistics: theory and methods. New York: Wiley; 2006. [7] Staudte RG, Sheather SJ. Robust estimation and testing. New York: Wiley; 1990.
329
[8] Tukey JW. A survey of sampling from contaminated normal distributions. In: Olkin I, et al, editors. Contributions to probability and statistics. Stanford, CA: Stanford University Press; 1960:448e85. [9] Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. New York: Academic Press; 1988. [10] Shavelson RJ. Statistical reasoning for the behavioral sciences. Needham Heights, MA: Allyn and Bacon; 1988. p. 266. [11] Goldman RN, Weinberg JS. Statistics an introduction. Englewood Cliffs, NJ: Prentice-Hall; 1985. p. 252. [12] Huntsberger DV, Billingsley P. Elements of statistical inference. 5th ed. Boston, MA: Allyn and Bacon; 1981. [13] Cribbie RA, Fiksenbaum L, Keselman HJ, Wilcox RR. Effects of nonnormality on test statistics for one-way independent groups designs. Br J Math Stat Psychol 2012;65:56e73. [14] Cressie NA, Whitford HJ. How to use the two sample t-test. Biometrical J 1986;28:131e48. [15] Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika 1938;29: 350e62. [16] Hand A. A history of mathematical statistics from 1750 to 1930. New York: Wiley; 1998. [17] Rosenberger JL, Gasko M. Comparing location estimators: trimmed means, medians, and trimean. In: Hoaglin D, Mosteller F, Tukey J, editors. Understanding robust and exploratory data analysis. New York: Wiley; 1983:297e336. [18] Box GE, Cox DR. An analysis of transformations. J Roy Stat Soc B 1964;26:211e52. 1964. [19] Sakia RM. The Box-Cox transformation: a review. The Statistician 1992;41:169e78. [20] Doksum KA, Wong C-W. Statistical tests based on transformed data. J Am Stat Assoc 1983;78:411e7. [21] Wood V, Wylie ML, Sheafor B. An analysis of short self-report measure of life satisfaction: correlation with rater judgments. J Gerontol 1969;24:465e9. [22] Cleveland WS. Robust locally-weighted regression and smoothing scatterplots. J Am Stat Assoc 1979;74:829e36. 1979. [23] Algina J, Keselman HJ, Penfield RD. An alternative to Cohen’s standardized mean difference effect size: a robust parameter and confidence interval in the two independent groups case. Psychol Methods 2005;10:317e28. [24] Wilcox RR, Tian T. Measuring effect size: a robust heteroscedastic approach for two or more groups. J Appl Stat 2011;38:1359e68. [25] Jackson J, Mandel D, Blanchard J, Carlson M, Cherry B, Azen S, et al. Confronting challenges in intervention research with ethnically diverse older adults: the USC Well Elderly II trial. Clin Trials 2009;6:90e101. [26] Ware JE, Kosinski M, Dewey JE. How to score version two of the SF36 Health Survey. Lincoln, RI: QualityMetric; 2000. [27] Radloff LS. The CES-D Scale: a self-report depression scale for research in the general population. Appl Psychol Meas 1977;1: 385e401. [28] Koenker R, Ng P. Inequality constrained quantile regression. Sankhya Indian J Stat 2005;67:418e40. [29] He X, Zhu L-X. A lack-of-fit test for quantile regression. J Am Stat Assoc 2003;98:1013e22.