A note on normality To the Editor: In their letter1 regarding a prior investigation evaluating factors associated with serum interleukin (IL)-33 levels among patients with atopic dermatitis,2 Lee and Park suggested, among other points, that the linear regression modeling performed in this study was incorrect, in that the latter ‘‘has a premise of a normally distributed dependent variable,’’ and the ‘‘IL-33 serum levels showed a nonnormal distribution.’’ They recommended a log transformation of this dependent variable to resolve this perceived problem. In response,3 the original study authors stated that the ‘‘issue raised by Lee and Park is appropriate’’ and repeated their analysis accordingly. Because ordinary least squares (OLS) linear regression is a widely used inferential technique in health care research, it is important to understand what normality means in this context. The normality assumption does not apply to the distribution of the dependent variable, but rather demands that the residuals—the differences in predicted and observed values for the dependent variable—are normally distributed.4 Alternatively stated, with Y outcome
and X independent variable, the distribution of Y conditional on X must be normal, not Y itself.4 While skewed dependent variables and outliers often result in distributions of residuals that are nonnormal, this is not necessarily the case, and researchers might needlessly apply mathematical transformations to their outcomes that may subsequently obscure the clinical interpretability of their results. Consider, for example, a hypothetical outcome that is normally distributed (mean 1, standard deviation 1) for 50 males and normally distributed (mean 10, standard deviation 1) for 50 females. The aggregate distribution of this outcome regardless of sex appears to be bimodal (Fig 1, A). However, after conditioning on sex and calculating residuals following bivariate OLS linear regression, the normality assumption appears, by visual inspection, satisfied (Fig 1, B). Normality can be assessed in various ways. In addition to using histograms and normal density functions, residuals can also be graphically evaluated using normal probability-probability plots, quantilequantile plots, and density probability plots, as well as other approaches. Statistical tests of normality include the Kolmogorov-Smirnov test, the Anderson-
Fig 1. Distribution of hypothetical outcome and residuals from ordinary least squares (OLS) linear regression. A, Hypothetical outcome whose values have been randomly drawn from a normal distribution (mean 1, standard deviation 1) for 50 males and randomly drawn from a normal distribution (mean 10, standard deviation 1) for 50 females, overlaid with the associated normal density curve. In aggregate, the distribution of the outcome appears bimodal. B, Distribution of the residuals (differences between observed outcome values and predicted values) following bivariate OLS linear regression, with the outcome as the dependent variable and sex as the independent variable. The distribution of residuals approximates the associated normal density curve, in contrast A. J AM ACAD DERMATOL
JUNE 2015 e169
e170 Letter
Darling test, the Shapiro-Wilk test, and others, all of which are easily executed in most modern statistical software packages but should be employed with appropriate thoughtfulness. Researchers confronted with nonnormal residuals may attempt to apply nonlinear transformations to their data, exclude outliers, and/or fit a different model. Even when normality is violated, the GaussMarkov theorem5 states that the best linear unbiased estimator of regression coefficients is still yielded by OLS estimation, so long as the errors have expectation zero, are uncorrelated, and have equal variance. This fact alone does not guarantee accurate P-value and confidence interval calculations in the face of nonnormality, but this is often a nonissue in large datasets4 consequent to the central limit theorem and ability to use nonparametric methods, such as bootstrapping, to assess statistical significance. As dermatologists continue to pursue research to advance care for patients with cutaneous disease, it is hoped this small note might prove useful to those leveraging linear regression in their future scientific endeavors. Jason P. Lott, MD, MHS, MSHP Section of Dermatology, Department of Veterans Affairs Connecticut Healthcare System, West Haven; Department of Internal Medicine and Robert Wood Johnson Foundation Clinical
J AM ACAD DERMATOL
JUNE 2015
Scholars Program, Yale University School of Medicine, New Haven, CT Funding sources: None. Conflicts of interest: None declared. Correspondence to: Jason P. Lott, MD, MHS, MSHP, 333 Cedar Street, SHM IE-61, PO Box 208088, New Haven, CT 06520 E-mail:
[email protected];
[email protected]
REFERENCES 1. Lee YB, Park YG. Statistical comments on ‘‘Increased serum levels of interleukin 33 in patients with atopic dermatitis.’’ J Am Acad Dermatol. 2015;72:199. 2. Tamagawa-Mineoka R, Okuzawa Y, Masuda K, Katoh N. Increased serum levels of interleukin 33 in patients with atopic dermatitis. J Am Acad Dermatol. 2014;70:882-888. 3. Tamagawa-Mineoka R, Teramukai S, Katoh N. Response to ‘‘Statistical comments on ‘Increased serum levels of interleukin 33 in patients with atopic dermatitis.’’’ J Am Acad Dermatol. 2015;72:200. 4. Lumely T, Diehr P, Emerson S, Chen L. The importance of the normality assumption in large public health data sets. Annu Rev Public Health. 2002;23:151-169. 5. Puntanen S, Styan GP. The equality of the ordinary least squares estimator and the best linear unbiased estimator. Am Stat. 1989;43:153-171. http://dx.doi.org/10.1016/j.jaad.2015.01.051