Analysis of sparse data

Analysis of sparse data

J Clin Epidemiol Vol. 43, No. 8, pp. 755-756, Printed in Great Britain. All rights reserved 0895-4356/90$3.00+ 0.00 Copyright 0 1990Pergamon Press pl...

205KB Sizes 2 Downloads 162 Views

J Clin Epidemiol Vol. 43, No. 8, pp. 755-756, Printed in Great Britain. All rights reserved

0895-4356/90$3.00+ 0.00 Copyright 0 1990Pergamon Press plc

1990

Variance and Dissent: Response ANALYSIS

OF SPARSE DATA T.

WILCOSKY

Department of Epidemiology and Collaborative Studies Coordinating Center, Department of Biostatistics, School of Public Health, University of North Carolina, Chapel Hill, NC 27514, U.S.A. (Received for publication 23 January 1990)

Dr Rimm’s Dissent [l] apparently reflects an uneasiness that many of us feel when working with multiple regression models. Most would agree that stratified analysis is, in many respects, the best single tool for epidemiologic investigations. (Dr Rimm’s original reviewer comments alluded to “a plot of raw data appropriately stratified”, although his Dissent does not explicitly mention stratified analysis.) In some instances, however, stratified analysis has little value. If the data are very sparse, one can examine only a few broad strata, or the rates will be too unstable to be informative. An analysis of obesity and mortality among men in the Framingham Heart Study cohort [2] illustrates the problem of sparse data. The investigators plotted the obesity-mortality association across six levels of relative body weight for each of six age- and smoking-specific subgroups. Even though the study included 26 years of follow-up, the cumulative mortality rates were unstable (i.e. few monotonically increasing or decreasing trends) across relative weight categories, and mortality estimates for the leanest nonsmokers were excluded due to small numbers. Because age and smoking are such strong confounders of obesity-mortality associations, any informative analysis should control for at least these variables. The analysis by Garrison et al. [2] is the crudest that one can reasonably present. However, even with 26 years of followup, their results at best give only a vague idea about the shape of the obesity-mortality association, especially since the leanest (and 755

most interesting) relative weight category was excluded from half of the six plots. The exact sample size and number of deaths is not given in the Garrison et al. paper [2], but their analysis included at least 679 deaths. In contrast, the analysis of the Lipid Research Clinics (LRC) Program cohort included only 163 deaths among men and 66 deaths among women [3]. A stratified analysis of the LRC data using a sufficient number of obesity categories to reveal the detailed shape of the obesitymortality curve yields noise, because the tails of the distribution where the rates are highest are, by definition, very sparsely populated. If wider strata are used to increase the stability of the rates, the shapes of the curves become obscured by the wide strata. Although the paper was somewhat long, we added some descriptive tables at Dr Rimm’s request because we agreed that they would improve the paper. We felt, however, that adding rates that we considered too unstable to be meaningful was inappropriate. Dr Rimm did not accept our explanation when we first responded to his comments, and I am sure he does not accept it now. If we truly intended, even unknowingly, to mislead the readers concerning the nature of the obesity-mortality associations in our data, we would have been clever enough to conceal the results of the quartic model analysis that indicates a lack of fit in the models that we used. At this point, Dr Rimm leaves us little choice but to present some rates. Table 1 of this Response presents the risk of death during the

756

T.

WILCOSKY

Table 1. Seven-year cumulative mortality rate from all causes by age and body mass index for males

Age 40-49

30-39 BMI

Cum. mort.

No. of deaths

<2.3 2.3 2.6 2.9. 23.2

0.021 0.015 0.003 0.004 0.010

1 6 2 1 1

Cum. mort. o.OOo* 0.019 0.019 0.026 0.052

50-59

60-69

No. of deaths

Cum. mort.

No. of deaths

Cum. mort.

No. of deaths

0 5 10 6 4

0.125 0.055 0.053 0.076 0.053

4 11 16 10 2

0.417 0.098 0.057 0.133 0.111

5 9 5 4 1

*n = 33.

first 7 years of follow-up by age and body mass index (BMI) for males. The population and death counts differ from those in our paper because the oldest age groups are excluded due to small numbers, censored persons (i.e. those alive with less than 7 years of follow-up) are excluded, and deaths occurring after 7 years of follow-up are excluded. As Table 1 shows, mortality is highest in the leanest category for three of the four age groups, and the middle category has the minimum mortality for the same three age groups. Only two of these rates are based on more than 10 deaths, so that fluctuation by even a single death in the remaining rates would cause at least a 10% change in the rate. Although the rates are roughly consistent with a quadratic model, they are very unstable. Rates for women (not presented) are even worse. Dr Rimm assumed that the data did not support the model. The data he requested do not support anything. I would like briefly to clarify some less important issues raised in Dr Rimm’s Dissent. (1) Dr Rimm needs to find another example of the “biostatistical tail wagging the epidemiologic dog”: no LRC statistician ever attempted to convince me or my coauthors that we should use or present regression analyses instead of stratified analyses. The epidemiologist that Dr Rimm approached was probably unfamiliar with the details of the analysis, so that he or she had little reason to question the statisticians’ judgment. (2) The quadratic model is appropriate for addressing the three basic study questions outlined in our paper, because modest deviations from the quadratic model should not affect the conclusions from this analysis. (3) The coefficients are admittedly difficult to interpret.

For interested readers, the tables contain the data necessary to generate the type of plots presented in Fig. 1 of our paper [3], by using the following formula: Relative hazard = exp@, X + /$X2 - K) where /I, is the linear regression coefficient, is the quadratic regression coefficient, 82 K = [B, (median of X) + fl*(median of X)*1, and the Xs are the values of interest for the obesity indices. Persons with the median value will be the reference group with a relative hazard of 1.O. In closing, let me comment on Table 5 from our paper [3] with its offending array of regression coefficients. Each model included obesity terms plus 9 covariates, and the 2 obesity regression coefficients per analysis give an extremely compact presentation of the adjusted relationship between obesity and mortality. If each of the 10 variables were dichotomized (the absolute minimum number of strata possible), a stratified analysis would require 2” = 1024 strata. The simplest way to present such an analysis would be a plot. To paraphrase a prominent politician, a plot from each analysis would produce “a thousand points of ink”. REFERENCES Rimm AA. A reveal-conceal test for manuscript review: its application in the obesity mortality study. J Clin Epidemiol 1990; 43: 753-754. Garrison RJ, Feinleib M, Castelli WP, McNamara PM. Cigarette smoking as a confounder of the relationship between relative weight and long-term mortality. The Framingham Heart Study. JAMA 1983; 249: 2199-2203. Wilcosky T, Hyde J, Anderson JJB, Bangdiwala S, Duncan B. Obesity and mortality in the Lipid Research Clinics Program Follow-up Study. J Clin Epidemiol 1990; 43: 743-752.