The role of the biostatistician in cancer research

The role of the biostatistician in cancer research

Biomed Pharmacother 2001 ; 55 : 502-9 © 2001 Éditions scientifiques et médicales Elsevier SAS. All rights reserved S0753332201001342/FLA Presentation...

411KB Sizes 1 Downloads 9 Views

Biomed Pharmacother 2001 ; 55 : 502-9 © 2001 Éditions scientifiques et médicales Elsevier SAS. All rights reserved S0753332201001342/FLA

Presentation

The role of the biostatistician in cancer research E.A. Gehan* Lombardi Cancer Center, Georgetown University Medical Center, Washington, DC 20007, USA

Summary – This article considers triumphs and challenges for biostatisticians working in oncology at the beginning of the 21st century. The impact of three major articles in biostatistics in the 20th century is considered: Cornfield’s 1951 paper on estimating comparative rates from clinical data; Mantel and Haenszel’s 1959 paper on obtaining summary measures of relative risk, adjusting for stratification factors in epidemiological studies; and D. R. Cox’s 1972 paper, which developed the proportional hazards model for evaluating the effect of covariates on survival time outcomes. Biostatistical challenges for the 21st century are considered for the areas of clinical trials, survival analysis, and statistical genetics. © 2001 Éditions scientifiques et médicales Elsevier SAS biostatistics / oncology / survival analysis

It is a distinct pleasure to return to the M. D. Anderson Cancer Center and participate in a symposium concerning triumphs and challenges in oncology. Since I have been asked to speak to the role of the biostatistician in cancer research, my theme will be the triumphs and challenges for biostatisticians working in oncology. I will consider some past triumphs in biostatistics that have had important applications in oncology, and also challenges for the 21st century. THE EVOLUTION OF STATISTICS FROM THE 19TH TO THE 21ST CENTURY The evolution of statistics as a subject is painted with a very broad brush in table I. In the 19th century and earlier, the main interest was in the statistics of the state, population, births and deaths, and statistics of commerce. In the first half of the 20th century, when R. A. Fisher was at his peak of productivity, the basic theories of statistics were developed, including probability distributions, ideas of randomization, the *Correspondence and reprints. E-mail address: [email protected] (E.A. Gehan).

Table I. Evolution of statistics from the 19th century to the 21st century. • 19th century



• 20th century (1900–1950)



• 20th century (1950–1999)



• 21st century



Statistics of the state Vital statistics Branch of mathematics Statistical theory Probability distributions Branch of science —biostatistics —survival analysis —clinical trials Biostatistics in the new millennium —branch of information science —data, biostatistics, information

Neyman-Pearson theory of testing hypotheses and so on. The second half of the 20th century saw statistics develop in a broader way as a branch of science, with an especial flowering of work in biostatistics, survival analysis and clinical trials. There was much progress by biostatisticians working in oncology and this is the first area that I will consider. Table II gives some of the important breakthroughs in biostatistics from 1950 to 1999, beginning with Cornfield’s work on obtaining estimates of relative risk in epidemiological studies [2], to the important

Role of the biostatistician in cancer research

Table II. Some breakthroughs in biostatistics (1950–1999). • Cornfield J. (1951). “A Method of Estimating Comparative Rates from Clinical Data. Applications to Cancer of the Lung, Breast and Cervix.” • Kaplan E.L., Meier P. (1958). “Non-parametric Estimation from Incomplete Observations.” • Mantel N., Haenszel W. (1959). “Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease.” • Tukey J.W. (1962). “The Future of Data Analysis.” • Cox D.R. (1972). “Regression Models and Life Tables.” • Nelder J.A., Wedderburn R.W.M. (1972). “Generalized Linear Models.” • Efron B. (1979). “Boot-strap Methods: Another Look at the Jackknife.” • Liang K.Y., Zeger S.L. (1986). “Longitudinal Data Analysis Using Generalized Linear Models.”

Figure 1. Biostatisticians at NIH: 1950s.

503

Kaplan and Meier paper [8], which provided a nonparametric method for estimating survival curves, and then Mantel’s work on developing χ2 tests for epidemiological studies [10]. Tukey foresaw many of the areas of application for data analysis [14] and Cox wrote perhaps the most influential paper of the century on regression models and life tables [4]. Other important work was by Nelder and Wedderburn [13], Efron [6], and Liang and Zeger [9]. In the first part of this paper, I will highlight some of the triumphs of Cornfield, Mantel, and Cox. I would indeed be remiss if I didn’t mention the important role that the National Institutes of Health (NIH) has had in the evolution of statistics both in the development of biostatistics and biostatisticians. Figure 1 shows the group of biostatisticians at NIH

504

E.A. Gehan

recruited by Harold Dorn in the early 1950s. He deserves much credit for recruiting Cornfield, Schneiderman, Mantel, Greenhouse, and Jack Lieberman. I will discuss Cornfield’s work further and also that of Mantel, but Marvin Schneiderman was the biostatistician working with Gordon Zubrod, who recruited the first biostatisticians working with the cancer cooperative groups beginning about 1955. Dr. Zubrod, Clinical Director of the National Cancer Institute, deserves much credit for believing in the importance of biostatistics and making sure that there were biostatisticians working actively in cancer clinical trials, a practice that continues to this day. BREAKTHROUGH IN BIOSTATISTICS: CORNFIELD’S ESTIMATE OF RELATIVE RISK Cornfield’s paper on estimating comparative rates from clinical data was published in 1951 [2]. Jerry Cornfield was known either as Jerry or Mr. Cornfield (figure 2). In addition to being highly intellectual and having a diversity of interests, Jerry was a man of action and threw himself into his work with tremendous vigor. Cornfield never received a doctorate degree, though he had a long list of publications. When Jerry was asked about this, he would say, “I would rather have someone wonder why I didn’t have a doctorate than why I did.” The following points were made by Cornfield in his Presidential Address to the American Statistical Association in 1975 [3]. He began by commenting that other subjects appear self-contained, but statistics does not. Or, as Jerry wrote, “. . .statistics appears concerned only with the methods of accumulating knowledge, but not with what is accumulated.” The point is not precisely correct, but all successful statisticians must be deeply concerned with the methodology of drawing proper conclusions. Secondly, Cornfield said, “Although I enjoyed mathematics, I enjoyed many other subjects just as much. This interest in many phenomena and that mathematics by itself is intellectually confining is characteristic of many statisticians.” He also posed the question, “Why should any person of spirit, of ambition, of high intellectual standards receive any stimulation from serving an auxiliary role on someone else’s problem?” He answered this by considering several examples, one on the toxicity of amino acids, another on computerized inter-

Figure 2. Jerome Cornfield, B.A.

pretation of electrocardiograms, and demonstrated that the role of the statistician was critical in solving the problem. He further pointed out that statisticians tend to be hybrids and this was certainly true of him, since he majored in history at New York University, but he later became an amateur historian on such subjects as baseball and the American Civil War. In a recent paper on case control studies, Norman Breslow refers to Jerry’s 1951 paper as “launching the modern era of case control studies” [1]. He stated the problem of the paper as follows: “A frequent problem in epidemiological research is the attempt to determine whether the probability of having or incurring a stated disease, such as cancer of the lung, during a specified interval of time is related to the possession of a certain characteristic, such as smoking.” One of the studies available at the time of Cornfield’s paper was the Doll and Hill matched control study [5] of 649 patients with lung cancer and 649 matched male control patients (table III). As shown in the table, 99.7% of the males with lung cancer were smokers compared with 95.8% of the male control patients. The difference between the relative frequencies of the smoking rates was statistically sig

505

Role of the biostatistician in cancer research

Table III. Cornfield’s estimate of Relative Risk (RR) in epidemiological studies. • Doll & Hill (1950) % smokers • Odds of being smoker

Males with Lung Cancer 99.7% (647/649) Males with Lung Cancer (99.7%)/0.3% = 332.3 • Estimate of Relative Risk (RR) of lung cancer: smokers vs nonsmokers RR ≅ ratio of odds = 332.2/22.8 = 14.6 assuming frequency of lung cancer in the general population is low

nificantly different. However, of more importance would be an estimate of the relative risk rate of developing lung cancer for smokers versus nonsmokers. Cornfield considered the relative odds rates of being a smoker for the males with lung cancer and controls and demonstrated via an application of Bayes’ theorem that the exposure odds ratio for cases versus controls equals the disease odds ratio for exposed versus unexposed and the latter in turn approximates the ratio of disease rates provided that the disease is rare. As shown in table III, the relative risk of a smoker developing lung cancer is 14.6 times that of a nonsmoker, assuming the frequency of lung cancer in the male population is low. It was data such as this that led, in 1964, to the report on ‘Smoking and Health’ of the Surgeon General of the United States Public Health Service (USPHS), and the USPHS has promoted smoking cessation efforts ever since.

that he began his employment in the U.S. government as a GS-1, the lowest-level employee, and completed his career as a GS-16, at that time the highest non-administrative grade that one could attain. However, later Nathan corrected me by saying that he actually began as a messenger boy in a position below GS-1 and was later promoted to GS-1. So, one lesson that I learned from Nathan is that he always likes to have the last word, and he is also the most frugal person that I know. Mantel delivered a lecture some years ago in which he gave his philosophy on various aspects of experimentation [12]. A major theme was that much can be learned from observational studies, basically because they are quasi-experiments. The observational approach applies whenever we seek to determine if there is some interrelationship between specified characteristics in a population. Observational data

BREAKTHROUGH IN BIOSTATISTICS: MANTEL AND HAENSZEL’S PAPER ON SUMMARY MEASURES OF RELATIVE RISK I now wish to speak about the paper on the statistical aspects of data from retrospective studies of disease published in 1959 [10]. There were two parts to this paper, the first part concerning epidemiological studies written by William Haenszel and the second part written by Nathan Mantel, giving methods of obtaining summary estimates of relative risk, adjusting for various types of stratification. Nathan Mantel is the only one of the original group of NIH biostatisticians who is still alive. His eyesight is not very good, but his mind is as alert as ever. Nathan is indeed an unforgettable character (figure 3). Once, I introduced him as a speaker and said

Male Controls 95.8% (622/649) Male Controls 95.8%/4.2% = 22.8

Figure 3. Nathan Mantel, M.S.

506

E.A. Gehan

can be of crucial importance in identifying factors related to disease occurrence and the same type of data has also been used by some in evaluating treatments. An old saw among statisticians is that they should be consulted prior to an investigation. Nathan believed the opposite, that statisticians should not be consulted at the start of an investigation. Nathan Mantel claimed that being called in after the fact was generally not a deterrent to finding some way of helping the investigator. Some interesting theoretical problems arose in this way, and whereas the investigator might have considered their data experimental, he considered the experiment as leading to observational data. He also believed that retrospective studies could be as persuasive as prospective studies and had examples of dependent and independent variables being confused. A final point he made was, “Write it down.” Once Mantel was brimming with ideas that he wished to express to Cornfield. But Cornfield replied: “Write it down.” Mantel found that when he had to put everything down in a systematic way that could not be misunderstood, this was helpful throughout his career. Mantel has a bibliography which lists 379 publications. His most famous publication is that on “Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease” (1959). This paper is important from several points of view. First, it gives an adjusted χ2 test, subsequently known as the Mantel-Haenszel test, which is applicable for retrospective and prospective studies of disease. The test controls for possible confounding of data (e.g., by occupation) by stratification of the data into a series of 2 × 2 tables. Secondly, the paper provided a simple summary relative risk estimator that weights individual odds ratios by both precision and importance. Thirdly, the same approach as taken in the MantelHaenszel paper can be used in many other types of applications, one example being in survival studies [11]. In the simple situation of two groups of individuals in which survival is to be compared, a 2 × 2 table could be formed at the time of each death indicating whether the patients remaining at risk in each treatment group died or survived at the particular death time. Appropriately weighting the odds ratios of death over all times of deaths leads to a log rank type test which many have called the Cox-Mantel test. By the end of 1994, this paper had received over

4000 citations and is one of the 200 most cited papers in the scientific literature. BREAKTHROUGH IN BIOSTATISTICS: COX’S PAPER ON THE PROPORTIONAL HAZARDS MODEL I’d like to consider now Sir David Cox’s paper written in 1972 on regression models and life tables [4]. For this work, Cox received the Charles F. Kettering prize in 1990 with the citation that it was for the “development of the proportional hazards regression model” which is perhaps more often now called the Cox model. Indeed it is a remarkable paper and one of the 100 papers most frequently quoted in the scientific literature. Sir David Cox was knighted in 1985 and is certainly among the best statisticians in the world, especially for his theoretical contributions that have had major applications (figure 4). I had the pleasure of working collaboratively with Sir David earlier in his career. He is an exceedingly shy and modest individual. In bringing problems to him, I always found that he had a better understanding of the problem than I did even after my imperfect description and

Figure 4. Sir David R. Cox.

Role of the biostatistician in cancer research

Table IV. Impact of Cox’s paper on regression methods and life tables. • Proportional Hazards Regression (Cox) Model: k共 t 兲 = k0共 t 兲 exp兵 b1 x1 + ™ + bk xk 其 hazard function (λ(t)) for individual at time t is product of base-line hazard function (λ0(t)) and exponential function of covariates (x’s) and regression parameters (β’s) • Applications: generalizes Kaplan-Meier estimate of survival no covariates except treatment: leads to log rank test covariates and treatment: test of differences between treatments, adjusting for prognostic factors time-dependent covariates: does treatment or covariate effect change with time? Other: graphic procedures, regression diagnostics, residuals • Theory: partial likelihood multivariate failure times modeling of epidemiological data (e.g., case-control, cohort)

all his suggestions were good ones for working on a problem. He has written more than ten books, but perhaps his best known work is on the proportional hazards regression model. The impact on biostatistics of Cox’s paper on regression methods has indeed been extraordinary (table IV). The proportional hazards regression model relates the hazard function at time t, λ (t), to a baseline hazard function, λ0 (t), and an exponential function of covariates, the x’s, which could designate treatment or characteristics of patients, such as age, white blood count, gender, and the regression parameters (the β’s) to be determined. Cox allowed this baseline hazard to be completely general so this made the model applicable to a wide variety of situations. By considering the hazard function rather than some other function of survival time, this also allowed the model to deal with right censored data, the main complication in survival data analysis. Cox used a partial likelihood approach, which was originally based mainly on intuition but later theoretically justified, to obtain estimates of the regression parameters. When there are no covariates, the model leads to the Kaplan-Meier estimate of survival function. If treatment is the only covariate, the model leads to the log rank test or the Cox-Mantel test. A major application is when covariates and treatments are available; then the model provides a test of differences between treatments, adjusting for differences

507

in the prognostic features of the patients in the various groups. The model is also applicable when there are time-dependent covariates. There have been many developments in theory deriving from the Cox model including partial likelihood, multivariate failure times, and modeling. BIOSTATISTICS IN THE NEW MILLENNIUM I’d like now to consider the future of biostatistics in the new millennium [7]. The first thought, however, is whether there will be biostatistics in the new millennium at all. Some research colleagues have remarked that biostatisticians only calculate sample size and statistical power for studies and straightforward computer programs can carry out about all calculations that research investigators need to make. However, I believe that there will be a future for biostatistics in the new millennium and would like to consider briefly those areas that should prosper. One reason for optimism about the future is that uncertainty in the data will not disappear. Individuals will vary in their chance of developing breast cancer or their chance of having a response with certain types of cancer, even after statistical refinements have improved predictions. Variability of experimental data will continue in the 21st century and beyond. The domain of statisticians is the understanding of uncertainty through statistics and probability and this will continue to be important. The tools that we need may change, but understanding and utilizing the concepts should continue to be the ‘bread and butter’ of statisticians. Techniques of randomization, likelihood methods, confidence intervals and significance tests will be in the tool kit of statisticians for some time to come. However, space should be left in this tool kit for newer methods such as Bayesian rather than frequentist methods, neural networks for problems that were previously solved by regression analysis, and data mining for problems previously solved by multivariate analysis. Several areas that seem likely to develop positively in the 21st century are clinical trials, survival analysis, and genetics. Clinical trials in the 21st century There will be improved methods of stratifying patients based on more refined methods of predict

508

E.A. Gehan

ing patient outcomes, such as logistic and Cox regression, and adjustments for stratification factors in analyses. In the conduct of trials, transmission of paper through the mail will disappear. Medical records will be electronic and interim analyses and stopping rules can in principle be conducted in real time. With the proliferation of new therapies requiring evaluations, there will be emphasis on timely results. There will be more emphasis on Bayesian than frequentist methods, especially since databases of patients will be available for determining prior distributions. Survival analysis in the 21st century I turn now to a topic that has had explosive growth in the last half of the 20th century – survival analysis. I predict that survival analysis will survive and the Kaplan-Meier method of estimating survival curves and the Cox regression model will continue to have many references in the 21st century. However, even now questions are raised about the assumptions in a proportional hazard’s model, so that I predict that a challenge will be to find alternatives to the Cox model in specific applications. For example, sometimes hazard functions, rather than being proportional in two groups, converge close together after the start of treatment; accelerated failure time models can deal with this problem. Also, as more is known about particular diseases and the survival or disease-free survival times, there may be increased emphasis on parametric models. Poisson regression models are more flexible to apply than the Cox model, so the increased utilization of these models is to be expected. Finally, some of the extensions that may be expected are in the fields of longitudinal data, multiplicity of types of failure, and multiple failures per individual. Statistical genetics in the 21 century The genome revolution is underway, and to paraphrase a statement by P. H. Abelson, a former editor of Science, ‘The genome revolution will have as great an impact on society as the industrial and computer revolutions.’ Table V lists some of the topics that will experience substantial growth. The Human Genome Project is an international effort to analyze the structure of the human genome as well as the genomes of certain

Table V. Statistical genetics in the 21st century. • Challenges in genomics-Human Genome Project – genetics maps to trace inheritance of disease – identifying genes in DNA sequences • Challenges in Molecular Medicine – molecular diagnosis – drug discovery – drug development • Challenges in Analysis of DNA and RNA Microarrays – differential expression of genes in tissues – clusters of genes relating to disease – gene expression profiles of prognostic importance

experimental organisms. This project is near completion and challenges in this area include creating genetic maps to locate polymorphisms that can be used to trace the inheritance of disease. Also, identifying genes in DNA sequences using such methods as Markov models, neural networks, and decision trees will be of importance in determining those sequences related to disease. Challenges in molecular medicine include work related to diagnosis, drug discovery and drug development. A topic of particular interest is developing approaches to the analysis of DNA and RNA microarrays. The diagnostic, prognostic, and therapeutic importance of the cancer gene candidates are yet to be fully evaluated. Evolution of statistics in the 21st century I have reviewed some areas of the major triumphs in biostatistics in the second half of the 20th century and spoken very briefly about areas that may be expected to have a bright future in the 21st century. In the 21st century, the challenge for biostatisticians in oncology will be to collaborate with those biomedical scientists in more broad and diverse areas than before. Biostatistics will progress more towards a branch of information science than mathematics or mathematical statistics. A challenge for oncologists is to cure certain forms of cancer and I expect that successful biostatisticians in the 21st century will be working closely with oncologists. I only hope that I am around for enough of the century to see some of these predictions come true. REFERENCES 1 Breslow NE. Statistics in epidemiology: the case-control study. JASA 1996 ; 91 : 14-28.

Role of the biostatistician in cancer research

2 Cornfield J. A method of estimating comparative rates from clinical data. JNCI 1951 ; 11 : 1269-75. 3 Cornfield J. A statistician’s apology. JASA 1975 ; 70 : 7-14. 4 Cox DR. Regression models and life tables (with discussion). J R Stat Soc (B) 1972 ; 74 : 187-220. 5 Doll R, Hill AB. Smoking and carcinoma of the lung. Br Med J 1950 ; 2 : 739-48. 6 Efron B. Boot-strap methods: another look at the jack knife. Ann Stat 1979 ; 7 : 1-26. 7 Gehan EA. Biostatistics in the new millennium: the consulting statistician’s perspective. Stat Meth Med Res 2000 ; 9 : 3-16. 8 Kaplan EL, Meier P. Non-parametric estimation from incomplete observations. JASA 1958 ; 53 : 457-81. 9 Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986 ; 73 : 13-22.

509

10 Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. JNCI 1959 ; 22 : 719-48. 11 Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep 1966 ; 50 : 163-70. 12 Mantel N. A personal perspective on statistical techniques for quasi-experiments. In: Owen DB, Ed. On the history of statistics and probability. New York: Marcel Dekker, Inc.; 1976. p. 10329. 13 Nelder JA, Wedderburn RMW. Generalized linear models. J R Stat Soc (A) 1972 ; 135 : 370-84. 14 Tukey JW. The future of data analysis. Ann Math Stat 1962 ; 33 : 1-67.