Per~amon Prc\\ Ltd
J Chron Dis Vol. 34. pp 159 lo 171. 19x1 Printed I” Great Bntaln
CONCEPTS
AND PROBLEMS IN THE EVALUATION OF SCREENING PROGRAMS
PHILIPC. PROROK*, BENJAMINF. HANKEY* and BRIAN N. BUNDY~ *Biometry Branch, National Cancer Institute. Bethesda, MD 20205, U.S.A. tRoswell Park Memorial Institute. Buffalo, NY, U.S.A. (Received in revisedform 25 June 1980) Abstract-Relationships between the clinical trial to evaluate therapy and the randomized controlled trial to determine the effectiveness of screening for the early detection of disease are examined. The concepts of covariate, response and pseudo variable are defined in relation to the time at which a variable is determined relative to the time of randomization. Pseudo variables are those which are determined after randomization and are therefore at least potentially affected by treatment. These are found to be inappropriate for most retrospective stratification and response measurement purposes in the screening setting. Specific screening problems addressed within this context include the uses of various definitions of age, the age at which screening is effective, and the marginal contribution of a single screening modality.
1. INTRODUCTION
As PART of the current thrust toward an increasing emphasis on prevention in medicine there is considerable interest in screening for the early detection of disease as a secondary prevention methodology. Accompanying this interest there has been, on the part of some individuals, an awareness that the potential benefits and costs of such preventive methodology demand careful and rigorous evaluation. Attempts at performing such evaluation, and some of the controversy that has ensued, have served to clarify some of the concepts and problems central to such an evaluation. It is these concepts and problems which are discussed in this paper, mainly within the context of breast cancer screening, as it is there that the best available data exist. The primary source of our current knowledge of the impact on breast cancer mortality of screening asymptomatic women for breast cancer is a study begun in 1963 and carried out by the Health Insurance Plan of Greater New York (HIP). Approximately 62,ooO women were randomly assigned to either a Study Group or a Control Group. The Study Group was offered four annual screens including physical examination and mammography and the Control Group followed their usual practices of medical care. Details of the Study design and participation in the screening program can be found in Shapiro [l]. The HIP Study is one of the few scientifically rigorous studies ever done to evaluate the effectiveness of screening in reducing mortality from disease. Based on early results of the HIP Study which showed a 3 reduction in breast cancer mortality in the Study Group, the Breast Cancer Detection Demonstration Projects (BCDDP) were begun in 1972 and involved the annual screening for a 5 yr period of some 270,000 women at 26 locations throughout the United States. Subsequent to the creation of the BCDDP, Bailar [2] raised the question of the possibility of radiationinduced cancers being caused by the use of mammography in screening and considerable controversy has ensued concerning the risks, costs and benefits of mammography in screening asymptomatic women. Three scientific working groups on Epidemiology, Radiation Risk, and Pathology were commissioned by the National Cancer Institute in 1975 to review the findings and data of the HIP Study and other studies in an attempt to resolve some of the controversy surrounding the use of mammography. Later a fourth working group was commissioned to review the BCDDP. The primary purpose of this paper is to consider and reconsider some of the statistical problems encountered by the 159
160
PHILIP C. PROROK, BENJAMIN F. HANKEY and BRIAN N. BUNDY
Pathology Working Group (PWG). As such, the discussion is in part an extension, modification or generalization of some of the results presented in the Pathology Working Group Report (PWGR) [3] based upon further research in the intervening period. Attention is focused on the relationship between clinical trials to evaluate therapy of clinically overt disease and randomized controlled trials to evaluate the effectiveness of early detection and treatment of preclinical disease. In particular, the appropriateness or inappropriateness for stratification and data analysis purposes of quantities that are based upon variables which are defined or measured after randomization is addressed. Within this context, more specific issues considered include the proper uses of various definitions of age, the age at which screening becomes effective or should be implemented, and the independent or marginal contribution of one screening modality when more than one modality is used. Since the PWG was involved in reviewing the HIP breast cancer screening study, the discussion and examples are in the context of breast cancer screening. However, many of the arguments are applicable to screening for other diseases. Most of the data used in this paper were generated by the PWG or supplied to the PWG by the HIP staff. They are not the latest available data from the HIP Study and they represent a view of the Study at only one point in time. The analyses of these data contained in this paper serve only to illustrate the problems and attempted solutions being considered. 2. COVARIATES. 2.1
Framework
of therapy
RESPONSES
AND
PSEUDO
VARIABLES
clinical trial
In attempting to clarify some of the issues involved in screening evaluation, it is useful to consider the analogies that exist with the more familiar concepts of the clinical trial. The term clinical trial as used in this paper is taken to mean a randomized controlled study to evaluate therapy for clinically diagnosed disease. This involves a more or less standard sequence of steps. A hypothesis to be tested is carefully formulated and a study design and protocol are developed. Patients who are eligible according to the protocol are entered into the study and randomized to one of the treatment groups. It is important to emphasize for later comparison that the point of entry into a clinical trial for a given individual is at or very near the time of diagnosis of primary or recurrent disease for that person. At the time of entry and/or randomization an initial collection and recording of data takes place. These data may be termed covariates, prognostic factors or baseline variables. They possibly affect or are related to therapeutic outcome. Many are characteristic of the disease at the time of diagnosis. They may be classified into six major categories [4]: demographic factors, history of illness, current state of disease (includes stage, symptoms, etc.), histology, physical condition of patient. and institution (may also be geographic region or related variable). Following randomization, therapy is administered according to protocol and then follow-up and monitoring commence and are continued for some specified time period. A second data collection takes place during follow-up. The data collected here are related to what might be termed responses, endpoint variables or outcome measures. It is these responses which should indicate or be correlated with a treatment effect, be it toxicity or therapeutic enhancement. Assuming that care has been taken to assure equality and completeness of follow-up in all randomized groups, the data are then analyzed when follow-up has ended (possibly intermittently as well) and the results of the trial are reported. Given this general scenario, it is of interest to focus on the uses and implications of each of the two types of information collected in a clinical trial, covariates and responses. Brown [S] lists the three important uses of covariates in the design and analysis of clinical trials. First, they are used to define patient eligibility. Second, they are used to group patients into homogeneous strata or blocks for the purpose of stratified randomization. This is prospective stratification. Third, they are used in data analysis to assure that
Evaluation
of Screening
Programs
161
of results in different treatment or randomized groups are adjusted to account for variations in prognosis at the time of randomization. This use is termed retrospective stratification. The advantages of randomization have been discussed by several authors C6,71. They are essentially three. First, bias is eliminated in the assignment of treatments SO that treatment comparisons will not be invalidated by selection of patients of a particular kind, whether consciously or not, to receive a particular form of treatment. This means that the expectation of response variables is equal in the study groups under the null hypothesis being tested. Second, the treatment groups tend to be balanced in covariates whether or not these variables are known. Thus the treatment groups being compared will tend to be truly comparable. Third, the validity of statistical tests of significance is guaranteed. Non-randomized designs are deficient in these respects. While it is true that randomization insures that, on the average, the treatment groups are comparable with respect to all factors that affect prognosis, this need not necessarily occur in any single trial. This does not affect the validity of significance tests based on randomization in a given trial. Nonetheless, one should not ignore important covariates in attempting to make the groups as comparable as possible. Two approaches for accomplishing this, prospective and retrospective stratification, and the current controversy surrounding their relative use are discussed by Byar [6] and Peto [7] and summarized by Brown [S]. Prospective stratification employs covariates to define strata and then randomization is performed within the strata. This gives additional assurance that the patients assigned to the randomized groups are balanced as closely as possible with respect to prognostic factors at baseline in any single trial. Provided the stratification covariate is correlated with the response which is being used to indicate treatment effect in the trial, prospective stratification may further ensure that the expectation and indeed the distribution of the response variable is equal in the treatment groups at the start of the study. The statistical effect of the extra balance provided by prospective stratification is to provide a more precise comparison, a more sensitive experiment. The added precision and increased power of statistical tests result from smaller standard errors that come about through more equal allocation.of patients within strata to the treatments under investigation. Even if a prognostically important covariate was not used to stratify prospectively, one should, during data analysis, consider the effects of the variable separately and/or employ retrospective stratification using the variable with some appropriate adjustment technique. In fact, some statisticians recommend only retrospective stratification during analysis after unrestricted randomization, except, perhaps, in very small trials [;I]. In any case, the use of prospective stratification does not obviate the need to take covariates into account in the analysis. If prospective stratification is employed, but the covariates are then ignored in the analysis, the computed significance level is likely to be invalid. It is ordinarily larger than the p value calculated on the basis of the stratified randomization procedure that was actually used, and the sensitivity of the study is diminished. Green and Byar [8] present some results on this subject. Thus, if a stratified randomization procedure is adopted, the stratification covariates must be used in the analysis or the advantage of the more complex design is compromised.
comparisons
2.2 Criteria for valid vs pseudo variables Regardless of the philosophy one espouses regarding prospective and retrospective stratification, it is emphasized that a covariate ‘must be specified at the start of the study, so that it can be carefully defined in the study protocol, and reliably and accurately observed and recorded at baseline for each patient’ [S]. Since it is defined at entry, the covariate must necessarily be independent of treatment effect since it is established before treatment is instituted. To be a valid covariate, then, the variable must be specified (that is collected or measured) at the time of entry (randomization) for all individuals in the study. If the variable is not specified at randomization, it may still be a valid covariate if the basis for its definition is time invariant in both the presence and absence. of the CD344
E
162
PHILIP C. PROROK, BENJAMIN F. HANKEY
and
BRIAN N. BUNDY
treatment under study, so that its value at randomization can be unequivocally established. Of course, responses should also be carefully defined in the study protocol and reliably and accurately observed. Since they are recorded during or after treatment, responses are not independent of, but rather indicative of, treatment effect. Note, however, that to be a valid response variable, the basis for the definition of the variable (such as the beginning of a time interval or the denominator of a rate) should be both established at randomization and time invariant. Furthermore, a valid response variable must have equal expectation under the null hypothesis of no treatment effect in all randomized groups. The determination of the validity of a response variable is thus tied to the null hypothesis being considered. This is not an issue with valid covariates because they are specified at randomization, and by the very nature of the randomization process, they have equal expectation under the null hypothesis as well as under the alternative hypothesis. (Note that this is not the case for non-randomized studies.) Examples of valid covariates which are recorded at randomization in a clinical trial are age at diagnosis, stage or extent of disease at diagnosis, histologic type, and interval from first symptoms to diagnosis, the length of which is known at randomization. An example of a valid covariate which may not necessarily be recorded at randomization but which is time invariant, and therefore the value of which at randomization can be determined, is eye color. Typical valid response variables in a clinical trial, depending upon the hypothesis being tested, are disease free interval or time to recurrence, survival time measured from diagnosis, the survival rate at some time point such as 5 yr, and the proportion of cases exhibiting a certain toxicity level. For the latter two examples, the denominator is established at randomization, while for the first two, the recording of duration begins at randomization. It may happen in certain circumstances that covariate-like information or response-like information is collected during follow-up and an attempt is then made to use the data as if a true covariate or response had been observed. Serious problems can arise when. this is done. It is necessary then to consider which kinds of variables are appropriate for use as covariates in prospective and retrospective stratification and as responses to measure treatment effect. If the definition of the variable is based on a quantity which is not time invariant or is determined during follow-up (after randomization), then the variable is potentially affected by and confounded with the treatment being tested. It can usually be shown that such a, variable does not satisfy the necessary properties of a valid covariate or response stated above. The variable will then be called a pseudo variable, either a pseudo covariate or a pseudo response. It is invalid to use a pseudo covariate for stratification purposes or to use a pseudo response to measure the impact of treatment in the absence of methods to adjust the variable definition for the accompanying confounding. To our knowledge, no such methods exist. A pseudo covariate is not determined until after randomization and so presents no problem with regard to prospective stratification since it is never available to be used for that purpose. However, since a pseudo covariate may be influenced by treatment, any retrospective strata created by its use may be less comparable rather than more comparable between randomized groups, and the goal of improving the comparison of like with like will not be reached. Moreover, since the pseudo covariate is presumably correlated with the endpoint of interest as well as with treatment, the probability or expected value of the endpoint under the null hypothesis may then be quite different in the randomized groups within strata in the absence of a treatment effect, and consequently the validity of statistical tests is compromised. Any adjustment for the pseudo covariate during analysis will alter the treatment effect [6]. Analysis using pseudo response variables can lead to similar difficulties and incorrect inferences. The very definition of response is changed in one randomized group compared to another. Any effect that is observed is thus a combination of a true treatment effect (if there is one) plus the confounding of treatment with pseudo response, yet it is not possible to separate the two or determine their relative magnitudes with any degree of precision. Byar [6] gives an example where this type of problem could arise when the rates of
Evaluation of Screening Programs
163
adherence to two therapies differ. In this situation, it would be improper to use adherence to therapy as a covariate since a difference in the side affects of the two treatments could be causing the difference in adherence. Armitage and Gehan [91 discuss the same potential pitfall and give an example from a series of trials of tetanus antitoxin where the investigators used a prognostic classification of patients based on the period of onset; i.e. the time between first symptoms and reflex spasms. However, some patients were admitted to the hospital and received specific treatment before spasms occurred. For these patients, the period of onset was not measurable until after treatment had started, and its value may have been influenced by treatment. The data were also examined another way, based upon time from first symptoms to hospital admission. Here the category into which a patient fell was defined at the time of admission, before treatment started. The result was a quite different prognostic classification. An interesting potential pitfall occurs when an attempt is made to use a response in the role of a covariate, so that a variable which may be a valid response becomes a pseudo covariate. Time to recurrence is a valid response variable if the null hypothesis is that there is no difference in recurrence time between groups, but suppose in an analysis one retrospectively stratifies by recurrence time, reasonably believing it to be correlated with survival, while using the 5 yr survival rate as the response. The null hypothesis then is that there is no difference in survival between groups. Suppose further that the treatment is correlated (confounded) with recurrence time but has little or no impact on survival. In this setting recurrence time, although measured from randomization, is not determined until some point during follow-up which is intermediate between randomization and the final outcome which is survival or death. If one then stratifies on recurrence time to facilitate the comparison of treatment and control groups, recurrence time becomes a pseudo covariate. If recurrence time is affected by treatment, the disease characteristics and survival of patients in recurrence time strata in the treatment group may be more dissimilar than, rather than more similar to, those of patients in the corresponding strata in the control group. Hence the purpose of retrospective stratification is defeated and the validity of the analysis is open to question. 2.3 Randomized controlled trial of screening and pseudo variables The companion medical experiment in the screening setting to the clinical trial of therapy will be termed the randomized controlled trial (RCT). The purpose of the RCT is to provide a medium for the rigorous scientific evaluation of the benefit (and cost if so designed) of a screening test or program. The general framework of the RCT is, on the face of it, very similar to that of the clinical trial. There are, however, some key differences which affect the definition of covariates, responses and pseudo variables and ultimately have a bearing on the proper methods for analyzing such a study. Most particularly, treatment (that is, early diagnosis plus therapy) does not occur at or near the time of randomization (entry) for all, nor necessarily even for most, of the patients in randomized groups of a RCT. Thus, unlike in the clinical trial, all covariates or responses defined according to some variable measured at diagnosis are potentially and very likely pseudo variables and caution should be exercised in using them for analysis. In fact, it is the lead time and length biases which are inherent in a screening program that are mainly responsible for creating pseudo variables out of variables which are valid covariates or responses in a clinical trial. To clarify, the study groups of the RCT are created by (possibly stratified) randomization of a large population of eligible, asymptomatic, presumably healthy individuals. Thus the diagnosis of disease is not a prerequisite for entry into the study; in fact, it is a contraindication. The covariates are therefore determined at entry (randomization) possibly several months or even years before the diagnosis of the disease under investigation. Of course only a small fraction of the individuals in the study ever get the disease, but the covariates are determined for everyone. Similarly, the response or endpoint (the proper one being life or death from the disease in question as discussed further below) is determined for everyone who enters the study, although only a few are directly affected
164
PHILIP C. PROROK,BENJAMINF. HANKEYand BRIAN N. BUNDY
by the treatment which is being evaluated. During the follow-up period between randomization and the final endpoint, numerous quantities can be observed or measured and numerous variables defined. All such variables are likely to be pseudo variables. These include the potential covariates associated with diagnosis, such as age at diagnosis and stage of disease at diagnosis, and potential responses such as case finding rate, stage of disease at diagnosis, and survival rate. Age at diagnosis and survival rate are considered in more detail in the sequel. A brief discussion of the others follows. Ideas related to the discussion here appear in [lo] and [l 11. Stage of disease at diagnosis is likely to be, in fact potentially very strongly, influenced by early detection. In fact, one would hope that a screening program is having a positive effect and is indeed finding cases at an earlier stage than would be the case in the absence of screening. However, because of lead time and length bias effects (see for example Zelen [12]), the definition of stage can be altered in a screened group as compared to a control group. That is, since screening tends to detect slower growing disease, the distribution of disease natural histories in a given stage in the screened group may be different from that in the corresponding stage in the control group. The problem is likely to be most pronounced for stage 1 or localized cases which tend to be detected in greater proportion in a screened population relative to a control population. A RCT has a cut-off point, perhaps 5 or 10 yr after the study starts, after which new cases of disease are no longer accrued. Consequently, lead time and length bias can result in cases of slow growing, even nonprogressive, disease being detected in stage 1 or stage 0 in the screened group to a greater extent than in the control group. Their counterparts in the control group will not be as likely to surface by the cut-off point, if they ever surface, so that the screened group will contain a higher proportion of Stage 1 or 0 cases even if screening has no effect on the natural history of the disease L-13). Thus, an enhanced comparison of like with like using retrospective stratification by stage will not be realized. Stage of disease at diagnosis is clearly a pseudo covariate, both because of the effects of the screening biases, and because it cannot be measured for all individuals in the study. With regard to responses and pseudo responses, consider the case finding rate (or incidence rate). This can be defined as the number of cases found in a population over a given period of time divided by the number of individuals in the population at the start of the study. For reasons cited in the preceding paragraph, screening by its very nature will tend to find more disease than the usual methods of incidence recording in the absence of screening, at least in the early years of a screening program. If this is the extent of one’s interest; that is, if the hypothesis being tested is concerned only with whether or not screening is finding more cases than the usual practices, then the case finding rate satisfies the requirements of a valid response variable. However, the most important and usual endpoint of interest is mortality, and the hypothesis of interest is most often related to whether or not screening can effect a reduction in disease mortality. Cases found in both a screened group and a control group are found, for the most part if not entirely, during follow-up, not at randomization. Since the screening biases tend to increase the number of cases found, there is obviously confounding with treatment. This can occur whether or not the screening program has any effect on mortality. Thus under the null hypothesis of no effect of screening on mortality, the case finding rate cannot be guaranteed to have equal expectation in all randomized groups. Consequently, in the usual situation, the case finding rate is a pseudo response. It may be an indicator to suggest that screening might be doing something, but it tells one nothing about the impact which screening may or may not have on the natural history of the cases discovered. In a similar vein, if screening results in a shift to an earlier stage distribution in the screened group compared to the control group, one can interpret this as an indication that screening might be beneficial in reducing mortality. If one is interested only in whether or not such a shift takes place, then stage of disease at diagnosis may be a valid response in that sense. However, even here one must be cautious. If a shift in stage distribution does occur, there are at least two explanations. One is the obvious, that
165
Evaluation of Screening Programs
screening has detected cases earlier than usual in the screened group so that the stages of such cases are earlier than those of their control group counterparts. The other is that the ‘extra’ cases found in the screened group because of lead time and length bias, which are usually found in Stage 1 or 0, and which have no control group counterparts (see above), are responsible for the shift in distribution, whereas the ‘usual’ cases are altered in stage very little if at all. It could be that both factors operate simultaneously, so that if screening does favorably shift the stage distribution, the magnitude of the effect would be exaggerated by ignoring the second explanation. Alternatively, suppose that the null hypothesis being tested is that early detection plus therapy has no effect on mortality from the disease in question. As noted above, length bias will tend to increase the proportion of Stage 1 or 0 cases even if screening has little impact on the ultimate outcome of the disease. Stage is clearly affected by screening and does not have equal expectation in the study and control groups under the null hypothesis. Stage is a pseudo response in this situation. In this regard, it should be made clear that the relationship of stage to mortality in the screening setting has not been established [14]. The magnitude to which a given shift in stage distribution of cases as a result of screening affects the mortality rate of the disease is relatively unknown at present. One cannot simply use the relationship established in the clinical setting because of length bias. This is an area requiring further research. Hence, although stage of disease at diagnosis may be a measure of response over the short term, it is of little use in accurately establishing the effect of screening in altering the natural history of disease. 3. SOME
SPECIFIC
INVOLVING
PROBLEMS
COVARIATES, PSEUDO
IN
SCREENING
RESPONSES
AND
VARIABLES
The discussion in this section centers on three of the many possible problems in the evaluation of a screening program where the designation of a variable as a covariate, response or pseudo variable has implications for the method and interpretation of analysis that is performed. The problems considered are the appropriate age definition, the age at which screening is effective, and the marginal contribution of a screening modality.
3.1 Dejinition 0s age Three definitions of age have been used in reported analyses of screening programs: age at entry, age at diagnosis and age at death [2,3,15,16,17]. In the appropriate circumstances, analyses based on any of these ages can likely yield useful information. However, it would appear that certain at least potential problems can arise with certain age definitions and these should be kept in mind in the design and analysis of a screening program. The focus here is on such problems with respect to a prospective randomized controlled trial with mortality endpoint (RCTM) with stratified randomization by age as part of the study design, where the aim is to determine the benefit of a screening program in terms of the magnitude of a reduction in mortality from the disease in question. Some of the applications and implications of each age definition with regard to the analysis of a RCTM are considered. There are several possible uses of age in the analysis of a screening program. For example, comparisons of ages at diagnosis and death in various study subgroups with the same ages in populations from other studies are of interest. Caution must be exercised, however, as the results of such comparisons are only suggestive due to noncomparability of populations. Within a screening trial, one might compare age at diagnosis with age at entry within various subgroups of the study population in an effort to produce information on whether screening is being done at those ages at which a substantial number of cases can be found. Of primary interest in a RCTM is the comparison of a study group with a control group for the purpose of ascertaining the benefit, if any, to be derived from screening. In making such comparisons on an age-specific basis, the appropriate age definition(s) to use can be established by referring to the scheme developed in Section 2 for classifying
166
PHILIP
C.
PROROK. BENJAMIN F. HANKEY and BRIAN N. BUNDY
variables as either covariates, responses or pseudo variables. With reference to this scheme, one notes that age at diagnosis and age at death are both established during follow-up, not at entry. Relative to usual medical care, age at diagnosis is presumably altered by screening, while age at death could be changed by early diagnosis and therapy. Thus both are confounded with treatment and are clearly pseudo covariates. Only age at entry is a valid covariate. Use of age at entry as a covariate would be particularly appropriate when stratification by age is part of the study design. Age at diagnosis and age at death, while pseudo covariates, may be valid response variables, depending upon the situation. If the null hypothesis is concerned only with a shift to younger ages at diagnosis or only with a shift to older ages at death as a result of screening. than age at diagnosis or age at death, respectively, is a valid response. The single most important test of a cancer screening program is whether it can effect a significant reduction in the mortality rate of the disease in question. The mortality rate of a given disease in a given population is defined as the ratio of the total number of deaths from the disease in a given period of time, say y years, to the total number of persons or person-years at risk during the y year time period in the population. The y year time period begins at some fixed point in chronological time which is the same for everyone in the population. Date of entry into a study is a typical starting point. Some function of mortality or a related measure which conveys the same information may be suitable as well. The mortality rate is then a valid response variable since the basis for its definition, namely the denominator, is established at entry and it clearly has equal expectation in all randomized groups under the null hypothesis that screening has no impact on mortality. Only the numerator changes with treatment, as one would hope for a response variable. Strictly speaking, this holds when the denominator is the number of persons at risk at the start of the study. Person-years could conceivably be affected by treatment, although the confounding would probably be minimal. Whether or not a mortality reduction has occurred can only be decided by comparison of the mortality rate in a study population (at least part of which is screened) with that in a comparable comparison or control population. The only proper way to accomplish this comparison is through a prospective randomized controlled trial, such as the HIP Study. Other approaches involve biases which render the comparison suspect [6,7, 14, 18-J. Another important endpoint that can also be considered is the survivial rate (equivalently the case fatality rate), but this is subject to certain biases for which no completely satisfactory adjustment procedure is available [3, 12-J. In fact, survival is a pseudo response (see Section 3.3). Consequently, the discussion in this Section will focus on the mortality rate and its relationship with age. The primary use of the mortality rate in screening program evaluation is in the comparison of the number of deaths per unit population in a total study group with the number of deaths per unit population in the totality of an appropriate comparison group, where appropriateness is established by using such techniques as stratification or matching together with randomization in the design of the study. A valid mortality rate comparison can then be made between the two groups. A second valid use of mortality is in the comparison of the number of deaths in a subset of a study group with the number of deaths in an appropriately comparable subset of a control group. The comparable subsets can be generated by using either prospective or retrospective stratification along with randomization. For example, in the HIP Study, the population was stratified by five year age groups [19] and then randomization was carried out within these age groups to create the Study and Control Groups. As a result, not only are the overall Groups comparable, but subsets defined by 5 yr age groups are presumably comparable as well, although some degree of non-comparability could exist because of smaller subgroup sample sizes and the resulting larger chance fluctuations. Of course, the subset sample size may not be large enough to make a reasonable or statistically valid comparison. It is important to establish the basis for defining age groups in such an age-specific analysis. Given that prospective stratification was done at the time of randomization and
Evaluation of Screening Programs
167
given the above discussion, it is evident that age categories should be defined according to the study design using valid covariates. Comparison groups are thereby made as comparable as possible and potential bias eliminated as completely as possible for data analysis purposes. In the HIP Study, for example, women were stratified by their age at the time they entered the Study. The appropriate age covariate for viewing HIP data for the purpose of comparative mortality analyses is therefore age at entry, so that comparisons are made between cohorts of women defined at the time of randomization. Certain difficulties must be considered when using other age definitions for defining comparison groups in a screening program. Age-specific subgroups defined by age at diagnosis or age at death can have certain ‘contamination’ problems which render them less comparable than subgroups defined by age at entry. For example, counts of deaths in older age at death categories include cases which were randomized into the study in earlier age groups. To the extent that the cases which are so shifted in the study group differ in disease characteristics from those so shifted in the control group, the comparability of the subgroups is diminished for the purpose of comparing mortality. The tabulation of person-years at risk for the denominator of a mortality rate encounters the same potential ‘contamination’ problem. Subgroups defined by age at diagnosis are plagued by a similar shifting problem. In addition, the cases detected by screening in a given age at diagnosis subgroup of a screened group are subject to lead time and length biases. The result may be a shifting of cases among age subgroups in the study group such that the prognoses of cases in a given age subgroup are more dissimilar to those of the control group cases in the same age subgroup than would be the case in the absence of screening. Another source of bias is thereby introduced which can further reduce the comparability of an age at diagnosis subgroup of a study group relative to its control group counterpart. These are further reasons why age at death and age at diagnosis are pseudo covariates whose use in retrospective stratification for mortality comparisons may lead to invalid inferences. 3.2 Age at which screening is effective-proportion
dead
Closely related to the question of age definition is the problem of determining the ages at which a screening program is effective, where effectiveness is here considered to be synonymous with the ability to produce a reduction in mortality from the disease. The definitive way to establish age-specific effectiveness is by deliberately designing a screening trial to answer this question. Where this is feasible in terms of recruiting a large enough population within given age categories, it can be accomplished, in principle, by using age stratified randomization of a population into study and control groups. Alternatively, one may be faced with trying to establish age-specific effectiveness using data from a study not specifically designed to answer this question. Difficult problems arise in this context; some of them are addressed in this section. One approach involves consideration of the proportion of individuals who die from the disease in various age-specific subsets and comparison of these between study and control groups. Such an analysis is discussed here, where the proportion dead in a given group is defined as the number of deaths from the disease in the group divided by the number of individuals who enter the study in the group. The proportion dead is clearly a valid response since the denominator is determined at randomization and the variable has equal expectation under the primary null hypothesis concerning the effect of screening on mortality. Equivalent follow-up in the two groups is assumed. An illustration using data from the HIP Study is presented. It is important to note first that the HIP Study was not specifically designed to determine the effect of screening on age-specific mortality. Rather, it was set up to assess the impact of screening with’ physical examination and mammography on breast cancer mortality in the total Study Group. The HIP Study did demonstrate that screening with the combined modalities was associated with a short and medium term reduction in breast cancer mortality of about 34%. Several authors and reports have concluded that
PHILIP C.
168
PROROK.
BENJAMIN
F.
HANKEY
and
BRIAN
N.
BUNDY
this reduction was concentrated in women 50yr of age and older, but recent improvements in screening technology have prompted many to declare that screening is now effective at younger ages as well. Thus methods for investigating the age effect of screening are of interest. One can first exhibit the data in the smallest reasonable age groupings, say 5 yr age groups. The numbers of cases within groups may be small, but large effects may be apparent, or patterns may be discernable. Relevant data are displayed in Table 1. See Shapiro [ 171 for a more detailed description of the data. (The numbers in the Table differ slightly from those in [17] because of minor corrections made recently by the HIP staff.) TABLE
1. AGE-SPECIFIC
BREAST CANCER
MORTALITY
IN THE
HIP
STUDY
BY 5 yr AGE GROUPS:
7 yr OF FOLLOWUP
FROM DATE OF ENTRY Control
Study
Percentage reduction
Age at entry stratum
Individuals entered
Deaths
Individuals entered
Deaths
proportion dead
4C44 45-49 5&54
6516 7311 692 I
16 16 I6
6455 7317 6985
13 25 ?I
-22 36 48
55-59 60-64 Total
5889 3789 30426
I3 7 68
6151 4099 31007
21 I5 IO5
35 50 34
99;4
in Odds ratio
confidence limits for odds ratio
0.64 0.11 0.022
0.82 1.56 1.92
0.30.2.22 0.67.3.70 0.85.4.4 I
0.14 0.095 0.0045
I.55 I .98 1.52
0.61.4.01 0.59.6.92 1.01.2.29
p value
The initial columns of Table 1 show the numbers of individuals entering and numbers of breast cancer deaths in 5 yr age at entry categories for both the Study and Control Groups from the HIP Study using data through 7yr of follow-up from date of entry. Stratification is by the covariate age at entry. The pseudo-covariates age at diagnosis and age at death are not used. The Table also contains the percentage reduction in proportion dead in the Study Group by cohort and the associated p value for testing the main effect of a difference in the proportion dead between the Study and Control Groups. The significance tests were performed using the asymptotic approximation to the appropriate exact conditional test for 2 x 2 contingency tables as in Thomas [20]. The last two columns of the Table display the estimated odds ratio and the 99% confidence limits for the odds ratio as calculated in Thomas [20]. The odds ratio is the odds of dying of breast cancer in the Control Group divided by the odds of dying of breast cancer in the Study Group. An examination of Table 1 reveals that a mortality reduction appears to have occurred in all cohorts except the youngest; however, the only difference that approaches statistical significance is in the 50-54 cohort, given the multiple comparison setting (see for example Tukey [21]). In fact, the confidence intervals for all the odds ratios overlap. The oldest cohort seems to show an effect similar in magnitude to the 50-54 cohort, but the sample size is much smaller. The 45549 and 55-59 strata are similar in effect as well. The comparison of Total Study Group with Total Control Group reveals a significant reduction in mortality of 34%. The 99% confidence interval for the odds ratio is completely above one. In combining several strata (several 2 x 2 tables) to test the main effect for the total Groups, the methodology assumes no interaction; that is, that the odds ratio is constant over all strata being combined. The test for interaction was performed using an updated version of the Thomas program [20] (or equivalently the approach developed in Halperin [22]). The resulting p value was 0.44, indicating no difference in odds ratio among the five cohorts. (Alternatively, one could assume a Poisson model and analyze the data in that way [23,24,35]. The results are virtually identical.) Despite the lack of statistical significance, there is the suggestion in the Table of an effect in the 50-54 cohort and possibly in the 45-49, 55-59 and 60-64 cohorts as well. The 40-44 stratum shows no hint of an effect. One might then consider combining neighboring strata to obtain larger sample sizes which might yield significant results,
169
Evaluation of Screening Programs
using the methodology referenced above for combining 2 x 2 contingency tables. Caution is advised here, however, as different groupings can lead to different conclusions. For example, when the 4&44 and 45-49 strata are combined, one obtains a 16% reduction in proportion dead with a p value of 0.27, an interaction p value of 0.28, an estimated odds ratio of 1.19 and 99% odds ratio confidence limits of 0.63 to 2.26. Any hint of a mortality effect is minimal and clearly not significant. However, if the 45-49 and 50-54 cohorts are grouped, one observes a mortality impact which is of borderline significance. The reduction in proportion dead is 43% with p = 0.0075. The interaction p value is 0.81, suggesting that the odds ratios in these two strata are more alike than those in the 4@-44 and 45-49 strata. The estimated odds ratio is 1.74 with a 99% confidence interval of 0.97-3.15. It would appear from this brief analysis that the 45-49 cohort is more like the older cohorts than the 4&44 group with regard to the effect of screening on mortality. Unfortunately, this approach is not a complete answer to the problem of determining the age-specific effectiveness of screening because of what one might term the crossover problem. That is, some of the cases from the 45-49 cohort were detected and treated, as a result of screening or otherwise, before age 50 and some at age 50 and over. Consequently, some fraction of the observed mortality reduction in the 45-49 group probably resulted from the impact of screening at age 50 and above, but the magnitude of that fraction is difficult to establish. Also, this analysis views the data at only one point in time. The optimal time point for analysis is not known. It is not clear that analyses at different time points would yield the same results as found here. The determination of the ages at which screening is effective appears to be a very difficult and subtle problem deserving of further research. 3.3 The independent
or marginal contribution
of one screening
modality
Evaluation trials for two cancer sites, breast and lung, have included the following rationale [ 19,261: the screening procedure used in the group randomized to screening was a combination of the best available screening modalities applied at the shortest feasible intervals. This was done in an attempt to apply the maximum possible effort to the problem to see if some reduction in mortality could be realized. While this design can answer the primary question of benefit, it is not suitable for determining the independent contribution of a single screening modality. This problem has been evident recently with regard to the contribution of mammography in breast cancer screening. Neither the value of mammography alone nor the value of adding mammography to physical examination in screening for breast cancer has ever been evaluated in a properly designed study, for any age group of women. For that matter, physical examination alone has never been properly evaluated either. The HIP Study was not designed to answer these questions. This problem may also arise in the future in lung cancer screening where it may be of interest to determine the marginal contribution of sputum cytology or X-ray to a reduction in lung cancer mortality if a reduction is observed in current studies. Consequently, this section is devoted to a discussion of some aspects of the problem of determining the independent or marginal contribution of a single screening modality to the benefit resulting from a screening program which uses more than one modality. The difficulties which arise with the attempted use of pseudo variables in this context are addressed. The primary measure of effectiveness in a cancer screening program is the extent to which mortality from the disease in question is reduced. However, mortality reduction is not suitable as a criterion for determining the independent contribution of one screening modality unless the study is properly designed to address this issue. The reasons for this were discussed in the PWGR [3]. Essentially the reasons can be summarized as the inability to define a mortality rate in this context and the impossibility of finding a comparable control group with which to make a comparison. A procedure one might turn to then would be a comparison of the survival rates of cases in the subgroups of interest. That is, if one wants to determine the marginal contribution of mammography in breast cancer screening where both mammography
170
PHILIP C. PHOROK. BENJAMIN F. HANKLY and BRIAN N. BUNDY
and palpation are used, one might simply compare the survival rate at, say, 5 yr for the cases detected by mammography only with the 5-yrsurvival rate in the control group or some other subgroup. Unfortunately, this leads to an erroneous comparison because of the problems of lead time bias and length bias (see [3, 123 for example). Indeed, the survival rate is clearly a pseudo response variable because the denominator, the number of cases diagnosed in a particular subgroup of interest, is determined not at entry but during the course of the screening program and is almost certain to be affected by treatment. Not only is the number of cases diagnosed likely to be different in a screened group compared to a control group because of lead time bias, but the natural history of some of the screen detected cases may be different as a result of length bias. In essence, the problem being confronted has been stated in Peto [7]: ‘If a group of patients treated one way does better than another group which was treated in another way, there are two possible explanations for this: either the first group got better treatment, or the first group contained disproportionately many patients who would have done well anyway, even if they had been treated in exactly the same way as the second group.’ This quote was in the context of random assignment of subjects to treatment groups. In screening, length bias tends to make even more likely the reality of disproportionate assignment of good prognosis patients to certain subgroups, namely those detected by screening, and perhaps those detected by certain modalities of screening. Note that the confounding created by the lead time and length biases in this context can occur when one is comparing randomized groups as well as nonrandomized groups. When comparing a study group with a randomized control group. even if the null hypothesis is related to mortality, the survival rate is still a pseudo response because of the screening biases. However, in examining the marginal contribution of a screening modality, if one compares survival rates among subgroups of cases defined by modality of detection, modality is the retrospective stratification variable. This is clearly a pseudo covariate since it is defined after randomization. is affected by the screening biases, and is not measured for all individuals in the study. Furthermore, the subgroups are not created by randomization. Survival is still a pseudo response, and additional noncomparability problems are introduced as well. Despite the several difficulties just mentioned. attempts have been made to estimate the marginal contribution of a single screening modality, namely mammography in breast cancer screening, when more than one modality was used [2.3. 171. The point to be made regarding these estimates is that they are all suggestive at best but not precise. Each makes somewhat different, perhaps reasonable, assumptions and arrives at a somewhat different answer from the others. No method. however, is able to adjust for length bias, and the direction of this bias for all the methods results in a tendency to overestimate the contribution of mammography. The extent of this overestimate cannot be determined. Until methodology is devised for analyzing survival data as desired to answer the marginal contribution question. the only rigorous alternative is a properly designed trial. Aclirlow/r,dS~rtr~,f~f.s-The authors are indebted to a number of individuals for their assistance in this research. The HIP staff provided data to the Pathology Working Group. Sam Shapiro. Nick Day and Judy Chen supplied very helpful suggestions on an early draft of the paper. We held many valuable discussions during the Pathology Review with our pathology colleagues Lou Thomas. Lauren Ackerman. Tom Hanson and Bob McDivitt. The referees made several useful suggestions. and Carol Ball and Betty Hennigan typed the manuscript. REFERENCES Shapiro S. Strax P, Venet L et al.: Changes in 5-year breast cancer mortality in a breast cancer screening program. Seventh National Cancer Conference Proceedings. New York: American Cancer Society. 1973. pp. 663-678 Bailar JC: Mammography: A contrary view. Ann Internal Med 84: 77-84. 1976 Thomas LB, Ackerman LV. McDivitt RW er a/.: Report of NC1 Ad Hoc Pathology Working Group to review the gross and microscopic findings of breast cancer cases in the HIP Study. J Nat Cant lnst 59: 4955541. 1977 Zelen M: Importance of prognostic factors in planning therapeutic trials. Cancer Therapy: Prognostic Factorsand Criteria of Response, Staquet MJ (Ed.). New York: Raven Press. 1975. pp. l-6
Evaluation of Screening Programs 5. 6. 7.
x. 9. IO.
I I. 12.
13. 14.
15. 16. 17. 18. 19. 20. 21. 22.
171
Brown BW: Designing for cancer clinical trials: selection of prognostic factors, presented at the Fifth Symposium on Coordinating Clinical Trials. New Orleans, Louisiana. 1978 Byar DP, Simon RM. Friedewald WT er al.: Randomized clinical trials. Perspectives on some recent ideas. New Engl J Med 295: 74-80, 1976 Peto R, Pike MC, Armitage P et al.: Design and analysis of randomized clinical trials requiring prolonged observation of each patient. 1. Introduction and design. Br J Cancer 34: 585-612. 1976 and II. Analysis and examples. Br J Cancer 35: l-39, 1977 Green SB. Byar DP: The effect of stratified randomization on size and power of statistical tests in clinical trials. J Chron Dis 3 I : 445-454, 1978 Armitage P, Gehan EP: Statistical methods for the identification and use of prognostic factors. Int J Cancer 13: 16-36. 1974 Cole P. Morrison AS: Basic issues in population screening for cancer. J Nat Cant Inst 64: 1263-1272. 1980 Prorok PC: Evaluation of screening programs for the early detection of cancer. Statistical Methods for Cancer Studies, Cornell RG (Ed.). New York : Marcel Dekker, in press. Zelen M: Theory of early detection of breast cancer in the general population. Breast Cancer: Treads in Research and Treatment. Hunson JC. Mattheiem WH. Rosencweig M (Eds). New York: Raven Press. 1976, pp. 287-300 Shwartz M : Personal communication, 1977 Beahrs OH, Shapiro S. Smart C et al.: Report of the working group to review the National Cancer Institute/American Cancer Society Breast Cancer Detection Demonstration Projects. J Nat Cant lnst 62: 639-709. 1979 Breslow L, Henderson B, Massey F et al.: Report of NC1 ad hoc working group on the gross and net benefits of mammography in mass screening for the detection of breast cancer. J Nat Cant lnst 59: 473-478. 1977 Dales LG. Friedman GD. Ramcharan S et a/.: Multiphasic checkup evaluation study. 3. Outpatient clinic utilization, hospitalization, and mortality experience after seven years. Prev Med 2: 221-235, 1973 Shapiro S: Statistical evidence for mass screening for breast cancer and some remaining issues. Cant Detec Prev 1: 347-363. 1976 Sackett D: Periodic examination of patients at risk. Cancer Epidemiology and Prevention, Schottenfeld D (Ed.). Springfield: Charles C Thomas. 1975. pp. 437-454 Shapiro S: Personal communication, 1977 Thomas DG: Exact and asymptotic methods for the combination of 2 x 2 tables. Comp Biomed Res 8: 423-446, 1975 Tukey JW: Some thoughts on clinical trials, especially problems of multiplicity. Science 198: 679-684, 1977 Halperin M, Ware JH. Byar DP et al.: Testing for interaction in an I x J x K contingency table. Biometrika 64: 271-275. 1977
23. Armitage P: The chi-square test for heterogeneity of proportions, after adjustment for stratification, J Roy Stat Sot B28: 150-163, 1966
24. Gart JJ : The analysis of ratios and cross-product ratios of Poisson variates with application to incidence rates. Comm Stat A7: 917-937. 1978 25. Gail M: The analysis of heterogeneity for indirect standardized mortality ratios. J Roy Stat Sot A141 : 224-234, 1978 26. Taylor WF. Fontana RS: Biometric design of the Mayo Lung Project for early detection and localization of bronchogenic carcinoma. Cancer 30: I344 1347. 1972