The effect of misclassification in screening trials: A simulation study

The effect of misclassification in screening trials: A simulation study

Available online at www.sciencedirect.com Contemporary Clinical Trials 29 (2008) 125 – 135 www.elsevier.com/locate/conclintrial The effect of miscla...

580KB Sizes 4 Downloads 63 Views

Available online at www.sciencedirect.com

Contemporary Clinical Trials 29 (2008) 125 – 135 www.elsevier.com/locate/conclintrial

The effect of misclassification in screening trials: A simulation study Nancy A. Obuchowski ⁎, Michael L. Lieber Department of Quantitative Health Sciences /Wb4 and Division of Radiology Cleveland Clinic 9500 Euclid Avenue Cleveland, OH 44195, United States Received 13 December 2006; accepted 30 May 2007

Abstract Background: Misclassification of study endpoints in randomized clinical trials of screening tests has been well documented, yet its effect on study power, type I error rate, and risk ratio estimate have not been studied in depth. Methods: We constructed a Markov model to depict the natural history of disease and the effect of screening on it. Using this model we simulated subjects in a two-arm RCT. We varied the type and amount of misclassification, and studied the effect on two endpoints — disease-specific mortality and the incidence of disease-specific symptoms. Results: Failure to identify disease-specific events in a RCT of screening has a small effect on the risk ratio estimate and study power. In contrast, the false identification of events as being attributable to the target disease greatly reduces study power. Conclusions: Investigators of RCTs of screening tests should carefully consider the potential for misclassification and the type of misclassification that their study is as risk for. Studies should be designed to minimize misclassification. The effect of misclassification on power should be considered in sample size calculations. © 2007 Elsevier Inc. All rights reserved. Keywords: Misclassification; Statistical power; Screening trials; Disease-specific mortality

1. Introduction Over the last 20 years there have been remarkable advances in imaging technology, allowing detection of early disease, or disease risk factors, before the onset of signs and symptoms. Randomized clinical trials (RCTs) are considered the gold standard for evaluating the efficacy of screening tests with patient outcomes, i.e. morbidity and mortality, as the primary endpoints. These RCTs typically require large sample sizes (i.e. thousands or even tens of thousands of subjects) and are extremely expensive due to the low prevalence of disease even when screening is targeted at high risk groups. Misclassification is a common problem in RCTs [1–8]. It occurs when study subjects' outcomes of success or failure, event or no event, are assigned incorrectly. Black et al. [1] described two types of misclassification bias that can

⁎ Corresponding author. Tel.: +1 216 445 9549; fax: +1 216 444 9307. E-mail address: [email protected] (N.A. Obuchowski). 1551-7144/$ - see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.cct.2007.05.009

126

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

occur in RCTs of screening tests. “Sticky-diagnosis bias” occurs in the screened group when a subject experiences an event (e.g. death) due to another cause but is known to have the target disease; bias occurs when the subject's event is wrongfully attributed to the target disease (we refer to this as over-diagnosis misclassification). In the control group a subject could experience an event from the target disease; yet, because the subject is scrutinized less for indications of the target disease, an alternative diagnosis may stick to this control subject (i.e. under-diagnosis misclassification). “Slippery-linkage bias” [1] occurs when subjects suffer complications from treatment or screening; these events, which are attributable to the target disease or screening for it, are not identified as disease-specific events (under-diagnosis misclassification). Misclassification of events can lead to distortion of study results, particularly distortion of the power, type I error rate, and estimated risk ratio. In this paper we examine the effect of misclassification in RCTs of screening tests. We use a Markov model to depict the natural history of disease and the effect of screening on study subjects' outcomes. We consider two clinical endpoints: disease-specific mortality and the incidence of disease-specific symptoms. In our baseline model without misclassification, screening is mildly beneficial; we then introduce different types and amounts of misclassification and examine the effects on the studies' power and estimated risk ratio. 2. Methods 2.1. Simulation model We simulated a two-arm RCT of asymptomatic people. Subjects in the screening arm undergo a form of screening (e.g. chest computed tomography, mammography, or calcium scoring) for a particular target disease or risk factor (e.g. lung cancer, breast cancer, or atherosclerosis). If the screening test detects an abnormality, then additional follow-up tests are performed, and early treatment or risk reduction is begun, as appropriate [9]. Subjects in the control arm do not undergo this form of screening. We examine the hypothesis that this form of screening is beneficial to subjects in detecting early disease, and consequently preventing advanced disease. We constructed a 10-stage Markov model to depict natural history of a generic disease, the screening intervention, and the effect of misclassification on this process. The model describes disease, not as binary states of present or absent, but rather as a dynamic process with the potential to both progress and/or regress in the presence of competing diseases. In our model all subjects begin in the well health state and, over time, transition to one of three terminal health states: dead from the target disease, dead from treatment for the target disease, or dead from a competing disease. The transitions occur at monthly intervals. A random number generator was used to determine, for each month, whether or not a subject would transition to a different health state or remain in their existing health state. Fig. 1 illustrates the model. Health states (labeled 0–9) are represented by rectangles, and circles depict intermediary events. Beginning at the far left, well subjects (health state #0) may remain in their current health state or, with time, transition into the preclinical phase of the target disease (health state #1) [9]. In the case of cancer, this can be thought of as a transition from healthy cells to initiated or intermediate cells. There may be several different stages within this preclinical phase, but they are all bundled together in the model because they are undifferentiated and undetectable by the screening test. Screening at this point in the natural history of the disease offers no benefit. The initiated or intermediary cells may give rise to detectable lesions. In the case of a cancer, the lesions would be in a premalignant state. We label this as the first part of the detectable preclinical phase (DPCP) prior to the critical point (health state #2) [9]. Subjects in health state 2 at the time of screening have the potential to realize benefit from early detection. From health state 2, the disease may progress further and reach a critical point (CP) [9]. In the case of cancer, the premalignant cells progress into an invasive tumor with metastases. We refer to this as the DPCP after the critical point (health state #3). Screening does not offer benefit to subjects in this health state because screening is too late to interrupt the progression of disease. Subsequently, the subject may transition into a clinical phase with symptoms of disease (health state #5), at which time treatment would occur. Alternatively, from health state 2, some subjects will progress into a clinical phase (health state 4) through incidental findings (e.g. detection of preclinical disease as part of a work-up prior to an unrelated surgery) or the development of symptoms or signs of disease that occur before the critical point. We assume that 50% of such subjects developed symptoms/signs of disease and the remaining subjects are diagnosed through incidental findings. Subjects with known disease undergo treatment. If the subject survives treatment (probability is 1-g), the outcome may be successful, partially successful, or unsuccessful. If fully successful, the subject returns to a well state (#0); if partially successful,

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

127

the subject returns to an undetectable preclinical phase (#1); and if unsuccessful, the subject remains in a state of clinical disease (#4). Note that in the first part of the DPCP, the disease may regress, as indicated by the left arrows between health states 1 and 2 and between 1 and 0. This regression has been called “pseudodisease type I” [9–11]. The screening test, although able to detect disease in the DPCP, does not benefit a subject whose disease naturally regresses. Throughout this natural history, the subject may succumb to other competing diseases or events (e.g. accidents) (top of Fig. 1). This competing health state is denoted as health state #8. Although not shown explicitly in Fig. 1, subjects in non-terminal health states transition to health state 8 with probability r. Pseudodisease type II [9–11] occurs when transitions to health state #8 occur often such that subjects, even with detectable target disease, do not die from the target disease (health state #6) but rather succumb to competing causes (health state #9). We simulated cohorts of control (unscreened) subjects who followed the algorithm in Fig. 1. We also simulated cohorts of screened subjects who followed the algorithm in Fig. 1 except at the time of screening. Subjects in the screening arm underwent screening at entry into the study (i.e. baseline screen), and, if eligible, underwent two incident screens at 12 and 24 months (eligible subjects were those who were asymptomatic (health state 0, 1, 2, or 3) and had no previously detected target disease.). Before the DPCP, screening does not benefit the subject, thus the test results are either true negative (TN) or false positive (FP). During the DPCP, screening can detect disease. For subjects in the DPCP the test results are either true positive (TP) or false negative (FN) (see Fig. 2). To simulate subjects in a RCT we made assumptions about the study population, target disease, screening test, treatment, and study design. These assumptions affect the calculated risk ratios and observed power but do not alter the general effects of misclassification on the RCT. The assumptions specific to our baseline model follow. The transition probabilities for our baseline model are summarized in Table 1. At entry into the study 89% of subjects were in the well health state, 5% in the undetectable preclinical phase, 4% in the DPCP prior to the critical point, and 2% in the DPCP after the critical point. Thus, the observed prevalence of preclinical disease at the baseline screen (i.e. for a test with 100% sensitivity) was 6%.

Fig. 1. Illustration of the template for the natural history model. Each month, subjects are in one of the 10 health states, indicated by the rectangles. The blue rectangles denote asymptomatic states; the green indicate clinical states with the target disease, the purple rectangle indicates a clinical state with a competing disease, and the red indicate death. The letters on the arrows indicate the transition rates to different health states, summarized in Table 1. Not shown are the transition arrows to health state #8 (symptoms from competing diseases); at each monthly cycle subjects in any health state (except death) are at risk of transitioning to health state #8, with transition probability of r.

128

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

We assumed good sensitivity and specificity of the screening test for detection of the DPCP (sensitivity = 0.93, specificity = 0.81). We assumed that 100% of subjects randomized to the screening arm are compliant with the first screening, and 85% of those eligible are compliant with the second and third screen. We assumed the annual lost to follow-up is 10% for screened subjects and 15% for controls. For each analysis we simulated 500 RCTs. We considered sample sizes of 3000, 5000, and 10,000 subjects per study arm. As occurs in most RCTs, there was staggered entry into the study, with uniform accrual over the first 2 years of the study. 2.2. Primary endpoints For each RCT we calculated two endpoints: the disease-specific mortality and the incidence of disease-specific symptoms. The disease-specific mortality rate is determined at a particular follow-up time t (for example, t = 3 years from the start of the study). It is determined by counting up the number of subjects who die from the target disease before time t [12]. Subjects who die as a result of a complication from treatment for the disease (i.e. health state #7) are also included (see Appendix for details of the calculations). The incidence rate of disease-specific symptoms is similarly defined. It is calculated by determining the number of subjects who, before time t, develop one or more signs/symptoms that, upon diagnostic testing, are attributable to the target disease. The set of disease-specific symptoms/signs must be determined prior to the start of the study. For example, in a lung cancer screening study, subjects with radiographic evidence of lung cancer (i.e. from diagnostic testing following a positive screening test or diagnostic testing triggered by signs/symptoms) would be identified. These cases and their reported history of symptoms (based on standardized questionnaires administered quarterly since the start of the study) would be reviewed by an outcome review panel, blinded to the subjects' study arm allocation. The panel would determine if symptoms of advanced lung cancer had developed. 2.3. Misclassification For disease-specific mortality, we simulated two types of misclassification: Type O (“over-diagnosis misclassification”) is when a study subject is classified as dying from the target disease when s/he truly died from another cause. We simulated this misclassification by first identifying subjects who died from other causes, and then altering the subject's cause of death from “9” (“other death”) to “6” (“target-disease death”) with a certain pre-specified

Fig. 2. Illustrates the additional pathways for screened subjects who are in the detectable preclinical phase prior to the critical point at the time of screening (left) or the detectable preclinical phase post-critical point (right). Screening can potentially benefit subjects prior to the critical point, but not after the critical point.

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

129

Table 1 Yearly transition probabilities⁎⁎ Transition probability

Description

Baseline model

a b c d e f g h i j k l m n o p q r r s

Incidence of new target disease Regression back to well state Progression from initial phase to DPCP Regression from DPCP to initial phase Progression to Critical Point (CP) Development of signs or symptoms before CP Risk of death after treatment No benefit of treatment when treatment occurs before CP Successful treatment when treatment occurs before CP Partially successful treatment when treatment occurs before CP Development of symptoms after CP Partially successful treatment when treatment occurs after CP No benefit of treatment when treatment occurs after CP Successful treatment when treatment occurs after CP Nearly successful treatment when treatment occurs after CP Progression to CP during clinical phase Mortality from target disease after CP Incidence of competing diseases (until age 70) Incidence of competing diseases (after age 70) Mortality from competing diseases

0.0043 0.01 0.6/0.3 0.01 0.9/0.5 0.25 0.01 0.10 0.7 0.2 0.9/0.5 0.2 0.63 0.07 0.10 0.95/0.80 1.0 0.025 + 0.0001 × age 0.025 + 0.002 × age 0.3

⁎⁎For transition rates labeled c, e, k, and p, we have listed two rates: the first for subjects with rapidly advancing disease, and the second for subjects with a slower progressing disease. We assumed that 15% of subjects with the target disease have a rapidly advancing form of the disease.

probability. Type U (“under-diagnosis misclassification”) is when a study subject is classified as not dying from the target disease when they truly did die from the target disease. We simulated this misclassification by first identifying subjects who died from the target disease, and then altering the subject's' cause of death from “6” to “9” with a certain probability. The type and amount of misclassification for the two study arms is summarized in Table 2. For the incidence of disease-specific symptoms, we similarly defined two types of misclassification. Type O occurred when a study subject in states 2, 3, or 4 was classified as having developed symptoms of advanced target disease when in truth symptoms from the target disease had not yet occurred. This misclassification would likely occur when subjects with known disease were overly scrutinized for the development of symptoms. We simulated this misclassification by first identifying asymptomatic subjects with the target disease who were undergoing early treatment. Note that both asymptomatic screened and asymptomatic control subjects can undergo early treatment in our model, i.e. following a positive screening test (screened arm) or following a work-up prior to an unrelated surgery (control or screening arm). We classified these subjects as having symptoms of advanced target disease at the time of treatment with a certain probability. Type U misclassification occurred when a study subject was classified as not having developed symptoms of the target disease when they truly did have symptoms of advanced target disease. We simulated this misclassification by first identifying subjects who experienced symptoms of the target disease; then we changed the time of their first symptoms of advanced disease to no symptoms with a certain probability. In addition, we considered two types of misclassification bias: quantitative bias and qualitative bias. Quantitative misclassification bias occurred when both the screened and control subjects had a similar type of misclassification (i.e. either Table 2 Rates of misclassification used in simulation model Model

Type O misclassification

Type U misclassification

Screened arm

Control arm

Screened arm

Control arm

Baseline Constant over misclassification Constant under misclassification Quantitative bias–over misclassification Quantitative bias–under misclassification Qualitative bias–over and under misclassification

0% 1–15% 0% 5–15% 0% 5–10%

0% Same as screened 0% 1% 0% 0%

0% 0% Same as control 0% 1% 0%

0% 0% 1–15% 0% 5–15% 5–10%

130

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

Fig. 3. Constant over misclassification of screened and control arms for the disease-specific mortality endpoint with N = 10,000 per study arm. The blue line indicates the power with no misclassification, whereas the red lines indicate the decreased power with 5%, 10%, and 15% over-misclassification.

over-or under-diagnosis) but the amount of misclassification differed between the two arms. We assumed that screened subjects would have more over-diagnose misclassification than control subjects. Similarly, we assumed that control subjects would have more under-diagnose misclassification than screened subjects. Qualitative misclassification bias occurred when the screened and control subjects had different types of misclassification. We assumed that in the same RCT control subjects would be under-diagnosed and screened subjects would be over-diagnosed (See Table 2). For each simulated RCT, we computed disease-specific mortality rate and the incidence rate of disease-specific symptoms. We compared the rates for the screened and control study arms. The null hypothesis was that the event rate in the screened subjects is greater than or equal to the event rate in the control group. The alternative hypothesis was that the event rate is lower in the screened group. The statistical analyses were performed at 3, 5, and 7 years after the start of the RCT. Details of the statistical analyses are given in the Appendix. 3. Results 3.1. Constant rate of over-diagnosis misclassification in screen and control arms Fig. 3 summarizes the results of constant over-diagnosis for the disease-specific mortality endpoint. The power decreases with just 1% misclassification (not shown), and decreases further as the misclassification rate increases further. For example,

Fig. 4. Constant over-misclassification of screened and control arms for the incidence of disease-specific symptoms endpoint with N =3000 per study arm. The blue line indicates the power with no misclassification, whereas the red lines indicate the decreased power with 5%, 10%, and 15% over-misclassification.

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

131

with 1% misclassification error, at 7 years with N = 10,000, the power decreases from 97% to 96% (1% reduction); the power decreases from 90% to 88% (2% reduction) at 5 years; there is no effect on power at 3 years. Similarly, with 5% misclassification error, at 7 years with N = 10,000, the power decreases from 97% to 91% (6% reduction); the power decreases from 90% to 78% (13% reduction) at 5 years; there is little effect on power at 3 years. For the incidence of disease-specific symptoms, there is a large effect on power for both short and long studies (see Fig. 4). For example, at 7 years with N = 3000, the power decreases from 96% to 86% (10% reduction) with 5% error; at 5 years the power decreases more: 91% to 71% (22% reduction); and at 3 years the power decreases from 62% to 26% (58% reduction). The estimated risk ratio becomes slightly biased against screening in the presence of over-diagnosis misclassification. For the mortality endpoint, the estimated risk ratio increases from a true value of 0.76 at 7 years to 0.77 (1% increase) with 1% misclassification and to 0.82 (8% increase) with 5% misclassification. For the symptoms endpoint, the risk ratio increases from a true value of 0.66 at 7 years to 0.67 (1.5% increase) with 1% misclassification and to 0.72 (9% increase) with 5% misclassification. 3.2. Constant rate of under-diagnosis misclassification in screen and control arms For both endpoints, constant under-diagnosis misclassification has a very small effect on power. For example, for mortality at 7 years with N = 10,000, the power decreases from 97% to 95% with 15% misclassification (see Fig. 5); for the incidence of symptoms, the power decreases from 96% to 95% at 7 years with N = 3000 with 15% misclassification. The estimated risk ratios, even with 15% misclassification, are nearly identical to the true values. 3.3. Quantitative misclassification bias When there is quantitative over-diagnosis misclassification bias (i.e. both screened and control arms have overdiagnosis misclassification, but the rate is higher in the screened arm), the effect on power for both endpoints is dramatic. For example, with 1% over-diagnosis misclassification for controls and just 5% over-diagnosis misclassification for screened, the power for the mortality endpoint decreases from 97% to 9% (91% reduction) at 7 years with N = 10,000. Similarly, the estimated risk ratio increases from 0.76 to 0.98 (29% increase). For the incidence of symptoms, the power drop is not as severe: from 96% to 85% (11% reduction) at 7 years with N = 3000, and from 91% to 68% (25% reduction) at 5 years. The risk ratio estimates increase 10–13% (from a true value of 0.66 to 0.73 at 7 years, and from 0.67 to 0.76 at 5 years). For under-diagnosis misclassification bias, even with a large bias (e.g. 15% for controls and 1% for screened), the power reduction for the mortality endpoint is small: 97% to 96% at 7 years (N = 10,000) and 90% to 86% at 5 years. For

Fig. 5. Constant under-misclassification of screened and control arms for the disease-specific mortality endpoint with N = 10,000 per study arm. The blue line indicates the power with no misclassification, whereas the red lines indicate the slight decreased power with 5%, 10%, and 15% undermisclassification.

132

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

Fig. 6. Qualitative misclassification bias for the disease-specific mortality endpoint with N = 10,000 per study arm. The blue line indicates the power with no misclassification. The red lines indicate the decreased power with 5% over-misclassification of the screened subjects and 1% undermisclassification of the controls (middle line), and 10% over-misclassification of the screened subjects and 1% under-misclassification of the controls (bottom line).

the incidence of disease-specific symptoms, the power increases slightly from 96% to 99% at 7 years (N = 3000) and from 91% to 97% at 5 years; the risk ratio estimates become slightly biased in favor of screening. 3.4. Qualitative misclassification bias A small qualitative bias (i.e. over-diagnosis misclassification for screened arm and under-diagnosis misclassification for controls) has a dramatic effect on power (see Figs. 6 and 7). For example, for the mortality endpoint with 5% over-diagnosis misclassification for screened and 5% under-diagnosis misclassification of the controls, the power drops from 97% to 1% (99% reduction) at 7 years, and from 90% to 10% (89% reduction) at 5 years (N = 10,000). The estimated risk ratio increases dramatically from a true value of 0.76 favoring screening to values near or even greater than one. For the incidence of symptoms endpoint, with 5% over-diagnosis misclassification for screened and 5% underdiagnosis misclassification of controls, the power drops from 96% to 90% (6% reduction) at 7 years with N = 3000; from 91% to 74% (19% reduction) at 5 years; and from 62% to 27% (56% reduction) at 3 years. The estimated risk ratio

Fig. 7. Qualitative misclassification bias for the incidence of disease-specific symptoms endpoint with N = 3000 per study arm. The blue line indicates the power with no misclassification. The red lines indicate the decreased power with 5% over-misclassification of the screened subjects and 1% under-misclassification of the controls, (middle line), and 10% over-misclassification of the screened subjects and 1% under-misclassification of the controls (bottom line).

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

133

becomes biased against screening, from a true value of 0.67 to 0.71 (6% increase) at 7 years, 0.67 to 0.73 (9% increase) at 5 years, and 0.71 to 0.85 (20% increase) at 3 years. 4. Discussion It is well-recognized that misclassification affects the estimation of study parameters and study power [1]. Furthermore, there is ample evidence to suggest that misclassification occurs often in RCTs [2–8]. Our study found that when screening does offer a modest benefit to the subject, there is little effect on the risk ratio estimate and the study power if there is similar under-diagnosis misclassification of screened and control subjects; in contrast, there is a large effect on the risk ratio and power for over-diagnosis misclassification of events. These contrasting effects can be explained, as follows. In the presence of over-diagnosis misclassification of the disease-specific mortality endpoint, when the rate of misclassification is the same for the two study arms, the misclassification error in the two arms does not balance out and the power of the RCT decreases (and the risk ratio estimate becomes closer to one). The explanation is that in the screened arm there are more people who actually died from other causes than in the control arm. Over-diagnosis misclassification involves labeling these non-target disease deaths falsely as being attributable to the target disease. Since there is a greater proportion of people to be misclassified in the screened arm, the power decreases. For example, suppose that of 1000 people screened there are 5 disease-specific deaths, compared with 8 disease-specific deaths in the control people (risk ratio of 0.625 (i.e. 5/8)). With a 10% over-diagnosis misclassification rate, the 995 screened people without the target disease are at risk of misclassification, compared with the 992 control people. The risk ratio with this constant 10% over-diagnosis misclassification rate increases to 0.975 (i.e. 104.5/107.2), thus greatly reducing the study's power. Similarly, for the incidence of disease-specific symptoms endpoint, there are many more subjects in the screened arm than in the control arm who have radiographic evidence of disease and undergo treatment while asymptomatic. These subjects are under great scrutiny and are at risk of over-diagnosis. Even if subjects in both arms are overdiagnosed at a constant rate, the screened arm will have more cases of over-diagnosis misclassification than the control arm. Thus, there will be a loss in study power. The effect of over-diagnosis misclassification on power is most serious for longer studies (i.e. 7 years) using mortality as its endpoint because there are few deaths of any kind in the first few years of the study so there is little opportunity to misclassify. For studies using the incidence of disease-specific symptoms, the effect can be large for both short-or long-studies, but it tends to be most serious for short studies. This is because screened patients are more likely to undergo treatment early in the study; thus, the risk of misclassification occurs early, so we see the drop in power early in the trial. Misclassifying disease-specific events as non-disease-specific events (i.e. under-diagnosis misclassification) has a small effect on the risk ratio and on study power when the rate of misclassification is similar for screened and control subjects, and even when the rate of misclassification is higher for controls. This small effect is attributed to the fact that there are relatively few disease-specific events in RCTs of screening; thus, there are few opportunities to underdiagnosis. For example, of 1000 people screened there may be 5 true disease-specific events, compared with 8 true disease-specific events of 1000 control people (risk ratio of 0.625, i.e. 5/8). With a 10% under-diagnosis misclassification in both arms, the risk ratio is unchanged (i.e. 4.5/7.2 = 0.625). The study power gradually decreases as the absolute number of events decreases, but the reduction in power is small. We expect that the most common type of misclassification in RCTs of screening occurs when the screened subjects are over-diagnosed (events are falsely attributed to the target disease) and the control subjects are under-diagnosed (diseasespecific events are not recognized as attributable to the disease). We found that this combination has a particularly detrimental effect on study power. Interestingly, the reduction in power seems to be more severe for the disease-specific mortality endpoint compared with the incidence of disease-specific symptoms endpoint (see Fig. 6 vs. Fig. 7). Our results confirm some of the conclusions of Black et al. [1]. We have shown that in the presence of overdiagnosis misclassification for the screened group and simultaneous under-diagnosis misclassification in the control arm (so called “sticky-diagnosis bias”), both disease-specific mortality and the incidence of disease-specific symptoms will have large power losses and the risk ratio will be biased such that screening will appear to be less favorable. In the presence of “slippery diagnosis bias”, where screening and treatment complications are not attributed to the target disease, our results show that screening will appear more beneficial than it should (i.e. the risk ratio is artificially low).

134

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

This discussion of power and estimated risk ratio has focused on RCTs where screening offers some benefit. Consider now the effects if screening is neither beneficial nor harmful. If the number of disease-specific events is similar in the two study arms, and the number of non-target disease events is similar in the two study arms, then constant rates of either over-or under-diagnosis will not affect the point estimate of the risk ratio. Thus, the type I error rate of the study will not be compromised. On the other hand, if the screened subjects tend to be over-diagnosed and the controls under-diagnosed, then the estimated risk ratio will be greater than one, and screening will appear to be harmful. Our study has limitations. We built a generic model to study the effect of misclassification on the estimated risk ratio and the study power. Our model was not meant to describe any particular disease or screening test but rather to represent a generalization of typical disease, screening test, and research study characteristics. Thus, we are not able to provide specific quantification of the power reduction expected in any particular RCT, but rather we provide generalizations of the effects of different types of misclassification. 5. Conclusion The conclusions from the study can be summarized as follows: 1. Failure to identify disease-specific events in a RCT of screening (i.e. under-diagnosis misclassification) has the following effect on study results: i. When the rate of misclassification is similar for screened and controls, and even when the rate for controls is slightly higher than for the screened subjects, there is little effect on the risk ratio and study power. ii. When complications due to screening or treatment are not properly attributed to the target disease, the benefit of screening will be over-stated (i.e. the power of the study will be artificially high). 2. The false identification of events as being attributable to the target disease (i.e. over-diagnosis misclassification) has the following effect on study results: i. When screening is truly beneficial, the power to detect the benefit is reduced. The greater the misclassification rate, the greater the reduction in power. The power reduction can be substantial. ii. When screening is not beneficial, preferential over-diagnosis misclassification in the screened arm will make screening appear to be harmful. 3. Investigators of RCTs of screening tests should carefully consider the potential for misclassification and the type of misclassification that their study is as risk for. Studies should be designed to minimize misclassification. The effect of misclassification on power should be considered in sample size calculations. Appendix A. Mortality rate The disease-specific mortality rate was calculated as follows: Mortality rate at time t ¼ ½f events by t =½f of person  years at risk until t  where “event” is death due to the disease or due to treatment for the disease. We assumed that all events would be identified; that is, if a subject was lost to follow-up and if an event occurred after lost to follow-up, then the event would be determined via public death records. The number of person-years at risk is equal to t⁎ − e, where t⁎ is the lesser of t and the time at which the subject died, and e is the number of study months before the subject was enrolled into the study (i.e. e = 0 at the start of the study). We tested the following null hypothesis: the disease-specific mortality rate of the screened subjects is greater than or equal to the disease-specific mortality rate of the controls. The alternative hypothesis is: the mortality rate of the screened subjects is less than the mortality rate of the controls (one-sided test). The following statistic [12] was used: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y ¼ ½MRwo  MRw = variance ð1Þ where MRwo is the estimated disease-specific mortality rate for the control arm and MRw is the estimated diseasespecific mortality rate for the screened arm. The variance is variance ¼ ½PYWO  MRWO þ PYW  MRW =½PYWO  PYW 

N.A. Obuchowski, M.L. Lieber / Contemporary Clinical Trials 29 (2008) 125–135

135

where PYwo is the number of person-years at risk for the control arm and PYw is the number of person-years at risk for the screened arm. Y is distributed as a standard normal random variable under the null hypothesis. We reported the frequency with which Y exceeded 1.645 (one-tailed test, with significance level of 5%). Incidence rate of disease-specific symptoms We estimated the incidence rate of symptoms from the target disease as follows: Incidence rate at time t ¼ ½events by t =½of person  years at risk until t  where “event” is the development of signs/symptoms due to the disease or death due to treatment for the target disease or disease-specific mortality (whichever is observed first). Not all signs and symptoms are observed because of loss to follow-up. The number of person-years at risk is equal to t′ − e, where t′ is the lesser of i) the time at which the subject experienced symptoms of the target disease, ii) time at which the subject was lost to follow-up if the subject did not die from the target disease or from its treatment before t, iii) time at which the subject died from the target disease or its treatment, and iv) t. A statistic similar to that in Eq. (1) was used to compare incidence rates of disease-specific symptoms in screened and control subjects. We reported the frequency with which the test statistic exceeded 1.645 (one-tailed test, with significance level of 5%). Risk ratio The risk ratio estimator for disease-specific mortality was defined as the ratio of the disease-specific mortality for screened subjects divided by the disease-specific mortality for controls. For the incidence of disease-specific symptoms, the ratio is defined as the ratio of the incidence of disease-specific symptoms for the screened subjects divided by the incidence of disease-specific symptoms for the controls. Risk ratios less than one indicate that screening is beneficial, whereas risk ratios greater than one indicate that screening is detrimental. References [1] Black WC, Haggstrom DA, Welch HG. All-cause mortality in randomized trials of cancer screening. J Natl Cancer Inst 2002;94:167–73. [2] Hoel DG, Ron E, Carter R, Mabuchi K. Influence of death certificate errors on cancer mortality trends. J Natl Cancer Inst 1993;85:1063–8. [3] Lee PN, Comparison of autopsy, clinical and death certificate diagnosis with particular reference to lung cancer. A review of the published data APMIS Suppl 1994; 45: 1–42. [4] Messite J, Stellman SD. Accuracy of death certificate completion: the need for formalized physician training. JAMA 1996;275:794–6. [5] Maudsley G, Williams EM. ‘Inaccuracy’ in death certification — where are we now? J Public Health Med 1996;18:59–66. [6] Lloyd-Jones DM, Martin DO, Larson MG, Levy D. Accuracy of death certificates for coding coronary heart disease as the cause of death. Ann Intern Med 1998;129:1020–6. [7] Newschaffer CJ, Otani K, McDonald MK, Penberthy LT. Causes of death in elderly prostate cancer patients and in a comparison nonprostate cancer cohort. J Natl Cancer Inst 2000;92:613–21. [8] Albertsen P. When is a death from prostate cancer not a death from prostate cancer? J Natl Cancer Inst 2000;92:590–1 [editorial]. [9] Morrison AS. Screening in Chronic Disease. 2nd ed. New York: Oxford Univ. Press; 1992. [10] Black WC, Welch HG. Screening for disease. AJR 1997;168:3–11. [11] Obuchowski NA, Graham RJ, Baker ME, Powell KA. Ten criteria for effective screening: their application to multislice CT screening for pulmonary and colorectal cancer. AJR 2001;176:1357–62. [12] Esteve J, Benhamou E, Raymond L. Descriptive Epidemiology: Statistical Methods in Cancer Research, vol. IV. Lyon, France: International Agency for Research on Cancer; 1994.