Clinical Radiology (1991) 43, 77-80
Editorial Breast Cancer Screening by Mammography: An Overview Population screening for disease has an intuitive appeal. Because it is carried out on large numbers of apparently healthy individuals there is the potential to make substantial inroads into the burden of disease. Screening has become an intrinsic part of normal medical practice in some fields. Biochemical and ultrasound screening in pregnancy are largely responsible for the 80% fall in the birth prevalence of neural tube defects in England and Wales over the last 15 years. P K U and hypothyroidism screening in the neonatal period are successful in preventing childhood metabolic disorders. With the recent progress made in molecular biology it has become feasible to consider population screening for the carriers of common recessive disorders such as cystic fibrosis. In contrast with these exciting developments in antenatal and neonatal diseases the value of screening for adult diseases has been slow to be demonstrated. A m o n g cancers the value of screening has only been established for cervix and breast cancer, and it is instructive to compare the two. In the case of cervical cancer large scale screening programmes were introduced in the mid-1960s in several countries, largely on the basis of pathological evidence that the dysplastic changes seen in the smear were precancerous. Only now after more than 20 years of data have accumulated is epidemiological evidence of efficacy available. The position with breast cancer is quite different. The p r o o f of efficacy has been provided over the last 20 years in careful studies and it is that which has made it possible now to confidently launch a national screening programme. Recently the publication of additional studies in breast cancer screening with relatively poor results have been interpreted by some as showing that efficacy may not in fact have been established. However, there is a need to see these results in the context of the totality of available data taking into account the different kinds of study designs used. It is the purpose of this review to summarize all the evidence and provide an overview.
Types of Evidence In screening for cancer, the principle question is whether the identification of individuals at an earlier stage in the natural history of their disease and consequent treatment reduce mortality. It is a commonly held belief that this is necessarily the case for cancers where the survival rate varies substantially with the stage at diagnosis. However, the argument here is fallacious. For example, in lung cancer there is a strong negative correlation between stage and survival, but although large studies have demonstrated that chest X-rays and sputum sampling lead to earlier diagnosis they have found no effect of screening on the mortality rate. This apparent paradox is explained by statistical bias. The first and most obvious bias is called the 'lead time' bias. Suppose that a cancer was diagnosed incidentally whilst investigating some completely unrelated problem, and the patient refused treatment. The time from diagnosis to death would necessarily be longer than had the
cancer been diagnosed clinically when the patient would have presented with symptoms, yet the date o f death would have been unchanged. Thus, the extra time the cancer was observed (the 'lead time') increased the survival time but did not change the outcome. The second bias is called the 'length' bias. This arises because of the variability in rate of progression of the disease, for a given site of cancer, so that for example some cancers are aggressive and will lead to death just a few years after initiation whereas other cancers of the same site are indolent and m a y take decades. Since the latter spend more time in the preclinical stages there are more opportunities for incidental diagnosis. It follows that in a group of cancers diagnosed early there will be a disproportionate number of indolent cases with good survival. These biases mean that to evaluate screening for cancer the mortality rate and not the survival rate has to be used, comparing it in those who are screened and similar individuals who are not. This comparison can be achieved by randomized clinical trials and by population studies. It is important to consider the advantages and shortcomings of the methods so that evidence from the different studies can be judged together.
Randomized Trials The randomized trial of screening is by far the most reliable source of data since it is unbiased. The cumulative breast cancer mortality rate in those allocated to the screening arm is in expectation the same as that in the control arm. Any differences that emerge will be either due to chance or to the effect of screening. The extent to which chance may either cloud a true effect or lead to an apparent effect that does not exist is influenced by: (1) (2) (3) (4) (5) (6)
size of the trial; magnitude of the true effect; compliance in the screening arm; screening in the control arm ('contamination'); duration of follow-up, ~fnd method of randomization.
As with any epidemiological study the ability to show an effect is dependent on the size of the trial and the magnitude of the true effect. The more deaths from breast cancer that accumulate the smaller the influence of chance which will be particularly important if the true effect is small. The effective size of a screening trial may be only a fraction of its actual size if compliance is low and contamination is high (see Table 1). The ability to demonstrate an effect is dependent on the duration of follow-up because the deaths prevented by screening are likely to have occurred m a n y years after it was done. Those who die of breast cancer in the first years after screening will be predominantly women with advanced disease at the time of screening who will have derived little benefit from it. Thus the cumulative numbers of death in the two arms of the trial are likely to be small and similar in the first, say, 5 years and only begin to differ after then.
78
CLINICAL RADIOLOGY c o n t r o l g r o u p , p o p u l a t i o n studies m a y be able to d o so to some extent. I f a p o p u l a t i o n offered screening c a n be identified systematically, a c o m p a r a b l e ' g e o g r a p h i c a l ' c o n t r o l g r o u p living in a different a r e a m i g h t be chosen. Such a g e o g r a p h i c a l l y c o n t r o l l e d study is always o p e n to the criticism t h a t the c o m p a r a b i l i t y c a n n o t be p r o v e d . Even if the m o r t a l i t y rate p r i o r to screening was the same in b o t h p o p u l a t i o n s , it is possible t h a t this w o u l d n o t have been m a i n t a i n e d d u r i n g the study period. A n o t h e r a p p r o a c h is to study m o r t a l i t y in screening a t t e n d e r s a n d c o m p a r e this with either the expected m o r t a l i t y rate using n a t i o n a l statistics (say) o r with the rate in w o m e n within the s a m e p o p u l a t i o n w h o have either n o t a t t e n d e d for screening or n o t yet been invited. In these studies n o n c o m p l i a n c e does n o t affect the result b u t self selection biases m a y i n t r o d u c e n o n - c o m p a r a b i l i t y . T h e extent to which the c o m p a r i s o n g r o u p is truly c o m p a r a b l e in such a study will relate to the site o f cancer a n d the p o p u l a t i o n studied. F o r example, the u p t a k e o f cervical screening is less in the lower social classes who have the highest risk o f dying f r o m the disease, so t h a t the m o r t a l i t y rate m i g h t be expected to be lower in those accepting screening t h a n in refusers. The social class effect is likely to have the reverse effect in b r e a s t cancer screening b u t there are o t h e r possible biases. There is the ' h e a l t h y screenee' effect, w h e r e b y those who have a l r e a d y been d i a g n o s e d with b r e a s t cancer will tend to refuse an i n v i t a t i o n to be screened. This can be o v e r c o m e by identifying such w o m e n a n d excluding t h e m f r o m the analysis. A n o t h e r bias m i g h t arise if w o m e n with s y m p t o m s w h o h a d d e l a y e d seeing a d o c t o r t e n d e d to also refuse an offer o f screening. This w o u l d lead to a relatively higher m o r t a l i t y rate in the n o n - a t t e n d e r s a n d so an a p p a r e n t increase in the p r o t e c t i v e effect o f screening.
Table 1 - Effective size of a randomized trial of breast screening with 25 000 women randomized to screening and 25 000 controls according to the level of compliance and contamination
Compliance* (%)
Contammatwnt (%)
Effect&e s&e
100
0 10 20
50000 40000 32000
80
0 10 20
32000 24000 18000
60
0 10 20
18 000 12000 8000
* Proportion of women allocated to the screening arm who attend. t Proportion of women allocated to the control who are, in fact, screened. It is even possible t h a t the n u m b e r s o f early deaths in the screening a r m are greater t h a n in the c o n t r o l a r m if some w o m e n who w o u l d have died with m e t a s t a t i c disease o f u n k n o w n origin were correctly d i a g n o s e d as having breast cancer as a result o f being screened. It is sometimes n o t p r a c t i c a l to r a n d o m l y allocate individuals in trials, b u t r a t h e r to use g r o u p r a n d o m i z a tion, say b a s e d on place o f residence. P r o v i d e d the g r o u p s are small e n o u g h this is n o t a p r o b l e m , b u t when only a few large g r o u p s are used there is clearly the a d d i t i o n a l possibility for chance to p l a y a role. Population Studies
A l t h o u g h it is only the r a n d o m i z e d trials t h a t are specifically designed to p r o v i d e a c o m p l e t e l y c o m p a r a b l e
Table 2 - Breast cancer mortality results according to study of mammographic screening together with statistical features of each study
Study
No. studied (lO00s)
Compliance* ( %)
Contaminationt ( %)
Follow-up (years)
Re&t&e Rtsk{ (95% CI)
Screening Group
Control Group
30 77 21 23
31 56 21 22
65 91 74 61
NK 13 24 NK
18 8 11 7
0.79 (0.62-0.99) 0.70 (0.55-0.87) 0.83 (0.60-1.14)§ 0.84 (0.58-1.18)
TEDBC (1988)
23
127
72
NK
7
0.78 (0.58 1.04)
Studies of Screening Attenders BCDDP (Morrison et al., 1988) Nijmegen (Verbeek et al., 1984) Utrecht (Collete et al., 1984) Florence (Palli et al., 1986)
55 20 ) 15 t 15
SEER non-attenders or uninvited
9 7 9 8
0.80 (0.72-0.87) 0.48 (0.23-1.00) 0.30 (0.13-0.70) 0.53 (0.29-0.95)
Randomized Control Trial
HIP (Shapiro et al., 1988) Two-counties (Tabar et al., 1989) Malm6 (Andersson et al., 1988) Edinburgh (Roberts et al., 1990) Geographical Control Study
*, proportion of screening group attending at first screening round. t, proportion of controls screened at least once during the study period. ~, risk of dying of breast cancer in screening group relative to risk in controls. §, derived from figure in publication: at the planned end of the trial it was only 0.96 (0.68-1.35). HIP, Health Insurance Plan of New York Study. The analysis relates to breast cancer deaths in cases dmgnosed within 5 years of entry to the study as only four annual screens were offered. TEDBC, Trial of Early Detection of Breast Cancer after excluding Edinburgh to avoid double counting. BCDDP, Breast Cancer Detection Demonstration Project. This included 283 000 women but only 55 000 have been subject to mortality analysis. SEER, Surveillance, Epidemiology and End Results. The BCDDP did not have a control group but data from the national SEER programme were used to estimate the expected mortality from breast cancer to compare with the screening group. NK, not known. CI, confidence interval.
BREAST C A N C E R S C R E E N I N G BY M A M M O G R A P H Y : A N O V E R V I E W
Results of the Published Studies There are now nine published studies and Table 2 summarizes their statistical features together with the relative risk of dying from breast cancer attributable to screening using the most recent follow-up information. Whilst only five of the studies have relative risks statistically significantly less than unity, they are all consistent with screening having a protective effect as large as 40%. A simple way to assess the statistical significance of the combined data is to consider the probability of nine such studies yielding results in the same direction. This is the same as the probability that an unbiased coin lands on 'tails' nine times in a row which is (1/2) 9 or 0.0002. It is possible to carry out a more formal meta-analysis which takes account of the magnitude of the effect found in each study as well as the direction. This would clearly lead to a more extreme level of statistical significance. The reason why some studies have by themselves found statistically significant results whilst others have not can be seen by examining the individual study details. The H I P study, although handicapped by the insensitive m a m m o g r a p h y available in the 1960s and the low compliance rate (64%) yielded a similar result to more recent studies (Shapiro et al., 1988). The 18 year follow-up period means that the results are less likely to be due to chance (there are now 289 breast cancer deaths). It may also provide a more realistic estimate of the long-term effects than other studies. The Swedish 'two-counties' study with its large study population and extremely high compliance rate was able to demonstrate an effect at an early phase of follow-up (Tabar et al., 1989). The cumulative breast cancer mortality rate in the screening arm was lower than that in the control arm after 2-4 years of follow-up and the gap between them has continued to widen thereafter. The other Swedish study from Malm6 was much smaller and although the follow-up was longer yielded substantially poorer results (Andersson et al., 1988). The cumulative breast cancer mortality rate was actually higher in the screening arm than the control arm for 8 years, they were equal in the ninth year, slightly lower in the tenth (the scheduled time of the first statistical analysis) and continued to separate in the eleventh year. Non-compliance and contamination together mean that the effective size of this trial is only a fraction of the actual size and this may have contributed to the p o o r results. However, it is also likely that at a technical level performance was poorer than in more successful studies. This can be seen by observing the proportion of breast cancers that surfaced between successive screening rounds as 'interval cancers'. Of cancers in women who attended for screening in the screening arm and died of breast cancer 62% (20/32) were interval cancers, compared with 44% (30/68) in the two-counties study (Tabar et al., 1987) despite a shorter average interval between screening rounds. The higher proportion would tend to be observed in centres whose screening sensitivity was relatively low, with cases 'missed' at screening being found in the interval between screens. The Edinburgh trial had a fairly low compliance (61%) and in the early screening rounds also had technical problems leading to low screening sensitivity although it improved as indicated by the interval cancer rate, as the trial progressed (Roberts et al., 1990). In the first four biannual m a m m o g r a p h i c rounds the proportion of breast
79
cancers surfacing in the next year were 28%, 17%, 5% and 5% respectively. In addition the results of the trial may have been affected by the method of randomization. Women were allocated to the two arms according to the general practice where they were registered. Since there were only 84 practices there was r o o m for the play of chance to cause an imbalance. The fact that the total mortality rate from all causes in the screening arm was 20% lower than the control arm (a statistically significant difference) suggests that this has happened. The difference would be explained if the practices allocated to screening had had patients with a higher socio-economic status and this was found to be the case based on a 20% sample of each practice. Since social class information was not available on individuals it could only be allowed for in the analysis by stratifying the practices and this did not substantially alter the results. The T E D B C study compared the breast cancer mortality rate in a population invited for screening (Guildford and Edinburgh, but only the former are considered here because the Edinburgh data are also represented in the randomized trial) with that in four control towns combined (Trial of Early Detection of Breast Cancer, 1988). The control population was comparable to Guildford since the breast cancer mortality rate in the period before the study began was similar in Guildford to that in the control towns after allowing for age. In the statistical analysis a small disparity was allowed for by adjusting the results for the pre-study rates. It is possible, however, that the similarities prior to the study may not reflect differences that would have occurred in the study period had screening not been offered and to the extent that this cannot be examined the results, though similar to those of the randomized trials, carry less weight. The four remaining studies compare the mortality rate in screened women with that in controls. Because in the absence of selection bias they are influenced by the compliance rate, the magnitude of the observed effect should be greater than in the randomized trials or the TEDBC. That is what is found in the two Dutch studies from Nijmegen (Verbeek et al., 1984) and Utrecht (Collette et aI., 1984) and in the Florence study (Palli et al., 1986), but not in the multi-centre B C D D P (Morrison et al., 1988) from the USA. However, all four studies are likely to have been affected by selection bias although this m a y have been materially different in the B C D D P for two reasons. Firstly, the B C D D P centres provided an open access to m a m m o g r a p h i c services and women attended in response to general publicity about screening whereas in the three other studies women attended in response to a specific invitation. Secondly, the B C D D P used national breast cancer mortality rates to compare with the mortality in attenders, whereas the other studies used rates largely in non-attenders. Although the influences on attendance are unknown it is likely that the factors influencing non-attendance following a specific invitation may be greater than those influencing attendance following general publicity. I f so, more weight should be given to the B C D D P than to the other studies.
Summary Given the differences between the nine studies in design and execution alone, it is remarkable that they all support the same general conclusion that m a m m o g r a p h i c screening for breast cancer is capable of reducing mortality from
80
CLINICAL RADIOLOGY
Table 3 - Breast cancer mortality results in women of similar age range to 50-64 from five studies
Study
Age range (yeats)
Relative risk*
(95% CI)
HIP (Shapiro et al., 1988) Two-counties (Tabar et al., 1989) Malm6 (Andersson et al., 1988) Edinburgh (Roberts et al., 1990) BCDDP (Morrison et aL, 1988)
50-64 50-59 55-69 50-64 50-59
0.79 0.60 0.73 0.80 0.76
(0.58 1.06) (0.40-0.90) (0.49-i. 10) (0.52-1.25) (0.64-0.87)
* Risk of dying of breast cancer in screening group relative to risk in controls.
the disease. There are also differences between them in the screening technique (age range, interval between screens, number of m a m m o g r a p h i c views) and treatment but this does not provide a simple guide to what we can expect in the National Health Service Breast Cancer Screening Programme which is now underway in this country. The new national p r o g r a m m e recruits women in the age range 50-64 and Table 3 shows the result from five of the studies in which separate data for a 'middle-age' group can be derived: the total Utrecht study group was aged 50-69 and the other three cannot be divided in this way. In each study, apart from the H I P study, the magnitude of the benefit is greater than in the total group. It is possible that these differences will decrease with longer follow-up (it took longer for an effect to emerge in the youngest age group in the H I P study) but the relative risks in Table 3 give a better guide than those in Table 2 in the short term. The specific effects of different intervals between screens and m a m m o g r a p h i c views are currently being investigated in randomized clinical trials. In the meantime our best estimate of what is likely to happen in the national p r o g r a m m e is probably provided by the twocounties study which, like us, used single-view m a m m o graphy and in the 50-64 age group had an average screening interval of 33 months close to the 3 years we have adopted.
Conclusion When all nine published studies on breast cancer mortality in relation to screening are taken together a clear and consistent effect is seen. Differences have arisen
because of chance factors which are in turn influenced by statistical features of the studies. The best estimate of the magnitude of reduction in mortality achievable by the new national screening p r o g r a m m e is obtained from the two-counties study. At the latest follow-up for women who were aged 50-59 at the time of invitation the reduction was 40%.
REFERENCES
Andersson, I, Aspegren, K, Janzon, L, Landberg, T, Lindholm, K, Linell, F et al. (1988). Mammographic screening and mortality from breast cancer: the Malm6 mamrnographic screening trial. British Medical Journal, 297, 943-948. Collette, HJA, Day, NE, Rombach, JJ & de Waard, F (1984). Evaluation of screening for breast cancer in a non-randomised study (the DOM project) by means of a case-control study. Lancet, i, 12241226. Morrison, AS, Brisson, J & Khalid, N (1988). Breast cancer incidence and mortality in the Breast Cancer Detection Demonstration Project. Journal of the National Cancer Institute, 80, 1540-1547. Palli, D, Rosselli del Turco, M, Buiatti, E, Carti, S, Ciatto S, Toseani, L et al. (1986). A case-control study of the efficacy of non-randomised breast cancer screening program m Florence (Italy). Internatwnal Journal of Cancer, 38, 501-504. Roberts, MM, Alexander, FE, Anderson, TJ, Chetty, U, Donnan, PT, Forrest, P et al. (1990). Edinburgh trial of screening for breast cancer: mortahty at seven years. Lancet, i, 241-246. Shapiro, S, Venet, W, Strax, P & Venet, L (1988). Periodic Screeningfor Breast Cancer: The Health Insurance Plan Project and Its Sequelae, 1963-86. Johns Hopkins University Press, London. Tabar, L, Faberberg, G, Day, NE & Holmberg, L (1987). What is the optimum interval between mammographic screening examinations? - An analysis based on the latest results of the Swedish twocounty breast cancer screening trial. British Journal of Cancer, 55, 547-551. Tabar, L, Faberberg, G, Duffy, SW & Day, NE (1989). The Swedish two county trial of mammographic screening for breast cancer: recent results and calculation of cost benefit. Journal of Epidemiology and Community Health, 43, 107-114. Trial of Early Detection of Breast Cancer (1988). First results on mortality reduction in the UK trial of early detection of breast cancer. Lancet, il, 411-416. Verbeek, ALM, Hendriks, JHCL, Holland, R, Mravunac, M, Sturmans, F & Day, NE (1984). Reduction of breast cancer mortality through mass screening with modern mammography: first results of the Nijmegen Project, 1975-1981. Lancet, i, 1222 1224.
H. C U C K L E
Department of Environmental and Preventive Medicine Medical College of St Bartholomew's Hospital Charterhouse Square London EC1M 6BQ