Clinical Radiology (2003) 58: 311–314 doi:10.1016/S0009-9260(02)00577-9, available online at www.sciencedirect.com
Tumour Size Measurement in an Oncology Clinical Trial: Comparison Between Off-site and On-site Measurements A .L . B E L T O N * , S . S A I N I † , K . L I E B E R M A N N ‡ , G . W . B O L A N D † , E . F . H A L P E R N † *Radiology Associates of Wausau, Wausau, Wisconsin, USA, †Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA and ‡WorldCare Inc., Cambridge, Massachusetts, USA Received: 25 January 2002 Revised: 31 October 2002
Accepted: 7 November 2002
AIM: To evaluate the degree of variability between lesion measurements obtained by a single observer compared with multiple observers, and in selected cases evaluate which of the two measurements more accurately represented the lesion size. MATERIALS AND METHODS: In this study we compared the performance of a single off-site observer to multiple on-site observers during measurement of 300 abdominal and thoracic lesions. Lesion measurements that were larger than 1 cm2, differed by more than 50%, but by less than 100%, were compared by a single adjudicator, who was blinded to the measurement source (n 5 46) RESULTS: Measurements of the 300 lesions differed by an average of 109% (SD 251%). Of 266 lesions larger than 1 cm2, results of the single observer compared with multiple observers differed by more than 10% for 249 lesions, more than 30% for 169 lesions, more than 50% for 126 lesions, and more than 100% for 66 lesions. Forty-six lesions were compared by the adjudicator. The adjudicator selected the measurement of the single observer for 37 lesions (80.4%), and the measurement determined by one of the multiple observers for nine lesions (19.6%; p 5 0.00002). CONCLUSION: Measurement of lesion size by a single observer compared with multiple observers reveals a high degree of variability. An adjudicator selected the measurement of the single observer more frequently than that of multiple observers, with statistical significance. These findings suggest that studies designed to quantify imaging features should limit the number of observers. Belton A. L. et al. (2003). Clinical Radiology 58, 311– 314. q 2003 Published by Elsevier Science Ltd on behalf of The Royal College of Radiologists. Key words: diagnostic radiology, observer performance, images, interpretation, efficacy study.
INTRODUCTION
Medical imaging is an increasingly important means of assessing the efficacy of therapeutic strategies. For example, in cancer chemotherapy, objective documentation of reductions in tumour burden has prognostic importance. Reductions of measurable tumour burden by 50% or more constitute a partial response that is predictive of symptom palliation and modest improvements in survival [1,2]. This important criterion is based on lesion size measurements and used by the World Health Organization (WHO) and Southwest Oncology Group. Therefore, fundamental to the evaluation of patient response Author for correspondence and guarantor of study: Austin L. Belton, Radiology Associates of Wausau, Suite 106, 2800 Westhill Drive, Wausau, WI 54401, USA. Tel: þ1-715-847-2283; Fax: þ 1-715-847-2154; E-mail:
[email protected] 0009-9260/03/$30.00/0
and therapeutic effectiveness is the accuracy of lesion measurement. In the United States, the Food and Drug Administration Modernization Act, signed into law on November 21, 1997, established a “fast track” designation to certain drugs, allowing approval on the basis of surrogate end points in phase II studies [3]. In oncology trials, tumour size, as measured on medical images such as computed tomography (CT) or magnetic resonance imaging (MRI) is a commonly employed surrogate marker. The benefit of employing surrogate markers is that, compared with traditional endpoints, surrogate markers help expedite the clinical evaluation of new drugs. Most phase 2 and phase 3 clinical trials in oncology are performed at multiple institutions where, if required, lesion size is measured “on-site”. This approach aggregates the variability of multiple observers at multiple institutions. In addition, bias from knowledge of a patient’s clinical status may influence
q 2003 Published by Elsevier Science Ltd on behalf of The Royal College of Radiologists.
312
CLINICAL RADIOLOGY
image analysis particularly when oncologists perform the quantitative measurements. An alternative approach involves image analysis at an “off-site” location where fewer independent unbiased observers analyse the images in a standardized fashion. We hypothesized that off-site quantitative analysis using a single observer would improve the accuracy of lesion measurement. The purpose of this study was to compare onsite versus off-site tumour size measurements in a large multicentre oncology trial.
MATERIALS AND METHODS
Medical images were acquired as a part of an international multi-centre phase 2 trial of capecitabine (Xeloda; Hoffman-La Roche, Nutley, NJ, U.S.A.) at 19 sites worldwide. Capecitabine is a 5-fluorouracil precursor that is used to treat metastatic breast carcinoma [4]. The clinical trial was performed in patients who were refractory to traditional chemotherapy [5]. A total of 300 lesions in 58 patients formed the data set for this analysis. Metastatic lesions involved a variety of organ systems including liver, lung, and lymph nodes, and were evaluated with a variety of imaging techniques including CT, MRI, ultrasound and radiographs. During the trial, the lesions were identified and labelled on-site. A cross-product technique using bi-dimensional measurements was employed for evaluating tumour size [6]. Using WHO criteria, the cross-product was obtained by measuring the greatest diameter and multiplying it by the greatest perpendicular diameter. On-site, the lesions were measured by one of multiple observers using manual measurement techniques or electronic calipers. Manual measurements were performed by measuring the lesion on the hard-copy image. Using a standard reference distances printed on the hard-copy image, the size of the lesion could then be determined from the manual measurements. Electronic measurements are obtained by placing electronic calipers on the borders of the lesion. A computer application then calculates the lesion size based on the relationship between image and patient size. Window settings were organ specific, i.e. soft tissue for abdominal lesions and lung windows for lung lesions. Hard-copy images were sent to the off-site review institute, where they were digitalized to an electronic format. Window settings were not adjusted during this process and the same window settings were applied for all measurements of a given lesion. No permanent record of the measurement was made on the images. The images were labelled with a unique identifier to specify the patient and lesion. The group of multiple on-site observers included technologists, radiologists and oncologists who followed a written tumour measurement protocol. The hard-copy images were then forwarded to a single off-site location (WorldCare Inc., Cambridge, MA, U.S.A.). At the off-site institution, a single observer measured each of the 300 lesions using electronic calipers on a soft copy workstation. The off-site observer was a board-certified radiologist with subspecialty training in abdominal imaging and was blinded to the measurements of the on-site observers. The bi-dimensional cross-products obtained by the single off-site observer and multiple on-site observers were compared.
Based on the smaller of the two measurements, a percentage difference (average, median, and standard deviation) between them was determined. A single adjudicator who did not participate in the on-site or off-site assessments compared the two measurements. The purpose of the adjudicator was to select the measurement that most accurately reflected the size of the lesion. The adjudicator was blinded to the source of the two measurements. Adjudication was performed in lesions which were larger than 1 cm2 (to exclude small lesions which are difficult to reliably measure), differed by at least 50% (to exclude lesions where small differences in measurements may be due to inherent ambiguity in lesion boundaries) and differed by less than 100% (to exclude lesions with potential errors in lesion identification). Of the 300 lesions, 46 lesions were larger than 1 cm2 and had a disagreement in measured size between 50 and 100% (Fig. 1). The adjudicator independently remeasured the 46 lesions in the group and selected the cross-product measurement that was closer to the adjudicator’s cross-product measurement of the lesion. The statistical significance of the adjudicator’s selection of the off-site or on-site measurements was determined using a binomial distribution, assuming equal probability of selecting the single off-site or multiple on-site observers. RESULTS
Measurements of the 300 lesions differed by an average of 109%, with a median of 40% and standard deviation of 251%. Of the 266 lesions larger than 1 cm2, results of the off-site observer differed from the results of the multiple observers by more than 10% for 249 lesions, by more than 30% for 169 lesions, by more than 50% for 126 lesions, and by more than 100% for 66 lesions (Table 1). Fig. 2a is a histogram representation of lesion measurements by the on-site observers versus the off-site observer. Fig. 2b shows the subset of lesions smaller than 50 cm2 to further demonstrate the degree of variability in lesion measurements. Of the 46 lesions that met all three of these criteria for adjudication, the adjudicator selected the measurement of the off-site observer for 37 lesions (80.4%) and that of the on-site observer for nine lesions (19.6%). Assuming an equal likelihood of selecting each of the measurements, the difference was statistically significant with a p-value of 0.00002. DISCUSSION
Qualitative and quantitative assessments of medical images are becoming increasingly important as a means for rapidly and
Fig. 1 – Lesions for adjudication.
TUMOUR SIZE MEASUREMENT IN AN ONCOLOGY CLINICAL TRIAL: COMPARISON BETWEEN OFF-SITE AND ON-SITE MEASUREMENTS
313
Table 1 – Table of the percentage difference in measurements between the off-site observer compared with the results of the on-site observers Difference in measurements
Number of lesions
.10% .30% .50% .100%
249 169 126 66
accurately determining the efficacy of novel therapeutic drugs. This application is particularly noteworthy where clinical efficacy is either difficult to measure or where demonstration of clinical efficacy takes many years. Thus, the first successful application of medical imaging to obtain regulatory approval was in patients with multiple sclerosis in whom drug efficacy was demonstrated with MRI [7]. More recently in an attempt to give early access to drugs that treat life-threatening or debilitating conditions, the FDA has developed expedited approval procedures where surrogate endpoints (such as tumour size) are used to show drug efficacy [3]. Hence the need for accurate measurements and independent verification is paramount. Our study demonstrates that measurement of lesion size by a single unbiased observer compared with multiple observers reveals a high degree of variability. An adjudicator selected the measurement of the single observer more frequently than that of multiple observers, with statistical significance. These observations suggest that measurements made by a limited number of experienced observers should increase the accuracy in quantitative analysis of medical images in clinical trials. An improvement in accuracy has the important advantage of reducing sample size for demonstrating therapeutic difference between test drug and placebo. These findings confirm earlier studies. For example, in 1984 Henderson et al. compared onsite versus off-site tumour measurements in a multi-centre cancer trial [8]. They reported that on-site and off-site examiners agreed 75% of the time on the change in disease burden. Of those patients with a remission, there was only 42% agreement between local and central examiners. Inter-examiner agreement was slightly worse than intra-observer agreement. In a review by Robinson [9], assessments of interobserver variability during interpretation of imaging studies yielded sufficient differences to even outweigh differences between radiology techniques. There are several explanations for the observed difference between the two models in our study. First, a single observer provided a standardized and reproducible format for lesion analysis. Second, the off-site observer was a highly qualified radiologist with subspecialty radiology training who inherently may have been more accurate compared with a range of different observers at on-site institutions. Finally, more accurate measurements may have been obtained with softcopy image analysis, which permits image manipulation (greyscale manipulation and magnification) and use of electronic cursors. There are several limitations to our study. Most importantly, the lesion measurement procedure used by the off-site reader (digital images with electronic calipers) was less frequently used with on-site measurements. Thus an inherent bias in the measurement technique may have aligned the measurements of
Fig. 2 – (a) Scatter plot representation of lesion measurement crossproducts of the multiple on-site observers versus the single off-site observer. (b) Scatter plot representation of lesion measurement crossproducts of the multiple on-site observers versus the single off-site observer, using the data subset of lesions smaller than 50 cm2.
the adjudicator with that of the off-site observer. In addition, lesion measurement is, by its very nature, somewhat subjective. For example, a lesion with lobulated or ill-defined margins will result in substantial variability in measurements between observers. As no gold standard or objective measurement of tumour size was available, measurement accuracy could not be established with certainty. Our results do document a greater reproducibility of the measurements by the single off-site observer than multiple on-site observers when assessed by an adjudicator. With rapid advances in imaging technology the use of surrogate markers to demonstrate efficacy of novel drugs is likely to increase. Our findings suggest that studies designed to quantify imaging features should limit the number of observers and perform the analysis off-site using a standardized, reproducible format. Our results underscore the importance of employing a well trained observer working in a facility with capabilities for soft copy measurements. Rapid adoption of the DICOM (digital imaging and communication in medicine) image format standard will facilitate efficient low-cost centralized image analysis. Previous assessments of interobserver variability have yielded substantial differences during interpretation of imaging studies to a sufficient extent that differences between observers often outweigh differences between radiology techniques [9].
314
CLINICAL RADIOLOGY
REFERENCES 1 Buyse M. The relationship between response to treatment and survival in patients with measurable solid tumors. Sixth International Congress on Anti-Cancer Treatment, February 6–9, 1996, Paris, France, 1996; 35. 2 Hamblin TJ. Clinical parameters for evaluating biological response modifier therapy. Eur J Cancer Clin Oncol, 1939;25(Suppl. 3):S7–S9. 3 Morris L. FDA modernization act: implications for oncology. Oncology (Huntingt), 1998;12:139– 141. 4 Dooley M, Goa KL. Capecitabine. Drugs, 1999;58:69–76. Discussion; p. 77–78. 5 Blum JL, Jones SE, Buzdar AU, et al. Multicenter phase II study of
6 7
8
9
capecitabine in paclitaxel-refractory metastatic breast cancer. J Clin Oncol, 1999;17:485–493. Moertel CG, Hanley JA. The effect of measuring error on the results of therapeutic trials in advanced cancer. Cancer, 1976;38:388–394. Kappos L, Stadt D, Keil W, et al. An attempt to quantify magnetic resonance imaging in multiple sclerosis—correlation with clinical parameters. Neurosurg Rev, 1987;10:133– 135. Henderson WG, Zacharski LR, Spiegel PK, et al. Comparison of local versus central tumor measurements in a multicenter cancer trial. Am J Clin Oncol, 1984;7:705 –712. Robinson PJ. Radiology’s Achilles’ heel: error and variation in the interpretation of the roentgen image. Br J Radiol, 1997;70:1085 –1098.