Int. J. Radiation Oncology Biol. Phys., Vol. 77, No. 3, pp. 959–966, 2010 Copyright Ó 2010 Elsevier Inc. Printed in the USA. All rights reserved 0360-3016/$–see front matter
doi:10.1016/j.ijrobp.2009.09.023
PHYSICS CONTRIBUTION
EVALUATION OF AUTOMATIC ATLAS-BASED LYMPH NODE SEGMENTATION FOR HEAD-AND-NECK CANCER LIZA J. STAPLEFORD, M.D.,* JOSHUA D. LAWSON, M.D.,*y CHARLES PERKINS, M.D., PH.D.,* SCOTT EDELMAN, M.D.,* LAWRENCE DAVIS, M.D., M.B.A., F.A.C.R.,* MARK W. MCDONALD, M.D.,*z ANTHONY WALLER,* EDUARD SCHREIBMANN, PH.D.,* AND TIM FOX, PH.D.* From the *Department of Radiation Oncology, Emory University School of Medicine and Winship Cancer Institute of Emory University Atlanta, GA. Current address: From the yDepartment of Radiation Oncology, University of California, San Diego, School of Medicine, La Jolla, CA, and zDepartment of Radiation Oncology, Indiana University, Melvin and Bren Simon Cancer Center, Indianapolis, IN. Purpose: To evaluate if automatic atlas-based lymph node segmentation (LNS) improves efficiency and decreases inter-observer variability while maintaining accuracy. Methods and Materials: Five physicians with head-and-neck IMRT experience used computed tomography (CT) data from 5 patients to create bilateral neck clinical target volumes covering specified nodal levels. A second contour set was automatically generated using a commercially available atlas. Physicians modified the automatic contours to make them acceptable for treatment planning. To assess contour variability, the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm was used to take collections of contours and calculate a probabilistic estimate of the ‘‘true’’ segmentation. Differences between the manual, automatic, and automatic-modified (AM) contours were analyzed using multiple metrics. Results: Compared with the ‘‘true’’ segmentation created from manual contours, the automatic contours had a high degree of accuracy, with sensitivity, Dice similarity coefficient, and mean/max surface disagreement values comparable to the average manual contour (86%, 76%, 3.3/17.4 mm automatic vs. 73%, 79%, 2.8/17 mm manual). The AM group was more consistent than the manual group for multiple metrics, most notably reducing the range of contour volume (106–430 mL manual vs. 176–347 mL AM) and percent false positivity (1–37% manual vs. 1–7% AM). Average contouring time savings with the automatic segmentation was 11.5 min per patient, a 35% reduction. Conclusions: Using the STAPLE algorithm to generate ‘‘true’’ contours from multiple physician contours, we demonstrated that, in comparison with manual segmentation, atlas-based automatic LNS for head-and-neck cancer is accurate, efficient, and reduces interobserver variability. Ó 2010 Elsevier Inc. Auto-segmentation, Intensity-modulated radiotherapy, Deformable image registration, Interobserver variability, Neck nodal volumes.
INTRODUCTION
with time reductions up to 30–40% seen in studies of HNC and breast contouring (3, 4). Another potential advantage of automatic segmentation is reduction in intra- and interobserver variability in anatomical volume delineation. Variability of contouring among physicians has been noted in a number of studies (5–8). The impact of such inconsistencies may be especially evident in HNC radiotherapy, where the range of interobserver variability is somewhat larger and may exceed the errors due to position uncertainty and organ motion (9). Interobserver variability may not affect an individual radiation oncologist’s contours, but it
Intensity-modulated radiation therapy (IMRT) has allowed for multiple advances in the treatment of head-and-neck cancer (HNC), including improved parotid gland sparing (1, 2) and higher radiation doses for tumors located near critical structures. To fully exploit the advantages of IMRT, all target volumes and critical structures must be contoured before treatment planning. This time-consuming process may be repeated multiple times during a treatment course because of tumor response or changes in patient weight or anatomy. Automatic segmentation can reduce physician contouring time,
Reprint requests to: Tim Fox, Ph.D., Emory University School of Medicine, Winship Cancer Institute, Department of Radiation Oncology, 1365 Clifton Road, Building C, Atlanta, GA 30322. Tel: (404) 778-3473; Fax: (404) 778-4139; E-mail:
[email protected]. org
Dr. Fox is entitled to royalties derived from Velocity Medical Solution’s sale of products. The terms of this agreement have been reviewed and approved by Emory University in accordance with its conflict of interest policies. Received June 5, 2009, and in revised form Sept 7, 2009. Accepted for publication Sept 15, 2009. 959
I. J. Radiation Oncology d Biology d Physics
960
does impact the field as a whole in regards to interpretation of clinical trials results and consistency across the specialty. Automatic segmentation has been shown to reduce variability of contours among physicians and improve efficiency for multiple disease sites (3, 4). The gains in efficiency and consistency are valuable only if accuracy is not compromised. Assessment of accuracy is a complex issue, because there is no objective volume for comparison. A standard approach in the evaluation of automatic segmentation for radiotherapy planning has been to use individual expert physician segmentations for comparison. The shortcoming of this approach is that it does not address interobserver variability. The Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm is a widely accepted tool that adjusts for intra- and interobserver variability in image segmentation (10). It takes a collection of segmentations and calculates a probabilistic estimate of the true segmentation. Using this algorithm, we took a collection of physician manual contours and generated an estimate of the true segmentation, to use as the ‘‘reference standard’’ for contour comparisons. We compared manual, automatic, and automatic-modified (AM) contours to this standard. This study focused on the target volume of lymph node regions for HNC patients, as manually contouring these volumes is a time-intensive task. The goal of this study was to use multiple assessment tools, including the STAPLE algorithm, to evaluate if automatic segmentation could decrease inter-physician variability while maintaining accuracy. Using these same methods, we analyzed how physicians modify automatic anatomical segmentations in terms of size, shape, and position. MATERIALS AND METHODS Study overview We selected 5 adult patients with non-bulky neck nodes who were treated with IMRT for HNC of either the oropharynx or nasopharynx. For each patient, a three-step process was performed: physicians manually contoured designated regions of interest (ROIs) on the planning CT scans; HNC atlas was automatically registered to the planning CT and delineated atlas-based ROI; and physicians reviewed and modified the atlas-based ROI.
Creation of automatic contours A commercially available HNC atlas (Velocity Medical Systems, Atlanta, GA) was used for this study. The atlas included bilateral retropharyngeal and cervical lymph node levels I through VI and was created on a model patient CT closely following the Radiation Therapy Oncology Group/European Organization for Research and Treatment of Cancer/Danish Head and Neck Cancer Group consensus guidelines (11). For each of the 5 case patients, the Velocity software deformable image registration algorithm was used to adapt the HNC atlas to each patient’s unique CT image set (12). Ipsilateral lymph node levels were joined in a Boolean operation to create bilateral neck clinical target volumes (CTVs) covering selected lymph node levels. Lymph node level coverage was determined following the Radiation Therapy Oncology Group 0022 protocol and Chao IMRT (13) guidelines based on the indi-
Volume 77, Number 3, 2010
vidual patient disease characteristics. Software post-processing features were used to smooth the contours and crop them 3 mm from the outer body contour.
Physician contouring process Five radiation oncologists with HNC IMRT experience generated individual contours for each patient using the planning CT images. The data sets were anonymized and physicians were blinded to the contours created by their colleagues. For each of the five cases, physicians were given limited clinical information, consisting of stage and disease site, and were instructed to create bilateral neck CTVs covering the specified lymph node levels, using the Radiation Therapy Oncology Group/European Organization for Research and Treatment of Cancer/Danish Head and Neck Cancer Group consensus guidelines for reference. These CTV contours are referred to as manual contours. After completing their manual contours, the physicians the automatic CTVs with instruction to review the contours to determine if they were acceptable for treatment planning as created, and, if they were not, to modify them until deemed suitable for treatment planning. These CTV contours are referred to as AM contours. Physicians were asked to record the elapsed time spent both creating the manual contours and modifying the automatic contours for each patient.
Analysis strategy To adjust for the interobserver variability among physician contours, the STAPLE algorithm was implemented as an automatic software tool that took the collections of individual physician segmentations and calculated probabilistic estimates of the true segmentations (Fig. 1). As right and left CTVs were considered discrete structures for each of the five patients, there were a total of 10 CTVs. Two STAPLE volumes were created for each of the 10 CTVs: STAPLE-manual created from the set of five manual contours and STAPLE-AM created from the set of five AM contours. The STAPLE-manual served as the ‘‘reference standard’’ for the comparison of manual vs. automatic contours, while the STAPLE-AM served as the ‘‘reference standard’’ for comparison of the AM vs. automatic contours. Quantitative analysis was performed using our software tool and the STAPLE ‘‘reference standards’’ to assess the differences among the manual, automatic, and AM contours in regards to size, position, and shape. Five metrics were used for comparisons: volume, sensitivity, false positive (FP), Dice similarity coefficient (DSC), and mean/max surface disagreement. Sensitivity and false positive were defined with the STAPLE volumes considered the ‘‘reference.’’ For two structures where one is defined as the ‘‘reference’’ and the other is a ‘‘test’’ structure, sensitivity and specificity are based on the ‘‘true positive’’ (TP), ‘‘false positive,’’ ‘‘true negative,’’ and ‘‘false negative’’ (FN) values. For binary image maps of structures, each pixel has a value defined as 1 or 0 based on its inclusion or exclusion from the ‘‘test’’ and ‘‘reference’’ structure. Pixels present in the test structure (1) are positive, whereas those pixels not included in the test structure (0) are negative. Positive and negative categories are divided into true and false based on the pixels inclusion (1) or exclusion (0) from the STAPLE ‘‘reference.’’ The corresponding pixel values for the ‘‘reference’’ and ‘‘test’’ structures would be defined as follows: TP (1,1); FP (0,1); true negative (0,0); FN (1,0). Sensitivity is defined as:
Auto-segmentation for head-and-neck cancer lymph nodes d L. J. STAPLEFORD et al.
961
STAPLE-manual using the metrics listed previously and calculating mean, range and standard deviation (SD) for each. The automatic contours were assumed to be accurate if their degree of overlap with the STAPLE-manual was comparable to that of the manual contours. If the range and SD for the automatic contours were less than or equal to that of the manual contours for each of the comparison metrics, then the automatic segmentation was assumed to perform equally well for all CTVs. Physicians also qualitatively assessed the automatic contours prior to making their modifications with a yes/no response to indicate if the automatic contours were acceptable for treatment planning without modifications, and, if not, to briefly indicate the major deficiency.
Interoperator variability
Fig. 1. Simultaneous Truth and Performance Level Estimation (STAPLE) probability map generated from a set of physician contours (black lines). The probability map values indicate the likelihood that the ‘‘true’’ segmentation is at that location.
TP 100¼ % TPþFN False positive was used as a surrogate for specificity, which could not be accurately quantified because the true negative volume was too large (all pixels in the CT scan that were not included in either structure). To make the FP value more quantifiable, it was divided by the volume of the test structure to create a % FP FP ¼ %FP Volume of Test Structure The spatial overlap between any two contours was calculated using the Dice similarity coefficient (DSC) (14)
DSC ðA; BÞ ¼
2jAXBj jAj þ jBj
where A is the volume of a test contour; B is the volume of a reference contour; X is the volume that A and B have in common. When the test and reference structures overlap perfectly, the DSC is one.
Accuracy Accuracy was measured by comparing the degree of overlap between the automatic and manual contours, respectively, with the
To measure interoperator variability, we compared the individual manual contours to the STAPLE-manual and calculated range, maximum, minimum, mean, median, and standard deviation for all five metrics for each CTV individually and the set of 10 CTVs as a whole. The same comparisons were then performed between the AM contours and the STAPLE-AM. If all the physicians drew identical manual contours for a given CTV, the STAPLE-manual would be identical to these contours, and the sensitivity and DSC would both be 100%, whereas the %FP and surface disagreements would be zero. Similarly, if physicians identically modified the automatic contour for a given CTV, then the AM contours and the STAPLE-AM would all be identical. Comparing the individual manual contours for a given CTV to the STAPLE-manual reveals the degree of consistency among the set of manual contours. Likewise, comparing the AM contours to the STAPLE-AM reveals the consistency among the AM contours. Therefore, comparing the consistency among the manual group versus that among the AM group indicates if automatic segmentation reduces inter-operator variability.
Utilization of automatic contours If physicians were completely unbiased by the automatic contours, then they each should have modified the automatic contours to make them identical to their manual contours, thus making the STAPLE-manual and STAPLE-AM contours identical. For each structure, these two STAPLE volumes were directly compared to assess their similarity. To further analyze bias, the automatic and AM contours were also compared with the STAPLE-AM. If physicians were completely biased by the automatic contours and made no modifications, then the STAPLE-AM would be identical to the automatic contours. Finally, potential time savings were addressed by comparing the time physicians spent creating manual contours to the time spent modifying the automatic contours for each patient.
Statistics The variations in volume, sensitivity, FP, DSC, and mean/max surface disagreement, and time between manual contour group and AM group were compared using Wilcoxon’s signed-rank test for paired nonparametric data. For all measurements except volume and time, the metrics were generated using the reference standard for each group (STAPLE-manual for manual contours and STAPLEAM for AM ones). For each metric, paired data for all CTVs were considered together in order to maximize sample size and power.
962
I. J. Radiation Oncology d Biology d Physics
RESULTS Accuracy The physicians’ qualitative assessments of the automatic contours deemed 32% of the contours to be acceptable for treatment planning without modification. Four of the five physicians had consistent answers for all the patients, answering either yes or no for the automatic contours for all five of the patients. All of the physicians who answered no to the acceptability question indicated that the CTVs were too large. Figure 2(a, b) and Table 1 show the comparisons between the STAPLE-manual and both the automatic and manual contours, respectively. The average sensitivity and DSC for the manual and automatic contours were 73% versus 86% and 79% versus 76%, respectively. The average maximum and mean surface disagreements were 17.0 mm and 2.8 mm (manual) versus 17.4 mm and 3.3 mm (automatic). The automatic contours on average had a higher %FP (32%) than the manual contours (9%). As seen in Table 2, the manual contours were on average smaller (206 mL) than the automatic contours (309 mL) and the STAPLE-manual contours (244 mL). The ranges of values for sensitivity, DSC, %FP for the automatic contours were all smaller than the respective range for the manual contours (Table 1). The SD for mean surface disagreement was smaller for the automatic contours, whereas the SD for maximum surface disagreement was smaller for the manual contours.
Interoperator variability Figure 3(a, b) illustrates sensitivity, DSC, %FP, and surface disagreement for the manual to STAPLE-manual and AM to STAPLE-AM groups. Sensitivity had a mean, median, and range of 73%, 68%, and 47–99% for the manual group versus 82%, 85%, and 54–95% for the AM. The average DSC was higher for the AM group than the manual group: 89% (range, 68–97%) versus 79% (range, 62–93%). The most dramatic difference was seen in regards to volume and %FP. For volume, the AM contours had a range and SD of 176–347 mL and 47 mL compared with 106– 429 mL and 80 mL for the manual contours (Table 2). The %FP was reduced from a mean of 9% (range, 1–37%) to 3% (range, 1–7%) with the introduction of the automatic segmentation. Examples of the manual and AM contours along with the STAPLE-manual are seen in Fig. 4. The AM contours on average had smaller mean and max surface disagreements from the STAPLE-AM (mean, 1.8 mm; max, 14.0 mm) than the manual contours from the STAPLE-manual (mean, 2.8 mm; max, 17.0 mm). For max surface disagreement, the SD and range were smaller for the AM group (3.4 mm and 7.3–21.7 mm) compared with the manual group (4 mm and 10.7–27.5 mm). For mean surface disagreement, the AM group had a larger SD and range (0.91 mm and 0.2–4.2 mm) than the manual group (0.59 mm and 1.3–4.2 mm).
Volume 77, Number 3, 2010
(a)
Sensitivity
DSC
% FP
100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 00 % Auto
(b)
Manual
Auto
Manual
Auto
Manual
Mean and Maximum Surface Disagreement
35 mm 30 mm 25 mm 20 mm 15 mm 10 mm 05 mm 00 mm Auto
Manual
Auto
Manual
Fig. 2. Accuracy was measured by comparing the automatic and manual contours to the Simultaneous Truth and Performance Level Estimation (STAPLE) manual. Horizontal lines indicate the mean value, while vertical bars indicate the absolute range (a) and the 99% range or m 3s (mean 3*SD) (b). Abbreviations: Auto = automatic; %FP = percent false positive; DSC = Dice similarity coefficient.
For each of the metrics, sensitivity, %FP, DSC, volume, and mean/max surface disagreement, Wilcoxon’s signed rank test for paired nonparametric data was performed. For all measurements except volume, the comparisons were performed against the true segmentation for each group (STAPLE-manual for manual contours and STAPLE-AM for AM contours). For each metric, paired data for all structures were considered together in order to maximize sample size and power. The p values were <0.005 for all comparisons. Utilization of automatic contours In the comparison of the two STAPLE volumes the mean DSC was 80% (range, 78–82%) and the mean sensitivity was 89% (range, 86–91%). The mean SD for mean surface disagreement was 2.5 mm 0.4 mm, whereas for max surface disagreement it was 13.8 mm 3.7 mm. Table 1 and Figure 5(a, b) show the comparisons between the STAPLE-AM and the automatic and the AM contours, respectively. Across all CTVs, the mean DSC, and sensitivity were 95% and 96% (automatic) versus 89% and 82% (AM). The %FP had a mean of 7% (range, 4–14%) for the
Auto-segmentation for head-and-neck cancer lymph nodes d L. J. STAPLEFORD et al.
963
Table 1. Comparison metrics for manual and automatic contours to the STAPLE-manual contours and for automatic-modified and automatic contours to the STAPLE-AM contours Surface disagreement Sensitivity
DSC
% FP
Mean
Max
Reference standard
Contour type
Mean
Max
Min
Mean
Max
Min
Mean
Max
Min
Mean
SD
Mean
SD
STAPLE-manual
Manual Automatic AM Automatic
73% 86% 82% 96%
99% 90% 95% 98%
47% 78% 54% 89%
79% 76% 89% 95%
93% 78% 97% 97%
62% 73% 68% 92%
9% 32% 3% 7%
37% 37% 7% 14%
1% 29% 1% 4%
2.77 3.30 1.77 1.00
0.59 0.43 0.91 0.21
17.04 17.37 13.99 12.09
4.00 5.58 3.44 2.75
STAPLE-AM
Abbreviations: AM = automatic-modified; DSC = Dice similarity coefficient; %FP = percent false positive; SD = standard deviation.
automatic contours and of 3% (range, 1–7%) for the AM contours. The average mean surface disagreements for the automatic and AM contours were 1 mm and 1.8 mm, respectively, whereas the values for max surface disagreement were 12.1 mm and 14 mm, respectively. On average, physician modified contours were smaller than the automatic (299 mL AM vs. 309 mL automatic), although not as small as the manual contours (254 mL) (Table 2). Per patient, physicians spent an average of 33.1 minutes generating manual contours and 21.6 minutes modifying automatic contours for an average time savings of 11.5 minutes per patient, a 35% reduction in time. DISCUSSION By creating ‘‘true’’ contours from multiple experienced physicians’ manual contours, we have demonstrated that the use of atlas-based automatic lymph node segmentation can improve efficiency and decrease interobserver variability while maintaining accuracy. Variability is one of the most challenging issues in the IMRT era and recognition of this fact has motivated recent efforts to quantify variability and develop systematic approaches to improve consistency (15). The ability to accurately and precisely identify target volumes impacts dose delivery to both tumor and normal structures. Planning target volumes are by definition, expansions of the CTV to account for organ motion, and setup uncertainty. There is considerable interest in reducing the planning target volume margin to decrease dose to normal tissues, and in HNC IMRT the adoption of daily on board im-
aging has allowed for reductions in margins down to 3 mm in some cases (16). Given the feasibility of achieving highly accurate setup in HNC radiotherapy, variability in target volume delineation may impact the accurate delivery of dose more than organ motion, or setup errors (9). Variability is influenced by a number of factors, including imaging
(a) Sensitivity
100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 00 %
%FP
AM* Manual†
(b)
DSC
AM* Manual†
AM* Manual†
Mean and Maximum Surface Disagreement
35 mm 30 mm 25 mm
20 mm 15 mm 10 mm 05 mm
Table 2. Comparison of contour volume (mL) by contour type Contour type Manual Automatic Automatic-modified STAPLE-AM STAPLE-manual
Mean (mL) Max (mL) Min (mL) SD (cc) 206 309 254 299 244
429 369 347 368 293
106 263 176 253 197
80 34 47 12 34
Abbreviations: STAPLE-AM = STAPLE automatic-modified; SD = standard deviation.
00 mm
AM*
Manual†
AM*
Manual†
Fig. 3. Contour variability within the manual and automatic-modified sets using their respective Simultaneous Truth and Performance Level Estimation (STAPLE) reference contours. *Indicates comparison of AM contours to the STAPLE-AM. yIndicates comparison of manual contours to STAPLE-manual. Horizontal lines indicate the mean value, whereas vertical bars indicate the absolute range for (a) and the 99% range or m 3s (mean 3*SD) (b). Abbreviations: AM = automatic-modified; %FP = percent false positive; DSC = Dice similarity coefficient.
964
I. J. Radiation Oncology d Biology d Physics
Volume 77, Number 3, 2010
Fig. 4. Reductions in contour variability are seen after introduction of the automatic contours. Physician manual contours are seen in blue, automatic contours modified by physicians are in purple, and Simultaneous Truth and Performance Level Estimation (STAPLE)-manual contour is shown in brown.
modality, imaging technique (slice thickness, contrast administration), and practitioner experience (17). Previous studies have demonstrated that the use of automatic segmentation can reduce interoperator variability in regards to delineating organs at risk and CTVs in HNC (4). In this study, we chose to minimize variability by selecting a group of fairly homogenous patients and looking at prophylactic nodal contours created by physicians who worked at a single institution, were given instructions regarding the nodal level coverage, and were directed toward the consensus guidelines for reference. The IMRT practice at this institution has been to prescribe to three to four volumes: a single volume containing all gross disease, separate volumes for each side of the neck, and possibly a separate high-risk nodal area. One criticism of this study may be that an atlas based on guidelines for an N0 neck should not be used to any N+ patients. We would argue that the presence of positive nodes does not preclude the use of an atlas in constructing prophylactic nodal volumes. Gross nodes should be contoured and included in a separate, higher dose volume that can be removed from the prophylactic neck volume using a subtraction function in the treatment planning system. Similarly, any region of the automatic atlas-based nodal volumes deemed by the physician to be at higher risk could be planned to a higher dose and the contours modified to include adjacent at risk areas. An atlas does not replace the judgment of a physician and, as such, will always require some modification. Although the degree of variability among the manual contours in this study was not dramatic, it was improved with introduction of the automatic segmentation in regard to sensitivity, % FP, mean/max surface disagreement, and volume. The gains in consistency were especially notable for volume, where utilization of the automatic segmentation led to a reduction in the average SD from 80 mL to 47 mL, a decrease in the range of volumes from 106–429 mL to 176–347 mL, and a reduction in %FP which dropped from 9% to 3%. The mean and max surface disagreement were both reduced after introducing the automatic segmentation, with the mean surface disagreement decreasing by 1 mm on average and the max by 3 mm. Although the AM group
did show a larger SD than the manual group for mean surface disagreement (0.9 mm vs. 0.6 mm), comparing the range of values for both groups (0.2–4.2 mm AM vs. 1.3–4.2 mm manual) reveals that the increased spread is due to a reduction in the minimum value for the AM group. Taken together with the reduction in the average value, this indicates that the
(a) 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 00 %
Sensitivity
Auto
(b)
DSC
AM
Auto
%FP
AM
Auto
AM
Mean and Maximum Surface Disagreement
35 mm 30 mm 25 mm 20 mm 15 mm
10 mm 05 mm 00 mm
Auto
AM
Auto
AM
Fig. 5. The degree of automatic segmentation modification was assessed by using Simultaneous Truth and Performance Level Estimation (STAPLE)-AM as a reference to compare automatic and automatic-modified contours. Horizontal lines indicate the mean value, while vertical bars indicate the absolute range (a) and the 99% range or m 3s (mean 3*SD) (b). Abbreviations: Auto = automatic; AM = automatic-modified; %FP = percent false positive; DSC = Dice similarity coefficient.
Auto-segmentation for head-and-neck cancer lymph nodes d L. J. STAPLEFORD et al.
mean surface disagreement was reduced when physicians used the automatic contours as a template. A potential explanation for the discrepancies between manual and automatic contours is that physicians were influenced by the presence of positive lymph nodes, whereas the automatic contours were based on guidelines for an N0 neck. Literature supports modifications to prophylactic neck contours based on the presence of positive nodes, such as starting ipsilateral level II contours at the base of skull (18). If physicians were taking this literature into account, one would expect that the manual contours would be larger than the automatic. In fact, the manual contours on average were more than 100 mL smaller than the automatic contours and were almost 50 mL smaller than the AM contours. The variability in contour volume is more dramatically affected by introduction of the automatic segmentation, because volume is the only metric that was calculated without the use of the STAPLE. As STAPLE is a smart algorithm designed to reduce outliers, the variability for the other metrics is somewhat blunted by using STAPLE as the comparison volume. Overall, the introduction of the automatic segmentations seemed to reduce the extremes in volume, thereby making more uniform-sized contours with less extreme areas of disagreement from the average contours. The creation of the STAPLE volumes also facilitated the study of how physicians’ contours are influenced by starting with a template versus starting from scratch. Overall, physicians made relatively minor modifications to the automatic contours. Given the similarity of the automatic contours and the STAPLE-manual, the lack of modification may be attributed to the fact that the automatic contours closely resembled most physicians’ definition of the truth. However, this does not exclude the possibility that physicians’ perceptions of the truth were influenced by the automatic segmentations. The introduction of bias is confirmed by the discrepancy between physicians’ qualitative assessments and their contour
965
modifications. Two-thirds of the time, physicians deemed that the automatic contours would require modification before treatment planning because the volumes were too generous; however, on average physicians removed only 7% of the automatic contour volume and created modified contours that were almost 50 mL larger than their manual contours. There are limitations to this analysis, including the small number of patients and physicians, the recruitment of physicians from a single institution, and lack of assessment of intraobserver variability. The performance of the automatic segmentations is highly dependent on the quality of deformable registration. The results may not be reproducible with less robust image registration tools or in patients with altered anatomy from bulky neck disease. Beyond its function in this project, the STAPLE algorithm has potential for multiple applications in the field. As we have demonstrated, it is an effective tool for assessing variability in physician contouring and may help expand the knowledge of factors that contribute to contour variability. Through these means, the STAPLE algorithm approach has the ability to improve quality assurance and uniformity across the specialty in the arena of IMRT contouring. Additionally, the STAPLE algorithm could be used to create learning tools for residents and other physicians to test their accuracy against expert contours. CONCLUSION By creating a ground truth from multiple segmentations, the STAPLE algorithm provides a unique tool to assess variability in contouring. With the application of STAPLE, we have shown that atlas-based automatic LNS in HNC is accurate, efficient, and reduces interobserver variability. Further analysis of the variability in IMRT contouring may help improve consistency across the field and augment the education process for physicians learning IMRT.
REFERENCES 1. Fang FM, Chien CH, Tsai WL, et al. Quality of life and survival outcome for patients with nasopharyngeal carcinoma receiving three-dimensional conformal radiotherapy vs. intensity-modulated radiotherapy—A longitudinal study. Int J Radiat Oncol Biol Phys 2008;72:356–364. 2. Chao KS, Majhail N, Huang CJ, et al. Intensity-modulated radiation therapy reduces late salivary toxicity without compromising tumor control in patients with oropharyngeal carcinoma: A comparison with conventional techniques. Radiother Oncol 2001;61:275–280. 3. Reed VK, Woodward WA, Zhang L, et al. Automatic segmentation of whole-breast using atlas approach and deformable image registration. Int J Radiat Oncol Bio Phys 2009;73:1493–1500. 4. Chao KSC, Bhide S, Chen H, et al. Reduce in variation and improve efficiency of target volume delineation by a computer-assisted system using a deformable image registration approach. Int J Radiat Oncol Biol Phys 2007;68:1512–1521. 5. Cooper JS, Mukherji SK, Toledano AY, et al. An evaluation of the tumor-shape definition by experienced observers from CT images of supraglottic carcinomas (ACRIN Protocol 6658). Int J Radiat Oncol Bio Phys 2007;67:972–975.
6. Hong TS, Chappell RJ, Harari PM. Variations in target delineation for head and neck IMRT: An international multi-institutional study. Int J Radiat Oncol Biol Phys 2004;60:S157. 7. Hermans R, Feron M, Bellon E, et al. Laryngeal tumor volume measurements determined with CT: A study on intra- and interobserver variation. Int J Radiat Oncol Biol Phys 1998;40:553–557. 8. Rasch C, Keus R, Pameijer FA, et al. The potential impact of CT-MRI matching on tumor volume delineation in advanced head and neck cancer. Int J Radiat Oncol Biol Phys 1997;39: 841–848. 9. Weiss E, Hess CF. The impact of gross tumor volume (GTV) and clinical target volume (CTV) definition of the total accuracy of radiotherapy. Strahlenther Onkol 2003;179:21–30. 10. Warfield SK, Zou KH, Wells WM. Simultaneous Truth and Performance Level Estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Trans Med Imag 2004;23:903–921. 11. Gregoire V, Levendag P, Ang K, et al. CT-based delineation of lymph node levels and related CTVs in the node-negative neck: DAHANCA, EORTC, GORTEC, NCIC, RTOG consensus guidelines. Radiother Oncol 2003;69:227–236.
966
I. J. Radiation Oncology d Biology d Physics
12. Lawson JD, Schreibmann E, Jani AB, et al. Quantitative evaluation of a cone-beam computed tomography–planning computed tomography deformable image registration method for adaptive radiation therapy. J Appl Clin Med Phys 2007;8:96–113. 13. Chao KS, Wippold F, Ozyigit G, et al. Determination and delineation of nodal target volumes for head-and-neck cancer based on patterns of failure in patients receiving definitive and postoperative IMRT. Int J Radiat Oncol Biol Phys 2002;53:1174–1184. 14. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945;26:297–302. 15. Li XA, Tai A, Arthur DW, et al. Variability of target and normal structure delineation for breast cancer radiotherapy: An RTOG
Volume 77, Number 3, 2010
multi-institutional and multi-observer study. Int J Radiat Oncol Biol Phys 2009;73:944–951. 16. Eisbruch A, Feng M. Future issues in highly conformal radiotherapy for head and neck cancer. J Clin Oncol 2007;25: 1009–1013. 17. O’Daniel JC, Rosenthal DI, Barker JL, et al. Inter-observer contour variations of head-and-neck anatomy. Int J Radiat Oncol Biol Phys 2005;63S:S370. 18. Eisbruch A, Foote RL, O’Sullivan B, et al. Intensity-modulated radiation therapy for head and neck cancer: Emphasis on the selection and delineation of the targets. Semin Radiat Oncol 2002; 12:238–249.