International Journal of
Radiation Oncology biology
physics
www.redjournal.org
Physics Contribution
A System for Continual Quality Improvement of Normal Tissue Delineation for Radiation Therapy Treatment Planning Jennifer Breunig, CMD, MS, Sophy Hernandez, MD, Jeffrey Lin, BS, Stacy Alsager, CMD, Christine Dumstorf, CMD, Jennifer Price, CMD, Jennifer Steber, CMD, Richard Garza, MD, Suneel Nagda, MD, Edward Melian, MD, Bahman Emami, MD, and John C. Roeske, PhD Department of Radiation Oncology, Loyola University Medical Center, Maywood, Illinois Received Jul 27, 2011, and in revised form Feb 3, 2012. Accepted for publication Feb 3, 2012
Summary Normal tissue delineation plays an important role in radiation therapy. However, numerous studies have demonstrated significant variations in normal tissue contours across multiple observers. In this study, the variation in normal tissue contouring among dosimetrists at our institution was evaluated. Subsequently, the “plan-do-act-check” cycle of continual quality improvement was implemented. Using this approach, we were able to reduce the variability of normal tissue contouring and to identify
Purpose: To implement the “plan-do-check-act” (PDCA) cycle for the continual quality improvement of normal tissue contours used for radiation therapy treatment planning. Methods and Materials: The CT scans of patients treated for tumors of the brain, head and neck, thorax, pancreas and prostate were selected for this study. For each scan, a radiation oncologist and a diagnostic radiologist, outlined the normal tissues (“gold” contours) using Radiation Therapy Oncology Group (RTOG) guidelines. A total of 30 organs were delineated. Independently, 5 board-certified dosimetrists and 1 trainee then outlined the same organs. Metrics used to compare the agreement between the dosimetrists’ contours and the gold contours included the Dice Similarity Coefficient (DSC), and a penalty function using distance to agreement. Based on these scores, dosimetrists were re-trained on those organs in which they did not receive a passing score, and they were subsequently re-tested. Results: Passing scores were achieved on 19 of 30 organs evaluated. These scores were correlated to organ volume. For organ volumes <8 cc, the average DSC was 0.61 vs organ volumes 8 cc, for which the average DSC was 0.91 (PZ.005). Normal tissues that had the lowest scores included the lenses, optic nerves, chiasm, cochlea, and esophagus. Of the 11 organs that were considered for re-testing, 10 showed improvement in the average score, and statistically significant improvement was noted in more than half of these organs after education and re-assessment. Conclusions: The results of this study indicate the feasibility of applying the PDCA cycle to assess competence in the delineation of individual organs, and to identify areas for improvement. With testing, guidance, and re-evaluation, contouring consistency can be obtained across multiple dosimetrists. Our expectation is that continual quality improvement using the PDCA
Reprint requests to: John C. Roeske, PhD, Department of Radiation Oncology, Loyola University Medical Center, 2160 S. First Ave, Maywood, IL 60153. Tel: (708) 216-2596; Fax: (708) 216-6076; E-mail:
[email protected] Conflict of interest: StructSure software used for this analysis was provided by Standard Imaging. Int J Radiation Oncol Biol Phys, Vol. 83, No. 5, pp. e703ee708, 2012 0360-3016/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.ijrobp.2012.02.003
AcknowledgmentsdThe authors are grateful to Standard Imaging for providing a research version of the StructSure Software used in this study. The authors also thank Ben Nelms, PhD, and Neal Miller for helpful discussions during this project, and Nasir Iqbal, MD, for reviewing contours.
International Journal of Radiation Oncology Biology Physics
e704 Breunig et al. organs that required further improvement.
approach will ensure more accurate treatments and dose assessment in radiotherapy treatment planning and delivery. Ó 2012 Elsevier Inc. Keywords: Contouring, Normal tissue delineation, Quality assurance
Introduction Treatment planning has often been described as a process with many components. One of these components is normal tissue delineation. These delineated structures serve as digital representations of the patient that the planning system uses to provide a quantitative estimate of the dose deposited within individual organs. Through the use of dose-volume histograms (DVH) and isodose overlays, physicians can determine whether a plan is acceptable or whether it needs modification to meet the treatment planning goals (1, 2). In addition, with the widespread use of image guided radiation therapy (IGRT), normal tissue contours may aid in patient positioning or the assessment of dose through adaptive schemes (3, 4). Although the accuracy of normal tissue delineation is important for the success of radiotherapy, several studies have documented the interobserver variation in normal tissue contouring (5-16). In one of the larger studies to assess these variations in the headand-neck region, Nelms et al provided a treatment planning computed tomography (CT) scan to 32 clinical sites around the world (16). In this dataset, the clinical target volume (CTV) was previously outlined, and participants were asked to outline the organs at risk (OAR) and to create a treatment plan based on the provided prescription. Several OAR, including the spinal cord and parotids, showed significant variation based on volume and contouring accuracy. Compared with the “gold” standard contours overlaid on the submitted plans, the variations in the OAR-DVH metrics across multiple clinicians resulted in dosimetric deviations in the mean dose, ranging from 289% to þ56%, and deviations in the maximum dose ranging from 22% to þ35%. These results indicate a significant variation in normal tissue delineation, and the need for a system to reduce this variation. In manufacturing and business, quality control along every step of a process is essential to ensure that the final product is free of defect. As such, we can use the methods developed in industry to potentially reduce the variability and improve the accuracy of normal tissue contouring. One such method is known as the plando-check-act (PDCA) cycle, popularized by W. Edwards Demming (17). Briefly, one begins by indentifying the objectives of a given process (Plan) which, in this case, is to obtain high quality normal tissue contours with minimal variation. Next, we implement the process (Do) by obtaining normal tissue contours to assess. The results of these contours are then compared with gold contours (Check) to determine how well they agree. Finally, based on the results, action is taken to improve contouring and to reduce variability (Act). It should be noted that the PDCA cycle is iterative; that is, based on the results of one pass through the cycle, the Plan (and subsequent elements) may be modified or improved in the next cycle. The goal of the current study is to apply the PDCA approach to assess the treatment planners’ accuracy in normal tissue delineation in sites throughout the body (Plan, Do). Using these results, treatment planners will receive individualized feedback and education on those organs with clinically unacceptable deviations
(Check). The treatment planners are then re-evaluated to determine whether, with training, they can reduce their contouring variability as well as identify areas where on-going training is required (Act).
Methods and Materials Patient population In this institutional review board-approved study, the CT scans of 5 previously treated patients were selected. The scans were obtained for tumors located in the following anatomical sites: brain, head and neck, thorax, abdomen, and pelvis (male). These scans were obtained on our departmental scanner (BigBore, Philips Medical Systems, Andover, MA). All scans were obtained using our departmental protocols that consisted of a 3-mm slice thickness/table index. To minimize the effect that the tumor size and location would have on normal structures, the CT scan of a whole brain study was used for the brain case. Likewise, the CT scan of a patient with a T1 larynx tumor was used for the headand-neck case. As such, contrast was not used on these scans. However, intravenous contrast was used for the thorax and abdominal scans. A retrograde urethrogram was used for the pelvis scan.
Assessment of contouring variability We applied the PDCA cycle to assess and improve the quality of normal tissue contouring in our department (Fig. 1). The first step
Fig. 1. Schematic diagram illustrating the process flow for the clinical contouring system. A set of “gold” contours is established based on agreement between multiple physicians. Next, dosimetrists are evaluated by contouring individual structures on CT scans. These structures are compared qualitatively and quantitatively using the StructSure Software. Feedback is provided and re-testing is performed on those structures that individuals initially did not perform satisfactorily.
Volume 83 Number 5 2012
Continual improvement of normal tissue contouring e705
of the process involves the establishment of a set of “gold” contours. These contours were selected based on normal structures routinely outlined in our clinic as well as those defined by Radiation Therapy Oncology Group (RTOG) studies. The OAR included in this study are listed in Table 1. Using the appropriate CT scan for the anatomical site, individual OAR were first outlined by a senior resident and subsequently reviewed and modified by a diagnostic radiologist. Thereafter, the contours were reviewed and approved by the individual radiation oncology attending specializing in each particular disease site. Five board-certified dosimetrists and 1 trainee participated in this study. The dosimetrists had an average of 7 years experience (range, 4-11 years), whereas the trainee had completed 10 months of his year-long training program. After the establishment of the gold contours, each dosimetrist received an individual copy of the CT scans without contours. The dosimetrists were then given a list of normal structures to contour from Table 1. All contouring was performed on an XiO workstation (Elekta, Maryland Heights, MO). No time limits were given, and the dosimetrists were not permitted to access anatomy books or on-line guides. In addition, the dosimetrists could not consult other dosimetrists or professionals within the department; rather, the dosimetrists were instructed to contour each OAR as they would for any clinical case.
Contour evaluation After completion of the study, the individual contours were analyzed using the StructSure Software (Standard Imaging, Madison, WI). StructSure uses the DICOM RT structure sets from the gold (primary) and the individual dosimetrist (secondary) contours. The program overlays the contours and provides both a visual and quantitative evaluation of the agreement between the two sets. Regions that overlap are shown in green, whereas regions that are within or extend beyond the primary contour are shown in blue and red, respectively. A screen capture of the program interface is shown in Figure 2. Quantitative measures of the individual contours’ accuracy are provided using two measures. The first is the Dice Similarity Coefficient (DSC), which is given by: DSCZðVolGold XVolDosim Þ=VolGold
Fig. 2. Example of the StructSure interface. Scores are listed quantitatively either through the penalty metric, or from the Dice Similarity Score (DSC). In addition, a qualitative measure of success is also shown. Regions that overlap are shown in green, while regions that are within or extend beyond the primary contour are shown in blue and red respectively. respectively. Thus, the DSC provides the degree of overlap between the gold and dosimetrist delineated contours (18). In general, DSC values closer to 1 indicate good agreement, whereas values closer to 0 indicate poor agreement. Second, a weighted penalty function was also used based on the distance to agreement (DTA). This function provided a “forgiveness” if the contour was within 1 mm (approximately the size of a single pixel), and thereafter produced a linear penalty as a function of distance. The functional form of this penalty is given by: PenaltyZ0 for 1 mm DTA 1 mm
ð2Þ
PenaltyZ0:5 jDTA 1j for DTA <1 mm or DTA >1 mm; ð3Þ
ð1Þ
where VolDosim and VolGold are the volumes of a particular structure based on the individual dosimetrist contours and the gold contours, Table 1 Organs evaluated for contouring Brain Brainstem Left lens Right lens Left eye Right eye Left optic nerve Right optic nerve Left cochlea Right cochlea Chiasm Mandible Left parotid Right parotid Spinal cord (cervical) Right lung Left lung Heart Spinal cord (thoracic) Esophagus Carina Right kidney Left kidney Liver Bowel Spinal cord (abdominal) Bladder Right femur Left femur Rectum
where a negative DTA denotes a “missing” OAR volume (volume not contoured by dosimetrist, but present in the gold contour volume) and a positive DTA is an “extra” OAR volume (volume contoured outside of gold contour) (16). These penalties were incorporated into a metric to provide a score indicating the agreement between contours: MetricZ100½Number of organ voxels Sumðvoxel penaltiesÞ =½Number of organ voxels ð4Þ where the voxel penalties are obtained from equations 2 and 3. Thus, metric scores closer to 100 indicate a high degree of agreement. Based on the above parameters, the dosimetrists were provided an individualized report on how they performed for each OAR. Using a cutoff of the penalty metric <70, the dosimetrists were provided an opportunity to review their individual contours in relation to the gold contours. This cutoff is based on the previous study by Nelms et al that showed significant dosimetric variation for metric values <70 (16). In addition, feedback on individual OAR was also provided by the physician group. After review, the dosimetrists then re-contoured, on a new scan, those OAR that did
e706 Breunig et al.
International Journal of Radiation Oncology Biology Physics
not meet the minimum cutoff score. The goal of this portion of the study was to determine whether re-education resulted in improved OAR agreement, and to determine whether additional training might be required. It should be noted that new scans were used for this portion of the study so that the dosimetrists would not be tempted to simply memorize the gold contours.
structure appear with a uniform value. The contour is then drawn using a simple threshold. Table 3 illustrates a strong relationship between the organ volume, DSC values, and penalty metrics. OAR having volumes of <8 cc had an average DSC of 0.61 vs OAR volumes of 8 cc had an average DSC of 0.91 (PZ.005). Likewise, when considering the penalty metric, the average value for volumes <8 cc was 51.5, while the average value for OAR volumes 8 cc was 80.2 (PZ.026). It is not surprising that OAR with smaller volumes would have lower scores. Based on size alone, minor variations in contouring will have a larger relative effect on a smaller OAR than on a larger organ. Moreover, the complex anatomy of the smaller organs in this study make them particularly difficult to contour, even for experts. The relationship between the penalty metric and DSC is also shown in Table 3. It is noted that DSC is often the quantity used to evaluate the accuracy of contouring, particularly in studies evaluating automated contouring systems. Our study shows a good correlation between these values with the average DSC of 0.64 for metric values <70, and an average DSC of 0.93 for metric values 70 (P<.001). It should be noted that several OAR with metric values <70 had high DSC values, including the left parotid, esophagus, bowel, and abdominal spinal cord. Based on the results of this analysis, the dosimetrists recontoured those organs for which they received metric scores of <70. Before recontouring, all dosimetrists were able to compare their OAR to the corresponding gold contour using the StructSure Software. The dosimetrists were then given a new scan to recontour to ensure that they were not merely memorizing the anatomy. For the chiasm and optic nerves, the dosimetrists were provided a magnetic resonance (MR) scan for the second attempt. Many centers, including our own, routinely fuse MR images to the CT to delineate these structures. Previous studies have validated the role of MR imaging in treatment planning for tumors in the brain (19, 20). The results of this re-assessment are shown in Figure 3, illustrating the average score after initial testing and after re-testing. Overall, 10 of the 11 OAR showed improvement in the mean penalty metric after re-testing. Statistically significant improvement was noted in 6 of the 11 OAR considered for re-evaluation. Organs that did not have a statistically significant improvement included the lenses, left parotid, esophagus, and abdominal spinal cord. For the esophagus, most deviations occurred because the dosimetrists over-contoured the structure. Likewise, the poor score on the abdominal spinal cord resulted from the dosimetrists contouring too far inferiorly. Based on removal of the contours on these additional slices, the average metric for the spinal cord was 90.0.
Statistical analysis Statistical analysis of these parameters was performed using a paired t test within Excel (Microsoft, Redmond, WA). Specifically, the t test was used to determine statistically significant differences between contouring accuracy based on organ volume, as well as to determine whether scores obtained after re-training were significant relative to initial scores. A value of P<.05 was considered statistically significant.
Results Based on our criteria that a DTA penalty metric >70 indicates reasonable agreement, 19 of 30 OAR (63%) achieved passing scores on the first attempt. Table 2 shows a summary of the dosimetrists’ performance for the OAR that did not receive a passing score. The organs that dosimetrists had the lowest scores (based on penalty metrics) included those of the visual pathways and the cochlea. Of interest, the organs outside the head and neck that had relatively low scores included the esophagus, bowel, and abdominal spinal cord. For both the abdominal spinal cord and esophagus, the largest disparities occurred where the dosimetrists chose to start and stop contouring these structures in the superior/ inferior direction. In the case of the bowel, differences were observed because of contouring individual loops of bowel vs the entire region. Structures on which the dosimetrists performed well (metric >90) included the brain, lungs, kidneys, thoracic cord, heart, carina, bladder, and liver. These OAR have larger volumes as well as good contrast at the boundaries of these organs. In addition, the dosimetrists also performed consistently well on OAR in which they used the system’s autosegmentation tools, such as the femurs. The autosegmentation features allow the user to place a seed point within a structure and to adjust the window/level to make the
Table 2
Summary of contouring metrics for select organs
Organ Left lens Right lens Left optic nerve Right optic nerve Left cochlea Right cochlea Chiasm Left parotid Esophagus Spinal cord (abdomen) Bowel
Average metric (range) 5.9 13.6 49.1 42.2 218.0 262.0 235.3 51.4 2.2 14.5
(527.1-100) (569.2-100) (33.1-69.2) (17.0-59.5) (489.3-88.1) (478.3-47.8) (513.0-83.5) (38.1-72.5) (20.9-28.1) (161.5-95.7)
44.4 (9.7-57.9)
Average DSC (range) 0.67 0.74 0.67 0.65 0.38 0.38 0.25 0.84 0.75 0.83
(0.72-0.88) (0.71-0.84) (0.64-0.72) (0.60-0.68) (0.33-0.73) (0.35-0.63) (0.10-0.67) (0.82-0.85) (0.72-0.79) (0.73-0.90)
0.86 (0.77-0.88)
Abbreviation: DSC Z Dice Similarity Coefficient.
Table 3 Dice Similarity Coefficient vs organ volume and contouring metric
Average DSC Average metric
Average DSC
Organ volume <8 cc
Organ volume 8 cc
P value
0.61 (nZ9) 51.5 (nZ9)
0.91 (nZ21) 80.2 (nZ21)
.005 .026
Contouring metric <70
Contouring metric 70
P value
0.64 (nZ11)
0.93 (nZ19)
<.001
Abbreviation: DSC Z Dice Similarity Coefficient.
Volume 83 Number 5 2012
Continual improvement of normal tissue contouring e707 It is important to note that the PDCA cycle calls for ongoing evaluation to achieve continual quality improvement. Hence, this study represents a first step in that direction. Based on our results, there are several initiatives that we plan to implement. First, there are a number of OAR that the dosimetrists obtained relatively low scores, even after re-training. To address these OAR, we plan to implement written guidelines with pictorial examples. It is our hope that subsequent re-evaluation will increase the metric to >70 for these remaining organs. In addition, we have shared these results with our physician group. As such, knowing the OAR that are most difficult to contour, they can also check and provide feedback on a case-by-case basis. Finally, once we are able to achieve a high level of contouring accuracy (metric >70) for all organs, we hope to increase our threshold to perhaps 80 or 90 to determine whether, as a group, we can achieve a high level of accuracy with minimal variability.
Fig. 3. Plot of the average organ metrics based on initial evaluation, and subsequent re-testing. Statistically significant improvement was noted for the right and left optic nerves, right and left cochlea, optic chiasm, and bowel. In both cases, these contours would result in a conservative dose estimate to these OAR.
Discussion In this study, we applied Demming’s PDCA cycle for continuous quality improvement to normal tissue contouring. We noted that the dosimetrists achieved a high degree of contouring accuracy for 19 of the 30 organs evaluated. Of the 11 remaining organs, 10 showed improvement in contouring accuracy following reeducation and re-evaluation. An important aspect of the PDCA approach is the determination of objective metrics and determining the level accuracy that is required. In this study, we used a cut-off of 70 for the penalty metric based on the previous study by Nelms et al (16). On visual comparison between the contours, we believed that this value represented a good demarcation line. A more rigorous method would involve combining the score with the resultant impact on a the DVH of an OAR. However, this impact will depend on the nature of the plan and the geometrical relationship between the PTVand the OAR. For example, if the OAR is in a high-dose/high-gradient region, such as the rectum in a prostate cancer patient, then a higher score may be necessary. However, if the OAR is in a low-dose/low-gradient region, such as the lens for a typical head-and-neck cancer patient, the accuracy requirement may not be as high. The relation of the OAR to these regions will vary from patient-to-patient. Clearly, this is an area that will require further research. With respect to the type of metric that should be used, we observed good agreement between the DTA penalty metric and DSC. Many studies, particularly related to autosegmentation, use the DSC to quantify the fidelity of contours. Although we observed a strong relationship between the DSC and penalty metric, there were several organs that had high DSC values but low metric scores. These organs included the abdominal spinal cord, esophagus, and bowel. For these organs, which tend to have larger volumes, the DSC parameter may not be sensitive enough to detect deviations in the contour outlines. Moreover, the metric of choice may vary throughout the body dependent upon the organ’s volume and location.
Conclusion In summary, we have implemented the PDCA cycle to provide continual quality improvement of normal tissue contouring. We found this approach straightforward to implement, with only a modest time burden. Given the important role that OAR contours play in treatment planning and IGRT, we hope that other clinics will adopt this approach to potentially improve patient outcomes.
References 1. Mell LK, Tiryaki H, Ahn KH, et al. Dosimetric comparison of bone marrow-sparing intensity-modulated radiotherapy versus conventional techniques for treatment of cervical cancer. Int J Radiat Oncol Biol Phys 2008;71:1504-1510. 2. Takam R, Bezak E, Yeoh EE, et al. Assessment of normal tissue complications following prostate cancer irradiation: comparison of radiation treatment modalities using NTCP models. Med Phys 2010; 37:5126-5137. 3. Dawson LA, Jaffrey DA. Advances in image-guided radiation therapy. J Clin Oncol 2007;25:938-946. 4. Harsolia A, Hugo GD, Kestin LL, et al. Dosimetric advantages of four-dimensional adaptive image-guided radiotherapy for lung tumors using online cone-beam computed tomography. Int J Radiat Oncol Biol Phys 2008;70:582-589. 5. Dowsett RJ, Galvin JM, Cheng E, et al. Contouring structures for 3dimensional treatment planning. Int J Radiat Oncol Biol Phys 1992; 22:1083-1088. 6. Guai-Amau X, Ibanex-Gual MV, Lliso F, et al. Organ contouring for prostate cancer: interobserver and internal organ motion variability. Comput Med Imaging Graph 2005;29:639-647. 7. Wong EK, Truong PT, Kader HA, et al. Consistency in seroma contouring for partial breast radiotherapy: impact of guidelines. Int J Radiat Oncol Biol Phys 2006;66:372-376. 8. Berger D, Kauer-Domer D, Seitz W, et al. Concepts for critical organ dosimetry in three-dimensional image-based breast brachytherapy. Brachytherapy 2008;7:320-326. 9. Li XA, Tai A, Arthur DW, et al. Variability of target and normal tissue structure delineation for breast cancer radiotherapy: an RTOG multiinstitutional and multiobserver study. Int J Radiat Oncol Biol Phys 2009;73:944-951. 10. Mitchell DM, Perry L, Smith S, et al. Assessing the effect of a contouring protocol on postprostactectomy radiotherapy clinical target volumes and interphysician variation. Int J Radiat Oncol Biol Phys 2009;75:990-993. 11. Szumacher E, Marnett N, Warner S, et al. Effectiveness of educational intervention on the congruence of prostate and rectal
e708 Breunig et al.
12.
13.
14.
15.
contouring as compared with gold standard in three-dimensional radiotherapy for prostate. Int J Radiat Oncol Biol Phys 2010;76: 379-385. Weiss E, Wu J, Sleeman W, et al. Clincial evaluation of soft tissue organ boundary visualization on cone-beam computed tomographic imaging. Int J Radiat Oncol Biol Phys 2010;78:929-936. Fiorino C, Vavsssori V, Sanguineti G, et al. Rectum contouring variability in patients treated for prostate cancer: impact on rectum dose-volume histograms and normal tissue complication probability. Radiother Oncol 2002;63:249-255. Collier DC, Burnett SS, Amin M, et al. Assessment of consistency in contouring of normal-tissue anatomic structures. J Appl Clin Med Phys 2003;4:17-24. Foppiano F, Florino C, Frezza G, et al. The impact of contouring uncertainty on rectal 3D dose-volume data: results of a dummy run in
International Journal of Radiation Oncology Biology Physics
16.
17. 18. 19.
20.
a multicenter trial (AIROPROS01-02). Int J Radiat Oncol Biol Phys 2003;57:573-579. Nelms BE, Tome WA, Robinson G. On the accuracy of anatomy contouring for head/neck: quantifying the variation and estimating the resulting dosimetric impact for IMRT plans. Int J Radiat Oncol Biol Phys 2012;82:368-378. Demming WE. Out of the crisis. Cambridge, MA: MIT Press; 1986. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945;26:297-302. Prabhakar R, Julka PK, Ganesh T, et al. Feasibility of using MRI alone for 3D radiation treatment planning in brain tumors. Jpn J Clin Oncol 2007;37:405-411. Karlsson M, Karlsson MG, Nyholm T, et al. Dedicated magnetic resonance imaging in the radiotherapy clinic. Int J Radiat Oncol Biol Phys 2009;74:644-651.