Thorac Surg Clin 17 (2007) 413–424
The Comparison of Performance Between Thoracic Surgical Units Alessandro Brunelli, MDa,*, Gaetano Rocco, MD, FRCS (Ed), FECTSb a Unit of Thoracic Surgery, Umberto I Regional Hospital, Via S. Margherita 23, Ancona 60129, Italy Division of Thoracic Surgery, National Cancer Institute Pascale Foundation, Via M Semmola 81, 60129 Naples, Italy
b
Managed care systems demand an increased effort in quality monitoring through the use of reliable instruments of performance evaluation. Comparison between units requires the definition and selection of proper indicators of quality. Outcome indicators are widely used in our practice but they need to be risk adjusted to prevent misleading information and unethical risk-averse behaviors. Regression analyses are commonly used for developing risk models. Bootstrap analysis can formalize the development of model building, removing much of the human biases associated with regression analysis and introducing a concrete measure of reliability of the risk factors. Recently published examples of comparisons of performance between different thoracic surgery units are summarized with the intent to provide a methodological template for future comparative audit investigations. In the present era of managed care systems, increasing pressure is exerted by payers and the public on the profession to assume greater responsibility for delivering high-quality care. Surgeons are made fully accountable for their results, which are increasingly published through several media. Furthermore, the increasing emphasis on cost-containment and the limited resources may make difficult to implement a high-quality program of care. In addition, the proliferation of World Wide Web–based public or private sites advertising health care providers and ranking hospitals or
* Corresponding author. E-mail address:
[email protected] (A. Brunelli).
units through often inaccurate or totally uncontrolled measures of performance, demands an increased effort in quality monitoring and improvement processes through the analysis of performance of health care providers. The final objective of quality monitoring is always to improve the quality of care delivered to the patients. The instruments through which the level of performance can be assessed may, however, vary depending on the targeted audience. For the practicing clinicians, it may be more useful to assess the degree of agreement of their practice with recommended standardized critical pathways of care, which are known to be linked with outcome. This will give immediate actionable information to improve their practice. However, on the part of the patients and financing institutions, outcome measures still represent the indicator of major interest. When selecting a provider, a patient is still more interested in knowing its outcomes rates rather than information on how the patients were treated according to standardized clinical pathways. Patients and payers are more inclined to choose those providers with the lowest morbidity and mortality rates. However, if this information is not reported in the correct way, it may become a dangerous and unethical instrument. In fact, because patients are not randomly allocated to different physicians or hospitals and their outcome may be influenced by severity of illness, treatment effectiveness, or even mere chance [1,2], the comparison of outcomes appears challenging from the outset. Indeed, different case-mixes at different institutions may make the comparison of crude outcome rates misleading.
1547-4127/07/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.thorsurg.2007.07.006
thoracic.theclinics.com
414
BRUNELLI & ROCCO
Units treating older patients with increased rates of comorbidities are likely to have higher morbidity and mortality rates compared with those treating younger patients with less severe underlying diseases. This may lead to unethical risk-averse behaviors and gaming of data to make the unit look to be a better performer. To obviate this problem, the selection of quality end points must account for the differences in the prevalence of risk factors, and clinical risk modeling must become the logical and necessary approach for provider profiling. Nevertheless, risk modeling for audit purposes appears yet in its embryonic phase in thoracic surgery and specific literature on this issue is scanty and mainly limited to internal audit [3–5]. In this article, morbidity and mortality models have been developed within the same unit to create a temporal benchmark to monitor the internal performance during different periods of activity. However, when dealing with comparative audit between different units, several unique issues need to be taken into account and fully understood before starting the analysis. This article is intended to provide an overview of the principles, methods, and existing examples of comparisons of performance between different thoracic surgical units.
The selection of performance indicators for comparative audit One of the most important and critical issues regarding the comparison of performance between surgical units is the selection of the proper measure or measures of audit. An ideal indicator of performance should meet the following characteristics: it should be based on agreed definitions and described exhaustively; it should be highly specific and sensitive; it should have a high validity and reliability; it should have a high discriminative capability; it should be relevant to the clinical practice; it should allow a fair comparison between providers; and it should be evidence-based. As explained in detail in another article in this issue of The Clinics, indicators may be related to outcomes or processes of care [6]. It is simplistic to view outcome and process indicators as being in competition with each other, but there are circumstances where one type of measure would be more useful than the other. Ideally, as the quality of care is by
definition multidimensional, more than one indicator may be used and ideally, outcome- and process-based indicators should be used in a complementary fashion. Certainly, process measures can be used whenever we have strong evidence or beliefs that particular processes affect or are linked to important outcomes. Process data are particularly useful for comparisons when one or more of the following apply: The final goal is to improve the delivery of care; Why specific providers achieve particular outcomes. For example, if we compare providers on outcomes to identify the worst outcomes, then we must study their processes of care to see how they can improve. Even if we find all providers have similar outcomes, we will still need to study processes. What if they all have similar outcomes but they are falling short in giving effective care to appropriate patients? Short time frames are necessary; The process of interest affects long-term outcomes; Provider groups or individual providers who contribute only part of the care received by a patient are compared. Being the thoracic surgical patient the subject of a multidisciplinary approach, when we look at outcomes only, we often cannot tell which providers to credit for a good outcome, or if all should share the credit. Small volume providers, for which outcomes studies are not possible, are compared, because detecting differences in outcomes requires large sample sizes. Risk-adjustment or risk-stratification methods are lacking; Providers are being compared in a coercive/ competitive situation, in which gaming of data and results to ensure better-looking outcomes can produce disastrous effects for the patients. On the other hand, outcomes may be preferred in some circumstances and whenever we have tools to adjust or stratify for patient factors that affect the health outcome. Outcomes data are especially useful when one or more of the following apply: We want to prioritize areas for quality improvement and have evidence about the
COMPARISON BETWEEN THORACIC SURGICAL UNITS
processes of care that improve outcomes for a clinical condition. In this situation, comparison of outcomes helps us to detect whether the outcomes achieved are not as good as they should be; We need safe implementation of processes of interest. Comparison of outcomes may be the best way to uncover mistakes or oversights in implementing care. Whenever outcomes fall short of the level known to be possible, there is a place for spending resources in quality management and seeking opportunities for big improvements. Long time frames are possible; We want to compare the performances of high-volume providers; We want to assess the performance of whole systems or programs of care rather than of individual providers. Providers are being compared in a cooperative, voluntary, non-punitive situation.
In practice, outcome measures are the ones most widely reported in the existing scant literature addressing the subject of quality audit in thoracic surgery [3–5]. Although some circumstances are less than ideal, the assessment of the internal performance or the comparison between different units in terms of complications and mortality (typical outcome measures) is still the preferred approach by most of the practicing thoracic surgeons. The biggest hurdle to the use of process-based performance indicators is the lack of well-defined evidence-based standardized critical pathways of care in our specialty and the lack of widely available information systems capable of simplifying the unbiased and objective recording of the processes of care of interest. Until these conditions are realized, the thoracic surgery community will continue to rely on outcomes indicators to monitor and improve our performance. Nevertheless, outcome measures have several limitations that need to be understood and taken into account in the interpretation of outcome data for comparative audit.
Limitations of outcome measures in comparative audit The two most commonly used outcome indicators in surgical analysis of performance are morbidity and mortality; however, when used for
415
performance analysis these measures have several critical inherent problems. Morbidity rates have the statistical advantage of being more frequent than mortality, and for this reason are more useful in developing risk models and in comparative audits between different units. However, they may have important limitations. First, complications may be defined variably in different settings or even in the same hospital at different times. This may produce different rates of morbidity that do not reflect the quality of care delivered to the patients but simply the quality of outcome variable definition. As an example in our specialty, in the many papers published dealing with complication rates after lung resection, the list of complications included in the analysis was largely inconsistent from one investigation to the other and even the definition of the same type of complication was variable. Some studies focused only on major cardiopulmonary morbidity such as respiratory failure, pneumonia, myocardial ischemia, pulmonary edema, pulmonary embolism, arrhythmia, cardiac failure, atelectasis, stroke, and acute renal insufficiency [7–11]. In addition to these complications, others included in the list of morbidity were the so-called minor complications, such as urinary tract infections, acute urinary retention, deep venous thrombosis, ileus/bowel obstruction, and wound complications [12]. Others included also surgical complications such as recurrent nerve injury, chylothorax, hemothorax requiring reoperation, bronchopleural fistula, and prolonged air leak [3]. Under a quality management point of view, aggregating all different kinds of complications into one single outcome measure would be less than ideal. In fact, if the objective is to improve the performance to prevent the occurrence and improve the management of complications, these should be grouped in different categories and dealt with separately, as they likely recognize different predisposing factors. Even under a statistical point of view, major cardiopulmonary, technical, and minor complications may be associated with different risk factors and, therefore, different risk models for each of these groups should be derived for the sake of improving risk stratification. In addition to this categorization problem, the same complication may be variably defined from one center to the other. For example, respiratory failure could be variably considered by some investigators as the need to be reintubated at
416
BRUNELLI & ROCCO
any time after operation [11–13], or the need for mechanical ventilation for more than 24 [14] or 48 hours [7,9,12,15]. The second inherent problem of morbidity is the possible variability of recording of complications between different centers and even within the same center at different times. For example, if all hospitals have the same complications rate but more complications are recorded at a teaching hospital because more observers write notes, this practice might explain why hospitals with residency training programs have higher observed complication rates. The same problem may occur within the same unit whenever a variation in the recording system of complications (ie, from written notes in the chart to electronic standardized data systems) or a change in the person responsible for recording complications has occurred during the years. The same problem of variability in definition and recording applies also for risk factors (preoperative or postoperative variables that can be associated with outcome and that can be used in model building for risk adjustment). This is crucial when dealing with risk adjustment, as will be discussed in the next paragraph. In fact, if more risk factors are recorded in a center compared with another as a consequence of different recording methods, or different numbers or qualifications of the personnel involved in data recording, an apparently better risk-adjusted outcome would be expected in that center. This result would be misleading inasmuch as it would not be the consequence of a better performance but only of an improved and more frequent recording of risk factors in that center compared with the other. Whenever a comparative audit is planned between 2 or more centers, it is crucial that risk factors and outcome variables will be defined by strict criteria before starting the data collection. In particular, morbidity must be exhaustively described and a list of complications of interest must be defined. Criteria of inclusion in each of the complications selected for the analysis must be described in detail. For the purpose of quality management and comparative audit, the type and number of complications that would be selected for the analysis are less important than their meaning according to a managed care perspective (major complications that add complexity to the postoperative management and increased postoperative use of resources), their membership to a category consistent with possible common risk
factors or causal effects, or the ultimate objective of the audit analysis (ie, assessment of the quality of perioperative care delivered by a surgical unit or of the surgical skill of individual surgeons). Finally, the other problem with morbidity is that in most of the studies dealing with this measure of outcome, complications are not weighted, meaning that an atelectasis requiring bronchoscopy may be considered on an equal footing with a respiratory insufficiency requiring more than 2 days of mechanically assisted ventilation. This is clearly improper under a clinical and management point of view, given the different impact in terms of expenses, resource use, and clinical complexity of the case. Unfortunately, an accurate weighing system for complications has not yet been found that could stratify morbidity to make comparative audit more reliable. Mortality is certainly a less ambiguous measure of performance than morbidity. However, even for a seemingly unambiguous end point there are important statistical and policy implications of using different time frames for its definition (in-hospital, 30-day, 60-day mortality). Whenever planning a comparative audit between different centers, this time frame must be absolutely defined before the start of the project. Under a quality management perspective, the in-hospital mortality may be more linked to the quality of care of the provider. Mortality outside the hospital (particularly more than 30 days after the operation) may be associated with other variables unrelated to surgery, such as structural, organizational, geographical, family, or primary care factors, of which the audited center cannot be made accountable for. The other problem with mortality is the number of cases. Fortunately, after lung resection or other thoracic procedures, mortality rates are low, around 2% to 4%. Small numbers mean that a large amount of cases need to be aggregated to reach a reliable number of events to obtain meaningful statistical comparisons. Under a strict statistical perspective, when dealing with risk adjustment, it is recommended that to prevent overfitting of the model, the number of events (mortality cases) should not be less than 10 for each predictor included in a model [16]. This means that a predictive equation containing two independent predictors (ie, age and ppoFEV1) could not be derived from a population experiencing less than 20 deaths. This may take a considerably long time frame for the analysis, which may be not very advantageous in terms of
COMPARISON BETWEEN THORACIC SURGICAL UNITS
quality management. It may take as long as 4 to 5 years to aggregate the necessary mortality cases to develop a reliable model for a single moderate-size unit before having the desired feedback on the performance. However, waiting 5 years to know how good we are working, is less than ideal and implies a considerable waste of time and perhaps preventable life-losses. Mortality is probably not the more appropriate measure of performance in our specialty and may be quite insensitive to quality as well. It tells very little about the way we work. Even though a patient undergoing lung resection will receive the worst possible perioperative management, he will likely survive the operation anyway. A possible solution to the numerical problem of mortality may be to aggregate more centers (multi-institutional datasets) to build up a reliable mortality model that could be used for comparative audit. However, this would solve only one aspect of the problem. In fact, once you have a valid and reliable model, then you have to apply it to the individual units or, worse, to the individual surgeons. Then you still have to cope with small numbers and long periods of time to aggregate sufficient events to detect differences. Morbidity and mortality reflect different aspects of quality of care, the first being more associated with the physiological characteristics of the patients operated on and the second linked both to the patients’ characteristics and to the structural and organizational characteristics of the hospital as well [17]. Since quality of care is multidimensional, in the absence of more accurate methods to evaluate it in its wholeness, multiple outcome measures should be used when planning a comparative audit on quality of care [7]. Units may rank differently according to different measures of performance [17], and this may have important policy and managerial implications. Finally, the single most important limitation of outcome measures is that they all are influenced by the characteristics of the patients who are being treated. Thus, to perform a fair comparison of outcomes, these must be adjusted for the case-mix of the population and the severity of the operation performed. An unadjusted comparison of morbidity and mortality would favor those units operating on the younger and healthier patients, which predictably will have better outcome rates. This would promote dangerous and unethical risk-averse behaviors, which will ultimately preclude surgery to the oldest and sickest patients.
417
Risk adjustment Due to different geographical or referral patterns, the case-mix at different institutions may be variable making the comparison of crude outcome rates misleading. In fact, patients’ outcome may be influenced by severity of illness, treatment effectiveness, or even mere chance [1,2] and differences in mortality and morbidity rates may reflect differences in patients’ characteristics more than differences in performance. Therefore, the selection of quality end points must account for the differences in the prevalence of risk factors, and clinical risk modeling must become the logical and necessary approach for provider profiling. The instrument through which outcomes from a given intervention are predicted by arranging the patients according to the severity of their illness (risk stratification) is risk adjustment. Risk adjustment is realized through the construction of risk models by applying sophisticated statistical techniques, of which the most commonly used in our specialty is regression analysis (logistic regression for discrete variables of interest such as mortality, or linear multivariate regression for continuous variables of interest such as postoperative FEV1). A detailed description of these techniques is not the objective of this article; however, we think some issues regarding risk modeling deserve to be emphasized. The most important phase in risk modeling is data collection. No model can be better than the data upon which it is derived. The quality of data definition and recording and the importance of the data source cannot be overemphasized enough. Therefore the first step of any quality improvement activity is the construction of a clinical, prospective, specialty-specific, procedure-specific, periodically audited database. Whenever possible, administrative datasets should be avoided if the aim of the analysis is to assess the clinical performance. In fact, although administrative data have the advantage of being easily available and retrievable they were not meant for clinical purposes but for billing purposes and suffer of several problems. In fact, they usually lack important critical variables, such as pulmonary function measures, oxygen consumption, or detailed intraoperative variables that can be strongly associated with outcome. Furthermore, differentiation of comorbidities from complications may be problematic and this may exaggerate the
418
BRUNELLI & ROCCO
predictive ability of risk models by inappropriately including preterminal complications that are highly correlated with mortality. Administrative databases may also exclude important variables that are not billable diagnoses or limit the number of secondary diagnoses. Finally, they have an insufficient flexibility to properly classify certain comorbidities all of which limit the accuracy of risk models derived from them. As mentioned above, strict definition and standardization of predictors and outcome variables are fundamental before starting a quality management activity. Particularly for a multiinstitutional comparative audit, all participating centers should standardize the variables included in the dataset, either in terms of definition or recording method. Dataset managers at each center should be selected and made responsible for the accuracy and consistency of the data. Another important concept in risk modeling is that no model is better than the one derived from the data at hand, and it has been shown that ready-made models applied to external populations perform less well than internally derived models [2,18–20]. Therefore, whenever possible, models should be derived from internal data and the use of existing external models should be limited. Furthermore, it is known that regression models perform better when applied retrospectively to evaluate past performance [21]. Under a total quality management perspective, they are not meant to foretell the future, but to analyze past data to avoid repeating problems encountered in the past. In this regard, the retrospective application of a model as a diagnostic quality instrument to the data from which it was developed seems justified, provided a cross-sample validation (such as bootstrap) had been performed to measure its reliability [22–24]. Unfortunately, risk models have only limited individual discrimination, and this may have significant implication for their use in individual patient counseling. Risk models may accurately predict that 3 of 100 patients with a given set of risk factors will die postoperatively, but they cannot precisely identify which 3. At the moment, no severity tool will ever perfectly describe patients’ risks for death or complications, and the most important reason that risk-adjustment methods fail to completely predict outcomes is that the data set used to derive the risk model comes from retrospective, observational data that contain inherent selection bias.
Moreover, the moderate discrimination of surgical risk models may be also explained by other reasons, such as the presence of yet unknown predictors, the difficulty to represent with one or more variables complex clinical conditions or pathways of care, and the occurrence of catastrophic random events that are rare in the population but important for single patients [25]. In logistic regression models in which the overall mortality rate ranges from 2% to 5%, the coefficient of multiple determination or R2 is almost always less than 0.2, meaning that only less than 20% of the variability of the outcome is explained by the model [21]. In addition to the reasons mentioned above, this limitation arises from the nature of the logistic regression, in which the dependent variable must have one of only two values (ie, survival or death). When the difference between actual and predicted mortality rates is calculated for each person (as part of the calculation of R2), no matter how accurate the prediction is, the difference between the predicted value and the observed value for mortality will be large, since the observed mortality must be either 0 or 1 and the prediction is a proportion between 0 and 1. While awaiting the quantum improvement in risk prediction through the development of new statistical techniques or the refinement of end point measures [26], the use of risk models should be limited to population-based analyses. Another critical issue in risk modeling is the number of variables that can be included in the model. It has been shown that increasing the number of variables beyond a limited set of core predictors does not improve the performance of the model [27]. Conversely, keeping the models as parsimonious as possible may be attractive since it can prevent many problems: cost of data collection; errors and imprecision in data recording; missing values; instability of the model. The ideal model should be based on clinical, high-quality, prospectively compiled, periodically audited, specialty-specific, procedure-specific (lung resection) databases that should contain, at the minimum, a core of set variables that have been demonstrated to be associated with outcome.
The construction of risk models: bootstrap versus training and test methods Regression analyses are the analytic techniques most commonly used for risk modeling. However,
COMPARISON BETWEEN THORACIC SURGICAL UNITS
the resultant models are useful only if they reliably predict outcomes for patients by determining significant risk factors associated with the outcome of interest. A problem may arise from this dependence on risk-factor analysis. Different investigators evaluating the same predictors through regression analysis may obtain heterogeneous results because of methodological discrepancies and inadvertent biases introduced in the statistical elaboration [28]. In the early 1980s computer-intensive computational techniques, termed ‘‘bootstrap methods,’’ were popularized [29–33]. Bootstrap analysis is a simulation method for statistical inference, which, if applied to regression analysis, can provide variables that have a high degree of reproducibility and reliability as independent risk factors of the given outcome. In fact, the predictive validity of a model can be assessed not just in one randomly split set of patients as in the traditional ‘‘training and test’’ method, but in perhaps hundreds or, typically, 1000 new different samples of the same number of patients as the original database, obtained by resampling with replacement. Bootstrap analysis was recently proposed as a breakthrough method for internal validation of surgical regression models [22,23]. The main advantage of this technique is that the entire dataset can be used for model building, which would yield more robust models especially in moderate-size databases and for rare outcomes (such as mortality after major lung resection) [34]. In a recent paper [24], we demonstrated that the traditional ‘‘training and test’’ method for model building, consisting of a random splitting of the database in a derivation set, from which to construct the model, and in a test set, in which to assess its calibration and discrimination, was subject to sampling noise. In fact, we showed that 7 of 10 different mortality models obtained by repeating 10 ‘‘training and test’’ sessions included different combinations of variables. The performance of each of these models was assessed 1000 times, each time using a bootstrap sampling of 224 patients drawn from an external dataset. The distribution of the c-statistics was extremely variable from one model to another and, in general, their performances in external samples were only modest. The development of risk-adjusted models by the method of ‘‘training and testing’’ appeared, therefore, completely unreliable. In contrast, by using bootstrap, we constructed and validated a mortality model, which, when
419
assessed in 1000 bootstrapped external samples drawn from another unit, performed better than the majority of the models developed by the ‘‘training and test’’ method. This shows that the bootstrap procedure can yield stable models across multiple populations. This technique can formalize the development of model building, removing much of the human biases associated with regression analysis, providing a balance between selecting risk factors that are not reliable (type I error) and overlooking variables that are reliable (type II error), and introducing a concrete measure of reliability of the risk factors [22]. For this reason, and in view of the unreliability of the ‘‘training and test’’ method, previously published reports using it should be interpreted with caution, and the use of bootstrap analysis must be recommended for every future surgical model-building process.
Examples of comparison of performance between units Comparative audit of performance between different centers is still in its embryonic phase in thoracic surgery. A recent cooperative effort between European centers (Ancona and Sheffield) was implemented with the aim to provide a methodological example of development and application of comparative quality management strategies [7,35]. In this section, we outline the principal steps planned and implemented in that multi-institutional quality activity.
Defining the objectives In the absence of more precise instruments to evaluate the quality of care in its wholeness, multiple end points should be analyzed as a surrogate [17], since each end point may be associated with a different aspect of the quality of care. In this cooperative effort, the authors made a preliminary scrutiny of possible quality end points that could be consistent in definition and recording between the two centers and could be subjected to risk adjustment. The analysis focused on morbidity, mortality, and emergency unplanned ICU admission after major lung resection with the aim to provide a methodological template for development and application of multiple risk-adjusted outcome measures for comparative audit in our specialty.
420
BRUNELLI & ROCCO
Data management and characteristics of the participating centers The analyses were performed by using prospective, periodically audited, electronic databases of the two dedicated thoracic surgery units located in two different European countries. Data were entered prospectively in each database by a trained staff physician and were periodically audited by a designated audit lead, who was responsible for the accuracy and completeness of the database. Both databases were in use as continuous quality improvement instruments at the two participating units. Data were initially scrutinized for assessing quality of variables and their consistency in definition and recording between the two units. To this purpose, the two databases were reciprocally and independently audited by the clinical audit lead of the other unit. Only those variables and end points that were deemed of high quality and consistent across the two units were included in this analysis. The databases were then made anonymous for both patients and surgeons and were merged for analysis. Complications were prospectively and independently recorded at two different centers after strict criteria were preliminary defined. It was the priority of the authors assessing the consistency of these definitions among the two centers and only those complications that were judged to be reliably consistent in definition and recording were used for the analysis. As for the consistency of ICU indications across the two units participating in this study, both of them had similar postoperative management policies: routine thoracic patients were initially managed in a dedicated general thoracic surgery ward with similar characteristics and nurse-to-patient ratio, and transferred to ICU only in the case of severe cardiopulmonary complications requiring assisted ventilation for acute respiratory failure, continuous invasive monitoring (including central venous and arterial lines or pulmonary artery catheter), or acute cardiac failure requiring multiple inotropic support. Because of the scarce availability of ICU beds at both hospitals, only patients meeting these strict criteria went to the ICU. Both centers had dedicated general thoracic surgery wards staffed, 24 hours a day, 365 days a year, with dedicated nurses and chest physiotherapists. In the early postoperative period the patients are monitored by means of ECG monitor
and pulse oxymeter. Noninvasive systemic blood pressure, respiratory rate, and body temperature are recorded every other hour (or more frequently if indicated) on a special chart. Every bed has oxygen and aspiration points, Flo-guard infusion administration and syringe pumps are available. The nurse-to-patient ratio is 1:4 in every section of the ward. The surgical team includes certified thoracic surgeons and residents. At least one thoracic surgeon is always present in the ward during the day, and one is always on-call at home during the night.
Populations and model-building process The 743 patients submitted to lobectomy/ bilobectomy (611) or pneumonectomy (132) in two European centers (519 in center A and 224 in center B) from January 2000 through December 2004 were analyzed. All patients were initially used to develop the predictive logistic models. For each measure of outcome (morbidity, mortality, emergency ICU admission rates), variables were initially screened by univariate analyses. The univariate comparisons of outcomes were performed by means of the unpaired Student t-test for numerical variables with normal distribution and by means of the Mann Whitney U test for numerical variables without a normal distribution. The Shapiro-Wilk normality test was used to assess normal distribution. Categorical variables were compared by means of the chi-square test or the Fisher’s exact test, as appropriate. Variables with a P ! .10 at univariate analysis were then used as independent variables in the stepwise logistic regression analyses, which were then validated by bootstrap analyses. The presence or absence of one or more complications, of mortality, or of emergency ICU admission were used as dependent variables in each respective model. All data were complete with the exception of carbon monoxide lung diffusion capacity (DLCO), which was 95% complete. Missing data were imputed by averaging the nonmissing values. To avoid multicollinearity, only one variable in a set of variables with a correlation coefficient greater than 0.5 was selected (by bootstrap procedure) and used in the regression model. The logistic models were then used to predict morbidity, mortality, or emergency ICU admissions in the patients operated on in the two different units. Predicted and observed outcomes
421
COMPARISON BETWEEN THORACIC SURGICAL UNITS
Stepwise logistic regression analysis showed that significant and reliable predictors of mortality were age (P ¼ .0002, bootstrap 100%) and ppoFEV1 (P ¼ .0004, bootstrap 98%). The following equation predicting mortality was developed: ln R/1 R ¼ 6.97 þ 0.095 age 0.042 ppoFEV1 (Hosmer Lemeshow statistic ¼ 2.99 [P ¼ .9], c-index ¼ 0.77). Stepwise logistic regression analysis showed that an emergency ICU admission was associated with older age (P ¼ .01, bootstrap 71%), reduced ppoFEV1 (P ¼ .02, bootstrap 55%), and pneumonectomy (P ¼ .01, bootstrap 83%). Accordingly, the following regression equation estimating the risk of emergency ICU admission after major lung resection was developed: ln R/1 R ¼ 4.48 þ 0.05 age 0.03 ppoFEV1% þ 0.95 pneumonectomy (Hosmer Lemeshow goodness of fit statistic ¼ 6.3 [P ¼ .6], c-index¼ 0.72). The comparison of predicted and observed outcome rates is shown in Table 1. Despite a higher observed morbidity rate in unit A (P ¼ .07), no differences were noted between observed and predicted outcome rates in each unit. No differences were noted when comparing the predicted to the observed ICU admission rates within each unit (unit A: 6.5 versus 4.8, P ¼ .3, and unit B: 6.4 versus 10.3, P ¼ .2, respectively), despite a higher observed ICU admission rate in unit B (10.3 versus 4.8, P ¼ .008). An in-depth analysis of the risk-adjusted observed ICU (RAO-ICU) admission rates of the two units participating in the study were calculated by multiplying the observed to expected ICU admission ratio of each unit by the observed ICU admission rate in the total population (0.065) [27]. This rate can be interpreted as the ICU admission rate a unit would have if its case-mix were similar to the average case-mix in the study, which showed the RAO-ICU admission rate of unit B was
rates in each unit were then compared by means of the z-test for comparison of proportions. The following variables were initially screened for a possible association with postoperative morbidity, mortality, or emergency ICU admission: age, gender, body mass index (BMI, kg/m2), forced expiratory volume in 1 second (FEV1), DLCO, predicted postoperative FEV1 (ppoFEV1), predicted postoperative DLCO (ppoDLCO), cardiac comorbidity, type of disease (benign versus malignant), type of operation (lobectomy versus pneumonectomy), and neodjuvant chemotherapy. Morbidity and mortality were considered as those occurring within 30 days from surgery or during a longer period if the patients were still hospitalized. The following complications were included: respiratory failure requiring mechanical ventilation for more than 48 hours; pneumonia (chest radiograph infiltrates, increased white blood cell count, fever); atelectasis requiring bronchoscopy; adult respiratory distress syndrome (ARDS); pulmonary edema; pulmonary embolism; myocardial infarction (suggestive ECG findings and increased myocardial enzymes); hemodynamically unstable arrhythmia requiring medical treatment; cardiac failure (suggestive chest radiographs, physical examination, and symptoms); acute renal failure (change in serum creatinine greater than 2 mg/dL compared with preoperative values); stroke. Outcome models and risk-adjusted comparison Stepwise logistic regression analysis showed that significant and reliable predictors of morbidity were age (P ¼ .005, bootstrap 80%), ppoFEV1 (P ¼ .003, bootstrap 88%) and cardiac comorbidity (P ¼ .002, bootstrap 89%). The following equation predicting morbidity was developed: ln R/1 R ¼ 2.4 þ 0.03 age 0.02 ppoFEV1 þ 0.6 cardiac comorbidity (Hosmer Lemeshow statistic ¼ 6.1 (P ¼ .6), c-index ¼ 0.65).
Table 1 Comparison of predicted and observed morbidity and mortality rates within each unit Morbidity
Mortality
Unit
Observed
Predicted
P value
Observed
Predicted
P value
Unit A Unit B
23% 17%*
22.7% (22–23.5) 18.2% (17–19)
.9 .7
4.8% 4.4%**
4.9% (4.5–5.3) 4.2% (3.6–4.9)
.9 .9
Expected outcomes are presented with 95% confidence limits (in parentheses). * P ¼ .07, unit A versus unit B ** P ¼ .8 unit A versus unit B.
422
BRUNELLI & ROCCO
more than twice as high as unit A (10.5% versus 4.8%, respectively; P ¼ .006). Quality management implications The use of risk-adjusted models of morbidity and mortality prevented misleading information derived from the unadjusted analysis of the performance. In fact, despite that unit A had a higher observed morbidity rate compared with unit B, the observed outcome rates were in line with the predicted ones in each unit. The increased observed morbidity in unit A could be explained by a worse physiological state of the patients at the time of surgery rather than by a poorer performance, as shown also by the higher frequency of patients with higher predicted morbidity and mortality risks in unit A compared with those in unit B. Without the use of risk adjustment, unit A would have been erroneously regarded as underperforming unit B. As for morbidity and mortality, even the frequency of accesses to the ICU, with its attending increased costs and resources use, should be analyzed by taking into consideration the characteristics of the patients operated on and of the type of operations performed. An emergency ICU admission rate higher than predicted may be a marker of poor quality of care. In this case, the entire process of care should be reviewed to identify possible factors associated with a performance worse than expected. An emergency ICU admission rate lower than predicted may be a marker of a successful process of care. Clinical and structural factors that contributed to the good performance may be generalized to other units. As an example of this type of comparative audit analysis, we compared the riskadjusted observed ICU admission rates of the two units and we found that unit B had a significantly higher RAO-ICU rate compared with unit A. Under a total-quality management perspective, these findings should prompt an in-depth analysis of the entire process of care in the two units to identify factors responsible for these differences.
The future of comparative audit Quality management needs actionable information and rapid feedback. Under these regards, outcomes measures of performance have inherent problems that can limit their usefulness. Quality of care should be assessed in its wholeness, by an in-depth analysis of its processes. Even though
processes are ultimately linked to outcomes, they give much more information on how well the care has been delivered to the patients and on what we could correct for improving our practice. The use of process indicators will also allow a more reliable comparison of performance between individual surgeons by using end points that will be more precisely attributable to the surgeon’s activity. It is our belief that one of the most urgent priorities of our international scientific organizations (such as American Association for Thoracic Surgery [AATS], Society of Thoracic Surgeons [STS], European Association for Cardiothoracic Surgery [EACTS], and European Society of Thoracic Surgeons [ESTS]) should be to set up multi-institutional studies, meta-analyses, or consensus conferences between panels of experts to produce guidelines or suggested pathways of care that would be evidence-based linked to the best possible outcomes. These should be made public to our thoracic surgery community. Once we have a set of standardized pathways of care (which should be periodically revised, being the processes strictly influenced by technology) we would need an electronic unbiased tool to rapidly and inexpensively record in our daily practice the agreement of our clinical actions to the suggested processes. Once that kind of system is widely available, the analysis would be simple and straightforward: the calculation of the proportion of certain patients receiving or not receiving a given process of care. Until then, multiple outcome measures need to be used in the attempt to define in a comprehensive manner the quality of surgical care. In this regard, risk modeling continues to be essential for provider profiling and must be used for a fair comparison of performance between different centers with the ultimate goal of improving the quality of surgical care. However, the comparison of outcomes should not be made in a coercive or punitive fashion, but in a cooperative and constructive spirit, with the ultimate aim to generalize successful quality factors from one unit to the others. Outcome comparative analyses should trigger subsequent internal processes of care evaluation to assess their consistency with current international standards of care. Regardless of the type of end point that will be used in the future for quality management purposes, data collection will always be of paramount importance. The costs for implementing and managing international multicentric databases (such as the European Thoracic Surgery Database
COMPARISON BETWEEN THORACIC SURGICAL UNITS
or the STS Thoracic database) seem, therefore, justified by the benefits that could be derived from the quality improvement processes, which will be based on them. Even though start-up costs may be daunting, ultimately improved quality will be cost-efficient, and part of any cost savings realized by improved quality may be even factored into the total costs of gathering and maintaining risk-adjusted data. We think that important international cooperative processes for monitoring and standardization of the pathways of surgical care and for the accreditation of structures cannot leave out of consideration the use of reliable risk-adjustment models developed by thoracic surgeons. Summary Managed care systems demand an increased effort in quality monitoring and improvement processes through the analysis of performance of health care providers. The performance can be assessed by analyzing the outcomes or the clinical processes. To date, outcome indicators are still the most widely used quality end points in our specialty. However, because patients are not randomly allocated to different physicians or hospitals, different case-mixes at different institutions may make the comparison of crude outcome rates misleading. Therefore, the selection of quality end points must account for the differences in the prevalence of risk factors, and clinical risk modeling must become the logical and necessary approach for provider profiling to prevent inappropriate clinical and administrative decisions and unethical risk-averse behaviors. Regression analyses are the analytic techniques most commonly used for risk modeling. However, the resultant models are useful only if they reliably predict outcomes for patients by determining significant risk factors associated with the outcome of interest. Recent evidence has shown that bootstrap resampling analysis can formalize the development of model building, removing much of the human biases associated with regression analysis, providing a balance between selecting risk factors that are not reliable (type I error) and overlooking variables that are reliable (type II error), and introducing a concrete measure of reliability of the risk factors. Regardless of the type of end point that will be used in the future for quality management purposes, data collection will always be of paramount importance. The costs for implementing and managing international multicentric
423
databases (such as the European Thoracic Surgery Database or the STS Thoracic database) seem, therefore, justified by the benefits that could be derived from the quality improvement processes, which will be based on them. References [1] Iezzoni LI. The risks of risk adjustment. JAMA 1997;278:1600–7. [2] Luft HS, Romano PS. Chance, continuity, and change in hospital mortality rates. Coronary artery bypass graft patients in California hospitals, 1993 to 1998. JAMA 1993;270:331–7. [3] Brunelli A, Fianchini A, Al Refai M, et al. Internal comparative audit in a thoracic surgery unit using the physiological and operative severity score for the enumeration of mortality and morbidity (POSSUM). Eur J Cardiothorac Surg 2001;19:924–8. [4] Brunelli A, Fianchini A, Al Refai M, et al. A model for the internal evaluation of the quality of care after lung resection in the elderly. Eur J Cardiothorac Surg 2004;25:884–9. [5] Brunelli A, Xiume´ F, Al Refai M, et al. Riskadjusted morbidity, mortality and failure-to-rescue models for internal provider profiling after major lung resection. Interact Cardiovasc Thorac Surg 2006;5:92–6. [6] Donabedian A. The quality of care. How can it be assessed? J Am Med Assoc 1988;260:1743–8. [7] Brunelli A, Morgan-Hughes NJ, Refai M, et al. Risk-adjusted morbidity and mortality models to compare the performance of two units after major lung resections. J Thorac Cardiovasc Surg 2007; 133:88–96. [8] Ferguson MK, Durkin AE. A comparison of three scoring systems for predicting complications after major lung resection. Eur J Cardiothorac Surg 2003;23:35–42. [9] Bolliger CT, Jordan P, Soler M, et al. Exercise capacity as a predictor of postoperative complications in lung resection candidates. Am J Respir Crit Care Med 1995;151:1472–80. [10] Markos J, Mullan BP, Hillman DR, et al. Preoperative assessment as a predictor of mortality and morbidity after lung resection. Am Rev Respir Dis 1989; 139:902–10. [11] Licker M, Spiliopoulos A, Frey JG, et al. Risk factors for early mortality and major complications following pneumonectomy for non-small cell carcinoma of the lung. Chest 2002;121:1890–7. [12] Harpole DH, DeCamp MM, Daley J, et al. Prognostic models of thirty-day mortality and morbidity after major pulmonary resection. J Thorac Cardiovasc Surg 1999;117:969–79. [13] Varela G, Cordovilla R, Jime´nez MF, et al. Utility of standardized exercise oximetry to predict cardiopulmonary morbidity after lung resection. Eur J Cardiothorac Surg 2001;19:351–4.
424
BRUNELLI & ROCCO
[14] Wang J, Olak J, Ultmann RE, et al. Assessment of pulmonary complications after lung resection. Ann Thorac Surg 1999;67:1444–7. [15] Kearney DJ, Lee TH, Reilly JJ, et al. Assessment of operative risk in patients undergoing lung resection. Importance of predicted pulmonary function. Chest 1994;105:753–9. [16] Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15:361–87. [17] Silber JH, Rosenbaum PR, Schwartz JS, et al. Evaluation of the complication rate as a measure of quality of care in coronary artery bypass graft surgery. JAMA 1995;274:317–23. [18] Ivanov J, Tu JV, Naylor CD. Ready-made, recalibrated or remodeled? Issues in the use of risk indexes for assessing mortality after coronary artery bypass graft surgery. Circulation 1999;99:2098–104. [19] DeLong ER, Peterson ED, DeLong DM, et al. Comparing risk-adjustment methods for provider profiling. Stat Med 1997;16:2645–64. [20] Orr RK, Naini BS, Sottile FD, et al. A comparison of four severity-adjusted models to predict mortality after coronary artery bypass graft surgery. Arch Surg 1995;130:301–6. [21] Chassin MK, Hannan EL, DeBuono BA. Benefits and hazards of reporting medical outcome publicly. N Engl J Med 1996;334:394–8. [22] Blackstone EH. Breaking down barriers: helpful breakthrough statistical methods you need to understand better. J Thorac Cardiovasc Surg 2001;122: 430–9. [23] Grunkemeier GL, Wu YX. Bootstrap resampling method: something for nothing? Ann Thorac Surg 2004;77:1142–4. [24] Brunelli A, Rocco G. Internal validation of risk models in lung resection surgery: bootstrap versus
[25]
[26] [27]
[28]
[29] [30]
[31] [32] [33] [34]
[35]
training and test sampling. J Thorac Cardiovasc Surg 2006;131:1243–7. Shahian DM, Blackstone EH, Edwards FH, et al. Cardiac surgery risk models: a position article. Report from the STS workforce on evidence-based surgery. Ann Thorac Surg 2004;78:1868–77. Steen PM. Approaches to predictive modelling. Ann Thorac Surg 1994;58:1836–40. Tu JV, Sykora K, Naylor CD. Assessing the outcome of coronary artery bypass graft surgery: how many risk factors are enough? Steering Committee of the Cardiac Care Network of Ontario. J Am Coll Cardiol 1997;30:1317–23. Naftel DC. Do different investigators sometimes produce different multivariable equations from the same data? J Thorac Cardiovasc Surg 1994;107: 1528–9. Diaconis P, Efron B. Computer-intensive methods in statistics. Sci Am 1983;248:96–109. Efron B, Gong G. A leisurely look at the bootstrap, the jackknife and cross-validation. Am Stat 1983;37: 36–48. Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman & Hall; 1993. Breiman L. Bagging predictors. Machine learning 1996;26:123–40. Efron B. The bootstrap and modern statistics. J Am Stat Assoc 2000;95:1293–6. Harrel FE. Regression modelling strategies with applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag; 2001. Brunelli A, Refai M, Salati M, et al. Modelling the risk of emergency ICU admission following initial recovery after major lung resection [abstract P74]. Presented at the Programs and Abstracts of the 42nd Annual Meeting Society of Thoracic Surgeons. Chicago, January 30–February 1, 2006.