Current Problems in
Sur Volume 34
r y ,® Number7
July 1997
Predicting Survival in the Intensive Care Unit John P. Hunt, MD Trauma and Surgical Critical Care Fellow University of North Carolina at Chapel Hill Chapel Hill, North Carolina
Anthony A. Meyer, MD, PhD Professor and Vice-Chairman Department of Surgery Chief, Division of General Surgery Medical Director of Critical Care University of North Carolina at Chapel Hill Chapel Hill, North Carolina
[~vdMosby m'~ ATimes Mirror u Company
Current Problems in
Sur
ry,
Predicting Survival in the Intensive Care Unit
Foreword In Brief
529
Introduction
535
Prediction Instruments
537 537
Basic Mathematic Concepts
Designing and Implementing Prediction Models Hypothesis or Premise Patient Selection Outcome Selection PredictorVariable Selection Data Collection and Building a Database Developing the Model Validation of the Instrument Evaluating and Updating the Model
A Review of Different Systems Acute Physiology and Chronic Health Evaluation (APACHE) Simplified Acute Physiology Score (SAPS) Mortality Prediction Model (MPM) Other Predictive Instruments
Problems with Predictive Instruments Inherent Errors or Bias Errors in Application
530
542 542 543 544 545 546 547 549 555 557 557 569 571 575 577 578 582
The Role of Physicians in Gauging Illness Severity
583
Curr Probl Surg, July 1997
527
Uses of Illness Severity Indexes Individual Patient Predictions Resource Use QualityAssurance Research
New Ideas References
528
585 586 587 588 590 590 591
Curr Probl Surg, July 1997
mm m
Foreword Many of our most severely ill patients require admission to the surgical intensive care unit. The successful management of the illness depends on the efforts of a highly skilled team of surgeons, nurses, and associated medical personnel. Over the last 2 decades, Surgical Critical Care has become a recognized specialty. Ten years ago, the American Board of Surgery began offering a Certificate of Added Qualifications in Surgical Critical Care for surgeons who completed the requisite educational experience and training. This decision reflected the marked increase in the body of knowledge relating to the treatment of the critically ill patient and the need to train surgeons as experts in this discipline. In this issue of Current Problems in Surgery, Dr. John Hunt and Dr. Anthony Meyer of the University of North Carolina at Chapel Hill have written an excellent monograph on "Predicting Survival in the Intensive Care Unit." Their thorough review of this topic provides a splendid reference source for practicing surgeons and surgical house officers who treat patients with high-intensity illnesses.
Samuel A. Wells, Jr., MD Editor-in-Chief
Curr Probl Surg, July 1997
529
In Brief Predicting the outcome and estimating the severity of illness in patients has been the physician's responsibility since the inception of medicine. For eons, this task has been performed with the heuristic powers of the practitioner himself. This method is very subjective and variable, however, because of differences in ability between physicians. Several factors have influenced our interest in the development of an objective illness severity score that could be used in a variety of patient populations. The first factor was the development of intensive care units (ICUs) in the 1950s. Physicians were able to better care for severely ill patients and had more influence, through interventions and advanced monitoring techniques, over the outcome. With this expanded armamentarium against life-threatening illnesses, a new problem developed. It became plainly evident that a section of the ICU population did not ultimately benefit from it. These patients died despite the application of the newest and most promising medical technologies. All too often, these patients died after an extended time in the ICU and the expenditure of finite and precious resources that might better have been applied elsewhere. A segment of the medical and the economic community sought a way of predicting who would benefit from the new technologies and large outlay of limited resources. The next factor in the development of an objective predicting system came not from those interested in exploring the options of resource allocation but out of a concern for quality control. In 1979 the Health Care Financing Administration (HCFA) noted that there were hospitals to which it paid Medicare dividends that had far greater death rates than comparable hospitals. They realized that mortality rates could not be used as the sole basis on which the quality of care could be judged. With this in mind, the goal was to form an instrument that could stratify the illness severity of patients with a wide variety of disease processes so that individual institutions and their practices could be evaluated fairly. The Acute Physiology and Chronic Health Evaluation (APACHE) system was the first large-scale system that attempted to stratify illness severity using predicted mortality. The original project, funded by the Health Care Financing Administration, yielded the APACHE I system by Knaus and colleagues at the George Washington University in 1981. APACHE I was updated to the APACHE II in 1985 and to the APACHE III in 1991. Many other models have been developed since this first endeavor. Notable 530
Curr Probl Surg, July 1997
examples are the Mortality Prediction Model I (MPM I) in 1985 and the Mortality Probability Model II (MPM II) in 1993, both by Lemeshow and Teres from the Baystate Medical Center in Massachusetts. Also noteworthy are the SimplifiedAcute Physiology Score I (SAPS I) developed in 1984 and the SimplifiedAcute Physiology Score II (SAPS II) developed in 1993 by LeGall and colleagues in France. Despite the seemingly large number of predictive models developed, all of the currently available techniques are based on common precepts and similar designs. The comerstone of predictive modeling is the multiple logistic regression technique. This technique allows models (or equations) to be constructed that take an existing database and relates the independent variables (predictors) to a single dependent variable (outcome). This equation, which explains the association of predictors and outcome in a large database, can then be applied to smaller groups or individuals. In logistic regression the outcome is always dichotomous (in our case, alive or dead). The logit (y) is expressed as y = 130 + ~lXl "1- ~2X2 "t"... "1- ~qXq, where ~0 is a constant and I]~ is the coefficient for predictor xl, etc. The logit (y) can then be substituted into the equationp = eY/1 + e y. Solving forp yields the probability of mortality for the individual patient to whom the model was applied. Just as a set of mathematic axioms dictate the use of multiple logistic regression, there are maxims that are applied in the construction of all models. Each model is built on a premise or hypothesis that guides the overall process of selecting the population to be studied and the data to be collected. Most models use the premise that the outcome for patients in the ICU is related to a combination of physiological derangement, age, and underlying disease states. Once the hypothesis is established, the patient population to be studied is selected. It must be determined whether the model is to be broadly inclusive and applicable to the entire ICU population, or whether it is to be specific for a selected subpopulation. This subdivision may be defined by disease process, but other criteria are also acceptable. After the study population is defined, the patients can be recruited. The outcomes must then be defined. Most of the major models use mortality as the outcome variable, but this is not absolutely necessary. It is entirely plausible that as other issues such as efficiency and quality of life come to the forefront, outcome variables such as length of hospital stay and quality of life will be used. Next, a set of independent or predictor variables is selected. There are two methods of predictor variable selection, subjective and objective. The subjective method involves the selection of predictors by a panel of experts Curr Probl Surg, July 1997
531
in the chosen field. Through consensus, these experts choose the variables that they believe contribute most to the outcome and assign them weights that reflect their relative contribution. The major advantage to this method is that a preexisting database is not required. The major disadvantage is that decisions on predictors are subjective. The objective method involves the selection of variables from an existing database. Univariate and multivariate analyses are used to identify the variables that best explain the outcome. The major advantage of this method is that it is data-driven and yields predictors that better explain the outcome. The major disadvantage of this approach is the need for an existing database, which can be expensive and time-consuming to obtain. Most designers of predictive instruments will simply collect data on a large number of data points and then eliminate the points that do not contribute to the outcome. One of the cornerstones of model building for outcome prediction is reliable and consistent data collection. Any model constructed from poor data or data collected in a nonreproducible manner is apt to misrepresent the population from which it is derived or to which it is applied. Indices of concordance are the standard methods of assessing data collection techniques. These methods involve taking a fraction of the total patient population and collecting the same data on the same patients at different times and by different data collectors. Comparisons are then made between the two observations. The kappa statistic is used to assess dichotomous data, and the intraclass correlation coefficient is used for continuous data. Once the database has been constructed, pertinent predictors must be identified and related to the selected outcome with the prescribed multiple logistic regression modeling techniques. There are an abundance of multiple logistic regression methods, which include linear discriminant analysis, backward elimination of variables, forward regression, stepwise regression, and other regression procedures. The objective is to develop a model having the most explanatory power with the least number of predictors. After the model has been constructed, it must be validated on the patient population from which it was derived to assess its internal validity. The model is then used in patient populations distinctly different from the original data set to see whether it can be generalized. Validity is usually assessed by two characteristics, discrimination and calibration. Discrimination describes the ability of the model to predict individual outcomes and is best determined by the area under the receiver operator characteristic curve. Calibration describes how the model performs across the entire range of risk. This is usually analyzed with a goodness-of-fit test such as the HosmerLemeshow statistic. 532
Curr Probl Surg, July 1997
Lastly, the model must be assessed continually and reformed to reflect changes in disease processes and treatment modalities. As these two parameters change, the applicability of the model will wane. New models must be built, or older models must be modified. Many problems plague predictive instruments. These include errors within the model itself, known as biases. The other major source of error is in the actual use of the model. These are known as errors in application. Biases that are evident in predictive models include selection bias, leadtime bias, detection bias, and diagnosis bias. Selection bias can occur as a result of model admission criteria in which a group of patients may be under- or overrepresented in the model. Selection of particular variables can also be a source of selection bias, if they do not accurately reflect the population that is used to build the model. Lead-time bias reflects the therapeutic interventions that are performed before the patient's admission to the ICU that may affect the outcome but not be accounted for by the model. This may occur in the patient population from which the model is constructed or to which it is applied. Detection bias refers to steps taken when data needed to build or apply the model are missing. Diagnosis bias pertains to placing patients in diagnostic categories that may not reflect accurately their disease state. Errors in application occur almost exclusively in the collection of data. Poor data collection techniques or a complex model with data points that are difficult to collect are the main causes of this problem. Errors in application can make a useful model act in an unpredictable fashion. There are numerous uses for predictive instruments. They may be applicable to areas of resource allocation, quality control, stratifying patient populations for research, and possibly in aiding end-of-life decisions. They should definitely not be used as sole criteria for withdrawal of care.
Curr Probl Surg, July 1997
533
John P. Hunt, MD, is a Trauma and Surgical Critical Care Fellow at The University of North Carolina at Chapel Hill. Dr. Hunt obtained his medical degree at Albany Medical College and completed his general surgery residency at Louisiana State University Medical College at Charity Hospital in New Orleans. He then completed a clinical fellowship in Trauma and Surgical Critical Care at The University of North Carolina at Chapel Hill. Dr. Hunt is currently serving a National Institutes of Health-sponsored research fellowship. He is also involved in the Robert Wood Johnson Core Curriculum and is working toward a Masters Degree in Public Health. His research interests include the immunology of injury and burns and the public health issues and epidemiology of trauma. Anthony Meyer received his PhD and MDdegreesfrom the University of Chicago in 1976 and 1977, respectively. He then entered general surgical training at the University of California at San Francisco. After completing his general surgery training, he joined the faculty at the University of California in 1982. At San Francisco General Hospital, he was Director of the Burn Center and Associate Director of the Medical/Surgical Intensive Care Unit. In 1985, Dr. Meyer accepted a position at the University of North Carolina as Director of the Surgical Intensive Care Unit, Associate Program Director, and Associate Director of the Burn Center. In 1986 he became Medical Director of all Critical Care Units at the University of North Carolina Hospitals and now serves as Chief of General Surgery and the General Surgery Program Director. He is currently a Director of the American Board of Surgery. ~
534
0
~
Curr Probl Surg, July 1997
Predicting Survival in the Intensive Care Unit
~
ecently, there has been an increasing interest in predicting the course of patients in the intensive care unit (ICU). This interest has been fostered by several events. One of the foremost factors fueling the search for a predictor of ICU mortality is the cost of ICU care. It has been estimated that ICUs account for nearly 10% of all hospital beds, 1 and the cost for caring for critically ill patients consumes 1% of the Gross National Product. 2 This burgeoning cost is largely related to the progressive technology involved in caring for the critically ill and, in part, to the success these technologic advances have reaped in prolonging life. ICUs were first conceived and developed in the 1940s and 1950s. Several factors contributed to their rapid construction and implementation. These factors included an influx of money provided by the Hill-BurtonAct of 1946, an improved understanding of the organization of facilities provided by the military medical experience of World War II, and the immediate need to treat victims of the polio epidemics of the time. 3 Forty years ago, patients in the ICU were usually young, had a disease with a well-known outcome such as tetanus or poliomyelitis, and rarely had more than one organ in dysfunction. If the patient and the physician could persevere, cure was often achieved and a normal life resumed. 4 However, the demographics of the ICU population have changed markedly. An increasing segment of the patients in the ICU are individuals with advanced disease. It has been held by some that there is a separate and identifiable group of patients in the ICU who have irreversible processes. 5 In one study, 65% of deaths in a given ICU over the course of 1 year were considered retrospectively to be "inevitable" despite all measures taken. 6 Approximately 15% to 25% of patients in the ICU do not survive their hospital course. 7 Many of these patients consume significantly more resources before dying compared with the average patient in the ICU. 8 In one study of hospital resource use, Zook and Moore 9 found that 13% of the general patient population consumed as many resources as the remaining 87%. It was noted that most of these costs were incurred by recurrent hosCurr Probl Surg, July 1997
535
pitalizations for patients with chronic disease and that a substantial proportion of the high cost was incurred by lengthy ICU stays. When Oye and Bellamy 1° specifically studied ICU use, it was noted that there was a"high cost" group; 8% of the ICU patient population used the same amount of resources as the remaining 92% of patients. Seventy percent of the "high cost" group died. 1°This phenomenon has been described in other ICU populations as well. u Descriptions of a large amount of ICU expenses going to patients with a poor outcome have stimulated an interest in identifying this segment of the ICU population early and possibly limiting their care. Many methods have been developed for this purpose and fall under the headings of predictive instruments, illness severity indexes, and mortality scoring systems. However, this idea must be tempered by studies that have demonstrated that severely ill patients who survive a prolonged ICU experience have a generally satisfactory quality of life iz and that most patients and their families would undergo ICU therapy again for even small gains in life expectancy. 13 Still, attempts at limiting futile care and increasing cost effectiveness in the ICU will only be intensified as the trend toward competitive managed care plans continues. It has been estimated that by the end of the 1990s as many as 60% of the total population of the United States could be part of managed care programs. 14 There are other possible benefits that an effective mortality scoring system could yield aside from cost savings. These include a quantitative risk stratification system for a broad-based quality control program. The advent of continuous quality improvement in medical care is now here. The omnipresent nature of this issue, as manifested in the literature, is plainly evident and not bound by specialization or disease process. 15-18One of the chief criticisms of current systems of quality assurance is the need for a "level playing field.''19 The ability to measure this quality fairly and accurately will depend on a standardized system to judge the severity of illness. This is requisite to compare quality of care in populations with and without the same disease in different institutions and even in different countries. Illness severity indexes will be used by both the private and public sector to help assess the quality of care received by its clients. It was actually the Health Care Finance Administration that funded the original Acute Physiology and Chronic Health Education (APACHE) project with this idea in mind. There is also the possibility that an illness severity index and predictive model could provide an objective assessment for end-of-life decision making. Settlement of issues involving do-not-resuscitate (DNR) orders and withdrawal of care have always been difficult and fraught with uncertainty. 536
Curr Probl Surg, July 1997
The input from another source could help to put at ease the mind of the physician, the patient, and the patient's family. Finally, illness severity indexes and predictive instruments may prove indispensable as a stratification standard for research in the ICU and in other portions of the hospital. More research is being undertaken in a multiinstitutional setting, and there is a need to be able to group and compare patients with complex medical problems fairly. Scoring systems are already being used for this purpose in both medicine and surgery. This is demonstrated by their application in recent studies on treatments of such diverse health problems as chronic obstructive pulmonary disease 2° and intraabdominal solid organ injury? ~ This monograph on outcome prediction in the ICU summarizes the mathematic concepts behind the illness severity scores, critically evaluates the existing predictive instruments, examines some of the problems inherent in these systems, and considers their appropriate uses.
Prediction Instruments There are many predictive devices, including the APACHE systems, the Mortality Prediction Model (MPM), the SimplifiedAcute Physiology Score (SAPS), and others. All of these are examples of multiple logistic regression analysis. Although multiple logistic regression is not the only method used in predictive instruments, it is one of the most frequently encountered. Despite the seeming"black box" appearance of this method, there is a clearcut set of steps and guidelines in designing, testing, and using predictive instruments of this type.
Basic Mathematic Concepts Linear Regression and Multiple Linear Regression. There are many multivariate techniques, but all have similar underlying concepts. Most clinicians are familiar with the concept of simple linear regression, which uses an equation to describe the relationship between two variables. The concept is usually presented mathematically as y = mx + [~. When applied to predictive modeling the equation is usually written as: Y = [30 + 131x"
(Equation 1)
This describes a two-dimensional relationship between y and x as represented by a straight line with a y-intercept, ~0, and a slope of ~1. The coefficient, [31, describes the change in y for every one unit change in x. Multiple regression describes the relationship between a desired dependent variable and more than one independent variable. The model will conCurr Probl Surg, July 1997
537
tain as many dimensions as the number of independent variables plus the dependent variable. Thus, if a multiple regression equation has two independent variables, it can be plotted in three dimensions. More than two independent variables requires some imagination because this situation exceeds three-dimensional space. Models of multiple regression follow a basic equation of the form: (Equation 2)
y = [30 + I]lx, + ~2x2 +... + ~qxq
where y is the desired dependent variable; xl, x 2 Xq are the independent variables; I]1, 92..... 13qare the coefficients for the corresponding independent variables, and I~0is a constant. It should be noted that the coefficient ~i is the change in y for a one unit increase in xi, given that all other independent variables are held constant. 22 When constructing prediction models the dependent variable is usually referred to as the outcome variable, and independent variables are usually referred to as the predictors. One of the distinct advantages of the multiple regression model is that it allows an estimation of measures of association between predictors and outcome variables while controlling for modifying or confounding factorsY Stratified analysis is the traditional method for controlling for these confounding factors when assessing for an association between cause and effect. With this specific technique, one or more variables are held constant while the effect of a particular variable on the dependent variable is assessed. For instance, this method could be used if one wanted to assess retrospectively the possible effect of alcohol consumption on the development of oral cancer. It is well known that smoking can cause oral cancer and that many people who consume alcohol excessively also have a tendency to smoke. Thus, tobacco use could be a confounding factor in trying to determine whether alcohol contributes to the development of oral cancer. If a population of patients with oral cancer is available, one could separate the cohort into those who smoked and those who did not. A comparison of the incidence of alcohol consumption in the nonsmoking population with oral cancer to alcohol consumption in a control population without the cancer could provide some information on the relative risk of alcohol. Because segments of the cohort are pared off to prevent confounding, this technique requires much larger study populations and is much more time-intensive than multivariate analysis. Multivariate analysis is able to factor confounding into the actual equation that describes the relationship between the independent variables and the dependent variable. In essence, multivariate analysis or the multiple regression model is used to control for the effects of many variables to assess the independent effects of one. 24 .....
538
Curr Probl Surg, July 1997
Both linear regression and multiple regression require available data sets. Each process calculates the best-fit relationship between the outcome variable and the chosen predictor (in the case of linear regression) or predictors (in the case of multiple linear regression). The relationship between a particular predictor and the outcome is dictated by [~, the coefficient of the independent variable. This coefficient is determined by one of several methods for constructing a straight line through the given data points. In the process of linear regression, the method most often used is the sum of least squares. This method constructs a line that minimizes the sum of the squares of the deviation of each observation and the line being constructed. 22 The method of least squares is well suited for the simple and multiple linear regression models but does not have the desired properties for designing an equation to describe the relationship between a set of predictors and a dichotomous outcome variable such as mortality. The methods for deriving the coefficients for multiple logistic regression are far more complicated and fall under the heading of maximum likelihood estimators. In general, these methods involve the construction of a separate likelihood function for the variable in question and then the application of one of several methods to maximize the likelihood functionY '26 This yields the coefficients that fit the best relationship to the available data. Multiple Logistic Regression. There are two major assumptions in the multiple linear regression model. The first is that the outcome variable is continuous, and the second is that there is a linear relationship between the dependent and independent variables. 23 This is not particularly well suited for the prediction of mortality, which is strictly a dichotomous variable. The outcome in a mortality prediction model has only two possible values, death or survival. Because of the dichotomous nature of mortality, the plot of probability of death against a given predictor is often sigmoidal, not linear. To fit the multivariate model to these conditions, a mathematic transformation must be used. The transform most consistently used for predictive modeling with binary outcome is the logit transformation. This approach has both the ability to yield an outcome between 0 and 1 and to convert a linear function such as a multivariate equation to a more sigmoid shape consistent with the distribution of mortality (Fig. 1). 26-28 The logistic transformation is described mathematically as: p = eY/1 + e y
(Equation 3)
where p is the probability of the dichotomous outcome, which will range from 0 to 1, and y is the original multivariate equation, which will range Curr Probl Surg, July 1997
539
0.9
o.s
P=eY/1+ey
0.7
/
1
0.8
~0.5
~." ~ ' " "
,
0
"
y
a. 0.4 0.3 0.2 0.1 0
I
i
I
I
I
I
I
I
l
I
I
I
I
I
I
I
I
--00
u? i
i
Logit FIG. 1. Graphic description of the Iogit tronsformofion. The linear relationship y = D0 + ~1xl ... in which y ranges from -oo to 0% is tronsformed to a sigmoidal plot in which p ranges from 0 to 1.
from --coto +oo. This formula is derived by taking the natural logarithm (In) of the odds of the outcome. For this reason, multiple logistic regression has also been referred to as the log-odds method. The odds can be expressed in terms of probability by: odds = p / 1 - p
(Equation 4)
where p is the probability of the outcome, which will always fall between 0 and 1. A simple way of understanding the concept of odds is to take the probability of an event, for instance 0.66, and substitute this value into Equation 4. This yields 0.66/0.33 = 2, often known in the vernacular as 2to-1 odds. One of the distinct advantages of this step is that we have entered the term of probability into the equation. This provides us a variable (p) that fits the criteria of being between 0 and 1. The original multivariate equation (Equation 2) is then set equal to the natural logarithm of the odds: y = In ( p / l - p ) .
(Equation 5)
There is still a linear relationship between the original multivariate equation (y) and the log odds.22 The equation can then be solved for p after taking the inverse natural logarithm of both sides of the equation. This manipulation yields: p = 1/l+e-y. 540
(Equation 6) Curr Probl Surg, July 1997
When the numerator and denominator of this equation are both multiplied by e y, we have the more well-known derivation in Equation 3. The original multiple regression equation (Equation 2) can then be substituted for y. When the original multivariate equation is manipulated in this fashion, it is referred to as the logit. The substitution of the logit for y yields:
p = eS0+[31 xl
+ 132 x2 + + 8q
xq/1 + e~0 ÷ 81 xl
+ [32 x2 ± + [3q xq.
(Equation 7)
wherep is still the probability of the outcome, e is the inverse of the natural log, and the original multivariate equation, or logit, has been substituted for y. We have now fulfilled the criteria of having an outcome variable, p, between 0 and 1. We have also taken the linear multiple regression plot and given it a more sigmoidal shape. This is more consistent with the mortality probability distribution of the ICU population. At the less severe end of a predictor's spectrum, the probability of death will approach 0, and at the severe end of the predictor's spectrum, it will approach 1 (Fig. 1). The logistic transformation also allows for the use of many different types of predictor variables (xi). These include nominal, categoric, ordinal, or continuous variables with a normal distribution.29,3°These characteristics and other techniques 31make the model robust, or relatively insensitive to data outliers, and thus make it applicable to relatively small sample populations. 32 Another distinct advantage of the multiple logistic regression modeling technique is the ability to convert the coefficients of the logit into easily understood and commonly used epidemiologic parameters. If we recall, the way the logistic transformation was constructed was by taking the log of the odds of the logit. Therefore, if we take the inverse natural logarithm of the coefficient lSi, we obtain the odds ratio or relative risk for the independent variable xi: odds ratio or relative risk = e ~i.
(Equation 8)
If the independent variable is dichotomous, the antilogarithm of the coefficient will yield the odds ratio. The odds ratio is defined as the risk of achieving the outcome in the presence of the dichotomous predictor divided by the risk of achieving the outcome when the predictor is not present. 33 For example, if a predictive model had a variable for shock with the coefficient 0.4, the odds ratio would be e ° 4 , which equals 1.49. This means that with all other parameters being equal, for every one person without shock who dies, 1.49 patients with shock will die. In the case where the independent variable is either continuous or ordinal, an estimate of the relative risk Curr Probl Surg, July 1997
541
TABLE 1. Guidelines for developing a prediction model
Hypothesis or premise Patientselection Outcome selection Predictor selection Data collection Developing the model Validation Calibration and discrimination Evaluating and updating the model
is o b t a i n e d . 23"26"34Irl other words, for every 1 unit change in xi, the log-odds
changes by a factor of ~i.23 A practical example of this concept is provided by the APACHE II score. The coefficient for the APACHE II score in the multivariate equation for that model is 0.146. Therefore, the relative risk for a patient with an APACHE II score of 1 is e 0"146(1),which equals 1.16. From this it is evident that for every patient with anAPACHE 1I score of 0 who dies, 1.16 patients with anAPACHE II score of 1 will die. Once again, this is with all other components of the predictive equation being held constant. Note that the relative risk does not change by a factor of 0.146, but the log of the relative risk does. One of the additional benefits of this modeling process is that the association of a particular predictor to the outcome can be assessed with the calculated odds ratio while the other variables in the model are controlled to some extent? 5
Designing and Implementing PredictionModels Like the basic mathematic concepts of multiple logistic regression analysis, there are well-established methods and guidelines for the design and implementation of a prediction m o d e l Y '36 These guidelines are summarized in Table 1.
Hypothesis or Premise Before a predictive instrument is developed, there must first be a hypothesis or premise on which to base the design. Once this hypothesis has been established, the next three steps, patient selection, outcome selection, and predictor selection, can be undertaken with some direction. All of the predictive models are grounded by a basic idea. The concept behind the originalAPACHE was that a patient's outcome in the ICU could be predicted by the specific disease process, the patient's physiologic reserve as determined by age and chronic health, and the degree of physiologic derangement rendered by the disease. 37This idea (or variants on it) has been the premise for many of the other prediction models as well. 542
Curr Probl Surg, July 1997
The Pediatric Risk of Mortality score uses similar parameters to predict mortality, including physiologic derangement, age, and operative intervention. 38,39The SAPS 4° and the MPM 41 also have physiologic derangement as the cornerstone for the prediction model. All of these scoring systems to some extent have used laboratory values, vital signs, or both to quantify the level of derangement. The extent of physiologic derangement is not the only possible underlying premise that can be used to ground a predictive model. The Injury Severity Score (ISS) was a scoring system designed from and for a specific patient population, the patient with trauma. This method attempts to associate the degree of anatomic damage to the prognosis. 42Another example of a totally different approach is the Therapeutic Intervention Scoring System ( T I S S ) . 43 This method was not specifically designed to predict mortality but rather to assess the severity of illness with therapeutic interventions. A group of interventions was constructed and weighted on a scale of 1 to 4, which was then added to yield a score. The score was then translated into a severity of illness category that could then be used for allocating resources.
Patient Selection One of the primary concerns in the selection of the patient population will be whether the predictive instrument is to be used in the general ICU population or only under a special set of circumstances. When referring to a specific set of circumstances, we are usually categorizing a group of patients by a specific diagnosis. Examples of scoring systems tailored to a particular patient population are abundant. Ranson and colleagues '44,45 set of criteria for acute pancreatitis is a common example of a very specific predictive model. It is a simple system designed retrospectively from the observations of a patient population with the particular diagnosis of acute pancreatitis. The authors were able to identify 11 predictors, 5 measured at admission and 6 within the next 48 hours. These predictors were then added together in an unweighted fashion, and the sum was correlated with prognosis. There are many other specific prognostic models ranging from the omnipresent cancer staging systems, with their attendant 5-year survival rates, to prediction models applicable specifically to patients with coronary artery d i s e a s e . 46 A prediction instrument designed from and for use in patients with pancreatitis would obviously not work very well for a patient in the ICU with an exacerbation of congestive heart failure. It takes a sizable patient population and thus a significant amount of time and effort to develop a database for a specific disease process. In response to this dilemma, there have been multiple efforts to develop predictive systems (e.g., APACHE, MPM, Curr Probl Surg, July 1997
543
SAPS) that would be applicable to a wide range of patients. These systems have all used consecutive ICU admissions to gather a variety of patients for the databases used in the construction of the respective models. It would be easy to conceive that more powerful predictive models could be derived from specific patient populations and then later applied to patients with those distinct disease processes. 36Yet, there is evidence that would challenge this assumption. One study of 290 patients with acute pancreatitis demonstrated that APACHE II predicted the correct outcome in 88% of cases compared with 69% for Ranson's criteria system. 47 Another study involved patients admitted to the ICU with rheumatologic disorders.48 Although this study was retrospective, it showed that there was a significant difference in theAPACHE II scores in the segment of the study population that died compared with the subset that survived. The user of a particular predictive model must be aware of the demographics of the patient population from which the model was created. The need for attention to this detail stems from the possibility that the model population may differ significantly from the population to which the model is applied. For instance, the original APACHE model population excluded patients with myocardial infarction and burns. 49The original MPM excluded patients requiring a coronary care unit, patients who underwent cardiac surgery, patients with burns, and patients younger than 14 years of age. 41In contrast, the original SAPS had no specific disqualification criteria, taking consecutive admissions for the sample population. 4° Before a particular predictive model is used, it is imperative to delineate any stipulations in the recruitment of the sample population that would make it significantly different from the population to which the instrument is to be applied.
Outcome Selection In most models trying to gauge illness severity in the ICU, the outcome variable selected is mortality. Most researchers have used this end point not only in light of its gravity, but because as a data point it fits all the criteria of a successful outcome variable. The outcome of any model or prediction rule must be clearly defined, easily understood, and easily collected. 5°,51Mortality satisfies all of these criteria. At what point in time this datum should be collected is a matter of debate. Some models use status at discharge from the ICU, whereas others use vital status at the time of discharge from the hospital as the outcome. This practice will obviously place those dying in the hospital after a term in the ICU in the positive mortality class, whereas they would be termed survivors in the previous scheme. Death during hospitalization rather than just in the ICU is probably a more accurate reflection of the true outcome but is more difficult to collect, because 544
Curr Probl Surg, July 1997
it requires additional data collection outside of the ICU. Still more difficult to collect but probably a more accurate reflection of the true utility of ICU care would be mortality at 6 months after discharge. It should be noted that mortality need not be the only outcome considered when designing a model. There are problems in health care for which mortality may not be the best end point to examine. Other outcomes such as length of hospital stay, man hours of care consumed, or quality of life indexes may bebetter suited as end points. This may be especially true when examining issues such as quality of life, efficiency, and quality improvement, all of which have received increasing attention in the field of health care. 52-54
Predictor Variable Selection The number of variables that could be collected and analyzed in the ICU is endless. There is a clear-cut method, however, that can be applied to the task of organizing and evaluating these variables. The method for constructing a scale as outlined by Feinstein 55 is applicable and includes initial selection of candidate variables, elimination or retention of component variables based on their association with the dependent variable, demarcation of scales, assigning of weights, and combination of component variables. Like the outcome variable, the predictor should be quickly available, easy to interpret, and well defined. Biologic variables, as opposed to sociologic and behavioral variables, better fit these qualifications. 56This does not preclude the use of nonbiologic data, so long as the variable is defined precisely and its recording is standardized. 57 It will be these measurable values that, when correlated with outcome by a modeling method, will yield the elusive and unmeasurable illness severity. There are two methods by which predictors can be selected: subjective and objective. The subjective method involves gathering a panel of experts who, by review of the literature and their collective experiences, put together an assortment of variables that they believe best reflect the severity of illness. The weights that are actually assigned to these variables are also agreed on by consensus of the expert panel. One of the distinct advantages of this method is that it does not require a preexisting patient population or database to generate the predictors. As a result, the sample patient population or database can be used for validating the model. 58This is less expensive and allows the model to be used sooner. The predictors of APACHE I, APACHE II, and SAPS I were selected in this fashion. Problems with this method include selecting predictors that do not correlate well with actual o u t c o m e s 59'6° and difficulty in adjusting the predictors when confronted with t h e s e e r r o r s . 61
In light of the flaws in assigning variables and weights subjectively, most Curr Probl Surg, July 1997
545
later models feature predictors that were assembled with objective methods. The APACHE III, MPM I, MPM II, and SAPS II predictive models all used objective methods for selecting variables and assigning weights. One essential prerequisite for using objective methods is the existence of a database that contains many possible variables and their associated outcomes. This approach can escalate costs, because one group of patients will be needed to generate the model, and a separate group will be required for validation testing. The larger number of patients will also take considerably longer to accumulate. Most efforts have circumvented this time factor by including patients from many institutions. This practice is acceptable so long as the collection of data is performed in a standardized and consistent fashion. The impact of the effect of different methods of care on a multiinstitutional database is yet to be evaluated. There is no precise way to select and weight variables objectively for a model. Instead, the process is more of an art than a science, and in fact there is some subjective input during the process. Univariate methods are generally used in the preliminary exclusion of variables. This is sometimes referred to as exploratory data analysis and involves comparing and correlating individual predictor variables with the outcome. In cases of dichotomous predictor variables, the chi-squared test may be used to note significance. For continuous data, the Student's t test may be used. Exploratory data analysis can also involve correlation (r) or just observing distributions when a single predictor is plotted against the outcome. The problem with the univariate selection processes is that they account for only direct influences on the outcome and do not explain indirect influences and interactions with other variables. For this reason, strictp values of 0.05 are rarely adhered to for fear of eliminating a significant variable. Most designers will not eliminate variables unless the p value is larger than 0.25.28 Once this initial elimination is performed, the remaining variables can be included or excluded by linear discriminant analysis or logistic regression techniques.
Data Collection and Building a Database The process of data collection and building a database is one of the most frequently overlooked aspects of the entire model building process. Without accurate, consistent, and easily accessible data, all efforts to build or validate a prediction instrument are futile. The acquisition of a computer and software suitable for the task is therefore very important. The many hardware and software options available cannot be discussed at length here and are often a matter of personal preference. Several basic guidelines should be followed to maximize the usable data. Once variables and outcomes are chosen, it is necessary to code all the 546
Curr Probl Surg, July 1997
variables in a standard fashion for entry into the computer. Every variable, once coded, is assigned a set of permitted responses. If one of these permitted responses is not entered into the computer, the person entering the data will be notified. This limits wrong entries and missed entries. Internal logic checks can also be constructed to detect inappropriate entries based on the data already entered. This is another way of minimizing wrong or missed entries. 62Once the computer aspects of the data collection have been organized, it is necessary to educate all data collectors in the procedures and protocols for entering the data. This process in itself can be an enormous undertaking, especially in the setting of a large multicenter study. Failure to correct these errors at the time of data entry will lead to the "garbage in-garbage out" phenomena that can plague studies involving large databases. In terms of data management, reliability refers to the extent to which repeated measurements of the same phenomenon yield the same result. 63 When referring to reliability there are two major considerations, interobserver and intraobserver consistency. Interobserver consistency refers to the ability of more than one observer to make the same observation of the variable being recorded. Intraobserver consistency refers to the ability of a single observer to make the same observation on the same variable. 64 Statistical measures of these phenomena are known as indices of concordance. The kappa statistic is used to measure consistency in the observation of dichotomous variables and also takes chance into account. The intraclass correlation coefficient is used to assess consistency in the collection of continuous data. Both indexes range from -1 to 1. A value of 1 denotes perfect agreement between observers, whereas a value of 0 indicates agreement between observers no better than chance. Values between 0 and -1 imply agreement less than chance. 65 The predictors for most validated databases have kappa statistics or intraclass correlation coefficients between 0.7 and 1. Models that do not mention indexes of concordance probably have not been assessed for the reliability of data entry and lack a valuable indicator of the strength of the model. Five percent of the patients in a quality database are usually reassessed for reliability with indexes of concordance.
Developing the Model During the next step of model building, development of the model, the predictor variables are related to the outcome. The type of model used will vary in the characteristics of the predictors and the outcome. Because the outcome of mortality is dichotomous and the possible predictors are numerous, almost all models for predicting survival use the multiple logistic regression format. Once the original list of variables has been pruned by Curr Probl Surg, July 1997
547
exploratory data analysis, the final association must be made. One of the prime directives during this step is to find the model that gives the best explanatory value while using the least variables. This might also be explained as constructing a model that eliminates the variables that do not significantly influence the outcome. This process can be accomplished by a variety of methods, all of which are computer-intensive. One method is backward elimination. In this method, a regression model is designed from all of the variables selected from the exploratory data analysis. Each variable will be assigned what is known as a partial F-statistic. The partial F-statistic gauges, with the sums of the squares of the error, the contribution by that variable to the outcome over and above the contribution of all the other variables in the equation. The closer to zero the partial F-statistic is, the smaller the contribution of that variable is to the model. The variable with the lowest partial F-statistic is then removed from the equation, and all of the other variables remaining have their partial F-statistic recalculated. The designers define a critical partial F-statistic. Once this critical value is reached, no more variables will be removed, and the "best" model remains. An alternative approach is the forward selection procedure. In this method, one variable at a time is entered into the model. The variable with the highest correlation (r) to the outcome is added first, and a simple linear regression equation is constructed. The partial F-statistics for the remaining variables are calculated. The variable with the largest partial F-statistic is added next. Variables are added and the partial F-statistic is recalculated for the remaining variables until a variable that does not meet the criteria for entry into the model is encountered. Stepwise regression is a modification of the forward selection procedure. Variables are added to the model as in the usual forward selection procedure, but at each step the partial F-statistic is recalculated for every variable in the model as if it were the last variable entered. Variables that have been rendered inconsequential by the addition of new variables can then be reevaluated and possibly expelled from the m o d e l . 66 Before the advent of newer high-speed computers and because of limits on computer time, intermediate steps were used to reduce the number of variables from the univariate analysis before they were applied to logistic multivariate analysis. The two most prominent intermediate steps are best s u b s e t s 67 and linear discriminant function analysis. 68 These complicated measures are, mentioned because they have been used in some well-known models, including the SAPS and MPM. There are now software packages that have all-possible-regression procedures. In other words, all of the possible permutations of variables would be run and compared by the com548
Curr Probl Surg, July 1997
puter, which would then select the most explanatory model. This will not only allow more combinations to be considered but also more variables to be analyzed. In light of this and a new generation of computers with improved computational power, these intermediate steps may no longer be required. 66
Validation of the Instrument Once a model has been designed and developed, it is necessary to test the model for validity. This is generally a twofold process. The first portion of the testing is done by the designers and involves using the instrument to evaluate a segment of the test population that was not used to develop the predictive model. This is known as a split sample technique. This is essential because testing the model on the population used to design the instrument could lead to a better result than would be expected from the general population. Once the internal validity of the instrument has been ascertained, the model is tested on other patient populations. It is generally accepted that the instrument will not perform as well as in the internal validation studies. This is thought to be due to differences in surveillance strategies, definition of parameters, and methodology.69 Regardless of whether the validation is being performed on part of the test population or on a separate patient population, there are some concepts that must be used. The performance of the model can be evaluated by several parameters, including explanatory power, discrimination, and calibration. Explanatory power is defined as the proportion of the predicted outcome that can be attributed to the model as opposed to variation. This is usually measured by the coefficient of determination (r2). In multiple linear regression, the coefficient of determination is defined as the square of the correlation (r). 7° The correlation is a value between -1 and 1, which describes how well a line fits the scatterplot of the data from which it was derived. A value of zero shows no correlation between the line and the data. As the r value approaches either-1 or 1, the line has increasing correlation with the data from which it is derived. The positive or negative sign simply indicates whether the line constructed from a plot of the independent and dependent variables has a positive or negative slope. If the r value for a particular model is 0.9, then the r 2, or coefficient of determination, for the model will be 0.81. In this example, 81% of the variation in the outcome can be explained by the predictors' v a r i a t i o n . 36 The remaining 19% of variation must be accounted for by sources outside of the model. 71This concept with some modification can also be applied to logistic regression models. Discrimination is generally thought of as the ability of the model to preCurr Probl Surg, July 1997
549
Actual Outcome
Positive Predicted Outcome
Positive
Negative
A
True Positive
B False Positive
C
D
False Negative
True Negative
I
Negative
FIG. 2. A standard 2 x 2 sensitivity/specificity table.
dict whether a particular patient will live or die. This parameter is closely related to accuracy, also known as the total correct classification rate. To best understand this concept it is necessary to use the well-known 2 x 2 sensitivity and specificity table (Fig. 2). Sensitivity is defined as the proportion of those patients with the disease or outcome who were identified correctly by the test or model. With reference to Fig. 2, the sensitivity would be given by a/a+c. The sensitivity can also be given by the true positives divided by the total number of the people with the disease or outcome in question. Specificity is the proportion of people who do not have the disease or outcome correctly identified by the test. 7z The specificity is given by the true negatives divided by the total without the disease or outcome. In Fig. 2, the specificity is given by as d/b+d. Sensitivity and specificity are fairly common and well-understood concepts in the medical literature. There are peculiarities, however, especially in predictive models, which must be understood. Most predictive models will yield a probability of mortality. Yet when predicting mortality, life or death is the only possible answer. As a result, a cutoff point in the probability distribution must be selected. The cutoff point has also been termed the decision threshold. Where this cutoff is placed will affect the sensitivity and specificity of the test. This concept is illustrated in Figs. 3 and 4, which both show scattergrams and the corresponding 2 x 2 tables for a given ICU population. The scattergram in both of the Figures is the same and illustrates an ICU population with its actual outcome across the top of the diagram and its predicted outcome, by a given model, down the side. To assess the sensitivity and specificity, all that is needed is to count the number of stars in each box of the scattergram and transpose them into the 2 x 2 table. As the mortality cutoff point is changed from 50% to 60%, the sensitivity is changed from 80% (20 of 25) to 60% (15 of 25), and the specificity 550
Curr Probl Surg, July 1997
Actual Outcome Died
Predicted
Lived
100 90 80 70 60
Mortality
50 40 30 20 10 Actual Outcome
Positive
Positive
Negative
20
10
Predicted Outcome Negative
40
FIG. 3. Scattergram for sensitivity/specificity. The mortality cutoff is at the 50% probability mark with its corresponding 2 x 2 table. The sensitivity is 20/25 (80%) and the specificity is 40/50
(8o%).
is changed from 80% (40 of 50) to 90% (45 of 50). This example illustrates the inverse relationship between sensitivity and specificity when selecting a cutoff point for a diagnostic test or a predictive model. Because the sensitivity and specificity can be manipulated by adjusting the decision threshold, there is a need to evaluate the accuracy and discrimination of a model over a wide range of cutoff points. This evaluation is accomplished with the use of the receiver-operating characteristic curve (ROC). The ROC curve plots the sensitivity (y-axis) against the specificity, represented by 1-specificity (x-axis) (Fig. 5). Each point on the ROC curve is a sensitivity/specificity pair for a given cutoff point. 73Accuracy or discrimination is assessed by the area under the ROC curve. Accuracy no better than random guessing will yield an area under the ROC curve of 0.5, a situation in which a straight line would arise from the origin at a 45-degree angle. Perfect discrimination is represented by an area under the ROC curve Curr Probl Surg, July 1997
551
Actual Outcome
Lived
Died 100 90 80 70
Predicted Mortality
60 50 40 30 20 10 Actual Outcome
Positive Positive
15
Negative
10
Negative
Predicted Outcome 45
FIG. 4. Scattergram for sensitivity/specificity. Notice how moving the mortality cutoff in the sca#ergram to the 60% probability mark changes the corresponding 2 x 2 table. The sensitivity is now 15/25 (60%) and the specificity is 45/50 (90%) for the same population as described in Fig. 3.
of 1.0, TMThus, as the curve moves upward and to the left, the accuracy and the discriminatory power of the model improves .75,76An area under the curve of 0.8 describes a situation in which a randomly selected individual from the group of patients who have died would have a predictive score that was •higher than that of a randomly chosen patient from the survivor group 80% of the time. ROC curves can be used to assess a given model or to compare different models used in the same patient population. 77 The ROC curve has benefits other than just an easily visualized gauge of accuracy over the entire range of threshold decisions. The ROC curve is not affected by the prevalence of the outcome, and no grouping of data is required) 8Another useful attribute of the ROC curve is that the likelihood ratio for a given threshold is the slope of the ROC curve at that threshold. The likelihood ratio is the true-positive fraction divided by the false552
Curr Probl Surg, July $997
1 0.9 0.8 0.7 0.6
B[
0.5
-t-
0.4
---zt- C I
0.3 02 0.1 0 0.1
02
0.3
0A
0.5
0.6
0.7
0.8
0,9
l-spec~y FIG. 5. ROC curve. The closer a curve bows to the upper left bond corner, the more discriminotory power o given model has. Model A has better discriminotion than model B. An orea under the ROC curve of ] denotes perfect discriminotion.
positive fraction and is a common epidemiologic parameter used to assess tests. 23 The other aspect of the validation process is calibration. Calibration involves the comparison of the true outcomes to the model-estimated outcomes over the entire range of risk. 2v,74This aspect is best represented by a comparative bar histogram, or a plot of the observed mortality rate versus the predicted rate. A set of comparative histograms for two models tested in a given ICU population is shown in Fig. 6. This graph compares the actual mortality with the predicted mortality (y-axis) over the entire range of risk as dictated by deciles of risk (x-axis). In the part of the figure labeled good calibration, the predicted and actual mortality rates are similar. However, in the histogram labeled poor calibration there is a great disparity between the predicted and observed rates of death indicating poor calibration of the model in the tested population. The other method of demonstrating calibration involves graphically plotring the actual mortality (y-axis) against the predicted mortality (x-axis) (Fig. 7). Similar to the comparative histogram, the predicted mortality is often grouped by 0.1 probability intervals or deciles when using percentages. The predicted mortality is then plotted against the average observed mortality for that group. The perfectly calibrated model would yield a Curr Probl Surg, July 1997
553
Good Calibration
o =E
.o
t.
10
20
30
40
50
60
70
80
go
1(30
80
90
100
Decile of risk (upper limit)
Poor Calibration 100 go 80
=,,
70 60
o =i
50 40 30 20 10 0 10
20
30
40
50
60
70
Declles of risk (upper limit)
FIG. 6. Comparative histogram. As opposed to the well-calibrated model, the poorly calibrated model shows discrepancies between predicted mortality and actual mortality in various deciles of risk.
straight line at a 45-degree angle from the origin and this line is used as a reference. The more a model deviates from the reference line, the poorer the calibration. In Fig. 7, the "perfect" line represents perfect calibration. Line A deviates from perfect more than line B, demonstrating worsening calibration. 554
Curr Probl Surg, July 1997
1 0.9 0.8 0.7 0.6
Observed 0.5 mortality o.4
/A
0.3
"--4P-A --O--B Perfect
0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Predicted mortality (upper deciles of risk) FIG. 7. Predicted mortality versus actual mortality. The more poorly calibrated model (A) deviates from the 45-degree norm more than the better calibrated model (B).
The statistical analysis of calibration is complicated and is still a matter of debate. Several methods all fall under the general heading of goodnessof-fit testing. These methods involve chi-square type analysis, with modifications, of the observed and predicted outcomes at given deciles of risk. 78'79 The most commonly used goodness-of-fit test is the Hosmer-Lemeshow statistic. When the goodness-of-fit test is performed, a p value is generated. The object is to evaluate how closely the model's predictions are similar to an actual patient population, not how different they are. Thus, a higherp value is desired and is indicative of a good fit. Values in the range o f p = 0.2 to p = 0.8 are considered acceptable. Absolute values for the HosmerLemeshow statistic of less than 15.5 are believed to reflect a good fit.
Evaluating and Updating the Model Predictive instruments, like many other tools, can become outdated. Patient populations change, new therapies and technologies arrive, and new disease processes surface. All of these factors can contribute to a decrease in the calibration of a model for a given patient population. For this reason, the model must be checked periodically for validity and calibration in a given ICU population. This is accomplished in the same way that the validation and calibration was performed when the model was first introCurr Probl Surg, July 1997
555
duced. Marked changes in either one of these parameters indicate an influence within the workings of the ICU that should be investigated. The introduction of new disease processes or those that were not included in the original analysis must be evaluated in terms of the predictive power of the model. Patients with human immunodeficiency virus (HIV) were not recognized as a specific group in the APACHE II study. It is estimated, however, that 60,000 to 70,000 people will contract this disease yearly and more than 1 millionAmericans are already infected, s° In the future, patients with HIV will comprise an increasing proportion of admissions to medical and surgical ICUs. The ICU mortality for patients with Pneumocystis carinii pneumonia has decreased from almost 90% in the early 19808 to approximately 55% today. 81This problem demonstrates two factors that can make a model invalid: the introduction of a disease process not in the sample population and a rapid change in the ICU mortality over a relatively short time due to improved therapy. Indeed, studies are starting to emerge that question the validity of the APACHE II instrument in the HIV population, 82'83 especially in patients with a total lymphocyte count less than 2 0 0 . 84 Malaria is one of the most serious health problems throughout the rest of the world. Yet, due to the demographics of the sample population, this disease was also not a part of the originalAPACHE II patient population.As a result, the instrument may have limited applicability in developing countries. One relatively small study from Thailand, applying APACHE II to patients with malaria, demonstrated an area under the ROC curve of greater than 0.9. 85 Nonetheless, there should be further evaluation of predictive instruments applied to this particular disease. Once a discrepancy in discrimination or calibration is uncovered, the cause must be identified, ffthe root of the problem lies in a new disease process or a change in therapy, the best course of action remains uncertain. Several possibilities are still being explored. The simplest solution is to exclude patients with that disease process from evaluation by the particular predictive insmament. Yet, this hardly seems to be as much a solution as an avoidance of the issue. Another possibility would be to attempt to add these unusual patients to the model. Methods to try to retrofit a new or changing patient population into the original sample population of an existing model are scarce and still being evaluated. This approach is rarely taken, but this is an alternative that should be explored. Another prospect is to construct a new model that has a sample population that includes the disease process in question. Drawbacks to this approach are obvious and include enormous expense and investment of time. It should also be considered that in a disease process such as HIV, new treatment options that increasingly prolong the life of those afflicted are being developed at a rapid 556
Curr Probl Surg, July 1997
pace. There is the possibility that a model including this population could be outdated even before it was validated and applied to a test population. Predictive instruments may also be found to have errors in design that remain unrecognized during the validation process. These errors are usually discovered by outside investigators who apply the model to their own ICU population. One noteworthy example was the lead-time bias detected in theAPACHE I instrument. Lead-time bias reflects treatment a patient received before admission to the ICU. The authors subsequently corrected for this flaw by including a variable in the APACHE II model that accounted for this issue.
A Review of Different Systems Acute Physiology and Chronic Health Evaluation (APACHE) Acute Physiology and Chronic Health Evaluation I (APACHE I). The APACHE systems are probably the most well known and widely used predictive models available today. These methods were developed by Knaus and colleagues 49 at George Washington University. Their initial study was published in 1981. The effort was funded by a grant from the Health Care FinancingAdministration. The impetus for the search for a uniform illness severity scoring system was the presentation of data in the late 1970s and early 1980s that suggested that patients receiving Medicare had large differences in hospital length of stay and cost per beneficiary in various geographic a r e a s . 86'87 There were also data from other populations to suggest that patients treated at high volume centers had better outcomes. 88,89The ability to compare and evaluate these patient populations was hampered by the lack of a uniform system to stratify patients by their severity of illness. The group at George Washington University sought to develop a stratification system that would yield a probability of mortality from a selection of data points commonly collected in the ICU. The premise of the model links the probability of mortality with the patient's physiologic state and preadmission health status. The physiology portion of the model is broken down according to the body's seven major physiologic systems: cardiovascular, respiratory, renal, gastrointestinal, hematologic, metabolic, and neurologic. Extra variables were believed to be parameters of sepsis. The predictor variables to evaluate these systems were selected subjectively by seven critical care specialists. All of the variables were weighted for severity on a 0 to 4 scale (Table 2). There are parameters that were not believed to warrant a full 4 point weight (e.g., aspartate transaminase [SGOT]). There are a total of 34 variables. The value most removed from the normal for a particular variable during the initial 32 hours is the one used. Values that are not recorded or Curr Probl Surg, July 1997
557
TABLE 2. Acute Physiology and Chronic Health Evaluation I (APACHE I) - acute physiology score Points
Heart rate Mean BP CVP MI (CPK or EKG) EKG arrhythmias Lactate pH Resp rate P(A-a)O2 (100%) or PaCO2 Urine output (ml)/day BUN Creatinine Amylase Albumin Bilirubin AIk Phos SGOT Anergy (skin test) Hematocrit WBC Platelets (xlO00) Protime (sec >control) CSF-pos. culture Blood-pos. culture Fungal-pos culture Temp °C Calcium Glucose Sodium Potassium HCO3 Serum osmolarity Glasgow Coma Score
+4
+3
+2
>180 >160
141-179 131-159
111-140 111-130 >_26
Atrial w / $ BP 3.5-8 7.6-7.69 35-49 351-499 61-69
Atrial w / o $ BP
Yes >8 ->7.7 ->50 ->500 ->70 ->150 >7 _>2000 >8
26-34 50-60 >_5000 81-100 2.1-3.5
101-150 3.6-7 500-1999 >-15
->1500 Relative 51-60 20001-40000 >1000 3.1-5
Total >60 >40000 >12 Yes Yes Blood or CSF >41.0 >-16 >800 >180 >7 >350 3
5.1-12
2 sites not blood or CSF 39.1-41.0 14-15.9 500-800 161-180 6.1-7 >40 321-350 4-6
156-160
7-9
BP, Blood pressure; CVP,central venous pressure; MI, myocardial infarction; EKG, electrocardiogram; BUN, blood urea nitrogen; WBC,white blood cell; CSF, cerebrospinal fluid; PVC, premature ventricular contraction. (From Knaus WA, Zimmerman JE, Wagner DP, Draper EA, Lawrence DE. APACHE-AcutePhysiologyand Chronic Health Evaluation:a physiologicallybased classification system. Crit Care Med 1981;9:591-7.)
obtained are assumed normal and given a value of zero. The Acute Physiologic Score (APS) is the sum of the weighted physiologic variables. The preadmission health status is verified by answers to questions about the patient's medical history (Table 3). The questions cover many chronic health problems, including chronic renal failure, cancer, and diabetes, among others. This nominal scale is divided into four categories fromA to D, with group A having the best health and group D having the worst. The final APACHE score is a combination of the APS and the chronic health evalu558
Curr Probl Surg, July 1997
+1
16-25
0
70-110 70-110 1-15 No
0-3.4 7.51-7.59 7.33-7.5 12-25 200-350 <200 30-49 3501-4999 700-3500 21-80 10-20 1.6-2 0.6-1.5 <500 3.5-8 5.1-14.9 0-5 >160 0-160 101-1499 0-100 None 47-50 30-46 15001-20000 3000-15000 600.1-1000 80-600 0-3 No No 1 site not blood or CSF None 38,6-39.0 36.0-38.5 11,1-13.9 8-11.0 251-499 70-250 151-155 130-150 5.6-6 3.5-5.5 31-40 20-30 301-320 260-300 10-12 13-15
+1
+2
+3
+4
56-69 51-69
4~55
<40 <50
>6 PVCs/min
V-Tach ~Fib
7.25-7.32 7-9
7.15-7.24
<7.15 <6
25-29 480-699 <10
20-24 120-479
<20 <120
<1
10-11
<0.6 2.5-3.4
<2.5
20-29 1000-2999 20-79.9
34.0-35.9
3-3.4 10-19
32.0-33.9 5.0-7.9 50-69 120-129 2.5-2,9 240-259
<20 <1000 <20000
30.0-31.9 30-49 110-119 5-9 220-239
g29.9 <5 <30 <110 <2.5 <5 <220
ation. For example, 23-B could be the score for a patient with anAPS of 23 and a history of diabetes mellitus. The sample population from which the model was derived included 582 ICU admissions from the George Washington University Medical Center over an 8-month period. Excluded from evaluation were patients with bums, acute myocardial infarction, or an ICU stay of less than 16 hours. Both medical and surgical patients were included. Two research associates performed all of the data collection. There was no significant interobserver variation. Curr Probl Surg, July 1997
559
TABLE 3. Acute Physiology and Chronic Health Evaluation I (APACHE I)--Chronic Health Evaluation
Qualifying questions Did the patient have weekly visits to the physician? Was the patient unable to work because of illness? Was the patient bedridden or institutionalized because of illness? Had the patient suffered a relapse after systemic treatment for carcinoma? Was the patient's usual daily activity limited? Did symptoms occur with mild exertion? Had the patient received treatment for neoplasm with remission or uncomplicated hemodialysis? Did the patient see a physician monthly? Did he take medication chronically? Was he mildly limited in his activity level because of illness? Had the patient had diabetes mellitus, chronic renal failure, a bleeding disorder, or chronic anemia? (Negative response to the above)
Chronic health evaluation
Brief description
D
Severe restriction of activity due to disease; includes persons bedridden or institutionalized because of illness.
C
Chronic disease producing serious but not incapacitating restrictions of activity.
B
Mild to moderate limitation of activity because of a chronic medical problem
A
Previous good health; no functional limitations.
(From Knaus WA, Zimmerman JE, Wagner DP, Draper EA, Lawrence DE. APACHE-acutephysiology and chronic health evaluation: a physiologically based classification system. Crit Care Med 1981;9:591-7.)
The APACHE I model was validated by comparing the predictive score with the amount of therapy received as dictated by the TISS and by shortterm mortality rates (alive or dead at discharge). The model was also applied to 223 patients from the ICU of a local community hospital. In both the university and community hospital, theAPS appeared to correlate with the TISS (r = 0.59; p < 0.01 and r = 0.47; p < 0.01, respectively). With regard to the chronic health evaluation, there appeared to be a significant difference between group A and B when compared with group C and D in terms of the survival rate. The chronic health evaluation predictive power was found to be improved by the addition of the physiologic score, but the addition of the chronic health evaluation to the APS did not improve its 560
Curr Probl Surg, July 3.997
predictive powers. These validation techniques are crude but are nonetheless forerunners to the methods used for later models. With a decision threshold for predicting death at 50%, the sensitivity of the model was found to be 97%, and the specificity was 49%. The total misclassification rate was 11%. No split sample validation was performed, and ROC analysis and goodness-of-fit testing were not performed. The original APACHE was validated in several other patient populations but rapidly gave way to the newer APACHE II system and is rarely used today.
Acute Physiology and Chronic Health Evaluation H (APACHE II). This model was constructed by the same group at George Washington University and was a response to several flaws found in the first model. The APACHE II model was published in 1985. 90 One of the major changes in the newer APACHE II was the pruning of the cumbersome APS from 34 to 12 physiologic variables with the use of a combination of clinical judgment by the authors and multivariate analysis. Variables were singularly eliminated, and the r z of the new model was compared with the old model. Essentially, the 22 variables dropped from the APS did not enhance the original instrument's performance. There was no significant reduction in the new model's explanatory power when compared with APACHE I, unless one of the 12 physiological variables found to be essential toAPACHE II was omitted. Although some of the variable weighting classifications were changed, the original weighting system of 0 to 4 still applied (Table 4). The value selected for a specific variable is the worst score for the initial 24 hours after admission. The Glasgow coma score is added directly into the score after it is subtracted from 15. Another noteworthy feature of this particular model was the cumulative APACHE II score, which included theAPS, age, and a chronic health score. Age was believed to have effects on mortality regardless of physiologic scoring or previous health status and was thus incorporated into theAPACHE II as a separate variable. The chronic health portion of the original instrument was consolidated, better defined, and also added to the APACHE II score. Five points were given for chronic health scores of patients who met two criteria: (1) they had a medical history that included one of several well-defined chronic health problems, and (2) they were either patients admitted to a medicine service or surgical patients who had an emergency operative procedure. Two points were given for chronic health scores of patients with one of the chronic health problems and operated on electively. The APACHE II score can range from 0 to 71. The APACHE II instrument also accounted for the specific disease process that led to the ICU admission by assigning weights to specific diagnostic categories (Table 5). There were 42 specific disease designations that were believed to represent Curr Probl Surg, July 1997
561
TABLE 4. Acute Physiology and Chronic Health Evaluation II (APACHE II) score Physiologic variable
High abnormal range
Temperature (°C) Mean arterial pressure (mm Hg) Heart rate Respiratory rate (ventilated or nonventilated) Oxygenation: A-aDO= or PaO 2 (mm Hg) a. FIO2 _>0.5 record A-aDO= b. FIO2 _<0.5 record only PaO 2 Arterial pH Serum sodium (mMol/L) Serum potassium (mMol/L) Serum creatinine ( m g / l O 0 ml) (double score for acute renal failure) Hematocrit (%) White blood cell count (/mm 3, in lO00s) Glasgow coma scale (score =15-GCS) [A] Total physiology score (= sum) Serum HCO 3 (mMol/L)(use if no ABG)
+4
+3
+2
_>41 _>160 _>180
39-40.9 130-159 140-179
110-129 110-139
_>50
35-49
_>550
350-499
200-349
_>7.7 _>180 _>7
7.6-7.69 160-179 6-6.9
155-159
_>3.5 >_,60 _>40
2-3.4
_>52
41-51.9
1.5-1.9 50-59.9 20-39.9
[B] Age Points
[C] Chronic Health Points
assign age points as follows: AGE(yrs) Points _<44 0 45-54 2 55-64 3 65-74 5 _>75 6
If the patient has a history of severe organ system insufficiency or is immunocompromised, assign points as follows: a. for nonoperative or emergency postoperative patients-5 points b. for elective postoperative patients-2 points Organ insufficiency or immunocompromised state must have been evident prior to hospital admission and conform to the following:
the bulk of patients admitted to most ICUs. The logit for the APACHE II predictive model is: y = - 3 . 5 1 7 + 0 . 1 4 6 ( A P A C H E II score) + 0.603 (for e m e r g e n c y surgery) + (diagnostic category weight). 562
(Equation 9)
Curr Probl Surg, July 1997
Low abnormal range +1
0
+1
+2
+3
+4
38.5-38.9
36-38.4 70-109 70-109
34-35.9
32-33.9 50-69 55-69
30-31.9
-<29.9 _<49 _<39
25-34
12-24
10-11
7.5-7.59 150-154 5.5-5.9
<200 >70 7.33-7.49 130-149 3.5-5.4
46-49.9 15-19.9
0.6-1.4 30-45.9 3-14.9
<0.6 20-29.9 1-2.9
32-40.9
22-31.9
18-21.9
5-9
61-70
3-3.4
40-54
7.25-7.32 120-129 2.5-2.9
Definitions: Liver: Biopsy-proven cirrhosis and documented portal hypertension; episodes of past upper gastrointestinal bleeding attributed to portal hypertension; or prior episodes of hepatic failure/encephalopathy/coma Cardiovascular: New York heart Association Class IV Respiratory: Chronic restrictive, obstructive, or vascular disease resulting in severe exercise restrictions, i.e., unable to climb stairs or perform household duties; or documented chronic hypoxia, hypercapnia, secondary polycythemia, severe pulmonary hypertension(>40 mm Hg), or respirator dependency. Renal: Receiving chronic dialysis Immunocomoromised: The patient has received therapy that suppresses resistance to infection, e.g., immunosuppression, chemotherapy, radiation, long-term or recent steroids, or has a disease that is sufficiently advanced to suppress resistance to infection, e.g., leukemia, lymphoma, AIDS.
-<5
55-60 7.15-7.24 111-119
<55 <7.15 -<110 <2.5
<20 <1
15-17.9
<15
APACHE II Score Sum of [A] + [B] + [C] [A] APS [B] Age [C] Chronic Health Total APACHE II
The logit can then be placed into the probability equation, p = e y / 1 + e y. This yields the probability of death. This expression can also be written as: in(p/I-p)= -3.517 + 0.146(APACHE II) + 0.603(S) + (D). (Equation 10) where p is the probability of dying, APACHE II is the APACHE II score, S is the weight for emergency surgery (1 for surgery, 0 for no surgery), andD is the weight of the particular disease category. 91 Curr Probl Surg, July 1997
563
TABLE 5. APACHE II diagnostic categories Nonoperative patients Respiratory failure or insufficiency from: Asthma/allergy COPD Pulmonary edema (noncardiogenic) Postrespiratory arrest Aspiration/poisoning/toxic Pulmonary embolus Infection Neoplasm Cardiovascular insufficiency from: Hypertension Rhythm disturbance Congestive heart failure Hemorrhagic shock/hypovolemia Coronary artery disease Sepsis Postcardiac arrest Cardiogenic shock Dissecting thoracic/abdominal aneurysm Trauma: Multiple trauma Head trauma Neurologic: Seizure disorder ICH/SDH/SAH Other: Drug overdose Diabetic ketoacidosis GI bleeding If none of above which organ system: Metabolic/renal Respiratory Neurologic Cardiovascular Gastrointestinal
-2.108 -0.367 -0.251 -0.168 -0.142 -0.128 0 0.891 -1.798 -1.368 -0.424 0.493 -0.191 0.113 0.393 -0.259 0.731 -1.228 -0.517 -0.584 0.723 -3.353 -1.507 0.334 -0.885 -0.890 -0.759 0.470 0.501 Continued
COPD, Chronic obstructive pulmonary disease; ICH, intracerebral hemorrhage; SDH, subdural hematoma; SAH, subarachnoid hemorrhage; GI, gastrointestinal. (From Knaus WA, Draper EA, Wagner DP, ZimmermanJE. APACHEIh A severity of disease classification system. Crit Care Med 1985;13-818-29.)
For example, suppose a 54-year-old man is admitted to the ICU with the diagnosis of pulmonary embolism. He is a patient with end-stage renal disease who is receiving chronic dialysis. He is normothermic, has a mean arterial pressure of 67, a heart rate of 120, and a respiratory rate of 38. He is receiving no supplemental oxygen and has a PaO2 of 58 with a pH of 7.57. His serum creatinine is 7.3. The remainder of his laboratory values 564
Curr Probl Surg, July 1997
TABLE 5 (cont'd) APACHE II diagnostic categories. Postoperative patients Multiple trauma Chronic cardiovascular disease Peripheral vascular surgery Heart valve surgery Craniotomy for neoplasm Renal surgery for neoplasm Renal transplant Head trauma Thoracic surgery for neoplasm Craniotomy for ICH/SDH/SAH Spinal cord surgery Hemorrhagic shock GI bleeding GI surgery for neoplasm
Respiratory insufficiency during surgery GI perforation/obstruction
-1.684 -1.376 -1.315 -1.261 -1.245 -1.204 -1.042 -0.995 -0.802 -0.788 -0.699 -0.682 -0.617 -0.248 -0.140 0.060
For postoperative patients admitted to the ICU for sepsis or post arrest, use the corresponding weights for nonoperative patients. If not one of the above, which major organ system led to postoperative ICU admission? Neurologic -1.150 Cardiovascular -0.797 Respiratory -0.610 Gastrointestinal -0.613 Metabolic/Renal -0.196
are within normal limits. In trying to assess his risk, it is necessary to first calculate his APACHE II score by consulting the score sheet. For the APS, this patient would receive 2 points for mean arterial blood pressure, 2 points for heart rate, 3 points for respiratory rate, 3 points for oxygenation, i point for arterial pH, and 4 points for serum creatinine. The remainder of his parameters are normal and would be weighted as zero. Therefore, this patient's APS would be 15. Adding 3 points for his age and 5 points for his chronic health points, the patient's APACHE II score would be 23. The coefficient for pulmonary embolism is -0.128, and the patient will not be undergoing emergency surgery. Putting these values into the predictive equation yields: In(p/I-p) = -3.517 + 0.146(23) + 0.603(0) + (-0.128). This can be reduced to In(p/I-p) = -0.287. Solving for the probability (p), by taking the inverse natural logarithm of both sides of the equation yields p = e m - Z S V / l + e ~ 2 8 7 = 0.43. Thus the probability of mortality for this particular patient during this intensive care admission is 43%. The hospital population that was used to test the instrument consisted of 5815 consecutive ICU admissions at 13 different hospitals and included Curr Probl Surg, July 1997
565
both surgical and medical admissions. Patients undergoing coronary artery bypass grafting were excluded from analysis because their physiologic derangement was believed to be markedly different from that of the average patient in the ICU. This group of patients had relatively highAPACHE II scores on admission but very low death rates. The hospitals were chosen because of their willingness to participate. Interobserver variability showed a 96% agreement for all physiologic data collected. The model was then applied to the sample population with a decision threshold for mortality at a predicted mortality of greater than or equal to 50%. The overall correct classification rate was 85.5%. The area under the ROC curve was 0.863, and the ra value was 0.319. No goodness-of-fit testing was performed. TheAPACHE II predictive instrument has probably been applied to more distinct patient populations then any other model. The model has been applied to populations throughout the world. 92-95It has also been applied to many diverse populations, 96 including that of a cardiothoracic ICU, 97 and has been shown to perform consistently. The model has also been validated in a cardiac care/acute myocardial infarction patient population, where the area under the ROC curve was 0.83 and the total correct classification rate was 87%. 98Both of these populations were originally excluded in the construction of the instrument. When theAPACHE II model is applied to various ICU populations, the area under the ROC curve is generally between 0.83 and 0.89. There have been some reports, however, that indicate that the model does not perform well in certain subpopulations of the I C U . 99
Acute Physiology and Chronic Health Evaluation III (APACHE III). The APACHE III system is the most recent modification by the group at George Washington University to their well-known illness severity scoring system. This project was a combined effort that also included The Northwestern University School of Management, The Johns Hopkins University School of Public Health, and APACHE Medical Systems, a private organization. The study had two phases. The first phase involved the collection of physiological data, diagnostic categories, chronic health information, and demographics, which would then be analyzed in terms of outcome. The second phase involved trying to relate variations in practice to eventual outcome. 1°° Improvements in all aspects of the APACHE system were undertaken. The 12 physiologic variables of APACHE II were reevaluated, and six additional variables were examined to see whether they improved the new model's predictive powers. The way that the previous variables were weighted and the times that they were collected were also evaluated. More explicit grading of the chronic health evaluation and assessment of activi566
Curr Probl Surg, July 1997
FIG. 8. APACHE III - Acute Physiology Score. BP, Blood pressure; Hct, hematocrit; WBC, white blood cell; RF, renal failure; ARF, acute renal failure; BUN, blood urea nitrogen. (From Knaus WA, Wagner DP, Draper EA, et al. The APACHE III prognostic system: risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991 ;100:1619-36.)
ties of daily living were also analyzed. It was also a goal to see whether the diagnostic and operative type categories could be expounded on. 1°1 The work initially describing APACHE III was published in 1991.1°2 The APACHE III population database comprised data collected from 17,744 patients from 40 hospitals throughout the United States. These hospitals included 14 tertiary care centers and 26 other hospitals picked randomly and thought to represent the spectrum of ICU services available. Data were collected over the course of 18 months in 1988 and 1989. Originally, 20 candidate physiological variables were considered. However, only 17, which include a modification of the Glasgow Coma Scale, were found to contribute significantly to outcome prediction by statistical modeling techniques. The weighting of these variables was also altered from previous models to increase the explanatory power of APACHE III (Fig. 8). Missing values were assigned a weight of zero. The chronic health and age portion of theAPACHE III score was reevaluated and resulted in a scale quite different from that of APACHE II. New weights were assigned to the age section. Chronic health was analyzed, and only seven conditions were found to have a significant impact. These conditions includedAIDS, hepatic failure, lymphoma, metastatic cancer, leukemia/multiple myeloma, immunosuppression, and cirrhosis (Table 6). If more than one chronic health problem is present, the one with the highest score is used. It was also found that these were significant predictors only Curr Probl Surg, July 1997
567
TABLE 6. APACHE III age and chronic health evaluation Points
Age (yr) _<44 45-59 60-64 65-69 70-74 75-84 >85 Comorbid condition* AIDS Hepatic failure Lymphoma Metastatic cancer Leukemia/multiple myeloma Immunosuppression Cirrhosis
0 5 11 13 16 17 24 23 16 13 11 10 10 4
*Excluded for patients undergoingelective surgery. AIDS, Acquired immunodeficiencysyndrome. (From Knaus WA, Wagner DP, Draper EA, et al. The APACHEIII prognostic system: risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991;100:1619-36.)
during emergency surgery or nonoperative admissions. It was found that the worst variables collected within the first 24 hours of admission were most explanatory. The finalAPACHE III score can vary between 0 and 300. Out of 212 major disease categories explored, 78 were included in the final model. All of these admission disease categories were analyzed individually by logistic regression analysis, and an odds ratio for each was calculated. This ratio indicates the relative risk of dying when being admitted to the ICU with a particular diagnosis. There are also features built into this model that account for where a patient was before being admitted to the ICU and a method to analyze the APACHE III score as it changes over time. There is a logistic regression equation for each diagnostic category and several that account for a patient's location before the ICU admission. APACHE III was validated with a split sample technique applied to the original patient population. The total r2 for the model was 0.41. The area under the ROC curve was 0.90, and the total correct classification rate was 88.2% at a decision threshold of predicted risk = 50%. When the scoring system was evaluated for its performance over time, it was found that initial and latest day scores yielded the best explanatory power. No goodnessof-fit testing was performed. The APACHE III system has not been widely accepted or used by most ICUs in this country. This is due to a variety of factors but most notably that the APACHE HI logistic regression coefficients and equations were 568
Curr Probl Surg, July 1997
proprietary information and not available for public scrutiny. This inability for independent verification and the "black box" characteristics of the instrument have given rise to doubts by some investigators. 1°3The constructors of the model maintain that the logistic regression equations with all coefficients have been made available on request) °4Another possible reason for the slow acceptance of APACHE III may be that in some studies, it does not perform that much better than APACHE II 1°5 and the cost to participate in the APACHE III system is quite high.
Simplified Acute Physiology Score (SAPS) Simplified Acute Physiology Score I (SAPS I). The SAPS was developed by LeGall and colleagues4° in France and was published in 1984. Their efforts were undertaken because it was believed that portions of the originalAPACHE were too complex. It was noted that values that were not available, possibly because of the complexities of data collection, were automatically assigned a value of zero and that this could introduce bias in the results. The authors wanted to demonstrate that illness severity could be gauged by a few variables that were collected commonly in all patients in the ICU. The study population was composed of 679 consecutive admissions to eight different ICUs in France over an unspecified time period. Both surgical and medical admissions were included. Thirteen routinely collected physiological parameters and the patient's age were collected for analysis. The worst value for the initial 24 hours in the ICU was used. All of the variables in the model were selected subjectively by the authors. The variables are also weighted subjectively on a 0 to 4 scale similar to that of the APACHE I system. The simplicity of the system allows the necessary data to be collected by an ICU nurse in less than I minute, and there are significantly fewer omissions than with the APACHE I model. The relationship of the variables to outcome was not made by logistic regression analysis but rather by simple observation. The model was initially validated on the original design population. The area under the ROC curve was similar to that of the APACHE I model (N0.85). No goodnessof-fit testing was performed. The sensitivity and specificity of the model were also compared with those ofAPACHE I. Instead of using an arbitrary cutoff point, the point giving the best Youden index 1°6 was determined for each model. This technique allows determination of the decision threshold that yields the maximum true-positives with a minimum of false-positives. The sensitivity and specificity for theAPACHE I at the appropriateYouden index were 0.56 and 0.82, respectively. These are compared with 0.69 and 0.69 for the SAPS. However, it was pointed out that in the 126 patients in Curr Probl Surg, July 1997
569
whom the two systems differed in predicting, the SAPS predicted correctly 81, whereas APACHE I predicted 45 correctly. It should be noted that this method of comparing instruments is unusual. This model is used more frequently in Europe than in the United States. SimplifiedAcute Physiology Score II (SAPS II). The update of the original SAPS was carried out by LeGall and colleagues 1°7 and was first published in 1993.This model has a much larger patient base and more sophisticated construction and validation than its predecessor. The sample population includes 13,152 consecutive admissions to 137 medical and surgical ICUs in North America and Europe. The data were collected over an 18-month period during 1991 and 1992. Data collection appeared to be consistent, with interclass correlation ranging from 0.81 to 0.95 and a kappa statistic ranging from 0.67 to 1.0. Exclusion criteria eliminated all pediatric patients, patients with bums, patients undergoing coronary care, and patients undergoing cardiac surgery. The variables for this model and their respective weights were determined objectively with multiple logistic regression techniques. The original pool of 37 predictors was narrowed down to 17 variables with univariate analysis and then logistic regression analysis. Twelve of these variables were physiological, two were demographic, and three were related to chronic disease. The weighting of the variables ranged from 0 to 3 up to as high as 0 to 26. The developmental set consisted of 8369 patients (65%) from the sample population. Applying multiple logistic regression analysis to the development set, the following logit equation was retrieved: y = - 7 . 7 6 3 1 + 0.0737(SAPS II score) + 0.9971 (In[SAPS II score + 1]). (Equation 11) The application of this model is similar to that of APACHE II, whereby a score is first calculated and then put into an equation to yield the logit (y). The logit is then substituted into the logistic transformation eY/1+e y to yield a mortality probability. The logarithmic term (In[SAPS II score + 1]) in the logit is known as a shrinking power transformation. 1°8 This is a provision to account for the distribution of the SAPS II score, which is highly skewed. Validation of the model was performed on the developmental set (65% of the total sample) and the validation set (35% of the sample population). The area under the ROC curve was 0.88 in the developmental set and 0.86 in the validation population. This was considerably better than SAPS I, which had an area under the ROC curve of 0.80 when applied to this population. The correlation coefficient (r) between the old and new SAPS score 570
Curr Probl Surg, July 1997
TABLE 7. Mortality Prediction Model (MPM I) Variable Constant Coma or deep stupor Emergency admission Cancer Infection No. of organ systems failed (one system odds ratio) Age (lO-year odds ratios) SBP SBP=
Variable
MPM I at admission (MPMo) ~ Odds ratio -3,000 2,630 1.630 1,490 0,677 0,595 0,038 -0,048 0,0001
13.87 5.10 4.44 1.97 1.81 1.46
MPM at 24 hours (MPM24) I~ Odds ratio
Constant -5.930 Coma or stupor 4.53 Infection 1.310 Intubated or FIO2 >50 1.170 Shock 0.998 Emergency admission 0.928 Age (lO-year odds ratio) 0.038 No. of organ system failures at admission (one system odds ratio) 0,336
92.76 3.71 3.22 2.71 2.53 1.46 1.40
SBP, Systolic blood pressure. (From Lemeshow S, Teres D, Pastides H, Spitz-AvruninJ, Steingrub JS. A method for predicting survival and mortality of ICU patients using objectively derived weights. Crit Care Med 1985;13:519-25.)
was 0.79. Thus, the r 2 was equal to 0.62, indicating that 62% of the variability in SAPS II could be explained by SAPS I. Goodness-of-fit testing was performed on both sets and yielded ap value of 0.883 for the developmental set and 0.104 for the validation set.
Mortality Prediction Model (MPM) Mortality Prediction Model I (MPM I). The MPM was developed by Lemeshow and colleagues 41 at the Baystate Medical Center in Springfield, Massachusetts. The original report describing this method appeared in 1985. The study population included 755 consecutive admissions to the medical and surgical ICUs over a 6-month period in 1983. All patients undergoing coronary care, patients undergoing cardiac surgery, patients with bums, and patients younger than 14 years of age were excluded from the analysis. The MPM I was the first model to use predictor variables derived objectively. Data were collected by trained ICU nurses on 137 possible predictor variables collected at five different times during the patient's hospital stay. Three different classes of predictor variables were collected, including demographic, condition, and treatment parameters. Interrater reliability was Curr Probl Surg, July 1997
571
believed to be high, with a kappa statistic ranging from 0.7 to 1.0, depending on the variable. This enormous set of variables was first reduced by testing the association with mortality by univariate analysis. A chi-square test was used for categoric predictors, and Student's t test was used for all predictors with a continuous distribution. The thresholdp value for consideration to be included in the multiple logistic regression analysis was not provided in the original publication. These maneuvers yielded 26 possible admission variables and 44 possible 24-hour variables. These variables were then pared down by application of a forward stepwise multiple logistic regression model. This procedure allowed each variable to be considered sequentially with other potentially significant variables. This manipulation produced a model in which seven admission variables and seven 24-hour variables were found to be significant. Once the significant variables were identified, their coefficients were calculated by maximum likelihood equations and odds ratios (or relative risks) were determined for each variable (Table 7). The MPM I was first internally validated with a split sample technique. With 50% used for the mortality prediction cutoff, the total correct classification rate was 87% at admission and 85% at 24 hours. The area under the ROC curve was approximately 0.85. Goodness-of-fit testing was performed with deciles of risk and was found to have a high degree of fit with p = 0.38 andp = 0.56 for admission and 24-hour predictions, respectively. It has also been demonstrated that the model's predictive accuracy can be increased by applying it in a serial fashion over time. 1°9These serial applications improved the goodness-of-fit to p = 0.8. The MPM has been validated in other patient populations 11° and compared with other systems on the same patient cohort, m The MPM was compared withAPACHE II in an ICU population of 322 patients. The two models were found to be highly correlated, with r 2 = 0.51 and p = 0.0001. Both models had areas under the ROC curve of approximately 0.86. TheAPACHE II had much better goodness-of-fit testing than the M P M (p > 0.5 vs p > 0.025. The MPM consistently overestimated expected deaths and underestimated survivors in all deciles of risk when compared withAPACHE II.1~2 Mortality Probability Model II (MPM II). The revision of the MPM system was completed and published in 1993.113 The database used to develop this model was very diverse. A total of 19,442 patients from 139 ICUs in 12 different countries were available for analysis. Consecutive medical and surgical ICU admissions at the various hospitals were examined with the exception of patients with bums, cardiac surgery, or an ailment requiring a coronary care unit. No patients younger than 18 years of age were included in the analysis. The hospitals included were a variety of teaching 572
Curr Probl Surg, July 1997
TABLE 8. Mortality Probability Model (MPM II) Variable
At admission (MPMo) ~ Odds ratio
Constant Coma or deep stupor Heart rate _>150 Systolic BP _<90 mm Hg Chronic renal insufficiency Cirrhosis Metastatic neoplasm Acute renal failure Cardiac dysrhythmia Cerebrovascular accident Gastrointestinal bleeding Intracranial mass effect Age (lO-year odds ratio) CPR before admission Mechanical ventilation Nonelective surgery
-5.468 1.486 0.456 1.061 0.919 1.137 1,200 1.482 0.281 0,213 0.397 0.865 0,031 0,570 0,791 1.191
Variable
At 24 hours (MPM24) J~ Odds ratio
Constant Age (lO-year odds ratio) Cirrhosis Intracranial mass effect Medical or emergency surgical admission Coma or deep stupor at 24 hours Creatinine >2.0 mg/dl Confirmed infection Mechanical ventilation Pa02<60 mm Hg Prothrombin time >3 sec above standard Urine output <150 ml in 8 hours Vasoactive drugs >1 hour
-5.646 0.033 1.087 0.913 0,834 1.688 0.723 0.497 0,808 0.467 0.554 0.823 0.716
4.4 1.6 2.9 2.5 3.1 3.3 4.4 1.3 1.2 1.5 2.4 1.4 1.8 2.2 3.3
1.4 3.0 2.5 2,3 5.4 2.1 1.6 2.2 1.6 1.7 2.3 2.0
BP, Blood pressure; CPR,cardiopulmonary resuscitation. (From Lemeshow S, Teres D, Klar J, Spitz-AvruninJ, Gehlbach SH, Rapoport J. Mortality Probability Models (MPM II) based on an international cohort of intensive care unit patients..lAMA 1993;270:2478-86.)
and community hospitals. The data were collected during an 18-month period in 1989 and 1990, A split sample technique was used. The developmental data set was made up of 12,610 patients, the remainder being reserved for validation purposes. As in the original MPM, a large set of predictors was pruned with univariate analysis. Chi-square analysis was used for categoric variables, and Student's t test was used for continuous variables. For initial inclusion in the model, a variable had to have ap value of at least 0.1 and be present in at least 2% of the patient population. Coefficients were determined with maximum likeCurr Probl Surg, July 1997
573
TABLE 9. Example of calculation of MPM I1" Variable Constant Coma or deep stupor Heart rate >150 Systolic BP <_90 mm Hg Chronic renal insufficiency Cirrhosis Metastatic neoplasm Acute renal failure Cardiac dysrhythmia Cerebrovascular accident Gastrointestinal bleeding Intracranial mass effect Age CPR before admission Mechanical ventilation Medical or nonelective surgery Total (Logit)
~l -5.468 1.486 0.456 1.061 0.919 1.137 1.200 1.482 0.281 0.213 0.397 0.865 0.031 0.570 0.791 1.191
Xl
~lXl -5.468
0 1 1 1 1 0 0 0 0 1 0 60 0 0 1
0.456 1.016 0.919 1.137
0.397 1.86
1.191 1.508
BP, Blood pressure; CPR,cardiopulmonary resuscitation, The admission calculation for a 60-year-old man with chronic renal failure and cirrhosis presenting with tachycardia, hypotension, and gastrointestinal bleeding.
lihood techniques. Variables were assessed one at a time in a backward stepwise regression and were eliminated if they did not add to the predictive power of the model. Odds ratios for each variable could then be calculated by taking the inverse log of the respective coefficient, The final model contained 15 variables of which all, except age, were dichotomous (Table 8). The predictors were a variety of physiological, chronic and acute disease states, and treatment variables. The patients were assessed at ICU admission and at 24 hours. The 24-hour model contained eight other variables that require collection. The model was validated on the portion of the population not used to create the model. The admission model had an area under the ROC curve of 0.82 and a goodness-of-fit p value of 0.33. The 24-hour model had an area under the ROC curve of 0.84 and a goodness-of-fit p value of 0.231. An example of using the MPM II is provided in Table 9. The patient is a 60-year-old man with a medical history that includes chronic renal failure and cirrhosis. At admission, the patient is found to have hypotension and tachycardia.After esophagogastroduodenoscopy is performed, he is found to have a bleeding duodenal ulcer, which is controlled. The variables that were pertinent to this patient would receive a value of 1, in the X~ column, and would be multiplied by the corresponding coefficient, I]i. The age is simply multiplied by the corresponding coefficient. The constant and the results mentioned previously are added to yield the logi t, which in this case 574
Curr Probl Surg, July 1997
is 1.508. The logit is then inserted into the probability equation, probability = eY/1+e y, to yield 0.82. This patient has, per the MPM II model, an 82% chance of dying during this ICU admission.
Other Predictive Instruments A multitude of other predictive models have been constructed for a variety of situations. Some are designed for the general ICU population. 114However, many are usually designed from and intended to be applied to specific patient populations. These can range from patients with coma 115to patients with s e p s i s 116 and multisystem organ failure. 117The different subpopulations to which modeling has and can be applied to is virtually endless. There have been some endeavors in the surgical community to develop models to be applied to patients undergoing surgical procedures. Most of these models have been constructed for the evaluation of the patient with trauma. As with the ICU models, these methods have sought to facilitate grouping of patients with similar mortality rates despite widely varied injuries and physiological s t a t e s . 118 The first noteworthy model is the Trauma and Injury Severity Score (TRISS) method. This model was first introduced by Champion and colleagues H9 in 1984. The hypothesis was that the outcome for injured patients could be predicted by assessing physiological derangement, anatomic disturbance from trauma, and age. This evolving scoring system was derived from the database of the MajorTrauma Outcome Study (MTOS). This project was coordinated through theAmerican College of Surgeons and has tabulated the data from 160,000 patients with trauma from more than 150 level I and level II trauma centers. This model uses the trauma score, age, and injury severity score in a multiple logistic regression format to predict survival after injury. The original model used the trauma score, which is a physiologic scoring system with weighted values for respiratory rate, respiratory effort, systolic blood pressure, capillary refill, and the Glasgow Coma Scale. The trauma score can range between 2 (poor) and 15 (good). 12° The newer TRISS uses the revised trauma score, 12~which is a function of systolic blood pressure, respiratory rate, and Glasgow Coma Scale. This score ranges between 0 and 8 and was derived from a regression analysis of 25,000 patients in the MTOS. The Injury Severity Score (ISS) accounts for the anatomic derangement caused by the trauma. This score divides the body into separate and distinct regions (abdomen, thorax, etc.) and assigns weights to specific injuries. The weights vary between 1 and 6. The injury with the most weight from each region is identified, and this weight is known as theAbbreviated Injury Score for that region.All of theAbhreviated Injury Scores are squared. Curr Probl Surg, July 1997
575
The three highest squares are then added together to yield the ISS. 42,122The score can vary between 0 and 75. An Abbreviated Injury Score of 6 automatically gets an ISS of 75 and is reserved for lethal injuries. There are two different logit equations, one for blunt trauma and one for penetrating trauma. The logit contains the variables of ISS, revised trauma score, and age. Once the logit is determined, it is placed into the logistic transformation to yield a probability of survival for the particular patient. The model has been applied to many populations, including both pediatric 123,124and adult patients, 125but there have been various criticisms of this system. Several subpopulations have been identified that are not well predicted by the model] 26 especially those with underlying medical conditions. 127 Modifications have had to be made for certain subsets of patients, most particularly those who undergo intubation. 128It has also been noted that the age portion of the model is dichotomous, with a weight of zero for patients younger than 54 years and a weight of 1 for those older than 54 years. Other criticisms include very wide variations in the recording of the ISS. Several studies have detected major intraobsever variability in assigning ISS scores, ranging from 20% to 40% concordance. 129'13°The ability for the ISS to measure the actual severity of a particular patient with trauma has also been called into question. 131It has been suggested that the predictive model may work better for patients with blunt trauma than for those with penetrating trauma. 132 Originally, TRISS did not use the established methods of measuring discrimination and calibration, the area under the ROC curve and goodnessof-fit testing, respectively. Instead, an arbitrary cutoff probability of survival (50%) was selected, and those with a probability of dying above this point were said to be "predicted to die." A statistical parameter known as the Z-statistic was then applied. The Z-statistic is equal to the number of deaths minus the predicted number of deaths and divided by the square root of the summation of the product of probability of death and probability of survival. This quantifies the difference between the number of actual deaths and the number of predicted deaths. A Z value of 0 would indicate perfect agreement. When mortality is analyzed, a negative Z value would attest that the number of predicted deaths exceeds the number of observed deaths, and is preferred. A positive Z value indicates the opposite. A Z value of 1.96 indicates the actual outcome is 2 standard deviations from the expected outcome. 133 The more recent model developed for the evaluation of the patient with trauma isA Severity Characterization of Trauma (ASCOT). TM This model was first available in 1990 and was also developed from the MTOS database but has several modifications. The age component is interval scaled 576
Curr Probl Surg, July 1997
instead of dichotomous as in the TRISS model. Attempts to eliminate the intraobserver variability of the ISS yielded theAnatomic Profile. The Anatomic Profile is a summary score for anatomic derangement from injury, having four categories, A through D. Category A is a summation of injuries to the central nervous system, B evaluates injury to the thorax and neck, C summarizes all serious injuries to the abdomen, pelvis, other bony structures and the vascular system, and D describes all other nonserious injuries and was actually excluded from the final model. ASCOT also evaluates physiological d e r a n g e m e n t by including the Glasgow Coma Scale, systolic blood pressure, and respiratory rate in the logit. There are separate logit equations for blunt and penetrating trauma victims. The ASCOT was designed to be an improvement over the TRISS method for the evaluation of patients with trauma. Nonetheless, even in the original report the adequacy of this model is questionable. Both the TRISS and ASCOT methods had poor goodness-of-fit testing for both penetrating and blunt trauma victims. Of the two models, only the ASCOT model showed close to a good fit, and that was only for patients with penetrating injuries. The Hosmer-Lemeshow goodness-offit statistic was 12.65 for this subgroup and was more than 15.5 for all other groups examined. These models have also shown questionable performance in non-MTOS populations. 135 W h e n compared with TRISS, ASCOT may have a small increase in accuracy. However, this is believed by some investigators to be offset by its increased complexity. 136It is also questionable whether chronic illness is adequately considered for in these models. There is no provision in the TRISS orASCOT methods to account for this potentially confounding influence. 137
Problems with Predictive Instruments There are many problems with predictive instruments, and the user should be aware of these. Some of the obstacles are inherent to the design of a particular model and are known as biases. Others relate to the actual application of a model to the population being tested. Many of these difficulties were discovered only after a model was examined by investigators other than the original constructors. Once these idiosyncrasies are identified, it does not mean that the model must be rejected or abandoned. Instead, the peculiarity must either be compensated for, or at least understood, in the operation of the model. And it is important that these perplexities be accounted for or corrected in subsequent endeavors to build a new predictive instrument. Curr Probl Surg, July 1997
577
Inherent Errors or Bias Selection bias. This form of bias strikes at the core of a predictive instrument in its design, data collection, and construction phases. This prejudice can be perpetuated in two ways: (1) in the selection Of the population to be studied and on which the instrument will be based and (2) in the selection of variables to be included for evaluation and eventually incorporated into the model. Selection bias in the study patient population occurs when the test sample does not adequately represent the population to which the model will be applied. Several possible deficits in the original database may lead to selection bias. The first possibility is that the database is simply too small to contain all of the data required to make predictions for a wide range of disease processes. This type of bias was very evident in the earlier models. All three of the original models (APACHE I, SAPS I, and MPM I) were constructed from sample populations of fewer than 1000 patients. Even APACHE II, which was tested on more than 5000 patients, has been implicated as having selection bias. This is not so much from a small sample size as from a patient population that lacks breadth with regard to a variety of disease processes. This flaw is usually uncovered when the model is shown to be inaccurate in certain subgroups of patients. This was demonstrated for the APACHE II model in patients with acute myelogenous leukemia, 138low serum albumin l e v e l s , 139 and many other conditions. Another example of selection bias in the original sample population may occur when all of the patients are acquired from a single source. The MPM I model not only had a small test population but consisted entirely of patients at one particular hospital. Any peculiarities of that specific patient population are then incorporated into the model undiluted by patients from other institutions. As a result, if that institution cared for a disproportionate share of a distinct illness, this disease process would be overrepresented in the predictive instrument. In addition, any discrepancy in care that deviated from the norm, for good or bad, and had an influence on the outcome would be magnified in the model. The MPM II attempted to correct these potential biases by increasing the sample population of the MPM II to almost 20,000 patients from more than 100 different ICUs in 12 countries. The hope was for an extremely diverse population in which all disease processes would be represented in an unbiased fashion. Another factor that will probably never be compensated for is the selection of patients who are actually admitted to the I C U . 14° This, after all, is subject to the judgment of the admitting physician and beyond the control of the designers of the model. The selection of variables and their weightings can also be a poten578
Curr Probl Surg, July 1997
tial source of selection bias. There can be several flaws simply in the type o f variables chosen. It is well known that variables that rely on interpretation by a clinician, either physical examination or observation, are not as reproducible as those that are generated by the laboratory. TM This has not been a significant problem with the ICU predictive indexes, whose authors have taken extensive measures to eliminate subjective variables. However, there are still examples of variables that require the interpretation of a physician. The "coma or stupor" variable in the MPM II model and the "intracranial mass effect" variable are both subject to the interpretation of the observer. The potential for bias in variables selected by a panel of"experts" is clearly evident. The physicians' undue weighting of overwhelming or recently experienced events has been described previously, m The earlier models, most notably the original APACHE, were subject to this type of bias. The APACHE I method is rarely used as a predictive model today. All of the larger and later predictive instruments have used objective computer-driven techniques such as discriminant analysis and regression methods in the variable selection process. Not only do these variables avoid selection bias, but they also more accurately reflect the illness severity. The weighting of these variables can also be a source of error. This is especially apparent when continuous data are ordinalized or grouped. Ordinalization of continuous data takes a parameter with a distribution that is normally unrestricted and divides it into segments. This process undoubtedly decreases the explanatory value of that particular variable. Nonetheless, almost all of the physiological variables found in most predictive instruments are categorized in this fashion. An example of how ordinalization can misrepresent data is with the serum sodium in the SAPS system. A serum sodium of 125 would be considered normal and receive a weight of zero, whereas a value of 124 would receive a weight of five, which is a substantially different weight for such a small difference in the actual value. Lead-time Bias. Lead-time bias refers to prejudice before the actual measurement of the value in question. This problem has plagued studies in the ICU long before the advent of scoring systems and still remains a major problem for predictive models. ~43In the case of predictive instruments, l e a d - t i m e bias refers to possible t r e a t m e n t s or interventions that a patient might have received before admission to the ICU, where most instruments make their measurements. For instance, a patient with sepsis and diabetic ketoacidosis may receive very aggressive therapy in the emergency department before being transferred Curr Probl Surg, July 1997
579
to the ICU, where the measurements for the predictive instruments will be recorded. The patient may have a far better score than when he or she originally presented to the emergency department and possibly a better predicted mortality than the actual disease state merits. A similar effect may be encountered in patients with chronic illnesses. For example, consider the patient with chronic emphysema who has been treated on the ward and does not respond to the therapies available there. His predicted mortality, partially because of the variables pertaining to the arterial blood gas, may be similar to that of an asthmatic patient entering from the emergency department. However, this patient's outcome will be worse than expected, because he has already been proven to be refractory to therapy. The location where a patient has received care before his or her ICU admission has been demonstrated in and of itself to be a fairly accurate predictor of mortality.144 This particular type of bias has led to criticism of the APACHE II model. The authors have acknowledged this flaw, and in the APACHE III model they have attempted to take measures to correct the deficit. One of the new variables is the source of the patient's referral. Each source (e.g., emergency department, nursing ward) has been assigned a weight that is supposed to reflect and counterbalance the lead-time bias. The problem with this correction is that it is an average account of an entire source of patients and in individual circumstances may not accurately reflect the magnitude of the bias. Lead-time bias is especially treacherous when trying to use predictive instruments to stratify or compare performance among different ICUs. Predictive models with lead-time bias may discriminate against ICUs with particular referral patterns to the critical care areas. Tertiary care centers typically receive chronic patients transferred from other institutions who have not improved with therapy provided at the referring institution. These patients, because of their already proven recalcitrance to therapy, will do worse than predicted when compared with the average patient with a similar score. This is one of the main reasons these instruments should be used with caution for this purpose. Scoring systems have been used to compare ICUs at home and abroad. 145One study actually implicated the interaction and coordination of the ICU team for the higher than expected mortality at a specific hospital. 146In actuality, however, lead-time bias may have been the confounding influence that produced these results. Detection Bias. Detection bias is found not only in predictive instruments but also in any study where data are collected. Detection bias refers to the issue of missing data and how this issue is addressed. This has been noted to be a potential problem with the APACHE models. The 580
Curr Probl Surg, July 1997
way in which this problem has been rectified by the authors is to assume the missing variable is normal and to assign it a weight of zero. However, the assumption of normality for parameters that are not routinely obtained or are missing from the patient's chart can lead to bias. For instance, patients with chronic renal failure, who have an abnormally high creatinine level, may not have this laboratory parameter checked every day because of the lack of utility in doing so. This premise applies to other laboratory values needed for the APACHE models such as the serum bilirubin, the white blood cell count, and the blood urea nitrogen. Another source of detection bias arises in mental status examinations such as the Glasgow Coma Scale, which are an important component of most models. Sedated or chemically paralyzed patients have error introduced into this measure by treatment. Although there have been attempts to correct for this error, it still remains a concern. There have been several attempts to reduce this potential source of bias. One of the most frequently used techniques is to design models that use as few variables as possible and yet still obtain an adequate relationship between the predictors and the outcome. This method has been used in all of the later APACHE, MPM, and SAPS models. Missing values will never be totally eliminated from any research endeavor; however. As a result, alternative methods for dealing with them must be explored. Current methods under investigation include the construction of a data hierarchy, which would decide which variables could be left out without damaging the predictive ability, or actually weighting missing variables. 14° Diagnosis Bias. The diagnosis that a patient receives on admission to the ICU is important for assessing the illness severity with the APACHE II or APACHE III scoring systems. This is a latent source of bias if the diagnosis of the patient is not clear-cut. Most patients who are admitted to the ICU have many medical problems. To place some of the more complicated patients into a single disease category is at best difficult and in some cases requires subjective decision-making. The APACHE II and APACHE III require that this be done despite the lack of concrete diagnostic definitions and priorities. This can lead to marked discrepancies in the final predicted mortality rate for complicated patients. For example, consider a hypothetical patient who has had a drug overdose and a respiratory arrest. Both of these diagnoses are considered a primary diagnostic category in the APACHE II scoring system. Without the need for a surgical procedure and with an APACHE II score of 20, the predicted mortality would be different depending on the primary diagnostic category selected. If respiratory arrest is selected as the primary diagnosis, the predicted mortality for Curr Probl Surg, July 1997
581
this patient would be 32%. However, when drug overdose is selected as the primary diagnosis, the estimated probability of dying for this patient is only 2%. Three deciles of risk is a wide discrepancy in the estimated mortality for the same patient. Ongoing Prediction Bias. The APACHE III and the MPM models are equipped to make ongoing predictions over the course of a patient's stay in the ICU. This feature has the potential to bias outcomes, especially in patients who are at the poor prognosis end of the spectrum. If the predictive instrument is used as part of the decision-making process, there is the possibility that the instrument becomes a selffulfilling prophecy. For example, consider a patient who is admitted and has, by whatever model chosen, a predicted mortality that increases over several days. This could be construed by physicians, other health care workers, and the patient's family as an indication for a less aggressive approach. Less aggressive care, or even withdrawal of care, would bias the outcome in favor of mortality for the patient with continuously updated predictions. However, the patient treated without the aid of a predictive instrument might receive a more aggressive approach and possibly achieve a better outcome. Because continuous prediction is a relatively recent innovation, this is a somewhat new bias. Methods for treating these effects are still in the theoretic phase and have not been applied to any models.
Errors in Application There are difficulties in the application of all the predictive models. Most of these difficulties are related to the actual data collection when the model is being applied. One of the primary flaws in the original APACHE model was the shear number of variables that needed to be collected for each patient. Recording 34 physiological variables for a single patient in a 24-houx period is not an easy task. Shortly after its introduction it was noted that certain variables within the APACHE instrument were rarely collectedJ47'~48Prime examples included testing for skin anergy, serum lactate determinations, and serum osmolarity measurements. Reasons mentioned for not collecting these variables ranged from not having the test at the particular hospital to doctors being unwilling to order it. These variables, which were all collected with extreme efficiency during the model development and construction, were an important part of the multivariate equation. Their absence had the potential to bias the model severely. It became clear to future model designers that the more variables required and the more complex they were to obtain, the less likely they were to be available for entry into the instrument. 582
Curr Probl Surg, July 1997
One of the major reasons cited for the development of the SAPS predictive instrument was the confusion and lack of simplicity afforded by the originalAPACHE. The authors of t h e A P A C H E models realized the magnitude of this problem and trimmed the number of variables in the subsequent model. The SAPS I has 14 easily measured variables, and the APACHE II has pared the APS to 13 physiological variables. Another source of error is in the accuracy of data collected. The consistency of the data collected for the construction of APACHE, SAPS, and M P M is impeccable. The kappa statistics and intraclass correlation coefficients are consistently between 0.8 and 1. However, this is in the setting of data collectors who were extensively trained and monitored and who had data collection as their sole function. In the actual clinical setting in which the models are used, data are collected by nurses, residents, and medical records personnel. These individuals are not necessarily as concerned with the accurate collection of data as with other issues such as patient care. It has been demonstrated that up to 20% of patients have a significant error in the collection of the data required for the APACHE II APS score. 149This effect is also related to the complexities o f the data to be collected. One of the advantages of the M P M models over the APACHE instruments is that the M P M collected data at admission, a very distinct point in time. The APACHE systems use the worst value for a variable over the first 24 hours in the ICU. This can be very complex when a single variable such as temperature may have been measured 15 times in the first 24 hours of an ICU stay. This method also opens the APACHE model to the criticism that therapy may have affected the illness severity, because it is being measured during treatment. Another difficulty in the collection of the APS forAPACHE II is that some of the variables require calculations. The A-a gradient on arterial blood gases and the mean arterial blood pressure are infrequently presented as such and demand numeric manipulation of the data usually logged on the ICU flowsheet. Although the calculations are not overwhelming, they can be another source of error. In general, fewer and simpler variables that still give adequate predictive power are desired.
The Role of Physicians in Gauging Illness Severity In acute diseases it is not quite safe to prognosticate either death or recovery. Hippocrates 15° Patients and their families will forgive you for wrong diagnoses, but will rarely forgive you for wrong prognoses; the older you grow in Curr Probl Surg, July •997
583
medicine, the more chary you get about offering iron clad prognoses, good or bad. Albert R. Lamb 151
As demonstrated by the first quote by Hippocrates, physicians have had the task of estimating and assigning prognoses since the inception of medicine. Both quotes demonstrate the physician's perceived difficulty in discharging this responsibility. This action is necessary and expected of the physician, however, so that the patient and his or her family can plan for the future. Physicians have tried to avoid this responsibility by "playing it safe" and offering prognoses that overstate the severity of the patient's illness. However, prognostication is believed, by most, to be superior to the nihilism of the "hanging crepe" strategy adopted by many physicians when dealing with patient's families and the tenuous clinical situations of their loved ones. 152 It has been assumed for years and suggested in some studies that the physician's clinical judgment affords the ability to determine illness severity adequately. 153However, there have been recent studies that would contradict this supposition. Numerous studies in various subspecialties have demonstrated the possibility that computer-aided models may have better predictive powers than physicians. For instance, in the field of cardiology, physicians were asked to assess the 3-year survival rates and infarct-free survival rates for a cohort of patients with significant coronary artery disease. When compared with a predictive model, physician predictions were less accurate than those of the computer. The rank correlation for survival for the model was 0.61 compared with 0.49 for the physicians. The rank correlation for infarct-free survival were 0.48 for the model and 0.29 for physicians. 154This has also been demonstrated in the ICU. In one study with more than 200 patients, experienced physicians and nurses were asked to predict the survival of critically ill patients. The physician and nurse predictions had a false-positive rate of between 7.7% and 16.7%. The predictive instrument in this study had no false predictions of death and a greater sensitivity than the physicians. 155 Several studies have suggested the physician to be better than predictive instruments at assigning risk. Meyer and colleagues 156 reported that the overall accuracy for physician predictions was 95.2% compared with a 90.9% accuracy demonstrated by the APACHE II model on the same cohort. This study is flawed, however, because it uses the APACHE II score instead of the predictive instrument, has an inordinately low ICU mortality, and uses arbitrary values of sensitivity and specificity to support its conclusions. Other studies have also reported equal or better prognostications by physicians. Kruse and colleagues 157 found no differences in the 584
Curr Probl Surg, July 1997
TABLE 10. Possible uses of illness severity scoring systems
Patient prediction Resource use Quality assurance Research Reimbursement Epidemiology and health policy
ROC curves of the APACHE II model, physicians, and nurses. In contrast, McClish and Powel1158 found an area under the ROC curve to be 0.89 for physicians and 0.83 for the APACHE II model, and Brannen and colleagues 159found an area under the ROC curve of 0.87 for physicians and 0.80 for the APACHE II model. All of the studies that claim better or equal efficacy of assigned risk by the physician when compared with predictive instruments suffer from a common design flaw. They assessed the discrimination of the predictions of the physician and the models but did not evaluate the calibration of either modality. McClish and Powell tried to illustrate calibration with a two space plot, and Kruse and colleagues tried to illustrate calibration with a calibration comparative histogram, but neither performed goodness-of-fit testing. Thus, there was no statistical evaluation of this parameter, which is an essential part of the appraisal of the utility of a model. Visual assessment of the graphic representations clearly demonstrates better calibration by the predictive indexes. These studies indicate something that has been observed in many other studies. Physicians may have discriminatory power that equals or is even better than that of the predictive models. Yet clinicians consistently have significantly less calibration when compared with predictive instruments. 160In other words, over the entire range of risk the physician could not predict as well as the regression models. This is especially true in the lower and middle range of risk. This could be due to a trend noted in which physicians consistently overestimated mortality probabilities. 161 The lack of adequate calibration certainly limits physicians in the field of illness severity assessment. There are also other traits that make physicians less than desirable in assigning risk, including poor reproducibility of a physician's individual decisions, large differentials between the accuracy of predictions because of experience, and the bias of recent events.
Uses of Illness Severity Indexes The wide range of the uses of predictive instruments has been described by Hyzy 162and Fakhry and colleagues. 163A combination of these possible applications are summarized in Table 10. Curr Probl Surg, July 1997
585
Individual Patient Predictions The problem with these scoring systems is that they have absolutely nothing to do with my patients. Steven Jurisich (personal communication with John Hunt, 1996) As noted above by an astute twentieth century surgeon, the application to the prediction of mortality in the individual patient is limited. In fact, decisions for individual patients based solely on a predictive instrument is mentioned in this text only to be condemned. It is intuitively obvious that predictive indexes will never be 100% s p e c i f i c . 164 Even studies that claim to be supportive of individual decision-making have patients predicted to die who live and are discharged from the hospital to lead satisfactory lives. 165This realization, coupled with studies that have suggested that scoring systems are not useful for individual predictions in certain populations, has soured enthusiasm to use predictive instruments in this fashion. Schafer and colleagues 166conducted a study analyzing APACHE II, SAPS, and MPM in a population of almost 1000 patients in an ICU. These investigators found areas under the ROC in the range of 0.75 and poor results in goodness-of-fit testing in all of the major modem predictive models. Another more recent study examined the use of daily APACHE II scores and the use of high decision thresholds to avoid a false prediction of death. Not only were there patients who were predicted to die who survived, but the sensitivity of the instrument was so reduced that the model was rendered practically useless. 167 Proponents of using scoring systems to make individual decisions will often do so in the guise of improving allocation of resources. The reader should be aware of this feint, which skirts the issue of individual decision-making by predictive models. In one retrospective study, the authors analyzed the records of 146 patients who received dialysis in the ICU at a university and its satellite hospitals. The authors applied the APACHE II model to the patient population and found that there were no survivors in the deciles of assigned risk greater than 70%. This segment of the study population included 33 patients, or approximately 23% of their study sample. The authors concluded there was a clear and distinct segment of the ICU populace that would not benefit from hemodialysis and alluded to a cost savings of $4500 per patient. 168 Another study was conducted by Chang and colleagues 169regarding the use of total parenteral nutrition in patients in an ICU. Using a similar study design, these authors concluded that there was a discrete portion of the ICU population that would not benefit from the initiation of to586
Curr Probl Surg, July 1997
tal parental nutrition. Although these studies were conducted in specific patient populations and all but advocated the denial of certain therapies, they have used a predictive model to select patients who would not receive standard treatment. When the standard of care is not delivered to a given patient, the decision not to treat has been made by default. The decision to withdraw care from a patient will never be made easily or with absolute certainty. This is true with or without the presence of predictive instruments. Currently, these decisions are made by the physician, the patient, and the patient's family. The thought of allocating this decision solely to a computer is cold and repugnant to most physicians. Many physicians and a large majority of the lay people are not familiar with the intricacies and nuances of predictive instruments and are suspicious of their use. Whether illness severity scores will have a role in buttressing the resolve of a very difficult decision remains to be seen.
Resource Use Although the use of predictive instruments in issues of withdrawal of care is a highly charged issue, there may be applicability to other resource use issues. One situation is the entrance of patients into the ICU. There is definitely a group of patients that enters into the ICU and receives very little in the way of "intensive care." Many of these patients are placed in the ICU solely for monitoring purposes and as such consume resources that could be better allocated. There have been numerous studies to document the existence of this subset of patients. 17o,171 With theAPACHE II model, Wagner and colleagues 172were able to demonstrate this segment of the ICU population as defined by their respective predicted mortality. In a study of almost 2000 patients receiving only monitoring in the ICU, it was noted that approximately 70% of the patients had anAPACHE II predicted mortality of less than 10%. Of almost 1400 patients, only 58 (less than 5%) received any kind of active therapy. Although the total cost savings was thought to amount to less than 5% of the hospital costs, it was noted that theAPACHE II could be applied to triage patients into the ICU when there were more requests for beds than beds available. There has also been work to determine whether patients who are doing well can be transferred out of the ICU earlier than they had been previously. Just as there is a definite set of patients who are admitted for monitoring and receive no active therapy, there are patients who are admitted for legitimate purposes but who simply stay longer than required. Becker and colleagues 173were able to develop a regression model Curr Probl Surg, July 1997
587
with the APACHE III model for patients who underwent coronary artery bypass grafting. The outcome for one of the arms of the model was the usual mortality variable. However, the authors had two other arms of the model that used the outcomes of ICU length of stay and resource use as determined by TISS. The study was conducted in six different hospitals. One of these hospitals showed significantly higher resource use compared with estimated length of stays and TISS scores. Another institution showed considerably less resource allocation compared with the predicted outcomes. It was estimated that the hospital with the lower level of resource use saved $8400 per patient. There is currently a drive in medicine to improve the efficiency with which medical care is delivered. This drive is being supported by both the government and the private sector. Predictive instruments may be used to allocate certain resources such as admission to the ICUs and length of stay in ICUs and other monitored settings. These models may allow health care providers to gauge whether they are using ICU resources in step with the institutions from which the models were constructed or other institutions using the instrument. Efforts then could be made to correct deficiencies that may be leading to the possible poor appropriation of resources.
Quality Assurance The application of predictive instruments may be most felt by the ordinary physician in the area of quality assurance. Quality assurance in the United States has a long and rich history that supersedes the "hype" that the issue receives currently. The American College of Surgeons was the first to standardize quality control principles with the publishing of Minimum Standards for Hospitals 174 in 1924. The American College of Surgeons monitored hospital quality until 1951, when the Joint Commission on Accreditation of Hospitals was formed and assumed the responsibility. ~75The assessment by both of these agencies relied on sending representatives to the hospital to be analyzed. These individuals would then survey the hospital's physical plant and assess the process by which patients were cared for. The process for verification by the current agency is not that different from that of its predecessor. An attractive proposition would be the availability of a tool that could standardize variations in hospital care and be applied across a great variety of patients to compare outcomes and, possibly, the quality of care. Predictive instruments are already being applied to the comparisons of different hospitals in an effort to assess the quality of care. 176 Re588
Curr Probl Surg, July 1997
suits have been mixed. Care in the ICU was compared between hospitals in Japan and in the United States. Mortality rates from 13 hospitals in the United States were compared with those from six Japanese hospitals after being standardized by theAPACHE II predictive model. The study showed that there were fewer adjusted deaths in the Japanese ICUs than in the American ICUs. On further review, this result was found to be explained by differences in the ICU populations and probably poor calibration of the model in Japanese ICUs. 177 Differences in patients admitted to the Japanese ICU and lead-time bias were believed to be responsible for the differences. It is important to note this possibility, because differences in patient selection and process can masquerade as differences in quality of care, and the differences in predicted outcomes cannot be used as the sole basis for quality control. Similar studies attempting to compare the quality of care at various institutions have been reported. In one study, the Pediatric Risk of Mortality instrument was used to assess possible differences in care of pediatric patients in tertiary and nontertiary care centers. The model allowed the patient populations to be assigned in a uniform fashion to different levels of risk. Comparisons could then be made among the centers with regard to the care of patients, stratified by risk. It was found that the tertiary care centers were better equipped to care for children in the h i g h e r illness severity strata. T h e r e were not s i g n i f i c a n t interhospital differences in the outcomes of less severely or moderately ill children. 178 The reimbursement for services may also be tied to this effort at quality control. Hospitals that did not meet certain standards of care might be excluded from third party payments. For this en masse quality control to be conducted in a fair manner, there is a need for a unified system that would appropriately stratify patients by risk. This would protect hospitals that cared for the extremely ill, such as many tertiary care centers. Unusually high complication or mortality rates may be justified if an objective scoring system could demonstrate more severely ill patients. There is still much work to be done in the area of quality control and scoring systems. Civetta and c o l l e a g u e s 179 observed that in surgical patients in an ICU, the actual mortality rates in a particular severity strata could differ from those predicted by APACHE II because of case mix and differences in length of stay. It has also been noted by Boyd ~8°that 2 of 13 hospitals in the original APACHE II database had significantly different actual mortality rates compared with theAPACHE II predicted mortality rates. Curr Probl Surg, July 1997
589
Research Increasingly, more research efforts are conducted in a multiinstitutional setting? 81It is rapidly becoming apparent that there is a need to be able to categorize patients so that trials and studies can be conducted in an unbiased fashion. Differences in institutions and practice patterns can both have confounding effects on study results. An illness severity scoring system is one of the ways to construct a study group from two different populations equally, so that a treatment modality can be applied and studied, unbiased by the case mix. Scoring systems can be used to stratify patients into equal groups for prospective trials. They can also be used to identify flaws in retrospective studies in which the conclusions are suspect. Differences in baseline populations can be analyzed and examined for confounding influences. This technique has been used in a number of retrospective studies. In a retrospective analysis of thoracic aorta injuries, Hunt and colleagues 182 concluded that partial bypass techniques may be beneficial in preventing paraplegia in cases where the aortic cross-clamp time was greater than 35 minutes. Illness severity scores were used to ensure that there were no differences in the baseline states of the patients who underwent bypass and those who did not.
New Ideas It has been suggested by some that the whole concept of trying to develop a predictive system based on association of individual predictors with outc o m e is misguided. Civetta 183 has indicated that the multiple logistic regression model is based on a modification of linear principles and believes that this model is too simple for the complex clinical situations found in the ICU. He is supported by Feinstein, 184 who is a proponent of set theory for characterizing clinical situations. They both believe that arithmetic models are inferior to models based on Boolean algebra in describing the associations between predictors and outcomes. This thought process may be reflected in some of the newer predictive models. The neural network is a mathematic construct that mimics biologic neural systems and is capable of solving complex problems. The neural network is able to identify complex patterns in input data and associate them with outcomes. The computer can "learn" to identify these patterns and combinations of these patterns and predict outcome based on its past "experience."185 A model based on this theory applied to patients with trauma performed adequately. When compared with TRISS and ASCOT, the neural network model had similar goodness-of-fit testing, specificity, sensitivity, and misclassification rates.186 590
Curr Probl Surg, July 1997
T h e r e m a y also b e s o m e m o d i f i c a t i o n s in the actual d a t a used. Alt h o u g h t h e r e are t h o s e w h o insist that e x t r e m e l y a c c u r a t e data are nece s s a r y f o r the c o n s t r u c t i o n o f a r e l i a b l e m o d e l , 187 t h e r e are t h o s e w h o m a i n t a i n this m a y n o t be m a n d a t o r y . R u t l e d g e and c o l l e a g u e s 18g h a v e d e m o n s t r a t e d that m o d e l s for p a t i e n t s with t r a u m a c o n s t r u c t e d with the ISS s c o r i n g s y s t e m m a y b e no b e t t e r t h a n m o d e l s d e s i g n e d w i t h the I n t e r n a t i o n a l C l a s s i f i c a t i o n o f D i s e a s e s , Version N i n e ( I C D - 9 ) c o d i n g s f r o m billing i n f o r m a t i o n c o m p i l e d b y m e d i c a l r e c o r d s p e r s o n n e l . I f this is true, efforts and r e s o u r c e s d e d i c a t e d to d e v e l o p i n g d a t a b a s e s spec i f i c a l l y for p r e d i c t i v e m o d e l i n g m a y b e a p p l i e d e l s e w h e r e . T h e avenues available for f u r t h e r investigation a b o u t o u t c o m e prediction are numerous. Regardless o f the directions taken, there will be a place for p r e d i c t i v e i n s t r u m e n t s in the future. H o w m u c h the a c c u r a c y can b e will be i m p r o v e d and w h e t h e r these tools will b e u s e d b e y o n d the r e a l m o f quality assurance, r e s o u r c e use, and r e s e a r c h r e m a i n s to be seen.
ACKNOWLEDGEMENT The authors thank Suzan deSerres, Peter Townshend, and Lyndie Bracey, who provided great help and inspiration through this endeavor.
REFERENCES 1. Hoyt JW, Leisifer D J, Rafkin HS. Critical care units. In: Wenzel RP, editor. Assessing quality health care: perspectives for clinicians. Baltimore: Williams and Wilkens; 1992. p.267-96. 2. Hospital Statistics 1988. 1989-1990 Edition. Chicago: American HospitalAssociation. 3. Reiser, SJ. The intensive care unit: the unfolding and ambiguities of survival therapy. Intl J Technol Assess Health Care 1992;8:382-94. 4. Rapin M. The ethics of intensive care. Intensive Care Med 1987;13:300-3. 5. Robin ED. A critical look at critical care. Crit Care Med 1983;11:144-8. 6. Vincent JL, Parquier JN, Preiser JC. Terminal events in the intensive care unit: review of 258 fatal cases in one year. Crit Care Med 1989;17:530-3. 7. Raffin TA. Intensive care unit survival of patients with systemic illness.Am Rev Respir Dis 1989;140:$28-35. 8. Detsky AS, Stricker SC, Mulley AG. Prognosis, survival and the expenditure of hospital resources for patients in the intensive care unit. N Engl J Med 1981 ;305:667-72. 9. Zook CJ, Moore FD. High cost users of medical care. N Engl J Med 1980;302:996-1002. 10. Oye RK, Bellamy PE. Patterns of resource consumption in medical intensive care. Chest 1991;99:685-9. 11. Cullen DJ, Ferrara LC, Briggs BA, Walker PF, Gilbert J. Survival, hospitalization charges and follow-up results in critically ill patients. N Engl J Med 1976;294:982-7. 12. Goins WA, Reynolds HN, Nyanjom D, Dunham CM. Outcome following prolonged intensive care unit stay in multiple trauma patients. Crit Care Med 1991; 19: 339-45. Curr Probl Surg, July 1997
591
13. Danis M, Patrick DL, Southerland LI, Green ML. Patients' and families' preferences for medical intensive care. JAMA 1988;260:797-802. 14. Abramowitz KS. The future of health care delivery in America. 1st ed. New York: Sanford C Bernstein & Co.; 1986. p.l-181. 15. Gordon TA, Burleyson GP, Tielsch JM, Cameron JL. The effects of regionalization on cost and outcome for one general high risk surgical procedure. Ann Surg 1995;221:43-9. 16. Carey TS, Garrett J, JackmanA, McLaughlin C, Fryer J, Smucker DR. The outcomes and costs of care for acute low back pain among patients seen by primary care practioners, chiropracters, and orthopedic surgeons. N Engl J Med 1995;333:913-7. 17. Annas GJ. Women and children first. N Engl J Med 1995;333:1647-51. 18. Shore MF, Beigel A. The changes posed by managed behavioral health care. N Engl J Med 1996;334:116-8. 19. Richardson JD, Polk HC. Monitoring quality of care: how about a level playing field. Bull Am Coil Surg 1996;81:28-31. 20. Brochard L, Mancebo J, Wysocki M, et al. Noninvasive ventilation for acute exacerbations of chronic obstructive pulmonary disease. N Engl J Med 1995;333:817-22. 21. Rutledge R, Hunt JP, Lentz CW, et al.A statewide, population-based time-series analysis of the increasing frequency of nonoperative management of abdominal solid organ injury. Ann Surg 1995;222:311-26. 22. Pagano M, Gauvreau K. Principles of biostatistics. 1st ed. Belmont, CA: Duxbury Press; 1992. p. 409-37. 23. Hennekens CH, Buring JE. Epidemiology in medicine. 1st ed. Boston: Little, Brown & Co; 1987. p. 317-9. 24. Fletcher RH, Fletcher SW, Wagner EH. Clinical epidemiology: the essentials. 2nd ed. Baltimore: Williams & Wilkens, 1988. p. 125-6. 25. Hosmer DW, Lemeshow S. Applied logistic regression. 1st ed. New York: Wiley & Sons; 1989. p. 8-11. 26. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic research: principles and quantitative methods. Belmont, CA: Lifetime Learning Publications; 1982. p. 419-43. 27. Kollef MH, Schuster DP. Predicting intensive care unit outcome with scoring systems. Crit Care Clin 1994;10:1-18. 28. Ruttiman UE. Statistical approaches to development and validation of predictive instruments. Crit Care Clin 1994; 10:19-35. 29. Titterington DM, Murray GD, Speigalhalter DJ. Comparison of discrimination techniques applied to a complex data set of head injured patients. Journal of the Royal Statistical Society (Series A) 1981;144:145-56. 30. Anderson JA. Separate sample logistic discrimination. Biometrika 1972;59:19-25. 31. Cleveland WS. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 1979;74:829-36. 32. Halperin M, Blackwelder WC, Verter JI. Estimation of the multivariate logistic risk function: a comparison of the discriminant function and likelihood approaches. J Chronic Dis 1971 ;24:125-58. 33. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. 2nd ed. Boston: Little, Brown & Co; 1991. p. 293. 34. Rothman KJ, Boice JD. Epidemiologic analysis with a programmable calculator. U.S.D.H.E.W.N.I.H Publication no.79-1649; 1979. 35. Truett J, Cornfield J, Kannel W. A multivariate analysis of the risk of coronary heart disease in Framingham. J Chronic Dis 1967;20:511-24. 36. Schuster DP. Predicting outcome after ICU admission: the art and science of assessing risk. Chest 1992; 102:1861-70. 592
Curr Probl Surg, July 1997
37. Knaus W, Wagner D, Draper E. Chapter 2: Development of APACHE. Crit Care Med 1989;17:S181-5. 38. Pollack MM, Ruttiman UE, Getson PR. Accurate prediction of the outcome of pediatric intensive care: a new quantitative method. N Engl J Med 1987;316:134-9. 39. Pollack MM, Ruttiman UE, Getson PR. The Pediatric Risk of Mortality (PRISM) Score. Crit Care Med 1988;16:1110-6. 40. LeGall JR, Loirat P, Alperovitch A, Glaser P, Granthil C, Mathieu D. A simplified acute physiology score for ICU patients. Crit Care Med 1984;12:975-7. 41. Lemeshow S, Teres D, Pastides H, Spitz-Avrunin J, Steingrub JS. A method for predicting survival and mortality of ICU patients using objectively derived weights. Crit Care Med 1985;13:519-25. 42. Baker SP, O'Neill B, HaddonW, LongWB. The Injury Severity Score: a method for describing patients with multiple injuries and evaluating emergency care. J Trauma 1974; 14:187-96. 43. Cullen DJ, Civetta JM, Briggs BA, Ferrara LC. Therapeutic intervention scoring system: a method of quantitative comparison of patient care. Crit Care Med 1974;2:57-60. 44. Ranson JH, Rifkind KM, Turner JW. Prognostic signs and nonoperative peritoneal lavage in acute pacreatitis. Surg Gynecol Obstet 1976;143:209-19. 45. Ranson JH, Spencer FC. The role of peritoneal lavage in severe acute pancreatitis. Ann Surg 1978;187:565-73. 46. Norris RM, Brandt PW, Caughy DE. A new coronary prognostic index. Lancet 1969; 1:274-8. 47. Larvin M, McMahon MJ. APACHE II score for assessment and monitoring of acute pancreatitis. Lancet 1989;2:201-4. 48. Kollef MH, Enzenauer RJ. Predicting outcome from intensive care for patients with rheumatologic diseases. J Rheumatol 1992; 19:1260-2. 49. Knaus WA, Zimmerman JE, Wagner DP, Draper EA, Lawrence DE. APACHEacute physiology and chronic health evaluation: a physiologically based classification system. Crit Care Med 1981;9:591-7. 50. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978;299:296-30. 51. FeinsteinAR. Clinical biostatistics. 1st ed. St Louis: CV Mosby; 1977. p.38-53. 52. Zaren B, Hedstrand U. Quality of life among long term survivors of intensive care. Crit Care Med 1987;15:743-7. 53. Frank BS, Pollack MM. Quantitative quality assurance in a community hospital pediatric intensive care unit. West J Med 1992;157:149-51. p. 38-53. 54. Lentz CW, Baker CC, Fakhry SM, Hunt JP, Rutledge RR.A prospective analysis of the impact of opening an intermediate care unit on resource utilization in a surgical intensive care unit. Crit Care Med 1995;23:A24. 55. Feinstein AR. Clinimetrics. 1st ed. New Haven: Yale University Press; 1987. p. 22-43. 56. Wasson JH, Sox HC, Neff RK, Goldman L. Clinical prediction rules: applications and methodological standards. N Engl J Med 1985;313:793-9. 57. Feinstein AR. Clinical biostatistics. XLI. Hard science, soft data, and the challenges of choosing clinical variables in research. Clin Pharmacol Ther 1977;22:485-96. 58. Gustafson DH, Fryback DG, Rose JH. An evaluation of multiple tranma severity indices created by different index development strategies. Med Care 1983;21:674-91. 59. Perkins HS, Jonsen AR, Epstein WV. Providers as predictors: using outcomes predictions in intensive care. Crit Care Med 1986; 14:105-10. 60. Poses RM, Cebul RD, Centor RM. Evaluating physician's probabilistic judgements. Med Decis Making 1988;8:233-40. Curr Probl Surg, July 1997
593
61. Tversky A, Kahneman D. Judgement under uncertainty: Heuristics and biases. Science 1974;185:1124-9. 62. Feigal D, Black D, Grady D, et al. Planning for data management and analysis. In: Hulley SB, Cummings SR, editors. Designing clinical research: an epidemiological approach. 1st ed. Baltimore: Williams & Wilkens, 1988:159-71. 63. Kahn MG. Clinical databases and critical care research. Crit Care Clin 1994;10:37-51. 64. Hulley SB, Cummings SR. Planning the measurements: precision and accuracy. In: Hulley SB, Cummings SR, editors. Designing clinical research: an epidemiological approach. 1st ed. Baltimore: Williams & Wilkens; 1988. p.31-41. 65. Feinstein AR. Clinical epidemiology: the architecture of clinical research. 1st ed. Philadelphia: WB Saunders Company; 1985. p. 182-6. 66. Kleinbaum DG, Kupper LL, Muller KE. Applied regression analysis and other multivariable models. 2nd ed. Belmont, CA: Duxbury Press; 1988. p. 320-8. 67. Hosmer DW, Jovanovic B, Lemeshow S. Best subsets logistic regression. Biometrics 1989;45:1265-73. 68. Jenrich RI. Stepwise discriminant analysis. In: Enslein K, Ralston A, Wilf H, editors: Statistical methods for digital computers. NewYork: JohnWiley & Sons; 1977. p.72-95. 69. Charlson ME, Ales KL, Simon R, MacKenzie CR. Why predictive indexes perform less well in validation studies. Arch Intern Med 1987;147:2155-61. 70. Colton T. Statistics in medicine. 1st ed. Boston: Little, Brown, & Co; 1974. p. 207-14. 71. Zar JH. Biostatistical analysis. 1st ed. Englewood Cliffs, NJ: Prentice-Hall Inc, 1974. p. 259-61. 72. Reigelman RK. Studying a study and testing a test. Boston: Little, Brown & Co; 1981. p. 119-30. 73. Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundemental evaluation tool in clinical medicine. Clin Chem 1993;39:561-77. 74. Berwick DM, Thibodeau LA. Receiver operating characteristic analysis of diagnostic skill. Med Care 1983;21:876-85. 75. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver-operating characteristic (ROC) curve. Radiology 1982;143:29-36. 76. McNeil B J, Keeler E, Adelstein SJ. Primer on certain elements of medical decision making. N Engl J Med 1975;293:211-5. 77. Hanley JA, McNeil BJ. A method of comparing the areas under receiver-operating characteristic curves derived from the same cases. Radiology 1983;148:839-43. 78. Lemeshow S, Hosmer DW. A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol 1982:115:92-106. 79. Hosmer DW, Lemeshow S. Goodness of fit tests for the multiple logistic regression model. Communications I, Statistics A 1980;A9:1043-69. 80. Centers for Disease Control and Prevention. Projections of the number of persons diagnosed withAIDS and the number of immunosupressed HIV-infected persons, United States, 1992-1994. MMWR Morb Mortal Wkly Rep 1992;41:1-19. 81. Rosen MJ, DePalo VA. Outcome of intensive care for patients with AIDS. Crit Care Clin 1993;9:107-14. 82. Smith RL, Levine SM, Lewis ML. Prognosis of patients with AIDS requiring intensive care. Chest 1989;96:857-61. 83. Chu DY. Predicting survival in AIDS patients with respiratory failure: application of the APACHE II scoring system. Crit Care Clin 1993;9:89-105. 84. Brown MC, Crede WB. Predictive ability of Acute Physiology and Chronic Health Evaluation II scoring applied to Human Immunodeciency Virus-positive patients. Crit Care Med 1995;23:848-53. 594
Curr Probl Surg, July 1997
85. Wilairatana P, Looareesuwan S. APACHE II scoring for predicting outcome in cerebral malaria. J Trop Med Hyg 1995;98:256-60. ~ 86. McClure W, Shaller D. Variations in medicare expenditures per elder. HealthAff 1984;3:120-9. 87. Knauss W, Draper E, Wagner D. Chapter 1. Introduction. Crit Care Med 1989;17:S176-80. 88. Luft H, Bunker J, EntovenAC. Should operations be regionalized? The empiric relation between surgical volume and mortality. N Engl J Med 1979;301:13649. 89. Luft H. The relation between surgical volume and mortality: an exploration of causal factors and alternative models. Med Care 1980; 18:940-59. 90. Knaus WA, Draper EA, Wagner DE Zimmerman JE. APACHE II: A severity of disease classification system. Crit Care Med 1985; 13:818-29. 91. Wong DT, Knaus WA. Predicting outcome in critical care: the current status of the APACHE prognostic scoring system. Can J Anaesth 1991;38:374-83. 92. Oh TE, Hutchihson R, Short S, Buckley T, Lin E, Leung D. Verification of the Acute Physiology and Chronic Health Evaluation scoring system in a Hong Kong intensive care unit. Crit Care Med 1993;21:698-705. 93. Giangiuliani G, Mancini A, Gui D. Validation of a severity of illness (APACHE II) in a surgical intensive care unit. Intensive Care Med 1989;15:519-22. 94. Marsh HM, Krishan I, Naessens JM, et al. Assessment of prediction of mortality by using the APACHE II scoring system in intensive care units. Mayo Clin Proc 1990;65:1549-57. 95. Chang RWS, Jacobs S, Lee B, Pace N. Predicting deaths among intensive care unit patients. Crit Care Med 1988;16:34-42. 96. Van Le L, Fakhry S, Walton LA, Moore DH, Fowler WC. Rutledge R. Use of the APACHE II scoring system to determine mortality of gynecologic oncology patients in the intensive care unit. Obstet Gynecol 1995;85:53-6. 97. Turner JS, MudaliarYM, Chang RWS, Morgan CJ. Acute Physiology and Chronic Health Evaluation (APACHE II) scoring in a cardiothoracic intensive care unit. Crit Care Med 1991;19:1266-9. 98. Ludwigs U, Hulting J. Acute Physiology and Chronic Health Evaluation II scoring system in acute myocardial infarction: a prospective validation study. Crit Care Med 1995;23:854-9. 99. Cerra FB, Negro F, Abrams J. APACHE II score does not predict multiple organ failure or mortality in postoperative surgical patients. Arch Surg 1990; 125:519-22. 100. Draper E, Wagner D, Russo M, et al. Chapter 3. Study design-data collection. Crit Care Med 1989;17:S186-93. 101. Wagner D, Draper E, Knauss W. Chapter 5. Development of APACHE III. Crit Care Med 1989;17:S199-203. 102. Knaus WA, Wagner DE Draper EA, et al. The APACHE III prognostic system: Risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991;100:1619-36. 103. Teres D. Comment on "The case for using objective scoring systems to predict intensive care unit outcome." Crit Care Clinics 1994;10:91-2. 104. Watts CM, KnaussWA. Comment on"Why severity models should be used with caution." Crit Care Clin 1994; 10:111-6. 105. Barie PS, Hydo LJ, Fischer E. Comparison of APACHE II and III scoring sytems for mortality prediction in critical surgical illness. Arch Surg 1995;130:77-82. 106. Youden WJ. Index for rating diagnostic tests. Cancer 1950;3:32-6. 107. LeGall JR, Lemeshow S, Saulnier F. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA 1993;270:2957-63. Curr Probl Surg, July 1997
595
108. Guerro VM, Johnson RA. Use of the Box-Cox transformation with binary response models. Biometrika 1982;17:309-14. 109. Lemeshow S, Teres D, Spitz-Avrunin J, Gage RW. Refining intensive care unit outcome prediction by using changing probabilities of mortality. Crit Care Med 1988;16:470-7. 110. Teres D, Lemeshow S, Avrunin J. Validation of the mortality prediction model for ICU patients. Crit Care Med 1987;15:208-13. 111. Lemeshow S, Teres D, Spitz-Avrunin J, Pastides H. A comparison of methods to predict mortality of intensive care unit patients. Crit Care Med 1987; 15:715-22. 112. Castella X, Gilabert J, Torner F, Torres C. Mortality prediction models in intensive care: Acute Physiology and Chronic Health Evaluation II and Mortality Prediction Model compared. Crit Care Med 1991;19:191-7. 113. Lemeshow S, Teres D, Klar J, Spitz-Avrunin J, Gehlbach SH, Rapoport J. Mortality Probability Models (MPM II) based on an international cohort of intensive care unit patients. JAMA 1993;270:2478-86. 114. Snyder JV, McGuirk M, Grenvik A, Stickler D. Outcome of intensive care: an application of a predictive model. Crit Care Med 1981 ;9:598-603. 115. Harnel MB, Goldman L, Teno J, Lynn J, Davis RB, Harrell FE. Identification of c o m a t o s e p a t i e n t s at high r i s k for death or severe d i s a b i l i t y . J A M A 1995;273:1842-8. 116. Fagon JY, Chastre J, Novara A, Medioni P, Gibert C. Characterization of intensive care unit patients using a model based on the presence of organ dysfunctions and/or infection: the ODIN model. Intensive Care Med 1993;19:137-44. 117. Hebert PC, Drummond AJ, Singer J, Bernard GR, Russel JA. A simple multiple system organ failure scoring system predicts mortality of patients who have sepsis syndrome. Chest 1993;104:230-5. 118. Oneill B, Zador P, Baker SP. Indexes of severity: underlying concepts-A reply. Health Serv Res 1979;14:68-76. 119. Champion HR, Frey CF, Sacco WJ. Determination of national normative outcomes for trauma. J Trauma 1984;24:651-2. 120. Champion HR, Sacco WJ, Carnazzo A J, Copes W, Fouty WJ. Trauma score. Crit Care Med 1981;9:672-6. 121. Champion HR, Sacco W, Copes WS. A revision of the trauma score. J Trauma 1989;29:623-9. 122. Baker SP, Oneill B. The Injury Severity Score: An update. JTrauma 1976;16:8825. 123. Eichelberger MR, Champion HR, Sacco WJ, Gotschall, Copes WS, Bowman LM. Pediatric coefficients for TRISS analysis. J Trauma 1993;34:319-22. 124. Kaufmann CR, Maier RV, Kaufmann EJ, Rivara FP, Carrico CJ. Validity of applying adult TRISS analysis to injured children. J Trauma 1991;31:691-8. 125. Guirguis EM, Hong C, Liu D, Watters JM, Baille F, Mclntyre RW. Trauma outcome analysis of two Canadian centers using the TRISS method. J Trauma 1990;30:426-9. 126. Tinkoff G, Rhodes M, Diamond D, Lucke J. Cirrhosis in the trauma victim. J Trauma 1990;211:172-7. 127. Pories SE, Gamelli RL, Pilcher DB, et al. Practical evaluation of trauma deaths. J Trauma 1989;29:1607-10. 128. Offner PJ, Jurkovich GJ, Gurney J, Rivara FP. Revision of TRISS for intubated patients. J Trauma 1992;32:32-5. 129. Cushing BM, Teitelbaum SD, Burman W, Karges D, BameW. Better data through direct physician entry of anatomic injuries. Med Decis Making 1991; 11 :$45-8. 130. Zoltie N, deDombal FT. The hit and miss of ISS and TRISS. Br Med J 1993;307:906-9. 596
Curr Probl Surg, July 1997
131. Baxt WG, Upenieks V. The lack of full correlation between the Injury Severity Score and the resource needs of injured patients. Ann Emerg Med 1990; 19:1396400. 132. Pillgram-Larsen J, Marcus M, Svennevig JL. Assessment of probability of survival in penetrating injuries using the TRISS methodology Injury 1989;20:10-2. 133. Boyd CR, Tolson MA, Copes WS. Evaluating trauma care: the TRISS method. J Trauma 1987;27:370-8. 134. Champion HR, Copes WS, Sacco WJ, et al. A new characterization of injury severity. J Trauma 1990;30:539-45. 135. Hannan EL, Mendeloff J, Farrel LS, Cayten CG, Murphy JG. Validation of TRISS and ASCOT using a non-MTOS trauma registry. J Trauma 1995;38:83-8. 136. Markle J, Cayten CG, Byrne DW, Moy F, Murphy JG. Comparison between TRISS and ASCOT methods in controlling for injury severity. J Trauma 1992;33:326-32. 137. Sacco WJ, Copes WS, Bain LW, et al. Effect of preinjury illness on trauma patient survival outcome. J Trauma 1993;35:538-43. 138. Tremblay LN, Hyland RH, Schouten DB, Hanly PJ. Survival of acute myelogenous leukemia patients requiring intubation/ventilatory support. Clin Invest Med 1995;18:19-24. 139. Pollack AJ, Strong RM, Gribbon P, Shah H. Lack of predictive value of APACHE II score in hypoalbuminemic patients. JPEN J Parenter Enteral Nutr 1991; 15:3135. 140. Cowen JS, Kelley MA. Errors and bias in using predictive scoring systems. Crit Care Clin 1994;10:53-71. 141. Fitzgerald FT. Physical diagnosis versus modern technology: a review. West J Med 1990;152:377-82. 142. Silverstein MD. Prediction instruments and clinical judgment in critical care. JAMA 1988;260:1758-9. 143. Dragsted L, Jorgenson J, Jensen NH, Bonsing E, Jacobson E, Knaus WA. Interhospital comparisons of patient outcome from intensive care: importance of lead-time bias. Crit Care Med 1989;17:418-22. 144. Escarce JJ, Kelley MA. Admission source to the medical intensive care unit predicts hospital death independent of APACHE II score. JAMA 1990;264:238994. 145. Wong DT, Crofts SL, Gomez M, McGuire GP, Byrick RJ. Evaluation of the predictive ability of APACHE II system and hospital outcome in Canadian intensive care unit patients. Crit Care Med 1995;23:1177-83. 146. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. An evaluation of outcome from intensive care in major medical centers. Ann Intern Med 1986;104:410-8. 147. Champion HR, Sacco WJ. Measurement of patient illness severity. Crit Care Med 1982;10:522-3. 148. Li TL, Phillips MC, Shaw L, Cook EF, Natanson C, Goldman L. On-site physician staffing in a community hospital intensive care unit: impact on test and procedure use and on patient outcome. JAMA 1984;252:2023-7. 149. Holt AW, Bury LK, Bersten AD Skowronski GA, Vedig AE. Perspective evaluation of residents and nurses as severity score collectors. Crit Care Med 1992;20:1688-91. 150. Strauss MB. Familiar medical quotations. 1st ed. Boston: Little, Brown & Company; 1968. p. 460. 151. Strauss MB. Familiar medical quotations. 1st ed. Boston: Little, Brown & Company; 1968. p. 461. 152. Siegler M. P a s c a l s ' wager and the h a n g i n g of crepe. N Engl J Med 1975;293:853-7. Curr Probl Surg, July 1997
597
153. Charlson ME, Sax FL, MacKenzie R, Fields SD, Braham RL, Douglas RG. Assessing illness severity: does clinical judgment work? J Chronic Dis 1986;39:43952. 154. Lee KL, Pryor DB, Harrel FE, et al. Predicting outcome in coronary artery disease: statistical models versus expert clinicians. Am J Med 1986;80:553-60. 155. Chang RWS, Lee B, Jacobs S, Lee B. Accuracy of decisions to withdraw therapy in critically ill patients: clinical judgment versus a computer model. Crit Care Med 1989;17:1091-7. 156. MeyerAA, Messick WJ, Young P, et al. Prospective comparison of clinical judgement and APACHE II score in predicting the outcome in critically ill surgery patients. J Trauma 1992;32:747-55. 157. Kruse JA, Thill-Baharozian MC, Carlson RW. Comparisons of clinical assessment with APACHE II for predicting mortality risk in patients admitted to a medical intensive care unit. JAMA 1988;260:1739-42. 158. McClish DK, Powell SH. How well can physicians estimate mortality in a medical intensive care unit. Med Decis Making 1989;9:125-32. 159. Brannen AL, Godfrey LJ, Goetter WE. Prediction of outcome from critical illness: a comparison of clinical judgment with a prediction rule. Arch Intern Med 1989;149:1083-6. 160. Knaus WA, Wagner DP, Lynn J. Short-term mortality predictions for critically ill hospitalized adults: science and ethics. Science 1991;253:389-94. 161. Tierney WM, Fitzgerald J, McHenry R, et al. Physicians estimates of the probability of myocardial infarction in emergency room patients with chest pain. Med Decis Making 1986;6:12-7. 162. Hyzy RC. ICU scoring and clinical decision making. Chest 1995;107:1482-3. 163. Fakhry SM, Rutledge R, Meyer AA. Severity of illness indices. In: Weigelt JA, Lewis FR, editors. Surgical critical care. Philadelpia: WB Saunders Company; 1996. p.7-21. 164. Watts CM, Knaus WA. The case for using objective scoring systems to predict intensive care unit outcome. Crit Care Clin 1994; 10:73-90. 165. Atkinson S, Bihari D, Smithies M, Daly K, Mason R, Mccoll I. Identification of futility in intensive care. Lancet 1994;344:1203-6. 166. Schafer JH, Maurer A, Jochimson F, et al. Outcome prediction models on admission in a medical intensive care unit: do they predict individual outcome? Crit Care Med 1990; 18:1111-7. 167. Rogers J, Fuller HD. Use of daily Acute Physiology and Chronic Health Evaluation (APACHE) II scores to predict individual patient survival rate. Crit Care Med 1994;22:1402-5 168. Dobkin JE, Cutler RE. Use of APACHE II classification to evaluate outcome of patients receiving hemodialysis in an intensive care unit. West J Med 1988;149:547-50. 169. Chang RWS, Jacobs S, Lee B. Use of APACHE II severity of disease classification to identify intensive care unit patients who would not benefit from total parenteral nutrition. Lancet 1986; 1:1483-6. 170. Thibault GS, Mulley AG, Barnett GO, Goldstein RL, Reder VA, Sherman EL. Medical intensive care: indications, interventions, and outcomes. N Engl J Med 1980;302:938-42. 171. Knaus WA, Wagner DP, Draper EA, Lawrence DE, Zimmerman JE. The range of intensive care services today. JAMA 1981;246:2711-6. 172. Wagner DP, Knauss WA, Draper EA. Identification of low-risk monitor admissions to medical-surgical ICUs. Chest 1987;92:423-8. 173. Becker RB, Zimmerman JE, Knaus WA, Wagner DP, Seneff MG, Draper EA. The use of APACHE III to evaluate ICU length of stay, resource use, 598
Curr Probl Surg, July 1997
174. 175. 176. 177.
178.
179. 180. 181. 182. 183.
184. 185. 186.
187. 188.
and mortality after coronary artery by-pass surgery. J Cardiovasc Surg 1995;36:1-11. TheAmerican College of Surgeons: The Minimum Standard. BullAm Coll Surg 1924;8:4-12. Osler T, Home L. Quality assurance in the surgical intensive care unit: where it came from and where it's going. Surg Clin North Am 1991;71:887-905. Civetta JM, Hudson-Civetta JA. Maintaining quality of care while reducing charges in the ICU. Ann Surg 1985;202:524-30. Sirio CL, Tajami K, Tase C, Knans WA, Wagner DE Hirasawa H. An initial comparison of intensive care in Japan and the United States. Crit Care Med 1992;20:1207-15. Pollack MM, Alexander SR, Clarke N, Ruttiman UE, Tesselaar HM, Bachulis, AC. Improved outcomes from tertiary center pediatric intensive care: a statewide comparison of tertiary and nontertiary care facilities. Crit Care Med 1991; 19:150-9. Civetta JM, Hudson-Civetta JA, Nelson LD. Evaluation of APACHE II for cost containment and quality assurance. Ann Surg 1990;212:266-75. Boyd O. Can standardized mortality ratio be used to compare quality of intensive care unit performance? Crit Care Med 1994;22:1706-9. Knaus WA, Draper EA, Wagner DE Zimmerman JE. Prognosis in acute organ failure. Ann Surg 1985;202:685-93. Hunt JP, Baker CC, Lentz CW, et al. Thoracic aorta injuries: management and outcome of 144 patients. J Trauma 1996;40:547-56. Civetta JM. Prediction and definition of outcome. In: Civetta JM, Taylor RW, Kirby RR, editors. Critical care. 2nd ed. Philadelphia: JB Lippincott Co; 1992. p.1873-98. Feinstein AR. Conceptual barriers to clinical science. In: Feinstein AR, editor. Clinical judgment. Baltimore: Williams & Wilkens, 1967:254-63. Wasserman PD. Neural computing: theory and practice. NewYork: Van Nostrand Rheinhold; 1989. p. 1-27. McGonigal MD, Cole J, Schwab CW, Kauder DR, Rotondo ME Angood PB. A new approach to probability of survival scoring for trauma quality assurance. J Trauma 1993;34:863-9. Vasser MJ, Holcroft JW. The case against using the APACHE system to predict intensive care unit outcome in trauma patients. Crit Care Clin 1994; 10:117-26. Rutledge R, Fakhry S, Baker C, Oller D. Injury severity grading in trauma patients: a simplified technique based upon ICD-9 coding. J Trauma 1993;35:497507.
Curr Probl Surg, July 1997
599