Journal of Statistical Planning and Inference 160 (2015) 51–59
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
Inference for identifying outlying health care providers Michael J. Racz a,∗ , J. Sedransk b a
Department of Basic and Social Sciences, Albany College of Pharmacy and Health Sciences, Albany, NY 12208, United States
b
Department of Statistics, Case Western Reserve University, Cleveland, OH 44106-7054, United States
article
info
Article history: Received 5 June 2013 Received in revised form 3 October 2014 Accepted 5 December 2014 Available online 19 December 2014 Keywords: Coronary artery bypass graft surgery Credible regions Posterior predictive p-value
abstract Provider profiling is the evaluation of the performance of hospitals, doctors, and other medical practitioners to enhance the quality of care. Many jurisdictions have released public report cards comparing hospital or physician-specific outcomes. Given the importance of provider profiling, studying the methodology used and providing enhancements are essential. Ohlssen, Sharples and Spiegelhalter (2007) give a thoughtful evaluation of provider profiling methodology. In particular they are concerned about whether a putative outlier is really an outlier or an observation in the tail of the common distribution for all practitioners, and present methodology to address this issue. In this paper we suggest as an alternative the use of a 100(1 − α)% credible region for provider fixed effects to identify outliers. Using both New York State bypass surgery data and simulated data of the same type as that used in Racz and Sedransk (2010), we compare the Ohlssen et al. (2007) approach with standard methodology, and with use of the credible region. © 2014 Elsevier B.V. All rights reserved.
1. Introduction Provider profiling is the evaluation of the performance of hospitals, doctors, and other medical practitioners to enhance the quality of care. Many jurisdictions have released public report cards comparing hospital or physician-specific outcomes. The objectives of provider profiling may be to seek corrective measures when a provider’s performance is deemed unsatisfactory, or to increase public awareness so that individuals can make an informed decision about selection of a medical practitioner. Given the importance of provider profiling, studying the methodology used and providing enhancements are essential. There is a large literature on this subject: The articles most relevant to our work are those by Christiansen and Morris (1997), Goldstein and Spiegelhalter (1996), Normand et al. (1997), Ohlssen et al. (2007),Thomas et al. (1994), Austin et al. (2001), Austin (2002), Austin et al. (2003), Austin (2005) and Shahian et al. (2005). The last five articles provide comparisons of alternative methods. The New York State Department of Health (NYS DOH) is a leader in provider profiling. Since 1989 the NYS DOH has evaluated the performance of the hospitals in New York that are licensed to perform coronary artery bypass graft surgery (CABG) (New York State Department of Health, 2009). Racz and Sedransk (2010) presented a Bayesian analogue of the NYS DOH methodology. In both approaches a hospital’s actual measure of performance is compared with its expected performance, assuming the same model for all hospitals but using this hospital’s own patient case mix to obtain the expected performance measure. Authors like Normand et al. (1997) and Shahian et al. (2005) have advocated adding hospital random effects to the models used by the NYS DOH (see (1) and (2) below). However, with a thoroughly risk-adjusted data set such as
∗ Correspondence to: Albany College of Pharmacy and Health Sciences, 106 New Scotland Avenue, Albany, NY 12208, United States. Tel.: +1 518 694 7182; fax: +1 518 694 7382. E-mail addresses:
[email protected] (M.J. Racz),
[email protected],
[email protected] (J. Sedransk). http://dx.doi.org/10.1016/j.jspi.2014.12.003 0378-3758/© 2014 Elsevier B.V. All rights reserved.
52
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59
that used by the NYS DOH, using a random effects model rather than (2) is not advantageous. Depending on the criterion used to detect an outlier we will either identify many fewer outliers or somewhat fewer outliers by assuming a random effects model (Racz and Sedransk, 2010). This occurs, mainly, because of inappropriate shrinkage. Section 4 has a brief discussion of methodology when the data set is not thoroughly risk-adjusted. Ohlssen et al. (2007) are concerned about whether a putative outlier is really an outlier or an observation in the tail of the distribution, and present methodology to address this issue. This is an important concern for applications such as provider profiling where decisions about outlying status may have serious practical consequences. While a potentially useful methodology, it does not provide a way to determine the number and identity of the providers to flag as outliers. Thus, we present an alternative method, i.e., one that provides a simultaneous 100(1 − α)% credible region for the ensemble of provider effects, thus addressing the multiple comparisons problem directly. In this paper we use both Cardiac Surgery Reporting System (CSRS) data and simulated data of the same type as that used in Racz and Sedransk (2010) to evaluate the Ohlssen et al. (2007) (hereafter OSS) methodology together with the Bayesian analogue of the NYS DOH method, and use of a credible region. Note that this is also the first systematic evaluation of the OSS approach. In Section 2 we present the Bayesian NYS DOH methodology, the OSS technique and the procedure to obtain the credible regions. Section 3 contains a description of the CSRS data and our simulation procedure followed by the results of our investigation. Additional discussion and a brief summary are provided in Section 4. 2. Methodology The first phase of the analysis by the NYS DOH is to perform stepwise logistic regression to find significant predictors of in-hospital mortality from approximately 40 pre-operative risk factors from CSRS. In detail, the data base is split into a development and validation group with roughly equal mortality rates. The risk factors are entered as candidate independent variables in a stepwise logistic regression model with in-hospital mortality as the binary dependent variable. Significant variables in this model are then validated in a stepwise model on the validation group. Significant variables from the validation model are then used to develop a stepwise model with the entire data set. The fit of the final logistic regression model is measured in terms of its discrimination (c-statistic). This method has been used since 1992 and the c-statistic is about 0.80 in each year. For additional details, see Hannan et al. (2006). Assuming r hospitals in hospital i, define Yij = 1 if the jth patient in the ith hospital died and Yij = 0 and with ni patients otherwise. Also, X ′ij = 1, Xij1 , . . . , Xijk and β ′ = (β0 , β1 , . . . , βk ) denote the regression structure. Let
i = 1, . . . , r , j = 1, . . . , ni ,
pij = Pr Yij = 1
(1)
and assume that logit pij = X ′ij β.
(2)
It is assumed throughout that, conditional on the pij , the Yij are mutually independent. The pre-operative risk factors, X , are described in NYS DOH (2009). n For hospital i we use Bayesian inference to obtain the predictive distribution of j=i 1 Yij , assuming the statewide model in (1) and (2). We then obtain a 100(1 − α)% predictive interval for
ni
Y . Hospital i is considered to be an outlier if, and
j=1 ij ni obs = Y : i = 1 , . . . , r , j = 1, . . . , ni and p = pij : i = 1, . . . , r , j = y is outside this interval. Define Y ij j=1 ij 1, . . . , ni , and assume the locally uniform prior distribution on β, β ∼ N 0, 104 I .
only if,
Let YijPred denote a predicted value of Y for patient j in hospital i. In applications of this type the posterior predictive distribution of the total number of deaths at hospital i under (2) is typically taken as
f
ni
YijPred
|
yobs −i
=
g
ni
j =1
YijPred
| p h p | yobs dp −i
(3)
j =1
where yobs is the set of observed values except for those in hospital i. To reduce computations (3) is often approximated by −i
f
ni
YijPred
obs
| y
=
j =1
g
ni
YijPred
| p h p | yobs dp.
j =1
Using the Gibbs sampler it is straightforward to make draws from f
ni
j =1
YijPred | yobs or f −i
ni
j =1
YijPred | yobs and obtain
ni Pred using the (α/2)100th and (1 − (α/2))100th percentiles of the draws. Hospital i j=1 Yij ni obs is considered a high outlier if j=1 yij is greater than the (1 − (α/2))100th percentile of the predictive distribution of ni Pred a 100(1 − α)% interval for
j =1
Yij
.
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59
53
For the computations we have used WinBUGS version 1.4. To examine convergence we ran three chains with different starting values. Each chain had a burn-in period of 1000, and a further 10,000 iterations were monitored. Convergence was determined using the Brooks–Gelman–Rubin multiple-chain diagnostic (Brooks and Gelman, 1998) with the (default) 80% credible intervals. In their Section 3.4.2 OSS describe a sampling model for outlying observations. (Here, we replace their word ‘‘best’’ with ‘‘worst’’ which is more appropriate for our application.) ‘‘Suppose that we have identified a provider m as being, say, the ‘observed worst’ out of r providers, and we wish to assess whether such an extreme observation could have arisen by chance. Then we can consider a set of r hypothetical providers, all with the same characteristics as m and arising from the null model, and simulate the distribution of the ‘observed worst’, which can then be used as a reference distribution to a obtain a [Bayesian predictive] p-value’’. Following OSS we use the Bayesian predictive p-value
Pr
ni
YijPred
≥
ni
yobs ij
|
yobs −i
(4)
j =1
j =1
to determine the worst provider, m. The OSS procedure starts by deleting the data from the worst performer, m. Then make draws from the joint posterior distribution g
p | yobs
−m
under the null (statewide) model. Defining yPred mi = posterior predictive distribution of the Yij f
Y Pred | yobs
−m
=
s Y Pred | p t p | yobs
−m
nm
Pred Pred , from the joint yPred mij , i = 1, . . . , r make draws, ym1 , . . . , ymr
j =1
dp
where each of these r draws is for a provider with an identical patient structure to that of provider m. Next, evaluate Pred Pred(B) Pred(1) Pred . One may compare , . . . , Max MaxPred = max y , . . . , y . Repeating this process B − 1 times yields Max mr m;r m1 m;r m;r nm obs j=1 ymj with this set of B maxima to obtain the OSS p-value which can be used to assess whether the extreme observation for provider m could have arisen by chance. The objective of the OSS procedure is to determine whether the observed worst, observed next-to-worst . . . providers’ values could have arisen by chance, a desirable goal. They do this by comparing ‘‘the ith ranked provider with outcomes that were expected from the ith true [worst] provider’’ (page 882), yielding a set of r p-values, corresponding to the r providers. However, this does not provide a suitable way of determining the number (and identity) of providers to flag as outliers. To provide a procedure that has specified coverage we add provider fixed effects to the logit specification in (2) and then use the method introduced by Besag et al. (1995) r to obtain simultaneous credible regions for the set of fixed effects. Start by defining λ = {λ1 , . . . , λr −1 } and W as a i=1 ni × (r − 1) matrix with any row corresponding to a patient in the first r − 1 hospitals having a 1 in the column referring to the patient’s hospital and zeros elsewhere. For any patient from the rth hospital the corresponding row in W has −1 in each column. Then (2) is replaced by logit (p) = X β + W λ.
(5)
The Besag et al. (1995) procedure to construct simultaneous credible regions for an entire n vector x of dimension (t ) : i = 1, . . . , n; t = 1, . . . , s , order x(i t ) : t = 1, . . . , s
is: Denoting the stored sample from s MCMC replicates by xi
[t]
separately for each component i to obtain order statistics xi and ranks rit , t = 1, . . . , s. For fixed k ∈ {1, . . . , s} [t ∗ ] [s+1−t ∗ ] let t ∗ be the smallest integer such that xi ≤ x(i t ) ≤ xi , for all i, for at least k values of t. (That is, t ∗ is
(t )
(t )
equal to the kth order statistic from the set max maxi ri , s + 1 − mini ri [s+1−t ∗ ] [t ∗ ] , xi xi
: t = 1, . . . , s ). Then, by this construction,
: i = 1, . . . , n is a set of simultaneous credible intervals containing at least 100k/s% of the empirical distribution of x = (x1 , . . . , xn ). The stored MCMC replicates were loaded into SAS version 9.3 (SAS Institute, Cary, North Carolina) from which the credible regions were calculated. In Section 3 we compare the OSS procedure with the Bayesian analogue of the NYS DOH method and the Besag et al. (1995) method applied to (1) and (5). In our application hospital i is declared to be an outlier if λi = 0 is outside the credible region. 3. Comparisons 3.1. CSRS data We first use the NYS DOH Cardiac Surgery Reporting System (CSRS) data from 1992 to 2006 to compare the Bayesian analogue of the NYS DOH method, the OSS method, and the Besag technique—all described in Section 2. This data set contains
54
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59
Table 1 Posterior predictive p-values from NYS DOH and Ohlssen et al. (2007) methods and outlier indicators from Besag et al. (1995) credible regions evaluated for the ‘observed worst’ outliers for CABG in New York State (1992–2006). Year
No. hospitals
‘‘Worst’’ performer
n
D
Exp
Posterior predictive p-values
Ohlssen p-values
1
2
3
4
5
6
7
8
1992 1993 1994 1995 1996
31 31 31 31 32
St. Peters Buffalo General Bellevue Millard Fillmore St. Vincent
484 1191 93 708 522
22 46 6 30 23
10.58 28.72 2.11 16.13 13.00
1997 1998 1999 2000 2001
33 33 33 34 35
Buffalo General St. Vincent Brooklyn Ellis Hospital Presbyterian
1217 551 284 433 486
47 25 22 19 18
29.79 9.99 7.05 8.61 9.99
2002 2003 2004 2005 2006
36 37 39 39 39
Buffalo General Brooklyn No high outliers Upstate Einstein
663 96
26 6
238 131
11 8
Using Besag credible intervals 80% 9
90% 10
95% 11
0.0005 0.0007 0.0148 0.0006 0.0044
0.0124 0.0198 0.3369 0.0159 0.1260
1 1 0 1 1
1 1 0 1 0
1 1 0 1 0
0.0004
0.0009 0.0102
0.0217 0.0008 <0.0001 0.0273 0.2944
1 1 1 1 1
1 1 1 1 1
0 1 1 1 0
12.65 2.16
0.0003 0.0079
0.0099 0.2525
1 1
1 0
1 0
4.29 1.85
0.0021 0.0001
0.1011 0.0123
1 1
1 1
0 1
<0.0001 <0.0001
(3) — ‘‘worst’’ performer — hospital is a high outlier using the NYS DOH method and has the smallest posterior predictive p-value, (4) — n — number of cases, (5) — D — number of deaths, (6) — Exp — expected number of deaths from null model, (7) — Bayesian posterior predictive p-value from Eq. (4), (8) — p-value computed applying the Ohlssen et al. (2007) method, (9), (10), (11) — Outlier indicators (1 = outlier) using 80%, 90% and 95% credible regions.
detailed information about every patient who had CABG surgery in New York since 1 January 1989, including demographic data; clinical risk factors and complications; dates of admission, surgery, and discharge; and discharge status. Table 1 has, in parallel columns, the year, total number of hospitals, the worst performer (based on the Bayesian posterior predictive p-value), the number of patients (n), the number of deaths (D) and the expected number of deaths (Exp). The seventh column gives the results for the Bayesian posterior predictive p-value from (4). The eighth column gives the p-values for the OSS method for a single putative outlier. Note that in their Appendix A.3 Ohlssen et al. (2007) present a justification for treating the p-values as if they were classical p-values (with uniform (0, 1) distributions). Finally, the last three columns indicate the outlier status of each hospital corresponding to the 80%, 90% and 95% Besag credible regions. Here, 1 denotes an outlier at the specified level. We start with the comparisons in Table 1 because of the emphasis in Ohlssen et al. (2007) on assessing whether the observed worst value could have arisen by chance. Comparing columns 7 and 8 the results from using the OSS method vis a vis the usual Bayesian method are similar in many cases, but very different in others. Notably, there is a large increase in the p-values for hospitals with small patient volumes (1994, 2003). We investigate this further using constructed data sets (Section 3.2). The OSS and Besag methods identify the same outliers in most cases. First, consider the Besag 80% credible region. Then, if the OSS criterion for identifying an outlier is a p-value of 0.10 or less, there is agreement in 10 of the 14 cases (exceptions 1996, 2001, 2003, 2005). For a Besag 90% region and OSS criterion of 0.05 or less, they agree in 12 cases (exceptions 2001, 2005). Finally, for a Besag 95% region and OSS criterion of 0.025 or less, they agree in 12 cases (exceptions 1997, 2000). We next consider the common situation for these CABG data sets that two hospitals were flagged as plausible outliers by the NYS DOH method. In these CSRS data (1992–2006) this occurred in 1993, 1995–1996, 1998 and 2000–2002. In Table 2 we present in parallel columns the year, hospital ID, patient volume, number of deaths, two types of p-value, and indicators of the outlier status (outlier = 1) of each hospital corresponding to the 80%, 90% and 95% Besag credible regions. The OSS p-values (sixth column) follow the procedure described in Section 2. That is, for each year the p-value for the first listed hospital is a comparison of the observed mortality for that hospital with the set of maximal predicted mortalities. For the second listed hospital the observed mortality is compared with the set of second-largest predicted mortalities. Considering the two worst performers for each year (Table 2) the OSS and Besag methods identify the same outliers in most cases. For the Besag 80% credible region there is agreement between the two methods in 11 of the 14 cases if the OSS criterion is 0.10 or less. For the Besag 90% credible region there is agreement in 9 cases if the OSS criterion is 0.05 or less. Finally, for the Besag 95% credible region there is agreement in 11 cases if the OSS criterion is 0.025 or less. As described in Section 2 the Besag method can be used to identify the set of outliers present in any year, controlling the overall ‘‘error rate’’. For these CSRS data sets the number of outliers identified using the 80% and 90% Besag regions are: Besag 80% Besag 90%
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 1 2 0 1 1 1 2 1 2 2 2 1 0 1 1 1 1 0 1 0 1 1 1 2 1 2 0 0 1 1
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59
55
Table 2 Posterior predictive p-values from NYS DOH and Ohlssen et al. (2007) methods and outlier indicators from Besag et al. (1995) credible regions evaluated for the ‘observed two worst’ outliers for CABG in New York State (1993, 1995, 1996, 1998, 2000, 2001 and 2002). Posterior predictive p-values
Ohlssen p-valuesa
Year
2 ‘‘Worst’’ performers
n
D
Using Besag credible intervals 80%
90%
95%
1993
Buffalo General Strong
1191 386
46 20
0.0015 0.0069
0.0114 0.0039
1 1
1 0
1 0
1995
Millard Fillmore Upstate
708 450
30 15
0.0006 0.0087
0.0123 0.0157
1 0
1 0
1 0
1996
St. Vincent Lenox Hill
522 860
23 32
0.0049 0.0148
0.0962 0.0353
1 0
0 0
0 0
1998
St. Vincent Bellevue
551 55
25 4
<0.0001 0.0047
0.0005 0.0053
1 1
1 0
1 0
2000
Ellis Hospital Mount Sinai
433 380
19 19
0.0005 0.0021
<0.0001
0.0145
1 1
1 1
1 1
2001
Presbyterian Westchester
486 732
18 30
0.0127 0.0186
0.2714 0.0541
1 1
1 0
0 0
2002
Buffalo General NYU
663 307
26 16
0.0004 0.0104
0.0061 0.0322
1 1
1 1
1 0
a Ohlssen et al. (2007) p-values. For each year the p-value for the first listed hospital is a comparison of its observed mortality with the set of maximal predicted mortalities. For the second listed hospital its observed mortality is compared with the set of second-largest predicted mortalities.
Thus, the analyses of the results in Tables 1 and 2, i.e., for one or two putative outliers, represent typical behavior for CABG in New York State. 3.2. Constructed data sets To compare the Bayesian analogue of the NYS DOH method, the OSS method and the Besag technique, we carried out two studies using constructed data sets based on the patient records from the 2002 CSRS. In the first study each constructed data set contains a specified hospital, i∗ , that is designated as a high outlier. We generated the data for our simulation study by modifying (2) as logit pij = X ′ij β + ηi ,
(6)
where ηi = 0 for i ̸= i and ηi∗was chosen so that this hospital has a log-odds of mortality c% higher than average for an average patient. We generated Yij : i = 1, . . . , r , j = 1, . . . , ni where the Yij are mutually independent with Yij = 1 with ∗
−1
probability 1 + exp −Zij and Yij = 0 otherwise. Here Zij = X ′ij βˆ + ηi where βˆ is the maximum likelihood estimator of β . We asked several experts in CABG outcomes research to suggest appropriate values for c. Given somewhat different assessments, we present results for c = 25, 50 and 75. From the 36 hospitals that performed CABG surgery in 2002 we use three to represent different patient volumes and characteristics. For each designated high outlier hospital and each value of c we generated 100 data sets. The results for hospital 1 (low volume), hospital 2 (medium volume) and hospital 3 (large volume) are presented in Table 3. The first column identifies the hospital and its volume. The second column gives the values of c while the third column gives the number of times, m1 (out of 100 replicates for c >0, 500 replicates for c = 0), that the hospital was flagged as an outlier (0.05 level) using the Bayesian analogue of the NYS DOH method. The fourth column shows the number of times, m2 (out of m1 ), that the hospital was deemed the worst performer. The fifth column gives the number of times (out of 100 replicates) that the hospital was declared an outlier (0.10, 0.05, 0.025 levels) using two different OSS methods while the sixth column gives the results (80%, 90%, 95% regions) for the Besag method. Throughout, our comparisons are based on observed proportions (from the 100 replicates) rather than formal tests. First, consider the low volume hospital, H1, and c = 25. It would be detected as an outlier 58 times (out of 100) by the NYS method (0.05 level), but would be the worst performer in only 48 data sets. Since the OSS procedure focuses on the worst performer, not necessarily H1, we have modified their method in two ways to permit an assessment for all 100 data sets. To obtain the results for the first modification (labelled A in the fifth column) we delete H1’s record for each data set and compare its observed value with the largest predicted one—even if H1 is not the worst performer for that data set. We call this the conservative modification because this is a stringent criterion for flagging an outlier. Doing this, the OSS procedure would detect H1 as an outlier only 17 times (0.025 level). (Recall that the OSS p-values are one-sided.) For H1 and c = 75, the NYS method would identify it as an outlier in 90 data sets while the OSS method would do so only 71 times. Comparing the values in the third column (for the NYS method) with those in the fifth column (OSS, 0.025 level) it is apparent that this OSS method is quite conservative for the low volume hospital (all values of c) and the medium volume hospital (small c).
56
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59
Table 3 Using 100 randomly generated data sets (500 for c = 0), the number of times a low, medium or high volume hospital is classified as an outlier using the NYS DOH method and, of the outlier cases, the number of times the hospital is the ‘‘worst performer’’. The last two columns have the number of outliers detected using the Ohlssen and Besag methods for p-value cutoffs 0.10, 0.05, 0.025 (Ohlssen) and 80%, 90%, 95% credible regions (Besag). Hospital (Volume)
ca
Outlier via NYS DOH method
Worst Performer via p-value
Ohlssen p-value cutoff
H1. Low (113)
H2. Medium (301)
H3. High (743)
Besag A
B
Credible region
C
0
3 of 500
3
0.10 0.05 0.025
0 0 0
0 0 0
80% 90% 95%
2 0 0
25
58 of 100
48
0.10 0.05 0.025
32 24 17
41 30 21
80% 90% 95%
38 33 25
50
75 of 100
72
0.10 0.05 0.025
59 54 49
62 56 50
80% 90% 95%
67 62 55
75
90 of 100
89
0.10 0.05 0.025
83 73 71
83 74 72
80% 90% 95%
87 84 77
0
5 of 500
4
0.10 0.05 0.025
0 0 0
1 0 0
80% 90% 95%
4 0 0
25
97 of 100
97
0.10 0.05 0.025
85 80 74
85 80 74
80% 90% 95%
93 88 83
50
99 of 100
98
0.10 0.05 0.025
95 94 91
96 95 92
80% 90% 95%
96 96 95
75
100 of 100
100
100
100
1 0 0
1 0 0
0
6 of 500
5
all 0.10 0.05 0.025
all 80% 90% 95%
100 5 1 0
25
100 of 100
100
all
100
100
all
100
50
100 of 100
100
all
100
100
all
100
75
100 of 100
100
all
100
100
all
100
A—Conservative method. Compare hospital’s observed value with the distribution of the maximal predicted ones. B—Less conservative method. Compare hospital’s observed value with the distribution of the ith ranked predicted values when hospital is the ith ranked. a Data sets were constructed such that the log-odds of mortality of patients at the specified hospital was c% higher than average for an average patient.
The results for the OSS and Besag methods are similar, although the Besag method always detects more outliers. Let (o, b) denote the number of outlier detections for OSS and Besag, respectively. Then for H1 (o, b) = (32, 38), (59, 67) and (83, 87) for c = 25, 50 and 75 using a 0.10 p-value cutpoint and 80% credible region (since the OSS p-values are one-sided). If we use a 0.05 p-value cutpoint and 90% credible region, (o, b) = (24, 33), (54, 62) and (73, 84) for H1. For a 0.025 cutpoint and 95% credible region, (o, b) = (17, 25), (49, 55) and (71, 77) for H1. The results for the medium volume hospital and c = 25 are similar to the ones given above while o and b are the same, or nearly the same, for all values of c for the large volume hospital and c = 50 and 75 for the medium volume hospital. Thus, neither the conservative OSS nor Besag method is effective in detecting outliers when the hospital patient volume is low. Next, consider a less conservative approach for the OSS method. Here, we compare a specified hospital H’s observed value with the distribution of the maximum predicted values when H is the worst performer, but compare H’s observed value to the distribution of the ith ranked predicted values when H is the ith ranked performer. (This is the method that Ohlssen et al., 2007 discuss in their Section 5.) The results are given in column 5B of Table 3. Comparing the results in columns 5A and 5B, it is clear that the two OSS methods are very similar except for the low volume hospital H1 and c = 25. In the latter case, this OSS method identifies more outliers, but (still) many fewer than those identified by the NYS method, i.e., 58 by the latter against 21 for this OSS method (0.025 level). Thus, this OSS method, like the first OSS method, is quite conservative for the low volume hospital (all values of c) and the medium volume hospital (small c). The results for this less conservative OSS method and the Besag method are similar (compare columns 5B and 6), although the Besag method still detects more outliers (exception: H1, c = 25, OSS (0.10) vs. Besag (80%)). We have also investigated the frequency that the methods yield false positives by taking c = 0 in the simulations; i.e., there are no outliers. The results corresponding to c = 0, presented in Table 3, use 500 data sets rather than the 100 data sets used for the other values of c. Clearly, the two OSS methods and the Besag method yield very few false positives.
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59
57
Table 4 Using 100 randomly generated data sets, the number of times a hospital was classified as an outlier using the NYS DOH method; of the outlier cases, the number of times the hospital is the ‘‘worst performer’’. The last two columns have the number of outliers detected using the Ohlssen and Besag methods for p-value cutoffs 0.10, 0.05, 0.025 (Ohlssen) and 80%, 90%, 95% credible regions (Besag). H1 is the small volume hospital and H2 the medium volume hospital. Hospital (Volume)
H1. Low (113)
H2. Medium (301)
ca
Outlier via NYS DOH method
Worst performer via p-value
Ohlssen
Besag
p-value cutoff
A
B
Credible region
C
25
50
20
0.10 0.05 0.025
31 27 22
46 41 37
80% 90% 95%
37 33 29
50
74
19
0.10 0.05 0.025
60 51 49
73 70 68
80% 90% 95%
68 60 52
75
95
14
0.10 0.05 0.025
87 80 77
95 92 91
80% 90% 95%
92 87 83
25
90
75
0.10 0.05 0.025
76 73 65
83 81 70
80% 90% 95%
83 76 74
50
100
81
0.10 0.05 0.025
93 91 86
98 95 91
80% 90% 95%
94 92 89
75
100
86
0.10 0.05 0.025
99 98 98
100 100 100
80% 90% 95%
100 100 98
A — Conservative method. Compare hospital’s observed value with the distribution of the maximal predicted ones. B — Less conservative method. Compare hospital’s observed value with the distribution of the ith ranked predicted values when hospital is the ith ranked. a Data sets were constructed such that the log-odds of mortality of patients at Hospital 1 and Hospital 2 was c% higher than average for an average patient.
For example, with a nominal p-value cutoff of 0.025, neither OSS method detects any outliers, i.e., 0 out of 500. Similarly, with a nominal 95% Besag credible region, there are no outliers. There are modest increases in the rates for the Besag method for the nominal 80% region for the larger volume hospitals (e.g., for H2, 0.008 for Besag vs. 0.002 for OSS-B). For the 90% and 95% nominal credible regions and corresponding p-value cutoffs the three methods have, with one exception, the same false positive rates. The second study designates two specified hospitals, i∗ and i∗∗ , as high outliers. We did this to mimic the CSRS experience as much as possible, i.e., approximately two high outliers per year are identified using the NYS DOH methodology. We generated the data for our simulation study by using (6) but with ηi = 0 for i ̸= (i∗ , i∗∗ ) and ηi∗ and ηi∗∗ were chosen so that these hospitals have a log-odds of mortality c% higher than average for an average patient. We generated Yij : i = 1, . . . , r , j = 1, . . . , ni as described above. We took the low (H1) and medium (H2) volume hospitals in Table 3 as the designated high outliers. Table 4 has the results for H1 and H2, and has the same structure as Table 3. We study H1 first because the literature on provider profiling clearly indicates that making appropriate inferences for low volume hospitals is difficult. First, consider c = 25. It would be detected as an outlier 50 times (out of 100) by the NYS method (0.05 level), but would be the worst performer in only 20 data sets. Since the OSS procedure focuses on the worst performer, not necessarily H1, we have modified their method in two ways to permit an assessment for all 100 data sets. To obtain the results for the first, conservative, method (labelled A in the fourth column) we delete the records of H1 and H2 for each data set and compare H1’s observed value with the distribution of the maximal predicted ones—even if H1 is not the worst performer for that data set. Doing this, the OSS procedure would detect H1 as an outlier only 22 times (0.025 level). For H1 and c = 75, the NYS method would identify H1 as an outlier in 95 data sets while the OSS method would do so only 77 times. Comparing the values in the second column (for the NYS method) with those in the fourth column (OSS, A, 0.025 level) it is clear that this OSS method is conservative. The results for the OSS and Besag methods are, again, quite similar. Let (o, b) denote the number of outlier detections corresponding to the OSS and Besag methods, respectively. Then (o, b) = (31, 37), (60, 68) and (87, 92) for c = 25, 50 and 75 using a 0.10 p-value cutoff and 80% credible region. The corresponding values of (o, b) for a 0.05 p-value cutoff and 90% credible region are (27, 33), (51, 60) and (80, 87) while they are (22, 29), (49, 52) and (77, 83) for a 0.025 p-value cutoff and 95% region. A less conservative approach for the OSS method is to compare H1’s observed value with the distribution of the maximal predicted values when H1 is the worst performer, but compare H1’s observed value to the distribution of the ith ranked predicted value when H1 is the ith ranked performer. These results are given in column 4B. For a given p-value cutoff it is clear that the number of flagged outliers is much larger when using this less-conservative approach. Looking at a p-value
58
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59 Table 5 Detailed comparison of Ohlssen B and Besag methods for outliers detected by NYS DOH method: H1 in Table 4 with c = 25, p-value cutoff = 0.10, 80% Besag credible region. Numbers of outliers detected by each method, classified by whether H1 was the worst, second worst or third worst performer. Besag
Worst performer Second worst performer Third worst performer
OSS OSS OSS
Yes No Yes No Yes No
Yes
No
17 2 18 0 0 0
0 1 8 1 3 0
cutoff of 0.025, the OSS method flags almost as many outliers as the NYS DOH method does except for c = 25 (compare column 4B with column 2). On the other hand this less-conservative OSS procedure flags more outliers than the Besag method. The results for H2 are similar to those for H1. The results for the conservative OSS method and the Besag method are similar (compare columns 4A and 5) whereas the less conservative OSS method is similar to the NYS DOH method except for c = 25 (compare column 4B with cutpoint 0.025 with column 2). For all methods the probabilities of identifying an outlier are much larger for H2 than for H1, so the differences among the methods are much less dramatic for H2 than for H1. Comparing the two OSS methods, the less conservative one is more likely to identify a true outlier, H, when: (a) there is more than one true outlier because, then, H may be the second or third or . . . worst performer, or (b) it is the only outlier but c is small because, then, there may be hospitals with attributes similar to H and one of these may be the worst performer. To make point (a) clearer, consider the specification summarized in Table 4, and the results for OSS method B for H1 when c = 25 and the cutoff is α = 0.10. Of the 50 cases where the NYS DOH method identified an outlier, OSS-B declared an outlier in 46 cases. Of these 50 cases, H1 was the worst performer in 20 cases, the second worst performer in 27 cases and the third worst performer in 3 cases. Table 5 compares the OSS-B method with α = 0.10 to the Besag 80% credible region. Among the 20 cases where H1 is the worst performer, the methods are comparable (17 outliers detected by OSS, 19 by Besag). It is in the 30 cases where H1 is second or third worst performer where OSS-B detects more outliers, i.e., 29 for OSS and 18 for Besag. This outcome, typical of our results, is not surprising because OSS-B compares the ith ranked provider with outcomes expected from the ith true provider, as we have noted in Section 2. Finally, as expected, it is true throughout our study that for comparable decision criteria none of the alternative methods identifies an outlier when the NYS DOH method does not. 4. Extensions and summary When the data set being analyzed does not have the level of risk-adjustment as that for CABG in New York, one may wish to assume that the hospital effects are random as, e.g., in Shahian et al. (2005). The model and notation are the same as in Section 2 except that (5) is replaced by logit (p) = X β + W δ, where δ = (δ1 , . . . , δr ) and W is a with each row in the ni th block having a 1 in the ith column and i=1 ni × r matrix 0, otherwise, and, independently, δ1 , . . . , δr ∼ N 0, γ 2 . Typically, independent locally uniform prior distributions are
r
chosen for β and γ −2 . For example, Shahian et al. (2005) take β ∼ N 0, 104 I and γ −2 ∼ gamma (0.001, 0.001). One may use the (α/2)100th and (1 − (α/2))100th percentile values of the posterior distribution of the δi to make assessments, as in Shahian et al. (2005). That is, the ith hospital is regarded as a high outlier if the (α/2)100th percentile value of δi > 0. Alternatively, as described in Section 2, one may apply the Besag method to the set of δi to yield simultaneous credible regions. Given the increasing use of public report cards for medical practitioners and the consequences of incorrect identification of outlying health care providers, it is essential to have sound statistical methodology. Here, we review and investigate the methodology in Ohlssen et al. (2007) who are concerned whether a putative outlier is really an outlier or an observation in the tail of the common distribution for all practitioners. Since this approach does not address the question of trying to determine the number and identity of the set of providers to flag as outliers we present an alternative method based on methodology in Besag et al. (1995), which does this by providing a credible region with specified coverage probability. We also compare these two approaches with the more standard New York State Department of Health methodology using both New York State CABG data and constructed data sets where there are known outliers. As expected, it is difficult to make sound inferences for hospitals with small patient volumes. Comparing the OSS and Besag methods, both have essentially the same false positive rates. For the large volume hospital (H3) and the medium volume hospital (H2) with c > 25, the two methods identify approximately the same number of
M.J. Racz, J. Sedransk / Journal of Statistical Planning and Inference 160 (2015) 51–59
59
outliers, somewhat fewer than the NYS DOH method. For the low volume hospital (H1) and H2 with c = 25, both the OSS and Besag methods detect many fewer outliers than the NYS DOH method. Looking at H1, the hospital where inferences are typically the most difficult, Besag is better when there is a single outlier, OSS-B is preferable when there are two outliers. Assuming that one fits the usual logit models as in (2) or (5), our results do not show a clear winner between OSS-B and Besag. Since the Besag method does determine the number and identity of the set of providers to flag as outliers and has a known ‘‘error rate’’, we prefer to use it. Moreover, as shown in Section 3, the cases for hospital H where the OSS-B method detects an outlier while the Besag method does not occur when H is the second, third, . . . worst performer among the set of hospitals being investigated. For some, this may not be an appealing way to decide whether H is an outlier. Racz and Sedransk (2010) were able to document that most of the hospitals deemed outliers using the NYS DOH methodology were, in fact, outliers. Thus, one should look carefully at (possible) outliers identified by the NYS DOH method but not by the OSS or Besag methods. Acknowledgements The authors are grateful to the associate editor and 3 reviewers for their extensive, helpful comments. References Austin, P.C., 2002. A comparison of Bayesian methods for profiling hospital performance. Med. Decis. Making 22, 160–169. Austin, P.C., 2005. The reliability and validity of Bayesian measures for hospital profiling: a Monte Carlo assessment. J. Statist. Plann. Inference 128, 109–122. Austin, P.C., Alter, D.A., Tu, J.V., 2003. The use of fixed- and random-effects models for classifying hospitals as mortality outliers: a Monte Carlo assessment. Med. Decis. Making 23, 526–539. Austin, P.C., Naylor, C.D., Tu, J.V., 2001. A comparison of a Bayesian vs. a frequentist method profiling hospital performance. J. Eval. Clin. Pract. 7, 35–45. Besag, J., Green, P., Higdon, D., Mengersen, K., 1995. Bayesian computation and stochastic systems. Statist. Sci. 10, 3–41. Brooks, S., Gelman, A., 1998. General methods for monitoring convergence of iterative simulations. J. Comput. Graph. Statist. 7, 434–455. Christiansen, C.L., Morris, C.N., 1997. Improving the statistical approach to health care provider profiling. Ann. Intern. Med. 127, 764–768. Goldstein, H., Spiegelhalter, D.J., 1996. League tables and their limitations: statistical comparisons of institutional performance. J. Roy. Statist. Soc. Ser. A 159, 385–409. Hannan, E.L., Wu, C., Bennett, E.V., Carlson, R.E., Culliford, A.T., Gold, J.P., Higgins, R.S., Isom, O.W., Smith, C.R., Jones, R.H., 2006. Risk stratification of inhospital mortality for coronary artery bypass graft surgery. J. Am. Coll. Cardiol. 47, 661–668. New York State Department of Health, 2009. Adult Cardiac Surgery in New York State 2004–2006. Albany, New York. Normand, S-L.T., Glickman, M.E., Gatsonis, C.A., 1997. Statistical methods for profiling providers of medical care: issues and applications. J. Amer. Statist. Assoc. 92, 803–814. Ohlssen, D.I., Sharples, L.D., Spiegelhalter, D.J., 2007. A hierarchical modelling framework for identifying unusual performance in health care providers. J. Roy. Statist. Soc. Ser. A 170, 865–890. Racz, M.J., Sedransk, J., 2010. Bayesian and frequentist methods for provider profiling using risk-adjusted assessments of medical outcomes. J. Amer. Statist. Assoc. 105, 48–58. Shahian, D.M., Torchiana, D.F., Shemin, R.J., Rawn, J.D., Normand, S-L.T., 2005. Massachusetts cardiac surgery report card: implications of statistical methodology. Ann. Thorac. Surg. 80, 2106–2113. Thomas, N., Longford, N.T., Rolph, J.E, 1994. Empirical Bayes methods for estimating hospital-specific mortality rates. Stat. Med. 13, 889–903.