Paying physician group practices for quality: A statewide quasi-experiment

Paying physician group practices for quality: A statewide quasi-experiment

Healthcare 1 (2013) 108–116 Contents lists available at ScienceDirect Healthcare journal homepage: www.elsevier.com/locate/hjdsi Delivery Science ...

1MB Sizes 1 Downloads 29 Views

Healthcare 1 (2013) 108–116

Contents lists available at ScienceDirect

Healthcare journal homepage: www.elsevier.com/locate/hjdsi

Delivery Science

Paying physician group practices for quality: A statewide quasi-experiment Douglas A. Conrad a,n, David Grembowski a, Lisa Perry b, Charles Maynard c, Hector Rodriguez d, Diane Martin a a

Health Services, University of Washington, Seattle, WA 98195, USA Department of Economics, University of Washington, Seattle, WA 98195, USA c Health Services, University of Washington, VA Health Services Research and Development, Seattle, WA 98195, USA d Health Policy and Management, University of California, Los Angeles, CA 90055-1772, USA b

art ic l e i nf o

a b s t r a c t

Article history: Received 8 March 2013 Received in revised form 16 April 2013 Accepted 22 April 2013 Available online 28 May 2013

This article presents the results of a unique quasi-experiment of the effects of a large-scale pay-forperformance (P4P) program implemented by a leading health insurer in Washington state during 2001– 2007. The authors received external funding to provide an objective impact evaluation of the program. The program was unique in several respects: (1) It was designed dynamically, with two discrete intervention periods—one in which payment incentives were based on relative performance (the “contest” period) and a second in which payment incentives were based on absolute performance compared to achievable benchmarks. (2) The program was designed in collaboration with large multispecialty group practices, with an explicit run-in period to test the quality metrics. Public reporting of the quality scorecard for all participating medical groups was introduced 1 year before the quality incentive payment program's inception, and continued throughout 2002–2007. (3) The program was implemented in stages with distinct medical groups. A control group of comparable group practices also was assembled, and difference-in-differences methodology was applied to estimate program effects. Case mix measures were included in all multivariate analyses. The regression design permitted a contrast of intervention effects between the “contest” approach in the sub-period of 2003–2004 and the absolute standard, “achievable benchmarks of care” approach in sub-period 2005–2007. Most of the statistically significant quality incentive program coefficients were small and negative (opposite to program intent). A consistent pattern of differential intervention impact in the sub-periods did not emerge. Cumulatively, the probit regression estimates indicate that neither the quality scorecard nor the quality incentive payment program had a significant positive effect on general clinical quality. Based on key informant interviews with medical leaders, practicing physicians, and administrators of the participating groups, the authors conclude that several factors likely combined to dampen program effects: (1) modest size of the incentive; (2) use of rewards only, rather than a balance of rewards and penalties; (3) targeting incentive payments to the group, thus potentially weakening incentive effects at the individual level. & 2013 Elsevier Inc. All rights reserved.

Keywords: Pay-for-performance programs Incentive effects

1. Introduction Increasingly, healthcare payment reform is moving to center stage in the deliberations of healthcare policymakers, payers, providers, and purchasers.11,5,2 There is general consensus regarding the shortfalls of fee-for-service reimbursement, given its tendency to encourage over-use and its non-alignment with cost containment and quality objectives. However, the prominent

n

Corresponding author. Tel.: +1 206 616 2923; fax: +1 206 543 3964. E-mail addresses: [email protected], [email protected] (D.A. Conrad).

2213-0764/$ - see front matter & 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.hjdsi.2013.04.012

alternative payment method — capitation — suffers from limitations of its own. Namely, capitation implicitly creates incentives to under-serve, and most small- and medium-sized provider organizations are unable to bear the actuarial risks implicit in receiving a fixed lump sum payment per person.6 The rise of pay-for-performance (P4P) programs in the late 1990s and early 2000s was a predictable counter-reaction to these problems.4,22,24 By design, P4P incorporates rewards for providing guideline-based services that mitigate the tendency toward underuse inherent in capitation while simultaneously discouraging overuse of expensive services – for example, through incentives for generic drug prescribing, appropriate use of antibiotics and

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

asthma controller medications.8 P4P programs are flexible, in that they can be implemented within health plan payment by fee-forservice, full-risk capitation, or hybrid mechanisms. 15,5,8

109

significant improvement in clinical quality than outcomebased incentives of equivalent size.8,7,16 2. The Washington state P4P quasi-experiment

1.1. Objective We assess the impact of a “quasi-experiment” in the form of a P4P program for medical group practices developed by a major health insurer in Washington State and implemented during 2003–2007 for its commercial preferred provider organization (PPO) plan products. Medicaid, Medicare fee-for-service, and Medicare Advantage covered lives were not part of the program. The clinical quality performance of three sets of medical groups is examined: (1) those participating only in a quality scorecard (QSC) and public reporting program, (2) those participating in the P4P program in addition to the quality scorecard and reporting, and (3) a “control” group of roughly comparable practice organizations not participating in either the QSC or P4P program. 1.2. Conceptual background Considerable peer-reviewed research has been published regarding P4P in the preceding decade.26,7,16,20,23,4 The design of this paper's quasi-experiment was informed not only by that literature, but also by the theory of principal–agency relationships,21,6 cognitive psychology,12,13 and behavioral economics.17,19 The literature's implications for design of this Washington State quasi-experiment can be summarized briefly in the following propositions: (1) Incremental, continuous, and more frequent rewards for improved clinical quality will produce greater quality gains than all-or-none rewards for exceeding a fixed quality threshold.3,9,19 As a corollary, quality targets based on achievable benchmarks will lead to greater improvement than fixed standards. (2) Downside risk — in the form of withholds or retrospective penalties — will generate greater improvement per dollar than “upside potential,” or bonuses.17 This prediction derives from what behavioral economists refer to as “loss aversion,” but is also consistent with traditional concepts of diminishing marginal utility of income. However, the use of penalties or withholds could produce negative responses if perceived as unfair by participants.19 Also, rewards may be necessary to cover the initial marginal costs of improvements in quality infrastructure and to catalyze behavioral change.3 Thus, the P4P incentive structure should balance penalties and rewards. (3) Stakeholder involvement in incentive design is expected to enhance the strength of performance effects—first, by enhancing the credibility and perceived fairness of the quality target and, second, by enhancing communication and awareness of the incentive program.8,19 (4) Performance incentives based on the subject's own performance (whether at the level of the organization, team, or individual level) will lead to greater improvement than those tied to performance relative to one's peers, or “contests”.19,8,7,16 However, since patient care is a “team sport,” the potential “free-rider” problem attendant to group rewards, in practice, might be offset by internal control mechanisms. (5) Because provider's have more control over process quality indicators than over patient health outcomes, performance incentives geared to care processes are more likely to achieve

2.1. Phase-in of the quality programs The preceding provides context for the Washington State quasiexperiment conducted by one of the major private health insurers in the Northwest. The quality scorecard for major medical groups and public reporting of quality were designed and pre-tested with an initial cohort between July 2001 and June 2002. The first public scorecard appeared in 2003, reporting on calendar year 2002 quality measures for the initial scorecard cohort of medical groups. After introducing the first 3 medical groups to the scorecard in measurement year 2002, 4 additional groups went on the scorecard and public reporting between 2003 and 2007. Participating medical groups’ quality metrics were reported annually and publicly on the insurer's website (available to members and the general public). In 2004, the same first 3 medical groups were eligible for incentive payments based on their performance on the quality scorecard metrics. Four other medical groups were added to the P4P incentive program during measurement years 2004– 2007. Participating medical groups’ quality metrics were reported annually on the insurer's public website (available to members and the general public).

2.2. Quality metrics The health plan medical leadership carefully vetted the quality measures with the participating medical groups and selected a set of well-established metrics for the quality scorecard and provider incentive payment program:

 Breast cancer screening (mammogram) for women age 52–69 in the year prior to or during the measurement year.a

 Cervical cancer screening (Pap test) for women age 21–64 in the 2 years prior to or during the measurement year.

 Well-child visits: 6 or more by age 15 months.  Use of optimal medications for asthma: ages 5–56.b  Diabetes: 2 Hemoglobin A1c (HbA1c) tests during the measurement year.

 Diabetes: LDL-Cholesterol screening during the measurement year.

 Diabetes: ACE-Inhibitor or ARB medication prescribed during 

the measurement year.c Coronary artery disease: LDL screening during the measurement year.

These quality process measures were derived from incurred claims data for each measurement year and were reported roughly 7 months after the end of the preceding year, by which time all incurred claims in the preceding year had been reported and paid.d a In measurement year 2006 the age range for breast cancer screening was increased from 52 to 69 years to 42–69. b In measurement year 2006 the population eligibility changed from asthma in either the current or prior year to asthma in both years. c ARBs were added as an option to this indicator in measurement year 2004. Beginning in measurement year 2005 this metric was restricted to those with diabetes who had either hypertension or coronary artery disease. d Late in the study period, a few outcome measures requiring clinical data were added to the measurement set, but those metrics are not included in the dataset of this study.

110

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

2.3. Quality incentive payment structure The quality payment incentive structure was based on points for each measure, and incorporated both level of achievement and degree of improvement from the prior year. In the first two measurement years (2003 and 2004) the incentive payments were provided only to the top scoring groups; in that sense, the incentive structure was a “contest” based on relative performance among participating medical groups. Each of the three groups participating in the QIP incentive program in 2003 performed well enough relative to all participants in the QSC scorecard to earn bonuses in the first 2 years of the QIP (2003–2004), but the % incentive earned for those three groups varied from 0.2% to 1.1% of contracted revenue with the sponsor. The initial incentive dollars at-risk from the sponsor in 2003 were $680,000. From 2005–2007, incentive payments were based on approaching and actually reaching achievable benchmarks of care. The specific incentive formula and weighting of components are proprietary to the insurer; but, in general, the incentive formula in 2005–2007 blended two elements—percentage attainment of the achievable performance benchmark and extent of performance improvement over the prior year. By 2007, the sponsor's potential incentive pool for seven groups participating in the QIP exceeded $1.7 million, and the incentive payments varied among the groups from 0.5% to 2.0% of contracted revenue. While the incentive structure was uniform across groups, the potential reward varied across medical groups, depending on their respective contractual agreements with the plan. The incentive payments derived totally from “new money”; the medical groups did not bear downside risk in the form of withholds or retrospective penalties The potential annual incentive payments per group varied from 1.1% of total contracted revenue from the spomsoring insurer to a high of 2.0% during 2003–2007 and were increased in later years. Like the percentage of revenue at-risk, the size and trend of the increase was not uniform over time or across groups. The modest percentage of revenue at-risk was squarely within the range of P4P programs in the mid-2000 s.23 2.4. Program participant selection The process of medical group selection was not random: health plan leadership approached a particular set of large medical groups for potential participation in the quality scorecard and public reporting program and, later, the quality incentive program. Thus, the phase-in over time of quality-scoring and public reporting, followed by quality incentive payment for a subset of those groups is best characterized as a “quasi-experiment.”

In measuring quality for each participating medical group, all eligible patients for each quality indicator in each group were scored. Each patient was attributed to a unique medical group in each year, based on the practice in which the plurality of his or her health care was received. 2.5. Quasi-experimental study design Because of the phased implementation of the quality program components (scorecard/reporting and incentives, respectively), we chose a modified difference-in-differences methodology, using a comparison set of medical groups (the “control group”) selected by the university researchers in collaboration with the health plan. The control group organizations were chosen on the basis of their perceived leadership and quality improvement sophistication, drawing on the experience and expertise of the health plan staff working most closely with each medical group. All medical groups participating in either the scorecard only (QSC) or the quality incentive program (QIP) during the 2001– 2007 period were included in the analysis sample: 7 QSC (only) medical groups, 7 QIP medical groups (who, of necessity, also participated in the quality scorecard program), and 5 “control” medical groups, the latter which never were part of either the QSC or QIP. Table 1 displays the baseline (2001) characteristics of the three cohorts. Both intervention cohorts are larger than control group practices in terms of number of primary care physicians. For each cohort, group practices are predominantly physician-owned. Total group revenues based on quality are 1% or less, and primary care physician compensation is predominantly production-based for all cohorts. The cohorts varied in their use of electronically integrated clinical information, but in no consistent pattern. 2.6. Addressing sample selection bias In planning the study, the research team anticipated the potential for selection bias in our estimates, particularly in light of the health plan's systematic approach to recruiting medical groups and the absence of a large number of control group organizations of comparable size and characteristics. Also, given the relatively small total number of groups, the use of instrumental variables or other similar techniques to correct for selection was not feasible. Thus, we have employed difference-indifferences regression, comparing the changes over time in quality scores between the control group and the QSC and QIP groups to mitigate potential selection bias due to any unobservable variables that are constant over time. In addition, we tested the sensitivity

Table 1 Medical group practice characteristics by cohort. Control Group

QSC Only

QSC/QIP

Characteristics: (mean) Number of locations Number of family practitioners Number of internists Number of pediatricians % Of groups physician-owned % Of total group revenues based on quality % Individual primary care physician compensation based on production

3 6 6 2 80 0 97

6 12 15 2 60 1 75

10 31 25 9 60 1 89

% Group practices using electronically integrated information in 2001 for: Standardized problem lists (%) Ambulatory care encounter data (%) Lab findings (%) Meds prescribed (%) Clinical guidelines and protocols (%) % Group revenues from insurer sponsoring QSC and QSC/QIP

20 40 60 20 0 23

20 20 20 0 0 16

20 100 50 17 0 20

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

of our effect estimates to the inclusion of medical group size as a covariate since number of providers seems highly correlated with other group-specific factors that are likely to affect clinical quality (e.g., extent of electronic medical record use, chronic disease quality improvement activities). These techniques should attenuate selection bias, it is not possible rule out time-varying unobservables that might still confound our estimates. 2.7. Statistical model The study team developed a regression model for estimating effects on quality of two health plan quality program interventions: the QIP and QSC. Since each intervention was implemented in different years for different sets of medical groups, a binary variable was equal to 1 during each year in which the group participated in the scorecard program (QSCt ¼ 1) and a second equal to 1 during each year in which the group participated in the P4P incentive program (QIPt ¼ 1). Two additional intervention regressors were included for each quality program: (1) a binary variable equal to 1 if the group ever participated in the program— (a) EverQSC_coh for scorecard and reporting program participation; (b) EverQIP_coh for quality incentive program participation, respectively. The latter dummy variables were included to capture underlying differences between the three sets of groups: QSC, QIP, and controls. Year-specific dummy variables were included to reflect secular trends over time, independent of quality program participation status. Patient-level variables for age, gender, and case mix were included to control for patient factors that might affect provider achievement of the quality criterion. The modified Charlson comorbidity score, based on a set of diagnoses and, in some cases, surgical procedures,14 was employed to measure case mix. This measure was originally validated as predictive of 1-year mortality on a medical service (the relevant clinical domain for the quality measures of the current study). The model was estimated in probit regressione at the patient level of observation (the dependent variable equal to 1 or 0, depending on whether or not the patient received the care indicated by the individual quality standard), with robust standard errors accounting for the clustering of patients within medical groups. To evaluate the sensitivity of our estimates of quality program effect to model specification, the regression was varied in four ways: (1) Group size was included as a regressor in some models to assess whether estimates of QSC and QIP effects were different when that potential confounder was taken into account. (2) Program dummy variables were replaced by program “dose” variables (QSCdoset, QIPdoset, plus the square of each of those variables) to account for non-linear effects of length of time each intervention had been in place. (3) In certain specifications, the program variables were distinguished by two time periods: the “contest” years 2003–2004 (QSC2003–2004t and QIP2003–2004t) and the “achievable benchmark years” (QSC2005–2007t and QIP2005–2007t), to differentiate quality program impacts in those two distinct time frames. In keeping with the design propositions, the researchers expected that larger program effects would occur in the achievable benchmark period. (4) In order to control for baseline (pre-quality program intervention) differences among the groups, in the final set of regression analyses, the baseline (2002) medical group-level average e We chose probit on a priori grounds, reasoning that the cumulative normal distribution function was likely a better fit for binary quality measures than logit (which has thinner upper and lower tails) and noting that the results of probit and logit are generally quite similar. See, for example, Aldrich and Nelson.1

111

value for each dependent quality variable (e.g., proportion of the group's eligible patients receiving the indicated care) was included as a regressor. 3. Results 3.1. Descriptive plots To inspect visually for any differential patterns in quality performance over time, we constructed a time series plot for each quality measure, distinguishing three cohorts: (1) Ever QSC only (i. e., QSC, but never QIP), (2) Ever QSC/QIP (i.e., QIP and QSC), and (3) the control medical groups. Each plot was split to distinguish the 2003–2004 and 2005–2007 sub-periods.f Those plots are presented as Figs. 1–8 and are discussed briefly here. These descriptive figures, unadjusted for covariates, are included only as potentially suggestive of program effects and as a prelude to the controlled regression results. Generally, there are not marked differential changes over time among cohorts on the quality indicators; for some metrics, there are baseline differences. For example, breast cancer screening levels (Fig. 1) are highest at baseline for the Ever-QSC only cohort, next highest for the controls, and lowest for the Ever-QSC/QIP. Neither the scorecard nor the incentive program shows the gain over time achieved by the controls. Examining the cervical cancer screening plots (Fig. 2), both the controls and scorecard cohort show some decrement over time, rather than improvement, as does the incentive program, except in the 2005–2007 sub-period when the QSC/QIP cohort displays a dramatic upturn in cervical cancer screening. On inspection of Fig. 6, both the scorecard and incentive program cohorts show considerably greater improvement over time in LDL cholesterol screening among diabetes patients of the intervention groups than the controls. The quality performance improvement in ACE-inhibitor (and ARB) use among diabetics also is somewhat greater among the intervention cohorts than the controls (Fig. 7)—especially for the incentive program (Ever QSCQIP) groups. Notably, both intervention cohorts started at lower baseline levels, so those medical groups had more room for improvement. Finally, the descriptive comparisons for LDL cholesterol screening among patients with cardiac disease (Fig. 8) are limited to the period 2004–2007 because of the later addition of that quality indicator within the program. This descriptive plot is suggestive of a positive effect of the scorecard and public reporting on LDL screening for patients with cardiac disease, but not for the incentive program. In general, the descriptive time patterns displayed in Figs. 1–8 do not suggest added benefit of program participation above and beyond sentinel effects. 3.2. Multivariate regression results Table 2 summarizes the probit regression results for the preferred specification. This model includes binary variables for f A second set of plots was examined by “synthetic cohort,” in which the cohorts were defined by whether the intervention (QSC only or QSC plus QIP) was in place for the medical group during the sub-period of interest or not in place (Control). In the study team's view, these synthetic cohort plots did not add useful information on differential time trends, compared to the regular cohort plots, because of the overlap of medical groups between cohorts over time. Thus, a medical group could be in the QSC cohort in 2003–2004, but in QSC plus QIP in 2005–2007—in which case, the plot over time of differences in “synthetic” intervention cohort quality performance would commingle two factors: any underlying differences in quality between the those who were in a particular intervention cohort (or the control) and the impact of the intervention itself during each sub-period. Those plots are available from the authors, but are not presented here.

112

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

Fig. 1. Breast cancer screening.

Fig. 2. Cervical cancer screening.

Fig. 3. Well child visits.

the QSC (scorecard and reporting) and QIP (incentive) programs during each year to denote participation in those interventions (“control group” is the reference category); a dummy variable for clinic size (above median ¼1); splitting the sub-periods by “contest:” (2003–2004) and “absolute benchmark” (2005–2007), and no

separate adjustment for baseline value of the dependent variable. The Ever—QSC and Ever QSC/QIP cohort dummy variables are included to capture fixed effects for two intervention cohorts relative to the control medical groups. The coefficient on the QIP treatment dummy for each sub-period estimates the incremental

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

113

Fig. 4. Asthma medications optimal use.

Fig. 5. Diabetes HbA1c testing.

Fig. 6. Diabetes LDL-C screening.

effect of the P4P incentive over and above the scorecard and reporting (QSC) program because the QSC treatment variable is switched on for any year in which the medical group participated in the quality scorecard and public reporting program.

3.3. Quality program effects The focus of this study is on the QIP treatment coefficients in the two sub-periods, which estimate the incremental effect for

114

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

Fig. 7. Diabetes ACE-inhibitor Meds.

Fig. 8. Coronary artery disease LDL.

Table 2 Regression results: binary intervention, covariate for size, no baseline adjustment, two time periods (probit models with covariates for patient age, gender, and case mix).

Female Age Deyo score Clinic size dummy 2003 dummy 2004 dummy 2005 dummy 2006 dummy 2007 dummy Ever QSC cohort Ever QSC/QIP cohort QSC treatment: '03–04 QSC treatment: '05–07 QIP treatment: ‘03–04 QIP treatment: ‘05–07 Observations Pseudo R-squared

Asthma

DM_ACE

−0.01 [0.010] 0.001 [0.000]*** 0.047 [0.013]*** 0.007 [0.012] 0.006 [0.036] −0.21 [0.062]*** −0.118 [0.037]*** −0.125 [0.035]*** 0.037 [0.028] −0.085 [0.038]** −0.073 [0.030]** 0.047 [0.032] 0.05 [0.030] −0.037 [0.021]* −0.037 [0.018]** 6700 0.05

−0.056 [0.012]*** −0.024 [0.008]*** 0.008 [0.001]*** 0 [0.000] 0.012 [0.003]*** 0.01 [0.005]** −0.007 [0.019] 0.003 [0.019] −0.009 [0.021] 0.013 [0.021] 0.065 [0.024]*** 0.041 [0.023]* *** 0.17 [0.022] 0.105 [0.037]*** 0.2 [0.021]*** 0.121 [0.037]*** 0.327 [0.016]*** 0.144 [0.036]*** −0.007 [0.022] 0.033 [0.025] −0.033 [0.033] 0.066 [0.037]* 0.003 [0.027] −0.019 [0.027] 0.01 [0.027] −0.004 [0.040] 0.03 [0.018]* −0.001 [0.022] −0.032 [0.028] −0.04 [0.020]* 19,962 21,365 0.07 0.01

Robust standard errors in brackets. n

Significant at po 0.10%. Significant at po 0.5%. nnn Significant at po 0.1%. nn

DM_HbA1c

DM_LDL

Well_Child

LDL

−0.039 [0.006]*** 0.001 [0.001]** −0.01 [0.003]*** 0.036 [0.014]** −0.009 [0.014] 0.048 [0.017]*** 0.103 [0.017]*** 0.132 [0.018]*** 0.155 [0.015]*** −0.02 [0.029] 0.006 [0.029] 0.046 [0.015]*** 0.039 [0.024] −0.029 [0.018] −0.054 [0.012]*** 21,365 0.04

0.004 [0.008] −0.013 [0.009] −0.041 [0.013]*** 0.158 [0.068]** 0.135 [0.046]*** 0.215 [0.039]*** 0.261 [0.034]*** 0.246 [0.040]*** 0.332 [0.039]*** −0.036 [0.070] −0.166 [0.084]** 0.046 [0.058] 0.012 [0.047] −0.018 [0.061] 0.055 [0.063] 6338 0.08

[0.014]*** [0.001] 0.008 [0.000]*** [0.003]** −0.015 [0.002]*** [0.007] 0.021 [0.021] −0.017 [0.008]** −0.019 [0.011]* 0.045 [0.014]*** −0.002 [0.027] 0.105 [0.012]*** −0.008 [0.024] 0.12 [0.013]*** 0.006 [0.023] −0.036 [0.025] −0.038 [0.030] −0.011 [0.027] 0.039 [0.035] 0.018 [0.027] −0.009 [0.010] 0.01 [0.014] −0.008 [0.026] −0.069 [0.041]* −0.019 [0.011] −0.033 [0.015]** −0.012 [0.014] 6270 81,142 0.02 0.03 −0.045 −0.001 −0.007 0.007

BCS

CCS

−0.005 [0.000]*** −0.016 [0.002]*** 0.004 [0.013] −0.012 [0.013] −0.022 [0.015] −0.029 [0.014]** −0.026 [0.015]* −0.003 [0.015] −0.011 [0.029] 0.006 [0.024] 0.01 [0.013] 0.003 [0.017] −0.033 [0.016]** −0.012 [0.018] 96,949 0.03

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

each measure of the QIP incentive program in addition to any effect of the quality scorecard and reporting on clinical quality (the latter effect being reflected in the QSC coefficients in each subperiod. Of the 16 QIP coefficients in Table 2 (8 quality measures by 2 sub-periods), 7 are statistically significant; and, among those, 6 are negative—implying that, relative to the scorecard and reporting alone, adding the QIP incentive actually was associated with a reduction in quality—opposite to the intent of the payment incentive program. The one positive QIP coefficient, on use of ACEinhibitors among patients with diabetes in the contest period (2003–2004), contrasts with the other significantly negative QIP coefficients for quality indicators of asthma medication use, cervical cancer screening, HbA1c and LDL cholesterol testing among diabetics, and LDL testing among cardiac patients. The regression design permits a contrast of intervention effects between the “contest” approach in the sub-period of 2003–2004 and the absolute standard, “achievable benchmarks of care” approach in sub-period 2005–2007. Because most of the statistically significant QIP coefficients are negative (opposite to program intent) and small, explicit tests of differences in intervention effect between the two sub-periods were not performed. However, by comparing the coefficients, one does not observe a consistent pattern of differential intervention impact in the sub-periods. A secondary interest of this study pertains to the QSC coefficients, which estimate the main effect of the quality scorecard and public reporting intervention. Of those coefficients, only one is significant — for LDL cholesterol testing among diabetics — and that effect is positive, as intended by the program designers. Cumulatively, these regression estimates indicate that neither the QSC scorecard nor the QIP pay-for-performance incentive has had a positive effect on general clinical quality. This overarching conclusion is robust across alternative regression specifications.g 3.4. Covariate effects This study is one of very few analyses of pay-for-performance programs that have controlled for patient-level covariates (cf., 25). One does observe systematic differences in quality indicators by gender, age, and case mix (the Deyo score). For brevity, we focus on case mix differences. The modified Deyo score is significant for all eight quality indicators, suggesting that patient case mix is an important factor – even when assessing process indicators of quality. Higher levels of comorbidities, as measured by the Deyo score, are positively related to three quality measures: optimal use of asthma medications, and use of ACE-inhibitors and HbA1c testing for patients with diabetes. In contrast, the Deyo score is negatively related to the other five quality indicators (LDL cholesterol screening for patients with diabetes and coronary artery disease, both cancer screening measures, and well-child visits.

4. Implications for policy and practice The overall null findings of this study with respect to intervention effects pose certain puzzles for the designers and implementers of quality measurement and reporting programs (i.e., QSC-like g We sensitivity-tested our findings by several changes to the preferred specification: (1) excluding or including baseline values of the quality dependent variables; (2) excluding or including group (clinic) size as a covariate; (3) estimating the quality programs’ intervention effects for the two distinct subperiods or over the entire observation period (2002–2007); (4) modeling intervention effects over the entire observation period as “dose”, i.e., by measuring the duration of the intervention in place (QSC or QIP) and the square of that duration (the latter to capture non-linearities, i.e., either marginally increasing or decreasing clinical quality improvement over time as a function of the intervention), as against measuring the intervention in binary form i.e., in place or not in a given year.

115

interventions) and for P4P programs as well. The use of medical advisory groups comprised of leaders from practices around the State could have encouraged spread of quality and incentive innovation to those not participating in the health plan's scorecard and public reporting or the incentive program. In attempting to draw lessons from this quasi-experiment, we consider the design propositions introduced in this study, the results of other comparable studies, and the perceptions of medical group administrative and clinical leaders who participated in key informant interviews for this study.h The modest size of the potential incentive payment was clearly a factor. While there is no clearly defined “tipping point” for incentive payment effect, empirical evidence (cf., 10) indicates that incentive size matters. The vast majority of medical directors felt that the quality incentives were too small, with several remarking that the payoffs needed to be much larger. Interestingly, key informant interviews suggested that “treating to the test” and diversion of effort from non-incentivized to incentivized quality domains was not an issue for these measurement and incentive programs in Washington State. Changing behavior based on reporting and payment for the patients of one health plan (even one accounting for 10–15% of group revenue) may be one reason that the plan's effort to improve quality through P4P had limited impact. That said, the failure to achieve substantial clinical improvement of even the largest P4P program in the country10 – the IHA program in California, involving the major health plans accounting for roughly 60% of total group revenue – implies that broad multi-payer involvement is not a sufficient condition for attaining large clinical quality gains. Moreover, most medical directors remarked that their groups’ individual physician compensation methods were production-based and not well aligned with health plan payment incentives for quality. Overall, the medical directors serving as key informants for this study were evenly divided on their qualitative judgments of the effectiveness of this incentive program.i Among those medical directors viewing the QIP as effective, several noted that the program's group-level quality incentive payments were useful in focusing team attention on quality and also provided a source of funding that could be used to place more individual physician compensation “at-risk” for quality performance. However, medical directors generally felt that the group (rather than individual) nature of the quality incentive was a significant limitation. The health plan program designers intentionally transitioned the QIP incentive mechanism from a relative performance model to an absolute performance standard (achievable benchmark), in which the incentive was tied directly to achievement and improvement of one's own quality performance. In spite of the theoretically superior incentive properties of the absolute standard, our findings show no evidence that this shift in approach produced favorable results. The absence of downside risk may have played a role in the failure to significantly “move the needle”. While penalties and withholds are never popular among clinicians, behavioral economics17 and previous research18 do reveal that “loss aversion” plays a strong role in modifying behavior. It seems clear that the modest amount of new money on the table (at least when measured as potential new income per provider) was not sufficient to stimulate significant improvement. Penalties for underperformance might have strengthened motivation, but would

h The key informant interviews are the subject of another manuscript and are reviewed only briefly here to assist in interpreting the results of this quantitative analysis. i Key informants were interviewed prior to private release of the current study's quantitative findings to their medical groups, so their comments would not reflect their interpretation of the results of impact evaluation.

116

D.A. Conrad et al. / Healthcare 1 (2013) 108–116

have to be weighed against the negative perceptions of potential income reductions.19 We acknowledge that – in the absence of an additional control group of practices using penalties and rewards – we cannot specifically determine if absence of penalties in the P4P program was a reason for the lack of a significant effect on quality. The QSC and QIP programs both targeted the medical group, rather than individual practice sites or providers. The preponderance of prior empirical research on P4P programs tends to support the superior incentive power of individual rather than group inducements26,7,16 The initiative's focus on process indicators of quality in the scorecard, reporting, and incentive structure – rather than patient health outcomes – seems to have been well-aligned with the state of the art prevailing in health plan-based quality measurement, reporting, and assessment systems in the 2002– 2007 period. The health plan sponsor moved to include outcome targets in the later years of the intervention period, but the impact of this effort probably was mitigated by the small sample size represented by one plan's covered lives in any one group. This study offers several specific contributions to the larger P4P literature: (1) Our results reinforce the extant evidence of mixed and relatively small impacts of P4P on quality. (2) To our knowledge, this study is the first to attempt to contrast the effects of quality incentives based on relative performance (the “tournament” or contest approach) versus absolute performance based on achievable benchmark standards within the same study sample. (3) The phased-in nature of the publicly reported quality scorecard, followed by implementation of the quality incentive program, allows one to disentangle the effects of the scorecard from those of the scorecard plus explicit quality incentives. (4) The current study combines key informant interview data with the quantitative results to provide a richer interpretation of the findings. (5) This research is also one of the few P4P studies to explicitly control for case mix differences. In keeping with the modest success of other first-generation P4P programs in the U.S., a quality incentive programs are tending toward multi-payer collaboration and efforts to couple financial incentives with programs geared to both practice redesign and shared savings between payers and providers.10,11 A full-court press on quality and efficiency, based on common and broadly defined clinical and economic metrics among multiple payers and providers, seems to be the logical next step in payment reform and health delivery system transformation. Acknowledgment This publication was supported by Grant # 63214 from the Robert Wood Johnson Foundation Health Care Financing and Organization Initiative (RWJF-HCFO). The authors wish to thank their RWJF-HCFO Project Officer, Bonnie Austin, for her outstanding support of this work. References 1. Aldrich JH, Nelson FD. Linear probability, logit, and probit models. Series: quantitative applications in the social sciences.Newbury Park, CA: Sage Publications, Inc; 1984.

2. Arrow K, Auerbach A, Bertko J, Brownlee MS, et al. Toward a 21st Century health care system: recommendations for health care reform. Annals of Internal Medicine. 2009;150:493–495. 3. Avery G, Schultz J. Regulation, financial incentives, and the production of quality. American Journal of Medical Quality. 2007;22(4):265–273. 4. Chaix-Couturier C, Durand-Zaleski I, Jolly D, Durieux P. Effects of financial incentives on medical practice: results from a systematic review of the literature and methodological issues. International Journal for Quality in Health Care. 2000;12(2):133–142. 5. Chernew M. Editorial: bundled payment systems: can they be more successful this time? Health Services Research. 2010;45(5, Part I):1141–1147. 6. Christianson JB, Leatherman S, Sutherland K. Lessons from evaluations of purchaser pay-for-performance programs: a review of the evidence. Medical Care Research and Review. 2008;65(6 Suppl):5S–35S. 7. Conrad DA, Perry L. Quality-based financial incentives in health care: can we improve quality by paying for it? Annual Review of Public Health. 2009;30:357–371. 8. Conrad DA. Incentives for health care improvement. In: Smith PC, Mossialos E, Papinicolas I, Leatherman S, editors. Performance Measurement for Health System Improvement: experience, Challenges, and Prospects. The Cambridge Health Economics, Policy, and Management Series. London: Cambridge University Press; 2009. p. 582–612 Chapter 5.4. 9. Conrad DA, Saver BG, Court B, Health S. Paying physicians for quality: evidence and themes from the field. Joint Commission Journal on Quality and Patient Safety. 2006;32(8):443–451. 10. Damberg CL, Raube K, Teleki SS, dela Cruz E. Taking stock of pay-forperformance: a candid assessment from the front lines. Health Affairs. 2009;28(2):517–525. 11. Davis K, Guterman S. Rewarding excellence and efficiency in Medicare payments. The Milbank Quarterly. 2007;85(3):449–468. 12. Deci EL, Koestner R, Ryan RM. A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation. Psychological Bulletin. 1999;125(6):627–668. 13. Deci EL, Ryan RM. Intrinsic Motivation and Self-determination in Human Behavior.New York: Plenum; 1985. 14. Deyo RA, Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD-9 CM administrative databases. Journal of Clinical Epidemiology. 1992;45:613–619. 15. Gosfield AG. The Prometheus Payment™ program: A legal blueprint. In: Gosfield AG, editor. The Health Law Handbook. 2007 ed.,Eagan, MN: West: A Thomson Company; 2007. p. 79–130 [Chapter 3]. 16. Frolich A, Talavera JA, Broadhead P, Dudley RA. A behavioral model of clinician responses to incentives to improve quality. Health Policy. 2007;80:179–193. 17. Kahneman D, Tversky A. Prospect theory: an analysis of decision under risk. Econometrica. 1979;47:263–292. 18. McNeil BJ, Pauker SJ, Sox Jr HC, Tversky A. On the elicitation of preferences for alternative therapies. New England Journal of Medicine. 1982;306(21): 1259–1262. 19. Mehrotra A, Sorbero MES, Damberg CL. Using the lessons of behavioral economics to design more effective pay-for-performance programs. American Journal of Managed Care. 2010;16(7):497–503. 20. Petersen LA, Woodard LD, Urech T, Daw C, Sookanan S. Does pay-forperformance improve the quality of health care? Annals of Internal Medicine. 2006;145(4):265–272. 21. Prendergast CR. The provision of incentives in firms. Journal of Economic Literature. 1999;37:7–63. 22. Rosenthal MB, Landon BE, Howitt K, Song HR, Epstein AM. Climbing up the payfor-performance learning curve: where are the early adopters now? Health Affairs. 2007;26(6):1674–1682. 23. Rosenthal MB, Frank RG. What is the empirical basis for paying for quality in health care? Medical Care Research and Review. 2006;63(2):135–157. 24. Rosenthal MB, Frank RG, Li Z, Epstein AM. Early experience with pay-forperformance: from concept to practice. Journal of the American Medical Association. 2005;294(14):1788–1793. 25. Sutton M, Elder R, Guthrie B, Watt G. Record rewards: the effects of targeted quality incentives on the recording of risk factors by primary care providers. Health Economics. 2010;1:1–13. 26. Van Herck P, De Smedt D, Annemans L, Remmen R, Rosenthal MB, Sermeus W. Systematic review: effects, design choices, and context of pay-for-performance in health care. BMC Health Services Research. 2010;10:247.