Reliability of utilization measures for primary care physician profiling

Reliability of utilization measures for primary care physician profiling

Healthcare 1 (2013) 22–29 Contents lists available at SciVerse ScienceDirect Healthcare journal homepage: www.elsevier.com/locate/hjdsi Reliability...

549KB Sizes 0 Downloads 23 Views

Healthcare 1 (2013) 22–29

Contents lists available at SciVerse ScienceDirect

Healthcare journal homepage: www.elsevier.com/locate/hjdsi

Reliability of utilization measures for primary care physician profiling Hao Yu a,n, Ateev Mehrotra b,1, John Adams c,2 a b c

RAND Corporation, 4570 Fifth Avenue, Pittsburgh, PA 15213, USA RAND Pittsburgh, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA RAND Corporation, 1776 Main Street, Santa Monica, CA 90401-3208, USA

art ic l e i nf o

a b s t r a c t

Article history: Received 15 January 2013 Received in revised form 12 April 2013 Accepted 15 April 2013 Available online 9 May 2013

Background: Given rising health care costs, there has been a renewed interest in using utilization measures to profile physicians. Despite the measures' common use, few studies have examined their reliability and whether they capture true differences among physicians. Methods: A local health improvement organization in New York State used 2008–2010 claims data to create 11 utilization measures for feedback to primary care physicians (PCP). The sample consists of 2938 PCPs in 1546 practices who serve 853,187 patients. We used these data to measure reliability of these utilization measures using two methods (hierarchical model versus test–retest). For each PCP and each practice, we estimate each utilization measure’s reliability, ranging from 0 to 1, with 0 indicating that all differences in utilization are due to random noise and 1 indicating that all differences are due to real variation among physicians. Results: Reliability varies significantly across the measures. For 4 utilization measures (PCP visits, specialty visits, PCP lab tests (blood and urine), and PCP radiology and other tests), reliability was high (mean4 0.85) at both the physician and the practice level. For the other 7 measures (professional therapeutic visits, emergency room visits, hospital admissions, readmissions, skilled nursing facility days, skilled home care visits, and custodial home care services), there was lower reliability indicating more substantial measurement error. Conclusions: The results illustrate that some utilization measures are suitable for PCP and practice profiling while caution should be used when using other utilization measures for efforts such as public reporting or pay-for-performance incentives. & 2013 Elsevier Inc. All rights reserved.

Keywords: Physician profiling Utilization measures Reliability Claims data

1. Introduction Given the growing concerns about rising health care costs, there is a push to profile providers on costs of care. Utilization measures are commonly used to reflect costs. Utilization measures capture the use of health services within a given patient population and, together with medical prices, drive total costs. Common utilization measures include the number of specialty visits referred among a primary care physician’s (PCP) patient panel or the number of emergency room visits among all the patients at a practice. Since these types of measures are readily obtained from health plan claims,1 they have been used in many ways, including confidentially providing feedback to physicians,2,3 as a basis for financial incentives for physicians,4,5 and public profiling of physicians.6 For example, Centers for Medicare and Medicaid Services (CMS) has two new physician profiling efforts. The Quality and Resource Use Reports will provide confidential

n

Corresponding author. Tel.: +1 412 683 2300x4460; fax: +1 412 683 2800. E-mail addresses: [email protected] (H. Yu), [email protected] (A. Mehrotra). 1 Tel.: +1 412 683 2300x4894; fax: +1 412 802 4972. 2 Tel.: +1 310 393 0411; fax: +1 310 393 4818. 2213-0764/$ - see front matter & 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.hjdsi.2013.04.002

feedback reports to tens of thousands of individual physicians on utilization measures such as emergency visits, lab tests, and days in a skilled nursing facility.7 The Physician Value-Based Payment Modifier, which was called for by the Affordable Care Act to link physician performance in quality and cost with Medicare Part B payments,8 is expected to expand over the next 2 years to cover all physicians for performance-based payment. There are also state and local initiatives, such as the California Quality Collaborative, which publicly profiles over 35,000 physicians on utilization measures (e.g. emergency room visits).6 Despite the common use of utilization measures, there is little research on the reliability of these measures.9 Reliability indicates a measure’s capability to distinguish true differences in providers' performance from random variation.10,11 Prior studies have examined reliability of health care quality measures, such as HEDIS,12 cost measures,13 and composite measures.14 Many of these studies have found inadequate reliability for the study measures—the measure is capturing significant random noise. However, to our knowledge, there is no published research on reliability of utilization measures at the individual physician and practice level. A critical issue of any such evaluation is how to measure the reliability of utilization measures. Conceptually, reliability is the

H. Yu et al. / Healthcare 1 (2013) 22–29

squared correlation between a measure and the true value.15 Because the true value is rarely known, researchers often estimate the lower bound of reliability.16 For example, the use of test–retest reliability is a common method for estimating reliability of survey instruments. In prior work, Adams and colleagues have used simple hierarchical models to help estimate reliability in cross-sectional data.13 In such models, the observed variability is divided into two components: variability of scores across all providers and variability of scores for individual providers. The provider-to-provider variance is combined with the provider-specific error variance to calculate the reliability of the profile for an individual provider. In this study we use these methods to assess the reliability of utilization measures. Our data are from a local consortium that created the utilization measures for physician feedback. As discussed below, our data come from multiple health plans, reflecting ongoing efforts at the state- and local-level to create multiplepayer claims database for health care improvement.17 We look at two different ways of measuring reliability (hierarchical model versus test–retest) and discuss the differences between them.

2. Methods 2.1. Data We obtained data from THINC, Inc., a nonprofit health improvement organization serving 9 counties in the Hudson Valley of New York State (Putnam, Rockland, Westchester, Dutchess, Orange, Ulster, Sullivan, Columbia, and Greene). In an effort to improve the value of the health care provided in the region, THINC, Inc. has taken a series of steps to construct and monitor utilization measures for PCPs. First, it pooled enrollment and claims data from five major health plans operating in the Hudson Valley, including four commercial plans (Aetna, United Healthcare, MVP Healthcare, and Capital District Physicians' Health Plan), and one Medicaid HMO (Hudson Health Plan) that together cover approximately 60% of the insured population in the Hudson Valley. Second, it used the methods depicted in Fig. A1 in Appendix A to attribute the insured people to PCPs, who are defined as those physicians holding an MD or DO and practicing in one of the specialties of Family Practice, General Practice, Internal Medicine, and Pediatrics. A patient can be attributed to one PCP only. PCPs were aggregated into practices which are the individual physicians at a single geographic site. Third, it created annual utilization measures for each PCP during the 3-year period of 2008–2010. The 11 utilization

Test_Retest_Reliability ¼

23

developed by DxCG, Inc. The patient co-morbidity score available to us was a single value for the entire patient panel. We did not have access to co-morbidity data or risk scores for individual patients. It is common for health plans to profile physicians and practices using just the average risk scores across a patient panel. The study sample consists of 2938 PCPs in 1546 practices, serving 853,187 patients in the Hudson Valley in 2008–2010.

2.2. Two methods of assessing reliability We use two methods for assessing reliability. Our primary method builds on the hierarchical model method used by Adams and colleagues.13 We used the formula below to calculate reliability at the individual physician level: Reliability ¼

Variance_across_physicians Variance_across_physicians þ Variance_within_physicians

For a given patient population, the utilization by patients is best viewed as count data. To calculate the variance in a given utilization measure among physicians, we first estimated a Poisson model with random effect for each of the utilization measures. Each model controls for both physician and patient characteristics, such as the physician’s age, gender, and degree, the patient’s average age and average risk adjustment scores, and the proportion of male patients. Then, the variance among physicians is calculated as the product of the model’s covariance and the average utilization rate among physicians for a given utilization measure. For example, the rate of specialist visits by a physician’s patient panel is the panel’s total number of specialist visits divided by the panel size, with the average rate of specialist visits being the mean rate across all the physicians. This calculation is based on the delta method to transform the log scale in the Poisson model into rate. To calculate the variance of a given utilization measure within a physician’s patient panel, we divided the panel’s rate of that utilization measure by the panel size. For example, the variance in specialist visits within a physician’s patient panel is the panel’s rate of specialist visits (described above) divided by the panel size. The calculated reliability ranges from 1 to 0, with 1 suggesting that all the variation results from real differences in performance, and 0 meaning that all the variation in utilization is caused by measurement error. There is no gold standard for what constitutes adequate reliability for a measure of a physicians' performance, although cut-offs of 0.7 and 0.9 have been used in prior studies.18,19 As a second method, we applied the simple test–retest method, which, shown conceptually, is

V ariance_across_physicians Variance_across_physicians þ V ariance_of _time_change þ Variance_within_physicians

measures (see Appendix B for detailed definitions.), which include annual PCP visits, specialty visits, PCP lab tests (blood and urine), and PCP radiology and other tests, professional therapeutic visits, emergency room visits, hospital admissions, readmissions, skilled nursing facility days, skilled home care visits, and custodial home care services, provide a comprehensive assessment of health services provided to a PCP’s patient panel. In addition to data on the utilization measures, we also obtained the following data from THINC, Inc.: (1) physicians' characteristics, such as degree, age, gender, and clinical specialty; (2) characteristics of each PCP’s patient panel including average age, and proportion of male patients; (3) linkage files that can be used to link practices, physicians, and patients; (4) the average co-morbidity score for a PCP’s patient panel; (5) the average co-morbidity score for patients enrolled in a practice. Co-morbidities were captured using a diagnosis-based model

For each physician and practice, we calculated the number of annual visits by types of service utilization per 100 patients and estimated the correlation between the 2008, 2009, and 2010 utilization measures. The calculation is based on the assumption that there is no difference from year to year in a physician’s practice pattern. Given there is likely some shift in practice across years,20 we must consider the test–retest reliability as the lower bound of any reliability estimate. In comparison, the above hierarchical model reflects cross-sectional reliability. We will examine how the results differ between the two methods. 2.3. Sensitivity analyses To test the effect of different model specifications for calculating reliability, we conducted a number of sensitivity analyses, including (1) one analysis that included only physicians'

24

H. Yu et al. / Healthcare 1 (2013) 22–29

Table 1 Characteristics of primary care physicians and their practices. Variable

Number or % with standard deviation in ( )

PCP level Age of physician (%) 35 years old or lower 31–50 years old 51 years old and over Gender of physician (%) Male Female Degree of physician (%) DO MD Number of patients in physician panel (%) 0–50 51–100 101–250 250+ Average age of physician’s panel (years) Average proportion of male patients in physician’s panel (%) Average patient risk score in physician’s panela

N ¼ 2938

Practice level Number of providers in the practice (%) 1 2 3–5 6–10 11–15 16+ Number of patients in the practice (%) 0–50 51–100 101–250 250+ Average Age of patients in the practice (years) Average proportion of male patients in the practice (%) Average patient risk score in the practicea

5.5 60.7 33.8

tests (1186.6 lab tests per 100 patients per year) while the lowest mean utilization was for custodial home visits (1.0 custodial home visits per 100 patients per year). On average there were 321.1 PCP visits and 289.2 specialty visits per 100 patients per year. Across PCPs, there was wide variation in utilization (Table 2). For example, the 5th percentile of PCP visits by 100 patients in a year is 150.0, about half of the median, and less than one-third of the 95th percentile. 3.3. Comparison of two methods of measuring reliability at PCP level

61.1 38.9 8.2 91.8 16.5 9.9 29.1 44.4 37.9 (19.4) 45.2 (14.7%) 4.25 (5.07) N ¼ 1546 66.3 12.5 9.4 5.1 2.1 4.6 21.7 11.0 22.1 45.2 40.6 (18.7) 47.7 (11.9%) 4.31 (3.91)

a As described in text, we do not have risk scores on individual patients. Rather we were provided with a single mean score of each physician’s panel and each practice’s panel.

characteristics in the Poisson model, (2) one analysis that included only patients' characteristics in the model, (3) one analysis that examined reliability by physicians’ patient panel size, (4) one analysis that had extreme values Winsorized, and (5) one analysis that looked at test–retest reliability using different combinations of years (2008 vs. 2009, 2009 vs. 2010, 2008 vs. 2010). 3. Results 3.1. Description of PCPs and their practices As shown in Table 1, the mean age of the 2938 study PCPs is 51 years. Males make up a greater proportion of the PCPs (61%). The vast majority of PCPs hold an M.D. (92%). The average patient panel size per PCP is 290. The mean patient age is 38 years, and 45% of the patients are male. The mean risk adjustment score among patients is 4.25. Most PCPs are solo physician practitioners (66.3%). About one-fifth of the practices have 50 patients or less; over 45% of the practices have 250 patients or more.

Table 3 summarizes results from the two methods. Using the hierarchical model we found that (1) there is a large variation in mean reliability measures among the 11 types of visits with the highest mean for PCP lab visits (0.98) and the lowest mean for custodial home care services (0.24); (2) six of the 11 types of visits had a mean reliability over 0.70, including PCP visits, specialist visits, PCP lab visits, PCP other visits, emergency visits, and skilled home care visits, while other utilization measures have a mean reliability at 0.70 or blow; (3) most utilization measures had a wide variation in reliability across PCPs. For example, across the PCPs in the sample, the reliability for skilled nursing facility days ranged from 0.06 (5th percentile) to 0.99 (95th percentile). Some measures had less variation across PCPs. For example, the reliability for PCP lab visits ranged from 0.92 (5th percentile) to 0.99 (95th percentile). Using the test–retest method for measuring reliability, most utilization measures have lower reliability. The reduction in reliability between the two methods is substantial for four utilization measures, including professional therapeutic visits, emergency room visits, hospital admissions, and skilled nurse home visits (Table 3). In comparison, there are two utilization measures—PCP lab visits and readmissions—whose reliability does not change as much between the two methods. We also found that the test–retest method had similar results across the different combinations of years we examined (2008 vs. 2010, 2009 vs. 2010, and 2008 vs. 2010) with two exceptions. For professional therapeutic visits and custodial home care services the test–retest results did vary across years (Table 3). Overall, we found that when using either of the two methods, four types of utilization had relatively high reliability (40.7), including PCP visits, specialty visits, PCP lab visits, and PCP other visits. 3.4. Sensitivity analyses at PCP level One sensitivity analysis showed that reliability of utilization measures varies by the PCPs' patient panel size. Fig. 1A shows that the reliability of PCP visits increases with panel size. It increases sharply for panel size below 20, and it becomes flattened for panel size over 60. A similar pattern appears for two other utilization measures, such as specialist visits and PCP lab visits. In sensitivity analyses at the PCP level, we included different co-variates in the Poisson model. If the model includes only PCP characteristics, reliability of most utilization measures will be slightly higher than the model with only patient characteristics. When the model includes both the PCP and patient characteristics, the results are virtually the same as the model using only the patient characteristics. When we Winsorized extreme value, there was no substantial difference in reliability (results not shown).

3.2. Variation in performance on utilization measures across PCPs

3.5. Variation in performance on utilization measures at practice level

There is substantial variation in mean utilization across the measures (Table 2). The highest mean utilization was for PCP lab

The lower part of Table 2 summarizes utilization measures across practices. The highest mean utilization was for PCP labs

H. Yu et al. / Healthcare 1 (2013) 22–29

25

Table 2 Variation in PCP and practice performance on annual number of visits per 100 patients. Level of analysis

Type of visits

Mean

5th Percentile

25th Percentile

Median

75th Percentile

95th Percentile

PCP level

PCP visits Specialty visits PCP lab tests (blood and urine) PCP radiology and other tests Professional therapeutic visits Emergency room visits Hospital admissions Readmissions Skilled nursing facility days Skilled home care visits Custodial home care services

321.1 289.2 1186.6 198.6 0.9 25.7 9.4 2.2 44.5 22.5 1.03

150.0 62.2 326.1 34.9 0.0 2.3 0.0 0.0 0.0 0.0 0

243.8 130.6 625.4 67.7 0.0 10.8 3.4 0.0 0.0 3.4 0

301.3 266.1 1114.7 183.3 0.0 17.1 6.3 0.6 0.0 13.0 0

368.9 373.5 1533.3 249.0 0.6 29.3 10.0 1.7 4.6 25.3 0

515.1 589.8 2294.3 403.1 3.1 72.7 23.8 6.1 65.3 74.1 1.04

PCP visits Specialty visits PCP lab tests (blood and urine) PCP radiology and other tests Professional therapeutic visits Emergency room visits Hospital admissions Readmissions Skilled nursing facility days Skilled home care visits Custodial home care services

317.7 324.8 1306.4 208.2 0.8 24.7 9.4 2.0 23.0 24.1 0.8

12.0 73.2 343.8 38.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0

251.7 182.6 750.6 105.7 0.0 10.7 3.3 0.0 0.0 3.8 0.0

319.8 289.8 1183.7 200.0 0.0 17.9 6.5 0.7 0.0 13.5 0.0

387.7 404.4 1599.2 268.3 0.5 31.1 10.6 1.9 6.2 25.8 0.0

549.1 691.1 2501.0 434.5 3.0 63.2 25.7 8.2 96.7 74.5 1.1

Practice level

Table 3 Provider-level reliability estimates by two methodsa. Utilization measure

PCP visits Specialty visits PCP lab tests (blood and urine) PCP radiology and other tests Professional therapeutic visits Emergency room visits Hospital admissions Readmissions Skilled nursing facility days Skilled home care visits Custodial home care services a

Reliability Measured using simple hierarchical models

Reliability measured using test–retest method

Mean 5th 25th Median 75th 95th Correlation between Percentile Percentile Percentile Percentile 2008 and 2009

Correlation between 2008 and 2010

Correlation between 2009 and 2010

0.94 0.93 0.98

0.63 0.64 0.92

0.96 0.95 0.99

0.98 0.98 1.00

0.99 0.99 1.00

1.00 1.00 1.00

0.68 0.75 0.93

0.70 0.68 0.84

0.83 0.84 0.90

0.88

0.41

0.89

0.96

0.98

0.99

0.79

0.55

0.76

0.50

0.04

0.35

0.55

0.68

0.81

0.04

0.06

0.32

0.87

0.37

0.88

0.94

0.97

0.98

0.23

0.31

0.39

0.69

0.12

0.61

0.78

0.86

0.92

0.25

0.41

0.31

0.49 0.70

0.04 0.06

0.30 0.37

0.53 0.92

0.68 0.98

0.82 0.99

0.53 0.48

0.52 0.52

0.48 0.85

0.92

0.55

0.94

0.97

0.99

0.99

0.12

0.11

0.45

0.24

0.00

0.04

0.19

0.41

0.65

0.71

0.37

0.36

See Section 2 for details of the two methods.

(1306.4 tests per 100 patients per year) while both professional therapeutic visits and custodial home care services had the smallest mean (0.8 visits per 100 patients per year). Table 2 also shows a wide range for each of the utilization measures at the practice level. For example, the 5th percentile of PCP visits was 12 visits per 100 patients per year, less than 5% of the median or 2% of the 95th percentile. 3.6. Comparison of two methods of measuring reliability at practice level Table 4 compares practice-level reliability estimates from the two methods. Using the hierarchical model, only 3 out of the 11 types of utilization had a mean reliability below 0.70, including professional therapeutic visits (0.59), readmissions (0.59), and

custodial home visits (0.33). These three utilization measures also had a larger variation in reliability than other measures. Using the test–retest method, we found that (1) the reliability measures were reduced remarkably for most types of visits and (2) the reliability measures were consistent across the years with one exception—custodial home visits—that has unusual high reliability between 2008 and 2009. Putting together results from the two methods, we found that four utilization measures had relatively high reliability (4 0.7) across the two methods, including PCP visits, specialty visits, PCP lab visits, and PCP other visits. We also conducted sensitivity analyses at the practice level. As Fig. 1B shows, there is a positive relationship between the reliability of PCP visits and a practice’s patient panel size. The relationship is strong for panel size below 20, and the reliability

26

H. Yu et al. / Healthcare 1 (2013) 22–29

becomes flattened for panel size over 60. A similar pattern appears for three other utilization measures, such as specialist visits, PCP lab visits, and PCP other visits.

There are few differences among three specifications of the Poisson model: the model with PCP characteristics only, the model with patient characteristics only, and the model with both PCP and patient characteristics.

4. Conclusion and discussion

Fig. 1. (A) Relationship between panel size and physician-level reliability estimate of PCP visits. (B) Relationship between panel size and practice-level reliability estimate of PCP visits.

In an effort to reduce costs, commercial health plans and government payers use utilization measures to profile individual physicians. Such efforts are only useful if these utilization measures are reliable. If they are unreliable and therefore primarily capture random noise, then these efforts may be misguided. In this study we used two methods of estimating reliability to look at utilization measures. We found variations in reliability across the measures. For 4 of the 11 measures (PCP visits, specialty visits, PCP lab visits, and PCP other visits), we found that both methods demonstrated high reliability at both the physician and the practice level. This would imply that the differences observed in these measures reflect, to large extent, real differences between physicians. For the other 7 measures, there were concerns about reliability since the two methods generated discrepant results. The test–retest method generated low reliability for each of the 7 measures while some of these measures, such as readmission, skilled nursing facility days and admissions still had relatively high reliability in the hierarchical model. We also found that most utilization measures had a wide variation in reliability across PCPs. Overall, our findings illustrate the importance of measuring reliability on a regular basis in physician profiling efforts. The high reliability we found for these 4 measures was contrary to our expectations which were based on the prior literature on the reliability of other measures. In the one prior study looking at reliability of utilization measures, Hofer and colleagues (1999) reported that the median reliability of physician visits by diabetic patients was 0.41.1 In related work, we examined the reliability of physician cost profiles and found that 59% of physicians had costprofile scores with reliabilities of less than 0.70, a commonly used cut-off.13 Most other work has focused on reliability of physician quality measures. One recent study found that a composite measure of diabetes care had a physician-level reliability of 0.70

Table 4 Practice-level reliability estimates by two methodsa. Utilization measure

PCP visits Specialty visits PCP lab tests (blood and urine) PCP radiology and other tests Professional therapeutic visits Emergency room visits Hospital admissions Readmissions Skilled nursing facility days Skilled home care visits Custodial home care services a

Reliability measured using simple hierarchical models

Reliability measured using test–retest method

Mean 5th 25th Median 75th 95th Correlation between Percentile Percentile Percentile Percentile 2008 and 2009

Correlation between 2008 and 2010

Correlation between 2009 and 2010

0.97 0.98 0.99

0.90 0.94 0.98

0.98 0.99 1.00

0.99 0.99 1.00

0.99 1.00 1.00

0.99 0.99 0.99

0.69 0.79 0.86

0.64 0.79 0.80

0.67 0.90 0.91

0.97

0.89

0.97

0.99

0.99

0.99

0.78

0.67

0.74

0.59

0.13

0.43

0.62

0.77

0.92

0.04

0.09

0.53

0.91

0.68

0.91

0.95

0.98

0.99

0.38

0.33

0.77

0.76

0.33

0.68

0.81

0.89

0.96

0.18

0.33

0.33

0.58 0.89

0.10 0.44

0.42 0.89

0.61 0.96

0.77 0.99

0.91 0.99

0.11 0.20

0.30 0.13

0.28 0.27

0.94

0.79

0.95

0.97

0.99

0.99

0.47

0.26

0.60

0.33

0.02

0.14

0.30

0.51

0.77

0.94

0.24

0.22

See Section 2 for details of the two methods.

H. Yu et al. / Healthcare 1 (2013) 22–29

with a sample of 25 patients.14 More recently, a study of primary care performance found that physician level reliability varied widely from 0.42 for HbA1c testing to 0.95 for cervical cancer screening, and that the reliability was below 0.80 for 10 out of the 13 study measures.12 The high reliability we found for these four measures might be due to our much larger sample size. We focus on general utilization measures while the published studies have focused on specific medical procedures for certain medical conditions, such as diabetes, and therefore their sample sizes are much smaller. While our findings of high or low reliability are based on the full patient panel size, not surprisingly our sensitivity analysis shows that the reliabilities are closely related to patient panel size. For example, although the utilization measure of PCP visits, on average, has high reliability, those physicians with small panel size (e.g., below 20) typically still have low reliability estimates as demonstrated in Fig. 1A. The sensitivity analysis also indicates that the reliability becomes flattened for panel size over 60, suggesting that the 7 utilization measures with low reliability are not likely to have higher reliability for larger panel size. It is not so surprising to see the discrepancy in our reliability estimates for the 7 measures between the two methods. This is inherent to some degree given the different formulas embedded in the two methods. While the reliability formulation for the hierarchical model helps one understand the extent to which we can distinguish physicians' performance within a particular year or aggregation of years, the test–retest reliability, a simple correlation coefficients between years, represents the lower bound of reliability estimate, and since it accounts for possible changes over time, the test–retest reliability should never be larger than crosssectional reliability. Contrasting the two reliabilities clearly shows that they have different meanings and they cannot be substituted for each other. The results from the two methods indicate that it is possible to be reliable for detecting differences in physician performance in historical data while having less or far less reliability for predicting future physician performance. One important consideration when assessing reliability is the relationship between reliability and validity. Validity indicates how well a measure represents the phenomenon of interest; reliability indicates the proportion of variability in a measure that is due to real differences in performance. Classically, validity and reliability are considered two separate characteristics of a measure with little relationship between them. If a physician has high reliability on a given measure, it does not mean that the physician performed well on the measure or that the measure is a valid measure of physician performance. Since reliability is difficult to interpret in the absence of validity, it is also important to consider the possible effects of the most common threat to validity, inadequate case-mix adjustment. Although a precise quantification of the effects of failed case-mix adjustment is beyond the scope of this paper, the intuition can be seen in a conceptual extension of the reliability formula: Calculated_Reliability ¼

Variance_across_physicians þ Squared_bias V ariance_across_physicians þ V ariance_within_physicians þ Squared_bias

We distinguish here between calculated reliability and the reliability you would obtain if the bias was removed from the performance measures. The presence of the squared bias in both the numerator and the denominator increases the calculated reliability but does not do so in a way that is relevant for physician profiling. Consider a physician whose panel is healthy beyond what the measured case-mix adjusters can capture. He/she has a structural advantage when compared to his/her peers. Indeed this advantage will likely persist over time and influence calculated test–retest reliability in a similar way.

27

Careful consideration of the adequacy of case-mix adjustment should inform the interpretation of reliability. Unfortunately there is often little analysis that can be done with the available data to inform this important question. This study was not able to address this issue because we had access only to health plans' administrative data and did not have access to information about chronic conditions or risk adjustment scores at individual patient level. Although it is common for health plans to profile physicians and practices using just the average risk scores across a patient panel, including more patientlevel information for risk adjustment to improve reliability analysis could provide an avenue for future inquiry. Our study has some limitations in terms of generalizability. Our physician sample was drawn from the five major health plans in the Hudson Valley, four of which are commercial health plans. The results may not generalize to other regions or to physicians who primarily serve Medicare patients. Our results may also not be generable with over 66% of the study physicians being solo practitioners, a proportion higher the national average. Our study could be expanded by using representative samples or data from large scale policy interventions. For example, it remains an interesting topic for future studies to assess reliability of the measures included in the CMS' QRUR system, which contain performance information on 28 quality indicators to provide confidential feedback reports to physicians in nine states (California, Illinois, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, or Wisconsin).7 Our study findings have important policy implications. The results of high and stable reliability of the four utilization measures suggest that these measures may be reliably used for PCP profiling by health plans and other health care organizations. In fact, some of the measures (e.g. PCP prescribed radiology visits) have already been used by organizations, such as the California Quality Collaborative to monitor resource use by PCPs.6 Our results concerning the other utilization measures suggest that they should be used with caution for provider profiling. Our finding that the reliability increases with patient panel size before it becomes flattened for panel size over 60 suggests that physician profiling programs should not include physicians with relatively small panel size (e.g. below 60). Our results also highlight the importance of assessing reliability as a standard aspect of any profiling effort. Utilization measures have been in use for many years, yet this is one of the first studies to assess their reliability. Hospital readmission measures have become a critical aspect of the health care system, yet, to our knowledge, no one has assessed their reliability before. Given the substantial variation we found across utilization measures, we encourage organizations, which are using utilization measures to profile physicians, to evaluate reliability of their utilization measures. We also emphasize that reliability varies across providers. Even if a given measure has a high reliability on average, it does not mean that all providers should be profiled on that measure. One option is to only profile providers where the reliability for that provider meets a threshold (for example, 0.70). We feel this would be more fair to providers than simply using volume cut-offs (for example, a patient panel of 60) because our results underscore that even above a volume cut-off (as demonstrated in our analysis of the full patient panel), reliability may not be sufficiently high for some utilization measures.

Acknowledgment This study was funded by THINC, Inc. Appendix A See Fig. A1.

28

H. Yu et al. / Healthcare 1 (2013) 22–29

Did the patient specify a primary care physician to the health plan upon enrollment during the measurement year?

Did the patient see any primary care physician from the study sample at least once in the measurement year?

No

Yes

Exclude No

Yes

Select the primary care physician specified most recently by the patient in the measurement year

Select the primary care physician with most visits for that patient during the measurement year.

Did the patient see that primary care physician for at least one visit (preventive care and E&M) in the measurement year?

If there is a tie between two primary care physicians, select the physician with the most recent visit.

Exclude No

Yes

Is that primary care physician included in the study sample?

Exclude No

Did the patient see that primary care physician for at least one preventive care or E&M care visit in the measurement year?

Exclude No

Yes

Yes

Attribute the patient to the primary care physician (Imputation Method)

Attribute the patient to that primary care physician (Enrollment Method)

Fig. A1. Methods of attributing a patient to a PCP.

where Rendering Provider meets the all of the following criteria:  Is not a pediatrician, family practice, internal medicine, or general practice as determined by submitted taxonomy codes  Is a physician

Appendix B. Definitions of utilization measures.

Measure

Measure description

PRIMARY CARE Count of Patient–Provider PHYSICIAN (PCP) VISITS encounters for the primary care providers. A visit is defined by distinct consistent Member id, Rendering Provider id, and Date of Service where Rendering Provider is a PCP Provider (pediatrician, family practice, internal medicine, or general practice as determined by submitted taxonomy codes). SPECIALIST VISITS Count of Patient–Provider encounters for specialist providers. A visit is defined by distinct Consistent Member id, Rendering Provider id, and Date of Service

PCP LABORATORY TESTS (BLOOD AND URINE)

PCP RADIOLOGY AND OTHER TESTS

Count of lines for any claim where: Service code has BETOS code for lab test OR Revenue code is for lab test. Count of lines for any claim where: Service code has BETOS code for radiology or diagnostic test OR Revenue code is for radiology or diagnostic test.

H. Yu et al. / Healthcare 1 (2013) 22–29

PROFESSIONAL Count of occupational, physical and THERAPEUTIC SERVICES speech therapy services using corresponding service codes. EMERGENCY Count of claims where CPT Code, DEPARTMENT VISITS Revenue code or CPT/Place of Service Code indicates an Emergency Department (ED) visit. Multiple ED visits on the same date of service are counted as one visit. ED visits that also have Room and Board are not counted (this means the patient was admitted to the hospital). HOSPITAL ADMISSIONS Based on fields such as claim ID, admit & discharge date, Type of Bill, Revenue Code and DRG Code, all facility claims are aggregated into hospital admissions. This field is a count of those admissions. READMISSIONS If all of the following conditions are met, count the admission as a readmission:  The number of days between the current admission and a previous admission is 30 days or less.  The admission must have the same Member ID and Product ID as the previous admission.  If an admission is within the 30 days of more than 1 preceding admission, count it only once. SKILLED NURSING FACILITY DAYS

SKILLED HOME CARE VISITS

CUSTODIAL HOME CARE SERVICES

Sum of the Length of Stays (LOS) from admissions in a Skilled Nursing Facility (SNF). SNF is an admission with the following criteria: Type of Bill code (TOB) is “Skilled Nursing facility—Inpatient” or “Skilled Nursing facility—Swing Bed” Or Revenue code on any admission code indicates SNF. Count of distinct Patient, Provider, service dates where Place of service code or BETOS code indicates Home Care visit. Count of claim lines where service code is indicates Home Care Services.

29

References 1. Hofer TP, Hayward RA, Greenfield S, Wagner EH, Kaplan SH, Manning WG. The unreliability of individual physician report cards for assessing the costs and quality of care of a chronic disease. JAMA. 1999;281(22):2098–2105. 2. Fung CH, Lim YW, Mattke S, Damberg C, Shekelle PG. Systematic review: the evidence that publishing patient care performance data improves quality of care. Annals of Internal Medicine. 2008;148(2):111–123. 3. Faber M, Bosch M, Wollersheim H, Leatherman S, Grol R. Public reporting in health care: how do consumers use quality-of-care information?: A systematic review Medical Care. 2009;47(1):1–8. 4. Rosenthal MB, Landon BE, Normand S-LT, Frank RG, Epstein AM. Pay for performance in commercial HMOs. New England Journal of Medicine. 2006;355(18):1895–1902. 5. Milstein A, Lee TH. Comparing physicians on efficiency. New England Journal of Medicine. 2007;357(26):2649–2652. 6. California Quality Collaborative. Actionable reports that show physician level performance compared to peer group for specified measures. 〈http://www. calquality.org/programs/costefficiency/resources/〉, and 〈http://www.calquality. org/about/〉; 2012 Accessed 18.05.12. 7. Centers for Medicare & Medicaid Services. QRUR template for individual physicians. 〈http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/ PhysicianFeedbackProgram/Downloads/QRURs_for_Individual_Physicians.pdf〉; 2012 Accessed 18.05.12. 8. VanLare Jm Blum JD, Conway PH. Linking performance with payment: implementing the physician value-based payment modifier. JAMA. 2012;308 (20):2089–2090. 9. Fischer E. Paying for performance—risks and recommendations. New England Journal of Medicine. 2006;355:1845–1847. 10. Shahian DM, Normand S-L, Torchiana DF, et al. Cardiac surgery report cards: comprehensive review and statistical critique. The Annals of Thoracic Surgery. 2001;7(6):2155–2168. 11. Landon B, Normand S, Blumenthal D, Daley J. Physician clinical performance assessment: prospects and barriers. JAMA. 2003;290(9):1183–1189. 12. Sequist TD, Schneider EC, Li A, Rogers WH, Safran DG. Reliability of medical group and physician performance measurement in the primary care setting. Medical Care. 2011;49(2):126–131. 13. Adams JL, Mehrotra A, Thomas JW, McGlynn EA. Physician cost profiling— reliability and risk of misclassification. New England Journal of Medicine. 2010;362(11):1014–1021. 14. Kaplan SH, Griffith JL, Price LL, Pawlson LG, Greenfield S. Improving the reliability of physician performance assessment identifying the physician effect on quality and creating composite measures. Medical Care. 2009;47 (4):378–387. 15. Bland J, Altman D. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307–310. 16. Fleiss J, Levin B, Paik M. Statistical Methods for Rates & Proportions.Indianapolis, IN: Wiley-Interscience; 2003. 17. Love D, Custer W, Miller P. All-Payer Claims Databases: State Initiatives to Improve Health Care Transparency, Common Wealth Fund Publication No. 1439, vol. 99. New York: Commonwealth Fund; 2010. 18. Safran DG, Karp M, Coltin K, et al. Measuring patients’ experiences with individual primary care physicians. Results of a statewide demonstration project. Journal of General Internal Medicine. 2006;21(1):13–21. 19. Hays RD, Revicki D. Reliability and validity (including responsiveness). In: Fayers P, Hays R, editors. Assessing Quality of Life in Clinical Trials. New York: Oxford University Press Inc.; 2005. 20. Phelps C. Health Economics. 3rd ed.,Boston: Addison-Welsley Educational Publishers; 2008.