Inter- and Intrarater Reliability of Retrospective Drug Utilization Reviewers

Inter- and Intrarater Reliability of Retrospective Drug Utilization Reviewers

RESEARCH Inter- and Intrarater Reliability of Retrospective Drug Utilization Reviewers Ilene H. Zuckerman. Deborah M. Mulhearn. and Colleen J. Metge ...

4MB Sizes 4 Downloads 59 Views

RESEARCH

Inter- and Intrarater Reliability of Retrospective Drug Utilization Reviewers Ilene H. Zuckerman. Deborah M. Mulhearn. and Colleen J. Metge

Objective: To assess inter- and intrarater reliability among 23 pharmacist and physician retrospective drug utilization reviewers and to assess interrater reliability after a reviewer training session. Design: Exploratory study. Setting: Maryland Medicaid's retrospective drug utilization review (OUR) program. Participants: 23 physician and pharmacist retrospective drug utilization reviewers. Interven-

tions: None. Main Outcome Measures: Profiles rated as "intervention indicated" or "intervention not indicated." Cochran's

a test,

overall percent agreement, and the unweighted kappa statistic were used in the analysis of review consistency. Results: Intrarater reliability showed substantial consistency among the 23 reviewers; the percent agreement was 82.9% with

ty, however, was poor, with an overall agreement of 69.6% and training session (agreement 81.8%,

K =

1C

K =

0.66. Interrater reliabili-

= 0.16. Interrater reliability was also poor after a one-hour reviewer

-0.19). Conclusion: The implicit review process used in the retrospective OUR program that

we evaluated was unreliable. Since reliability is a necessary but not sufficient condition for validity of an indicator of inappropriate drug use, the validity of the OUR implicit review process is in question.

JAm Pharm Assoc. 1999;39:45-9.

The implementation of the Omnibus Budget Reconciliation Act of 1990 COBRA '90) has profound implications for the future of phannaceutical care. According to OBRA '90, the purpose of drug utilization review (OUR) is "to improve the quality of phannaceutical care by insuring that prescriptions are appropriate, medically necessary, and... (un)likely to result in adverse medical events."l Toward that end, OBRA '90 mandates that state Medicaid programs conduct a DUR program that consists of prospective DUR; retrospective OUR; the application of explicit, predetermined criteria and standards for assessment of drug use; and an educational program for practitioners on drug therapy problems. 1 The criteria elements used in Medicaid OUR programs when evaluating prescribing include dose, duration of therapy, drug--drug interactions, drug-disease interactions, and duplicative treatment with similar therapeutic entities. Screening criteria, applied in prospective and retrospective OUR, are used with

Received October 1, 1997, and in revised form December 24, 1997. Accepted for publication January 16, 1998. Ilene H. Zuckerman, PharmD, is associate professor, School of Pharma~,University of Maryland, Baltimore. Deborah M. Mulhearn, PharmD, is

consultant pharmacist, Pharmerica, Beltsville, Md. Colleen J. Metge, PhD, is assistant professor, Faculty of Pharmacy, University of Manitoba. Correspondence: Ilene H. Zuckerman, PharmD, School of Pharmacy, University of Maryland, Baltimore, 100 N. Greene St., 5th Floor, BaltiIllore, MD 21201. Fax: (410) 706-5394. E-mail: zuckerma@pharmacy. ab.urnd.edu

Vol. 39, No.1

January/February 1999

.explicit computer-based rules to permit evaluation of large numbers of prescriptions rapidly and efficiently. The overriding assumption of DUR is that prescriptions that pass the criteria are more likely to result in favorable patient outcomes than prescriptions that do not. Drug orders that pass prospective OUR and are dispensed end up in a Medicaid paid claims file. In some retrospective DUR programs, these claims are transformed into computer-generated patient profiles and subjected to retrospective DUR. Typically, panels composed of pharmacists and physicians review individual profiles using implicit judgment based on OUR screening criteria and determine whether follow-up interventions (e.g., an advisory letter to the physician or pharmacist) are necessary. Implicit review relies on the professional judgment of peers as to whether drug therapy is appropriate. The intent of retrospective review is to identify patterns of potentially inappropriate drug therapy and implement corrective interventions. However, the process of retrospective DVR has been questioned. 2 Concerns about OUR include lack of empirical evidence to support its claimed benefit and lack of valid screening criteria 2 Retrospective OUR depends on valid screening criteria and reliable peer review of patient profiles that fail the screening criteria. Interrater reliability of judgments based on existing data describes the extent to which two reviewers reach similar conclusions from review of the same profile. Intrarater reliability measures the consistency of responses from the same reviewer on

Joomal of the Amerkan Phannareutical AAiociation

45

RESEARCH

OUR Rater Reliability

Table 1. Selected Chal'llcteristics DUR Reviewers

en - 23) Mean years since graduation (SO) Mean years as a OUR reviewer (SO) Degrees held Bachelor of science in pharmacy Doctor of pharmacy Ooctor of medicine

so

17.8 (13.6)

4.1 (2.4) 19 2 2

= standard deviation.

two different examinations of the same profile. Interrater reliability has been found to be poor in evaluations involving implicit review. Hayward et a1. 3 conducted a retrospective review of patient charts to evaluate the reliability of implicit review by trained physicians in measuring various aspects of care in a general medicine inpatient service. They found that for assessment of overall quality of care and preventable deaths of inpatients, implicit review by peers had moderate degrees of reliability, but for most other aspects of care (e.g., laboratory testing, radiology testing, pharmaceutical agent ordering), physician reviewers could not agree. Research has shown that training and the use of specialist peer reviewers enhance reliability and validity of judgments about general medicine inpatient care. 4.5 There is a paucity of published empiric evidence about the reliability of peer judgments as applied specifically to a retrospective OUR program. The decisions made by retrospective OUR peer reviewers have the potential to affect patient outcomes; therefore, reviewers must be a reliable source of information for the patient's physician and/or pharmacist. Reviewer reliability is important in ensuring dependability, predictability, consistency, and accuracy in the decision-making process of the reviewers regarding patient, physician, or pharmacist interventions.

Objectives Using patient medication profiles identified through retrospective OUR as potentially inappropriate, we evaluated implicit review by panelists participating in Maryland Medicaid's retrospective DUR program. The primary objective of this study was to assess inter- and intrarater reliability among a panel of pharmacist and physician profile reviewers. A secondary objective was to assess whether interrater reliability increased after a reviewer training session.

Methods Retrospective OUR Process Maryland Medicaid's retrospective DUR program at the time of this study was contracted to a private vendor and managed by a vendor-appointed DUR director. The program used 23 peer

46 Joumal of the Ameriam I'bannaceutical Association

reviewers. These same 23 reviewers were used in this reliabilily study (see Table I). The reviewers' instructions for all profiles (including those used in the present study) did not change from the usual and customary procedures. That is, each month the reviewer was given approximately 100 profiles, each with six months of claims history, to review. They had one month within which to make recommendations for intervention (advisory leUQto the physician and/or pharmacist) or no intervention to the DUR director. Once a month the reviewers met and had an opportunily to review one or more patient profiles with their fellow peer reviewers. Minimum qualifications for each reviewer included a working knowledge of current therapeutics and status as an active provider or affiliation with a provider institution for the state Medicaid program. In addition, reviewers received a copy of the Drug Uti· lization Review Reviewers Handbook,6 which contained their position description, tips for reading the patient profile, and sample letters sent to pharmacist and physician providers identified for follow-up intervention. Six-month patient medication profiles were generated monthly by the vendor for retrospective DUR from prescription, physician, and hospital Maryland Medicaid claims billing data. All billing data, and consequently all patient medication profiles, were updated by the vendor monthly. The vendor then applied explicit computer-based rules written from screening criteria approved by Maryland's DUR Board to the updated medication profIles. This process generated profiles with DUR screening criteria exceptions, to be reviewed by the retrospective DUR reviewers. The OUR director then distributed the profiles among the 23 physician and pharmacist peer reviewers. Each profile was reviewed by only one reviewer; and---with the exception of this stud~ere was no ongoing assessment of interrater reliability in the retrospective DUR program. The reviewers used an unstructured implicit review process to evalu· ate approximately 100 six-month patient profiles consisting of prescription claims data and diagnoses generated from physician and hospital billing data. According to the screening criteria exception and his or her own clinical judgment, the reviewer decided whether the profile contained an indication of potentially inappropriate drug therapy and whether the DUR director should contact the physician or pharmacist using one of a series of prototypical intervention advisory letters. As previously mentioned, the DUR director scheduled monthly meetings for the 23 peer reviewers to discuss each other's profIles. These discussions allowed reviewers with specific areas of expertise the opportunity to comment on the most clinically significant of the potential problems identified by fellow reviewers.

Primary Study Objective-Assessment of Inter- and Intrarater Reviewer Reliability Patient medication profiles from December 1993 that were screened by using computer-based explicit DUR criteria and

January/February 1999

Vol. 39, No. 1I

OUR Rater Reliability

indicated as having criteria exceptions (or potentially inappropriate drug therapy use) were placed in the usual and customary profile pool for review by one of the 23 peer reviewers. At Time 1 (December 1993), 15 profiles indicating potentially inappropriate drug therapy were randomly selected from the December 1993 profile pool and inserted into each set of profiles sent to the 23 reviewers. So that reviewers were not aware of the reliability testing being conducted, they were blinded to the fact that all of them received 15 of the same patient medication profiles, twice (Time 1 and Time 2). Each reviewer designated each profile as either "intervention indicated" (advisory letter sent) or "intervention not indicated" (no advisory letter sent). Interrater reliability among the 23 reviewers using the unweighted 1C (kappa) statistic was then calculated from these designations. The 1C statistic determines the extent of agreement between two or more judges exceeding that which would be expected by chance. 7 At Time 2 (January 1994), the original set of 15 profiles was recopied and then inserted into each set of reviewer profiles. Each of the 15 original profiles was re-marked with a January date to decrease the likelihood that the reviewers would recognize the profiles as duplicates from the previous month. Again, using the unweighted K statistic, intrarater reliability was calculated for each of the 23 reviewers based on the evaluation of the same set of profiles (i.e., Time 1 versus Time 2).

Secondary Objective-Assessment of Interrater Reliability after Training A reviewer training session, separate from the monthly meeting, was offered to all 23 reviewers to assess the effect of education on reviewer reliability at Time 3 (February 1994). Eleven reviewers chose to participate. One of the authors (DMM) conducted the one-hour session, which included a summary of the process of retrospective drug use profile review and a review of explicit drug use criteria for calcium channel blockers (CCBs) and angiotensinconverting enzyme inhibitors (ACEls), two frequently prescribed drug classes among Maryland Medicaid recipients. Each participant was given a copy of the explicit DUR screening criteria and additional reading material on CCBs and ACEls. A second and different set of 15 profiles containing CCB and ACEI screening criteria exceptions was selected from the February 1994 profile pool and inserted into each reviewer's package of profiles. Interrater reliability among the 11 reviewers who participated in the training session was determined using the K statistic.

AnalYSis Because the reviewer response was either yes (send advisory letter) or no (do not send advisory letter), the Cochran Q test was Used at Time 1 to test whether the reviewers' responses were related to the profile reviewed. The Cochran Q test can be used to evaluate the relationship between two nominal variables, in this case, the reviewer and hislher response to each profile. 8

Vol. 39, No.1 January/February 1999

RESEARCH

Overall percent agreement is reported for both inter- and intrarater reliability studies. Analysis of review consistency (interand intrarater reliability) among the 23 retrospective DUR reviewers was determined using the un weighted K statistic. Again, the K statistic determines the extent of agreement between two or more judges exceeding that which would be expected by chance.7 The standard error of K was used to generate 95% confidence intervals (Cls).9 If the agreement is totally by chance, then the expected value of K O. If agreement is less than that expected by chance, then K < 0 can occur. As a general rule, K> 0.75 indicates excellent reliability, 0.4 ~ K ~ 0.75 indicates good reliability, and K < 0.4 indicates marginal reliability.10- 12

=

Results Table 2 summarizes the rater reliability results. Overall, pharmacist and physician reviewers achieved substantial intrarater reliability for consistency of review recommendations. When data were pooled among the 23 reviewers, the percent agreement was 82.9%, K = 0.66 (95% CI, 0.56 to 0.76). Interrater reliability, however, was poor, with an overall agreement of 69.6% and 1C = 0.16 (95% CI, 0.12 to 0.20). Interrater reliability after a one-hour training session was also poor (agreement, 61.8%; 1C = -0.19, [95% el, -D.25 to -D. 13]). The Cochran Q test was statistically significant when applied to all 23 reviewers, indicating heterogeneity (or lack of consistency) among the reviewers' responses. This suggests that the reviewers' responses were dependent on the individual medication profile reviewed. Figure 1 illustrates the responses per profile during Time 1. Agreement of greater than 75% was reached for only three profiles (numbers 10, 13, and 14). In no case did all reviewers agree to either send a letter or not send a letter.

Discussion Retrospective DUR is used to decide what interventions, if any, are needed from an indication of potentially inappropriate drug use on the basis of screening criteria. The implicit review process we evaluated was unreliable; that is, the extent to which the review process yields consistent results on a decision to intervene is in doubt from the results of this study. Because reliability is a necessary, but not sufficient, condition for validity of an indicator of inappropriate drug use, in this case a medication profile singled out through retrospective DUR for implicit review, the validity of the retrospective DUR process is in question. So, what can be done to improve rater reliability? Several studies suggest that the use of explicit criteria lead to increased interrater reliability. 13 Raters with varying clinical experience can achieve good interrater reliability if they are given clear definitions and practice sessions in which potential differences in interpretation are discussed. 14 Profile review tends to be very subjective

Journal of the American Phannaceutical Association

47

RESEARCH

DUR Rat e r Reliability

z:a::::oo.

Table 2. Inter- and Intrarater Reliability Results % Agreement

Tim e

2

3 CI

= confidence interva l.

unle the reviewer has a specific data collection tool focusing on pecific, objective parameters, so that any conclusions are based on the data obtained. Hanlon et al. devised a method for assessing the appropriateness of drug therapy using the Medication Appropriateness Index (MAI).15 A clinical pharmacist and an internist-geriatrician used this index to make independent assessments of medication use in elderly male ambulatory patients. The MAl provided a reliable method for clinical pharmacists and physicians to assess drug therapy appropriateness. Incorporation of a previously tested index, such a the MAl, into the review process may also improve reliability. Selection and training of reviewers may be a critical step in enhancing the reliability and validity of implicit review. In their Figure 1. Results of Interrater Reliability of Reviewer Responses (n = 23)

100

75

............... -.. ................ ..

50 . Q) U)

c 0

25 .

Cl. U) Q)

a:

~ 0

0

Q) ~

Q)

.s; Q)

25 ·

a:

50 ....................................... ..

75 ' ........................................ .

100

1



48

2

3

4

5

6

7

Intervention indicated

8 9 10 11 12 13 14 15 Profile

D Intervention not indicated

Journal of the American Phannaceutical Association

study of physician-to-physician and physician-to-nurse interrater I reliability of the Comprehensive Psychopathological Rating Scale,4 Kasa and Hitomi found that education and training may I improve rater reliability. Coryell et al. 5 studied the clinical assess· ment of psychiatric inpatients by physicians and nonphysician I using a comprehensive structured psychiatric interview suitable for use by nonphysicians. The interviewers, whose backgrounds I were in social work, psychology, or theology, received 40 hours of training in the form of lectures, case demonstrations, text, and I seminars on interview technique. Neither practice nor prior medical training appeared to influence reliability, whereas training of nonphysicians seemed to correlate with reliability. The nonphysi- I cian interviewers achieved ratings comparable to physicians for those items thought to require the greatest degree of clinical and medical judgment. The training in this study was much more I intensive than our one-hour training session, which did not produce acceptable interrater reliability (see Limitations section). I The following suggestions are proposed to increase interrater reliability in DUR programs in general: I 1. Define a standard set of objective criteria as a basis for n UR decisions. Although publicly sanctioned clinical practice guidelines (e.g. , those published by the Agency for Health Care Policy I and Research) should be used as the basis for explicit criteria, clinical practice guidelines usually lack explicit definitions and specificity. As a consequence, one should take the following into consideration when drafting explicit criteria from practice guidelines: (a) criteria rules should be in a simple "if-then-else" formal with all rules strictly defmed using data that are routinely collected in the DUR program (Medicaid or otherwise); and (b) evaluate the criteria before using them in a DUR program by applying them to real patients and testing for physician and pharmacist response. 16 2. Intensively train reviewers about the specific content of the objective criteria and the process by which reviews should be conducted. If the DUR program reviews individual patient profiles , then use indexes that have demonstrable reliability for assessing drug therapy appropriateness (e.g. , the MAl). Also, it i essential that reviewers be periodically retrained; clinical guide· lines do change based on new research and practice findings. 3. Incorporate assessment of rater reliability as part of the DUR program 's quality assurance activities. As illustrated by Mullin I et al. , Donabedian ' s Structure-Process-Outcome conceptual I framework of program quality can be applied to DUR

January/February 1999

Vol. 39, No. 1

OUR Rater Reliability

programs. 17 This framework suggests that all three aspects (structure, process, outcome) are interrelated; that is, appropriate outcomes are directly related to the presence of appropriate structures and processes. Knowledge about the process of OUR, in this case, its reliability, will provide direct feedback to policy makers about the quality of the program. Finally, the quality of OUR programs is becoming increasingly important, especially when one considers the extent to which Medicaid OUR programs are now provided by managed care organizations. Enrollment in Medicaid managed care plans has increased more than fivefold in this decade, but states are hard pressed to effectively monitor and evaluate the care provided in a closed environment such as a managed care contract. 18 As Medicaid managed care expands to include more patients with disabling illnesses, pharmacists involved in or responsible for generating contracts and making purchasing decisions for the provision of medication by managed care companies need to take heed of the necessity for quality evaluation. We have proposed that such process evaluations as rater reliability occur as a prerequisite for any outcomes attributed to OUR interventions. Otherwise, outcomes from OUR programs that identify and purport to resolve inappropriate drug therapy will be based on unreliable processes and subject to question.

Limitations There are several limitations to our study. It is possible that the reviewers recognized the duplicated profiles during Time 2, which would have influenced intrarater reliability results. However, we believe that the likelihood of recognition was low, because the reviewers receive 100 profiles per month to review, we inserted the duplicated profiles into the packet randomly, and we altered the date on the duplicated profiles to confonn with the rest of the Time 2 profiles. The reviewers receiving the training session were self-selected, which introduces selection bias. The training session itself was limited to what could be logistically provided in an existing DUR program beyond the researchers' control. Thus, the session was limited to one hour, attendance was optional, and only brief discussions of the process of retrospective OUR and the criteria on CCBs and ACEIs were included. Criteria were clearly defined; however, a practice session reviewing profiles and a discussion of differences in interpretation were not part of the training session. Because of these limitations, we cannot draw any conclusions about the effectiveness of our limited training session or a more extensive training session. Therefore, the Time 3 interrater reliability results are provided for descriptive purposes only. Our results and conclusions cannot be generalized to all OUR programs. However, this study provides a framework for assessing rater reliability in OUR programs that involve profile review and peer judgments.

Vol.J9, No.1

January/February 1999

RESEARCH

Conclusion In summary, there was strong intrarater reliability among retrospective DUR reviewers in a Medicaid OUR program. However, interrater reliability was found to be poor. Interrater reliability after a one-hour training session was also poor. The goal of OUR is to improve the quality of patient care and increase the likelihood of positive therapeutic outcomes by identifying, preventing, and solving drug-related problems. OUR programs, whether in traditional Medicaid settings or managed care environments, should demonstrate reliability in an effort to assure the validity and integrity of results. Pharmacists involved in OUR program implementation and management have the ability to improve the state of the art for OUR by instituting ongoing assessments of DUR processes and outcomes as suggested above.

References 1. Omnibus Budget Reconciliation Act of 1990, Section 4401. Federal Register. 1992;57:(212)49397-412.

2. Soumerai SB, Lipton HL. Computer-based drug-utilization reviewrisk, benefit or boondoggle? N Engl J Med. 1995;332:1641-5. 3. Hayward RA, McMahon IF, Bernard AM. Evaluating the care of general medicine inpatients: how good is implicit review? Ann Intern Med. 1993;118:550-6. 4. Kasa M, Hitomi K. Inter-rater reliability of the comprehensive psychopathological rating scale in Japan. Acta Psychiatr Scand. 1985, 71:388-91. 5. Coryell W, Cloninger CR, Reich T. Clinical assessment: use of nonphysician interviewers. J Nerv Men Dis. 1978;166:599-606. 6. Parker K. Drug Utilization Review Reviewer Handbook. Baltimore, Md: Mid-Atlantic OUR, Inc; 1992. 7. Brennan PF, Hays BJ. The kappa statistic for establishing interrater reliability in the secondary analysis of qualitative clinical data. Res Nurs Health. 1992;15:153-8. 8. Siegel S. Nonparametric Statistics for the Behavioral Sciences. New York, NY: McGraw-Hili Book Co; 1956. 9. Fleiss Jl, Cohen J, Everitt BS. large sample standard errors of kappa and weighted kappa. Psychol Bull. 1969;72:323-7. 10. Dawson-Saunders B, Trapp RG. Basic and Clinical Biostatistics. 2nd ed. Norwalk, CT: Appleton and lang; 1994. 11. Waltz CF, Strickland Ol, lenz ER. Measurement in Nursing Research. 2nd ed. Philadelphia: F.A. Davis Company; 1992. 12. landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159-74. 13. Spitzer RL, Endicott J, Robins E. Research diagnostic criteria: rationale and reliability. Arch Gen Psychiatr. 1978;35:773-82. 14. Pfohl B. Effect of clinical experience on rating DSM III symptoms of schizophrenia. Compr Psychiatr. 1980;21 :233-5. 15. Hanlon JT, Schmader KE, Samsa GP, et al. A method for assessing drug therapy appropriateness. J Clin Epidemiol. 1992;45:1045-51 16. Tierney WM, Overhage JM, Takesue BY, et al. Computerizing guidelines to improve care and patient outcomes: the example of heart failure. JAm Med Inform Assoc. 1995;2:316-22. 17. Mullins CD, Baldwin R, Perfetto EM. What are outcomes? JAm Pharm Assoc. 1996;NS36:39-49. 18. Landon BE, Tobias C, Epstein AM. Quality management by state Medicaid agencies converting to managed care. JAMA. 1998;279:211-6.

Joumal of the American Phannaceutical Association

49