Outpatient process quality evaluation and the Hawthorne Effect

Outpatient process quality evaluation and the Hawthorne Effect

ARTICLE IN PRESS Social Science & Medicine 63 (2006) 2330–2340 www.elsevier.com/locate/socscimed Outpatient process quality evaluation and the Hawth...

180KB Sizes 2 Downloads 43 Views

ARTICLE IN PRESS

Social Science & Medicine 63 (2006) 2330–2340 www.elsevier.com/locate/socscimed

Outpatient process quality evaluation and the Hawthorne Effect Kenneth Leonarda,, Melkiory C. Masatub a

Department of Agricultural Economics, University of Maryland, 2200 Symons Hall, College Park, MD 20742, USA b Centre for Educational Development in Health, Arusha, Tanzania Available online 2 August 2006

Abstract We examine the evidence that the behavior of clinicians is impacted by the fact that they are being observed by a research team. Data on the quality of care provided by clinicians in Arusha region of Tanzania show a marked fall in quality over time as new patients are consulted. By conducting detailed interviews with patients who consulted both before and after our research team arrived we are able to show strong evidence of the Hawthorne effect. Patient-reported quality is steady before we arrive, rises significantly (by 13 percentage points) at the moment we arrive and then falls steadily thereafter. We show that quality after we arrive begins to look similar to quality before we arrived between the 10th and 15th consultations. Implications for quality measurement and policy are discussed. r 2006 Elsevier Ltd. All rights reserved. Keywords: Tanzania; Hawthorne effect; Out patient department quality evaluation; Audit and feedback

Introduction When researchers record the behavior of individuals there is always a concern that people will alter their behavior if they know they are being observed. Health care research in developing countries is moving from measuring fixed structural characteristics of health facilities to measuring process quality such as the quality of care provided in consultations.1 However, unlike a leaky roof or Corresponding author. Tel.: +1 301 405 8589.

E-mail addresses: [email protected] (K. Leonard), [email protected] (M.C. Masatu). 1 New tools being using in developing country contexts include vignettes (Das & Hammer, 2005; De Geyndt, 1995; Leonard & Masatu, 2005; Peabody et al., 1994, 1998, 2000; Thaver, Harpham, McPake, & Garner, 1998; Thomas et al., 1997), the simulated client method (Madden et al., 1997), the tracer approach (Evans et al., 2001), chart abstraction (Peabody et al., 2000), standardized patients (Dresselhaus et al., 2000), participa0277-9536/$ - see front matter r 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.socscimed.2006.06.003

missing medical supplies, doctors can alter their behavior when they know they are being observed. Research in developing countries has shown significant variation in consultation quality (Das & Hammer, 2005; Leonard & Masatu, 2005); can researchers have confidence that this measured variation reflects the real variation in quality that would exist in the absence of a research team? Even when doctors are observed in their own environment consulting their regular patients and care is taken to ensure the doctors that they are insulated from any impact of being observed, it is possible that the mere presence of another person can alter their behavior. This paper examines data on the behavior of clinicians in Tanzania that has been used to measure quality and asks whether the (footnote continued) tory evaluation (Bradley et al., 2002) and direct clinician observation (Leonard, Mliga, & Mariam, 2002; Mliga, 2000).

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

presence of a member of the research team changed the behavior of clinicians, showing evidence of the Hawthorne effect. We examine the use of one instrument for measuring quality (direct consultation observation or DCO), in which the researcher fills a checklist of expected procedures for a clinicians’ regular consultations. Previous work with this instrument has shown a marked fall in quality beginning shortly after the team arrives, consistent with a diminishing Hawthorne effect. We pair this instrument with interviews with patients conducted immediately after the consultation. The interview attempts to fill the same checklist as the DCO instrument based on the patient’s recall of the consultation. This second checklist (retrospective consultation review or RCR) is a reasonable measure of the same quality as observed by the research team with DCO. Importantly, we can measure pre-observation quality by interviewing patients who had their doctor consultations before the research team arrived, allowing us to quantify the Hawthorne effect. The Hawthorne effect The Hawthorne effect refers to a situation in which an individual’s behavior is altered by the observation itself. Mayo (1933) documented a temporary increase in employee productivity in response to a change in lighting intensity and other interventions. Because these interventions varied without the intention to increase worker’s performance, Mayo argued that the increase in productivity was the result of personal attention and of the newness of a program (Benson, 2000). Both the methodology of the original experiment and the description of the Hawthorne effect have since been called into question (Jones, 1992; Kolata, 1998; Wickstrom & Bendix, 2000), but the original understanding of this effect has survived these debates. The Hawthorne effect has been described in many different settings, primarily education, but including health care (Campbell, Maxey, & Watson, 1995; Ely, Oshero, Chambliss, Ebell, & Rosenbaum, 2005; Gimotty, 2002; Leung, Lam, Lam, To, & Lee, 2003; Verstappen et al., 2004). The Hawthorne effect is characterized by a positive and temporary change in some measurable behavior in a situation where the observer had no intention to truly affect the other individual’s behavior. The Hawthorne effect is distinct from an incentive effect in which clinicians alter their behavior because they suspect

2331

they may be penalized or rewarded for their behavior. Though the literature suggests that the Hawthorne effect may require a ‘‘perceived demand for performance’’ (Campbell et al., 1995), the pure incentive effect can be differentiated from the Hawthorne effect by the fact that, under the Hawthorne effect, individuals eventually return to their pre-observation level of activity even when they remain under observation. Data and methods Data collection Over the course of the past 3 years, Dr. Masatu has led a team examining OPD consultation quality in a sample of 45 facilities in urban and rural Arusha region in Tanzania including parts of Arusha municipal district, Arumeru district, and Monduli district. The quality evaluation focused exclusively on outpatient clinics and studied the behavior of clinicians for common illnesses in the region. We have examined 107 clinicians using the DCO instrument, observing the consultations of over 1100 patients. In 2005 Dr. Masatu examined a small subsample of this dataset using both the DCO and RCR instruments. In this smaller sample there are 12 clinicians (at 11 facilities). The RCR questionnaire was administered to 320 patients of whom the consultations of 136 were also observed with the DCO instrument. Some of the RCR questionnaires were administered with patients who had visited other clinicians in the same facility, so we have only 211 total interviews with patients who visited the clinician we are examining. Thus, on average, we have RCR data on 6 consultations before we arrive and RCR and DCO data for 11 consultations after we arrive for each clinician. The DCO instrument Following the terminology outlined in Foy et al. (2005), the DCO instrument is a type of process audit in which data are collected on the compliance with practice recommendations of clinician activities undertaken in the presence of the researcher. Clinical practice recommendations are drawn from national protocols for the examination of outpatients suffering from conditions presenting with fever, cough or diarrhea as primary symptoms. For each of these categories there is a list of expected history taking questions as well as expected physical examination procedures as derived from national

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

2332

protocols and we include standard items on manner, general health education and referrals. The items included on the DCO 4 and the full instruments are available on request. In total, there are 68 possible items on the checklist. Only a portion of the items will be relevant to a given consultation; relevance is determined by the researcher and depends on whether the patients presents with a fever, cough or diarrhea, their age and whether they are referred. The minimum number of relevant items ranges from 5 to 40 with an average consultation containing 25 relevant items.

The RCR instrument The RCR instrument is similar to the DCO instrument except that data on clinician activities are collected from patients immediately following the consultation. There may or may not have been a member of the research team present during the consultation. The instrument follows the same format as the DCO, except that some questions are eliminated. Thus, if the DCO checklist includes the item ‘‘did the clinician check the patient’s temperature?’’, the RCR includes the item ‘‘did the clinician check your temperature?’’ In some cases, such as ‘‘does the clinician observe the patient’s breathing pattern?’’ we concluded that

Sub Sample 90% confidence bound

Percentage of Items Correct

65

there was no way a patient could know whether or not a clinician had paused to observe their breathing. Thus the set of relevant questions is reduced to 43 possible questions, with a minimum of 3, a maximum of 40 and on average 20 questions. Each patient who is consulted after we arrive is handed a card identifying them by a number and this number is recorded for both DCO and RCR. Thus we can match RCR to DCO data and it is straightforward to sort patients according to their order of consultation. For patients who were consulted before we arrived, we ask them how long it has been since they left the doctor’s office. Using this information and the time of the interview, we sort patients according to the order in which they were consulted. The items included on the RCR and the full instruments are available on request.

Methodology In order to document the presence, size and duration of the Hawthorne effect, we examine behavior of clinicians in the whole sample of 107 clinicians to validate that our small subsample is representative of this larger set. Fig. 1 is a semiparametric representation of the quality of care for both samples of clinicians. The horizontal axis

Full Sample 90% confidence bound

60

55

50

45 0

5

10 Order of Patients in Consultation

15

Fig. 1. Changes in quality with the order of patients in consultation for full sample and subsample. Source: The subsample is from 136 observations of patients at 12 clinicians in 11 facilities using the DCO instrument in Arusha Municipal District (2005). This sub sample is combined with data from observations of 939 patients reporting at 95 clinicians at 39 facilities from Arusha Municipal District, Arumeru district and Monduli district (2002 and 2003). Lines represent a semi-parametric smoothed regression and confidence intervals following DiNardo and Tobias (2001), controlling for illness, patient and item characteristics. See Appendix A.

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

shows the order of consultations and shows quite clearly that our measure of quality falls as clinicians are being observed. In addition, the smaller subsample, though it has wider confidence intervals (consistent with the smaller sample size), is not significantly different from the larger sample. A second stage of the analysis requires validating that the RCR instrument is reasonably representative of the data collected by the DCO instrument. We examine the correlation coefficient as well as a regression analysis using random effects at the level of the survey item (also referred to as the variance components or error components model).2 Fig. 3 shows graphically that the changes in performance over time recorded in both the RCR and the DCO instruments come from changes in the reports of patients who agree with our observers, not from mismatched items. Table 3, examines the patterns of changing performance by item type across RCR and DCO to check that the instruments agree on the source of changing performance. Once we have validated the RCR instrument, we can examine the patterns in behavior before and after our team arrives, allowing us to test for the Hawthorne effect. Fig. 1 shows that quality falls steadily after we arrive. However, the fall of quality over time is consistent with at least two patterns other than a diminishing Hawthorne effect. It may be the case that clinicians always get tired over the course of the day and that we are observing some portion of this pattern. On the other hand, it may be that the characteristics of patients vary in a manner that we do not observe and that quality changes because of this pattern. This could be a characteristic of the illness, such as severity within illness category. Characteristics of patients may also vary systematically; for example, it could be that educated people come early (or late). All of the clinics sampled sort patients on a first come first served basis without regard to illness or severity, but patients choose when to arrive and patients may differ in their ability to ‘push’ their way forward in a queue. Each of these alternative patterns can be distinguished from the Hawthorne effect by looking at the behavior of clinicians before we arrive. In the case of clinicians who tire over the course of their day, quality should decline steadily before and after 2

The item-by-item compliance across the DCO and RCR instruments and percentage of the time the two instruments agree are available on request from the authors and as Tables 5 and 6 at www.arec.umd.edu/people/faculty/kleonard/research.html.

2333

our arrival with no major changes in the pattern when we arrive. Any smooth pattern of behavior is consistent with the hypothesis that patients vary in unobservable characteristics. Quality may increase at some point in the day and then begin to decrease at another time of the day, but there is no reason to expect quality to change at any one point, much less the exact time we arrive at a clinic. The features that distinguish the Hawthorne effect include a structural break in quality (in our case an increase) and a gradual return to the prearrival quality. The literature on the Hawthorne effect does not propose a duration for this effect, but if we observe this return during the time we are present, this is strong evidence in support of the Hawthorne effect. In particular, it helps to rule out the pure incentive effect, since changes in behavior due to a perceived change in incentives should remain for the duration of our stay. Thus we will attempt to show that after a reasonable length of observation, quality has returned to its pre-arrival level. This task is greatly simplified if it is true that neither of the hypothesized alternative sources of the fall in quality—the clinician becoming tired or patient sorting—are happening simultaneously with the Hawthorne effect. For example, if we have clinicians getting tired and a Hawthorne effect, we may observe the structural break, but would have a difficult time showing that quality had returned to pre-arrival levels because, in the absence of our team, quality later in the day could never be equal to quality earlier in the day. Thus it will be important to establish not just a structural break in the quality of care at the moment we arrive, but the fact that quality is steady before we arrive. Fig. 2 is a semi-parametric representation of the data that we explore in the following section. The methodology for these semi-parametric regressions and their confidence intervals is discussed in Appendix A. Each of the lines representing local area regression smoothing allows for the impact of possible co-variates, include age and symptom category, and a fixed effect for each clinician observed. Shown is the quality of care and 95% confidence interval measured using both the RCR (dashed line) and DCO (solid line) instruments. The DCO and RCR instruments have the same overall pattern though they are not perfectly matched. Importantly, the RCR instrument shows a large structural break in quality at the moment the research team arrives at the facility. In addition, there is no evidence of any pattern in quality in the

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

2334

team report 90% confidence bound

70

patient report 90% confidence bound

Percentage of Items Correct

65 60 55 50 45 40 -10

-5 0 5 10 Order of Patients in Consultation (team arrives at 1)

15

Fig. 2. Changes in quality with the order of patients in consultation and whether or not the research team is present. Source: Team report is based on 136 observations of 12 clinicians in 11 facilities. The patient report is based on 320 patients at these same facilities including 136 who were also observed by the team. Data collected by the authors in Arusha Municipal District (2005) using RCR and DCO Only DCO items with corresponding RCR items are included. 0 represents the point where the research team arrives at the facility. Lines represent a semi-parametric smoothed regression and confidence intervals following DiNardo and Tobias (2001), controlling for illness, patient and item characteristics. See Appendix A.

period before we arrive. These patterns are tested more explicitly and confirmed in the analysis below. Findings The coefficient of correlation for each item on the RCR instrument compared to the corresponding item on the DCO instrument is 0.28 (1645 observations with p-valueo0.001). Table 1 shows that, even after including an item-level random effect, the two scores are well correlated. Patients appear to be more generous or forgiving than the clinicians on our research team but the probability of a patient reporting that a doctor did something increases when the research team says the doctor did that same thing. The average item-by-item compliance in manner, history taking, physical examination and health education for both RCR and DCO and the agreement between the two instruments are shown in Table 2.3 The two instruments agree 65% of the time on average, 77% of the time for clinician manner, 63% of the time for history taking, 69% of the time for physical examination and 57% of the 3

The values for and descriptions of all individual items on the instruments are available on request from the authors and as Tables 5 and 6 at www.arec.umd.edu/people/faculty/kleonard/ research.html

Table 1 Correlation between RCR and DCO instruments for each item Item correct on RCR

Coef.

Std. Err.

Item correct on DCO Constant Random effects

0.170 0.460

(0.024) (0.040) Included

Source: 1645 observations of multiple observations from 136 consultations observed for both RCR and DCO instruments collected by the authors in Arusha Municipal District (2005).  Indicates significance at the 95% level. Dependent variable is the score given by patients for each question on the RCR instrument (0 or 1). R2 within groups is 0.03, R2 between groups is 0.16.

time for health education items. On average, patients are more likely than the researcher to say that the clinician complied for manner, physical examination and health education, but less likely for history taking. Although the agreement in scores is reasonable, it is not perfect. In examining changes in behavior that come from the RCR and DCO instruments it is possible that RCR and DCO agree overall, but for the wrong reasons. Patients may report that clinicians are changing compliance on some items where the researcher reports they are changing compliance on other items. Fig. 3 shows that the

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

2335

Table 2 DCO and RCR item descriptions and average responses: Part I Item

All items: total Clinician manner: total History taking: total Physical examination: total Health education: total

Compliance DCO

RCR

Correspondence in matched items Agreement Correspondence

%

Obs

%

Obs

%

Obs

Yes/yes (%)

Yes/no (%)

No/no (%)

No/yes (%)

55 76 59 33 43

3158 779 1096 568 709

55 99 47 41 49

5125 878 1844 944 1454

65 77 63 69 57

1645 327 573 246 499

41 77 39 25 27

16 0 21 14 22

24 0 24 44 30

19 23 16 17 21

Source: 1136 and 320 consultations observed for both the DCO and RCR instruments, respectively, collected by the authors in Arusha Municipal District (2005). % is the percentage of the time that an item is provided when it was required Agreement is when both instruments are yes or no 1/1 (yes/ yes) and 0/0 (no/no).

DCO yes: RCR yes DCO no: RCR yes

DCO no: RCR no DCO yes: RCR no

Percent of all Matched Items

50

40

30

20

10 0

5

10 Order of Patients in Consultation

15

20

Fig. 3. Changes in the types of matches between the RCR and DCO instrument with the order of patients in consultation. Source: Team report is based on 136 observations of 12 clinicians in 11 facilities evaluated by both RCR an DCO collected by the authors in Arusha Municipal District (2005). Lines represent a smoothed regression. See Appendix A.

patterns observed come from items on which RCR and DCO agree. The number of yes/no or no/yes matches are high (16% and 19%, respectively) but largely unchanging over time. Patients and researchers who agree and who are changing their reports from compliance to non-compliance over time are driving the patterns seen in Fig. 2. Table 3 shows the change in compliance over the whole sample (without taking into account any covariates) for those items that match on both RCR and DCO, for consultations observed by both instruments. Compliance for the first three con-

sultations is compared to compliance for all consultations after the fifth consultation. Changes in compliance come from all five types of activities, except that patients do not observe the same changes in clinician manner that are observed by the research team. The largest changes in compliance come from health education, but for the DCO instrument each type of item shows decreased compliance. This table also suggests the magnitude of the effect, given the importance of each type of item on the instrument. Overall, for the first 3 observations, clinicians comply with 2.56 more

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

2336

Table 3 Patterns of input provision by input type Sample

First 3 cons %

DCO matched with RCR All items: total Clinician manner: total Physical examination: total History taking: total Health education: total RCR matched with DCO All items: total Clinician manner: total Physical examination: total history taking: total Health education: total

All cons45

Difference

Obs

%

Obs

%

Items

71 86 52 67 71

259 57 31 86 85

54 72 41 57 46

1021 195 157 359 310

16 14 11 10 24

2.56 0.42 0.29 0.54 1.20

69 100 52 65 58

259 57 31 86 85

55 99 37 49 42

1021 195 157 359 310

14 1 15 16 16

2.24 0.00 0.39 0.87 0.80

Source: 136 consultations observed for both RCR and DCO instruments collected by the authors in ArushaMunicipal District (2005). Difference is the difference in percentage for the first three observations and the percentage for all consultations after the 5th consultation.

Table 4 Patterns of quality over the order of consultations for RCR and DCO instruments Dependent variable

Consultation order Order, overall Order, present Order, not present Present (yes/no) Consultation blocks np0 0onp5 5onp10 10onp15 Item fixed effects Clinician random effects Observations

DCO itema clinician observed 0.012 (0.006)

RCR itemb clinician observed

RCR itemb clinician observed

0.046 (0.007) 0.001 (0.012) 0.479 (0.096)

RCR itemc clinician not observed

0.035 (0.031) 0.012 (0.010) 0.136 (0.181) 0.0.424 (0.183) 0.783 (0.183) 0.577 (0.183) 0.284 (0.185) Included Included

3159

3343

3343

1782

Source: 136 and 320 consultations observed for both the DCO and RCR instruments respectively, collected by the authors in Arusha Municipal District (2005). Each regression is a random effects probit with random effects at the clinician level and dummy variables included for each item. Standard errors in parentheses. a Dependent variable is all relevant items for each of 136 consultations observed with the DCO instrument. b Dependent variable is all relevant items for each of 211 consultations observed with the RCR instrumentfor clinicians who were also observed with the DCO instrument. c Dependent variable is all relevant items for each of 109 consultations observed with the RCR instrumentfor clinicians who were not observed with the DCO instrument (but practiced at the same facilities).  Indicates significance at the 95% level.  At the 90% level.

items on the DCO instrument and 2.24 more items on the RCR instrument. Table 4 shows the results of four probit regressions of the probability of a clinician correctly

implementing a relevant item as a function of the order of consultation. Each of the four regressions is a probit regression with clinician random effects and dummy variables (fixed effects) for each item.

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

The random effect probit model is designed to fit panel data in which we expect significant correlation in the behavior of each clinician over time. The first column shows the pattern of quality using the DCO instrument. By construction, all data used in this regression were collected with the researcher present. The negative coefficient for overall order (order, overall) in the first column shows the same pattern as was shown with the DCO instrument in the full sample of clinicians: quality declines with the order of observation. The second column of Table 4 shows the pattern of quality using the RCR instrument. This can be differentiated by whether or not a researcher was present. We include three possible effects: a slope with consultation order before the researcher arrives (order, not present), a slope with consultation order after the researcher arrives (order, present) and a change in the value of the intercept when the researcher arrives (present (yes/no)). The slope with consultation order before the researcher arrives is very small and not significantly different from zero. When the researcher arrives, there is a large and significant jump in quality and after that point quality declines. The combined size of the fixed change in quality when the team arrives (0.479) and the subsequent decline with consultation order (0.046) suggests the two effects have equal magnitude after about 10 consultations. The marginal effect of the variable present (yes/no) on the probability that a clinician will provide a required item is 0.132,4 suggesting that compliance for the average clinician and average illness increases by 13% immediately that the team arrives. Assuming baseline compliance is about 50% (see Fig. 2) this is a 26% increase in compliance. This coefficient estimate for the size of the Hawthorne effect is consistent with the overall changes in compliance (without controlling for co-variates) of 16% for DCO and 14% for RCR, as shown in Table 3. The third column of Table 4 examines the same dependent variable as the second column—quality as seen by the patient—but examines the order of consultation in four blocks rather than as a continuous variable.5 Both the first five consultations after the team arrives and the next five after 4

The marginal effect of changing from 0 to 1 is calculated where both order, present and order, not present are set to zero and all other variables are held at their mean values. 5 The omitted category is consultations more than 15 consultations after we arrived. There are only five such observations, three at one facility and one each at two other facilities.

2337

that are significantly different from all consultations before the team arrives (p-valueo0.001 and 0.04, respectively). The consultations between 10 and 15 observations after the team arrives are not significantly different from those before the team arrives (p-value ¼ 0.11). The fourth column of Table 4 examines the pattern for clinicians who are at the same facility as clinicians being observed but who are never observed themselves. For this column the variables order, present, order, not present and present (yes/ no), represent the research teams’ presence at the facility. If clinicians react to the fact that a team is present at the facility, rather than to the fact that a researcher is in the room, we should expect a Hawthorne effect. Although there is a slight positive effect when we arrive at the facility and a slight drop off over time from that point, neither impact is significant. There is no evidence of a ‘spill–over’ Hawthorne effect in the facilities that we visited. The fact that there is no evidence of a simultaneous process that causes a pattern that looks like the Hawthorne effect supports our contention that the behavior is caused by the Hawthorne effect. Discussion and conclusion The patterns shown in Fig. 2 and validated with regression analysis in the previous section show that there is a large discontinuity in the quality of care provided when the research team arrives. Fig. 1 shows that our smaller sample of clinicians increase their compliance from just over 50% to just over 60%, a 10-percentage point increase in quality (20% gain). The regression analysis shown in the second column of Table 4 suggests a gain of 13-percentage points using the RCR instrument and controlling for clinician and illness effects. The DCO data from the larger sample of clinicians shown in Fig. 1 is consistent with a more modest 5-percentage point gain in compliance. This gain in quality dissipates slowly over time so that quality returns to its original level after between 10 and 15 observations. This pattern demonstrates a classic Hawthorne effect: short-lived impacts due to observation. The study for which the DCO instrument was designed sought to characterize the determinants of quality in this area and was deliberately designed to not alter quality. However, at least initially, the presence of our team had an impact similar to that of interventions designed to improve quality. Our

ARTICLE IN PRESS 2338

K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

data collection is similar in appearances to an audit and feedback study with the following components: the data on consultations with patients before our arrival are an audit; the introduction of our team is feedback; and the subsequent observations of consultations are the evaluation of the impact of feedback. The difference between our study and conventional audit and feedback studies is that our feedback contained no actual information. This effective design is similar to the methodology employed by Verstappen et al. (2004) in which each clinician was evaluated on procedures for which he received feedback as well as procedures for which he had not received feedback. Clinicians improved their performance even when the feedback contained no information for that procedure. The role of feedback in our study is simply to inform the clinician that his activities are being observed. Even uninformative feedback produced a large, but temporary effect. Reviews of audit and feedback studies in industrialized countries show that it is not a particularly effective form of intervention. Multi study reviews suggest a mean impact of 7% reduction in noncompliant behavior when audit and feedback is compared to no intervention (Jamtvedt et al., 2003, pp. 8). Grimshaw et al. (2001) suggest that the intensity of interaction can improve the gains in performance even when the feedback only provides information, not any mechanism to alter behavior. Though few studies look at the duration of the benefits, Jamtvedt et al. (2003, p. 9) suggest that the benefits may not be permanent. For example, Jones et al. (1996) showed that when feedback was withdrawn, performance gains deteriorated within 12 months. Campbell et al. (1995) explicitly examine the temporary gain in performance and conclude that, in their study, gains can be characterized as a Hawthorne effect. Their study further suggests that the effect requires a ‘‘perceived demand for performance,’’ though not necessarily direct observation or feedback. Van den Hombergh, Grol, van den Hoogen, and van den Bosch (1999b) compare the change in performance when the audit (visit) is done by a physician and by a non-physician and find more significant changes when the visit is by a physician. This may be because practitioners were more comfortable being visited by non-physicians 12 than physicians (Van den Hombergh, Grol, van den Hoogen, & van den Bosch, 1999a); it is not hard to see how changes in behavior can be motivated by discomfort.

Our study suggests that a large component of the gains that would come from audit and feedback study in this type of setting may come from the Hawthorne effect. These other studies, in turn, suggest that if we had provided real feedback we might have observed a longer duration for performance changes, but there is no guarantee that it would have been permanent. In addition, it is likely that the change in behavior would have been smaller if the observers were not also clinicians. The organization with which our study was affiliated is a independent government institute, though it is known to be associated with research, not regulation or funding. This fact, and the fact that our team was drawn from the ranks of public sector clinicians and nurses, may have had some effect on the magnitude of the Hawthorne effect. Leonard, Masatu, and Vialou (2005) examine the determinants of the Hawthorne effect in the larger sample of clinicians and find that some clinicians show no deterioration in performance even after a long period of time. That data is not matched with the RCR instrument, so they cannot know what clinicians were doing before the team arrived, but they surmise that some clinicians are practicing very near to their abilities and have less scope to adjust their behavior when they are being observed. Leonard et al. (2005) suggest that proper incentives can encourage most clinicians to practice close to their abilities. This study shows that the average clinician can improve his or her practice without additional training, it remains to be shown how to encourage them to do so all the time. Limitations of this study 12 clinicians were observed over a total of 320 patients. The clinicians were drawn from the public, private and NGO sectors of the Arusha municipal district and the results discussed are strong enough to be significant despite the small sample size. In addition, the overall pattern in quality is seen in the much larger study of 107 clinicians and over 1100 patients. Though the larger sample shows a smaller magnitude for the Hawthorne effect, it confirms its exisitence. The rural and urban areas of Arusha district are similar to districts throughout Tanzania and potentially similar to most developing countries with weak public sector regulation. Weak regulation is important because it might allow clinicians to practice at levels of quality significantly below their capacity, thereby allowing them to quickly improve

ARTICLE IN PRESS K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

quality. Thus our study suggests the possibility of a Hawthorne effect in similar circumstances, though it cannot claim to measure the magnitude with prescision. Where clinicians practice close to their capacity, uninformative feedback like the type used in this study cannot have a strong impact on compliance. Given the nature of our study, no conclusions should be drawn about the Hawthorne effect in countries with strong systems of regulatory oversight. Conclusion When doctors in Tanzania are observed in the course of their normal outpatient consultations they alter their behavior so as to appear more conscientious and careful. The same effect is not observed in other clinicians in the same facility who are not being observed. The Hawthorne effect increases quality by about 13-percentage points immediately, but this impact diminishes with further consultations. Most doctors exhibit the same behavior after 10–15 patients as they did before the team arrived. The Hawthorne effect shows that quality can change even without any change in the structural characteristics of a facility. In addition, the measurement of quality can be significantly biased by the Hawthorne effect if researchers do not adjust for its impact or do not stay long enough to observe the regression to type. The fact that clinicians can change their practices in such a short period of time calls into question the validity of process quality instruments in which the researcher is present during evaluation. In such a setting, it is possible that observed gains in quality are temporary and not indicative of real changes or that differences in observed quality across clinicians are due to different reactions to observation, not real differences in quality. Appendix A. Semi-parametric analysis The semi–parametric regression lines are derived following the technique outlined below based on DiNardo and Tobias (2001). Each appropriate consultation item is either correct (1) or incorrect (0). These scores are sorted, item by item, by consultation order. A first–differenced score for each item is created by subtracting each item score from the same item score for the previous patient at the same doctor. This first–differenced (de-trended) score is then regressed on observable characteristics:

2339

the age of the patient, the symptoms presented, the number of symptoms, clinician dummy variables and item dummy variables. The predicted difference is subtracted from the actual score for each item, thereby removing the impact of observable covariates. This net score is then graphed using kernel regression (local area regression), with a bin width of 3 observations and a gaussian kernel. The confidence intervals are obtained by bootstrap estimation, drawing 100 samples with replacement of the net score, repeating the kernel regression and finding the 5th and 95th percentiles of the resulting distribution of points at each patient order point. Figs. 1 and 2 are semi-parametric local area regressions and Fig. 3 is a local area regression (no co-variates). References Benson, P. G. (2000). The Hawthorne effect. In W. E. Craighead, & C. B. Nemeroff (Eds.), The corsini encyclopedia of psychology and behavioral science (Vol. 2). New York: Wiley (3rd ed). Bradley, J. E., et al. (2002). Participatory evaluation of reproductive health care quality in developing countries. Social Science and Medicine, 55(2), 269–282. Campbell, J. P., Maxey, V. A., & Watson, W. A. (1995). Hawthorne Effect: Implications for prehospital research. Annals of Emergency Medicine, 26(5), 590–594. Das, J., & Hammer, J. (2005). Which Doctor?: Combining vignettes and item- response to measure doctor quality. Journal of Development Economics, 78, 348–383. De Geyndt (1995). Managing the quality of health care in developing countries, World Bank technical paper 258. Washington, DC: The World Bank. DiNardo, J., & Tobias, J. L. (2001). Nonparametric density and regression estimation. Journal of Economic Perspectives, 15(4), 11–28. Dresselhaus, T. R., Peabody, J. W., et al. (2000). Measuring compliance with preventive care guidelines—Standardized patients, clinical vignettes, and the medical record. Journal of General Internal Medicine, 15(11), 782–788. Ely, J. W., Osheroff, J. A., Chambliss, M. L., Ebell, M. H., & Rosenbaum, M. E. (2005). Answering Physicians Clinical Questions: Obstacles and Potential Solutions. Journal of The American Medical Informatics Association, 12, 217–224. Evans, P. J., et al. (2001). Evaluation of community-based rehabilitation for disabled persons in developing countries. Social Science and Medicine, 53(3), 333–348. Foy, R., Eccles, M. P., Jamtvedt, G., Young, J., Grimshaw, J. M., & Baker, R. (2005). What do we know about how to do audit and feedback? Pitfalls in applying evidence from a systematic review. BMC Health Services Research, 5(50), 1–7. Gimotty, P. A. (2002). Delivery of preventive health services for breast cancer control: A longitudinal view of randomized controlled trial. Health Services Research, 37(1), 65–85. Grimshaw, J. M., Shirran, L., Thomas, R., Mowatt, G., Fraser, C., Bero, L., et al. (2001). Changing provider behavior—An

ARTICLE IN PRESS 2340

K. Leonard, M.C. Masatu / Social Science & Medicine 63 (2006) 2330–2340

overview of systematic reviews of interventions. Medical Care, 39(8), II-2–II-45 (Suppl. LC2). Jamtvedt, G., Young, J. M., Kristoffersen, D. T., Thomson, L., O’Brien, M. A., & Oxman, A. D. (2003). Audit and feedback: effects on professional practice and health care outcomes (Review). The Cochrane Database of Systematic Reviews(3). Jones, H. E., Cleave, B., Zinman, B., Szalai, J. P., Nichol, H. L., & Hoffman, B. R. (1996). Efficacy of feedback from quarterly laboratory comparison in maintaining quality of a hospital capillary blood glucose monitoring program. Diabetes Care, 19(2), 168–170. Jones, S. R. G. (1992). Was there a Hawthorne effect? The American Journal of Sociology, 98(3). Kolata, G. (1998). Scentific myths that are too good to die. The New York Times. Leonard, K. L., & Masatu, M. C. (2005). Comparing vignettes and direct clinician observation in a developing country context. Social Science and Medicine, 61(9), 1944–1951. Leonard, K. L., Gilbert, M. K., & Damen, H. M. D. (2002). Bypassing health facilities in Tanzania: Revealed preferences for observable and unobservable quality. Journal of African Economies, 11(4), 441–471. Leonard, K. L., Masatu, M. C., & Vialou, A. (2005). Getting doctors to do their best: Ability, altruism and incentives. Mimeo: University of Maryland. Leung, W. C., Lam, H. S. W., Lam, K. W., To, M., & Lee, C. P. (2003). Unexpected reduction in the incidence of birth trauma and birth asphyxia related to instrumental deliveries during the study period: Was this the Hawthorne effect? BJOG: An International Journal of Obstetrics and Gynaecology, 110(3). Madden, J. M., et al. (1997). Undercover careseekers: Simulated clients in the study of health provider behavior in developing countries. Social Science and Medicine, 45(10), 1465–1482. Mayo, E. (1933). The human problems of an industrial civilization. New York: MacMillan. Mliga, G. R. (2000). Decentralization and the quality of health care. In K. David, S. Leonard (Ed.), Africa’s changing

markets for human and animal health services, London, also available at http://repositories.cdlib.org/uciaspubs/editedvolumes/5/: Macmillan (Chapter 8). Peabody, W. J., et al. (1994). Quality of care in public and private primary health care facilities: structural comparisons in Jamaica. Bulletin of the Pan American Health Organization, 28, 122–141. Peabody, J. W, et al. (1998). The effects of structure and process of medical care on birth outcomes in Jamaica. Health Policy, 43(1), 1–13. Peabody, J. W, et al. (2000). Comparison of vignettes, standardized patients, and chart abstraction: A prospective validation study of 3 methods for measuring quality. Journal of the American Medical Association, 283, 1715–1722. Thaver, I. H., Harpham, T., McPake, B., & Garner, P. (1998). Private practitioners in the slums of Karachi: What quality of care do they offer? Social Science and Medicine, 46, 1441–1449. Thomas, D., et al. (1997). Indonesian family life survey, Survey Instruments. Van den Hombergh, P., Grol, R., van den Hoogen, H. J. M., & van den Bosch, W. J. H. M. (1999a). Practice visits as a tool in quality improvement: Acceptance and feasibility. Quality in Health Care, 8, 161–171. Van den Hombergh, P., Grol, R., van den Hoogen, H. J. M., & van den Bosch, W. J. H. M. (1999b). Practice visits as a tool in quality improvement: Mutual visits and feedback by peers compared with visits and feedback by non-physician observers. Quality 16 in Health Care, 8, 161–166. Verstappen, W. H. J. M., van der Weijden, T., Reit, G. T., Grimshaw, J., Winkens, R., & Grol, R. P. T. M. (2004). Block design allowed for control of the Hawthorne effect in a randomized controlled trial of test ordering. Journal of Clinical Epidemiology, 57, 1119–1123. Wickstrom, G., & Bendix, T. (2000). The Hawthorne effect? What did the original Hawthorne studies actually show? Scandinavian Journal of Work Environment & Health, 26(4), 363–367.