An investigation into expectation-led interviewer effects in health surveys

An investigation into expectation-led interviewer effects in health surveys

Social Science & Medicine 56 (2003) 2221–2228 An investigation into expectation-led interviewer effects in health surveys Paul Clarke1, Kerry Sprosto...

124KB Sizes 0 Downloads 23 Views

Social Science & Medicine 56 (2003) 2221–2228

An investigation into expectation-led interviewer effects in health surveys Paul Clarke1, Kerry Sproston, Roger Thomas* National Centre for Social Research, 35 Northampton Square, London EC1V 0AX, UK

Abstract Many large-scale health surveys use interviewers to obtain standardised information about the health of the general population. To improve response rates and data quality, the researchers/designers usually brief the interviewers to familiarise them with the survey procedures and stimulate their interest in the survey. However, it is possible that interviewers, having been exposed to researchers’ expectations, may inadvertently influence respondents to produce outcomes consistent with those expectations. Such expectations are referred to here as ‘expectation-led interviewer effects’. In this paper, the design and results from an experiment to test for expectation-led interviewer effects are described. The experiment involved conducting two health surveys, called the ‘experimental’ and the ‘control’, which were identical in every way except that researchers made a reference to a supposed link between childhood and adult health at the experimental survey briefing. The testing procedure was designed prior to data collection to preclude accusations of data dredging and to ensure that the type I error probability was less than 5 percent. No consistent evidence of expectation-led interviewer effects was found, bar a statistically significant effect for health questions requiring the recall of detailed quantitative information. This effect was small, however. r 2002 Elsevier Science Ltd. All rights reserved. Keywords: Non-sampling error; Sample surveys; Interviews; Ordinal log-linear model; Experimental design

Introduction Health surveys of the general population are an important source of information for the description and analysis of social variation in health (for example, Marmot & Wilkinson, 1999). Such surveys are often conducted using interview methods. In the United Kingdom, current examples are the Health Survey for England (Prescott-Clarke & Primatesta, 1998) and the Scottish Health Survey (Erens & Dong, 1997). For such large-scale surveys it is necessary to use trained interviewers. The versatile skills of the interviewer in making contact, introducing the survey, securing cooperation, steering and encouraging the respondent through the questionnaire, and carrying out measuring *Corresponding author. Tel.: +44-20-7250-1866; fax: +4420-7250-1524. E-mail address: [email protected] (R. Thomas). 1 The ordering of authors is alphabetical.

procedures, make it possible to administer long and demanding instruments without an unacceptable fall-off in response. The use of interviewers also results in improved data quality, requiring less editing than alternatives such as self-completion questionnaires (see De Leeuw and Collins (1997) for a review of the advantages of using interviewers). Interviewer errors are thought to arise in a variety of circumstances. For example, when asking a question or probing for an answer, interviewers might subconsciously send verbal or non-verbal signals suggesting that some answers are particularly interesting or ‘what we want to hear’ (Lyberg & Kasprzyk, 1991). Alternatively, respondents might consciously or subconsciously tailor their answers to fit in with what they perceive to be social or behavioural norms. The impact of interviewer effects of this type has been reviewed by Wynder (1994) and Eisenhower, Mathiowetz, and Morganstein (1991). Commonly, survey organisations try to reduce the size of interviewer effects by recruiting specialised

0277-9536/03/$ - see front matter r 2002 Elsevier Science Ltd. All rights reserved. PII: S 0 2 7 7 - 9 5 3 6 ( 0 2 ) 0 0 2 3 8 - 1

2222

P. Clarke et al. / Social Science & Medicine 56 (2003) 2221–2228

field-forces and training them to follow standardised interview procedures. Standardisation involves giving interviewers strict guidelines in what they say to maintain this rapport, and training them to ask questions verbatim from the questionnaire. Previous studies have found standardisation to be an effective way of minimising interviewer effects (for example, Fowler, 1991). Fowler (1991) and others have found that the performance of interviewers is likely to improve if they are motivated, not only by performance targets for remuneration but also by an interest in the research background to the survey. Hence, in complex surveys it is now common practice for researchers to brief interviewers face-to-face. Survey briefings focus partly on special survey procedures and informing the interviewers about why the survey is being conducted, but also partly on allowing the researchers to motivate the interviewers by communicating their enthusiasm for the survey and the research. However, this too carries a risk of importing bias if the researchers imply that a particular relationship is expected between survey questions. For example, suppose that the researchers say they believe poor childhood health leads to poor adult health. It is possible that, when they have measured childhood health, the interviewer may inadvertently behave in a way that induces the expected relationship by influencing the respondent. The interviewer effect caused by interviewers being influenced by the researchers’ expectations is referred to here as an ‘expectation-led’ interviewer effect. The presence of expectation-led interviewer effects could distort the observed correlation between two questions, A and B, and lead to misleading conclusions being drawn in the subsequent analysis. This is particularly true if the association being studied is weak (for example, Wynder, 1987). At present, the existence of expectation-led interviewer effects is entirely hypothetical but there are many studies whose results could potentially be affected. For example, recent studies into the relationship between job strain and cardiovascular disease (Sacker, Bartley, Firth, Fitzpatrick, & Marmot, 2001) and between income and health (Ecob & Davey Smith, 1999) have used data from interview surveys. This paper investigates whether expectation-led effects exist in practice and, if so, the magnitude of the effects. As was discussed above, the use of interviewers to administer complex surveys is well grounded. The existence of large expectation-led interviewer effects would have serious repercussions for the use of interviewers. For this reason, it was considered appropriate to require a high standard of evidence. Hence the experiment was designed to minimise the chance of incorrectly finding expectation-led effects, at the expense of correctly

finding them. The design and results of this experiment are described below.

Methods Subjects and design The interviewer subjects are the highly trained fieldforce employed by the National Centre for Social Research, an established social research organisation based in the United Kingdom. At any one time, there are over 1000 interviewers working for the National Centre for Social Research, a large proportion of whom have some years’ field experience. The aim of this experiment was to test for the existence of expectation-led interviewer effects in a typical health survey. To make the results as applicable as possible to non-experimental settings, it was carried out under the guise of two substantive health surveys called ‘experimental’ and ‘control’, which were replicates in every respect except for a particular paragraph in the interviewer briefing script. This design allowed the expectation-led interviewer effect to be measured by simply comparing inferences from the experimental and control surveys. The design, analysis and findings of this experiment are documented below. Sampling and randomisation To begin, a sample of individuals was taken from the general adult population of England using a three-stage cluster design (for example, Cochran, 1977). At stage one, a number of postal sectors were randomly selected from the population of postal sectors that partition England into geographical areas. Stage two involved a sample of addresses being selected at random from within the selected postal sectors. Finally, at stage three, one adult from each address was selected at random. Within each postal sector the selected individuals were allocated to the experimental or control surveys at random. Thus, the two survey samples were identical in terms of potential interviewees, aside from differences attributable to sampling variation. Interviewers were then allocated to the two subsurveys. First, for each of the selected postal sectors, two interviewers were chosen using the standard criteria of availability and residential proximity to the sector. Second, one of the two interviewers was randomly assigned to the experimental survey and the other to the control survey. The interviewers invited to work on the project were told only that they were participating in a standard health survey. In order to avoid any interviewers becoming aware that an experiment was taking place, the interviewer teams for the experimental and control surveys were kept segregated.

P. Clarke et al. / Social Science & Medicine 56 (2003) 2221–2228

Interviewer briefings The nature of the hypothesis, or expectation, to be introduced to the interviewers was devised carefully. It needed to be plausible in a health research context, but not trivial or self-evidently true, and it needed to be something that researchers conducting briefings might naturally mention. The usual procedure for briefings is to give a short account of the purpose of the survey and then to go through the main points to do with contacting respondents, interviewing and other survey procedures. This standard plan was followed in both surveys, but for the experimental survey a scripted paragraph was added towards the start of the briefing. It stated that the survey was being carried out to investigate a strong finding from other research; namely, that the dominant influence on whether people had better or poorer health as adults is their health as children. For the control survey the corresponding paragraph merely stated that the survey covered various aspects of respondents’ health experience over time. Thus, the childhood–adult health relationship is suitable for this experiment. It is believable—as there is in fact a degree of correlation in the population between some aspects of childhood and adult health—but not trivial, as this correlation is not necessarily strong and not necessarily between those aspects of health covered by the experimental questions. The strength of the true correlation should have little effect on the experiment, because the experiment is concerned with ascertaining whether the association measured in the experimental survey is greater than that measured in the control survey. Questionnaire design The survey procedures and the specific questions used in testing the hypotheses were of a general type familiar to interviewers from other health surveys of the adult population. A range of questions about health in childhood and adulthood was embedded in a questionnaire that took about 30 min to administer, with the additional questions related to aspects of current health not germane to the experiment and some other information relating to standard demographic classifications. The aim was to produce a health questionnaire of a style, length and content that would be familiar to the interviewers. There are many ways of measuring adult and childhood health, so a number of different questionnaire measures were included. The respondents were asked two types of questions relating to their adult health; namely, their general level of health and whether they have experienced a long-term illness. The former question was measured on a five-point ordinal scale from ‘very good’ to ‘very bad’, coded respectively from 1

2223

to 5, and the latter was a binary measure with outcomes ‘yes’ and ‘no’ coded as 1 and 0. There were seven questions about the respondents’ childhood health, given in Table 1, representing a mixture of quantitative and qualitative variables. Analysis protocol: modelling adult health and childhood health. The experiment described above was designed to test the substantive null hypothesis that expectation-led interviewer effects do not exist. To avoid accusations of data dredging, the protocol for testing the null hypothesis was decided prior to the experiment. The chosen analysis protocol is described here and in the next section. In this section, the models relating adult and childhood health are described, followed in the next section by the statistical tests used. In order to assess the evidence against the substantive null hypothesis, it was required to translate it to a form whereby it could be tested using statistical procedures. The chosen approach was to simultaneously model the association between two variables, one relating to adult health and the other to childhood health, along with the difference between this association in the two subsurveys. Separate models were specified for each pair of adult and childhood health variables, the general form of which is described below. Consider a pair of adult and childhood health variables, AH and CH. The outcomes of AH are ordinal, with levels numbered in ascending order from 1 to I with associated scores u1 > u2 > ? > uI : Similarly, CH has J levels and associated scores v1 > v2 > ? > vJ : The scores for the general adult health variable run from 1 for ‘very good’, to 5 for ‘very poor’, with the binary outcomes for long-term adult illnesses scored as 1 for ‘yes’ and 2 for ‘no’; the complete set of childhood health variable outcomes is given in Table 1. The two survey groups are denoted by the indicator K; taking the values k ¼ 1 for the control survey and k ¼ 2 for the experimental survey. Therefore, the class of models best suited to facilitate conversion of the substantive hypothesis to a statistical one was the ordinal log-linear model. The ordinal log-linear model for AH, CH and K can be written, using the above definitions, as log Fijk ¼ m þ lAH þ lCH þ lK i j k þ b1 ui vj þ b2 ui k þ b3 vj k þ b4 ui vj k; where Fijk is the cell count in a cross-tabulation of adult health, childhood health and interviewer group. The CH parameters correspond to the main-effects (lAH i ; lj ; K lk ), the two-way interactions (b1 ; b2 ; b3 ), and the threeway interaction (b4 ) between adult health, childhood health and interviewer group. A full explanation of how

2224

P. Clarke et al. / Social Science & Medicine 56 (2003) 2221–2228

Table 1 List of childhood health measures used in analysis Question

Variable name

Outcomes

Coded

Ordinal scores

How is your health in general? Would you say it isy Do you have any long-standing illness, disability or infirmity? Thinking back to when you were a child, how would you rate your health, in general, at that time? Compared to other people of your generation, would you say that as a child you were Thinking back to the time when you were at school. Would you say that, compared to other people in your class, you spent Total number of specific illnesses experienceda

Genhelf

Very good; good; fair; poor; very poor? Yes; no

1–5

1–5

1–0

1–2

Did you have any major illnessesb apart from those already mentioned? When you were a child, how many separate stays in hospital as an inpatient did you have when you were a child?

About how many nights altogether did you stay in hospital as a child?

Longill GenhelfC

Very good; good; fair; Poor; very poor

1–5

1–5

Illprone

More prone; about as prone; less prone than others

1–3

1–3

Offsick

More time; about the same time; less time absent from school than others

1–3

1–3

Totalill

Count

0–7

0–5 as 1–6

Otherill

Yes; no

1–0

6–7 as 7 1–2

Numstays

Count

0+

0–7 as 1–8

0+

8–14 as 9 20+as 10 0–7 as 1–8

Numnights

Count

8–10 as 9 11–14 as 10 15–19 as 11 20–25 as 12 26–30 as 13 31–40 as 14 41–50 as 15 51–99 as 16 100–199 as 17 200+as 18 a Total number of ‘yes’ answers to ‘‘Did you have any of the following illnesses as a child?’’ (polio; asthma; measles; whooping cough; chicken pox; anemia; bronchitis; German measles; scarlet fever). b Illnesses other than those mentioned in the previous question.

to interpret these coefficients is given in Ishi-Kuntx (1994). However, for the purposes of this paper it need only be understood that b4 measures the difference in adult/childhood health association between the experimental and control surveys. Moreover, if jb1 þ b4 j > jb1 j then the adult/childhood health association is stronger in the experimental survey than the control survey. From the above, it follows that assessing an interviewer effect requires testing whether the coefficient b4 is significantly different from 0. In other words, for a given model, the hypothesis is of the form H0 : b4 ¼ 0 versus H1 : b4 a0;

which is tested here by calculating the t-statistic for b4 ; the ratio between its point estimate and its estimated standard error. As the sampling was performed using a three-stage design, standard errors that assume independent, identically distributed observations are incorrect. Thus, the ‘jackknife’ procedure was used to calculate the appropriate standard errors for the model parameters (Skinner, Holt, & Smith, 1989). Analysis protocol: hypothesis testing As one test was to be performed for each pair of variables, it was crucial for the testing procedure to be

P. Clarke et al. / Social Science & Medicine 56 (2003) 2221–2228 Table 2 Grouping of models in analysis Group

Type

Model

Adult health

Childhood health

1

General health

1 2 3 4 5 6

Genhelf

GenhelfC Illprone Offsick GenhelfC Illprone Offsick

7 8 9 10

Genhelf

11 12 13 14

Genhelf

2

3

Specific illnesses

Hospitalization

Longill

Longill

Longill

Totalill Otherill Totalill Otherill Numstays Numnights Numstays Numnights

such that the overall probability of a type I error occurring was at least 0.05. For this, it is first necessary to define an ‘overall’ type I error, which is taken as the event where one or more tests result in the null hypothesis being incorrectly rejected. It is required that the probability of this event should be no more than 5 percent. There are 14 pairings of the adult and childhood health variables, shown in Table 2, each of which was modelled using an ordinal log-linear model. The null hypothesis can be tested for each distinct pair, and the type I error probability set individually for each test by choosing the test significance level, a: However, concern is focused on testing for an overall difference in association between the experimental and control surveys across all 14 pairs of variables, while ensuring that the overall type I error probability is 0.05. In general, fixing the type I error probability for a number of related tests is not trivial, but it is possible to ensure that the overall probability is not greater than 5 percent. A test procedure that satisfies this criterion was devised from first principles and is described below. The 14 pairs of variables were arranged into three groups, based on the aspect of health measured by the childhood variable. As shown in Table 2, group 1 relates to general assessments of childhood health; group 2 to specific childhood illnesses; and group 3 to questions about hospitalisation during childhood. Tests based on models within a particular group are highly correlated because the adult and childhood health measures within each group are themselves highly correlated. As a result, if each test has a type I error probability fixed at a ¼ 0:05; then the probability that one or more tests within the group will produce a type I error will be slightly

2225

larger than, but approximately equal to, 0.05. To ensure that the group’s type I error probability is strictly less than 0.05, it is said to be significant if the null hypothesis is rejected for at least two of its constituent tests. If overall significance is defined as at least two of groups 1–3 being significant, as defined above, then by using a similar argument the probability of an overall type I error can also be shown to be at most 5 percent. Cost considerations meant that a sequential sampling approach was used in the experiment. A sequential sample involves dividing the sampling for the experiment into a series of tranches, and using a stopping rule to decide whether a further tranche is required (Armitage, 1975). If the evidence for rejecting the null hypothesis at tranche 1 is sufficient, the experiment is terminated; otherwise, the experiment proceeds to tranche 2, with the data from tranches 1 and 2 combined, and so on until the stopping rule is invoked. The total number of tranches must be decided in advance, otherwise the experiment could carry on indefinitely until the stopping rule was invoked by chance. It was decided that a maximum of two tranches would be needed because the combined sample size after the second tranche was estimated to be sufficient to detect an interviewer effect whose expectation was small, should it exist. Using a sequential sample design causes another problem. Prior to the experiment, it is still unknown as to whether one or two tranches will be required. Hence, setting the overall type I error probability at 5 percent for each tranche will not result in the overall type I error probability for the experiment being 5 percent. To see this, note that there are two possible outcomes that could result in a type I error: 1. Reject the null hypothesis at tranche 1 and terminate the experiment, or 2. Reject the null hypothesis at tranche 2, given it was not rejected at tranche 1. The probability of outcome 1 is simply the fixed type I error probability to be used for each tranche, call it a1; it is this probability that needs to be determined in advance. The probability of outcome 2 equals the probability of a type I error at tranche 2, given that the null hypothesis was not rejected at tranche 1. The probability of not being rejected at tranche 1 is (1  a1 ), and the probability of a type I error at tranche 2 is some probability hða1 Þ that depends on a1 : Thus, the overall probability of a type I error is a ¼ a1 þ ð1  a1 Þhða1 Þ: In general, h(.) is determined using numerical integration, but published tables of a1 are available. For example, in Table 15.6 of Armitage and Berry (1994), a1 ¼ 0:029 for a two-tranche design with a ¼ 0:05: Therefore, the p-value for statistical significance was set to 0.029 for both tranches, ensuring that the overall type I error probability is not greater than 0.05.

P. Clarke et al. / Social Science & Medicine 56 (2003) 2221–2228

2226

Results

Tranche 2

Tranche 1

The sample in tranche 2 consisted of an additional 1717 respondents from another 50 postal sectors, making a combined total of 3516 respondents. The analysis was repeated for the combined data set and the results are displayed in the rows of Table 3 corresponding to tranche 2. In terms of model significance, the pattern of the results is identical to that at tranche 1, with the exception of model 13. Thus, the same conclusion is drawn as before: there is no evidence to reject the null hypothesis, and thus no evidence that interviewers are influencing the observed relationship between adult and childhood health due to differential interviewer briefings. An interesting feature of the results is that group 3 is again significant, giving evidence of a systematic effect for the comparison of adult health and childhood hospitalisation. Model 13 in particular gives a curious

The achieved sample at tranche 1 consisted of 1799 respondents selected from 50 postal sectors. The 14 ordinal log-linear models were then fitted to the sample data, the results of which are displayed in the rows of Table 3 corresponding to tranche 1. It can be seen that none of the group 2 models was significant, and that only one of the group 1 models was significant. Every model in group 3 was significant, although models 11 and 12 estimate the association between adult and childhood health variables to be smaller for the experimental survey than the control survey. According to the stopping rule, there was insufficient evidence to reject the null hypothesis because neither group 1 nor group 2 was significant. Thus a second tranche was performed. Table 3 Test results from tranches 1 and 2 Variable

Experimental

Group

Model

Tranche

Adult

Childhood

Est. b4

Std. error

1

1

2 1 2 1 2 1 2 1 2 1 2 1

Genhelf

GenhelfC

0.0156 0.0149 0.0495 0.1095 0.0387 0.0972 0.1861 0.1745 0.0716 0.1199 0.0638 0.3295

0.0384 0.0447 0.0596 0.0889 0.0578 0.0800 0.0406 0.0581 0.1177 0.1905 0.1256 0.1749

0.0202 0.0338 0.0937 0.0827 0.0470 0.0401 0.1895 0.1016

0.0193 0.0218 0.0835 0.1195 0.0532 0.0736 0.1701 0.2514

0.0711 0.0709 0.0170 0.0189 0.0071 0.1955 0.0238 0.0336

0.0185 0.0247 0.0058 0.0061 0.0592 0.0381 0.0096 0.0119

2 3 4 5 6

2

7 8 9 10

3

11 12 13 14 a

Illprone Offsick Longill

GenhelfC Illprone Offsick

2 1 2 1 2 1 2 1

Genhelf

2 1 2 1 2 1 2 1

Genhelf

Totalill Otherill

Longill

Totalill Otherill

Numstays Numnights

Longill

Numstays Numnights

a

p-value

*** ***

higherb YES yes YES yes YES yes YES yes YES yes YES yes YES

YES yes YES yes *** *** *** ***

YES

*** ** ***

yes YES yes

YES

p-values: *po0.05, **po0.025, ***po0.01. Whether association larger in experimental survey than control survey, which is true if 9b1 þ b4 9 > 9b1 9 (see Analysis Protocol section). b

P. Clarke et al. / Social Science & Medicine 56 (2003) 2221–2228

result: the estimate of b4 at tranche 2 is 0.007, which lies outside the 95 percent confidence interval for b4 at tranche 1, 0.203 to 0.121. Furthermore, the standard error at tranche 2 is larger than at tranche 1, which is counter-intuitive because the standard error would be expected to decrease with greater sample size. It was decided to investigate the anomalous results for model 13 further. The ordinal log-linear model assumes that the conditional log-odds of answering ‘yes’ against answering ‘no’ to the long-term adult illness question is linearly related to the number of hospital stays. Plotting the observed log-odds against the number of hospital stays for the experimental and control surveys revealed this relationship to be quadratic rather than linear; in other words, model 13 is mis-specified, which explains the increase in standard error between tranches 1 and 2. Thus, it is necessary to recalculate the scores for the number of hospital stays to ensure that the relationship is linear. A new set of scores was calculated to ensure a linear relationship between the variables that involved combining original scores 7–10 into a new score, 7. The childhood health variable in model 11 uses the number of hospital stays, so both models 11 and 13 were refitted using the new scores. The results of fitting this model at tranches 1 and 2 are shown in Table 4. It can be seen that neither model is significant, which gives a different conclusion to that arrived at using the original number of hospital stays scores. However, the overall conclusions are unaffected because group 3 is still significant; two out of four models are significant.

Discussion The sequential design eventually required two tranches before any firm conclusions could be drawn. Ultimately, the null hypothesis was rejected only for the tests in group 3 involving the adult health and childhood hospitalisation variables. No evidence was found to reject the null hypothesis in the tests involving either adult health and general assessments of childhood health variables, or the adult health and specific childhood illness variable. Hence, by the rules set out prior to

Table 4 Results from refitting models 11 and 13 using ‘Newstays’ Model

Tranche

Estimate

Std. error

Significant

11

2 1

0.0265 0.0294

0.0642 0.0831

No No

13

2 1

0.0274 0.0856

0.1416 0.1933

No No

2227

the experiment, it is concluded that there is insufficient evidence to reject the null hypothesis that there was no interviewer effect. As noted in the previous section, it was decided after inspection of the results to recode the original scores for the number of hospital stays variable (which measured the number of separate occasions on which the respondent stayed overnight in hospital as a child). This was done in order to ensure that the log-odds of having a long-term adult illness changed linearly with the number of hospital stays. Although this was a divergence from the pre-specified analysis protocol, it was necessary because following the analysis protocol would have involved fitting an incorrectly specified model. Furthermore, it was the anomalous increase in standard error for the estimate of b4 at tranche 2 that led to the change, and not an attempt to influence the experimental findings. Despite the null hypothesis not being rejected overall, for the childhood hospitalisation questions there was a suggestion of a systematic interviewer effect because models 12 and 14 were significant at both tranches. The occurrence of expectation-led interviewer effects is plausible in this case because the childhood health measure for both these models is the total number of nights spent in hospital. For a retrospective question of this nature, it is easy to believe that the answers will be subject to considerable measurement bias; individuals may have had to guess the answer, and thus become susceptible to the interviewer’s expectation. This may warrant further investigation. However, even if the effect were real, its magnitude was small. Looking at the Spearman’s rank correlation coefficient for the association between the general adult health and number of nights in hospital variables shows the significant association in model 12 to be small: 0.005 for the control survey, and 0.022 for the experimental survey. Similarly, in model 14, for the long-term adult illness and number of nights spent in hospital variables, the control survey correlation is 0.035 and the experimental survey correlation is 0.039.

Conclusions The experiment revealed no evidence for a general increase in the association between childhood and adult health outcomes for interviewers with preconceptions about such a link. The results are somewhat reassuring in two ways. First, they are consistent with the findings on ‘experimenter expectations’ by Rosenthal (1976). Secondly, they suggest that the practice of personally briefing interviewers, for which there are many positive arguments, need not bias performance and results. However, it would be quite wrong to assume that all health survey interviewing and interviewers are immune

2228

P. Clarke et al. / Social Science & Medicine 56 (2003) 2221–2228

to expectation-led effects. Interviewers from the National Centre for Social Research are highly selected and trained to be aware of the need to maintain strict objectivity and to avoid sending out unwanted signals when face-to-face with respondents; applicants who cannot do this are rejected. Indeed, researchers doing their own interviewing with hypotheses in mind could be more subject to such subconscious tendencies than trained survey interviewers. A small, consistent effect was found for hospitalisation, but it is difficult to imagine it being of any practical significance unless the interviewer caseloads were large (as noted by Davis & Scott, 1995). Further research may thus be necessary to investigate the impact of expectation-led effects on such questions.

Acknowledgements Supported by Fabriques de Tabac R!eunies SA (Philip Morris). The authors wish to thank Dr Susan Purdon for her invaluable help and advice on all aspects of the design and analysis of this experiment, and Dr Myron S Weinberg, Dr Rosanne B McTyre, Joseph Huggard, Cynthia Fink, Elisa Foltz, and Michelle Arness for their contributions to the manuscript. The authors are also grateful to the health policy editor Peter Davis and to two anonymous referees for their comments.

References Armitage, P. (1975). Sequential medical trials. Oxford: Blackwell. Armitage, P., & Berry, G. (1994). Statistical methods in medical research. Oxford: Blackwell. Cochran, W. (1977). Sampling techniques. New York: Wiley. Davis, P., & Scott, A. (1995). The effect of interviewer variance on domain comparisons. Survey Methodology, 21, 99–106. De Leeuw, E., & Collins, M. (1997). Data collection methods and survey quality: an overview. In L. E. Lyberg, P. Biemer,

M. Collins, E. De Leeuw, C. Dippo, N. Schwarz, & D. Trewin (Eds.), Survey measurement and process quality (pp. 199–220). New York: Wiley. Ecob, R., & Davey Smith, G. (1999). Income and health: What is the nature of the relationship? Social Science & Medicine, 48(5), 693–705. Eisenhower, E., Mathiowetz, N. A., & Morganstein, D. A. (1991). Recall error: sources and bias reduction techniques. In P. P. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, & S. Sudman (Eds.), Measurement errors in surveys (pp. 127–144). New York: Wiley. Erens, B., & Dong, W. (1997). Scottish health survey 1995. Edinburgh: The Stationary Office. Fowler, F. J. (1991). Reducing interviewer-related error through interviewer training supervision and other means. In P. P. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, & S. Sudman (Eds.), Measurement errors in surveys (pp. 259–278). New York: Wiley. Ishi-Kuntx, M. (1994). Ordinal log-linear models. Thousand Oaks, CA: Sage. Lyberg, L., & Kasprzyk, D. (1991). Data collection methods and measurement error: an overview. In P. P. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, & S. Sudman (Eds.), Measurement errors in surveys (pp. 237–258). New York: Wiley. Marmot, M., & Wilkinson, R. G. (Eds.). (1999). Social determinants of health. Oxford: Oxford University Press. Prescott-Clarke, P., & Primatesta, P. (1998). The health survey for England 1997. London: The Stationary Office. Rosenthal, R. (1976). Experimenter effects in behavioral research (enlarged edition). New York: Wiley. Sacker, A., Bartley, M. J., Firth, D., Fitzpatrick, R. M., & Marmot, M. G. (2001). The relationship between job strain and coronary heart disease: Evidence from an English sample of the working male population. Psychological Medicine, 31(2), 279–290. Skinner, C. J., Holt, D., & Smith, T. M. F. (Eds.). (1989). Analysis of complex surveys. Chichester: Wiley. Wynder, E. (1987). Workshop on guidelines to the epidemiology of weak associations. Preventive Medicine, 16(2), 139–141. Wynder, E. (1994). Investigator bias and interviewer bias: The problem of reporting systematic error in epidemiology. Journal of Clinical Epidemiology, 47(8), 825–827.