0277-9536189 $3.00 + 0.00
SOC.Sri. ME-~.Vol. 28. NO. 1, pp. 53-58. 1989
Copyright c 1989Pergamon Press plc
Printed in Great Britain. All rights reserved
MEASURING SATISFACTION WITH HEALTH CARE: A COMPARISON OF SINGLE WITH PAIRED RATING STRATEGIES H. J. SUTHERLAND,G. A. LOCKWOOD,S. MINKIN, D. L. TRITCHLER,J. E. TILL and H. A. LLEWELLYN-THOMAS* The Ontario Cancer Institute, Toronto, Ontario M4X lK9. Canada Abstract-One
of the central problems in studies of patient satisfaction with health care is the development of reliable and valid methods to determine the relative importance of different aspects of health care. Two techniques, paired comparisons and rating on a visual analogue scale, were compared in terms of their consistency with logical assumptions, test-retest reliability, and convergent validity. Thirty women with breast cancer were asked to assess brief hypothetical scenarios describing out-patient clinic visits to a tertiary cancer care centre. Each scenario incorporated three variables related to satisfaction with care: staff attitude, control over treatment decisions, and continuity of medical supervision. The paired choice method showed marginally better reliability and logical consistency than the rating method. Of the three variables assessed, continuity of medical supervision was consistently ranked highest in importance, and control over treatment decisions lowest. These preference assessment techniques appear to be suitable for use in the development of patient satisfaction indices, and for studies designed to examine variations in the priority given to different aspects of satisfaction with care. Key ntords-paired
comparisons,
rating scales, health care, patient satisfaction
INTRODUCr’ION The need to assess patient satisfaction is an increasingly prominent aspect of health care evaluative research [l-3]. However, the construction of measures of patient satisfaction involves addressing a number of complex theoretical and methodologic problems. These include the identification of the salient dimensions of satisfaction, the selection of appropriate items and scaling methods to use in quantifying attitudes within each dimension, and the determination of the relative or weighted contribution each dimension makes to an overall measure of satisfaction. There are a number of reports in the literature identifying various dimensions of satisfaction and describing the reliability and validity of scales designed to assess the degree of satisfaction within dimensions [3,4]. There has been comparatively little examination of the performance of different methods for making across-dimensional assessments of relative importance [5]. In this study, we compared two methods for eliciting weights for different aspects of patient satisfaction. This work, therefore, makes a methodologic contribution to the effort to develop patient satisfaction indices. The methods for eliciting dimension weights can be grouped according to whether they use a direct or an indirect approach. In the first strategy, the methods involve directl,y assigning a value or weight to each dimension. Our experience with this approach indicates that, because the task does not require them to *Address correspondence to: Dr I-I. A. Llewellyn-Thomas, Ontario Cancer Institute, 500 Sherbourne Toronto, Ontario M4X lK9, Canada.
Street,
make mental trade-offs among the dimensions. raters tend arbitrarily to assign equal weights to each dimension [6]. Accordingly, we chose to examine methods within the second strategy. This alternate approach involves presenting the rater with a set of scenarios in which the dimensions appear together in various combinations. Different methods can be used to derive a rater’s judgement of the relative overall desirability of each scenario. then regression techniques are applied to the full set of results in order to estimate the weight the rater was assigning to each dimension while making judgements. In one of these methods, the subject provides ratings on visual analogue scales (RA). Our experience with this method [6,7] indicates that the technique is potentially useful in health care applications. However, the simultaneous assessment of several attributes in order to arrive at an overall evaluative rating or score for a single scenario is not an easy cognitive task [8]. Paired comparisons of scenarios (PC) is an alternate method that appears less cognitively demanding, since the judgement task at a particular point in time only requires the rater to choose the preferred of two scenarios. This technique has been applied in market research [9] and psychological studies [IO], but not extensively in the health care field. Classically, evaluation of the psychometric properties of methods like these involves determining their reliability and validity under various circumstances. Reliability refers to the extent to which a measure renders repeatable results, while validity refers to the extent to which an instrument measures what it was intended to measure [I 11. Our earlier work [12, 131 has indicated that it is also important to assess the
H.J.
54
SUTHERLAND
degree to which these methods provide results that are consistent with logical assumptions; that is, results which demonstrate the ‘logical consistency’ of the method. Thus there were two primary purposes to this study. The first was to examine the properties of the methods. We wished to assess the logical consistency and test-retest reliability of the RA and PC methods, and to determine whether one method performed better than the other on the basis of these criteria. Then, given that these results were acceptable, we wished to compare the RA and PC methods, to assess their convergent validity. By convergent validity, we mean that, if both methods are valid indicators of the underlying construct of relative importance, then they ought to assign similar overall values to the scenarios themselves, as well as comparable weights to their constituent attributes or dimensions [14]. The second purpose was to explore the usefulness of these methods. As an initial test, we assessed the relative importance assigned by women with breast cancer to three care-oriented variables: staff attitude. control over treatment decisions, and continuity of medical supervision.
severity, and experience with clinic attendance could interfere with the assessment of the methods’ reliability and validity. In the past. we have been able to carry out satisfactory psychometric assessments of similar methods with samples of this size [6, 15, 161. The selection of the appropriate time interval between interviews is problematic. An interval of a week would be preferable, because this would reduce the danger of spuriously high reliability coefficients due to subjects being able to remember their earlier responses at retest time. However, the health status of cancer patients under treatment is changeable and such alterations could undermine the accuracy of the assessment of test-retest reliability. A briefer interval was used to reduce this possibility; we assume that the complexity of the judgement tasks offsets the likelihood that subjects can recall their earlier responses. Scenario construction
Each scenario was presented as a description of a clinic visit. The clinic visit was selected as the health care episode of interest, and was conceptualized as an encounter between the clinic staff and the patient within the context of the agency. Thus the clinic visit could be described and evaluated in terms of a number of staff-, patient-, and agency-specific characteristics [ 171. One characteristic was selected from each of these evaluative areas. yielding three descriptive dimensions. These three care-oriented dimensions were staff attitude (ATTITUDE). control over treatment decisions (CONTROL), and continuity of medical supervision (CONTINUITY). Their positive and negative definitions appear in Table 1. The selection of dimensions and the development of their definitions was guided by the literature, the characteristics of the study setting, and our assumptions about the study sample. The perceived humanness of staff attitude [l, 31 and the continuity of care [18] have been frequently cited as significant components of patient satisfaction. Furthermore, the significance of these factors may be heightened in the minds of cancer patients who are attending clinics at an active-treatment, tertiary-level cancer centre in which the commitment to the teaching of oncology residents reduces the patient’s chances of seeing the same physician at each visit. The assumption that patients prefer control over treatment decisionmaking has been argued in the health care evaluative [4], ethical [19], psychological [20], and cancer [21] literature.
METHODS
General plan
Brief paragraphs (‘scenarios’), incorporating three dimensions of health care, were assessed on two occasions by a group of women with breast cancer. On both occasions, two methods were used to elicit the patients’ judgements about the relative desirability of the scenarios. Subjects
Thirty women with primary breast cancer were interviewed on two occasions separated by I or 2 days at the beginning of post-surgical radiation treatment in the out-patient department of the Princess Margaret Hospital. Inclusion criteria required an ability to read and understand English and to give informed consent. All patients were interviewed by the same interviewer on the two occasions. The patients were not familiar with the interviewer, and were aware that their responses would be confidential. We selected the study subjects from a particular diagnostic group of cancer patients, in order to reduce the possibility that differences in sex, disease Table
et al.
I, Dimension definitions
Evaluatwz area
Dlmensmn
Dimension definition
Skiff
Attitude (ATTITUDE)
as wanted.
Patlent
Control over treatment (CONTROL)
Agency
Continulry of care (CONTINUITY)
Posirice
definuion:
being treated with as much kindness
Negotiw definirion: not being treated with as much kindness as wanted. but as another case. Posrrice definirion: having as much control over treatment decisions as wanted. Negorive definirions:not having as much control over treatment decisions as wanted. Positiw
dejinirion:
each clinic visit. Negorive dqinifion:
at each clinic visit.
seeing the same preferred physician at
not seeing the same preferred physician
Measuring
satisfaction with health care
Table 2. Sample scenario Scenario
A
:
I amnot treated case of breasr I alwaw
wlh
as much
kindness
as 1 wanl.
but as “onother
cancer“.
SPC rhe docror
I wnm when
I come to the
1 do nor hare a.s much conrrol over treatment
hospital.
decisions
as I want.
The scenarios incorporated the three dimensions at dichotomous levels, giving eight possible situations. An example appears in Table 2. The two extreme situations, in which all dimensions were either negatively or positively defined, served as the ‘worst’ and ‘best’ reference points for judging the remaining six scenarios. Judgement procedures
Two different procedures were used to elicit judgements of scenario desirability. In the rating method (RA), the patient was asked to read each scenario and to indicate, by making a vertical mark on a 1Ocm line, how desirable it was relative to the extreme scenarios which anchored the line. An ‘RA’ score for each scenario was determined by measuring in millimetres from the end of the line anchored by the ‘worst’ case to the indicated vertical mark. In the paired comparisons (PC) method, the subject was presented with the scenarios in pairs and asked to select the one she preferred. The results were represented as a set of binary values, where a one indicated ‘preferred’ and a zero indicated ‘not preferred’. All possible pairs were presented once, giving 1S choices; thus each scenario appeared in five choice situations. The ‘PC’ score for each scenario was the number of times, in the five times it appeared, that the scenario was preferred to the other scenarios presented. Whether a respondent began with the RA or PC procedure was randomly determined; half the sample began with one procedure and half with the other. Analysis
The data analysis plan was designed to assess the methods’ logical consistency, test-retest reliability, and convergent validity. Two forms of the data were analysed: raw scores and modelled dimension weights. Raw scores. Six logical consistency checks were inherent in both the RA and PC judgement tasks. A scenario was considered logically preferable to another if the first had two dimensions (A and B) at the high level and the second had onty one of these dimensions (A or B) at the high level, while both cases had the third dimension at the same level. Thus, the logical consistency checks were based on the assumptions that, other things being equal, the high level of a dimension is preferable. For the RA method, an inconsistency was counted when one scenario that was logically preferable to a second had an RA score which was lower. For the PC method, an inconsistency was counted when one scenario that was logically preferable to a second was not preferred by the respondent. The total number of inconsistencies made by all respondents is an inverse measure of the logical consistency of that task.
55
To assess the test-retest reliability of each method, a Spearman correlation coefficient was calculated for each subject’s set of scores. This gave a reliability coefficient for each subject on each method. The median of the subjects’ reliability coefficients is an index of the overall test-retest reliability of the method. The Wilcoxon signed rank test was used to test for differences between the reliability coefficient of the RA and PC methods. Convergent validity-that is, the extent to which the results of the two methods agreed-was examined in two ways. First, the average rank vectors for the two procedures were compared. The average rank vector for a method was computed by ranking the scenarios for a subject, then averaging the ranks for a scenario across all subjects, and finally ranking the average ranks. Second, a between-method Spearman correlation coefficients of a respondent’s sets of RA and PC scores was calculated. This gave a coefficient of agreement for each subject. The median of these agreement coefficients indicates the method agreement in the whole sample. Modeled dimension weights. A set of importance weights for the three underlying dimensions (ATTITUDE, CONTROL, and CONTINUITY) was derived from the raw data generated by each method. Ordinary least-squares regression was applied to summarize the RA scores by estimating a main effect for each dimension. For each subject the RA scores were regressed on the levels of the three dimensions used in the cases to get estimates of the corresponding parameters. These estimates, normalized to sum to one, were interpreted as the importance weights the subject assigned to the three dimensions that constituted the scenarios. Logistic regression was applied to summarize the binary data from the PC procedure assuming the Bradley-Terry model [22]. Estimates of the parameters were obtained using maximum likelihood estimation and were interpreted as importance weights for the three dimensions. Both regression strategies assume that the derived dimension weights are additive. Pearson correlation coefficients on the weights were calculated; first, within method and across test time to assess the test-retest reliability of the weights from each method, and second, within test time and across method to assess the agreement of weights between methods. The average rank vectors for the weights were calculated and compared in the same manner as for the raw scores (see section on Raw Scores, above). RESULTS
Raw scores Logical consistency. The RA method yielded more internal inconsistencies at test and retest (22 & 22) than the PC method (13 & 5). The majority of the inconsistencies, except for the PC method at retest, resulted from the two choices between scenarios that differed only in terms of the level of the CONTROL dimension. Difficulty with this dimension accounted for 13 and 14 of the RA inconsistencies at test and
H. J. SUTHERLAND et al.
56
Table 3. Distributmn of inconsistencies among respondents*
Table
5. Reliability
Number of inconsistencles
and
agreement dimensions
coefficients
Test-retest reliabihty
for
scenario
Method agreement
Method
0
1
2
3
Total
RA test RA retest
14 16
IO 9
6 2
0 3
30 30
Dimension
RA
PC
Test
Retest
30 30
ATTITUDE CONTINUITY CONTROL
0.65 0.77 0.68
0.74 0.81 0.62
0.63 0.50 0.67
0.72 0.60 0.43
PC test PC retest
20 25
7
3 0
5
0 0
*Table values give the number of respondents with the indicated number of inconsistencies in thetr responses.
retest respectively; and for 10 and 2 of the PC inconsistencies at test and retest respectively. The decrease in the number of PC inconsistencies from test to retest is due almost entirely to a decrease in the number of inconsistencies arising from the CONTROL dimension. The distributions of the inconsistencies among respondents are given in Table 3. The percentage of respondents who had inconsistencies was higher for the RA procedure than the PC procedure at both test and retest (53% for RA vs 33% for PC at test; 47% for RA vs 17% for PC at retest). Tesf-retest reliability. The median reliability coefficient for the RA procedure was 0.83. The median reliability coefficient for the PC procedure was 0.94, significantly higher (P = 0.013) than the RA median reliability. Method agreement. The average rank vectors of the scenarios, obtained from both methods at both test and retest, were the same. This indicates consistent agreement of the results of the two methods. The average rank order of the scenarios from best to worst is given in Table 4. As one would expect, on the basis of logical consistency, the three scenarios with two dimensions at the high level had higher average ranks than the three scenarios with only one dimension at the high level. The median of the agreement coefficients at test is 0.86 and at retest is 0.88. Again this indicates a good agreement between the methods at both times. Modelled dimension weights Test-retesf reliability. The test-retest Pearson correlation coefficients for the importance weights are given in Table 5. The correlations for the weights from the RA procedure ranged from 0.65 to 0.77, and those from the PC procedure from 0.62 to 0.81. Since inconsistencies suggested that the subjects had difficulties with the tasks, the dimension weights’ test-retest reliabilities were recalculated for the subset of subjects who had no inconsistencies. The reliabilities for the weights from the PC data were higher for all three dimensions (0.75-0.89). while the reliabilities for the weights from the RA data were lower for ATTITUDE and CONTINUITY (0.59. 0.66) and higher for CONTROL (0.82). Table 4. Scenarms m order of average rank* Case label D F B A E C
ATTITUDE
CONTINUITY
+ _ + + -
+ + _ + _ _
CONTROL + + +
l+ Indicates dimension as at the high level. - mdicates dunension is at the low level.
Method agreement. The between-method Pearson correlation coefficients are also included in Table 5. Five of the 6 correlations are between 0.50 and 0.72; CONTROL at retest is less than 0.50. It is also noteworthy that the average rank vectors of the weights were the same at both test and retest for both methods. On average, CONTINUITY was ranked first, ATTITUDE second, and CONTROL last. Taken together, these results provide evidence for the convergent validity of the two methods for preference assessment. Relative importance of care-oriented variables. The mean weights for the three care-oriented variables obtained from both methods, at both times, are given in Table 6. The results show considerable accordance. The average weight for CONTINUITY is consistently largest, ATTITUDE intermediate, and CONTROL smallest. CONTINUITY received the highest weight in 66% of the responses, ATTITUDE in 24%, and CONTROL in 10%.
DISCUSSION
In the present study, the logical consistency and test-retest reliability of both methods were good, and the results of the two methods agreed quite closely. The observation that the PC scores are more consistent with our logical assumptions than the RA scores suggests that the subjects had less difficulty with the PC method. Presenting the scenarios in pairs may encourage the subject to compare the scenarios by dimension, and thus be less likely to make logically inconsistent choices. The majority of inconsistencies for the RA method, at both test and retest, were due to difficulty with the (patient-specific) CONTROL dimension, suggesting that a portion of the subjects tended to ignore that dimension when assessing the scenarios. This is consonant with Slavic and Lichtenstein’s [23] conclusion that holistic evaluations tend to be based on a limited number of attributes. The majority of inconsistencies for the PC method at test were also due to difficulty with the CONTROL dimension. However, the number of inconsistencies decreased from test to retest. The PC raw scores yielded a significantly higher median reliability coefficient than the RA raw scores. At least for the respondents who gave consistent answers, the PC data with logistic regression also appeared to provide more reliable estimates of the dimensional weights. This greater reliability may be because the PC task requires less cognitive effort. Alternatively, it may have been easier for the subject to remember her previous choices and thus exhibit greater test-retest reliability.
Measuring satisfaction with health care Table 6. Dimension Dimension CONTINUITY ATTITUDE CONTROL
57
weights for the two methods’
RA test
RA retest
PC test
PC retest
46.3 -+ 4.0 (21.9) 3212 + 3.7 (20.3) 21.5 + 2.9 (15.9)
48.5 * 4.1 (22.51 3117 F i.3 (18.1) 19.9 f 2.5 (13.7)
47.0 t 3. I (17.0) 3115 F 5.6 (19.7) 21.4i2.9 (15.9)
47.5 * 3.3 (18.1) 31:1 * i.8 (15.3) 21.4i2.9 (15.9)
*Table values grve means k SE of the weights;
Good evidence of convergent validity was observed. The high median agreement coefficients and identical average rankings of the scenarios indicate that, to a considerable degree, the two techniques are tapping the same underlying construct of relative importance. The agreement between the dimension weights was also satisfactory, even though, due to the different types of data, they were calculated under two different models. Although we cannot draw broad conclusions from this study, within this sample both methods performed well, with the PC method performing perhaps slightly better than the RA method. Of the three care-oriented dimensions studied, continuity of medical supervision was, on the average, given the highest rank. In contrast, control over treatment decisions was given the lowest average rank, and only 10% of the respondents considered it the most important of the three variables. The setting and sample size used in this study preclude inferences about how breast cancer patients in general perceive the relative importance of these dimensions of clinic care. For example, it is possible that these results were influenced significantly by the setting within which the study was done; continuity of care may be perceived as important by patients in a tertiary care setting, and control over treatment decisions may be perceived as unimportant by patients who know that their treatment choice has already been made. Whatever the reason for the particular rank order of weights reported by the participants in this study, the results imply that various dimensions of health care may be of differential importance to patients, that these rankings of importance should be taken into consideration when developing patient satisfaction scales, and that these methods can yield consistent rankings even with a small sample size. Methods of the kind described here could be used not only in the development of patient satisfaction scales, but also more directly as a means to explore variations in the importance patients assign to different dimensions of satisfaction with care. For example, do patients give increasingly high priority to continuity of care as they proceed from the care of a single primary care physician to the care of a larger health care team? Do patients give much greater importance to control over treatment decisions at a time immediately prior to treatment choice than at other times? The issue of the stability of patients’ priorities is an important one; if they are labile, then which set of priorities should be used in evaluations of satisfaction with care? The stability of patients’ preferences has been examined in other contexts [6,24,25]. The results indicate that the relative importance ascribed to the dimensions of interest may remain stable [6, 251, or may vary [24]. Factors that
SD appear
in parentheses.
influence the stability or lability of patients’ priorities merit more attention than they have received in the past; methods like those described in this paper could be useful in exploring these broader research issues. Acknowledgements-This work was supported National Cancer Institute of Canada and the Cancer Treatment and Research Foundation.
by the Ontario
REFERENCES
M. and Mamlin J. J. Patient I. Greene J. Y., Weinberger attitudes toward health care: expectations of primary care in a clinic setting. Sot. Sci. Med. 14A, 133, 1980. 2 Locker D. and Dunt D. Theoretical and methodological issues in sociological studies of consumer satisfaction with medical care. Sot. Sci. Med. 12, 283, 1978. 3 Ware J. E. Jr and Snyder M. K. Dimensions of patient attitudes regarding doctors and medical care services. Med. Care 13, 669, 1975. of scales to measure satisfac4. McCusker J. Development tion and preferences regarding long-term and terminal care. Med. Care 22, 476. 1984. 5. Matthews D. A., Feinstein A. R. and Joyce C. K. A new instrument for patients’ perceptions of physicians’ performance. Med. Decis. Making 5, 369 (abstr.), 1985. H. A.. Sutherland H. J.. Ciampi A.. 6. Llewellyn-Thomas Etezadi-Amoli J.. Boyd N. F. and Till J. E. The assessment of values in laryngeal cancer: reliability of measurement methods. J. chron. Dis. 37, 283. 1984. H. J., Wong S.. Minkin S.. 7. Langley G. R., Sutherland Llewellyn-Thomas H. A. and Till J. E. Why are (or are not) patients given the option to enter clinical trials? Cont. Clin. Trials 8, 49, 1987. 8. Miller G. A. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psycho/. Rev. 63, 81. 1956. D. and Rao M. R. Estimation of attribute 9. Horsky weights from preference comparisons, Mgnir Sci. 30, 801, 1984. 10. Penner L., Homant R. and Rokeach M. Comparison of rank-order and paired-comparison methods for measuring value systems. Percepr. Molar Skills 27, 417, 1968. 11. Nunnally J. C. Psychometric The0r.v. 2nd edn. McGraw-Hill, New York, 1978. H., Sutherland H. J., Tibshirani R., 12. Llewellyn-Thomas Ciampi A., Till J. E. and Boyd N. F. The measurement of patients’ values in medicine. Med. Decis. Making 2, 449. 1982. 13. Sutherland H., Dunn V. and Boyd N. The measurement of values for states of health with linear analogue scales, Med. Decis. Making 3, 477, 1983. 14. Fischer G. W. Experimental applications of multiattribute utility models. In Uriliry. Probability, and Human Decision Making (Edited by Wendt D. and Vlek C.). Reidel, Dordrecht, 1975. 15. Llewellyn-Thomas H. A. and Sutherland H. J. Procedures for value assessment. In Recenr Advances in Nursing: Research Methodology (Edited by Cahoon M.). Churchill-Livingstone, Edinburgh, 1987.
H. J. SUTHERLAND et al.
58
16. Llewellyn-Thomas H. A., Sutherland H. J., Tibshirani R.. Ciamoi A.. Till J. E. and Bovd N. F. Describing health states: methodologic issues in obtaining values for health states. Med. Care 22, 543, 1984. 17. Ware J. E., Davies A. A. and Stewart A. L. The measurement and meaning of patient satisfaction. Hlfh Med. Sci. Reu. 1, 2, 1978. 18. Fletcher R. H., O’Malley M. S., Fletcher S. W., Earp J. L. and Alexander J. P. Measuring the continuity and coordination of medical care in a system involving multiple providers. Med. Care 22, 403, 1984. 19. Thomasma D. C. Autonomy in the doctor-patient relationship. Theoret. Med. 5, 1, 1984. 20. Ray C., Fisher J. and Lindop J. The surgeon-patient relationship in the context of breast cancer. Inf. Rev. Appl. Psychol. 33, 531, 1984.
21. Cassileth B. R., Zupkis R. V., Sutton-Smith K. and March V. Information and oarticination oreferences among cancer patients. Ann. intern. Med. 92; 832, 1980. 22. David H. A. The Method of Paired Comparisons. Hafner. New York, 1963. 23. Slavic P. and Lichtenstein S. Comparison of Bayesian and regression approaches to the study of information processing in judgment. Org. Behar. Hum. Perform. 6, 649, 1971. 24. Christensen-Szalanski J. J. Discount functions and the measurement of patients’ values: women’s decisions during childbirth. Med. Decis. Making 4, 47, 1984. 25. O’Connor A. M., Boyd N. F., Warde P.. Stolbach L. and Till J. E. Eliciting preferences for alternative drug therapies in oncology: influence of elicitation technique and treatment experience on preferences. J. chron. Dis. 40, 811, 1987.