J Chron Dis Vol. 40, Suppl. I, pp. 23S-26S , 1987
0021-9681 {87 $3.00 + 0.00
Printed in Great Britain. All rights reserved
Copyright © 1987 Pergamon Journals Ltd
EVALUATING HEALTH MEASURES. COMMENTARY: MEASURING OVERALL HEALTH: AN EVALUATION OF THREE IMPORTANT APPROACHES MARILYN
BERGNER, I ROBERT M.
KAPLAN2
and JOHN E. WARE JR3
'Division of Health Policy and Management, Johns Hopkins Scho.ol o~ Hygiene.and ~ublic He,alth, Baltimore, MD 21218, 2Department of Community Medicine, Um~erslty of Caltfor~lta, San, Dle~o, California and 3Health Sciences Program, The RAND Corporation, Santa Monica, California, U.S.A. Health status measurement
Functional status
I. INTRODUCTION
In their paper entitled "Measuring Overall Health: An Evaluation of Three Important Approaches:' Read et al. [1] present results from a systematic evaluation of three widely used health measures: the General Health Rating Index (GRRI), the Quality of Wellbeing Scale (QWB), and the Sickness Impact Profile (SIP). Their presentation is of considerable importance for several reasons. First, each of us [Drs Ware, Kaplan, and Berger, respectively] was actively involved in the development and validation of one of the three measures compared and thus have a major interest in what this empirical evaluation by an independent investigator can tell us about the relative performance of our measures. Second, to the best of our knowledge this is the first time all three have been administered in the same study, Finally, the comparisons and results reported ?y Read et al. can provide an opportunity to discuss the standards that should be used in comparing health measures and to offer suggestions for future studies. In some respects, this paper is a good example of how one should proceed with this kind of measurement evaluation. In other respects,
Dr Bergner is Professor, Dr Kaplan is .Professor, and Dr Ware is Senior Research Psychologist.
Validity
Quality of life
there is room for improvement. First, let us examine some of the strengths of the approach. Read and his colleagues begin with a conceptualization of health and health-related factors, which can serve as a blueprint in reviewing the content of candidate measures. This is a strength of their approach. We highly recommend giving careful thought to the health concepts of interest so that the content of candidate measures can be evaluated in light of the information about health that is desired. The list of major concepts included in their model looks similar to the one we would have suggested, as Bergner discussed elsewhere [2]. Both include: (1) health-related genetic, environmental, and other factors not under personal control; (2) health behaviors and attitudes under personal control; (3) physical functioning; (4) mental functioning; and (5) social functioning. Another strength of the evaluation by Read and his colleagues is their emphasis on practical considerations involved in using a health measure, which included training, administration methods, and respondent burden. These considerations are too often ignored when measures are compared. Further, Read and co-authors collected their data with careful attention to factors that might influence results , including order of administration of instruments, which was randomized to minimize any bias due to this factor. Tests of validity employed multiple methods and they were careful to define tech23S
24S
MARlLYN BERGNER
nical terms when presenting methods and results. Measurement requirements vary from study to study and, therefore, conclusions drawn from a review of measures will vary depending on the particular application one has in mind. Regardless of how well a measure performs in general, its performance in relation to the requirements of a particular study is of utmost importance. Read, Quinn, and Hoefer did not ignore this issue. Their stated goal was to compare the practicality and validity of the three measures for purposes of evaluation research and, specifically, for a prospective clinical trial of a health care intervention. Perhaps we differ with them most in terms of how general or specific one need be in this regard. The choice of a measure depends on the goal and design of the trial and particularly on the specific health outcomes that are expected. For example, a trial investigating alternative approaches to treating mental health disorders requires measures sensitive to differences in psychological distress and well-being. An investigation of tradeoffs involved in treatments affecting the length and quality of life requires a measure that integrates both. Whether measures are selected to concentrate on the negative or the positive end of a health continuum depends on whether one is studying a relatively sick or healthy population. The context in which Read and his colleagues compared the three measures seems too broad to permit a focused evaluation.
et al.
focused measures. Read and his colleagues offer good arguments in favor of this strategy. There is more to life than its length. Life also varies along dimensions of quality. There is more to the quality of life than the parameters of disease and treatment outcome that have most often been measured, for example, in clinical trials. All three of the measures in question were based on a broad perspective of health, although the approach taken by each is very different. They were developed with the goal of broadening the focus of health assessment in the areas of behavioral functioning and general well-being. Although a review of content is no substitute for empirical tests of validity, a content review is a good first step in anticipating likely differences in validity. To assess the content of a measure in relation to study objectives, it is a good idea to (1) specify the hypothesis to be tested and the specific health concepts required, (2) review what the developers say their instruments measure, and (3) examine the content of items to scrutinize these claims. Read and his colleagues present a thorough discussion of the similarities and differences in the content of the GHRI, QWB, and SIP. We take this opportunity to share our perspectives. Items in the QWB explicitly tap physical and social functioning as well as physiological status, the latter by way of its symptom problem list. Its most noteworthy advantage over the other two instruments is that it offers a method for incorporating both length and quality of life in a single index score, making it possible to include both survivors and nonsurvivors in the II. THE GOALS OF MEASUREMENT same analysis. In other words, programs that The first and foremost consideration in produce benefit or harm in terms of either choosing a measure is the relationship between mortality or morbidity or both can be compared what one wants to know about health and the using this index. Neither of the other two information provided by a particular health measures was constructed with this goal in mind. measure. Studies of health outcomes vary greatly It is thus reasonable to ask the following in their focus, including such policy issues as the about the QWB: how well does the QWB make benefits produced by competing systems of care adjustments for differences in the quality of days or such clinical concerns as the benefits pro- lived? How well does it capture the full range of duced by specific medical treatments. Studies of important benefits that might be produced by very heterogeneous populations require meas- competing health progress? Is there value in ures that are sensitive to changes in a wide range enhancing the content of the QWB, for example, of health concepts. A clinical trial of a new in the area of mental health? Read and his treatment may focus on biomedical or func- colleagues provide one test of whether the tional parameters that are disease- or treatment- inclusion or exclusion of mental health items is specific. A goal common to the developers of the likely to make a difference in empirical terms. three measures in question was that of suppleSIP items ask about the three major health menting the information about health that is concepts recommended by the World Health available from clinical and other more highly Organization [3]: physical, mental, and social
Evaluating Health Outcomes
functioning. Further, it measures behavioral dysfunction in each of these conceptual areas with considerable depth . In contrast to the QWB index, it has a number of scoring options: a total score, summary scores for physical and psychosocial functioning, and 12 specific subscale scores. Thus, an examination of its content suggests that it is the richer source of health information from a clinical perspective. But, is there more to the health equation than behavioral dysfunction? The WHO definition of health also stresses " well-being." The GHRI attempts to assess this concept by asking for personal evaluations of "health." This is based on the assumption that well-being is in the "eyes of the beholder." In contrast to the SIP, the GHRI is not suitable for contrasting differences in outcomes for specific health concepts such as physical vs mental health. Its items fall entirely within one category in the diagram of health offered by Read and his colleagues, namely, patient evaluation of overall health . The GHRI distinguishes evaluations of health in the past , present, and future and offers the option of aggregating these concepts or interpreting them separately. The most distinguishing feature of the OHRI is its subjectivity. It is the only measure of the three that asks individual respondents personally to assess their own health status. III. PRACTICAL CONSIDERATIONS
25S
IV. SENSITIVITY TO DIFFERENCES IN MENTAL HEALTH
Findings reported by Read and his colleagues offer the opportunity to explore whether the differences in the content of the three measures, as discussed above, can be linked to differences in empirical results. Their analysis of content revealed a large number of SIP items appearing to ask about mental health. In contrast, the QWB contains only one item assigned to that category, and the GHRI asks only about health in general. Thus, according to the standard of content validity, the SIP appears to be the most valid of the three for purposes of testing hypotheses about differences in mental health . Empirical tests of the validity of summary scores for the three measures in assessing mental health was performed by correlating each with a short-form of the Mental Health Inventory (MHI), a general measure of psychological distress and well-being. What did Read and his colleagues find? Among the three measures, correlations with the mental health criterion variable (the MHI) ranged from a high of 0.57 for the SIP to a low of 0.33 for the QWB ; the GHRI correlated 0.46 with the MHI. (These differences in correlations appear to be significant at the 0.05 level.) Thus, consistent with expectations from the content analysis, the SIP clearly correlated highest with the mental health criterion variable. It is interesting that personal ratings of health in general, as measured by the GHRI, also correlated more highly than the QWB with the mental health criterion variable. Let us leave this issue with the conclusion that, from the perspective of content validity (i.e. what is and is not well represented in a battery of health items), the SIP appears to be a more valid measure of mental health than the QWB. This conclusion is also supported by empirical tests of validity reported by Read and his colleagues.
The most reliable and valid health measure has only limited value to the field if its use is impractical in most studies. Practical considerations include the real dollar and time costs involved in gathering health data and the acceptance of a health measure to those involved, including respondents and the audience to be addressed when results are published. Included are the amount of training required , mode of administration (personal or telephone interview, self-administration), length in terms of number V. THE PROS AND CONS OF of items and response time, as well as the steps AGGREGATION involved in data processing and interpretation. Finally, we take this opportunity to share These and other practical considerations are some suggestions regarding additional worthdiscussed elsewhere [4]. Application of these criteria is well illustrated while analysis possible within the framework of by Read and his colleagues . They note differ- this rich data set. This analysis and others like ences among the measures in terms of training it would inform the debate regarding the pros required , respondent burden, and the complex- and cons of relying on aggregated vs disaggreities of scoring. Despite these differences, all gated measures . Two of the measures compared th ree instruments were well accepted by study (SIP and GHRI) were constructed to permit analysis of subs cales focusing on specific health participants.
268
MARILYN BERGNER
concepts in addition to analysis of a summary score. Are there advantages associated with this approach? Are some subscales more valid than others for purposes of testing specific hypotheses? For example, an interesting question is whether the mental health criterion variable correlates more highly with the SIP Psychosocial Scale than with other SIP subscales or the SIP summary score discussed above. If so, reliance on the Psychosocial sub scale in a clinical trial would provide a more powerful test of hypotheses regarding mental health outcomes than analysis of other subscales or the summary score . Further, it would be interesting to test whether predictions based on a set of disaggregated subscales from the SIP or GHRI is more accurate than a prediction based on a summary score . VI. FINAL COMMENTS
So, what is the best health measure? Read, Quinn, and Hoefer did not answer this question and, in the abstract, neither will we. The answer depends on study objectives. Comparisons among candidate measures should be made in terms of criteria established in advance to parallel study requirements as closely as possible. In several respects Read et at. have set a good example of how one might proceed in making such comparisons. They stressed the importance of practical considerations and called attention to noteworthy differences among the three measures in this regard. They offered a specific conceptual framework suitable for judging the content of health measures and presented numerous empirical tests of validity. The extensive correlation matrix they have appended (see Appendix A in [lJ) is a rich source
et al.
of information about the validity of GRRI, QWB, and SIP scores. Inclusion of the validity variables in this matrix would have made the table even more useful. Hopefully, other investigators who compare health assessment instruments will follow the example set by Read and his colleagues in reporting their results. If they do, choices among available measures will be better informed, which will be of great benefit to the field. For those who are planning such comparative studies, we would suggest other criteria that can be applied in evaluating measures of health status, including: (1) how well they discriminate between groups of persons with and without serious medical conditions, (2) how well they discriminate persons at different levels of disease severity within a diagnostic group, (3) whether changes in health scores over time mirror known changes in disease severity, and (4) their predictive validity in forecasting utilization of health care resources as well as length of survival. A valid measure of health would be expected to perform well in all of these analyses. Again, the importance of each analysis depends on specific study objectives. REFERENCES 1. Read JL, Quinn R, Hoefer MA: Measuring overall health : an evaluation of three important approaches. J Chron Dis 40: 75-215, 1987 2. Bergner M. Measurement of health status. Med Care 23: 696-704, 1985 3. World Health Organization: Constitution of the World Health Organization . In Basic Documents. Geneva: WHO ,1948 4. Ware JE Jr: Methodological considerations in the selection of health status assessment procedures. In Assessment of Quality of Life in Clinical Trials of Cardiovascular Therapies, Wenger NK, Mattson ME, Fu rberg CD, Elinson J (Eds). New York: LeJacq Publishing Inc., 1984. pp. 87-111