A methodological framework for assessing health indices

A methodological framework for assessing health indices

J Chron Dis Vol. 38. No. 1. pp. 27-36, 1985 Printed in Great Britain. All rights reserved 0021-9681:85$3.00-~0.00 Copyright I(~’1985 Pergamon Press L...

925KB Sizes 0 Downloads 170 Views

J Chron Dis Vol. 38. No. 1. pp. 27-36, 1985 Printed in Great Britain. All rights reserved

0021-9681:85$3.00-~0.00 Copyright I(~’1985 Pergamon Press Ltd

A METHODOLOGICAL FRAMEWORK ASSESSING HEALTH INDICES BRAM

McMaster

University

KIRSHNER

Health

GORDON

and

FOR

GUYATT

Sciences Centre. 1200 Main St West, Hamilton, Canada L8N 325

(Receirrrl

in recised~brm

Ontario,

13 Auyust 1984)

Abstract-Tests or measures in clinical medicine or the social sciences can be used for three purposes: discriminating between subjects, predicting either prognosis or the results of some other test. and evaluating change over time. The choices made at each stage of constructing a quality of life index will differ depending on the purpose of the instrument. We explore the implications of index purpose for each stage of instrument development: selection of the item pool. item scaling. item reduction. determination of reliability, of validity, and of responsiveness. At many of these stages. not only are the requirements for discriminative. predictive. and evaluative instruments not complementary. they are actually competing. Attention to instrument purpose will clarify the choices both for those developing quality of life measures and for those selecting an appropriate instrument for clinical studies.

INTRODUCTION

A HOST of quality of life indexes have recently been developed to measure complex domains like emotional and social function, well-being, disability, and overall health status. Although methodological standards for quality of life indexes exist [la], their usefulness is limited by failure to clearly distinguish between the possible uses of these instruments in clinical practice and in research. The result is confusion as to the best way to construct and validate quality of life measures. In response to this problem, we present in the following discussion a methodological framework for the development and assessment of health status measures which emphasizes the specific purpose for which the instrument will be used. Whilst this framework can be applied to any measuring instrument, whether of hemoglobin, blood pressure, or emotional function, we believe it will be most useful for clarifying issues related to the construction and validation of quality of life measures. THREE

PURPOSES

OF HEALTH

STATUS

The potential applications of health status measures categories: discrimination, prediction and evaluation. I. DiLscrimirzatice

MEASURES

can be divided

into three broad

inde.u

A discriminitive index is used to distinguish between individuals or groups on an underlying dimension when no external criterion or gold standard is available for validating these measures. Intelligence tests, for example, are used to distinguish between children’s learning abilities. The Minnesota Multiphasic Personality Inventory was developed in order to distinguish those with emotional and psychological disorders from the general population [S]. One of the most promising uses of discriminitive functional status measures is in surveys which attempt to quantify the burden of illness across different communities [6, 71. 2. Predictive

inde.us

A predictive index is used to classify individuals into a set of predefined measurement categories when a gold standard is available, either concurrently or prospectively, to 27

28

BRAM KIRSHNERand GORDON GUYAT-I

determine whether individuals have been classified correctly. This type of index is generally used as a screening or diagnostic instrument to identify which specific individuals have or will develop a target condition or outcome. Given that a gold standard is available, one might ask why an investigator would go through the trouble of developing a new measuring instrument which, at best, could only be as good as one which already exists. The answer, hopefully, is that the index has something to offer which the gold standard does not. The index may be less risky, uncomfortable or costly, or applicable earlier in the course of disease. We may, for example, use a simple functional status measure to determine which elderly people require additional social services in the home, the gold standard being a detailed assessment of need or the response to the intervention. Simple functional status measures have been shown to be among the most accurate predictors of prognosis in cancer patients [S]. The Denver Developmental Screening Test is designed to identify children who are likely to have learning problems in the future [9]. 3. Eduutice

indexes

An evaluative index is used to measure the magnitude of longitudinal change in an individual or group on the dimension of interest. The development of evaluative instruments has provided the main focus for those interested in measurement of quality of life. Such instruments are needed for quantitating the treatment benefit in clinical trials, and for measuring quality adjusted life years in cost-utility analyses. A number of investigators have developed indexes designed for general populations [2, 10P14]. More recently, the need for disease specific quality of life questionnaires for use in clinical trials has been recognized, and led to construction of instruments for studies in cancer [lS] and in rheumatology [16, 171. PURPOSE

SPECIFIC

INSTRUMENTS:

IMPLICATIONS

FOR

INDEX

DEVELOPMENT

In clinical practice, the introduction of new tests or measures often occurs without scientists feeling the need, when assessing an instrument’s usefulness, to focus on the specific purpose of the test. The reason is that the process of development of most clinical tests, and the requirements for successful application, are identical for discrimination, prediction and evaluation. For example, the measurement of cardiac output can be used to tell us who has normal cardiac function and who doesn’t; who is likely to experience heart failure in the future; and whether someone’s heart is working better or worse than it did in the past. The same is true for most of the wide variety of laboratory or physiological tests upon which clinicians rely. It is not true, however, when we look at instruments for measuring quality of life. In fact, as we shall point out, the requirements for maximizing one of the functions of discrimination, prediction, or evaluation may actually impede the others. As a result, validating the index for one of these purposes does not necessarily ensure that it can be used for the remaining ones. The following steps are involved in constructing an index to measure quality of life: -selection of the item pool -item scaling -item reduction --determination of reliability -determination of validity determination of responsiveness To illustrate the different methodological issues involved in developing a quality of life index for each of these three purposes, we shall use an example pertaining to the measurement of functional status (by which we mean the way a person feels, and how he or she functions in day-to-day activities) in patients with disabling chronic cardiorespiratory disease. We shall look at three investigators who wish to construct an index to measure functional status in patients with chronic cardiorespiratory disease. The first, whom we shall call D, (for discrimination) has the impression that functional status is related as much to personality variables, such as motivation and outlook on life, as it is to

A Methodological

Framework

for Assessing

Health

Indices

29

physiological dysfunction. To test this hypothesis requires a measure of functional status that would discriminate between individuals; D has taken on the task of constructing this instrument. The second investigator, P, (for prediction) wishes to predict mortality in his patient population, and thinks that at least some aspects of functional status may be powerful clues to prognosis. The third investigator, E, (for evaluation) wants to know if the treatments being used improve patients’ functional status and needs to develop an instrument which will evaluate the ability of an intervention to change the functional status of patients with chronic cardiorespiratory disease. We shall examine the steps involved in index construction for each of the three investigators in order to see how their approaches might differ. A summary of the major issues in index construction for each step is listed, according to the purpose of the instrument, in Table 1.

Diacrimmative Item

SelectlO”

Item scaling

Predictive

criteria

-tap important components domam --universal apphcablhty to respondents -stability over time

of the

-short response sets which facilitate uniform mterpretation

Item reduction

-internal

Rellabllity

large and stable mtersubject variation: correlatmn between replicate measures

Validity

scaling or consistency --comprehensiveness and reductmn of random error vs respondent burden

-cross-sectIonal construct validity: relationship between index and external measures at a single

Evaluative

criteria

-statistical criterion

associatmn measure

-response

sets which maximize

correlations ltlCZl.Sure

with

with the crltermn

-power to predict YS respondent burden

-stable Inter and intrasubject variation: chance corrected agreement between rephcate measures
point in time Responsiveness

-not

not relevant

SELECTION

relevant

OF THE

ITEM

criteria

tap areas related to change in health status -responsiveness to clinically significant change

-response sets with sufficient gradations to register change

-responsiveness burden

vs respondent

-stable intrasubject variation: inslgmficant variatmn between rephcate measures

-longitudinal construct validity: relationship between changes in Index and external measures over time power of the test to detect a clinically important difference

POOL

When constructing a questionnaire, the first task is to bring together a set of items which might plausibly be included in the final instrument. These items can be selected in a number of ways: the investigator may use personal judgement, tap the clinical experience of his colleagues, consult the relevant literature, or ask patients with the disease how it affects their lives. Regardless of which route is taken, the investigator must begin with some criteria for inclusion of individual items in the initial pool. Investigator D, whose goal is to discriminate between patients with cardiorespiratory disease according to their functional status, will want to focus on features of daily life in which disability related to underlying disease has a substantial impact. Items which are important to patients, and which are stable over at least short periods of time will be chosen. This latter criterion is necessary if the final index is going to be consistent in how it distinguishes between subjects according to their functional status. Selection will also focus on items that are performed by virtually all of the subjects at a particular functional level. For example, ability to climb stairs would not be a desirable item if 50% of the respondents never had occasion to negotiate a flight of stairs: these patients would be left without a rating, and could not be compared with the rest of the population. Investigator P would need only a single criterion for item selection: the relation between

a particular item and subsequent mortality. Unsure about which items fulfill this criterion, he may begin with a large pool of varied items chosen because they are common or important to patients. It is, however, items that can predict mortality that are being sought, regardless of whether or not they are causally related to this outcome. The third researcher, E, must apply a unique criterion: the likelihood that patient status on a particular item will change with application of one of the available interventions. For example, if the questionnaire was designed for use in trials of bronchodilators, emotional aspects of functional status would have little place in the index because variability in emotional functional would show up as noise which might obscure a treatment effect. On the other hand. if an investigator wished to measure the effect of a comprehensive rehabilitation program, including counselling and group therapy, on functional status, emotional items would be important. The degree of dependency in living accommodations (e.g. community, group home, or institution) which could be a crucial element of the discriminative instrument, and may well be important to the predictive index, would be rejected by investigator E, because therapy is unlikely to change the living arrangements of such patients. The procedure for achieving comprehensiveness is different when selecting an item pool for an evaluative instrument than for either a descriminative or predictive tool. We mentioned that one would want each item to be answered by most respondents who complete a discriminative questionnaire; the same need not be true of an evaluative index. To cite our previous example, while climbing stairs may be involved in the day-to-day activities of only 50 ‘I0 of the population, it may be a crucial chore for those who live in a two-storey home with bedroom and bathroom on the upper level. If an intervention made it substantially easier for this subsample to climb stairs (even if it had no other major effect), then it may be considered worthwhile. Evaluative instruments must, therefore, include a strategy for measuring all clinically important treatment effects.

ITEM

SCALING

By scaling, we mean the options available for patients in answering each question; that is, the scale associated with each item. The most simple scale possible is one in which the respondent is offered two choices for each item: he either does perform a particular activity, or he doesn’t. Such a scale would look very attractive to D. First, this strategy of response options is necessary if one wishes to use cumulative scaling. Second, different patients may mean different things by each of, say. “very mild, mild, moderate, severe, or extremely severe shortness of breath.” When one is trying to discriminate between people, as D is, then this variability in interpretation is problematic. Such interpretive problems are likely to be substantially less if the response options are, “no problem with shortness of breath going up stairs,” vs “problems with shortness of breath going up stairs.” In P’s case, the issue at hand is to identify item scales which will maximize correlations with the criterion measure. In this instance, it is difficult to generalize as to the best scaling procedure, since the number of response categories may affect the correlation differently from item to item, or instrument to instrument. The requirements for Investigator E’s index are quite different from those of D and P. He must ensure that his instrument is responsive (sensitive to change), i.e. individual items must show changes in score when clinically important improvement or deterioration occurs. D’s two-option scale suffers badly from lack of responsiveness. If a patient has difficulty with shortness of breath on climbing stairs, the only way an intervention will change his score is if it eliminates shortness of breath altogether. However, if E’s index is to be a success, it must register any clinically important decrease in distress. Up to a point (and no one is yet sure of what that point is), increasing response options on a scale wil increase item responsiveness. Therefore, E may choose 5, 7 or 9-point scales, or perhaps opt for a visual analogue scale; the investigator will certainly stay clear of yes-no questions. It should also be noted that E is not concerned with between-person differences, but in measuring within-person change over time.

A Methodological ITEM

REDUCTION

Framework AND

for Assessing INTERNAL

Health

Indices

31

CONSISTENCY

An investigator may initially choose items from the total item pool for the first version of the questionnaire based on item frequency and importance (as assessed by patients or health providers), but the content of the definitive instrument must depend on item performance in the setting in which the index will ultimately find its use. Therefore, performance criteria must be established so that items which do not contribute to, or actually detract from, the usefulness of the instrument can be deleted. Investigator D will be satisfied with the questionnaire if it discriminates between people according to their functional status. If the total questionnaire is to achieve this purpose, the individual questions must have this discriminative ability. Therefore, questions to which most or all of the respondents give similar or identical answers are of no use. Idiosyncratic items in which patients who by other criteria have a low functional status perform well and visa versa, must be excluded. Finally, the researcher must discover the items in which the most of the between-person variance is accounted for by factors other than cardiorespiratory functional status. For example, suppose it is found that while people respond quite differently when asked about the difficulty they have climbing stairs, the amount of difficulty they report bears little relation to their functional status judged by other criteria. Perhaps their answer is related more to the grade of the stairs they encounter, the speed at which they attempt to mount stairs, or what they generally find at the top, than to the severity of their disease. If this was the case, the question should be removed from a discriminative instrument intended to distinguish between the patients according to functional status. There are a number of ways of ensuring that items retained for the final discriminative index conform to these standards. One very powerful method is to ensure that the final instrument meets cumulative scaling criteria. Cumulative scales are those that order items by the degree of severity of limitation they describe and for which only one pattern of responses (across items) is associated with each scale level [lg]. For example, suppose a questionnaire has three items asking about ability to run a mile, walk a mile, and walk a block. If this questionnaire fulfills cumulative scaling criteria, then anyone who said they could run a mile would also report being able to walk both a mile and a block. If only one affirmative answer was given, it would always be to the question about walking a block, and a second positive response would always be to the query about ability to walk a mile. Thus, in a cumulative scale the patients’ score, a single number, tells you how they answered every question on the test. While often difficult to construct, and for some areas (such as emotional function) virtually impossible, satisfaction of cumulative scaling criteria solves a number of problems for discriminative instruments. It insures that each question is in fact able to discriminate between individuals, that there are no idiosyncratic items in the questionnaire, and that items bear a fixed relation to one another. Further, satisfaction of cumulative scaling criteria eliminates the need for weighting the importance of each question, and provides very strong evidence that the between-person variability in responses is related to status on a uniform underlying dimension. Where cumulative scaling is not possible investigators can employ other statistical techniques to achieve item reduction. Procedures designed to measure internal consistency such as those by Kuder and Richardson 1191 and Cronbach [20] can be used to select the group of items which will maximize the precision of the instrument to measure a given construct. These internal consistency measures are based on the number of items included in the index, as well as the set of correlations among them: hence, they provide good estimates of measurement error attributable to inappropriate and/or inadequate sampling of the content domain. Another advantage of these measures is that they can be used to show the point at which the inclusion of additional items in the index will not add substantially to the instrument’s precision. Investigator D, who wishes only to be able to discriminate between people according to functional status, may well proceed with item reduction by deleting items which fail to meet the requirements of either cumulative scaling or internal consistency, since inclusion of these items would decrease the

32

BIIAM KIKSHNER and GORDON GUYATT

power of the index to discriminate between patients with cardiorespiratory disease according to their functional status. In considering how many items to delete, investigator D will want to take respondent burden (the time and effort required by the subject to complete the questionnaire) into consideration. While procedures for measuring internal consistency are often used to refine and test evaluative indexes, one can question the rationale for doing so. Measures of internal consistency such as the KR20 and Cronbach’s alpha are based on the assumptions that the precision of the index will increase incrementally with the co-variance of the items and the number of items included. While this is true of a discriminative index, it is not true of an evaluative one. To begin with, items in an evaluative index need not be correlated at a single point in time but rather, must be consistent in the way they measure change in health status over two points in time. Second, maximizing the number of items in the index, regardless of whether they are correlated, can increase the likelihood of including items that are insensitive to efficacious treatment. Besides unnecessarily increasing the respondent burden, any variability in items in which change in score is not related to therapy will contribute to the instrument’s random error, and may thus obscure any important treatment effects which do occur. For E, it is imperative that his index is responsive to all clinically important differences in functional status. This requires identification and deletion of unresponsive items. One way in which E could proceed with this task is to administer the preliminary questionnaire to a group of patients with chronic cardiorespiratory disease before and after an intervention with known efficacy in improving functional status. Responsive items could be retained for subsequent use and items whose score failed to change could be expunged. E must also be careful to ensure that item responsiveness is not idiosyncratic to a specific intervention or subpopulation, and as a result may wish (before deleting any items) to repeat this experiment in more than one subgroup with more than one intervention. Investigator P needn’t be concerned with internal consistency, cumulative scaling, or item responsiveness. All that is needed is patience, for the questionnaire must be administered to a group of people who are then followed over time to see who dies and who remains alive. Items that fail to discriminate between the two groups can be discarded, and those with predictive ability retained. Further, to maximize the power of the instrument, answers to each of the items can be regressed against mortality, and weights provided for each item that will allow for the best prediction of ultimate outcome. RELIABILITY

Reliability of a single application of a measurement instrument can be defined as the ratio of the variance attributable to true differences among subjects to the total variance 121,221. Reliability can be measured by serial administration of a test to a group of subjects whose status on the dimension being examined is believed to be stable. The issue of reliability, however, is different for each of the three investigators. Investigator D is concerned with demonstrating that the difference observed between individuals are reproducible over two or more points in time-that between-person variation will remain constant. This can usually be assessed with correlation coefficients. While these coefficients do not take account of systematic changes in the scores, such change may not be important (for example, if everyone’s score rose by half a standard deviation). This is because scores from descriptive indexes often do not have meaning in themselves. Rather, they are usually interpreted in terms of referent values (e.g. the average and standard deviation) or their place in a distribution. For example, it is common practice to standardize the scores from intelligence tests on a given population and to interpret individual scores in terms of percentiles. In the case of predictive indexes, the scores or measurement categories are already meaningful in that they relate to one or more target conditions. In this instance, reliability pertains to agreement between replicate observations-whether the index classifies each

A Methodological Framework for Assessing Health Indices

33

individual into the same category at time 1 and time 2. From investigator P’s viewpoint, the usefulness of his index is based on its ability to give predictions which are more reliable and accurate than those which would be obtained by chance alone. Therefore, he concludes that reliability should be demonstrated with a chance-corrected measure of agreement, which takes into account both systematic change and random error. As far as reliability is concerned, investigator E is only interested in whether replicate observations on each individual remain stable over time-that is, that magnitude of the within-person variance is small. It is entirely possible that small changes in the withinperson variation may result in poor between-person reliability-this would not be of concern to E, provided the degree of within-person variation remained insignificant from both a statistical and clinical standpoint. To illustrate this point, consider how the three investigators would react if the data depicted in Table 2 represented the results obtained from test-retest procedures for each of

TABLE 2. TEST-RFTEST R~LMR~L~TY Patients I 2 3 4 5 b 7 8

Time

1

15* 14 15 14 IS 14 I.5 14

Ttmc 2 14 15 14 I5 14 15 14 15

*score IS out of 20.

their instruments. Although within-person variation is small, the reliability of the instrument is very poor. In fact, because the between-person variance is zero, the value of the product moment correlation coefficient and the intraclass correlation coefficient (which describes reliability as the ratio of between-subject variance to total variance, and can be used as a parametric analogue of the chance-corrected measure of agreeement Kappa) [23] both have values of 0. The product moment correlation coefficient and the intraclass correlation coefficient reflect the uselessness of the instrument depicted from the viewpoints of both investigators D and P. D is trying to discriminate between subjects according to their present functional status, P according to their risk of subsequent death. The index whose results are depicted in Table 2 won’t be able to discriminate between patients at all, because there isn’t any between-patient variability. The results of the reliability study may not appear so bleak to E, who is not interested in measuring between-person variation, but only within-person changes over time. Let us say that E subsequently presents evidence that his index can accurately measure the degree of intra-subject variation in functional status across different points in time when change does occur. Because the data in Table 1 demonstrate small intra-subject variance at a time when functional status is not changing, it is then likely that the index will be a useful outcome measure for clinical trials. The message here is that a descriptive index, in order to be useful, must have a large and stable between-subject variation. Predictive indexes must meet an additional criterion: there must not be any systematic change in subject scores over time. For evaluative indexes, small within-subject variance in stable subjects and a large change in score when functional status improves or deteriorates are the factors of importance. VALIDITY Criterion validity refers to the extent to which a measuring instrument produces the same results as a gold standard, or criterion measure [24&26]. Investigator P can make use of criterion validity, for mortality is a relatively easy outcome to measure. If his questionnaire is indeed able to forecast mortality, its validity is established. Investigators

34

BKAM KIKSHN~I< and Gowm

G~,YAIT

D and E have a more difficult task, for a gold standard to measure functional status in cardiorespiratory patients is not available. Content validity refers to the completeness with which an index covers the important areas of the domain which it is attempting to represent [24426]. While we will not dwell on methods for demonstrating content validity, it is worth mentioning that the domains that D and E are trying to sample are different. D is attempting to sample all important, relatively stable aspects of functional status common to most members of each functional class. while E’s domain is restricted to salient activities and feelings which are subject to clinically important change with treatment. Construct validity is concerned with the extent to which a particular measure relates to other measures in a manner which is consistent with theoretically derived hypotheses concerning the concepts (or constructs) that are being measured [24-261. To demonstrate construct validity. D might want to show that patients with poorer cardiac and respiratory function by physiological measurement score lower in functional status; that questionnaire score is related directly to exercise capacity; and that global ratings of quality of life by patient, relative. and health worker bear a close relation to results of the new index. All these measurements would be undertaken at a single point in time. and it is the cross-sectional difference between subjects on each of these tests that would be correlated. Investigator E might want to use the same physiological, exercise. and global rating methods to validate the index, but it would not be between subject differences at a single point in time that would be examined. Rather. it would be important to show that longitudinal within-subject changes in index scores with an intervention bore the expected relation to changes in the other variables measured. This point has been neglected in the health status measurement literature to date. To cite a number of examples: Bush and colleagues [3. lo]. Parkerson and associates [Z], and Spitzer rt ul. 1151, constructed quality of life measures for, respectively, the general population, family practice populations, and cancer patients. Each investigative team indicated that their instrument was intended for use as an outcome measure in clinical trials. The first two groups demonstrated expected correlations between questionnaire results and variables such as age, socio-economic status, and a number of chronic medical conditions. Spitzer and colleagues provided evidence of the validity of their QL-Index by demonstrating substantial correlations between a number of ratings of quality of life by patients, health workers and relatives, and scores on their new questionnaire. They also showed that the QL-Index could discriminate between groups of healthy people, those with cancer or other chronic diseases, and the seriously ill. This data can appropriately lead to the conclusion that the QL-Index has been validated for distinguishing between groups of healthy people, those with chronic disease and the seriously ill: however, not one of the three instruments has been validated for measuring within-person change in the setting of a clinical trial. It could be argued that instruments that adequately discriminate between people are likely to be valid for measuring within-person change over time as well. While this may be true, it is not necessarily so. Therefore, for the most convincing, or definitive. demonstration of the validity of an evaluative instrument, its relation to other measures must be examined prospectively in a setting in which change over time is measured.

RESPONSIVENESS

While instrument about an outcome difference medium,

demonstration of reliability and validity is sufficient for concluding that an is useful for descriptive or predictive purposes, we must also have information evaluative instrument’s responsiveness before we can confidently use it as an measure in a clinical trial. This issue concerns the power of the index to detect a when one is present -that is, the sample size required to observe a small, or large change or effect size in the population.

A Methodological

Framework

for Assessing

Health

Indices

35

There are a number of strategies for assessing responsiveness. These include ensuring that scores improve with application of a treatment of known efficacy, and use of the index in a clinical trial followed by examination of change scores in those who by other criteria improved or deteriorated. Regardless of which method is chosen, the investigator must come to grips with the effect of random and systematic error on the power of the test. When large sources of measurement error are present, then beta error is increased and consequently the power of the test is decreased. If the power of the test turns out to be too low, requiring an unattainable N to observe a desired effect size at a given level of alpha, then the index is not useful as an evaluative instrument. SUMMARY

The tests we use in clinical practice and research have three basic purposes: to discriminate between individuals along a continuum of health, illness or disability; to predict outcome or prognosis; and to evaluate within-person change over time. While for most testing procedures, the prerequisites for each role are complementary, in the case of quality of life measures the requirements may not only be independent, but competing. Therefore, those who wish to develop new instruments in this area should have their specific goal clearly in mind and tailor approaches to item selection, scaling, reduction, and assessment of reliability and validity to their primary purpose. Likewise, those who wish to use an existing index in clinical practice or research should ensure that the instrument has confirmed reliability, validity, and, if necessary, responsiveness, for the specific use they are intending to make of it.

Ackno~,l~dgc~nz~nt-~This work Ontario Ministry of Health.

was partially

supported

by the Medical

Research

Council

of Canada

and the

REFERENCES I. 2. 3. 4. 5. 6. 7.

8. 9. IO. Il. 12. 13. 14. 15. 16.

Sackett DL. Chambers LW. McPherson AS <>Ial: The development and application of indices of health: general methods and a summary of results. Am J Pub1 Health 67: 423428, 1977 Parkerson GR. Gehlback SH, Wagoner EH et al: The Duke-UNC health profile. Med Care 19: 806-828, 1981 Kaplan RM. Bush JW. Barry CC: Health status: types of validity and the index of well being. Health Ser Res I I: 47X 507. 1976 Bombardier C, Tugwell P: A methodological framework to develop and select indices for clinical trials: statistical and judgmental approaches. J Rheumatol 9 (Suppl. 8): 169-172. 1982 Hathaway S. McKinley C: Minnesota Multiphasic Personality Inventory. New York Psychological Corporation. 195 I Maddocks CL. Kerasic RB: Planning Services for Older People: Translating National Objectives into Effective Programs. Durham, N.C.: Center for the Study of Aging and Development. 1976 National Center for Health Statistics: Health Interview Survey Procedures. 1957-1974. Rockville, Md: National Center for Health Statistxs, 1975 (Vital and health statistics. Series 1, No. 11) (DHEW publication no. (HRA)75-1311) Sorna GP, Hulmes EC. Petrovich Z: Lung cancer. In Cancer Treatment. Haskell CM (Ed). Philadelphia: WB Saunders. 1980. p. 197 Frankenburg WK, Goldsmith AD, Camp BW: The revised DDST: its accuracy as a screening instrument. J Paediatr 79: 98X-995. 1971 Kaplan RM. Bush JW: Health-related quality of life measurement for evaluation research and policy analysis. Health Psycho1 1: 61-80. 1982 Spitrer VJO, Sackett DL, Sibley JC et al: The Burlington randomized trial of the nurse practitioner. N Engl J Med 290: 25 l-256. 1974 Bergner M. Bobbit RA, Carter WB et crl: The sickness impact profile development and final revision of a health status measure. Med Care 19: 787 %OS, 198 I Ware JE. Brook RH. Davies-Avery A rr al: Conceptualization and measurement of health for adults in the health insurance study: Model of Health and Methodology. Rand Corporation. May 1980, p. 1 Hunt SM. McKenna SP, McEwan J et al: A quantitative approach to perceived health status: a validation study. J Epid Common Health 34: 281-286, 1980 Spitzer WO. Dobson AJ, Hall J et al: Measuring the quality of life of cancer patients. J Chron Dis 34: 585 597.1981 Meenan RF. Gertman PM, Masen JH: Measuring health status in arthritis: the arthritis impact measurement scales. Arthritis Rheum 23: 146 152, 1980

36 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

BKAM KIRSHNER and GOKDONGWATT Fries CF. Spitr TR. Kraines RG rt crl: Measurement of patient outcome in arthritis. Arthritis Rhenm 23: 137-145. 1980 Guttman L: A basis for scaling qualitative data. Am So&l Rev 9: 1399150, 1944 Ruder GF, Richardson MW: The theory and estimation of test reliability. Psychometrika 2: 151 160, 1937 Cronbach LJ: Co-efficient alpha in the internal structure of tests. Psychometrika 16: 297 334. 1951 Wmer BJ: Statistical Principles in Experimental Design. 2nd edn. Kogakuska: McGraw-Htll 1971. pp. 283- 287 Guilferd JP: Fundamental Statistics in Psychology and Education. 4th edn. Toronto. 1965. pp. 438 442 Kramer MS. Feinstetn AR: Clinical biostatistics LIV: the biostatistics of concordance. Clin Pharm Ther 26: 111-123. 1981 Nunally JC: Psychometric Theory. New York: McGraw Hill. 1978 American Psychologtcal Associatton: Standards for Education and Psychological Tests. 1200 Seventeenth Street N.W., Washington, D.C. 20036: American Psychological Associations Inc. Carmines EG. Zcller RA: Reliability and Validity Assessment. Beverly Hills, Calif: Sage Publications. 1979