Testing the Measurement Properties of the Short Form-36 Health Survey in a Frail Elderly Population

Testing the Measurement Properties of the Short Form-36 Health Survey in a Frail Elderly Population

J Clin Epidemiol Vol. 51, No. 10, pp. 827–835, 1998 Copyright  1998 Elsevier Science Inc. All rights reserved. 0895-4356/98/$-see front matter PII S...

206KB Sizes 0 Downloads 29 Views

J Clin Epidemiol Vol. 51, No. 10, pp. 827–835, 1998 Copyright  1998 Elsevier Science Inc. All rights reserved.

0895-4356/98/$-see front matter PII S0895-4356(98)00061-4

Testing the Measurement Properties of the Short Form-36 Health Survey in a Frail Elderly Population Karen Stadnyk, Jane Calder, and Kenneth Rockwood * Division of Geriatric Medicine, Dalhousie University, Halifax, Nova Scotia, Canada ABSTRACT. The Short Form-36 Health Survey (SF-36) is a widely used measure of health-related quality of life, however, its suitability for frail older persons is not well documented. This study examines the measurement properties of the SF-36 in a frail older patient population. Patients consecutively admitted to two geriatric services (n ⫽ 146) were administered the SF-36 and comparative measures on admission and discharge. Internal consistency (0.75–0.91) and test-retest reliability (0.24–0.80) did not meet standards for clinical application of the tool. Four subscales were moderately correlated with comparative measures (Physical Function 0.53 to ⫺0.76; Bodily Pain ⫺0.61; Vitality ⫺0.58; Mental Health ⫺0.63). The results of effect size, standardized response mean, and relative efficiency statistics were consistent in documenting only minimal change for the SF36 subscales. The SF-36 appears to be reliable and valid, although its ability to monitor clinical change for frail older patients is questionable. j clin epidemiol 51;10:827–835, 1998.  1998 Elsevier Science Inc. KEY WORDS. Health status indicators, quality of life, elderly, validity, reliability, responsiveness

INTRODUCTION The evaluation of interventions targeted to the frail elderly is challenging, given that their problems are multiple and complex, and that few health outcomes are important for all patients [1–3]. The traditional health outcomes of mortality and morbidity reduction often fail to capture all of the issues that are important for preserving dignity and maintaining independence of this group [3]. Consequently, measures that address these health-related factors are needed when evaluating specialized geriatric interventions [4,5]. Although a number of instruments are available to tap health-related quality of life (HRQL) in general [6,7], to address specific health dimensions [8–10] and to examine HRQL in distinct populations [11,12], no ‘‘gold standard’’ exists. The generic Short Form-36 Health Survey (SF-36) [13], has, however, gained rapid acceptance as a HRQL measure, because of its brevity and favorable psychometric properties [14]. While there is extensive experience with the SF-36 in general populations across age groups and health conditions [15–17], there is a paucity of data to support its use with the frail elderly. Many studies incorporate the elderly, but fail to describe its measurement properties for the oldest old (⬎75 years) [16–19], in whom frailty is a common occurrence [3]. Studies often exclude persons with multiple health problems or chronic conditions [20–22], *

Address for correspondence: Dr. K. Rockwood, Centre for Health Care of the Elderly, Queen Elizabeth II Health Sciences Centre, 5955 Jubilee Road, Halifax NS, B3H 2E1, Canada. Accepted for publication on 28 April 1998.

and when older persons are included is often not evident that they are frail [20,23–26]. It has been suggested that the SF-36 may be the optimum outcome measure [27], and to have surpassed the Nottingham Health Profile [7], a long-standing generic assessment of HRQL, as the instrument of choice for outcomes research [14]. This project was designed to test these claims by assessing the feasibility, scaling properties, reliability, validity, and responsiveness of the SF-36 in a frail elderly patient population, using standardized instructions for administration and scoring. METHODS Design, Setting, and Subjects This study was conducted between September 1995 and November 1996, on the Geriatric Restorative Care (GRC) and the Geriatric Day Hospital (GDH) services, at the Queen Elizabeth II Health Sciences Centre (QEII HSC), Halifax, Nova Scotia, Canada. The GRC and GDH programs provide inpatient and outpatient rehabilitation respectively, to elderly people who require rehabilitation following surgery, acute illness, or chronic health problems. All persons 65 years and older were eligible for inclusion; as in earlier studies [1,3] frailty was defined as ‘‘slow to mobilize’’ following hospitalization or any premorbid dependence (2 weeks prior to hospitalization) in activities of daily living [Barthel Index (BI) [8]; feeding, dressing, bathing, toileting, transfers] or instrumental activities of daily living [Older Americans Resources and Services Instrumental Activities

K. Stadnyk et al.

828

of Daily Living (OARS-IADL) [9]; telephone usage, travel outside, shopping, housework, meal preparation, medication usage, money management]. Subjects with moderate to severe cognitive impairment [Mini-Mental State Examination (MMSE) [28] ⱕ17/30], those with severe hearing impairment, receptive aphasia, or those unable to speak English were excluded from administration of the self-report measures. The QEII HSC Research and Ethics Committee approved this project. Informed consent was obtained from all subjects who completed interviewer administered assessments. Measures and Protocol The SF-36 is a brief 36-item scale that measures physical function (10 items), role limitations due to physical (four items) and emotional health (three items), bodily pain (two items), social functioning (two items), mental health (five items), vitality (four items), and general health perceptions (five items) [13]. For each subscale, items are coded, summed, and reweighted to form a 0 (worst health) to 100 (best health) point scale. Together the subscales present a profile of physical and mental well-being. We administered the acute (items related to the past week) North American version of the SF-36 using the standardized instructions and scoring protocol [15]. Instruments were collected at admission and discharge. Trained interviewers administered the SF-36 and the Nottingham Health Profile (NHP) [7], in random order, to all consecutively admitted patients during the period of study. A modified version [29] of the Spitzer Quality of Life Index (SQLI) [11] was administered to the GRC and GDH multidisciplinary teams by the interviewer. Instruments that were routinely collected by the GRC/GDH staff, using standardized protocols (BI [8], OARS-IADL [30], MMSE [31]), were extracted from health records. For a random subset of the GRC/GDH admissions, the SF-36 was readministered at baseline in order to assess retest reliability. Based on our earlier experience, we modified the SQLI to make its use conform to current terminology for use with frail elderly people [29]. These modifications were intended to reflect Lawton’s [4] conceptualization of quality of life for the frail elderly. Two domains were added to provide assessment of cognition and the individual’s usual environment and the scoring was modified to incorporate the addition of these domains. The ‘‘Activity’’ and ‘‘Daily Living’’ domains were also changed to reflect the content of geriatric assessments for instrumental activities of daily living and activities of daily living, respectively. Scoring the SF-36 The SF-36 was examined for violations of randomness of missing items, according to gender, age, marital status, cognitive status, years of education, and presenting problem,

using nonparametric (χ 2) and parametric (t-tests, ANOVA) statistics. The eight SF-36 subscales were then scored using a standardized method that includes substitution of missing items with the mean value of non-missing items, provided that ⱖ50% of the subscale items were complete [15]. All statistical analyses were performed using the SPSS and SAS for Windows software packages. Feasibility We examined the proportion of individuals who were unable to complete the SF-36, the time to administer the questionnaire, and the proportion of missing data per item and per subscale. Data quality was assessed using the SF-36 Response Consistency Index [15]. The proportion of inconsistent responses was recorded for 15 item pairs (i.e., ‘‘walking one block’’ versus ‘‘walking more than one mile’’; ‘‘felt calm and peaceful’’ versus ‘‘been a very nervous person’’). Scaling Properties and Score Distributions The method of summated ratings is used to score the SF-36 subscales [15]. A valid scale score is computed by summing responses when assumptions of equivalent item means, equivalent variances, and an appreciable linear relationship between items and their corresponding scale scores are met. These scaling assumptions were tested using methods described by McHorney et al. [32]. Item means and variance estimates were visually examined for equality. An item analysis was performed to identify the frequency of substantial correlations (P ⱖ 0.40) between each item and its corresponding subscale, after correction for overlap. The proportion of floor (lowest scores) and ceiling (highest scores) effects was then examined for each item and each subscale. Reliability Cronbach’s alpha was used to measure internal consistency or the extent to which the SF-36 subscale items addressed the same attribute [33]. An intraclass correlation coefficient (ICC) was used to describe the retest performance of the tool [14]. Repeated measures ANOVA was used to estimate variance components for ICC ⫽ σ s / σ s ⫹ σ t ⫹ σ e (s ⫽ subjects, t ⫽ test, e ⫽ error) [34]. Reliability coefficients between 0.80 and 0.89 were considered acceptable for population surveys [33], while 0.90 was the minimum accepted standard for clinical application of the tool [35]. Construct Validity Factorial validity and correlational evidence were used to examine the extent to which the SF-36 measured the intended constructs of physical and mental well-being. This project duplicated the methods of McHorney et al. [36] (who studied general populations) and attempted to corroborate their observations in a frail elderly patient population.

SF-36 in Frail Elderly Population

Two factors were extracted from the eight SF-36 subscales using principal components analysis. An eigenvalue of ⱖ1 was identified as the criterion for retaining a factor. The contribution of the SF-36 subscales to a general health construct was evaluated by examining the proportion of variance explained by the first unrotated factor and the number of positive correlations exceeding 0.29. The two principal components were rotated using the varimax method to simplify interpretation of the factor values. Following rotation, the pattern of correlations among the eight subscales were examined according to predefined hypotheses [36]. A correlation matrix documented convergent and divergent validation of the SF-36 subscales with previously validated generic (NHP, SQLI) and specific (BI, OARS-IADL, MMSE) assessments of health status. Spearman correlations of 0.60 or more were identified as evidence of convergent validity between the SF-36 subscales and corresponding measures.

829

(1.4%) left the services and could not be traced, five (3.4%) were transferred to other hospital units before data could be collected, one (0.7%) was quarantined, one person died (0.7%), and one left against medical advice (0.7%). In total, 131 patients completed both the admission and discharge assessments. The majority of patients were female (64%), unmarried (66%), over 80 years of age (57%), cognitively intact (74%), and attained 10 or more years of education (70%). The most common presenting problems were medical (57%), followed by trauma (hip or limb fractures) (29%), and elective orthopedic surgery (hip or knee replacements) (13%). Despite these problems, 52% of the patients rated their health as ‘‘good,’’ ‘‘very good,’’ or ‘‘excellent.’’ Subjects had on average 6.4 ⫾ 2.1 presenting problems. Ninety-four percent of the cohort had an ADL or mobility limitation on admission, as assessed by the BI. Most subjects (96.6%) were limited in one or more admission IADLs. All patients were therefore considered frail.

Responsiveness Responsiveness addresses the ability of an instrument to detect important change over time. The responsiveness of the SF-36 and each comparative measure, from admission to discharge, was assessed using effect size [37], standardized response mean [38], and relative efficiency [39] statistics. The effect size statistic estimates the magnitude of change, by expressing the mean change score in terms of the variability of scores in stable situations (standard deviation at baseline). The standardized response mean also estimates the magnitude of change, but expresses change as the ratio of mean change from admission to discharge to the standard deviation of the mean change score. Relative efficiency expresses the change score as a ratio of paired t-tests that compare the assessed instrument to a standard. A score of 1 indicates that the comparative instrument has the same efficiency as the standard, ⬍1 indicates less efficiency, while ⬎1 represents greater efficiency. The SQLI was arbitrarily selected as the standard for comparison in this study given that there is no standard assessment of quality of life for the frail elderly. RESULTS Patients Two hundred patients with first admissions were screened for the study. Eight persons (4.0%) refused to participate, 41 (20.5%) were ineligible, and five (2.5%) could not be contacted in time (four missed, one quarantined). The patients who completed the interview battery differed from those who did not complete the tests. The ineligible/refused group had fewer years of education, were less likely to reside in their own home, were more cognitively impaired, and more limited in IADL prior to admission to these services. Admission data were available for 146 patients. At discharge, five individuals (3.4%) refused the interview, two

Feasibility The median time to complete the SF-36 was 20 minutes. Response consistencies could not be assessed for 34 (23.3%) item pairs due to missing data. For the remaining 112 item pairs, consistent responses were reported for 74.1% of patients, 21.4% committed one inconsistency, and 4.5% were inconsistent on two or more occasions. Most inconsistencies were confined to the SF-36 two-item Social Function scale (7.4%). The Physical Function questions comparing the ability to ‘‘walk one block’’ and ‘‘moderate activities’’ (5.7%) and the Vitality scale’s items related to ‘‘having a lot of energy’’ or ‘‘feeling worn out’’ (5.1%) also produced a number of inconsistencies. On admission, 66% of the subjects completed all 36 items. In general, less than 5% of the responses per item were missing, and only one question, pertaining to ‘‘bathing and dressing,’’ was answered by everyone. There were no apparent violations in the assumption of randomness of missing data, therefore, SF-36 subscale scores were computed. Only Bodily Pain and Social Function, the two subscales composed of only two items, had more than 5% of the items missing (Table 1). The SF-36 method of replacing missing data was considered inappropriate for data imputation with these subscales. Scaling Properties and Score Distributions In Table 2, the ranges in individual item means and estimates of variability are presented for each SF-36 subscale. Item means and standard deviations appear uniform for the Role-Physical, Bodily Pain, Social Function, and RoleEmotional subscales. The item means and standard deviations are lowest for the Physical Function questions addressing ‘‘vigorous activities’’ and ‘‘walking more than a mile.’’ In contrast, the only activities of daily living item

K. Stadnyk et al.

830

TABLE 1. Descriptive statistics for admission SF-36 subscales, missing items replaced

Mean Median Range SD Percentiles 10 25 50 75 90 Floor (%) Ceiling (%) Nonresponse (%)

Physical Function

RolePhysical

Bodily Pain

General Health

Vitality

Social Function

RoleEmotional

Mental Health

23.43 15 0–90 25.67

29.24 0 0–100 36.26

57.05 61 0–100 28.46

55.85 55 5–100 24.74

41.40 40 0–100 23.78

58.80 62.5 0–100 35.98

88.85 100 0–100 28.40

75.35 80 4–100 21.08

20 40 55 76 92 0 3.4 0.7

10 25 40 55 74 2.1 1.4 0.7

55 100 100 100 100 7.5 81.5 2.8

46.4 64 80 92 96 0 8.2 0.7

0 3.75 15 35.16 65 24.7 0 0

0 0 0 50 100 50.0 12.3 1.4

21.9 32 61 74 100 1.4 17.1 5.5

(‘‘bathing/dressing’’) in the Physical Function subscale had the greatest distribution of item responses. Item-scale correlations within each subscale were similar and greater in magnitude than correlations with scales addressing other health concepts (Table 2). All item-scale correlations, corrected for overlap, met or exceeded the criteria of P ⱖ 0.40 with their corresponding subscale, suggesting item convergent validity. The distribution of total scores was skewed for five of the eight subscales (Table 1). Floor effects were common for the Physical Function (24.7%), Role-Physical (50%), Social Function (11.0%), and Role-Emotional (7.5%) scales, while ceiling effects were observed for the Role-Emotional (81.5%), Social Function (28.1%), Bodily Pain (17.1%), and Role-Physical (12.3%) scales. Reliability The majority of the homogeneity estimates (α) fell below acceptable standards for clinical application of the tool (Table 2). The lowest coefficients were observed for the Social Function (0.72) and General Health (0.75) subscales. Intermediate values (0.80–0.81) were obtained for the subscales addressing Vitality, Role-Physical, Bodily Pain, and Mental

0 25 62.5 100 100 11.0 28.1 7.5

Health, while the highest estimates were observed for the Role-Emotional (0.89) and Physical Function (0.91) subscales. Only the latter subscale met minimum criteria (α ⱖ 0.90) for reliability. The SF-36 was readministered to 41 subjects (20 GDH, 21 GRC). The median time between tests was 6 days (range 2–21 days). None of the SF-36 subscales met the minimum standard (P ⱖ 0.90) for retest reliability (Table 2). Construct Validity The first unrotated factor, described by the eight SF-36 subscales, met the criteria to support a general health construct (Table 3) with subscale correlations ranging from 0.42 (Role-Emotional) to 0.77 (Vitality). The rotated two-factor solution supported the interpretation that the instrument broadly measures mental and physical well-being, accounting for 34% and 19% of the total variance, respectively. Only the Bodily Pain, General Health, and Social Function subscales deviated from the hypothesized ranges in correlation. Bodily Pain did not exhibit a strong association with physical well-being, General Health was only weakly associated with physical health and the Social Function subscale was moderately, instead of strongly, associated with mental

TABLE 2. Summary of scaling properties, internal consistency, and test-retest reliability of the SF-36 subscales

Subscale Physical Function Role-Physical Bodily Pain General Health Vitality Social Function Role-Emotional Mental Health a

Range of item means

Range of item SDs

Convergent item-scale correlations a

Divergent item-scale correlations b

Internal consistency (␣)

Test-retest reliability

(1.15–2.03) (1.24–1.35) (3.85–3.86) (2.46–4.06) (2.31–4.13) (3.26–3.44) (1.88–1.90) (4.19–5.34)

(0.39–0.83) (0.43–0.48) (1.46–1.64) (1.19–1.60) (1.46–1.59) (1.61–1.65) (0.30–0.33) (1.21–1.63)

(0.56–0.82) (0.56–0.70) 0.69 (0.40–0.65) (0.53–0.67) 0.56 (0.74–0.81) (0.53–0.70)

(⫺0.25–0.43) (⫺0.08–0.46) (0.26–0.46) (⫺0.24–0.42) (⫺0.23–0.42) (⫺0.10–0.45) (⫺0.10–0.29) (⫺0.23–0.49)

0.91 0.81 0.81 0.75 0.80 0.72 0.89 0.81

0.76 0.72 0.80 0.77 0.61 0.64 0.24 0.76

Correlations corrected for item-overlap. b Correlations with alternate scales.

SF-36 in Frail Elderly Population

831

TABLE 3. Factorial validity of the SF-36 subscales

Hypothesized association a Subscale Physical Function Role-Physical Bodily Pain General Health Vitality Social Function Role-Emotional Mental Health a

Unrotated

Rotated principal components

Physical

Mental

Factor 1 correlations (r)

Physical correlations (r)

Mental correlations (r)

Communality b

⫹ ⫹ ⫹ ∗ ∗ ∗ ⫺ ⫺

⫺ ⫺ ⫺ ∗ ∗ ⫹ ⫹ ⫹

0.44 0.51 0.62 0.57 0.77 0.67 0.42 0.61

0.80 0.81 0.42 0.06 0.39 0.66 ⫺0.07 0.01

⫺0.10 ⫺0.02 0.46 0.69 0.68 0.32 0.60 0.79

0.65 0.66 0.38 0.49 0.62 0.54 0.31 0.63

From McHorney et al. [36]. b Proportion of total variance explained by the two-factor solution. ⫹ Strong relationship (r ⱖ 0.70). * Moderate relationship (0.30 ⬍ r ⬍ 0.70). ⫺ Weak relationship (r ⱕ 0.30).

well-being. Overall, the proportion of total variance per subscale (communality), explained by the constructs of physical and mental well-being, ranged from 37% to 66%. Table 4 presents a correlation matrix between each SF36 subscale and the comparative measures. Negative corre-

lations with the NHP were an artifact of scoring where lower scores are associated with better health states. In comparison, high SF-36, SQLI, BI, and OARS-IADL scores represent better health status. Bivariate correlations exceeded the criterion (P ⱖ 0.60)

TABLE 4. Multitrait-multimethod correlation matrix to assess convergent (grey shaded)

and divergent validity of the SF-36 subscales Instrument Physical Function SF-36 BI OARS-IADL SQLI-ADL SQLI-IADL NHP-Mobility a Role-Physical SF-36 Bodily Pain SF-36 NHP-Pain a General Health SF-36 SQLI-Health Vitality SF-36 NHP-Energy a Social Function SF-36 SQLI-Support NHP-Social Isolation a Role-Emotional SF-36 Mental Health SF-36 SQLI-Outlook NHP-Emotional a MMSE a

PF

RP

BP

GH

VT

SF

0.22 ⫺0.33

0.25 ⫺0.11

⫺0.60

0.10 ⫺0.09

0.12 ⫺0.08

0.26 0.19

0.27

0.34 ⫺0.37

0.23 ⫺0.19

0.43 ⫺0.29

0.41 ⫺0.41

⫺0.58

0.32 ⫺0.17 ⫺0.20

0.43 ⫺0.06 ⫺0.09

0.30 ⫺0.01 ⫺0.09

0.07 0.09 ⫺0.29

0.41 ⫺0.02 ⫺0.28

0.02 ⫺0.25

⫺0.13

⫺0.01

0.21

0.22

0.19

0.10

0.07 ⫺0.06 ⫺0.08 0.00

0.09 ⫺0.07 0.02 0.01

0.19 0.09 ⫺0.19 ⫺0.20

0.46 0.42 ⫺0.41 0.00

0.50 0.24 ⫺0.34 ⫺0.06

0.33 0.05 ⫺0.22 ⫺0.03

RE

MH

0.26 0.07 ⫺0.23 0.02

0.45 ⫺0.63 0.00

0.69 0.61 0.53 0.60 ⫺0.76 0.48

SF-36 high scores and NHP low scores are equated with better health. Abbreviation: PF ⫽ Physical Function; RP ⫽ Role-Physical; BP ⫽ Bodily Pain; GH ⫽ General Health; VT ⫽ Vitality; SF ⫽ Social Function; RE ⫽ Role-Emotional; MH ⫽ Mental Health.

K. Stadnyk et al.

832

TABLE 5. Responsiveness of the SF-36 subscales and comparative instruments

Instrument SF-36 Physical Function ** Role-Physical * Bodily Pain ** General Health Vitality * Social Function * Role-Emotional Mental Health† NHP Physical Mobility * Pain Energy * Social Isolation† Emotional Reactions ** SQLI ** BI ** OARS-IADL ** MMSE

Admission x ⴞ SD

Discharge x ⴞ SD

ES

SRM

RE

22.81 27.71 57.92 56.73 42.37 58.85 89.92 76.31

⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾

25.40 35.44 28.88 24.64 23.40 35.58 27.37 20.38

28.46 35.25 66.71 58.18 48.40 71.46 91.13 78.91

⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾

23.40 35.31 28.20 24.38 24.42 30.96 25.19 18.85

0.22 0.21 0.30 0.10 0.26 0.35 0.00 0.13

0.24 0.20 0.31 0.10 0.25 0.31 0.00 0.14

0.10 0.07 0.14 0.01 0.10 0.13 0.03 0.03

45.56 23.03 39.03 19.99 17.86 9.18 64.08 5.00 25.73

⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾

23.85 26.18 33.72 22.24 22.85 1.90 22.12 3.58 3.35

39.41 20.58 29.13 16.18 9.57 10.56 79.22 7.20 26.10

⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾ ⫾

24.66 23.97 30.63 21.56 16.47 2.09 16.52 2.99 3.49

⫺0.30 0.00 ⫺0.30 ⫺0.20 ⫺0.40 0.73 0.68 0.61 0.11

⫺0.30 ⫺0.10 ⫺0.30 ⫺0.20 ⫺0.50 0.78 0.94 0.74 0.12

0.09 0.01 0.12 0.06 0.26 1.00 1.46 0.91 0.02

Abbreviations: ES ⫽ effect size; SRM ⫽ standardized response mean; RE ⫽ relative efficiency. Significant difference between admission and discharge scores: * P ⱕ .001, ** P ⱕ .01 (non-parametric), †P ⱕ .05.

for confirmation of convergent construct validity for the SF36 Physical Function, Bodily Pain, and Mental Health subscales and instruments that measured similar health constructs. The SF-36 Vitality subscale correlated closely with the NHP Energy scale (⫺0.58). Convergent validation was not observed for the SF-36 Social Function or General Health scales and was not assessed for the SF-36 RolePhysical and Role-Emotional subscales due to lack of comparative assessments. Responsiveness Statistically significant changes in mean scores were documented for all scales except the SF-36 General Health and Role-Emotional subscales, the NHP Pain subscale, and the MMSE (Table 5). Nevertheless, all eight SF-36 subscales were less responsive than specific health assessments using statistical evaluation of the magnitude of change (effect size, standardized response mean). The SQLI (relative efficiency standard: x ⫽ 9.03 ⫾ 1.92, median ⫽ 9.0, range 4.0–13.0), BI and IADL measures were consistently more responsive, using all three indicators, than any subscale of the multiitem SF-36 or the NHP. DISCUSSION This study examined the measurement properties of the SF36 for frail elderly individuals admitted to an inpatient rehabilitative service and outpatient hospital program. Although the measurement properties of the SF-36 are well documented in general populations [19,32,36,40], this is the

first study to evaluate comprehensively the performance of the instrument in a cohort of frail elderly people. Recruitment of a well defined cohort of frail elderly persons represents an important strength of this study. Previous reports described the SF-36’s measurement properties in persons over 65 years of age [25,26,32], but it is not clear that these studies incorporated frail persons. McHorney et al. [32,36] presented abundant data detailing psychometric properties of the SF-36 for elderly persons by age group, diagnosis, and disease severity, but these subcategories did not overlap. One cannot determine if persons with complicated medical diagnosis or specific health conditions are young or old, or if the subgroups of elderly are well or ill. Only face-to-face interviews were employed, which represents both a strength and a weakness of this project. This method of administration ensured that persons who could not otherwise complete a self-report version were able to participate. Missing data are known to be higher with selfreport questionnaires, and specifically with use of the SF-36 in the elderly [41]. Interviewer administered questionnaires have the advantage of improving completion rates, but compromise generalizability of the study results by excluding persons with severe hearing impairment. Administration of the SF-36 was feasible in this population with a noted exception. The median time to complete the instrument was longer, but within the range of previous reports [17,24,26]. The proportion of missing items was slightly lower than a comparable Medical Outcomes Study (MOS) using self-report [32], which is probably attributable to our use of face-to-face interviews and the acute version.

SF-36 in Frail Elderly Population

Missing data were substantially reduced using the SF-36’s ascribed imputation methods, although missing items for the scales incorporating only two items, Bodily Pain and Social Function, were not imputed. Failure to impute these items likely explains the higher missing data rates for these two scales compared to the MOS report [32]. Finally, no published reports are available to describe response inconsistencies in the elderly. Approximately one in four of our patients committed one or more inconsistent errors. This estimate is likely low, given the large proportion of item pairs (23%) that could not be examined due to missing responses. In comparison to published norms for the MOS (5.5%) and U.S. general populations (9.7%) [15], our estimates are high. Discrepancies could be interpreted as misunderstanding the questions, or more likely for hospitalized frail patients, the inability to discriminate among presented items and judge current ability. Replication of the response inconsistency data is required. Future research should determine if the observed inconsistencies are limited to frail populations in rehabilitative settings or whether this observation is generalized to other populations. Future research should also document the impact of inconsistent replies on the SF-36 scoring algorithm and assess the subsequent effect of inconsistent responses on interpretation of the health profiles. Lack of variability of item responses is evident for many of the SF-36 items in this population. A compounding effect of reduced item variability is poor discriminative ability of the corresponding scale scores. Floor and ceiling effects predominate for the physical and mental functioning scales, but this is not a new observation in the elderly [32,42]. Despite problems with the distribution of some SF-36 items, all items correlated higher with hypothesized scales than with alternate subscales. This finding counters previous observations that scaling success decreases with advancing age and disease severity [32]. The internal consistency coefficients observed for each SF-36 subscale were within the ranges reported for other studies incorporating the elderly [25,26,32,43]. Internal consistency estimates met criteria for discriminating among population subgroups and lend credence to the use of the SF-36 as a tool for population surveys and program evaluation. The homogeneity of the measure did not, however, reach internal consistency values that would be considered essential for clinical application of the instrument with frail elderly persons [43]. Limited assessment of retest reliability of the self-report [19] and interview format [44] suggests that most SF-36 subscales are reasonably stable. In this study none of the SF36 subscales met minimum standards for retest reliability when applied in a clinical setting. Readministration of the SF-36 over an extended time frame (2 to 21 days) may account for the lack of reliability. The extended period of data collection reflects the logistic difficulties of conducting studies with frail elderly people who may not return for outpatient visits on their scheduled dates.

833

Assessment of factorial validity of the SF-36 produced results that are comparable to the MOS general population for the Physical Function, Role-Physical, Vitality, and Mental Health subscales [36]. In contrast, the Bodily Pain, General Health, Social Function, and Role-Emotional scales deviated from the predefined hypotheses. The few discrepancies between the factor structure of the MOS data and this report could be attributed to genuine differences in the interpretation of the SF-36 questions between younger populations and the frail elderly and/or the application of the acute instrument in a rehabilitative setting. Poor convergent validation of the SF-36 Social Function subscale may reflect conceptual differences between the SF36 and the NHP. The SF-36 questions pertain to the amount of time and the extent that health problems have interfered with normal social activities while the NHP emphasizes the subject’s perception of distress with limited social activities. Frail elderly people may correctly report that their social network is restricted for the period of hospitalization, but this observation may not be reflected in psychological distress for these individuals. In this study, statistically significant changes in health status were observed for six of the SF-36 subscales between hospital admission and discharge. Yet assessment of the magnitude of change produced nearly identical results across all response measures suggesting that SF-36 cannot capture clinically important changes in health status for frail elderly persons in a rehabilitative setting. Indeed, both subject centered measures of HRQL (SF-36, NHP) were less effective in documenting health status change than professional reports (modified SQLI, BI, OARS-IADL). Professional assessment of health status is known to differ from patient reports [11,45], however, this observation cannot exclusively account for the poor performance of the SF-36 to document health status change. Statistically significant change did occur. This change may represent regression to the mean for subscales with responses clustered at extreme values. Clinically important change was not captured by the SF-36 chiefly due to floor effects. For example individuals may have made significant gains in activities of daily living (i.e., grooming, transfers) that were not addressed in the questionnaire. The responsiveness scores from the physical performance measures that are specific to tasks relevant to the frail (BI, OARS-IADL) support this alternative explanation. To date, selected SF-36 subscales have demonstrated responsiveness in several surgical studies [46,47], but the results from these studies cannot be generalized to the frail elderly. Future research should follow frail persons into the community to determine if the SF-36 is capable of documenting health status for broader program evaluation. CONCLUSION This study set out to address two assertions, that the SF-36 represents the ‘‘optimum outcome measure’’ [27] and that

834

the instrument displays superior measurement properties for subsequent application with the frail elderly. In conclusion, the results of this study do not support these broad claims. Although the SF-36 can be administered to frail elderly in a timely fashion and with limited missing data, a substantial proportion of the responses to items are inconsistent. It is not clear, at this time, what the significance or the impact of item inconsistencies has on the interpretation of the questionnaire. These inconsistencies do not appear to affect the scaling properties or validity of the tool, but reliability and responsiveness may be compromised. Is lack of responsiveness in part a function of inconsistent responses? Or is unresponsiveness merely a consequence of omitting health issues that are relevant to the frail? Experts identify that the SF-36 fails to tap important health promoting and maintaining issues for the elderly [13,43]. Their recommendations, to supplement the SF-36 with outcome measures that address health issues specific for frail persons, is reiterated in this report. This recommendation does not negate use of the SF-36 in the frail elderly, it simply points to the obvious conclusion that health outcome selection should be targeted to the population of interest, the purpose of the inquiry, and the context of application. The SF-36’s measurement property and application limitations are a consequence of the theoretical foundation of the tool [13,15]. The SF-36 was developed as a generic assessment of important health constructs, with application in ‘‘. . . clinical practice and research, health policy evaluations, and general population surveys’’ [13], an ambitious undertaking for any generic outcome measure. Generic instruments are known to be insensitive to clinically important change and are often restricted to use in population studies or evaluation of health policy [48]. Future investigations of the SF-36 in community populations of frail elderly may support the adoption of the SF-36 as the primary outcome for health policy or program evaluation; however, current information suggests that in clinical applications the tool should be used in conjunction with other outcome measures that are developed specifically for frail elderly people who become acutely ill. This study was supported with a grant from the Camp Hill Medical Centre Research Fund. The National Health Research Development Program (NHRDP) supported this research through a National Health Scholar award to Kenneth Rockwood and through a National Health graduate award to Karen Stadnyk.

References 1. Jarrett PG, Rockwood K, Carver D, Stolee P, Cosway S. Illness presentation in elderly patients. Arch Intern Med 1995; 155: 1060–1064. 2. Buchner DM, Wagner EH. Preventing frail health. Clin Geriatr Med 1992; 8: 1–17. 3. Rockwood KR, Fox RA, Stolee P, Robertson D, Beattie BL. Frailty in elderly people: An evolving concept. Can Med Assoc J 1994; 150: 489–495.

K. Stadnyk et al.

4. Lawton MP. A multidimensional view of quality of life in frail elders. In: Birren JE, Lubben JE, Rowe JC, Deutchman DE, Eds. The Concept and Measurement of Quality of Life in the Frail Elderly. San Diego, CA: Academic Press; 1991: 3–27. 5. Arnold SB. Measurement of quality of life in the frail elderly. In: Birren JE, Lubben JE, Rowe JC, Deutchman DE, Eds. The Concept and Measurement of Quality of Life in the Frail Elderly. San Diego, CA: Academic Press; 1991: 50–73. 6. Bergner RA, Bobbitt RA, Carter WB, Gilson BS. The Sickness Impact Profile: Development and final revision of a health status measure. Med Care 1981; 19: 787–805. 7. Hunt SM, McEwan J, McKenna SP. Measuring Health Status. London: Croom Helm; 1986. 8. Granger CV, Albrecht GL, Hamilton BB. Outcome of comprehensive medical rehabilitation: Measurement by PULSES Profile and the Barthel Index. Arch Phys Med Rehabil 1979; 60: 145–154. 9. Fillenbaum GG, Smyer MA. The development, validity, and reliability of the OARS multidimensional functional assessment questionnaire. J Gerontol 1981; 36: 428–434. 10. Melzack R. The McGill Pain Questionnaire: Major properties and scoring methods. Pain 1975; 1: 277–299. 11. Spitzer WO, Dobson AJ, Hall J, Chesterman E, Levi J, Shepherd R, et al. Measuring the quality of life of cancer patients: A concise QL-index for use by physicians. J Chronic Dis 1981; 34: 585–597. 12. Parfrey PS, Vavasour H, Bullock M, Henery S, Harnett JD, Gault MH. Development of a health questionnaire specific for end-stage renal disease. Nephron 1989; 52: 20–28. 13. Ware JEJ, Sherbourne CD. The MOS 36-item short form health survey (SF-36). I. Conceptual framework and item selection. Med Care 1992; 30: 473–483. 14. McDowell I, Newell C. Measuring Health: A Guide to Rating Scales and Questionnaires. New York: Oxford University Press; 1996. 15. Ware JEJ, Snow KK, Kosinski M, Gandek B. SF-36 Health Survey Manual and Interpretation Guide. Boston, MA: Nimrod Press; 1993. 16. Jenkinson C, Coulter A, Wright L. Short form 36 (SF36) health survey questionnaire: Normative data for adults of working age. Br Med J 1993; 306: 1437–1440. 17. Kurtin PS, Davies AR, Meyer KB, Degiacomo JM, Kantz ME. Patient-based health status measures in outpatient dialysis. Med Care 1992; 30: MS136–MS149. 18. Nerenz DR, Repasky DP, Whitehouse FW, Kahkonen DM. Ongoing assessment of health status in patients with diabetes mellitus. Med Care 1992; 30: MS112–MS124. 19. Brazier JE, Harper R, Jones NMB, O’Cathain A, Thomas KJ, Usherwood T, Westlake L. Validating the SF-36 health survey questionnaire: New outcome measure for primary care. Br Med J 1992; 305: 160–164. 20. Kantz ME, Harris WJ, Levitsky K, Ware JEJ, Davies AR. Methods for assessing condition-specific and generic functional status outcomes after total knee replacement. Med Care 1992; 30: MS240–MS252. 21. Mahler DA, Mackowiak JI. Evaluation of the short-form 36item questionnaire to measure health-related quality of life in patients with COPD. Chest 1995; 107: 1585–1589. 22. Gliklich RE, Hilinski JM. Longitudinal sensitivity of generic and specific health measures in chronic sinusitis. Qual Life Res 1995; 4: 27–32. 23. Reuben D, Valle L, Hays R, Siu A. Measuring physical function in community-dwelling older persons: A comparison of self-administered, interview-administered, and performancebased measures. J Am Geriatr Soc 1995; 43: 17–23.

SF-36 in Frail Elderly Population

24. Weinberger M, Samsa GP, Hanlon JT, Schmader K, Doyle ME, Cowper PA, Uttech KM, et al. An evaluation of a brief health status measure in elderly veterans. J Am Geriatr Soc 1991; 39: 691–694. 25. Lyons RA, Perry HM, Littlepage BNC. Evidence for the validity of the Short-Form 36 Questionnaire (SF-36) in an elderly population. Age Ageing 1994; 23: 182–184. 26. Weinberger M, Nagle B, Hanlon JT, Samsa GP, Schmader K, Landsman PB, et al. Assessing health-related quality of life in elderly outpatients: Telephone versus face-to-face administration. J Am Geriatr Soc 1994; 42: 1295–1299. 27. Ware JE. Measuring patients’ views: The optimum outcome measure. SF 36: A valid, reliable assessment of health from the patient’s point of view. Br Med J 1993; 306: 1429–1430. 28. Folstein MF, Folstein SE, McHugh PR. ‘‘Mini-Mental State’’: A practical method for grading the cognitive state of patients for the clinician. J Psychiatr Res 1975; 12: 189–198. 29. Stolee P, Stadnyk K, Myers AM, Rockwood K. An individualized approach to outcome measurement in geriatric rehabilitation. Gerontologist 1996; 36: 97. (Abstract) 30. Fillenbaum GG. Multidimensional Functional Assessment of Older Adults: The Duke Older Americans Resources and Services Procedures. Hillsdale, NJ: Laurence Erlbaum Assoc.; 1988. 31. Molloy DW, Alemayehu E, Roberts R. Reliability of a standardized Mini-Mental State Examination compared with the traditional Mini-Mental State Examination. Am J Psychiatr 1991; 148: 102–105. 32. McHorney CA, Ware JEJ, Lu JFR, Sherbourne CD. The MOS 36-item short-form health survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care 1994; 32: 40–66. 33. Nunnally JC, Bernstein IH. Psychometric Theory. New York: McGraw-Hill; 1994. 34. Streiner DL, Norman GR. Health Measurement Scales. A Practical Guide to their Development and Use, 2nd Edition. Oxford: Oxford University Press; 1995. 35. McHorney CA, Tarlov AR. Individual-patient monitoring in clinical practice: Are available health status surveys adequate? Qual Life Res 1995; 4: 293–307. 36. McHorney CA, Ware JEJ, Raczek AE. The MOS 36-item

835

37. 38. 39. 40.

41. 42. 43.

44. 45. 46. 47. 48.

short-form health survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993; 31: 247–263. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989; 27: S178–S189. Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Med Care 1990; 28: 632–642. Liang MH, Larson MG, Cullen KE, Schwartz JA. Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. 1985; 28: 542–547. Garratt AM, Ruta DA, Abdalla MI, Buckingham JK, Russell IT. The SF36 health survey questionnaire: An outcome measure suitable for routine use within the NHS? Br Med J 1993; 306: 1440–1444. Hayes V, Morris J, Wolfe C, Morgan M. The SF-36 health survey questionnaire: Is it suitable for use with older adults? Age Ageing 1995; 24: 120–125. Bindman AB, Keane D, Lurie N. Measuring health changes among severely ill patients. The floor phenomenon. Med Care 1990; 28: 1142–1152. McHorney CA. Measuring and monitoring general health status in elderly persons: Practical and methodological issues in using the SF-36 health survey. Gerontologist 1996; 36: 571– 583. O’Brien B, Viramontes JL. Willingness to pay: A valid and reliable measure of health state preference? Med Decis Making 1994; 14: 289–297. Slevin ML, Plant H, Lynch D, Drinkwater J, Gregory WM. Who should measure quality of life, the doctor or the patient? Br J Cancer 1988; 57: 109–112. Katz JN, Larson MG, Phillips CB, Fossel AH, Liang MH. Comparative measurement sensitivity of short and longer health status instruments. Med Care 1992; 302: 917–925. Patrick DL, Deyo RA, Atlas SJ, Singer DE, Chapin A, Keller RB. Assessing health-related quality of life in patients with sciatica. 1995; 20: 1899–1909. Patrick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Med Care 1989; 27: S217–S232.