Is the Berg Balance Scale an Internally Valid and Reliable Measure of Balance Across Different Etiologies in Neurorehabilitation? A Revisited Rasch Analysis Study

Is the Berg Balance Scale an Internally Valid and Reliable Measure of Balance Across Different Etiologies in Neurorehabilitation? A Revisited Rasch Analysis Study

1209 ORIGINAL ARTICLE Is the Berg Balance Scale an Internally Valid and Reliable Measure of Balance Across Different Etiologies in Neurorehabilitati...

2MB Sizes 0 Downloads 29 Views

1209

ORIGINAL ARTICLE

Is the Berg Balance Scale an Internally Valid and Reliable Measure of Balance Across Different Etiologies in Neurorehabilitation? A Revisited Rasch Analysis Study Fabio La Porta, MD, Serena Caselli, PT, Sonia Susassi, PT, Paola Cavallini, PT, Alan Tennant, PhD, Marco Franceschini, MD ABSTRACT. La Porta F, Caselli S, Susassi S, Cavallini P, Tennant A, Franceschini M. Is the Berg Balance Scale an internally valid and reliable measure of balance across different etiologies in neurorehabilitation? A revisited Rasch analysis study. Arch Phys Med Rehabil 2012;93:1209-16. Objectives: To assess, within the context of Rasch analysis, (1) the internal validity and reliability of the Berg Balance Scale (BBS) in a sample of rehabilitation patients with varied balance abilities; and (2) the comparability of the BBS measures across different neurologic diseases. Design: Observational prospective study. Setting: Rehabilitation ward of an Italian district hospital. Participants: Consecutively admitted inpatients and outpatients (N⫽217); for 85 participants, data were collected both on admission and discharge, giving a total sample of 302 observations. Intervention: Not applicable. Main Outcome Measure: BBS. Results: Most of the BBS items had to be rescored, and 2 items (static sitting and standing balance) had to be deleted, to attain adequate internal construct validity (␹224⫽35.68; P⫽.059). The reliability of the Rasch-modified BBS (BBS-12) (total score, 0 –35) was high (.957), indicating precision of measurement at the individual level. The analysis of differential item functioning (DIF) showed invariance of the item calibrations across patients’ sex, age, and etiology. After adjusting for the possible effect of repeated measurements on person estimates, the analysis of DIF by timing of assessment confirmed the stability of the item hierarchy across time. A practical ruler was provided to convert item raw scores into Rasch estimates of balance ability. Conclusions: This study supports the internal validity and reliability of the BBS-12 as a measurement tool independent of the etiology of the neurologic disease causing the balance

From the Rehabilitation Medicine Unit, Azienda Unità Sanitaria Locale Modena, Modena, Italy (La Porta, Caselli, Susassi, Cavallini); PhD School in Advanced Sciences in Rehabilitation Medicine and Sports, Tor Vergata University, Rome, Italy (La Porta); Department of Rehabilitation Medicine, Faculty of Medicine and Health, University of Leeds, Leeds, United Kingdom (Tennant); and Neuro-Rehabilitation Units Istituto di Ricovero e Cura a carattere Scientifico San Raffaele Pisana, Rome, Italy (Franceschini). Preliminary study results presented as an oral presentation to the 7th Mediterranean Congress of Physical and Rehabilitation Medicine, September 18 –21, 2008, Portorož, Slovenia. No commercial party having a direct financial interest in the results of the research supporting this article has or will confer a benefit on the authors or on any organization with which the authors are associated. Reprint requests to Fabio La Porta, MD, Unità Operativa di Medicina Riabilitativa, Nuovo Ospedale Civile “S. Agostino Estense,” Via Giardini 1355 – 41100 – Modena, Italy, e-mail: [email protected]. In-press corrected proof published online on Apr 21, 2012, at www.archives-pmr.org. 0003-9993/12/9307-01176$36.00/0 doi:10.1016/j.apmr.2012.02.020

impairment. In view of some sample-related issues and that not all possible etiologies encountered in the neurorehabilitation settings were tested, a larger multicenter study is warranted to confirm these findings. Key Words: Neurologic disorders; Outcome measures; Postural balance; Psychometrics; Rehabilitation. © 2012 by the American Congress of Rehabilitation Medicine ITH MORE THAN 480 MEDLINE titles, abstract citaW tions, or both as of December 2011, the Berg Balance Scale (BBS) is the most widely used balance clinical assess1

ment tool.2 Although the BBS was originally proposed for the assessment of balance in the elderly,1 it has been used extensively in a variety of neurologic conditions such as stroke,3 Parkinson’s disease,4 multiple sclerosis,5 traumatic brain injury,6 and peripheral neuropathies.7 The psychometric properties of the BBS have been extensively examined, producing convincing evidence for its reliability and external validity under the classical test theory (CTT) framework.2 However, new psychometric methods have been introduced that supplement validity and reliability data provided by the CTT approach.8 Among these new methods, the Rasch model is particularly useful in view of its operationalization of the formal axioms of additive conjoint measurement.9 Fitting data to the Rasch model, a process known as Rasch analysis, is an iterative analytical process that examines 3 key aspects of conformity to the measurement criteria of the Rasch model8: local independence, unidimensionality, and invariance. Further aspects that are also routinely evaluated8 are the rating scale properties (ie, the correct ordering of response categories), the targeting of persons to items, and, finally, the reliability of the scale. When data adequately fit the Rasch model, the scale’s total score can be transformed into linear measures of ability that allow correct interpretation of change scores and proper access to parametric statistics, as required in clinical trials.10 Furthermore, the demonstration of invariance across relevant key groups, such as age, sex, etiology, or time (ie, absence of differential item functioning [DIF]), ensures comparability of scores across subjects and samples.11 The

List of Abbreviations BBS BBS-12 BCI CFA CTT DIF PSI PST

Berg Balance Scale Rasch-modified Berg Balance Scale binomial confidence interval for proportions confirmatory factor analysis classical test theory differential item functioning person separation index proportion of significant t test

Arch Phys Med Rehabil Vol 93, July 2012

1210

RASCH ANALYSIS OF THE BERG BALANCE SCALE, La Porta

possibility of testing the invariance across different etiologies is particularly interesting, given the widespread use of the scale in different neurorehabilitation settings. To date, there is only 1 published study12 in which the BBS was evaluated using Rasch analysis. This study focused mainly on the rating scale properties of the BBS, demonstrating that all 14 BBS items had disordered thresholds. However, this study had 3 main shortcomings: (1) it did not report on all the other measurement requirements of the Rasch model, as outlined above; (2) the sample was made of outpatient male veterans with balance problems, whose average ability was not reported; and (3) the sample was small (100 persons). Thus, the current study had 2 main goals: (1) to fully appraise, using Rasch analysis, the internal validity and reliability of the BBS in a sample of rehabilitation patients with varied balance abilities; and (2) to examine the presence of DIF by etiology, thus providing information about the comparability of BBS scores across different neurologic diseases. METHODS Participants, Setting, and Instruments Full details of participants and setting are given elsewhere.13 In brief, data were collected within a rehabilitation unit in an Italian general hospital from April 2007 to June 2009. All patients with a neurologic disease requiring rehabilitation admitted to the unit as inpatients or outpatients were included in the study. For inclusion, patients needed to be able to sit unsupported for 30 seconds without using the upper limbs or to participate, even minimally, to transfers. Where possible, inpatients were assessed twice (on admission or as soon as the inclusion criteria were satisfied, and on discharge from the unit). The assessments were carried out with the BBS,1,14 a 14item summative ordinal scale evaluating postural changes from sitting to standing and vice versa, transfers, sitting balance, and a variety of other standing balance tasks. Each item provides a score ranging from 0 to 4, where 0 implies the absence of balance ability and 4 the best possible performance in the observed activity. Thus, its total score may indicate an increase in balance ability from the lowest (0) to the highest score (56). As this study was part of a project aiming at building item banks for balance and mobility, the BBS was administered within a larger protocol including other instruments, as outlined elsewhere.13,15 All patients gave their informed consent to take part in this study that was undertaken in compliance with the ethical principles set forth in the Helsinki Declaration.16 Statistical Analyses We carried out the psychometric analyses of the BBS using both classical methods for the preliminary assessment of unidimensionality, and Rasch analysis. Because some of the data collected included both admission and discharge data, a further aspect to consider is how we dealt with repeated measures. Preliminary assessment of unidimensionality. As the Rasch model is a unidimensional measurement model, it is based on the assumption that all items measure a single underlying dimension.17 As a consequence, it is sometimes useful to test acceptable unidimensionality of the scale items before the Rasch analysis.17,18 In this study, we did so by using a variety of classical methods, including the analysis of total score–item correlations and of internal consistency reliability, as well as with a confirmatory factor analysis (CFA), as detailed elsewhere.13 Within the context of CFA, we also examined the Arch Phys Med Rehabil Vol 93, July 2012

strength of the relationship between each BBS item and the latent variable being measured (ie, balance) in terms of standardized factor loadings (range, 0 –1), where the lower the factor loading for a given item, the higher the likelihood that the item measures something else. We adjusted for local dependency of items (which is a common finding in clinical outcome scales) by allowing correlations of error terms. Should these methods suggest multidimensionality of the item set, we would perform an exploratory factor analysis to explore the dimensionality of the BBS.17 Rasch analysis. After the above analyses, data from the identified unidimensional item set were fitted to the Rasch model.19 This mathematical model postulates that a subject with a certain ability on the latent variable (ie, balance) is expected to affirm (pass) items representing tasks associated with less ability, and not affirm (fail) items representing a higher level, in this case, of balance. The process of iteratively testing whether the data meet the requirements of the Rasch model is widely known as Rasch analysis (readers may want to refer to the article by Tennant and Conaghan8 for a didactic introduction to the subject). Within the context of Rasch analysis, here based on the partial credit parameterization of the model, we tested the following assumptions: ●

Local independence, which assumed that no significant association among item responses should be found once the dominant factor (balance) influencing a person’s response to those items has been conditioned out.20 This important assumption was tested by an examination of the residual correlations for each pair of items, where values above 0.3 indicated local dependence within the item pair.19 ● The stochastic ordering of the items, which was tested by fit to the model. We considered it achieved when (1) a summary chi-square interaction statistic was nonsignificant, showing no deviation from model expectation; (2) where item and person summary fit statistics approached a mean of zero and SD of 1; (3) where individual items showed nonsignificant chi-square fit statistics (Bonferroni adjusted); and (4) where individual item and person residuals were within the range of ⫾2.5, which represents the 99% confidence interval. ● Unidimensionality, which was tested with a t test on the estimates for each respondent derived from the residual of item sets that loaded, respectively, positively (⬎0.3) and negatively (⬍– 0.3) on the first component of the principal component analysis of residuals.21 To achieve strict unidimensionality, the proportion of significant t tests (PSTs) should be less than 5%, or the lower bound of the binomial confidence interval for proportions (BCI) should be below 5%.21 Where these assumptions failed, we undertook an iterative phase involving item modifications aiming at finding a solution that satisfied both the model expectations and assumptions, followed by a reassessment of the model fit after each cycle of modification, as described in detail elsewhere.13,19,22,23 Finally, if data satisfy the Rasch model expectations, the raw score derived from the scale will be transformed to an interval scale, whose unit of measurement is the logit.19 Dealing with repeated measures. We conditioned out any possible dependency between persons’ measures collected on different time points24 by following the procedure proposed by Mallinson.25 Particularly, for each patient who had repeated observations, we randomly selected either the admission or the discharge assessment. These unique observations, along with the unique observations from patients assessed only once, were

1211

RASCH ANALYSIS OF THE BERG BALANCE SCALE, La Porta

subjected to the preliminary assessment of unidimensionality and to a first Rasch analysis. After obtaining a fitting solution, we anchored these item difficulty estimates to the whole observation sample that included both admission and discharge assessments. Should the anchoring procedure demonstrate adequate fit to the Rasch model also for the whole observation sample (indicative of lack of person dependency across time points), we would assess the stability of the item hierarchy across time points by the mean of DIF by assessment timing for each individual item.26-28 Statistical Notes, Software, and Sample Size Issues We used SPSSa to perform descriptive statistics, whereas we used the Mplus softwareb to carry out on complete data only factor analyses for categorical data. We estimated that 220 observations (ratio of subjects to items, 16:1) would be a suitable sample for these analyses.29 We performed the Rasch analyses using the RUMM2030 software.c Within the context of Rasch analysis, a sample size of 300 observations would be sufficient to estimate item difficulty, with ␣ of .01 to less than ⫾0.3 logits, irrespective of the targeting of persons to the items.30 A significance value of .05 was used throughout and adjusted for the number of tests by Bonferroni correction.31 RESULTS Participants Recruited and Procedures All observations were collected by 4 raters on a convenience sample of 217 patients. The mean age ⫾ SD of the patients was 59.5⫾16.3 years, and 60.8% were men. Ischemic stroke was the most common etiology (48.8% of cases), followed by intracerebral hemorrhage (18.0%) and traumatic brain injury (11.1%). A range of other rarer etiologies included subarachnoid hemorrhage (6.9%), central nervous system neoplasms (5.1%), myelopathies (3.7%), and peripheral neuropathies (3.2%). Further sample descriptive statistics have been reported elsewhere.13

Psychometric Analyses For 132 patients only 1 assessment was available, whereas 85 patients had both admission and discharge assessments. Thus, we assembled a sample of 217 observations by adding to the 132 single observations only 1 randomly selected assessment (either admission or discharge) for each of the 85 patients with repeated observations. Preliminary assessment of unidimensionality (Nⴝ217). Although the median total score–item correlation index was high at .908 (range, .485⫺.927), the analysis showed that BBS03 (sitting unsupported) and BBS02 (standing unsupported) were the only 2 items showing a lower correlation with the BBS total score than the other items (.485 and .725, respectively). The analysis of internal consistency reliability on complete data showed a high Cronbach ␣ value (.974). After allowing correlation of errors between BBS05 (transfers) and BBS02 (standing unsupported), and between the latter and BBS03 (sitting unsupported), the CFA suggested a “good” fit to a unidimensional model (root mean square error of approximation, .063; Comparative Fit Index, .999; Tucker-Lewis Index, .999), although BBS03 was the only item with a significantly lower factor loading (.487) than the other items (median factor loading, .921), thus indicating a less strong relationship of this item with the latent variable being measured (eg, balance). Considering these results, we brought forward the whole 14-item set, although BBS03 and BBS02 were flagged as possible problematic items by these analyses. Rasch analysis (Nⴝ217). The initial Rasch analysis performed on the original BBS (table 1, analysis 1) showed serious misfit to the model, failing both the assumptions of stochastic ordering of the items (␹228⫽113.40; P⬍.000) and of unidimensionality (PST⫽10.0%; BCI, 7.1%–12.9%), but not that of local independence (there were no pairs of items with residual correlations equal to or above 0.38,19). The item analysis showed that 1 item overfitted the model (its response pattern was too predictable), 4 items had highly significant

Table 1: Rasch Analysis Summary for the BBS Analysis No.

1 2a 2b 3 4 5

Description of Analysis

N

Item Residual

Person Residual

Initial solution Rescoring according to Kornetti et al12 Individual item rescoring After deleting BBS03 After deleting BBS02 (final solution) Anchored analysis

217 217

⫺0.928⫾1.257 ⫺0.928⫾1.059

217

Recommended values

Item-Trait Interaction

␹ (df)

P

PSI

⫺.254⫾.760 ⫺.169⫾.265

113.40 (287) 57.42 (28)

.000 .001

.956 .956

⫺1.215⫾1.398

⫺.211⫾.726

127.44 (28)

.000

217 217

⫺0.935⫾1.125 ⫺0.681⫾0.958

⫺.245⫾.610 ⫺.272⫾.789

56.96 (26) 35.68 (24)

302

⫺0.671⫾1.018

⫺.265⫾.795

68.10 (48)

0.0⫾1.0

0.0⫾1.0

2

NA

Cronbach ␣*

Unidimensionality t Test PST¶ (%)

BCI# (%)

.956 .961

10 5.1

7.1–12.9 2.2–8.0

.957

.968

8.3

5.4–11.2

.000 .059

.960 .957

.972 .972

4.6 4.6

1.7–7.5 1.7–7.5

.030

.957

.971

6.3

3.8–8.7

⬎.004†

⬎.850‡

⬎.850‡

⬍5§

Lower BCI⬍5§

NOTE. Values are mean ⫾ SD or as otherwise indicated. Analysis no. 2b was conducted on analysis no. 1, as analysis no. 2a was abandoned. Abbreviations: df, degrees of freedom; PST, proportion of significant t test carried out on the estimates that, within a principal component analysis of residuals, loaded positively and negatively (factor loading ⫾0.3) on the first component. *Cronbach ␣ values were calculated on complete data only. † Bonferroni corrected value of .05 will vary by analysis; this value is referred to the final solution. ‡ A value of ⬎.850 indicates precision of measurement at the individual level, whereas a value of ⬎.700 indicates precision at the group level. § Strict unidimensionality is considered achieved either when PST is ⬍5% or, alternatively, when the lower bound of its confidence interval is ⬍5%. ¶ PST is the proportion of significant t test carried out on the estimates that, within a principal component analysis of residuals, loaded positively and negatively (factor loading ⫾0.3) on the first component. # BCI is the binomial confidence interval for proportions of significant t test.

Arch Phys Med Rehabil Vol 93, July 2012

1212

RASCH ANALYSIS OF THE BERG BALANCE SCALE, La Porta

chi-square values (signaling a lack of the expected stochastic ordering), and 11 items (78.6%) had disordered thresholds. The next stage of the analysis (see table 1, analysis 2a) involved the rescoring of items with disordered thresholds according to the rescoring pattern suggested by Kornetti et al.12 However, after this, 4 items still had disordered thresholds, 2 items overfitted the model’s expectations, and 1 item (BBS02) had a significant chi-square value. Furthermore, the scale as a whole still failed the assumption of stochastic ordering (see table 1, analysis 2a). Considering these results and that this rescoring pattern entailed also the rescoring of 3 items without disordered thresholds, we decided to rescore the items according to clinically meaningful criteria specific for any individual item, but limiting the rescoring procedure only to the 11 items with disordered thresholds. However, after this limited rescoring (see table 1, analysis 2b), the scale still failed the assumptions of stochastic ordering (␹228⫽127.44; P⬍.000) and of unidimensionality (PST⫽8.3%; BCI, 5.4%–11.2%). The item analysis showed that 3 items overfitted the model’s expectations, and 1 item (BBS03) had a highly significant chi-square value (␹22⫽60.87; P⬍.000). Considering the serious misfit of BBS03 and the results of the classical item analysis and CFA, we deleted this item. Although the remaining 13 items (see table 1, analysis 3) appeared to satisfy the unidimensionality assumption (PST⫽4.6%; BCI, 1.7%–7.5%), the scale as a whole still failed the assumption of stochastic ordering (␹228⫽56.96; P⬍.000), in view of the misfit to the model of BBS02 (␹22⫽15.83; P⬍.000). Considering the latter finding and the results of the previous analyses, we also deleted BBS02. After this (see table 1, analysis 4), the remaining 12 items (Rasch-modified BBS [BBS-12]) finally satisfied the model’s expectations in terms of local independence, strict unidimensionality (PST⫽4.6%; BCI, 1.7%–7.5%), and invariance (␹2⫽35.68; P⫽.059). All items showed to fit the model indi-

vidually. At this stage, we performed a DIF analysis for the following factors: sex, age (ⱕ55y, 56 – 68y, ⱖ69y), days since lesion (ⱕ20d, 21– 43d, ⱖ44d), and etiology (ischemic stroke, hemorrhagic stroke, traumatic brain injury, other). We found no DIF for any of the tested group factors. All persons fitted the model individually. Person separation reliability, expressed as person separation index (PSI), was high at .957. Rasch analysis (Nⴝ302). As detailed in the Methods section, we exported the item difficulty estimates and the RaschAndrich thresholds of the final solution of the previous Rasch analysis and anchored them to the data collected for the whole observation sample (N⫽302), which included repeated observations on admission and discharge (see table 1, analysis 5). This new analysis confirmed the findings of the previous analysis in terms of adherence to the Rasch model’s assumptions of local independence, strict unidimensionality (PST⫽6.3%; BCI, 3.8%– 8.7%), and stochastic ordering of items (␹248⫽68.10; P⫽.030). All BBS-12 items fitted the model individually. After this, we tested again for the presence of DIF for the previously tested factors, and the absence of DIF was confirmed. We also tested for the presence of DIF by timing of assessment (admission vs discharge assessment), and no significant DIF was found also for this factor, thus confirming the stability of the item hierarchy across time. Regarding persons, 301 (99.7%) fitted the model individually, whereas just 1 person showed a mild model overfit (fit residual, ⫺2.945). The targeting graph of the BBS-12 (fig 1) showed that persons were evenly spread across 13 logits, although 20 persons and 22 persons were located, respectively, at the floor (floor effect, 6.6%) and at the ceiling of the scale (ceiling effect, 7.3%). The mean person ability of ⫺.373 logits indicated that, on average, the ability of the sample was well matched to the average difficulty of the BBS-12, set by default to 0 logits. The person reliability, expressed both as PSI and Cronbach ␣ (the latter calculated on complete data only) were,

Fig 1. Targeting (person-item threshold distribution) graph of the BBS-12. Observations (Nⴝ302) and item thresholds are displayed, respectively, in the upper and the lower part of the graph, separated by the logit scale. Grouping set to interval length of .20 making 65 groups.

Arch Phys Med Rehabil Vol 93, July 2012

1213

RASCH ANALYSIS OF THE BERG BALANCE SCALE, La Porta Table 2: Items’ Parameter, Fit Statistics, Scoring Model for the BBS-12 (Nⴝ302) Item Parameters and Fit Statistics Final BBS-12 Items

BBS05: BBS04: BBS01: BBS06: BBS10: BBS07: BBS08: BBS09: BBS13: BBS14: BBS11: BBS12:

Transfers From standing to sitting From sitting to standing Standing with eyes closed Turning trunk (feet fixed) Standing with feet together Reaching forward while standing Retrieving object from floor Tandem standing Standing on 1 leg Turning 360° Placing alternate foot on stool Deleted Items

BBS03: Sitting unsupported BBS02: Standing unsupported

Scoring Model

Loc

SE

FR

␹2

P*

0

1

2

3

4

⫺2.449 ⫺2.380 ⫺1.990 ⫺1.088 ⫺0.599 ⫺0.485 0.178 0.227 1.808 1.994 2.041 2.743

.133 .132 .152 .181 .129 .149 .151 .179 .144 .148 .182 .181

1.026 0.613 ⫺1.371 ⫺2.018 ⫺1.138 ⫺1.340 ⫺1.720 ⫺1.056 .358 ⫺.420 ⫺.859 ⫺.243

3.565 2.600 4.094 1.081 0.359 3.391 5.144 5.936 2.826 2.650 2.866 1.167

.168 .273 .129 .582 .836 .184 .076 .051 .243 .266 .239 .558

0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 0 1 1 0 0

2 2 1 1 2 2 1 1 1 2 1 1

3 3 2 1 3 2 2 1 2 2 1 1

4 4 3 2 4 3 3 2 3 3 2 2

0 0

1 1

1 1

2 2

Reason for Deletion

Scoring Model

Lack of invariance Lack of invariance

1 1

NOTE. The degrees of freedom for each ␹2 value were 2 for all items. In the upper part of the table, the final 12 BBS-12 items ordered by progressive difficulty from top to bottom are displayed. For each item the rescoring pattern is also presented. For instance, for item BBS01, the 1st, 4th, and 5th original categories remained unchanged, whereas the 2nd and 3rd ones were collapsed together into the 2nd category (01123). In the lower part of the table, the 2 deleted items are displayed in order of deletion. Abbreviations: FR, fit residual; Loc, item difficulty expressed in logits; SE, standard error of the difficulty estimate. *A Bonferroni-corrected chi-square P value of .004 was applied.

respectively, at .957 and .971, both indicating precision of measurement at the individual level.20 Given the PSI, persons could be separated in 7.11 strata (ie, the statistically distinct levels of balance ability that BBS-12 was able to reliably distinguish32). The total raw score of the BBS-12 ranged from 0 to 35. Its calibrations and the rescoring pattern, as well as all the deleted items, are displayed in table 2. The item hierarchy (see table 2) was consistent with clinical expectations, as the easiest items were related, respectively, to transfers (BBS05), and to postural changes from standing to sitting (BBS04) and vice versa (BBS01), whereas the most difficult items were, respectively, standing on 1 leg (BBS14), turning 360° (BBS11), and placing the alternate foot on a stool (BBS12). On the basis of the item calibrations, it was possible to construct both a conversion table to transform the raw scores of the BBS-12 into linear estimates of balance ability (table 3) and a nomogram (the BBS-12 ruler, fig 2) that performs the following operations: (1) conversion of the original BBS item scores into the BBS-12 item scores; (2) calculation of the BBS-12 total score; (3) conversion of the BBS-12 total score into linear measures of ability with their related 95% confidence interval; and (4) assessment of the quality of the measures on the basis of the displayed item response pattern. A working example of the BBS-12 ruler is presented in figure 2, and a blank version of it can be obtained from the corresponding author. DISCUSSION To the best of our knowledge, this is the first study that fully appraised the internal construct validity and reliability of the BBS within the framework of Rasch analysis. We conducted this study on a sample of neurologic patients with impairment of balance admitted to rehabilitation, either as inpatients or outpatients. Our results suggested the need to modify the BBS structure, not only by modifying the scoring model of most items but also by deleting 2 items (sitting and standing bal-

ance), to attain adequate internal construct validity. As a consequence, the BBS-12 that we obtained holds interesting measurement properties independently from the neurologic disease causing the balance impairment. As in the previous Rasch analysis study12 of the BBS, most BBS items showed disordered thresholds and hence, were rescored. However, our rescoring was less extensive than that undertaken by Kornetti,12 considering that their modified BBS had a total score ranging from 0 to 37 across 14 items (mean thresholds per item, 2.6), whereas the total score of BBS-12 ranged from 0 to 35 across 12 items (mean thresholds per item, 2.9). We were unable to Table 3: Raw Score to Logit Conversion Table for the BBS-12 Raw Score

Logit

Raw Score

Logit

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

⫺6.826 ⫺5.716 ⫺4.896 ⫺4.290 ⫺3.804 ⫺3.392 ⫺3.029 ⫺2.703 ⫺2.405 ⫺2.130 ⫺1.875 ⫺1.637 ⫺1.412 ⫺1.197 ⫺0.990 ⫺0.787 ⫺0.586 ⫺0.385

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

⫺0.180 0.028 0.241 0.461 0.689 0.926 1.173 1.431 1.701 1.985 2.282 2.596 2.930 3.295 3.711 4.217 4.903 5.835

NOTE. This conversion table can be used only if patients are assessed on all the 12 BBS-12 items and if the modified rating scales for the BBS-12 items are to be used, as detailed in table 2 and figure 2.

Arch Phys Med Rehabil Vol 93, July 2012

1214

RASCH ANALYSIS OF THE BERG BALANCE SCALE, La Porta

Fig 2. The BBS-12 ruler. In order to use the ruler, the BBS is administered according to the original BBS scoring criteria, and the original score for each and every BBS item is reported in the corresponding blank square on the left hand side in the [A] area. The 2 items excluded from BBS-12 (BBS02–standing unsupported and BBS03–sitting unsupported) are not administered. The BBS items are ordered by progressively increasing difficulty from top to bottom. The numbered horizontal bars for each item [B] represent the range of ability flagged by each score, and the actual affirmed score for each item is indicated by a black bar. Thus, for instance, an affirmed score of 4 in BBS05 (transfers) flags an ability range from around ⴙ0.5 logits to infinite, whereas an affirmed score of 3 in BBS04 (from standing to sitting) covers an ability range from around ⴚ2 to ⴙ0.5 logits. For the rescored items, the rescoring key (showed also in table 2) is presented as a number in square brackets. Thus, for instance, no rescoring key is provided for the BBS05 item, and hence, a score of 4 remains unchanged for this item. On the other hand, the rescoring key for the BBS06 item suggests that either a BBS original score of 1, 2, or 3 flags the same ability range indicated by a BBS-12 score of [1]. In the [C] area, either the modified BBS-12 scores or the unmodified BBS scores are reported as required, and their sum is then reported in the blank square in the [D] area. Thus, for this individual patient, the BBS-12 total score is 21. This score is marked both on the top and bottom score rulers [E]. The 2 markers are then conjoined with a vertical dotted line [F] that is the measurement line. This line crosses the measure ruler [G], giving the corresponding Rasch measure estimate expressed in logits (in this case, ⴙ0.5 logits). In the area below the measurement ruler, there is a graph [H] plotting the confidence interval due to measurement error around each BBS-12 total score. Then, 2 lines parallel to the measurement line are plotted considering the lower and upper 95% confidence interval around the person’s ability [I], thus defining a measurement area [L] that contains the true measure estimate with a 95% confidence. In this way, the true ability estimate for this person lies in the range from ⴚ0.4 to ⴙ1.4 logits (or 0.5ⴞ0.9 logits if expressed as 95% confidence interval). By examining the relationship between the measurement line and measurement area with the individual responses to each item, it is possible to observe several patterns, giving useful clinical information. In some cases (a), the actual response to the item spans across the whole measurement area, whereas in other cases the item response crosses, either from the right-hand side (b) or from the left-hand side (c), the measurement line, although it overlaps just partially the measurement area. These patterns can be considered “best” responses, as the range of ability flagged by the item score is the closest possible to the individual’s range of ability flagged by the measure. However, for other items, less expected responses could be observed. For instance, for some items (d), the score overlaps the measurement area just partially without crossing the measurement line. For this item (BBS13–tandem standing), the best response would have been a modified BBS13 score of [1] (that would cross the measurement line), thus indicating that in this activity, this patient underperformed somehow given

Arch Phys Med Rehabil Vol 93, July 2012

RASCH ANALYSIS OF THE BERG BALANCE SCALE, La Porta

1215

4™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™

Fig 2 (cont’d) his ability level. A similar consideration may be applied to the responses to some items that are completely outside the measurement area (e): in this specific item (BBS14) and in BBS12, the patient performed better than expected on the basis of his ability level. The (d) and (e) responses, although unexpected, are still compatible with the model’s expectations because the range of ability flagged by the actual item score is still quite close to the measurement area. However, the same consideration may not apply for the response to the BBS01 item (sitting to standing), for which the lowest score is reported (f). Indeed, this is a rather unexpected response (as for this item the best response would have a modified BBS01 score of [3]) that is rather distant from the measurement area. This represents a significant departure from the model’s expectations: although this patient’s balance ability is in the moderate to high range, nonetheless he fails a rather easy item. Given this unexpected response, a careful inspection of this patient’s record is mandatory in order to find an explanation for this unexpected item behavior (transcription error? correct response due to specific individual factors?).

make further comparisons with the previous study12 because it did not report on the other aspects of fit to the model. After obtaining a stable scoring structure for all items, the analysis demonstrated that BBS still did not fit the Rasch model’s measurement requirements. Particularly, both the preliminary analyses (classical item analysis and CFA) and the Rasch analysis flagged BBS03 and BBS02 (balance in sitting and standing, respectively) as problematic items and, hence, were deleted within the context of Rasch analysis. The lack of fit of these 2 items to the Rasch model could be explained considering that both evaluate static balance, whereas the other BBS items assess dynamic balance, as recently suggested.13 Indeed, static postural control is likely to be heavily influenced, especially in low-ability patients with hemiparesis, also by trunk control that is an allied but separate construct from dynamic balance.13 After the elimination of the static balance items, we observed an increase in the floor effect (although still at an acceptable level), whereas the moderate ceiling effect observed on the first analysis was almost unaffected by the rescoring procedure, although both floor and ceiling effects are wellknown shortcomings of BBS.3,33 Interestingly, the elimination of 2 items and the loss of 21 score points did not affect at all the reliability of the scale, as it remained practically unchanged from the initial to the final solution. This indicates that the 2 items were not internally consistent with the other items, and further confirms the underutilization of some BBS scoring categories, as previously suggested.12,34 Thirty-nine of the 217 enrolled patients had repeated measurements on admission and discharge. However, including repeated measurements from the same patients at different time points may violate the assumption of statistical independence among the observations. On the other hand, to include all the observations available (including measurements at 2 time points from the same patients) would have allowed the sample to be much more representative of the actual population admitted to our rehabilitation unit, as it includes patients with the full spectrum of balance abilities, ranging from the lower ones typically seen in patients just admitted to rehabilitation, to the higher balance abilities that are more likely to be achieved by patients on discharge. The latter is a common occurrence in rehabilitation research and clinical practice,25 where the main focus is measuring the change that occurs in the patient’s functioning as an effect of the rehabilitation treatment. We resolved this dilemma, thus eliminating a possible source of time series dependence, by following the procedure suggested by Mallinson25 that allowed us to measure persons at different time points within the same frame of measurement. Once possible time-series dependence on persons estimates had been adjusted for, we were not only able to include the repeated observations, but we were also able to test for the stability of item difficulty estimates across the 2 different time points. The lack of stability of the item hierarchy across time, sometimes referred to as “item drift,” is just a special case of DIF26,27 and, providing that there are few time

points (as in this case), it can be assessed with standard DIF statistical techniques.27,28 The analysis of DIF demonstrated that the BBS-12 was invariant across patients’ sex, age, and time, as we demonstrated stability of item calibrations in terms of time elapsed from the onset and of time between pretreatment and posttreatment observations. These data suggest that BBS-12 may be used at various stages of the rehabilitation process, with younger or older patients of both sexes, affected by acute or chronic balance problems. On the other hand, the lack of DIF for etiology is particularly interesting, as it may ensure comparability of the assessment of patients across a variety of balance problems caused by different neurologic causes. The rescoring table presented in this article provides a simple method to transform the total BBS-12 raw scores into linear measures of balance. The BBS-12 ruler is a useful addition to this because it not only provides a simple and quick method to assess the quality of the data collected for the individual patient, but it can also highlight individual areas of weakness or strength in balance activities that may provide a guidance for setting up individualized treatment plans aimed at improving balance. Since the logit measure estimates provided by the table and the ruler satisfy the requirements for interval-level measurement, clinicians and researchers may want to use these estimates rather than raw scores both for the correct interpretation of change scores8 and for the possibility of making use of parametric statistics (ie, analysis of variance).8,10 It should be noted, however, that both the table and the ruler can be used only with complete data, because missing item responses would lead to systematic underestimation of the patient’s ability. Study Limitations There are a number of sample-related issues in this study. First, ours was a convenience sample representing a crosssection of adults drawn from a single rehabilitation center. This may limit the possibility of generalization of these findings to other samples. Second, the sample size, although large enough for definitive calibrations,35 was too small to have a validation sample that would have enabled us to further validate the final scale. There is thus a risk that the solution we have obtained has capitalized on chance anyway with respect to fit to the Rasch model. Thus, considering the latter point, the raw-score interval scale transformation and the BBS-12 ruler should be considered provisional at this time, and a further validation of the proposed rescoring pattern is warranted. Further limitations include that not all possible etiologies encountered in all rehabilitation settings were tested (eg, in the sample there were no patients with extrapyramidal diseases), and that the relationship between the original cutoff scores of the BBS for fall risk estimation and the scores of the BBS-12 was not addressed. Given these limitations, our findings require replication in the context of a large multicenter study aiming at further validating the rescoring pattern proposed, at testing the Arch Phys Med Rehabil Vol 93, July 2012

1216

RASCH ANALYSIS OF THE BERG BALANCE SCALE, La Porta

stability of item calibration across further etiologies, and at exploring the predictive validity of the BBS-12. 17.

CONCLUSIONS This study supports the internal validity and reliability of the BBS-12 as a measurement tool independent of the etiology of the neurologic disease causing the balance impairment. In view of some sample-related issues and that not all possible etiologies encountered in the neurorehabilitation settings were tested, a larger multicenter study is needed to confirm these findings.

18.

Acknowledgments: We thank Stefano Gualdi, PT, Chiara Bosi, PT, and Matteo Maria Mariani, PT, for data collection.

20.

References 1. Berg KO, Wood-Dauphinee SL, Williams JI, Maki B. Measuring balance in the elderly: validation of an instrument. Can J Public Health 1992;83(Suppl 2):S7-11. 2. Tyson SF, Connell LA. How to measure balance in clinical practice. A systematic review of the psychometrics and clinical utility of measures of balance activity for neurological conditions. Clin Rehabil 2009;23:824-40. 3. Blum L, Korner-Bitensky N. Usefulness of the Berg Balance Scale in stroke rehabilitation: a systematic review. Phys Ther 2008;88:559-66. 4. Qutubuddin AA, Pegg PO, Cifu DX, Brown R, McNamee S, Carne W. Validating the Berg Balance Scale for patients with Parkinson’s disease: a key to rehabilitation evaluation. Arch Phys Med Rehabil 2005;86:789-92. 5. Cattaneo D, Regola A, Meotti M. Validity of six balance disorders scales in persons with multiple sclerosis. Disabil Rehabil 2006; 28:789-95. 6. Juneja G, Czyrny JJ, Linn RT. Admission balance and outcomes of patients admitted for acute inpatient rehabilitation. Am J Phys Med Rehabil 1998;77:388-93. 7. Brotherton SS, Williams HG, Gossard JL, Hussey JR, McClenaghan BA, Eleazer P. Are measures employed in the assessment of balance useful for detecting differences among groups that vary by age and disease state? J Geriatr Phys Ther 2005;28:14-9. 8. Tennant A, Conaghan PG. The Rasch measurement model in rheumatology: what is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Rheum 2007;57:1358-62. 9. Perline R, Wright BD. The Rasch models as additive conjoint measurement. Appl Psychol Meas 1979;3:237-55. 10. Svensson E. Guidelines to statistical evaluation of data from rating scales and questionnaire. J Rehabil Med 2001;33:47-8. 11. Cantagallo A, Carli S, Simone A, Tesio L. MINDFIM: a measure of disability in high-functioning traumatic brain injury outpatients. Brain Inj 2006;20:913-25. 12. Kornetti DL, Fritz SL, Chiu YP, Light KE, Velozo CA. Rating scale analysis of the Berg Balance Scale. Arch Phys Med Rehabil 2004;85:1128-35. 13. La Porta F, Franceschini M, Caselli S, Cavallini P, Susassi S, Tennant A. Unified Balance Scale: an activity-based, bed to community, and aetiology-independent measure of balance calibrated with Rasch analysis. J Rehabil Med 2011;43:435-44. 14. Berg K, Wood-Dauphinée S, Williams J, Gayton D. Measuring balance in the elderly: preliminary development of an instrument. Physiother Can 1989;41:304-11. 15. La Porta F, Franceschini M, Caselli S, Cavallini P, Susassi S, Tennant A. Unified Balance Scale: classic psychometric and clinical properties. J Rehabil Med 2011;43:445-53. 16. 59th World Medical Association General Assembly, Declaration of Helsinki: ethical principles for medical research involving human subjects. 2008. World Medical Association. Available at:

Arch Phys Med Rehabil Vol 93, July 2012

19.

21.

22.

23.

24. 25. 26. 27.

28.

29. 30. 31. 32. 33.

34.

35.

http://www.wma.net/en/30publications/10policies/b3/. Accessed January 31, 2009. Elhan AH, Öztuna D, Kutlay S, Küçükdeveci AA, Tennant A. An initial application of computerized adaptive testing (CAT) for measuring disability in patients with low back pain. BMC Musculoskelet Disord 2008;9:166. Franchignoni F, Horak F, Godi M, Nardone A, Giordano A. Using psychometric techniques to improve the Balance Evaluation Systems Test: the mini-BESTest. J Rehabil Med 2010;42:323-31. Andrich D. Rasch models for measurement. London: Sage Publications; 1988. Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care 2007;45(5 Suppl 1):S22-31. Smith E. Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. J Appl Meas 2002;3:205-31. Tennant A, Penta M, Tesio L, et al. Assessing and adjusting for cross-cultural validity of impairment and activity limitation scales through differential item functioning within the framework of the Rasch model: the PRO-ESOR project. Med Care 2004;42(1 Suppl):I37-48. Pallant JF, Tennant A. An introduction to the Rasch measurement model: an example using the Hospital Anxiety and Depression Scale (HADS). Br J Clin Psychol 2007;46(Pt 1):1-18. Wright BD. Rack and stack: time 1 vs. time 2. Rasch Meas Trans 2003;17:905-6. Mallinson T. Rasch analysis of repeated measures. Rasch Meas Trans 2011;251:1317. Angoff WH. Proposals for theoretical and applied development in measurement. Appl Meas Educ 1988;1:215-22. Li X. An investigation of the item parameter drift in the Examination for the Certificate of Proficiency in English (ECPE). Spaan Fellow Working Papers in Second or Foreign Language Assessment 2008;6:1-28. Available at: http://www.cambridgemichigan. org/sites/default/files/resources/SpaanPapers/Spaan_V6_Li.pdf. Accessed April 1, 2012. Specht K, Leonhardt JS, Revald P, et al. No evidence of a clinically important effect of adding local infusion analgesia administrated through a catheter in pain treatment after total hip arthroplasty. Acta Orthop 2011;82:315-20. Tabachnick BG, Fidell LS. Using multivariate statistics. 4th ed. New York: HarperCollins; 2001. Linacre JM. Sample size and item calibration stability. Rasch Meas Trans 1994;7:328. Bland J, Altman D. Multiple significance tests: the Bonferroni method. BMJ 1995;310:170. Wright BD, Masters GN. Rating scale analysis. Chicago: MESA Pr; 1982. Rose DJ, Lucchese N, Wiersma LD. Development of a multidimensional balance scale for use with functionally independent older adults. Arch Phys Med Rehabil 2006;87:1478-85. Wang CH, Hsueh IP, Sheu CF, Yao G, Hsieh CL. Psychometric properties of 2 simplified 3-level balance scales used for patients with stroke. Phys Ther 2004;84:430-8. Linacre JM. Sample size and item calibration [or person measure] stability. Rasch Meas Trans 1994;7:328.

Suppliers a. SPSS version 13 for Windows, Release 13.0.1; SPSS Inc, 233 S Wacker Dr, 11th Fl, Chicago, IL 60606. b. Mplus version 6.0; Muthén & Muthén, 3463 Stoner Ave, Los Angeles, CA 90066. c. RUMM2030 version 5.1 for Windows; RUMM Laboratory Pty Ltd, 14 Dodonaea Ct, Duncraig, WA, Australia 6023.