1752
ORIGINAL ARTICLE
Reliability of the Performance and Safety Scores of the Wheelchair Skills Test Version 4.1 for Manual Wheelchair Users Noelle J. Lindquist, MScPT, Patricia E. Loudon, MScPT, Trent F. Magis, MScPT, Jessica E. Rispin, MScPT, R. Lee Kirby, MD, FRCPC, Patricia J. Manns, PhD ABSTRACT. Lindquist NJ, Loudon PE, Magis TF, Rispin JE, Kirby RL, Manns PJ. Reliability of the performance and safety scores of the Wheelchair Skills Test Version 4.1 for manual wheelchair users. Arch Phys Med Rehabil 2010;91: 1752-7. Objective: To evaluate the interrater, intrarater, and testretest reliability of the total performance and safety scores of the Wheelchair Skills Test version 4.1 (WST 4.1) for manual wheelchairs operated by adult wheelchair users. Design: Cohort study. Setting: University research setting. Participants: People (N⫽11) who used manual wheelchairs for community locomotion. Interventions: Not applicable. Main Outcome Measure: Participants were videotaped as they completed the WST 4.1 (30 skills) on 2 separate occasions 1 to 2 weeks apart. Subsequently, raters scored the WST 4.1 from the video recordings and each participant received a total score for performance and safety. Using those scores, interrater, intrarater, and test-retest reliability were determined by using intraclass correlation coefficients (ICCs). Percentages of agreement between raters for individual skills also were calculated. Results: Mean ⫾ SD overall WST 4.1 scores for performance and safety were 80.1%⫾8.5% and 98.0%⫾2.8%. ICCs for the interrater, intrarater, and test-retest reliability of the performance component were .855, .950, and .901 (P⬍.001). Safety component ICC scores were .061 (P⫽.243), .228 (P⫽.048), and .254 (P⫽.041). Percentages of agreement between raters for each test item for both the performance and safety scales ranged from 68% to 100%. Conclusions: Reliability of the performance component of the WST 4.1 was excellent, whereas ICCs for the safety component indicated only slight to fair agreement, probably because of the low variability in safety scores. Additional study is needed to further evaluate the reliability of the safety component with a larger and more diverse sample group. Key Words: Outcome assessment (health care); Rehabilitation; Reproducibility of results; Wheelchairs.
From the Department of Physical Therapy (Lindquist, Loudon, Magis, Rispin, Manns), Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, Alberta; and the Division of Physical Medicine and Rehabilitation, Dalhousie University, Halifax, Nova Scotia, Canada (Kirby). Supported by the Endowment Fund for the Future: Support for Advancement of Scholarship Small Faculties Research Grant Program at the University of Alberta. No commercial party having a direct financial interest in the results of the research supporting this article has or will confer a benefit on the authors or on any organization with which the authors are associated. Correspondence to Patricia J. Manns, PhD, Dept of Physical Therapy, University of Alberta, 2-50 Corbett Hall, Edmonton, Alberta T6G 2G4, e-mail: trish.
[email protected]. Reprints are not available from the author. 0003-9993/10/9111-00218$36.00/0 doi:10.1016/j.apmr.2010.07.226
Arch Phys Med Rehabil Vol 91, November 2010
© 2010 by the American Congress of Rehabilitation Medicine HERE WERE 2.7 MILLION noninstitutionalized users of T wheeled mobility devices in the United States in 2002. Using a conservative rate of growth (5.2% a year), this number
was estimated to have increased to 3.86 million by 2009.1 Up to 36% of wheelchair users reported that obstacles such as curbs, uneven terrain (eg, grass, mud, ice), door handles, flooring surfaces, and thresholds were barriers to mobility.2 Specific training of wheelchair skills may help overcome some or all of these barriers for selected persons. In rehabilitation or community settings, in which the goal is to improve wheelchair skills, the WST3 can be used to identify skill deficiencies and design interventions that appropriately target those deficiencies. The WST then can be used to assess the results of training or other interventions. Several studies have shown that assessment and training of wheelchair skills leads to improvements in those skills.4-7 Since its inception in 1996, the WST has evolved, with 4 versions (1.0, 2.4, 3.2, and 4.1) released for general use.3,8-10 The evolution of the WST has been based on clinical and research experience, feedback from users, and assessments of its measurement properties. The most recent 3 versions include dichotomous grading (pass/fail) of the performance of a set of wheelchair skills necessary for successful wheelchair locomotion in the community. The skill set has evolved by deletions, additions, and combinations of skills. The number of skills assessed has decreased to the current 32. The most recent version (4.1) includes more difficult tasks, such as getting up off the floor and ascending and descending stairs. These were added to better assess advanced wheelchair users and avoid a ceiling effect. However, the most notable difference in version 4.1 compared with previous versions is the inclusion of a safety component. Wheelchair users now receive both a performance and a safety score for each skill, and the number of skills passed for performance and safety are totaled separately to provide 2 total percentage scores. Inclusion of a safety component for the assessment of wheelchair skills is largely without precedent in the wheelchair
List of Abbreviations ICC R1 R2 T1 T2 WSP WST WST 4.1
intraclass correlation coefficients first viewing of a trial (rating 1) second viewing of a trial (rating 2) first trial second trial Wheelchair Skills Program Wheelchair Skills Test Wheelchair Skills Test version 4.1
RELIABILITY OF WHEELCHAIR SKILLS TEST VERSION 4.1, Lindquist
literature. Aspects of safety, such as physical strain measured by means of heart rate, perceived task difficulty, and physical cost, have been discussed previously in a systematic review regarding outcome parameters of WSTs.11 However, no test has included a score specific to the safety of the performance of individual tasks. The safety component was added to the WST to reward wheelchair users for improvements in safety or good judgment related to safety, even when the skill was not successfully performed. The addition of the safety component allows a distinction to be made between failure because of unsafe performance and failure because of inadequate yet safe performance. For example, as a result of training, a wheelchair user might progress from an unsafe failure (by attempting the curb descent skill and needing to be caught by the spotter) to a safe failure (by declining to attempt the skill) by learning that he/she was not capable of performing the task without tipping over. Previous versions of the WST have been found to be valid, reliable, safe, and practical tools to assess functional skills of manual wheelchair mobility.8,9 However, the reliability of the WST 4.1, with the slightly revised skill set and the additional safety component, has not been tested. The primary purpose of this study was to determine the interrater, intrarater, and testretest reliability of the total performance and safety scores of the WST 4.1 for manual wheelchairs operated by adult wheelchair users. Our secondary objectives were to assess the percentage of agreement among raters for individual skills and identify any sources of unreliability. METHODS Participants We studied 11 manual wheelchair users, a sample of convenience. The estimated sample size was determined by means of a power analysis using intrarater ICC values (ICC⫽.959 and ICC.950) from previous WST reliability studies,8,9 ␣ level of .05, and target power of .80. This analysis showed that a sample size of 9 was recommended to ensure adequate power.12 Recruitment and Screening Participants were recruited through an exercise center for individuals with disabilities and by contacting wheelchair users who had participated in previous research studies through the Physical Therapy Department at the University of Alberta. Inclusion criteria were age older than 16 years, use of a manual wheelchair for more than 1 year and for most personal transport, use of the present wheelchair for more than 1 week, medically stable, and able to perform the WST on 2 separate occasions. Demographic and clinical data were recorded for participants (sex, age, height, weight, diagnosis accounting for wheelchair use) and their wheelchairs and wheelchair-use patterns. Ethical Issues The study was approved by the Health Research Ethics Board of the University of Alberta. Each participant was informed of the purpose of the study and signed an informed consent form. Safety of Testing To ensure the safety of participants and testers, we took measures included in the spotter procedures of the WSP. A spotter strap was used to prevent backward tips of the wheelchair user during skills with a high risk for tipping. Although these measures decrease the likelihood of injury, they do not
1753
interfere with the ability to assess safety because the spotter does not intervene until a safety criterion has been violated. Adverse incidents were recorded. Administering the WST 4.1 Each of 4 testers administering and scoring the WST 4.1 received training and certification from the WSP before testing. Training involved a study of the WST 4.1 manual and completion of scoring of 3 sample test videos. Scoring of these videos was discussed later during a teleconference with test developers at Dalhousie University to provide clarification about scoring rules. Certification was received after each individual completed an examination on the WSP administered by the WSP team at Dalhousie. The WST for manual wheelchair users was administered according to the WST 4.1 manual. There were at least 3 testers on site to perform the testing: 1 to conduct the WST, a second to videorecord the WST, and a third to manage equipment, provide moving obstacles, and provide spotting help as necessary. Testing was conducted at 2 public locations at the University of Alberta, the location dependent on space availability. Of note was the absence of 5° and 10° ramps within the facilities; therefore, 4 skills were omitted (ascent and descent of each). Instead, ascent and descent of a 7.5° ramp was substituted. With this adjustment, we tested 30 skills (listed later) instead of the usual 32 skills in the WST 4.1 for manual wheelchair users. In addition, a minimum of 2 steps was substituted for the minimum of 3 steps outlined in the WST 4.1 manual to conform to available equipment at the 2 facilities. T1 and T2 were scheduled a minimum of 1 week and a maximum of 2 weeks apart for each participant to minimize learning effects and the effects of natural skill improvement. Each participant completed both WST trials in normal attire and with his/her normal wheelchair configuration. Participants used their own wheelchairs during testing and the same wheelchair for both trials. Tire pressure was assessed at each session and changed if necessary to ensure the same tire pressure for both trials. All trials were videotaped using camera positions that were as consistent as possible. Skills were performed in an order that minimized location changes and increased efficiency. Before the performance of more difficult tasks (eg, 15-cm curb), screening questions were asked according to the protocol set out in the WST 4.1 manual3 to determine whether the task should be attempted. For example, participants were asked “Can you get your wheelchair down a 15-cm curb? How?” WST Scoring by Different Raters After completion of all WST trials, copies of the video recordings were made and distributed to each of 4 raters (A, B, C, D). Video recordings were scored individually according to the WST 4.1 manual, which includes general and specific criteria for performance and safety. For skills for which screening questions were asked, if a participant stated that he/she could not perform a skill, he/she was given a fail grade for performance of that task, but a safe grade for safety. If he/she described a method of performing the skill that the tester deemed unsafe on the basis of criteria set out in the WST 4.1 manual, he/she was given a fail grade for performance of that task and an unsafe grade for safety. Each trial was rated twice (R1 and R2). Each rater viewed the recordings in order from participant 1 to 11. Viewing was permitted during multiple sessions, and raters were allowed to stop and rewind the recordings as necessary to make appropriate judgments. All 4 raters initially viewed T1 of all participants (T1-R1). After a Arch Phys Med Rehabil Vol 91, November 2010
1754
RELIABILITY OF WHEELCHAIR SKILLS TEST VERSION 4.1, Lindquist
minimum of 2 weeks, raters A and B then viewed T1 a second time (T1-R2), whereas raters C and D viewed T2 (T2-R1). The different number of raters was to permit the different types of reliability to be assessed (see Data Analysis section). After scoring all individual skills (ie, pass/fail or safe/unsafe), total percentage scores were calculated separately for performance and safety according to WST procedures.3 The numerators were the number of skills awarded pass or safe scores, respectively. The denominators were the number of skills assessed (normally 30 for this study). Each score sheet thus provided information about individual skill success and a total percentage score for both performance and safety. Data Analysis All statistical analyses were conducted using SPSS for Macintosh.a Descriptive statistics were calculated for all quantitative data. ICCs were calculated to statistically quantify the reliability of the total percentage WST scores for both performance and safety. Interrater reliability was calculated using the scores obtained by each of the 4 raters for T1-R1. Intrarater reliability was calculated using scores for raters A and B from T1-R1, as well as their scores from T1-R2. For test-retest reliability, calculations were made using the scores of raters C and D (T1-R1 and T2-R1). Total percentage scores were treated as continuous data for this purpose. ICC values were interpreted according to Shrout and Fleiss13 as follows: less than .00 indicates poor; .00 to .10, virtually none; .11 to .40, slight; .41 to .60, fair; .61 to .80, moderate; and .81 to 1.00, substantial agreement. We also assessed whether there was a significant difference in scores for T1 and T2 by using a paired t test. A Bonferroni-adjusted ␣ level of .01 (.05/4) was used for all correlations and comparisons. We calculated statistics for each individual skill using the data obtained from the 4 raters’ scoring of all participants in T1-R1. Because of the known effect of variability on ,14 percentage of rater agreement also was determined for each individual skill. Calculations were performed using SPSS software and a script developed by Dates,15 which allows for the inclusion of 4 raters when calculating and percentage of rater agreement (unlike the standard SPSS analysis, which accommodates 2 raters). Results for percentage of rater agreement were interpreted qualitatively, using a threshold of at least 85% as a target indicating acceptable agreement. This threshold was based on our judgment of the magnitude of a minimum clinically significant difference. When there was not 100% rater agreement regarding a skill, analysis of the discrepancy was conducted and discrepancies were assigned to 1 of 4 categories. Video errors were errors of video playback. These involved disc errors, deletion of portions of the video on some digital copies of the original video recording, and inconsistent scene playback between the first and second viewing of a skill. A performance interpretation discrepancy was the result of a disagreement among raters about whether a skill was successfully performed. A safety interpretation discrepancy was a difference of opinion about whether performance of a skill was completed in a safe or unsafe manner. Last, a discrepancy involving a scoring rule interpretation resulted from a rater misunderstanding or overlooking a scoring rule present in the WST 4.1 manual. RESULTS Participant Demographic, Clinical, and Wheelchair Data Demographic, clinical, and wheelchair-use data for participants are listed in table 1. Age, time of wheelchair use, and Arch Phys Med Rehabil Vol 91, November 2010
Table 1: Participant Demographics and Wheelchair Use Parameter
Age (y) Sex (male/female) Height (m) Weight (kg) Diagnosis Spinal cord injury Stroke Arteriovenous malformation Wheelchair use (overall, y) Wheelchair use (current chair, y) Propulsion method 2 hands 2 hands and 2 feet Location of wheelchair use Home only Community only Both home and community
Value
42.1⫾16.2 9/2 1.70⫾0.12 74.5⫾19.4 9 1 1 9.7⫾9.6 (1–37) 5.6⫾3.8 (2wk–10y) 10 1 1 1 9
NOTE. Values expressed as mean ⫾ SD (range) or n.
diagnosis varied greatly among participants. Most participants had conventional wheelchairs, although 4 had lightweight wheelchairs and 1 had an ultralight chair. Seven participants used wheelchairs with a rigid frame, and all participants’ wheelchairs had tire diameters of 61 or 64cm. WST 4.1 Data All 11 participants completed T1, whereas only 10 completed T2. There were 21 performances of the WST 4.1 and 86 separate scorings conducted by the 4 raters. The overall mean ⫾ SD performance score was 80.1%⫾8.5% (range, 63.3%–100.0%), and mean safety score was 98.0%⫾2.8% (range, 89.7%–100.0%). Performance scores, but not safety scores, improved from T1 to T2 (P⫽.041). Several skills were performed successfully by all participants (table 2). The stairs task was not attempted by most participants, although 1 person successfully descended stairs. Other than stairs, the next most difficult task was the 15-cm curb, with only 26% of participants able to successfully ascend a 15-cm curb. The lowest safety scores also were related to the 15-cm curb, with 16% of participants scored as unsafe on the descent of the 15-cm curb. Reliability ICC results are listed in table 3. ICC values for the performance component for all types of reliability were in the substantial agreement range. ICC scores for the safety component indicated virtually none to slight agreement. In the analysis of individual skills (table 2), scores for performance generally were higher than for safety. Percentage of rater agreement scores ranged from 68% to 100% success for both the performance and safety components. For the performance component, 25 (83%) skills had rater agreement greater than 85%, and 27 (90%) skills for the safety component had rater agreement greater than 85%. If we used a more conservative threshold (80%), only 1 skill (rolls backward 5m) for performance and 2 for safety (rolls backward 5m and transfers) had rater agreement less than that threshold (see table 2). The skills found to be most problematic according to percentage of rater agreement for performance were rolls backward 5m, maneuvers sideways, transfers from wheelchair to bench and back, gets from ground into wheelchair, and descends stairs. For safety, the problematic skills were rolls backward 5m, transfers from wheelchair to bench and back,
1755
RELIABILITY OF WHEELCHAIR SKILLS TEST VERSION 4.1, Lindquist Table 2: Success, , and Percentage of Rater Agreement for T1 Performance and Safety Scores Performance
Safety
Item No.
Skill
n
% Success
% RA
n
% Safe
% RA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Rolls forward 10m Rolls forward 10m in 30s Rolls backward 5m Turns 90° moving forward Turns 90° moving backward Turns 180° in place Maneuvers sideways Gets through hinged door Reaches 1.5-m object Picks object up from floor Relieves weight from buttocks Transfers from WC to bench and back Folds and unfolds WC Rolls 100m Avoids moving obstacles Ascends 7.5° incline Descends 7.5° incline Rolls 2m across 5° side slope Rolls 2m on soft surface Gets over 15-cm pothole Gets over 2-cm threshold Ascends 5-cm level change Descends 5-cm level change Ascends 15-cm curb Descends 15-cm curb Performs 30-s stationary wheelie Turns 180° in place in wheelie Gets from ground into WC Ascends stairs Descends stairs
44 44 44 44 44 44 41 42 44 44 44 42 42 44 44 44 44 44 44 44 43 44 44 43 44 44 43 41 40 40
100 100 84 100 100 98 93 98 100 100 91 69 43 100 100 98 100 100 100 75 100 93 100 26 57 50 42 39 0 10
NC NC ⫺.1892 NC NC ⫺.0233 .3812 .0688 NC NC .4500 .6190 .8582 NC NC ⫺.0233 NC NC NC .8788 ⫺.0233 .6423 NC .7772 .9074 .8788 .6735 .6915 .4601 .4267
100 100 68 100 100 95 85 88 100 100 91 82 92 100 100 95 100 100 100 95 95 95 100 91 95 94 91 83 91 82
44 44 44 44 44 44 41 42 44 44 44 42 42 44 44 44 44 44 44 44 43 44 44 43 44 44 41 41 40 40
100 100 84 100 100 100 100 98 100 100 95 88 100 100 100 98 100 100 100 98 100 100 100 98 84 95 100 100 100 92
NC NC ⫺.1892 NC NC NC .1653 .0688 NC NC ⫺.0476 .1309 .1373 NC NC ⫺.0233 NC NC NC ⫺.0233 ⫺.0233 NC NC .1373 .8301 .3016 ⫺.0560 ⫺.0732 .4601 .3863
100 100 68 100 100 100 89 88 100 100 91 76 92 100 100 95 100 100 100 95 95 100 100 92 95 94 86 86 91 85
Abbreviations: n, number of times skill was tested (most skills were attempted 44 times [11 participants rated by 4 raters] with skills for which a participant received a rating of not tested or testing error having ⬍44 attempts); NC, not calculated ( could not be calculated because of lack of variability; for all instances for which was not calculated, participants were 100% successful in performance and safety of that skill); RA, rater agreement among 4 raters; WC, wheelchair.
and descends stairs (see table 2). Most (54%) performance and safety discrepancies occurred because of scoring rule interpretations. The remaining discrepancies were distributed among video errors (14.7%), performance interpretations (14.7%), and safety interpretations (16.4%). Safety and Other Observations No serious adverse events occurred during administration of the WST, and there was only 1 minor adverse event: 1 participant fell without injury after declining spotter intervention during the transfer task. The testers perceived that conducting the WST 4.1 was simple and the training received was ade-
quate to properly and safely administer the WST 4.1. The test was well tolerated by participants. Mean testing times were 34.5⫾14.2 minutes for T1 and 27.9⫾8.5 minutes for T2 (P⫽.356). DISCUSSION This study was the first assessment of the reliability of WST 4.1 for manual wheelchair users. Our findings for intrarater and test-retest performance reliability were similar to those reported for WST 2.48 and only slightly lower than those for WST 3.2.10 Our ICC value for interrater reliability was lower than those previously reported for both WST 2.48 and 3.2,10 but
Table 3: Performance and Safety ICCs for Interrater, Intrarater, and Test-Retest Reliability Variable
Performance Interrater Intrarater Test-retest Safety Interrater Intrarater Test-retest
Participant n
Rater n
ICC
95% Confidence Limit
P
11 11 10
4 2 2
.855 .950 .901
.683 to .953 .880 to .984 .768 to .971
⬍.001 ⬍.001 ⬍.001
11 11 10
4 2 2
.061 .228 .254
⫺.086 to .284 ⫺.034 to .609 ⫺.026 to .651
.243 .048 .041
Arch Phys Med Rehabil Vol 91, November 2010
1756
RELIABILITY OF WHEELCHAIR SKILLS TEST VERSION 4.1, Lindquist
it still showed substantial agreement. WST 4.1 included for the first time an additional safety component score. ICC and values related to scores of the safety component were substantially lower than for performance scores. This finding, related to agreement between raters for total scores (ICC) and individual skill success (), seemingly indicated slight to fair reliability for safety scores. However, percentage of rater agreement scores related to the safety component generally were high. To resolve the apparent discrepancy between and percentage of rater agreement safety scores, we looked more closely at the percentage of rater agreement scores and found fewer rater discrepancies in the safety component than in the performance component, again, seemingly in disagreement with the ICC and values obtained during statistical analysis. To determine whether skills with lower rater percentage of agreement could be contributing to the low safety reliability, the ICC safety values were recalculated excluding the 4 skills with the lowest rater agreement. This increased the 3 safetycomponent ICC values slightly to .282, .362, and .334 (for interrater, intrarater, and test-retest reliability), but all remained in the slight-agreement range. There was limited variability in safety scores (ie, most people were safe for most skills; see table 2), and the small improvement in ICC values suggests that the low ICC values were caused less by rater disagreement than by lack of variability in the WST safety scores.16 Our finding that was higher for skills for which percentage of success was lower (ie, greater variability in the data) supports this argument. Our results related to safety were similar to those of a recent study that tested the reliability of a Walking Safety Scale for older adults.17 Dube et al17 also reported low variability in safety scores, which resulted in a generally high percentage of rater agreement values despite low values. The apparent paradox between low values and high rater agreement has been reported in the literature.14,16 Our findings related to percentage of agreement between raters suggest that safety can be assessed reliably. Future reliability studies with participants on the lower end of the performance and safety spectrum will expand our knowledge regarding the reliability of the WST 4.1 safety component. The results related to safety provide the opportunity for reflection about the purpose of the inclusion of the safety score in WST 4.1. Assessment of safety allows the tester to document whether the skill was failed because it was not completed, the participant made an unsafe attempt and showed poor judgment, or the participant declined to attempt it, thereby showing safety and good judgment. Detecting a wheelchair user’s unsafe skill performance has the potential to improve overall safety by providing the stimulus for intervention (eg, wheelchair modification or training). However, a challenging aspect of the interpretation of the safety scores may be the use of the screening questions for the more difficult skills. Participants may state that they are unable to complete a task because they either are not confident or know from experience that they cannot complete the task. However, at the point and time when the WST 4.1 is completed, the distinction between confidence and ability may not be that important because the consequence is the same for all practical purposes. In either case, the person has still made a safe decision not to complete the task. The issue of confidence or self-efficacy with respect to wheelchair skills is an interesting topic,18 but is not addressed by the WST 4.1. Our analysis of individual skill scoring discrepancies may provide suggestions about how the scoring instructions and criteria can be clarified. As noted, most (54%) performance and Arch Phys Med Rehabil Vol 91, November 2010
safety discrepancies were caused by scoring rule interpretation errors. In these instances, although guidelines were in place for scoring, because of individual rater misinterpretation or overlooking of specific rules in the WST 4.1 manual, raters scored the skills differently. For example, the skill with the lowest percentage of rater agreement (68%) was rolls backwards 5m. Percentage of rater agreement was low for this skill because 1 rater consistently failed participants for not looking over their shoulders while rolling backwards while the other raters did not. The correct interpretation of this scoring rule is that failure to shoulder-check should be included as a comment, but does not constitute failure of the skill. This led us to conclude that reliability could be improved with increased tester training and more practice in scoring the WST 4.1. Despite some difficulties in scoring, conducting the WST 4.1 was relatively simple. The testers perceived that the training received was adequate to properly and safely administer the WST 4.1. Required equipment should be easy to find within most facilities, and when equipment is not available, the WST 4.1 is adaptable to different environments. Equipment deficiencies, such as ramps and stairs, were discussed with the WST 4.1 developers. Clarification was received that documented changes to equipment are permissible to allow for testing in different settings and locations. Although the 4 raters were novices, the reliability obtained for the performance component was substantial, indicating that the WST 4.1 and WST training program is adequate to produce acceptable reliability even for users with little experience. The reliability of previous versions of the WST was assessed by administrators with a moderate amount of experience administering and rating the test. These reliability studies had results similar to our study.8,10 Study Limitations There were limitations to this study, some of which have already been noted. Sample size was small, but power was adequate to identify significant correlations. Although participants were heterogeneous with respect to some characteristics, they were all experienced wheelchair users, 9 of 11 used their wheelchairs because of spinal cord injury, and it is likely that they were reasonably fit (given that they were recruited from an exercise center). It is likely that their performance and safety scores were higher than they would have been when they were first learning to use their wheelchairs. The testers who administered the test were also the raters who scored the video recordings. A more robust study design would have had separate testers and raters to eliminate the potential for bias in scoring. In the present study design, it was possible for a rater to base the scoring on both the video recording and the rater’s memory of the participant attempting the skill. This likely affected results only minimally because the time between test administration and test scoring averaged 18 weeks. It also is possible that there was a learning effect between T1 and T2 for both participants and raters. Using the combined performance scores for the 2 raters who scored T1 and T2, performance scores significantly improved from T1 to T2 (P⫽.041). However, this improvement was only from 78.8% to 81.2% and as such does not clearly indicate that there was a learning effect. There was no difference in safety scores from T1 to T2. For testers/raters, it is possible that they became better at administering the WST as the study progressed or improved their scoring as they progressed through the video recordings. Testing time was lower for T2 than T1, but not significantly different. An alternative explanation is that the first WST was a stimulus for participants to think about and improve their wheelchair skills (intrinsic learning). This expla-
RELIABILITY OF WHEELCHAIR SKILLS TEST VERSION 4.1, Lindquist
nation is consistent with improvements seen in the control groups of wheelchair users in randomized controlled studies evaluating the efficacy of skills training.4,6 One difficulty encountered in the study design was the use of video recordings. In several instances, skills were absent from copies of the original disc. In addition, problems with positioning of the video camera and the available lighting resulted in difficulties seeing boundaries on the ground or knowing if an intervention had occurred to prevent a fall. These problems could be avoided if scoring was done at the same time as testing, as usually is the case clinically. In a real-life situation, if a tester is unclear about how the wheelchair user performs a skill (eg, because of being unseen), the skill attempt can be repeated. However, this was not practical in this study. Videotaping was necessary for the study design and in some ways may be better than viewing and scoring the WST live. It may allow more accurate scoring because skills can be replayed when the rater is unsure of task performance or safety. A final limitation was the limited scope of the study, focusing as it did on reliability. There are other important dimensions (eg, validity, responsiveness) to the issue of measurement properties of a tool like the WST. Future studies should be performed to address these limitations, replicate our findings, and extend the assessment of measurement properties. CONCLUSIONS WST 4.1 is a reliable tool to evaluate a manual wheelchair user’s skill performance. This study was the first to report the reliability of scoring the performance and safety component of WST 4.1. Although ICC values for the safety component suggest that raters were less reliable when assessing safety, rater agreement was high and our findings likely show the paradox of the low and high rater agreement that is shown when variability in scores is low.16 Further study is warranted with a larger and more diverse group of subjects.
5.
6.
7.
8.
9.
10.
11.
12.
13. 14. 15.
16. 1. 2.
3.
4.
References Flagg J. Wheeled mobility demographics. Buffalo: Rehabilitation Engineering Center on Technology Transfer; 2009. p 7-29. Meyers AR, Anderson JJ, Miller DR, Shipp K, Hoenig H. Barriers, facilitators, and access for wheelchair users: substantive and methodologic lessons from a pilot study of environmental effects. Soc Sci Med 2002;55:1435-46. Dalhousie University. Wheelchair Skills Test, version 4.1. Available at: http://www.wheelchairskillsprogram.ca/eng/4.1/WST_Manual_ Version4.1.51.pdf. Accessed March 5, 2010. MacPhee AH, Kirby RL, Coolen AL, Smith C, MacLeod DA, Dupuis DJ. Wheelchair skills training program: a randomized
17.
18.
1757
clinical trial of wheelchair users undergoing initial rehabilitation. Arch Phys Med Rehabil 2004;85:41-50. Coolen AL, Kirby RL, Landry J, et al. Wheelchair skills training program for clinicians: a randomized controlled trial with occupational therapy students. Arch Phys Med Rehabil 2004;85: 1160-7. Best KL, Kirby RL, Smith C, MacLeod DA. Wheelchair skills training for community-based manual wheelchair users: a randomized controlled trial. Arch Phys Med Rehabil 2005;86:2316-23. Mountain AD, Smith C, Kirby RL. Are wheelchair-skills assessment and training relevant for long-standing wheelchair users? Two case reports. Disabil Rehabil Assist Technol 2010;5:230-33. Kirby RL, Dupuis DJ, Macphee AH, et al. The Wheelchair Skills Test (version 2.4): measurement properties. Arch Phys Med Rehabil 2004;85:794-804. Kirby RL, Swuste J, Dupuis DJ, MacLeod DA, Monroe R. The Wheelchair Skills Test: a pilot study of a new outcome measure. Arch Phys Med Rehabil 2002;83:10-8. Routhier F, Demers L, Kirby RL, et al. Inter-rater and test-retest reliability of the French Canadian Wheelchair Skills Test (version 3.2): preliminary findings. Proceedings of the Annual Meeting of RESNA; Phoenix, AZ; June 15-19, 2007. Kilkens OJ, Post MW, Dallmeijer AJ, Seelen HA, van der Woude LH. Wheelchair skills tests: a systematic review. Clin Rehabil 2003;17:418-30. Portney LG, Watkins MP. Foundations of clinical research: applications to practice. 2nd ed. Upper Saddle River: Prentice Hall Health; 2000. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420-8. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol 1990;43:551-8. Dates B. Syntax for Cohen’s Kappa. Available at: http:// www.spsstools.net/Syntax/Matrix/CohensKappa.txt. Accessed January 15, 2010. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol 1990;43:543-9. Dube F, Rousseau J, Kaegi C, Boudreault R, Nadeau S. Development of a walking-safety scale for older adults, part II: interrater and test-retest agreement of the GEM Scale. Physiother Can 2008;60:274-82. Rushton PW, Miller WC, Kirby RL, Eng JJ, Yip J. Development and content validation of the Wheelchair Confidence Scale: a mixed-methods study. Disabil Rehabil Assist Technol 2010 Early online.
Supplier a. SPSS Inc, 233 S Wacker Dr, 11th Fl, Chicago, IL 60606.
Arch Phys Med Rehabil Vol 91, November 2010