Accepted Manuscript Reliability and responsiveness of the Activities of Daily Living Computerized Adaptive Testing system in patients with stroke Ya-Chen Lee , MS Wan-Hui Yu , MS Yu-Fen Lin , MS I-Ping Hsueh , MA Hung-Chia Wu , MS Ching-Lin Hsieh , PhD PII:
S0003-9993(14)00347-5
DOI:
10.1016/j.apmr.2014.04.025
Reference:
YAPMR 55832
To appear in:
ARCHIVES OF PHYSICAL MEDICINE AND REHABILITATION
Received Date: 19 December 2013 Revised Date:
11 March 2014
Accepted Date: 24 April 2014
Please cite this article as: Lee Y-C, Yu W-H, Lin Y-F, Hsueh I-P, Wu H-C, Hsieh C-L, Reliability and responsiveness of the Activities of Daily Living Computerized Adaptive Testing system in patients with stroke, ARCHIVES OF PHYSICAL MEDICINE AND REHABILITATION (2014), doi: 10.1016/ j.apmr.2014.04.025. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT 1
3 4 5 6 7
Running head: Reliability and responsiveness of the Activities of Daily Living Computerized Adaptive Testing system Title: Reliability and responsiveness of the Activities of Daily Living Computerized Adaptive Testing system in patients with stroke
RI PT
2
Title Page
Authors: Ya-Chen Lee, MS; Wan-Hui Yu, MS; Yu-Fen Lin, MS; I-Ping Hsueh, MA; Hung-Chia Wu, MS; Ching-Lin Hsieh, PhD
Authors’ affiliations: Ya-Chen Lee, Wan-Hui Yu, Yu-Fen Lin, I-Ping Hsueh & Ching-Lin
9
Hsieh: School of Occupational Therapy, College of Medicine, National Taiwan University
SC
8
and Department of Physical Medicine and Rehabilitation, National Taiwan University
11
Hospital, Taipei, Taiwan. Hung-Chia Wu: Department of Physical Medicine and
12
Rehabilitation, E-Da Hospital, Kaohsiung, Taiwan.
13
Acknowledgment of financial support:
14
This study was supported by a research grant from the National Health Research Institutes
15
(NHRI-EX102 -1007PI) and the E-Da Hospital (EDAHT 102008 & EDAHT 103024).
TE D
M AN U
10
16
Financial Disclosure Statement:
18
No party having a direct interest in the results of the research supporting this article has or
19
will confer a benefit on us or on any organization with which we are associated. We certify
20
that all financial and material support for this research (eg, NHRI or EDAHT grants in
21
Taiwan) are clearly identified in the title page of the manuscript.
23
AC C
22
EP
17
Address for correspondence and reprints: I-Ping Hsueh, School of Occupational Therapy,
24
College of Medicine, National Taiwan University, 4F, No 17, Xuzhou Rd, Taipei 100,
25
Taiwan.
26
Fax: +886-2-23511331
27
Tel: +886-2-33668174
28
E-mail:
[email protected]
ACCEPTED MANUSCRIPT
1
Reliability and responsiveness of the Activities of Daily Living Computerized Adaptive
2
Testing system in patients with stroke Abstract
4
Objective: To examine the intra-rater reliability, inter-rater reliability, and responsiveness of
5
the Activities of Daily Living Computerized Adaptive Testing system (ADL CAT) in patients
6
with stroke.
7
Design: One repeated-measures design (at an interval of 7 days) was used to examine the
8
intra-rater reliability and inter-rater reliability of the ADL CAT. For the responsiveness study,
9
participants were assessed with the ADL CAT at admission to the rehabilitation ward and at
M AN U
SC
RI PT
3
discharge from the hospital.
11
Setting: Eight rehabilitation units.
12
Participants: Three different (non-overlapping) patient groups were recruited. Fifty-five and
13
42 outpatients with chronic stroke participated in the intra-rater and inter-rater reliability
14
studies, respectively; 60 inpatients who had recently had a stroke participated in the
15
responsiveness study.
16
Interventions: Not applicable.
17
Main Outcome Measure: The ADL CAT.
18
Results: The intraclass correlation coefficient (ICC) values were 0.94 and 0.80 for the ADL
19
CAT in the intra-rater reliability and inter-rater reliability studies, respectively. The Classical
20
Test Theory-based minimal detectable change (MDCCTT) values were 6.5 and 9.5 for the
21
ADL CAT in the intra-rater reliability and inter-rater reliability studies, respectively. The
22
Kazis’ effect size and standardized response mean of the ADL CAT were moderate
23
(0.62-0.73).
AC C
EP
TE D
10
1
ACCEPTED MANUSCRIPT
Conclusions: The results of this study showed that the ADL CAT has good intra-rater
2
reliability and inter-rater reliability in outpatients with chronic stroke and sufficient
3
responsiveness in inpatients with stroke undergoing inpatient rehabilitation. Further
4
investigations on the responsiveness of the ADL CAT in the outpatients are needed to obtain
5
more evidence on the utility of the ADL CAT.
RI PT
1
6
Key Words: reliability; responsiveness; activities of daily living; computerized adaptive
8
testing; stroke.
SC
7
M AN U
9
AC C
EP
TE D
10
2
ACCEPTED MANUSCRIPT
List of abbreviations:
2
ADL activities of daily living
3
ADL CAT ADL Computerized Adaptive Testing system
4
BADL basic ADL
5
BI Barthel Index
6
CI confidence interval
7
CTT classical test theory
8
FAI Frenchay Activities Index
9
IADL instrumental ADL ICC intraclass correlation coefficient
11
IRT item response theory
12
LOA limits of agreement
13
MDC minimal detectable change
14
SEE standard error of estimate
15
SEM standard error of measurement
16
SRM standardized response mean
EP AC C
17
TE D
10
M AN U
SC
RI PT
1
3
ACCEPTED MANUSCRIPT
1
Stroke is a major cause of disability in activities of daily living (ADL) among the elderly.1-3 Assessing ADL is important for clinicians in planning ADL intervention,
3
estimating care requirements, and monitoring outcomes.4 To be clinically useful, a short and
4
precise ADL measure is preferred to improve the administrative efficiency and reduce
5
assessment burden.1
6
RI PT
2
The ADL Computerized Adaptive Testing system (ADL CAT) was developed to achieve both efficiency and precision of ADL assessments.1 The ADL CAT has three advantages.
8
First, the ADL CAT is quick to complete. The ADL CAT chooses only items tailored to a
9
patient and skips items that are either too easy or too difficult for patients; it requires an
10
average of only 88 seconds to complete.1 Such efficiency is unlikely to be achieved with
11
traditional measures, such as the Functional Independence Measure and Frenchay Activities
12
Index (FAI). Thus, the ADL CAT can enhance the efficiency of administration and reduce the
13
assessment burden on patients and raters.5 Second, the ADL CAT assesses a broad spectrum
14
of ADL function. Commonly, ADL refers to basic ADL (BADL).6, 7 However, assessing
15
BADL does not capture the information on higher levels of ADL functions that are necessary
16
for independence in the home and community (i.e., instrumental ADL, IADL).3 The ADL
17
CAT combines both the BADL and IADL items into one item bank to comprehensively
18
assess patients’ ADL functions.1 Third, the ADL CAT takes into account gender differences in
19
performing some IADL items (i.e., domestic chores) and thus assigns different weights to
20
these IADL items to prevent underestimation of male patients’ ADL function in performing
21
domestic chore items.1 Due to the aforementioned advantages, the ADL CAT demonstrates
22
great potential for use in clinical and research settings.
23
AC C
EP
TE D
M AN U
SC
7
Validity of the ADL CAT has been well examined.1 The ADL CAT has high concurrent
4
ACCEPTED MANUSCRIPT
validity (Pearson’s r=0.82) with the combined Barthel Index (BI, assessing BADL) and FAI
2
(assessing IADL) in patients with stroke.1 In addition, the 34 BADL and IADL items of the
3
ADL CAT item bank are one-dimensional.1 Thus, the construct validity of the item bank of
4
the ADL CAT is supported.
5
RI PT
1
However, some other important psychometric properties, such as intra-rater reliability, inter-rater reliability, and responsiveness of the ADL CAT, are still unknown, thus limiting
7
the utility of the measure. Intra-rater reliability reflects the extent of consistency between
8
repeated assessments administered by the same rater.8 Inter-rater reliability indicates whether
9
different raters give consistent scores when administering a measure to the same group of
10
patients.8 The responsiveness refers to a measure’s ability to detect change that occurs as a
11
result of therapy or disease progression.9 It is critical for the ADL CAT to have sufficient
12
intra-rater reliability, inter-rater reliability, and responsiveness to ensure its utility in clinical
13
and research settings. Thus, the aims of this study were (1) to examine the intra-rater
14
reliability and inter-rater reliability of the ADL CAT; (2) to investigate the responsiveness of
15
the ADL CAT in patients with stroke.
TE D
M AN U
SC
6
METHODS
EP
16 Participants
18
Intra-rater and inter-rater reliability
AC C
17
19
We recruited two convenience samples of outpatients with chronic stroke from the
20
Department of Physical Medicine and Rehabilitation at seven hospitals between March 2011
21
and May 2012. One convenience sample was for examining intra-rater reliability; the other
22
was for examining inter-rater reliability. All participants met the following criteria: (1)
23
diagnosis of cerebral hemorrhage or cerebral infarction; (2) having had a stroke recently, ≥6
5
ACCEPTED MANUSCRIPT
months. In addition, all participants received traditional rehabilitation (e.g., occupational
2
therapy, physical therapy, or speech and language pathology where needed, 1 to 3 times per
3
week for each therapy). The traditional rehabilitation provided trainings in ADL, mobility,
4
endurance and strength, balance, communication, or language-based skills. We excluded
5
patients with major comorbidities (e.g., dementia or rheumatoid arthritis) or recurrent stroke
6
during the study period that might influence ADL functioning.
7
Responsiveness
SC
9
We recruited a consecutive sample of patients undergoing inpatient rehabilitation at one hospital from May 2011 to January 2013. The inclusion criterion for selecting participants
M AN U
8
RI PT
1
was diagnosis of cerebral hemorrhage or cerebral infarction. All participants received
11
inpatient rehabilitation services (e.g., occupational therapy, physical therapy, or speech and
12
language pathology therapy, 3-5 times a week for each therapy). The inpatient rehabilitation
13
services focused on trainings in ADL, mobility, motor recovery, endurance and strength,
14
balance, chewing, or swallowing, where appropriate. Patients with major comorbidities were
15
excluded. Moreover, we excluded patients who stayed in the ward for less than 7 days
16
because their ADL functions tended to be stable, as indicated by the short hospital stay. The
17
whole study was approved by the local institutional review boards.
18
Procedure
EP
AC C
19
TE D
10
Prior to the study, the raters (rater A and B, both occupational therapists) received at
20
least 2 hours of training from the first author (a very experienced ADL CAT user) on the
21
administration of the ADL CAT. During the training section, the raters had to familiarize
22
themselves with the items, response categories, interview procedures and scoring. At the end
23
of training section, both raters individually interviewed 4 to 6 patients while the first author
6
ACCEPTED MANUSCRIPT
1
observed and scored at the same time. Then the raters’ interview procedures and scoring
2
results were checked by the first author to ensure that the procedures and results were
3
satisfactory. During the study, the raters interviewed the patient and his/her primary caregiver, if
RI PT
4
available, to assess the patient’s level of independence in daily life. The raters asked the
6
patient whether he or she had done a specific ADL task in the pre-specified time frame
7
(“whether or not the patient actually put on pants or shorts him/herself in the previous 1-2
8
days before assessment”). If the patients had done the task, the rater asked whether the
9
patient had done it by himself/herself or with assistance. If it was the latter, the rater further
M AN U
SC
5
asked the level of assistance during the task. If we obtained the responses from both the
11
patient and his/her primary caregiver, but there was a discrepancy, the raters further clarified
12
the discrepancy with the patient and his/her primary caregiver. After further clarification, if
13
the discrepancies still existed, the rater would further check with the patient and his/her
14
caregiver simultaneously to determine how the patient actually performed the ADL task
15
within the time frame. If a patient had difficulty responding to the interview (e.g., a patient
16
with aphasia or cognitive-perceptual deficits), the patient’s primary caregiver was
17
interviewed instead.
18
Intra-rater reliability
20 21 22
EP
AC C
19
TE D
10
The ADL CAT was administered to the participants twice by rater A, 7 days apart. Inter-rater reliability
Raters A and B independently interviewed and scored the participants, 7 days apart, and did not communicate with each other or the first author throughout the study. Such an
7
ACCEPTED MANUSCRIPT
1
independent interview and scoring design was intended to simulate the actual context of
2
ADL assessments in both clinical and research settings.
3
Responsiveness The ADL CAT was administered to the patients by rater B at admission to the
RI PT
4
rehabilitation ward and at discharge from the hospital.
6
Measures
7
ADL CAT. The ADL CAT, containing an item bank with 11 BADL tasks and 23 IADL tasks,
8
can be administered using a personal digital device (e.g., smart phone, tablet) via the
9
Internet.1 The ADL CAT assesses a patient’s actual performance (in terms of level of
M AN U
SC
5
assistance) in his/her daily tasks, which provides information on patients’ ADL functions
11
regarding the integration of the physical environment, personal assistance, and the use (or
12
nonuse) of assistive technology. Thus, the results of the ADLCAT indicate the level of
13
dependence/disability of a patient in real life and can be viewed as an outcome indicator and
14
as an indicator of the level of burden of long- term care.3, 10
TE D
10
In the initial development of the ADL CAT, a total of 4 response categories (“totally
16
dependent,” “partially dependent,” “sometimes independent, but not every time,” and “totally
17
independent, every time”) were proposed. However, after estimation of the item parameters,
18
a few of the items showed response category reversal and their response categories were
19
collapsed.1 The renewed version of the ADL CAT has items with 3 kinds of response
20
categories (2, 3, and 4 categories). However, such variation in the response categories can be
21
confusing for prospective users. To achieve ease of scoring, the developers of the ADL CAT
22
kept the 4-category design of the response categories of the 34 items.1
AC C
EP
15
8
ACCEPTED MANUSCRIPT
The ADL CAT presents subsequent items on the basis of the responses of the patients.
2
For example, if a female patient was totally dependent on someone to complete the task of
3
putting on pants or shorts (lowest level of independence), the patient and/or her primary
4
caregiver was asked about whether or not the patient went to the toilet to eliminate urine
5
herself, and after elimination, arranged clothes and cleaned herself (an easier task than the
6
“putting on pants or shorts” task). On the other hand, if the patient independently completed
7
the “putting on pants or shorts” task herself every time (highest level of independence), the
8
patient and/or her primary caregiver was asked about whether the patient wash dishes during
9
past week (a harder task than the “putting on pants or shorts” task).
SC
M AN U
10
RI PT
1
The stopping rule of the ADL CAT for each patient is either reliability (estimated by item response theory (IRT)) > .90 or a maximum test length of 7 items.1 The original ADL
12
CAT scores are standardized scores ranging from -2.65 to 2.56. For easier interpretation, the
13
standardized scores are transformed to T scores (mean=50, SD=10). The T score of the ADL
14
CAT ranges from 22.0 to 77.2.1 When a patient independently performs all BADL tasks but
15
none of the IADL tasks presented by the CAT, the highest possible T score is 55. Because
16
IADL tasks are usually more difficult than BADL tasks for patients, a T score of 55 can be
17
generally used to indicate whether a patient can independently perform all BADL tasks.
18
Analysis
19
Intra-rater and inter-rater reliability. Intraclass correlation coefficients (ICC2,1) were
20
employed to examine the extent of agreement between repeated assessments of the ADL CAT
21
administered twice by the same rater (intra-rater reliability) or by two raters individually
22
(inter-rater reliability). ICC values from 0.90-0.99 indicate high reliability; those from
AC C
EP
TE D
11
9
ACCEPTED MANUSCRIPT
1
0.80-0.89, good reliability; those from 0.70-0.79, fair reliability; those below 0.69, poor
2
reliability.11
3
We estimated the minimal detectable change (MDC) of the ADL CAT administered by a single rater (intra-rater) and by different raters (inter-rater). The MDC is the smallest
5
threshold of change scores that are detectable and beyond random error at a certain level of
6
confidence (usually 95%).The value of the MDC can be used as a threshold to determine
7
whether the changed score on a measure of an individual patient has reached a real
8
improvement (or deterioration) or is due to the measurement error. 12, 13 The MDC values
9
were estimated using Classical Test Theory (CTT) and IRT. The MDCCTT uses a single
M AN U
SC
RI PT
4
estimate of random measurement error for the measure all over the whole score continuum.14
11
That is, a single value of MDCCTT can be used for all levels of ADL function (all scores of
12
the ADL CAT in this study). On the other hand, the MDCIRT can be estimated for each
13
assessment and takes into account that a measure does not assess a characteristic equally well
14
and precisely over the whole score continuum.15, 16 The value of MDCIRT of each score of a
15
measure depends on the items tested.17
In CTT, both MDC values were calculated on the basis of the standard error of
EP
16
TE D
10
measurement (SEM) using the following formula:12
18
MDCCTT=z-score level of confidence × √2SEM
(1)
19
SEM=SD all testing scores× √1 −
(2)
20
AC C
17
The z-score represents the confidence interval (CI) from a standard normal distribution
21
(i.e., 1.96 for 95% CI in this study). The SD all testing scores means the SD of all scores of two
22
assessments, and r is the coefficient of the intra-rater or inter-rater reliability (ICC value).
23
The multiplier of √2 refers to the additional uncertainty introduced by inclusion of scores
10
ACCEPTED MANUSCRIPT
1
from two separate assessments.12 Using CTT, the MDC values are invariant for every ability
2
estimate along the scale, because SEM depends on population distribution and is assumed the
3
same for each person.18
5 6
The MDCCTT value was considered acceptable when this value was<20% of the mean of
RI PT
4
all scores in a measure.19
The IRT-based MDC (MDCIRT) was calculated on the basis of standard error of estimate (SEE) using the following formula:
8
MDCIRT = z-score (1.96) × SEE + SEE
(3)
In IRT, one can generate a SEE for each individual score. The SEE allows users to
M AN U
9
SC
7
10
estimate the precision (or reliability) of a particular measurement.10 We calculated the
11
MDCIRT for each patient using the SEE for each patient’s first and second test scores in the
12
intra-rater and inter-rater reliability studies. The MDCIRT was also plotted for each patient. We also used Bland-Altman plots with 95% limits of agreement (LOA) to visually
TE D
13
examine the agreement between two repeated assessments.20 In these plots, the differences (d)
15
between each pair of observations were presented against the average value for each pair of
16
observations. Assuming that differences follow the standard normal distribution, 95% of the
17
differences will lie between d ±1.96SDdiff (i.e., LOA), where SDdiff represents the standard
18
deviation of differences. Moreover, these plots were used to illustrate the heteroscedasticity
19
in the representation of a tendency: changes in repeated assessments generally increase as the
20
average score of the assessments increase. The possibility of heteroscedasticity was evaluated
21
according to the association (i.e., Pearson’s r) between the average and the absolute change
22
in each pair of assessments. The data was considered heteroscedastic when r > 0.3.21
AC C
EP
14
11
ACCEPTED MANUSCRIPT
1
Paired t test was also performed to determine whether there were significant differences between repeated assessments in the intra-rater or inter-rater condition.
3
IRT reliability and test length in both reliability studies. We calculated IRT reliability for
4
each patient in both assessments in the intra-rater and inter-rater reliability studies on the
5
basis of the SEE using the following formula:
6
IRT reliability=1-SEE2
9
In addition, we calculated the test length by counting the number of items presented in
SC
8
(4)
each interview.
Paired t test was performed to determine whether there were significant differences in
M AN U
7
RI PT
2
IRT reliability/test length between repeated assessments in the intra-rater and inter-rater
11
conditions.
12
Responsiveness. The responsiveness of the ADL CAT was examined using two types of
13
effect size. First, Kazis’ effect size22 was calculated by dividing the mean change in score
14
between admission and discharge measurements by the SD of the admission score. Second,
15
standardized response mean (SRM)23 was obtained by dividing the mean change by the SD
16
of the change in admission and discharge scores. An effect size greater than 0.80 was
17
considered large, one of 0.50 to 0.80, moderate, and one of 0.20-0.49, small.24 We expected
18
that the responsiveness of the ADL CAT would be moderate to high. In addition, we used
19
paired t test to determine the statistical significance of the changes in scores on the ADL
20
CAT.
22
EP
AC C
21
TE D
10
RESULTS
Intra-rater reliability
12
ACCEPTED MANUSCRIPT
1
A total of 55 patients participated in the study. Of these, two patients had severe aphasia and 1 had severe cognitive-perceptual deficits. Their mean age was 74.3 years, and 73% of
3
the patients were male (Table 1). Table 2 shows that the mean scores of the ADL CAT at the
4
first and second assessments were 50.9 and 51.3, respectively, indicating that, on average, our
5
participants independently performed most BADL tasks but none of the IADL tasks.
RI PT
2
The ICC and MDCCTT values for the ADL CAT were 0.94 and 6.5 for the intra-rater
7
reliability study, respectively (Table 2). The MDCCTT was 12.8% of the mean of all scores of
8
the intra-rater assessments.
Figure 1 shows that the LOA of the ADL CAT ranged from -6.3 to 6.9. The association
M AN U
9
SC
6
between the average and the absolute change scores in each pair of assessments was less than
11
0.3 (Pearson’s r=0.12). In addition, there was no significant difference between intra-rater
12
assessments (p>0.05, Table 2).
13
Inter-rater reliability
TE D
10
A total of 42 patients participated in the study. Three of the 42 patients had severe
15
cognitive-perceptual deficits. Their mean age was 63.6 years, and 57.0% of the patients were
16
male (Table 1). Table 2 shows that the mean scores at the first and second assessments were
17
49.4 and 50.6, respectively.
EP
14
The ICC and MDCCTT values for the ADL CAT were 0.80 and 9.5 for the inter-rater
19
reliability study, respectively (Table 2). The MDCCTT was 18.9% of the mean of all scores of
20
the inter-rater assessments.
21 22
AC C
18
Figure 2 shows that the LOA of the ADL CAT ranged from -8.2 to 10.5. The association between the average and the absolute change scores in each pair of assessments
13
ACCEPTED MANUSCRIPT
1
was less than 0.3 (Pearson’s r=0.13). Moreover, no significant differences were found
2
between inter-rater assessments (p>0.05, Table 2).
3
Figure 3 shows the MDCIRT plots. The MDCIRT, as expected, was much smaller in the middle of the ADL continuum in our intra-rater and inter-rater reliability samples (i.e., a
5
patient with a T score of 50.3 had the lowest MDCIRT of 5.4). The MDCIRT was much higher
6
at the very low or very high end of the ADL continuum (e.g., a patient with the lowest T
7
score (22.0) had the highest MDCIRT of 16.2).
8
IRT reliability and test length in both reliability studies
SC
Table 3 shows that the mean IRT reliability was very high for the patients’ ADL CAT
M AN U
9
RI PT
4
scores (about 0.93 for each of the two studies). The mean test length was short (about 4 items
11
for the two studies). There was no significant difference in terms of the IRT reliability and
12
test length in the two studies (i.e., p≥0.362).
13
Responsiveness
14
TE D
10
A total of 71 patients were originally recruited. Eleven patients withdrew from the study because they were discharged early without notice. Sixty patients completed both baseline
16
and follow-up assessments. These patients were not significantly different from those who
17
withdrew from the study in terms of demographic characteristics (i.e. age and gender) or
18
ADL function (i.e. the ADL CAT baseline scores) (p>0.11). The median number of days from
19
onset to initial evaluation for these patients was 20 (minimum~maximum=9~50), indicating
20
that most of the patients were in the subacute stage (Table 4). Table 2 shows that the
21
admission and discharge scores of the ADL CAT were 39.8 and 45.9, respectively,
22
demonstrating a tendency of increase between the scores at admission and discharge.
AC C
EP
15
14
ACCEPTED MANUSCRIPT
1 2
The Kazis’ effect size and SRM of the ADL CAT were moderate (0.62-0.73) (Table 2). The change in scores on the ADL CAT was significant (p<0.001). DISCUSSION
3
Establishing intra-rater reliability, inter-rater reliability, and responsiveness is important
RI PT
4
for ensuring the utility of the ADL CAT. Our findings provide empirical evidence on these
6
important properties of the ADL CAT in patients with stroke for clinicians and researchers.
7
Our results showed that the ICC value (0.94) for the intra-rater agreement of the ADL
SC
5
CAT was high. The ICC value (0.80) for the inter-rater agreement of the ADL CAT was
9
good. The inter-rater agreement appeared slightly lower than the intra-rater agreement. Two
M AN U
8
possible reasons have been proposed. First, the ADL CAT scores’ variance between
11
participants (SD=7.8) in the inter-rater reliability study was smaller than those between
12
participants (SD=9.4) in the intra-rater reliability study.25 Because ICC reflects the ratio of
13
variance between participants to total variance (between-and within-participant variances),
14
this value becomes smaller when the between-participant variance decreases.25 Second,
15
training of only two hours on the administration of the ADL CAT might not be sufficient to
16
ensure the raters attain equally high proficiency in interview and scoring skills over time.
17
Although the inter-rater agreement of the ADL CAT was considered good (ICC=0.80),
18
rigorous rater training is recommended to ensure stable assessments within or between raters.
EP
AC C
19
TE D
10
The MDCsCTT were 6.5 points and 9.5 points for the ADL CAT in the intra-rater
20
reliability and inter-rater reliability studies, respectively. As expected, the MDCCTT obtained
21
from different raters was higher than that obtained from an individual rater. Both MDCCTT
22
values were < 20% of the mean of all scores and < 1 SD (10 of the T scores) of the ADL
23
CAT, indicating an acceptable level of random measurement error. More importantly, these
15
ACCEPTED MANUSCRIPT
MDCCTT values are useful for users when judging whether the change scores in consecutive
2
ADL CAT assessments signify real changes or random variations.26 For example, a change
3
score greater than 9.5 points between consecutive assessments administered by the different
4
raters can be interpreted as a real change (i.e., beyond random measurement error) with 95%
5
confidence.
6
RI PT
1
We calculated both the MDCCTT and the MDCIRT because each has its own merits. The MDCCTT takes into account the effects of raters’ inconsistency in testing (either intra-rater or
8
inter- rater interviewing skills and judgment). Furthermore, MDCCTT, a single value, is
9
applied for all patients no matter their level of functions. Because there was apparent
M AN U
SC
7
10
variability between inter-rater assessments of the ADL CAT (ICC=0.80), we recommend that
11
prospective users employ inter-rater MDCCTT (9.5) when the ADL CAT is administered by
12
different raters.
In contrast, the MDCIRT, generated using the SEE for each individual score, does not
TE D
13
take into account the rater’s effect. The value of MDCIRT depends on the quality of the item
15
bank of a CAT.27 That is, a better item bank will generate a lower MDCIRT. The ADL CAT
16
can calculate the MDCIRT automatically with each patient’s ADL score. Particularly, the ICC
17
value (0.94) for the intra-rater reliability was so high that the effect of raters’ inconsistency
18
may be negligible for the ADL CAT. Thus, such MDCIRT information, which is automatically
19
reported by the ADL CAT, can be very useful for users to determine whether a patient’s
20
change score is beyond random measurement error.
AC C
21
EP
14
However, the MDCIRT is usually large for patients with an extremely low or high level
22
of functioning. For example, the patient who had the lowest ADL CAT score (22.0) had a
23
very high MDCIRT, 16.2. The results indicate that the ADL CAT did not provide reliable
16
ACCEPTED MANUSCRIPT
information to describe the ADL function of patients with an extremely low level of ADL
2
functioning. Therefore, the item bank for the ADL CAT should in the future be extended at
3
both ends of ADL function by the addition of some very easy and very hard items, so that
4
better reliability (low random measurement error) can be achieved for patients with extreme
5
ADL functions.
6
RI PT
1
The Bland-Altman plots show that the mean scores of repeated assessments were
scattered throughout almost the entire range of possible scores (i.e., from 22.0 to 66.6 for the
8
intra-rater reliability study) of the ADL CAT, implying that our participants had a wide range
9
of ADL function. The associations between the average and the absolute change in each pair
10
of assessments in both the intra-rater reliability and the inter-rater reliability studies were less
11
than 0.3, indicating that heteroscedasticity did not exist. That is, the differences in repeated
12
assessments did not increase as the average score of the assessments increased. Moreover,
13
the results of paired t test showed no systematic bias between either intra-rater or inter-rater
14
assessments. Our findings demonstrate that the ADL CAT is reliable in assessing a wide
15
range of ADL function in patients with stroke over time.
M AN U
TE D
Moreover, the ADL CAT had high IRT reliability (≥.93) and a short test length (4 items,
EP
16
SC
7
on average) in both reliability studies. In addition, the differences of the IRT reliability and
18
test length in intra-rater assessments and inter-rater assessments were not significant. These
19
observations further support that the ADL CAT is reliable and efficient when used repeatedly
20
to assess patients’ ADL functions.
AC C
17
21
We found that the responsiveness of the ADL CAT was moderate (Kazis’ effect
22
size=0.62 and SRM=0.73) in detecting change in patients with stroke undergoing inpatient
23
rehabilitation. Such a result partially supports our original hypothesis that the responsiveness
17
ACCEPTED MANUSCRIPT
of the ADL CAT was moderate to high. A possible reason for the moderate responsiveness of
2
the ADL CAT is that our participants in the responsiveness study were all staying in
3
hospitals. IADL is not commonly performed by inpatients. Therefore, their scores on the
4
ADL CAT were generally limited within the score range of the BADL function, which might
5
have compromised the responsiveness. Nevertheless, the results support the value of the
6
ADL CAT in detecting the change of ADL function in inpatients with stroke.
The ADL CAT not only eases the administration burden on both clinicians and patients
SC
7
RI PT
1
but also improves the efficiency of patient management. The ADL CAT provides instant
9
outcome reports and automatic storage of results, which further increase the efficiency of
M AN U
8
data collection and management.28 Furthermore, the data of the ADL CAT are collected
11
directly from patients themselves and/or their caregivers, which is in line with
12
patient-reported outcomes approach. Since the information on ADL function is gathered from
13
the patients’ perspectives, the ADL CAT is useful in assisting clinicians to develop treatment
14
plans and monitor outcomes toward patient-centered care. In the future, the ADL CAT can be
15
combined with other CATs (e.g., the Balance CAT, the CAT- Fugl-Meyer motor scale)28-30 to
16
further extend the utility of the CAT in validation of treatment effectiveness, decision-making,
17
and data management.
18
Study Limitations
AC C
EP
TE D
10
19
This study has five limitations. First, our samples for the intra-rater reliability and
20
inter-rater reliability were convenience samples (with a few patients having severe aphasia
21
and cognitive-perceptual deficits). Thus, our results cannot be generalized to all stroke
22
populations. Second, we did not record whether the responses were from patients, primary
23
caregivers, or both. Some primary caregivers were not present, in which case only the
18
ACCEPTED MANUSCRIPT
patients were interviewed. The inconsistency of informants between the first and second
2
interviews might have caused underestimations of the intra-rater reliability and inter-rater
3
reliability in this study. Third, we examined the responsiveness of the ADL CAT in inpatients
4
with stroke receiving rehabilitation. Because inpatients are not likely to perform IADL, the
5
generalization of our results is limited. Fourth, the item bank of ADL CAT has not been
6
validated exclusively in inpatients. Thus, an investigation of validity in inpatients is needed
7
to further validate the ADL CAT. Fifth, whether the item parameters of ADL item bank are
8
stable over time (i.e., differential item functioning [DIF] due to time or the other factors,
9
except for gender1) has not been examined. It is important that ADL items for use as an
M AN U
SC
RI PT
1
10
outcome indicator over time not have a time-DIF. Future studies are needed to examine the
11
time-DIF of the ADL item bank to ensure that the item parameters are stable over time and
12
that our findings of responsiveness are robust.
CONCLUSIONS
14
TE D
13
The results of this study showed that the ADL CAT has good intra-rater reliability and inter-rater reliability in outpatients with chronic stroke and moderate responsiveness in
16
inpatients with stroke undergoing inpatient rehabilitation. The ADL CAT is very efficient,
17
needing about 4 items, on average, to complete the assessments, and this efficiency is
18
unlikely to be achieved with traditional measures (e.g., the 10-item BI or 15-item FAI).
19
Further investigations on the responsiveness of the ADL CAT in outpatients are needed
20
before it is used as an outcome measure in outpatients with stroke.
AC C
21
EP
15
19
ACCEPTED MANUSCRIPT
References
1 2
1.
Hsueh IP, Chen JH, Wang CH, Hou WH, Hsieh CL. Development of a computerized adaptive test for assessing activities of daily living in outpatients with stroke. Phys
4
Ther 2013;93:681-93. 2.
recovery in stroke. Stroke 1997;28:550-6.
6 7
Stineman MG, Maislin G, Fiedler RC, Granger CV. A prediction model for functional
3.
Hsieh CL, Hoffmann T, Gustafsson L, Lee YC. The diverse constructs use of
SC
5
RI PT
3
activities of daily living measures in stroke randomized controlled trials in the years
9
2005-2009. J Rehabil Med 2012;44:720-6.
10
4.
M AN U
8
Hsieh CL, Sheu CF, Hsueh IP, Wang CH. Trunk control as an early predictor of
11
comprehensive activities of daily living function in stroke patients. Stroke
12
2002;33:2626-30.
Associates; 2000.
14 15
6.
Hsueh IP, Wang WC, Sheu CF, Hsieh CL. Rasch analysis of combining two indices to assess comprehensive ADL function in stroke patients. Stroke 2004;35:721-6.
16 17
Howard W. Computerized adaptive testing: a primer. Mahwan: Lawrence Erlbaum
TE D
5.
7.
EP
13
Kelly-Hayes M, Robertson JT, Broderick JP, Duncan PW, Hershey LA, Roth EJ, et al. The American Heart Association Stroke Outcome Classification. Stroke
19
1998;29:1274-80.
20
8.
23
Portney LG, Watkins MP. Fundations of clinical research: applications to practice. Upper Saddle River: Pearson Prentice Hall; 2009.
21 22
AC C
18
9.
Tamanini JT, Dambros M, D'Ancona CA, Palma PC, Rodrigues-Netto N, Jr. Responsiveness to the Portuguese version of the International Consultation on
20
ACCEPTED MANUSCRIPT
1
Incontinence Questionnaire-Short Form (ICIQ-SF) after stress urinary incontinence
2
surgery. Int Braz J Urol 2005;31:482-9; discussion 90. 10.
measurements. J Rehabil Med 2007;39:585-90.
4 5
Jette AM, Tao W, Norweg A, Haley S. Interpreting rehabilitation outcome
11.
RI PT
3
Arnall FA, Koumantakis GA, Oldham JA, Cooper RG. Between-days reliability of
electromyographic measures of paraspinal muscle fatigue at 40, 50 and 60% levels of
7
maximal voluntary contractile force. Clin Rehabil 2002;16:761-71. 12.
used in physical therapy. Phys Ther 2006;86:735-43.
9 10
Haley SM, Fragala-Pinkham MA. Interpreting change scores of tests and measures
13.
M AN U
8
SC
6
Lu WS, Wang CH, Lin JH, Sheu CF, Hsieh CL. The minimal detectable change of the
11
simplified stroke rehabilitation assessment of movement measure. J Rehabil Med
12
2008;40:615-9. 14.
Chakravarty EF, Bjorner JB, Fries JF. Improving patient reported outcomes using
TE D
13 14
item response theory and computerized adaptive testing. J Rheumatol
15
2007;34:1426-31. 15.
Wang YC, Hart DL, Werneke M, Stratford PW, Mioduski JE. Clinical interpretation
EP
16
of outcome measures generated from a lumbar computerized adaptive test. Phys Ther
18
2010;90:1323-35.
19
16.
AC C
17
Hart DL, Wang YC, Cook KF, Mioduski JE. A computerized adaptive test for patients
20
with shoulder impairments produced responsive measures of function. Phys Ther
21
2010;90:928-38.
22 23
17.
Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care 2000;38:II28-42.
21
ACCEPTED MANUSCRIPT
1
18.
Lawrence Erlbaum Association; 2000.
2 3
Embretson S, Reise S. Item response theory for psychologist. Mahwah, New Jersey:
19.
Flansbjer UB, Holmback AM, Downham D, Patten C, Lexell J. Reliability of gait performance tests in men and women with hemiparesis after stroke. J Rehabil Med
5
2005;37:75-82. 20.
21.
Cohen J. Statistical power analysis for the behavioral sciences. Lawrence Erlbaum; 1988.
Bruton A, Conway JH, Holgate ST. Reliability? what is it, and how is it measured?
EP
25.
Physiothe 2000;86:94-9.
17 18
TE D
24.
15 16
Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Med Care 1990;28:632-42.
13 14
Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989;27:S178-89.
23.
M AN U
22.
11 12
Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med 1998;26:217-38.
9 10
SC
methods of clinical measurement. Lancet 1986;1:307-10.
7 8
Bland JM, Altman DG. Statistical methods for assessing agreement between two
26.
Hsueh IP, Wang CH, Liou TH, Lin CH, Hsieh CL. Test-retest reliability and validity
AC C
6
RI PT
4
19
of the comprehensive activities of daily living measure in patients with stroke. J
20
Rehabil Med 2012;44:637-41.
21
27.
let the CAT out of the bag? Health Serv Res 2005;40:1694-711.
22 23
Cook KF, O'Malley KJ, Roddey TS. Dynamic assessment of health outcomes: time to
28.
Hou WH, Shih CL, Chou YT, Sheu CF, Lin JH, Wu HC, et al. Development of a
22
ACCEPTED MANUSCRIPT
1
computerized adaptive testing system of the Fugl-Meyer motor scale in stroke
2
patients. Arch Phys Med Rehabil 2012;93:1014-20.
3
29.
Hsueh IP, Chen JH, Wang CH, Chen CT, Sheu CF, Wang WC, et al. Development of a computerized adaptive test for assessing balance function in patients with stroke.
5
Phys Ther 2010;90:1336-44.
6
30.
RI PT
4
Yu WH, Hsueh IP, Hou WH, Wang YH, Hsieh CL. A comparison of responsiveness and predictive validity of two balance measures in patients with stroke. J Rehabil
8
Med 2012;44:176-80.
SC
7
M AN U
9 10
AC C
EP
TE D
11
23
ACCEPTED MANUSCRIPT
Figure legends:
2
Fig 1. The Bland-Altman plots show the agreement of the intra-rater assessments of the ADL
3
CAT. The solid line represents the mean of the difference. The limits of agreement (mean of
4
the difference ± 1.96×SDdiff= 0.3 ± 1.96 3.4) are presented as dotted lines.
5
Fig 2. The Bland-Altman plots show the agreement of the inter-rater assessments of the ADL
6
CAT. The solid line represents the mean of the difference. The limits of agreement (mean of
7
the difference ± 1.96 SDdiff= 1.2 ± 1.96 4.8) are presented as dotted lines.
8
Fig 3. The plots show each value of MDCIRT for each patient in the ADL CAT intra-rater and
9
inter-rater reliability studies.
M AN U
SC
RI PT
1
AC C
EP
TE D
10
24
ACCEPTED MANUSCRIPT
Table 1: Demographic and clinical characteristics of the participants from the reliability study
40 (72.7%) 15 (27.3%) 74.3 (16.9)
24 (57.1%) 18 (42.9%) 63.6 (12.4)
21 (38.2%) 34 (61.8%)
17 (40.5%) 25 (59.5%)
32 (58.2%) 22 (40.0%) 1 (1.8%)
18 (42.9%) 24 (57.1%) 0 (0%)
RI PT
Inter-rater study (n=42)
M AN U
Characteristic Sex, n (%) Male Female Age, years, mean (SD) Stroke type, n (%) Cerebral hemorrhage Cerebral infarction Side of hemiplegia, n (%) Right Left Bilateral Time since onset to initial evaluation, months, median (minimum~maximum)
Intra-rater study (n=55)
SC
1
21.7 (6.1~182.8)
23.4 (6.3~107.1)
AC C
EP
TE D
2 3
1
ACCEPTED MANUSCRIPT
Table 2: Reliability and responsiveness indices of the activities of daily living computerized adaptive test (ADL CAT) First test Mean (SD)
Second test Mean (SD)
Difference (Second-First) Mean (SD)
ICC (95% CI)
SEM
MDC (MDC%)
RI PT
1
Paired t test t value (p)
Kazis’ effect size
SRM
Reliability Intra-rater 50.9 (9.6)* 51.3 (9.4)† 0.3 (3.4) 0.94 2.4 6.5 0.7 (0.494) (n=55) (0.90-0.96) (12.8 %||) § Inter-rater 49.4 (7.3)‡ 1.2 (4.8) 0.80 3.4 9.5 1.6 (0.114) 50.6 (7.8) (n=42) (0.65-0.89) (18.9 %¶) Responsiveness 39.8 (9.8) 45.9 (7.4) 6.1 (8.3) 5.7 (<0.001) 0.62 (n=60) *The first assessment scores in the intra-rater reliability study ranged from 22.0 to 66.6. † The second assessment scores in the intra-rater reliability study ranged from 22.0 to 66.6. ‡ The first assessment scores in the inter-rater reliability study ranged from 36.4 to 62.1. § The second assessment scores in the inter-rater reliability study ranged from 36.5 to 66.6. || MDC%=MDC/mean of all scores of the intra-rater assessments of the ADL CAT (51.1) ¶ MDC%=MDC/mean of all scores of the inter-rater assessments of the ADL CAT (50.0) SD: standard deviation; ICC: intraclass correlation coefficient; CI: confidence interval; SEM: standard error of measurement; MDC: minimal detectable change; SRM: standardized response mean.
-
SC
-
M AN U
TE D
EP AC C
2 3 4 5 6 7 8 9 10
0.73
2
ACCEPTED MANUSCRIPT
Table 3: The mean IRT reliability and test length of the ADL CAT in the intra-rater and inter-rater studies
Mean IRT Reliability Paired t test P value 0.728 0.855
EP
TE D
M AN U
SC
Second test Mean(SD) 0.93 (0.05) 0.94 (0.02)
AC C
2 3
Study Intra-rater study (n=55) Inter-rater study (n=42) *IRT reliability=1-SEE2
First test Mean (SD) 0.93 (0.05) 0.94 (0.02)
Test length (number of items) First test Second test Paired t test Mean (SD) Mean(SD) P value 4.5 (1.8) 4.4 (1.7) 0.362 3.9 (1.6) 4.1 (1.7) 0.483
RI PT
1
3
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
Table 4: Demographic and clinical characteristics of the participants from the responsiveness study Participants who Participants who Characteristic completed the withdrew from study (n=60) the study (n=11) Sex, n (%) Male 39 (65.0%) 6 (54.5%) Female 21 (35.0%) 5 (45.5%) Age, years, mean (SD) 68.5 (11.5) 74.6 (11.0) Stroke type, n (%) Cerebral hemorrhage 16 (26.7%) 2 (18.2%) Cerebral infarction 44 (73.3%) 9 (81.8%) Side of hemiplegia, n (%) Right 25 (41.7%) 4 (36.4%) Left 33 (55.0%) 7 (63.6%) Bilateral 2 (3.3%) 0 (0%) Time since onset to initial evaluation, days, median (minimum~maximum) 20 (9~50) 14 (10~37) ADL CAT baseline scores, median (minimum~maximum) 40.2 (22.0~69.4) 36.5 (22.0~51.6)
4
20
RI PT
10
SC
0
-10
-20 20
30
M AN U
Difference between the intra-rater assessments of the ADL CAT
ACCEPTED MANUSCRIPT
40
50
60
Mean scores of the intra-rater assessments of the ADL CAT
AC C
EP
TE D
70
20
RI PT
10
SC
0
-10
-20 30
40
M AN U
Difference between the inter-rater assessments of the ADL CAT
ACCEPTED MANUSCRIPT
50
60
Mean scores of the inter-rater assessments of the ADL CAT
AC C
EP
TE D
70
ACCEPTED MANUSCRIPT 18
RI PT
14 12
SC
10 8 6 4 2 20
30
M AN U
MDC values obtained using IRT
16
40
50
60
AC C
EP
TE D
Baseline scores for each patient in two reliability studies
70