Archives of Clinical Neuropsychology 21 (2006) 413–420
The California Verbal Learning Test – second edition: Test-retest reliability, practice effects, and reliable change indices for the standard and alternate forms Steven Paul Woods a,∗ , Dean C. Delis b , J. Cobb Scott c , Joel H. Kramer d , James A. Holdnack e a
c
Department of Psychiatry (0847), University of California at San Diego, 150 W. Washington Street, 2nd Floor, San Diego, CA 92103-2005, USA b Department of Psychiatry (0847), University of California at San Diego, and Psychology Service (116B), VA San Diego Healthcare System, La Jolla, CA, USA Joint Doctoral Program in Clinical Psychology, San Diego State University and University of California at San Diego, San Diego, CA, USA d Department of Neurology, University of California at San Francisco, San Francisco, CA, USA e Harcourt Assessment, Inc., San Antonio, TX, USA Accepted 15 June 2006
Abstract The California Verbal Learning Test – second edition (CVLT-II) is one of the most widely used neuropsychological tests in North America. The present study evaluated the 1-month test-retest reliability and practice effects associated with the standard and alternate forms of the CVLT-II in a sample of 195 healthy adults. Eighty participants underwent repeat assessment using the standard form of the CVLT-II on both occasions, whereas the remaining 115 individuals received the standard form at baseline and the alternate form at follow-up. Consistent with prior research, results revealed generally large test-retest correlation coefficients for the primary CVLT-II measures in both the standard/standard (range = 0.80–0.84) and standard/alternate (range = 0.61–0.73) cohorts. Despite exhibiting slightly lower test-retest reliability coefficients, participants in the alternate form group displayed notably smaller practice effects (Cohen’s d range = −0.01 to 0.18) on the primary indices relative to individuals who received the standard form on both occasions (Cohen’s d range = 0.27–0.61). Reliable change indices were also generated and applied to primary CVLT-II variables to determine the base rates of significant improvements (range = 2–10%), declines (range = 0–7%), and stability (range = 85–97%) in performance over time. Overall, findings from this study support the test-retest reliability of the standard and alternate forms of the CVLT-II in healthy adults and may enhance the usefulness of this test in longitudinal neuropsychological evaluations. © 2006 National Academy of Neuropsychology. Published by Elsevier Ltd. All rights reserved. Keywords: Episodic memory; Verbal learning; Test reliability; Practice
The California Verbal Learning Test (CVLT and CVLT-II; Delis, Kramer, Kaplan, & Ober, 1987; Delis, Kramer, Kaplan, & Ober, 2000) is among the five most common assessment instruments used by clinical neuropsychologists in North America (Rabin, Barr, & Burton, 2005). The construct validity of the CVLT as a measure of episodic verbal learning and memory has garnered considerable support in the neuropsychological literature (e.g., Alexander, Stuss, ∗
Corresponding author. Tel.: +1 619 543 5004; fax: +1 619 543 1235. E-mail address:
[email protected] (S.P. Woods).
0887-6177/$ – see front matter © 2006 National Academy of Neuropsychology. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.acn.2006.06.002
414
S.P. Woods et al. / Archives of Clinical Neuropsychology 21 (2006) 413–420
& Fansabedian, 2003; Baldo, Delis, Kramer, & Shimamura, 2002; Crosson, Novack, Trenerry, & Craig, 1988; Kibby, Schmitter-Edgecombe, & Long, 1998). Prior studies also support the test-retest reliability of the original CVLT (e.g., Paolo, Tr¨oster, & Ryan, 1997), with the traditional primary variables (e.g., total trials 1–5 and long-delay free recall) demonstrating particularly robust temporal stability in healthy adults (Delis et al., 1991). In the only study published to date on the test-retest reliability of second edition of the CVLT, Benedict (2005) reported data on 34 participants with multiple sclerosis who were randomly assigned to receive either: (1) the CVLT-II standard form at baseline and the alternate form at 1-week follow-up; or (2) the standard form at both baseline and 1-week follow-up. Although test-retest reliability coefficients were broadly comparable for the standard (M r = 0.62, range = 0.50–0.72) and alternate form (M r = 0.75; range = 0.54–0.89) groups, participants who received the alternate form at retest exhibited notably smaller practice effects across the CVLT-II summary measures (alternate form M d = 0.0, range = −0.1 to 0.1; standard form M d = 0.76, range = 0.5–1.0). Findings were interpreted to suggest that use of the CVLT-II alternate form might diminish the confounding effects of practice across repeated administrations without adversely affecting reliability. However, no peer-reviewed studies have been published on the test-retest reliability of the CVLT-II in a nonclinical group, which may yield different results as compared to a disease sample (Delis, Jacobson, Bondi, Hamilton, & Salmon, 2003) and provide more broadly applicable psychometric data for clinical and research application. The present study therefore aimed to examine the test-retest reliability and practice effects of the standard and alternate forms of the CVLT-II in healthy adults, as well as to generate reliable change indices (RCIs; Jacobson & Truax, 1991) to provide statistical guidelines for detecting significant changes in individual CVLT-II profiles. 1. Method Participants were 195 healthy adults who underwent repeat testing with the CVLT-II over at least a 1-week testretest interval. The average test-retest interval was 29 days (S.D. = 13, range = 9–74). All potential study participants were screened for histories of medical, neurological, or psychiatric conditions known to adversely affect neurocognitive functions (see the CVLT-II technical manual for further details). Eighty participants underwent repeat assessment using the standard form of the CVLT-II on both occasions (standard/standard), whereas the other 115 individuals received the standard form at baseline and the alternate form (standard/alternate) at follow-up. The demographic characteristics of the study groups and their test-retest interval data are displayed in Table 1. Raw scores were used for all analyses. Wilcoxon Signed Rank tests and Spearman’s rho (ρ) correlation coefficients (or their parametric counterparts as determined by results from the Shapiro-Wilk W test of normality) were conducted to assess practice effects and test-retest reliability, respectively. The critical alpha level was set at 0.001 for these analyses Table 1 Demographic composition of the test-retest study samples Variable
Standard/standard forms (N = 80)
Standard/alternate forms (N = 115)
Mean
S.D.
Range
Mean
S.D.
Range
Test-retest interval (days) Age (years)
25.8 49.5
13.1 22.7
9–74 16–88
30.9 47.7
13.1 22.0
10–74 16–88
Education (%) ≤8 years 9–11 years 12 years 13–15 years ≥16 years
7.5% 8.8% 37.5% 33.8% 12.5%
5.2% 15.7% 37.4% 21.7% 20.0%
Sex (%) Female Male
51.3% 48.8%
53.9% 46.1%
Ethnicity (%) Caucasian Hispanic African-American Other
77.5% 10.0% 10.0% 2.5%
83.5% 7.8% 8.7% 0.0%
S.P. Woods et al. / Archives of Clinical Neuropsychology 21 (2006) 413–420
415
in an effort to limit Type I error due to multiple comparisons. Effect sizes for the practice effects analyses were measured with the unbiased Cohen’s d (Hedges & Olkin, 1985). Means and standard deviations of the CVLT-II difference scores (Time 2 − Time 1) were then calculated in order to generate RCIs with 68%, 90%, and 95% confidence intervals (Chelune, Naugle, L¨uders, Sedlack, & Awad, 1993; Temkin, Heaton, Grant, & Dikmen, 1999). For example, 90% confidence interval RCIs were generated using the following formula: M practice effect ± (S.E.diff × 1.645), where S.E.diff is the S.D. of the M practice effect (i.e., Time 2 − Time 1). 2. Results CVLT-II test variable means, standard deviations, effect sizes, practice effects, reliability coefficients, and RCI confidence intervals are displayed in Tables 2 and 3. In the standard/standard group, reliability coefficients for the Table 2 Test-retest data for the CVLT-II Standard/standard form study sample (N = 80) CVLT-II variable
Time 1
Time 2
Wilcoxon z
Primary measures Total trials 1–5 Short-delay free recall Short-delay cued recall Long-delay free recall Long-delay cued recall Total recognition discrim.
48.05 (11.83) 10.04 (3.57) 11.39 (3.06) 10.26 (3.87) 11.46 (3.46) 2.94 (0.93)
55.85 (13.52) 11.40 (3.52) 12.75 (2.98) 11.70 (3.69) 12.80 (3.05) 3.18 (0.87)
1250.0** 727.0** 767.0** 800.5** 833.0** 505.0**
Process measures Trial 1 Trial 5 Trial B Semantic clustering Serial clustering forward Serial clustering bidirectional Subjective clustering Total learning slope Immediate recall discrim. Free recall discrim. Cued recall discrim. Delayed recall discrim. Total recall discrim. Across-trial consistency Percent recall primacy Percent recall middle Percent recall recency Immediate recall intrusions Free recall intrusions Cued recall intrusions Delayed recall intrusions Total intrusions Immediate recall repetitions Total repetitions Source recognition discrim. Novel recognition discrim. Semantic recognition discrim. Total response bias Forced-choice recognition
6.00 (1.97) 11.75 (2.88) 5.10 (2.15) 1.09 (1.86) 0.51 (0.84) 0.62 (1.05) 0.78 (0.82) 1.41 (0.56) 2.10 (0.52) 1.99 (0.56) 2.20 (1.00) 2.16 (0.91) 2.04 (0.62) 79.89 (11.54) 28.41 (5.81) 43.23 (7.54) 28.49 (6.56) 1.33 (2.18) 2.70 (3.29) 2.24 (3.40) 3.34 (4.24) 4.94 (5.76) 3.48 (4.29) 4.51 (5.14) 2.76 (0.90) 2.86 (0.77) 2.63 (0.96) 0.12 (0.31) 98.99 (5.68)
8.35 (2.91) 12.33 (2.95) 5.74 (2.15) 2.04 (2.58) 0.46 (1.09) 0.50 (1.32) 1.37 (1.13) 0.92 (0.57) 2.40 (0.60) 2.27 (0.62) 5.49 (1.01) 2.43 (0.96) 2.31 (0.69) 85.00 (10.17) 26.91 (6.75) 46.36 (6.72) 26.84 (5.37) 1.49 (2.62) 3.20 (4.54) 2.36 (3.65) 3.78 (5.26) 5.56 (7.43) 4.08 (3.60) 5.45 (5.02) 3.02 (0.81) 3.05 (0.74) 2.85 (0.92) 0.03 (0.34) 99.69 (2.22)
1173.0** 364.5* 366.0** 863.5** −231.0 −311.5 892.5** −6.4b,** 987.5** 7.2b,** 663.0** 725.0** 961.5** 813.0** −424.5* 743.0** −336.0 40.5 91.5 −27.0 46.5 47.0 418.5* 406.5* 486.5** 344.0** 396.5** −253.0* 5.0
ρ
d 0.61 0.44 0.45 0.38 0.41 0.27 0.95 0.20 0.15 0.42 −0.05 −0.09 0.59 −0.35 0.53 0.47 0.29 0.28 0.41 0.47 −0.24 0.44 −0.27 0.07 0.13 0.05 0.09 0.09 0.15 0.18 0.30 0.25 0.23 −0.28 0.16
S.D.a
90% CI
95% CI
7.80 1.36 1.36 1.44 1.34 0.24
8.01 2.25 1.92 1.98 2.01 0.48
−5.38, 20.98 −2.34, 5.06 −1.80, 4.52 −1.82, 4.70 −1.97, 4.65 −0.55, 1.03
−7.90, 23.50 −3.05, 5.77 −2.40, 5.12 −2.44, 5.32 −2.60, 5.28 −0.70, 1.18
0.58** 2.35 0.75** 0.58 0.56** 0.64 0.61** 0.95 0.40** −0.05 0.54** −0.11 0.45** 0.59 0.28c −0.49 0.76** 0.30 0.83c,** 0.28 0.78** 0.30 0.82** 0.28 0.83** 0.27 0.59** 5.11 0.51** −1.50 0.43** 3.14 0.36* −1.65 0.19 0.16 0.44** 0.50 0.57** 0.13 0.59** 0.44 0.57** 0.62 0.41** 0.60 0.36* 0.94 0.83** 0.26 0.79** 0.19 0.81** 0.22 0.57** −0.09 0.34* 0.70
2.43 2.00 1.90 1.80 0.99 1.07 0.98 0.68 0.40 0.35 0.67 0.57 0.38 9.91 5.93 7.57 6.87 2.38 3.68 3.16 4.46 5.92 4.77 5.97 0.48 0.48 0.55 0.30 5.08
−1.65, 6.35 −2.71, 3.87 −2.49, 3.77 −2.01, 3.91 −1.68, 1.58 −1.87, 1.65 −1.02, 2.20 −1.61, 0.63 −0.36, 0.96 −0.30, 0.86 −0.80, 1.40 −0.66, 1.22 −0.35, 0.90 −11.19, 21.41 −11.25, 8.25 −9.31, 15.59 −12.95, 9.65 −3.76, 4.08 −5.55, 6.55 −5.07, 5.33 −6.90, 7.78 −9.12, 10.36 −7.27, 8.45 −8.88, 10.76 −0.53, 1.05 −0.60, 0.98 −0.68, 1.12 −0.58, 0.40 −7.66, 9.06
−2.41, 7.11 −3.34, 4.50 −3.08, 4.36 −2.58, 4.48 −1.99, 1.89 −2.21, 1.99 −1.33, 2.51 −1.82, 0.84 −0.48, 1.08 −0.41, 0.97 −1.01, 1.61 −0.86, 1.38 −0.48, 1.02 −14.31, 24.53 −13.12, 10.12 −11.70, 17.98 −15.12, 11.82 −4.50, 4.82 −6.71, 7.71 −6.06, 6.32 −8.30, 9.18 −10.98, 12.22 −8.75, 9.95 −10.76, 12.64 −0.68, 1.20 −0.75, 1.13 −0.86, 1.30 −0.68, 0.50 −9.26, 10.66
0.80** 0.80** 0.80** 0.83** 0.80** 0.84**
Mdiff
Note: ρ = Spearman’s rho; CI = confidence interval; d = Cohen’s d; Diff = difference; Discrim. = discriminability. a The M diff ± S.D. can also serve as the 68% RCI confidence interval. b Data are presented as t-tests for paired-samples. c Pearson’s product moment correlation coefficients. * p ≤ 0.01 (statistical trends). ** p < 0.001.
416
S.P. Woods et al. / Archives of Clinical Neuropsychology 21 (2006) 413–420
Table 3 Test-retest data for the CVLT-II standard/alternate form study sample (N = 115) CVLT-II variable
Time 1
Time 2
Wilcoxon z
d
ρ
Primary measures Total trials 1–5 Short-delay free recall Short-delay cued recall Long-delay free recall Long-delay cued recall Total recognition discrim.
46.83 (11.27) 9.76 (3.54) 11.04 (2.82) 10.24 (3.48) 11.11 (3.06) 2.97 (0.85)
48.16 (12.25) 10.43 (3.71) 11.53 (3.17) 10.71 (3.67) 11.50 (3.30) 2.96 (0.89)
−1.6b −696.0* −627.5 −496.0 −451.5 108.5
0.15 0.18 0.16 0.13 0.12 −0.01
0.73c,** 1.33 0.67** 0.67 0.61** 0.49 0.69** 0.47 0.61** 0.39 0.62**−0.01
Process measures Trial 1 Trial 5 Trial B Semantic clustering Serial clustering forward Serial clustering bidirectional Subjective clustering Total learning slope Immediate recall discrim. Free recall discrim. Cued recall discrim. Delayed recall discrim. Total recall discrim. Across-trial consistency Percent recall primacy Percent recall middle Percent recall recency Immediate recall intrusions Free recall intrusions Cued recall intrusions Delayed recall intrusions Total intrusions Immediate recall repetitions Total repetitions Source recognition discrim. Novel recognition discrim. Semantic recognition discrim. Total response bias Forced-choice recognition
5.97 (1.68) 11.38 (2.93) 5.37 (2.18) 0.73 (1.52) 0.59 (0.91) 0.68 (0.94) 0.73 (0.82) 1.31 (0.63) 2.04 (0.45) 1.95 (0.50) 2.10 (0.92) 2.09 (0.83) 1.99 (0.56) 77.65 (14.46) 28.15 (6.47) 44.30 (7.43) 27.70 (7.43) 1.28 (1.94) 2.66 (3.30) 2.30 (3.43) 3.38 (4.56) 4.97 (5.83) 3.18 (3.36) 4.34 (4.45) 2.81 (0.81) 2.87 (0.74) 2.68 (0.86) 0.07 (0.35) 99.48 (1.87)
6.18 (2.01) 11.73 (3.19) 5.10 (2.02) 1.58 (2.00) 0.39 (0.76) 0.50 (0.90) 0.94 (0.85) 1.34 (0.58) 2.08 (0.51) 2.00 (0.54) 2.30 (0.91) 2.25 (0.88) 2.06 (0.60) 81.77 (13.70) 30.95 (6.44) 40.42 (7.40) 28.77 (6.75) 1.49 (2.05) 3.05 (3.74) 1.67 (2.95) 2.77 (4.46) 4.72 (6.09) 3.09 (3.02) 3.92 (3.87) 2.82 (0.88) 2.84 (0.76) 2.66 (0.91) 0.11 (0.35) 99.63 (1.44)
−263.5 −361.0 371.0 −1816.0** 839.5 641.0 −687.5 −225.0 −1.0 b −1.2 b −1035.0* −914.0* −1.8 b −1185.0** −1095.5* 1251.0** −507.5 −182.0 −276.0 393.0 372.5 181.0 81.0 315.0 −37.0 145.5 197.5 −396.0 −10.5
0.11 0.11 −0.13 0.48 −0.24 −0.19 0.25 0.05 0.08 0.10 0.22 0.19 0.12 0.29 0.43 −0.52 0.10 0.10 0.11 −0.20 −0.13 −0.04 −0.03 −0.10 0.01 −0.04 −0.02 0.11 0.09
0.40** 0.21 0.63** 0.35 0.48** 0.26 0.48** 0.85 0.36**−0.20 0.42**−0.18 0.17 0.17 0.36** 0.03 0.64c,** 0.04 0.73c,** 0.04 0.60** 0.20 0.65** 0.16 0.74c** 0.07 0.43** 4.12 0.25* 2.80 0.21 −3.88 0.30* 1.06 0.19 0.21 0.31** 0.39 0.45**−0.63 0.41**−0.61 0.39**−0.25 0.30**−0.09 0.27* −0.42 0.63** 0.01 0.53**−0.03 0.59**−0.02 0.43** 0.04 0.06 0.15
Mdiff
S.D.a
90% CI
95% CI
8.71 −12.99, 15.66 2.86 −4.03, 5.37 2.54 −3.69, 4.67 2.71 −3.99, 4.93 2.62 −3.92, 4.70 0.69 −1.15, 1.13
−15.74, 18.40 −4.94, 6.28 −4.49, 5.47 −4.84, 5.78 −4.75, 5.53 −1.36, 1.34
−3.18, 3.60 −3.71, 4.41 −3.19, 3.71 −1.95, 3.65 −1.65, 1.25 −1.74, 1.38 −1.61, 1.95 −1.09, 1.15 −0.63, 0.71 −0.60, 0.68 −1.02, 1.42 −0.93, 1.25 −0.62, 0.76 −15.46, 23.70 −10.08, 15.68 −19.34, 11.58 −12.81, 14.93 −3.85, 4.27 −5.93, 6.71 −5.07, 3.81 −7.17, 5.95 −9.31, 8.81 −6.28, 6.10 −8.51, 7.67 −1.06, 1.08 −1.15, 1.09 −1.29, 1.25 −0.52, 0.60 −3.65, 3.95
−3.83, 4.25 −4.49, 5.19 −3.86, 4.38 −2.48, 4.18 −1.93, 1.53 −2.04, 1.68 −1.95, 2.29 −1.30, 1.36 −0.76, 0.84 −0.72, 0.80 −1.25, 1.65 −1.13, 1.45 −0.75, 0.89 −23.20, 27.44 −12.55, 18.15 −22.30, 14.54 −15.46, 17.58 −4.63, 5.05 −7.14, 7.92 −5.92, 4.66 −8.43, 7.21 −11.05, 10.55 −7.46, 7.28 −10.06, 9.22 −1.26, 1.28 −1.36, 1.30 −1.53, 1.49 −0.63, 0.71 −4.38, 4.68
2.06 2.47 2.10 1.70 0.88 0.95 1.08 0.68 0.41 0.39 0.74 0.66 0.42 11.90 7.83 9.40 8.43 2.47 3.84 2.70 3.99 5.51 3.76 4.92 0.65 0.68 0.77 0.34 2.31
Note: ρ = Spearman’s rho; CI = confidence interval; d = Cohen’s d; Diff = difference; Discrim. = discriminability. a The M diff ± S.D. can also serve as the 68% RCI confidence interval. b Data are presented as t-tests for paired-samples. c Pearson’s product moment correlation coefficients. * p ≤ 0.01 (statistical trends). ** p < 0.001.
primary measures ranged from 0.80 (e.g., total trials 1–5) to 0.84 (recognition discriminability). Somewhat greater variability in reliability was observed for the process measures, with coefficients ranging from 0.19 (immediate recall intrusions) to 0.83 (e.g., total recall discriminability). Significant practice effects emerged on 60% of the CVLT-II measures in this cohort, including all of the primary measures, with associated effect size values (d) ranging from 0.05 (e.g., serial clustering forward) to 0.95 (trial 1). Table 3 shows that reliability coefficients for the primary CVLT-II measures in the standard/alternate group ranged from 0.61 (short delay cued recall) to 0.73 (total trials 1–5). Similar to results from the standard/standard sample, the secondary process measures displayed more variability in reliability coefficients than the primary measures, with coefficients falling between 0.06 (forced choice recognition) and 0.74 (total recall discriminability). Statistically significant effects of practice were present on only 11% of the CVLT-II measures studied, but none of the primary measures, with
S.P. Woods et al. / Archives of Clinical Neuropsychology 21 (2006) 413–420
417
Table 4 Base rates of reliable change on the primary CVLT-II indices CVLT-II variable
90% RCI confidence interval
95% RCI confidence interval
Declined (%)
Stable (%)
Improved (%)
Declined (%)
Stable (%)
Improved (%)
Standard/standard (N = 80) Total trials 1–5 Short-delay free recall Short-delay cued recall Long-delay free recall Long-delay cued recall Total recognition discrim.
2.5 3.8 3.8 5.0 5.0 6.3
90.0 92.5 90.0 85.0 87.5 87.5
7.5 3.8 6.3 10.0 7.5 6.3
2.5 1.3 1.3 1.3 1.3 1.3
93.8 95.0 96.3 96.3 95.0 92.5
3.8 3.8 2.5 2.5 3.8 6.3
Standard/alternate (N = 115) Total trials 1–5 Short-delay free recall Short-delay cued recall Long-delay free recall Long-delay cued recall Total recognition discrim.
5.2 4.4 6.1 7.0 6.1 3.5
93.0 92.2 92.2 87.0 88.7 90.4
1.7 3.5 1.7 6.1 5.2 6.1
2.6 4.4 4.4 3.5 4.4 0.0
95.7 93.9 93.9 93.9 93.0 96.5
1.7 1.7 1.7 2.6 2.6 3.5
Note: Due to rounding error, the summed proportions of improvements, declines, and stabilities do not equal 100 for certain variables.
the exception of small, trend-level practice effects on short delay free recall (ds = 0.18). The effect sizes (d) associated with practice in the standard/alternate group ranged from 0.01 (e.g., total recognition discriminability) to 0.52 (percent recall middle). Next, the RCI methodology was used to determine the base rates of significant improvements, declines, and stability on the primary CVLT-II variables, including total trials 1–5, short delay free recall, short delay cued recall, long delay free recall, long delay cued recall, and total recognition discriminability. Specifically, the 90% and 95% RCI confidence intervals displayed in Tables 2 and 3 were applied to each individual participant’s difference score (Time 2 − Time 1) on these six variables. Difference scores falling within the predetermined confidence interval were classified as reflecting normal variability in performance (i.e., “stable”), whereas scores outside the confidence interval were designated as having significantly improved or declined as appropriate. Results for both the standard/standard and standard/alternate groups are displayed in Table 4. 3. Discussion Professionals in research and clinical settings commonly use the CVLT-II, a test whose construct validity as a measure of episodic verbal learning and memory is supported by a considerable body of research. Few studies to date have examined the temporal stability of the CVLT-II, which is an essential step in determining its potential usefulness for measuring cognitive change (McCaffrey & Westervelt, 1995). The current study supports the 1-month test-retest reliability of the CVLT-II in a sample of 195 healthy adults. Results revealed generally large test-retest correlation coefficients for the primary measures on the standard and alternate forms of the CVLT-II measures. Overall, the observed test-retest reliability and practice effects were generally consistent with those documented for the original CVLT (e.g., Paolo et al., 1997), as well as those of other list-learning and memory tasks (e.g., Hopkins Verbal Learning Test – Revised [HVLT-R]; Benedict, Schretlen, Groninger, & Brandt, 1998). Data from this study may therefore enhance the applicability of the CVLT-II for the purpose of longitudinal neuropsychological evaluation. In particular, the RCIs (Tables 2 and 3) provide user-friendly, statistically based guidelines for neuropsychologists who are faced with the task of deciding whether observed changes in an individual’s CVLTII profile are meaningful. Such complex determinations are commonly requested of both clinicians and researchers, for example in the context of diagnostic decision-making (e.g., incident dementia) or assessing treatment outcomes (e.g., pharmacological and surgical trials) (Chelune, 2002). Although the RCI methodology is not without its critics or controversies (e.g., Maassen, 2000), it nevertheless performs comparably to more complex regression formulas in detecting cognitive changes in healthy and clinical populations (Heaton et al., 2001; Temkin et al., 1999).
418
S.P. Woods et al. / Archives of Clinical Neuropsychology 21 (2006) 413–420
To apply the CVLT-II RCIs in practice, research consumers should examine either Table 2 or Table 3, depending on whether or not the alternate form was used at retest, to determine if retest difference scores (Time 2 − Time 1) reflect statistically meaningful changes in performance. To illustrate, let us assume that a particular client achieves a CVLT-II standard form total trials 1–5 raw score of 40 at baseline and a retest score of 49 on the alternate form. Has this individual demonstrated a significant improvement in learning? The RCI data displayed in Table 3 allow the clinician to assert with 95% confidence that the observed improvement at retest (+9 words) is not statistically unusual, but instead likely reflects the expected practice effects. Indeed, an improvement of greater than 18 words recalled (or decline of 16 words) would be necessary to classify a significant change in performance using this approach. The availability of multiple confidence interval widths will allow users to modify the risk of Type I error as needed (Chelune, 2002). Of course, the RCI methodology will be most effective when supplemented with other relevant clinical data regarding individual (e.g., fatigue and effort) and external (e.g., changes in the testing environment) factors that might influence the interpretation of changes in CVLT-II scores over time. In general, the temporal stability of the primary learning, recall, and recognition measures was better than the more detailed process measures (e.g., errors, contrast scores, and ratios). Although analysis of quantified process features of the CVLT-II are useful in delineating underlying cognitive mechanisms of general memory deficits, enhancing diagnosis, and informing remedial efforts in neuropsychiatric disorders (Delis et al., 2003; Poreh, 2000), caution is nevertheless encouraged in interpreting changes in these component process measures over time in healthy adults. Such findings are commensurate with prior test-retest data on component process indices from the original CVLT (Delis et al., 1991; Paolo et al., 1997), as well as the HVLT-R (Woods et al., 2005). The relatively lower reliability of these measures may be a psychometric artifact of small differences between raw scores with restricted ranges adversely affecting reliability coefficients, particularly in nonclinical samples. As aptly noted by Anastasi (1988), reliability coefficients are highly dependent upon the heterogeneity and ability level of the samples from which they are derived. Accordingly, it is possible that the reliability of the CVLT-II process measures may be more robust in clinical populations, where consideration of the strategies and processes involved in learning and remembering is most fruitful (Delis, Kaplan, & Kramer, 2001). For instance, intrusion errors displayed only modest test-retest reliability in the current study; however, intrusions are considerably more prevalent in individuals with Alzheimer’s disease (AD) than in healthy controls (e.g., Davis, Price, Kaplan, & Libon, 2002), which therefore may yield a broader range and distribution of test scores in AD with which to evaluate their reliability and general psychometric properties. Reliability coefficients for participants who received the standard form at both assessments were consistent with Benedict (2005), with the exception of the coefficient for delayed recall, which was somewhat higher in our nonclinical group. In contrast to Benedict (2005), however, we observed slightly lower test-retest reliability in participants who received the alternate form at retest. This finding is consistent with the long-standing hypotheses that alternate test forms may introduce additional test-retest error (i.e., decrease reliability) as a function of variability in both individual and test content (Benedict, 2005; Groth-Marnat, 1997). Nevertheless, participants who received the alternate form of the CVLTII displayed notably smaller practice effects, despite exhibiting modestly smaller reliability as compared to participants who received the standard form on both occasions. For example, total trials 1–5 yielded a reliability coefficient of 0.80 and a practice effect size of 0.61 in the standard/standard group; however, the effects of practice on this same measure were negligible in the standard/alternate cohort (d = 0.15), even though the test-retest reliability coefficient was slightly lower (ρ = 0.73). Thus, results from this study support previous recommendations to use alternate forms of the CVLT-II – and episodic memory tests in general – whenever possible so as to minimize the confounding effects of practice (e.g., Benedict, 2005; Bird, Papadopoulou, Ricciardelli, Rossor, & Cipolotti, 2003). A few limitations to the interpretation and application of these data merit discussion. Most notably, the average test-retest interval (i.e., approximately 1 month) was considerably less than that which is typically used in many clinical settings (e.g., 6–12 months). However, the influence of brief test-retest intervals on test reliability and practice effects and their generalizability to longer intervals has yet to be determined. The present study sample nevertheless had a reasonable range of test-retest interval spans (9–74 days) and post hoc analyses revealed no correlation between test-retest interval length and changes (i.e., Time 2 − Time 1 difference scores) in any primary CVLT-II measure (all ps > 0.10). It also deserves mention that the test-retest interval used in this study closely approximates the design of certain types of studies, including those on the cognitive effects (and safety) of surgical interventions (e.g., coronary artery bypass graft surgery) and pharmacological agents (e.g., atypical antipsychotics). The external validity of these data may be constrained by the demographic and health characteristics of the study sample. It is
S.P. Woods et al. / Archives of Clinical Neuropsychology 21 (2006) 413–420
419
conceivable that CVLT-II RCIs derived from healthy adults may not generalize to clinical samples, whose baseline cognitive impairment, natural disease progression, and demographic differences might alter statistical definitions of unusual changes in list-learning and memory. To this end, controversy remains as to whether the absence of the expected practice effects using RCIs from controls in a clinical sample can signify cognitive decline (Fields & Tr¨oster, 2000). Nevertheless, recent data from individuals with schizophrenia-spectrum disorders (Heaton et al., 2001) and HIV-1 infection (Woods et al., 2006) support the specificity of RCIs in neurologically stable clinical samples. Evaluation of the construct validity of the CVLT-II RCIs in classifying central nervous system insults and recovery (e.g., sensitivity to incident traumatic brain injury) across longer test-retest intervals will be an important target for future research. References Alexander, M. P., Stuss, D. T., & Fansabedian, N. (2003). California Verbal Learning Test: Performance by patients with focal frontal and non-frontal lesions. Brain, 126, 1493–1503. Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan Publishing Co., Inc. Baldo, J. V., Delis, D., Kramer, J., & Shimamura, A. P. (2002). Memory performance on the California Verbal Learning Test – II: Findings from patients with focal frontal lesions. Journal of the International Neuropsychological Society, 8, 539–546. Benedict, R. H. B. (2005). Effects of using same versus alternate-form memory tests during short-interval repeated assessments in multiple sclerosis. Journal of the International Neuropsychological Society, 11, 727–736. Benedict, R. H. B., Schretlen, D., Groninger, L., & Brandt, J. (1998). Hopkins Verbal Learning Test – Revised: Normative data and analysis of inter-form and test-retest reliability. The Clinical Neuropsychologist, 12, 43–55. Bird, C. M., Papadopoulou, K., Ricciardelli, P., Rossor, M. N., & Cipolotti, L. (2003). Test-retest reliability, practice effects and reliable change indices for the recognition memory test. British Journal of Clinical Psychology, 42, 407–425. Chelune, G. J. (2002). Making neuropsychological outcomes research consumer friendly: A commentary on Keith et al. (2002). Neuropsychology, 16, 422–425. Chelune, G. J., Naugle, R. I., L¨uders, H., Sedlack, J., & Awad, I. A. (1993). Individual change after epilepsy surgery: Practice effects and base-rate information. Neuropsychology, 7, 41–52. Crosson, B., Novack, T. A., Trenerry, M. R., & Craig, P. L. (1988). California Verbal Learning Test (CVLT) performance in severely head-injured and neurologically normal adult males. Journal of Clinical and Experimental Neuropsychology, 10, 754–768. Davis, K. L., Price, C. C., Kaplan, E., & Libon, D. J. (2002). Error analysis of the nine-word California Verbal Learning Test (CVLT-9) among older adults with and without dementia. The Clinical Neuropsychologist, 16, 81–89. Delis, D. C., Jacobson, M., Bondi, M. W., Hamilton, J. M., & Salmon, D. P. (2003). The myth of testing construct validity using factor analysis or correlations with normal or mixed clinical populations: Lessons from memory assessment. Journal of the International Neuropsychological Society, 9, 936–946. Delis, D. C., Kaplan, E., & Kramer, J. H. (2001). Delis-Kaplan Executive Function System. Technical manual. San Antonio, TX: Psychological Corporation. Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (1987). California Verbal Learning Test: Adult version. Manual. San Antonio, TX: Psychological Corporation. Delis, D. C., Kramer, J. H., Kaplan, E., & Ober, B. A. (2000). California Verbal Learning Test – second edition. Adult version. Manual. San Antonio, TX: Psychological Corporation. Delis, D. C., McKee, R., Massman, P. J., Kramer, J. H., Kaplan, E., & Gettman, D. (1991). Alternate form of the California Verbal Learning Test: Development and reliability. The Clinical Neuropsychologist, 5, 154–162. Fields, J. A., & Tr¨oster, A. I. (2000). Cognitive outcomes after deep brain stimulation for Parkinson’s disease: A review of initial studies and recommendations for future research. Brain and Cognition, 42, 268–293. Groth-Marnat, G. (1997). Handbook of psychological assessment. New York: John Wiley & Sons, Inc. Heaton, R. K., Temkin, N., Dikmen, S., Avitable, N., Taylor, M. J., Marcotte, T. D., & Grant, I. (2001). Detecting change: A comparison of three neuropsychological methods, using normal and clinical samples. Archives of Clinical Neuropsychology, 16, 75–91. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 12–19. Kibby, M. Y., Schmitter-Edgecombe, M., & Long, C. J. (1998). Ecological validity of neuropsychological tests: Focus on the California Verbal Learning Test and the Wisconsin Card Sorting Test. Archives of Clinical Neuropsychology, 13, 523–534. Maassen, G. H. (2000). Principals of defining reliable change indices. Journal of Clinical and Experimental Neuropsychology, 22, 622– 632. McCaffrey, R. J., & Westervelt, H. J. (1995). Issues associated with repeated neuropsychological assessments. Neuropsychology Review, 5, 203– 221. Paolo, A. M., Tr¨oster, A. I., & Ryan, J. J. (1997). Test-retest stability of the California Verbal Learning Test in older persons. Neuropsychology, 11, 613–616. Poreh, A. M. (2000). The quantified process approach: An emerging methodology to neuropsychological assessment. The Clinical Neuropsychologist, 14, 212–222.
420
S.P. Woods et al. / Archives of Clinical Neuropsychology 21 (2006) 413–420
Rabin, L. A., Barr, W. B., & Burton, L. A. (2005). Assessment practices of clinical neuropsychologists in the United States and Canada: A survey of INS, NAN, and APA Division 40 members. Archives of Clinical Neuropsychology, 20, 33–65. Temkin, N. R., Heaton, R. K., Grant, I., & Dikmen, S. S. (1999). Detecting significant change in neuropsychological test performance: A comparison of four models. Journal of the International Neuropsychological Society, 5, 357–369. Woods, S. P., Childers, M., Ellis, R. J., Guaman, S., Grant, I., Heaton, R. K., & The HNRC Group. (2006). A battery approach for measuring neuropsychological change. Archives of Clinical Neuropsychology, 21, 83–89. Woods, S. P., Scott, J. C., Conover, E., Marcotte, T. D., Heaton, R. K., Grant, I., & The HNRC Group. (2005). Test-retest reliability of component process variables within the Hopkins Verbal Learning Test – Revised. Assessment, 12, 96–100.