Behavioural Processes 54 (2001) 137– 154 www.elsevier.com/locate/behavproc
Evaluating data from behavioral analysis: visual inspection or statistical models? Gene S. Fisch Department of Epidemiology and Public Health and the Child Study Center, Di6ision of Biostatistics, Yale Uni6ersity, 60 College Street, New Ha6en, CT 06520, USA Accepted 5 January 2001
Abstract Traditional behavior analysis relies upon single-subject study designs and visual inspection of graphed data to evaluate the efficacy of experimental manipulations. Attempts to apply statistical inferential procedures to analyze data have been successfully opposed for many decades, despite problems with visual inspection and increasingly cogent arguments to utilize inferential statistics. In a series of experiments, we show that trained behavior analysts often identify level shifts in responding during intervention phases (‘treatment effect’) in modestly autocorrelated data, but trends are either misconstrued as level treatment effects or go completely unnoticed. Errors in trend detection illustrate the liabilities of using visual inspection as the sole means by which to analyze behavioral data. Meanwhile, because of greatly increased computer power and advanced mathematical techniques, previously undeveloped or underutilized statistical methods have become far more sophisticated and have been brought to bear on a variety of problems associated with repeated measures data. I present several nonparametric procedures and other statistical techniques to evaluate traditional behavioral data to augment, not replace, visual inspection procedures. © 2001 Elsevier Science B.V. All rights reserved. Keywords: Visual inspection; Behavior analysis; Statistical inference; Randomization tests; Signal detection analysis
‘‘A graph is nothing but a 6isual metaphor. To be truly e6ocati6e it must correspond closely to the phenomena it depicts… If a graphic depiction of data does not faithfully follow this notion it is almost sure to be misleading’’ —Howard Wainer, Visual Re6elations. 1996
E-mail address:
[email protected] (G.S. Fisch).
1. Studies of responses to the visual inspection of data Historically, the experimental analysis of behavior (EAB) has, for various reasons, eschewed the use of inferential statistics in favor of visual inspection of data. Initially, Skinner (1938) objected to inferential statistics on the grounds that it was a poor substitute for well-defined methods of measurement. Similarly, Sidman (1960) argued
0376-6357/01/$ - see front matter © 2001 Elsevier Science B.V. All rights reserved. PII: S 0 3 7 6 - 6 3 5 7 ( 0 1 ) 0 0 1 5 5 - 3
138
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
it was a poor substitute for well-defined methods of measurement. Similarly, Sidman (1960) argued that inferential statistics were used to bolster data in which experimental control was lacking. Skinner (1956) later modified his original remarks to reflect the then climate of (over) use of statistics in experimental psychology. My interest in the use of inferential statistics — or lack thereof — in EAB arose from debates in the applied behavior analysis literature of the 1970s and 1980s concerning the problems associated with visual inspection of data (Kazdin, 1977; Yeltman et al., 1977; DeProspero and Cohen, 1977; Wampold and Furlong, 1981; Furlong and Wampold, 1982; Fisch and Schneider, 1993). As a result, we began a series of experiments to examine the parameters that might affect visual inspection. Over a period of several years, sets of graphed data exhibiting point-to-point functions in different experimental phases were shown to various groups of graduate students and faculty trained in behavioral analysis and single-subject
design. In the five studies fashioned, the materials used were graphs displaying point-to-point functions generated by an autoregressive model originally employed by Matyas and Greenwood (1990), as follows: Yt = aYt − 1 + b+ d+e, where Yt the current response value, Yt − 1 the response value from the previous session, a autoregressive coefficient, b a value placing the function in the middle of the graph, d a level intervention (or treatment) effect, e ks, k= a scale parameter and, s=a randomly selected zscore from a standard normal distribution. A sample graph used in our experiments is shown in Fig. 1. In four of the five studies, the parameters used in the autoregressive equations were as follows: a= (0, 0.30), i.e. no autoregressive component or a modest autoregressive value; b= 50, placing the function in the middle of the graph; d= (0, 6), i.e. no treatment, treatment effect; e= (2, 4), i.e. lesser
Fig. 1. Sample of a graph used to present a point-to-point function generated by an autoregressive equation. The three sections of the graph represent responses in the baseline, intervention, and return-to-baseline conditions of an experiment.
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
or greater variability. When a treatment effect was present, the effect size, d/e, ranged from 1.5 to 3. Cohen (1962) considers an effect size of 1 to be large. In Studies 1 (Greenspan and Fisch, 1992) and 2 (Fisch and Schneider, 1993), 48 different graphs were presented to each of two groups of graduate students enrolled in separate courses covering research design. In Study 3 (Fisch and Porto, 1994), 32 different graphs were presented to nine graduate students and the instructor, first at the beginning, then at the end of a one-semester course in single-subject design. Study 4 (Lee and Fisch, unpublished data) enlisted five faculty trained as behaviorists to examine 96 graphs. Study 5 (Fisch and Crosbie, unpublished data) asked 10 graduate students from a different university to evaluate 48 graphs. The parameters manipulated in the graphs for Study 1 (Greenspan and Fisch, 1992) were study design (AB, or baseline/intervention vs. ABA, or baseline/intervention/return to baseline) and number of points, i.e. sessions, in each phase of the design (5 or 10). In Study 2 (Fisch and Schneider, 1993), labelling the ordinate (rate of response v. proportion of responses) and positioning of pointto-point functions in the figure (close to the xaxis, in the middle, close to the top) were manipulated. In Study 3 (Fisch and Porto, 1994), participants were asked to draw the best fitting line to point-to-point functions in graphs of study designs. In Study 4 (Lee and Fisch, unpublished data), a subset of graphs used in Studies 1 and 2 were presented with and without statistical process control (SPC) guidelines to experienced (faculty) and novice (graduate students) behavior analysts. SPC guidelines are parallel lines typically constructed three standard deviations above and below the mean and used to identify outlying values in temporal data. In Study 5 (Fisch and Crosbie, unpublished data), responses to point-topoint functions generated by autoregressive equations were compared to those generated by ordinary linear regressions. Slopes of regression functions were comparable to those generated by autoregressive equations. For each graph, participants were asked to specify, compared to values observed in a ‘base-
139
line’ phase (condition ‘A’), if they observed: [1] a level ‘treatment’ effect only had appeared during the ‘intervention’ phase (condition ‘B’); [2] a trend only had appeared during intervention; [3] both a level treatment effect and trend appeared during intervention; [4] neither treatment effect nor trend appeared during intervention; or [5] some other systematic change occurred. Subjects were shown figures of examples of each effect used by Furlong and Wampold, 1982, (Fig. 1). In addition to correctly identified graphs in which a level treatment effect alone, trend alone or both treatment effect and trend were presented (‘correct hits’); or, those in which neither treatment effect nor trend were generated (‘correct rejections’), many different error types were produced. For example, a ‘Type I’ error response was recorded when a graph in which neither a treatment effect nor trend was generated but the subject reported a level treatment effect. Or, a ‘Type II’ error occurred when a treatment effect was generated but the subject reported nothing had occurred. Another (unclassified) error generated was one in which the autoregressive equation produced both level treatment effect and trend, but only the treatment effect component was reported (see Table 1). The results from four studies (Greenspan and Fisch, 1992; Fisch and Schneider, 1993; Lee and Fisch, unpublished data; Fisch and Crosbie, unpublished data) are shown in Tables 2a and 2b. The overall impression of the results in Table 2a is that level treatment effects are not as well detected as perhaps they should be. More remarkably, trends are hardly noticed at all. Although it appears that participants can more readily detect treatment effects in ABA compared AB designs, some outcomes are at variance with what is generally thought to be true about visually inspected data. Specifically, increasing the number of data points in a given condition does not necessarily translate into improved visibility. For example, Greenspan and Fisch (1992) show that in ABA designs, when there are 10 points presented in the baseline condition and 5 points in the intervention, a markedly higher proportion of treatment effects are correctly identified (72%) than when 10 points are shown in both baseline and intervention conditions (6%) (Table 2a).
140
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
Table 1 Response types given the presentation of a graph type Response Graph type presented
Treatment effect only Trend only
Treatment and trend
No effect
Treatment effect only Trend only Treatment and trend Neither treatment nor trend
Correct R X X Type I error
X X Correct R Type I error
Type II Type II Type II Correct
X Correct R X Type I error
The most striking feature across these studies is the widespread inability of subjects to detect trends, either when they are presented as the only change generated during the intervention condition; or, when they are generated concurrently with level treatment effects (Table 2b). Trends alone are not easily identified as such irrespective of design (Greenspan and Fisch, 1992) or experience (Lee and Fisch, unpublished data). Trends alone do seem to be noticed more readily when point-to-point functions are placed near the bottom of the figure (Fisch and Schneider, 1993), although SPC guidelines do not seem to facilitate their discrimination (Lee and Fisch, unpublished data). However, if trends are generated by ordinary linear equations, a higher proportion are identified (Fisch and Crosbie, unpublished data). Also striking is the high error rate of detecting the treatment effect component only when both treatment effects and trends were generated. This high error rate is sustained across all studies and parameters manipulated (Table 2b).
2. Signal detection analysis of responses to visually inspected data In an effort to characterize in greater detail the types of error subjects made, Karp and Fisch, 1996 applied Signal Detection Analysis (SDA) of responses by subjects in Study 1 (Greenspan and Fisch, 1992), Study 2 (Fisch and Schneider, 1993) and Study 5 (Fisch and Crosbie, unpublished data). For the purposes of SDA, treatment effects and trends were treated as separate signals. A response was designated a ‘hit’ when a graph
Other systematic
error error error R
containing the signal (treatment or trend) was presented and identified as such. Hit rates and ‘False Alarm’ rates (Type I errors) were calculated and graphed to determine how readily signals were detected (sensitivity) and how cautious the observer was (response bias). For those participants whose responses could be evaluated, results from Study 1 indicated that most subjects could readily identify treatment effects above chance level. About half the observers were either too cautious or too liberal in their response bias measures. SDA for trends paint a very different picture. Only one of eight subjects exhibited above chance sensitivity. Trend detection was also accompanied by a conservative bias, indicating a willingness to identify a trend present only when it was strongly suspected. As in Study 1, participants in Study 2 (Fisch and Schneider, 1993) were generally able to detect treatment effects. However, they were moderately conservative in their identification pattern. Also, as in Study 1, most respondents were unable to identify trends. Their response bias towards trends were also conservative. In Study 5, subjects from another university were generally better at detecting both treatment effects and trends than those in Studies 1 and 2. There was little or no response bias to treatment effects. In all three studies, however, response bias to trends tended to be conservative.
3. Statistical analysis of single-subject behavioral data Previously, attempts were made to introduce standard parametric tests of statistical inference,
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
e.g. t-tests and analysis of variance (ANOVA) into behavioral analysis. Both statistical procedures were subsequently rejected for appropriate reasons. For the t-test, one cannot assume that samples of observations comparing baseline responses to those noted during intervention were obtained at random, or that the observations themselves were independent of one another. As for ANOVA, one cannot assume that the sample
141
of responses was drawn from a population with known mean and variance, whose error term is normally distributed. If there are a relatively small number of subjects, the smallness of the sample size may not permit application of the Central Limit Theorem, wherein the sampling distribution of means is normally distributed. Thirdly, the subject is not randomly assigned to one condition or another (baseline; intervention; return-to-base-
Table 2a Summary of correct responses (in %) from four studies in which observers examined graphed point-to-point functions for level treatment effects and/or trends, and the parameters manipulated in each study Study
Parameters
Greenspan and Fisch (1992)
Design/points per Level treatment phase effect only
Fisch and Schneider (1993)
Lee and Fisch (unpublished)
Fisch and Crosbie (unpublished)
Percent correct responses Trend only
Treatment effect and trend
Neither treatment nor trend
Design ABA AB
37 26
11 9
24 22
67 59
Sessions/phase 10:5:10 10:5 10:10:10 10:10
72 44 6 0
17 17 6 6
28 39 17 11
56 39 72 67
Ordinate label Proportion Rate
32 36
18 19
13 14
57 65
Placement in graph Low Middle High
54 8 42
31 15 10
29 8 2
48 77 58
34 55
15 1
19 7
62 90
55 53
– 2
8 5
90 92
47 50
18 25
28 40
85 93
Label/placement
Experience/guidelines Experience Students Faculty Faculty only SPC guidelines No guidelines Equation model Autoregressive Linear function
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
142
Table 2b Summary of errors (in %) in which graphs displayed both level treatment effects and trends, but only treatment effects were reporteda Parameters manipulated
Error rate (%)
Design ABA AB
61 72
Sessions per phase 10:5:10 10:5 10:10:10 10:10
56 56 61 78
Ordinate label Proportion Rate
49 49
Graph placement Low Middle High
56 83 4
Experience Grad students Faculty
55 63
SPC guidelines ( faculty only) Yes guidelines No guidelines
62 63
Equation model Autoregressive Linear equation
57 55
a
4. Proposed statistical alternatives One solution to the problems posed by parametric procedures is to use non-parametric (NP) statistical inference in single subject designs. This paper would not be the first to propose such a statistical alternative (Edgington, 1992; Davison, 1999). In general, NP tests make use of the data collected and — if sample sizes are small — construct exact distributions based those data. The manner in which the distributions are constructed are based on counting techniques (permutations) that exhaust all possible outcomes that could have been generated by the data. Most of the procedures that follow are well-known, but some novel techniques are also presented. Although many refer to comparisons in AB designs, with some slight modifications and other statistical adjustments, they can be used with more elaborate single-subject layouts.
5. Analyzing level treatment effects
Data are from studies cited in Table 2a.
line) in the experiment. Finally, and most importantly from the view of inferential statistics, the assumption of independence of one response measure from the next is violated. That is, there exists serial dependence or autocorrelation in successive response measures. Attempts to resolve the issues raised by autocorrelation involved implementation of time-series analysis (TSA; Box and Jenkins, 1970) or interrupted TSA (Hartmann et al., 1980; Crosbie, 1993). These procedures have proved equally vexing as the number of data points in each condition required to perform a proper statistical analysis varies from 40 to 75 (Busk and Marascuilo, 1988). In addition, autocorrelated time lags need to be specified and constant, observations must be equally spaced in time, and missing data are not permitted.
5.1. The Wilcoxon signed rank test (WSRT) The WSRT can be used to evaluate pre-treatment (baseline)/post-treatment (intervention) when only a single measure, or representative for a single measure, e.g. the mean, is used to depict the baseline response (Xi ); and a single measure is used to depict the intervention response (Yi ) for any number of subjects, i. The WSRT is obtained by computing the differences between intervention and baseline measures for each subject, i.e. Zi = Yi − Xi. It is assumed that each Zi comes from a continuous population (not necessarily the same) symmetric about a common median, q, where q is referred to as the treatment effect. The statistical hypotheses are: [1] There is no treatment effect (H0:q= 0); [2] there is a treatment effect (H1:q\ 0 {or uB 0; or q=0}). The statistic, T+, is obtained by forming the absolute values for Zi i.e. Zi , …, Zn and ordering them from smallest to largest. These values are then assigned a rank Ri = i, that ranges from 1 to n. T+ is the sum of the ranks in which Z\ 0.
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
A fictitious example employing baseline and intervention evaluations of depressed patients based on a study in Hollander and Wolfe (1999) follows. In this example, nine patients who meet DSM-IV criteria for clinical depression have been evaluated using the Hamilton depression scales. Subjects were evaluated before and after treatment with an antidepressant. Results (and analysis) are presented in Table 3. In our example, T + =5. To determine whether this value is unusual, we compare T = 5 for a sample size of N = 9 in an exact table for the Wilcoxon (Hollander and Wolfe, 1999; Table A4) and obtain P0 (T + 5 5)= 0.020, which indicates that there is strong evidence to the effect that the antidepressant leads to improvement in patients diagnosed as depressed, as measured by the Hamilton scale.
5.2. Repeated-measures analysis of 6ariance (RM-ANOVA) Repeated-measures analysis of variance (RMANOVA) has been available for many years for examining successive within-subject evaluations (Cf. Winer, 1971). Strictly speaking RM-ANOVA is a parametric test, and a major drawback in using has been the assumption of homogeneity in the variance–covariance matrix. That is, the variances and co-variances should be about the same across sessions and between subjects. The example that follows has been adapted from McCall and Appelbaum (1973), in which five subjects were tested over four consecutive sessions. The question is whether there has been
143
an effect of session on the behavioral measure. The data are presented in Table 4. RM-ANOVA can be used to assess whether there have been significant changes within a given phase of an experiment (baseline, intervention, return-to-baseline); or, in a more complex statistical analysis, whether there were changes across experimental conditions. For these data, the variance– covariance matrix (not presented here; cf. McCall and Appelbaum, 1973) appears to show wide differences in values. To determine if the variances and co-variances are significantly different from one another, one can compute a quick test of homogeneity, e.g. Fmax (Cf. Winer, 1971), where Fmax = l 2max /l 2min . In our example, Fmax = 25.7/8.2 = 3.12 (P 0.07) is sufficiently large to caution against using a simple RM-ANOVA without some adjustment. A standard RM-ANOVA would produce an F3,12 = (78.6/3)1(86.4/12) = 3.64 for these data, while the tabled value for F3,12 at h=0.05 is F= 3.49. That is, a simple RM-ANOVA would show significant differences within subjects across sessions. McCall and Appelbaum (1973) note that the in-homogeneity contained in the variance–covariance matrix has an adverse effect on the matrix of correlations from one session to the next. To produce a more conservative statistical analysis, they suggest the Greenhouse–Geisser correction be used to reduce the degrees of freedom (df) in the F-test. The Greenhouse–Geisser correction is generally available on statistical software packages, e.g. SPSS, SAS. It can also be computed from values generated by the variance–covariance
Table 3 Fictitious data for nine subjects evaluated by the Hamilton depression scales (see Holland and Wolfe, 1999) Patient i
Xi (baseline)
Yi (treatment)
ZI
Zi
Ri
R/Z\0
1 2 3 4 5 6 7 8 9
18 5 16 25 17 19 16 30 13
9 7 6 21 11 12 11 31 13
−9 2 −10 −4 −6 −7 −5 1 0
9 2 10 4 6 7 5 1 0
8 3 9 4 6 7 5 2 1
0 3 0 0 0 0 0 2 0
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
144
Table 4 Response measures for five subjects evaluated over four consecutive sessions (adapted from McCall and Appelbaum, 1973) Subject
Session 1
Session 2
Session 3
Session 4
1 2 3 4 5
5 9 4 1 5
3 11 5 3 9
4 14 7 1 10
7 13 8 14 9
Session means
4.8
6.2
7.2
10.2
matrix from the formulae provided by McCall and Appelbaum (1973). After completing their computations, the original df(3, 12) are reduced to (1, 5). If we assume a Type I error (h= 0.05), the tabled F-value becomes F(1,5) =6.61. When the adjusted table F-value is compared to the obtained F =3.64, the obtained F is no longer significant. Rather than ask whether there are differences across individual sessions within a condition, one might ask (as indicated earlier) whether the sessions in the baseline condition differ significantly from the sessions in the intervention condition. For such an analysis, one would employ a similar approach using a multivariate RM-ANOVA. However, this analysis is beyond the scope of this presentation and the reader is referred to Winer (1971), McCall and Appelbaum (1973).
all observations can be ranked 1, …, k; and this true for all n subjects. There are two statistical hypotheses: there are no additive effects across sessions (H0: ~1 = ~2 = … =~k ). The alternative hypothesis is that not all [~1, …, ~k ] are equal. The statistic employed for the FTWAR is the 2 goodness-of fit, which is computed by first ranking responses for each subject until all subjects have received ranks for their responses, then summing the ranks in each of the j sessions. As an illustration, we use the previous example from McCall and Appelbaum (1973). Responses for each subject were ranked then summed for each session, as shown in Table 5. The Friedman 2 for (k− 1) df is obtained as follows:
5.3. Friedman two-way analysis by ranks (FTWAR)
To correct for ties, the obtained 2 is divided by:
An alternative NP approach to RM-ANOVA that requires no adjustment in the computed statistic is the Friedman two-way analysis by ranks (FTWAR). The general layout for the FTWAR is the same as the repeated measures ANOVA in Table 4. The FTWAR makes several assumptions, all of which fall within the purview of behavioral analysis. Specifically, all n subjects are different from one another and their responses are observed in each of k sessions. Each obtained response, Xij comes from a continuous population whose parameters are not necessarily known. Response measures are independent from one another. For subject 1, with response measures Xij ,
!
"
2 = 12/(nk(k+ 1))%R 2j − 3n(k+ 1) .
n
1− % Ti /nk(k 2 − 1) ,
Table 5 Data obtained from Table 4, ranked within each subject Subject
Session 1
Session 2
Session 3
Session 4
1 2 3 4 5
3 1 1 1.5 1
1 2 2 3 2.5
2 4 3 1.5 4
4 3 4 4 2.5
Rank sum
7.5
10.5
14.5
17.5
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
where Ti =ti 3− ti and ti =number of tied observations for a given rank for each subject. To test the statistical hypotheses, the obtained 2 is compared to the tabled k2 − 1,h and if obtained 2 ] 2k − 1,h, then reject H0. In our example: n =5; k =4 and there are two pairs of tied observations. The obtained 2 =6.96, along with the correction factor= 0.96, produce an adjusted 2 = 7.25. The tabled k2 − 1,h = 23,0.05 = 7.82. Therefore, the obtained 2 is not significant.
5.4. A distribution-free test for data from an incomplete block design (Skillings – Mack) The RM-ANOVA with the Greenhouse – Geisser correction and the FTWAR are useful statistics for evaluating complete block designs. However, by definition, they require that there are no missing data, nor can the number of sessions vary across subjects. Since single-subject studies are designed generally with differing numbers of sessions for each subject, RM-ANOVA and FTWAR will not be the appropriate statistical procedures to use in those cases. Nonetheless, there is a test that may prove useful and it is the Skillings– Mack (Skillings and Mack, 1981). It is not a well-known statistic and the summary here is intended only to outline its application briefly. A more complete description can be found in Hollander and Wolfe (1999), from which the example presented has been adapted. The SM statistic makes three assumptions: si denotes the number of sessions for which an observation is present for a given subject, i; there are at least two subjects; there may be an arbitrary number of missing sessions for sessions j. To compute it, one ranks the observed values from smallest to largest for each subject. At this point, the SM resembles the FTWAR in terms of layout. Unlike the FTWAR, if an observation is missing, the value rij = (si +1)/2, the average rank for subject ‘i’, is assigned to the missing value(s). Then a vector, A, is constructed based on the following computation: Aj = %i [12/(si + 1)]1/2[rij −(si +1)/2],
145
Table 6 Number of correct responses to 25 tasks performed during baseline, intervention with guided compliance, intervention with guided compliance plus medicationa Subject
Baseline
Guided compliance
GC+Medication
1 2 3 4 5 6 7 8
3 1 5 2 0 1 2 0
5 3 4 – 2 2 3 2
15 18 21 6 17 10 8 13
a
(1) (1) (2) (1) (1) (1) (1) (1)
(2) (2) (1) (2) (2) (2) (2)
(3) (3) (3) (2) (3) (3) (3) (3)
Individual subject ranks are in parenthesis.
for each session j, j= 1, …, k. In effect, Aj is the weighted sum of centered ranks for observations in the jth session. The SM statistic then requires a co-variance matrix, 0 and its inverse, ( − 1) be computed. The statistic is obtained from SM= A − l A t. The statistical hypotheses are as follows: the response measures obtained during baseline sessions are no different from those observed during intervention sessions, i.e. H0:[~1 = ~2 = …= ~k ]. Alternately, H1: [~1 … ~k ] not all the same. The null hypothesis is rejected if SM (obtained value)] smh (tabled value). The following example is adapted from Hollander and Wolfe (1999). Suppose eight children with mild-to-moderate mental retardation were asked to perform a task that required attending to the presentation of 25 individual tasks. Number of correct responses were recorded during three conditions: Baseline; Intervention with guided compliance; Intervention with guided compliance plus medication. The data appear in Table 6. For each session — baseline, intervention using guided compliance alone, intervention with guided compliance plus medication — the obtained values for Aj were: A1 = − 11.392; A2 = − 1.732; A3 = 13.124. To maintain linear independence in the A matrix, its rank must be one less than the number of sessions. In this example, its rank is 2. Thus, A needs only 2 values. A1 and A2 were chosen for computational ease.
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
146
A = [ −11.392, −1.732]. The covariance matrix S0, and its inverse, S this example were obtained: Ã 15 %0 = Ã Ã −7
−1
, for
− 7Ã Ã 14 Ã
where S0− 1 has the form: Ã0.0870 %0− 1 = Ã Ã0.0435
0.0435 Ã Ã. 0.0932 Ã
Finally, SM = A% − 1 A t = [ −11.392, −1.732] Ã0.0870 Ã Ã0.0435
0.0435 Ã Ã − 11.392 Ã ÃÃ Ã 0.0932 Ã Ã −1.732 Ã
= 13.287. The obtained SM value is compared to one found in the table (Hollander and Wolfe, 1999; Table A27), we note the SM for eight subjects for h = 0.01, SM=10.242. Thus, the effect of intervention on baseline responses is significant. A more complete description of the procedure can be found in Hollander and Wolfe (1999).
To examine whether bout lengths differ among the four conditions, Haccou and Meelis (1992) employed a NP multiple-change point test based on the Kruskal-Wallis (KW). Data for the KW are the entire sample of behavior durations. In our example, N=20 bouts. To make proper use of the KW, the sample of observations of bout lengths x1 to xN must be independent of one another and the expected value of the sample should be the population mean (more or less). That is, E(xi )= [i = vi + mi where oi comes from a continuous distribution. The statistical hypotheses for the KW are that mean bout lengths do not differ across conditions (H0: [1 = [2 = …= [m ). Alternately, not all [i are equal. As shown in Table 7, bout lengths from all conditions are first pooled into a single sample and ranked from shortest to longest (last column). Ranks are then reallocated to their respective conditions (or groups). The KW statistic, K, is obtained by computing the sums of the ranks, Rj, for each of the m conditions (groups). If there are no ties in the ranks, the KW statistic, K, for testing the equality of m mean rank sums is: Table 7 Bout lengths for a given behavior, under each of four conditions (adapted from Haccou and Meelis, 1992). Condition Baseline
Rank
4.7 17.4 57.7 92.3 41.7
2 5 11 15 8
Intervention 1
1.6 19.5 46.1 51.1 179.8 11.5 234.4
1 6 9 10 16 4 20
Intervention 2
223.5 29.4 7.7 84.6
18 7 3 14
Intervention 3
67.5 224.3 59.4 182.0
13 19 12 17
5.5. Formal methods for e6aluating abrupt changes — the Kruskal-Wallis (KW) test Although the formal procedures that follow were developed for studies in ethology (Haccou and Meelis, 1992), they are readily adapted to the experimental analysis of behavior and applied behavior analysis when duration of behavior, e.g. bout length, rather than punctate target behaviors are examined. These can be particularly useful in studies of choice behavior in which bouts of responding occur for one behavior, then another, and where the contingencies of reinforcement change during a single session. The example that follows was adapted from Haccou and Meelis (1992). Bout lengths for a given behavior were recorded in each of four successive conditions, as in Table 7.
Bout lengths (s)
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
147
Table 8 Statistical tests comparing number of change points in Bout length data from Table 7 (Adapted from Haccou and Meelis, 1992) Statistical test
Obtained MLE
Tabled 2 (Bonferroni adjusted)a
Decision
3 points vs. 2 points 2 points vs.1 point 1 point only
4.08 2.09 4.69
h =0.025; \3 = 7.86 h= 0.05; \2 = 5.14 h = 0.10; \1 = 5.76
Do not reject Ho Do not reject Ho Do not reject Ho
a
Values are from Haccou and Meelis (1992), Tables A10–12 (p. 352–354).
K = [12/(N(N + 1))]%m (R 2j /nj ) −3(N +1).
, !
,
When there are ties, K% is: K%= K
1− %g (t 3g −t) N 3 −N
"n
,
where tg =the number of ties within a given group. K (or K%) is computed for all m −1 change points and continues to be calculated algorithmically, assuming successively fewer change points, for all (m − 2), (m− 3), …, possible transitions until there are only two groups and one transition point remaining. Finally, Maximum Likelihood Estimates (MLE), , where =max K, for all contiguous pairs of numbers of change points {(m − 1, m − 2), (m − 2, m −3), … ,(2, 1)} are then obtained. Details for computations and statistical analyses can be found in Haccou and Meelis (1992). For the example presented in Table 7, the question is whether bout lengths differ among any of the conditions. For four groups, K4, KW is: K4 =[12/(20(20 +1))](2329.74) −{2(21)} = 66.56−63= 3.56. For four conditions, there are a maximum of three transition points. Given N bout lengths divided among four groups, there are a total of [1/((1)(2)(3))]{(N −1)(N −2)(N − 3)} possible groups of various sizes. When N = 20, there are 969 such groups. Haccou and Meelis (1992) calculated that the MLE, L3 =max K4, is attained when the four groups have the following sizes: 11, 2, 2, 5. Using the data from Table 7, along with the aforementioned sample sizes, we find L3 = max K4 = [12/(20(21)](2585.09) −{3(21)} =
10.86. The MLE for j= 3 conditions with two transition points is calculated next. For three conditions and two transition points, L2 = max K3 is attained for group sizes: 11, 2, 7 (2+ 5), and L2 = max K3 = [12/(20(21)](2442.23) −{3(21))= 6.78. For j= 2 groups and 1 transition point, L1 = max K2 is attained for group sizes: 11, 9 (2+ 2+ 5). L1 = max K2 = [12/ (20(21))](2369.09)−{3(21)}=4.69.Having calculated the KW for the original data set and MLEs for 4, 3, and 2 groups, the statistical hypotheses are: [1] Is there a significant difference in bout lengths among the four groups as presented in the original data? [2] Using MLEs, could there be significant differences at any transition point(s) in the set of 20 bout lengths when bout lengths are assigned to: (a) four groups? (b) three groups? (c) two groups? Our statistical analyses proceed as follows: For the first hypothesis, we observed that K4 = 3.56. Since the sample sizes in the four conditions (groups) exceeds those provided in the tabled values, we make use of the large sample approximation, 2k − 1,h, and compare obtained K4 with tabled k2 − 1,h. If the obtained K4 ] tabled 2k − 1,h, reject H0. Since the tabled 2k − 1,h = 23,0.05 = 7.81, we cannot reject H0. Our second set of hypotheses are: are there three significant transition points? Two significant transition points? One significant transition point? MLEs for the number of transitions for each statistical test in the second hypothesis are summarized in Table 8. Since none of the tests reached the levels of significance indicated, we conclude there were no significant change points in the number of bout lengths across any of the four conditions in the example.
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
148
Table 9 Response measures for nine consecutive baseline sessions Sessions
Response scores
1 2 3 4 5 6 7 8 9
44.4 44.7 41.9 44.1 45.2 50.7 53.3 45.9 60.1
where a= (Yj − Yi ) and c(a)= (1 if a\ 0; 0 if a= 0; − l if aB0). We construct an example in which response measures from nine consecutive baseline sessions are shown. We wish to determine if there is a trend across sessions in the baseline. The data appear in Table 9.Given the analysis presented in Table 10, we calculated K=24.From Table A30 in Hollander and Wolfe (1999), where K= 24 and n=9, h= 0.006. Therefore, there is evidence to indicate there is a significant increasing trend in the experimental condition.
5.7. Randomization tests for beha6ioral analysis using AB designs
5.6. Formal methods for e6aluating trends 5.6.1. Kendall’s ~ Kendall’s ~ is usually used as a distribution-free test for independence of two variables based on signs. However, it can also be used to test for temporal trend (Mann, 1945), as follows: If we take Xi = i, for n successive sessions within an experimental condition, and Yi as the response measure for session i, Kendall’s ~ for trends can be reformulated as follows:
Several authors have suggested that randomization tests be used to compare mean differences between baseline and intervention responses (Busk and Marascuilo, 1988; Wampold and Worsham, 1986; Edgington, 1992). Randomization (or permutation) tests are essentially exact distributions based on permutations obtained from empirical data and the basis for most of the NP tests we examined earlier. The following example should make clear how an exact distribution is constructed, and how obtained data are compared to it. Consider a study in which there are a total of five baseline and intervention sessions, where the sequence of baseline and intervention conditions can be imposed in any order. The number of responses for the experiment are recorded in Table 11. Mean number of responses in the baseline condition is X=108, while the mean number
K = %i %j = i + 1{Q((i, Yi )( j, Yj ))} where Q is a function that compares pairs of values: i with j; Yi with Yj. When j =i +1, Q can be simplified as: =%i %j = i + 1{c(Yj − Yi )} Table 10 Calculations for Kendall’s ~ using i j = i+1 {c(Yj−Yi )}a i¯j
1
2
3
4
5
6
7
8
2 3 4 5 6 7 8 9
1 −1 −1 1 1 1 1 1
−l −1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1
1 1 1 1
1 −1 1
−1 1
1
6–0
5–0
4–0
Sum a
6–2
Data obtained from Table 9
5–2
2–1
1–1
1
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154 Table 11 Layout for a permutation test for three baseline sessions and two intervention sessions Baseline (X) Session Responses
1 108
2 96
Thus, in our example, the descriptive level of significance (0.20) is not statistically significant. Another example of the use of a randomization test is provided by Edgington (1992). In each of 6 days of what appears to be a multi-element design, a subject is randomly assigned to placebo baseline (A) or medication intervention (B) condition in the morning or afternoon in paired blocks of days, for a total of 12 sessions. The number of responses in each condition is recorded in Table 12. Baseline (A)= 6 sessions; intervention (B)= 6 sessions. According to Edgington (1992), there are:
Intervention (Y) 3 120
4 150
149
5 126
of responses during intervention is Y= 138. The mean difference in responding between intervention and baseline is Y − X =30. The (statistical) question is: is the difference larger than would be expected by chance?
n=
5.7.1. A permutation test For this example, we performed a permutation test by computing all possible ways five values could be assigned to three baseline sessions and two intervention sessions. In other words, given a sample size equal to the total number of sessions (N = 5); the number of sessions in the baseline (n1= 3); the number of sessions during intervention (n2 =2). There are a total of n = N!/ n1!n2!= 5!/3!2! = 10 different ways of assigning five values to two samples, one of which contains three values, the other of which contains two. Once all possible assignments are made, the 10 mean values are calculated for the baseline (X) and intervention (Y) conditions, as are their mean differences (Y− X). The obtained difference found in the actual study is then compared to all possible outcomes. This is generally referred to as the descriptive level of significance (Mosteller and Rourke, 1973). The descriptive level of significance is the probability of obtaining a result as unusual as, or more usual than, the observed outcome. The reader should be able to calculate the 10 premutation outcomes and find: P(observed difference] 30) =2/10(0.20).
N! 12! = = 924 possible outcomes. na !nb ! 6!6!
The baseline mean, XA = 19.0, and the intervention mean, YB = 15.5. Therefore, XA − YB = 3.5. To test whether the intervention has had a positive effect on behavior, i.e. reduced the number of inappropriate responses, we compare the obtained difference (3.5) with the 923 other possible outcomes. According to Edgington (1992), there are only five random samples for which (XA − YB ] 3.5). Therefore, at the descriptive level of significance, P(observed difference] 3.5)= 5/924 : 0.005.
5.8. A non-paramertric alternati6e to the randomization test: the Mann-Whitney–Wilcoxon Alternatively, Edgington (1992) suggests the Mann-Whitney–Wilcoxon (MWW) test of independent samples be applied to demonstrate a similar outcome. To use the MWW properly, the two samples of observations must be randomly selected (as in the randomization test above) and the samples (X1, ..., Xn ) and (Y1, ..., Yn ) obtained from two different populations. The samples are
Table 12 Example of a randomization test in a multi-element experimental design Session
1
2
3
4
5
6
7
8
9
10
11
12
A or B Responses Rank
A 17 6.5
B 14 1.5
B 15 3
A 17 6.5
B 14 1.5
A 19 9
A 21 12
B 16 4.5
A 20 10.5
B 16 4.5
A 20 10.5
B 18 8
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
150
Table 13 Responses by three subjects over 12 experimental sessions (adapted from Wampold and Worsham, 1986) Session
1
2
3
4
5
6
7
8
9
10
11
12
Subject A Subject B Subject C
5 8 6
7 9 6
4 7 7
5 76 5
3 10 4
7 11 8
5 10 76
8 13 96
10 14 10
86 10 11
15 13 13
10 11 10
independent from one another and the observations measures of a continuous variable. The distribution functions, although unknown, are the same for both populations, but for their means. The statistical hypotheses are: there is no difference among the sample means (H0:q =0). The alternate hypothesis is that there is a difference (H1: q \0 [or q B0; or q "0]). The test statistic, T, is computed as follows: T = S−{(n(n+1)/12)} where S= the sum of the ranks assigned to the sample from either population 1 or 2. For a one-tailed test, the decision to reject the null hypothesis at some level of significance, h, is T 5 Ttable. In our example, the obtained sum of ‘A’ ranks, SA = 55, where the smallest possible sum, {(n(n+ 1)/12} =21. Given sample sizes for A and B, respectively, NA =NB =6, where T =34, P= 0.004 for a one-tailed test, which is remarkably similar to the value obtained by the randomization test. It should be noted that the randomization and MWW test recommended by Edgington (1992) and used in this example both assume permutations are constructed by randomly assigning A and B to any of the 12 available sessions. The research strategy could be interpreted differently given the study’s multi-element experimental design in which randomization occurs on a daily basis. That is, the subject is assigned at random daily to A or B on each of 6 consecutive days. Therefore, the total number of events should not be obtained by counting permutations of N =12 sessions, but the binomial-like counting procedure in which the total number of assignments is 2N, where N= 6 days. In this example, 2N =26 =64 possible outcomes. If the obtained difference, 3.5, is the largest mean difference (Z) between the A and B conditions,
then the descriptive level of significance is P(Z ] 3.5)= 1/64 0.02. The difference would be significant but reduced by an order of magnitude. If there were as many as five out comes for which the mean difference exceeded 3.5 (as suggested by Edgington, 1992), the descriptive level of significance would be approximately 0.08, which is not significant. The issue of interpretation of design is revisited when we explore the following example generated by Wampold and Worsham (1986) and reconsidered by Busk and Marascuilo (1988).
5.9. Randomization tests with multiple baseline designs The applied behavior analyst is most likely to be involved in studies in which there are several subjects tested over many sessions, where intervention is phased in at different points or sessions. The following example is provided by Wampold and Worsham (1986). Consider that there are three subjects a study who are tested over 12 sessions, in Table 13. The values bolded and underlined represent the sessions in which intervention was implemented. In the example, subjects were randomly assigned to the intervention condition at the 4th, 7th or 10th sessions. Compared to their mean responses during baseline, subjects’ mean responses during intervention are as follows:
Subject
Baseline
Intervention Difference
A B C
6 8 6
11 11 10
5 3 4
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154 Table 14 Responses from four subjects during 20 experimental sessions (adapted from Wampold and Worsham, 1986) Session
‘B’
‘A’
‘C’
‘D’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
8 7 6 7 46 56 66 56 46 46 56 26 46 36 46 56 46 36 26 26
6 7 8 7 5 7 6 8 66 56 46 46 46 36 26 56 36 46 36 66
5 5 4 4 4 5 6 7 4 5 6 5 26 36 26 46 16 06 26 36
8 6 7 7 8 5 7 8 7 6 7 8 5 6 8 8 66 46 46 5
fThe sum of their differences is 12. Given that intervention beginning on the 4th, 7th and 10th sessions could have been assigned randomly to any of the three subjects, there are a total of 3!= 6 different session assignments. The P(observed difference] 12) = 1/6 0.17 is not significant. It should be noted, however, that in the
151
example provided by Wampold and Worsham, there were four subjects assigned randomly to begin intervention at four different treatment sessions. Hence, the total number of different random assignments is 4!= 24 and P= 1/24 = 0.042, which is significant. A second example of a randomization test based on single-subject multiple baseline designs comes from Busk and Marascuilo (1988). The data they use are from the same Wampold and Worsham (1986) study, and presented in Table 14. Busk and Marascuilo (1988) make an assumption that differs from the one by Wampold and Worsham (1986). Instead of using the procedure of randomiy assigning the intervention phase of the study at the 5th, 9th, 13th, and 17th trials as described by Wampold and Worsham (1986), Busk and Marascuilo (1988) assume that intervention is imposed after baseline has stabilized, although not before the 4th session nor after the 17th session. That is, intervention could have been imposed on the 5th or 6th or… or 17th session, for each of the four subjects. These new criteria alter the original assumptions of the experimental design and permit a more generous (in terms of the number of different possible outcomes) exact distribution to be constructed. Under these circumstances, the randomization distribution for the first subject (‘B’) would appear as in Table 15. Session 5, the one with double asterisk, is the session in which intervention actually occurred.
Table 15 Mean baseline and intervention responses for subject ‘B’, assuming intervention could occur on any of the 5th through 17th sessions Intervention session
Baseline mean
Intervention mean
Mean difference
5** 6 7 8 9 10 11 12 13 14 15 16 17
7.00 6.40 6.17 6.14 6.00 5.78 5.60 5.55 5.25 5.15 5.00 4.93 4.94
3.88 3.87 3.79 3.62 3.50 3.45 3.40 3.22 3.38 3.29 3.33 3.20 2.75
3.12 2.53 2.38 2.52 2.50 2.33 2.20 2.33 1.87 1.86 1.67 1.73 2.19
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
152
The obtained mean difference, Z1 = 3.12, the descriptive level of significance is P(Z1 =3.12) = 1/ 13:0.08. This not significant if one only considers the single subject, ‘B’. There are, however, three other subjects, and their combined differences are obtained by adding: %Zi =Z1 +Z2 + Z3 +Z4 =3.12+2.67+ 3.04 + 2.19 = 11.02. The exact distribution is constructed by finding all possible combinations of mean differences obtained for the four subjects for all 13 possible intervention conditions. Specifically, 134 = (13)(13)(13)(13)=28 561 possible sums of mean differences (See Busk and Marascuilo (1988), Table 4, p. 11). However, as this is computationally tedious and the number of differences sufficiently large to invoke the normal distribution, an alternative is to examine the sum of the expected values of mean differences for all subjects, the sum of variances of their differences, and compute the Z-score for the obtained sum of differences, SZi. Based on the calculations provided by Busk and Marascuilo (1988) the expected value of the mean for each subject, E(Zi ), summed over all subjects, is: E(Di ) = E(D1) +E(D2) + E(D3) + E(D4) = 2.45+2.18+ 2.22 + 0.99 =7.84 The expected value of the variance for each subject, E[var(Zi )], summed over all subjects, is: %[var(Zi )]=E[var(Z1)] +E[var(Z2)] +E[var(Z3)] +E[var(Z4)] =0.1469+ 0.1932+0.4423 + 0.1860 = 0.9684.
!
",'
The large sample Z-score is obtained from Z = %Zi −%E(Z)i
%E[var(Zi )]
={11.02−7.84}/0.9684 =3.23. The probability of obtaining Z = 3.23 is less than 0.001. Therefore, we reject the null hypothesis of no intervention effect. It is important to note that the analysis performed by Busk and Marascuilo (1988), while
useful in thinking about types of randomization procedures to perform, is not appropriate for the experimental design proposed by Wampold and Worsham (1986). Wampold and Worsham state specifically that randomization in the experimental design was based on random assignment of the first intervention period to the 5th, 9th, 13th or 17th session; whereas, Marascuilo and Busk change the assumptions about the design and assert that intervention could have occurred on any of the 5th through 17th sessions. One should take care not to confound experimental design with statistical analysis.
6. Power considerations In implementing any statistical procedure, it is important to note not only the significance level for the test (the probability of rejecting the null hypothesis if it is true), but the power of the test as well (the probability that the alternate hypothesis is true). Cohen (1962) suggested that satisfactory power range between 70 and 80%, but researchers generally use 80% or higher as the most appropriate criterion. Power of non-parametric tests are often compared to the power of parametric tests, usually the t- or Normal distributions. Comparisons are based on what is generally referred to as asymptotic relative efficiency, or ARE. ARE refers to the proportion of subjects needed for a study using a nonparametric procedure compared to the same study if one could assume the parameters of the distribution were known. For example, if a specific NP test has 80% ARE relative to the normal distribution, and 40 subjects were needed to obtain 75% power under the assumption of normality, one would need 50 subjects for that NP test to attain 75% power. Table 16 compares the NP and randomization tests presented in this paper, and their power (if known) or AREs.
7. Summary Experiments that use single subject designs and rely only on visual inspection to evaluate the
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154 Table 16 Power estimates for non-parametric and randomization tests presented to obtain comparable power Test
Power
Wilcoxon signed-ranks
Power t-test (sometimes greater) Adjust df with Greenhouse–Geisser, m, to obtain comparable power ARE=k/(k+1), where k= number of sessions {unknown} ARE : 91% compared to Pearson’s z Power t-test (sometimes greater) Never B84%; if ND 95.5%
Repeated measures ANOVA
Friedman 2-way ANOVA Skillings–Mack Kendall’s ~ Mann-Whitney Kruskal-Wallis Permutation tests (Ferron and Ware, 1995) Single subject, AB design, large effect size, modest autocorrelation Single subject, ABAB design, large effect size, modest autocorrelation Multiple baseline, AB design, four subjects, 15 sessions, large effect size, modest autocorrelation
17–25% 25–32% 30–49%
efficacy of interventions often miss treatment effects and, to a much greater extent, trends in the data. Other types of errors, e.g. one in which a treatment effect and trend occur concurrently but only the treatment effect is reported, illustrates the weakness of visual inspection as the sole means by which to evaluate graphed functions. In this tutorial, I have presented several statistical procedures, nearly all of which do not depend on parametric assumptions. I strongly recommend they be considered to augment, not replace, visual inspection to assess the effectiveness of behavioral interventions, especially those that are employed in an applied behavioral setting. As with any technique, statistical inference should be used thoughtfully, carefully, and properly integrated with experimental design.
153
References Box, G.E.P., Jenkins, G.M., 1970. Time Series Analysis: Forecasting and Control. Holden-Day, San Francisco. Busk, P.L., Marascuilo, L.A., 1988. Autocorrelation in singlesubject research: a counterargument to the myth of no autocorrelation. Behavioral Assessment 10, 229 – 242. Cohen, J., 1962. The statistical power of abnormal-social psychological research: a review. Journal of Abnormal and Social Psychology 65, 145 – 163. Crosbie, J., 1993. Interrupted time-series analysis with brief single-sublect data. Journal of Consulting and Clinical Psychology 61, 966 – 974. Davison, M., 1999. Statistical inference in behavior analysis: having my cake and eating it? The Behavior Analyst 22, 99 – 103. DeProspero, A., Cohen, S., 1977. Inconsistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis 12, 573 – 579. Edgington, E.S., 1992. Nonparametric tests for single-case designs. In: Kratochwill, T.R., Levin, J.R. (Eds.), SingleCase Research Design and Analysis: New Directions for Psychology and Education. Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 133 – 157. Ferron, J., Ware, W., 1995. Analyzing single-case data: the power of randomization tests. Journal of Experimental Education 63, 167 – 178. Fisch, G.S., Porto, A.F., 1994. Visual inspection of data. Does the eyeball fit the trend? In: Human Vision, Visual Processing, and Digital Display V, Rogowitz, B.E., Alleback, J.P., Eds., Proc. SPIE 2179, San Jose, CA, pp 268 – 76. Fisch, G.S., Schneider, R., 1993. Visual inspection of data: placement of point-to-point graphs affects discrimination of trends and treatment effects. American Statistical Association 1993 Proceedingss Section on Statistical Graphics, pp. 55 – 58. Furlong, M.J., Wampold, B.E., 1982. Intervention effects and relative variation as dimensions in experts’ use of visual inference. Journal of Applied Behavior Analysis 15, 415 – 421. Greenspan, P., Fisch, G.S., 1992. Visual inspection of data: a statistical analysis of behavior. American Statistical Associaton 1992 Proceedings Section on Statistical Graphics, pp. 79 – 82. Haccou, P., Meelis, E., 1992. Statistical Analysis of Behavioural Data. Oxford University Press, Oxford, England, pp. 75 – 111. Hartmann, D.P., Gottmann, J.M., Jones, R.R., Gardner, W., Kazdin, A.E., Vaught, R.S., 1980. Interrupted time-series analysis and its application to behavioral data. Journal of Applied Behavior Analysis L3, 543 – 559. Hollander, M., Wolfe, D.A., 1999. Nonparametric Statistical Methods, 2nd ed., Wiley, New York, pp. 36 – 50, 106 – 124, 319 – 328, 363 – 382. Karp, H., Fisch, G.S. A signal detection analysis of visual inspection of graphs, presented at a Symposium on the Experimental Analysis of Visually Inspected Data, San Francisco, CA, May, 1996.
154
G.S. Fisch / Beha6ioural Processes 54 (2001) 137–154
Kazdin, A.E., 1977. Artifact, bias, and complexity of assessment: the ABCs of reliability. Journal of Applied Behavior Analysis 10, 141 – 150. Mann, H.B., 1945. Nonparametric tests against trend. Econometrica 13, 245 – 259. Matyas, T.A., Greenwood, K.M., 1990. Visual analysis of single-case time series: effects of variability, serial dependence, and magnitude of intervention effects. Journal of Applied Behavior Analysis 24, 341 – 351. McCall, R.B., Appelbaum, M.I., 1973. Bias in the analysis of repeated-measures designs: some alternatives. Child Development 44, 401 –415. Mosteller, F., Rourke, R.E.K., 1973. Sturdy Statistics: Nonparametrics and Order Statistics. Addison-Wesley Publishing, Reading, MA, pp. 1–19. Sidman, M., 1960. Tactics of Scientific Research: Evaluation of Experimental data in Psychology. Basic Books, New York.
Skillings, J.H., Mack, G.A., 1981. On the use of a Friedmantype statistic in balanced and unbalanced block designs. Communications in Statistical Theory and Methods 6, 1453 – 1463. Skinner, B.F., 1938. The Behavior of Organisms: An Experimental Analysis. Appleton-Century, New York. Skinner, B.F., 1956. A case history in scientific method. American Psychologist 11, 221 – 233. Wampold, B., Furlong, M., 1981. The heuristics of visual inference. Behavioral Assessment 3, 79 – 92. Wampold, B.E., Worsham, N.L., 1986. Randomization tests for multiple-baseline designs. Behavioral Assessment 8, 135 – 143. Winer, B.J., 1971. Statistical Principles in Experimental Design, 2nd. McGraw-Hill, New York, pp. 514 – 603. Yeltman, A.R., Wildman, B.G., Erickson, M.T., 1977. A probability-based formula for calculating interobserver agreement. Journal of Applied Behavior Analysis 127 – 131.
.
.