CASE
HISTORIES
AND
SHORTER
COMMUNICATIONS
Measurement error in direct observations: a comparison of common recording methods* (Recuiwd 31 Ocrohrr
1979)
Summary-Videotapes of three brief duration. three medium duration and three long duration types of stereotyped behaviour (of eight severely retarded children) were analysed to provide a criterion record of the true percentage duration of the behaviour. The criterion record was compared with the records produced by four time-sampling methods: a whole-interval method. two partial-interval methods and a momentary time-sampling method. As predicted, the wholeinterval method grossly underestimated and the partial-interval methods grossly overestimated the true percentage duration of the behaviour. except when the duration of individual responses was much longer than the observation interval. Momentary time-sampling was not an errorless method but was consistently superior to the other methods. The implications of these findings for the detection of treatment effects by direct observations are discussed.
In applied behavtour analysis. the vast majority of experimental data are obtained through the direct observation of behaviour. According to Kelly (1977). interval recording and time-sampling account for 4l”,, of the data collection methods of studies reported in a prominent behavioural journal. Given that these methods are so commonly used. it is important to ask whether the resulting data accurately reflect the observed behavioural events. It is only recently however that any consistent experimental evaluation of the unavoidable sampling errors involved in interval and time-sampling have appeared (Powell rr ~1.. 1975; Repp rr al.. 1976b; Powell er ul.. 1977: Green and Alverson. 1978: and Powell and Rockinson. 1978). Studies so far have yielded conflicting results. presumably partly because they differ in the duration of the behaviour observed and in the numbers and lengths of the samples of behaviour observed (compare for example. Powell er ul.. 1975. 1977. with Repp et ul.. 1976b). This paper therefore sets out to examine the accuracy of whole-interval recording partial-interval recording and momentary time-sampling (these terms are defined as in Powell (‘I ~1.. 1975) in behaviour of different average response durations. using common sample lengths and taking a maximum number of samples in each observation period. The behaviour observed was real unscheduled behaviour (the stereotyped mannerisms of severely retarded children) in contrast to the artificial ‘behaviour’ of most previous studies (Powell rr ul.. 1977: Repp er al.. 1976b: Green and Alverson. 1978; Powell and Rockingham. 1978). Predictions about the errors produced by the different recording methods can be made from a consideration of how the errors arise. If the measure of interest is the percentage of the observation time for which the behaviour occurred (per cent duration) then partial interval sampling will always overestimate the true per cent duration unless the length of each bout of behaviour is much longer than the interval sample. because any appearance of the behaviour in a sample is counted as though the behaviour were occurring throughout the interval. By a similar argument. it can be predicted that whole-interval sampling will always underestimate the true per cent duration of behaviour. unless the bouts of behaviour are much longer than the interval length. Momentary time sampling on the other hand should produce no consistent errors, although it may result in occasional chance errors. in either direction. particularly if few samples are taken in an observation session.
METHOD
Videotapes (Smin duration) were made of eight profoundly retarded children. all of whom showed particular stereotyped behaviour of sufficiently high frequency for each bchaviour to occur at least several times on the videotape. Each child was seated on a chair and was given no toys with which to occupy himself. The only experimenter interventions were to return the child to his seat if he rose and moved out of the camera’s sight. If the child left his seat but remained in the camera’s view he was not returned to his seat but was allowed to remain standing if he so wished. Children were filmed if their bouts of stereotyped behaviour lasted on average for short periods (2 set or less). medium periods (between 2 and 20 set) or long periods (between 35 set and Zmin). Three samples of each duration were videotaped and each videotape was viewed by two raters. One rater recorded the onset and cessation of all bouts of the target stereotyped behaviour using a Rustrak event recorder. The second rater recorded IO-see interval marks onto the event recorder chart (using an audiotape). The two raters then repeated this. reversing their tasks so as to provide inter-observer reliability figures for all videotapes.
*This research Fund.
was supported
by the Mental
Health
Foundation
147
and the Bethlem
Royal
Hospital
Research
148
CASE HISTORIES
AND
SHORTER
COMMUNICATIONS
0
I7
4
0
50
53
4
0
17
0
54
3x
0 17
94 X0
51
60 loo X0
The event recorder for each rater:
charts
were then analysed
to provide
the following
data
67 Y7 xc)
on each stereotyped
54 Y4 XI
behavtour
(i) Number of responses (or bouts of responses where individual responses could not be separated). (ii) The mean duration of each response or bout of responses. (iii) The criterion record of the total duration of stereotyped behavtour. quoted as percentage duration over the 5 min (see below for details of this calculation). (iv) A whole-interval record of the behaviour. using ‘IO-set intervals (simulating a ‘IO-see observe. IO-set record’ method). (vl A partial-interval record of the behaviour. using IOsec intervals (again simulating a ‘IOsec observe. IO set record’ method). lvil A partial-interval record of the behaviour using the shorter interval of 2.5 set (a ‘2.5-set observe. 7.5-set record’ method).* (vii) A momentary ttme sampling record of the behaviour (effectively ‘I-set observe, 9-set record’).* The criterton Y )’ X 100
record
(iii) of per cent duration
was calculated
by the following
formula:
where v was the sum total length of target behaviour recorded (by pen deflection on the event record chart) in millimetres j‘ was the total length of the event record chart in millimetres (representing the 5 min duration of the videotape).
The whole-interval record (iv) was constructed by counting the behaviour as present if it occurred throughout a IO-set ‘observation‘ interval: the partial interval records were constructed by counting the behaviour as interval: the momentary timepresent if it occurred at all durtng the IO-set (v) or 2.5-set (vi) ‘observation’ sample record was constructed by counting the behaviour as present if it occurred at all during the moment (approxtmately I set) of ‘observatton’. For subject 6. who showed medium duration hand postures. after analysing the event record chart as above. a second analysts was done. On this second occasion. alternate responses on the record chart were treated as though absent. but otherwise the analysis was identical to that described above. The intent was to discover what would happen to the various per cent duration measures if the true per cent duration halved but the response length remamed constant. The extent of the error produced by the various recording methods. measurement error. was calculated by subtractmg the estimate of per cent duration (produced by each one of the recording methods) from the criterion record of true per cent duration. INTFR-OBSERVtR
RELIABILITY
Inter-observer agreement measures were calculated for each of the nine observational sessions for each observational method. The measures used were R,,, (total per cent agreement). R,,, (the per cent occurrence agreement) and R “OrnlLL(the per cent non-occurrence agreement)-see Hartmann (1977). The chance levels of 1977). The mean reliabilities were high: 95”” R,,,,. R,.,, and R,,,,, oLLwere also calculated (Hopkins and Hermann (R,,, ). 8 l1’s,(R,,,) and 88”,, IR,,,,.,,,) and invariably exceeded chance levels.* The inter-observer reliability levels for the criterion record were calculated by the whole-session method (Repp et al., 1976a). The mean reliability was 93”,, (range 73”,,- loo”,,). RESULTS
Brtef descriptions responses or bouts responses.
of the stereotyped behavtour recorded are given in Table of responses m the five minutes and the mean duration
1. together with the number of each response or bout
of of
*Thirty samples were possible for each of these two methods in the 5-min session. Fifteen samples were possible for the other methods, This was a result of choosmg sampling methods that a real observer would find practicable during a 5-min session. assuming he was trying to maximise his number of samples without leaving himself too little ttme in which to record. *Further
details
of inter-observer
reliability
levels are available
on request.
CASL
HISTORIES
AND
SHORTER
149
COMMUNICATIONS
B Bnef duratton behav M Med duratton bahav L Long duration bahav
-3d-
Whalemt
Fig. I. The measurement
error
Perttal int. (loss.tY-~
Portral tnt (2 5sac.)sS
produced by the four sampling long duration behaviour.
TinXinpling methods
for brief. medium
and
Table I also shows the percentage duration of each stereotyped behaviour as calculated from the various recording methods. together with the true percentage duration (the criterion record). The error made by each recording method is plotted against length of response in Fig. 1. As predicted momentary time sampling results in consistently less error than other methods. whole-interval recording underestimating and partial-interval recording overestimating. until the mean response length greatly exceeds the observation interval length. The measurement error produced by whole-interval recording for brief duration behaviour is relatively small because of ‘floor’ effects but as a percentage of the true duration this err& is of course quite large. Momentary time-sampling makes a suprisingly large error in estimating the duration of medium length responses but a major part of this error was contributed by one subject’s behaviour (subject 5. I7”,, error) and seems to have been the result of the chance coincidence of more of the momentary time-samples with the subject’s responses than would usually be expected. The result of eliminating alternate responses of subject 6 (medium duration hand postures and hand taps) is shown in Table 2. The mean response duration has been changed very little and all measures of per cent duration reflect the fact that a reduction in (true) per cent duration has taken place, momentary time sampling coming closest to the t.rue figures. DISCUSSION
The results clearly show that whole-interval recording underestimates and partial-interval recording overestimates the true percentage duration of the behaviour being observed. The error is large in both cases but reduces when the mean response length is very much longer than the observation interval. The dependence of the extent of error on the response duration and observation interval length means that partial-interval recording using a 2.5 second observation period is more accurate than when using a IO set observation period and momentary time-sampling (which is effectively partial-interval recording using an observation interval of roughly one second) is more accurate still. If a recording method were to make a standard level of error of over- or under-estimation in all cases, then it would be a simple matter to subtract or add a constant figure to the obtained recordings to provide the accurate measure. Unfortunately. as Powell et al., (1975. 1977) comment, the level of error is not constant. so that genuine treatment effects could be obliterated by the data recording system. In this regard it is interesting to consider what would happen if. as a result of a treatment intervention, the percentage duration of the recorded behaviour was reduced. With subject 6. where response duration was unchanged but the per cent duration of behaviour was halved (see Table 2) all the measures of amount of behaviour shown reflected the fact that a reduction had taken place. with momentary time-sampling giving the best estimate. If however. the
Table
2 Per cent dura!aon
of hand posturer
fsuh)ect
Whole Intcr\al recordme
All reqponses Allcrnale counted
BRT
lol suh,ect
r~sponser lsuh)ect
IS?-F
61
6,
61 as estlmswd
by rhe dlfierent
Par1Kil mler\PI
(Iowl
rec6rdmp
?7”.,
HO”.,
13”.,
S?“,,
methods.
usmg all responses
Partial mlerval
(2.~ set,
recordq
and usmg alternate
Momcntq tmle ramplmg
responses
Crncrion ,true
only
record
per cc”,
duratmnl
64”,.
59”.,
53”.,
40”..
270,~
31”‘,
on,>
150
CASI. HISTOR1t.SANI> SHORTER COMMUNICATIONS
response length alters when the percentage duration of a behavlour is reduced (as would happen If an undesirable bchaviour were reduced by treatment from the level of. for example. subject 9’s hand postures to the level of subJect 6’s hand postures) only momentary time-sampling can be relied upon to reflect the true extent of rcductlon. whole-interval and partial-interval recording sometimes over- and sometimes under-estimating the ‘treatment’ effect. Further perusal of data on Table I shows that true reductions in duration can also be reflected as increases at times [as. for example. if partial interval (IO set) recording had been used to estimate a change in body-rocking from the level of subJect 7 to subject 5). In concIusIon. it must be said that where it IS necessary to use a sampling method (as opposed to a continuous method) of recording and where the response measure requtred Is per cent duration. momentary time-sampling should be the method of choice, unless it is certain that the minimum response duration will be much longer than the observation interval (when partial-interval or whole-interval recording will be satisfactory.) Furthermore, because it cannot be assumed that treatment will got produce alterations in response length as well as m per cent duration of behaviour it means that interval recording methods will not only provide incorrect estimates of per cent duration but may also mask or exaggerate any treatment effects. For most applied behaviourists who are attempting to record several types of behaviour at a time and who may be uncertain quite how treatment will affect response lengths. this means. as Powell YI al. (1977) pointed out. that momentary time-sampling is almost obligatory. -I~~~~rr~~~/c~/~~~~~~~~~~tI.\~ The authors would like to acknowledge Dr. Janet Carr. Maria Callias and Dr. John Corbctt.
the encouragement
and
helpful
comments
of
GLYVIS MCKPHI ELIZA~I.TH GO()I
GKI I N S. B. and ALVI RSON L. G. (197X). A comparison of indirect measures for long duration behavlours. J. II&. Bclrtrr. -i/W/. I I. 530. HAKTUANU D. P. (1977). Considerations in the choice of inter-observer reliability estimates. J. uppi. BrlrcIr. 4 w/. IO. IO3-I 16. HOPKINS B. L. and HERMANN J. A. (1977). Evaluating inter-observer reliability of Interval data. J. uppl. B&I.. A&. 10. 121 ~126. KELLY M. B. (1977). A review of the observational data-collectIon and reliabtlity procedures reported in the Journal of Applied Behaviour Analysis. J. uppl. Brhr. .4~/. IO. 97-101. POWI LI. J.. MARTINI)ALI A. and KI,LI~ S. (1975). An evaluation of time-sample measures of behaviour. J. uppl. Bclror. 4,10/. 8. 463-469. POWELL J.. MARTINDALE B., KULP S.. MARTINDALE A. and BAUMAN R. (1977). Taking a closer look: timesampling and measurement error. J. trppl. Bhr. Am/. IO. 325-332. POWI LI. J. and RCK I;INGSON R. (1978). On the inability of interval time-sampling to reflect frequency 01 occurrence data. J. trppl. Bohr. -ltrtr/. I I. 53 I 532. RI I’I’ A. C‘.. D1.1~7 D. E. D.. BOLI.S S. M.. DI-ITZ S. M. and RI:PP C. F. (1976a). Differences among common methods for calculating Inter-observer agreement. J. up/. Bohr. .-lmr/. 9. 109-I 13. RI 1’1’ A. C.. ROIII.RTS D. M.. SLACK D. J.. RI.IJP C. F. and BER~;L~:R M. S. (l976b). A comparison of frequency. Interval and time-sampling methods of data collection. J. appl. Bch. Arwl. 9. 501-508.
I”“,?-7wl7 x0
,,I,,,-,,150\0?,“, 0
Mainknence of improvement in agoraphobic patients treated by behavioural method-a four-year follow-up
Summar!
Fifty-six agoraphobic patients who had shown clinical improvement when treated by hehavioural methods were followed-up between 3.0 and 6.3 yr (mean 4.3 yr) later. Comparison of prc- and post-treatment and follow-up self-assessment data showed that improvement had been maintained on all the variables .issessed--main symptom. other phobias. depressIon. social reIatIonhhIps and disruption at work. Only one patient reported the emergence of new psychological ,)mptoms. However. only IO (IV,,) of the sample described themselves as completely rbmptom-free. although most of the remainder reported that their symptoms caused them only \iight distress. and little disruption in their I~ves.
Despllc the wldcspread use. and the evidence of the short-term effectiveness. of behavroural methods in the rreatmcnt of agoraphobia (e.g. Marks. 197X). there have been very few reports of long-term follow-up of treated patient\. The studies that there have been. although not strictly comparable. have produced divergent results.