The Veterinary Journal The Veterinary Journal 171 (2006) 281–286 www.elsevier.com/locate/tvjl
The intra- and inter-assessor reliability of measurement of functional outcome by lameness scoring in horses Catherine J. Fuller
a,*
, Bruce M. Bladon b, Adam J. Driver c, Alistair R.S. Barr
d
a
d
Department of Anatomy, Equine Centre, University of Bristol, Langford House, Langford, Bristol BS40 5DU, UK b Donnington Grove Veterinary Surgery, Oxford Road, Newbury, RG20 8SH, UK c Dubai Equine Hospital, P.O. Box 9373, Dubai, UAE Department of Clinical Veterinary Science, Equine Centre, University of Bristol, Langford House, Langford, Bristol BS40 5DU, UK Accepted 8 October 2004
Abstract The objective of this study was to assess the reliability of lameness scoring in horses. One veterinary surgeon examined nineteen lame horses on four occasions. Gait was recorded by camcorder, and scored from 0 to 10 ranging from sound to non-weight bearing lameness. A global score of overall change in lameness during the study was also determined for each horse. To measure intra-assessor reliability of the scoring systems, one veterinary surgeon scored videotapes of the horsesÕ gaits on two occasions. To measure inter-assessor reliability, three veterinary surgeons viewed the videotapes, assigning individual lameness scores plus global scores to each horse. Reliability of individual lameness scoring was good intra-assessor, but only just within our acceptable limit inter-assessor. However, global scoring of change in lameness throughout the study was found to be reliable overall. Since clinician scoring is commonly used to assess lameness in horses, this is an important finding, fundamental to future clinical studies. Ó 2004 Elsevier Ltd. All rights reserved. Keywords: Horse; Lameness; Scoring; Reliability; Clinical study
1. Introduction With the current drive for evidence based medicine in equine research there is a need for high quality randomized clinical trials to be carried out. Conducting such studies can pose several difficulties, including the requirement for large numbers of cases to provide adequate power, and the selection of valid and reliable outcome measures. In any clinical trial, assessing change or response to therapy requires consistent scoring of outcome at different time points. To allow for the recruitment of adequate numbers of cases, it is often *
Corresponding author. Tel.: +44 117 928 9621; fax: +44 117 928 9622. E-mail address:
[email protected] (C.J. Fuller). 1090-0233/$ - see front matter Ó 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.tvjl.2004.10.012
necessary to conduct the study at a number of centres. This then introduces the need to involve a number of assessors. The outcome measure chosen must be reliable intra- and inter-assessor, i.e., able to produce the same results consistently, both by the same assessor and by different assessors. In equine trials relating to locomotor system function the most commonly used outcome measure is an index of clinician assessed pain, indicated by lameness (Aviad et al., 1988; Gaustad and Larsen, 1995; Verschooten and Desmet, 1997; White et al., 1994) and although this is a well established and relevant method, it remains subjective, and there are few reports investigating the reliability of this method in the literature (Keegan et al., 1998). Intra-assessor variation (repeatability) and inter-assessor variation (reproducibility) (Welsh et al., 1993) of the scoring system is best
282
C.J. Fuller et al. / The Veterinary Journal 171 (2006) 281–286
calculated by test and retest at one to two week intervals (Deyo et al., 1991) and the agreement, between categorical variables, should be calculated using the j statistic (Fleiss, 1981). There are various different methods of scoring lameness. Commonly used methods in equine practice include using a numerical rating system to score the degree of lameness from 0–5 or from 0–10 (Wyn-Jones, 1988). Verbal rating systems are similar but descriptors of the stage of lameness are added to each grade (Stashak, 1987). A visual analogue scale is a 10 cm line on which a mark is drawn at an appropriate level between the two extremes of no lameness to non weight bearing lameness (Dixon and Bird, 1981). These various systems have been used and evaluated in a number of studies. The repeatability of a numerical rating system for scoring lameness was found to be good to excellent when evaluating moderate to severe lameness (Keegan et al., 1998) but it was suggested that a global scoring system may improve reliability. The descriptors in verbal rating systems may help to make the allocation of grade more accurate, but also may force artificial placement into categories. Pleasant et al. (1997) used a five-point verbal rating system (Stashak, 1987) with excellent repeatability, however reproducibility was not reported. Visual analogue scales have been used to measure lameness in a number of studies (Olby et al., 2001; Fuller, 1998; Hewetson, 2003; Welsh et al., 1993) with variable results. In our study, a numerical rating system was chosen since it is a commonly used method by equine veterinary practitioners. A global assessment is a simple composite index, in this case reflecting an overall change in pain and or disability. They are obviously not as specific as signal measurements, i.e. a score of a particular clinical sign or disease characteristic, but where there is no gold standard for the measurement of disease activity these remain the ‘‘next best thing’’ (Symmons, 1994). In our study, a global score of change in lameness over the course of the study was used. The definitions of the scores are given in Table 1. The purpose of this study was to address the following questions. Is one assessor able to score lameness consistently over a longitudinal trial? If not then we need to use a reliable method of recording gait which will allow us to review data at one sitting. Can lameness be consistently scored by a number of different assessors? We hypothesised that this is not reliable and therefore indiTable 1 Global scoring system for change in lameness Lameness compared to beginning of trial Worse Same Improvement Sound all gaits
Global score 1 0 1 2
vidual lameness scores cannot be comparable in multicentre trials, or by different assessors at different time points, i.e., the same assessor must see the same horse throughout for reliable results. We hypothesised that a global score of change over time is more reliable and therefore is the method of choice in multi-centre trials.
2. Methods 2.1. Clinical study Horses included in the study were from a population of riding horses which were diagnosed with osteoarthritis (OA) after referral to the Department of Clinical Veterinary Science, University of Bristol for lameness evaluation. During the study each horse was examined by the same assessor (CF) on four occasions at three monthly intervals over a nine month period (i.e., at 0, 3, 6, and 9 months post-diagnosis). At the first consultation, before inclusion in the study, each horse was given a full clinical examination. At each consultation the gait of each horse was observed and recorded with a camcorder (Sanyo VMD66P) first at the walk and then at the trot in a straight line on a level concrete surface, both moving away and then returning towards the investigator. The gait of each horse was also recorded while trotting on the lunge in a circle, on both reins, first on a soft and then on a hard surface. Lameness was assessed by the same observer (CF), on each occasion, based on evaluation of head movement, and symmetry of stride and pelvic movement, and scored, at walk and trot, and on each surface, according to a semiquantitative numerical rating system from 0 to 10 ranging from sound to non weight bearing lameness (Wyn-Jones, 1988). 2.2. Repeatability 2.2.1. Individual lameness scores At the completion of the study a videotape (V1) was made on which the recordings of lameness were blinded to date order for each horse. For each horse, on each of up to four occasions, there was a series of recordings which included locomotion at the walk, trot in a straight line and trot on the lunge, both on soft and on hard surfaces. A lameness score was assigned to each horse for each available recording. This video was used to score lameness again by the primary assessor (CF), on two occasions, 14 days apart, and the level of agreement between the two sets of scores was calculated. The agreement between scores given at the time of the examination and those given when viewing V1 was also calculated to ascertain whether the assessment of lameness at three month intervals by one assessor was repeatable.
C.J. Fuller et al. / The Veterinary Journal 171 (2006) 281–286
283
2.2.2. Global scores On a second videotape (V2), the recordings for each horse at each examination were kept together, in order. This was viewed, and a global score of improvement over the duration of the trial (Table 1) was assigned for each horse by the primary assessor, on two occasions 14 days apart. A global score was also given for each horse based on both lameness scores given at the time of examination, and those derived from the date blinded video recordings (V1).
midcarpal OA in the right fore, one case of metacarpophalangeal OA in the right fore, six cases of proximal interphalangeal OA – two left fore, two right fore, one in the left and one in the right hind, and seven cases of distal interphalangeal joint OA – all in the right forelimb. Lameness scores given by all assessors ranged from 0 to 4 out of a maximum 10, so the level of lameness seen in this study was mild to moderate.
2.3. Reproducibility
Because of some breakdowns in compliance video recordings were available for 17 horses on every examination. Some horses were examined at the trot in a straight line only, some were also examined and their gait recorded on the lunge. Therefore, V1 consisted of a total of 101 recordings from 19 horses whereas V2 consisted of 99 recordings of 17 horses.
2.3.1. Individual lameness scores The videotape V1 was viewed, in single sittings, by three independent experienced veterinary clinicians, each of which assigned lameness scores to a subset of 33 recordings of eight horses moving at the trot in a straight line and at the lunge.
3.2. Video recordings
3.3. Repeatability
2.3.2.1. Statistics. The level of agreement between scores was measured using the kappa statistic (j). j measures the proportion of agreement between two scores, made by different assessors, or by the same assessor but at different times, over and above that which could occur by chance (Cohen, 1960). Where there is complete agreement j = 1, and where there is no more agreement than would be expected by chance, j = 0. Weighted kappa values (jw) were calculated when disagreements between scores were occasionally more than one category (Cohen, 1968). A j value of below 0.4 was considered poor, and for this study unacceptable, variability; values between 0.4 and 0.75 were fair to good agreement; and a value of >0.75 was considered excellent (Fleiss, 1981).
3. Results 3.1. Demographic data Twenty horses from those referred to the Department of Clinical Veterinary Science, University of Bristol for persistent lameness, which were diagnosed to be suffering from OA, were recruited to the study. These were of mixed breed and mixed age, the mean age being 10.2 years and ranging from 4 to 21 years. The sex ratio was 13 male:7 female. The mean duration of lameness in the population was 6.4 months. All horses had unilateral lameness. There were five cases of tarsometatarsal OA, one in the left and four in the right hind, one case of
3.3.1. Individual lameness scores The agreement between lameness scores given at two readings of V1 for one assessor (CF) was good: jw = 0.68 (95% CI = 0.57–0.79) (Fig. 1a). This means that individual lameness scoring using a numerical rating system was repeatable. The agreement between lameness scores given at the time of examination with those from the time blinded video for one assessor (CF) was also good: jw = 0.61 (95% CI = 0.51–0.71) indicating that one assessor could score lameness consistently at three month intervals. 3.3.2. Global scores The agreement between one assessorÕs global scores taken from two readings of V2, 14 days apart was considered good: jw = 0.6 (95% CI = 0.42–0.78) (Fig. 1b), therefore the global scoring system was repeatable.
0.9
Weighted kappa values (κw) plus 95% Confidence intervals
2.3.2. Global scores The videotape V2 was also viewed, in single sittings, by three independent veterinary clinicians and a global score of change during the trial was given to each horse.
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a) Repeatability (b) Repeatability (c) Reproducibility (d) Reproducibility individual score global score individual score global score
Fig. 1. Repeatability and Reproducibility of individual and global lameness scoring. jw < 0.4 = unacceptable variation, jw 0.4–0.75 = fair to good agreement, jw > 0.75 = excellent agreement.
284
C.J. Fuller et al. / The Veterinary Journal 171 (2006) 281–286
Table 2 Levels of agreement in individual lameness scores between the three assessors Assessors
Number of occasions/33 where scores differed by 1 grade
Number of occasions/33 where scores differed by 2 grades
jw
95% CI
1 vs. 2 2 vs. 3 1 vs. 3 Overall agreement
11 13 11
0 2 2
0.58 0.35 0.30 0.41
0.35–0.82 0.1–0.6 0.04–0.55 0.36–0.46
Table 3 Levels of agreement in global lameness scores between the three assessors Assessors
Number of occasions/17 where scores differed by 1 grade
Number of occasions/17 where scores differed by 2 grades
jw
95% CI
1 vs. 2 2 vs. 3 1 vs. 3 Overall agreement
4 7 5
0 1 1
0.74 0.45 0.61 0.60
0.50–0.97 0.11–0.79 0.31–0.89 0.37–0.83
The agreements (jw) between (i) global scores derived from lameness scores at the time of examination, and (ii) global scores derived from V1 lameness scores, and those assigned while viewing V2 were 0.58 (95% CI = 0.51–0.65) and 0.78 (95% CI = 0.64–0.82) respectively, which were considered good and excellent. These results infer that the individual scores assigned by one assessor at different time points did reflect change in lameness over the whole trial period and gave further support to the repeatability of lameness scoring by a single assessor. There was a significant positive association between change in lameness score given at the time of examination (change in lameness between examination 1 and 4) with the global score given from the video of lameness: rs = 0.79 (95% CI = 0.51–0.92, P = 0.0001), reinforcing the view that the individual lameness scores given by one assessor were repeatable and could reflect change. 3.4. Reproducibility 3.4.1. Individual lameness scores Although the agreement between two of the assessors was good (Table 2), the overall level of agreement in lameness scoring between different experienced assessors was considered to be only just above the acceptable limit: jw = 0.41 (95% CI = 0.36–0.46) (Fig. 1c). 3.4.2. Global scoring The agreement between two assessors was almost excellent (Table 3) and the overall agreement in global scoring between all three assessors was considered good: jw = 0.6 (95% CI = 0.37–0.83) (Fig. 1d). The system of global scoring was also judged to be reproducible and therefore a reliable method of measuring change in lameness.
4. Discussion This study has shown that the clinical assessment of change in lameness is reliable both intra- and inter-assessor when a global scoring system is used. Lameness scoring, using a numerical rating system, is reliable only when carried out by a single assessor, and should not be used as an outcome measure when more than one assessor is involved. We have also shown that one assessor can score lameness consistently at different time points over a nine month period and that change in mild to moderate lameness can be reliably detected. Since clinician scoring of lameness is the most common method of assessing limb pain and disability in the horse, this is an extremely important finding. Inter-assessor agreement was only just within our acceptable limit when individual lameness scores were given. This is a very important finding since in long term studies, particularly when held in more than one centre, it is likely that more than one clinician will be involved in observing lameness. However, what is most important in clinical studies is the degree of change in lameness over the trial period. We have shown here that the method of recording the gait of the horse with a video camera at each examination, and subsequently assigning a score of change over the whole study period is a reliable method of assessing that change. The particular guidelines for interpreting the kappa values must be considered when comparing assessment of reliability from different studies. Landis and Koch (1977) suggest j > 0.8 indicates excellent agreement, but also suggest that j > 0.2 = fair and j > 0.4 = moderate agreement. Martin and Bonnett (1987) recommend that for clinical purposes j values between 0.3 and 0.5 are acceptable, 0.5–0.7 = good and j > 0.7 = excellent agreement. In the study reported here we chose to use the guidelines suggested by Fleiss (1981) to describe the quality of
C.J. Fuller et al. / The Veterinary Journal 171 (2006) 281–286
the agreement, since for the purposes of conducting good quality clinical studies, we considered that the higher limit, j = 0.4, of acceptable variability was appropriate. The horses in our study all were mildly to moderately lame. When assessing mild lameness, it has been reported (Keegan et al., 1998) that using videotape to enable concurrent lameness examinations to be recorded for viewing at a single sitting, is a more reliable way of detecting change, than comparing individual scores assigned at the time of lameness examination, especially when more than one assessor is involved. However, in our study, for one assessor, we demonstrated good agreement between change in scores given at the time of examination with global scores assigned at a single videotape viewing. Other studies (Keegan et al., 1998; Pleasant et al., 1997) have used video recordings to score lameness subjectively and the possibility that viewing gait on a videotape could affect the assessorÕs ability to detect lameness has been raised. In the study by Keegan, the lateral view only was used to view the horse, and the recordings had no sound. In the study reported here, horses were viewed from the front and back as well as laterally, and sound was included. Also, the fact that in our study there was good agreement between scores assigned at the time of lameness and those assigned after viewing the time-blinded video supports the fact that videotapes can be a reliable method of storing data. Our individual scoring system ranged from 0 to 10, while our global scoring system ranged from 1 to 2 only. It may be that if we had used a 0–5 scoring system for individual scores the reliability of that system may have improved. However, because all cases were only mildly to moderately lame, all cases were scored between 0 and 4, by all assessors, so only part of the scale was used. Some may say that it is not surprising that there was disagreement between assessors when assigning absolute lameness scores. With experience, and despite training, different clinicians tend to develop their own scoring system within the 1–10 framework, some scoring higher than others. Indeed, there was good agreement between two of the three assessors in this study for individual scores. Also, in this study no descriptors of lameness degree were used, and the scoring system relied purely on a ranking system. However, a recent study by Hewetson (2003) showed that use of a numerical rating system for experienced clinicians was more reliable than the use of a verbal rating system. This may be because no studies have investigated the validity of assigning descriptors to each rank.
5. Conclusion We have investigated the reliability of a commonly used and reported method of assessing lameness in horses. The results have established that although the
285
degree of change in mild lameness over time can be detected reliably by one assessor, when more than one assessor is involved in a clinical trial, a global scoring system would be the preferred outcome measure. Acknowledgement The funds for this study were provided by the Horse Race Levy Betting Board. References Aviad, A.D., Arthur, R.M., Brencick, V.A., Ferguson, H.O., Teigland, M.B., 1988. Synacid vs Hylartin V in equine joint disease. Journal of Equine Veterinary Science 8, 112–116. Cohen, J., 1960. A co-efficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–47. Cohen, J., 1968. Weighted kappa; nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70, 213–220. Deyo, R., Diehr, P., Patrick, D., 1991. Reproducibility and responsiveness of health status measures. Controlled Clinical Trials 12 (Suppl.), 142–158. Dixon, J.S., Bird, H.A., 1981. Reproducibility along a 10-cm vertical visual analogue scale. Annals of the Rheumatic Diseases 40, 87–89. Fuller, C.J., 1998. Measures of Osteoarthritis in the Horse. PhD Thesis, University of Bristol. Fleiss, J., 1981. The measurement of interrater agreement. In: Statistical Methods for Rates and Proportions. Wiley, New York. Gaustad, G., Larsen, S., 1995. Comparison of polysulphated glycosaminoglycan and sodium hyaluronate with placebo in treatment of traumatic arthritis in horses. Equine Veterinary Journal 27, 356– 362. Hewetson, M., 2003. Reproducibility of a visual analogue scale, numerical rating scale and verbal rating scale for the assessment of lameness in the horse. In: Proceedings of the 42nd British Equine Veterinary Association Congress Birmingham, p. 295. Keegan, K.G., Wilson, D.A., Wilson, D.J., Smith, B., Gaughan, E.M., Pleasant, R.S., Lillich, J.D., Kramer, J., Howard, R.D., BaconMiller, C., Davis, E.G., May, K.A., Cheramie, H.S., Valentino, P.D., van Harreveld, P.D., 1998. Evaluation of mild lameness in horses trotting on a treadmill by clinicians and interns or residents and correlation of their assessments with kinematic gait analysis. American Journal of Veterinary Research 59 (11), 1370–1377. Landis, J.R., Koch, G.G., 1977. The measurement of observer agreement for categorical data. Biometrics 33, 159–174. Martin, S.W., Bonnett, B., 1987. Clinical epidemiology. Canadian Vet Journal 28, 318–325. Olby, N.J., De Risio, L., Munana, K.R., Wosar, M.A., Skeen, T.M., Sharp, N.J.H., Keene, B.W., 2001. Development of a functional scoring system in dogs with acute spinal cord injuries. American Journal of Veterinary Research 62 (10), 1624–1628. Pleasant, R.S., Moll, H.D., Ley, W.B., Lessard, P., Warnick, L.D., 1997. Intra-articular anesthesia of the distal interphalangeal joint alleviates lameness associated with the navicular bursa in horses. Veterinary Surgery 26 (2), 137–140. Stashak, T.S., 1987. Diagnosis of lameness. In: Stashak, T.S. (Eds.), Adams Lameness in Horses. Lea & Febiger, Philadelphia. Symmons, D., 1994. Disease assessment indices: activity, damage and severity. Ballie`re Clinical Rheumatology 8 (3), 267–285. Verschooten, F., Desmet, P., 1997. Effect of intra-articular sodium hyaluronate (Hyonate(R)) in equine joint disease: A clinical trial. Vlaams Diergeneeskundig Tijdschrift 66, 21–27.
286
C.J. Fuller et al. / The Veterinary Journal 171 (2006) 281–286
Welsh, E., Gettinby, G., Nolan, A., 1993. Comparison of a visual analogue scale for assessment of lameness, using sheep as a model. American Journal of Veterinary Research 54, 976–983. White, G.W., Jones, E.W., Hamm, J., Sanders, T., 1994. The efficacy of orally-administered sulfated glycosaminogly-
can in chemically-induced equine synovitis and degenerative joint disease. Journal of Equine Veterinary Science 14, 350–353. Wyn-Jones, G., 1988. Equine Lameness. Blackwell Scientific, Oxford.