SCHIZOPHRENIA RESEARCH ELSEVIER
Schizophrenia Research 29 (1998) 287 292
Inter-rater reliability of the neurological examination in schizophrenia Richard D. Sanders a, Steven D. Forman b,c, Joseph N. Pierri c, Robert W. Baker d, Mary E. Kelley b Daniel P. Van Kammen b,c Matcheri S. Keshavan c,, a Wright State University School o f Medicine, Department o f Psychiatry and Dayton Mental Health Center, 2611 Wayne Avenue, Dayton, 0H45420, USA b Pittsburgh VA Health Care System, Highland Drive Division, Pittsburgh, PA 15206, USA c Western Psychiatric Institute and Clinic', 3811 O'Hara Street, Pittsburgh, PA 15213, USA d University o f Mississippi Medical Center, 2500 N. State Street, Jackson, M S 39216-4505, USA Received 21 July 1997; accepted 26 July 1997
Abstract
Neurological Examination Abnormalities (NEA) are prevalent in schizophrenia, but the significance of this is obscured by methodological problems. The Neurological Evaluation Scale (NES), the most widely used structured neurological examination in schizophrenia research, has had limited study of its inter-rater reliability (IRR). An augmented version of the NES was jointly administered (one examiner-rater and one observer-rater) by three pairs of psychiatrists to two populations of patients with idiopathic psychotic disorders. In addition to the ordinal and categorical data yielded by the original NES, continuous data were recorded in one of the series. Reliability analyses of our populations and a previously published study, reveal consistently adequate IRR in 12 of the 26 items assessed, and inconsistently adequate IRR in an additional 11. Consistent with studies using other N E A schedules, IRR was unacceptably low for some items that rely on subjective severity ratings. Certain rare abnormalities, which posed difficulties for the estimation of IRR, are probably not generally useful in the study of schizophrenia. Reliability estimates of continuous, ordinal and dichotomous data were comparable in most cases. We recommend that certain items from the NES be deleted, and that other studies of N E A in psychiatry follow similar procedures before undertaking further analyses. © 1998 Elsevier Science B.V. Keywords: Schizophrenia; Neurologic examination; Psychotic disorders; Neuropsychology; Reproducibility of results; Psychomotor performance
1. Introduction
T h e p a s t 20 years have seen an e x p l o s i o n o f research into n e u r o l o g i c a l f u n c t i o n in schizo* Corresponding author. Tel: (412) 624 0814; Fax: (412) 624 9464; e-mail: keshavan + @pitt.edu 0920-9964/98/$19.00 © 1998 Elsevier Science B.V. All rights reserved. PH S0920-9964(97)00103-5
phrenia. M o s t o f this research has e m p l o y e d m e t h o d s o f assessment t h a t w o u l d be p r o h i b i t i v e l y expensive to utilize in r o u t i n e clinical practice. Because the bedside n e u r o l o g i c a l e x a m i n a t i o n is the m o s t readily a v a i l a b l e a n d inexpensive m e a n s o f e v a l u a t i n g n e u r o l o g i c a l function, research into n e u r o l o g i c a l e x a m i n a t i o n a b n o r m a l i t i e s ( N E A ) in
288
R.D. Sanders et al. / Schizophrenia Research 29 (1998) 287 292
schizophrenia has particular potential for direct application to clinical practice. Many of the abnormalities found in psychiatric patients are referred to as 'soft signs', reflecting in part the perception that they are irreproducible. In fact, establishing inter-rater reliability (IRR) for such examinations can pose difficulties (Vitiello et al., 1989; Vreeling et al., 1993). Further confusion in this field arises from the inconsistent composition and execution of the examination. Thus, there is a need for a reliable and generally accepted battery of neurological exam items for use in the study of schizophrenia. The ideal neurological battery would be limited to items that can be rated reliably by most clinicians and researchers, and that are abnormal in a significant proportion of the patient population. Many NEA are lacking on both accounts, and often these two deficiencies are related. Aside from the obvious limitations of a measure that rarely varies within or between study populations, estimating agreement between raters on invariable measures poses difficulties of its own. Percentage agreement (PA) is inflated by chance agreement when applied to invariable measures (Fleiss, 1975). Chance corrected measures of agreement, such as kappa (K; Fleiss, 1981) and intraclass correlation (ICC; Shrout and Fleiss, 1979), suffer from deflation when little variance in the measure is observed (Shrout et al., 1987). Individual NEA are sometimes combined into groups, and reliability addressed with summary scores (Buchanan and Heinrichs, 1989). Although this is expedient, there is little empirical basis yet for deciding how to combine the elements of the psychiatric neurological exam. Unreliable and invariable measures are best singled out and eliminated from neurological exam schedules. The Neurologic Evaluation Scale (NES), a widely used instrument with published instructions and scoring guidelines, was developed using NEA items culled from the schizophrenia literature (Buchanan and Heinrichs, 1989). Item scores are ordinally represented as 0, 1 or 2, or dichotomously as 0 or 2 whenever findings can only be described as present or absent. This approach enhances scale homogeneity, but information may be lost when broad variation in performance of a given task is
distilled into one of two or three possible scores. Although others have examined the IRR of NES summary scores, the original paper (Buchanan and Heinrichs, 1989), includes the only item-byitem IRR data published for this scale, using intraclass correlation (ICC). To enhance the efficiency and utility of the NES, we examined the IRR of a modified NES in acute and chronic schizophrenia, through analysis of continuous data, as well as the nominal and ordinal data generated by the original NES (Buchanan and Heinrichs, 1989). Based on the agreement among psychiatrist examiners in different populations of psychotic patients, we propose eliminating at least three items from the NES.
2. Method
This report includes data from two separate populations of psychotic patients: a younger, more acute group of patients from an urban psychiatric hospital (Group A), and an older, more chronically psychotic group of male military veterans (Group B). We had the opportunity to perform two reliability studies on Group A patients, each with its own pair of raters (n =20, 15), and one reliability study was performed on Group B (n = 16). All subjects had signed informed consent. Characteristics of these patient groups are presented in Table 1. We made no attempt to incorporate or control for such factors as medication status, side-effects or intelligence, which may influence neurologic performance but would not be expected to directly influence reliability. The neurological examination used in this study was modified from the Neurological Evaluation Scale (NES) (Buchanan and Heinrichs, 1989). The procedures were clarified with the assistance of the authors of the NES. Modifications included the addition of the palmomental reflex and in Group B, the go no-go task. With Group B, continuous data were recorded for those items in which such data could be collected without substantially altering the procedure. This included noting the number of errors made, the time to completion of sequential motor tasks, the duration of persistence in a visual fixation task and during the Romberg, the
R.D. Sanders et al. / Schizophrenia Research 29 (1998) 287-292
289
Table 1 Characteristics of two subject populations
N umber Age" % Male % Right-handed % Positive signsa NES Total score"
Ai
A2
B
20 24.7 (7.4) 75 80 33.5 (16.1) 0.419 (0.304)
15 24.1 (8.3) 86.6 93.3 35.4 (13.7) 0.425 (0.218)
16 37.8 (8.2) 100 93.8 45.9 (17.0) 0.592 (0.341)
NES, Neurological Evaluation Scale; A 1 and A2, acute psychotic patients at an urban acute psychiatric setting; B, chronic psychotic patients at a VA psychiatric facility)Mean (standard deviation).
number of blinks observed during the glabellar tap and during visual fixation, and the number of responses observed during the palmomental reflex procedure. Detailed instructions for the modified NES, as well as the examination scoring form used to record continuous data are available upon request. All NES evaluations were performed by psychiatrists, none of whom had received additional training in neurology. In the assessment of IRR, one examiner performed the exam while the other watched; results reflect the variance in observation of a single examination. In studies of acute psychotic subjects (Group A), this was done using the exclusively ordinal/categorical format of Buchanan and Heinrichs (1989). In the study of chronic patients (Group B), the continuous data form was used, and results later coded into the ordinal/categorical format. We did not include dominance of hand, foot or eye in this reliability study, but have found no instances of disagreement in smaller samples. Data from methodologically identical subdivisions of particular exam items (right and left, memory at 5 and 10min) were combined, yielding two data points per subject for the memory task and for bilaterally administered items. Categorical observations (ratings of 0 vs 1 or 2) were compared using the kappa statistic ( K ) (Fleiss, 1981 ). In sorting these exam items in terms of reliability, we applied the conventional standard (Fleiss, 1981 ) of K > 0.40, representing a minimum threshold for acceptable agreement. Continuous and ordinal variables were compared using intraclass correlation (ICC) (Shrout and Fleiss, 1979),
using the standard threshold of I C C > 0 . 5 0 (Bartko, 1976). Statistical significance of ICC and K values was not assessed, as the goal was not to exceed chance levels of agreement, but to exceed acceptable levels of agreement (MacLure and Willett, 1987). The continuous data from Group B yield multiple variables in some items, generally number of errors and time to completion. In these cases, ICC was also computed for time to completion.
3. Results
Inter-rater reliability results are presented in Table 2. Kappa values are presented for all items, but subsequent discussion refers to K only for those items (convergence, snout and suck reflexes) that can only be rated dichotomously. The reliability of specific items can be seen to vary substantially between substudies. When the consensus score was zero in greater than 90% of a subsample (as was the case with stereognosis, face-hand test, convergence, and snout and suck reflexes), K and ICC tended to indicate poorer agreement than did PA. Three NEA (stereognosis, face-hand test, snout reflex) were too rare for ICC to be computed in all series. In Group A, we found that 13 items fell below minimal IRR in at least one rater pair. When raters were pooled (n = 35), we found subthreshold IRR in tremor, stereognosis, finger-nose, convergence and grasp reflex. In Group B, only tremor and snout reflex had inadequate IRR. To assess the impact of dichotomous vs ordinal
290
R.D. Sanders et al. / Schizophrenia Research 29 (1998) 287 292
Table 2 Agreement a m o n g three rater combinations on ordinally rated neurological exam items, in two populations of patients with schizophrenia (A and B) R S / M K (A1)
M K / J P (A2)
RS/SF (B)
Exam item
K
ICCo
PA (%)
K
ICCo
PA (%)
K
ICCo
PA (%)
1. Tandem gait 2. R o m b e r g 3. Adventitious overflow 4. Tremor 8. Audiovisual integration 9. Stereognosis 10. Graphesthesia 11. Fist ring 12. Fist edge palm 13. Alternating fist palm 14. Memory 15. Tap reproduction 16. Tap production 17. Diadochokinesis 18. Finger t h u m b 19. Mirror movement 20. F a c e - h a n d test 21. Right-left orientation 22. Synkinesis 23. Convergence 24. Gaze impersistence 25. Finger-nose 26. Glabellar reflex 27. Snout reflex 280. Grasp reflex 29. Suck reflex 30. Palmomental reflex 31. G o - n o - g o
0.52 l 0.42
0.79 0.99 0.44
94 95 69
0.41 0 0.31
0.56 0.60 0.69
73 0.87 71
I 0.74 0.49
1 0.62 0.55
100 84 80
0.34 0.8
0.24 0.82
72 81
0.14 0.84
0.10 0.98
69 93
0.23 l
0.48 1
53 100
0 0.88 0.78 0.77 0.88
--0.02 0.83 0.84 0.81 0.91
95 a 84 72 0.83 88
NC 0.93 0.72 0.40 0.70
NC 0.98 0.80 0.66 0.70
100a 97 87 70 67
0.89 I 0.80 0.87 0.93
0.89 1 0.92 0.92 0.97
96 100 88 96 92
0.94 0.88 1 0.87 0.85 0.67 0.82 1
0.89 0.88 0.97 0.91 0.89 0.64 0.84 0.99
97 82 87 97 88 86 94 94
0.86 0.72 0 0.40 0.17 0.02 NC 0.27
0.99 0.92 0.31 0.56 0.41 0.20 NC 0.65
96 86 72 75 59 62 100" 72
1 0.59 0.66 0.52 0.47 0.38 0.60 0.87
1 0.93 0.65 0.77 0.65 0.54 0.88 0.93
100 85 87 73 71 63 85 92
0.11 0 0.65 -0.07 0.84 1 0.43 l 0.87 NE
0.10 NC 0.65 -0.07 0.80 NC 0.46 NC 0.91 NE
78 94 a 88 87 75 100 a 82 100 a 88 NE
0.62 0.42 0.29 0 0.28 NC 0 0 0.20 NE
0.77 NC 0.42 -0.16 0.28 NC 0.32 NC 0.58 NE
79 77 72 72 64 1004 73 87 60 NE
0.67 0.75 0.45 0.41 0.73 0 0.57 0.43 0.46 0.73
0.73 NC 0.60 0.51 0.87 NC 0.62 NC 0.78 0.89
88 92 77 77 85 85 85 85 80 92
G r o u p (A)=first-episode psychotics; G r o u p ( B ) = chronic schizophrenics; N C = not computed (all ratings = 0, or ICC not computed for dichotomous data); N E = n o t examined; K - k a p p a of dichotomous data (0 vs 1 or 2); I C C o = i n t r a c l a s s correlation or ordinal data; PA =percent agreement. aIndicates rare finding, rated 0 by both examiners with greater than 90% frequency.
data on IRR estimations, we computed both K and ICC for items that yield ordinal data, and found little discrepancy (Table 2); the two methods disagreed in 7 of 80 cases, as to whether an acceptable degree of IRR had been obtained. We also compared continuous vs ordinal data for items that could be assessed in terms of continuous data, in Group B (Table 3). These comparisons reveal little disagreement, although estimated
agreement based on continuous data was markedly higher in the case of gaze impersistence. Table 4 reviews the reliability of individual NES items in three psychotic populations: state hospital inpatients (Buchanan and Heinrichs, 1989), acute psychotic patients (Group A) and more chronically ill veterans (Group B). Group A, displaying a lower overall prevalence of NEA (Table 1), had lower IRR on several measures. Of the 26 items
291
R.D. Sanders et al. / Schizophrenia Research 29 (1998) 287 292
Table 3 Inter-rater reliability of continuous and ordinal measures of neurological performance in one group of chronically schizophrenic patients (n = 16) Exam item
ICCo
ICCc
1. Tandem gait (errors) 8. Audiovisual integration (errors) 9. Stereognosis (errors) 10. Graphesthesia (errors) 11. Fist-ring (errors) Time (s) 12. Fist edge-palm (errors) Time (s) 13. Ozeretski (errors) Time (s) 14. Memory (errors) 15. Tap reproduction (errors) 16. Tap production (errors) 17. Diadochokinesis (errors) Time (s) 18. Finger thumb opposition (errors) Time (s) 20. Face-hand test (errors) 21. Right-left orientation (errors) 24. Gaze impersistence (s) 26. Glabellar reflex (blinks) 30. Palmomental reflex (responses) 31. Go no-go (errors)
1 1 0.89 1 0.92
.98 1 0.99 1 0.98 0.93 0.96 0.93 0.97 0.79 1 0.79 0.76 0.84 0.96 0.67 0.86 0.99 0.94 0.99 0.89 0.84 0.95
0.92 0.97 1 0.93 0.65 0.77 0.65 0.88 0.93 0.60 0.87 0.78 0.89
ICCo, intraclass correlation of ordinal data; ICCc, intraclass correlation of continuous data. a s s e s s e d i n all t h r e e p o p u l a t i o n s , 12 s h o w e d c o n s i s t e n t l y a d e q u a t e ( a t l e a s t f a i r ) r e l i a b i l i t y in all a n a l y s e s , 11 s h o w e d i n c o n s i s t e n t l y a d e q u a t e reliability (fair or better in most analyses), and 3 showed consistently inadequate reliability (poor reliability in most analyses). Pooling the Group A data yields consistently inadequate IRR in the s a m e t h r e e i t e m s ( t r e m o r , s n o u t a n d g r a s p reflexes) and inconsistently adequate IRR in three (stereognosis, finger-nose, convergence).
4. D i s c u s s i o n To establish which NEA are reliably assessed in adult patients with idiopathic psychoses, we studied the IRR of the NES with three pairs of raters in two populations of psychotic patients. As r e p r e s e n t e d i n T a b l e 2, I R R e s t i m a t e s v a r i e d c o n siderably among individual items, and also among
Table 4 Inter-rater reliability of the Neurological Evaluation Scale in three sets of schizophrenic subjects
1. Tandem gait 2. Romberg 3. Adventitious overflow 4. Tremor 8. Audiovisual integration 9. Stereognosis 10. Graphesthesia 11. Fist-ring 12. Fist-edge-palm 13. Alternating fist palm 14. Memory 15. Tap reproduction 16. Tap production 17. Diadochokinesis 18. Finger-thumb opposition 19. Mirror movement 20. Face hand test 21. Right left orientation 22. Synkinesis 23. Convergence 24. Gaze impersistence 25. Finger-nose 26. Glabellar reflex 27. Snout reflex 28. Grasp reflex 29. Suck reflex 30. Palmomental reflex 31. Go no-go
B and H
A
B
Summary reliability
G G* G,G
G,F E,F P,F
E F F
+ + +
F,F E
P,P G,E
P E
-+
E*,E E,E G,F G,F E
P*,NC* G,E G,G G,F G,E
G E E E E
_+ + + + +
E E G E*,F G*,G
G,E G,E E,P E,F G,P
E E F G F
+ + _+ + _+
G,F E E
F,P G,NC* E,F
F G E
_+ + +
G,E G,F E,G G,G G P* P*,P* F NE NE
P,F P*,F F,P P,P G,P E*,NC* P,P E*,P E,F NE
G G F F G P F F G G
_+ _+ _+ _+ _+ +
Based on intraclass correlation (ICC), except for current data on items 23, 27 and 29, which are based on Kappa (K). B and H = Buchanan and Heinrichs ( 1989); multiple ratings represent items reported separately for right- and left-handed executions; Group (A)=acute psychotics; multiple ratings represent different combinations of raters; Group (B)=chronic schizophrenics; E (Excellent) denotes ICC>0.9 or K>0.75; G (Good) denotes ICC>0.7; F (Fair) denotes ICC>0.5 or K > 0.4; P (Poor) denotes ICC < 0.5 or K < 0.4; NE, not examined; NC, not computed. * =Item rated 0 in >90% of instances; + =consistently adequate; _+ =inconsistently adequate; - = consistently inadequate. p o p u l a t i o n s a n d r a t e r c o m b i n a t i o n s . I R R estimates using continuous, dichotomous and ordinal data were generally similar.
292
R.D. Sanders et al. / Schizophrenia Research 29 (1998) 287 292
We observed two groups o f unreliable N E A items. First, those N E A that were rarely observed in these populations tended to have low chancecorrected I R R estimates, often in spite o f high percentage agreement. Stereognosis and some o f the primitive reflexes were n o t e w o r t h y for their lack o f positive findings. A l t h o u g h some o f these N E A m a y be more c o m m o n in other psychiatric populations, we discourage their inclusion in studies o f patients with primary psychoses. A m o n g N E A observed m o r e frequently, reliability is often c o m p r o m i s e d in items relying more on subjective assessment, due to difficulties quantifying ( R u t t e r et al., 1970; Vitiello et al., 1989), particularly when discriminations between normal and slightly a b n o r m a l are required ( R u t t e r et al., 1970; Richards et al., 1991 ). Tremor, mirror movements and synkinesis are present examples o f this problem. Some N E A , such as rigidity (not in N E S ) and the grasp reflex, are inherently difficult to assess visually, leading to underestimation o f I R R when determined by the current method. Because the low-reliability N E S items in this study are generally the same as those f o u n d elsewhere in the psychiatric and neurologic literature, we r e c o m m e n d against including them in future adaptations o f the N E S and in other N E A protocols. In the study o f psychotic patients generally, we suggest eliminating t r e m o r and the snout and grasp reflexes. Several other items, including adventitious overflow, mirror movements, synkinesis, convergence, the finger-nose test and the suck reflex, have limited potential for reliable measurement, and should be avoided, particularly in less impaired populations. A l t h o u g h improved I R R in some o f these items m a y result from strenuous effort (Vreeling et al., 1993) or greater expertise ( H a n s e n et al., 1994), the general clinical applicability o f any results thus obtained would be suspect. Owing to infrequent positive findings, stereognosis might have to be a b a n d o n e d , or made m o r e challenging. Psychiatrically significant N E A comprise a very heterogeneous group, having in c o m m o n only that they are simple, bedside measures o f neurologic function. The N E S m a y have little to add when used only to generate a single index o f neurological impairment, but represents a convenient collection
o f measures that share potential usefulness in the study o f schizophrenia and other psychoses. It will be necessary to tailor one's choice o f items to the question at hand. We suggest that these choices be limited to items with demonstrated potential for satisfactory I R R .
Acknowledgment This study was supported in part by N I M H grants M H 45203, 01180 and 46614 ( D r Keshavan). D r Nina Schooler helped in the study design, T a m e r a M c L a u g h l i n provided secretarial support, and Drs R o b e r t Stowe, John Gurklis, and K a t h y Pajer offered helpful comments.
References Bartko, J.J., 1976. On various intraclass correlation coefficients. Psychol. Bull. 83, 762 765. Buchanan, R.W., Heinrichs, D.W., 1989. The neurological evaluation scale (NES): a structured instrument for the assessment of neurological signs in schizophrenia. Psychiatry Res. 27, 335-350. Fleiss, J.L., 1975. Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31,651-659. Fleiss, J.L., 1981. Statistical methods for rates and proportions, New York, Wiley. Hansen, M., Sindrup, S.H., Christensen, P.B., Olsen, N.K., Kristensen, O., 1994. Interobserver variation in the evaluation of neurological signs: observer dependent factors. Acta Neurol. Scand. 90, 145 149. MacLure, M., Willett, W.C., 1987. Misinterpretation and misuse of the kappa statistic. Am. J. Epidemiol. 126, 161 169. Richards, M., Marder, K., Bell, K., Dooneief, G., Mayeux, R., Stern, Y., 1991. Interrater reliability of extrapyramida[ signs in a group assessed for dementia. Arch. Neurol. 48, 1147 1149. Rutter, M., Graham, P., Yule, W., 1970. A neuropsychiatric study in childhood. Clin. Devel. Med. Volume 35/36. Shrout, P.E., Fleiss, J.L., 1979. lntraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420 428. Shrout, P.E., Spitzer, R.L., Fleiss, J.L., 1987. Quantification of agreement in psychiatric diagnosis revisited. Arch. Gen. Psychiatry 44, 172 177. Vitiello, B., Ricciuti, A.J., Stoff, D.M., 1989. Reliability of subtle (soft) neurological signs in children. J. Am. Acad. Child Adolesc. Psychiatry 28, 749-753. Vreeling, F.W., Jolles, J., Verhey, F.R.J., Houx, P.J., 1993. Primitive reflexes in healthy, adult volunteers and neurological patients: methodological issues. J. Neurol. 240, 495 504.