Gait & Posture 36 (2012) 495–499
Contents lists available at SciVerse ScienceDirect
Gait & Posture journal homepage: www.elsevier.com/locate/gaitpost
Functional limits of agreement: A method for assessing agreement between measurements of gait curves J. Røislien a,b,c,*, L. Rennie b, I. Skaaret a a
Rikshospitalet University Hospital, Oslo, Norway Sunnaas Rehabilitation Hospital, Nesoddtangen, Norway c Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway b
A R T I C L E I N F O
A B S T R A C T
Article history: Received 5 October 2011 Received in revised form 20 April 2012 Accepted 3 May 2012
Three dimensional measurements of gait is a widely used tool in clinical gait analysis, and the evaluation of the reliability and reproducibility of the method is a recurring topic in the literature. The reliability of gait curve measurements is often assessed by extraction of single points from the gait curves before applying traditional reliability measures for scalars. This approach does, however, not explore the entire gait curves as continuous functions of time. In order to assess agreement between gait curves measured by different measurement methods, or measurers, we propose an extension of the concept of limits of agreement (LoA) to curve data. The LoA represent the estimated variation in the actual observations, which are then to be accompanied by an evaluation of whether this observed variation is within clinically acceptable limits. The generalization of the methodology from scalars to continuous function, e.g. gait curves, can be done using functional data analysis (FDA), a statistical methodology particularly developed for analyzing functional data. The resulting functional limits of agreement (FLoA) are continuous functions from 0 to 100% of the gait cycle, representing the difference in gait curves as measured by different measurement methods. The FLoA are presented in actual degrees for each joint and plane under study. The proposed methodology is demonstrated on real data from an inter-rater repeatability study. ß 2012 Elsevier B.V. All rights reserved.
Keywords: 3D Gait analysis Gait curves Reliability Limits of agreement Functional data analysis
1. Introduction Three dimensional gait analysis (3DGA) has become a widely used tool in gait analysis, and kinematic gait curves serve as important outcome measures in both gait research and clinical practice and decision making. While instruments for measuring gait are become increasingly sophisticated, measurements of gait are nonetheless prone to both natural and experimental error, often of surprisingly large magnitudes [1,2]. For any measurement method the issues of reliability and reproducibility is essential, even arising in the ever-so-important question of validity: whenever no gold standard exists, or the true value may be impossible to measure for practical or ethical reasons, the validation task changes to one of an assessment of reliability and reproducibility [3]. Central to the question of reliability is the concept of agreement. Due to both systematic and random error no two measurements of the same trait, be it body
* Corresponding author at: Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Boks 1122 Blindern 0317 Oslo, Norway. Tel.: +47 22 85 14 05; fax: +47 22 85 13 13. E-mail address:
[email protected] (J. Røislien). 0966-6362/$ – see front matter ß 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.gaitpost.2012.05.001
temperature or human gait, will be identical. When comparing measurement methods the central question is thus whether the different measurement methods generate measurements that for all practical purposes can be considered to be similar. Several analytical methods have been proposed for assessing similarity between measurements of gait. Standard reliability coefficients such as the ICC [4,5] do not take into account the fact that gait curves are not a selection of points or summary statistics, such as range of motion (ROM), but continuous functions of time. The frequently used coefficient of multiple correlations (CMC) for quantification of curve similarity [6] considers the whole curve but has several shortcomings [7], such as not adjusting for the correlation between points belonging to the same gait curve. Further, it has been suggested that a proper evaluation of agreement between curve measurements should include some measure of variability, such as SD or SEM [7], so that the result can be evaluated in an absolute measure, e.g. degrees. In a seminal paper by Bland and Altman [8], it was pointed out that agreement is a stronger claim than mere correlation, and that correlation is not sufficient to assess agreement. The paper introduced the concept of limits of agreement (LoA), highlighting that agreement can only be assessed by comparing the estimated variation in observed data to a clinical evaluation of what is an
J. Røislien et al. / Gait & Posture 36 (2012) 495–499
496
statistically significant from zero (0.12 (95% CI 0.49, 0.40), p = 0.85), implying that d and A are indeed uncorrelated, justifying the use of the standard LoA approach. When the assumption holds the differences d hold all the quantitative information needed in order to assess agreement, and d and the accompanying LoA can be summarized in a simple histogram without loss of information (Fig. 1C).
acceptable quantification of ‘‘not different’’. This methodology has been applied in gait analysis numerous times, including the assessment of reliability of gait speed measurements [9], agreement between temporospatial gait variables in stroke populations [10] and validity and inter-rater reliability of the Lindop Parkinson’s Disease Mobility Assessment [11]. In these studies, however, the outcome measures were scalars, not gait curves. When assessing the similarity of gait curves, the data tend to be reduced to scalars, i.e. summary statistics such as ROM, before applying the methodology of LoA. This approach has been taken in the analysis of the reliability of 3DGA in cervical spondylotic myelopathy [12] or between-day repeatability of knee kinematics [13]. Reducing gait curves to a set of characteristic scalar information does not fully appreciate the functional nature of gait curves. In the present work we expand the ideas of Bland and Altman [8] to be applicable for functional data, e.g. gait curves, by introducing functional limits of agreement (FLoA). The FLoA can be applied for assessing agreement between gait curves measured by different methods, e.g. the same measurer on different days or different measurers on the same day, and the results are presented in absolute measures of the variable under study, such as degrees. The methodology is demonstrated on data from an inter-rater study on healthy adults.
2.2. Functional data analysis Gait is a continuous process. In 3DGA, gait is sampled at a given frequency, e.g. 100 Hz, and the resulting sequences of discrete values are usually presented as gait curves from 0 to 100% of the gait cycle. These gait curves are representations of the underlying, time continuous process under study; gait. The functional nature of gait data is naturally handled using functional data analysis (FDA) [15]. In FDA the temporally ordered sequence of values that is returned from 3DGA software to represent gait, is turned into one, functional object. Given a set of basis functions fk, k = 1. . .K, a time continuous signal g(t), can be represented as a mathematical function in time domain by a linear combination of these basis functions;
gðtÞ ¼ 2. Statistical methodology
ck fk ðtÞ;
k¼1
2.1. Limits of agreement Fig. 1A shows tibial torsion measured by physical examination of bimalleolar axis (BM) and by knee rotation angle at initial contact (kIC) on the left side in 26 children. Plotting the difference between pairs of measurements (d) against the corresponding averages (A) and assuming normally distributed data, 95% of the observed differences will lie between the limits of agreement (LoA) calculated as d¯ 1:96 SDðdÞ, with d¯ denoting mean difference (Fig. 1B). Whether this observed variation in the data, quantified by the LoA, is acceptably small must then be accompanied by a clinical evaluation. The main assumption for the above approach to be valid is that d and A must be uncorrelated. In order to assess this assumption one can perform a linear regression of d on A [14]. In the tibial torsion data we find that the regression coefficient is not
(C)
4 3
0
Difference
−10
1 0
−30
−10
2
−20
kIC
10
0
5
20
10
(B)
6
with ck, k = 1. . .K a set of weights. For periodic data such as gait, a Fourier series representation yields good fits, i.e. using as the set of basis functions 1, sin(vt), cos(vt), sin(2vt), cos(2vt) and so on. This data transformation, from a set of temporally ordered discrete values to a continuous mathematical function with respect to time, using a basis function representation, makes most standard statistical methods applicable, with proper modifications. Note that the process of fitting a continuous mathematical function to the temporally ordered data points can be used as a data reduction procedure; often relatively few basis functions fk and corresponding weights ck are needed in order to represent the data sequence with great precision. Here, however, the main concern is the actual fitting of the continuous mathematical function for further statistical analysis, and we have followed the general advice of using as many basis functions as the data allows for [15]. In our analyses discretely represented gait curves were converted into functional objects and analyzed using functionality from the fda package [15,16] in the freeware R 2.12 [17].
We briefly review the concept of limits of agreement (LoA) for scalars [8], before extending the methodology to functional data.
(A)
K X
−30
−20
−10 BM
0
10
−25
−20
−15
−10
Average
−5
0
−10
0
10
20
Difference
Fig. 1. (A) Tibial torsion measured by physical examination of bimalleolar axis (BM) and by knee rotation angle at initial contact (kIC) on the left side in 26 children; (B) Bland Altman plot of differences vs averages, mean difference and limits of agreement (LoA); and (C) histogram of differences and LoA.
J. Røislien et al. / Gait & Posture 36 (2012) 495–499
497
2.3. Functional limits of agreement
2.4. Estimation uncertainty
In the discrete case, such as the tibial torsion example above, we note that whenever the difference d between pairs of measurements collected by different measurement methods is uncorrelated to the corresponding average A, a mere histogram of the differences holds all the quantitative information needed in order to assess agreement by LoA. This fact allows for a straightforward generalization of the concept of LoA to functional data, e.g. gait curves as continuous function of time, g(t). Differences between gait curves measured by two different methods will be continuous functions with respect to time, d(t), and these differences can be visualized in a diagram. If the differences d(t) are uncorrelated to the corresponding time continuous averages, A(t), the d(t) hold all quantitative information necessary to establish agreement for the functional data under study. Whether this assumption holds can be assessed using a functional generalization of linear regression [15], and, if the assumption holds, the corresponding functional limits of agree¯ 1:96 SDðdðtÞÞ, with dðtÞ ¯ ment (FLoA) calculated as dðtÞ denoting the time continuous mean difference between the two measurement methods under study. The FLoA are presented in actual degrees, representing the observed time continuous variation in the measurements of the gait data, g(t). These FLoA must then be accompanied by a clinical evaluation of whether this observed variation is within clinically acceptable limits.
As the FLoA are estimated from data, they will have some estimation uncertainty associated with them. As studies of human gait often include relatively few test subjects, estimation uncertainties tend to be relatively large. It is advisable to quantify this uncertainty in some manner. This can for example be done using general statistical re-sampling techniques such as bootstrapping [18,19]. 3. Real data example: inter-rater reliability study The aim of the following study was to investigate inter-rater reliability related to marker placement. 3.1. Data material Seven healthy volunteers (4 females) with mean (SD) age 38 (13) years, weight 72 (9) kg, with no past history of neurological or musculoskeletal pathology, gave informed written consent to take part in the study. The subjects were tested once by each of two tester teams on consecutive days. 3DGA data was recorded during bare feet, level walking along a 10 m walkway at self selected, comfortable walking speed, using six infrared cameras (Vicon MX13, 100 Hz, Vicon Motion Systems, Oxford, UK), two force platforms (AMTI
Fig. 2. (A) Gait curves for seven normal adults without impairments, measured by two different tester teams on consecutive days (gray vs black); (B) functional intercept (regression function 1); (C) functional slope (regression function 2); and (D) functional p-value for assessing statistically significant associations between differences between pairs of measurements and corresponding averages. Overall p-value calculated using a functional permutation F-test superimposed.
498
J. Røislien et al. / Gait & Posture 36 (2012) 495–499
Fig. 3. (A) Gait curves for seven normal adults without impairments, measured by two different tester teams on consecutive days (gray vs black); (B) time continuous differences for each individual (gray solid lines), mean difference (black solid line), and functional limits of agreement (FLoA) (black dotted lines); and (C) mean difference, FLoA, and 95% CI for FLoA calculated using bootstrapping.
OR6-7, Advanced Mechanical Technology, Inc., Watertown, USA) and two digital video cameras. The standard Plug-in-gait marker protocol, and Plug-in-gait model processing (Vicon Motion systems, Oxford, UK) was used. The same anthropometric measurements were used for both test sessions and one person undertook all data processing. For each subject, one left cycle was selected from each test session, based on inter-session similarity in gait velocity. The pelvic, hip, knee and ankle kinematic curves in all planes were analyzed. Here results for the pelvic, hip and knee curves in the sagittal and transversal planes are presented. Fig. 2A shows the 14 gait curves for each of the six combinations of segments and planes. The Fourier basis functions representation gave excellent fit to the data (not shown). 3.2. Assessing independence between the difference and the average In order to assess whether the time continuous differences between measurement methods d(t) where uncorrelated to the corresponding averages A(t), we performed concurrent functional regression analyses (Fig. 2B and C), as well as functional permutation tests to calculate corresponding time continuous pvalues, p(t) [15,16] (Fig. 2D). Note that in a functional regression analysis, the regression coefficients and corresponding 95% confidence intervals (CI) will also be functions of time. For all joints and planes the 95% CI for the regression coefficient include the zero line throughout the full gait cycle (Fig. 2C), indicating that d(t) and A(t) are unrelated. Further, the corresponding p-value function, p(t), indicates no statistically significant association between d(t) and A(t) for any joints or planes, with overall p-values well beyond the standard significance level of 5% (Fig. 2D). The overall p-values were calculated using a functional version of a permutation F-Test [16].
3.3. Functional limits of agreement Measurements from the two tester teams for each of the seven subjects are shown in Fig. 3A, while Fig. 3B shows corresponding time continuous differences d(t) for each individual, the mean ¯ difference dðtÞ and corresponding FLoA. For pelvic tilt and rotation ¯ and hip flexion/extension, the mean difference dðtÞ follows the zero line closely, and the width of the FLoA is relatively small, indicating good agreement between measurements from the two tester teams for these gait variables. Markedly poorer agreement between the two tester teams can be seen for knee flexion/ extension and hip rotation, with their considerably wider FLoA. This suggests a systematic inconsistency between the teams in how they position the thigh wand marker. Whether these wider FLoA correspond to an unacceptably high variation in the observed data is not a statistical question but up to clinical evaluation. We will not pursue this here. 3.4. Estimation uncertainty Calculating 95% CI for FLoA by bootstrapping of the gait curves show that the estimation uncertainty in the FLoA is large (Fig. 3C). With only seven test subjects, this is not unexpected. 4. Discussion This study presents a novel method for assessing agreement between measurements of gait curves, by performing a generalization of the method of limits of agreement (LoA) introduced by Bland and Altman [8] to functional data, here termed functional limits of agreement (FLoA). The calculations can be carried out using existing functionality in freely available software, and the result is a graphical display of the actual variation in the observed
J. Røislien et al. / Gait & Posture 36 (2012) 495–499
data which is then to be compared to an evaluation of what is a clinically acceptable quantification of ‘‘not different’’. The common practice of extracting single data values when representing gait curves, e.g. range of motion (ROM) does not fully explore the functional nature of gait curves. While this approach might deem standard methods of reliability applicable, such as the ICC, a lot of the detail in the gait curve is thrown away in the process. Other methods, such as the coefficient of multiple correlation (CMC) acknowledges that a gait curve is indeed a curve, but suffers under the same fact that all correlation based methods do, namely that agreement is a stronger claim than mere correlation [7,8]. It has been suggested that evaluation of agreement between measurements of gait curves should include some measure of variability, such as SD or SEM [7]. That is, a measure of difference that is on the scale of the data under study. This is exactly what the proposed method of FLoA does. In our real data example the FLoA varied across the gait cycle, and across segments and planes. Some of these variations are related to one another, making clinical evaluation essential, as there does not yet exist an easily available method for assessing agreement across all segments and planes simultaneously. A typical example is the cross talk seen between knee flexion/ extension and knee varus/valgus in swing as a result of thigh wand marker misplacement. We further note in our real data example, that the confidence intervals for the estimated FLoAs were rather wide. For some segments and planes the lower confidence limit for the upper FLoA even fell below the upper confidence limit for the lower FLoA. This is a natural consequence of a too limited amount of data. It has been pointed out previously that increasing the number of gait trial recordings maximizes intra-rater reliability [20]. The collection of gait data is time consuming, but gait analysis is still not excluded from the mathematical laws of statistical analysis: few measurements results in large estimation uncertainty and wider confidence intervals, making it more difficult to draw conclusions based on the study. Information is power, and less information means less power. A central assumption when applying LoA is that the differences d must be unrelated to the corresponding averages A. There are generally two main patterns we are on the lookout for; is d associated with A; is the variance of d associated with A. Performing a linear regression of d on A will uncover a direct association between these transformed variables, and is an advised statistical modeling strategy [14]. A more general additive model, e.g. a functional parallel to generalized additive models (GAM) [21], is a potential improvement, but would also increase mathematical complexity dramatically; even beyond the existing framework for agreement of scalar measurements. An association between the variance of d and A is encountered when measurement precision changes according to magnitude, and the range of magnitude is wide; in such situations it is not uncommon that the distance d between pairs of measurements increases with increasing average A. In gait analysis, however, it might be reasonable to assume that precision does not change according to joints and planes, and we have thus not explored this in the present analysis. Being a generalization of Bland and Altman’s original methodology [8], our proposed methodology of FLoA is constructed for comparing two methods of functional measurements; one pair. For comparing more than two methods a straightforward generalization that parallels the presentation of correlation between more than two continuous variables can be applied. In the latter situation a matrix of correlations is often constructed with each cell containing the correlation between a given pair of variables.
499
The same approach can be taken in a LoA setting [22], and is directly transferable to the present situation, by the construction of a matrix of graphical FLoA displays rather than a matrix of scalar correlations. An important feature of agreement is that it is not solely a statistical question, but also contains clinical evaluation as an integral part of the analysis [8] is the actual observed difference in measurements between two or more methods small enough for the measurements to be considered ‘‘not different’’? The FLoA visualizes the observed differences in functional data, but the actual decision-making of whether this observed difference is within acceptable limits is left to clinical evaluation. Conflicts of interest There are no conflicts of interest. Acknowledgement The research was supported by Sophies Minde Ortopedi’s Research Trust. References [1] Gorton III GE, Hebert DA, Gannotti ME. Assessment of the kinematic variability among 12 motion analysis laboratories. Gait & Posture 2009;29:398–402. [2] Schwartz MH, Trost JP, Wervey RA. Measurement and management of errors in quantitative gait data. Gait & Posture 2004;20:196–203. [3] Zou KH, Warfield SK, Bharatha A, Tempany CMC, Kaus MR, Haker SJ, et al. Statistical validation of image segmentation quality based on a spatial overlap index1: scientific reports. Academic Radiology 2004;11:178–89. [4] McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1996;1:30–46. [5] Barnhart HX, Haber MJ, Lin LI. An overview on assessing agreement with continuous measurements. Journal of Biopharmaceutical Statistics 2007;17: 529–69. [6] Kadaba MP, Ramakrishnan HK, Wootten ME, Gainey J, Gorton G, Cochran GVB. Repeatability of kinematic, kinetic, and electromyographic data in normal adult gait. Journal of Orthopaedic Research 1989;7:849–60. [7] McGinley JL, Baker R, Wolfe R, Morris ME. The reliability of three-dimensional kinematic gait measurements: a systematic review. Gait & Posture 2009;29: 360–9. [8] Bland MJ, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 1986;327:307–10. [9] Green J, Forster A, Young J. Reliability of gait speed measured by a timed walking test in patients one year after stroke. Clinical Rehabilitation 2002;16:306–14. [10] Stokic DS, Horn TS, Ramshur JM, Chow JW. Agreement between temporospatial gait parameters of an electronic walkway and a motion capture system in healthy and chronic stroke populations. American Journal of Physical Medicine & Rehabilitation 2009;88. [11] Pearson MJT, Lindop FA, Mockett SP, Saunders L. Validity and inter-rater reliability of the Lindop Parkinson’s disease mobility assessment: a preliminary study. Physiotherapy 2009;95:126–33. [12] McDermott A, Bolger C, Keating L, McEvoy L, Meldrum D. Reliability of threedimensional gait analysis in cervical spondylotic myelopathy. Gait & Posture 2010;32:552–8. [13] van der Linden ML, Rowe PJ, Nutton RW. Between-day repeatability of knee kinematics during functional tasks recorded using flexible electrogoniometry. Gait & Posture 2008;28:292–6. [14] Carstensen B. Comparing methods of measurement: extending the LoA by regression. Statistics in Medicine 2010;29:401–10. [15] Ramsay JO, Silverman BW. Functional data analysis. Springer; 2005. [16] Ramsay JO, Hooker G, Graves S. Functional data analysis with R and MATLAB. Springer; 2009. [17] R Development Core Team R. A language and environment for statistical computing. R Foundation for Statistical Computing; 2008. [18] Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman & Hall; 1993. [19] Lenhoff MW, Santner TJ, Otis JC, Peterson MGE, Williams BJ, Backus SI. Bootstrap prediction and confidence bands: a superior statistical method for analysis of gait data. Gait & Posture 1999;9:10–7. [20] Monaghan K, Delahunt E, Caulfield B. Increasing the number of gait trial recordings maximises intra-rater reliability of the CODA motion analysis system. Gait & Posture 2007;25:303–15. [21] Wood SN. Generalized additive models: an introduction with R. Chapman & Hall/CRC; 2006. [22] Carstensen B. Comparing and predicting between several methods of measurement. Biostatistics 2004;5:399–413.