Journal Pre-proof Determining the Validity, Reliability and Utility of the Forgotten Joint Score: A Systematic Review Marco Adriani, MD, Michael-Alexander Malahias, MD, PhD, Alex Gu, BS, Cynthia A. Kahlenberg, MD, Michael P. Ast, MD, Peter K. Sculco, MD PII:
S0883-5403(19)31035-6
DOI:
https://doi.org/10.1016/j.arth.2019.10.058
Reference:
YARTH 57613
To appear in:
The Journal of Arthroplasty
Received Date: 20 August 2019 Revised Date:
11 October 2019
Accepted Date: 28 October 2019
Please cite this article as: Adriani M, Malahias M-A, Gu A, Kahlenberg CA, Ast MP, Sculco PK, Determining the Validity, Reliability and Utility of the Forgotten Joint Score: A Systematic Review, The Journal of Arthroplasty (2019), doi: https://doi.org/10.1016/j.arth.2019.10.058. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Elsevier Inc. All rights reserved.
Running title: Forgotten Joint Score: A systematic review 1 2 3
Title: Determining the Validity, Reliability and Utility of the Forgotten Joint Score: A Systematic Review
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1
Running title: Forgotten Joint Score: A systematic review 25
ABSTRACT
26
Background: With improving patient outcome after total hip and total knee arthroplasty (THA
27
and TKA), patient-reported outcome measures (PROMs) have seen a parallel rise in average
28
scores and ceiling effects. The Forgotten Joint Score (FJS) is a PROM that has been previously
29
proposed to reduce this observed ceiling effect. However, the validity and reliability of the FJS
30
has not been well analyzed.
31
Methods: The US National Library of Medicine (PubMed/MEDLINE), EMBASE, and the
32
Cochrane Database of Systematic Reviews were queried utilizing keywords pertinent to FJS,
33
validity, reliability, measurement properties, and PROM. The methodological quality of
34
measurement properties was evaluated using the COnsensus-based Standards for the selection of
35
health Measurement INstruments (COSMIN) checklist.
36
Results: In total, 13 articles met the inclusion criteria and were included in this analysis.
37
Internal consistency was consistently high (Cronbach alpha >0.9). Test-retest reliability was
38
good or excellent (Interclass Correlation Coefficient ≥0.8) in all studies. As for construct
39
validity, all the articles reported a positive rating. Floor and Ceiling effects overall were low
40
(<15%). Conflicting results were found for responsiveness and measurement error.
41
Conclusion: There is a strong evidence of good construct validity and test-retest reliability
42
regarding the FJS, with moderate evidence of good internal consistency. Ceiling and floor effects
43
were very low showing a very promising discriminatory power between patients with a good
44
outcome and patients with an excellent outcome. Therefore, especially in patients expected to
45
achieve high levels of function after total joint replacement, we highly recommend the use of FJS
46
for the long-term assessment of their treatment.
47
2
Running title: Forgotten Joint Score: A systematic review 48 49
3
Running title: Forgotten Joint Score: A systematic review 50
Introduction
51
Total hip (THA) and total knee arthroplasty (TKA) have long been considered the gold standard
52
treatment for end-stage degenerative joint disease. There are different ways to assess outcomes
53
after total joint arthroplasty: radiographic evaluation, implant survivorship, clinician assessment,
54
and patient-reported outcome measures (PROMs). While the first three options are based on
55
objective surgeons’ ratings, PROMs provide a more patient-centered perception on treatment
56
outcome [1].
57
PROMs are validated questionnaires completed by patients to generate a score that can be
58
tracked over time, to observe change in a specific patient, and compared with scores from other
59
patients. The use of PROMs has become a “standard of care” at multiple institutions across the
60
globe, as PROM data from individuals is used to guide treatment decisions, while aggregate
61
PROM data is used to keep health providers accountable and to measure their performance [2].
62 63
A wide range of PROMs have been developed in the past few decades, largely categorized into
64
three groups: 1) generic health-related, 2) disease-specific, and 3) system-specific [3]. PROMs of
65
the first group report the patient’s overall well-being and functionality, while they can be applied
66
to multiple medical etiologies and across multiple patients with different cultural and educational
67
backgrounds [3]. While useful and informative, generic health PROMs are not used as primary
68
end points in most orthopaedic research because, when used in isolation, they lack the
69
responsiveness needed to assess the true impact of an orthopaedic intervention. For this reason,
70
they should always be accompanied by a disease/system-specific PROM that measures a
71
condition’s effect on patient.
4
Running title: Forgotten Joint Score: A systematic review 72
System-specific questionnaires can provide clinicians the ability to assess specific changes in an
73
outcome tied to elements of a body region. While they provide a more precise outcome measure
74
on a single body system, they may lack the granularity needed to identify differences among
75
patients with specific diagnoses [4].
76
Disease-specific measurements focus on a subgroup of patients affected by a condition and can
77
measure the effect of changes in it. They generally have better sensitivity than system-specific
78
measures and provide clinicians with the information needed to assess changes in a patient’s
79
specific disease [4].
80
Even though a number of studies [1, 5, 6] have analyzed and compared the measurement
81
properties of PROMs after THA or TKA, there is no consensus on which score should be used.
82
Researchers usually refer to the most commonly reported scores including the OKS, OHS, the
83
Knee injury and Osteoarthritis Outcome Score (KOOS) and the Hip disability and Osteoarthritis
84
Outcome Score (HOOS) , the HHS, the WOMAC score, the EQ-5D, the SF-36 and the Short
85
Form-12 (SF-12) [5]. The majority of them were developed more than 20 years ago, when
86
patients undergoing THA and TKA had different demographics and expectations of ‘reasonable’
87
post-operative function [7]. With the improvement of surgical techniques, implant materials, and
88
prosthesis design, surgical outcomes have improved considerably over the last decades and
89
PROMs have seen a parallel rise in terms of their average scores [8]. As a result, some of the
90
commonly used questionnaires (WOMAC, OKS and OHS) demonstrated a ceiling effect
91
following TKA and THA; that is, many patients received the maximum score (or close to the
92
maximum) on the scale [8]. In an attempt to improve discriminatory ability of PROMs
93
(specifically the ability to distinguish between good and excellent results) and subsequently
5
Running title: Forgotten Joint Score: A systematic review 94
reduce the ceiling effect, Behrend et al proposed a disease-specific PROM, known as the
95
Forgotten Joint Score (FJS) [9].
96 97
The FJS is a questionnaire based on the assumption that the ability to forget the artificial joint in
98
everyday life can be regarded as the ultimate goal following joint arthroplasty resulting in the
99
greatest possible patient satisfaction [9]. The authors of this score suggested that the new
100
construct would be more responsive to higher level functional outcomes after joint arthroplasty
101
[10]. FJS uses a 5-point Likert response format, consisting of 12 equally weighted questions,
102
each measuring the awareness of the artificial joint in several daily activities (raw score range 0-
103
100 points). Recently, a variety of studies have been conducted to determine the validity,
104
reliability and psychometric properties surrounding the FJS in the THA or TKA population. In
105
order to establish efficacy of FJS, we performed a systematic review focusing on the utility of
106
the FJS among THA and TKA patients. We aimed to answer four questions: 1) what was the
107
reliability of FJS in patients treated with TJA, 2) what was the validity of FJS in patients treated
108
with TJA, 3) what was the responsiveness of FJS in patients treated with TJA, 4) what were the
109
floor and ceiling effects of FJS in patients treated with TJA?
110 111
Methods
112
Search strategy and Selection Criteria
113
This study was done in accordance with the Preferred Reporting Items for Systematic Reviews
114
and Meta-Analyses guidelines (PRISMA). [11] The US National Library of Medicine
115
(PubMed/MEDLINE), EMBASE, and the Cochrane Database of Systematic Reviews were
116
queried for publications from January 1980 to April 2019 utilizing keywords pertinent to total
6
Running title: Forgotten Joint Score: A systematic review 117
hip or knee arthroplasty, validity and reliability, and forgotten joint score. The specific search
118
terms are shown in Table 1.
119 120
The inclusion criteria were: 1) studies on human subject of any age and gender, 2) studies that
121
include a population of at least 15 patients who underwent primary or revision TKA/THA 3)
122
studies that measured reliability, validity or responsiveness of the forgotten joint score relative to
123
TKA/THA. The exclusion criteria were: 1) case reports; 2) review articles; 3) letters to editor; 4)
124
technical papers; 5) abstract; 6) book chapter; 7) in vitro studies; 8) non-English language
125
publications. For articles that met these criteria, the reference lists were screened for additional
126
studies not captured using the initial search terms.
127 128
Two authors independently conducted the search. All authors compiled a list of articles not
129
excluded after application of the inclusion and exclusion criteria. Discrepancies between the
130
authors were resolved by discussion. During initial review of the data, the following information
131
was collected for each study: title, author, year published, study design, number of patients,
132
number of joints, gender, anatomic location, internal consistency, test-retest reliability,
133
measurement error, construct validity, responsiveness, ceiling and floor effect.
134 135
Measurement properties
136
The measurement properties of health status questionnaires can be divided into three main
137
domains: validity, reliability and responsiveness [12].
138
Validity is the extent to which a questionnaire measures the construct it is supposed to measure
139
and contains the following measurement properties: content validity, criterion validity and
7
Running title: Forgotten Joint Score: A systematic review 140
construct validity. Content validity refers to the extent to which the domain of interest is
141
comprehensively sampled by items in the questionnaire [13] and is typically assessed during the
142
developmental phase of the tool. Criterion validity is the extent to which scores on a tool are a
143
proper reflection of a gold standard. Construct validity refers to the extent to which scores on a
144
particular questionnaire relate to other measures in a manner that is consistent with theoretically
145
derived hypothesis concerning the concepts that are being measured [13]. Good construct
146
validity meant that the questionnaire correlated well with tools of the same construct (convergent
147
validity) while correlating poorly with tools of different construct (divergent validity) [14].
148 149
Reliability is a domain of a PROM instrument which contains three important measurement
150
properties: internal consistency, reproducibility and measurement error [12]. Internal consistency
151
refers to the extent to which the items in a questionnaire are correlated and therefore
152
‘unidimensional’. Reproducibility refers to the stability of a questionnaire over time and has to
153
be assessed looking at the test-retest reliability. As proposed by Landis and Koch [15], we
154
considered a good reliability when ICC was >0.8 and excellent when >0.9. Lastly, measurement
155
error refers to how close the scores on repeated measures are to one another. A small value
156
enhances the evaluative purpose of the questionnaire as it distinguishes clinically important
157
changes from measurement error.
158 159 160
Responsiveness has been defined as the ability of a questionnaire to detect clinically important
161
changes over time, even if these changes are small [13]. In analogy with construct validity, it
8
Running title: Forgotten Joint Score: A systematic review 162
should be assessed by testing a predefined hypothesis about expected correlations between
163
changes in measures, or expected differences in changes between known groups [13].
164
Other important characteristics of a questionnaire are floor and ceiling effects, which refers
165
respectively to the number of respondents who achieved the lowest or highest possible score.
166
According to Terwee et al. [13], floor and ceiling effects are considered to be present if more
167
than 15% of respondents achieved the lowest or highest possible score.
168 169
Assessing the Methodological quality
170
We evaluated the methodological quality of studies as defined by the COnsensus-based
171
Standards for the selection of health Measurement INstruments (COSMIN) checklist [12]. The
172
methodological quality of the following measurement properties was assessed: 1) internal
173
consistency, 2) test-retest reliability, 3) measurement error, 4) construct validity, 5)
174
responsiveness. Criterion validity was not included due to the lack of an established gold
175
standard PROM for patients undergoing THA or TKA. Floor and ceiling effects are not included
176
in the COSMIN list and therefore we could not rate the paper’s quality on these characteristics.
177
We used the updated scoring method developed in 2012 for the COSMIN checklist [13] to rate
178
each paper’s quality. This tool contains 4 possible response options (i.e. excellent, good, fair,
179
poor) for each item (i.e. individual question) for each measurement property. The final rating of
180
each study for each property is given by the lowest rating among the items within that property
181
(“worst score counts”). Two reviewers assessed the methodological quality of the articles
182
separately and independently using this updated COSMIN scoring system. When there was
183
disagreement between them, it was resolved by consensus. If consensus was not reached after
184
discussion, a third party was consulted to resolve the disagreement.
9
Running title: Forgotten Joint Score: A systematic review 185 186
Assessing the quality of Psychometric properties
187
We utilized the quality criteria established by Terwee et al. [13] to assess the psychometric
188
evidence/properties of Forgotten Joint Score (FJS) for the included articles. Table 1 describes
189
the definition of each assessed psychometric property, as well as the quality of each property.
190
The quality of each property was rated as positive (+), indeterminate (?), or negative (-). When
191
no information was reported, a rating of zero (0) was given.
192 193
Synthesizing the level of evidence
194
In order to synthesize the level of evidence for the overall Quality of the Measurement Property
195
we adopted the method by Schellingerhout et al. [16]. This method has been previously used in
196
the literature [1, 17]. The overall synthesized score combines the consistency of the
197
psychometric evidence with the methodological quality of the included studies and the level of
198
evidence proposed by the Cochrane Back Review Group [18]. With this system, level of
199
evidence can be described as unknown (only studies of poor methodological quality), conflicting
200
(conflicting findings), limited (one study of fair methodological quality), moderate (consistent
201
findings in multiple studies of fair methodological quality or in one study of good
202
methodological quality) and strong (consistent findings in multiple studies of good
203
methodological quality or in one study of excellent methodological quality).
204 205
Results
206
Search Results
10
Running title: Forgotten Joint Score: A systematic review 207
The literature search resulted in 123 abstracts (Figure 1). Following elimination of duplicate
208
articles, predetermined inclusion and exclusion criteria were applied. Twenty-four articles were
209
screened, with 13 articles meeting the inclusion criteria (Table 2).
210 211
Demographics
212
In total, the FJS for 2,217 joints among 2,212 patients (TKA: 61.6%; unicompartmental knee
213
arthroplasty (UKA): 0.01%; THA: 38.3%) were included in this review. Further details regarding
214
the baseline characteristics of these patients can be found in Table 3.
215 216
Methodological quality of the studies
217
All methodological scores are summarized in Table 4. Overall, 11 studies (92%)[27, 19, 3, 5, 6,
218
16, 17, 21, 32, 31, 30] received “poor” or “fair” COSMIN rating for the internal consistency,
219
while one study [14] received “good” COSMIN rating. Five out of 9 studies (56%) [ 16, 32, 31,
220
30, 19] that analyzed reliability received “fair” COSMIN rating, whereas 4 papers (44%) [3, 6,
221
17, 27] received “good” COSMIN rating. Three out of 5 studies (50%) [16, 32, 27] reporting
222
data on measurement error received “fair” COSMIN rating, while the remaining 2 articles [6, 3]
223
receiving “good” COSMIN rating. Five studies (50%) [5,16, 21,27, 31] received a “fair” or
224
“poor” COSMIN rating for hypothesis testing, whereas 5 studies (50% ) received “good”
225
COSMIN rating [3, 14, 32, 6, 17].The content validity of the FJS was explored during its
226
development with a “good” COSMIN rating and no other articles were found looking at content
227
validity of the tool. Two out of four studies (50%)[3,14] reporting on responsiveness received a
228
“good” COSMIN rating, whereas two study (50%)[30, 12] received only “fair” COSMIN rating.
229
11
Running title: Forgotten Joint Score: A systematic review 230
Statistical methods used to determine measurement properties
231
Ten of the studies (77%) [7, 9, 14, 19, 20, 21, 22, 23, 24, 29] used Cronbach’s alpha to assess
232
internal consistency reliability. In addition, nine studies (69%) depicted reliability with test-retest
233
reliability by calculating Internal Correlation Class (ICC) values [14, 19, 20, 21, 23, 24, 29, 31,
234
32]. Five papers (38%) reported data on measurement error [14, 19, 20, 23, 24], with four of
235
those [19, 20, 23, 24] expressing measurement as standard error of measurement (SEM), and one
236
study [14] preferring to use the Bland and Altman plot analysis. The majority of the studies
237
(77%) [7, 9, 14, 19, 20, 21, 23, 24, 31] in this review assessed validity by defining construct
238
validity, formulating a priori hypothesis involving correlation or mean differences with other
239
PROM tools already validated.
240 241
Internal consistency reliability
242
Among the studies that observed internal consistency [7, 9, 14, 19, 20, 21, 23, 24, 29, 31], the
243
mean Cronbach’s coefficient alpha was 0.95, ranging from 0.91[14] to 0.98 [7]. Overall, the
244
included studies demonstrated good internal consistency according to the criterion proposed by
245
Nunnally and Bernstein [33]. Median internal consistency reliability statistics were ≥0.9 overall
246
and by language, surgery, diagnosis and sample size (Table 5). Only one article [7] performed a
247
factor analysis, reporting that Forgotten Joint Score is unidimensional. Overall, the level of
248
evidence was moderate (Table 4).
249 250
Test-retest reliability
12
Running title: Forgotten Joint Score: A systematic review 251
Median test-retest reliability statistics were >0.8 overall and by language, surgery, diagnosis and
252
sample size (Table 5). Overall, all of the articles that performed test-retest evaluation concurred
253
that the FJS is reliable (Table 4) with a global moderate level of evidence.
254 255
Measurement error
256
Among the papers that reported data on measurement error, four (80%) [14, 20, 23, 24] had an
257
indeterminate evidence rating as they did not report Minimal Important Change (MIC) value
258
[13], whereas one (20%) [19] had a negative rating reporting a MIC smaller than the Smallest
259
Detectible Change (SDC) (Table 4). However, the overall level of evidence was only limited.
260 261
Construct validity
262
Almost all of the studies (90%) [7, 9, 14, 19, 20, 21, 23, 24, 31] that evaluated construct validity
263
with hypothesis testing method had a positive rating (Table 4), meaning that correlation with an
264
instrument measuring the same construct was >0.5 or at least 75% of the results were in
265
accordance with the hypothesis, and correlation with related constructs was higher than with
266
unrelated constructs [13]. The overall level of evidence according to Sheelingerout et al. rating
267
system was strong.
268 269
Responsiveness
270
Among the four studies that looked at responsiveness of FJS [7, 19, 30, 32], two [7, 30] (50%)
271
had a positive rating, one [19] (25%) had a negative rating and one [32] (25%) had an
272
indeterminate rating (Table 4). Overall, the evidence regarding the responsiveness of FJS was
273
conflicting.
13
Running title: Forgotten Joint Score: A systematic review 274 275
Floor and ceiling effects
276
Overall, all of the articles reported data on ceiling effects of the questionnaire and 11 papers
277
(85%) analyzed floor effects. In total, the large majority of articles (85%) resulted in less than
278
15% ceiling effects at 1 year follow-up, with a mean value of 8.9%. Considering floor effects,
279
perfect agreement (100%) was found among the articles in considering Forgotten Joint Score
280
having less than 15% floor effect at 1 year follow-up, with a mean value of 1.8% at 1 year
281
follow-up. In addition, six articles [9, 21, 22, 23, 24, 29] compared ceiling effects between FJS
282
and other frequently used PROMs, such as WOMAC and Oxford knee or hip score, with five of
283
them [9, 22, 23, 24, 29] (83%) resulting in significantly lower ceiling effects with the use of FJS.
284 285
Discussion
286
This review summarized 13 studies that analyzed the utility of the FJS. The FJS is a relatively
287
new PROM used to assess patients’ outcomes after TJA. The main finding of this review was
288
that FJS showed excellent psychometric properties in terms of reliability and validity. In
289
addition, the FJS showed low ceiling and floor effects, showcasing beneficial utility when
290
evaluating outcome groups that traditionally perform very well.
291 292
Internal Consistency
293
When assessing internal consistency, the FJS received a score of excellent (Cronbach
294
alpha=0.95). However, we should consider that high internal consistency scores do not
295
necessarily mean that the scale is unidimensional [34]. In addition, there is evidence [34] that
296
when the value of alpha is too high (over 0.95 or so) it may reflect unnecessary duplication of
14
Running title: Forgotten Joint Score: A systematic review 297
content across items and point more to redundancy than to homogeneity. When evaluating
298
internal consistency, most studies [9, 14, 19, 20, 21, 22, 23, 24, 32] did not use factor analysis or
299
item response to verify unidimensionality of the tool, thus compromising the methodology. Only
300
Hamilton et al. took this into consideration, performing a factor analysis showing that FJS is
301
unidimensional [7].
302 303
Reliability
304
This systematic review showed an excellent test-retest reliability across included studies
305
(ICC=0.91). High test-retest reliability refers to the stability of a questionnaire over time and is
306
important to have for discriminative purposes, such as distinguishing patients with less or more
307
disease. When evaluating measurement errors, the majority of the papers (75%) did not report
308
data on Minimal Important Change (MIC) or Smallest Detectable Change (SDC). With only one
309
paper [19] reporting data in a proper way according to COSMIN list, additional studies are
310
required to determine measurement error of the FJS.
311 312
Validity
313
Validity was assessed by examining content and construct validity. In order to achieve a proper
314
content validity, authors have to provide a clear description of measurement aim of the
315
questionnaire, target population, concepts that the questionnaire is intended to measure and the
316
item selection. Behrend et al. adequately described these aspects and therefore the Forgotten
317
Joint Score has good content validity. Concerning construct validity, overall the primary
318
hypotheses of all studies were confirmed, thus showing good construct validity. Furthermore, the
15
Running title: Forgotten Joint Score: A systematic review 319
majority of papers resulted to have a good COSMIN score and therefore we feel that there is
320
strong evidence to show adequate construct validity of the tool.
321 322
Responsiveness
323
Different results were found when analyzing responsiveness. In our systematic review, only two
324
papers [7, 19] evaluated responsiveness with a good COSMIN methodology, and among them
325
we found conflicting results. Based on this finding, we suggest that further studies are needed in
326
order to reach to safe conclusions regarding the responsiveness of FJS.
327 328
Discriminatory Power
329
When developing the FJS, Behrend et al. believed that their tool would have been able to
330
distinguish patients with good outcomes from patients with excellent outcomes, thus resulting in
331
a better discriminatory power when compared with other PROM tools [9]. In order to achieve
332
this goal, FJS should have significant lower ceiling effects compared to other PROMs. This
333
review showed low ceiling effects (<15%) in 11 out of the 13 studies at 1-year follow-up. These
334
findings suggest that FJS might have improved discriminatory power compared to other PROMs
335
that might recognize small improvements between patients.
336
A particular limitation of FJS was related to question number 12 (“are you aware of your
337
artificial knee when doing your favorite sport?”) of the questionnaire. We found that this
338
question had a significant percentage of missing responses (>10%) in all those articles that
339
reported this percentage [7, 14, 20, 21, 24, 32]. A possible explanation could be that this question
340
might not be applicable in less active individuals, such as elderly patients [24, 32]. A shorter,
341
more patient-adapted or age-adapted FJS might therefore be more adequate.
16
Running title: Forgotten Joint Score: A systematic review 342
Another potential limitation was that the COSMIN checklist which was used in this review is a
343
novel quality-assessment tool with relatively untested inter-rater reliability. Finally, follow-up
344
period varied widely amongst the studies with five of them not reporting it (Table 3).
345 346
Future directions
347
Forgotten joint score introduced a new outcome parameter that is the patient’s awareness of their
348
knee or hip joint during activities of daily living. Although FJS was originally introduced for
349
patients undergoing THA or TKA [9], we feel that the assessment of joint awareness in daily
350
activities might be also useful in other types of orthopaedic operations. For example, since FJS
351
has a very promising discriminatory power between patients with a good outcome and patients
352
with an excellent outcome, it could be particularly useful in the postoperative assessment of
353
patients with sports injuries who are treated with arthroscopic procedures [35].
354 355
Conclusions
356
This review showed that there is a strong evidence of good construct validity and test-retest
357
reliability regarding the FJS, with only moderate evidence of good internal consistency. Ceiling
358
and floor effects were very low showing a very promising discriminatory power between patients
359
with a good outcome and patients with an excellent outcome. Therefore, especially in patients
360
expected to achieve high levels of function after total joint replacement, we highly recommend
361
the use of FJS for the long-term assessment of their treatment.
362 363
References
17
Running title: Forgotten Joint Score: A systematic review 364
1. Gagnier JJ, Mullins M, Huang H, Marinac-Dabic D, Ghambaryan A, Eloff B, et al. A
365
Systematic Review of Measurement Properties of Patient-Reported Outcome Measures Used in
366
Patients Undergoing Total Knee Arthroplasty. J Arthroplasty 2017 May;32(5):1688-1697.
367
2. Vega JF, Spindler KP. Types of Scoring Instruments Available. 2019. In: Musahl V et al.,
368
editors. Basic Methods Handbook for Clinical Orthopaedic Research. Berlin, Heidelberg,
369
Springer; 2019.
370
3. Irrgang JJ, Lubowitz JH. Measuring arthroscopic outcome. Arthroscopy. 2008;24(6):718–22.
371
4. MOTION Group. Patient-Reported Outcomes in Orthopaedics. J Bone Joint Surgery Am.
372
2018 Mar 7;100(5):436-442.
373
5. Alviar MJ, Olver J, Brand C, Tropea J, Hale T, Pirpiris M, et al. Do patient-reported outcome
374
measures in hip and knee arthroplasty rehabilitation have robust measurement attributes? A
375
systematic review. J Rehabil Med 2011 Jun;43(7):572-83.
376
6. Ramkumar PN, Harris JD, Noble PC. Patient-reported outcome measures after total knee
377
arthroplasty: a systematic review. Bone Joint Res 2015 Jul;4(7):120-7.
378
7. Hamilton DF, Loth FL, Giesinger JM, Giesinger K, MacDonald DJ, Patton JT, et al.
379
Validation of the English language Forgotten Joint Score-12 as an outcome measure for total hip
380
and knee arthroplasty in a British population. Bone Joint J 2017 Feb;99-B(2):218-24.
381
8. Marx RG, Jones EC, Atwan NC, Closkey RF, Salvati EA, Sculco TP. Measuring improvement
382
following total hip and knee arthroplasty using patient-based measures of outcome. J Bone Joint
383
Surg Am 2005 Sep;87(9):1999-2005.
18
Running title: Forgotten Joint Score: A systematic review 384
9. Behrend H, Giesinger K, Giesinger JM, Kuster MS. The "forgotten joint" as the ultimate goal
385
in joint arthroplasty: validation of a new patient-reported outcome measure. J Arthroplasty 2012
386
Mar;27(3):430,436.e1.
387
10. Giesinger K, Hamilton DF, Jost B, Holzner B, Giesinger JM. Comparative responsiveness of
388
outcome measures for total knee arthroplasty. Osteoarthritis Cartilage 2014 Feb;22(2):184-9.
389
11. Moher D, Liberati A, Tetzlaff J, Altman DG, The PRISMA Group (2009). Preferred
390
Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. Open
391
Med 2009;3(3):123-130
392
12. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN
393
checklist for assessing the methodological quality of studies on measurement properties of health
394
status measurement instruments: an international Delphi study. Qual Life Res 2010
395
May;19(4):539-49.
396
13. Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the
397
methodological quality in systematic reviews of studies on measurement properties: a scoring
398
system for the COSMIN checklist. Qual Life Res 2012 May;21(4):651-7.
399
14. Cao S, Liu N, Han W, Zi Y, Peng F, Li L, et al. Simplified Chinese version of the Forgotten
400
Joint Score (FJS) for patients who underwent joint arthroplasty: cross-cultural adaptation and
401
validation. J Orthop Surg Res 2017 Jan 14;12(1):6,016-0508-5.
402
15. Landis JR, Koch GG. The measurement of observer agreement for categorial data.
403
Biometrics. 1977;33(1): 159-74
19
Running title: Forgotten Joint Score: A systematic review 404
16. Schellingerhout JM, Heymans MW, Verhagen AP, de Vet HC, Koes BW, Terwee CB.
405
Measurement properties of translated versions of neck-specific questionnaires: a systematic
406
review. BMC Med Res Methodol 2011 Jun;6(11):87.
407
17. Huang H, Grant JA, Miller BS, Mirza FM, Gagnier JJ. A Systematic Review of the
408
Psychometric Properties of Patient-Reported Outcome Instruments for Use in Patients With
409
Rotator Cuff Disease. Am J Sports Med 2015 Oct;43(10):2572-82.
410
18. van Tulder M, Furlan A, Bombardier C, et al. Updated method guidelines for systematic
411
reviews in the Cochrane collaboration back review group. Spine 2003 Jun;28(12):1290-1299.
412
19. Baumann F, Ernstberger T, Loibl M, Zeman F, Nerlich M, Tibesku C. Validation of the
413
German Forgotten Joint Score (G-FJS) according to the COSMIN checklist: does a reduction in
414
joint awareness indicate clinical improvement after arthroplasty of the knee? Arch Orthop
415
Trauma Surg 2016 Feb;136(2):257-64.
416
20. Kinikli GI, Güney Deniz H, Karahan S, Yüksel E, Kalkan S, Dönder Kara D, et al. Validity
417
and reliability of Turkish version of the Forgotten Joint Score-12. Journal of Exercise Therapy
418
and Rehabilitation 2017;4(1):18-25.
419
21. Klouche S, Giesinger JM, Sariali EH. Translation, cross-cultural adaption and validation of
420
the French version of the Forgotten Joint Score in total hip arthroplasty. Orthop Traumatol Surg
421
Res 2018 Sep;104(5):657-61.
20
Running title: Forgotten Joint Score: A systematic review 422
22. Matsumoto M, Baba T, Homma Y, Kobayashi H, Ochi H, Yuasa T, et al. Validation study of
423
the Forgotten Joint Score-12 as a universal patient-reported outcome measure. Eur J Orthop Surg
424
Traumatol 2015 Oct;25(7):1141-5.
425
23. Thomsen MG, Latifi R, Kallemose T, Barfod KW, Husted H, Troelsen A. Good validity and
426
reliability of the forgotten joint score in evaluating the outcome of total knee arthroplasty. Acta
427
Orthop 2016 Jun;87(3):280-5.
428
24. Shadid MB, Vinken NS, Marting LN, Wolterbeek N. The Dutch version of the Forgotten
429
Joint Score: test-retesting reliability and validation. Acta Orthop Belg 2016 Mar;82(1):112-8.
430
25. Beaton DE, Bombardier C, Guillemin F, Ferraz MB. Guidelines for the process of cross-
431
cultural adaptation of self-report measures. Spine (Phila Pa 1976) 2000 Dec 15;25(24):3186-91.
432
26. Guillemin F, Bombardier C, Beaton D. Cross-cultural adaptation of health-related quality of
433
life measures: literature review and proposed guidelines. J Clin Epidemiol 1993
434
Dec;46(12):1417-32.
435
27. Ware JE,Jr, Keller SD, Gandek B, Brazier JE, Sullivan M. Evaluating translations of health
436
status questionnaires. Methods from the IQOLA project. International Quality of Life
437
Assessment. Int J Technol Assess Health Care 1995 Summer;11(3):525-51.
438
28. Ganestam A, Barfod K, Klit J, Troelsen A. Validity and reliability of the Achilles tendon
439
total rupture score. J Foot Ankle Surg 2013 Nov-Dec;52(6):736-9.
440
29. Larsson A, Rolfson O, Karrholm J. Evaluation of Forgotten Joint Score in total hip
441
arthroplasty with Oxford Hip Score as reference standard. Acta Orthop 2019 Apr 1:1-8.
21
Running title: Forgotten Joint Score: A systematic review 442
30. Giesinger JM, Loth FL, Howie C, Giesinger K, Hamilton DF. Validation of the English
443
Version of the Forgotten Joint Score - 12 in Patients Undergoing Total Knee or Hip Arthroplasty.
444
Value Health 2015 Nov;18(7):A652-3.
445
31. Thompson SM, Salmon LJ, Webb JM, Pinczewski LA, Roe JP. Construct Validity and Test
446
Re-Test Reliability of the Forgotten Joint Score. J Arthroplasty 2015 Nov;30(11):1902-5.
447
32. Thienpont E, Vanden Berghe A, Schwab PE, Forthomme JP, Cornu O. Joint awareness in
448
osteoarthritis of the hip and knee evaluated with the 'Forgotten Joint' Score before and after joint
449
replacement. Knee Surg Sports Traumatol Arthrosc 2016 Oct;24(10):3346-51.
450
33. Nunnally JC, Bernstein IH. Psychometric theory. 3rd ed. New York:McGraw-Hill; 1994
451 452
34. Tavakol M, Dennick R. Making sense of Cronbach's alpha. Int J Med Educ. 2011
453
Jun;27(2):53-55.
454
35. Behrend H, Zdravkovic V, Giesinger JM, Giesinger K. Joint awareness after ACL
455
reconstruction: patient-reported outcomes measured with the Forgotten Joint Score-12. Knee
456
Surg Sports Traumatol Arthrosc 2017 May;25(5):1454-1460.
22
Table 1. Quality criteria for measurement properties
Property
Description
Rating
Quality Crtiteria a,b
1. Content validity
The extent to which the domain of
+
A clear description is provided of the measurement aim, the target population, the concepts that are being measured, and the item selection AND target population
interest is comprehensively sampled by
and (investigators OR experts) were involved in item selection; the items in the questionnaire A clear description of above-mentioned aspects is lacking OR only target ? population involved OR doubtful design or method; -
2. Internal consistency
3.Reliability
4. Responsive ness
The extent to which items in a (sub)scale are intercorrelated, thus measuring the same construct
+
Factor analyses performed on adequate sample size (7 * # items and >100) AND Cronbach’s alpha(s) calculated per dimension AND Cronbach’s alpha(s) between 0.70 and 0.95;
?
No factor analysis OR doubtful design or method;
-
Cronbach’s alpha(s) !0.70 or O0.95, despite adequate design and method;
The extent to which patients can be distinguished from each other, despite measurement error (relative measurement error)
+
ICC/weighted Kappa 0.70 OR Pearson’s r 0.80
?
Neither ICC/weighted Kappa nor Pearson’s r determined ICC/weighted Kappa <0.80
The ability of a questionnaire to detect clinically important changes over time
+
SDC or SDC < MIC OR MIC outside the LOA OR RRO1.96 OR AUC>0.70; ?
Doubtful design or method;
-
SDC or SDC>MIC OR MIC equals or inside LOA OR RR<1.96 OR AUC!0.70, despite adequate design and methods;
5.Measurm ent error
The extent to which the scores on repeated measures are close to each other (absolute measurement error)
+
MIC < SDC OR MIC outside the LOA OR convincing arguments that agreement is acceptable;
?
Doubtful design or method OR (MIC not defined AND no convincing arguments that agreement is acceptable);
6.Hypothesi s testing
The extent to which scores on a particular instrument relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured
-
MIC ≥ SDC OR MIC equals or inside LOA, despite adequate design and method;
+
Correlation with an instrument measuring the same construct 0.50 OR at least 75% of the results are in accordance with the hypotheses, AND correlation with related constructs is higher than with unrelated constructs Solely correlations determined with unrelated constructs
? -
Correlation with an instrument measuring the same construct <0.50 OR <75% of the results are in accordance with the hypothesis, OR correlation with related constructs is lower than with unrelated constructs
MIC= minimal important change; SDC = smallest detectable change; LOA= limits of agreement; ICC = Interclass correlation ; SD = standard deviation. a + = positive rating; ? = indeterminate rating; - = negative rating; 0 = no information available b Doubtful design or method = lacking of a clear description of the design or methods of the study, sample size smaller than 50 subjects, or any important methodological weakness in the design or execution of the study
Table 5. Internal consistency reliability and test-retest reliability per study Internal consistency reliability
Test-retest reliability
Number of study
Mean Cronbach alpha
Range Cronbach alpha
Number of study
All
10
0.95
(0.91-0.98)
English
3
0.97
Non-English
7
THA only
Range ICC
9
Mean Interclass Correlation coefficient (ICC) 0.91
(0.95-0.98)
3
0.92
(0.87-0.97)
0.95
(0.91-0.97)
6
0.91
(0.8-0.97)
4
0.97
(0.96-0.98)
2
0.9
(0.86-0.93)
TKA only
4
0.95
(0.91-0.97)
5
0.9
(0.8-0.97)
Smaller sample size(<100) Bigger sample size (>100)
2
0.97
(0.96-0.97)
2
0.9
(0.86-0.93)
8
0.95
(0.91-0.97)
7
0.92
(0.8-0.97)
(0.8-0.97)
Table 2. Levels of evidence for the overall quality of the measurement property Level Strong
Rating +++ or ---
Moderate
++ or --
Limited
+ or -
Conflicting Unknown
± ?
Criteria Consistent findings in multiples studies of good methodological quality OR in one study of excellent methodological quality Consistent findings in multiple studies of fair methodological quality OR in one study of good methodological quality One study of fair methodological quality Conflicting findings Only studies of poor methodological quality
Table 3. Characteristics of the included studies
Study
Language
Oxford level of evidence
Number of patients
Female (%)
Number of TKA
Number of UKA
Number of THA
Age
Time since surgery
follow up
Baumann et al.
German
I
105
54.3
86
19
-
65.2±9.3
7.2±12.5
1 year
Behrend et al.
English
II
243
49.4
86
-
157
70.6±11.3
31.1±12.3
-
Cao et al.
Chinese
II
150
78.7
150
-
-
68.1±7.4
28±9.7
1 year
Giesinger et al.
English
II
98
49.0
98
-
-
68.1±8.6
-
2 years
Hamilton et al.
English
II
436
56.9
231
-
205
69.9(THA) 0,6 and 12 67.6 months (TKA)
1 year
Kinikli et al.
Turkish
II
132
77.3
90
-
42
63.9±12.7
30.8±16
-
Klouche et al.
French
II
58
37.9
-
-
63
62.7±15.2
At least 1 year
1-6 years
Larsson et al.
English
II
111
52.0
-
-
111
69
At least 1 year
At least 1 year
Matsumoto et al.
Japanese
II
108
81.5
-
-
108
65.7±11.6
29.5±38.7
-
Shadid et al.
Dutch
II
159
64.0
84
-
75
68.6
15 months
2 years
Thomsen et al.
Danish
III
315
59.4
315
-
-
65
-
1-4 years
Thompson et al.
English
III
147
46.3
147
-
-
67
39 (range 18-72)
-
Thienpont et al.
English
II
150
56
75
-
75
66±17 (THA) e 69±10 (TKA)
-
-
Table 4. Methodological quality of each study per measurement properties
Study
Internal Consistency
Reliability
Measurment error
Hypothesis testing
Responsiveness
COSMIN score
Evidence rating
COSMIN score
Evidence rating
COSMIN score
Evidence rating
COSMIN score
Evidence rating
COSMIN score
Evidence rating
Baumann et al.
Poor
?
Good
+
Fair
-
Good
+
Good
-
Behrend et al.
Poor
?
0
0
0
0
Fair
+
0
0
Cao et al.
Poor
?
Good
+
Fair
?
Good
+
0
0
Giesinger et al.
0
0
0
0
0
0
0
0
Fair
+
Hamilton et al.
Good
+
0
0
0
0
Good
+
Good
+
Kinikli et al.
Poor
?
Fair
+
Fair
?
Poor
+
0
0
Klouche et al.
Poor
?
Good
+
0
0
Good
+
0
0
Larsson et al.
Poor
?
Fair
+
0
0
0
0
0
0
Matsumoto et al.
Poor
?
0
0
0
0
Poor
?
0
0
Shadid et al.
Poor
?
Good
+
Fair
?
Fair
+
0
0
Thomsen et al.
Fair
+
Fair
+
Fair
?
Good
+
0
0
Thompson et al.
0
0
Fair
+
0
0
Fair
+
0
0
Thienpont et al OVERALL LEVEL OF EVIDENCE
0
0 ++
Fair
+ ++
0
0 +
0 +++
0
0
0 ±
Table 6. Cross-cultural and construct validity per study Study
Crosscultural validity
Construct validity
Baumann et al.
German
Pearson correlation coefficient: -Moderate correlation of FJS with OKS (r=0.37) and EQ-5D(r=0.56), -Poor correlation between FJS and TAS(r=0.29), -all hypothesis confirmed
Behrend et al.
Original
Pearson correlation coefficient -High correlation with the WOMAC scales ( r=-0.79 total, r=-0.75 pain, r=-0.69 stiffness, r=-0.78 function)
Cao et al.
Chinese
Pearson correlation coefficient -Good correlation with symptoms (r=0.67) and pain (r=0.6) domains of KOOS and social functioning (r=0.66) domain of SF-36. -Moderate correlation with function in daily living (r=0.53), and function in sport and ricreation (r=0.4) domains of KOOS and physical subscale of SF-36 (r=0.51). -Weak correlation with mental subscale of SF-36. -All hypothesis confirmed
Hamilton et al.
Original
Pearson correlation coefficient -In TKA patients high correlation for the OKS (r=0.85) and the SF-12 PCS (r=0.7). IN THA slightly lower (r=0.79 for OHS and r=0.67 for SF-12 PCS) -Fair correlation with SF-12 MCS in the TKA group (r=0.23) and in the THA group (r=0.36). - Hypothesis confirmed
Kinikli et al.
Turkish version
Pearson correlation coefficient -Moderate to high correlation with WOMAC, KOOS-PS, HOOS-PS, TKS and SF-12 PCS. -No correlation with SF-12 MCS
Klouche et al.
French version
Pearson correlation coefficient -Positive correlation with modified HHS and negatively with OHS-12 -Hypothesis were confirmed
Matsumoto et al.
Japanese
Pearson correlation coefficient -Moderatly correlated with total WOMAC score (r=0.52) and its subscale scores for 'stiffness' (r=0.4) and 'function' (r=0.54). -Weakly with the subscore for 'pain' (r=0.29). -Favorably correlated with total JHEQ score (r=0.69) and its subscale score for 'movement' (r=0.64) and moderatly with the scores for 'pain' (r=0.55) and 'mental' (r=0.53).
Shadid et al.
Dutch
Spearman correlation coefficient -Significant positive correlation between FJS and WOMAC total score (and also subscales) (r=0.75) -Hypothesis confirmed
Thomsen et al.
Danish
Pearson correlation coefficient -Strong correlation between the FJS and OKS (r=0.81) -Hypothesis confirmed
Thompson et al.
Original
Spearman correlation coefficient -Positive correlation with total WOMAC score (r= 0.7) and its subscale Pain (r=0.67), Stiffness (r=0.52) and Function (r=0.66). -Positive correlation with KOOS 'Quality of life' (r=0.63), 'Pain' (r=0.68) and 'ADL' (r=0.66) whereas weakly correlation with KOOS 'symptoms'(r=0.33)
Fig. 1: Systematic review flow diagram.
Identification
Records identified through database searching (n =123)
Screening
Records after duplicates removed (n =98)
Eligibility
Full-text articles assessed for eligibility (n = 24)
Included
Studies included in qualitative synthesis (n =13)
.
Duplicates removed (n =25)
Records excluded (n =74)
Full-text articles excluded, with reasons (n = 11)
Figure legends
Fig. 1: Systematic review flow diagram