Determining the Validity, Reliability, and Utility of the Forgotten Joint Score: A Systematic Review

Determining the Validity, Reliability, and Utility of the Forgotten Joint Score: A Systematic Review

Journal Pre-proof Determining the Validity, Reliability and Utility of the Forgotten Joint Score: A Systematic Review Marco Adriani, MD, Michael-Alexa...

517KB Sizes 0 Downloads 52 Views

Journal Pre-proof Determining the Validity, Reliability and Utility of the Forgotten Joint Score: A Systematic Review Marco Adriani, MD, Michael-Alexander Malahias, MD, PhD, Alex Gu, BS, Cynthia A. Kahlenberg, MD, Michael P. Ast, MD, Peter K. Sculco, MD PII:

S0883-5403(19)31035-6

DOI:

https://doi.org/10.1016/j.arth.2019.10.058

Reference:

YARTH 57613

To appear in:

The Journal of Arthroplasty

Received Date: 20 August 2019 Revised Date:

11 October 2019

Accepted Date: 28 October 2019

Please cite this article as: Adriani M, Malahias M-A, Gu A, Kahlenberg CA, Ast MP, Sculco PK, Determining the Validity, Reliability and Utility of the Forgotten Joint Score: A Systematic Review, The Journal of Arthroplasty (2019), doi: https://doi.org/10.1016/j.arth.2019.10.058. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Elsevier Inc. All rights reserved.

Running title: Forgotten Joint Score: A systematic review 1 2 3

Title: Determining the Validity, Reliability and Utility of the Forgotten Joint Score: A Systematic Review

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

Running title: Forgotten Joint Score: A systematic review 25

ABSTRACT

26

Background: With improving patient outcome after total hip and total knee arthroplasty (THA

27

and TKA), patient-reported outcome measures (PROMs) have seen a parallel rise in average

28

scores and ceiling effects. The Forgotten Joint Score (FJS) is a PROM that has been previously

29

proposed to reduce this observed ceiling effect. However, the validity and reliability of the FJS

30

has not been well analyzed.

31

Methods: The US National Library of Medicine (PubMed/MEDLINE), EMBASE, and the

32

Cochrane Database of Systematic Reviews were queried utilizing keywords pertinent to FJS,

33

validity, reliability, measurement properties, and PROM. The methodological quality of

34

measurement properties was evaluated using the COnsensus-based Standards for the selection of

35

health Measurement INstruments (COSMIN) checklist.

36

Results: In total, 13 articles met the inclusion criteria and were included in this analysis.

37

Internal consistency was consistently high (Cronbach alpha >0.9). Test-retest reliability was

38

good or excellent (Interclass Correlation Coefficient ≥0.8) in all studies. As for construct

39

validity, all the articles reported a positive rating. Floor and Ceiling effects overall were low

40

(<15%). Conflicting results were found for responsiveness and measurement error.

41

Conclusion: There is a strong evidence of good construct validity and test-retest reliability

42

regarding the FJS, with moderate evidence of good internal consistency. Ceiling and floor effects

43

were very low showing a very promising discriminatory power between patients with a good

44

outcome and patients with an excellent outcome. Therefore, especially in patients expected to

45

achieve high levels of function after total joint replacement, we highly recommend the use of FJS

46

for the long-term assessment of their treatment.

47

2

Running title: Forgotten Joint Score: A systematic review 48 49

3

Running title: Forgotten Joint Score: A systematic review 50

Introduction

51

Total hip (THA) and total knee arthroplasty (TKA) have long been considered the gold standard

52

treatment for end-stage degenerative joint disease. There are different ways to assess outcomes

53

after total joint arthroplasty: radiographic evaluation, implant survivorship, clinician assessment,

54

and patient-reported outcome measures (PROMs). While the first three options are based on

55

objective surgeons’ ratings, PROMs provide a more patient-centered perception on treatment

56

outcome [1].

57

PROMs are validated questionnaires completed by patients to generate a score that can be

58

tracked over time, to observe change in a specific patient, and compared with scores from other

59

patients. The use of PROMs has become a “standard of care” at multiple institutions across the

60

globe, as PROM data from individuals is used to guide treatment decisions, while aggregate

61

PROM data is used to keep health providers accountable and to measure their performance [2].

62 63

A wide range of PROMs have been developed in the past few decades, largely categorized into

64

three groups: 1) generic health-related, 2) disease-specific, and 3) system-specific [3]. PROMs of

65

the first group report the patient’s overall well-being and functionality, while they can be applied

66

to multiple medical etiologies and across multiple patients with different cultural and educational

67

backgrounds [3]. While useful and informative, generic health PROMs are not used as primary

68

end points in most orthopaedic research because, when used in isolation, they lack the

69

responsiveness needed to assess the true impact of an orthopaedic intervention. For this reason,

70

they should always be accompanied by a disease/system-specific PROM that measures a

71

condition’s effect on patient.

4

Running title: Forgotten Joint Score: A systematic review 72

System-specific questionnaires can provide clinicians the ability to assess specific changes in an

73

outcome tied to elements of a body region. While they provide a more precise outcome measure

74

on a single body system, they may lack the granularity needed to identify differences among

75

patients with specific diagnoses [4].

76

Disease-specific measurements focus on a subgroup of patients affected by a condition and can

77

measure the effect of changes in it. They generally have better sensitivity than system-specific

78

measures and provide clinicians with the information needed to assess changes in a patient’s

79

specific disease [4].

80

Even though a number of studies [1, 5, 6] have analyzed and compared the measurement

81

properties of PROMs after THA or TKA, there is no consensus on which score should be used.

82

Researchers usually refer to the most commonly reported scores including the OKS, OHS, the

83

Knee injury and Osteoarthritis Outcome Score (KOOS) and the Hip disability and Osteoarthritis

84

Outcome Score (HOOS) , the HHS, the WOMAC score, the EQ-5D, the SF-36 and the Short

85

Form-12 (SF-12) [5]. The majority of them were developed more than 20 years ago, when

86

patients undergoing THA and TKA had different demographics and expectations of ‘reasonable’

87

post-operative function [7]. With the improvement of surgical techniques, implant materials, and

88

prosthesis design, surgical outcomes have improved considerably over the last decades and

89

PROMs have seen a parallel rise in terms of their average scores [8]. As a result, some of the

90

commonly used questionnaires (WOMAC, OKS and OHS) demonstrated a ceiling effect

91

following TKA and THA; that is, many patients received the maximum score (or close to the

92

maximum) on the scale [8]. In an attempt to improve discriminatory ability of PROMs

93

(specifically the ability to distinguish between good and excellent results) and subsequently

5

Running title: Forgotten Joint Score: A systematic review 94

reduce the ceiling effect, Behrend et al proposed a disease-specific PROM, known as the

95

Forgotten Joint Score (FJS) [9].

96 97

The FJS is a questionnaire based on the assumption that the ability to forget the artificial joint in

98

everyday life can be regarded as the ultimate goal following joint arthroplasty resulting in the

99

greatest possible patient satisfaction [9]. The authors of this score suggested that the new

100

construct would be more responsive to higher level functional outcomes after joint arthroplasty

101

[10]. FJS uses a 5-point Likert response format, consisting of 12 equally weighted questions,

102

each measuring the awareness of the artificial joint in several daily activities (raw score range 0-

103

100 points). Recently, a variety of studies have been conducted to determine the validity,

104

reliability and psychometric properties surrounding the FJS in the THA or TKA population. In

105

order to establish efficacy of FJS, we performed a systematic review focusing on the utility of

106

the FJS among THA and TKA patients. We aimed to answer four questions: 1) what was the

107

reliability of FJS in patients treated with TJA, 2) what was the validity of FJS in patients treated

108

with TJA, 3) what was the responsiveness of FJS in patients treated with TJA, 4) what were the

109

floor and ceiling effects of FJS in patients treated with TJA?

110 111

Methods

112

Search strategy and Selection Criteria

113

This study was done in accordance with the Preferred Reporting Items for Systematic Reviews

114

and Meta-Analyses guidelines (PRISMA). [11] The US National Library of Medicine

115

(PubMed/MEDLINE), EMBASE, and the Cochrane Database of Systematic Reviews were

116

queried for publications from January 1980 to April 2019 utilizing keywords pertinent to total

6

Running title: Forgotten Joint Score: A systematic review 117

hip or knee arthroplasty, validity and reliability, and forgotten joint score. The specific search

118

terms are shown in Table 1.

119 120

The inclusion criteria were: 1) studies on human subject of any age and gender, 2) studies that

121

include a population of at least 15 patients who underwent primary or revision TKA/THA 3)

122

studies that measured reliability, validity or responsiveness of the forgotten joint score relative to

123

TKA/THA. The exclusion criteria were: 1) case reports; 2) review articles; 3) letters to editor; 4)

124

technical papers; 5) abstract; 6) book chapter; 7) in vitro studies; 8) non-English language

125

publications. For articles that met these criteria, the reference lists were screened for additional

126

studies not captured using the initial search terms.

127 128

Two authors independently conducted the search. All authors compiled a list of articles not

129

excluded after application of the inclusion and exclusion criteria. Discrepancies between the

130

authors were resolved by discussion. During initial review of the data, the following information

131

was collected for each study: title, author, year published, study design, number of patients,

132

number of joints, gender, anatomic location, internal consistency, test-retest reliability,

133

measurement error, construct validity, responsiveness, ceiling and floor effect.

134 135

Measurement properties

136

The measurement properties of health status questionnaires can be divided into three main

137

domains: validity, reliability and responsiveness [12].

138

Validity is the extent to which a questionnaire measures the construct it is supposed to measure

139

and contains the following measurement properties: content validity, criterion validity and

7

Running title: Forgotten Joint Score: A systematic review 140

construct validity. Content validity refers to the extent to which the domain of interest is

141

comprehensively sampled by items in the questionnaire [13] and is typically assessed during the

142

developmental phase of the tool. Criterion validity is the extent to which scores on a tool are a

143

proper reflection of a gold standard. Construct validity refers to the extent to which scores on a

144

particular questionnaire relate to other measures in a manner that is consistent with theoretically

145

derived hypothesis concerning the concepts that are being measured [13]. Good construct

146

validity meant that the questionnaire correlated well with tools of the same construct (convergent

147

validity) while correlating poorly with tools of different construct (divergent validity) [14].

148 149

Reliability is a domain of a PROM instrument which contains three important measurement

150

properties: internal consistency, reproducibility and measurement error [12]. Internal consistency

151

refers to the extent to which the items in a questionnaire are correlated and therefore

152

‘unidimensional’. Reproducibility refers to the stability of a questionnaire over time and has to

153

be assessed looking at the test-retest reliability. As proposed by Landis and Koch [15], we

154

considered a good reliability when ICC was >0.8 and excellent when >0.9. Lastly, measurement

155

error refers to how close the scores on repeated measures are to one another. A small value

156

enhances the evaluative purpose of the questionnaire as it distinguishes clinically important

157

changes from measurement error.

158 159 160

Responsiveness has been defined as the ability of a questionnaire to detect clinically important

161

changes over time, even if these changes are small [13]. In analogy with construct validity, it

8

Running title: Forgotten Joint Score: A systematic review 162

should be assessed by testing a predefined hypothesis about expected correlations between

163

changes in measures, or expected differences in changes between known groups [13].

164

Other important characteristics of a questionnaire are floor and ceiling effects, which refers

165

respectively to the number of respondents who achieved the lowest or highest possible score.

166

According to Terwee et al. [13], floor and ceiling effects are considered to be present if more

167

than 15% of respondents achieved the lowest or highest possible score.

168 169

Assessing the Methodological quality

170

We evaluated the methodological quality of studies as defined by the COnsensus-based

171

Standards for the selection of health Measurement INstruments (COSMIN) checklist [12]. The

172

methodological quality of the following measurement properties was assessed: 1) internal

173

consistency, 2) test-retest reliability, 3) measurement error, 4) construct validity, 5)

174

responsiveness. Criterion validity was not included due to the lack of an established gold

175

standard PROM for patients undergoing THA or TKA. Floor and ceiling effects are not included

176

in the COSMIN list and therefore we could not rate the paper’s quality on these characteristics.

177

We used the updated scoring method developed in 2012 for the COSMIN checklist [13] to rate

178

each paper’s quality. This tool contains 4 possible response options (i.e. excellent, good, fair,

179

poor) for each item (i.e. individual question) for each measurement property. The final rating of

180

each study for each property is given by the lowest rating among the items within that property

181

(“worst score counts”). Two reviewers assessed the methodological quality of the articles

182

separately and independently using this updated COSMIN scoring system. When there was

183

disagreement between them, it was resolved by consensus. If consensus was not reached after

184

discussion, a third party was consulted to resolve the disagreement.

9

Running title: Forgotten Joint Score: A systematic review 185 186

Assessing the quality of Psychometric properties

187

We utilized the quality criteria established by Terwee et al. [13] to assess the psychometric

188

evidence/properties of Forgotten Joint Score (FJS) for the included articles. Table 1 describes

189

the definition of each assessed psychometric property, as well as the quality of each property.

190

The quality of each property was rated as positive (+), indeterminate (?), or negative (-). When

191

no information was reported, a rating of zero (0) was given.

192 193

Synthesizing the level of evidence

194

In order to synthesize the level of evidence for the overall Quality of the Measurement Property

195

we adopted the method by Schellingerhout et al. [16]. This method has been previously used in

196

the literature [1, 17]. The overall synthesized score combines the consistency of the

197

psychometric evidence with the methodological quality of the included studies and the level of

198

evidence proposed by the Cochrane Back Review Group [18]. With this system, level of

199

evidence can be described as unknown (only studies of poor methodological quality), conflicting

200

(conflicting findings), limited (one study of fair methodological quality), moderate (consistent

201

findings in multiple studies of fair methodological quality or in one study of good

202

methodological quality) and strong (consistent findings in multiple studies of good

203

methodological quality or in one study of excellent methodological quality).

204 205

Results

206

Search Results

10

Running title: Forgotten Joint Score: A systematic review 207

The literature search resulted in 123 abstracts (Figure 1). Following elimination of duplicate

208

articles, predetermined inclusion and exclusion criteria were applied. Twenty-four articles were

209

screened, with 13 articles meeting the inclusion criteria (Table 2).

210 211

Demographics

212

In total, the FJS for 2,217 joints among 2,212 patients (TKA: 61.6%; unicompartmental knee

213

arthroplasty (UKA): 0.01%; THA: 38.3%) were included in this review. Further details regarding

214

the baseline characteristics of these patients can be found in Table 3.

215 216

Methodological quality of the studies

217

All methodological scores are summarized in Table 4. Overall, 11 studies (92%)[27, 19, 3, 5, 6,

218

16, 17, 21, 32, 31, 30] received “poor” or “fair” COSMIN rating for the internal consistency,

219

while one study [14] received “good” COSMIN rating. Five out of 9 studies (56%) [ 16, 32, 31,

220

30, 19] that analyzed reliability received “fair” COSMIN rating, whereas 4 papers (44%) [3, 6,

221

17, 27] received “good” COSMIN rating. Three out of 5 studies (50%) [16, 32, 27] reporting

222

data on measurement error received “fair” COSMIN rating, while the remaining 2 articles [6, 3]

223

receiving “good” COSMIN rating. Five studies (50%) [5,16, 21,27, 31] received a “fair” or

224

“poor” COSMIN rating for hypothesis testing, whereas 5 studies (50% ) received “good”

225

COSMIN rating [3, 14, 32, 6, 17].The content validity of the FJS was explored during its

226

development with a “good” COSMIN rating and no other articles were found looking at content

227

validity of the tool. Two out of four studies (50%)[3,14] reporting on responsiveness received a

228

“good” COSMIN rating, whereas two study (50%)[30, 12] received only “fair” COSMIN rating.

229

11

Running title: Forgotten Joint Score: A systematic review 230

Statistical methods used to determine measurement properties

231

Ten of the studies (77%) [7, 9, 14, 19, 20, 21, 22, 23, 24, 29] used Cronbach’s alpha to assess

232

internal consistency reliability. In addition, nine studies (69%) depicted reliability with test-retest

233

reliability by calculating Internal Correlation Class (ICC) values [14, 19, 20, 21, 23, 24, 29, 31,

234

32]. Five papers (38%) reported data on measurement error [14, 19, 20, 23, 24], with four of

235

those [19, 20, 23, 24] expressing measurement as standard error of measurement (SEM), and one

236

study [14] preferring to use the Bland and Altman plot analysis. The majority of the studies

237

(77%) [7, 9, 14, 19, 20, 21, 23, 24, 31] in this review assessed validity by defining construct

238

validity, formulating a priori hypothesis involving correlation or mean differences with other

239

PROM tools already validated.

240 241

Internal consistency reliability

242

Among the studies that observed internal consistency [7, 9, 14, 19, 20, 21, 23, 24, 29, 31], the

243

mean Cronbach’s coefficient alpha was 0.95, ranging from 0.91[14] to 0.98 [7]. Overall, the

244

included studies demonstrated good internal consistency according to the criterion proposed by

245

Nunnally and Bernstein [33]. Median internal consistency reliability statistics were ≥0.9 overall

246

and by language, surgery, diagnosis and sample size (Table 5). Only one article [7] performed a

247

factor analysis, reporting that Forgotten Joint Score is unidimensional. Overall, the level of

248

evidence was moderate (Table 4).

249 250

Test-retest reliability

12

Running title: Forgotten Joint Score: A systematic review 251

Median test-retest reliability statistics were >0.8 overall and by language, surgery, diagnosis and

252

sample size (Table 5). Overall, all of the articles that performed test-retest evaluation concurred

253

that the FJS is reliable (Table 4) with a global moderate level of evidence.

254 255

Measurement error

256

Among the papers that reported data on measurement error, four (80%) [14, 20, 23, 24] had an

257

indeterminate evidence rating as they did not report Minimal Important Change (MIC) value

258

[13], whereas one (20%) [19] had a negative rating reporting a MIC smaller than the Smallest

259

Detectible Change (SDC) (Table 4). However, the overall level of evidence was only limited.

260 261

Construct validity

262

Almost all of the studies (90%) [7, 9, 14, 19, 20, 21, 23, 24, 31] that evaluated construct validity

263

with hypothesis testing method had a positive rating (Table 4), meaning that correlation with an

264

instrument measuring the same construct was >0.5 or at least 75% of the results were in

265

accordance with the hypothesis, and correlation with related constructs was higher than with

266

unrelated constructs [13]. The overall level of evidence according to Sheelingerout et al. rating

267

system was strong.

268 269

Responsiveness

270

Among the four studies that looked at responsiveness of FJS [7, 19, 30, 32], two [7, 30] (50%)

271

had a positive rating, one [19] (25%) had a negative rating and one [32] (25%) had an

272

indeterminate rating (Table 4). Overall, the evidence regarding the responsiveness of FJS was

273

conflicting.

13

Running title: Forgotten Joint Score: A systematic review 274 275

Floor and ceiling effects

276

Overall, all of the articles reported data on ceiling effects of the questionnaire and 11 papers

277

(85%) analyzed floor effects. In total, the large majority of articles (85%) resulted in less than

278

15% ceiling effects at 1 year follow-up, with a mean value of 8.9%. Considering floor effects,

279

perfect agreement (100%) was found among the articles in considering Forgotten Joint Score

280

having less than 15% floor effect at 1 year follow-up, with a mean value of 1.8% at 1 year

281

follow-up. In addition, six articles [9, 21, 22, 23, 24, 29] compared ceiling effects between FJS

282

and other frequently used PROMs, such as WOMAC and Oxford knee or hip score, with five of

283

them [9, 22, 23, 24, 29] (83%) resulting in significantly lower ceiling effects with the use of FJS.

284 285

Discussion

286

This review summarized 13 studies that analyzed the utility of the FJS. The FJS is a relatively

287

new PROM used to assess patients’ outcomes after TJA. The main finding of this review was

288

that FJS showed excellent psychometric properties in terms of reliability and validity. In

289

addition, the FJS showed low ceiling and floor effects, showcasing beneficial utility when

290

evaluating outcome groups that traditionally perform very well.

291 292

Internal Consistency

293

When assessing internal consistency, the FJS received a score of excellent (Cronbach

294

alpha=0.95). However, we should consider that high internal consistency scores do not

295

necessarily mean that the scale is unidimensional [34]. In addition, there is evidence [34] that

296

when the value of alpha is too high (over 0.95 or so) it may reflect unnecessary duplication of

14

Running title: Forgotten Joint Score: A systematic review 297

content across items and point more to redundancy than to homogeneity. When evaluating

298

internal consistency, most studies [9, 14, 19, 20, 21, 22, 23, 24, 32] did not use factor analysis or

299

item response to verify unidimensionality of the tool, thus compromising the methodology. Only

300

Hamilton et al. took this into consideration, performing a factor analysis showing that FJS is

301

unidimensional [7].

302 303

Reliability

304

This systematic review showed an excellent test-retest reliability across included studies

305

(ICC=0.91). High test-retest reliability refers to the stability of a questionnaire over time and is

306

important to have for discriminative purposes, such as distinguishing patients with less or more

307

disease. When evaluating measurement errors, the majority of the papers (75%) did not report

308

data on Minimal Important Change (MIC) or Smallest Detectable Change (SDC). With only one

309

paper [19] reporting data in a proper way according to COSMIN list, additional studies are

310

required to determine measurement error of the FJS.

311 312

Validity

313

Validity was assessed by examining content and construct validity. In order to achieve a proper

314

content validity, authors have to provide a clear description of measurement aim of the

315

questionnaire, target population, concepts that the questionnaire is intended to measure and the

316

item selection. Behrend et al. adequately described these aspects and therefore the Forgotten

317

Joint Score has good content validity. Concerning construct validity, overall the primary

318

hypotheses of all studies were confirmed, thus showing good construct validity. Furthermore, the

15

Running title: Forgotten Joint Score: A systematic review 319

majority of papers resulted to have a good COSMIN score and therefore we feel that there is

320

strong evidence to show adequate construct validity of the tool.

321 322

Responsiveness

323

Different results were found when analyzing responsiveness. In our systematic review, only two

324

papers [7, 19] evaluated responsiveness with a good COSMIN methodology, and among them

325

we found conflicting results. Based on this finding, we suggest that further studies are needed in

326

order to reach to safe conclusions regarding the responsiveness of FJS.

327 328

Discriminatory Power

329

When developing the FJS, Behrend et al. believed that their tool would have been able to

330

distinguish patients with good outcomes from patients with excellent outcomes, thus resulting in

331

a better discriminatory power when compared with other PROM tools [9]. In order to achieve

332

this goal, FJS should have significant lower ceiling effects compared to other PROMs. This

333

review showed low ceiling effects (<15%) in 11 out of the 13 studies at 1-year follow-up. These

334

findings suggest that FJS might have improved discriminatory power compared to other PROMs

335

that might recognize small improvements between patients.

336

A particular limitation of FJS was related to question number 12 (“are you aware of your

337

artificial knee when doing your favorite sport?”) of the questionnaire. We found that this

338

question had a significant percentage of missing responses (>10%) in all those articles that

339

reported this percentage [7, 14, 20, 21, 24, 32]. A possible explanation could be that this question

340

might not be applicable in less active individuals, such as elderly patients [24, 32]. A shorter,

341

more patient-adapted or age-adapted FJS might therefore be more adequate.

16

Running title: Forgotten Joint Score: A systematic review 342

Another potential limitation was that the COSMIN checklist which was used in this review is a

343

novel quality-assessment tool with relatively untested inter-rater reliability. Finally, follow-up

344

period varied widely amongst the studies with five of them not reporting it (Table 3).

345 346

Future directions

347

Forgotten joint score introduced a new outcome parameter that is the patient’s awareness of their

348

knee or hip joint during activities of daily living. Although FJS was originally introduced for

349

patients undergoing THA or TKA [9], we feel that the assessment of joint awareness in daily

350

activities might be also useful in other types of orthopaedic operations. For example, since FJS

351

has a very promising discriminatory power between patients with a good outcome and patients

352

with an excellent outcome, it could be particularly useful in the postoperative assessment of

353

patients with sports injuries who are treated with arthroscopic procedures [35].

354 355

Conclusions

356

This review showed that there is a strong evidence of good construct validity and test-retest

357

reliability regarding the FJS, with only moderate evidence of good internal consistency. Ceiling

358

and floor effects were very low showing a very promising discriminatory power between patients

359

with a good outcome and patients with an excellent outcome. Therefore, especially in patients

360

expected to achieve high levels of function after total joint replacement, we highly recommend

361

the use of FJS for the long-term assessment of their treatment.

362 363

References

17

Running title: Forgotten Joint Score: A systematic review 364

1. Gagnier JJ, Mullins M, Huang H, Marinac-Dabic D, Ghambaryan A, Eloff B, et al. A

365

Systematic Review of Measurement Properties of Patient-Reported Outcome Measures Used in

366

Patients Undergoing Total Knee Arthroplasty. J Arthroplasty 2017 May;32(5):1688-1697.

367

2. Vega JF, Spindler KP. Types of Scoring Instruments Available. 2019. In: Musahl V et al.,

368

editors. Basic Methods Handbook for Clinical Orthopaedic Research. Berlin, Heidelberg,

369

Springer; 2019.

370

3. Irrgang JJ, Lubowitz JH. Measuring arthroscopic outcome. Arthroscopy. 2008;24(6):718–22.

371

4. MOTION Group. Patient-Reported Outcomes in Orthopaedics. J Bone Joint Surgery Am.

372

2018 Mar 7;100(5):436-442.

373

5. Alviar MJ, Olver J, Brand C, Tropea J, Hale T, Pirpiris M, et al. Do patient-reported outcome

374

measures in hip and knee arthroplasty rehabilitation have robust measurement attributes? A

375

systematic review. J Rehabil Med 2011 Jun;43(7):572-83.

376

6. Ramkumar PN, Harris JD, Noble PC. Patient-reported outcome measures after total knee

377

arthroplasty: a systematic review. Bone Joint Res 2015 Jul;4(7):120-7.

378

7. Hamilton DF, Loth FL, Giesinger JM, Giesinger K, MacDonald DJ, Patton JT, et al.

379

Validation of the English language Forgotten Joint Score-12 as an outcome measure for total hip

380

and knee arthroplasty in a British population. Bone Joint J 2017 Feb;99-B(2):218-24.

381

8. Marx RG, Jones EC, Atwan NC, Closkey RF, Salvati EA, Sculco TP. Measuring improvement

382

following total hip and knee arthroplasty using patient-based measures of outcome. J Bone Joint

383

Surg Am 2005 Sep;87(9):1999-2005.

18

Running title: Forgotten Joint Score: A systematic review 384

9. Behrend H, Giesinger K, Giesinger JM, Kuster MS. The "forgotten joint" as the ultimate goal

385

in joint arthroplasty: validation of a new patient-reported outcome measure. J Arthroplasty 2012

386

Mar;27(3):430,436.e1.

387

10. Giesinger K, Hamilton DF, Jost B, Holzner B, Giesinger JM. Comparative responsiveness of

388

outcome measures for total knee arthroplasty. Osteoarthritis Cartilage 2014 Feb;22(2):184-9.

389

11. Moher D, Liberati A, Tetzlaff J, Altman DG, The PRISMA Group (2009). Preferred

390

Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. Open

391

Med 2009;3(3):123-130

392

12. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN

393

checklist for assessing the methodological quality of studies on measurement properties of health

394

status measurement instruments: an international Delphi study. Qual Life Res 2010

395

May;19(4):539-49.

396

13. Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the

397

methodological quality in systematic reviews of studies on measurement properties: a scoring

398

system for the COSMIN checklist. Qual Life Res 2012 May;21(4):651-7.

399

14. Cao S, Liu N, Han W, Zi Y, Peng F, Li L, et al. Simplified Chinese version of the Forgotten

400

Joint Score (FJS) for patients who underwent joint arthroplasty: cross-cultural adaptation and

401

validation. J Orthop Surg Res 2017 Jan 14;12(1):6,016-0508-5.

402

15. Landis JR, Koch GG. The measurement of observer agreement for categorial data.

403

Biometrics. 1977;33(1): 159-74

19

Running title: Forgotten Joint Score: A systematic review 404

16. Schellingerhout JM, Heymans MW, Verhagen AP, de Vet HC, Koes BW, Terwee CB.

405

Measurement properties of translated versions of neck-specific questionnaires: a systematic

406

review. BMC Med Res Methodol 2011 Jun;6(11):87.

407

17. Huang H, Grant JA, Miller BS, Mirza FM, Gagnier JJ. A Systematic Review of the

408

Psychometric Properties of Patient-Reported Outcome Instruments for Use in Patients With

409

Rotator Cuff Disease. Am J Sports Med 2015 Oct;43(10):2572-82.

410

18. van Tulder M, Furlan A, Bombardier C, et al. Updated method guidelines for systematic

411

reviews in the Cochrane collaboration back review group. Spine 2003 Jun;28(12):1290-1299.

412

19. Baumann F, Ernstberger T, Loibl M, Zeman F, Nerlich M, Tibesku C. Validation of the

413

German Forgotten Joint Score (G-FJS) according to the COSMIN checklist: does a reduction in

414

joint awareness indicate clinical improvement after arthroplasty of the knee? Arch Orthop

415

Trauma Surg 2016 Feb;136(2):257-64.

416

20. Kinikli GI, Güney Deniz H, Karahan S, Yüksel E, Kalkan S, Dönder Kara D, et al. Validity

417

and reliability of Turkish version of the Forgotten Joint Score-12. Journal of Exercise Therapy

418

and Rehabilitation 2017;4(1):18-25.

419

21. Klouche S, Giesinger JM, Sariali EH. Translation, cross-cultural adaption and validation of

420

the French version of the Forgotten Joint Score in total hip arthroplasty. Orthop Traumatol Surg

421

Res 2018 Sep;104(5):657-61.

20

Running title: Forgotten Joint Score: A systematic review 422

22. Matsumoto M, Baba T, Homma Y, Kobayashi H, Ochi H, Yuasa T, et al. Validation study of

423

the Forgotten Joint Score-12 as a universal patient-reported outcome measure. Eur J Orthop Surg

424

Traumatol 2015 Oct;25(7):1141-5.

425

23. Thomsen MG, Latifi R, Kallemose T, Barfod KW, Husted H, Troelsen A. Good validity and

426

reliability of the forgotten joint score in evaluating the outcome of total knee arthroplasty. Acta

427

Orthop 2016 Jun;87(3):280-5.

428

24. Shadid MB, Vinken NS, Marting LN, Wolterbeek N. The Dutch version of the Forgotten

429

Joint Score: test-retesting reliability and validation. Acta Orthop Belg 2016 Mar;82(1):112-8.

430

25. Beaton DE, Bombardier C, Guillemin F, Ferraz MB. Guidelines for the process of cross-

431

cultural adaptation of self-report measures. Spine (Phila Pa 1976) 2000 Dec 15;25(24):3186-91.

432

26. Guillemin F, Bombardier C, Beaton D. Cross-cultural adaptation of health-related quality of

433

life measures: literature review and proposed guidelines. J Clin Epidemiol 1993

434

Dec;46(12):1417-32.

435

27. Ware JE,Jr, Keller SD, Gandek B, Brazier JE, Sullivan M. Evaluating translations of health

436

status questionnaires. Methods from the IQOLA project. International Quality of Life

437

Assessment. Int J Technol Assess Health Care 1995 Summer;11(3):525-51.

438

28. Ganestam A, Barfod K, Klit J, Troelsen A. Validity and reliability of the Achilles tendon

439

total rupture score. J Foot Ankle Surg 2013 Nov-Dec;52(6):736-9.

440

29. Larsson A, Rolfson O, Karrholm J. Evaluation of Forgotten Joint Score in total hip

441

arthroplasty with Oxford Hip Score as reference standard. Acta Orthop 2019 Apr 1:1-8.

21

Running title: Forgotten Joint Score: A systematic review 442

30. Giesinger JM, Loth FL, Howie C, Giesinger K, Hamilton DF. Validation of the English

443

Version of the Forgotten Joint Score - 12 in Patients Undergoing Total Knee or Hip Arthroplasty.

444

Value Health 2015 Nov;18(7):A652-3.

445

31. Thompson SM, Salmon LJ, Webb JM, Pinczewski LA, Roe JP. Construct Validity and Test

446

Re-Test Reliability of the Forgotten Joint Score. J Arthroplasty 2015 Nov;30(11):1902-5.

447

32. Thienpont E, Vanden Berghe A, Schwab PE, Forthomme JP, Cornu O. Joint awareness in

448

osteoarthritis of the hip and knee evaluated with the 'Forgotten Joint' Score before and after joint

449

replacement. Knee Surg Sports Traumatol Arthrosc 2016 Oct;24(10):3346-51.

450

33. Nunnally JC, Bernstein IH. Psychometric theory. 3rd ed. New York:McGraw-Hill; 1994

451 452

34. Tavakol M, Dennick R. Making sense of Cronbach's alpha. Int J Med Educ. 2011

453

Jun;27(2):53-55.

454

35. Behrend H, Zdravkovic V, Giesinger JM, Giesinger K. Joint awareness after ACL

455

reconstruction: patient-reported outcomes measured with the Forgotten Joint Score-12. Knee

456

Surg Sports Traumatol Arthrosc 2017 May;25(5):1454-1460.

22

Table 1. Quality criteria for measurement properties

Property

Description

Rating

Quality Crtiteria a,b

1. Content validity

The extent to which the domain of

+

A clear description is provided of the measurement aim, the target population, the concepts that are being measured, and the item selection AND target population

interest is comprehensively sampled by

and (investigators OR experts) were involved in item selection; the items in the questionnaire A clear description of above-mentioned aspects is lacking OR only target ? population involved OR doubtful design or method; -

2. Internal consistency

3.Reliability

4. Responsive ness

The extent to which items in a (sub)scale are intercorrelated, thus measuring the same construct

+

Factor analyses performed on adequate sample size (7 * # items and >100) AND Cronbach’s alpha(s) calculated per dimension AND Cronbach’s alpha(s) between 0.70 and 0.95;

?

No factor analysis OR doubtful design or method;

-

Cronbach’s alpha(s) !0.70 or O0.95, despite adequate design and method;

The extent to which patients can be distinguished from each other, despite measurement error (relative measurement error)

+

ICC/weighted Kappa 0.70 OR Pearson’s r 0.80

?

Neither ICC/weighted Kappa nor Pearson’s r determined ICC/weighted Kappa <0.80

The ability of a questionnaire to detect clinically important changes over time

+

SDC or SDC < MIC OR MIC outside the LOA OR RRO1.96 OR AUC>0.70; ?

Doubtful design or method;

-

SDC or SDC>MIC OR MIC equals or inside LOA OR RR<1.96 OR AUC!0.70, despite adequate design and methods;

5.Measurm ent error

The extent to which the scores on repeated measures are close to each other (absolute measurement error)

+

MIC < SDC OR MIC outside the LOA OR convincing arguments that agreement is acceptable;

?

Doubtful design or method OR (MIC not defined AND no convincing arguments that agreement is acceptable);

6.Hypothesi s testing

The extent to which scores on a particular instrument relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured

-

MIC ≥ SDC OR MIC equals or inside LOA, despite adequate design and method;

+

Correlation with an instrument measuring the same construct 0.50 OR at least 75% of the results are in accordance with the hypotheses, AND correlation with related constructs is higher than with unrelated constructs Solely correlations determined with unrelated constructs

? -

Correlation with an instrument measuring the same construct <0.50 OR <75% of the results are in accordance with the hypothesis, OR correlation with related constructs is lower than with unrelated constructs

MIC= minimal important change; SDC = smallest detectable change; LOA= limits of agreement; ICC = Interclass correlation ; SD = standard deviation. a + = positive rating; ? = indeterminate rating; - = negative rating; 0 = no information available b Doubtful design or method = lacking of a clear description of the design or methods of the study, sample size smaller than 50 subjects, or any important methodological weakness in the design or execution of the study

Table 5. Internal consistency reliability and test-retest reliability per study Internal consistency reliability

Test-retest reliability

Number of study

Mean Cronbach alpha

Range Cronbach alpha

Number of study

All

10

0.95

(0.91-0.98)

English

3

0.97

Non-English

7

THA only

Range ICC

9

Mean Interclass Correlation coefficient (ICC) 0.91

(0.95-0.98)

3

0.92

(0.87-0.97)

0.95

(0.91-0.97)

6

0.91

(0.8-0.97)

4

0.97

(0.96-0.98)

2

0.9

(0.86-0.93)

TKA only

4

0.95

(0.91-0.97)

5

0.9

(0.8-0.97)

Smaller sample size(<100) Bigger sample size (>100)

2

0.97

(0.96-0.97)

2

0.9

(0.86-0.93)

8

0.95

(0.91-0.97)

7

0.92

(0.8-0.97)

(0.8-0.97)

Table 2. Levels of evidence for the overall quality of the measurement property Level Strong

Rating +++ or ---

Moderate

++ or --

Limited

+ or -

Conflicting Unknown

± ?

Criteria Consistent findings in multiples studies of good methodological quality OR in one study of excellent methodological quality Consistent findings in multiple studies of fair methodological quality OR in one study of good methodological quality One study of fair methodological quality Conflicting findings Only studies of poor methodological quality

Table 3. Characteristics of the included studies

Study

Language

Oxford level of evidence

Number of patients

Female (%)

Number of TKA

Number of UKA

Number of THA

Age

Time since surgery

follow up

Baumann et al.

German

I

105

54.3

86

19

-

65.2±9.3

7.2±12.5

1 year

Behrend et al.

English

II

243

49.4

86

-

157

70.6±11.3

31.1±12.3

-

Cao et al.

Chinese

II

150

78.7

150

-

-

68.1±7.4

28±9.7

1 year

Giesinger et al.

English

II

98

49.0

98

-

-

68.1±8.6

-

2 years

Hamilton et al.

English

II

436

56.9

231

-

205

69.9(THA) 0,6 and 12 67.6 months (TKA)

1 year

Kinikli et al.

Turkish

II

132

77.3

90

-

42

63.9±12.7

30.8±16

-

Klouche et al.

French

II

58

37.9

-

-

63

62.7±15.2

At least 1 year

1-6 years

Larsson et al.

English

II

111

52.0

-

-

111

69

At least 1 year

At least 1 year

Matsumoto et al.

Japanese

II

108

81.5

-

-

108

65.7±11.6

29.5±38.7

-

Shadid et al.

Dutch

II

159

64.0

84

-

75

68.6

15 months

2 years

Thomsen et al.

Danish

III

315

59.4

315

-

-

65

-

1-4 years

Thompson et al.

English

III

147

46.3

147

-

-

67

39 (range 18-72)

-

Thienpont et al.

English

II

150

56

75

-

75

66±17 (THA) e 69±10 (TKA)

-

-

Table 4. Methodological quality of each study per measurement properties

Study

Internal Consistency

Reliability

Measurment error

Hypothesis testing

Responsiveness

COSMIN score

Evidence rating

COSMIN score

Evidence rating

COSMIN score

Evidence rating

COSMIN score

Evidence rating

COSMIN score

Evidence rating

Baumann et al.

Poor

?

Good

+

Fair

-

Good

+

Good

-

Behrend et al.

Poor

?

0

0

0

0

Fair

+

0

0

Cao et al.

Poor

?

Good

+

Fair

?

Good

+

0

0

Giesinger et al.

0

0

0

0

0

0

0

0

Fair

+

Hamilton et al.

Good

+

0

0

0

0

Good

+

Good

+

Kinikli et al.

Poor

?

Fair

+

Fair

?

Poor

+

0

0

Klouche et al.

Poor

?

Good

+

0

0

Good

+

0

0

Larsson et al.

Poor

?

Fair

+

0

0

0

0

0

0

Matsumoto et al.

Poor

?

0

0

0

0

Poor

?

0

0

Shadid et al.

Poor

?

Good

+

Fair

?

Fair

+

0

0

Thomsen et al.

Fair

+

Fair

+

Fair

?

Good

+

0

0

Thompson et al.

0

0

Fair

+

0

0

Fair

+

0

0

Thienpont et al OVERALL LEVEL OF EVIDENCE

0

0 ++

Fair

+ ++

0

0 +

0 +++

0

0

0 ±

Table 6. Cross-cultural and construct validity per study Study

Crosscultural validity

Construct validity

Baumann et al.

German

Pearson correlation coefficient: -Moderate correlation of FJS with OKS (r=0.37) and EQ-5D(r=0.56), -Poor correlation between FJS and TAS(r=0.29), -all hypothesis confirmed

Behrend et al.

Original

Pearson correlation coefficient -High correlation with the WOMAC scales ( r=-0.79 total, r=-0.75 pain, r=-0.69 stiffness, r=-0.78 function)

Cao et al.

Chinese

Pearson correlation coefficient -Good correlation with symptoms (r=0.67) and pain (r=0.6) domains of KOOS and social functioning (r=0.66) domain of SF-36. -Moderate correlation with function in daily living (r=0.53), and function in sport and ricreation (r=0.4) domains of KOOS and physical subscale of SF-36 (r=0.51). -Weak correlation with mental subscale of SF-36. -All hypothesis confirmed

Hamilton et al.

Original

Pearson correlation coefficient -In TKA patients high correlation for the OKS (r=0.85) and the SF-12 PCS (r=0.7). IN THA slightly lower (r=0.79 for OHS and r=0.67 for SF-12 PCS) -Fair correlation with SF-12 MCS in the TKA group (r=0.23) and in the THA group (r=0.36). - Hypothesis confirmed

Kinikli et al.

Turkish version

Pearson correlation coefficient -Moderate to high correlation with WOMAC, KOOS-PS, HOOS-PS, TKS and SF-12 PCS. -No correlation with SF-12 MCS

Klouche et al.

French version

Pearson correlation coefficient -Positive correlation with modified HHS and negatively with OHS-12 -Hypothesis were confirmed

Matsumoto et al.

Japanese

Pearson correlation coefficient -Moderatly correlated with total WOMAC score (r=0.52) and its subscale scores for 'stiffness' (r=0.4) and 'function' (r=0.54). -Weakly with the subscore for 'pain' (r=0.29). -Favorably correlated with total JHEQ score (r=0.69) and its subscale score for 'movement' (r=0.64) and moderatly with the scores for 'pain' (r=0.55) and 'mental' (r=0.53).

Shadid et al.

Dutch

Spearman correlation coefficient -Significant positive correlation between FJS and WOMAC total score (and also subscales) (r=0.75) -Hypothesis confirmed

Thomsen et al.

Danish

Pearson correlation coefficient -Strong correlation between the FJS and OKS (r=0.81) -Hypothesis confirmed

Thompson et al.

Original

Spearman correlation coefficient -Positive correlation with total WOMAC score (r= 0.7) and its subscale Pain (r=0.67), Stiffness (r=0.52) and Function (r=0.66). -Positive correlation with KOOS 'Quality of life' (r=0.63), 'Pain' (r=0.68) and 'ADL' (r=0.66) whereas weakly correlation with KOOS 'symptoms'(r=0.33)

Fig. 1: Systematic review flow diagram.

Identification

Records identified through database searching (n =123)

Screening

Records after duplicates removed (n =98)

Eligibility

Full-text articles assessed for eligibility (n = 24)

Included

Studies included in qualitative synthesis (n =13)

.

Duplicates removed (n =25)

Records excluded (n =74)

Full-text articles excluded, with reasons (n = 11)

Figure legends

Fig. 1: Systematic review flow diagram