Measurement Properties of Visual Analogue Scale, Numeric Rating Scale, and Pain Severity Subscale of the Brief Pain Inventory in Patients With Low Back Pain: A Systematic Review

Measurement Properties of Visual Analogue Scale, Numeric Rating Scale, and Pain Severity Subscale of the Brief Pain Inventory in Patients With Low Back Pain: A Systematic Review

Accepted Manuscript Measurement properties of Visual Analogue Scale, Numeric Rating Scale and Pain Severity subscale of the Brief Pain Inventory in p...

1MB Sizes 0 Downloads 91 Views

Accepted Manuscript

Measurement properties of Visual Analogue Scale, Numeric Rating Scale and Pain Severity subscale of the Brief Pain Inventory in patients with low back pain: a systematic review Alessandro Chiarotto , Lara J. Maxwell , Raymond W. Ostelo , Maarten Boers , Peter Tugwell , Caroline B. Terwee PII: DOI: Reference:

S1526-5900(18)30400-0 https://doi.org/10.1016/j.jpain.2018.07.009 YJPAI 3616

To appear in:

Journal of Pain

Received date: Revised date: Accepted date:

3 May 2018 14 June 2018 12 July 2018

Please cite this article as: Alessandro Chiarotto , Lara J. Maxwell , Raymond W. Ostelo , Maarten Boers , Peter Tugwell , Caroline B. Terwee , Measurement properties of Visual Analogue Scale, Numeric Rating Scale and Pain Severity subscale of the Brief Pain Inventory in patients with low back pain: a systematic review, Journal of Pain (2018), doi: https://doi.org/10.1016/j.jpain.2018.07.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

HIGHLIGHTS

Evidence quality on content validity of VAS, NRS and BPI-PS in LBP is (very) low



High quality evidence shows that NRS measurement error is too large



Head-to-head measurement comparisons of VAS, NRS and BPI-PS are missing in LBP



Content validity, reliability and responsiveness should be further assessed

AC

CE

PT

ED

M

AN US

CR IP T



1

ACCEPTED MANUSCRIPT

Measurement properties of Visual Analogue Scale, Numeric Rating Scale and Pain Severity subscale of the Brief Pain Inventory in patients with low back pain: a systematic review

CR IP T

Alessandro Chiarotto1,2, Lara J. Maxwell3, Raymond W. Ostelo1,2, Maarten Boers1,4, Peter Tugwell3,5, Caroline B. Terwee1

1. Department of Epidemiology and Biostatistics, Amsterdam Public Health research institute, VU University Medical Center, Amsterdam, Netherlands 2. Department of Health Sciences, Amsterdam Movement Sciences research institute, Vrije Universiteit, Amsterdam, Netherlands

AN US

3. Centre for Practice-Changing Research, Ottawa Hospital Research Institute, University of Ottawa, Ottawa, Canada 4. Amsterdam Rheumatology and Immunology Center, VU University Medical Center, Amsterdam, Netherlands

M

5. Department of Medicine, Faculty of Medicine, University of Ottawa, Ottawa, Canada

ED

Running title: Measurement properties VAS, NRS and BPI-PS in LBP

CE

PT

Disclosures: This work was supported by the Task Force for Research of the Spine Society of Europe (EUROSPINE) [grant number: EUROSPINE TFR 5-2015]. These funding bodies did not have any role in in designing the study, in collecting, analyzing and interpreting the data, in writing this manuscript, and in deciding to submit it for publication. The authors of this manuscript declare that they do not have any conflicts of interest regarding its content.

AC

Corresponding author:

Alessandro Chiarotto, PT, MSc Department of Epidemiology and Biostatistics Amsterdam Public Health research institute Amsterdam Movement Sciences research institute VU University Medical Center de Boelelaan 1085, Medical Faculty F-vleugel 1081HV, Amsterdam, the Netherlands +31 (0) 622164491 Email: [email protected]

2

ACCEPTED MANUSCRIPT

Measurement properties of Visual Analogue Scale, Numeric Rating Scale and Pain Severity subscale of the Brief Pain Inventory in patients with low back pain: a systematic review

CR IP T

ABSTRACT

ED

M

AN US

Visual Analogue Scale (VAS), Numeric Rating Scale (NRS), and Pain Severity subscale of the Brief Pain Inventory (BPI-PS)] are the most frequently used instruments to measure pain intensity in low back pain (LBP). However, their measurement properties in this population have not been systematically reviewed. The goal of this study was to provide such systematic evidence synthesis. Six electronic sources (MEDLINE, EMBASE, CINAHL, PsycINFO, SportDiscus, Google Scholar) were searched (July 2017). Studies assessing any measurement property in patients with non-specific LBP were included. Two reviewers independently screened articles and assessed risk of bias using the COSMIN checklist. For each measurement property: evidence quality was rated as high, moderate, low, or very low (GRADE approach); results were classified as sufficient, insufficient or inconsistent. Ten studies assessed the VAS, 13 the NRS, four the BPI-PS. The three instruments displayed low or very low quality evidence for content validity. High quality evidence was only available for NRS insufficient measurement error. Moderate evidence was available for: NRS inconsistent responsiveness, BPI-PS sufficient structural validity and internal consistency, and BPI-PS inconsistent construct validity. All VAS measurement properties were underpinned by no, low or very low quality evidence, likewise the other measurement properties of NRS and BPI-PS. PERSPECTIVE

AC

CE

PT

Despite their broad use, there is no evidence clearly suggesting that one among VAS, NRS and BPI-PS has superior measurement properties in LBP. Future adequate quality head-to-head comparisons are needed, and priority should be given to assessing content validity, test-retest reliability, measurement error and responsiveness.

KEY WORDS

Low back pain; pain intensity; Visual Analogue Scale; Numeric Rating Scale; Brief Pain Inventory.

3

ACCEPTED MANUSCRIPT

INTRODUCTION

CR IP T

Low back pain (LBP) is the most disabling health condition worldwide 33. Measuring the impact of LBP on patients’ lives is fundamental to monitor clinical management and to study the (cost-) effectiveness of treatments 4. Patients with LBP have indicated that the most important domains to be measured are: physical functional activities, pain reduction, quality of life, enjoyment of life, emotional well-being, and fatigue 9, 43, 103. A core outcome set initiative (involving patients) aimed at standardizing measurement for LBP identified four core outcome domains for clinical trials: physical functioning, pain intensity, health-related quality of life, and number of deaths 9. Among these domains, pain intensity is the most frequently assessed in LBP clinical trials 31.

M

AN US

Pain intensity, defined as ‘how much a patient hurts, reflecting the overall magnitude of the pain experience’ 102, is the pain domain that ranked the highest among various pain domains (e.g. pain quality, temporal aspects of pain, pain behaviour, pain interference) in consensus exercises to establish core outcome domains for LBP 9 and other pain conditions 53, 77. The Visual Analogue Scale (VAS) is the patient-reported outcome measure (PROM) most frequently used to measure pain intensity in LBP trials, followed by the Numeric Rating Scale (NRS) and the Pain Severity subscale of the Brief Pain Inventory (BPI-PS) 7, 31. Recent consensus-based studies have shown that researchers and clinicians prefer the NRS over other instruments to measure pain intensity in LBP 13, 17, 22, 24. However, this choice has not been explicitly based on its measurement properties and feasibility 5, 85.

AC

CE

PT

ED

NRS, VAS and BPI-PS are highly feasible for clinical research and practice, providing very little burden to professionals and patients 39. Various reviews have attempted to synthesize their measurement properties in samples of pain patients 6, 42, 47, 52, 86, 94, 108. All these reviews focused on chronic pain broadly and two of them solely focused in children and adolescents 6, 94. In recent years, the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative has developed tools that allow researchers to conduct high quality systematic reviews on measurement properties of PROMs 72, 73, 84, 100. Given that these existing reviews pre-dated the COSMIN guidance 42, 47, 52, 86, 108, key methodological steps (e.g. quality assessment of the studies, formulation of evidence synthesis and findings taking the quality of the studies into account, definition of the methods to combine study results 90, 105) could not be included. Therefore, it is timely to adopt the most recent methodological advancements in a systematic review on PROMs for pain intensity. The objective of this study was to systematically synthesize the evidence on the measurement properties of VAS, NRS and BPI-PS in adult patients with LBP. This review was conducted within an international collaboration aimed at developing a core outcome measurement set for LBP 12, and informed a Delphi study to reach consensus on which core outcome measurement instrument(s) to endorse for pain intensity in LBP clinical trials 8. For this reason, in contrast with previous reviews that had a more generic focus on various pain conditions 6, 42, 47, 52, 86, 94, 108, it focused solely on studies in patients with LBP, following the approach adopted in Cochrane reviews of randomized clinical trials on the effectiveness of interventions in patients with LBP 32.

4

ACCEPTED MANUSCRIPT

METHODS

CR IP T

This systematic review was conducted according to COSMIN guidance 84 and reported according to the Preferred Reporting Items for Systematic Reviews and meta-Analysis (PRISMA) statement 71. Its protocol was registered in the international prospective register of systematic reviews (http://www.crd.york.ac.uk/PROSPERO/), registration number: CRD42015020006.

Measurement instruments

AN US

The VAS is a self-reported scale consisting of a horizontal or vertical line, usually 10 centimetres long (100 mm) anchored at the extremes by two verbal descriptors referring to the pain status 45. An introductory question (with or without a time recall period) asks the patient to tick the line on the point that best refers to his/her pain. The introductory question, the recall period, and the content of the external verbal descriptors vary in the literature 39. The NRS is a numbered version of the VAS in which the patient can select one number that best describes the pain 23. Like in the VAS, the NRS introductory question, time recall period and verbal descriptors can vary; the most frequently used version is the 11-point (0-10) NRS 39.

Literature search

PT

ED

M

The BPI-PS consists of four 11-point NRSs, two of which asking the patient to rate the pain at its ‘worst’ and ‘least in the last 24 hours’, and the other two asking about pain ‘on the average’ and ‘right now’ 15. For each NRS, the verbal descriptors are ‘no pain’ and ‘pain as bad as you can imagine’, and this questionnaire is usually administered as part of the BPI, which includes other 11 pain-related questions (seven of which belonging to the pain interference subscale) 15.

CE

Data sources and searches

AC

MEDLINE (through the interface PubMed), EMBASE (Embase.com), CINAHL (EBSCOhost), PsycINFO (EBSCOhost), and SportDiscus (EBSCOhost) were last searched on July 25, 2017. The search strategy consisted of three groups of search terms combined with the Boolean operator ‘AND’: 1) PROMs names, 2) LBP, 3) measurement properties. A validated search filter for retrieving studies on measurement properties in PubMed was used 98; the same filter was adapted for all the other databases (Appendix 1). No restrictions for language or time were adopted in the search strategies. Google Scholar was also searched (last on July 28, 2017) with the full names of the PROMs, and the first 100 hits for each PROM were screened for inclusion. Citation tracking of the eligible studies was carried out by consulting the database Web of Science and by checking their references. Study selection

5

ACCEPTED MANUSCRIPT

CR IP T

Any study on one or more of the three instruments was included if it assessed at least one of the nine measurement properties identified by the COSMIN taxonomy: internal consistency, test-retest reliability, measurement error, content validity, structural validity, construct validity/hypotheses testing, cross-cultural validity, criterion validity, and responsiveness 73. Studies presenting the development of the PROMs were included for the assessment of content validity 100. Other studies were considered eligible for the assessment of content validity if they were full-text original articles, including adult patients (> 18 years) with non-specific LBP 67 and/or professionals (e.g. researchers, clinicians) to assess the relevance, comprehensiveness or comprehensibility of the content of at least one of the three PROMs 100. Studies on all the other measurement properties were included if they were full-text articles presenting results for adult patients with non-specific LBP. Studies in populations that also included patients with specific LBP or patients with pain locations different from the lower back were included only if at least 75% of the total sample was classified as having non-specific LBP or if results were presented separately for the non-specific LBP group 54. Studies that used the PROMs as outcome measurement instruments, or in which the PROMs were used in a validation studies of other instruments were excluded 84.

M

Evaluation of the measurement properties

AN US

Inclusion criteria were applied by two reviewers (AC and LM) independently to titles and abstracts of the hits retrieved with the searches. Potentially eligible full-texts were screened independently by the same two reviewers. Consensus on inclusion was sought between reviewers and, in case of disagreement, a third reviewer (RO) made decisions.

CE

PT

ED

After retrieving the available evidence, COSMIN guidance for systematic reviews of PROMs recommends assessment of measurement properties in the following order: 1) content validity, 2) internal structure (i.e. structural validity, internal consistency, and cross-cultural validity), 3) the remaining properties (i.e. test-retest reliability, measurement error, criterion validity, construct validity, responsiveness) 84. For each measurement property, three phases are included in the assessment. First, the risk of bias of each single study on a measurement property is assessed. Second, the results of each single study on a measurement property are rated against criteria for sufficient measurement properties. Third, the results from all studies on a measurement property are summarized and the quality of evidence is graded. Each phase is described in more detail in the following sections.

AC

Risk of bias assessment and data extraction The risk of bias of the included studies was assessed with the COSMIN Risk of Bias checklist 72. Risk of bias refers to the methodological quality of the studies. The COSMIN checklist contains a box for each measurement property and boxes to assess the PROM development quality 100. Each box is rated on a 4-point rating scale: ‘very good’, ‘adequate’, ‘doubtful’, or ‘inadequate’. For the development study, total quality scores were determined separately for the two main parts of the study: concept elicitation study and cognitive interview(s) with patients. For the content validity studies, the study quality for the three main aspects of content validity (i.e. relevance, comprehensiveness, comprehensibility) was assessed separately. A total rating was obtained for each part by taking the lowest rating among the standards (i.e. worst score counts) 99. Two reviewers 6

ACCEPTED MANUSCRIPT (AC and CT) assessed PROM development quality and risk of bias of original content validity studies independently and achieved consensus in a face-to-face meeting. A similar 4-point rating scale and worst score counts method were also used for assessing the risk of bias for studies on the other measurement properties 72, and a total quality rating was determined for the studies on each measurement property in each study. Two reviewers (AC and LM) assessed the risk independently and achieved consensus in a video conference. For every study, data was extracted on patient characteristics and results by one reviewer (AC) and checked for accuracy by a second reviewer (LM).

CR IP T

Evidence synthesis

M

AN US

Evidence synthesis was performed separately for each measurement property 84, 100. For content validity, the results of the studies (including PROM development) were rated by two reviewers (AC and CT) independently according to ten established criteria: five on relevance, one on comprehensiveness, and four on comprehensibility 100. Each criterion could be rated as sufficient (+), insufficient (–), or indeterminate (?). The same criteria were also applied by two reviewers (AC and CT) to the content of the PROM itself 100; a specific version of VAS and NRS was used for this assessment, with the introductory question, recall period and external descriptors as recommended in a recent consensus-study (Appendix 2) 8. An overall sufficient (+), insufficient (–), or inconsistent (±) rating was determined for relevance, comprehensiveness and comprehensibility of each PROM, by jointly assessing all results and reviewers’ ratings on the same PROM. More detailed information on this assessment can be found in the COSMIN user manual on assessing the content validity of PROMs, available online at: www.cosmin.nl.

PT

ED

For the other measurement properties, the results were rated according to the consensus-based criteria proposed by Prinsen et al. (Appendix 3) 85. For measurement error, consensus-based minimal important change (MIC) values 75 were used to judge the relative magnitude of the smallest detectable change. For construct validity and responsiveness, the review team formulated a set of a priori hypotheses against which to evaluate the results of studies. For both properties, correlations were expected to be: ≥ 0.60 with other pain intensity instruments;

-

< 0.60 and ≥ 0.30 with instruments measuring related but dissimilar constructs (e.g. pain behaviour, physical functioning);

AC

CE

-

-

< 0.30 with instruments measuring unrelated constructs.

These hypotheses were based on the results of a systematic review on physical functioning PROMs for LBP 10. Two additional hypotheses were formulated for responsiveness: -

the area under the curve (AUC) to discriminate between improved and not improved/deteriorated patients had to be ≥ 0.70;

-

effect sizes (ES) and standardized response means (SRM) for improved patients had to be at least 0.50 larger than those for not improved/deteriorated patients; ES referred to mean

7

ACCEPTED MANUSCRIPT difference divided by the baseline standard deviation, while SRM referred to mean differences divided by the standard deviation of the difference 20. For construct validity and responsiveness, an overall sufficient (+), insufficient (–), or inconsistent (±) rating was determined by counting the number of results that met the hypotheses across all studies 84 . For the other measurement properties, an overall rating was determined by lumping together the scoring of each individual study; if ≥ 75% of the studies displayed the same scoring, that scoring became the overall rating (+ or –), while if < 75% of studies displayed the same scoring, the overall rating became inconsistent (±) 84.

M

AN US

CR IP T

The quality of evidence for each measurement property was rated according to the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) approach 37, adapted for this type of review, into: ‘high’, ‘moderate’, ‘low’, or ’very low’ 84, 100. ‘High’ quality evidence indicates that further research is very unlikely to change the confidence in study results; ‘moderate’ indicates that is likely that further research will have an important impact on study results and may change them; ‘low’ suggests that further research is very likely to have an important impact on study results and is likely to change them; ‘very low’ means that any result is very uncertain 37. For content validity, the evidence quality could be downgraded because of risk of bias, inconsistency of results and indirectness, as outlined elsewhere 100. For the other measurement properties, risk of bias, imprecision, inconsistency, and indirectness were taken into account to rate evidence quality 84. The concepts of risk of bias, imprecision, inconsistency and indirectness were taken from the GRADE approach 37. Risk of bias refers to limitations in the methodological quality of the eligible studies; imprecision refers to a low total number of patients included in the studies; inconsistency refers to unexplained heterogeneity of studies’ results; indirectness refers to the extent to which the study characteristics met the review inclusion criteria 32.

AC

CE

PT

ED

Rating the quality of evidence for content validity was performed by giving more weight to original content validity studies over PROM development and reviewers’ rating, as explained elsewhere (Appendix 4) 100. Thus, if there were no content validity and no PROM development studies (or if the PROM development was of ‘inadequate’ quality), the overall rating corresponded to the reviewers’ rating and quality of evidence was labelled as ‘very low’ 100. For the other measurement properties downgrading was done for: risk of bias of one level if there was only one ‘adequate’ quality study, two levels if there were only ‘doubtful’ or ‘inadequate’ studies; imprecision of one level if the total patient sample was < 100, two levels if < 50; inconsistency of one level if ≥ 75% of studies results were not all sufficient (+), insufficient (–) or inconsistent (±); indirectness of one level if at least one study did not specifically address the construct (pain intensity) or the target population (adult patients with non-specific LBP) of this review (Appendix 4) 11.

8

ACCEPTED MANUSCRIPT

RESULTS

AN US

CR IP T

Among 10.719 records retrieved, 23 full-text articles were included, 5 of which retrieved through citation tracking (Figure 1). Of 45 potentially eligible articles retrieved in the databases, 27 were excluded: 5 did not present results separately for patients with non-specific LBP 1, 57, 58, 93, 101; 9 did not aim to assess any measurement property 18, 27, 28, 34, 35, 40, 48, 49, 92; 8 did not report clearly if patients with non-specific LBP were included 25, 26, 38, 61, 66, 78, 83, 89; one each was excluded for the following reasons: VAS administered through the phone 46; VAS completed by a tester 74; assessed patients with experimental pain 82; assessed only patients with specific LBP 87; and focus on other instruments 109 . ***Figure 1 about here***

M

Three of the included full-text articles reported information on the BPI-PS development 15, 16, 19, while the other 20 included 22 original studies (two articles included two studies each 36, 59) on the measurement properties of the three PROMs. The VAS was assessed in 10 studies, the NRS in 13, and the BPI-PS in four. Four studies assessed more than one PROM in the same patient group 36, 88, 95 (Table 1).

PT

Visual Analogue Scale

ED

***Table 1 about here***

CE

A 100 mm VAS was used in all 10 studies; introductory statement, time recall period and external verbal descriptors varied (Table 1). One study assessed content validity 88, two test-retest reliability 64, 80 , two measurement error 76, 80, two construct validity 29, 95, and four responsiveness 3, 36, 91. Patients’ characteristics of each study are presented in Table 1, their results in Tables 2 to 4.

AC

Content validity

None of the studies retrieved described the development of the VAS as a pain intensity instrument. Robinson-Paap et al. 88 assessed VAS relevance and comprehensiveness with adequate quality; the same study also assessed NRS and BPI-PS. Three main themes were identified by patients with LBP on the instruments: 1) perception that it may not even be possible to measure pain in a meaningful way, 2) difficulty in finding appropriate experiences as referents, 3) difficulty on averaging pain. A few specifications for each theme are presented here. 1) E.g. “At the end of the day a single line is really not going to tell what I’ m actually feeling”. Three more specific subthemes were identified:

9

ACCEPTED MANUSCRIPT a. Pain measurement is influenced by other things other than pain. b. The numbers used to rate pain do not have an absolute meaning. c. Preference for pain intensity ratings ‘in the middle’ of the scale. 2) This theme included two subthemes:

CR IP T

a. Some patients used their prior LBP episodes as comparators; others did not use a comparator experience at all, rather thought of pain based on how much medication they took in a particular day. b. Several patients thought that anchoring the lower end to ‘no pain’ was not appropriate because they always experience some pain. Some patients expressed that they would not use the highest numbers on the scale because doing so would indicate a lack of ability to cope with the pain. The suggestions of ‘average’, ‘normal’, or ‘usual’ as alternative anchors also emerged.

AN US

3) Generating a number to represent ‘average pain’ over a given time period was not an intuitive task. The longer the time period over which to average, the more difficulty participants had.

ED

M

Relevance and comprehensiveness were rated as insufficient based on these results; the reviewers rated relevance, comprehensiveness, and comprehensibility of the VAS as sufficient. Low quality evidence was found for inconsistent findings for relevance and comprehensiveness, due to inconsistency and indirectness, as the only eligible study did not specifically focus on the pain intensity construct but on pain in general without referring to a specific aspect such as intensity (Table 5). Very low quality evidence was found for sufficient comprehensibility (Table 5). Internal structure

PT

Structural validity and internal consistency are not applicable to the VAS and NRS since these are single-item instruments. No studies were found on cross-cultural validity.

CE

Other measurement properties

AC

Only one study 80 presented results that could be rated for test-retest reliability (Table 2), providing low quality evidence (due to risk of bias and imprecision) of sufficient reliability (Table 5). Due to risk of bias and inconsistency of results across studies (Table 2), very low quality evidence of inconsistent findings was found for measurement error (Table 5). ***Table 2 about here***

Results on hypotheses testing for construct validity were inconsistent across studies (Table 3), providing low quality evidence (due to risk of bias and inconsistency) on this measurement property (Table 5). The results of four studies were tested against our hypotheses for responsiveness (Table 4), providing low quality evidence (due to risk of bias and inconsistency of results) of inconsistent results for this measurement property (Table 5). ***Table 3 about here*** 10

ACCEPTED MANUSCRIPT ***Table 4 about here***

Numeric Rating Scale

CR IP T

The 11-point NRS was used in all 13 studies; external descriptors varied slightly, while construct and recall period in the introductory statement varied more widely (Table 1). One study 14 administered three NRSs referring to current, best and worst pain over the last 24 hours and took the average of the three scores in the analyses. Two studies evaluated content validity 44, 88, two test-retest reliability 14, 69, five measurement error 14, 56, 60, 69, 104, one construct validity 95, and eight responsiveness 14, 36, 56, 59, 69, 81; four studies assessed the NRS in conjunction with other pain intensity instruments 36, 88, 95. Content validity

ED

M

AN US

No studies presenting the NRS development were found. Robinson-Paap et al. 88 analysed the NRS together with VAS and BPI-PS displaying the same results for all the instruments, as summarized for the VAS results. Hush et al. 44 assessed relevance and comprehensiveness of four NRS versions in a study of adequate quality. The majority of patients included in this study (i.e. > 50%) expressed the opinion that the NRS does not adequately capture the complexity of their personal experience of pain. Two themes emerged: 1) the meaning attributed to the pain score, 2) the time-frame of measurement. Regarding the first theme, participants reported that their score reflect many aspects of the pain experience, other than the sensory component of pain; another common view was that NRS scores are highly dependent on individual experiences of pain that can determine the benchmark used by a patient to rate the pain. Regarding the second theme, a majority believed that the NRS versions assessing pain ‘in the past 24 hours’ or ‘right now’ were unlikely to capture improvements because of symptom fluctuation.

***Table 5 about here***

CE

PT

These results, taken together with the reviewers’ ratings on the NRS to measure pain intensity in LBP, provided inconsistent results based on low quality evidence (due to inconsistency and indirectness) (Table 5).

Internal structure

AC

Structural validity and internal consistency are not applicable to the NRS since it is a single-item scale, and no studies assessing cross-cultural validity were retrieved. Other measurement properties Low quality evidence (due to inconsistency and imprecision) was found for inconsistent findings for test-retest reliability (Tables 2 and 5). High quality evidence was found for insufficient measurement error (Table 5) because smallest detectable change values in four adequate quality studies were larger than the proposed 2-point MIC (Table 2) 75. Very low quality evidence from one study of inadequate quality was found for inconsistent results on construct validity (Tables 3 and 5). Seven of the eight responsiveness studies provided results to be 11

ACCEPTED MANUSCRIPT rated against our hypotheses (Table 4), resulting in inconsistent results based on moderate quality evidence (due to inconsistency) (Table 5).

Pain Severity subscale of the Brief Pain Inventory Three studies presented information on the BPI-PS development 15, 16, 19. Among the other four studies (Table 1): one assessed content validity 88, two internal consistency 55, 97, one structural validity 97, two construct validity 55, 97, and two responsiveness 55, 106.

CR IP T

Content validity

AN US

The development of the BPI was rated as of doubtful quality because it was unclear if the included patients were representative of the target population 11. One content validity study assessed relevance and comprehensiveness in a study of adequate quality 88. This study also assessed VAS and NRS, providing the same results for all three instruments, as previously outlined. It was considered to provide indirect evidence because the pain intensity construct was not clearly specified, and its negative results were in contrast with reviewers’ ratings; this resulted in low quality evidence for inconsistent findings (Table 5). Internal structure

ED

M

One study 97 assessed the BPI-PS structural validity in a study of adequate quality performing an exploratory factor analysis on the whole BPI. The four BPI-PS items loaded on the same factor explaining 12% of the total variance and with eigenvalue equal to 1.38. The factor loadings on this factor ranged from 0.61 (pain worst) to 0.82 (pain least), while factor loadings on the first pain interference factor were very low (between -0.07 and 0.16). This resulted in sufficient unidimensionality based on moderate quality evidence (Table 5).

CE

PT

Two studies of adequate quality investigated the internal consistency, exhibiting Cronbach’s alpha of 0.82 55 and 0.85 97. According to the latest COSMIN guidance 84, these results provide moderate quality evidence for sufficient internal consistency (Table 5). No studies on cross-cultural validity were retrieved. Other measurement properties

AC

Test-retest reliability and measurement error of the BPI-PS were not assessed in any study. Moderate quality evidence (due to inconsistent results across studies) (Table 3) was found for inconsistent results on construct validity (Table 5). Responsiveness was assessed in two studies of inadequate quality (Table 4), providing very low quality evidence (due to risk of bias and inconsistency) of inconsistent results for this measurement property (Table 5).

12

ACCEPTED MANUSCRIPT

CR IP T

DISCUSSION

AN US

This systematic review illustrates that the quality of evidence on the measurement properties of VAS, NRS and BPI-PS in patients with LBP is clearly suboptimal (Table 5). The quality of evidence on content validity of all three instruments is low to very low. For the other measurement properties, high quality evidence was only found on the insufficient measurement error of the NRS. Moderate quality evidence was found for inconsistent results on the NRS responsiveness, sufficient results for BPI-PS structural validity and internal consistency, inconsistent construct validity of the BPI-PS. For all other assessed measurement properties the quality of evidence was low or very low (Table 5).

PT

ED

M

The NRS is most often recommended to measure pain intensity in patients with LBP 13, 17, 22, and in chronic pain more generally 24. Apparently, only practical aspects have dictated NRS recommendations in LBP so far. In a recent international Delphi survey, researchers, clinicians and patients clearly preferred the NRS over VAS and BPI-PS to measure pain intensity in LBP clinical trials 8 . Several Delphi participants highlighted the VAS to be: less understandable for patients (elderly in particular) than the NRS, time-consuming to score if the line is not exactly 100 mm long, and difficult to administer with digital devices 8. Meanwhile, the BPI-PS was less often chosen because it has a fee for administration, and it is less easy to administer than the other instruments 8. A previous review on a broader pain population also concluded that the NRS was preferred over the VAS for feasibility reasons 42. Despite these preference towards the NRS, the VAS has been the most frequently used pain instrument in LBP clinical trials so far 31, therefore it is important to monitor if this pattern of use will change in the (near) future.

AC

CE

Content validity is considered the first measurement property to consider when selecting a PROM 85. Evidence on this property could be generated by a head-to-head comparison studies where all 3 instruments are administered and patients are asked to rate their relevance, comprehensiveness and comprehensibility 100. Two studies included in this review 44, 88 raised issues regarding the content validity of NRS and VAS, in line with the results of a previous study in a chronic pain population 107. If these results are replicated in future studies in patients with LBP, the use of these instruments should be seriously reconsidered. Since these PROMs are usually intended to measure pain intensity, future clinimetric studies should consider NRS and VAS versions that specifically refer to pain intensity in the introductory question, as displayed in Appendix 2. Structural validity and internal consistency of the BPI-PS were found to be sufficient (Table 5) which is not surprising considering that the BPI-PS items share very similar content; this could artificially inflate its unidimensionality and Cronbach’s alpha. This systematic review clearly showed that the NRS measurement error is larger than the 2-point MIC value commonly proposed for this instrument in LBP (Table 3) 75. This implies that this PROM 13

ACCEPTED MANUSCRIPT may not be able to distinguish smallest detectable changes from “real” changes in the measured construct 21, which represents a serious limitation. Whether or not VAS and BPI-PS share this problem is not able to be determined as direct comparisons are lacking. The measurement error of an instrument can be decreased by increasing the number of repeated measurements or items 20, as recently shown in mixed chronic pain populations: multi-item tools displayed slightly more reliable scores than single-item tools 50, 51; therefore, the BPI-PS may also have a smaller measurement error than the other two PROMs in patients with LBP, but this has to be tested.

AN US

CR IP T

The cross-cultural validity of VAS, NRS and BPI-PS has not been evaluated in patients with LBP, nor in broader pain populations. Since data in patients with LBP from different cultures are routinely pooled in systematic reviews of clinical trials 54, 65, 68, 79 and observational studies 30, 62, it is essential to exclude substantial differential item functioning across countries and languages. The evidence quality on construct validity and responsiveness is low (Table 5) to determine if any instrument outperforms the others. The only study directly comparing construct validity of VAS and NRS is of inadequate quality 95. Two studies (of doubtful quality) comparing VAS and NRS responsiveness showed that the NRS has larger effect sizes (therefore better ability to capture pain intensity changes) in patients with acute and chronic LBP 36, but this finding requires replication. There is evidence that pain multiple-item PROMs do not display substantially larger effect sizes than singleitem ones in more heterogeneous pain conditions 48, 50, but these studies did not specifically include the BPI-PS and did not specifically assess a range of responsiveness aspects, such as the AUC and correlations with other instruments.

AC

CE

PT

ED

M

Recently, the use of pain intensity scales in patients with chronic pain has been criticized 2, 63, 96. More specifically, these instruments have been advocated as potential contributors to the opioid epidemic in some countries; patients who display high pain intensity ratings are those who, despite the presence of comorbidities such as mental health disorders, are frequently prescribed opioids resulting in subsequent addiction 2, 96. Additionally, it has been proposed that “zero pain is not the (only) goal” in patients with chronic pain, rather the main goal should be to improve (physical and psychological) functioning 2, 63. This view against the use of pain intensity scales and on the unimportance of pain intensity is in contrast with various studies clearly showing that pain intensity reduction is a crucial goal for patients living with chronic pain 9, 41, 43, 70. Therefore, considering the importance of pain intensity as a core outcome domain in LBP 9 and considering that the instruments included in this review have been widely used for decades 7, 31, the lack of robust evidence supporting the measurement properties of the most frequently used instruments for this domain is worrisome. Nevertheless, it should be underlined that the GRADE approach in systematic reviews on measurement properties of instruments has only recently been introduced 85, 100 and this is the first systematic review to adopt such approach for all measurement properties, therefore reaching the high quality evidence level will be the goal of future research. There is a need for adequate quality head-to-head comparison studies on pain intensity instruments in patients with LBP. The instruments assessed in this review may be included in these studies alongside other pain intensity instruments, such as Verbal Rating Scales, the bodily pain subscale of the Short Form 36 (which combines pain intensity measurement with pain interference), or other pain items or subscales of other generic- or disease-specific instruments. Additionally, other methods to assess pain intensity in patients with pain may be considered, and investigators with

14

ACCEPTED MANUSCRIPT innovative and creative ideas on how to better measure pain intensity are certainly welcome in this field.

CR IP T

The main strength of this first systematic review on the measurement properties of the three most frequently used pain intensity PROMs in LBP 7, 31 is the use of the most up-to-date methodology 72, 73, 84, 100 . In contrast with previous reviews on the measurement properties of pain intensity instruments 6, 42, 47, 52, 86, 94, 108, this systematic review focused on patients with LBP only; this decision was guided by the focus of the core outcome measurement set for which this review was performed 8, 9 , and by the fact that there is evidence clearly suggesting what is the best method to synthesize the evidence on measurement properties of instruments (i.e. whether it should be synthesized in specific or generic populations). A potential limitation is that the evidence synthesis lumps together studies from different languages and countries, and including instruments with (slightly) different pain constructs and high external anchors. However, this approach is routine for pain intensity scales in systematic reviews for LBP, splitting studies may be equally contentious, and there is no evidence on the best approach. For detailed scrutiny, language, country and instruments’ characteristics of each study are specified in the results (Tables 1 to 4).

M

AN US

In conclusion, there is currently no evidence to claim superior measurement properties for any of the three commonly used instruments to measure pain in LBP. In our opinion, such evidence should preferably come from sound head-to-head comparison clinimetric studies, with priority to be given to the assessment of content validity, test-retest reliability, measurement error and responsiveness.

3. 4.

AC

5.

PT

2.

Angst F, Verra ML, Lehmann S, Aeschlimann A. Responsiveness of five condition-specific and generic outcome assessment instruments for chronic pain. BMC Med Res Methodol. 8:26, 2008 Ballantyne JC, Sullivan MD. Intensity of Chronic Pain--The Wrong Metric? N Engl J Med. 373:2098-2099, 2015 Beurskens AJ, de Vet HC, Koke AJ. Responsiveness of functional status in low back pain: a comparison of different instruments. Pain. 65:71-76, 1996 Black N. Patient reported outcome measures could help transform healthcare. BMJ. 346:f167, 2013 Boers M, Brooks P, Strand CV, Tugwell P. The OMERACT filter for Outcome Measures in Rheumatology. J Rheumatol. 25:198-199, 1998 Castarlenas E, Jensen MP, von Baeyer CL, Miro J. Psychometric Properties of the Numerical Rating Scale to Assess Self-Reported Pain Intensity in Children and Adolescents: A Systematic Review. Clin J Pain. 33:376-383, 2017 Chapman JR, Norvell DC, Hermsmeyer JT, Bransford RJ, DeVine J, McGirt MJ, Lee MJ. Evaluating common outcomes for measuring treatment success for chronic low back pain. Spine. 36:S54-S68, 2011 Chiarotto A, Boers M, Deyo RA, Buchbinder R, Corbin TP, Costa LO, Foster NE, Grotle M, Koes BW, Kovacs FM, Lin CC, Maher CG, Pearson AM, Peul WC, Schoene ML, Turk DC, van Tulder MW, Terwee CB, Ostelo RW. Core outcome measurement instruments for clinical trials in non-specific low back pain. Pain. 159:481-495, 2018

CE

1.

ED

REFERENCES

6.

7.

8.

15

ACCEPTED MANUSCRIPT

11.

12.

13. 14. 15.

16.

20. 21.

AC

22.

PT

19.

CE

18.

ED

M

17.

CR IP T

10.

Chiarotto A, Deyo RA, Terwee CB, Boers M, Buchbinder R, Corbin TP, Costa LO, Foster NE, Grotle M, Koes BW, Kovacs FM, Lin CC, Maher CG, Pearson AM, Peul WC, Schoene ML, Turk DC, van Tulder MW, Ostelo RW. Core outcome domains for clinical trials in non-specific low back pain. Eur Spine J. 24:1127-1142, 2015 Chiarotto A, Maxwell LJ, Terwee CB, Wells GA, Tugwell P, Ostelo RW. Roland-Morris Disability Questionnaire and Oswestry Disability Index: Which Has Better Measurement Properties for Measuring Physical Functioning in Nonspecific Low Back Pain? Systematic Review and Meta-Analysis. Phys Ther. 96:1620-1637, 2016 Chiarotto A, Ostelo RW, Boers M, Terwee CB. A systematic review highlights the need to investigate the content validity of patient-reported outcome measures for physical functioning in low back pain. J Clin Epidemiol. 95:73-93, 2018 Chiarotto A, Terwee CB, Deyo RA, Boers M, Lin C-WC, Buchbinder R, Corbin TP, Costa LO, Foster NE, Grotle M, Koes BW, Kovacs FM, Maher CG, Pearson AM, Peul WC, Schoene ML, Turk DC, van Tulder MW, Ostelo RW. A core outcome set for clinical trials on non-specific low back pain: study protocol for the development of a core domain set. Trials. 15:511, 2014 Chiarotto A, Terwee CB, Ostelo RW. Choosing the right outcome measurement instruments for patients with low back pain. Best Pract Res Clin Rheumatol. 30:1003-1020, 2016 Childs JD, Piva SR, Fritz JM. Responsiveness of the numeric pain rating scale in patients with low back pain. Spine. 30:1331-1334, 2005 Cleeland CS: The Brief Pain Inventory, Available at: https://www.mdanderson.org/documents/Departments-and-Divisions/SymptomResearch/BPI_UserGuide.pdf, 2009. Cleeland CS, Ryan K. Pain assessment: global used of the Brief Pain Inventory. Ann Acad Med Singapore. 23:129-138, 1994 Clement RC, Welander A, Stowell C, Cha TD, Chen JL, Davies M, Fairbank JC, Foley KT, Gehrchen M, Hagg O, Jacobs WC, Kahler R, Khan SN, Lieberman IH, Morisson B, Ohnmeiss DD, Peul WC, Shonnard NH, Smuck MW, Solberg TK, Stromqvist BH, Hooff ML, Wasan AD, Willems PC, Yeo W, Fritzell P. A proposed set of metrics for standardized outcome reporting in the management of low back pain. Acta Orthop. 86:523-533, 2015 Co YY, Eaton S, Maxwell MW. The relationship between the St. Thomas and Oswestry disability scores and the severity of low back pain. J Manipulative Physiol Ther. 16:14-18, 1993 Daut RL, Cleeland CS, Flanery RC. Development of the Wisconsin Brief Pain Questionnaire to assess pain in cancer and other diseases. Pain. 17:197-210, 1983 De Vet HC, Terwee CB, Mokkink LB, Knol DL: Measurement in medicine: a practical guide, Cambridge University Press, 2011. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL, Bouter LM. Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health Qual Life Outcomes. 4:54, 2006 Deyo RA, Dworkin SF, Amtmann D, Andersson G, Borenstein D, Carragee E, Carrino J, Chou R, Cook K, DeLitto A, Goertz C, Khalsa P, Loeser J, Mackey S, Panagis J, Rainville J, Tosteson T, Turk D, Von Korff M, Weiner DK. Report of the NIH Task Force on research standards for chronic low back pain. J Pain. 15:569-585, 2014 Downie W, Leatham P, Rhind V, Wright V, Branco J, Anderson J. Studies with pain rating scales. Ann Rheum Dis. 37:378-381, 1978 Dworkin RH, Turk DC, Farrar JT, Haythornthwaite JA, Jensen MP, Katz NP, Kerns RD, Stucki G, Allen RR, Bellamy N, Carr DB, Chandler J, Cowan P, Dionne R, Galer BS, Hertz S, Jadad AR, Kramer LD, Manning DC, Martin S, McCormick CG, McDermott MP, McGrath P, Quessy S, Rappaport BA, Robbins W, Robinson JP, Rothman M, Royal MA, Simon L, Stauffer JW, Stein W, Tollett J, Wernicke J, Witter J. Core outcome measures for chronic pain clinical trials: IMMPACT recommendations. Pain. 113:9-19, 2005

AN US

9.

23. 24.

16

ACCEPTED MANUSCRIPT

31.

32.

33.

34.

35. 36.

AC

37.

CR IP T

30.

AN US

29.

M

28.

ED

27.

PT

26.

Elfving B, Lund I, C LB, Bostrom C. Ratings of pain and activity limitation on the visual analogue scale and global impression of change in multimodal rehabilitation of back pain analyses at group and individual level. Disabil Rehabil. 38:2206-2216, 2016 Farrar JT, Young JP, Jr., LaMoreaux L, Werth JL, Poole RM. Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain. 94:149158, 2001 Filho IT, Simmonds MJ, Protas EJ, Jones S. Back pain, physical function, and estimates of aerobic capacity: what are the relationships among methods and measures? Am J Phys Med Rehabil. 81:913-920, 2002 Fishbain DA, Gao J, Lewis JE, Zhang L. At Completion of a Multidisciplinary Treatment Program, Are Psychophysical Variables Associated with a VAS Improvement of 30% or More, a Minimal Clinically Important Difference, or an Absolute VAS Score Improvement of 1.5 cm or More? Pain Med. 17:781-789, 2016 Fishbain DA, Lewis JE, Gao J. Is there significant correlation between self-reported low back pain visual analogue scores and low back pain scores determined by pressure pain induction matching? Pain Pract. 13:358-363, 2013 Fritsch CG, Ferreira ML, Maher CG, Herbert RD, Pinto RZ, Koes B, Ferreira PH. The clinical course of pain and disability following surgery for spinal stenosis: a systematic review and meta-analysis of cohort studies. Eur Spine J. 26:324-335, 2017 Froud R, Patel S, Rajendran D, Bright P, Bjorkli T, Buchbinder R, Eldridge S, Underwood M. A Systematic Review of Outcome Measures Use, Analytical Approaches, Reporting Methods, and Publication Volume by Year in Low Back Pain Trials Published between 1980 and 2012: Respice, adspice, et prospice. PLoS One. 11:e0164573, 2016 Furlan AD, Malmivaara A, Chou R, Maher CG, Deyo RA, Schoene M, Bronfort G, Van Tulder MW. 2015 updated method guideline for systematic reviews in the Cochrane Back and Neck Group. Spine. 40:1660-1673, 2015 GBD. Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990-2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet. 388:1545-1602, 2016 Gronblad M, Hurri H, Kouri JP. Relationships between spinal mobility, physical performance tests, pain intensity and disability assessments in chronic low back pain patients. Scand J Rehabil Med. 29:17-24, 1997 Gronblad M, Lukinmaa A, Konttinen YT. Chronic low-back pain: intercorrelation of repeated measures for pain and disability. Scand J Rehabil Med. 22:73-77, 1990 Grotle M, Brox JI, Vollestad NK. Concurrent comparison of responsiveness in pain and functional status measurements used for patients with low back pain. Spine. 29:E492-501, 2004 Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, Schunemann HJ. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 336:924, 2008 Hagino C, Thompson M, Advent J, Rivet L. Agreement between 2 pain visual analogue scales, by age and area of complaint in neck and low back pain subjects: the standard pen and paper VAS versus plastic mechanical sliderule VAS. J Can Chiropr Assoc. 40:220, 1996 Hawker GA, Mian S, Kendzerska T, French M. Measures of adult pain: Visual Analog Scale for Pain (VAS Pain), Numeric Rating Scale for Pain (NRS Pain), McGill Pain Questionnaire (MPQ), Short-Form McGill Pain Questionnaire (SF-MPQ), Chronic Pain Grade Scale (CPGS), Short Form-36 Bodily Pain Scale (SF-36 BPS), and Measure of Intermittent and Constant Osteoarthritis Pain (ICOAP). Arthritis Care Res. 63 Suppl 11:S240-252, 2011 Hazard RG, Haugh LD, Green PA, Jones PL. Chronic low back pain: The relationship between patient satisfaction and pain, impairment, and disability outcomes. Spine. 19:881-887, 1994

CE

25.

38.

39.

40.

17

ACCEPTED MANUSCRIPT

48. 49.

50.

51. 52. 53.

54.

AC

55.

CR IP T

47.

AN US

45. 46.

M

44.

ED

43.

PT

42.

Henry SG, Bell RA, Fenton JJ, Kravitz RL. Goals of Chronic Pain Management: Do Patients and Primary Care Physicians Agree and Does it Matter? Clin J Pain. 33:955-961, 2017 Hjermstad MJ, Fayers PM, Haugen DF, Caraceni A, Hanks GW, Loge JH, Fainsinger R, Aass N, Kaasa S. Studies comparing Numerical Rating Scales, Verbal Rating Scales, and Visual Analogue Scales for assessment of pain intensity in adults: a systematic literature review. J Pain Symptom Manage. 41:1073-1093, 2011 Hush JM, Refshauge K, Sullivan G, De Souza L, Maher CG, McAuley JH. Recovery: what does this mean to patients with low back pain? Arthritis Rheum. 61:124-131, 2009 Hush JM, Refshauge KM, Sullivan G, De Souza L, McAuley JH. Do numerical rating scales and the Roland-Morris Disability Questionnaire capture changes that are meaningful to patients with persistent back pain? Clin Rehabil. 24:648-657, 2010 Huskisson E. Measurement of pain. Lancet. 304:1127-1131, 1974 Jamison RN, Raymond SA, Slawsby EA, McHugo GJ, Baird JC. Pain assessment in patients with low back pain: comparison of weekly recall and momentary electronic data. J Pain. 7:192-199, 2006 Jensen MP: The validity and reliability of pain measures for use in clinical trials in adults: review paper written for the Intiative on Methods, Measurements, and Pain Assessment in Clinical Trials (IMMPACT) meeting, April 12–13 (2003), IMMPACT-II, http://www.immpact.org/static/meetings/Immpact2/background/Jensen_review.pdf, 2003. Jensen MP, Hu X, Potts SL, Gould EM. Single vs composite measures of pain intensity: relative sensitivity for detecting treatment effects. Pain. 154:534-538, 2013 Jensen MP, Schnitzer TJ, Wang H, Smugar SS, Peloso PM, Gammaitoni A. Sensitivity of singledomain versus multiple-domain outcome measures to identify responders in chronic lowback pain: pooled analysis of 2 placebo-controlled trials of etoricoxib. Clin J Pain. 28:1-7, 2012 Jensen MP, Tome-Pires C, Sole E, Racine M, Castarlenas E, de la Vega R, Miro J. Assessment of pain intensity in clinical trials: individual ratings vs composite scores. Pain Med. 16:141148, 2015 Jensen MP, Turner JA, Romano JM, Fisher LD. Comparative reliability and validity of chronic pain intensity measures. Pain. 83:157-162, 1999 Kahl C, Cleland JA. Visual analogue scale, numeric pain rating scale and the McGill Pain Questionnaire: an overview of psychometric properties. Phys Ther Rev. 10:123-128, 2005 Kaiser U, Kopkow C, Deckert S, Neustadt K, Jacobi L, Cameron P, De VA, Apfelbacher C, Arnold B, Birch J. Developing a core outcome-domain set to assessing effectiveness of interdisciplinary multimodal pain therapy: the VAPAIN consensus statement on core outcome-domains. Pain. 159:673-683, 2018 Kamper SJ, Apeldoorn AT, Chiarotto A, Smeets RJ, Ostelo RW, Guzman J, van Tulder MW. Multidisciplinary biopsychosocial rehabilitation for chronic low back pain. Cochrane Database Syst Rev.Cd000963, 2014 Keller S, Bann CM, Dodd SL, Schein J, Mendoza TR, Cleeland CS. Validity of the brief pain inventory for use in documenting the outcomes of patients with noncancer pain. Clin J Pain. 20:309-318, 2004 Kovacs FM, Abraira V, Royuela A, Corcoll J, Alegre L, Cano A, Muriel A, Zamora J, del Real MT, Gestoso M, Mufraggi N. Minimal clinically important change for pain intensity and disability in patients with nonspecific low back pain. Spine. 32:2915-2920, 2007 Krebs EE, Bair MJ, Damush TM, Tu W, Wu J, Kroenke K. Comparative responsiveness of pain outcome measures among primary care patients with musculoskeletal pain. Med Care. 48:1007-1014, 2010 Lapane KL, Quilliam BJ, Benson C, Chow W, Kim M. One, two, or three? Constructs of the brief pain inventory among patients with non-cancer pain in the outpatient setting. J Pain Symtpom Manage. 47:325-333, 2014

CE

41.

56.

57.

58.

18

ACCEPTED MANUSCRIPT

66. 67. 68.

69. 70.

71. 72.

73.

AC

74.

CR IP T

65.

AN US

63. 64.

M

62.

ED

61.

PT

60.

Lauridsen HH, Hartvigsen J, Manniche C, Korsholm L, Grunnet-Nilsson N. Responsiveness and minimal clinically important difference for pain and disability instruments in low back pain patients. BMC Musculoskelet Disord. 7:82, 2006 Lauridsen HH, Manniche C, Korsholm L, Grunnet-Nilsson N, Hartvigsen J. What is an acceptable outcome of treatment before it begins? Methodological considerations and implications for patients with chronic low back pain. Eur Spine J. 18:1858-1866, 2009 Law M, McIntosh J, Morrison L, Baptiste S. A comparison of two pain measurement scales: Their clinical value. Can J Rehabil. 1987 Lee H, Hübscher M, Moseley GL, Kamper SJ, Traeger AC, Mansell G, McAuley JH. How does pain lead to disability? A systematic review and meta-analysis of mediation studies in people with back and neck pain. Pain. 156:988-997, 2015 Lee TH. Zero Pain Is Not the Goal. JAMA. 315:1575-1577, 2016 Love A, Leboeuf C, Crisp TC. Chiropractic chronic low back pain sufferers and self-report assessment methods. Part I. A reliability study of the Visual Analogue Scale, the Pain Drawing and the McGill Pain Questionnaire. J Manipulative Physiol Ther. 12:21-25, 1989 Machado GC, Maher CG, Ferreira PH, Day RO, Pinheiro MB, Ferreira ML. Non-steroidal antiinflammatory drugs for spinal pain: a systematic review and meta-analysis. Ann Rheum Dis. 76:1269-1278, 2017 Machin D, Lewith GT, Wylson S. Pain Measurement in Randomized Clinical Trials: A Comparison of Two Pain Scales. Clin J Pain. 4:161-168, 1988 Maher C, Underwood M, Buchbinder R. Non-specific low back pain. Lancet. 389:736-747, 2017 Marin TJ, Van Eerd D, Irvin E, Couban R, Koes BW, Malmivaara A, van Tulder MW, Kamper SJ. Multidisciplinary biopsychosocial rehabilitation for subacute low back pain. Cochrane Database Syst Rev. 6:CD002193, 2017 Maughan EF, Lewis JS. Outcome measures in chronic low back pain. Eur Spine J. 19:14841494, 2010 McRae M, Hancock MJ. Adults attending private physiotherapy practices seek diagnosis, pain relief, improved function, education and prevention: a survey. J Physiother. 63:250-256, 2017 Moher D, Liberati A, Tetzlaff J, Altman DG, Group P. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 6:e1000097, 2009 Mokkink LB, De Vet HC, Prinsen CA, Patrick DL, Alonso J, Bouter LM, Terwee CB. COSMIN Risk of Bias checklist for systematic reviews of Paitent-Reported Outcome Measures. Qual Life Res. 27:1171-1179, 2018 Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Bouter LM, de Vet HC. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 63:737-745, 2010 Olaogun MO, Adedoyin RA, Ikem IC, Anifaloba OR. Reliability of rating low back pain with a visual analogue scale and a semantic differential scale. Physiother Theory Pract. 20:135-142, 2004 Ostelo RW, Deyo RA, Stratford P, Waddell G, Croft P, Von Korff M, Bouter LM, de Vet HC. Interpreting change scores for pain and functional status in low back pain: towards international consensus regarding minimal important change. Spine. 33:90-94, 2008 Ostelo RW, Swinkels-Meewisse IJ, Knol DL, Vlaeyen JW, de Vet HC. Assessing pain and painrelated fear in acute low back pain: what is the smallest detectable change? Int J Behav Med. 14:242-248, 2007 Page MJ, Huang H, Verhagen AP, Buchbinder R, Gagnier JJ. Identifying a core set of outcome domains to measure in clinical trials for shoulder disorders: a modified Delphi study. RMD Open. 2:e000380, 2016

CE

59.

75.

76.

77.

19

ACCEPTED MANUSCRIPT

84.

85.

86.

87. 88. 89. 90.

91.

92.

AC

93.

CR IP T

83.

AN US

82.

M

81.

ED

80.

PT

79.

Park KB, Shin JS, Lee J, Lee YJ, Kim MR, Lee JH, Shin KM, Shin BC, Cho JH, Ha IH. Minimum Clinically Important Difference and Substantial Clinical Benefit in Pain, Functional, and Quality of Life Scales in Failed Back Surgery Syndrome Patients. Spine. 42:E474-e481, 2017 Parreira P, Heymans MW, van Tulder MW, Esmail R, Koes BW, Poquet N, Lin CWC, Maher CG. Back schools for chronic non‐specific low back pain. Cochrane Database Syst Rev. 8:CD011674, 2017 Paungmali A, Sitilertpisan P, Taneyhill K, Pirunsan U, Uthaikhup S. Intrarater reliability of pain intensity, tissue blood flow, thermal pain threshold, pressure pain threshold and lumbopelvic stability tests in subjects with low back pain. Asian J Sports Med. 3:8-14, 2012 Pengel LH, Refshauge KM, Maher CG. Responsiveness of pain, disability, and physical impairment outcomes in patients with low back pain. Spine. 29:879-883, 2004 Price DD, Harkins SW. Combined Use of Experimental Pain and Visual Analogue Scales in Providing Standardized Measurement of Clinical Pain. Clin J Pain. 3:1-8, 1987 Price DD, McGrath PA, Rafii A, Buckingham B. The validation of visual analogue scales as ratio scale measures for chronic and experimental pain. Pain. 17:45-56, 1983 Prinsen CA, Mokkink LB, Bouter LM, Alonso J, Patrick DL, De Vet HC, Terwee CB. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 27:1147-1157, 2018 Prinsen CA, Vohra S, Rose MR, Boers M, Tugwell P, Clarke M, Williamson PR, Terwee CB. How to select outcome measurement instruments for outcomes included in a “Core Outcome Set”–a practical guideline. Trials. 17:449, 2016 Ramasamy A, Martin ML, Blum SI, Liedgens H, Argoff C, Freynhagen R, Wallace M, McCarrier KP, Bushnell DM, Hatley NV, Patrick DL. Assessment of Patient-Reported Outcome Instruments to Assess Chronic Low Back Pain. Pain Med. 18:1098-1110, 2017 Roach KE, Brown MD, Dunigan KM, Kusek CL, Walas M. Test-retest reliability of patient reports of low back pain. J Orthop Sports Phys Ther. 26:253-259, 1997 Robinson‐Papp J, George MC, Dorfman D, Simpson DM. Barriers to chronic pain measurement: A qualitative study of patient perspectives. Pain Med. 16:1256-1264, 2015 Scrimshaw SV, Maher C. Responsiveness of visual analogue and McGill pain scale measures. J Manipulative Physiol Ther. 24:501-504, 2001 Shea BJ, Grimshaw JM, Wells GA, Boers M, Andersson N, Hamel C, Porter AC, Tugwell P, Moher D, Bouter LM. Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews. BMC Med Res Methodol. 7:10, 2007 Sheldon EA, Bird SR, Smugar SS, Tershakovec AM. Correlation of measures of pain, function, and overall response: results pooled from two identical studies of etoricoxib in chronic low back pain. Spine. 33:533-538, 2008 Solomon D, Roopchand-Martin S, Swaminathan N, Heymans MW. How well do pain scales correlate with each other and with the Oswestry Disability Questionnaire? Int J Ther Rehabil. 18, 2011 Song CY, Lin SF, Huang CY, Wu HC, Chen CH, Hsieh CL. Validation of the Brief Pain Inventory in Patients With Low Back Pain. Spine. 41:E937-942, 2016 Stinson JN, Kavanagh T, Yamada J, Gill N, Stevens B. Systematic review of the psychometric properties, interpretability and feasibility of self-report pain intensity measures for use in clinical trials in children and adolescents. Pain. 125:143-157, 2006 Strong J, Ashton R, Chant D. Pain intensity measurement in chronic low back pain. Clin J Pain. 7:209-218, 1991 Sullivan MD, Ballantyne JC. Must we reduce pain intensity to treat chronic pain? Pain. 157:65-69, 2016 Tan G, Jensen MP, Thornby JI, Shanti BF. Validation of the Brief Pain Inventory for chronic nonmalignant pain. J Pain. 5:133-137, 2004

CE

78.

94.

95. 96. 97.

20

ACCEPTED MANUSCRIPT

103.

104.

105.

106.

107. 108.

CE

109.

CR IP T

102.

AN US

101.

M

100.

ED

99.

Terwee CB, Jansma EP, Riphagen II, de Vet HC. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 18:1115-1123, 2009 Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res. 21:651-657, 2012 Terwee CB, Prinsen CA, Chiarotto A, De Vet HC, Westerman MJ, Patrick DL, Alonso J, Bouter LM, Mokkink LB. COSMIN methodology for evaluating the content validity of patientreported outcome measures: a Delphi study. Qual Life Res. 27:1159-1170, 2018 Triano JJ, McGregor M, Cramer GD, Emde DL. A comparison of outcome measures for use with back pain patients: results of a feasibility study. J Manipulative Physiol Ther. 16:67-73, 1993 Turk DC, Dworkin RH, Allen RR, Bellamy N, Brandenburg N, Carr DB, Cleeland C, Dionne R, Farrar JT, Galer BS, Hewitt DJ, Jadad AR, Katz NP, Kramer LD, Manning DC, McCormick CG, McDermott MP, McGrath P, Quessy S, Rappaport BA, Robinson JP, Royal MA, Simon L, Stauffer JW, Stein W, Tollett J, Witter J. Core outcome domains for chronic pain clinical trials: IMMPACT recommendations. Pain. 106:337-345, 2003 Turk DC, Dworkin RH, Revicki D, Harding G, Burke LB, Cella D, Cleeland CS, Cowan P, Farrar JT, Hertz S, Max MB, Rappaport BA. Identifying important outcome domains for chronic pain clinical trials: an IMMPACT survey of people with pain. Pain. 137:276-285, 2008 van der Roer N, Ostelo RW, Bekkering GE, van Tulder MW, de Vet HC. Minimal clinically important change for pain intensity, functional status, and general health status in patients with nonspecific low back pain. Spine. 31:578-582, 2006 Whiting P, Savovic J, Higgins JP, Caldwell DM, Reeves BC, Shea B, Davies P, Kleijnen J, Churchill R. ROBIS: A new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 69:225-234, 2016 Whynes DK, McCahon RA, Ravenscroft A, Hodgkinson V, Evley R, Hardman JG. Responsiveness of the EQ-5D health-related quality-of-life instrument in assessing low back pain. Value Health. 16:124-132, 2013 Williams ACdC, Davies HTO, Chadury Y. Simple pain rating scales hide complex idiosyncratic meanings. Pain. 85:457-463, 2000 Williamson A, Hoggart B. Pain: a review of three commonly used pain rating scales. J Clin Nurs. 14:798-804, 2005 Wright AA, Cook CE. Criterion validation of the rate of recovery, single alphanumeric measure, in patients with low back pain. Physiother Res Int. 18:124-129, 2013

PT

98.

AC

FIGURE 1. Flow chart of results of search strategy and selection of records

Records retrieved PubMed EMBASE

10719 5097 3985

CINAHL1. Characteristics of the studies included in this systematic review TABLE 829 SportDiscus PRO M(s)

Refer ence

288 Lang uage

Title/abstract after duplicates removal

Study desig

LBP Charact

Measure ment

PROM (s)

PR OM

Pain constr

High anch

7896

21 Title/abstract excluded

Full-text assessed

Patients’ Characteristics

7851

ACCEPTED MANUSCRIPT

Grotle 2004 [36]

properti es

descri ption

Engli sh (US)

Focus groups and individ ual intervi ews

>2 months ± leg pain

Content validity

Crosssectio nal

Chronic

10 cm VAS 11point NRS BPIPS 100 mm VAS 100 mm vVAS 11point NRS 100 mm VAS

Engli sh (Austr alia)

Norw egian

Longit udinal

<3 weeks

Construc t validity

Responsi veness

Longit udinal

Love 1989 [64]

VAS

Beurs kens 1996 [3]

Engli sh (Austr alia)

Crosssectio nal

Dutch

RCT

AC

Thre e VAS s

ED

Norw egian

>3 months

Responsi veness

PT

Grotle 2004 [36]

CE

VAS, NRS

>6 months

>6 weeks

sco res (µ± SD)

60± 24 61± 24

Testretest reliability

Responsi veness

6.3 ±2. 3 39± 23

11point NRS

6.8 ±1. 8

100 mm VAS

34± 23

11point NRS

6.1 ±2. 4

M

VAS, NRS

Stron g 1991 [95]

eristics

uct

or*

Avera ge past 24 h

Worst pain

na

na

10 cm VAS 10 cm VAS 10 cm VAS 100 mm VAS

n

Fe mal e (%)

Ag e (ye ars, µ± SD)

13

54

45

CR IP T

Two VAS s, NRS

Robin sonPapp 2015 [88]

n

Intensi ty

For the time being During the last week For the time being During the last week Experi enced now At its worst At its best Avera ge severit y during last

Pain durati on (µ±S D)

Pain as bad as it could be

92

49

46 ±1 3

10±1 0 years

Pain as bad as it could be

54

73

38 ±1 0

10±7 days

Pain as bad as it could be

50

62

40 ±9

2±2 years

Intoler able pain

63

46

41 ±1 0

24 week s (medi an)

AN US

VAS, NRS, BPIPS

(Cou ntry)

81

22

ACCEPTED MANUSCRIPT PRO M(s)

Refer ence

Lang uage (Cou ntry)

Study desig n

LBP Charact eristics

Measure ment properti es

PROM (s) descri ption

PR OM sco res (µ± SD)

Pain constr uct

High anch or*

Patients’ Characteristics n

Fe mal e (%)

Ag e (ye ars, µ± SD)

Pain durati on (µ±S D)

Ostel o 2007 [76]

Dutch

Crosssectio nal

<4 weeks ± radiation (no pain ≥3 months before)

Measure ment error

VAS

Sheld on 2008 [91]

Engli sh (US)

Two RCTs

Responsi veness

VAS

Paun gmali 2012 [80]

Thai

Crosssectio nal

>3 months ± leg pain analgesi c intake ≥24 days/mo nth >3 months VAS score = 2-7

VAS

Fishb ain 2013 [29]

Engli sh (US)

Longit udinal

Hush 2010 [44]

Engli sh (Austr alia)

Focus groups

Curren t intensi ty

Worst imagi nable pain

100 mm VAS

17 6

40

43 ±1 2

77± 14

Intensi ty

Extre me pain

63 9

62

53 ±1 3

1/3 each: <1 week , 1-2 week s, 2-4 week s 11±1 1 years

Avera ge over the lumbosacral area Curren t

Extre me pain

13

69

26 ±6

1±1 years

Unbe arable pain

23 6

At its worst in the past 24 h At its least

Pain as bad as you can imagi

36

42

42 ±6

69% persi stent /recur rent, 31% recov

M

ED

PT

CE

AC Four NRS s

>6 months as primary complain t Persiste nt or recurrent LBP, or recovery from previous

100 mm VAS

AN US

VAS

CR IP T

week

Testretest reliability, Measure ment error

10 cm VAS

39± 9

Construc t validity

100 mm vVAS

62± 32

Content validity

11point NRS

23

ACCEPTED MANUSCRIPT Refer ence

Lang uage (Cou ntry)

Study desig n

LBP Charact eristics

Measure ment properti es

PROM (s) descri ption

PR OM sco res (µ± SD)

LBP

Engli sh (US)

RCT

NRS

Kova cs 2007 [56]

Spani sh (Spai n)

Longit udinal

NRS

Peng el 2004 [81]

Engli sh (Austr alia)

RCT

NRS †

Lauri dsen 2006 [59]

Danis h

NRS ††

Lauri dsen 2006

Danis h

± leg symptom s, ODI ≥ 30%

Testretest reliability, Measure ment error, Responsi veness

11point NRS

5.8 ±2. 0

ED

11point NRS

7.5 ±2. 0

Responsi veness

11point NRS

5.5 ±2. 1

Responsi veness

11point NRS

4.3 ±2. 3

Longit udinal

± leg pain

Responsi veness

11point NRS

4.9 ±2. 5

PT

Measure ment error, Responsi veness

Longit udinal

>14 days, ± leg pain NRS ≥ 3/10 >6 weeks and <3 months ± leg pain

CE

AC

High anch or*

in the past 24 h On the averag e Right now Curren t level during last 24 h Best level during last 24 h Worst level during last 24 h Lower back

ne

Avera ge over past week Back ± leg pain over past week Back ± leg pain

Patients’ Characteristics n

Fe mal e (%)

Ag e (ye ars, µ± SD)

Pain durati on (µ±S D) ery

Worst imagi nable pain

13 1

42

34 ±1 1

66% <6 week s

Worst imagi nable pain

13 49

68

54 ±1 5

9±8 years

Worst pain possib le

15 6

56

49 ±1 6

Worst possib le pain

94

53

44

Worst possib le

97

54

47

AN US

Child s 2005 [14]

M

Thre e NRS s**

Pain constr uct

CR IP T

PRO M(s)

73% ≤30 days, rest >30 days 12% ≤30 days,

24

ACCEPTED MANUSCRIPT Refer ence

Lang uage (Cou ntry)

Study desig n

LBP Charact eristics

Measure ment properti es

PROM (s) descri ption

PR OM sco res (µ± SD)

[59]

Longit udinal

Engli sh (UK)

Longit udinal

BPIPS

Keller 2004 [55]

Engli sh (US)

Longit udinal

BPIPS

Tan 2004 [97]

Engli sh (US)

AC

NRS

BPIPS

Whyn es 2013 [106]

11point NRS

6.4 ±1. 8

± leg pain

Measure ment error

11point NRS

6.2

>3 months ± leg pain

Testretest reliability, Measure ment error, Responsi veness Internal consiste ncy, Construc t validity, Responsi veness Internal consiste ncy, Structura l validity, Construc t validity Responsi veness

11point NRS

5.0 ±2. 6

Crosssectio nal

Engli sh (UK)

RCT

Chronic

over past week Intensi ty

High anch or*

BPIPS

Patients’ Characteristics n

Fe mal e (%)

Ag e (ye ars, µ± SD)

Pain durati on (µ±S D) rest 30 days

Very sever e pain

114

Intensi ty over past week Intensi ty

Worst possib le pain Worst imagi nable pain

14 7

66

46

48

67

52

na

na

13 1

50

46 ±1 4

8

55

AN US

Danis h

NRS

Measure ment error

M

RCT

ED

Dutch

PT

Van der Roer 2006 [104] Lauri dsen 2009 [60] Maug han 2010 [69]

CE

NRS ***

Pain constr uct

CR IP T

PRO M(s)

BPIPS

7.0 ±1. 8

na

na

44 0

BPIPS

8.1 ±3. 0

na

na

37

37% ≤6 mont hs 6 years (mea n)

10±7 days

Empty cells reflect data not assessed; na = not applicable. PROM = patient-reported outcome measure; n = sample size; μ = mean; SD = standard deviation; NRS = numeric rating scale; VAS = visual analogue scale; BPI-PS = Pain Severity subscale of the Brief Pain Inventory. v-VAS = vertical VAS; RCT = randomized controlled trial; ODI = Oswestry Disability Index. * = The low anchor was always „no pain‟; ** = The average of the 3 ratings was used to represent the patient‟s overall pain intensity; *** = Measurement error was calculated on unchanged patients but characteristics of those patients alone were not presented.

25

ACCEPTED MANUSCRIPT PRO M(s)

Refer ence

Lang uage (Cou ntry)

Study desig n

LBP Charact eristics

Measure ment properti es

PROM (s) descri ption

PR OM sco res (µ± SD)

Pain constr uct

High anch or*

Patients’ Characteristics n

Fe mal e (%)

Ag e (ye ars, µ± SD)

Pain durati on (µ±S D)

CR IP T

† = This study refers to primary care patients; †† = This study refers to secondary care patients. ‡ = These are scores were the same for patients with (sub)acute LBP or chronic LBP.

Table 2. Test-retest reliability and measurement error of pain intensity instruments in patients with low back pain

Love 1989 [64]

Experienc ed now At its worst At its best

VAS

Ostelo 2007 [76]

Current intensity

VAS

Paungm ali 2012 [80]

CE Childs 2005 [14]

AC

Three NRSs*

Average over the lumbosacral area Current, best, and worst level during last 24 h Lower back

NRS

Kovacs 2007 [56]

NRS

van der Roer

Intensity

Measurement error

n

Study quality

Time interval( s)

ICC (95 % CI)

6 3

DOUBTF UL

Some days

0.77 † 0.49 † 0.57 †

n

Study quality

Time interval( s)

SEM (95% CI, % scale rang e)

SDC* (95% CI, % scale rang e)

176

DOUBTFUL

Maximu m 24 hours

13 (1215, 13) ‡ 0.1 (- , 1) ‡

36 (3241, 36) ‡ 0.3 (-, 3) ††

AN US

Three VASs

Test- retest reliability

M

Pain construct

ED

Referen ce

PT

PROM( s)

1 3

DOUBTF UL

48 hours

0.90 ‡

13

INADEQUA TE

48 hours

4 1

ADEQUA TE

1 week

0.61 (0.3 00.77 )‡

41

ADEQUAT E

1 week

1.0 (-, 10) ‡

2.8 (-, 28) ††

209 #

ADEQUAT E

12 weeks

52 ##

DOUBTFUL

12 weeks

1.3 (-, 13) †† 1.7 (-,

3.5 (3.23.8, 35) 4.7 (3.3-

26

ACCEPTED MANUSCRIPT 2006 [104] 62 ##

Lauridse n 2009 [60]

Intensity over past week

NRS

Maugha n 2010 [69]

Intensity

2 5

ADEQUA TE

5 weeks

0.92 †

55

ADEQUAT E

1 week

25

ADEQUAT E

5 weeks

8.0, 47)

1.6 (- , 16) †† 1.0 (-, 10) †† 0.9 (-, 9) ‡

4.5 (3.46.7, 45) 2.8 (-, 28)

CR IP T

NRS

17) ††

2.4 (-, 24) ‡

AN US

Empty cells represent aspects not assessed. n = number of patients; ICC = Intraclass Correlation Coefficient; 95% CI = 95% confidence interval; SEM = Standard Error of Measurement; SDC = Smallest Detectable Change. *= the average of the 3 ratings was used to represent the patient‟s overall pain intensity. † = this value represents a Pearson product-moment correlation and not an intraclass correlation coefficient; ‡ = unclear if ICCconsistency SEMconsistency or ICCagreement, SEMagreement was used; †† = this SEM or SDC was not reported in the article but it was calculated from the available data (SDC was calculated as SEM x √2 x 1.96); # = the sample size for the measurement error of the NRS for LBP was not reported in the article, therefore this number includes also patients with leg pain; ## = 52 were patients with (sub)acute LBP, while 62 were patients with chronic LBP.

Refere nce

Pain constr uct

n

Two VAS, NRS

Strong 1991 [95]

Intensit y

92

BPIPS

BPIPS

Tan 2004 [97]

INADEQU ATE

PT

CE

Fishbai n 2013 [29] Keller 2004 [55]

AC

VAS

Study quality

Current lower back Intensit y

Intensit y

Correlations with other measurement instruments measuring similar, related or unrelated constructs

ED

PROM (s)

M

TABLE 3. Construct validity (hypotheses testing) of pain intensity instruments in patients with low back pain

23 6

ADEQUA TE

13 1

ADEQUA TE

44 0

VERY GOOD

NRS

VAS

NRS

0.81

VAS

0.81

vVAS 0.70

BRS

VRS

0.53

0.81

v0.70 0.71 VAS 0.17 with pain thresholds 0.29 with pain tolerance CPG IS

DS

0.6 0

0.4 9

RMD Q 0.57

PPI

PRI

0.71

NRS -101 0.85

0.51

0.20

0.50

0.64

0.81

0.48

0.22

0.43

0.54

0.73

0.45

0.25

SF-36 BP

PF

RP

GH

V

SF

RE

0.6 1

0.6 3

0.5 4

0.3 7

0.4 7

0.5 1

0.4 1

M H 0.4 1

0.40 with Roland Morris Disability Questionnaire

PROM = patient-reported outcome measure; VAS = Visual Analogue Scale; NRS = Numeric Rating Scale; BPI-PS = Pain Severity subscale of the Brief Pain Inventory.

27

ACCEPTED MANUSCRIPT VAS = horizontal Visual Analogue Scale; v-VAS = vertical Visual Analogue Scale; BRS = Behavioral Rating Scale; VRS = Verbal Rating Scale with four response options (e.g. “no pain”, “some pain”); NRS-101 = Numeric Rating Scale in which the patient should choose a number between 0 and 100 that indicates his/her level of pain; PPI = Present Pain Intensity ranging from 1 (“mild) to 5 (“excruciating”); PRI = Pain Rating Index of the McGill Pain Questionnaire; BPI-PS = Pain Severity subscale of the Brief Pain Inventory; CPG-IS = Intensity Scale of the Chronic Pain Grade; CPG-DS = Disability Scale of the Chronic Pain Grade; SF36-BP = Bodily Pain subscale of the Short Form 36; SF36-PF = Physical functioning subscale of the Short Form 36; SF36-RP = Role Physical subscale of the Short Form 36; SF36-GH = General Health subscale of the Short Form 36; SF36-V = Vitality subscale for the Short Form 36; SF36-SF = Social Functioning subscale of the Short Form 36; SF36-RE = Role Emotional subscale of the Short Form 36; SF36-MH = Mental Health subscale of the Short Form 36.

Study quality

Time inter val

Criteri on

PR OM

VAS, NRS

Grotle 2004 [36]

DOUBTF UL

4 week s

6-point GPES from “worse” to “compl etely recover ed”

VAS

Pain constr uct

n

Better , same, worse (%)

Correla tion with criterio n

AU C % (95 % CI)

ESs* or SRMs** (95% CI)

Correlat ions with change s in other instrum ents

42

74 better 26 same

0.59

91 (83 ; 10 0)

0.7 (0.4;1.0) SRM overall 1.6 (1.1;2.0) SRM better -0.5 (0.8;0.5) SRM same

0.64 to RMDQ 0.59 to ODI 0.49 to DRI 0.67 to SF36PF 0.65 to NRS

45

76 better 24 same

0.76

93 (86 ; 10 0)

1.1 (0.8;1.5) SRM overall 2.0 (1.4;2.6) SRM better 1.0 (0.6;1.7) SRM same

0.68 to RMDQ 0.58 to ODI 0.58 to DRI 0.38 to SF36PF 0.65 to VAS

AN US

Ref

For the time being

AC

CE

PT

ED

M

PRO M(s)

CR IP T

TABLE 4. Responsiveness (hypotheses testing) of pain intensity instruments in patients with low back pain

NR S

During last week

28

ACCEPTED MANUSCRIPT Ref

Study quality

Time inter val

Criteri on

PR OM

Pain constr uct

n

Better , same, worse (%)

Correla tion with criterio n

AU C % (95 % CI)

ESs* or SRMs** (95% CI)

Correlat ions with change s in other instrum ents

VAS, NRS

Grotle 2004 [36]

DOUBTF UL

3 mont hs

6-point GPES from “worse” to “compl etely recover ed”

VAS

For the time being

33

48 better 52 same

0.24

71 (54 ; 88)

-0.1 (0.4; 1.0) SRM overall 0.4 (0.2; 0.9) SRM better 0.1 (1.1; 0.3) SRM same 0.3 (0.0; 0.6) SRM overall 1.1 (0.4; 1.7) SRM better -0.2 (0.6; 0.4) SRM same 1.6 SRM better 0.1 SRM same

0.40 to RMDQ 0.35 to ODI 0.13 to DRI -0.08 to SF36PF 0.30 to NRS

1.8-2.6 ES overall ♯♯

0.660.70 to RMDQ ♯♯

AN US During last week

39

49 better 51 same

47 better 48 same 6 worse

0.52

82 (67 ; 96)

ADEQUA TE

5 week s

AC

7-point GPES from “compl etely recover ed” to “vastly worsen ed” 5-point PGAR T from “excell ent” to “none” 11point GPES

PT

Beursk ens 1996 [3]

CE

VAS

ED

M

NR S

CR IP T

PRO M(s)

VAS

Sheldo n 2008 [91]

DOUBTF UL

12 week s

NRS

Pengel 2004 [81]

DOUBTF UL

6 week s

VAS

Avera ge severit y during last week

81†

91

VAS

Lower back intensi ty

639

0.680.74 ♯♯

NR S

Avera ge over

156

0.50

88 (85 ; 90)

0.52 to RMDQ 0.42 to ODI 0.16 to DRI 0.13 to SF36PF 0.30 to VAS

1.3 (1.2; 1.4) ES overall †

29

ACCEPTED MANUSCRIPT

DOUBTF UL

Time inter val

1 week

from “vastly worse” to “compl etely recover ed” 15point RS from “a great deal worse” to “a very great deal better” ♯

PR OM

Pain constr uct

n

Better , same, worse (%)

131 ††

65 better 33 same 2 worse

PT

Laurid sen 2006 [59]

ADEQUA TE

8 week s

AC

CE

NRS

NRS

Laurid sen 2006

ADEQUA TE

8 week s

7-point GPES from “much better” to “much worse”, and NRS to score pain change importa nce 7-point GPES from

Correla tion with criterio n

AU C % (95 % CI)

ESs* or SRMs** (95% CI)

Correlat ions with change s in other instrum ents

past week

NR S

Curren t, best, and worst level during last 24 h

ED

4 week s

Criteri on

CR IP T

Childs 2005 [14]

Study quality

AN US

Three NRSs ***

Ref

M

PRO M(s)

72 (62 ; 81)

82 better 13 same 4 worse

92 (86 ; 97)

NR S

Back and/or leg over past week

85 ‡

73 better 27 same

65 in LB P onl y

NR S

Back and/or leg

59 ‡‡

31 better 69

62 in LB

0.9 SRM overall 1.4 SRM better 0.5 SRM same 1.2 SRM overall 1.5 SRM better 0.6 SRM same 1.5 (1.2; 1.8) SRM better 0.8 (0.3; 1.3) SRM same

0.9 (0.4; 1.3) SRM

30

ACCEPTED MANUSCRIPT Time inter val

[59]

Kovac s 2007 [56]

DOUBTF UL

NRS

Maugh an 2010 [69]

DOUBTF UL

12 week s

“much better” to “much worse”, and NRS to score pain change importa nce 4-point RS from “compl etely recover ed” to “worse ned”

PR OM

Pain constr uct

n

over past week

Better , same, worse (%)

Correla tion with criterio n

same

AU C % (95 % CI)

ESs* or SRMs** (95% CI)

P onl y

better 0.2 (0.1; 0.5) SRM same

NR S

Lower back

134 9

33 recov ered 50 better 16 same 1 worse

95 (93 ; 97)

48

48 better 52 same

50

131

34 better 50 same 16 worse

AC

CE

PT

ED

NRS

Criteri on

BPIPS

Keller 2004 [55]

INADEQ UATE

5 week s

7-poing GPES from “compl etely recover ed” to “vastly worsen ed” RMDQ

Correlat ions with change s in other instrum ents

CR IP T

Study quality

AN US

Ref

M

PRO M(s)

NR S

Intensi ty

BPIPS

na

3.2 SRM recovere d# -2.0 SRM improve d# -0.5 SRM unchang ed# 1.6 SRM deteriora ted#

-1.1 SRM improve d -0.4 SRM

31

ACCEPTED MANUSCRIPT PRO M(s)

Ref

Study quality

Time inter val

Criteri on

PR OM

Pain constr uct

n

Better , same, worse (%)

Correla tion with criterio n

AU C % (95 % CI)

ESs* or SRMs** (95% CI)

Correlat ions with change s in other instrum ents

Whyne s 2013 [106]

INADEQ UATE

12 week s

BPIPS

na

37

0.9 (0.8-1.0) SRM overall

0.66 with BPI-PI 0.70 with ODI -0.57 with EQ5DUS -0.56 with EQ5DVAS

M

AN US

BPIPS

CR IP T

unchang ed 0.3 SRM deteriora ted

CE

PT

ED

Empty cells indicate not available or not assessed data; na = not applicable. PROM = patient-reported outcome measure; AUC = area under the ROC curve; 95% CI = 95% confidence interval; ES = effect size; SRM = standardized response mean. NRS = numeric rating scale; VAS = visual analogue scale; BPI-PS = Pain Severity subscale of the Brief Pain Inventory. GPES = global perceived effect scale; RMDQ = Roland Morris Disability Questionnaire; ODI = Oswestry Disability Index; DRI = Disability Rating Index; SF36-PF = physical functioning subscale of the Short Form 36; PGART = patient global assessment of response to therapy; RS = rating scale; BPI-PI = pain interference subscale of the Brief Pain Inventory; EQ5D-US = utility score of the EuroQol-5D; EQ5D-VAS = visual analogue scale of the EuroQol-5D. * ESs were calculated by dividing the mean change by the baseline standard deviation; ** SRMs were calculated by dividing the mean change by its standard deviation; *** The average of the 3 ratings was used to represent the patient‟s overall pain intensity. † = in this case an 84% CI was presented; †† = 125 patients completed the 1-week follow-up, 119 patients the 4-week follow-up; ‡ = primary care patients; ‡‡ = secondary care patients; ♯ = these ESs or SRMs were not reported in the article but calculated from the available data; ♯♯ = this is the range of correlations or ESs found in the three separate arms of this study (i.e. etoricoxib 60 mg, etoricoxib 90 mg, placebo).

AC

Table 5. Evidence synthesis on measurement properties of pain intensity instruments in patients with low back pain Measurement properties

VAS

NRS

BPI-PS

Content validity

± Low

± Low

± Low

± Low

± Low

± Low

+

+

+

Relevance

Rating Quality of evidence Comprehensiveness Rating Quality of evidence Comprehensibility Rating

32

ACCEPTED MANUSCRIPT

Structural validity

Internal consistency

Test-retest reliability

Measurement error

Construct validity

Very low

Very low

na

na

+ Moderate

na

na

+ Moderate

+ Very Low ± Very Low ± Low

± Low

± Low

‒ High

± Very Low

± Moderate

± Moderate

± Very Low

AN US

Responsiveness

Very low

CR IP T

Quality of evidence Rating Quality of evidence Rating Quality of evidence Rating Quality of evidence Rating Quality of evidence Rating Quality of evidence Rating Quality of evidence

ED

M

Empty cells represent measurement properties not assessed in any study. The cross-cultural validity row is not displayed because it was not assessed in any study. VAS = Visual Analogue Scale; NRS = Numeric Rating Scale; BPI-PS = Pain Severity subscale of the Brief Pain Inventory. “+” = sufficient results; “‒” = insufficient results; “±” = inconsistent results; na = measurement property not applicable.

PT

Search Strategy in MEDLINE

AC

CE

#1 "Visual Analog Scale"[Mesh] OR (visual analog*[tiab] AND (scale[tiab] OR score[tiab])) OR vas[tw] OR (numeric*[tiab] AND rating [tiab] AND (scale[tiab] OR score[tiab])) OR numeric scale[tiab] OR nrs[tw] OR nprs[tw] OR (brief pain[tiab] AND (inventory[tiab] OR index[tiab] OR scale[tiab] OR questionnaire[tiab])) OR bpi[tw] #2 "Back Pain"[Mesh] OR back pain[tiab] OR lbp[tw] OR "Back Injuries"[Mesh] OR back injury[tiab] OR "Lumbar Vertebrae"[Mesh] OR lumbar spine[tiab] OR "Sacroiliac Joint"[Mesh] OR sacroiliac pain[tiab] OR "Zygapophyseal Joint"[Mesh] OR facet pain[tiab] OR "Sciatica"[Mesh] OR sciatica[tiab] OR "Spinal Stenosis"[Mesh] OR lumbar stenosis[tiab]OR "Intervertebral Disc Displacement"[Mesh] OR herniated dis*[tiab] OR disc pain[tiab] OR disk pain[tiab] OR "Spondylolisthesis"[Mesh] OR spondylolisthesis[tiab] #3 33

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

(instrumentation[sh] OR methods[sh] OR "Validation Studies"[pt] OR "Comparative Study"[pt] OR "psychometrics"[MeSH] OR psychometr*[tiab] OR clinimetr*[tw] OR clinometr*[tw] OR "outcome assessment (health care)"[MeSH] OR "outcome assessment"[tiab] OR "outcome measure*"[tw] OR "observer variation"[MeSH] OR "observer variation"[tiab] OR "Health Status Indicators"[Mesh] OR "reproducibility of results"[MeSH] OR reproducib*[tiab] OR "discriminant analysis"[MeSH] OR reliab*[tiab] OR unreliab*[tiab] OR valid*[tiab] OR "coefficient of variation"[tiab] OR coefficient[tiab] OR homogeneity[tiab] OR homogeneous[tiab] OR "internal consistency"[tiab] OR (cronbach*[tiab] AND (alpha[tiab] OR alphas[tiab])) OR (item[tiab] AND (correlation*[tiab] OR selection*[tiab] OR reduction*[tiab])) OR agreement[tw] OR precision[tw] OR imprecision[tw] OR "precise values"[tw] OR test-retest[tiab] OR (test[tiab] AND retest[tiab]) OR (reliab*[tiab] AND (test[tiab] OR retest[tiab])) OR stability[tiab] OR interrater[tiab] OR inter-rater[tiab] OR intrarater[tiab] OR intra-rater[tiab] OR intertester[tiab] OR inter-tester[tiab] OR intratester[tiab] OR intra-tester[tiab] OR interobserver[tiab] OR inter-observer[tiab] OR intraobserver[tiab] OR intra-observer[tiab] OR intertechnician[tiab] OR inter-technician[tiab] OR intratechnician[tiab] OR intra-technician[tiab] OR interexaminer[tiab] OR inter-examiner[tiab] OR intraexaminer[tiab] OR intra-examiner[tiab] OR interassay[tiab] OR interassay[tiab] OR intraassay[tiab] OR intra-assay[tiab] OR interindividual[tiab] OR inter-individual[tiab] OR intraindividual[tiab] OR intra-individual[tiab] OR interparticipant[tiab] OR inter-participant[tiab] OR intraparticipant[tiab] OR intra-participant[tiab] OR kappa[tiab] OR kappa's[tiab] OR kappas[tiab] OR repeatab*[tw] OR ((replicab*[tw] OR repeated[tw]) AND (measure[tw] OR measures[tw] OR findings[tw] OR result[tw] OR results[tw] OR test[tw] OR tests[tw])) OR generaliza*[tiab] OR generalisa*[tiab] OR concordance[tiab] OR (intraclass[tiab] AND correlation*[tiab]) OR discriminative[tiab] OR "known group"[tiab] OR "factor analysis"[tiab] OR "factor analyses"[tiab] OR "factor structure"[tiab] OR "factor structures"[tiab] OR dimension*[tiab] OR subscale*[tiab] OR (multitrait[tiab] AND scaling[tiab] AND (analysis[tiab] OR analyses[tiab])) OR "item discriminant"[tiab] OR "interscale correlation*"[tiab] OR error[tiab] OR errors[tiab] OR "individual variability"[tiab] OR "interval variability"[tiab] OR "rate variability"[tiab] OR (variability[tiab] AND (analysis[tiab] OR values[tiab])) OR (uncertainty[tiab] AND (measurement[tiab] OR measuring[tiab])) OR "standard error of measurement"[tiab] OR sensitiv*[tiab] OR responsive*[tiab] OR (limit[tiab] AND detection[tiab]) OR "minimal detectable concentration"[tiab] OR interpretab*[tiab] OR ((minimal[tiab] OR minimally[tiab] OR clinical[tiab] OR clinically[tiab]) AND (important[tiab] OR significant[tiab] OR detectable[tiab]) AND (change[tiab] OR difference[tiab])) OR (small*[tiab] AND (real[tiab] OR detectable[tiab]) AND (change[tiab] OR difference[tiab])) OR "meaningful change"[tiab] OR "ceiling effect"[tiab] OR "floor effect"[tiab] OR "Item response model"[tiab] OR IRT[tiab] OR Rasch[tiab] OR "Differential item functioning"[tiab] OR DIF[tiab] OR "computer adaptive testing"[tiab] OR "item bank"[tiab] OR "cross-cultural equivalence"[tiab]) #4 ("addresses"[Publication Type] OR "biography"[Publication Type] OR "case reports"[Publication Type] OR "comment"[Publication Type] OR "directory"[Publication Type] OR "editorial"[Publication Type] OR "festschrift"[Publication Type] OR "interview"[Publication Type] OR "lectures"[Publication Type] OR "legal cases"[Publication Type] OR "legislation"[Publication Type] OR "letter"[Publication Type] OR "news"[Publication Type] OR "newspaper article"[Publication Type] OR "patient education handout"[Publication Type] OR "popular works"[Publication Type] OR "congresses"[Publication Type] OR "consensus development conference"[Publication Type] OR "consensus development

34

ACCEPTED MANUSCRIPT conference, nih"[Publication Type] OR "practice guideline"[Publication Type]) NOT ("animals"[MeSH Terms] NOT "humans"[MeSH Terms]) #5 #1 AND #2 AND #3

CR IP T

#6 #5 NOT #4

Search Strategy in EMBASE

AN US

#1 ‘visual analog scale’/exp OR (‘visual’:ab,ti AND (‘analogue’:ab,ti OR ‘analog’:ab,ti) AND (‘scale’:ab,ti OR ‘score’:ab,ti)) OR vas OR ((‘numeric’:ab,ti OR ‘numerical’:ab,ti) AND ‘rating’:ab,ti AND (‘scale’:ab,ti OR ‘score’:ab,ti)) OR ‘numeric scale’:ab,ti OR nrs OR nprs OR (‘brief pain’:ab,ti AND (‘inventory’:ab,ti OR ‘index’:ab,ti OR ‘scale’:ab,ti OR ‘questionnaire’:ab,ti)) OR bpi AND [embase]/lim

M

#2 'backache'/exp OR ‘back pain’:ab,ti OR lbp OR 'lumbar spine'/exp OR ‘lumbar spine’:ab,ti OR ‘lumbar spine’:ab,ti OR ‘sacroiliac joint’/exp OR ‘sacroiliac pain’:ab,ti OR ‘zygapophyseal joint’/exp OR ‘zygapophyseal pain’:ab,ti OR ‘facet pain’:ab,ti OR 'sciatica'/exp OR ‘sciatica’:ab,ti OR 'lumbar spinal stenosis'/exp OR ‘lumbar stenosis’:ab,ti OR 'lumbar disk hernia'/exp OR ‘lumbar hernia’:ab,ti OR ‘disk pain’:ab,ti OR 'spondylolisthesis'/exp OR ‘spondylolisthesis’:ab,ti AND *embase+/lim

AC

CE

PT

ED

#3 'intermethod comparison'/exp OR 'data collection method'/exp OR 'validation study'/exp OR 'feasibility study'/exp OR 'pilot study'/exp OR 'psychometry'/exp OR 'reproducibility'/exp OR reproducib*:ab,ti OR 'audit':ab,ti OR psychometr*:ab,ti OR clinimetr*:ab,ti OR clinometr*:ab,ti OR 'observer variation'/exp OR 'observer variation':ab,ti OR 'discriminant analysis'/exp OR 'validity'/exp OR reliab*:ab,ti OR valid*:ab,ti OR 'coefficient':ab,ti OR 'internal consistency':ab,ti OR (cronbach*:ab,ti AND ('alpha':ab,ti OR 'alphas':ab,ti)) OR 'item correlation':ab,ti OR 'item correlations':ab,ti OR 'item selection':ab,ti OR 'item selections':ab,ti OR 'item reduction':ab,ti OR 'item reductions':ab,ti OR 'agreement':ab,ti OR 'precision':ab,ti OR 'imprecision':ab,ti OR 'precise values':ab,ti OR 'test-retest':ab,ti OR ('test':ab,ti AND 'retest':ab,ti) OR (reliab*:ab,ti AND ('test':ab,ti OR 'retest':ab,ti)) OR 'stability':ab,ti OR 'interrater':ab,ti OR 'inter-rater':ab,ti OR 'intrarater':ab,ti OR 'intra-rater':ab,ti OR 'intertester':ab,ti OR 'inter-tester':ab,ti OR 'intratester':ab,ti OR 'intratester':ab,ti OR 'interobeserver':ab,ti OR 'inter-observer':ab,ti OR 'intraobserver':ab,ti OR 'intraobserver':ab,ti OR 'intertechnician':ab,ti OR 'inter-technician':ab,ti OR 'intratechnician':ab,ti OR 'intra-technician':ab,ti OR 'interexaminer':ab,ti OR 'inter-examiner':ab,ti OR 'intraexaminer':ab,ti OR 'intra-examiner':ab,ti OR 'interassay':ab,ti OR 'inter-assay':ab,ti OR 'intraassay':ab,ti OR 'intraassay':ab,ti OR 'interindividual':ab,ti OR 'inter-individual':ab,ti OR 'intraindividual':ab,ti OR 'intraindividual':ab,ti OR 'interparticipant':ab,ti OR 'inter-participant':ab,ti OR 'intraparticipant':ab,ti OR 'intra-participant':ab,ti OR 'kappa':ab,ti OR 'kappas':ab,ti OR 'coefficient of variation':ab,ti OR repeatab*:ab,ti OR (replicab*:ab,ti OR 'repeated':ab,ti AND ('measure':ab,ti OR 'measures':ab,ti OR 35

ACCEPTED MANUSCRIPT

AN US

CR IP T

'findings':ab,ti OR 'result':ab,ti OR 'results':ab,ti OR 'test':ab,ti OR 'tests':ab,ti)) OR generaliza*:ab,ti OR generalisa*:ab,ti OR 'concordance':ab,ti OR ('intraclass':ab,ti AND correlation*:ab,ti) OR 'discriminative':ab,ti OR 'known group':ab,ti OR 'factor analysis':ab,ti OR 'factor analyses':ab,ti OR 'factor structure':ab,ti OR 'factor structures':ab,ti OR 'dimensionality':ab,ti OR subscale*:ab,ti OR 'multitrait scaling analysis':ab,ti OR 'multitrait scaling analyses':ab,ti OR 'item discriminant':ab,ti OR 'interscale correlation':ab,ti OR 'interscale correlations':ab,ti OR ('error':ab,ti OR 'errors':ab,ti AND (measure*:ab,ti OR correlat*:ab,ti OR evaluat*:ab,ti OR 'accuracy':ab,ti OR 'accurate':ab,ti OR 'precision':ab,ti OR 'mean':ab,ti)) OR 'individual variability':ab,ti OR 'interval variability':ab,ti OR 'rate variability':ab,ti OR 'variability analysis':ab,ti OR ('uncertainty':ab,ti AND ('measurement':ab,ti OR 'measuring':ab,ti)) OR 'standard error of measurement':ab,ti OR sensitiv*:ab,ti OR responsive*:ab,ti OR ('limit':ab,ti AND 'detection':ab,ti) OR 'minimal detectable concentration':ab,ti OR interpretab*:ab,ti OR (small*:ab,ti AND ('real':ab,ti OR 'detectable':ab,ti) AND ('change':ab,ti OR 'difference':ab,ti)) OR 'meaningful change':ab,ti OR 'minimal important change':ab,ti OR 'minimal important difference':ab,ti OR 'minimally important change':ab,ti OR 'minimally important difference':ab,ti OR 'minimal detectable change':ab,ti OR 'minimal detectable difference':ab,ti OR 'minimally detectable change':ab,ti OR 'minimally detectable difference':ab,ti OR 'minimal real change':ab,ti OR 'minimal real difference':ab,ti OR 'minimally real change':ab,ti OR 'minimally real difference':ab,ti OR 'ceiling effect':ab,ti OR 'floor effect':ab,ti OR 'item response model':ab,ti OR 'irt':ab,ti OR 'rasch':ab,ti OR 'differential item functioning':ab,ti OR 'dif':ab,ti OR 'computer adaptive testing':ab,ti OR 'item bank':ab,ti OR 'cross-cultural equivalence':ab,ti AND [embase]/lim

ED

M

#4 #1 AND #2 AND #3

Search Strategy in CINAHL

AC

CE

PT

#1 ((TI visual OR AB visual) AND (TI analog* OR AB analog*) AND (TI scale OR AB scale OR TI score OR AB score)) OR vas OR ((TI numeric* OR AB numeric*) AND (TI rating OR AB rating) AND (TI scale OR AB scale OR TI score OR AB score)) OR (TI numeric scale OR AB numeric scale) OR nrs OR nprs OR ((TI brief pain OR AB brief pain) AND (TI inventory OR AB inventory OR TI index OR AB index OR TI scale OR AB scale OR TI questionnaire OR AB questionnaire)) OR bpi #2 (MH “Back Pain+”) OR TI back pain OR AB back pain OR lbp OR (MH “Back Injuries+”) OR TI back injury OR AB back injury OR (MH “Lumbar Vertebrae”) OR TI lumbar spine OR AB lumbar spine OR (MH “Sacroiliac Joint”) OR (MH “Sacroiliac Joint Dysfunction”) OR TI sacroiliac pain OR AB sacroiliac pain OR (MH “Zygapophyseal Joint”) OR TI facet pain OR AB facet pain OR (MH “Sciatica”) OR TI sciatica OR AB sciatica OR (MH “Spinal Stenosis”) OR TI lumbar stenosis OR AB lumbar stenosis OR (MH “Intervertebral Disk Displacement”) OR TI herniated disk OR AB herniated disk OR TI disk pain OR AB disk pain OR (MH “Spondylosis+”) OR TI spondylolisthesis OR AB spondylolisthesis

36

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

#3 TI psychometr* OR TI observer variation OR TI reproducib* OR TI reliab* OR TI unreliab* OR TI valid* OR TI coefficient OR TI homogeneity OR TI homogeneous OR TI “internal consistency” OR AB psychometr* OR AB observer variation OR AB reproducib* OR AB reliab* OR AB unreliab* OR AB valid* OR AB coefficient OR AB homogeneity OR AB homogeneous OR AB “internal consistency” OR (TI cronbach* OR AB cronbach* AND (TI alpha OR AB alpha OR TI alphas OR AB alphas)) OR (TI item OR AB item AND (TI correlation* OR AB correlation* OR TI selection* OR AB selection* OR TI reduction* OR AB reduction*)) OR TI agreement OR TI precision OR TI imprecision OR TI “precise values” OR TI test-retest OR AB agreement OR AB precision OR AB imprecision OR AB “precise values” OR AB test-retest OR (TI test OR AB test AND TI retest OR AB retest) OR (TI reliab* OR AB reliab* AND (TI test OR AB test OR TI retest or AB retest)) OR TI stability OR TI interrater OR TI interrater OR TI intrarater OR TI intra-rater OR TI intertester OR TI inter-tester OR TI intratester OR TI intra-tester OR TI interobserver OR TI inter-observer OR TI intraobserver OR TI intra-observer OR TI intertechnician OR TI inter-technician OR TI intratechnician OR TI intra-technician OR TI interexaminer OR TI inter-examiner OR TI intraexaminer OR TI intra-examiner OR TI interassay OR TI inter-assay OR TI intraassay OR TI intra-assay OR TI interindividual OR TI inter-individual OR TI intraindividual OR TI intra-individual OR TI interparticipant OR TI inter-participant OR TI intraparticipant OR TI intra-participant OR TI kappa OR TI kappa’s OR TI kappas OR TI repeatab* OR AB stability OR AB interrater OR AB inter-rater OR AB intrarater OR AB intra-rater OR AB intertester OR AB inter-tester OR AB intratester OR AB intra-tester OR AB interobserver OR AB inter-observer OR AB intraobserver OR AB intra-observer OR AB intertechnician OR AB inter-technician OR AB intratechnician OR AB intra-technician OR AB interexaminer OR AB inter-examiner OR AB intraexaminer OR AB intra-examiner OR AB interassay OR AB inter-assay OR AB intraassay OR AB intra-assay OR AB interindividual OR AB inter-individual OR AB intraindividual OR AB intra-individual OR AB interparticipant OR AB inter-participant OR AB intraparticipant OR AB intra-participant OR AB kappa OR AB kappa’s OR AB kappas OR AB repeatab* OR ((TI replicab* OR AB replicab* OR TI repeated OR AB repeated) AND (TI measure OR AB measure OR TI measures OR AB measures OR TI findings OR AB findings OR TI result OR AB result OR TI results OR AB results OR TI test OR AB test OR TI tests OR AB tests)) OR TI generaliza* OR TI generalisa* OR TI concordance OR AB generaliza* OR AB generalisa* OR AB concordance OR (TI intraclass OR AB intraclass AND TI correlation* or AB correlation*) OR TI discriminative OR TI “known group” OR TI factor analysis OR TI factor analyses OR TI dimension* OR TI subscale* OR AB discriminative OR AB “known group” OR AB factor analysis OR AB factor analyses OR AB dimension* OR AB subscale* OR (TI multitrait OR AB multitrait AND TI scaling OR AB scaling AND (TI analysis OR AB analysis OR TI analyses OR AB analyses)) OR TI item discriminant OR TI interscale correlation* OR TI error OR TI errors OR TI “individual variability” OR AB item discriminant OR AB interscale correlation* OR AB error OR AB errors OR AB “individual variability” OR (TI variability OR AB variability AND (TI analysis OR AB analysis OR TI values OR AB values)) OR (TI uncertainty OR AB uncertainty AND (TI measurement OR AB measurement OR TI measuring OR AB measuring)) OR TI “standard error of measurement” OR TI sensitiv* OR TI responsive* OR AB “standard error of measurement” OR AB sensitiv* OR AB responsive* OR ((TI minimal OR TI minimally OR TI clinical OR TI clinically OR AB minimal OR AB minimally OR AB clinical OR AB clinically) AND (TI important OR TI significant OR TI detectable OR AB important OR AB significant OR AB detectable) AND (TI change OR AB change OR TI difference OR AB difference)) OR (TI small* OR AB small* AND (TI real OR AB real OR TI detectable OR AB detectable) AND (TI change OR AB change OR TI difference OR AB difference)) OR TI meaningful change OR TI “ceiling effect” OR 37

ACCEPTED MANUSCRIPT TI “floor effect” OR TI “Item response model” OR TI IRT OR TI Rasch OR TI “Differential item functioning” OR TI DIF OR TI “computer adaptive testing” OR TI “item bank” OR TI “cross-cultural equivalence” OR TI outcome assessment OR AB meaningful change OR AB “ceiling effect” OR AB “floor effect” OR AB “Item response model” OR AB IRT OR AB Rasch OR AB “Differential item functioning” OR AB DIF OR AB “computer adaptive testing” OR AB “item bank” OR AB “cross-cultural equivalence” OR AB outcome assessment

CR IP T

#4 #1 AND #2 AND #3

Search Strategy in SportDiscus

AN US

#1 ((TI visual OR AB visual) AND (TI analog* OR AB analog*) AND (TI scale OR AB scale OR TI score OR AB score)) OR vas OR ((TI numeric* OR AB numeric*) AND (TI rating OR AB rating) AND (TI scale OR AB scale OR TI score OR AB score)) OR (TI numeric scale OR AB numeric scale) OR nrs OR nprs OR ((TI brief pain OR AB brief pain) AND (TI inventory OR AB inventory OR TI index OR AB index OR TI scale OR AB scale OR TI questionnaire OR AB questionnaire)) OR bpi

ED

M

#2 DE "BACKACHE" OR TI back pain OR AB back pain OR lbp OR TI back injury OR AB back injury OR DE "LUMBAR vertebrae" OR TI lumbar spine OR AB lumbar spine OR DE "SACROILIAC joint" OR TI sacroiliac pain OR AB sacroiliac pain OR DE "ZYGAPOPHYSEAL joint" OR TI facet pain OR AB facet pain OR DE "SCIATICA" OR TI sciatica OR AB sciatica OR (DE "SPINAL canal -- Stenosis") OR TI lumbar stenosis OR AB lumbar stenosis OR DE "INTERVERTEBRAL disk displacement" OR TI herniated disk OR AB herniated disk OR TI disk pain OR AB disk pain OR DE "SPONDYLOLISTHESIS" OR TI spondylolisthesis OR AB spondylolisthesis

AC

CE

PT

#3 TI psychometr* OR TI observer variation OR TI reproducib* OR TI reliab* OR TI unreliab* OR TI valid* OR TI coefficient OR TI homogeneity OR TI homogeneous OR TI “internal consistency” OR AB psychometr* OR AB observer variation OR AB reproducib* OR AB reliab* OR AB unreliab* OR AB valid* OR AB coefficient OR AB homogeneity OR AB homogeneous OR AB “internal consistency” OR (TI cronbach* OR AB cronbach* AND (TI alpha OR AB alpha OR TI alphas OR AB alphas)) OR (TI item OR AB item AND (TI correlation* OR AB correlation* OR TI selection* OR AB selection* OR TI reduction* OR AB reduction*)) OR TI agreement OR TI precision OR TI imprecision OR TI “precise values” OR TI test-retest OR AB agreement OR AB precision OR AB imprecision OR AB “precise values” OR AB test-retest OR (TI test OR AB test AND TI retest OR AB retest) OR (TI reliab* OR AB reliab* AND (TI test OR AB test OR TI retest or AB retest)) OR TI stability OR TI interrater OR TI interrater OR TI intrarater OR TI intra-rater OR TI intertester OR TI inter-tester OR TI intratester OR TI intra-tester OR TI interobserver OR TI inter-observer OR TI intraobserver OR TI intra-observer OR TI intertechnician OR TI inter-technician OR TI intratechnician OR TI intra-technician OR TI interexaminer OR TI inter-examiner OR TI intraexaminer OR TI intra-examiner OR TI interassay OR TI inter-assay OR TI intraassay OR TI intra-assay OR TI interindividual OR TI inter-individual OR TI intraindividual OR TI intra-individual OR TI interparticipant OR TI inter-participant OR TI 38

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

intraparticipant OR TI intra-participant OR TI kappa OR TI kappa’s OR TI kappas OR TI repeatab* OR AB stability OR AB interrater OR AB inter-rater OR AB intrarater OR AB intra-rater OR AB intertester OR AB inter-tester OR AB intratester OR AB intra-tester OR AB interobserver OR AB inter-observer OR AB intraobserver OR AB intra-observer OR AB intertechnician OR AB inter-technician OR AB intratechnician OR AB intra-technician OR AB interexaminer OR AB inter-examiner OR AB intraexaminer OR AB intra-examiner OR AB interassay OR AB inter-assay OR AB intraassay OR AB intra-assay OR AB interindividual OR AB inter-individual OR AB intraindividual OR AB intra-individual OR AB interparticipant OR AB inter-participant OR AB intraparticipant OR AB intra-participant OR AB kappa OR AB kappa’s OR AB kappas OR AB repeatab* OR ((TI replicab* OR AB replicab* OR TI repeated OR AB repeated) AND (TI measure OR AB measure OR TI measures OR AB measures OR TI findings OR AB findings OR TI result OR AB result OR TI results OR AB results OR TI test OR AB test OR TI tests OR AB tests)) OR TI generaliza* OR TI generalisa* OR TI concordance OR AB generaliza* OR AB generalisa* OR AB concordance OR (TI intraclass OR AB intraclass AND TI correlation* or AB correlation*) OR TI discriminative OR TI “known group” OR TI factor analysis OR TI factor analyses OR TI dimension* OR TI subscale* OR AB discriminative OR AB “known group” OR AB factor analysis OR AB factor analyses OR AB dimension* OR AB subscale* OR (TI multitrait OR AB multitrait AND TI scaling OR AB scaling AND (TI analysis OR AB analysis OR TI analyses OR AB analyses)) OR TI item discriminant OR TI interscale correlation* OR TI error OR TI errors OR TI “individual variability” OR AB item discriminant OR AB interscale correlation* OR AB error OR AB errors OR AB “individual variability” OR (TI variability OR AB variability AND (TI analysis OR AB analysis OR TI values OR AB values)) OR (TI uncertainty OR AB uncertainty AND (TI measurement OR AB measurement OR TI measuring OR AB measuring)) OR TI “standard error of measurement” OR TI sensitiv* OR TI responsive* OR AB “standard error of measurement” OR AB sensitiv* OR AB responsive* OR ((TI minimal OR TI minimally OR TI clinical OR TI clinically OR AB minimal OR AB minimally OR AB clinical OR AB clinically) AND (TI important OR TI significant OR TI detectable OR AB important OR AB significant OR AB detectable) AND (TI change OR AB change OR TI difference OR AB difference)) OR (TI small* OR AB small* AND (TI real OR AB real OR TI detectable OR AB detectable) AND (TI change OR AB change OR TI difference OR AB difference)) OR TI meaningful change OR TI “ceiling effect” OR TI “floor effect” OR TI “Item response model” OR TI IRT OR TI Rasch OR TI “Differential item functioning” OR TI DIF OR TI “computer adaptive testing” OR TI “item bank” OR TI “cross-cultural equivalence” OR TI outcome assessment OR AB meaningful change OR AB “ceiling effect” OR AB “floor effect” OR AB “Item response model” OR AB IRT OR AB Rasch OR AB “Differential item functioning” OR AB DIF OR AB “computer adaptive testing” OR AB “item bank” OR AB “cross-cultural equivalence” OR AB outcome assessment #4 #1 AND #2 AND #3

Search Strategy in PsycINFO #1 ((TI visual OR AB visual) AND (TI analog* OR AB analog*) AND (TI scale OR AB scale OR TI score OR 39

ACCEPTED MANUSCRIPT AB score)) OR vas OR ((TI numeric* OR AB numeric*) AND (TI rating OR AB rating) AND (TI scale OR AB scale OR TI score OR AB score)) OR (TI numeric scale OR AB numeric scale) OR nrs OR nprs OR ((TI brief pain OR AB brief pain) AND (TI inventory OR AB inventory OR TI index OR AB index OR TI scale OR AB scale OR TI questionnaire OR AB questionnaire)) OR bpi

CR IP T

#2 DE "BACKACHE" OR TI back pain OR AB back pain OR lbp OR TI back injury OR AB back injury OR DE "LUMBAR vertebrae" OR TI lumbar spine OR AB lumbar spine OR DE "SACROILIAC joint" OR TI sacroiliac pain OR AB sacroiliac pain OR DE "ZYGAPOPHYSEAL joint" OR TI facet pain OR AB facet pain OR DE "SCIATICA" OR TI sciatica OR AB sciatica OR (DE "SPINAL canal -- Stenosis") OR TI lumbar stenosis OR AB lumbar stenosis OR DE "INTERVERTEBRAL disk displacement" OR TI herniated disk OR AB herniated disk OR TI disk pain OR AB disk pain OR DE "SPONDYLOLISTHESIS" OR TI spondylolisthesis OR AB spondylolisthesis

AC

CE

PT

ED

M

AN US

#3 TI psychometr* OR TI observer variation OR TI reproducib* OR TI reliab* OR TI unreliab* OR TI valid* OR TI coefficient OR TI homogeneity OR TI homogeneous OR TI “internal consistency” OR AB psychometr* OR AB observer variation OR AB reproducib* OR AB reliab* OR AB unreliab* OR AB valid* OR AB coefficient OR AB homogeneity OR AB homogeneous OR AB “internal consistency” OR (TI cronbach* OR AB cronbach* AND (TI alpha OR AB alpha OR TI alphas OR AB alphas)) OR (TI item OR AB item AND (TI correlation* OR AB correlation* OR TI selection* OR AB selection* OR TI reduction* OR AB reduction*)) OR TI agreement OR TI precision OR TI imprecision OR TI “precise values” OR TI test-retest OR AB agreement OR AB precision OR AB imprecision OR AB “precise values” OR AB test-retest OR (TI test OR AB test AND TI retest OR AB retest) OR (TI reliab* OR AB reliab* AND (TI test OR AB test OR TI retest or AB retest)) OR TI stability OR TI interrater OR TI interrater OR TI intrarater OR TI intra-rater OR TI intertester OR TI inter-tester OR TI intratester OR TI intra-tester OR TI interobserver OR TI inter-observer OR TI intraobserver OR TI intra-observer OR TI intertechnician OR TI inter-technician OR TI intratechnician OR TI intra-technician OR TI interexaminer OR TI inter-examiner OR TI intraexaminer OR TI intra-examiner OR TI interassay OR TI inter-assay OR TI intraassay OR TI intra-assay OR TI interindividual OR TI inter-individual OR TI intraindividual OR TI intra-individual OR TI interparticipant OR TI inter-participant OR TI intraparticipant OR TI intra-participant OR TI kappa OR TI kappa’s OR TI kappas OR TI repeatab* OR AB stability OR AB interrater OR AB inter-rater OR AB intrarater OR AB intra-rater OR AB intertester OR AB inter-tester OR AB intratester OR AB intra-tester OR AB interobserver OR AB inter-observer OR AB intraobserver OR AB intra-observer OR AB intertechnician OR AB inter-technician OR AB intratechnician OR AB intra-technician OR AB interexaminer OR AB inter-examiner OR AB intraexaminer OR AB intra-examiner OR AB interassay OR AB inter-assay OR AB intraassay OR AB intra-assay OR AB interindividual OR AB inter-individual OR AB intraindividual OR AB intra-individual OR AB interparticipant OR AB inter-participant OR AB intraparticipant OR AB intra-participant OR AB kappa OR AB kappa’s OR AB kappas OR AB repeatab* OR ((TI replicab* OR AB replicab* OR TI repeated OR AB repeated) AND (TI measure OR AB measure OR TI measures OR AB measures OR TI findings OR AB findings OR TI result OR AB result OR TI results OR AB results OR TI test OR AB test OR TI tests OR AB tests)) OR TI generaliza* OR TI generalisa* OR TI concordance OR AB generaliza* OR AB generalisa* OR AB concordance OR (TI intraclass OR AB intraclass AND TI correlation* or AB correlation*) OR TI discriminative OR TI “known group” OR TI factor analysis OR TI factor analyses OR TI dimension* OR TI subscale* OR AB discriminative OR AB “known group” OR AB factor analysis OR 40

ACCEPTED MANUSCRIPT

AN US

CR IP T

AB factor analyses OR AB dimension* OR AB subscale* OR (TI multitrait OR AB multitrait AND TI scaling OR AB scaling AND (TI analysis OR AB analysis OR TI analyses OR AB analyses)) OR TI item discriminant OR TI interscale correlation* OR TI error OR TI errors OR TI “individual variability” OR AB item discriminant OR AB interscale correlation* OR AB error OR AB errors OR AB “individual variability” OR (TI variability OR AB variability AND (TI analysis OR AB analysis OR TI values OR AB values)) OR (TI uncertainty OR AB uncertainty AND (TI measurement OR AB measurement OR TI measuring OR AB measuring)) OR TI “standard error of measurement” OR TI sensitiv* OR TI responsive* OR AB “standard error of measurement” OR AB sensitiv* OR AB responsive* OR ((TI minimal OR TI minimally OR TI clinical OR TI clinically OR AB minimal OR AB minimally OR AB clinical OR AB clinically) AND (TI important OR TI significant OR TI detectable OR AB important OR AB significant OR AB detectable) AND (TI change OR AB change OR TI difference OR AB difference)) OR (TI small* OR AB small* AND (TI real OR AB real OR TI detectable OR AB detectable) AND (TI change OR AB change OR TI difference OR AB difference)) OR TI meaningful change OR TI “ceiling effect” OR TI “floor effect” OR TI “Item response model” OR TI IRT OR TI Rasch OR TI “Differential item functioning” OR TI DIF OR TI “computer adaptive testing” OR TI “item bank” OR TI “cross-cultural equivalence” OR TI outcome assessment OR AB meaningful change OR AB “ceiling effect” OR AB “floor effect” OR AB “Item response model” OR AB IRT OR AB Rasch OR AB “Differential item functioning” OR AB DIF OR AB “computer adaptive testing” OR AB “item bank” OR AB “cross-cultural equivalence” OR AB outcome assessment

ED

M

#4 #1 AND #2 AND #3

PT

APPENDIX 2. Visual Analogue Scale and Numeric Rating Scale used for the reviewers’ assessment of content validity in this systematic review.

AC

CE

How would you rate your average low back pain intensity over the last week?

No pain

Worst imaginable pain

How would you rate your average low back pain intensity over the last week?

41

ACCEPTED MANUSCRIPT 0

1

2

3

4

5

6

7

No pain

8

9

10 Worst imaginable pain

APPENDIX 3. Consensus-based criteria for measurement properties (Prinsen et al. 85) Criteria

+

All items refer to relevant aspects of the construct to be measured AND are relevant for the target population AND are relevant for the context of use AND together comprehensively reflect the construct to be measured Criteria for ‘+’ not met Not all information for ‘+’ reported CTT Unidimensionality EFA: first factor accounts for at least 20% of the variability AND ratio of the variance explained by the first to the second factor > 4 CFA: CFI or TLI or comparable measure > 0.95 AND (RMSEA < 0.06 OR SRMR < 0.08) Bi-factor model: standardized loadings on a common factor > 0.30 AND correlation between individual scores under a bi-factor and unidimensional model > 0.90 Structural validity EFA: factors with eigenvalue > 1 account for at least 50% of the variability CFA: CFI or TLI or comparable measure > 0.95 AND (RMSEA < 0.06 OR SRMR < 0.08)

− ? +

ED

M

AN US

Structural validity

Rating

CR IP T

Measurement property Content validity (including face validity)

AC

CE

PT

Rasch/IRT At least limited evidence for unidimensionality or positive structural validity AND no evidence for violation of local independence: Rasch: standardized item-person fit residuals between -2.5 and 2.5; OR IRT: residual correlations among the items after controlling for the dominant factor < 0.20 OR Q3's < 0.37 AND no evidence for violation of monotonicity: adequate looking graphs OR item scalability >0.30 AND adequate model fit: Rasch: infit and outfit mean squares ≥ 0.5 and ≤ 1.5 OR Zstandardized values > -2 and 0.01; OR IRT: G2 > 0.01

Internal consistency

− ? + − ?

Optional additional evidence: Adequate targeting; Rasch: adequate person-item threshold distribution; IRT: adequate threshold range No important DIF for relevant subject characteristics (such as age, gender, education), McFadden's R2 < 0.02 Criteria for ‘+’ not met Not all information for ‘+’ reported Evidence for unidimensionality or sufficient structural validity AND Cronbach’s alpha ≥ 0.70 and ≤ 0.95 Criteria for ‘+’ not met Not all information for ‘+’ reported OR conflicting evidence for 42

ACCEPTED MANUSCRIPT

Measurement error Construct validity – hypotheses testing Cross-cultural validity

+ − ? + − ? + − ?

CR IP T

Reliability

unidimensionality or structural validity OR evidence for lack of unidimensionality or negative structural validity ICC or weighted kappa ≥ 0.70 Criteria for ‘+’ not met ICC or weighted kappa not reported SDC or LoA < MIC Criteria for ‘+’ not met MIC not defined At least 75% of the results are in accordance with the hypotheses Criteria for ‘+’ not met No correlations with instrument(s) measuring related construct(s) AND no differences between relevant groups reported No important differences found between language versions in multiple group factor analysis or DIF analysis Criteria for ‘+’ not met Multiple group factor analysis AND DIF analysis not performed Convincing arguments that gold standard is “gold” AND correlation with gold standard ≥ 0.70 Criteria for ‘+’ not met Not all information for ‘+’ reported At least 75% of the results are in accordance with the hypotheses Criteria for ‘+’ not met No correlations with changes in instrument(s) measuring related construct(s) AND no differences between changes in relevant groups reported

+ − ? +

Criterion validity

AN US

− ? Responsiveness + − ?

ED

M

+ = sufficient results; − = insufficient results; ? = unknown CTT = classical test theory; EFA = exploratory factor analysis; CFA = confirmatory factor analysis; CFI = comparative fit index; TLI = Tucker Lewis Index; RMSEA = root mean square error of approximation; SRMR = standardized root mean residuals; IRT = item response theory; DIF = differential item functioning; ICC = intraclass correlation coefficient; SDC = smallest detectable change; LoA = limits of agreement; MIC = minimal important change.

PT

APPENDIX 4. Grading the quality of evidence on measurement properties

CE

a. Modified GRADE approach to rate the quality of evidence on content validity (Terwee et al.100)

AC

Study design At least 1 content validity study No content validity studies

Quality of evidence High Moderate Low Very low

Lower if Risk of bias -1 Serious -2 Very serious Inconsistency -1 Serious -2 Very serious Indirecteness -1 Serious -2 Very serious

43

ACCEPTED MANUSCRIPT b. Modified GRADE approach to rate the quality of evidence on the other measurement properties (Prinsen et al.84) Lower if Risk of bias -1 Serious (only one study of adequate quality) -2 Very serious (only studies of doubtful or inadequate quality) Imprecision -1 total (n = 50-100) -2 total (n < 50)

CR IP T

Quality of evidence High Moderate Low Very low

Inconsistency -1 Serious (< 75% of studies did not display the same results)

AC

CE

PT

ED

M

AN US

Indirecteness -1 Serious (at least one study not addressing construct or target population of the review)

44