Reliability and validity of non-radiographic methods of thoracic kyphosis measurement: A systematic review

Reliability and validity of non-radiographic methods of thoracic kyphosis measurement: A systematic review

Manual Therapy 19 (2014) 10e17 Contents lists available at ScienceDirect Manual Therapy journal homepage: www.elsevier.com/math Systematic review ...

596KB Sizes 0 Downloads 47 Views

Manual Therapy 19 (2014) 10e17

Contents lists available at ScienceDirect

Manual Therapy journal homepage: www.elsevier.com/math

Systematic review

Reliability and validity of non-radiographic methods of thoracic kyphosis measurement: A systematic review Eva Barrett a, *, Karen McCreesh a, Jeremy Lewis b, c a

Department of Clinical Therapies, Faculty of Education and Health Sciences, University of Limerick, Limerick, Ireland Musculoskeletal Services, Health at the Stowe, Central London Community Healthcare, NHS Trust, 260 Harrow Road, London W2 5ES, UK c Department of Allied Health Professions and Midwifery, School of Health and Social Work, Wright Building, College Lane Campus, University of Hertfordshire, Hatfield, AL10 9AB, Hertfordshire, UK b

a r t i c l e i n f o

a b s t r a c t

Article history: Received 7 January 2013 Received in revised form 29 April 2013 Accepted 16 September 2013

Background: A wide array of instruments are available for non-invasive thoracic kyphosis measurement. Guidelines for selecting outcome measures for use in clinical and research practice recommend that properties such as validity and reliability are considered. This systematic review reports on the reliability and validity of non-invasive methods for measuring thoracic kyphosis. Methods: A systematic search of 11 electronic databases located studies assessing reliability and/or validity of non-invasive thoracic kyphosis measurement techniques. Two independent reviewers used a critical appraisal tool to assess the quality of retrieved studies. Data was extracted by the primary reviewer. The results were synthesized qualitatively using a level of evidence approach. Results: 27 studies satisfied the eligibility criteria and were included in the review. The reliability, validity and both reliability and validity were investigated by sixteen, two and nine studies respectively. 17/27 studies were deemed to be of high quality. In total, 15 methods of thoracic kyphosis were evaluated in retrieved studies. All investigated methods showed high (ICC  .7) to very high (ICC  .9) levels of reliability. The validity of the methods ranged from low to very high. Conclusion: The strongest levels of evidence for reliability exists in support of the Debrunner kyphometer, Spinal Mouse and Flexicurve index, and for validity supports the arcometer and Flexicurve index. Further reliability and validity studies are required to strengthen the level of evidence for the remaining methods of measurement. This should be addressed by future research. Ó 2013 Elsevier Ltd. All rights reserved.

Keywords: Reliability Validity Thoracic kyphosis Measurement

1. Introduction Thoracic kyphosis is the sagittal plane curvature between T1 and T12 vertebral bodies (Perriman et al., 2010). Normal kyphosis ranges from 20 to 50 when assessed radiographically (Willlner, 1981). Excessive thoracic kyphosis, defined as a kyphosis >50 (Willner, 1981; Teixeira and Carvalho, 2007), has been previously linked with a range of negative consequences. The postural effects of excessive kyphosis include musculoskeletal complaints such as shoulder pain (Gray and Grimsby, 2004) and cervical pain (Horter, 1978; Callet, 1991; Ayub, 1991) and can affect any age group (Gray and Grimsby, 2004). In osteoporotic samples, excessive thoracic kyphosis can lead to physiological adaptations such as impaired respiratory function (Murray et al., 1993; Di Bari et al., 2004) and can have functional influences such as decreased mobility (Lydick et al., 1997), injurious falls (Kado et al., 2007) and loss of independence * Corresponding author. Tel.: þ353 (0)61 234232; fax: þ353 (0)61 234251. E-mail addresses: [email protected], [email protected] (E. Barrett). 1356-689X/$ e see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.math.2013.09.003

(Lydick et al., 1997). The measurement of thoracic kyphosis is therefore an essential aspect to musculoskeletal assessment, helping clinicians to adequately screen for excessive kyphosis, determine baseline data, monitor progress and guide appropriate implementation of treatment strategies (Chaise et al., 2011). The current gold standard for the quantification of thoracic kyphosis is the lateral radiograph, a method which provides a Cobb angle (Harrison et al., 2001; Briggs et al., 2007). While this is routinely used for the diagnosis and monitoring of conditions such as idiopathic scoliosis and hyperkyphosis (Saad et al., 2012), it has significant limitations. Radiographic methods are generally inconvenient in a clinical setting, involve high costs and expose the patient to high doses of potentially harmful radiation (Korovessis et al., 2001; Kellis et al., 2008). Furthermore, the validity of the Cobb angle has been criticized, particularly in osteoporotic individuals, as it predominantly reflects endplate tilt of vertebrae between selected limits of the curve and fails to represent the full contour of the thoracic spine (Goh et al., 1999; Harrison et al., 2001; Briggs et al., 2007).

E. Barrett et al. / Manual Therapy 19 (2014) 10e17

Alternatively, several non-invasive, skin-surface methods have been adopted for clinical use including the Debrunner kyphometer (Öhlén et al., 1989), the Flexicurve (Milne and Williamson, 1983), the Spinal Mouse (Mannion et al., 2004) as well as technology based methods including rasterstereography (Melvin et al., 2010) and 3D ultrasound (Folsch et al., 2012). Guidelines for selecting measurement tools for use in clinical and research practice recommend that validity and reliability are amongst the essential properties to be considered (Lohr, 2002; Terwee et al., 2007). Validity is an evaluation of whether an instrument measures a construct or variable that it is intended to measure (Carmines and Zeller, 1979; van de Ven-Stevens et al., 2009). For a non-invasive tool to be considered accurate enough to measure thoracic kyphosis in practice and research, it must display adequate criterion validity when compared to the gold standard, i.e. the radiographic Cobb angle. Reliability is defined as the extent to which a measurement is consistent and free from error, when used by the same rater (intra-rater reliability), or when used by different raters (inter-rater reliability) (Portney and Watkins, 2000). In practice, to state that a patient’s clinical status has changed since the last measurement, the measured change is required to be larger than the error associated with the measurement (Wright and Feinstein, 1992). Therefore, the reporting of Standard Error of Measurement is an important element of reliability studies as it aids clinical interpretability of results (van de Ven-Stevens et al., 2009). Since numerous studies on the psychometric properties of these instruments have been published, an evaluation of the literature is required. Therefore, the purpose of this systematic review is to report on the reliability and validity of methods of non-invasive thoracic kyphosis measurement. 2. Methods 2.1. Search strategy A systematic search was performed on 1st October 2012 by the primary investigator. Searches of the following databases were performed: MEDLINE, AMED, CINAHL, Pubmed, Biomedical Reference Collection: Expanded, SportDiscus, ScienceDirect, Cochrane Library, Web of Science (1960eOct 2012). The search was conducted using search terms from 3 subject areas: thoracic kyphosis (“thoracic kyphosis”, “spinal curvature”, “thoracic curvature”, kyphosis), psychometric properties (reliability, validity, sensitivity, responsiveness, properties) and physical tests (instrument, tool, test, measure*, inclinometer, flexicurve, kyphometer, radiograph, Cobb). The Boolean Operators “Or” and “And” were used to combine the search terms within and between each of the 3 subject areas respectively. A word from each area was required to be in the Title or Abstract of the study. An additional search of Google Scholar search engine was also performed. These searches were supplemented by hand-searching the reference lists of the final articles found from the above searches. 2.1.1. Eligibility criteria A meeting between the two reviewers was convened to decide on selection criteria. 2.1.2. Inclusion criteria    

Articles available in full text Articles available in English A neutral thoracic kyphosis value angle was recorded Measurement of validity and/or reliability was the primary aim of the study

11

 Studies on human participants were included for review. No restrictions were made with regard to populations. 2.1.3. Exclusion criteria  Full text in English could not be located  Thoracic kyphosis angle reported in thoracic flexion or extension only  Radiographic measurement techniques only Initially, article titles and abstracts were screened by the primary reviewer. Any title and abstract which was not clearly investigating a psychometric property of a thoracic kyphosis measurement method was discarded as being not relevant. In cases of uncertainty about eligibility of a study title/abstract, the full text was explored. When the original search was narrowed down to relevant articles only, a second reviewer independently applied the selection criteria to the chosen articles to ensure all articles were suitable for review. There were no disagreements between reviewers regarding the eligibility of chosen articles. 2.2. Quality assessment The critical appraisal tool used was a relatively new checklist (Brink and Louw, 2011) which was designed for testing combined reliability and validity studies or validity and reliability on their own. The checklist, which is comprised of 13 items, does not report a quality score. This tool was developed from two existing tools, the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) and the Quality Appraisal of Diagnostic Reliability Studies (QAREL). As some of the included studies assess both reliability and validity of the instrument, this checklist was more convenient than using the QUADAS or QUAREL separately. The criteria are provided as a footnote to Table 2. The studies were considered of high quality if they scored 60%, as done previously (van der Wurff et al., 2000; May et al., 2006; 2010; Adhia et al., 2012). Quality assessment was performed independently by two reviewers on each paper. In the pilot stage, each reviewer independently rated two non-included articles using the checklist, in order to identify any difference in interpretations of the items. This process recorded a kappa score of .92, which was regarded as acceptable to continue. Disagreements were resolved by discussion and all items were clarified. 2.3. Data analysis Meta-analysis was not attempted due to the heterogeneity of tests, participants and analyses. Also, a subgroup analysis could not be performed due to the limited number of studies evaluating the same thoracic kyphosis measurement technique. Hence a descriptive analysis was conducted and data were synthesized using a level of evidence approach (van Tulder et al., 2003), displayed in Table 1. The Intraclass Correlation Coefficient (ICC) and Pearson’s Correlation Coefficient were interpreted as follows: .00e.29 as very Table 1 Levels of evidence approach (van Tulder et al., 2003). Level of evidence

Criteria

Strong Moderate

Consistent findings from 3 high quality studies Consistent findings from at least 1 high quality and one or more low quality studies Consistent findings in 1 low quality studies or only 1 study available Inconsistent evidence in multiple studies irrespective of study quality No studies found

Limited Conflicting No evidence

12

E. Barrett et al. / Manual Therapy 19 (2014) 10e17

Table 2 1. Adequate description of study population; 2. Adequate description of raters; 3. Adequate explanation of reference standard; 4. Between-rater blinding; 5. Within-rater blinding; 6. Variation of testing order; 7. Time period between index test and reference standard; 8. Time period between repeated measures; 9. Independency of reference standard from index test; 10. Adequate description of index test procedure; 11. Adequate description of reference standard procedure; 12. Explanation of any withdrawals; 13; Appropriate statistical methods. Study

1

2

3

4

5

6

7

8

9

10

11

12

13

High quality?

Chaise et al., 2011 Czaprowski et al., 2012 de Oliveira et al., 2012 D’Osualdo et al., 1997 Dunk et al., 2004 Dunk et al., 2005 Folsch et al., 2012 Goh et al., 1999 Gravina et al., 2012 Greendale et al., 2011 Hinman 2004 Kellis et al., 2008 Korovessis et al., 2001 Leroux et al., 2000 Lewis and Valentine 2010 Lundon et al., 1998 Mannion et al., 2004 Melvin et al., 2010 Öhlén et al., 1989 Perriman et al., 2010 Purser et al., 1999 Ripani et al., 2008 Saad et al., 2012 Sheeran et al., 2010 Teixeira and Carvalho 2007 Van Blommestein et al., 2012 Willner 1981 Yanagawa et al., 2000

O O O O O O O X O O O O O O O O O O O O O O O O O O X O

X O X X X X X X X O O O X X X O O O X X X O O O O X X X

O n/a O O n/a n/a n/a n/a O O n/a n/a O O n/a n/a n/a n/a n/a O n/a O n/a n/a O n/a O n/a

X O X X n/a n/a n/a n/a n/a O X O O n/a n/a X X X O n/a n/a X O X X n/a X O

X O X X X X X X n/a X n/a O X n/a O O X X O X X X X X X O X O

X X X X X X X X n/a X X X X n/a O O O X O X X X X O X O X X

O n/a O O n/a n/a n/a n/a X O n/a n/a X X n/a n/a n/a n/a n/a O n/a O n/a n/a O n/a X n/a

O O O X X X O O n/a O O O X n/a X O O O X O O O X O O O X X

O n/a O O n/a n/a n/a n/a O O n/a n/a O O n/a n/a n/a n/a n/a X n/a O n/a n/a O n/a O n/a

O O O X O O O X O O O O O O O O O O O O O O O O O O O O

O n/a O X n/a n/a n/a n/a X O n/a n/a O O n/a n/a n/a n/a n/a O n/a O n/a n/a O n/a O n/a

O O O O O O O O O O O O O O O O O O O O O O X O O O O O

O X X X X O O X O X O O X O O X X X X X O X X O X O X X

Yes Yes Yes No No No Yes No Yes Yes Yes Yes No Yes Yes Yes Yes No Yes No Yes Yes No Yes Yes Yes No No

low correlation, .30e.49 as low correlation, .50e.69 as moderate correlation, .70e.89 as high correlation, .90e1.00 as very high correlation (Munro and Visintainer, 2005). 3. Results

2010; Czaprowski et al., 2012; Folsch et al, 2012; van Blommestein et al., 2012). The main areas of weakness found were inadequate description of the raters, insufficient between-rater and within-rater blinding, lack of variation in testing order and inappropriate or insufficiently described statistical analyses.

3.1. Selection of studies Fig. 1 presents a flow diagram, based on the PRISMA guidelines (Liberati et al., 2009), which details the movement of articles through the review process. Twenty-seven articles were included for review under the outlined selection criteria. Of these studies, 2 investigated validity only, 16 investigated reliability only and 9 investigated both reliability and validity. Of the 16 included reliability studies, 1 investigated inter-rater reliability, 7 investigated intra-rater reliability and 8 investigated both intra- and inter-rater reliability. 3.2. Methodological quality Eighteen out of twenty-eight studies were deemed to be of high quality (score 60%). The full scoring process is displayed in Table 2. The two reviewers initially disagreed on 12 items across all studies (kappa score .94). The disagreement between the two reviewers was then resolved by discussion. A third reviewer was available to moderate disagreement but was not required. Both of the included validity studies were of high quality (Leroux et al., 2000; Gravina et al., 2012). Five out of nine combined reliability and validity studies were of high quality (Teixeira and Carvalho, 2007; Ripani et al., 2008; Chaise et al., 2011; Greendale et al., 2011; de Oliveira et al., 2012). Eleven out of seventeen reliability studies were of high quality (Öhlén et al., 1989; Lundon et al., 1998; Purser et al., 1999; Hinman, 2004; Mannion et al., 2004; Kellis et al., 2008; Lewis and Valentine, 2010; Melvin et al., 2010; Sheeran et al.,

3.3. Study characteristics A total of 15 methods for thoracic kyphosis measurement were found within reviewed articles. The Flexicurve index and the Debrunner kyphometer were the most commonly studied, in terms of both reliability and validity. A list of all methods is below.  Arcometer (D’Osualdo et al., 1997; Chaise et al., 2011)  Flexicurve index (Yanagawa et al., 2000; Hinman, 2004; Teixeira and Carvalho, 2007; Greendale et al., 2011)  Flexicurve angle (Greendale et al., 2011; de Oliveira et al., 2012)  Debrunner’s kyphometer (Öhlén et al., 1989; Purser et al., 1999; Korovessis et al., 2001; Greendale et al., 2011)  Spinal Mouse (Mannion et al., 2004; Kellis et al., 2008; Ripani et al., 2008)  Manual inclinometer (Lewis and Valentine, 2010; van Blommestein et al., 2012)  Digital inclinometer (Czaprowski et al., 2012)  3D ultrasound (Folsch et al, 2012), photogrammetry (Dunk et al., 2004, 2005; Saad et al., 2012)  Rasterstereography (Goh et al., 1999; Melvin et al., 2010)  Stereovideography (Leroux et al., 2000)  Goniometer (Gravina et al., 2012)  Electrogoniometer (Perriman et al., 2010)  Spinal wheel (Sheeran et al., 2010)  Pantograph (Willner, 1981)  Photogrammetry (Dunk et al., 2004, 2005; Saad et al., 2012).

E. Barrett et al. / Manual Therapy 19 (2014) 10e17

13

Fig. 1. PRISMA flow diagram.

3.4. Types of participants A healthy sample of participants was used in 21/28 studies. Only 7 studies included subjects with any degree of pathology, with 4 involving postmenopausal, osteoporotic women (Lundon et al., 1998; Purser et al., 1999; Yanagawa et al., 2000; Greendale et al., 2011) and 3 studies of subjects with scoliosis (Willner, 1981; Leroux et al., 2000; Saad et al., 2012). The subject BMI was unreported in 15 studies, while 5 studies reported an average BMI 25 (Mannion et al., 2004; Sheeran et al., 2010; Greendale et al., 2011; Chaise et al., 2011; de Oliveira et al., 2012) and 3 studies reported an average BMI <25 (Ripani et al., 2008; Melvin et al., 2010; Saad et al., 2012). The majority of studies used subjects with a mean age between 20 and 65 years. 6 studies used subjects with a mean age between 10 and 19 years (D’Osualdo et al., 1997; Willner, 1981; Leroux et al., 2000; Korovessis et al., 2001; Kellis et al., 2008; Gravina et al., 2012) and 1 study compared pre- and post-menopausal women (Hinman, 2004). Only 5 studies used a population with mean age 65 years (Lundon et al., 1998; Purser et al., 1999; Yanagawa et al., 2000; Teixeira and Carvalho, 2007; Greendale et al., 2011).

3.5. Reliability and validity All reliability studies showed high to very high levels of reliability. The validity of the methods ranged from low to very high.

However, only 11 out of 27 studies assessed validity. This is shown in more detail in Table 3. 3.6. Level of evidence Table 4 details the accumulated level of evidence found for all methods. For the majority, there is a limited or inconsistent level of evidence for the reliability and validity of methods. Strong and moderate levels of evidence have been found for a small selection of methods. 4. Discussion 4.1. Main findings This review highlighted 15 methods for the non-invasive measurement of thoracic kyphosis, ranging from simple, skin-surface measures to computerized postural analysis systems. In general, high to very high levels of reliability were found for all investigated measurement techniques. The validity of these techniques was less commonly studied and ranges from low to very high. On observation of the data, the more technological methods (e.g. rasterstereography, 3D ultrasound, stereovideography) do not appear to offer greater reliability or validity than the simpler methods. In fact, the strongest level of evidence was in support of the high to very high levels of reliability of the Flexicurve index, Debrunner kyphometer and Spinal Mouse, which are simple, hand-held tools.

14

E. Barrett et al. / Manual Therapy 19 (2014) 10e17

Table 3 Reliability and validity data for all methods.

Table 4 Level of evidence.

Reference

High Reliability quality? (ICC/Cronbach alpha)

SEM

Validity (correlation coefficient)

Folsch et al, 2012 Chaise et al., 2011 D’Osualdo et al., 1997 Korovessis et al., 2001 Öhlén et al., 1989 Purser et al., 1999 Lundon et al., 1998 Greendale et al., 2011 Czaprowski et al., 2012 Perriman et al., 2010 Greendale et al., 2011 Lundon et al., 1998 de Oliveira et al., 2012 Greendale et al., 2011 Yanagawa et al., 2000 Teixeira and Carvalho, 2007 Hinman, 2004 Gravina et al., 2012 Lewis and Valentine, 2010 Van Blommestein et al., 2012 Willner, 1981 Dunk et al., 2004 Dunk et al., 2005 Saad et al., 2012 Goh et al., 1999 Melvin et al., 2010 Mannion et al., 2004 Kellis et al., 2008

Yes

.95 (intra)

3.7

N/A

Yes

.98 (inter), .99 (intra) .99 (intra þ inter)

Ripani et al., 2008 Sheeran et al., 2010 Leroux et al., 2000

No No Yes Yes Yes Yes

.94 .98

.84 (inter), .92 (intra) .92, .93 (intra), .91, .94 (inter) .95e.97 (intra)

.759 N/A N/A

.88 (inter), .89e.99 (intra) .98 (inter þ intra)

N/A .622 

3.8 (intra)

N/A

Yes

.83 (intra)

No

.9e.95 (intra)

.538e.876

Yes

.96 (intra þ inter)

.656e.758

Yes

N/A

Yes

.89e.98 (intra), .87 (inter) .94 (inter), .82 (intra)

.7

Yes

.96 (intra þ inter)

.686e.756

No

.93 (intra)

N/A

Yes

.87 (intra), .94 (inter)

.528e.906

Yes Yes

.93 and .94 (inter) N/A

N/A .897

Yes

.93e.97 (intra)

1 , 1.7

Yes

.92e.96 (intra)

1.7, 2.3

No No

ICC not reported .351e.691 (intra)

.94 N/A

Yes

.310e.727 (intra)

N/A

No

.93e.95 (intra), .97 (inter) .95 (intra)

N/A

No No Yes Yes

Yes Yes

Yes

.921e.992 (intra), .979 (inter) .73e.88 (intra), .83e.87 (inter) .81e.87 (intra), .88e.89 (inter)

N/A

.6 e.8

N/A N/A





N/A 4.2 , 2.8 (intra) N/A 2.3 , 2.7 (intra), 1.4 , 2.1 (inter) .385e.467

.828e.991 (intra),666e.991 (inter) .833e.98 1.7 , 5.5 (intra), .986 (inter) (intra), 2 (inter) N/A

N/A

.89

4.2. Validity Significant barriers to validity testing are the limited accessibility and the ethical issues regarding the use of spinal X-rays (Greendale et al., 2011). This is likely to be a large contributing

Level of evidence

Method

Reliability

Debrunner kyphometer Very high intra-rater reliability Flexicurve index Very high inter-rater reliability Spinal Mouse Very high intra þ interrater reliability Moderate Arcometer Very high intra þ interrater reliability Flexicurve index Manual inclinometer Very high intra-rater reliability Limited Goniometer Stereovideographic Pantograph Digital inclinometer High intra þ inter-rater reliability Spinal Wheel Very high intra þ interrater reliability Rasterstereography Very high intra þ interrater reliability Electrogoniometry Very high intra-rater reliability Photogrammetry Very high inter-rater reliability Spinal Mouse Conflicting Photogrammetry Low-very high intrarater reliability Debrunner kyphometer High-very high interrater reliability Flexicurve index High-very high intrarater reliability Flexicurve angle High-very high inter þ intra-rater reliability

Validity

Strong

Very high validity Moderate validity

High validity High validity Very high validity

High validity

Low validity

Moderate-high validity

Moderate-high validity

factor to the retrieval of only two studies which exclusively examined the validity of a non-invasive instrument for thoracic kyphosis measurement (D’Osualdo et al., 1997; Perriman et al., 2010). Other methods have been suggested as alternates to the Cobb angle, such as the centroid method (Chen, 1999) and posterior tangent method (Harrison et al., 2001). However, these are all still radiographically based. There are several reasons as to why skin-surface devices may falter in validity. Skin-surface techniques follow the line of the spinous processes and not that of the vertebral bodies, as done radiographically (Mannion et al., 2004). Secondly, the varying distribution of adipose tissue overlying the spine imposes on the accuracy obtained (Mannion et al., 2004). This may have been influential as only 1 retrieved validity study reported a BMI <25 (Ripani et al., 2008), whereas 2 studies reported BMI >25 (Greendale et al., 2011; Chaise et al., 2011) and the remaining 8 validity studies had unreported BMI. As detailed by reviewed studies, other sources which were likely to lower the validity scores included incorrect landmark palpation (Leroux et al., 2000; Greendale et al., 2011), measurement error in calculating the Cobb angle (Ripani et al., 2008; Greendale et al., 2011) and the operation of the device (Korovessis et al., 2001; Ripani et al., 2008). The Debrunner kyphometer, arcometer, inclinometry, goniometry and electrogoniometry attain a kyphosis value from placing the instrument on selected limits of the curve, a method which is similar to the calculation of the Cobb angle. Alternatively, the Flexicurve, Spinal Mouse, Spinal Wheel and pantograph provide a representation of spinal curvature continuously throughout the thoracic spine. By observing the validity data, there appears to be no obvious trend in higher or lower validity scores by using either method. However, over time, relying on selected limits of the curve

E. Barrett et al. / Manual Therapy 19 (2014) 10e17

may fail to reveal changes regionally within the thoracic curvature. This may create a discrepancy for populations with osteoporosis or Sheuermann’s disease where single vertebral wedging is common (Briggs et al., 2007; Czaprowski et al., 2012; Gravina et al., 2012). Therefore, the latter techniques may be more sensitive and robust over time (Greendale et al., 2011), and thus may be more appropriate for on-going assessment. 4.3. Reliability The variable nature of thoracic kyphosis poses a potential challenge to reliability studies (D’Osualdo et al., 1997). Included studies have acknowledged the postural variance both from session to session as a result of sporting, vocational or routine activity (D’Osualdo et al., 1997; Lewis and Valentine, 2010), fatigue from repeated measures (Hinman, 2004; Kellis et al., 2008) and repositioning errors (Sheeran et al., 2010; van Blommestein et al., 2012). However, some studies attempted to control this by re-testing within the same session (D’Osualdo et al., 1997; Goh et al., 1999; Leroux et al., 2000; Teixeira and Carvalho, 2007; Melvin et al., 2010; Lewis and Valentine, 2010; Chaise et al., 2011; Greendale et al., 2011; de Oliveira et al., 2012) or re-testing at the same time of day (Korovessis et al., 2001; Mannion et al., 2004; Kellis et al., 2008; Saad et al., 2012; Folsch et al, 2012; de Oliveira et al., 2012). Others used techniques such as using the same light and temperature conditions (Saad et al., 2012) and restricting sporting activities between measurement days (Folsch et al, 2012). As reliability data was largely positive, these controls appeared to be sufficient to stabilize the thoracic kyphosis between measurements. Further potential challenges to taking reliable measures included the accurate palpation of spinal landmarks. The validity of palpation of spinal landmarks has been reported to be poor throughout the spine (O’Haire and Gibbons, 2000; French et al., 2000; Billis et al., 2003). Difficulty in palpating landmarks was frequently discussed by studies under review (D’Osualdo et al., 1997; Lundon et al., 1998; Leroux et al., 2000; Dunk et al., 2004; Hinman, 2004; Mannion et al., 2004; Kellis et al., 2008; Lewis and Valentine, 2010; Sheeran et al., 2010; Chaise et al., 2011; Greendale et al., 2011; de Oliveira et al., 2012; van Blommestein et al., 2012). It has been noted that the accuracy of palpation can depend on the skill of the examiner (Billis et al., 2003; Haneline and Young, 2009). Testers under review included physiotherapists/ physical therapists (Lundon et al., 1998; Purser et al., 1999; Teixeira and Carvalho, 2007; Melvin et al., 2010; Sheeran et al., 2010; Saad et al., 2012; Czaprowski et al., 2012), physicians (Korovessis et al., 2001; Ripani et al., 2008) and researchers (Kellis et al., 2008; Greendale et al., 2011). Level of experience with instrument ranged from novice (Hinman, 2004; Greendale et al., 2011) to experienced (Mannion et al., 2004) but was largely undescribed. Therefore, it is unclear if the level of experience of the tester contributed to the reliability obtained. Several studies did not remove markers of palpated landmarks between raters (Lundon et al., 1998; Mannion et al., 2004; Sheeran et al., 2010; Chaise et al., 2011; de Oliveira et al., 2012). The use of the same marked points is likely to have increased the reliability of measurements between raters. Furthermore, variation in amount of pressure applied with instrument (D’Osualdo et al., 1997; Hinman, 2004; Mannion et al., 2004; Kellis et al., 2008; Ripani et al., 2008; Sheeran et al., 2010; de Oliveira et al., 2012), unstandardized instructions (Öhlén et al., 1989; Lundon et al., 1998; Hinman, 2004; Mannion et al., 2004) and variations in subject positioning (Dunk et al., 2004; Folsch et al, 2012; de Oliveira et al., 2012) were other commonly discussed challenging factors.

15

4.4. Methodological considerations There were some methodological limitations of the reviewed studies. Firstly, the majority of studies investigated a healthy sample, of mean age between 20 and 65 years and of unreported BMI. A healthy population of this age bracket is not necessarily representative of a clinical population (Whiting et al., 2003) and so the results cannot be generalised to the clinical population. However, both studies which contributed to the strong level of evidence for the very high inter-rater reliability of the Flexicurve index used a postmenopausal, osteoporotic sample (Yanagawa et al., 2000; Greendale et al., 2011), which increases the clinical applicability of the Flexicurve index. BMI is an important sample characteristic as, in reality, the bony landmarks may be more difficult to palpate in obese people leading to higher measurement errors (Langendefer et al., 2009; Greendale et al., 2011). Secondly, a description of the raters was only sometimes described, further limiting the generalizability of the results. Thirdly, some studies did not describe their statistical methods sufficiently and others used inappropriate analyses. The lack of measures of precision by some studies limits the clinical applicability of their results. Lastly, some studies did not perform (Dunk et al., 2004, 2005; Perriman et al., 2010; Sheeran et al., 2010; Chaise et al., 2011; Folsch et al, 2012; de Oliveira et al., 2012) or did not detail (Willner, 1981; Goh et al., 1999; Hinman, 2004; Mannion et al., 2004; Teixeira and Carvalho, 2007; Saad et al., 2012) controls to ensure between-rater and within-rater blinding. The lack of blinding in inter-rater and intra-rater reliability studies may have inflated the agreement between raters or between measures respectively. 4.5. Limitations of review The strengths of the present review are its systematic nature, the comprehensive search strategy based on PRISMA guidelines, its use of multiple reviewers and its inclusion of all populations. However, only articles in English were included. During the title/ abstract screening, 9 articles were excluded due to their unavailability in English. As there were so few studies found investigating each thoracic kyphosis measurement technique, these articles could have made a significant difference to the overall conclusions of the study. Secondly, the two reviewers assessing the methodological quality of the studies were not blinded to the results of the studies. While this may have produced an opportunity for reviewer bias (Stochkendahl et al., 2006), the stringent criteria of the critical appraisal tool and the use of multiple reviewers reduced the likelihood of bias. Thirdly, the wide level of heterogeneity amongst study populations, procedures and testers indicates that the external validity of this review is low. 4.6. Clinical and research implications The 15 methods highlighted in this systematic review indicate that clinicians have a wide scope of options for thoracic kyphosis measurement. For the present, clinicians must choose a method using their best judgment of the reliability and validity data presented in this review. The Flexicurve index, Debrunner kyphometer and the Spinal Mouse have the strongest evidence base in terms of their reliability and the Flexicurve index and arcometer have the strongest level of evidence in terms of validity. Factors such as low cost, ease of use for entry level clinicians, and short measurement time have been previously considered to argue for the use of the Flexicurve (Greendale et al., 2011). However, the absence of evidence does not mean an outcome measure is not suitable, only that no data has yet been published to verify validity and reliability. Clinicians

16

E. Barrett et al. / Manual Therapy 19 (2014) 10e17

must be also be mindful of the populations in which these measures were tested and the expertise of the raters testing them. This systematic review identified the strong need for further research into the psychometric properties of thoracic kyphosis measurement methods, especially methods with limited and inconsistent levels of evidence. As responsiveness to change is an important property to be considered, future research should also consider this. The early research appears promising, but a true representation of the reliability and validity cannot be made until further studies emerge. It is recommended that future research should include representative samples of patients, incorporate adequate measures to ensure subject and examiner blinding, and consider the use of clinically relevant statistical analyses accompanied by estimates of precision. 5. Conclusion A wide range of thoracic kyphosis measurement techniques have been reviewed. However, there are few studies investigating each technique. Overall, reliability data for investigated techniques is very positive but generally remains limited. The validity of the techniques was lower than their reliability but information on validity is lacking for many measures. The strongest levels of evidence for reliability exists in support of the Debrunner kyphometer, Spinal Mouse and Flexicurve index, and for validity supports the arcometer and Flexicurve index. Perhaps the Flexicurve may be the most feasible as it is inexpensive, easy to use and has high levels of both reliability and validity. Future research should concentrate on methods with limited and inconsistent levels of evidence as identified by this review. References Adhia DB, Bussey MD, Ribeiro DC, Tumilty S, Milosavljevic S. Validity and reliability of palpation-digitization for non-invasive kinematic measurement, a systematic review. Man Ther 2012:1e9. Ayub E. Posture and the upper quarterIn Physical therapy of the shoulder. Melbourne: Churchill Livingstone; 1991. p. 81e90. Billis EV, Foster NE, Wright CC. Reproducibility and repeatability: errors of three groups of physiotherapists in locating spinal levels by palpation. Man Ther 2003;8(4):223e32. Briggs A, Wrigley T, Tully E, Adams P, Greig A, Bennell K. Radiographic measures of thoracic kyphosis in osteoporosis: Cobb and vertebral centroid angles. Skeletal Radiol 2007;36(8):761e7. Brink Y, Louw QA. Clinical instruments: reliability and validity critical appraisal. J Eval Clin Pract 2011:1e7. Chaise FO, Candotti CT, Torre ML, Furlanetto TS, Pelinson PP, Loss JF. Validation, repeatability and reproducibility of a non-invasive instrument for measuring thoracic and lumbar curvature of the spine in the sagittal plane. Revista Brasileira de Fisioterapia 2011;15(6):511e7. Calliet R. Shoulder pain. 3. Philadelphia: F.A. Davis Company; 1991. Carmines E, Zeller R. Reliability and validity assessment. Beverley Hills: Sage Publications; 1979. Chen Y. Vertebral centroid measurement of lumbar lordosis compared with the Cobb technique. Spine 1999;24(17):1786e90. Czaprowski D, Pawlowska P, Gebicka A, Sitarski D, Kotwicki T. Intra- and interobserver repeatability of the assessment of anteroposterior curvatures of the spine using Saunders digital inclinometer. Ortopaedic Traumatol Rehabil 2012;14(2):145e53. de Oliveira TS, Candotti CT, La Torre M, Pelinson PPT, Furlanetto TS, Kutchak FM, et al. Validity and reproducibility of the measurements obtained using the flexicurve instrument to evaluate the angles of thoracic and lumbar curvatures of the spine in the sagittal plane. Rehabil Res Pract 2012:1e9. Di Bari M, Chiarlone M, Matteuzzi D, Zacchei S, Pozzi C, Bellia V, et al. Thoracic kyphosis and ventilator dysfunction in unselected older persons: an epidemiological study in Dicomano, Italy. J Am Geriatr Soc 2004;52(6):909e15. D’Osualdo F, Schierano S, Iannis M. Validation of clinical measurement of kyphosis with a simple instrument, the arcometer. Spine 1997;22(4):408. Dunk NM, Lalonde J, Callaghan JP. Implications for the use of postural analysis as a clinical diagnostic tool: reliability of quantifying upright standing spinal postures from photographic images. J Manipulative Physiol Ther 2005;28(6): 386e92. Dunk NM, Chung YY, Compton DS, Callaghan JP. The reliability of quantifying upright standing postures as a baseline diagnostic clinical tool. J Manipulative Physiol Ther 2004;27(2):91e6.

Fölsch C, Schlögel S, Lakemeier S, Wolf U, Timmesfeld N, Skwara A. Test-retest reliability of 3D ultrasound measurements of the thoracic spine. J Inj Funct Rehabil 2012;4(5):335e41. French SD, Green S, Forbes A. Reliability of chiropractic methods commonly used to detect manipulable lesions in patients with chronic low-back pain. J Manipulative Physiol Ther 2000;23(4):231e8. Goh S, Price RI, Leedman PJ, Singer KP. Rasterstereographic analysis of the thoracic sagittal curvature: a reliability study. J Musculoskelet Res 1999;3(2):137. Gravina AR, Ferraro C, Frizziero A, Ferraro M, Masiero S. Goniometer evaluation of thoracic kyphosis and lumbar lordosis in subjects during growth age: a validity study. Stud Health Technology Inform 2012;176:247e51. Gray J, Grimsby O. Interrelationship of the spine, rib cage, and shoulderIn Physical therapy of the shoulder. Edinburgh: Churchill Livingston; 2004. p. 133e85. Greendale G, Nili N, Huang MH, Seeger L, Karlamangla A. The reliability and validity of three non-radiological measures of thoracic kyphosis and their relations to the standing radiological Cobb angle. Osteoporos Int 2011;22(6):1897e905. Haneline MT, Young M. A review of intraexaminer and interexaminer reliability of static spinal palpation: a literature synthesis. J Manipulative Physiol Ther 2009;32(5):379e86. Harrison DE, Cailliet R, Harrison DD, Janik TJ, Holland B. Reliability of centroid, Cobb, and Harrison posterior tangent methods: which to choose for analysis of thoracic kyphosis. Spine 2001;26(11):227e34. Hinman MR. Interrater reliability of flexicurve postural measures among novice users. J Back Musculoskelet Rehabil 2004;17(1):33. Horter TS. How to care for your neck. Phys Ther 1978;52(2):184e5. Kado DM, Huang MH, Nguyen, Barrett-Connor E, Greendale GA. Hyperkyphotic posture and risk of injurious falls in older persons: the Rancho Bernardo Study. J Gerontol Biol Sci 2007;62(6):682e7. Kellis E, Adamou G, Tzilios G, Emmanouilidou M. Reliability of spinal range of motion in healthy boys using a skin-surface device. J Manipulative Physiol Ther 2008;31(8):570e6. Korovessis P, Petsinis G, Papazisis Z, Baikousis A. Prediction of thoracic kyphosis using the Debrunner kyphometer. J Spinal Disord 2001;14(1): 67e72. Langenderfer JE, Rullkoetter PJ, Mell AG, Laz PJ. A multi-subject evaluation of uncertainty in anatomical landmark location on shoulder kinematic description. Computer Methods Biomech Biomed Eng 2009;12(2):211e6. Leroux MA, Zabjek K, Simard G, Badeaux J, Coillard C, Rivard CH. A non-invasive anthropometric technique for measuring kyphosis and lordosis, an application for idiopathic scoliosis. Spine 2000;25(13):1689e94. Lewis JS, Valentine RE. Clinical measurement of the thoracic kyphosis. A study of the intra-rater reliability in subjects with and without shoulder pain. BMC Musculoskelet Disord 2010;11:39. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Med 2009;6:e1000100. Lohr KN. Assessing health status and quality-of-life instruments: attributes and review criteria. Qual Life Res 2002;11(3):193. Lundon KMA, Li A, Bibershtein S. Interrater and intrarater reliability in the measurement of kyphosis in postmenopausal women with osteoporosis. Spine 1998;23(18):1978e85. Lydick E, Zimmerman SI, Yawn B, Love B, Kleerekoper M, Ross P, et al. Development and validation of a discriminative quality of life questionnaire for osteoporosis. Journal Bone Mineral Research 1997;12:456e63. Mannion AF, Knecht K, Balaban G, Dvorak J, Grob D. A new skin-surface device for measuring the curvature and global and segmental ranges of motion of the spine: reliability of measurements and comparison with data reviewed from the literature. Eur Spine J 2004;13(2):122e36. May S, Chance-Larsen K, Littlewood C, Lomas D, Saad M. Reliability of physical examination tests used in the assessment of patients with shoulder problems: a systematic review. Physiotherapy 2010;96(3):179e90. May S, Littlewood C, Bishop A. Reliability of procedures used in the physical examination of non-specific low back pain: a systematic review. Aust J Physiother 2006;52:91e102. Melvin M, Sylvia M, Udo W, Helmut S, Paletta JR, Adrian S. Reproducibility of rasterstereography for kyphotic and lordotic angles, trunk length and trunk inclination, a reliability study. Spine 2010;35(14):1353e8. Milne JS, Williamson J. A longitudinal study of kyphosis in older people. Age Ageing 1983;12:225e33. Munro BH, Visintainer MA. Statistical methods for health care research. Philadelphia: Lippincott Williams and Wilkins; 2005. p. 239e58. Murray PM, Weinstein SL, Spratt KF. The natural history and long-term follow-up of Scheuermann kyphosis. J Bone Joint Surg 1993;75(2):236e48. O’Haire C, Gibbons P. Inter-examiner and intra-examiner agreement for assessing sacroiliac anatomical landmarks using palpation and observation: pilot study. Man Ther 2000;5(1):13e20. Öhlén G, Spangfort E, Tingvall C. Measurement of spinal sagittal configuration and mobility with Debrunner’s kyphometer. Spine 1989;14:580e3. Perriman DM, Scarvell JM, Hughes AR, Ashman B, Lueck CJ, Smith PN. Validation of the flexible electrogoniometer for measuring thoracic kyphosis. Spine 2010;35(14):633e40. Portney LG, Watkins MP. Foundations of clinical research. Applications to practice. Upper Saddle River, New Jersey: Prentice Hall Health; 2000.

E. Barrett et al. / Manual Therapy 19 (2014) 10e17 Purser JL, Pieper CF, Branch LG, Duncan PW, Gold DT, McConnell ES, et al. Reliability of physical performance tests in four different randomized clinical trials. Arch Phys Med Rehabil 1999;80:557e61. Ripani M, Di Cesare A, Giombini A, Agnello L, Fagnani F, Pigozzi F. Spinal curvature: comparison of frontal measurements with the spinal mouse and radiographic assessment. J Sports Med Phys Fitness 2008;48(4): 488e94. Saad KR, Colombo AS, Ribeiro AP, Joao SM. Reliability of photogrammetry in the evaluation of the postural aspects of individuals with structural scoliosis. J Bodywork Movement Therapies 2012;16(2):210e6. Sheeran L, Sparkes V, Busse M, van Deursen R. Preliminary study: reliability of the spinal wheel. A novel device to measure spinal postures applied to sitting and standing. Eur Spine J 2010;19(6):995e1003. Stochkendahl MJ, Christensen HW, Hartvigsen J, Vach W, Haas M, Hestbaek L, et al. Manual examination of the spine: a systematic critical literature review of reproducibility. J Manipulative Physiol Ther 2006;29(6):475e85. Teixeira F, Carvalho G. Reliability and validity of thoracic kyphosis measurements using flexicurve method. Revista Brasileira de Fisioterapia 2007;11(3):199e204. Terwee CB, Bot SDM, de Boer MR, van der Windt DA, Knol DL, Dekker J, et al. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol 2007;60(1):34e42.

17

Van Blommestein AS, Lewis AS, Morrissey MC, MacRae S. Reliability os measuring thoracic kyphosis angle, lumbar lordosis angle and straight leg raise with an inclinometer. Open Spine J 2012;4:10e5. van de Ven-Stevens LA, Munneke M, Terwee CB. Clinimetric properties of instruments to assess activities in patients with hand injury: a systematic review of the literature. Arch Phys Med Rehabil 2009;90:151e69. van der Wurff P, Hagmeijer RHM, Meyne W. Clinical tests of the sacroiliac joint. A systematic methodological review. Part 1. Reliability Manual Therapy 2000;5: 30e6. van Tulder M, Furlan A, Bombardier C, Bouter L. Updated method guidelines for systematic reviews in the Cochrane Collaboration Back Review Group. In: The Cochrane library. Oxford: Update Software; 2003. 4. Whiting P, Rutjes AWS, Reitsma JB, Bossuyt PMM, Kleijnen J. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med Res Methodol 2003;3(25):25. Willner S. Spinal pantograph: a non-invasive technique for describing kyphosis and lordosis in the thoraco-lumbar spine. Acta Orthop Scand 1981;52:525e9. Wright JG, Feinstein AR. Improving the reliability of orthopaedic measurements. J Bone Joint Surg 1992;74:287e91. Yanagawa TL, Maitland ME, Burgess K, Young L, Hanley D. Assessment of thoracic kyphosis using the flexicurve for individuals with osteoporosis. Hong Kong Physiother J 2000;18(2):53e7.