ISAKOS Scientific Committee Report
Scoring Systems for the Functional Assessment of the Shoulder Alexandra Kirkley,† M.D., M.Sc., F.R.C.S.C., Sharon Griffin, C.S.S., and Katie Dainty, M.Sc., C.R.P.C.
Abstract: A number of instruments have been developed to measure the quality of life in patients with various conditions of the shoulder. Older instruments appear to have been developed at a time when little information was available on the appropriate methodology for instrument development. Much progress has been made in this area, and currently an appropriate instrument exists for each of the main conditions of the shoulder. Investigators planning clinical trials should select modern instruments that have been developed with appropriate patient input for item generation and reduction, and established validity and reliability. Among the other factors discussed in this review, responsiveness of an instrument is an important consideration as it can serve to minimize the sample size for a proposed study. The shoulder instruments reviewed include the Rating Sheet for Bankart Repair (Rowe), ASES Shoulder Evaluation Form, UCLA Shoulder Score, The Constant Score, Disabilities of the Arm, Shoulder and Hand (DASH), the Shoulder Rating Questionnaire, the Simple Shoulder Test (SST), the Western Ontario Osteoarthritis of the Shoulder Index (WOOS), the Western Ontario Rotator Cuff Index (WORC), the Western Ontario Shoulder Instability Index (WOSI), Rotator Cuff Quality of Life (RC-QOL), and the Oxford Shoulder Scores (OSS). Key Words: Quality of life—Shoulder outcomes—Outcome development.
I
n a previous article in this series, the methodology for the development and evaluation of a diseasespecific quality of life instrument was described. We will now discuss each of the most commonly used shoulder scoring systems, commenting on their strengths and weaknesses. The shoulder instruments reviewed in this article include the Rating Sheet for Bankart Repair (Rowe), UCLA Shoulder Score, The Shoulder Pain and Disability Index (SPADI), ASES Shoulder Evaluation Form, The Constant Score, Disabilities of the Arm, Shoulder and Hand (DASH), the
From the Fowler Kennedy Sport Medicine Clinic, London, Ontario, Canada. †Deceased. Address correspondence and reprint requests to Katie Dainty, M.Sc., C.R.P.C., Fowler Kennedy Sport Medicine Clinic, 3M Centre, University of Western Ontario, London, Ontario N6A 3K7, Canada. E-mail:
[email protected] © 2003 by the Arthroscopy Association of North America 0749-8063/03/1910-3893$30.00/0 doi:10.1016/j.arthro.2003.10.030
Shoulder Rating Questionnaire, the Western Ontario Osteoarthritis of the Shoulder Index (WOOS), the Western Ontario Rotator Cuff Index (WORC), the Western Ontario Shoulder Instability Index (WOSI), Rotator Cuff Quality of Life (RC-QOL), and the Oxford Shoulder Scores (OSS). THE RATING SHEET FOR BANKART REPAIR In 1978, Carter Rowe published a classic article evaluating the long-term results of the Bankart repair.1 It was in this article that he introduced a new rating system for the postoperative assessment of patients undergoing anterior stabilization. This system scores patients based on 3 separate areas—stability, motion, and function—with 1 item for each of these areas. The weighting is such that stability accounts for 50 points, motion for 20 points and function for 30 points, giving a total possible score of 100 points. Unfortunately, there are no published reports on the
Arthroscopy: The Journal of Arthroscopic and Related Surgery, Vol 19, No 10 (December), 2003: pp 1109-1120
1109
1110
A. KIRKLEY ET AL.
development or testing of this instrument. It is likely that the items used in the questionnaire were selected without direct patient input. There are a number of problems that can be identified with this instrument. Each of the 3 domains in this instrument contains “double-barreled” questions, i.e., the subject is asked to consider more than 1 question at the same time, each of which may be answered differently. For example, in responding to the stability domain, the subject must choose the best response considering dislocations, subluxations, and apprehension. The motion domain includes 3 different motions (external rotation, forward elevation, and internal rotation) and the function domain includes both functional limitation and pain. Some subjects may choose the response option only if all conditions are met while others may choose based on the 1 condition they think is the most important. It is unknown why the developers of this instrument assigned the various weights to the 3 items (stability accounts for 50%, motion 20%, function 30%). While not necessarily incorrect, it is unsupported. Similarly, it is unknown why the items have been assigned what appear to be random scores. For instance, a total of 30 points are assigned to the function domain. The difference in score from no limitation to mild limitation is 5 points, whereas the difference from mild to moderate limitation is 15 points. Although this is not necessarily incorrect, it is unconventional and is fundamentally arbitrary. It is not clear whether apprehension is to be measured by asking the patient whether they have apprehension or by examining the patient and doing an apprehension test (putting the arm in a position of extreme abduction and external rotation and monitoring the sensation of instability). This is important to define because many patients will deny apprehension for day-to-day activities but if put in a provocative position will feel apprehensive. The evaluation of motion is not defined as active or passive nor does the instrument describe whether the scapulothoracic joint is to be stabilized. Further, since for this instrument, motion is based on a percentage of the normal shoulder, it is not clear how one would score a patient who does not have a contralateral normal shoulder. This instrument combines 2 items of subjective evaluation with 1 item of physical examination. As these items are measuring fundamentally different attributes it is probably not meaningful to combine them for a total score as is meant to be done in this instrument.
THE UCLA SHOULDER SCORE The University of California at Los Angeles Shoulder Rating scale was first published in 1981 in a paper by H. C. Amstutz et al.2 The instrument was intended to be used in studies of patients undergoing total shoulder arthroplasty for arthritis of the shoulder. Since then, however, it has been used for patients with other shoulder conditions including rotator cuff disease3 and shoulder instability.4 This instrument assigns a score to patients based on 5 separate domains: pain, function, active forward flexion, strength of forward flexion, and overall satisfaction. There is 1 item for each of these areas. The weighting is such that pain accounts for 10 points, function for 10 points, forward flexion for 5 points, strength for 5 points, and overall satisfaction for 5 points, giving a total of 35 points. There are no publications available on the development or testing of this instrument. It is likely that the items on this instrument were also selected by the authors without direct patient input, similar to the Rowe instrument. Several problems can be identified with this tool. The items in the pain and function domains are “double-barreled.” As an example, when measuring pain, the patient is asked to comment on frequency, severity, and the type and amount of medication that is required to relieve the pain. This certainly presents difficulties in choosing an appropriate response when the patients will be unlikely to find a perfect match from the response options available. Again similar to the Rowe, it is unknown why the developers of this instrument assigned the various weights to the 5 domains (Pain 28.6%, Function 28.6%, Range of Motion 14.3%, Strength 14.3%, Satisfaction 14.3%). While not necessarily incorrect, it is unsupported. The overall satisfaction item only allows for the instrument to be used after an intervention and not before and after as would be ideal in most clinical trials. In addition, it is not clear how a subject would respond if his or her condition were unchanged. This instrument also combines 2 items of subjective evaluation of function with 1 item of physical examination. As these are measuring fundamentally different attributes it is probably not meaningful to combine them for a total score. Clearly both of these first 2 instruments were developed before the advent of modern measurement development methodology. The problems identified with these instruments may lead to poor reliability,
REVIEW OF SHOULDER OUTCOME TOOLS validity, and responsiveness, and therefore they may or may not be ideal choices for evaluating patients in the research or clinical environment. THE SHOULDER PAIN AND DISABILITY INDEX (SPADI) In 1991, Roach et al. published the development and evaluation of the SPADI.5 The authors state that “the SPADI was developed to provide a self-administered instrument that would reflect the disability and pain associated with the clinical syndrome of painful shoulder.” It was designed as both a discriminative and evaluative instrument. The majority of the item generation and reduction was carried out by a panel of 3 rheumatologists and a physical therapist without direct patient input. Further items were eliminated based on poor test-retest reliability or a low correlation with shoulder range of motion (ROM) on the involved side. Eliminating items based on poor reliability is logical for a discriminative instrument but not necessarily ideal for an evaluative instrument, as an item can have poor reliability but be important to patients and therefore be highly responsive. Eliminating items based on poor correlation with range of motion may have a negative impact on the final tool make-up, as range of motion has been found to rarely correlate more than modestly with patients’ estimation of their subjective functioning. There is no report of formal pretesting. The instrument has 13 items divided into 2 subscales: pain (5 items) and disability (8 items). The response format selected for the instrument was the 10-cm VAS anchored verbally at each end. However, in distinction to the usual method of scoring a VAS in which the slash on the line is measured from the left anchor in millimeters, the authors describe dividing the horizontal line into 12 segments of equal length. A number ranging from 1 to 11 is attached to the segment to produce a score for each item. The scores for the individual items are given equal weight within their domain and the domain scores are reported by converting to a score out of 100, with a score of 0 being perfect and a score of 100 being the worst score possible. The total score for the instrument is determined by averaging the scores for the two domains of pain and disability. The reliability of the instrument has been evaluated by measuring test-retest reliability over several days in 23 subjects who represented a subset of 37 male patients presenting to an ambulatory care clinic with a complaint of shoulder pain and who were subse-
1111
quently enrolled in a randomized clinical trial. The ages of the subjects ranged from 23 to 76 years, with a mean age of 58 years. Of the patients included, 27 had shoulder pain of musculoskeletal origin. The majority of the remaining subjects had shoulder pain of neurogenic or undetermined origin. The intraclass correlation coefficient (ICC) for the SPADI total score (.65) and for the pain and disability subscales (.64 and .64) may be falsely elevated by the short time interval of several days, which may not be long enough for subjects to have forgotten their original score. The diverse population on which it was tested may also have an effect on the overall outcome. Despite this, the scores of reliability are only modest. The authors explain that the subjects may have actually improved over the short time frame as most were started on an analgesic at the original visit. As a global rating of change score was not administered concurrently, it is unknown if this is the case or not. The authors also report on the internal consistency of the instrument using Cronbach’s alpha (total score .95, pain subscale .86, disability subscale .93). Further, factor analysis found that most of the items loaded onto 1 factor supporting the conclusion that the SPADI measures 1 construct. Varimax rotation provided limited support for 2 subscales. Limited validation testing has been described. Construct validation consisted of testing the hypotheses that the SPADI would correlate with baseline shoulder active ROM (flexion, abduction, extension, external rotation) measurements as an indicator of discriminative function and over time using change scores after an intervention (as a measure of evaluative function). The tool was administered to all 37 of the previously described subjects. Correlations with baseline range of motion ranged from .55 to .80. Correlations with range of motion change scores ranged from .50 to .70. The responsiveness has not been formally tested. No report of the minimally important difference has been provided. THE AMERICAN SHOULDER AND ELBOW SURGEONS EVALUATION FORM (ASES) In 1993, the Society of the American Shoulder and Elbow Surgeons developed a standardized form for the assessment of shoulder function.6 The purpose was to facilitate communication between investigators and to permit and encourage multicenter trials. The members felt that the required attributes of any new tool were ease of use, a method of assessing activities of daily living and inclusion of a patient self-evaluation
1112
A. KIRKLEY ET AL.
section. The research committee of the ASES reviewed all published forms available at the time and based on those and their own ideas developed a prototype instrument. It is not stated how the committee selected the items for this instrument. The prototype instrument was distributed to the members, who were encouraged to use the instrument and then offer constructive criticism. More than 70 suggestions to improve the instrument were made after distribution of the prototype. Following review by the research committee a second prototype was distributed in the summer of 1992. A further 15 suggestions were made and further revisions resulted in the final instrument. The instrument consists of a physician assessment section and a patient self-evaluation section. The physician assessment section includes physical examination and documentation of range of motion, strength, and instability, and demonstration of specific physical signs. No score is derived for this section of the instrument. The patient self-evaluation section has 11 items that can be used to generate a score. These are divided into 2 areas: pain (1 item) and function (10 items). The response to the single pain question is marked on a 10-cm visual analog scale (VAS), which is divided into 1-cm increments and anchored with verbal descriptors at 0 and 10 cm. The 10 items in the function area of the ASES include activities of daily living such as managing toileting, putting on a coat, etc. There are more demanding activities such as lifting 10 pounds above shoulder height and throwing a ball overhand. Finally, there are 2 general items: doing usual work and doing usual sport. There are 4 categories for response options from 0 (unable to do) to 3 (not difficult). Because of this, the responsiveness of the individual items is likely poor, especially in higher functioning patients. As an example, if a patient found an activity somewhat difficult prior to treatment he or she would have to have no difficulty whatsoever after treatment to improve by 1 category. The final score is tabulated by multiplying the pain score (maximum 10) by 5 (therefore total possible 50) and the cumulative activity score (maximum 30) by 5/3 (therefore, a total possible 50) for a total of 100. No rationale has been presented for the weighting scheme of this instrument. While not necessarily incorrect, it is unsupported. No published data is available on the testing of this instrument. Three of these instruments, the Rowe, UCLA score, and ASES score have been compared in a group of 52 patients with shoulder instability undergoing surgical stabilization.4 The 3 scales provided remarkably dif-
ferent categorization of patients and correlated poorly with each other. The authors of this study concluded, “the most commonly used scoring systems for shoulder conditions yield varying results when used to evaluate shoulder instability outcomes in our patient population. We urgently need a well-accepted shoulder system based on the patient’s functional status to critically assess our management of various shoulder conditions.”4 THE CONSTANT SCORE The Constant Score7 has become the most widely used shoulder evaluation instrument in Europe. This scoring system combines physical examination tests with subjective evaluations by the patients. The subjective assessment consists of 35 points and the remaining 65 points are assigned for the physical examination assessment. The subjective assessment includes a single item for pain (15 points) and 4 items for activities of daily living (work 4, sport 4, sleep 2, and positioning the hand in space 10 points). The objective assessment includes: range of motion (forward elevation, 10 points; lateral elevation, 10 points; internal rotation, 10 points; external rotation, 10 points) and power (scoring based on the number of pounds of pull the patient can resist in abduction to a maximum of 25 points). The total possible score is therefore 100 points. The publication by Constant7 in which he describes the instrument does not include methodology for how it was developed and more specifically, the rationale for item selection and relative weighting of the items. The strength of this instrument is that the method for administering the tool is quite clearly described which is an improvement on pre-existing tools. It is unknown why the developers of this instrument assigned various weights to the items (pain 15%, function 20%, range of motion 40%, strength 15%). While not necessarily incorrect, it is unsupported. This instrument combines 4 items of function with 5 items of physical examination. As these are measuring fundamentally different attributes, they should be measured separately as opposed to being combined for a total score. This instrument is weighted heavily on range of motion (40%) and strength (25%). Although this may be useful for discriminating between patients with significant rotator cuff disease or osteoarthritis, it is not useful for patients with instability. In fact, in one study all the patients with instability of the shoulder
REVIEW OF SHOULDER OUTCOME TOOLS scored nearly perfectly (95-100) despite having problems of sufficient magnitude to request surgical intervention. The reliability of this measurement tool has been evaluated on a limited basis.8 Although the methodology is not described in detail, Constant7 states that when the instrument was used to assess 100 abnormal shoulders by 3 different observers, the interobserver error was an average of 3% ranging from 0% to 8%. Conboy et al.8 measured the reliability on 25 patients with varying diagnoses of shoulder syndromes. They demonstrated that the 95% confidence limit between observers was 27.7 points and within observers was 16 points. No data on the formal testing of validity nor the responsiveness of this instrument has been published. THE DISABILITIES OF THE ARM, SHOULDER AND HAND (DASH) Recently, the American Academy of Orthopaedics Surgeons (AAOS) along with the Institute for Work & Health (Toronto, Ontario, Canada) developed an outcome tool to be used for patients with any condition of any joint of the upper extremity. This instrument called the Disabilities of the Arm, Shoulder and Hand Measurement tool or DASH is made available by the AAOS. A brief description of the methodology for the item generation and the initial item reduction phases has been published.9 In 1999, the AAOS and Institute for Work & Health developed and published a User’s Manual for the DASH outcome measure.10 The complete development and testing of the instrument is detailed in this manual. The DASH is a 30-item questionnaire designed to evaluate “upper extremity-related symptoms and measure functional status at the level of disability.” Disability is defined as “difficulty doing activities in any domain of life (the domains typical for one’s age/sex group) due to a health or physical problem.”11 Concepts covered by the DASH include symptoms (pain, weakness, stiffness, and tingling/numbness), physical function (daily activities, house/yard chores, shopping, errands, recreational activities, self-care, dressing, eating, sexual activities, sleep, and sport/performing art), social function (family care occupation, socializing with friends/family) and psychological function (self-image). Item generation was carried out by first reviewing the literature. Thirteen scales were combined to produce an initial pool of 821 items. Item reduction was carried out in 2 steps. Three members of the collaborative development group reviewed the original items.
1113
The items were stripped of scaling and attribution to a specific disorder. Items that were repetitive or obviously unrelated to the upper extremity were eliminated. The reduced list was then sent to clinician “content experts” for their input as to content/face validity and the importance of the items (5 point scale: 2 ⫽ definitely yes, to ⫺2 ⫽ definitely no). This allowed for reduction of 821 potential items to a 67 item questionnaire. The 67 items were reformatted into a questionnaire suitable for field testing. This questionnaire was pretested on 20 patients with upper limb problems to ensure readability, absence of ambiguity, and understanding of scaling and content, as well as to confirm that an adequate number/type of response items were available. In this publication the authors state, “. . . further item reduction will be carried out after field testing of the questionnaire on 420 patients in Canada, Australia, and the United States. Frequency of endorsement and internal consistency will be assessed using the data generated by the field testing. Items with a very high or low endorsement rate or excessively high correlations with other items in the same scale will be eliminated. Factor analysis will also be used to empirically validate the aggregation of items into subscales.” This testing was actually completed by Marx et al. in 1996. The major criticism of this tool is that the itemgeneration phase did not include interviews with patients with the conditions of interest. It has been well documented that physicians are poor judges of patient status12,13 and likely are poor judges of what is important to patients. The initial item reduction was done by clinicians, although it has been reported that item impact, as determined from patient input, was used for the remainder of the item reduction. There are several examples in the DASH where one item is a more specific version of another item. For example, item 324 “pain in the arm, shoulder, or hand when performing any specific activity” is a more specific version of item 323, which asks about arm, shoulder, or hand pain in general. It is unclear why they would choose 2 items where the more specific one would make up part or all of the response to the more general one. Similarly, the 4 questions relating to sports or playing an instrument would appear to have considerable overlap. Item 332, “difficulty playing your musical instrument or sport as well as you would like” would have a large contribution from item 331 “difficulty playing your musical instrument or sport because of pain.” Although it is not technically incorrect, it builds in considerable redundancy into the
1114
A. KIRKLEY ET AL.
tool, which has the effect of attributing more weight or value to these items. This instrument is intended for patients with any condition of any joint of the upper extremity. This makes it attractive for use in the clinical setting where patients present in an undifferentiated fashion. The patients can complete the questionnaire before a diagnosis is established. There is also much more information currently available on scoring of the DASH now that the DASH User’s Manual is available. This is a very useful resource for clinicians interested in properly implementing the DASH as an outcome tool in their practice. Unfortunately the broader scope of this instrument makes it less attractive for use in a clinical trial. Many of the items may seem irrelevant to patients with specific conditions. In addition, this instrument has been shown to be less responsive than other shoulder specific and shoulder condition specific instruments making it less efficient as a research tool.14-16 THE SHOULDER RATING QUESTIONNAIRE In 1997, L’Insalata et al. published the Shoulder Rating Questionnaire “a self-administered questionnaire for the assessment of symptoms and function of the shoulder.”18 It is unknown how the items on the instrument were generated or selected. It is simply stated that “A preliminary questionnaire was developed.” The preliminary questionnaire was administered to 30 patients and a subset of those patients were interviewed to identify clinical relevance, relative importance, and ease of completion and grading. This allowed for modifications to be made to produce a revised questionnaire. An “assessment” of the questionnaire was said to have been completed, after which “questions that had poor reliability, substantially reduced the total or subset internal consistency, or contributed little to the clinical sensitivity of the over-all instrument were eliminated to produce the final questionnaire.” The final instrument includes 6 separately scored domains: global assessment, pain, daily activities, recreational and athletic activities, work, and satisfaction. A final, nongraded domain allows the patient to select 2 areas in which he or she believes improvement is most important. The global assessment domain consists of a single VAS. Each of the other scored domains consists of a series of multiple-choice questions with 5 response categories from 1 (poorest) to 5 (best). Each domain is scored separately by averaging the
scores of the completed questions and multiplying by two. Thus, the possible score for each domain ranges from 2 (poorest) to 10 (best). Further, the investigators suggest a weighting scheme based on “consultation with several shoulder surgeons and patients regarding the relative importance of each of the domains.” The weighting is as follows: global assessment 15%, pain 40%, daily activities 20%, recreational and athletic activities 15%, work 10%. Therefore, the total possible score ranges from 17 to 100. Testing of this instrument has been described by the developers. Test-retest reliability was evaluated in 40 patients with a wide variation of characteristics (age, gender, shoulder disease, and severity) at variable time intervals within 1 week of initial administration (mean of 3 days, range 1 to 7 days). They reported the Spearman Rank Correlation Coefficient for the overall instrument (.96) and each of the domains (range .81.96). A criticism of this approach is that the values may have been falsely elevated for 2 reasons. First, 3 days is unlikely to be long enough for patients to forget their previous responses, making it more likely that they could reproduce their original score. Second, because reliability is a measure of the between-person variance to the total variance, testing reliability in such a diverse population increases the numerator, giving a higher reliability than one might get in a population more representative of a typical study population where all the patients have only 1 condition. To date, the responsiveness for this tool has not been compared with any other existing shoulder instruments. The investigators indicate that a difference of 12 points for the total score and 2 points for each domain score compared with pretreatment scores is clinically important although the rationale for the selection of these values is not described. The validation described consisted of correlating scores on the Shoulder Rating Questionnaire with comparable domains of the Arthritis Impact Measurement Scales 2. No a priori predictions were made and no interpretation of the observed correlations (ranging from .56 to .89) is described. A second construct was tested: that patients who selected a particular domain as an important area for improvement would score lower on that domain than patients who did not select it as an important area. A significant difference was found for each of the 4 domains tested (pain, daily activities, recreation/athletic activities, and work). Construct validation through correlations between this instrument and other measures of shoulder function have not been determined.
REVIEW OF SHOULDER OUTCOME TOOLS THE SIMPLE SHOULDER TEST (SST) In 1992 Lippitt, Harryman, and Matsen reported on the development and testing of the Simple Shoulder Test (SST).19 The purpose of the instrument is stated to be a means of documenting the functional improvement resulting from a specified procedure performed by a specific surgeon in response to a given diagnosis and to characterize the severity of the condition. The SST consists of 12 questions with “yes or no” response options. The instrument combines subjective items and items that actually require the patient to perform a physical function. For example, the patient is asked “Does your shoulder allow you to sleep comfortably?” which is subjective and “Can you lift 8 pounds to the level of your shoulder without bending your elbow?” which requires the patient to perform the maneuver. Item generation and reduction was based on Neer’s evaluation,20 the ASES evaluation,21 and observation of complaints of patients by the instrument developers. It is not clear how the final 12 items were actually selected. The tool was administered to 49 subjects between the ages of 60 and 70 with (1) no history of shoulder disease, injury, or surgery, (2) no shoulder symptoms, and (3) a normal shoulder ultrasound to rule out silent rotator cuff tears. Essentially, all patients obtained a perfect score (3% unable to place 8 lb at head level, 2% unable to carry 20 lb at the side, 5% incapable of throwing 20 yards). The tool has been administered to 250 patients with different diagnoses (osteoarthritis, rheumatoid arthritis, avascular necrosis, subacromial impingement, rotator cuff tears, frozen shoulder, traumatic anterior instability, and multidirectional instability). The instrument is able to distinguish between patients with these conditions and normal shoulder function. The authors noted distinct patterns between groups of patients with the different conditions, indicating that the instrument might be helpful in establishing a diagnosis. Some data on the SST following patients after rotator cuff repair indicates that the instrument can be used to determine what functional improvement the average patient obtains post treatment. The authors provide no report of formal testing of reliability of this instrument. The responsiveness has not been evaluated nor compared with other measures of shoulder function. The SST is unlikely to be sensitive to small but clinically important changes in patient function because of the dichotomous response options (yes or no). For the same reason, the instrument is likely have poor discrimina-
1115
tive function to differentiate between patients with varying severity of the same condition. THE WESTERN ONTARIO SHOULDER TOOLS In 1998, Kirkley et al. published the first in a series of disease-specific quality of life measure tools for the shoulder, The Western Ontario Shoulder Instability Index (WOSI).14 This instrument was developed and evaluated using the methodology as described by Kirschner and Guyatt.22 The stated purpose of the instrument was for use as the primary outcome measure in clinical trials evaluating treatments for patients with shoulder instability. In 2001, the second in the series of disease-specific quality of life instruments for the shoulder, The Western Ontario Osteoarthritis of the Shoulder Index (WOOS), was published.23 The authors state that the instrument was developed and evaluated using similar methodology as was used in the development of the WOSI.14 The WOOS is meant for use as the primary outcome measure in clinical trials evaluating patients with symptomatic primary osteoarthritis of the shoulder. Most recently, in 2003, the third instrument in the series, the Western Ontario Rotator Cuff Index (WORC) was accepted for publication as a primary outcome measure in clinical trials evaluating treatments for patients with degeneration of the rotator cuff.15 The WORC was also developed and evaluated using the methodology as described by Kirschner and Guyatt.22 Item generation was carried out in 3 steps for all 3 of the tools, which included a review of the literature and existing instruments, interviews with clinician experts, and interviews with 33 patients (sampled to redundancy), representing the full spectrum of patient characteristics. Item reduction was carried out using the frequency importance product (impact) from a survey of 100 patients representing the full spectrum of patient characteristics and a correlation matrix to eliminate redundant items. The response format selected for the instrument was the 10-cm VAS anchored verbally at each end. The prototype instrument was pretested on 2 consecutive groups of 10 patients. The items were assigned equal weight based on the uniformly high impact scores. A database of patients meeting the inclusion/exclusion criteria for symptomatic shoulder instability from all the clinically relevant categories with the exception of fixed dislocations was established. A database of
1116
A. KIRKLEY ET AL. TABLE 1. The Western Ontario Instruments WOSI (21 items)
WORC (21 items)
WOOS (19 items)
Physical Symptoms (10 items) Sport/Recreation/Work Function (4 items) Lifestyle Function (4 items) Emotional Function (3 items)
Physical Symptoms (6 items) Sport/Recreation (4 items) Work Function (4 items) Lifestyle Function (4 items) Emotional Function (3 items)
Physical Symptoms (6 items) Sport/Recreation/Work Function (5 items) Lifestyle Function (5 items) Emotional Function (3 items)
patients meeting the inclusion/exclusion criteria for symptomatic rotator cuff disease including rotator cuff tendinitis, rotator cuff tendinosis with no tear, partialthickness rotator cuff tears, full-thickness rotator cuff tears (small to massive) and rotator cuff arthropathy was established. Similarly, a database of patients of all ages with a diagnosis of primary osteoarthritis of the shoulder was defined and established. The Western Ontario instruments are constructed as shown in Table 1. Each instrument includes instructions to the patients, a supplement with an explanation of each item, and detailed instructions for the clinician on scoring. The authors recommend using the total score for the primary outcome in clinical trials but also recommend reporting individual domain scores. The scores can be presented in their raw form or converted to a percent score. The best possible total score is 100% (raw score ⫽ 0) and signifies that the patient has no decrease in shoulder-related quality of life. The worst possible score is 0% (raw score ⫽ 2,100 in the WOSI and the WORC and 1,900 in the WOOS) and signifies that the patient has an extreme decrease in shoulder-related quality of life. Validity has been assessed through construct validation by making a priori predictions of how the instrument would correlate with other measures of health status at 1 time point, as an indicator of discriminative function, and over time using change scores after an intervention of known effectiveness, as a measure of evaluative function (Table 2).
and .91) and individual domain scores (range .72 to .94) are reported. The instrument was administered to 47 patients undergoing surgical repair for anterior instability. All correlations were within .2 of the predicted values. As predicted, the WOSI correlated best with the DASH as both a discriminative and evaluative instrument (r ⫽ .77, r ⫽ .76) and showed poor correlations with the SF-12 mental score (r ⫽ .115 discriminative; r ⫽ .12 evaluative). The responsiveness has been evaluated using the Standardized Response Mean and compared with the other measures of shoulder function in the same 47 patients used for the validation testing. The WOSI was more responsive than the others tested (in order of responsiveness: WOSI, Rowe, DASH, Constant Score, ASES, Range of Motion, UCLA, SF-12 physical, and SF-12 mental). The minimally important difference was estimated in the same group of 47 patients.25 The patients were administered the WOSI concurrently with a 5-point global rating of change score. Patients were asked whether, after treatment, they were better, worse, or the same. If they indicated that they were better or
TABLE 2. The Western Ontario Indexes Western Ontario Index WOSI
THE WESTERN ONTARIO SHOULDER INSTABILITY INDEX (WOSI)
WORC
The reliability of the WOSI has been evaluated in 51 stable patients at 2 weeks and 3 months in conjunction with a global rating of change score. The patient population tested was only briefly described as patients with shoulder instability who were stable and it is not clear how diverse a population this was. The ICCs at 2 weeks and 3 months for the total score (.95
WOOS
Correlated Measures ASES, UCLA, Constant, DASH, Rowe, SF-12 (Physical & Mental domains), and range of motion ASES, UCLA, Constant, DASH, Global Rating of Change, Sickness Impact Profile Total Scale, SF-36 (Bodily Pain/Physical, Social Function/Lifestyle, Physical Role Limitation/Work and Mental Health domains), and range of motion ASES, UCLA, Constant, Global Rating of Change, McGill Pain Questionnaire, McGill VAS, SF-12 (Physical & Mental domains), and range of motion
REVIEW OF SHOULDER OUTCOME TOOLS worse, they were asked to quantify their change on a 5-point scale (1 to 5, very little different to a great deal different). Patients with 1 or 2 points change were considered minimally different, 3 or 4 points change moderately different, and those with 5 points change a great deal different. The estimates were as follows: MID change in total score of 220 (10.4%), moderate difference change in total score of 469 (22.3%), and large difference change in total score of 527.46 (25%). The confidence intervals around these estimates were large because of the small number of patients involved in the determination. Further testing is needed to make more accurate estimates. The WOSI is more responsive than other tools for shoulder instability. Richards et al.16 reported on the results of treatment of posterior shoulder instability and determined that the WOSI was more responsive than the SPADI, DASH, Constant, and ASES. The results of a randomized clinical trial evaluating the treatment of patients with a first anterior dislocation of the shoulder showed that the WOSI was more responsive than other instruments tested.26 (In order of responsiveness: WOSI, Rating Sheet for Bankart Repair, DASH, Constant, ASES, ROM, UCLA, SF-12 physical score, and SF-12 mental score). THE WESTERN ONTARIO OSTEOARTHRITIS OF THE SHOULDER INDEX (WOOS) The reliability of the WOOS instrument has been evaluated in 58 stable patients at 3 months in conjunction with a global rating of change score. The patient population tested was described as meeting the inclusion criteria of primary osteoarthritis of the shoulder. The ICC was calculated based on the 22 subjects who remained stable over the 3 months. The ICC for the total score was .96 and for each of the domains ranged from .87 to .95. This number may be falsely decreased by the long test-retest interval of 3 months. The instrument was administered to 41 patients selected from the database undergoing treatment for osteoarthritis of the shoulder. All correlations were within .2 of the predicted values. As predicted, the WOOS correlated best with the Constant Score as both a discriminative and evaluative instrument (r ⫽ .69, r ⫽ .73). The responsiveness was evaluated using the Standardized Response Mean and compared with the other measures of shoulder function in 41 patients involved in a randomized clinical trial of hemiarthroplasty versus total shoulder arthroplasty.15 The WOOS was
1117
more responsive than the others tested (in order of responsiveness WOOS, McGill VAS, UCLA, ASES, McGill Pain, Constant Score, SF-12 physical, ROM, and SF-12 mental). No estimate of the minimally important difference or the responsiveness data has been reported for the WOOS. This instrument has been translated and validated in French, Spanish, and German. THE WESTERN ONTARIO ROTATOR CUFF INDEX (WORC) The reliability and validity of the WORC was assessed in patients who were being treated for rotator cuff tendinosis with no or a small full-thickness cuff tear. Patients completed the WORC and other measures of health as well as a global rating of change score. Those that indicated they had not changed at 2 weeks were used for the analysis of reliability. The ICC was calculated based on the 50 subjects who remained stable over the 2 weeks. The ICC for the total score was .96 and for each of the domains ranged from .63 for the emotional well being domain to .91 for the physical symptoms domain. The instrument was administered to 110 patients with rotator cuff tendinopathy or small full-thickness cuff tears who were undergoing active treatment (injections, physiotherapy, or arthroscopy and subacromial decompression) All correlations were within .2 of the predicted values. The WORC correlated best with the ASES and DASH as a discriminative instrument (r ⫽ .73, r ⫽ .69) and with the ASES and UCLA as an evaluative instrument (r ⫽ .75, r ⫽ .65). Data on the responsiveness of the WORC tool has not been reported. The minimally important difference was calculated using 44 patients meeting specific inclusion/exclusion criteria for chronic cuff tendinosis without tear undergoing treatment with subacromial injection. They were prospectively evaluated at baseline and 3 months after injection using a global rating of change and the WORC Index. Patients were asked whether, after treatment, they were better, worse, or the same. If they indicated that they were better or worse, they were asked to quantify their change on a 5-point scale (1 to 5, very little different to a great deal different). Patients with 1 or 2 points change were considered minimally different, 3 or 4 points change moderately different, and those with 5 points change a great deal different. The estimates were as follows: MID change in total score of 245.26 (11.7%), moderate difference change in total score of 371.3 (17.68%), and large
1118
A. KIRKLEY ET AL.
difference change in total score of 773.4 (36.82%). The confidence intervals around these estimates were large because of the small number of patients involved in the determination. Further testing is needed to make more accurate estimates. This instrument has also been translated into French and German. THE ROTATOR CUFF QUALITY-OF-LIFE MEASURE (RC-QOL) In October 2000, Hollinshead et al. published a paper reporting on the 6-year follow-up of large and massive rotator cuff tears.27 In the article, they introduced a new disease-specific quality of life instrument for patients with rotator cuff disease. The instrument was developed and tested using similar methodology to that described by Guyatt et al.28 This instrument is indicated for use as an outcome tool in patients with the “full spectrum of rotator cuff disease.” Item generation was carried out in 3 steps, including a review of the literature and existing outcome tools, discussions with clinician experts, and “direct input from a set of patients with a full spectrum of rotator cuff disease ranging from primary impingement tendinopathy to massive rotator cuff defects.” It is not stated how many patients were interviewed nor how many items were generated at this phase. A preliminary questionnaire was formulated using 10-cm VAS response format. The preliminary questionnaire was pretested on 20 patients with documented rotator cuff disease. Patients underwent a structured interview consisting of 5 questions pertaining to whether the items were semantically appropriate, whether the patient considered the items important to his or her quality of life, whether the patient could comprehend the question, and whether the patient would suggest any modifications to the questionnaire.” A revised 55-item questionnaire was then developed. The authors describe further item reduction, from 55 to 34 items, but do not provide details on the methodology. They state “On the basis of qualitative and quantitative criteria, reduction of this 55-item instrument to a smaller, more manageable questionnaire was considered.” The qualitative criteria included the importance of each item in demonstrating a quality-of life issue, the importance of each item to patients and the elimination of redundancy or ambiguity. The quantitative criterion was based on reliability testing. Items that had an average difference score of 15% or greater were eliminated from the tool. Although eliminating items based on poor reliability is logical for a discriminative instrument it is not necessarily ideal for an
evaluative instrument as an item can have poor reliability but be important to patients (high impact) and be highly responsive. The instrument has 34 items with 5 domains: Symptoms and Physical Complaints (16 items), Sport/Recreation (4 items), Work-Related Concerns (4 items), Lifestyle Issues (5 items), and Social and Emotional issues (5). The instrument does provide instructions to the patients. It asks the patients to consider the last 3 months when answering questions which may be too long for most patients’ recall. Some of the items are double barreled as they ask the subject to consider pain and difficulty at the same time. The response options are written such that the best score is 100 mm and the worst score is 0 mm. However, because the items are asking about symptoms, this requires the patient to consider the amount of the symptom from right to left as opposed to the traditional left to right. It is unknown if this presents any difficulty to patients. The reliability of the instrument was evaluated in 30 consecutive patients with an interval of 2 weeks. The patient population tested was not described other than they had documented rotator cuff disease. The authors report “average difference in score” as a measure of reliability. The average difference for the total score was 5.05%. The reliability of each of the domains is not reported. The ICC values are not reported. Some validation of the discriminative function of the RC-QOL has been performed. The RC-QOL has been correlated with other measures of shoulder function and measures of health status (Functional Shoulder Elevation Test, ASES, and SF-36) at final follow-up (average, 42 months; range, 25 to 71 months) in 70 patients undergoing surgical treatment for large and massive rotator cuff tears. The authors do not comment on the surprisingly high correlations between it and the generic health profile, the global shoulder tool and the functional test. The RC-QOL correlated very highly with the SF-36 (.78) the ASES (.84), and the FSET (.84). In addition, the hypothesis that the RC-QOL should be able to distinguish between patients with large and massive rotator cuff tears as further indication of its discriminative function is described. The RC-QOL, ASES, and the FSET were all able to distinguish between patients with large and massive cuff tears in this sample of 73 shoulders (17 large and 56 massive cuff tears) at final follow-up. No validation of the instrument’s evaluative function has been reported. The responsiveness and determination of the minimally important difference have also not been reported.
REVIEW OF SHOULDER OUTCOME TOOLS For example: Question: With any prolonged activity how much pain or discomfort do you experience in your shoulder? 0 Severe Pain
100 No pain at all
The authors recommend converting the raw scores (0 to 3,400; 0 ⫽ worst score, 3,400 ⫽ best score) to a percentage score, i.e., presenting scores out of 100. OXFORD SHOULDER SCORES (OSS) Similar to Kirkley et al., Dawson, Fitzpatrick and Carr29 have published 2 questionnaires that deal with the perceptions of patients about shoulder surgery. The first, the Oxford Shoulder Score (OSS) was published in 1996 and is for patients having shoulder operations other than stabilization. The second questionnaire was published in 1999 and is meant for the group of patients who had been excluded from the original questionnaire, those presenting with shoulder instability.30 Both are 12-item questionnaires with each item scored from 1 to 5, from least to most difficulty or severity, combined to produce a single score ranging from 12 (best score) to 60 (worst score). The Oxford Shoulder Instability Questionnaire was developed by interviewing 20 patients referred to an outpatient clinic with shoulder instability. It is unknown whether these patients represented all types of shoulder instability categories, age, gender, and treatment experiences. Based on the interviews, an 18-item instrument was drafted and then pared down to 12 questions following pretesting on a further 2 groups of 20 patients. It is not stated by what method the items were selected or discarded. The instrument has been tested for test-retest reliability in 34 patients at 24hour recall period. The ICC was not calculated; however, it is likely that the Pearson Correlation Coefficient closely approximates it. The r value was reported as .97. The recall period was very short for this type of assessment raising the possibility that patients were able to remember their previous scores and artificially increasing the r value. Construct validity has been determined through prospective studies in which both instruments have been compared to other outcome tools as discriminative instruments (1 point in time both before treatment and at 6 months after treatment). Although predictions as to how the instruments should correlate were not made, the results, which show modest correlation with the other
1119
shoulder instruments and the appropriate domains of the global tools, seem appropriate. Finally, responsiveness or sensitivity to change was measured by comparing the effect sizes of the new questionnaires and the SF-36 scores, as well as the HAQ, Constant, and Rowe scores in patients undergoing surgical stabilization. The results show that the instruments were more sensitive than the generic instruments. In addition, the new questionnaire was compared with the other instruments for the ability to distinguish between patients who reported the most positive change in their shoulder from all other patients on 3 separate questions of patient perception of overall success of treatment, room for improvement, and perception of improvement in shoulder problems following treatment. Medium-term results have also been reported for the OSS.31 Once again comparisons were made with the SF-36 and the Constant Shoulder score. In addition to these measures, patients were also asked to assess the success of their surgery and to judge the degree of change in the symptoms arising from their shoulder. Ninety-three patients were assessed preoperatively, and at 6 months and 4 years postoperatively. The correlation coefficients between the absolute scores of the OSS, the Constant assessment and the relevant dimensions of the SF-36 were generally high (r ⬎ .5) and highly significant. Comparisons between mean change scores, grouped responses, and the patient satisfaction question further strengthened support for the OSS questionnaire. Patients reported considerable differences in mean change scores 6 months postoperatively on the Constant, OSS, and relevant domains of the SF-36. Similar results at the 4-year assessment were shown for the OSS and the pain dimension of the SF-36. Interestingly, differences at this stage for the Constant barely approached significance, mean differences being significantly reduced between the 6-month and 4-year assessment points. This suggests that the reliability and sensitivity of the Constant Score relative to the OSS were significantly reduced over the long term. However, in reporting all of these medium-term results, the authors acknowledge that only 66% of the original sample underwent a clinical assessment at the 4-year mark and that this variation in the period of follow-up may have affected the clinical validity of the investigation. It would seem from the publications that these questionnaires have been tested and should provide reliable, valid, and responsive information. The authors of the present article have no experience with these particular tools and perhaps in the future, more information regarding their effectiveness will be available.
1120
A. KIRKLEY ET AL. CONCLUSION
In summary, older instruments designed for evaluating shoulder conditions were developed at a time when little information was available or little attention was paid to the appropriate methodology for such endeavors. However, there now exist a number of instruments that are excellent for specific conditions of the shoulder. Much work remains to be done to evaluate these instruments in specific patient populations, to determine values for the minimally clinically important difference for each of these tools, and to develop valid translations such that they can be used internationally. It is clear that much progress has already been made in this area of orthopaedic surgery and that currently there exists an appropriate instrument for each of the main conditions of the shoulder. Investigators planning clinical trials should select a modern instrument developed with appropriate patient input for item generation and reduction, established validity, and reliability. All things being equal, the most responsive instrument available should be used in order to minimize the sample size for the proposed study.
14.
15.
16.
17.
18.
19.
20. 21.
REFERENCES 22. 1. Rowe CR, Patel D, Southmard WW. The Bankart procedure—A study of late results. J Bone Joint Surg Am 1977;59: 122. 2. Amstutz HC, Sew Hoy AL, Clarke IC. UCLA anatomic total shoulder arthroplasty. Clin Orthop 1981;155:7-20. 3. Ellman H, Hanker G, Bayer M. Repair of rotator cuff. Factors influencing reconstruction. J Bone Joint Surg Am 1986;68: 1136-1144. 4. Romeo AA, Bach BR Jr, O’Halloran KL. Scoring systems for shoulder conditions. Am J Sports Med 1996;24:472-476. 5. Roach KE, Budiman-Mak E, Songsiridej N, Lertratanakul Y. Development of a shoulder pain and disability index. Arthritis Care Res 1991;4:143-149. 6. Richards RR, An K-N, Bigliani LU, Friedman RJ, Gartsman GM, Gristina AG, Iannotti JP, Mow VC, Sidles JA, Zuckerman JD. A standardized method for the assessment of shoulder function. J Shoulder Elbow Surg 1994;3:347-352. 7. Constant CR, Murley AHG. A clinical method of functional assessment of the shoulder. Clin Orthop 1987;214:160-164. 8. Conboy VB, Morris RW, Kiss J, Carr AJ. An evaluation of the Constant-Murley Shoulder Assessment. J Bone Joint Surg Br 1996;78:229-232. 9. Hudak PL, Amadio PC, Bombardier C. Development of an upper extremity outcome measure: The DASH (disabilities of the arm, shoulder and hand) [corrected]. The Upper Extremity Collaborative Group (UECG). Am J Ind Med 1996;29:602-608. 10. Solway S, Beaton DE, McConnell S, Bombardier C. The DASH outcome measure user’s manual. Toronto, Ontario: Institute for Work & Health, 2002. 11. Verbrugge LM, Jette AM. The disablement process. Soc Sci Med 1994;38:1-14. 12. Haworth RJ, Hopkins J, Ells P, Ackroyd CE, Mowat AG. Expectations and outcome of total hip replacement. Rheumatol Rehabil 1981;20:65-70. 13. Lieberman JR, Dorey F, Shekelle P, Schumacher L, Thomas BJ, Kilgus DJ, Finerman GA. Differences between patients’
23.
24.
25.
26.
27.
28. 29. 30. 31.
and physicians’ evaluations of outcome after total hip arthroplasty. J Bone Joint Surg Am 1996;80:835-838. Kirkley A, Griffin S, McLintock H, Ng L. The development and evaluation of a disease-specific quality of life measurement tool for shoulder instability: The Western Ontario Shoulder Instability Index (WOSI). Am J Sports Med 1998;26:764-772. Kirkley A, Griffin S, Alvarez C. The development and evaluation of a disease-specific quality of life measurement tool for rotator cuff disease: The Western Ontario Rotator Cuff Index (WORC). Clin J Sport Med 2003;13:84-92. Richards RR, Harniman E. A long-term follow-up of posterior shoulder stabilizations for recurrent posterior glenohumeral instability. London, Ontario: Canadian Orthopaedic Association, #74, 2001. Barrack RL, Skinner HB. The sensory function of knee ligaments. In: Daniel DM, Akeson WH, O’Connor JJ, eds. Knee ligaments: Structure, function, injury and repair. New York: Raven, 1990. L’Insalata JC, Warren RF, Cohen SB, Altchek DW, Peterson MG. A self-administered questionnaire for assessment of symptoms and function of the shoulder. J Bone Joint Surg Am 1997;79:738-748. Lippitt SB, Harryman DT II, Matsen FA III. A practical tool for evaluating function: The Simple Shoulder Test. In: Matsen FA, Fu FH, Hawkins RJ, eds. The shoulder: A balance of mobilty and stability. Rosemont, IL: American Academy of Orthopaedic Surgeons, 1992;501-518. Rowe CR. Evaluation of the shoulder. In: The shoulder. New York: Churchill-Livingstone, 1988;631-637. Barrett WP, Franklin JL, Jackins SE, Wyss CR, Matsen FA III. Total shoulder arthroplasty. J Bone Joint Surg Am 1987;69: 865-872. Kirshner B, Guyatt G. A methodological framework for assessing health indices. J Chronic Dis 1985;38:27-35. Lo IKY, Griffin S, Kirkley A. The development and evaluation of a disease-specific quality of life measurement tool for osteoarthritis of the shoulder: The Western Ontario Osteoarthritis of the Shoulder Index (WOOS). Arthritis Cartilage 2001;9:771-778. Juniper EF, Guyatt GH, Jaeschke R. How to develop and validate a new health-related quality of life instrument. In: Spilker B, ed. Quality of life and pharmacoeconomics in clinical trials. Philadelphia: Lippincott-Raven, 1996;49-56. Kirkley A. The development and evaluation of a diseasespecific quality of life measurement tool for shoulder instability: The Western Ontario Shoulder Instability Index (WOSI). 1-89. Thesis/Dissertation, McMaster University, Hamilton, Ontario, Canada, 2001. Kirkley A, Griffin S, Richards C, Miniaci A, Mohtadi N. Prospective randomized clinical trial comparing the effectiveness of immediate arthroscopic stabilization versus immobilization and rehabilitation in first traumatic anterior dislocation of the shoulder. Arthroscopy 1998;15:507-514. Hollinshead RM, Mohtadi NG, Vande Guchte RA, Wadey VM. Two 6-year follow-up studies of large and massive rotator cuff tears: Comparison of outcome measures. J Shoulder Elbow Surg 2000;9:373-381. Guyatt GH, Townsend M, Berman LB, Keller JL. A comparison of Likert and visual analogue scales for measuring change in function. J Chronic Dis 1987;40:1129-1133. Dawson J, Fitzpatrick R, Carr A. Questionnaire on the perceptions of patients about shoulder surgery. J Bone Joint Surg Br 1996;78:593-600. Dawson J, Fitzpatrick R, Carr A. The assessment of shoulder instability. J Bone Joint Surg Br 1999;81:420-426. Dawson J, Hill G, Fitzpatrick R, Carr A. The benefits of using patient-based methods of assessment: medium-term results of an observational study of shoulder surgery. J Bone Joint Surg Br 2001;83:877-882.