S52
Issues in the Development of an Item Bank Rita K. Bode, PhD, Jin-Shei Lai, PhD, David Cella, PhD, Allen W. Heinemann, PhD ABSTRACT. Bode RK, Lai J-S, Cella D, Heinemann AW. Issues in the development of an item bank. Arch Phys Med Rehabil 2003;84 Suppl 2:S52-60. Objective: To describe and illustrate 2 issues involved in the development of an item bank that can be used to improve measurement across settings and over time. Design: Secondary (psychometric) analysis of data collected on existing quality of life (QOL) instruments. Setting: Five cancer clinics in hospital settings in various parts of the United States; 523 solo or group practices in 3 major US cities; and an inpatient rehabilitation hospital in a large metropolitan area. Participants: Illustration 1: 399 persons being treated for or having a history of cancer, 170 persons being treated for human immunodeficiency virus (HIV), 328 persons with stroke assessed during and after acute rehabilitation, and 433 persons being treated for multiple sclerosis. Illustration 2: 1714 persons with cancer and/or HIV participating in a large-scale multisite study, 3429 persons with prevalent treatable chronic health conditions, and 125 persons with stage IV metastatic breast cancer. Interventions: Not applicable. Main Outcomes Measures: QOL as measured by 10 different instruments. Results: The illustrations show that (1) core items, which functioned similarly across 4 diagnostic groups, can be identified and used to construct instruments measuring physical function that are tailored to each of these groups, and (2) items from 3 separate datasets can be linked to create a dataset that can serve as an initial pain item bank. Conclusion: The methodology exists to develop item banks to develop better measures of QOL. Key Words: Quality of life; Rehabilitation. © 2003 by the American Congress of Rehabilitation Medicine NCREASINGLY, REHABILITATION therapy is being in a variety of settings—from acute care units, to Iacuteprovided rehabilitation, to postacute rehabilitation, to home health—and different measures are needed to examine not only the short-term but also the long-term outcomes of rehabilitation. Short-term outcomes typically focus on reduction of impairment and improvement of function, whereas long-term outcomes typically focus on quality of life (QOL). Monitoring
From the Rehabilitation Institute of Chicago, Chicago, IL (Bode, Heinemann); Institute of Health Services Research & Policy Studies, Northwestern University, Chicago, IL (Bode, Lai, Cella, Heinemann); and Evanston Northwestern Healthcare, Evanston IL (Lai, Cella). Supported by the Early Detection and Community Oncology Program, Community Oncology and Rehabilitation Branch, National Cancer Institute (grant no. CA60068). No commercial party having a direct financial interest in the results of the research supporting this article has or will confer a benefit upon the author(s) or upon any organization with which the author(s) is/are associated. Reprint requests to Rita K. Bode, PhD, Center for Rehabilitation Outcomes Research, Rehabilitation Institute of Chicago, Ste 1396, 345 E Superior St, Chicago, IL 60611, e-mail:
[email protected]. 0003-9993/03/8404-7657$30.00/0 doi:10.1053/apmr.2003.50247
Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
improvement is difficult because of the lack of instruments that can accurately measure status at each of these rehabilitation stages. If one cannot accurately measure status at each stage, one cannot accurately determine whether status changes over time. The instruments currently used to monitor improvement were developed to measure status at a particular point in recovery and contain items covering a relatively narrow range of task difficulty. Research using the instruments outside the settings for which they were developed has encountered floor or ceiling effects, which potentially underestimated the improvement made.1,2 A solution to this problem is the development of an item bank that contains items that measure a single construct but that cover a wide range of difficulty appropriate for individuals all along the continuum. The usefulness of such an item bank makes its development desirable, and technology now available makes its development possible. This article describes methodologic advances that can enhance the ability of rehabilitation researchers to answer questions of effectiveness, functional gain, and QOL. It focuses on the development of a QOL item bank. The process of developing QOL item banks begins with equating studies of existing instruments to create a common metric. In the area of physical function, numerous equating studies1,3-6 have been conducted combining the responses to individual items from various instruments. More recently, individual QOL instruments have been equated to a common scale.7,8 After creating a common scale, the next step is to create banks that contain items that define meaningful hierarchies in terms of QOL. Two differences exist between QOL item banks and previously developed banks that contained knowledge item, such as the professional certification tests. First, in QOL item banks, blueprints for the content areas (constructs) to be included have not been as clearly articulated. Preliminary research has defined various aspects of health-related QOL (HRQOL), but the various frameworks need to be validated.9 Second, little research has been conducted on item functioning across diverse samples, and the research that does exist has only examined differences by demographic group. Stewart and NapolesSpringer10 reviewed studies involving adult samples with more than 50 cases and included both English and translated versions of these instruments. This review examined 9 psychometric characteristics: content validity, analysis of missing data, variability (score distribution, floor and ceiling effects), reliability (mainly internal consistency), measurement of factor structures (mainly with item-total score correlations), differential item functioning (DIF), item weights, use of response scales, and construct validity (including response bias and sensitivity to change). The diverse populations examined in these studies consisted almost exclusively of demographic groups: age, education, race and ethnicity, and gender. Of the studies on English-language versions of HRQOL instruments such as the Medical Outcomes Study 36-Item Short-Form Health Survey9 (SF-36), none addressed content validity, DIF, or use of response scales or item weights, but a number addressed missing data. Different patterns in missing data were found by age,11 poverty level, and education,12 but the results were inconsistent regarding ethnicity.12,13 Variability13 and internal consis-
DEVELOPMENT OF AN ITEM BANK, Bode
tency12-14 were for the most part adequate and comparable across groups, whereas the factor structure was mixed.12,13,15 The available studies compared the correlations between itemsubscale scores across groups, but few conducted confirmatory factor analysis on subgroups. Finally, construct validity was addressed in 1 study16 that found a distinction between healthy and unhealthy subjects. However, none of these studies examined differences in these characteristics by diagnostic group. Much of the research has concentrated on merging data from different QOL instruments. When subsets of items are administered to different portions of the merged sample, linking strategies are needed to combine data from multiple sources.17 Unlike knowledge tests, whose items are scored as correct or incorrect, differences in the item formats and rating scales used also need to be accommodated when combining data into a single item bank.18 In this article, we focus on the issues involved in building a common core of items across diagnostic groups and linking existing databases. DEVELOPING A COMMON CORE OF ITEMS ACROSS GROUPS The field of QOL measurement is quickly moving beyond fixed measures of QOL to measures with content that is tailored to the individual subject. With the introduction of item banks and, more specifically, computerized adaptive testing, we are no longer tied to instruments with fixed content presented in paper-and-pencil format to which each individual must respond to all items in order to obtain an interpretable score. Within a paper-and-pencil format, our current choices in measuring QOL are between generic and disease-specific instruments. Generic instruments (eg, SF-369) are designed for use across various diagnostic groups or conditions and are expected to provide overall estimates of individuals’ QOL that can be compared across groups. Norms may be provided for specific conditions, but the same set of items is administered to all respondents. These instruments are well suited to studies that compare QOL across groups, but, although the sets of items may work well enough across groups, individual items may not be appropriate for a particular condition or impairment. For instance, items that ask about difficulty in walking or climbing stairs would be inappropriate for persons with spinal cord injuries who use wheelchairs for locomotion. For this reason, disease-specific instruments have been developed to measure more precisely QOL within specific groups. These instruments (eg, various Functional Assessment of Cancer Therapy scales19) contain items that are responsive to particular issues within an individual group and more completely describe QOL levels within that group. As such, they are well suited to studies that compare QOL within a group over time but not for making comparisons across groups. Thus, the dilemma of determining which type of instrument to use lies in having to choose between sensitivity and generalizability. Ideally what is needed is a common set of items, which is appropriate across groups, that can be administered along with a unique set that is specific to a particular group. The common core of items allows for generalizability across groups while the unique items allow for better descriptions of QOL within each specific group; in other words, tailoring the instrument to the characteristics of the respondents should produce the “best” instrument. Creating an instrument tailored to the characteristics of respondents involves the same issues whether the instrument is in a paper-and-pencil format or it is a set of items that is computer-administered adaptively from an item bank.
S53
Tailoring an instrument to a specific group requires identification of a common core of items that (1) measure the same construct across groups, (2) function well together to describe the construct within each individual group, and (3) function similarly across groups. Because item functioning within and across groups depends on a rational grouping of items, the first step in developing a common core of items is to determine whether the items measure the same construct(s) across groups. Within a classical test theory framework, factor analytic methods have been used to describe similarities in responses to items that form different factors or constructs. Exploratory and confirmatory analyses with rotations that maximize distinctions across factors have typically been used,15 but more recently the measurement model portion of structural equations modeling has also been used.20,21 When the factors are hypothesized to correlate, oblique methods are used instead. These classical test theory methods use the intercorrelations among the raw responses to the items to form the factors; unfortunately, these correlations are sample-specific and ordinal in nature. More recently, item response theory methods22-24 have been used to examine patterns of responses to items that are not as prone to sample dependence and nonlinearity of item responses.25,26 An example of such an analysis is the Rasch residual analysis, a principal components analysis that examines patterns in the residuals of the responses obtained from those predicted by the measurement model.27 To determine whether the items in an instrument are measuring the same constructs across groups, the analyses need to be conducted separately for each group and the item loadings on each factor compared across groups. If items are significantly loaded on the same factors across groups, one would conclude that the same constructs are being measured across groups. If items are loaded on different factors across groups, one would conclude that the items are measuring different constructs across groups. For example, if items measuring pain load strongly on physical factors in a cancer sample and on mental factors in a stroke sample, one would conclude that pain has different meanings for persons in these 2 groups. For an item to be included in the common core, it must measure the same construct across groups. The second step in developing a common core of items is to determine whether, within each group, the set of items that is assumed to measure the same construct does indeed measure that construct. When using the Rasch measurement model, if all the items measure the same construct, there should be only random variation from the responses predicted by the model; that is, the items should “fit” the model, and difference between expected and obtained responses summed across persons should be nonsignificant. If individual items “misfit” the model, the unexpectedness of a substantial number of responses to that item suggests that the set of items might not be unidimensional. A likely explanation is that an individual misfitting item measured a different construct. An alternative explanation is that a bad question produced unexpected responses. There may also be some nuance in the item to which only members of certain groups are sensitive. Whatever the cause, the item functions differently than the other items in the set as a whole and its deletion would improve the reliability of the set of items. To determine if items are unidimensional within a group, the set of items are calibrated separately for each group. For an item to be included in the common core, it must measure the same construct as the other items in the set within each group examined. The third step in developing a common core of items is to determine whether the items function essentially the same across groups. The Rasch model describes the hierarchy of Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
S54
DEVELOPMENT OF AN ITEM BANK, Bode
items or ordering of items from most to least commonly endorsed. The construct described by this hierarchy should conceptually make sense or conform to the theoretical order described in the literature or on the basis of clinical experience. Items that are more commonly endorsed should be those characteristics that are more commonly observed or statements that even respondents possessing little of the trait would endorse. Conversely, items that are less commonly endorsed should be those characteristics that are less commonly observed or statements than only respondents possessing a great deal of the trait would endorse. To determine if items function essentially the same across groups, estimates of item “difficulty” are compared across samples after controlling for the structure of the rating scale. For an item to be included in the common core, even though the absolute level of endorsement might vary across groups, its position within the hierarchy should be the same from one group to the next. Given the opportunity provided by computerized adaptive testing and the substantial and consistent noncomparability of standard measures of QOL, now may be the time to modify original measures.28 This could be accomplished once a common core of items has been developed. Unique items that meet the criteria of measuring the same construct and exhibiting fit within each group could be added to the bank and used to create an instrument tailored to the respondents within that group. Items that function differentially across groups could be treated as unique items and individually included in the bank at the calibrated difficulty level for each group. The composition of items within each instrument would therefore vary across groups and be sensitive to particular issues in each group. Items that are meaningless in one group could still be included in the bank and used in the development of instruments for other groups. Items that are more or less likely to be endorsed in each group do not have to assume a single position in the hierarchy across groups; their positions can vary depending on the group in which the respondent is a member. For example, items measuring limitations in physical functioning are more likely to be endorsed by persons with stroke than by persons with cancer. Combining data from these 2 groups and producing 1 estimate of the positions of these items in the hierarchy would result in item misfit for persons in each group and more error in the estimation of their QOL. These issues are particularly important in the development of computerized adaptive tests in which the response to 1 item is used to select the next item to be administered. An Illustration An illustration of how data from a generic QOL instrument was used to develop a common core of items is presented later in this article. We used the SF-36,9 a QOL instrument that contains 9 rating scales and has a single scoring algorithm that converts raw item scores to a scale that ranges from 0 to 100. Transformed item scores are then averaged across 8 scales (physical functioning, role–physical, bodily pain, general health, vitality, social functioning, role– emotional, mental health) and 2 summary dimensions (physical health, mental health, summarizing the first 4 and the last 4 scales, respectively).9 This instrument was administered to samples from 4 diagnostic groups: cancer, multiple sclerosis (MS), human immunodeficiency virus (HIV), and stroke after treatment or hospitalization. Because the cancer sample was larger than the other diagnostic groups, a subsample of approximately the same number of cases was randomly selected for analysis. The HIV sample was the youngest (mean ⫾ standard deviation, 37.1⫾7.7y), the MS sample was slightly older (mean, 45.0⫾ 10.0y), and the stroke sample was slightly younger (mean, Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
53.6⫾15.0y) than the cancer sample (mean, 57.0⫾14.0y). The HIV sample was predominantly males (91.2%), whereas the MS sample was predominantly females (69.4%), and the gender distribution in the stroke and cancer samples were similar, at 52.0% and 52.4%, respectively. The procedures for developing a common core of physical function QOL items consist of 3 steps: determining whether (1) the same constructs were measured across groups, (2) all the items measured a single construct within each group, and (3) the items functioned the same across groups. Same constructs across groups. A number of methods can be used to determine if the same constructs are measured in the set of items. In this example, the typical principal component analysis with varimax rotation was used to identify factors and to examine factor loadings across groups. In addition, Rasch residual analysis was used to examine patterns in the residuals to confirm these results. Using a criterion of eigenvalues greater than 1 to identify the significant factors, the factor analysis identified 3 factors each for the cancer and HIV samples (systemic conditions) and 7 each for the MS and stroke samples (neurologic conditions). The results of the factor analysis showed that across groups, the physical function items all loaded on the first factor but the other items identified as part of the physical health component loaded on different factors across groups. The impact of physical health appears to be related to physical function only for persons with systemic conditions, whereas general health appears to be unrelated to physical function for those with neurologic conditions. Pain appears to be intermingled with other aspects of physical health only for those with systemic conditions but a unique aspect for those with neurologic conditions. The results for items classified as part of the mental health component showed that none of the items load on the same factor across diagnostic groups. For most groups, mental health appears to represent a separate and unique factor, but, for persons with MS, it seems to be related to the impact of emotional health and social functioning. More generally, for persons with systemic conditions, the impact of emotional health appears to be related to physical function, and social functioning and vitality appear to be related to general health. The Rasch residual analysis was also used to examine the factor structure of SF-36 items across diagnostic groups; these results confirm that the physical function items (PF-10) load consistently on the first factor across groups with the other items differentially loading on different factors. Thus, by using the results of both these analyses, it appears that only the PF-10 items meet the criterion of measuring the same construct across groups. Same construct within groups. To examine the functioning of the PF-10 items within and across diagnostic groups, responses were calibrated separately by groups. Because the PF-10 uses a rating scale for its responses, a moderate criterion (an infit mean square ⬎1.30) was used to identify misfitting items. The results of this analysis showed that 2 of the PF-10 items misfit within specific samples: “vigorous activities” misfit in the HIV and stroke samples and “bathing and dressing” misfit for the cancer and stroke samples. Thus, the numbers of items meeting the second criterion by diagnostic group were 9 for the cancer sample, 10 for the MS sample, 9 for the HIV sample, and 8 for the stroke sample. Comparable item functioning across groups. To examine DIF in the PF-10 across diagnostic groups, we compared the item difficulty estimates for each pair of diagnostic groups. The DIF criterion was a difference between estimates that was greater than 2 times the combined standard error (SE) of measurement for the groups. In this example, 4 items exhibited
DEVELOPMENT OF AN ITEM BANK, Bode
S55
Fig 1. PF-10 item hierarchies for core and by diagnostic group.
DIF when MS estimates were compared with those from the other groups: “vigorous activities,” “walking one block,” “walking one mile,” and “bathing/dressing.” Thus, these 4 items would be deleted from the common core of PF-10 items, leaving 6 that met the third criterion. In this illustration, 5 of the PF-10 items met both item functioning criteria and could be used as common core items in the development of a tailored physical functioning instrument or item bank. The item calibrations for the core items were fixed at their combined level and used to link the data across diagnostic groups. Items that exhibited DIF were treated as separate items for each group and allowed to have different difficulty levels in the bank. An illustration of the hierarchies in the common core and the unique items by diagnostic group is found in figure 1. In this figure, the position in the hierarchy of the 5 core items is shown at the left; they range in difficulty from approximately ⫺1 to ⫹1 on the scale. These positions represent the average “difficulty” estimates used in the bank, regardless of which diagnostic group a person was a member. Depending on the number of categories in the rating scale, additional difficulty estimates would be included for each item but are not represented here. The other items in this figure have separate positions in the hierarchy, corresponding to their average difficulty level within that group. For example, depending on diagnostic group, item PF1 (vigorous activities) has 4 separate positions corresponding to average difficulty levels from approximately ⫹2 to ⫹3 and item PF10 (bathing, dressing) has 4 separate positions corresponding to average difficulty levels from approximately ⫺1 to ⫺3. These different positions would be the average difficulty estimates used in the item bank, depending on which diagnostic group was being assessed. In developing an instrument tailored to a particular group, items that misfit in that group would not be included in the set of items for that sample. For example, item PF10 (bathing, dressing) misfits for the cancer sample (not shown in fig 1) and would be excluded from an instrument developed for that sample. Other items that measure the physical health construct could then be added to the item bank, a few items at a time so as not to challenge the definition of the construct, and subjected to the
same item functioning criteria. An illustration of such tailored instruments is found in figure 2. In this figure, the items selected for each diagnostic group met the criteria established previously: they loaded positively with the physical health items in the original factor analyses for that group and they were unidimensional within that group. When used to produce a paper-and-pencil instrument, conversion tables could be developed to report the results on a common scale for the items appropriate for that specific diagnostic group. When used in computerized adaptive testing, the algorithm would include the diagnostic group and only the data appropriate for that group would be used to select items for administration. This illustration is an example of a modification of a generic QOL measure in the face of evidence of its noncomparability across diagnostic groups. Although using generic instruments with standard item classifications allows us to compare scores within groups, modifying them by tailoring the items to the diagnostic group allows us not only to measure QOL within each group but also, because of the common core of items, to compare QOL across groups. In developing a common core of items, the first criterion that should be met is fit to the construct. Then, if the bank is to be used across diagnostic groups and there is evidence that some items function differently in these groups, misfitting items should be deleted separately by group and difficulty estimates for items should be allowed to vary across groups. Dataset Linking Strategies To develop an item bank, each item would need to be administered to at least a subset of persons for whom the bank is intended. Before common items can be developed, however, it may be necessary to combine data from multiple sources. An “item bank” does not refer to just any collection of items but to a composition of coordinated questions that develop, define, and quantify a common theme and thus provide an operational definition of a variable.29-31 Wright and Bell31 outlined the basic steps necessary to build an item bank. However, these steps apply to banks in which all the items are developed and calibrated in 1 step. In practice, most item banks are developed and expanded over time. As such, they require that items be Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
S56
DEVELOPMENT OF AN ITEM BANK, Bode
Fig 2. Physical health item hierarchies by diagnostic group. NOTE: Core items are underlined and in italics.
added in stages and that data be collected from multiple samples. Instead of collecting data on new or existing items on new samples, many investigators use existing datasets to create an item bank. Issues involved in linking datasets include (1) the selection of an equating method that fits the available data, (2) the choice of number of linking items, and (3) the validation of the quality of the linking process and of the resulting item bank. The 2 commonly used methods to equate tests (sometimes referred to as “linking datasets”) are common-item and common-person equating. Common-item equating is a more widely used approach. Figure 3 depicts the basic concept of the common-item equating. It uses items common to more than 1 dataset to remove disparities in the scores that can be related to the common items. In other words, the common items (items A, B, C in datasets 1 and 2; fig 3) across tests are fixed, or anchored at specific locations on the continuum, thereby al-
Fig 3. Illustration of linking 2 datasets by common items.
Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
lowing other items, which are unique to 1 dataset (items D, E, F in dataset 1, items G, H, I in dataset 2), to be positioned along a common continuum. This approach is also called anchor-test design.17 In this approach, all subjects complete common items between test forms and, thereby, linkage between forms is ensured by including common items on original test forms. Relative item difficulties and dispersions can then be determined by holding the item calibrations for the common items constant across test forms. The quality of the linking items themselves can be evaluated separately using standard test statistics. The common-person approach is similar to the commonitem approach. However, instead of common items, common persons are used to equate different test forms that these common persons completed. These common persons may complete test forms simultaneously or sequentially. The differences between the ability estimates for these persons are constant across test forms. A drawback of using common-person equating is that person abilities are generally less stable than item difficulties, and consequently, obtained item calibrations are less reliable.32 Also, this approach is of limited utility when equating existing datasets. Given the development of the computer programs with missing data features (eg, MFORMS,33 Winsteps34), it is no longer necessary to link separately calibrated forms using either of these approaches. Instead, tests can be equated within a single calibration by treating uncommon items across the forms as missing data by combining data from a network of multiple forms into a single data matrix. This approach is called “onestep item banking”35 or single calibration equating.36 There is no fixed rule regarding the size of linking items. Angoff37 suggested that linking items should contain the larger
DEVELOPMENT OF AN ITEM BANK, Bode
of 20 items or 20% of the total number of items. Other investigators31 recommend from 5 to 10 items to form the link. In general, a larger number of common items will result in more precise and stable item calibrations in the bank. A single item can also construct a valid link in linking datasets36; however, risk of instability warrants that such banks be tested in new subjects before implementation of computerized adaptive testing or the development of short forms. Compared with scores obtained from tests in which every subject responds to every item, scores obtained from the equating procedures are less precise than with complete data. With the recognition of this disadvantage, item banks, especially the linking process, need to be validated before the computerized adaptive testing technique can be applied. Wolfe17 proposed 4 fit indices to evaluate the quality of the equating. They are (1) item within link fit, which focuses on the degrees to which linking items exhibit adequate fit within the various forms in which they appear; (2) item between link fit, which focuses on the stability of the item calibrations between datasets; (3) link within bank fit, which examines the agreement among links with respect to test form difficulties; and (4) form within bank fit, which examines the stability of the link across all test forms in the bank. The last step is to validate this initial item bank by comparing the item hierarchy before and after the datasets were linked. Because the item hierarchy defines the construct, a stable item hierarchy before and after linking indicates that the construct definition has not changed as a result of the linking and equating processes. In terms of figure 3, this could require that the construct of the lowest line (containing all 9 items) be the same as the construct of the upper 2 lines (containing 6 items each). Most datasets that could be used to create an item bank are not designed for banking, therefore, methodologies used to create an item bank will vary based on the characteristics of the available datasets. In the following example, we illustrate these procedures. An Illustration An initial pain item bank serves to show the procedures in creating an item bank by using existing datasets. The first step in linking datasets consists of examining unidimensionality (ie, the extent to which the items work together to define a single construct) within each dataset. By using the Rasch measurement model, an indicator of the unidimensionality of a set of items is the ratio between the observed and expected variance (called mean square [MnSq]) in subject responses to each item. An ideal MnSq value is 1.0; however, we allow a certain level of unexpected random variance in the data (in this study, 30%). Therefore, an item with a MnSq value ⱕ1.3 is considered to fit the measurement model. A misfitting item is not necessarily deleted from the set of items, but is investigated further to identify the cause of the misfit. It may be retained in the item pool if it captures an important clinical condition and does not demonstrate excessive misfit (MnSq values ⬎2.00). This step statistically ensures the construct validity of items in each individual dataset. After the unidimensionality of the individual datasets is determined, the second step is to examine the unidimensionality of the item pool using the single calibration equating approach. During this step, item locations on the continuum (ie, item hierarchy) are simultaneously calibrated for the combined item pool. If any item misfits the combined item pool, the nature of the misfit is diagnosed (ie, does the error variance exist in individual dataset or was it produced from the linking process?). The result of this step is an initial item bank.
S57
Fig 4. Data structure of the pain item pool used to create an initial pain item bank. NOTE. Although the SF-36 was produced from the MOS study, 1 item “how much bodily pain” was altered from its original item “had pain or discomfort.” We did not consider these 2 items as equivalent. Therefore, there was only common item that served as the linking item between MOS and Q-score datasets was “interfere with normal work.”
Finally, the quality of the linking process is evaluated by determining the fit of the linked items as well as the stability or invariance of item location estimates across the equating steps. If an item exhibits poor fit or has parameter estimates that are not stable across the equating process, then the comparability of parameter estimates within an item bank is suspect.17 In addition to evaluating the quality of the linking items, the stability of the item parameter (all items) is tested by comparing the difficulty levels before and after linking the items (DIF) and by examining the discrepancy between expected and observed item calibrations (displacement) before and after the linking. DIF was computed by comparing the differences in item calibrations to their combined SEs; displacement was computed by comparing the observed item calibration after linking to the value that would have been expected if the linking had not occurred.34 (The criterion of determining a significant displacement is affected by sample size. Because sample sizes exceeded 1500 for 2 datasets, we used a 0.5 logit criterion. We used a criterion of 2 SEs to define a significant displacement for items from Breast Cancer Pain Study [BCPS] because of its moderate sample size.) Datasets collected from 3 different studies (fig 4) were used for the analysis: Q-score,7 Medical Outcomes Study9 (MOS), and BCPS (an ongoing study). The purpose of the Q-score (refers to dataset 1) project was to equate the scores from 5 QOL instruments and derive a universal scale for evaluating results across cancer studies. A total of 1714 cancer and/or HIV patients from 5 different clinics located in hospital settings in various parts of the United States were recruited. The average age was 55.1⫾14.8 years; 56.2% were males and 80.9% were white. The MOS (refers to dataset 2) was a large-scale multiyear survey of patients with prevalent treatable chronic health conditions. The study samples (N⫽3429) were drawn from patients receiving health care services from 523 solo or group practices in Boston, MA, Chicago, IL, and Los Angeles, CA. The average age was 53.5⫾16.7 years; 38.0% were males and 76.4% were white. The purpose of the BCPS (refers to dataset 3) project is to investigate the relationship between pain levels and medication effects for patients with stage IV metastatic breast cancer. One hundred twenty-five women were recruited with an average age of 55.3⫾12.5 years (range, 29 – 85y) at the time of the study. We first selected pain items from these 3 datasets based on the item content (ie, face validity). The initial item pool contained 6 items from dataset 1, 6 from dataset 2, and 36 from Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
S58
DEVELOPMENT OF AN ITEM BANK, Bode
Fig 5. Steps to create an initial pain item bank.
dataset 3. Within dataset 1, 2 items were from the European Organization for Research and Treatment of Cancer’s Core Quality of Life Questionnaire,38,39 1 from Functional Assessment of Cancer Therapy–General Scale40 (FACT-G), 2 from the MOS,41 and 1 from Cancer Rehabilitation Evaluation System–Short Form.42 Within dataset 2, all of the items were from the MOS long form. Within dataset 3, 1 was from the FACT-G, 26 from an expanded Brief Pain Inventory,43 2 from the Symptom Distress Scale,44 and 7 from the Pain Disability Index.45 Before the data were analyzed, all negatively worded items were reverse scored so that a higher rating always represented less pain (ie, better QOL) and a lower rating always represented more pain (ie, lower QOL). There was no common item across all 3 datasets. However, 1 item was common between dataset 1 and the other datasets. In this illustration, we implemented the single calibration equating approach to calibrate pain-related items from the existing datasets to build an initial pain item bank. We used Winsteps34 computer programa in this calibration. Figure 5 highlights the data analysis procedures. We validated this equating result by calculating fit indices proposed by Wright and Bell31 and Wolfe.17 Because there was only 1 linking item and consequently no variance, we only calculated the itemwithin-link (IWL) MnSq fit index. In addition to evaluating the quality of the linking items, we tested the stability of the item parameter (all items) across equating processes by comparing the difficulty orders levels (DIF) and the discrepancy between expected and observed item calibrations (displacement) before and after the linking method was implemented. Data analysis for individual datasets. The results of examining the unidimensionality of the individual datasets revealed that none of the pain-related items from Q-score misfit, whereas 2 of 8 pain-related MOS items and 3 of 36 pain-related BCPS items misfit. After removing these misfitting items, all the remaining items met the fit criterion. These 43 fitting items (6 from Q-score, 6 from MOS, 33 from BCPS; 2 items common across datasets) were retained in the item pool. Data analysis for combining all 3 datasets. A pain item pool (item n⫽43) was created by combining items that fit in their individual datasets. These 43 items were calibrated by using the single calibration equating method36 that estimated item calibrations directly from the total data matrix by treating unadministered items as missing data. Three items had MnSq values greater than 1.30 but less than 2.00. Two items with borderline misfit (level of disability in occupation because of pain; pain on average) were retained in the bank for 3 reasons: Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
(1) although MnSq values suggested borderline misfit, they were still acceptable; (2) these 2 items assessed meaningful clinical features; and (3) there was no other empirical evidence (eg, improved item MnSq values) to support removal of these items. The item with greater misfit (pain interfered with sleep) was retained because sleep is a major concern for people who suffer from pain. Because these items fit the unidimensional measurement model and were well calibrated on the pain continuum, we refer to this item pool as an item bank. Validation of the linking process. The IWL fit analysis showed, with an expected value of the IWL equal to 1, acceptable IWL fit indices between Q-score and MOS (.95) and between Q-score and BCPS (.81). We did not calculate the IWL between MOS and BCPS because there were no linking items between these 2 datasets. The results of the displacement analysis showed that 2 items (2/43⫽4.7%) showed significant discrepancy of item calibrations from each dataset to that in the linked database, and the DIF analysis showed that only 1 item (1/43⫽2.3%) showed significant DIF. (If a significant discrepancy between item calibrations under 2 conditions [ie, linked and unlinked] was detected, we could conclude that the corresponding item showed DIF. The criterion for determining a significant DIF is affected by sample size. Because Q-score and MOS had sample sizes greater than 1500, we used a 0.5 logit criterion. Because BCPS had a moderate sample size, we used 95% confident interval criterion.) Both displacement and DIF results provided evidence that the single calibration equating we used was valid. We have summarized the issues involved in creating an initial item bank by using existing datasets and showed these procedures in the development of an initial pain item bank. Because the 3 datasets possessed only 1 common item across its pairs, these results are illustrative. The stability of the calibrations in this initial item bank must be tested further. This initial item bank provided information on the interrelationship of these items and identified potential gaps where new items might be added. This information makes it possible to improve and expand this initial item bank into a fully operational item bank. This pain item bank is being field tested in a study in which all subjects answer all items, thereby providing data on the stability of the item calibrations. A final pain item bank will be produced after the completion of field testing. The optimal equating method depends on a variety of sample and item characteristics. We offer an example of 1 method. Although the results from common-person equating are less reliable because people’s responses constantly change, there is no evidence that the results using this method would vary significantly from those obtained using another equating method. Smith36 compared the results of 3 equating techniques (weighted common-item, unweighted common-item, single calibration equation) and concluded that all the equating procedures were robust enough not to violate the assumptions of the Rasch model. Although there may be no significant differences in results, the single calibration equating offers advantages in terms of ease of use and labor intensity. CONCLUSION We have shown the procedures involved in developing a common core of items that are appropriate across a number of diagnostic groups. Conceivably, adaptations to item banks could be made to reflect differences in an infinite number of characteristics or groups or characteristics within groups. How far one goes in tailoring an instrument or item bank depends on the level of precision required. In the sample data reported herein, the cancer and HIV samples and the MS and stroke samples could be combined to produce only 2 versions of
DEVELOPMENT OF AN ITEM BANK, Bode
selected items to be included in the bank. A reasonable criterion would be whether significant differences exist in the estimates of reliability of the measures obtained using the single or separate estimates of item difficulty for all groups. Another criterion would be differences in the measures themselves resulting from treating items as common or specific to a diagnostic group. As development of item banks continues, these issues will need to be explored further. We have also shown the procedures involved in linking existing datasets. Methods of test equating have now been advanced to the point where available computer programs are quite adequate. The selection of method should be based on the nature of the data. An initial item bank, whether developed by common-item, common-person, or single calibration equating approach, must be field tested to ensure the stability and the accuracy of the item calibrations. Implications and Future Directions Of what utility might such an item bank be to rehabilitation researchers? One use would be to develop, from this item bank, short instruments designed for a particular setting or subgroup of individuals. Because all the items in the bank are scaled together, different subsets of items can be selected to produce separate instruments that accurately assess status at a particular time or setting. Because the measures from these instruments are on a common scale, they can be used to continuously monitor improvement across time or settings. Although these instruments would be tailored to groups of individuals, a natural extension of an item bank would be the development of computerized adaptive tests that can be used to tailor the instrument to individuals being measured. These instruments could be more clinically driven and focus on aspects of recovery that are most important to the individual whose QOL is being assessed. References 1. Segal ME, Heinemann AW, Schall RR, Wright BD. Rasch analysis of a brief physical ability scale for long-term outcomes of stroke. In: Smith RM, editor. Physical medicine and rehabilitation: state of the arts reviews: outcome measurement. Philadelphia: Hanley & Belfus; 1997. p 385-96. 2. Doninger N, Bode RK, Heinemann AW, Ambrose C. Rating scale analysis of the Neurobehavioral Cognitive Screening Exam (NCSE). J Head Trauma Rehabil 2000;15:683-95. 3. Fisher WP Jr. Physical disability construct convergence across instruments: towards a universal metric. J Outcome Meas 1997; 1(2):87-113. 4. Cella DF, Lloyd SR, Wright BD. Cross-cultural instrument equating: current research and future directions. In: Spilker D, editor. Quality of life and pharmacoeconomics in clinical trials. 2d ed. New York: Lippincott-Raven; 1996. p 707-15. 5. Fisher WP Jr, Harvey RF, Taylor P, Kilgore KM, Kelly CK. Rehabits: a common language of functional assessment. Arch Phys Med Rehabil 1995;76:113-21. 6. Fisher WP Jr, Eubanks RL, Marier RL, Hunter SM. Equating the LSU HSI and the SF-36 using a probabilistic measurement model. Measurement Research Reports. New Orleans (LA): LSU Medical Center, Dept Preventive Medicine and Public Health; 1996. 7. Chang CH, Cella D. Equating health-related quality of life instruments in applied oncology settings. In: Smith RM, editor. Physical medicine and rehabilitation: state of the arts reviews: outcome measurement. Philadelphia: Hanley & Belfus; 1997. p 397-406. 8. Leplege A, Ecosse E. Methodological issues in using the Rasch model to select cross culturally equivalent items in order to develop a quality of life index: the analysis of four WHOQOL-100 data sets (Argentina, France, Hong Kong, United Kingdom). J Appl Meas 2000;1:372-93. 9. Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I: Conceptual framework and item selection. Med Care 1992;30:473-83.
S59
10. Stewart AL, Napoles-Springer A. Health-related quality of life assessments in diverse population groups in the United States. Med Care 2000;38(9 Suppl):II102-24. 11. Sherbourne CD, Meredith LS. Quality of self-report data: a comparison of older and younger chronically ill patients. J Gerontol 1992;47:S204-11. 12. McHorney CA, Ware JE Jr, Lu JF, Sherbourne CD. The MOS 36-item Short-Form Health Survey (SF-36). III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care 1994;32:40-66. 13. Meredith LS, Siu AL. Variation and quality of self-report health data: Asians and Pacific Islanders compared with other ethnic groups. Med Care 1995;33:1120-31. 14. Johnson PA, Goldman L, Orav EJ, Garcia T, Pearson SD, Lee TH. Comparison of the Medical Outcomes Study Short-Form 36-Item Health Survey in black patients and white patients with acute chest pain. Med Care 1995;33:145-60. 15. Wolinsky FD, Stump TE. A measurement model of the Medical Outcomes Study 36-Item Short Form Health Survey in a clinical sample of disadvantaged, older, black, and white men and women. Med Care 1996;34:537-48. 16. Arocho R, McMillan CA. Discriminant and criterion validation of the US-Spanish version of the SF-36 Health Survey in a CubanAmerican population with benign prostatic hyperplasia. Med Care 1998;36:766-72. 17. Wolfe EW. Understanding Rasch measurement: equating and item banking with the Rasch model. J Appl Meas 2000;1:409-34. 18. Bode RK. Understanding Rasch measurement: partial credit model and pivot anchoring. J Appl Meas 2001;2(1):78-95. 19. Cella DF, Tulsky DS, Bonomi A, Lee-Riordan D, Silberman M, Purl S. The Functional Assessment of Cancer Therapy (FACT) scales: incorporating disease-specificity and subjectivity into quality of life (QL) assessment [abstract]. Proc Am Soc Clin Oncol 1990;9:307. 20. Whitelaw NA, Liang J. The structure of the OARS physical health measures. Med Care 1991;29:332-47. 21. Liang J, Bennett J, Whitelaw N, Maeda D. The structure of self-reported physical health among the aged in the United States and Japan. Med Care 1991;29:1161-80. 22. Rasch G. Probabilistic models for some intelligence and attainment tests. Rev ed. Chicago: Mesa Pr; 1980. 23. Wright BD, Masters GN. Rating scale analysis: Rasch measurement. Chicago: Mesa Pr; 1982. 24. Hambleton RK. Principles and selected applications of item response theory. In: Linn RL, editor. Educational measurement. 3rd ed. New York: Macmillan; 1989. p 147-200. 25. Wright BD. Comparing Rasch measurement and factor analysis. Structural Equation Modeling 1996;3(1):3-24. 26. Smith RM. A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling 1996; 3(1):25-40. 27. Linacre JM. Structure in Rasch residuals: why principal components analysis? Rasch Meas Trans 1998;12:636. 28. Rogler LH. Methodological sources of cultural insensitivity in mental health research. Am Psychol 1999;54:424-33. 29. Choppin B. An item bank using sample-free calibration. Nature 1968;219:870-2. 30. Millman J, Arter JA. Issues in item banking. J Educ Meas 1984; 21:315-30. 31. Wright BD, Bell SR. Item banks: what, why, how. J Educ Meas 1984;21:331-45. 32. Bergstrom B. Diagnosing noisy anchors. Rasch Meas Trans 1993; 3:327. 33. Wright BD, Schulz EM. MFORMS [a FORTRAN computer program for one-step item banking of dichotomous and partial credit data from multiple forms]. Chicago: Mesa Pr; 1987. 34. Linacre JM. WINSTEPS computer program. Chicago: Mesa Pr; 2001. 35. Schulz M. One-step item banking computer program. Rasch Meas Trans 1987;1:10. 36. Smith RM. Applications of Rasch measurement. Chicago: Mesa Pr; 1992. Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
S60
DEVELOPMENT OF AN ITEM BANK, Bode
37. Angoff WH. Scales, norming, and equivalent scores. In: Thorndike RL, editor. Educational measurement. 2nd ed. Washington (DC): American Council on Education; 1971. p 508-600. 38. Aaronson NK, Ahmedzai S, Bergman B, et al. The European Organization for Research and Treatment of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in oncology. J Natl Cancer Inst 1993;85:365-76. 39. Fayers P, Aaronson N, Bjordal K, Curran D, Groenvold M. EORTC QLQ-C30 scoring manual. 2nd ed. Brussels: European Organizational Fund for Research and Treatment of Cancer; 1999. 40. Cella D. Manual of the Functional Assessment of Chronic Illness Therapy (FACIT) measurement system. Evanston (IL): Center on Outcomes, Research and Education (CORE), Evanston Northwestern Healthcare and Northwestern University; 1997.
Arch Phys Med Rehabil Vol 84, Suppl 2, April 2003
41. Hays RD, Sherbourne CD, Mazel RM. User’s manual for the Medical Outcomes Study (MOS): core measures of health-related quality of life. Santa Monica (CA): RAND; 1995. 42. Schag CA, Ganz P, Heinrich RL. Cancer rehabilitation evaluation system-short form: a cancer specific rehabilitation and quality of life instrument. Cancer 1991;68:1406-13. 43. Cleeland CS. Assessment of pain in cancer: measurement issues. In: Foley KM, editor. Advances in pain research and therapy. Vol 16. New York: Raven Pr; 1990. p 47-55. 44. McCorkle R, Jepson C, Malone D, et al. The impact of posthospital home care on patients with cancer. Res Nurs Health 1994;17:243-51. 45. Tait RC, Pollard CA, Margolis RB, Duckro PN, Krause SJ. The Pain Disability Index: psychometric and validity data. Arch Phys Med Rehabil 1987;68:438-41. Supplier a. Winsteps, PO Box 811322, Chicago, IL 60681-1322.