Evaluating quality-of-life and health status instruments: development of scientific review criteria

Evaluating quality-of-life and health status instruments: development of scientific review criteria

CLINICAL THERAPEUTICS’VVOL. 18, NO. 5, 1996 Evaluating Quality-of-Life and Health Status Instruments: Development of Scientific Review Criteria* Kat...

870KB Sizes 0 Downloads 42 Views

CLINICAL THERAPEUTICS’VVOL.

18, NO. 5, 1996

Evaluating Quality-of-Life and Health Status Instruments: Development of Scientific Review Criteria* Kathleen N. L.ohr, PhD,’ Neil K. Aaronson, PhD,2 Jordi Alonso, MD, PhD,3 M. Audrey Burnam, PhD,4 Donald L. Patrick, PhD, MSPH,s Edward B. Pen-in, PhD,5 and James S. Roberts, MD6 ‘Health Services and Policy Research Program, Research Triangle Institute, Research Triangle Park, North Carolina, 2Division of Psychosocial Research and Epidemiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands, 3Department of Epidemiology and Public Health, Institut Municipal d’lnvestigacio Medica, Barcelona, Spain, 4The RAND Corporation, Santa Monica, California, ‘Department of Health Services, University of Washington, Seattle, Washington, and 6Voluntary Hospitals of America, Irving, Texas

ABSTRACT The Medical Outcomes Trust is a depository and distributor of high-quality, standardized, health outcomes measurement instruments to national and international health communities. Every instrument in the Trust library is reviewed by the Scientific Advisory Committee against a rigorous set of eight attributes. These attributes consist of the following: (1) conceptual and measurement model; (2) reliability; (3) validity; (4) responsiveness; (5) inter-

*Presented at the first annual international meeting of the Association for Pharmacoeconomics and Outcomes Research, May 12-15, 1996, Philadelphia, Pennsylvania.

0149-2918/96/$3.50

pretability; (6) respondent and administrative burden; (7) alternative forms; and (8) cultural and language adaptations. In addition to a full description of each attribute, we discuss uses of these criteria beyond evaluation of existing instruments and lessons learned in the first few rounds of instrument review against these criteria.

INTRODUCTION This paper reports on work from the Medical Outcomes Trust and its Scientific Advisory Committee (SAC).‘** The Trust is a nonprofit public service organization, established in 1994 to serve as a depository and distributor of highquality, standardized, health outcomes

979

CLINICAL THERAPEUTICS”

measurement instruments to national and international health communities. The goal of the Trust is to achieve universal adoption of health outcomes measurement to improve the value of health services. It believes that strategies for meamonitoring, and managing suring, functional outcomes, and for integrating those approaches into quality improvement programs, offer the best hope of achieving value from the nation’s investments in health care. Thus the Trust was established to ensure that advanced patient-based instruments and related support materials are easily available to all interested parties. The Trust does this through a variety of services intended to support the application of health outcomes assessment in everyday health care practice; those services include providing timely information and answers to questions, technical support, and ongoing educational activities.3 To ensure that the instruments accepted into the Trust meet rigorous scientific standards, it established the SAC to evaluate all instruments submitted and make recommendations to the Trust’s board of trustees. The SAC evaluates these instruments in the context of their intended applications as well as their general significance and contributions to the field, and also in relation to explicit, public criteria that ensure consistent and defensible recommendations and decisions. To that end, the SAC in 1995 made public its current set of eight attributes used to review and evaluate instruments. These are briefly presented in the remainder of this paper in eight “pairs”: first a brief definition of the attribute and then a short outline of the assessment criteria. The appendix gives the fully detailed set of attributes and criteria.

980

ATTRIBUTES CRITERIA

AND RELATED

Conceptual and Measurement Model The first attribute is the instrument’s conceptual and measurement model. The conceptual model comprises the underlying rationale for and description of the concepts that the measure is intended to assess and the relationships between or among those concepts. The measurement model is reflected in the instrument’s scale and subscale structure and the procedures that are used to create the scale scores. At least four major questions must be answered about the conceptual and measurement model. First, what is the basis for combining items into scales? Second, what descriptive statistics for scales can be provided, and do the scales demonstrate adequate variability? Third, what evidence describes or supports the intended level of measurement (eg, ordinal, interval, or ratio scales) or the scaling assumptions for preference-weighted measures? Fourth, are the procedures for deriving scale scores from raw scores adequately specified and justified?

Reliability Reliability is the degree to which the instrument is free from random error. This aspect of an instrument is examined in two ways-for internal consistency reliability, that is, high correlations among test items, and for reproducibility, that is, the stability over time in test-retest circumstances or the consistency of scores across raters at a point in time. The SAC uses several criteria to evaluate reliability. These include reliability estimates and standard errors of measure-

K.N. LOHR ET AL.

ment for all elements of an instrument; a clear, complete, and detailed description of how the reliability data are collected; reliability estimates for subpopulations (eg, persons with a specific chronic disease or groups with different languages or ethnic backgrounds); and the application of minimal reliability coefficients, depending on whether the comparisons are to be with groups or for individuals. In addition, specific information about test and retest scores is expected, including various coefficients appropriate for instruments yielding interval-level or nominal or ordinal scale values.

as an element of construct validation. Some experts think of it as the ratio of real change over time to variability in scores over time that is not associated with true change in status. More colloquially, responsiveness is the signal-to-noise ratio. The SAC assesses this attribute using longitudinal data, ideally from field tests that compare a group that is expected to change with one that is expected to stay the same, in which change scores are expressed as effect sizes. SAC reviewers look for clear descriptions of those populations and of the approaches used to calculate change and effect size.

Validity

Interpretability

In many arenas, validity is the most important attribute. The SAC uses a classic definition-namely, the degree to which an instrument measures what it purports to measure. We are concerned with three aspects of validity: content, construct, and criterion. The standards used by the SAC to evaluate validity include the methods used to collect content and construct validity as well as a defense of any criteria measures when the instrument developers make a claim for criterion validity. SAC reviewers also look for validity evidence for every proposed use of the instrument and for different populations if it is believed that validity may differ for such groups.

Interpretability is defined as the degree to which one can assign qualitative meaning-that is, clinical or commonly understood connotations-to quantitative scores. This can be done with various types of information intended to aid in interpreting scores on the instrument. Examples include comparative data on the distribution of scores from other groups, including the general public, and information on the relationship of scores to disease conditions or health care interventions, or to various events such as losing a job, graduating from college, or needing institutional care. Basically, in evaluating interpretability, the SAC seeks clear descriptions of the comparison populations and the means by which the relevant data (eg, on clinical conditions or life events) were amassed, recorded, interpreted, and displayed.

Responsiveness Responsiveness is an instrument’s ability to detect change in outcomes that matter to persons with a health condition, to their significant others, or to their providers. It sometimes is referred to as sensitivity to change, and it is also sometimes regarded

Respondent and Administrative Burden The sixth attribute is really a pair of properties-respondent burden and administrative burden. These involve the time,

981

CLINICAL THERAF’EUTICS”

energy, financial resources, personnel, or other resources required of respondents or those administering the instrument. With respect to respondent burden, the SAC is concerned that the instrument place no undue physical or emotional strain on the respondent. Among the important items of information are the following: time needed to complete the form or interview; levels of reading or comprehension assumed; any special requirements that might be placed on respondents, such as need to consult health or financial records; and acceptability of the instrument, such as the level of missing data or refusal rates. Regarding administrative burden, the SAC requires information about the time needed for a trained interviewer to administer the instrument, the assumptions about the level of training or expertise needed to administer the instrument, and any special resources needed, such as specific computer equipment to administer, score, or analyze the instrument.

Alternative Forms Alternative forms, the seventh attribute, refers to all the ways in which the instrument might be administered other than the original way; this particular characteristic can refer to many different modes of applying the instrument. In addition, alternative forms can include proxy versions of the original instrument. Two aspects of evaluating alternative forms of an instrument are important. The first is to evaluate these using most of the criteria already mentioned-that is, evidence of reliability, validity, responsiveness, interpretability, and burden. In addition, SAC reviewers want to see information indicating how compa-

982

rable each alternative inal document.

form is to the orig-

Cultural and Language Aaizptations Finally, the eighth attribute involves cultural and language adaptations. Here the focus is on conceptual and linguistic equivalence between the original instrument and its adaptations-that is, equivalence in relevance and meaning of the various concepts in an instrument as well as equivalence of the wording and meaning of items and response choices. In addition, the independent psychometric properties of the adaptations are significant aspects of the review, so again the SAC requires information to be provided about reliability, validity, responsiveness, interpretability, and burden. In judging this attribute, SAC reviewers seek material about how the developers sought to achieve conceptual equivalence, ideally in terms of content validity. Also, common rules concerning translations are applied: at least two forward translations and a pooled forward translation; at least one (more would be better) backward translation and another pooled translation; review of these versions by both lay and expert panels, with revisions as necessary; and field tests to gather evidence of comparability. Certainly any significant differences between the original and translated versions should be pointed out and explained.

LESSONS LEARNED IN EARLY ROUNDS OF INSTRUMENT REVIEW By early 1996, the SAC had reviewed perhaps two dozen instruments using these criteria and a wide array of information

K.N. LOHR ET AL.

groups (over time) may be found but may not be of any real importance to either the patients or the clinicians. Furthermore, data related to the clinical interpretation of scores are critical when health status or quality-of-life instruments are to be introduced into clinical practice settings, again because of the central importance of the patients’ and the clinicians’ perspectives. A second lesson is that, notwithstanding these gaps, developers tend to be focused on the conceptual and statistical properties of their measures, and issues of the practicality of the instrument get short shrift. Even when information on reliability and validity is available, for instance, facts and figures on administrative or respondent burden are not given. For users wanting good guidance from the Trust about applications of different instruments, this is an unfortunate omission. A third point is an inevitable tension between insisting on scientific rigor and passing instruments that may not achieve the level of scientific validity that we would like to see but that appear to have real promise or to fill an important gap in the

supplied by the developers or their representatives. The main purpose has been Eo make determinations about recommendations to the Trust’s board concerning instruments to be included in its library. (For instruments in the Trust’s library in September 1996, see the table.) This effort has provided several lessons. First, the psychometric analyses may well have been done for much of the work reviewed, but they often are not fully reported. The SAC often has to defer decisions and ask specifically for the missing information in a second round. In particular, two key pieces of information are often lacking altogether or are only partially available for review. One area involves data on the responsiveness of instruments to changes in health status over time, and the other concerns data that will facilitate (or even enable) the interpretation of scores in terms that are meaningful for clinicians or lay persons and policy makers. This gap appears particularly worrisome for largescale clinical trials, where statistically significant differences in scores between

Table.

Medical Outcomes Trust-approved

instruments

(as of September

1996).

London Handicap Scale Quality of Well-Being Scale Seattle Angina Questionnaire SF-12 Health Survey SF-36 Health Survey (standard and acute versions) Languages available: Australia/New Zealand (English) Canada (English) Germany (German) Spain (Spanish) Sweden (Swedish) United Kingdom (English) United States (English) Sickness Impact Profile

983

CLINICAL THERAPEUTICS’

measurement armamentarium. Generally, the SAC has had the primary objective of applying stringent standards to all instruments as a means of ensuring the quality of those contained in the Trust’s library. When the information needed may well be available or obtainable with relative ease, the SAC asks those submitting instruments to provide or collect and forward it, as a means of encouraging the full development of their work and this field.

CONCLUSIONS By consistently applying publicly known criteria, the SAC can contribute to the Trust’s objective of assuring widespread availability of a comprehensive library of high-quality health outcomes instruments. By making the criteria well known, we believe that others may use them for evaluating and revising their own instruments, for developing new measures, and for choosing functional outcomes instruments for both clinical trials and health services research. Finally, the criteria may prove valuable for others to use in assessing the adequacy of research, in evaluating claims

984

for new diagnostic or therapeutic modalities, and in reviewing publications in this field. The SAC and the Trust welcome both submissions of health status and quality-of-life instruments and comments and suggestions about the instrument review criteria and their applications.

Address correspondence to: Kathleen N. Lohr, PhD, Director, Health Services and Policy Research Program, Research Triangle Institute, PO Box 12194, 3040 Cornwallis Drive, Research Triangle Park, NC 27709-2194. REFERENCES Perrin EG. SAC instrument review process. Med Outcomes Trust Bull. 1995; 3:l.

Scientific Advisory Committee. Instrument review criteria. Med Outcomes Trust Bull. 199$3:1-N SourcePages. Vol. 1, No. 1. Boston: Medical Outcomes Trust; 1996.

K.N. LOHR ET AL.

Appendix.

Scientific Advisory Committee INSTRUMENT REVIEW CRITERIA The Scientific Advisory Committee of the Medical Outcomes Trust has established criteria by which to evaluate instruments that are submitted to the Trust for inclusion in the library. The Committee identified eight instrument attributes to serve as the principle foci of their review. The relative importance of criteria for instrument review may differ depending on the intended use(s) and application(s) of the instrument. Instruments may be intended to distinguish between two or more groups, assess change over time, or predict future status. Instruments will be reviewed in the context of documented

applications as stated in the Instrument Submission Form available by contacting the

Trust. The properties of an instmment are context specific. An instrument that works well for one purpose or in one setting or population may not do so when applied for another purpose or in another setting or population. Ideally, one would want to have evidence of the measurement properties of an instrument for each of its intended applications.

I. CONCEPTUAL AND MEASUREMENT MODEL DEFINITION A conceptual model is a rationale for and description

of the concept(s) that the measure is intended

to assess and the relationship between those concepts. A measurement

model is defined as an instru-

ment’s scale and subscale structure, and the procedures followed to create scale and subscale scores. The adequacy of the measurement

model can be evaluated by examining evidence: (1) that a scale

measures a single conceptual domain or construct; (2) that multiple scales measure distinct domains; (3) that variability in the domain is adequately represented by the scale; and (4) that the intended level of measurement

of the scale and its scoring procedures are well justified.

REVIEW CRITERIA (1) The conceptual and empirical basis for combining multiple items into a single scale score and/or multiple scale scores should be provided. This might include information

on factor structure, distinctiveness

consistency

of scales as generated by methods such as factor analysis and Rasch

of multiple scales, and internal

or structural equation modeling.

(2)

Descriptive

statistics for each scale should be provided, including information

on

central tendency and dispersion, skewness, and frequency of missing data. Any other evidence that the scale demonstrates adequate variability in a range that is relevant to its intended use should be provided.

985

CLINICAL THERAPEUTICS”

(3)

A description

of the intended level of measurement,

for example, ordinal, interval or

ratio scales, should be provided, along with available supportive evidence. For preference-weighted assumptions (4)

measures, a rationale and evidence supporting

scaling

should be provided.

Procedures for deriving scale scores from raw scores should be clearly specilied and a clear description standardization)

of and rationale for transformations

(such as weighting

and

should be provided.

II. RELIABILITY DEFINITION The principal definition of test reliability is the degree to which an instrument is free from random error. This succinct definition implies homogeneity sistency (ie, high correlations)

of content on multi-item tests and internal con-

among test items. The two approaches recommended

test reliability are coefftcient (Y(Cronbach’s alpha) and alternative form correlations. ter approach is seldom used in health status assessment,

for examining Because the lat-

the coefficient 0: can be considered the most

relevant approach to reliability estimation. A second definition

of reliability

is reproducibility

or stability of an instrument

retest) and interrater agreement at one point in time. The two definitions

over time (test-

are largely independent

of

one another.

A. Internal Consistency Coefficient

cx provides an estimate of reliability based on all possible correlations

between two sets of items within a test. For instruments

employing

response choices, an alternative formula, the Kuder-Richardson

dichotomous

formula 20 (KR-20).

is available.

B. Reproducibility Test-retest reproducibility: Test-retest reproducibility

is the degree to which an instrument yields stable scores

over time among respondents who are assumed not to have changed on the domains being assessed. The influence of test administration on the second administration may overestimate reliability. Conversely, variations in health, learning, reaction, or regression to the mean may yield test-retest data underestimating reliability. Despite these cautions, information on test-retest reproducibility data is important for the evaluation of the instrument. Inter-observer

(interviewer)

reproducibility:

For instruments administered by an interviewer, to both i&a-observer and inter-observer agreement.

986

test-retest

reproducibility

may refer

K.N. LOHR ET AL.

REVIEW CRITERIA

(1)

Reliability estimates and standard errors should be reported for all elements of an instrument, including both the total score and subscale scores, where appropriate.

(2)

A clear description should be provided of the methods employed to collect reliability data. This should include (a) methods of sample accrual and sample size; (b) characteristics of the sample (eg, sociodemographics, clinical characteristics if drawn from a patient population, etc.); (c) the testing conditions (ie, where and how the instrument of interest was administered); and (d) descriptive statistics for the instrument under study (ie, means, standard deviations, floor and ceiling effects, etc.).

(3)

Where there are reasons measurement will differ instrument is to be used, of interest (eg, different groups, etc.).

(4)

Test-retest reproducibility information should be provided as a complement to, not as a substitute for, internal consistency. Reproducibility is more important when repeated measures with the instrument are proposed.

(5)

A well-argued rationale should support the time elapsed between first and second administration, as well as the design of the study to ensure that changes in health status were minimal, ie, including transitional questions on general and specific health or functional status.

(6)

Information about test and retest scores should include the appropriate central tendency and dispersion measures of both test and retest administrations.

(7)

For instruments yielding interval-level data, information on test-retest reliability (reproducibility) and interrater reliability should include intraclass correlation coefficients (ICC). For nominal or ordinal scale values, kappa and weighted kappa, respectively, are recommended.

(8)

Commonly accepted minimal standards for reliability coefftcients comparisons and 0.90-0.95 for individual comparisons.

to believe that reliability estimates or standard errors of substantially for the various populations in which an these data should be presented for each major population chronic disease populations, different language or cultural

are 0.7 for group

987

CLINICAL THERAPEUTICS”

III. VALIDITY DEFINITION The validity of an instrument is defined as the degree to which the instrument measures what it purports to measure. There are three ways of accumulating A.

Content-related:

evidence that the content domain of an instrument

relative to its intended use. Methods commonly content-related

Construct-related:

is appropriate

used to obtain evidence about

validity include the use of lay and expert panel judgments

clarity, comprehensiveness, B.

evidence for the validity of an instrument:

and redundancy

of the

of items and scales of an instrument.

evidence that supports a proposed interpretation

of scores on the

instrument based on theoretical implications associated with the constructs. Common methods to obtain construct-related

validity include an examination

of the logical

relations that should exist with other measures and/or patterns of scores across groups of individuals. C.

Criterion-related:

evidence that shows the extent to which scores of the instrument

are related to a criterion measure. Criterion measures are measures of the target construct that are widely accepted valid measures of that construct. In the area of health status assessment,

criterion-related

validity is rarely tested because of

the absence of widely accepted criterion measures.

REVIEW CRITERIA

(1)

Evidence of content validity should be presented

for every major proposed use of

the instrument. Information about the methods for developing instrument should be provided.

(2)

the content of the

Evidence of construct validity should be presented for every major proposed use of the instrument.

When data related to criterion validity are presented,

a clear

rationale and support for the choice of criteria measures should be stated. A rationale should be provided to support the particular mix of evidence presented for the intended uses. (3)

A clear description should be provided of the methods employed to collect validity data. This should include (a) methods of sample accrual and sample size; (b) characteristics of the sample (eg, sociodemographics, clinical characteristics if drawn from a patient population, etc.); (c) the testing conditions (ie, where and how the instrument of interest was administered); and (d) descriptive statistics for the instrument under study (ie, means, standard deviations, floor and ceiling effects, etc.).

988

K.N. LOHR ET AL.

(4)

The composition of the validation sample should be described in sufficient detail to make clear the population(s) to which it applies. Available data on selective factors that might reasonably be expected to influence validity, such as gender, age, ethnicity, and language, should be described.

(5)

Where there are reasons to believe that validity will differ substantially for the various populations in which an instrument is to be used, these data should be presented for each major population of interest (eg, different chronic disease populations, different language or cultural groups, etc.).

IV. RESPONSIVENESS DEFINITION Sometimes

referred to as sensitivity to change, responsiveness

construct validation process. Responsiveness defined as the minimal change considered their significant

is viewed as an important part of the

refers to an instrument’s ability to detect change, often to be important by the persons with the health condition,

others, or their providers. The criterion of responsiveness

requires asking whether

the measure can detect differences in outcomes that are important, even if those differences are small. Responsiveness

can be conceptualized

also as the ratio of a signal (the real change over time that

has occurred) to the noise (the variability in scores seen over time that is not associated

with true

change in status). Common methods of evaluating responsiveness intervention

that is expected

include comparing

to affect the construct,

and comparing

scale scores before and after an changes in scale scores with

changes in other related measures that are assumed to move in the same direction as the target measure. Assessment

of responsiveness

often involves estimation of the effect size. Effect size is an estimate

of the magnitude of change in health status. Effect size translates the before-and-after a standard unit of measurement. tion of effects is discussed

changes into

Different methods may be used to calculate effect size. Interpreta-

in Section V.

REVIEW CRITERIA (1) For any claim that an instrument is responsive, evidence should be provided on the change scores found in field tests of the instrument. should be expressed

These change scores

ideally as effect sizes, with information

provided on the

methods used to calculate the effect size.

(2)

Claims for an instrument’s responsiveness

should be derived from longitudinal data,

preferably comparing a group that is expected to change with a group that is expected to remain stable. (3)

The population(s) on which responsiveness has been tested should be clearly identified including the time intervals of assessment, the interventions or measures involved in evaluating change, and the populations assumed to be stable.

989

CLINICAL THERAPEUTICS”

V. INTERPRETABILITY DEFINITION Interpretability

is defined as the degree to which one can assign qualitative meaning to an instru-

ment’s quantitative a quantitative

scores. Interpretability

of a measure is facilitated by information

that translates

score or change in scores to a qualitative category that has clinical or commonly

un-

derstood meaning.

There are several types of information

(1)

comparative

that can aid in the interpretation

data on the distribution

of scores:

of scores derived from a variety of defined

population groups, including, when possible, a representative

sample of the

general population,

(2)

information

on the relationship

of scores to clinically recognized

conditions

or need

for specific treatments, (3)

information on the relationship of scores or changes in scores to commonly recognized life events (such as the impact of losing a job), and

(4)

information

on how well scores predict known relevant events (such as death or need for

institutional

care).

REVIEW CRITERIA (1)

A clear description

should be provided of the rationale for selection of populations

for purposes of comparison

and interpretability

of data. This should include (a)

methods of sample accrual and sample size; (b) characteristics of the sample (eg, sociodemographics, clinical characteristics if drawn from a patient population, etc.); (c) the testing conditions administered);

(ie, where and how the instrument

and (d) descriptive

means, standard deviations, (2)

statistics for the instrument

of interest was under study (ie,

floor and ceiling effects, etc.).

Provide any information regarding various ways in which the data have been reported and displayed in order to facilitate interpretation.

VI. BURDEN DEFINITION Respondent burden is defined as the time, energy, and other demands placed on those to whom the instrument is administered. Administrative burden is defined as the demands placed on those who administer

990

the instrument.

K.N. LOHR ET AL.

REVIEW

(1)

CRITERIA:

Evidence

RESPONDENT

emotional strain on the respondent. circumstances

(2)

BURDEN

should be provided that the instrument places no undue physical or

Information

their instrument

Developers

should indicate when or under what

is not suitable for respondents.

should be provided on the average time and range of time needed to

complete the instrument

on a self-administered

administered

for all population

instrument,

basis or as an interviewer-

groups for which the instrument is

intended. (3)

Information

should be provided about the reading and comprehension

for all population (4)

Information

groups for which the instrument

should be provided about any special requirements

might be placed on respondents, or copy information (5)

Information

level assumed

is intended. or requests that

such as the need to consult health care records

about medications

used.

should be provided on the acceptability

of the instrument,

for example,

by indicating the level of missing data and refusal rates and the reasons for both. REVIEW (1)

CRITERIA:

ADMINISTRATIVE

For interviewer-administered

instruments,

BURDEN information

should be provided on the

average time and range of time required of a trained interviewer instrument.

If appropriate

(eg, if the times differ significantly),

should be given for face-to-face

interview, telephone,

to administer

the

the information

and computer-assisted

formats/applications. (2)

Information

should be provided on the amount of training and level of education or

professional

expertise

and experience

needed by administrative

staff to administer,

score, or otherwise use the instrument. (3)

Information

should be provided about any resources required for administration

the instrument,

software to administer, VII. ALTERNATIVE

of

such as the need for special or specific computer hardware or score, or analyze the instrument.

FORMS

DEFINITION Altemative forms of an instmment include all modes of administration other than the original source instrument. Depending on the nature of the original source instrument altemative forms can include self-administered self-report, interviewer-administemd self-mm trained observer rating, computer-assisted self-repo& computer-assisted interviewer-administered report_ and performancebased measures. hi addition altemative forms may include proxy versions of the original source instrument such as self-administemd proxy report and interviewer-administered proxy report.

991

CLINICAL THERAPEUTICS”

REVIEW CRITERIA Alternative forms of an instrument will be evaluated employing inal source instrument. pretability,

This will include evidence

the same criteria used for the orig-

of reliability,

and burden. An additional criterion for evaluation

validity, responsiveness,

of alternative

inter-

forms will be compa-

rability with the original instrument.

VIII. CULTURAL

AND LANGUAGE

ADAPTATIONS

DEFINITION The cross-cultural

adaptation of an instrument

ceptual and linguistic equivalence, alence refers to equivalence

involves two primary steps: (1) assessment

and (2) evaluation of measurement

in relevance and meaning of the same concepts being measured in dif-

ferent cultures and/or languages. Linguistic equivalence and meaning in the formulation its applications.

For evaluation

tion will be reviewed

of items, response of measurement

refers to equivalence

interpretability,

of question wording

choices, and all aspects of the instrument

properties,

and

each cultural and/or language adapta-

separately by the Scientific Advisory Committee

validity, responsiveness,

of con-

properties. Conceptual equiv-

for evidence of reliability,

and burden.

REVIEW CRITERIA (1)

Information

about methods to achieve conceptual

It is commonly

recommended

equivalence

should be provided.

that the content validity of the instrument

be

assessed in each cultural or language group to which the instrument is to be applied. (2)

Information

about methods to achieve linguistic equivalence

is commonly recommended

should be provided. It

that there should be: (a) at least two forward translations

from the source language, preferably

by persons experienced

in translations

in health status research. This should result in a pooled forward translation; least one, preferably

more, backward translations

and (b) at

to the source language that

results in another pooled translation; (c) a review of translated versions by lay and expert panels with revisions; and (d) field tests to provide evidence of comparability. (3)

Any significant identified

differences

between the original and translated versions should be

and explained.

Appendix. Scientific Advisory Committee instrument Medical Outcomes Trust, Boston, Massachusetts.

992

review criteria.

Reprinted,

with permission,

from the