OsteoArthritis and Cartilage (2006) 14, A10eA13 ª 2006 Published by Elsevier Ltd on behalf of OsteoArthritis Research Society International. doi:10.1016/j.joca.2006.02.021
International Cartilage Repair Society
A users guide to measurement in medicine
M. N. Lassere M.B.B.S. (Hons), Dip Grad Epi, Ph.D., F.R.A.C.P., F.A.F.P.H.M.yza* y Department of Rheumatology, St George Hospital, Gray Street, Kogarah, NSW 2217, Australia z University of New South Wales, Australia Summary Measurement is fundamental to science. In medicine measurement underpins most clinical decisions. Outcome measures for rheumatoid arthritis clinical trials (OMERACT) is an informal collaborative group of professionals dedicated to improving outcome measurement in the rheumatic disease. The methodologic hallmark of the OMERACT process is captured in the OMERACT filter e truth, discrimination, and feasibility. Using the key elements of the OMERACT filter a comprehensive checklist for evaluating reported measures is provided. The checklist guides the potential user through a series of questions. The checklist is also an important resource for researchers working in the field of measurement. ª 2006 Published by Elsevier Ltd on behalf of OsteoArthritis Research Society International. Key words: Measurement, Validity, Reliability, Responsiveness, Diagnostic test, Surrogate marker.
in patients’ condition over time, and determining prognosis are clinical processes that require some form of observation and measurement either ad hoc or as part of a formal procedure. Measurement has often been antecedent to disease definition itself. The concept of the condition of diabetes was obscure at best until we were able to measure urinary and later blood glucose. Similarly, the definition of osteoarthritis may in turn be modified to reflect our capacity to measure articular cartilage dimensions.
Outcome measures for rheumatoid arthritis clinical trials (OMERACT) is an informal collaborative group of professionals dedicated to improving outcome measurement in the rheumatic disease1 (www.omeract.org). The OMERACT process is data-driven, iterative and consensus building. It aims to examine existing measures and explore new measures in order to make selection of outcome measures explicit and justifiable. Where consensus does not or cannot be reached, OMERACT formulates appropriate research agenda. Its recent work in developing an magnetic resonance imaging (MRI) score of disease activity and damage in rheumatoid arthritis (RA) is an example2. The methodologic hallmark of this process is captured in the OMERACT filter e truth, discrimination, and feasibility3. What follows is an explanation of these three components of the filter to provide guidance to assessing past work and designing future investigations in the development and validation of measures of structural outcome in osteoarthritis. This introduction will also briefly describe how the historical development of measurement in the medical sciences, adopted from conventions arising from either psychology4 or the natural sciences5 has led to a nonintuitive, ambiguous nomenclature that can confuse the clinical research process and its application. The OMERACT process encourages outcome measurement to be transparent and useful.
What to measure? What to measure depends on our intent. From our limited knowledge of the pathobiology of osteoarthritis, we have little a priori guidance here. Measurement content can reflect anatomy: different joints, different tissues within each joint (cartilage, bone, synovium, tendon, meniscus), and different features of each tissue (e.g., cartilage thickness, volume, stiffness, composition). Ultimately we are more interested in how a patient feels, functions or survives rather than anatomical measurement. Sometimes a distinction is made between outcomes, such as symptoms, irreversible morbid events and survival, whose importance is unquestioned, and anatomical or laboratory outcomes that may be easy to measure but their importance to the patient has not been demonstrated. These outcomes may be substituted for clinical outcomes and are called surrogate outcomes. Although the use of surrogate outcomes in clinical trials reduces sample size requirements and trial duration, they can only be justified if there is strong evidence that therapeutic targeting of the surrogate will translate into a beneficial patient outcome.
Why measure? Measurement is fundamental to science. In medicine measurement underpins most clinical decisions. The development of disease diagnostic criteria, evaluation of change
How to measure?
a
Staff Specialist Rheumatologist in St. George Hospital, and Associate Professor in Medicine in University of New South Wales. *Address correspondence and reprint requests to: A/Prof. Marissa N. Lassere, Department of Rheumatology, St. George Hospital, Gray Street, Kogarah, NSW 2217, Australia. Tel: 61-2-93502330; Fax: 61-2-9588-1156; E-mail:
[email protected] Received 24 September 2004; revision accepted 26 February 2006.
Data can be qualitative or quantitative or a combination of the two. Qualitative data can be described in all its aspects but may not be easily measurable, for example, the consistency (soft or firm) of a lump and/or quality (blowing or harsh) of a cardiac murmur. Quantitative data can easily be described using a dimensional scale. Often qualitative A10
Osteoarthritis and Cartilage Vol. 14, Supplement A
A11
Table I Checklist for evaluating a measure 1. What is its purpose? How will it be used? 1.1 Is the purpose described in sufficient detail? 1.2 Is the subject group defined? 1.3 Are comparisons to be made? 1.4 If so, will these comparisons be on individuals, on groups or both? 1.5 If so, will these comparisons be concurrent, longitudinal or both? 1.6 If applied to individuals, was it standardized? 1.7 Are reference or normative values provided? How were they developed? Are they stratified by relevant cofactors? 2. Were all aspects described in sufficient detail so that others can reproduce it? Truth 3. Is it valid for your purpose? 3.1 Are all types of validity addressed? 3.1.1 Is it credible (face validity)? 3.1.2 Is it comprehensive (content validity)? 3.1.3 Does it agree (by independent and blind comparison) with other measures that reflect the same concept (construct validity)? 3.1.4 Does it agree (by independent and blind comparison) with a ‘gold standard’ (concurrent criterion validity)? 3.1.5 Does it predict (by independent and blind comparison) a future ‘gold standard’ (predictive criterion validity)? 3.2 Was the study design and analysis satisfactory? 3.2.1 Was an appropriate selection (and spectrum) of subjects studied? How were they selected? Was a control group included? 3.2.2 Was an independent and blind comparison performed for determination of construct and criterion validity? 3.2.3 Were the statistical methods adequately described and appropriately chosen? If both measures are continuous are intraclass correlation coefficient (1.0 is perfect reliability agreement), 95% limits of agreement (the smaller the better), and mean difference (i.e., paired t-test, and again the smaller the better) and mean vs difference or mean vs variance plot (to examine important trends) reported? If both measures are categorical are the weighted and unweighted kappa (1.0 is perfect agreement), sensitivity, specificity and likelihood ratio reported? If one measure is categorical and the other continuous is an ROC curve reported? If both measures are continuous but units are different are the units rescaled or standardized to allow reporting using the methods described above or are correlation and linear regression methods of analysis reported? 3.2.4 Was the study large enough to draw confident conclusions? Discrimination: reliability and responsiveness 4. Is it sufficiently reliable for your purpose? 4.1 Were all facets of reliability evaluated (between occasions, procedural, within observers, between observers, other sources of variability)? 4.2 Was the study design and analysis satisfactory? 4.2.1 Did the subjects reflect the extremes of the spectrum where inconsistency is more common? 4.2.2 Was reliability tested independently and blind to previous results? 4.2.3 Were the statistical methods adequately described and appropriately chosen? If continuous measures, are the intraclass correlation coefficient, limits of agreement (smallest detectable difference), and mean difference (i.e., paired t-test), coefficient of variation and mean vs difference or mean vs variance plot all reported? If categorical measures, were percentage agreement (weighted and unweighted) and kappa statistic (weighted and unweighted) all reported? 4.2.4 Was the study large enough to draw confident conclusions? 5. Is it sufficiently responsive for your purpose? 5.1 Was more than one aspect of responsiveness evaluated (different time frames)? 5.2 Was the Study Design and Analysis satisfactory? 5.2.1 Did the subjects reflect the middle of the spectrum to optimize discrimination? 5.2.2 Was responsiveness tested independently and blind to previous results? 5.2.3 Were the statistical methods adequately described and appropriately chosen? Is the relative efficiency reported? This is the square of the ratio of two paired t statistics (or two paired z statistics). The square of the paired t statistic is also known as the coefficient of variation, so relative efficiency is also the ratio of the coefficients of variation. For within subject change, a large paired t statistic indicates good responsiveness. Is the ‘standardized response mean’ (SRM) reported? This is the mean change in scores from time zero to time one divided by the standard deviation of these changes. A large SRM indicates good responsiveness. Is the smallest detectable difference reported. This is a measurement error based definition of responsiveness and is the absolute value of the 95% confidence limits around the standard deviation of the difference scores from a testeretest reliability study? 5.2.4 Was the study large enough to draw confident conclusions? 6. Are sub-components to be used separately? 6.1 If so, does each component meet filter components of validity, reliability and responsiveness? Feasibility 7. Is it sufficiently feasible? 7.1 Are there procedural problems in administration? 7.2 Is training required? 7.3 Are standard instructions available? Are they clear? 7.4 Are reference or normative values provided? How were they developed? Are they stratified by relevant cofactors? 7.5 What are the costs? 7.6 Is it easily available? 7.7 What are the time issues? 7.8 Is it acceptable to all users?
A12 data are used to develop graded scales such as degree of softening of arthroscopically probed articular cartilage, therefore the distinction between the two may be subtle. Levels of measurement have been defined from the psychosocial disciplines as nominal, ordinal, interval and ratio6. The distinction is important because level of measurement determines the statistical method required for analysis and some statistical methods are more efficient, therefore more powerful than others. In medicine we are more familiar with the level of measurement of variables being described as discrete or continuous. We also use the terms categorical, binary, dichotomous, graded and ranked variables. How do they all compare? Nominal ‘naming’ variables are categorical variables such as race, diseases, etc. They have no intrinsic dimensional scale. Some categorical variables can take on only two values and these are called dichotomous or binary variables. Gender is one example of a non-directional binary variable. Responder or non-responder is an example of a directional binary variable. Nominal, categorical, binary and dichotomous variables are all discrete variables. Ordinal measures are equivalent to ranked or graded measures. They take on more than two values and have direction but sit uncomfortably between categorical and continuous measures because the interval between rankings or grades may not be equal. This has implications regarding the statistical methods that should be used to analyse ordinal variables and is the subject of considerable debate. Interval variables are continuous measures that have equal intervals (such as temperature) and ratio variables are interval variables that have a true zero point (such as height). A look at the osteoarthritis imaging literature illustrates the considerable choice available for the numerical process of measurement. Clinical researchers new to the field of measurement must be aware of these alternatives. For example, osteophytes can be (1) counted (discrete variable); (2) graded on a scale of size or number or position (ordinal or ranked variable); (3) we can combine (1) and (2) to give a composite osteophyte score; (4) the dimensions (such as height) of the osteophyte can be directly measured in millimeters; or (5) measuring more than one dimension to give total osteophyte area or volume. As a general principle, one should avoid grading or simple counts if one can use a dimensional measurement, as the latter minimizes information loss. However, grading is often quicker and the end result often agrees close enough to the dimensional measurement, and may be more reliable. Sometimes dimensional measurement is not technically possible. The methods used to construct a composite of different kinds of measurements also differ7,8, the most common being arithmetic such as in the Disease Activity Score9 and Boolean such as in the ACR2010 and the RA remission criteria11. Other aspects are encountered in the development and validation of a measure can either be driven by expert opinion, be statistically driven with cross-sectional data, or be explicitly, prospectively data-driven. The last of the three is a type of predictive validity12. The studies that evaluate the performance of new and existing measures are often called method-comparison studies.
OMERACT filter The OMERACT process improves outcome measurement in rheumatology because it is data-driven, iterative and consensus building1. The OMERACT filter e truth,
M. N. Lassere: A users guide to measurement in medicine discrimination, feasibility e defines the essential components of a measure. TRUTH is fundamental. Does it measure what is intended? Is the measure unbiased? Is the measure relevant?13 Truth captures all aspects of validity: face, content, construct, criterion (see Table I) terminology derived from the psychosocial literature. Accuracy and precision, terms commonly used in the biomedical literature, also capture truth, but they are not synonymous. For example, in a method-comparison study of continuous measures, to determine how well the measures agree we calculate the difference between the two measurements for each subject. Accuracy is the mean of these differences whereas precision is the standard deviation of the differences14. However, precision has also been used to describe the property of reliability or consistency in medicine and this confusion in terminology has caused problems both with the design of method-comparison studies and their analysis. DISCRIMINATION includes the related concepts of reliability (the degree to which a result obtained by a measure can be replicated15) and sensitivity to change, which is also known as responsiveness or discriminant validity of a measure. All else being equal, a measure that has poor reliability (because of the random error) will also be less responsive. The study design determines whether the objective is to determine the stability of a measure (reliability) or its ability to respond to external change (responsiveness). Reliability should be reported using both relative and absolute statistics because they provide different information and incorporate certain biases depending on the study design and subjects16e18. Responsiveness can also be reported using several methods19e21. FEASIBILITY is conceptually straightforward. Can the measure be applied easily, given constraints of time, money, and interpretability? It can be divided into relevant component parts such as personnel, time, costs, acceptability, ethics, adverse events, and so forth. There are overlapping aspects to this filter. The well-known tradeoff of truth and feasibility applies across all measures. The OMERACT filter summarizes key criteria of outcome measurement. In Table I, I provide a users guide to measurement in medicine. The checklist is a series of questions that should be used when developing new measures or when evaluating existing ones.
References 1. Tugwell P, Boers M. OMERACT conference on outcome measures in rheumatoid arthritis clinical trials e conclusion. J Rheumatol 1993;20:590. 2. Conaghan P, Edmonds J, Emery P, Genant H, Gibbon W, Klarlund M, et al. Magnetic resonance imaging in rheumatoid arthritis: summary of OMERACT activities, current status, and plans. J Rheumatol 2001;28(5):1158e62. 3. Boers M, Brooks P, Strand V, Tugwell P. The OMERACT filter for outcome measures in rheumatology. J Rheumatol 1998;25:198e9. 4. Allen M, Yen W. Introduction to Measurement Theory. Monterey: Brooks Cole 1979. 5. Harris EK, Boyd JC. Statistical Bases of Reference Values in Laboratory Medicine. New York: Marcel Dekker 1995. 6. Streiner D, Norman G. Health Measurement Scales: A Practical Guide to their Development and Use. Oxford: Oxford University Press 1989. 7. Bombardier C, Tugwell P. Methodological framework to develop and select indices for clinical trials. J Rheumatol 1982;9(2):753e7.
Osteoarthritis and Cartilage Vol. 14, Supplement A 8. Boers M, Tugwell P. The validity of pooled outcome measures (indices) in rheumatoid arthritis clinical trials. J Rheumatol 1993;20(3):568e74. 9. van der Heijde D, van’t Hof M, van Riel PL, van de Putte LB. Development of a disease activity score based on judgement in clinical practice by rheumatologists. J Rheumatol 1993;20:579e81. 10. Felson DT, Anderson JJ, Boers M, Bombardier C, Furst D, Goldsmith C, et al. American College of Rheumatology. Preliminary definition of improvement in rheumatoid arthritis. Arthritis Rheum 1995;38(6): 727e35. 11. Pinals RS, Masi AT, Larsen RA. Preliminary criteria for clinical remission in rheumatoid arthritis. Arthritis Rheum 1981;24(10):1308e15. 12. Lassere MN, van der Heijde D, Johnson KR. Foundations of the minimal clinically important difference for imaging. J Rheumatol 2001;28(4):890e1. 13. Tugwell P, Bombardier C. A methodologic framework for developing and selecting endpoints in clinical trials. J Rheumatol 1982;9(2):758e62. 14. Altman DG. Practical Statistics for Medical Research. London: Chapman and Hall 1991.
A13 15. Last J. A Dictionary of Epidemiology. New York: Oxford University Press 1988. 16. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86: 420e8. 17. Bland M, Altman D. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307e10. 18. Lassere M, McQueen F, :stergaard M, Conaghan P, Shnier R, Peterfy C. OMERACT Rheumatoid Arthritis Magnetic Resonance Imaging Studies. Exercise 3: an international multicenter reliability study using the RA-MRI Score (RAMRIS). J Rheumatol 2003;20: 366e75. 19. Liang MH. Evaluating measurement responsiveness. J Rheumatol 1995;22:1191e2. 20. Anderson JJ, Chernoff MC. Sensitivity to change of rheumatoid arthritis clinical trial outcome measures. J Rheumatol 1993;20(3):535e7. 21. Liang M, Larson M, Cullen K, Schwartz J. Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum 1985;28(5):542e7.