Guidelines for Evaluating Assessment Instruments Elaine Ewing Fess, MS, OTR, FAOTA, CHT Hand Rcscarch, ZicJll svillc, India11a
hile obvious differences exist between the W fields of medicine and engineering, each field can learn from the other. Some of the most
important contributions engineers bring to the specialty of hand rehabilitation are instrumentation concepts. For engineers, instrumentation theory is introduced early in undergraduate training and remains fundamental to all their professional endeavors, resulting in an almost instinctive and constant questioning of instrument veracity. In contrast, most therapists and surgeons receive almost no training in this area and as a consequence trustingly accept at face value the accuracy of the ir measurement instruments. Exhibiting diverse n eed s, some medical specialty areas have evolved highly sophisticated instrumentation by workin g closely with biomedical engineers, while other specialties have not required such high technology to assess, monitor, and treat patients and therefore have not had much interaction with engineers. When instrumentation concepts are not fully understood by a professional group, a predictable evolution occurs that leads to development and subseq uent use of assessment tools that may not meet basic testing and measurement criteria. Practice dictated by personality, tradition, outdated technology, or opinion, rather than by science, is perilously susceptible to bias and errors. When this occurs, ensuing advancement of knowledge within the profession is slowed needlessly by an inability to clearly identify baseline pathology, assess the efficacy of treatment interventions, and define final outcomes. Lack of measurement sophistication eventually inhibits growth of a profession because the ability of its members to communicate with accuracy and truthfulness is diverted and undermined. Unfortunately, very few assessment tools in hand rehabilitation meet even the most basic measurement criteria. Placing blame for this predicament is counterproductive, but it is important that all involved understand that problems do exist. A difficult-to-break cycle currently exists where vendors sell what clinicians will buy and clinicians, unaware of instrumentation concepts, do not know to demand higher product sta ndards . As a result, the cycle revolves, ever
Correspondence and reprint requ ests to Elaine Ewing Fess, MS, OTR, FAOTA, CHT, Hand Research, 635 Eagle Creek Court, Zionsville, IN 46077.
144
JOURNAL OF HAND THERAPY
feeding upon itself. While there are exceptions, even those responsible for developing new technology often seem uncommitted to meeting instrumentation standards in the rush to market evaluation products. Eventually, in the present situation, everyone loses patients, therapists, surgeons, vendors, and the profession. Education is needed to break this seemingly impenetrable cycle. Escalating the urgency of this situation, individual professional respon sibility is being usurped by social and legal mandates in the current health care environment. Hand rehabilitation specialists are facing ever-growing legislation to produce high-quantity care at low cost while maintaining high levels of quality. Ironically, clinicians are suddenly faced with proving their professional efficacy with measurement tools that are frighteningly inadequate. The very instruments that diffusely measure patient progress are now being relied upon to substantiate professional existence. Large databases are being established by third-party payers, HMOs, and professional societies. In addition, outcome studies are becoming increasingly important. Use of improper or poorly researched evaluation tools will seriously impair the ability to survive in an ever more demanding and accountable health care system. If there ever was a time for hand rehabilita tion professionals to work together to educate about basic instrumentation concepts, to identify problems with current assessment tools, to triage the reliable and the valid from the inaccurate, and to stand firm on a commitment to high instrumentation standards, it is now. There may not be another chance. The purposes of the article are to review basic testing and measurement concepts, to identify instruments that best meet these requirements, to analyze common instrumentation misconceptions, and to provide guidelines that will assist practitioners when purchasing evaluation equipment. This information is helpful not only for those purchasing and using assessment equipment, but also for consumers of professional literature because research assessment equipment must meet the same standards.
MEASUREMENT CRITERIA All measurement instruments must meet basic criteria regardless of cost or seeming level of sophistication. I - 3
Reliability First and foremost, an instrument must be accurate within its measurement unit. While this may .seem obvious, it is the very area in which many current hand assessment tools are deficient. Termed instrument reliability or bench testing in most testing and measurement arenas, this fundamental instrumentation criteria is referred to as calibration or repeatability by engineers. Any assessment tool whose measurement unit is governed by the National Institute of Standards and Technology (NIST) must be tested against this national standard. Reliability is usually reported as a correlation coefficient, indicat~ ing how closely instrument output relates to NIST criteria. In the past, clinicians assumed that hand evaluation instruments met these standards, but, in many cases, they did not. Once instrument reliability is established, rater reliability is documented, first for one examiner and then for multiple examiners. Although frequently used, trial-to-trial testing of normal subjects to define reliability without prior comparison with NIST standards is fraught with pitfalls4 and must be condemned. In the past, this practice led to adoption of instruments that sometimes had mediocre to poor accuracy. It makes no sense to test an instrument against a standard that has greater propensity for variation than does the instrument itself, 5 especially if the instrument repeatedly makes consistent measurement errors. If NIST standards do not exi?t for the entity being measured, as is the case with patient satisfaction questionnaires, trial-to-trial reliability statements are appropriate. It is important to remember that paper and pencil test instruments are not exempt from reliability requirements.
Validity Validity defines a test's ability to measure the phenomenon for which it was designed, i.e., strength, volume, or sensibility. It is statistically defined as a correlation coefficient. New assessment instruments are compared with similar instruments with previously established reliability and validity standards. When a test is devised that measures in an entirely new manner, other tests with which it may be compared are not available and a group of experts may endow the test with face validity. Subsequent tools that are developed and that measure in a similar manner are then compared with the originally sanctioned test. Reliability is a prerequisite to establishment of validity. A test cannot be valid if it is inaccurate.
Equipment, Administration, Scoring, and Interpretation Standards Equipment standards must be carefully adhered to in order to maintain test integrity. Once reliability and validity have been defined, the instrument or group of similar instruments cannot be altered without affecting inherent reliability and validity levels. If equipment standards are not adhered to, the ability
of the test to measure accurately is jeopardized and, as a result, the test is automatically invalidated. Protocols for administration must also be developed and stringently followed. When this is not done, uncontrolled variables are introduced that may influence the ability of the instrument to provide consistent and valid data. Scoring and interpretation standards that strictly define the meaning and use of test results are also required.
Normative Data and Patient-specific Data When all of the above criteria have been met and carefully defined, normative data may be collected for an instrument. It is interesting to think about the large number of assessment instruments currently on the market that have reported normative data but lack critical statements of reliability and validity. If an instrument has not been proven to measure accurately and consistently, and if it has not been confirmed to measure the entity for which it was designed, normative data, no matter how great the number of subjects, are useless. Once norms have been established, patient-specific data according to diagnosis may be investigated.
Statement of Purpose and Related Bibliography Although not an essential requirement, a clearly defined statement of purpose that provides focused boundaries for use is helpful for an instrument. A related bibliography is also worthwhile for those who use the test and want to know more about its development. Aside from a few hand function tests, only four hand assessment instruments meet even the most basic requirement of instrument reliability.l,2 These include the volumeter, the goniometer, the Jamar dynamometer, and Semmes-Weinstein monofilaments.
PROBLEMS WITH CURRENT INSTRUMENTS Lack of Adequate Calibration Calibration or instrument reliability is the area in which most problems with current assessment instruments may be found. Some deficiencies have been identified by researchers and have been corrected or improved by concerned manufacturers, while others have remained unchanged. A factor in not identifying measurement tool inaccuracies has been poor understanding of how best to determine reliability, and this has led to a false sense of security. In general, problems may be divided into four categories: (1) failure to compare with NIST criteria; (2) failure to periodically recheck calibration; (3) failure to understand physics principles; and (4) use of inappropriate methods to check calibration. April-June 1995
145
Too often instruments have not been compared with NIST standards. Any instrument that measures in units defined by federal regulation must first meet these standards before it may be used for patient assessment. In the past, the assumption was made that all newly purchased instruments met these criteria, but for one reason or another this was not always true. For example, in 1989 it was reported that the Jamar dynamometer had very high reliability. The same study found that of 53 dynamometers tested, 23.8% of the new dynamometers and 53.1 % of the used dynamometers needed to be returned to the manufacturer for recalibration. 6 ,7 As a direct result of the study, the manufacturer increased its quality control, and subsequent testing since 19908 has indicated that of 38 new dynamometers tested, only 5.3% needed recalibrating. Of 112 used dynamometers tested from 1987 to 1995, 50% needed recalibration. 8 Although Jamar dynamometers had been on the market since 1954/ no one had checked them against NIST criteria for over three decades, assuming that calibration was accurate. Other examples of lack of comparison with NIST criteria may be found in the plethora of two-point discrimination instruments. Although reliability studies have been published for some of these, the studies do not compare the instruments with NIST criteria. It was not until 1988 that Bell and Buford lO demonstrated the inconsistency of stimulus application of currently used twopoint discrimination instruments. This was done by measuring the amount of force application in grams, a unit defined by NIST criteria. Even when instruments have been proven to be reliable according to NIST standards, they must be reevaluated periodically. Age, use, and inadvertent damage may alter an instrument's ability to measure accurately. In an ongoing study of Jamar dynamometers, 8 several dynamometers have been periodically retested with interesting results. One dynamometer maintained its high correlation and excellent mean adjustment over 7 years, while another, over 4 112 years, maintained an excellent correlation but required annual adjustments of the faceplate screw to rectify mean accuracy. One year the dynamometer readings were heavy by 2 Ib and the dynamometer was adjusted accordingly. The next year readings were heavy by 4 Ibs and adjustments were again made. The most recent check indicated that the dynamometer readings were light by 5 Ib, requiring yet another adjustment. These annual checks and adjustments are important in providing measurement consistency. It is unrealistic to expect assessment instruments to remain in calibration forever. They must be checked periodically for calibration accuracy. Failure to fully understand and implement physics concepts also has inhibited definition of instrument reliability. An example of this may be found in many of the vibrometers currently on the market. Few, if any, actually control force application. Force is defined as mass multiplied by acceleration. This means that the speed at which the vibratory probe is applied must be computed and controlled in addition to controlling probe size, vibration frequency, and amplitude. Although speed is defined by NIST criteria in 146
JOURNAL OF HAND THERAPY
terms of time and distance, most vibrometers do not measure or control the speed of application, therefore, the force stimulus is unregulated and the reliability or accuracy of the instrument is unknown. There are a few highly specialized vibrometers that control speed of stimulus application, but their great expense and bulk take them beyond most clinic situations. Another example, problems with instrument validity, was also identified by Bell and Buford in their investigation of force-time relationships of sensibility testing instruments. 1o Prior to their study, it was widely believed that certain sensibility instruments activated slowly adapting sensory end organs, while others stimulated quickly adapting end organs. Using frequency signals and a spectrum analyzer, it was found that currently used handheld sensibility assessment tools are not selective to specific types of sensory end organs. In both of the above examples, knowledge of basic physics is fundamental to understanding how measurement instruments really perform. Use of inadequate or inappropriate methods of checking calibration is another reason instrument reliability may not be defined. Calibration monitoring must be appropriately matched to the function being analyzed. If the phenomenon being measured is relatively static, then so too may be the method of checking calibration. If, however, dynamic motion' is involved, then calibration must also incorporate dynamic motion. 11 For example, the Baltimore Therapeutic Equipment (BTE) Work Simulator uses a static technique of weight suspension to monitor calibration. This static technique is appropriate for the static mode, which has been shown to be very accurateY However, the static calibration recommended by the manufacturer is insufficient to define the accuracy of measurement of the dynamic mode, which, when measured with a more appropriate method, has been found to be inconsistent both within and between simulators. 11-16 Another problem with calibration is that the method of defining instrument accuracy must be at least as sensitive as the instrument being tested. When the standard with which an instrument is compared is more gross than the instrument being tested, the resulting reliability statement may be extremely misleading and inaccurate. Examples of this may be found in studies where human subjects are used in trial-totrial testing to define reliability without prior comparison of the instrument with NIST criteria. Unfortunately, unsuspecting clinicians accept the erroneously reported high reliability statements; in turn, these instruments, whose true accuracy is unknown, are trustingly used in further research, creating progressively increasing amounts of misinformation.
COMPUTERIZED ASSESSMENT TOOLS Computerized assessment tools must meet th~ same standards as noncomputerized instruments. Unfortunately, computers inherently have high levels of believability and, because of their complexities, inaccuracies may be more difficult to identify. To
most patients and clinicians, computer evaluation systems seem more sophisticated than traditional assessment instruments; therefore, the computers must be more accurate. In truth, because computers tend to be "black boxes" about which the average rehabilitation specialist knows little, it is far easier to be misled. In regard to computer evaluation systems, the greatest source of misplaced confidence often is centered on the self-assessment program. The uninformed or naive practitioner often assumes that simply by pushing a button or initiating a program, calibration of the respective instruments is accurate. This assumption is analogous to testing sensibility by asking the patient to do addition in his or her' head without ever touching his or her hand. The computerized dynamometer or goniometer, or whatever the instrument may be, is the distal most link in an integrated system. The only way to check the accuracy of the instrument is to compare it with known NIST standards as though it were not computerized. Further, the entire system needs to be in place as it would be in everyday use. It is not sufficient to send the instrument to a remote-from-the-clinic location for testing. If this is done, only one end of the system is known.
BASIC GUIDELINES FOR EVALUATING ASSESSMENT EQUIPMENT While not all-inclusive, the following guidelines may be used in evaluating the status of evaluation equipment: 1. Ask for instrument reliability correlation coefficients as defined by NIST standards. Normative data are not a substitute for reliability. 2. Ensure that the instrument's full range was tested, not just spot checked, in determining reliabili ty. 3. Check for validity correlation coefficients. Remember, a new test cannot be validated through comparison with an invalid test. 4. If reliability and validity are sufficiently established, request administration protocols and interpretation instructions. 5. Ensure that the instrument meets the equipment standards used in the reliability and validity process. 6. Identify the true resolution of the instrument if digital readout is used. The number of digits displayed is not indicative of the resolution or accuracy of the instrument. 7. Analyze the method of checking calibration. This is one of the most difficult areas to evaluate. Manufacturers' recommendations may camouflage inherent problems in the instrument. If questions arise about the appropriateness of the calibration check, consult an independent bioengineer or engineer.
8. Ensure that the scope of the testing range is appropriate for the phenomenon being measured by matching physiologic thresholds and end points to instrument range and specificity. An instrument that tests only partial aspects of physiologic range is not of much value. 9. Beware of instruments that report average values without providing specific high-low scores. Single value range (high minus low), mean, variance, and coefficient of variation statistical data may be helpful, but they do not identify specific peaks and valleys of performance. Average values may camouflage performance inconsistencies within the instrument itself? 10. Read related literature. Look for human trialto-trial reliability statements without NIST comparisons. Also, have physics concepts been completely or appropriately assessed and controlled? If unsure, consult an independent bioengineer or engineer. 11. Beware when someone says that the computer self-check is all that is needed. 12. Beware when someone says that a computerized instrument may be sent away for calibration check without checking the full computer system as it is used in the clinic. If every hand rehabilitation specialist would follow these minimal guidelines when purchasing new assessment equipment or when assessing established evaluation tools, reliance on inaccurate and invalid instruments would be curtailed considerably.
FUTURE CONSIDERATIONS It is interesting to peruse evaluation equipment advertisements and related literature. Presently, very few manufacturers provide precise engineering specifications regarding instrument accuracy tolerances. This information is critical to selection of measurement instruments and yet is conspicuously absent. It would be helpful if profeSSional societies related to the field of hand rehabilitation would develop accuracy criteria for all hand assessment equipment. This would be a major positive step toward instrument regulation, which is very much needed.
CONCLUSION As instrument sophistication increases, the role of the independent engineer becomes more and more critical. Clinicians and researchers alike can learn and can enhance their respective professional endeavors by consulting with engineers and by implementing sound instrumentation principles. Until measurement problems are identified and corrected, patient intervention is needlessly hampered and the profession cannot advance. April-June 1995
147
REFERENCES 1. Fess EE: The need for reliability and validity in hand assessment instruments. J Hand Surg [Am] 11:621- 623, 1986. 2. Fess EE: Documentation: Essential elements of an upper extremity assessment battery. III Hunter JM, Schneider LH, Mackin EJ, Callahan AD (eds): Rehabilitation of the Hand: Surgery and Therapy, 3rd ed. St. Louis, C. V. Mosby, 1990, pp. 5381. 3. Payton OD: Research : The Validation of Clinical Practice, 3rd ed. Philadelphia, F. A. Davis, 1994, pp. 55-83. 4. Fess EE: Why trial-to-trial reliability is not enough. J Hand Therapy 7:28, 1994. 5. Dunipace KR: Reliability of the BTE Work Simulator dynamic mode (letter). J Hand Ther 8:42-43, 1995. 6. Fess EE: Reliability of new and used Jamar dynamometers under laboratory conditions. J Hand Ther 3:35, 1990. 7. Fess EE: A method for checking Jamar dynamometer calibration. J Hand Ther 1:28-32, 1988. 8. Fess EE: Instrument reliability of new and used Jamar dynamometers (unpublished study). 1995.
148
JOURNAL OF HAND THERAPY
9. Bechtol CD: Grip test: Use of a dynamometer with adjustable handle spacing. J Bone Joint Surg Am 36:820-832, 1954. 10. Bell-Krotoski JA, Buford WL Jr: The force/time relationship of clinically used sensory testing instruments. J Hand Ther 1:7685, 1988. 11 . Dunipace, KR: Reliability of the BTE Work Simulator dynamic mode (abstract). J Hand Ther 8:52-53, 1995. 12. Fess EE: Instrument reliability of the BTE Work Simulator: A preliminary study . J Hand Ther 6:59-60, 1993. 13. Cetinok EM, Renfro RR, Coleman EF: Final Report of CRC Design Team, EE492 Senior Design Class. Indianapolis, IN, Department of Electrical Engineering, Purdue University School of Engineering and Technology, 1993. 14. Cetinok EM, Renfro RR, Coleman EF: A study of the dynamiC mode of a BTE Work Simulator. Submitted for publication. 15. Cetinok EM, Coleman EF, Fess EE, Dunipace KR, Renfro R: Reliability of the BTE Work Simulator dynamic mode (abstract). J Hand Ther 8:52-53, 1995. 16. Dunipace KR: Reliability of the BTE Work Simulator dynamic mode (letter). J Hand Ther 8:42-43, 1995. 17. Fess EE: How to avoid being misled by statements of average. J Hand Ther 7:193-194,1994.