J. &on.
Dis. Vol. 15, pp. 381-387. Pergamon Press Ltd. Printed in Great Britain
MEASUREMENT
IN MEDICINE AND
CHARLES M. WYLIE, M.D.,
DR. P.H._F
School of Hygiene and Public Health, The Johns Hopkins (Received
“As the births of living creatures births of time.”
15 September
PUBLIC HEALTH*
University,
Baltimore,
Maryland
1961)
at first are ill-shapen,
so are all innovations, hANCIS
which are the BACON
1625
IT IS often held that organized bodies of knowledge, such as medicine and public health, become sciences when precise and accurate methods of measurement are developed. In each of these areas, at least rudimentary measuring sticks have existed for some time to guide practitioners in their actions. In wide areas, however, measuring techniques are poorly developed. For this reason, the practice of medicine or public health is often classified as “an art based on a science”, suggesting that skill, intelligence, and sometimes intuition are needed to apply successfully the bodies of knowledge. When effective measuring sticks become available in both fields, art should play less part in their practice. At present, however, one cannot ignore the fact that both areas often approach the opposite extreme, when even the best practitioners disagree on the precise diagnosis and optimum course of action for particular situations. Clearly, better measuring devices are needed to improve medical and public health practice. HISTORY
OF
MEASUREMENT
PROBLEMS
When speaking of measurement in simple terms, one often imagines a measuring stick, such as a ruler, being placed beside the object whose quality, in this instance length, one wishes to measure. Using a modicum of judgement, we can record the number on the measuring scale that most nearly reflects the size of that quality. Such a process of measurement goes back far into history. Ancient civilizations varied in their range and skill of measurement. Common to almost all, however, were means for measuring length, volume, weight and time. [l] Frequently there were no standard scales in widespread use; a local unit of length, for example, might be the forearm of the man supervising the construction of a building. Embarrassing situations occurred when the measuring scale died and when the succeeding supervisor was of entirely different stature. In addition to involuntary slips in standardizing the measuring unit, deliberate variations were frequent, particularly in commerce. Merchants were rare who did not have two sets of weights; the heavier set used in buying goods from others; the lighter set with identical names used for retailing goods in the market. In Biblical *This study was supported in part by a grant from the National Institutes of Health. TAssistant Professor, Public Health Administration, Johns Hopkins University. 381
382
CHARLES M. WYLIE
times, standardized measuring scales were so rare that Solomon wrote: “A false balance is abomination to the Lord, but a just weight is His delight”. (Proverbs 11: 1). Great advances have occurred since then, both in developing techniques of measurement and in standardizing units. In some fields of medicine and public health, however, situations prevail like those of Biblical times. Results vary widely, for instance, in biochemical estimations carried out by different laboratories, or by the same laboratory at different times. Even more unhappily, the practitioner using informed judgement to assess change in patients or communities may unknowingly vary his units of measurement, depending on what he expects to see. For example, in evaluating programs which the practitioner intuitively expects to produce improvement, his judgement is often more sensitive to seeing improvement than deterioration. Measuring niethods which give less scope to judgement will help overcome such unconscious biases. SCALES
OF
MEASUREMENT
We have already mentioned the simple measurement of length with a ruler. However, the process of measurement may be even more basic than this. In its simplest sense, measurement is the assigning of numbers to aspects of objects or events according to a consistent rule. [2] Using this definition, one can describe four types of scales, each a development of, and an improvement over, the preceeding one. [2] The most primitive measuring procedures involve a nominal scale. Names as well as numbers describe the points on a nominal scale; and the numbers used have no quantitative meaning. With such a scale, one can sort into the same class objects with equivalent properties. The numbers are used merely to identify the groups to which all cases identified by one number differ from cases the measured entity belongs; identified by another. A good example is the process of diagnosing patients and giving them numbers from the International Statistical Classification of Diseases, Injuries, and Causes of Death. Persons are placed in their most likely diagnostic class, by measuring certain manifestations against those typically found with each disease. Like all nominal scales, the International Statistical Classification uses an operation for determining equality; thus, all cases in class 260 (diabetes mellitus) have some degree of equality with each other, and differ in some respects from those in other classes. Nominal scales employ no operations for determining greater or less; persons in class 260 are not necessarily more or less ill than those in lower or higher classes. The process of measurement with a nominal scale is often described as “classifying” or “categorizing”, rather than “measuring”. More highly developed than nominal scales is the ordinal scale, in which numbers have also a quantitative meaning. Thus, the ordinal scale employs a means for determining equality and greater or less. For example, heart disease patients are often placed in classes I through IV, following criteria set by the New York Heart Association. The Class I patient is least disabled by his disease, while Class II disability is less than Class III, and so on. However, the distances between classes in ordinal scales are not specified. Thus, the contrast between those in Class I and II is not necessarily equal to that between Class II and III patients. Measuring with an ordinal scale is often termed “ranking” or “rating”. The interval scale goes one step further by having equal distances between classes. The Fahrenheit thermometer is a typical example; two patients whose temperatures
Measurement
in Medicine and Public Health
383
are 99°F and lr)O”F, respectively, differ by an equal amount to those whose temperatures are 98°F and 99°F. However, the zero point of such a scale does not mean complete absence of the measured quality. A temperature of O”F, for instance, does not indicate complete absence of heat, since temperatures go considerably below this point. Moreover, the measured quality is not doubled when the interval scale measurement rises from 1 to 2. For example, ice at 2°F is not twice as hot as ice at 1°F. The ratio scale overcomes these defects and approaches the peak of perfection. Its zero point indicates complete absence of the quality being measured, and the measured entity has doubled when the measurement rises from 1 to 2. In the field of thermometry, for example, the Kelvin scale has its zero point where heat ,is absent, and objects at 2°K are twice as hot as those at 1°K. Despite these improvements, ratio scales are no more complicated than less perfect ones. Height and weight are two characteristics of patients measured by such scales. Thus, the 200 lb patient is twice as heavy as one weighing 100 lb while weight is completely absent at 0 lb. DEVELOPMENT
OF
SCALES
In developing a scale for measurement, we can sometimes aim at perfection right from the start. When simply-used ratio scales are available for height and weight, there is no need for less perfect methods. However, in measuring less clearly defined entities, we must sometimes be content with ordinal and interval scales until we learn more about each entity. To understand how measuring scales have evolved, we might study the history of measuring temperature. Until the sixteenth century, objects were classed as “cold” or “hot”, .according to the impression made on the human senses. [3] It was conceived that heat had an intensity, but there was no satisfactory definition of the characteristic However, even without definition or complete understanding, to be measured. thermometers were developed in the early 1500’s. By using such primitive instruments, much was learned about heat. Most important was the discovery that liquids froze and boiled at constant temperatures, which allowed the setting of fixed points on the temperature scale. In 1714, Fahrenheit established an early version of his scale, using a salt and ice mixture and the temperature of a normal man as his lower and upper limits respectively [4]. Celsius later suggested dividing into 100 parts the interval between the freezing and boiling points of water, thus producing the Centrigrade scale. Like Fahrenheit’s scale, this was also an interval scale, but more convenient to use. Later, Kelvin developed a ratio scale using Centrigrade degrees for his unit of measurement, but establishing the zero point where heat was completely absent [I]. Thus the developing of primitive measuring scales accelerated man’s gathering of knowledge about heat. Throughout the four centuries of measurement, however, man has still measured heat by related entities-volume of a liquid, length of a metal column, pressure of a gas, electrical resistance of a wire, and so on. The history of thermometry guides us in developing measurements for entities which, so far, are assessed only by intuitive methods. Thus, there seem to be four reasonable steps in developing new scales. Step 1 is to define, if possible, what is to be measured. The definition should use words too plain to need further clarification. However, this may not be possible, and For example, time has not yet been clearly definition need not precede measurement.
384
CHARLES
M. WYLIE
defined nor understood, but it can be measured by satisfactory, indirect methods. Since measurement allows the more intensive observation and study of phenomena, it eventually makes definitions possible. Step 2 is to select two fixed points at opposite ends of the scale of measurement. The zero and upper points need not be at the extreme of change, although this is an advantage. Step 3 is to hnd some consistent means of subdividing the distance between the fixed points. Preferably the subdivisions should be equal; however, this desirable equality may be impossible to achieve in the early stage. Step 4 is to determine the validity and repeatability of the resulting measurements. The scale is valid when it measures what it is intended to measure. If one has not been able to define in advance the quality to be measured, validity may be determined only by prolonged experience with the scale. Ideally, the results should be checked against a valid independent criterion; but it is when such criteria are absent that the measuring instruments are so much needed. Repeatability is more practical to determine. All properly trained users of the scale should obtain comparable results when measuring the same subjects. The same user should obtain similar results when measuring an unchanged subject at different times. THE
PRESENT
SITUATION
IN
PUBLIC
HEALTH
How far have we gone in developing effective measures in public health? At present, public health measurements are based on crude scales used on individuals making up the public. Thus, death rates are based on classifying each person as to whether he is living or dead at, the end of the year; this forms a crude nominal scale applied to each individual. The more sophisticated use of this procedure results in age-, disease-, and other specific death rates. A further development is SWAROOP’S Proportional Mortality Ratio [5] which is the percentage of all deaths occurring at age 50 and over; this ratio depends on classifying each death as above or below age 50, a very crude ordinal scale. Morbidity rates are based on classifying each living individual as to whether or not he is diseased. Disease-specific rates and other more sophisticated developments also depend on this relatively crude nominal scale. However, classifying living individuals as “diseased” or “well” is less reliable than the living-dead classification. Thus, morbidity rates are less comparable than death rates for different areas. Obviously, the effectiveness of public health programs must continue to be measured by their beneficial effect on those served. Nevertheless, further marked reductions in death or disease incidence rates are becoming more difficult to achieve. For instance, in certain areas of the developed countries, it is no longer possible to show that maternal and child health programs are increasing their effectiveness by falling infant mortality rates [5]. Clearly, new and more sensitive measures of benefit must be developed to help show the value of future programs. POSSIBLE
FUTURE
DEVELOPMENTS
One area of improvement will be the creation of more ingenious ways of handling disease and death figures. Developments in this area, however, will continue to bear the burden of the crude measurements made on individuals making up the study populations.
Measurement
in Medicine and Public Health
385
The more significant improvements may occur in methods of measuring the health status of the individual. At present, individuals are measured for public health purposes on a simple ordinal scale : healthy, diseased, dead. The most immediate new development may be the addition of a fourth class, to form a healthy, asymptomaticalIy diseased, symptomatically diseased, dead scale; the asymptomatically diseased would be detected by multiple screening procedures. The second new development in individual health measurement may be the evolution of valid methods of measuring severity of disease or disability, so that the “symptomatically diseased” class will be subdivided into groups of increasing severity. Rudimentary scales already exist for some disease groups, such as the previously mentioned New York Heart Association classes for cardiac patients [6], but much improvement is needed. Such scales of severity will be particularly useful for rehabilitation programs, in which the patient remains diseased but his disability may be reduced. We have been studying one index, used in the Maryland Chronic Disease Hospitals, to describe the disability status of patients whose disease interferes with independent movement [7]. The Maryland Disability Index gives points for eleven activities of daily living. The instructions to help standardize the scoring procedure clearly define the conditions under which each score is used. The patient who lies completely helpless in bed scores 0 points; one able to deal completely with all usual activities scores a total of 100 points; while patients at intermediate stages of recovery gain scores in between these extremes. Thus, the lower the score, the more severe is the disability. TABLE
1.
IMPROVEMENT IN DISABILITY INDEX SCORE 0~ CONSECUTIVE STROKE PATIENTS DISCHARGED WITHTHE CLINICALIMPRESSIONOFIMPROVED ORUNCHANGED
Improvement in disability score All improvements 0
5 10 15 20 25 30+
Total discharged 194 34 6 12 12 15 16 99
Physician’s impression Improved Unchanged 168 26 20 3 10 10 12 14 99
14 3 2 2 3 2 0
Since there is no valid, independent means of measuring disability, we cannot immediately establish the validity of this procedure. However, two phenomena can be tested. First, the more a patient’s score rises, the more often should he be assessed as “clinically improved” by examining physicians. In Table 1, we compare the improvement in disability score in 194 stroke patients, discharged as “improved” or “unchanged” from Montebello Chronic Disease Hospital in Baltimore, with the discharging physician’s independent assessment of the patients. Disability scores correlate fairly well with the clinical impressions. When the score improves by 30 or more points, no patient is classified as “unchanged”. We shall shortly explain why the correlation is not perfect.
386 TABLET.
CHARLES
Two
YEAR MORTALITY
M.
WYLIE
AFTER ADMISSION OF CONSECUTIVE INDEXSCZOREONADMISSION*
STROKE
PATENTS, BY DISABILITY
Disability score
Total admitted*
Number died
Deaths per 1,000
All scores O-20 2545 50-70 75-100
251 47 81 16 47
64 24 22 11 7
255 511 272 145 149
*Omitted from this table: 43 patients from whom a disability index was not available, usually because of early death.
The second phenomenon, which is simply tested, is that death rates increase as disease and disability become more severe. This has been well established for the New York Heart Association Classes, where the Class 1, non-disabled patients have death rates much lower than Class IV, severely disabled cases [8]. Table 2 compares the death rates for 251 successive stroke patients, admitted to Montebello Chronic Disease Hospital in Baltimore, with their disability score on admission. This table shows that patients with scores of 45 or less have death rates significantly higher than those scoring 50 or more. This disability scoring method has two known defects. Since it is based entirely on physical activities, patients can show considerable physical improvement, but be classified clinically as “unimproved” because of a deterioration in mental or other aspects. Seven patients in Table 1, for example, increased by 15 points or more but were regarded as “unchanged” by their discharging physicians. Furthermore, since this procedure takes no account of the presence or absence of aphasia, rightsided hemiplegic patients obtain the same score as left, although the former are usually more severely disabled. We must overcome these and other defects before the index can be suggested for widespread use. Even in its present crude state, however, it shows considerable promise, and being an ordinal scale, is an improvement over the current nominal categories of “greatly improved,” “slightly improved”, and so on.
CONCLUSIONS
Newly graduated physicians, when supplied with information on blood glucose levels, will find more new cases of diabetes mellitus than experienced physicians working without such measurements. Thus, medical and public health practitioners will perform more effectively when supplied with sensitive and valid measuring devices. Such new measuring scales need not aim at perfection from the beginning; from the history of measurement in other fields of science, we have seen that crude instruments help greatly in defining the problems to be faced. Improvements in ways of measuring the health status of the individual provide the most promising field for future development. The performance of one ordinal scale for measuring disability has been briefly described. With further refinement, such scales will increase the scientific basis for medical and public health practice.
Measurement
in Medicine and Public Health
387
REFERENCES 1. PERRY,J.: The Story ofStan&& Funk and Wagnalls, New York, 1955. 2. Sl-EVENS, S. S.: Measurement, Psychophysics and Utility. Pp. 18-33. In Measurement: Definitions and Theories. (CHURCHMAN,C. W. and RATOOSH,P., Editors.) John Wiley, New York, 1959. 3. C~O&BIE,A. C.: Medieval and Early Modern Medicine. Doubleday, New York, 1959. 4. ROLLER. D.: The Early Develovment of the Concevts of_ Temverature and Heat. Harvard University Press, Cambridge, Mass., 1950: 5. SWAROOP,S.: Introduction to Health Statistics. Williams and Wilkins, Baltimore, Md., 1960. 6. Criteria Committee of the New York Heart Association, Inc.: Nomenclature and Criteria for Diagnosis of Diseases of the Heart and Blood Vessels. New York Heart Association, New York, 1953 (First Edition). 7. Mimeographed criteria for scoring this index may be obtained from the author. 8. (a) FORD, A. M.: Rehabilitation of the Cardiac Patients, Minn. Med. 42,1203,1959. (b) TUILRELL,D. J. and HELLERSTEIN,H. R.: Evaluation of Cardiac Function in Relation to Specific Physical Activities following Recovery from Acute Myocardial Infarction. Progr. in Cardiovasc. Dis. I, 237, 1958.