Clinical
Data
Representation
in Multidimensional
Space *
A number of measured and hinary categorical variables were observed for several days in a group of patients hospitalized for suspected myocardial infarction: temperature. systolic and diastolic blood pressure, pulse. respiratory rate, P-R interval (electrocardiogram). white blood count, serum enzyme levels (IBH, SGOT, SGPT). along with the presence or absence of chest pain. abnormal rhythm. ventricular gallop, t-ales. cardiac arrest, external cardiac pacing, assisted respiration, and the administration of digitalis, diuretics. antiarrhythmic agents, and vasopressorh. Volumes of data of this magnitude are often not comprehensible in graphical or numerical form. In order lo aid in the compression a-d interpretation of the data. each patient on a given day was represented a5 a single point in a multidimensional space. Computed distances between points are measures of clinical dissimilarity. Trajectories in the space are indicative of the clinical course of the patient’s illness. A computer program was used to connect points to their nearest neighbors so as to yield a “minimum coverage tree,” of which the connections, branches, etc., provide information concerning relationships between points. A twodimensional graphical representation of the points was generated, locating the points by minimizing the sum of the squared differences between rz-dimensional and two-dimensional squared distances. The presence of an important nonobserved variable may be signaled by a long series of minimum coverage connections between two apparent neighbors. These mtrltidimensional spatial techniques offer promise of usefulness in a variety of other types of clinical research studie\.
Rather frequently in clinical investigation large groups of patients arc observed with respect to multiple variables. If an important objective of the research is to classify the patients into two or more subgroups on the basis of the values for the observed variables, some criticism must be employed by the investigator -::Supported in part by ;I National Heart Institute contract (PH-43-67-1440), three grans (HE-11309-03, H&07563-07 and GM-16725-01). Some computation time was supported by Duke University under a computation grant (AD-0576). USPHS
58
SPATlAl.
KEPRESENTATION
01;
DAL4
SC)
for defining the subgroups. No standard criterion exists for classifying patients. and usually the grouping is made on the basis of one or two variables of special interest to the investigator. Tntuitively, it would seem reasonable to expect that individual patients should be classifiable by means of some “natural” or inherent qualities.’ which would reveal that certain individuals tend to be more similar to certain others, while unlike still others. Although no classification scheme which is completely natural in this sense is generally accepted, the purpose of the present report is to outline an approach which does tend to group together patients with similar responses and to separate those which respond differently from one another. The resultant grouping is made solely on the basis of patients’ responses to observed variables, and is, to a certain extent, independent of any preconceived notions of the investigators. The proposed approach was developed in connection with the analysis of a very large amount of data generated by the Myocardial infarction Research Unit of the Duke University Medical Center. Although it is immediately apparent that the techniyue can also be applied in a wide variety of other situations, the data from the myocardial infarction studies will be used to illustrate the application of the method. The technique appears to provide a powerful tool for examining the interrelationships among individual patients. It may in the future yield a means of predicting the clinical trend of a given patient from the transitions through which the patient has gone in the immediate past and from his clinical state at a given point in time. THE
PROBLEM
In the Duke Myocardial Infarction Research Unit (MlRU) patients are admitted to an intensive coronary cure unit with presumptive diagnosis of acute myocardial infarction. The first 69 patients who were admitted to the unit form the basis of the present report. A data collection form was employed which provided for the daily entry of 40 different measured and categorical variables. For further study, an arbitrary selection was made of those variables for which there were the smallest numbers of missing observations. In the case of six variables (temperature, pulse rate, respiratory rate, systolic and diastolic blood pressure, and the P-R interval of the electrocardiogram), several readings were available in each eight-hour interval; for each of these variables, the extreme values during each eight-hour interval were recorded (i.e. highest temperature, highest pulse, lowest systolic blood pressure, etc. ). Fifteen other variables were recorded once each day (white count, lactic dehydrogenase, serum oxaloacetic and pyruvic transaminase, plus the presence or absence of chest pain, rales, ventricular diastolic gallop, arrhythmia, cardiac arrest, and whether the treatment of the day included electrical pacing, assisted respiration, digitalis, diuretics, vasopressors, or antiarrhythmic agents), The selection omitted some
ho
I t-1011
variables
of
individual nary
considerable
wcrc
cutccholamincs, recorded
selected If
no
recorded
data
had
responses.
actual
number
As
an
time
inspection
discover
the
yrayhs.
were
under
cians.
who
of
patients.
did
in of
number
analyzed will
reasonable
to
uncovering.
if
patients
In brief, treater1
69
on
provided
for
evident
clinical
in the the of
reveal
their MlRU method data
data
with
dimensional
patient
and
time.
and
our
initial
basic
from
further
detailed
fact
the
inclusion last of
in ot
trcatnlcnt for
the the
the
stud\,
ou
;I
2 I \ariablcz treatment
itself.
with
the
niethod~
Although ilInes
or
alI
dif-
infarction
over
analysis
pattern.;
10
organized
six
the
phy+
influenced the
control
the
thche
of
regimen
specially
\\;I\
visual
that
my~~curdial
variablch.”
history
it
;I team
Neverthelcsh.
“trcatmcnt
diagnosis unit
which
by
relevant
trend\
it
vicu
a
in
for
among
\ccnl\ toi+
;I
ad
grcwp
01
one
in
infarction of
was
sought
which
and
changes
patients
of
and
tend analysis
research
cats-
did
below
would
and
data
individual it
have
measured
these
and
phGcian>
described
reduction types
ANALYTIC point
twenty
with
the
team
approach
technique
different
myocardial
of
analytic
experience
tool
over
displays an
The
of
a trainelI to
graphical
many
a given of
fact.
01’ the
from
\vcre
which
to
differences
with
space.
$0 the
made
the
but
(7 minimal
clinical
data
states
at
;I-
l-1.454
However.
trcatmcnt
trcatmcnt
hnvin,
all
cart of
THE Each
(:bscr\ 01
uere
by
patients
natural
Therefore,
similarities
as a useful
total
evident
csplaincd
ward>
prcsumptivc of
patterns.
emanating
rcsp~~nws of
observations.
became
individual
In
to
coronary volume
data.
;I
III~CIF
disease.
Inspection
clinical
I OS
patient.
schedules.
reccjgnizablc
patients
a large variables.
would
the
this
to
investigator,
ix equivalent wcrc
a modern
gorical
of
stud) the
2 I distinct
graphs
each
;I 4tandardizcd of
\,ariablcx.
possible.
treated
effect
data.
bc
single
apply
obscure
subject
may ;I
of th::
patients
present
P-K
the
pcri:r:l
missing
urn
and
ot
amount
for
well-equipped
diffcrcnt
certainly
~~ttcrc‘cl
hormones,
for
m;Iximuni
trends
treatment
The
the
onI!,
\igna
data
~OI- ;I 1.is-d:ly
these a time
this of
rcprescntatiic modern.
of
in the
effects
not
different
paticnts.
treatment
larger
part, to
receive
of
In cart.
courses
WI’;
( vital
multiple
obvious
clinical
multiple
such
no
attempt
by
of
at
The
patients
;I
w0~1Ic1
csamining
not
of these care
in that
tke
fcrently patients
were
\vhich growth
collected
include
there
variable\
to
variab1c.i the
each p;itit:nt this
for
rcnin.
that.
step
several
six
missing.
th:m
tirst
of of
p2ticntx the
less
obvious
disappointing
been
Inevitably,
wax
courses
might
but
plasma
hours,
19X 1_ from
II)
intcrcst.
first
tight
stud)
I)
\\‘oOI)HI
(e.g..
the
cvcrv
for
(3.6
ANI)
research
a\,ailahlc
etc. J, Since
each
variables 16 ’ tions.
.I I<.
current
observations
) wcrc
\,a1
I’SON.
not
yield
was
applied
to
recommend
of
large
bodies
investigation.
TECHNIQUE time dimension
was
represented was
allocated
as
;I point for
each
in
JllUl&
variable
to
SPATIAL,
RFPRESEN’T.4TION
01.
61
DAT\
included in the study. It is not possible or necessaary to try to visualize such a higher dimensional space in order to understand the concept of representing patients by points in a higher space. However, it is worth commenting that, just as we can compute the distance between two points in a two-dimensional plane or in a three-dimensional space, we can compute distances in any higher dimensional space. The distance in an n-dimensional Euclidean space between two points whose coordinates are (.t-, ,, .i-, , s, . ., .r ,,,, ) and (.r,, .rll. s,. ,, . .. .v- ,,) can he calculated from the well-known formula: D == \ LYI, - .\.zl)‘J + (.Y1z ~ .\““I’) + (.Yl:{ ~~ x::!” -+
+ i.\Y,,,( - .Y.:!.,,i”.
(1)
l-his distance is a number which tends to be small if the values of the corrcsponding coordinates of the two points are similar in magnitude, and large if the points differ significantly in several coordinates. Therefore, the computed distance between two points can be used as a measure of the similarity or dissimilarity between the clinical states of two patients at two points in time. If a series of points represent the sequence of clinical states of a single patient in time, the path or trajectory connecting the points is indicative of the patient’s clinical course. tisuallq. the d:fferent variables be’ng studied will be measured in different units and will be characterized by widely differing means and ranges. If each variahlc is initially transformed by dividing its value by the standard deviation of that variable multiplied by the square root of two, the resultant transformed variable will bc dimensionless. Because the expected value of the squared differcncc between two values of any one of the untransformed variables is equal to twice its variance, the expected value of the squared difference btween two values of ;L transformed variable is unity. When several squared differences between variables, transformed in this manner, arc summed in the computation of 2 distance between two points, the average contribution from each of the ccmponcnt carrablcs to the distance will be the ‘ranit’. If II different variables arc obzcrvcd and an n-dimensional distance is to be computed, its expected value \\ill be equal to 1 n No special problem arises in reference to the binary categorical variables, ior which raw scorch of zero and c?nc can be aasigned and the same transformation applied. Lls!ng f~~rmula ( 1 ), above. on the transformed data. diatanccs can tc cornputed between each point and all other variable points in the n-dimensional space. Next. these distances arc ranked in order. A minimum coverage algorithm is employed to connect each point to at least one nearest neighbor, and the constrllct. ;L branching network which ties together all of the available points, ; Cla,re The random
correctly,
htalement wriahle.
in
the the .L-.
text
expected is only
value
of
approximately
the
square
of
true.
hcc;~usc
the
distance E ( \
.\
will
be
equal to 11.
) +
1’
E (.\-)
for
;I
If only points are conncctecl which
have not in some previous step been ion final nct\\ork can also TV ~lcscribecl 2s (I nectcd to the \;1111c nctwc~rk. branching “tree.” It is not rcali~ possible to clisplac accuratclv ill1 4imensi~maI network on a two)-diniensional surface. but lbv stretching, bettding. and t\crkng the arcs connecting adjacent points. it is uhuall\ possible fcr Iocate the points on ;I plane. so that most of the near neighbor5 of each point arc CI~W to it. while points which are not its neighbors tend to be farther away. ~l‘hr resultiny “road map” is at best an approximation, but it may contain a surprising amc>unt of useful information in a highly compressed form. tllc
There are numerous more or less satisfactory ways to construct ;I planar graph from an n-dimensional tree. but care must be taken to avoid a placement of the points on the plane which reelects to a considerable cxtcnt the investigator’s preconceived ideas of existing similarities among patients. In such a situation, any interpretation of the relationships found in the map would bc of limited usefulness due to investigator bias. To avoid this criticism. and also to assist in the arduoux task of constructing planar maps, a FORTRAN program has been designed which generatcs a planar map from the t-au clinical data. A least squares criterion was used for locating the points on the map. .\- and x coordinates were found to minimize the following sum of squares.
where Dzi, represents the squared pr-dimensional distance between the i-th and j-th points. To accomplish the least squares solution for multipic pairs of .r and .V coordinates. an iterative ( Gauss-Seidel ) version of Newton’s methocl was employed. Examinaticn of the planar maps constructed in this way’ shows a highly satisfactory aggregation of points which arc near neighbors in the multidimensional space. DATA
ANALYSIS
Except for the vital signs and the P-R interval. the variables selected fog analysis were recorded only once each day for the MIRU patients OII given &ys, rather than at given points in time. This distinction is a significant feature of this particular study which limits our ability to interpret continuous changes with time in the clinical states of patients. Mathematically, this is equiva!cnt to saying that the model is discrete with respect of timr. (In many studies, complete sets of data are not obtainable at single instants of time, but when instantaneous sets of data are available, there would be no theoreticai objection to considering time as a continuous variable.) Rather than select a single value for each of the vital sign and P-R interval variables, we treated each individual eight-hour observation as a value for a different variable. However, the con-
tribution of each individual eight-hour observation to the distance was diminished to one-third by dividing the squared differences for the corresponding variables by three. By this device, each of the 2 I distinct variables designated above ti’as weighted equally in computing the distances. Because three values were associated with each of the first six variables, a single point in a 33dimensional space was used tct represent each “patient-day” ( I8 dimensions for the vita1 signs and P-R interval and IS dimensions for the white count. cnzymcs. and the remaining categorical variables. ). When interrelationships among multip!c variables are of interest, as in the case oi‘ the myocardial infarction studies described here, the missing data problem ih an important and troublcsomc one. In the initial description of the application of the technique proposed here. it will probably be advantageous to avoid wherever possible. c:-,mplicatiors v. ith result primarily from imperfect data collection. For this reason, the wholc problem of missing data has been sidestepped. by omitting from the analysis any incomplete sets of data (i.c. any patient-days for which Ices than 33 items oi’ data arc available). When this very demanding r~rju+emcnt was applied to the data from 73 admissions of 6Y k4IRU patients. 21 patients were immediately L>limir7ate,l from consideration. :IS they did not yield cvcn a single complctc s:t of data. The remaining 3X patients (52 admissions ) provided for the aral!k I37 complete patientdays. representing a tot:11 of 4.57 I responses (about one-third of the theoretical maximum mertionc:t atone for 73 ac!mi!sinns ). TI~c,~dist;mces between all ]Grs of the I37 :I\ ailable points were computed and r;m’,ed in order with the help of a FORTRAN program (there arc 9.3 I6 such distances ). Using the Icast squares criterion dcscribLd above. a twodimensional planar graph was generated (Fig. I ) .$ To each point there was assigned ;In abbreviated four-character nam:. The first and second characters of the n:lme represented the initials of the patient’s first and l:lst names: in the third &aracter position was pl:~ed ;I I or ;I 2 for the number of hospitalization for those few patients who v,ere admitted twice during the period of s;udy (!>thrr\viLc. the second letter of the patient’s last name W;IS inserted in that positi~~n); the fourth character position was used for the day of hospitalization. For example. “llO15” would rcfcr to the fifth da) of patient H.O.‘s first ]lospitn] admiaGon. When two or more points were superimposed upon one another. a footnote (e.g., ““12*“) was substituted for the character name on the graph to he decoded in another part of the computed output. EVJII though Fig. 1 represrnts the data from only 52 patients and the inc]usion of ;l larger sample of points would fur;hcr complicate the picture. the +I .4 planar map derived from these data was presented previously in the Proceedings of the DECUS Symposium. May. 1969. Differences between that map and Fig. 1 art: attrihuled
to
the
more
recent
finding
of
significant
errors
in
the
input
data.
0-i
.
.
0
representation of 33-dimensional data. In order to provide a Frti. I. Two-dimensional larger map, the computer print-out utilized two sheets of wide paper which ;tre spliced together. For display purposes. ellipses were drawn around each 4-character label on the computer-generated map. For greater detail, see Fig. 6 helow.
set of points has already resulted in an elaborate map. However. the tigurc is considerably easier to examine than are the original 4,521 items of data from which it was derived. The minimum coverage algorithm was utilized to indicate the nearest neighbor of each point and then to identify successively the shortest possible connections between pairs of points until all points were incorporated in a single network. Figure 2 shows the result of adding the minimum coverage connections to the planar map of Fig. 1. The complexity of the multidimensional interrelationships is immediately evident from this figure. It would. indeed, be possible to display on a plane a much less intricate rcprcsentation of the minimum coverage tree without crossing lines, but such a representation would completely obscure the spatial relationships other than those between adjacent closest neighbors; by such a device. clusters or aggregations of points would become widely dispersed over remote portions of the tree. In Figures 3( a)-3(d) are shown the distributions on the planar map of four abnormal findings-the presence of chest pain. lactic dehydrogenase values in excess of 400 units. a ventricular “gallop” sound heard on ausucultation, and a systolic blood pressure of 80 or below (“shock”) at any time during a twenty-four hour period. Figures 3 (b)-3 (d) show ;I clear tendency for ahnormal values of the variables to be found in identifiable regions of the map. The very elevated lactic dehydrogenasc values are located at the left on the map, the gallops at the lower right, and the low blood pressure values at the
SP.\?‘IAL
,..,,,..,..
..__._...........,..-...”
REPRESE~‘T.ATION
.-...
.-..
. . ..-
. . . .. . .. . .. .
. . I_...
01
0 5
Ll:\ f.\
.,
--
. ..---
FIG. 2. Minimum coverage graph. To the same points shown in Fig. 1 have been added the minimum coverage connections. Points close to one another in the two-dimensional projection need not necessarily be neighbors in 33-D. This type of diagram adds some information concerning the higher dimensional relationships.
lower left, respectively. Since each figure contains information about only one of 21 different variables and a two-dimensional piot is used to locate the points from a 33-dimensional space, it is interesting that these identifiable regions of the map can be found. Those variables which show the closest correlation with other observed variables would be expected to show the greatest tendency to separate into distinct regions, and vie versa. For example, the presence of a ventricular gallop [Fig. 3 (c)] was significantly correlated with the presence of ralcs (r E .59) and with the administration of diuretics (v L .27). The 137 points shown in Fig. l-3 form a highly branching network, which suggests a continuous and complex spectrum of clincical states, rather than several easily identifiabIe and distinct states. There is a group of points in the upper left portion of the diagram which are all very close to one another. From Fig. 3 (a) it can be seen that many of these points are characterized by the presence of chest pain. The closenessof these points is especially interesting when it is remembered that chest pain is only one of 21 equally-weighted variables which determine the positions of these points. In general, the sicker patients were represented by points in the lower portion of the map, especially toward the periphery. It seems unlikely that further similar data will result in the appearance of several easily distinguishable subgroups of clinical states, but it &XX seem reasonable to anticipate that further expcricncc may permit us to
ALLOP
FIG.
3( a-d
).
Disposition
SYSTOLIC BLOODPRESSUREIa&
of
individual
abnormalities
on
the
two-dimensional
map.
SW
test.
divide the space into several contiguous regions of differing significance. The path of a given patient through the space with time may take him through one or more important transitions from one region to another. If certain pathways through the space are characterized by heavy trafic. and if movement along one of them is of special prognostic significance. then identification of these patients wou!d b:: especially important. Phc data at hand do include instances of several points on subsequent days from the same patients. Figure 4 shows the points in the same locations as in the previous figures. but the interconnections in this diagram are directional arrows between point> from the same patients at different times. The more usual tendency of patients wilh this illness to show clinical improvcmcnt is reflected in this figure by ;I prcdominancc of arrows pointing from the lower xd more peripheral regions of the map to the more “normal” region of densely Llygregated points in the upper left portion of the mau.
SPATIAL
REPRESENTATION
OF
DATA
67
’ ,’
/ ,., .., ,i/
/’
Qi
‘3 i,
.,,,,,,,.,,,.,,,_,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,.,.,,,,,,,,.,,,,,,,,,.,, ;
......_.........................~.....~....~.~
FIG. 4. Changes in clinical status with time. The sparse trajectory data shown here suggest a tendency for paths to converge in the dense cluster at the upper left. while avoiding a central zone in which there are almost no points to be found. See text.
DISCUSSION The representation of observations on multiple variables as points in multidimensional space is, from a mathematical standpoint, a standard and wellknown device, It is, therefore, not surprising to find that the distance formula ( 1) has been applied repeatedly by scientists to their data in the past. Since Karl Pearson used such a formula to develop his “coefficient of racial likeness” in 1926,’ there were occasional scattered references to the use of distance measures (Ref. 1, pp. 284ff). But in the 1950’s, several psychologists ‘. L ‘L i. Y and biologists 3, ’ began to emphasize their use. Sokal and Sneath ’ make distance measures the basis of “numerical taxonomy.” The techniques developed by these workers were applied to a group of cardiac patients in 1966 by Manning and Watson !I and in 1969 by Neurath et u~.,~Oin the preoperative assessment of pelvics undergoing pelvic surgery. Distance measures as an adjunct to cluster analysis have also been used by Bonner I’ in connection with medical research data, but the techniques have, nevertheless, been employed infrequently in clinical investigations to data. Feinstein Jz has shown how one can select several variables of interest and. by means of overlapping circles in a Venn diagram, summarize schematically the frequency of patterns of findings in a given disease entity. There are four
hX
IIl0\ll’\;0lu.
II<.
\NI)
\\OOI)HI’I~~
u hich ~‘211 rc\ult in in~porlanl characteristic5 of thi5 type of -ITrouping proccdurc l~scs of inform:ltion. ( I I Each patient mu>t hc charactcl-i/cd ;I\ belongin? 10 some discrete subgroup. whcrc:ls hc might mnrc r~a\onabl\ hc dc\cribcd ;I\ III ;I transitional group or as 2 unique “outlycr.” ( 2 ) Similaritic~7 ;mtl clitlcrcncc5 between individual patients become obscured ;I\ soon ;I\ theI ;Irc catcgori/cd groups. (3 ) L!nle>s ;I Vcnn diagram i\ as belonging to the same or difercnt designed in which ;I circle is used for virtually ever\ available measurccl variable (which tends to mal\e an unduly intric:ltc Vcnn diagram ). these techniques result in grouping4 L which strongly refect ;I bias inherent in the invcxtigator’s variablch of interest.” (4) Thia tape of grouping ot selection of the “major patients does not lend itself well to demonstrating changes in ;I patient’\ responses with time. The approach described here differs markedly from the approach of classical univariate or multivariate statistics. In statistics. individual measurements arc lost sight of in favor of estimated means and variances of samples of populations. whereas in the present approach the identity of each point (i.c., each observation vector) and its relationship to all other points are preserved and where there is no generally accepted arc central to the analysis. In situations criterion for classifying sets of data into subgroups, there i\ ;I considerable advantage in king able to discern the interrelationships among the individual items of the data (the so-called “structure” of the data) without having first to divide the data into arbitrarily chosen subgroups. This is likely to be the situation in studies where experimental data are collected with the primary aim to provide objective information about an insufficiently understood discasc entity. experimental system. or population, rather than to test one or more specific scientific hypotheses. In this type of investigation. the rcscarcher expects to be able to examine some of his data before formulating detailed questions about his system. The MIRE studies described above are an excellent example of this sort of investigation. There arc ditfcrcnces between the approach proposed here and so-called “cluster analysis.” If clustering techniques arc applied to data of the sort dcscribed here, clustering algorithms will permit one to identify any clusters which may be present, These techniques are not too helpful, however. when applied to a series of points such as those in Fig. 2. where a highly branched structure. rather than a group of several clusters, provides a more realistic description of the data. The minimum coverage algorithm used here should be expected to permit detection of clusters if they arc present (such as, for example. the group of points at the upper left of Fig. 2. characterized by chest pain with no other complicating abnormal tindings ). but it also brings out relationships between individual points WCH when /IO clustering is prosent. Most Ggnificantly. we visualize patienta ah continuously changing in ;I variety of ways with time. rather than “jumping” from one to another of a few discrete clinical states, Although the LISC of standard scores for each of the multiple variables eiim-
SPATIAL
REPRESENTATION
OF
DA7‘
69
nates the problem of combining, in a single distance formula, quantities measured in different units, it leaves unsolved the more complex problem of assigning weights to the contributions from each of the different variables. When observations are made on multiple variables. some of them are likely to be highly correlated with one another. In a sense. the information contributed by two highly correlated variables would be the same a\ that contributed by either of the variables alone (for example, we would expect highly significant overlaps in the information provided by the systolic and diastolic blood pressures or by the lac!ic dehytlrogenase and the glutamic oxaloacctic transaminase levc!s ). In the prc’scnt study, no attempt was made to weight each v,ariablc in proportion to the amount of independent information provided by that variable. In discussing the problem, Sokal and Sneath I rccommcnd that. at least in taxonomy. equal weights be used for all variables. Overall suggcstcd the USC of Mahalanobis’ generalized distance formula to take into account correlations among variahlcs, but his argument for the LISC of this formula has been criticized.’ Other-s -,’ have suggested that the techniques of factor\ analysis or principalcomponents analysis be applied initially to the raw data, and that the “factors” be usccl to determine the dimensions of a space and the location of the points in that space. Experience with these methods of handling the problem of correlated variables has not been
70
FIG. 5(a-b). Squashing as a result of missing dimensions. In Fig. 5(a). and also again in the background of Fig. 5(b), is drawn in two dimensions a hypothetical circuit of points open at the upper left. This would seem to imply similarity between the points opposite one another at the open ends of the ring. Fig. 5( b J shows the same points drawn in a three-dimensional space. where information ahout a third variable is included. In the second figure, the apparent “circuit” has disappeared. and the points which seemed close to one another in Fig. 5(a) are now found to lie maximally far apart. See text.
SPATIAL
REPRESENTATION
OF DA.r.4
71
. . .
FIG. 6. An open “circuit” found in the myocardial infarction map. The connecting lines shown here are extracted from the minimum coverage graph of Fig. 3,. The apparent circuit may represent a real example of the phenomenon illustrated diagrammatically in Fig. 5.
between two points, WB03 to EC04, which arc fairly near neighbors of each other in 33-D (distance = 2.43 units). This figure was obtained from an enlargement of the central portion of Fig. 1. The raw data associated with the two points, as well as for each of the points making up the circuit between them, are shown in Table 1. Reading the table from left to right corresponds to moving counterclockwise around the circuit from WB03 to EC04. Comparison of this circuit with Fig. 2, on which all of the minimum coverage connections have been drawn, shows that there is, indeed, a central region within the circuit which is virtually devoid of points. In Fig. 4 only one trajectory crosses this central zone. This findir.g lends support to the idea that this empty area represents a very unlikely clinical state, and that the circuit from WB03 to EC04 probably results from the absence of an important additional dimension. In other words, WB03 and EC04, although similar to one another in terms of the variables observed, would probably differ importantly if other clinical information were available. This is just the situation discussedin connection with Fig. 5.
I HO11
\\ HO?
I’\()\
u
1304
WHO.,
H t () :
Ii I-o,i
I-101
i
111:
tc
O!
;O. i
S>~lOllC
: .i
.i
:(I
.i
BP IO
h
6 2
Diastolic
2 IO
IO6 I IO IOX
016 YX I00
I20 I IO I05
I IO I20 IO0
I70 I IO I20
I IO II)0 II)0
I IO I IO 125
70 6X 70
60 70 hl)
X0 70
so SO
69
70 70 70
70
70 70 70
SI) 70 loo
88 XJ Yj
72 70 70
73 81 6s
72 77 79
100 XX 88
70 24 20
20 ‘0 20
IX 20 20
Ih 20 ‘2
0. IX 0. I6 0. I6
0. I6 0. I6 0. I6
0. 1-i 0. Ih 0. IL
HI’ IO h 6 2 2 IO
71) 70 70
h’s 80 X0 Respiratory
Rate 10 6 6 2 2 -10
70 20 2-1
22 2-I I8
20 22 20
20 20 22
IO 6 6-- 2 2 IO
0.20 0. IX 0. IS
0. I6 0. IX 0. I8
0. Ih 0. IX 0. IX
0. I\, 0. I7 0. I9
WBC
Y,XSO
14,280
16, I 50
IO.ocn)
IO, 2nn
LDH SCOT SGI’T
270 I 50 70
I72 xl 23
IjX 3-l 23
P-R
Interval
hill
Rhythm Rales Gallop Arrest Digitalis Diuretic Antiarrhythmic Paced Respiratory assistance Vasopressors
agent
‘73 YS 6s
238 6S -1s
0.20 0.20 0.20
7 / 700 xx SO 38
230 X.5 4s
IO. 350 252 x7 113
No N YKS YKS No YKS No Yes NO
NV N YKS YKS No YKS NO YCS NV
NO N YKS YKS No YKS No YKS NO
NO N YKS YKS NO YKS NV YKS No
NV N YKS No NO YKS No YKS No
No N YKS NO No Ye5 No YC\ No
NV N NV NO No YKS NO YCS NV
No N NV YKS No Yes No YKS No
No No
NO No
NO No
No NO
NO No
No NV
NV NO
No No
SPATIAL
REPRESENTATION
SUMMARY
OF
AND
1),17:\
73
CONCLUSION
Techniques for quantitating similarity and dissimilarity between individuals have been studied extensively by researchers in taxonomy and psychology. but as yet only rarely by medical investigators. These methods were well suited to the analysis of a large group of patients. where information is available about among individuals and groups of multiple variables. The interrelationships patients can be displayed by means of a two-dimensional road map. These techniques were applied to 2 I different variables observed in 52 patients with presumptive diagnosis of myocardial infarction. Because each patient was observed from one to six times on different hospital days, a total of I?7 complete sets of data were available. By representing each set of data as a point in a multidimensional space and connecting all points to their nearest neighbors, a road map was devised to indicate which patients were similar to one another. Certain points tcndcd to form tight clusters (especialy, patienta with chest pain and no other evident abnormalities ). It is anticipated that, as more points are added to the map. it will become possible to identify major I’:;hways hetwcen important clinical states, and to quantitate the likelihood of :I patient’s going from one clinical state to another in time. RFFERE.NCES
1.
PI.ARSON.
3.
Koti~w.
1.
<
5.
6. 7. 8.
0. IO.
1 I. 17.
I(. D.
On J.,
the
coefficiency
of
racial
AND
TANIMCI[.Q.
T.
T.
A
likeness, computer
Rio/uc,lriXtr program
18, 105, (1936). for classifying plant\.
.S(~ic,frc P 132, I 1 15, ( 1960 ). I-. J.. AND GI ~:SI K. G. C. Assessing similarity between profiles. P.vJT/~o/. H/I//. 50, 456. (1953). (;I FSl:R. Ci. C. Quantifying similarity between people. /,I f/11, Role and Mrt/~od(~/o,~~ of Clrrxvifictrtiorf irr P.s,vc/rirrli~ f/d P.\?c~o~“lllolr,~. Proc. of Conference. Washinglon 1965. KATZ. M. M.. Co~.r-, J. 0.. .kNn BARTON. W. F. (Eds.). USPHS Publication No. 1584, p,p. 2Ollf. 196X. H~~RMANN. E. F. Commenta on Overall’s Multivariate methods for profile analysis. Ps.w//ol. N/,/I. 63, 128. ( 1965 ). NLINNAI I Y, J. The analysis of profile data. P.,:vcl~o/. Bull. 59, 3 11. ( 1961 ). OVERAl I . J. E. Note on multivariate method> for profile analysis. P.ycko/. Hd/. 61, 195, (196-I). ~~ANNIM. I<. T.. AND WATSON. I.. Sign\. symptoms, and syslematics. JAMA 198, ll80. ( 1966). NI-CR.IIH. P. W.. ENSFIN, IC.. AND MITCHELL.. G. W.. Jr. Design of ;I computer system to assist in differential preoperxtive diagnosis for pelvic surgery. ,\I. I-Jrrg/. ./. of Med. 280, 745. ( 1969). HONNI:R, R. E. Cluster analysis. .4rur. N.Y. Aud. .ki. 128, 973 (1966). F I-INSTI 1~. ,\. R Cliuiur/ J//cl,~u~~rft. Bahimore. William\ & Wilkins. 1967. RONI~ACH,