Clinical data representation in multidimensional space

Clinical data representation in multidimensional space

Clinical Data Representation in Multidimensional Space * A number of measured and hinary categorical variables were observed for several days in ...

965KB Sizes 1 Downloads 98 Views

Clinical

Data

Representation

in Multidimensional

Space *

A number of measured and hinary categorical variables were observed for several days in a group of patients hospitalized for suspected myocardial infarction: temperature. systolic and diastolic blood pressure, pulse. respiratory rate, P-R interval (electrocardiogram). white blood count, serum enzyme levels (IBH, SGOT, SGPT). along with the presence or absence of chest pain. abnormal rhythm. ventricular gallop, t-ales. cardiac arrest, external cardiac pacing, assisted respiration, and the administration of digitalis, diuretics. antiarrhythmic agents, and vasopressorh. Volumes of data of this magnitude are often not comprehensible in graphical or numerical form. In order lo aid in the compression a-d interpretation of the data. each patient on a given day was represented a5 a single point in a multidimensional space. Computed distances between points are measures of clinical dissimilarity. Trajectories in the space are indicative of the clinical course of the patient’s illness. A computer program was used to connect points to their nearest neighbors so as to yield a “minimum coverage tree,” of which the connections, branches, etc., provide information concerning relationships between points. A twodimensional graphical representation of the points was generated, locating the points by minimizing the sum of the squared differences between rz-dimensional and two-dimensional squared distances. The presence of an important nonobserved variable may be signaled by a long series of minimum coverage connections between two apparent neighbors. These mtrltidimensional spatial techniques offer promise of usefulness in a variety of other types of clinical research studie\.

Rather frequently in clinical investigation large groups of patients arc observed with respect to multiple variables. If an important objective of the research is to classify the patients into two or more subgroups on the basis of the values for the observed variables, some criticism must be employed by the investigator -::Supported in part by ;I National Heart Institute contract (PH-43-67-1440), three grans (HE-11309-03, H&07563-07 and GM-16725-01). Some computation time was supported by Duke University under a computation grant (AD-0576). USPHS

58

SPATlAl.

KEPRESENTATION

01;

DAL4

SC)

for defining the subgroups. No standard criterion exists for classifying patients. and usually the grouping is made on the basis of one or two variables of special interest to the investigator. Tntuitively, it would seem reasonable to expect that individual patients should be classifiable by means of some “natural” or inherent qualities.’ which would reveal that certain individuals tend to be more similar to certain others, while unlike still others. Although no classification scheme which is completely natural in this sense is generally accepted, the purpose of the present report is to outline an approach which does tend to group together patients with similar responses and to separate those which respond differently from one another. The resultant grouping is made solely on the basis of patients’ responses to observed variables, and is, to a certain extent, independent of any preconceived notions of the investigators. The proposed approach was developed in connection with the analysis of a very large amount of data generated by the Myocardial infarction Research Unit of the Duke University Medical Center. Although it is immediately apparent that the techniyue can also be applied in a wide variety of other situations, the data from the myocardial infarction studies will be used to illustrate the application of the method. The technique appears to provide a powerful tool for examining the interrelationships among individual patients. It may in the future yield a means of predicting the clinical trend of a given patient from the transitions through which the patient has gone in the immediate past and from his clinical state at a given point in time. THE

PROBLEM

In the Duke Myocardial Infarction Research Unit (MlRU) patients are admitted to an intensive coronary cure unit with presumptive diagnosis of acute myocardial infarction. The first 69 patients who were admitted to the unit form the basis of the present report. A data collection form was employed which provided for the daily entry of 40 different measured and categorical variables. For further study, an arbitrary selection was made of those variables for which there were the smallest numbers of missing observations. In the case of six variables (temperature, pulse rate, respiratory rate, systolic and diastolic blood pressure, and the P-R interval of the electrocardiogram), several readings were available in each eight-hour interval; for each of these variables, the extreme values during each eight-hour interval were recorded (i.e. highest temperature, highest pulse, lowest systolic blood pressure, etc. ). Fifteen other variables were recorded once each day (white count, lactic dehydrogenase, serum oxaloacetic and pyruvic transaminase, plus the presence or absence of chest pain, rales, ventricular diastolic gallop, arrhythmia, cardiac arrest, and whether the treatment of the day included electrical pacing, assisted respiration, digitalis, diuretics, vasopressors, or antiarrhythmic agents), The selection omitted some

ho

I t-1011

variables

of

individual nary

considerable

wcrc

cutccholamincs, recorded

selected If

no

recorded

data

had

responses.

actual

number

As

an

time

inspection

discover

the

yrayhs.

were

under

cians.

who

of

patients.

did

in of

number

analyzed will

reasonable

to

uncovering.

if

patients

In brief, treater1

69

on

provided

for

evident

clinical

in the the of

reveal

their MlRU method data

data

with

dimensional

patient

and

time.

and

our

initial

basic

from

further

detailed

fact

the

inclusion last of

in ot

trcatnlcnt for

the the

the

stud\,

ou

;I

2 I \ariablcz treatment

itself.

with

the

niethod~

Although ilInes

or

alI

dif-

infarction

over

analysis

pattern.;

10

organized

six

the

phy+

influenced the

control

the

thche

of

regimen

specially

\\;I\

visual

that

my~~curdial

variablch.”

history

it

;I team

Neverthelcsh.

“trcatmcnt

diagnosis unit

which

by

relevant

trend\

it

vicu

a

in

for

among

\ccnl\ toi+

;I

ad

grcwp

01

one

in

infarction of

was

sought

which

and

changes

patients

of

and

tend analysis

research

cats-

did

below

would

and

data

individual it

have

measured

these

and

phGcian>

described

reduction types

ANALYTIC point

twenty

with

the

team

approach

technique

different

myocardial

of

analytic

experience

tool

over

displays an

The

of

a trainelI to

graphical

many

a given of

fact.

01’ the

from

\vcre

which

to

differences

with

space.

$0 the

made

the

but

(7 minimal

clinical

data

states

at

;I-

l-1.454

However.

trcatmcnt

trcatmcnt

hnvin,

all

cart of

THE Each

(:bscr\ 01

uere

by

patients

natural

Therefore,

similarities

as a useful

total

evident

csplaincd

ward>

prcsumptivc of

patterns.

emanating

rcsp~~nws of

observations.

became

individual

In

to

coronary volume

data.

;I

III~CIF

disease.

Inspection

clinical

I OS

patient.

schedules.

reccjgnizablc

patients

a large variables.

would

the

this

to

investigator,

ix equivalent wcrc

a modern

gorical

of

stud) the

2 I distinct

graphs

each

;I 4tandardizcd of

\,ariablcx.

possible.

treated

effect

data.

bc

single

apply

obscure

subject

may ;I

of th::

patients

present

P-K

the

pcri:r:l

missing

urn

and

ot

amount

for

well-equipped

diffcrcnt

certainly

~~ttcrc‘cl

hormones,

for

m;Iximuni

trends

treatment

The

the

onI!,

\igna

data

~OI- ;I 1.is-d:ly

these a time

this of

rcprescntatiic modern.

of

in the

effects

not

different

paticnts.

treatment

larger

part, to

receive

of

In cart.

courses

WI’;

( vital

multiple

obvious

clinical

multiple

such

no

attempt

by

of

at

The

patients

;I

w0~1Ic1

csamining

not

of these care

in that

tke

fcrently patients

were

\vhich growth

collected

include

there

variable\

to

variab1c.i the

each p;itit:nt this

for

rcnin.

that.

step

several

six

missing.

th:m

tirst

of of

p2ticntx the

less

obvious

disappointing

been

Inevitably,

wax

courses

might

but

plasma

hours,

19X 1_ from

II)

intcrcst.

first

tight

stud)

I)

\\‘oOI)HI

(e.g..

the

cvcrv

for

(3.6

ANI)

research

a\,ailahlc

etc. J, Since

each

variables 16 ’ tions.

.I I<.

current

observations

) wcrc

\,a1

I’SON.

not

yield

was

applied

to

recommend

of

large

bodies

investigation.

TECHNIQUE time dimension

was

represented was

allocated

as

;I point for

each

in

JllUl&

variable

to

SPATIAL,

RFPRESEN’T.4TION

01.

61

DAT\

included in the study. It is not possible or necessaary to try to visualize such a higher dimensional space in order to understand the concept of representing patients by points in a higher space. However, it is worth commenting that, just as we can compute the distance between two points in a two-dimensional plane or in a three-dimensional space, we can compute distances in any higher dimensional space. The distance in an n-dimensional Euclidean space between two points whose coordinates are (.t-, ,, .i-, , s, . ., .r ,,,, ) and (.r,, .rll. s,. ,, . .. .v- ,,) can he calculated from the well-known formula: D == \ LYI, - .\.zl)‘J + (.Y1z ~ .\““I’) + (.Yl:{ ~~ x::!” -+

+ i.\Y,,,( - .Y.:!.,,i”.

(1)

l-his distance is a number which tends to be small if the values of the corrcsponding coordinates of the two points are similar in magnitude, and large if the points differ significantly in several coordinates. Therefore, the computed distance between two points can be used as a measure of the similarity or dissimilarity between the clinical states of two patients at two points in time. If a series of points represent the sequence of clinical states of a single patient in time, the path or trajectory connecting the points is indicative of the patient’s clinical course. tisuallq. the d:fferent variables be’ng studied will be measured in different units and will be characterized by widely differing means and ranges. If each variahlc is initially transformed by dividing its value by the standard deviation of that variable multiplied by the square root of two, the resultant transformed variable will bc dimensionless. Because the expected value of the squared differcncc between two values of any one of the untransformed variables is equal to twice its variance, the expected value of the squared difference btween two values of ;L transformed variable is unity. When several squared differences between variables, transformed in this manner, arc summed in the computation of 2 distance between two points, the average contribution from each of the ccmponcnt carrablcs to the distance will be the ‘ranit’. If II different variables arc obzcrvcd and an n-dimensional distance is to be computed, its expected value \\ill be equal to 1 n No special problem arises in reference to the binary categorical variables, ior which raw scorch of zero and c?nc can be aasigned and the same transformation applied. Lls!ng f~~rmula ( 1 ), above. on the transformed data. diatanccs can tc cornputed between each point and all other variable points in the n-dimensional space. Next. these distances arc ranked in order. A minimum coverage algorithm is employed to connect each point to at least one nearest neighbor, and the constrllct. ;L branching network which ties together all of the available points, ; Cla,re The random

correctly,

htalement wriahle.

in

the the .L-.

text

expected is only

value

of

approximately

the

square

of

true.

hcc;~usc

the

distance E ( \

.\

will

be

equal to 11.

) +

1’

E (.\-)

for

;I

If only points are conncctecl which

have not in some previous step been ion final nct\\ork can also TV ~lcscribecl 2s (I nectcd to the \;1111c nctwc~rk. branching “tree.” It is not rcali~ possible to clisplac accuratclv ill1 4imensi~maI network on a two)-diniensional surface. but lbv stretching, bettding. and t\crkng the arcs connecting adjacent points. it is uhuall\ possible fcr Iocate the points on ;I plane. so that most of the near neighbor5 of each point arc CI~W to it. while points which are not its neighbors tend to be farther away. ~l‘hr resultiny “road map” is at best an approximation, but it may contain a surprising amc>unt of useful information in a highly compressed form. tllc

There are numerous more or less satisfactory ways to construct ;I planar graph from an n-dimensional tree. but care must be taken to avoid a placement of the points on the plane which reelects to a considerable cxtcnt the investigator’s preconceived ideas of existing similarities among patients. In such a situation, any interpretation of the relationships found in the map would bc of limited usefulness due to investigator bias. To avoid this criticism. and also to assist in the arduoux task of constructing planar maps, a FORTRAN program has been designed which generatcs a planar map from the t-au clinical data. A least squares criterion was used for locating the points on the map. .\- and x coordinates were found to minimize the following sum of squares.

where Dzi, represents the squared pr-dimensional distance between the i-th and j-th points. To accomplish the least squares solution for multipic pairs of .r and .V coordinates. an iterative ( Gauss-Seidel ) version of Newton’s methocl was employed. Examinaticn of the planar maps constructed in this way’ shows a highly satisfactory aggregation of points which arc near neighbors in the multidimensional space. DATA

ANALYSIS

Except for the vital signs and the P-R interval. the variables selected fog analysis were recorded only once each day for the MIRU patients OII given &ys, rather than at given points in time. This distinction is a significant feature of this particular study which limits our ability to interpret continuous changes with time in the clinical states of patients. Mathematically, this is equiva!cnt to saying that the model is discrete with respect of timr. (In many studies, complete sets of data are not obtainable at single instants of time, but when instantaneous sets of data are available, there would be no theoreticai objection to considering time as a continuous variable.) Rather than select a single value for each of the vital sign and P-R interval variables, we treated each individual eight-hour observation as a value for a different variable. However, the con-

tribution of each individual eight-hour observation to the distance was diminished to one-third by dividing the squared differences for the corresponding variables by three. By this device, each of the 2 I distinct variables designated above ti’as weighted equally in computing the distances. Because three values were associated with each of the first six variables, a single point in a 33dimensional space was used tct represent each “patient-day” ( I8 dimensions for the vita1 signs and P-R interval and IS dimensions for the white count. cnzymcs. and the remaining categorical variables. ). When interrelationships among multip!c variables are of interest, as in the case oi‘ the myocardial infarction studies described here, the missing data problem ih an important and troublcsomc one. In the initial description of the application of the technique proposed here. it will probably be advantageous to avoid wherever possible. c:-,mplicatiors v. ith result primarily from imperfect data collection. For this reason, the wholc problem of missing data has been sidestepped. by omitting from the analysis any incomplete sets of data (i.c. any patient-days for which Ices than 33 items oi’ data arc available). When this very demanding r~rju+emcnt was applied to the data from 73 admissions of 6Y k4IRU patients. 21 patients were immediately L>limir7ate,l from consideration. :IS they did not yield cvcn a single complctc s:t of data. The remaining 3X patients (52 admissions ) provided for the aral!k I37 complete patientdays. representing a tot:11 of 4.57 I responses (about one-third of the theoretical maximum mertionc:t atone for 73 ac!mi!sinns ). TI~c,~dist;mces between all ]Grs of the I37 :I\ ailable points were computed and r;m’,ed in order with the help of a FORTRAN program (there arc 9.3 I6 such distances ). Using the Icast squares criterion dcscribLd above. a twodimensional planar graph was generated (Fig. I ) .$ To each point there was assigned ;In abbreviated four-character nam:. The first and second characters of the n:lme represented the initials of the patient’s first and l:lst names: in the third &aracter position was pl:~ed ;I I or ;I 2 for the number of hospitalization for those few patients who v,ere admitted twice during the period of s;udy (!>thrr\viLc. the second letter of the patient’s last name W;IS inserted in that positi~~n); the fourth character position was used for the day of hospitalization. For example. “llO15” would rcfcr to the fifth da) of patient H.O.‘s first ]lospitn] admiaGon. When two or more points were superimposed upon one another. a footnote (e.g., ““12*“) was substituted for the character name on the graph to he decoded in another part of the computed output. EVJII though Fig. 1 represrnts the data from only 52 patients and the inc]usion of ;l larger sample of points would fur;hcr complicate the picture. the +I .4 planar map derived from these data was presented previously in the Proceedings of the DECUS Symposium. May. 1969. Differences between that map and Fig. 1 art: attrihuled

to

the

more

recent

finding

of

significant

errors

in

the

input

data.

0-i

.

.

0

representation of 33-dimensional data. In order to provide a Frti. I. Two-dimensional larger map, the computer print-out utilized two sheets of wide paper which ;tre spliced together. For display purposes. ellipses were drawn around each 4-character label on the computer-generated map. For greater detail, see Fig. 6 helow.

set of points has already resulted in an elaborate map. However. the tigurc is considerably easier to examine than are the original 4,521 items of data from which it was derived. The minimum coverage algorithm was utilized to indicate the nearest neighbor of each point and then to identify successively the shortest possible connections between pairs of points until all points were incorporated in a single network. Figure 2 shows the result of adding the minimum coverage connections to the planar map of Fig. 1. The complexity of the multidimensional interrelationships is immediately evident from this figure. It would. indeed, be possible to display on a plane a much less intricate rcprcsentation of the minimum coverage tree without crossing lines, but such a representation would completely obscure the spatial relationships other than those between adjacent closest neighbors; by such a device. clusters or aggregations of points would become widely dispersed over remote portions of the tree. In Figures 3( a)-3(d) are shown the distributions on the planar map of four abnormal findings-the presence of chest pain. lactic dehydrogenase values in excess of 400 units. a ventricular “gallop” sound heard on ausucultation, and a systolic blood pressure of 80 or below (“shock”) at any time during a twenty-four hour period. Figures 3 (b)-3 (d) show ;I clear tendency for ahnormal values of the variables to be found in identifiable regions of the map. The very elevated lactic dehydrogenasc values are located at the left on the map, the gallops at the lower right, and the low blood pressure values at the

SP.\?‘IAL

,..,,,..,..

..__._...........,..-...”

REPRESE~‘T.ATION

.-...

.-..

. . ..-

. . . .. . .. . .. .

. . I_...

01

0 5

Ll:\ f.\

.,

--

. ..---

FIG. 2. Minimum coverage graph. To the same points shown in Fig. 1 have been added the minimum coverage connections. Points close to one another in the two-dimensional projection need not necessarily be neighbors in 33-D. This type of diagram adds some information concerning the higher dimensional relationships.

lower left, respectively. Since each figure contains information about only one of 21 different variables and a two-dimensional piot is used to locate the points from a 33-dimensional space, it is interesting that these identifiable regions of the map can be found. Those variables which show the closest correlation with other observed variables would be expected to show the greatest tendency to separate into distinct regions, and vie versa. For example, the presence of a ventricular gallop [Fig. 3 (c)] was significantly correlated with the presence of ralcs (r E .59) and with the administration of diuretics (v L .27). The 137 points shown in Fig. l-3 form a highly branching network, which suggests a continuous and complex spectrum of clincical states, rather than several easily identifiabIe and distinct states. There is a group of points in the upper left portion of the diagram which are all very close to one another. From Fig. 3 (a) it can be seen that many of these points are characterized by the presence of chest pain. The closenessof these points is especially interesting when it is remembered that chest pain is only one of 21 equally-weighted variables which determine the positions of these points. In general, the sicker patients were represented by points in the lower portion of the map, especially toward the periphery. It seems unlikely that further similar data will result in the appearance of several easily distinguishable subgroups of clinical states, but it &XX seem reasonable to anticipate that further expcricncc may permit us to

ALLOP

FIG.

3( a-d

).

Disposition

SYSTOLIC BLOODPRESSUREIa&

of

individual

abnormalities

on

the

two-dimensional

map.

SW

test.

divide the space into several contiguous regions of differing significance. The path of a given patient through the space with time may take him through one or more important transitions from one region to another. If certain pathways through the space are characterized by heavy trafic. and if movement along one of them is of special prognostic significance. then identification of these patients wou!d b:: especially important. Phc data at hand do include instances of several points on subsequent days from the same patients. Figure 4 shows the points in the same locations as in the previous figures. but the interconnections in this diagram are directional arrows between point> from the same patients at different times. The more usual tendency of patients wilh this illness to show clinical improvcmcnt is reflected in this figure by ;I prcdominancc of arrows pointing from the lower xd more peripheral regions of the map to the more “normal” region of densely Llygregated points in the upper left portion of the mau.

SPATIAL

REPRESENTATION

OF

DATA

67

’ ,’

/ ,., .., ,i/

/’

Qi

‘3 i,

.,,,,,,,.,,,.,,,_,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,.,.,,,,,,,,.,,,,,,,,,.,, ;

......_.........................~.....~....~.~

FIG. 4. Changes in clinical status with time. The sparse trajectory data shown here suggest a tendency for paths to converge in the dense cluster at the upper left. while avoiding a central zone in which there are almost no points to be found. See text.

DISCUSSION The representation of observations on multiple variables as points in multidimensional space is, from a mathematical standpoint, a standard and wellknown device, It is, therefore, not surprising to find that the distance formula ( 1) has been applied repeatedly by scientists to their data in the past. Since Karl Pearson used such a formula to develop his “coefficient of racial likeness” in 1926,’ there were occasional scattered references to the use of distance measures (Ref. 1, pp. 284ff). But in the 1950’s, several psychologists ‘. L ‘L i. Y and biologists 3, ’ began to emphasize their use. Sokal and Sneath ’ make distance measures the basis of “numerical taxonomy.” The techniques developed by these workers were applied to a group of cardiac patients in 1966 by Manning and Watson !I and in 1969 by Neurath et u~.,~Oin the preoperative assessment of pelvics undergoing pelvic surgery. Distance measures as an adjunct to cluster analysis have also been used by Bonner I’ in connection with medical research data, but the techniques have, nevertheless, been employed infrequently in clinical investigations to data. Feinstein Jz has shown how one can select several variables of interest and. by means of overlapping circles in a Venn diagram, summarize schematically the frequency of patterns of findings in a given disease entity. There are four

hX

IIl0\ll’\;0lu.

II<.

\NI)

\\OOI)HI’I~~

u hich ~‘211 rc\ult in in~porlanl characteristic5 of thi5 type of -ITrouping proccdurc l~scs of inform:ltion. ( I I Each patient mu>t hc charactcl-i/cd ;I\ belongin? 10 some discrete subgroup. whcrc:ls hc might mnrc r~a\onabl\ hc dc\cribcd ;I\ III ;I transitional group or as 2 unique “outlycr.” ( 2 ) Similaritic~7 ;mtl clitlcrcncc5 between individual patients become obscured ;I\ soon ;I\ theI ;Irc catcgori/cd groups. (3 ) L!nle>s ;I Vcnn diagram i\ as belonging to the same or difercnt designed in which ;I circle is used for virtually ever\ available measurccl variable (which tends to mal\e an unduly intric:ltc Vcnn diagram ). these techniques result in grouping4 L which strongly refect ;I bias inherent in the invcxtigator’s variablch of interest.” (4) Thia tape of grouping ot selection of the “major patients does not lend itself well to demonstrating changes in ;I patient’\ responses with time. The approach described here differs markedly from the approach of classical univariate or multivariate statistics. In statistics. individual measurements arc lost sight of in favor of estimated means and variances of samples of populations. whereas in the present approach the identity of each point (i.c., each observation vector) and its relationship to all other points are preserved and where there is no generally accepted arc central to the analysis. In situations criterion for classifying sets of data into subgroups, there i\ ;I considerable advantage in king able to discern the interrelationships among the individual items of the data (the so-called “structure” of the data) without having first to divide the data into arbitrarily chosen subgroups. This is likely to be the situation in studies where experimental data are collected with the primary aim to provide objective information about an insufficiently understood discasc entity. experimental system. or population, rather than to test one or more specific scientific hypotheses. In this type of investigation. the rcscarcher expects to be able to examine some of his data before formulating detailed questions about his system. The MIRE studies described above are an excellent example of this sort of investigation. There arc ditfcrcnces between the approach proposed here and so-called “cluster analysis.” If clustering techniques arc applied to data of the sort dcscribed here, clustering algorithms will permit one to identify any clusters which may be present, These techniques are not too helpful, however. when applied to a series of points such as those in Fig. 2. where a highly branched structure. rather than a group of several clusters, provides a more realistic description of the data. The minimum coverage algorithm used here should be expected to permit detection of clusters if they arc present (such as, for example. the group of points at the upper left of Fig. 2. characterized by chest pain with no other complicating abnormal tindings ). but it also brings out relationships between individual points WCH when /IO clustering is prosent. Most Ggnificantly. we visualize patienta ah continuously changing in ;I variety of ways with time. rather than “jumping” from one to another of a few discrete clinical states, Although the LISC of standard scores for each of the multiple variables eiim-

SPATIAL

REPRESENTATION

OF

DA7‘

69

nates the problem of combining, in a single distance formula, quantities measured in different units, it leaves unsolved the more complex problem of assigning weights to the contributions from each of the different variables. When observations are made on multiple variables. some of them are likely to be highly correlated with one another. In a sense. the information contributed by two highly correlated variables would be the same a\ that contributed by either of the variables alone (for example, we would expect highly significant overlaps in the information provided by the systolic and diastolic blood pressures or by the lac!ic dehytlrogenase and the glutamic oxaloacctic transaminase levc!s ). In the prc’scnt study, no attempt was made to weight each v,ariablc in proportion to the amount of independent information provided by that variable. In discussing the problem, Sokal and Sneath I rccommcnd that. at least in taxonomy. equal weights be used for all variables. Overall suggcstcd the USC of Mahalanobis’ generalized distance formula to take into account correlations among variahlcs, but his argument for the LISC of this formula has been criticized.’ Other-s -,’ have suggested that the techniques of factor\ analysis or principalcomponents analysis be applied initially to the raw data, and that the “factors” be usccl to determine the dimensions of a space and the location of the points in that space. Experience with these methods of handling the problem of correlated variables has not been
70

FIG. 5(a-b). Squashing as a result of missing dimensions. In Fig. 5(a). and also again in the background of Fig. 5(b), is drawn in two dimensions a hypothetical circuit of points open at the upper left. This would seem to imply similarity between the points opposite one another at the open ends of the ring. Fig. 5( b J shows the same points drawn in a three-dimensional space. where information ahout a third variable is included. In the second figure, the apparent “circuit” has disappeared. and the points which seemed close to one another in Fig. 5(a) are now found to lie maximally far apart. See text.

SPATIAL

REPRESENTATION

OF DA.r.4

71

. . .

FIG. 6. An open “circuit” found in the myocardial infarction map. The connecting lines shown here are extracted from the minimum coverage graph of Fig. 3,. The apparent circuit may represent a real example of the phenomenon illustrated diagrammatically in Fig. 5.

between two points, WB03 to EC04, which arc fairly near neighbors of each other in 33-D (distance = 2.43 units). This figure was obtained from an enlargement of the central portion of Fig. 1. The raw data associated with the two points, as well as for each of the points making up the circuit between them, are shown in Table 1. Reading the table from left to right corresponds to moving counterclockwise around the circuit from WB03 to EC04. Comparison of this circuit with Fig. 2, on which all of the minimum coverage connections have been drawn, shows that there is, indeed, a central region within the circuit which is virtually devoid of points. In Fig. 4 only one trajectory crosses this central zone. This findir.g lends support to the idea that this empty area represents a very unlikely clinical state, and that the circuit from WB03 to EC04 probably results from the absence of an important additional dimension. In other words, WB03 and EC04, although similar to one another in terms of the variables observed, would probably differ importantly if other clinical information were available. This is just the situation discussedin connection with Fig. 5.

I HO11

\\ HO?

I’\()\

u

1304

WHO.,

H t () :

Ii I-o,i

I-101

i

111:

tc

O!

;O. i

S>~lOllC

: .i

.i

:(I

.i

BP IO

h

6 2

Diastolic

2 IO

IO6 I IO IOX

016 YX I00

I20 I IO I05

I IO I20 IO0

I70 I IO I20

I IO II)0 II)0

I IO I IO 125

70 6X 70

60 70 hl)

X0 70

so SO

69

70 70 70

70

70 70 70

SI) 70 loo

88 XJ Yj

72 70 70

73 81 6s

72 77 79

100 XX 88

70 24 20

20 ‘0 20

IX 20 20

Ih 20 ‘2

0. IX 0. I6 0. I6

0. I6 0. I6 0. I6

0. 1-i 0. Ih 0. IL

HI’ IO h 6 2 2 IO

71) 70 70

h’s 80 X0 Respiratory

Rate 10 6 6 2 2 -10

70 20 2-1

22 2-I I8

20 22 20

20 20 22

IO 6 6-- 2 2 IO

0.20 0. IX 0. IS

0. I6 0. IX 0. I8

0. Ih 0. IX 0. IX

0. I\, 0. I7 0. I9

WBC

Y,XSO

14,280

16, I 50

IO.ocn)

IO, 2nn

LDH SCOT SGI’T

270 I 50 70

I72 xl 23

IjX 3-l 23

P-R

Interval

hill

Rhythm Rales Gallop Arrest Digitalis Diuretic Antiarrhythmic Paced Respiratory assistance Vasopressors

agent

‘73 YS 6s

238 6S -1s

0.20 0.20 0.20

7 / 700 xx SO 38

230 X.5 4s

IO. 350 252 x7 113

No N YKS YKS No YKS No Yes NO

NV N YKS YKS No YKS NO YCS NV

NO N YKS YKS No YKS No YKS NO

NO N YKS YKS NO YKS NV YKS No

NV N YKS No NO YKS No YKS No

No N YKS NO No Ye5 No YC\ No

NV N NV NO No YKS NO YCS NV

No N NV YKS No Yes No YKS No

No No

NO No

NO No

No NO

NO No

No NV

NV NO

No No

SPATIAL

REPRESENTATION

SUMMARY

OF

AND

1),17:\

73

CONCLUSION

Techniques for quantitating similarity and dissimilarity between individuals have been studied extensively by researchers in taxonomy and psychology. but as yet only rarely by medical investigators. These methods were well suited to the analysis of a large group of patients. where information is available about among individuals and groups of multiple variables. The interrelationships patients can be displayed by means of a two-dimensional road map. These techniques were applied to 2 I different variables observed in 52 patients with presumptive diagnosis of myocardial infarction. Because each patient was observed from one to six times on different hospital days, a total of I?7 complete sets of data were available. By representing each set of data as a point in a multidimensional space and connecting all points to their nearest neighbors, a road map was devised to indicate which patients were similar to one another. Certain points tcndcd to form tight clusters (especialy, patienta with chest pain and no other evident abnormalities ). It is anticipated that, as more points are added to the map. it will become possible to identify major I’:;hways hetwcen important clinical states, and to quantitate the likelihood of :I patient’s going from one clinical state to another in time. RFFERE.NCES

1.

PI.ARSON.

3.

Koti~w.

1.

<

5.

6. 7. 8.

0. IO.

1 I. 17.

I(. D.

On J.,

the

coefficiency

of

racial

AND

TANIMCI[.Q.

T.

T.

A

likeness, computer

Rio/uc,lriXtr program

18, 105, (1936). for classifying plant\.

.S(~ic,frc P 132, I 1 15, ( 1960 ). I-. J.. AND GI ~:SI K. G. C. Assessing similarity between profiles. P.vJT/~o/. H/I//. 50, 456. (1953). (;I FSl:R. Ci. C. Quantifying similarity between people. /,I f/11, Role and Mrt/~od(~/o,~~ of Clrrxvifictrtiorf irr P.s,vc/rirrli~ f/d P.\?c~o~“lllolr,~. Proc. of Conference. Washinglon 1965. KATZ. M. M.. Co~.r-, J. 0.. .kNn BARTON. W. F. (Eds.). USPHS Publication No. 1584, p,p. 2Ollf. 196X. H~~RMANN. E. F. Commenta on Overall’s Multivariate methods for profile analysis. Ps.w//ol. N/,/I. 63, 128. ( 1965 ). NLINNAI I Y, J. The analysis of profile data. P.,:vcl~o/. Bull. 59, 3 11. ( 1961 ). OVERAl I . J. E. Note on multivariate method> for profile analysis. P.ycko/. Hd/. 61, 195, (196-I). ~~ANNIM. I<. T.. AND WATSON. I.. Sign\. symptoms, and syslematics. JAMA 198, ll80. ( 1966). NI-CR.IIH. P. W.. ENSFIN, IC.. AND MITCHELL.. G. W.. Jr. Design of ;I computer system to assist in differential preoperxtive diagnosis for pelvic surgery. ,\I. I-Jrrg/. ./. of Med. 280, 745. ( 1969). HONNI:R, R. E. Cluster analysis. .4rur. N.Y. Aud. .ki. 128, 973 (1966). F I-INSTI 1~. ,\. R Cliuiur/ J//cl,~u~~rft. Bahimore. William\ & Wilkins. 1967. RONI~ACH,