Pattern Recognition Letters ELSEVIER
Pattern RecognitionLetters16 (1995) 703-709
Rapid compound pattern classification by recursive partitioning of feature space. An application in flow cytometry Richard Dybowski *, Vanya A. Gant, Peter A. Riley, Ian Phillips Department of Microbiology, United Medical and Dental Schools (St Thomas" Campus), Lambeth Palace Road, London SE1 7EH, United Kingdom
Received9 June 1994;revised2 February1995
Abstract A method is described for rapidly classifying a set of points in real space. A set is mapped to a low-dimensional vector via a discriminating, recursive partition of feature space obtained pragmatically by the CART algorithm. Keywords: Compounddecisionproblem; Feature space partition; CART; Microbiology;Flow cytograms;Nearest neighbourclassification
1. Background: the classification of flow cytograms Flow cytometry consists of a method for measuring the light scattering and fluorescence characteristics of large numbers of individual microscopic particles in liquid suspension such as plankton, bacteria, and blood cells (Allman et al., 1993, pp. 27-47). In a flow cytometer the particles under consideration are individually carried in a laminar stream of fluid through a focused beam of light. At this point each particle will not only scatter light at both narrow and large angles to the original beam to a degree which relates to the particle's physical characteristics, but also emit fluorescent light whose wavelength and intensity depends on the type and amount of dye it contains. These pulses of light, both scattered and emitted, can be collected, differentiated, and delivered to photomultiplier tubes by means of lenses. * Correspondingauthor. Email:
[email protected]
The pulses of light are converted to pulses o f electricity, whose amplitude is proportional to the size of the original signal. The information is digitised and stored on a computer as list mode data for each light channel. Data acquisition is very rapid (up to 5000 particles per second); a typical flow cytometric analysis will contain data relating to at least 10 4 partides, with each particle having its own "signature" of, for example, four values for narrow and large angle light scatter, as well as fluorescence at two wavelengths, respectively. The analysis is traditionally displayed as 2-dimensional plots ( f l o w cytograms), where each particle is represented by a point. Typical cytograms are shown in Fig. 1. One application of this technique concerns the potential ability of flow cytometry to differentiate between different types of bacteria on the basis of their light scattering characteristics. This raises the question whether a classification method can be established which is able to map a flow cytogram to the constituent bacteria.
0167-8655/95/$09.50 © 1995 ElsevierScienceB.V. All fights reserved SSD! 0167-8655(95)00016-X
R. Dybowski et al. / Pattern Recognition Letters 16 (1995) 703-709
704
From Eq. (1),
2. Standard group classification
Suppose we have a set X of N vectors {x 1. . . . . xs} obtained simultaneously, for example, a flow cytogram. To which of C possible classes (to1 . . . . . toc) should X be assigned? This problem can be regarded as a type of non-sequential compound decision problem (Fu and Yu, 1980) in which the objective is to assign X to the class toi for which the overall risk of misclassification is minimised. In standard compound decision theory the compound risk R i ( x ) in assigning X to toi is taken to be the average of the resulting risks in the component decision problems: 1 iv R i ( { X l , . . . , xs}) = ~ ]~ li(xk), (1) k=l
where P(x) is the conditional risk (expected loss) of assigning x to toi, c
li(x)
-- E A ( t o i [ t o j ) P ( t ° j l x ) ,
y=1 with ,~(toi1%) being the loss (cost) in assigning a vector to toi which is actually associated with toy. The observations Xl . . . . . x s are assigned to that class toy for which
RJ(X) =
man
R'((xl,. ,xN}) 1 N =~
c
Y'~ ] ~ A ( t o / l t o j ) P ( t o j ) k=l
where f( x k) = EC= l P( toh)f( xk [ toh) and f( xk Lwh) is the multivariate probability density function (pdf) with respect to wh. Thus, in order to determine Ri(X) (i = 1. . . . . C), we require f ( x k I Wh) for each of the NC possible xk-to h pairs. Because of the multimodal nature of flow cytograms, a nonparametric method for estimating f(xklto h) for Eq. (2) is preferred to a parametric one. Various methods exist including nearestneighbour estimators (Loftsgaarden and Quesenberry, 1965), kernel estimators (Rosenblatt, 1956), and maximum penalized likelihood estimators (Good and Gaskins, 1971). In the nearest-neighbour method, f(xklto h) is estimated by the volume V(x k, ~, ton) of feature space centered at x k which just contains the vectors of (,Oh nearest to xk:
f(Xkltoh) = Iy(toh) lV(xk ' ~, tOh)'
Class 2
Class 1
400
- j:v..'.':..,
600
~.~s ~ : .':'" :'" '~ ~:." ,.; . . . .
800
1000
¢,,IOO ~ """~'~:'¢' '
400
600
"
800
LS1
LS1
Class 3
Class 4 .....
400
(2)
j=l
where Y(toh) is the set of those vectors present in feature space known to belong to toh.
{Ri(X)}.
iE{1 . . . . . C}
•
f(xk[toj) f(xk ) ,
600
800 LS1
1200
-
1000
400
600
800
1000
LS1
Fig. 1. Flow cytograms obtained from Classes 1, 2, 3 and 4 (defined in Section 6) when incubation time was 4 hours. LS1, forward angle light scatter; LS2, wide angle light scatter.
R. Dybowski et al. / Pattern Recognition Letters 16 (1995) 703-709
With the kernel method, f(xkl toh) is estimated by centering the same unimodal symmetric function K (the kernel) at each element of Y(to h) in D-dimensional space and summing the kernels: 1 f ( x , I OJh) ---- ]V(tOh) lbO
~
K
y~V(o,h)
,
(3) where b is the bandwidth of the kernel. An example of a kernel is
K( z) = ( 30~r-l(1- zrz) 2
if ZTZ< I, otherwise,
Maximum penalized likelihood estimation adopts a statistical approach to estimating the pdf which gave rise to the set of observed v e c t o r s Y(tOh). Using a set S of candidate pdfs for cah, the penalized log likelihood for curve g in S is given by
l~(glV(tOh))=
]~
In
g(y)--ot.p(g),
yE Y( ogh)
where a is a constant and p(g) is a measure of the degree of local variablity for curve g. The chosen pdf f for ~oh is that for which
l~( f l Y( tOh)) = min 1~(g lY(tOh)). g~S
(Tapia and Thompson (1978) give a detailed analysis of the method.)
3. Compound-pattern feature vectors
In this paper a compoundpattern will refer to a set of points in real space that can take any form, for example, the set of points can be evenly distributed or it can be a composite of clusters. We wish to use a collection of compound patterns (namely, flow cytograms) as the derivation set for a pattern recogniser that will enable us to classify a new compound pattern. One approach to this problem is to use compound decision theory in conjunction with density estimation as described in Section 2 above. However, our interest is in classifying flow cytograms within a working medical environment. This requires a real-time classification procedure which is very rapid. Furthermore, any amendments to the example-vector sets I1(o)1). . . . . Y(toc) need to be
705
incorporated readily. One drawback of kernel estimators is that determination of Eq. (3) can be time consuming (Silverman, 1986, pp. 88-89), particularly in the context of flow cytograms since these typically consist of o v e r 10 4 points. Although calculation of f(xk[ toh) by maximum penalized likelihood estimation may be rapid once f is defined, applying this technique to the estimation of multivariate pdfs can be computationally very costly (Silverman, 1986, p. 100). As regards the nearestneighbour approach, although identifying the ~ vectors of t% nearest to x k can be time consuming, Friedman et al. (1977) describe a method in which these vectors can be located rapidly; however, the nearest-neighbour approach is susceptible to local noise (Silverman, 1986, p. 97). Because o f these problems, we have been seeking an alternative to the compound decision-theoretic approach. One alternative is to map each compound pattern t o a single feature vector (a compound-patternfeature vector) and then perform pattern recognition with respect to the distribution of these vectors in real space. Let all the compound patterns of interest (the example sets and the sets to be classified) be just contained within a subset F 0 of D-space (the feature space for the points) having the shape of a D-dimensional rectangular hyper-parallelepiped. The faces of F 0 are aligned with the Cartesian axes of D-space. One way of obtaining a compound-pattern feature vector is by partitioning F 0 into a !large set F 1 of rectangular hyper-parallelepipeds of equal volume. Using F 1, a compound pattern X is mapped to a [ F ll-dimensional compound-pattern feature vector (~bI. . . . ,~blrll), where ~i is the relative frequency of the number of points of X which reside within the ith cell of F 1. Since the elements of the compound-pattern feature vector sum to unity, the compound-pattern feature vectors will be distributed on the [ F l I-dimensional hyperplane which passes through the points (1, 0 . . . . ,0), (0, 1, 0 , . . . , 0 ) . . . . . (0 . . . . . 0, I1). This distribution could then be the basis for a nonparametric pattern recogniser such as k-nearest neighbour for which Hand (1981, pp. 31-33) I gives a statistical justification. Although, for a given cardinality of X, the sample space of the possible compound-pattern feature vectors will be finite, Hand's argument is approximated when I X I is large.
706
R. Dybowski et al. /Pattern Recognition Letters 16 (1995) 703-709
Because of the increase in misclassification rate that tends to accompany an increase in dimensionality (Hand, 1981, pp. 122-131), a more parsimonious partitioning of F 0 is preferred: one which results in fewer cells than F 1 but which provides a high discrimination between compound patterns corresponding to different classes. In principle, this could be done by establishing a series of linear discriminant functions 81,..., 8 Ir11_1 which partition F 0 in a recursive manner. For example, 81 partitions F 0 into A 1 and A2, 8 2 partitions A 1 into B 1 and BE, 83 partitions B 2 into C~ and C 2, and so on. Rather than pursue this possibly complex approach, we will use a more pragmatic method for achieving a recursive partition that accentuates differences between compound patterns from different classes. We will describe how this can be done using the CART algorithm but first we will describe CART for those not familiar with this technique.
4. The CART methodology CART (Classification and Regression Trees) is a statistical procedure for tree-structured data analysis (Breiman et al., 1984). From a derivation set, the CART algorithm produces a set of decision rules represented in the form of a binary decision tree in which the non-leaf nodes are binary decision nodes. In the CART algorithm, a classification tree initially consists of only the root node z I which contains all the examples in the derivation set. For each attribute a i recorded for the examples, a partitioning of the set of possible values for a/ into two subsets is sought which maximises some index of class purity (see below). Continuous variables are split by order and categorical variables by a partition of the possible values. Node "/'1 is then partitioned into two descendent nodes, 7 L and ~'g, with respect to that attribute aj which, out of all the attributes, gave rise to the largest maximisation of the class-purity index. This partitioning is applied recursively to each leaf node. The intention of the above tree-building algorithm is to generate a tree for which each node is class-wise purer than its parent node. l e t the impurity of node ~" be defined by ,.(~-) = h( 4~(11 ~') .... ,4,(C I~'))
where h is an impurity function, ~b(i[~-) is the proportion of class i at node ~', and C is the number of classes. A partition (or split) S of node ~" into z L and z R, results in a proportion thL of cases in z going to z L and a proportion ~bR of cases going to ~'R" The decrease in impurity resulting from split S is defined by A,~(S, ~) = ~(z) - [~L" ~(~L) + ~ " " ( ~ ) ] . Various impurity functions have been proposed. (For example, the information gain measure of Quinlan (1986).) For CART, Brieman et al. (1984) used the Gini index of diversity: h ( 4 , ( l [ ~ - ) , . . . , ~b(C[~-))= ~ ~b(il~-) • ~b(jl~-). j-#i
When an unclassified object is passed down a decision tree from the root node to a leaf node along the path which is consistent with the object's attribute values, it is assigned to that class which is the most frequent class amongst those examples present in the leaf node. It is likely that a tree generated from the recursive partitioning of leaf nodes in the manner described above will overfit the derivation set, therefore, a search is made for that subtree (obtainable by backward pruning) of the final tree which minimises classification error rates either by means of a separate test set or through the use of cross-validation. The CART procedure has been discussed by Crawford (1989) and Buntine (1991). CART software is available from the SYSTAT CART statistical package (Steinberg and Colla, 1992) and is part of the S-PLUS statistical package (Statistical Sciences, 1993).
5. Classification of compound patterns via CART In this paper the CART method will not be used to perform classifications directly. Instead, it will be used to partition feature space in a way that discriminates between compound patterns from different classes. The partition is then used to map a set of points to a single point. The steps are as follows. From each of the example compound patterns of known class, m points are selected randomly where m is the lowest cardinality of all the compound
R. Dybowski et al. / Pattern Recognition Letters 16 (1995) 703-709
707
problem of classifying a vector with respect to a set of example vectors. However, because r 0 was partitioned using CART, which attempted to discriminate between the class-representative compound ipatterns, compound-pattern feature vectors correspqnding to the same class are expected to be clustered)together.
patterns. The resulting subsets of points for each class are pooled together to form a class-representative compound pattern. This represents the range of vectors associated with the corresponding class. The CART algorithm is run over the class-representative compound patterns (the data set) using v-fold cross validation to produce a tree T. Cross validation is employed to prevent overfitting. A leaf node of T corresponds to a subset of points in feature space F 0 which have attribute values in agreement with the path from the root node of T to the leaf node. Because of the manner in which the tree was constructed, the subsets of F 0 corresponding to the leaf nodes of T collectively constitute a recursive partition /'ca, of F 0. Fig. 2 is an example of such a partition. Tree T is then used to map a set of points to a compound-pattern feature vector by "dropping" the points down the tree from the root node and counting the number of times each leaf node has been reached by the points. This is equivalent to counting the number of points in a compound pattern which reside in a cell of F~r t, this being done for every cell. The compound-pattern feature vector for a compound pattern is the vector of counts for all the cells in Feast expressed as relative frequencies. By mapping all the compound patterns of interest to compound-pattern feature vectors, the problem of classifying a compound pattern is reduced to the
6. Experiment In the following experiment we demons!rate that, by first translating flow cytograms into compoundpattern feature vectors though the use of CART as described above, it is possible to classify cytograms correctly. Paired 2 ml aliquots of filtered (0.22 p,m) Isosensitest broth (Oxoid, Basingstoke, UK) Were inoculated with either an antibiotic-sensitive strain of the bacterium Escherichia coli (NCTC 10418) or an antibiotic-resistant clinical isolate of Escherichia coli (U726). The antibiotic amoxycillin was addbd to one of each pair of broths at a concentration expected to kill the sensitive but not the resistant strain (8 mg/1). This resulted in four different cultures designated as follows: Class 1: Sensitive E. coli in the presence of the antibiotic Class 2: Sensitive E. coli with no antibiotic
m,m o •
63
o
roD= []
[]
[] O
[]
D
[]
>-
O
[] []
ODD
o []
[] o
Q
5
l
(
L
10
15
20
x Fig. 2. Partition of 2-space obtained by running the CART algorithm over a synthetic data set consisting of two classes. D and • are members of the two classes respectively.
708
R. Dybowskiet al. /Pattern RecognitionLetters 16 (1995) 703-709
Class 3: Resistant E. coli in the presence of the
antibiotic Class 4: Resistant E. coli with no antibiotic
Each culture was prepared in triplicate and incubated at 37°C for 5 hours. At 0, 1, 2, 3, 4, 5 hours, 150 ~1 aliquots of each culture were removed for staining with DiBAC 4 (Molecular Probes Inc, Eugene, USA) at a concentration of 0.5 mg/1. (DiBAC 4 is a lipophilic anion that has a low affinity for intact polarized bacterial cell membranes. Depolarization of the membrane allows the dye to enter the bacterium, resulting in fluorescence of the cell.) Bacteria were stained for 5 minutes before flow cytometric analysis. Flow cytograms recording forward angle light scatter, wide angle light scatter and fluorescence were obtained for each of the 4 X 6 X 3 cultures using a Bryte HS flow cytometer (Bio-Rad). No gating was used either at acquisition or before processing the data apart from a forward scatter threshold applied to exclude particles significantly smaller than bacteria. Thirty-three points were randomly subsampled from each of the 72 cytograms and the subsamples corresponding to each of the 4 X 6 class-time combinations were pooled together to give rise to 24 time-conditioned class-representative compound patterns. Using 2-fold cross validation, the CART algo-
rithm provided by the SYSTAT CART package was run over the 33 × 72 samples using forward angle light scatter, wide angle light scatter, fluorescence and incubation time as the candidate attributes for splitting. Each of the 72 flow cytograms was mapped to a compound-pattern feature vector by dropping all the points of a cytogram down a 13-terminal node tree induced by the above CART algorithm. 6.1. Results
Fig. 3 displays a principal component plot of the 12 compound-pattern feature vectors corresponding to an incubation time of 4 hours. Each compoundpattern feature vector was classified by the 1nearest-neighbour method (standardised Euclidean distance) using the remaining 11 vectors as the test set (the leave-one-out method). There were no misclassifications.
7. Discussion In our approach, each point x in a flow cytogram X is dropped down a single CART-tree. This is in contrast to Eq. (2) of the compound decision-theo-
Q
g E o 0
"7
"c O-
8
o~ 2 2 i
I
i
i
-4
-2
0
2
First Principal
Component
Fig. 3. Principalcomponentplot of the compoundpattern featurevectorscorrespondingto flowcytogramsfrom Classes 1, 2, 3 and 4 when incubation time was 4 hours. The points are labelledwith the class numbers.
R. Dybowski et al. / Pattern Recognition Letters 16 (1995) 703-709
retic approach in which each f(xlto) has to be computed for every class to of interest. Consequently, the former approach is expected to be generally faster. The partition of feature space F 0 by the CART algorithm gave rise to a clustering of compound-pattern feature vectors that were successfully classified by the 1-nearest-neighbour method. This suggests that the method has the potential of classifying a given bacterium as being sensitive or resistant to a given antibiotic (Fig. 3; Classes 1 and 3). Furthermore, it may also have the potential to discriminate between different strains of the same bacterium (Fig. 3; Classes 2 and 4). Conventional antibiotic sensitivity tests take 18-24 hours but, by using flow cytometry in conjunction with CART, the susceptibility of a pathogen to an antibiotic can be determined in under 5 hours. A problem with the approach is that a partition of F 0 by CART is done to discriminate between those class-representative compound patterns presented to the algorithm. Therefore, it is possible to have a compound pattern which is very different from all the given class-representative compound patterns but which is not differentiated from any of these by /"cart" However, given the promising results obtained so far by our method, we will continue to assess its efficacy with a variety of bacteria and antibiotics.
Acknowledgement The project was partly funded by the Department of Health (file reference CTJ/523).
709
References Allman, R., R. Manchee and D. Lloyd (1993). Flow cytometric analysis of heterogeneous bacterial populations. In: D. Lloyd, Ed., Flow Cytometry in Microbiology. Springer, LOndon. Breiman, L., J.H. Freidman, R.A. Olshen and C.J. Stone (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, CA. Buntine, W. (1991). A Theory of Learning Classification Rules. PhD thesis, University of Technology, Sydney. Crawford, S. (1989). Extensions to the CART algorithm. Internat. J. Man-Machine Studies 31, 197-217. Friedman, J.H., J.L. Bently and R.A. Finkel (1977). An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software 3, 209-26. Fu, K.-S. and T.S. Yu (1980). Statistical Pattern Classification Using Contextual Information. Wiley, Chichester. Good, IJ. and R.A. Gaskins (1971). Nonparametric roughness penalties for probability densities. Biometrika 58, 255-77. Hand, D.J. (1981). Discrimination and Classification. Wiley, Chichester. Loftsgaarden, D.O. and C.P. Quesenberry (1965). A n0nparametric estimate of a multivariate density function. Ann. Math. Statist. 36, 1049-51. Quinlan, J. (1986). Induction of decision trees. Machine Learning 1, 81-106. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832435. Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Statistical Sciences Inc. (1993). S-PLUS for Windows User's Manual, Version 3.1. Statistical Sciences Inc, Seattle. Steinberg, D. and P. Colla (1992). CART. SYSTAT Inc; Evanston, IL. Tapia, R.A. and J.R. Thompson (1978). Nonparametric Probability Density Estimation. John Hopkins University Press, Baltimore.