Classification of chromosomes constrained by expected class size

Classification of chromosomes constrained by expected class size

Pattern Recognition Letters 4 (1986) 391-395 October 1986 North-Holland Classification of chromosomes constrained by expected class size Jim PIPER ...

351KB Sizes 1 Downloads 75 Views

Pattern Recognition Letters 4 (1986) 391-395

October 1986

North-Holland

Classification of chromosomes constrained by expected class size Jim PIPER Medical Research Council Clinical and Population Cytogenetics Unit, Western General Hospital Crewe Road, Edinburgh EH4 2XU, Scotland Received October 1985 Revised 23 January 1986

Abstract: If a chromosome classifier is constrained to produce fixed class sizes specified a priori then classification accuracy may be improved compared with a simple maximum likelihood approach applied to each chromosome without reference to the other chromosomes in the cell. Several classifiers which take account of the expected class size are compared using a representative set of banded chromosome data.

Key words: Maximum likelihood classification, model based classification.

Introduction The distribution o f c h r o m o s o m e classes in h u m a n cells is extremely regular. In about 99°7o of people, almost all cells will contain two chromosomes in each o f the 22 autosomal classes plus two sex chromosomes, either two Xs in females or an X and a Y in males. Moreover, even in the majority of abnormal c h r o m o s o m e constitutions, there will still be exactly two normal chromosomes in each of at least 21 classes in each cell (for example, Down's syndrome, in which the only abnormality is that there are three number 21 chromosomes). A u t o m a t i o n of c h r o m o s o m e analysis can make use of this knowledge. The usual procedure is to measure features and p e r f o r m a classification on each c h r o m o s o m e independently, which results in the likelihoods of a particular c h r o m o s o m e belonging to each of the 24 classes. If the maxi m u m likelihood is used for a preliminary classification, in a complete cell the result will usually not be compatible with the model, in that

some classes will have an excessive number of chromosomes while other classes will have a deficit. A rearrangement procedure is then entered which uses the alternative class-likelihoods of the chromosomes and attempts to find a best fit compatible with the model. Such a rearrangement procedure has two benefits. Firstly, the classification accuracy should be improved. Secondly, even if the overall accuracy is only slightly improved, in an interactive chromosome analysis system the effect of removing most instances where the computer presents chromosome classes which clearly have too m a n y or too few members will make the system appear more capable. The joint likelihood of the set of individual c h r o m o s o m e class assignments in a cell, which we call the 'cell classification likelihood', is the product of the individual likelihoods. It is of course m a x i m u m when each individual chromosome is classified independently by m a x i m u m likelihood. Slot (1979) analysed the possible improvement in classification accuracy that could be obtained by

0167-8655/86/$3.50 © 1986, Elsevier Science Publishers B.V. (North-Holland)

391

Volume 4, Number 5

PATTERN RECOGNITION LETTERS

attempting to maximise the cell classification likelihood in a situation where each class was constrained to be of predetermined size. It is possible but computationally very expensive to obtain an optimal solution to this problem, and Slot proposed a fast but sub-optimal method. In a simulation experiment with a classifier using two features normally distributed within each class, he showed that worthwhile improvement in classification accuracy was possible compared with just using each individual chromosome's maximum likelihood class assignment. Practical chromosome analysis systems differ from Slot's model, firstly in that ten or more features are used (Lundsteen et al., 1986; Piper, 1986), and secondly in that the goal of the analysis is to detect abnormality of some sort; thus only the expected class sizes are known. Additionally, in the case of ante-natal diagnosis the sex is unknown a priori and so only upper and lower bounds of the expected sex chromosome class sizes are known. Apart from genuine abnormality in the chromosome constitution, a practical system also has to be able to handle the usual range of bad data problems, for example poorly stained material, incomplete cells. Tso and Graham (1983) used a linear programming algorithm to obtain the constrained global maximum cell classification likelihood, where the constraints permitted missing or additional chromosomes or initially unknown sex. They demonstrated its performance in a ten class, two feature experiment. Unfortunately, the algorithm at its present stage of development is computationally too expensive for routine use. Several computationally feasible techniques for such model-based rearrangement have been proposed, and Granum (1982) reported the improvement obtained by a method similar to that of Lundsteen et al (1981) on a selected data set of straight, well separated banded chromosomes. Here we compare the performances of the simple 'benchmark' method proposed by Tso and Graham (1983) (which is also similar to the Lundsteen et al (1981) method), the 'cascade' method proposed by Rutovitz (1977), a new variant of this method, and a new relaxation method, on a set of reasonably representative banded chromosome 392

October 1986

data, classified into 24 classes, using up to 28 features.

Rearrangement classifiers The original classification against which the rearrangement strategies were tested was that in which each chromosome was independently assigned to its maximum likelihood class. This classifier is described in more detail by Piper (1986). Four rearrangement classifiers were tested, as follows: RCI: The first classifier is described by Tso and Graham (1983) as the 'benchmark' method, which is a slight variant of the method proposed by Lundsteen et al. (1981) and used in the Joyce-Loebl Magiscan 2 chromosome analysis system. We have chosen to test Tso and Graham's version as an example of this type of algorithm, firstly because they make explicit the order in which chromosomes are allocated to classes, and secondly because, unlike Lundsteen et al. (1981, 1986) and Granum (1982), they do not 'reject' assignments made with a relative likelihood below a rejection threshold, and this makes their method more comparable with the other rearrangement classifiers tested here, which also have no reject condition. Classified chromosomes are assigned to classes in decreasing order of likelihood (over all chromosomes and all classes), so that the overall most likely assignments are made first. If a chromosome is to be assigned to a class which has already been filled by application o f this rule, however, then the assignment is not permitted, and the chromosome must subsequently be assigned to another class for which its likelihood is in fact less. Note that in this method, no class will end up with more than two chromosomes. RC2: In the rearrangement classifier proposed by Rutovitz (1977) chromosomes are first of all allocated to their maximum likelihood class. Then a rearrangement is implemented as a 'cascade' of moves through a set of classes GI, G 2 . . . . . G n , starting at class G1 which has an apparent excess

Volume 4, Number 5

P A T T E R N R E C O G N I T I O N LETTERS

of chromosomes and ending at class Gn which has an apparent deficit. In each stage of the cascade one c h r o m o s o m e is moved from class Gi to class Gi+ i. O f course, frequently n is equal to 2 and the cascade simplifies to the movement of a single c h r o m o s o m e between two classes. If L i is the likelihood that a c h r o m o s o m e is of class Gi, Rutovitz (1977) defines the cost of moving a c h r o m o s o m e from class Gi to Gi+l to be C / = - L i + 1 . Then the cost of a cascade is defined to be C C = m a x ( C / ) . Plausibility constraints restrict the permissible rearrangements with the intention that such a classifier should be able to classify correctly those samples where, for example, there really are three chromosomes in a class, e.g. in Down's syndrome. For this study the constraints were that (i) the reassignment of a c h r o m o s o m e to a class Gj was permitted only if Lj > k. max(L/), where k is a constant determined by experience, and is 0.15 in this experiment, and max(Li) is the m a x i m u m likelihood for that chromosome, (ii) that the length of a cascade should not exceed 4 classes, i.e. three moves. The rearrangement procedure computes all plausible cascades, the moves in the minimum cost cascade are then implemented, and the whole procedure is iterated until no further plausible cascades can be found. RC3: This is similar to RC2 but with the alternative cost functions Ci = Li/ti+ 1 and CC = II Ci. With these cost functions, the minimum cost cascade is the one which minimises the decrease of cell classification likelihood (the product of the individual likelihoods). This method is therefore a sub-optimal algorithm for chosing the m a x i m u m constrained cell classification likelihood. RC4: The previous three classifiers allocate or rearrange chromosomes sequentially, making one move at a time and then reassessing the situation. A parallel rearrangement classifier can be achieved by a relaxation process in which class likelihood values are modified by a constraint function derived f r o m the actual number of chromosomes present in each class. Thus if a class is oversubscribed, the likelihoods o f all chromosomes of belonging to that class are simultaneously reduced, and similar-

October 1986

ly if a class is deficient, all likelihood values for that class are increased. The method was implemented as follows: Let E, be the expected size of class G k, and Lc, k, i be the 'likelihood' in iteration i that chromosome c is of class Gk. At each iteration i of the relaxation, the chromosomes are individually classified according to their current m a x i m u m 'likelihood'. The resulting size o f each class is Nk, i. The difference Ek--Nk, i provides a compatibility constraint used to modify the class likelihoods for the next iteration, as follows:

Lc, k,i+ 1=(1 +Ci" s i g n ( E k - - N k , i))" Lc, k , i ,

(1)

where Ci is a weighting factor and the value of the function sign() is + 1 or - 1 as appropriate. While developing the method, the constraint function (1 + G " (Ek--Nk, i))

(2)

was rejected in favour of the function in (1) above, which gave a better overall performance on the data set. The weighting factor C i clearly also affects the performance of the algorithm. If it is too small, then the compatibility constraint will be weak in that the alteration of likelihoods will be small at each iteration, with the result that convergence to a better overall classification will be slow. On the other hand, if C i is large then the algorithm may tend to oscillate between two possible solutions. Several fixed values of Ci were tried, but it turned out to be better to make C/variable, starting with quite a large value and decaying as follows: C/+ l = D . C/.

(3)

A number o f values of Co and D were tested experimentally. T h e difference in overall performance measured over the entire data set was not great, and the final choice was C0=0.4 and D=0.85. The iteration is stopped if either it converges so that Nk, i equals E k for all classes G k, or all but one class (which is the relevant convergence condition in the relatively c o m m o n cases of cells with either 45 or 47 chromosomes), or after ten iterations (which provides an inherent plausibility constraint on the rearrangement). 393

Volume 4, Number 5

P A T T E R N R E C O G N I T I O N LETTERS

Data and experimental procedure The database was derived f r o m about 120 h u m a n peripheral blood cells, containing about 5500 chromosomes, digitised with a Bosch chalnicon camera directly from microscope slides. The images were automatically segmented by thresholding and then the segmentation corrected by manual interaction. A vector of 28 feature values was measured totally automatically on each chromosome. The database was divided into two subsets .4 and B, and the experimental procedure was to train a classifier with set A and test with set B, and then train with set B and test with set A, and take the mean of the classifier success rates so obtained. Correct chromosome class for training the classifier was provided by an experienced cytogeneticist. In all cases, the initial computation of class likelihoods was performed by a likelihood classifier on the assumption that feature values are normally distributed within classes (Piper, 1986). These initial class likelihoods were then used as input to the rearrangement classifiers. It has been shown that with our set of c h r o m o s o m e features, the m a x i m u m likelihood classification obtained when only the trace of the covariance matrix is used is almost as accurate and a great deal cheaper computationally than when the entire covariance matrix is used (Piper, 1985). The rearrangement classifiers were therefore applied to the results of both versions of the likelihood classifier. The number of features selected f r o m the panel of 28 for use in the initial classifier was varied, using an automatic feature selection method (Piper, 1986), in order to determine the effect, if any, of an improved initial classification (which results from an increase in the number o f features used (Piper, 1986)) on the classification accuracy after rearrangement. Some preliminary experiments were performed to try to optimise parameters in the cascade and relaxation methods (RC2, RC3, RC4), and only the best of these classifiers are compared here. The classification success rates before and after rearrangement are presented in Table 1. 394

October 1986

Table l Classification accuracy of the original m a x i m u m likelihood classifier, and the improvement in classification accuracy resulting from rearrangement, both per cent, tabulated against the number of features used in the classifier. Table la is for the m a x i m u m likelihood classifier using full within-class covariance matrices, while Table lb is for the classifier using the trace of the covariance matrices only. (a) Maximum likelihood classification RC1 RC2 RC3 RC4

(b) Maximum likelihood classification RC1 RC2 RC3 RC4

Number of features in likelihood classifier 6

9

12

16

20

24

74.9

78.9

80.5

81.5

80.5

79.7

2.4 0.1 4.0 3.7

2.6 1.7 3.5 2.9

1.8 1.4 2.6 2.5

0.7 2.2 2.2 1.3

1.5 2.5 3.1 2.3

1.6 2.9 3.0 2.7

Number of features in likelihood classifier 6

9

12

16

20

24

74.4

78.2

79.7

81.0

80.2

81.2

2.5 0.8 3.6 3.6

2.4 1.4 3.4 3.2

2.0 1.7 3.2 3.2

1.7 2.3 3.0 2.9

1.6 2.8 3.5 3.3

1.2 2.7 3.0 2.8

Conclusions All rearrangement strategies result in small improvements to the accuracy of chromosome classification. Furthermore, all improve the 'appearance' o f a classified cell by reducing the number of classes which contain an unexpected number of chromosomes. The method RC2 (Rutovitz, 1977) can be improved by using an alternative cost function, resulting in method RC3 which has the best overall performance of the methods tested. It is computationally acceptable, taking about l second to execute on a M68000 processor. The relaxation method RC4 is also very promising; further work on an acceptable convergence criterion is in hand. From Table 1 it can be seen that the improvement in classification accuracy from methods RC3 and RC4 was relatively independent of the number of features used in the original classifier (which corresponds to some extent to the accuracy of the initial classification). RCI appears to give less improvement, while conversely RC2 perform better,

Volume 4, Number 5

PATTERN RECOGNITION LETTERS

as the number of features used increased. All methods performed approximately the same whether the entire covariance matrix was used or only the trace.

Acknowledgement The work reported here was supported entirely by the UK Medical Research Council.

References Granum, E. (1982). Application of statistical and syntactical methods of analysis and classification to chromosome data. In: J. Kittler, K.S. Fu and L.F. Pau, Eds, N A T O ASlseries no. C. 81: Pattern Recognition Theory and Applications, D. Reidel, Dordrecht, 373-398.

October 1986

Lundsteen, C., T. Gerdes, E. Granum, J. Philip and K. Philip (1981). Automatic chromosome analysis II: Karyotyping of banded human chromosomes using band transition sequences. Clinical Genetics 19, 26-36. Lundsteen, C., T. Gerdes and J. Maahr (1986). Automatic classification of chromosomes as part of a routine system for clinical analysis. Cytometry 7, 1-7. Piper, J. (1986). The effect of zero feature correlation assumption on maximum likehood based classification of chromosomes. Signal Processing 12(1), in preparation. Rutovitz, D. (1977). Chromosome classification and segmentation as exercises in knowing what to expect. In: E.W. Elcock and D. Michie, Eds., Machine Intelligence 8, Ellis Horwood, London, 455-472. Slot, R.E. (1979). On the profit of taking into account the known number of objects per class in classification methods. IEEE Trans. Inform. Theory 25, 484-488. Tso, M.K.S. and J. Graham (1983). The transportation algorithm as an aid to chromosome classification. Pattern Recognition Letters I, 489-496.

395