Strategies for combining classifiers employing shared and distinct pattern representations1

Strategies for combining classifiers employing shared and distinct pattern representations1

Pattern Recognition Letters 18 Ž1997. 1373–1377 Strategies for combining classifiers employing shared and distinct pattern representations 1 J. Kittl...

51KB Sizes 0 Downloads 67 Views

Pattern Recognition Letters 18 Ž1997. 1373–1377

Strategies for combining classifiers employing shared and distinct pattern representations 1 J. Kittler ) , A. Hojjatoleslami, T. Windeatt UniÕersity of Surrey, Guildford, Surrey, GU2 5XH, UK

Abstract The problem of combining multiple classifiers which employ mixed mode representations consisting of some shared and some distinct features is studied. Two combination strategies are developed and experimentally compared on mammographic data to demonstrate their effectiveness. q 1997 Elsevier Science B.V. Keywords: Classification; Multiple expert function

1. Introduction Combination of classifiers has received considerable attention in the pattern recognition literature during the last quenquennium. For a number of years various strategies for combining different classifier designs have been used to improve the performance of a pattern recognition system. The advocated classifier combination methodology has been largely heuristic although some attempts to provide a common underlying framework for at least some combination rules have recently been reported. The emerging classification of approaches involves four categories: multiple classifiers using an identical pattern representation; multiple classifiers using distinct pattern representations; data dependent combination strategies; multistage classifiers. For the first category it has been shown by Tumer and Ghosh Ž1996.

) 1

Corresponding author. This work has been supported by EPSRC Grant GRrJ89255.

for discriminant function classifiers and by Kittler Ž1997. for the classifiers approximating a Bayes decision rule by computing the a posteriori class probabilities, that any improvement in performance derives from the well-known principles of the Bayesian estimation theory, i.e. reducing estimation errors by virtue of using a larger number of samples. For the second category, Kittler et al. Ž1996. showed that many existing combination schemes can be developed from a common Bayesian framework. This has recently been extended to take into account the confidence of individual experts in the computed a posteriori probabilities. The last two categories of approaches are exemplified by the data-dependent multiple expert scheme ŽWoods et al., in review., where the decision about the class membership of each unknown pattern is made by the locally most reliable expert, and by the multistage classifiers ŽHo et al., 1994; Xu et al., 1992., respectively. In this paper we shall focus on the first two categories of multiple expert decision making and address the problem of combining classifiers where

0167-8655r97r$17.00 q 1997 Elsevier Science B.V. All rights reserved. PII S 0 1 6 7 - 8 6 5 5 Ž 9 7 . 0 0 0 9 5 - 0

J. Kittler et al.r Pattern Recognition Letters 18 (1997) 1373–1377

1374

the representation used does not fall neatly into either shared or distinct representation cases. Specifically we shall consider the case where the features employed by each classifier are partly shared with the other classifiers and partly distinct. The paper is organsied as follows. In the next section we introduce the necessary formalism and develop a combination scheme appropriate for the mixed mode data. In Section 3 we apply the combination scheme to the problem of automatic microcalcification detection in digital mammograms where the opinions of four different machine experts are drawn on to improve the system performance. Finally, in Section 4 we offer some conclusions.

Let us rewrite the a posteriori probability P Ž u s v k < x 1 , . . . , x R . using the Bayes theorem. We have P Ž u s v k < x1 , . . . , x R . s

p Ž x1 , . . . , x R < u s v k . P Ž v k . p Ž x1 , . . . , x R .

,

Ž 2.

where pŽ x 1 , . . . , x R < u s v k . and pŽ x 1 , . . . , x R . are the conditional and unconditional measurement joint probability densities. The latter can be expressed in terms of the conditional measurement distributions as pŽ x 1 , . . . , x R . s Ý mjs1 pŽ x 1 , . . . , x R < u s v j . P Ž v j . and therefore, in the following, we can concentrate only on the numerator of Ž2.. We commence by expressing pŽ x 1 , . . . , x R < u s v k . as p Ž x1 , . . . , x R < u s v k .

2. Theoretical framework

s p Ž j 1 , . . . , j R < y, u s v k . p Ž y < u s v k . . Consider a pattern recognition problem where pattern Z is to be assigned to one of the m possible classes  v 1 , . . . , v m 4 . Let us assume that we have R classifiers each representing the given pattern by a measurement vector x i i s 1, . . . , R. The components of each pattern vector x i can be divided into two groups, forming vectors yand j i , i.e. x i s w yT, j iT x T, where the vector of measurements yis shared by all the R classifiers, whereas j i is specific to the ith classifier. We shall assume that given a class identity, the classifier specific part of the pattern representation j i is conditionally independent of j j , j / i. In the measurement space each class v k is modelled by the probability density function pŽ x i < v k . and its a priori probability of occurrence is denoted by P Ž v k .. We shall consider the models to be mutually exclusive, which means that only one model can be associated with each pattern. Now according to the Bayesian theory, given measurements x i ,i s 1, . . . , R, the pattern, Z, should be assigned to class v j , i.e. its label u should assume value u s v j , provided the a posteriori probability of that interpretation is maximum, i.e. assign u ™ v j if P Ž u s v j < x 1 , . . . , x R . s max P Ž u s v k < x 1 , . . . , x R . . k

Ž 1.

Ž 3.

Recalling our assumption that the classifier specific representations j i , i s 1, . . . , R are conditionally statistically independent, we can write p Ž x1 , . . . , x R < u s v k . R

s

Ł p Ž j i < y,u s v k . p Ž y < u s v k . ,

Ž 4.

is1

which, assuming that the shared measurements are conditionally independent of the classifier specific ones, can be expressed as p Ž x1 , . . . , x R < u s v k . R

s

Ł

P Ž u s v k < y, j i . p Ž y, j i .

P Ž vk < y. pŽ y.

P Ž vk < y. pŽ y.

P Ž vk .

is1

Ž 5. and finally p Ž x1 , . . . , x R < u s v k . R

s

Ł

is1

P Žu s vk < xi . pŽ xi .

P Ž vk < y. pŽ y.

P Ž vk < y. pŽ y.

P Ž vk .

.

Ž 6. Let us pause to look at the meaning of the terms defining pŽ x 1 , . . . , x R < u s v k .. First of all P Ž u s v k < x i . is the k th class a posteriori probability computed by each of the R classifiers, whereas P Ž v k < y . is k th class probability based on the shared features.

J. Kittler et al.r Pattern Recognition Letters 18 (1997) 1373–1377

pŽ x i . and pŽ y . are the mixture measurement densities of the representations used for decision making by each of the experts. Since the measurement densities are independent of the class labels they can be cancelled out by the normalising term in the expression for the a posteriori probability in Ž2. and we obtain the decision rule assign u ™ v j R

if

Ł

P Ž u s vj < xi . P Ž u s vj < y .

is1

R

m

s max ks1

Further, it is worth noting that when the shared features are noninformative, the a posteriori probabilities PˆŽ u s v k < y ., ;k will be comparable and therefore the term Ž1 y R . PˆŽ u s v k < y . can be omitted from both sides of the decision rule Ž8. giving a combination rule assign u ™ v j R

if P Ž u s vj < y .

P Žu s vk < xi .

Ł is1 P Ž u s v k < y .

P Žu svk < y. ,

is1

Ž 10 . Ž 7.

R

Ý P Ž u s vj < xi . is1

m

s max Ž 1 y R . P Ž u s v k < y . ks1

R

q Ý P Žu s vk < xi . .

R

m

Ý P Ž u s v j < x i . s max Ý P Žu s vk < xi . . ks1

is1

which combines the individual classifier outputs in terms of a product. Each factor in the product for class v k is normalised by the a posteriori probability of the class given the shared representation. Now let us consider the ratio P Ž u s v k < x i .rP Ž u s v k < y . and suppose it is close to one. We can then write P Ž u s v k < x i . s P Ž v k < y .Ž1 q Dk i .. Substituting into Ž7. and linearising the product by expanding it and neglecting all terms of second order and higher, the decision rule becomes assign u ™ v j if Ž 1 y R . P Ž u s v j < y . q

1375

Ž 8.

Even if the shared features are informative, it may be beneficial to ignore this term if the estimation errors on PˆŽ u s v k < y ., ;k are non-negligible, as the effect of these errors on the decision rule will be amplified by the factor Ž1 y R .. Thus decision rule Ž9. is an important alternative for the combination of multiple experts employing mixed mode representations.

3. Experimental results The proposed combination strategies have been applied to the problem of detecting microcalcifications in digital mammograms. The problem is to label suspected regions as normal or abnormal. A data set of 227 mammograms is used in the study. It includes 22 images with microcalcifications and 205 normal mammograms. A preprocessing step outputs 39 measurements, which are used as a basis for designing four different classifiers: Radial Basis Function ŽRBF. classifier; Multilayer Perceptron ŽMLP.; K-Nearest Neighbour method Ž K-NN.;

is1

Note that the classifier combination rules Ž7. and Ž8. are expressed in terms of the a posteriori class probabilities returned by the individual classifiers using mixed representations and the a posteriori class probability based on the shared representation. Each classifier provides an independent estimate of the latter. It is therefore sensible to average these values to obtain a more reliable estimate PˆŽ u s v k < y ., i.e. Pˆ Ž u s v k < y . s

1 R

R

Ý

Pi Ž u s v k < y . ,

Ž 9.

is1

where Pi Ž u s v k < y . is the a posteriori probability computed by the ith classifier.

Table 1 The performance of classifiers and their combination when only the four shared features are used on the independent test sets A and B Classifier

RBF MLP K-NN Gaussian Mean Comb.

No. of features

4 4 4 4 4

Set A

Set B

Error 1 Error 2 Ž%.

Error 1 Error 2 Ž%.

1.0 0.62 0.70 0.90 0.55

1.86 1.44 1.8 7.2 2.7

40 36 31 32 25

31 34 32 46 27

J. Kittler et al.r Pattern Recognition Letters 18 (1997) 1373–1377

1376

Table 2 The number of features used by each classifier and the errors produced on the independent test sets A and B Classifier

FS1rRBF FS2rMLP FS3rK-NN FS4rGaussian Mean Comb. Ž8. Mean Comb. Ž9.

No. of features

7 13 11 17

Set A

Set B

Error 1

Error 2 Ž%.

Error 1

Error 2 Ž%.

0.44 0.40 0.62 0.49 0.37 0.30

12 21 21 24 20 12

0.98 0.62 1.15 0.69 0.61 0.78

15 19 22 32 21 18

Gaussian classier. The feature sets used by these classifiers are not identical but they include four features shared by the four experts. The classifiers were trained on 320 microcalcifications, which have been extracted from three abnormal images and on 960 blob-like regions extracted from five normal images. The eight images used for training were excluded from the database and the remaining images were divided into two sets, Set A and Set B. Set A has been used to identify the prior probabilities that would guarantee the detection of all microcalcifications in the set. The performance measures are Error 1, which measures the rate of false positive clusters Žat least three microcalcifications in a prespecified radius. per image, and Error 2, which gives the percentage of normal images misclassified as abnormal. The results are obtained on Set B and then the role of the two sets is interchanged. The classification rates achieved are presented in Tables 1 and 2. The results in Table 1 show the improvements gained by averaging the posterior class probabilities based on the shared representation. Table 2 presents the performance rates of the individual experts and the effect of multiple expert fusion using the combination rules Ž8. and Ž9.. It is apparent that both combiners improve the classification rates but rule Ž8. outperforms rule Ž9. which suggests that the class a posteriori probabilities based on the shared features are subject to substantial estimation errors.

4. Conclusions The problem of combining multiple classifiers which employ mixed mode representations consisting of some shared and some distinct features was

addressed. Two combination strategies were developed and experimentally compared on mammographic data to demonstrate their effectiveness. Discussion Roli: My first question is the following: your combination method assumes the independence condition. I agree with you that using shared features and separate features makes it likely that this condition is satisfied. Have you verified that assumption? The second question is: is it possible and do you think that it is useful to investigate what is the best set of features to be shared in order to have the most independent classifier and so that the condition that you are assuming is satisfied? Kittler: The answer to the first question is that we looked at the correlations between the distinct sets of features used by each classifier. There was definitely a block structure in the covariance matrix. There were no correlations to worry about. So certainly the assumption was more or less satisfied. As to the second question: of course you can investigate anything and it would be a very interesting study to find out what would be the relative benefit of using many features as shared features and add some distict features afterwards, or whether you are better off if you just used distinct features. But that is a big study.

References Ho, T.K., Hull, J.J., Srihari, S.N., 1994. Decision combination in multiple classifier systems. IEEE Trans. Pattern Anal. Machine Intell. 16 Ž1., 66–75.

J. Kittler et al.r Pattern Recognition Letters 18 (1997) 1373–1377 Tumer, K., Ghosh, J., 1996. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition 29, 341–348. Xu, L., Krzyszak, A., Suen, C.Y., 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. SMC 22 Ž3., 418–435. Kittler, J., Hatef, M., Duin, R.P.W., 1996. Combining classifiers. In: Proc. 13th Internat. Conf. on Pattern Recognition, Vienna, vol. II, Track B, pp. 897–901.

1377

Kittler, J., 1997. Improving recognition rates by classifier combination: A theoretical framework. In: Downton, A., Impedovo, S. ŽEds.., Handwriting Recognition. World Scientific, Signapore. Woods, K.S., Bowyer, K., Kergelmeyer, W.P., in review. Combination of multiple classifiers using local accuracy estimates. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 391–396 Žin review..