A note on some feature selection criteria

A note on some feature selection criteria

Pattern Recognition Letters 10 (1989) 155 158 North-Holland September 1989 A note on some f e a t u r e selection criteria C.E. QUEIROS Dept. Comp. ...

312KB Sizes 1 Downloads 217 Views

Pattern Recognition Letters 10 (1989) 155 158 North-Holland

September 1989

A note on some f e a t u r e selection criteria C.E. QUEIROS Dept. Comp. Science, FCT/UNL, 2825 Monte Caparica, Portugal

E.S. G E L S E M A Dept. Med. In£, Erasmus University, Rotterdam, The Netherlands Received 19 January 1989

Abstract: In this correspondence, results are presented which show the need for further studies concerning some selection criteria used in feature selection algorithms.

Key words: Feature selection, selection criteria, error estimation.

1. Introduction Feature selection algorithms are characterized by a search procedure, a selection criterion and a stopping rule. These may be roughly defined as the strategy for moving in the domain of all possible combinations of features, the quantity used to measure the quality of a given set of features, and the set of rules that, if met, will cause the algorithm to stop. For each candidate set of features, a value of the selection criterion is computed. This value governs the selection process, i.e. a feature set is selected in accordance with the value of the selection criterion. Normally, the selection criterion is an explicit (or implicit) function of the error rate and has to be estimated on the basis of a given set of samples. Without loss of generality, let us assume that the optimization of the selection criterion corresponds to a maximization. The maximum of the criterion function over a set of values is a random variable and it may be represented by S = max(sl,s 2 . . . . .

Sn)

(1)

where each random variable sj is value of the criteri-

on associated with a given feature set, and n is the number of feature sets tested. The distribution of the variable S is extremely difficult to obtain for the most popular selection criteria. Therefore, it is not surprising that whenever a new criterion is proposed, its properties (e.g. bias and variance) presented are the properties of sj. The aim of this correspondence is to show that the differences between S and sj may be significant. This is done by presenting the results of some sampiing experiments. The next section introduces the experimental setup. Section 3 concludes this correspondence, and it contains the results of the experiments and final remarks.

2. Experimental setup Three feature spaces defined by a number of binary features were established. For each space, two classes were defined with equal a priori probabilities assumed to be known. The experiments consisted of the execution of various feature selection algorithms in these spaces. The selection stopped after one feature had been selected.

0167-8655/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)

155

Volume 10, Number 3

PATTERN RECOGNITION LETTERS

For each feature space and each feature selection algorithm, 30 different sample sets with 32 samples per class were generated, sampling independently per class. For each sample set, the best feature was selected, and the error associated with that feature estimated. The results of the experiments are two sets of values. For each combination of a feature space and a selection algorithm, two values were computed, both of them, averages over the 30 different sample sets: (1) the mean value of the estimated error associated with the feature selected by the selection algorithm, (2) the mean value of the estimated error associated with the true best feature. The differences between these values were used in order to analyze the results of the experiment. 2.1. Feature spaces

The first feature space used was defined by 10 binary features. Given the class, these were assumed to be independent. A log-linear model (see [2]) was used to represent each class-conditional probability function. This space will be referred to as the H (highly structured) space. The Bayes error in the subspaces defined by one feature, varied between 0.27 and 0.46. The second feature space was also defined by 10 binary features. Given the class, six features and the set composed of the remaining four, were assumed to be independent. All but the fourth order factor were non-zero for the four features assumed to be non-independent. This space shall be referred to as the M (moderately structured) space. The Bayes error in the subspaces defined by one feature, varied between 0.27 and 0.47. The third feature space used was defined by 5 binary features. A pseudo-random generator was used to obtain preliminary values for the probabilities for each individual element in the 5-dimensional space. Final values were obtained by applying the constraints that the probabilities have to sum up to one, and that the Bayes error for the full dimensionality should be equal to 0.1. This space is highly non-structured and shall be referred to as the L (low structured) space. The Bayes error in the subspaces 156

September 1989

defined by one feature, varied between 0.36 and 0.50. 2.2. The selection criteria

Three different selection criteria were defined. All are estimates of the error rate and require the use of classifiers. Each selection criterion is one of three different error estimators. All error estimators belong to the set of the socalled rotation methods. A general framework for these kind of techniques is as follows. Given a set of objects, the entire data is partitioned a number of times into a training and a test set. For each partition, a classifier is designed using the objects in the training set, and is applied to the objects in the test set. The error is then estimated by counting the number of misclassifications and averaging over all partitions. In order to apply rotation methods, and given a set of data, both the size of one set (e.g. the test set) and the number of partitions to be used have to be defined. If n objects are available and it is decided to use test sets of size k (k < n), the number of ways to partition the data available is equal to the number of combinations of k elements in a set containing n elements. For reasons of reliability, the three estimators adopted all require that more than one partition be used. If k is equal to 1 and all possible partitions are explored, we have the first rotation technique used. This is the leave-one-out method [4]. It yields an almost unbiased estimate [4]. For discrete feature spaces it has a large variance [3]. Other partitioning ratios may be used although most likely not all possible partitions can be explored. Indeed, their number may be enormous and the computer time required absurd. Still, a given number may be randomly selected, the number being such that computer time remains manageable. The second estimator used test sets which contained approximately 10% of the number of samples available. Twenty five different partitions were randomly generated. The third estimator used test sets with 50% of the number of samples available and 7 different partitions, again randomly generated. Toussaint [6]

Volume 10, N u m b e r 3

PATTERN RECOGNITION LETTERS

called this estimator, the modified hold-out method. Since a priori probabilities were assumed to be known and an equal number of objects per class was available, the generation of each training set was subject to the condition that it should contain an equal number of objects from both classes. The properties of such estimators (the second and third estimators), to our knowledge, have not yet been studied. However, some conclusions may be drawn from qualitative reasoning about the properties. Each partition in each estimator, yields an error estimate that is known to be pessimistically biased. Furthermore, the bias will increase when the number of objects assigned to each training set is decreased. Therefore, it may be concluded that both estimators are pessimistically biased. Moreover, this bias is likely to be more pronounced in the case of the third estimator since fewer objects are used to design the classifiers. As for the variance, two effects have to be considered. The number of different values that, in each partition, the estimated error may assume in the interval [0.0,1.0], is larger for the third estimator. More partitions are used by the second estimator. The first effect leads us to expect that the third estimator has a smaller variance (following a reasoning similar to Glick's, see [3], when discussing the variance of the leave-one-out). The second effect leads us to expect that the second estimator has smaller variance. Which of these effects is dominant and how the variances compare to the variance of the leave-one-out method, is not clear. All these estimators require that classifiers are estimated. Maximum likelihood estimation was used. Ties in the selection were broken by adopting a technique proposed by Ben-Bassat [1]. In case of equal error estimates, the feature selected was the one for which the variance of the error over the candidate space was largest. 3. Results and final remarks Table 1 compiles the results of these experiments. Each entry contains an estimate of a mean value and an estimate of a standard deviation (between parentheses). Both these statistics relate to a given selection criterion. Since all three criteria used are

September 1989

Table 1 Each entry in this table shows the estimated means and standard deviations (between parentheses). Column-wise, the results relate to the three error estimators used. Row-wise, the results relate to the three feature spaces and to the selected 'best" feature and the actual best feature. The upper part of this table, is associated with estimates for the 'best' feature as indicated by the selection criteria. The lower part is associated with error estimates for the actual best feature.

o')

Error estimator 1

Error estimator 2

Error estimator 3

H

22.9 (2.9)

22.3 (3.3)

21.9 (3.5)

M

24.4 (3.5)

21.8 (4.1)

23.4 (4.0)

L

34.3 (5.6)

32.6 (6.0)

34.7 (6.6)

H

25.5 (4.3)

27.4 (6.1)

25.5 (4.5)

M

27.1 (4.6)

27.0 (6.7J

26.9 (5.8)

35.8 (6.7)

36.0 (6.6)

37.0 (8.6)

e'~

<

estimates of the error rate, the mean values are mean error rates. Column-wise, the results relate to the three error estimators used. Row-wise, the results relate to the three feature spaces and to the selected 'best' feature and the actual best feature. The first three rows contain estimates for the 'best' feature as selected by the feature selection algorithm. The last three rows contain estimates for the actual (i.e. true) best feature. An analysis of the last three rows reveals that the estimated errors of the actual best feature are close to the true value of the error associated with the actual best feature. This is as expected in view of the relatively large number of objects available (32 objects per class) and the small number of parameters (one) required by the classifier. In these situations, the pessimistic bias associated with the estimators is negligible. The upper part of Table I shows the estimated errors of the 'best' features as selected by the selection criteria. They are optimistically biased in spite of the relatively large number of objects used 1 1 In order to assess the magnitude of the bias, it should be recalled that the actual error mean is larger than the error of the best feature because the actual best feature was not always selected. 157

Volume 10, Number 3

PATTERN RECOGNITION LETTERS

Therefore, the following question has to be asked: how can a pessimistic bias be transformed into an optimistic bias? This question can be answered in the following manner. The distribution of S (see expression (1)) is not the distribution of sj and therefore the properties of the latter, are not necessarily the properties of the former. The optimization process introduces the bias in S. The size of this bias depends on the distributions of the various sj and their relations. The bias may be large even if the relation between the number of objects used and the number of parameters required by the classifier, is large (although not large enough for the minimum error over a set, as the results show). The possibility of occurrence of these large biases is an indication that the results (i.e. the selected features) may strongly suit the data at hand, and not necessarily the underlying distributions of the feature space. Therefore, in order to achieve better performance with feature selection, it is important that efforts are directed towards the study of the properties of the various available criterion functions (be it the error rate or other ones), when the circumstances of their application involves an optimization procedure. Another aspect of pattern recognition design that might be affected by feature selection, is the evaluation (i.e. error estimation) of a classifier. Assume that this classifier uses a number of features which were selected after the application of a feature selection algorithm. Also assume that a set of objects divided into a training and a test set, has been used. Consider the following scenarios and possible consequences on the classifier evaluation: (1) All the data is used f o r feature selection and classifier design. If the leave-one-out technique is applied to the entire set of objects, in certain cases an optimistically biased estimate of the error rate of the final classifier, is to be expected.

158

September 1989

(2) The training set is used f o r feature selection and classifier design. If the leave-one-out technique is applied to the entire set of objects, in certain cases an optimistically biased estimate of the error rate of the final classifier, is to be expected. The justification for these behaviours follows immediately from the considerations above. A technique, which is known to be unbiased, yields a biased estimator if the data on which it is applied is 'biased' data, i.e. if it is not representative of the feature space. This 'biased' data may be a consequence of the feature selection process. More scenarios, similar to these ones, could be constructed, with other popular error estimators, yielding similar conclusions. Therefore, estimated error rates should be carefully analyzed. An unbiased estimate is not automatically guaranteed by the fact that the estimator used is an unbiased estimator: the data on which it is applied might already be 'biased'.

References [1] Ben-Bassat, M. (1980). On the sensitivityof the probability of error rule for featureselection.IEEE Trans. Pattern Anal. Machine lntell. 2, 57-60. [2] Bishop, Y.M.M., S.E. Fienberg and P.W. Holland (1975). Discrete Multivariate Analysis -- Theory and Practice.

MIT Press, Cambridge, MA. [3] Glick, N. (1978). Additive estimators for probabilities of correct classification.Pattern Recognition 10, 211-222. [4] Lachenbruch, P.A. (1967). An almost unbiased method of obtaining confidenceintervalsfor the probabilityof misclassificationin discriminant analysis. Biometrics, 639-645. [5] Queiros, C.E. (1988). Pattern recognitionwith discrete and mixed data: theory and practice. PhD Dissertation, Rotterdam. [6] Toussaint, G.T. (1974). Bibliographyon estimation of misclassification.IEEE Trans. Information 1heory 20, 472-479.