Expert Systems with Applications 39 (2012) 1698–1707
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Combination of multiple diverse classifiers using belief functions for handling data with imperfect labels Mahdi Tabassian a,b,⇑, Reza Ghaderi a,1, Reza Ebrahimpour b,c,2 a
Faculty of Electrical and Computer Engineering, Babol University of Technology, Shariatee St., P.O. Box 484, Babol, Iran School of Cognitive Sciences, Institute for Research in Fundamental Sciences (IPM), Niavaran Sq., P.O. Box 19395-5746, Tehran, Iran c Brain and Intelligent Systems Research Lab, Department of Electrical and Computer Engineering, Shahid Rajaee Teacher Training University, P.O. Box 16785-163, Tehran, Iran b
a r t i c l e
i n f o
Keywords: Data with imperfect labels Ensemble learning Belief functions framework Classifier selection Neural network
a b s t r a c t This paper addresses the supervised learning in which the class memberships of training data are subject to ambiguity. This problem is tackled in the ensemble learning and the Dempster–Shafer theory of evidence frameworks. The initial labels of the training data are ignored and by utilizing the main classes’ prototypes, each training pattern is reassigned to one class or a subset of the main classes based on the level of ambiguity concerning its class label. Multilayer perceptron neural network is employed to learn the characteristics of the data with new labels and for a given test pattern its outputs are considered as basic belief assignment. Experiments with artificial and real data demonstrate that taking into account the ambiguity in labels of the learning data can provide better classification results than single and ensemble classifiers that solve the classification problem using data with initial imperfect labels. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction In the classical supervised classification framework, a classifier is trained based on a learning set in the form of labeled patterns. However, in some applications, unambiguous label assignment may be difficult, imprecise or expensive. Such situations can occur when differentiating between two or more classes is not easy due to the lack of information required for specifying certain labels to data or the difficulty of labeling complicated data of the problem at hand. Combination of multiple classifiers’ decisions is a promising solution to gain an acceptable classification performance (Ebrahimpour, Kabir, & Yousefi, 2007; Kittler, Ghaderi, Windeatt, & Matas, 2003). By utilizing a proper strategy for the construction of an ensemble network, it can be successfully applied to classification problems with imprecise and uncertain information. The Dempster–Shafer (D–S) theory of evidence (Shafer, 1976), also referred to as belief functions, is a well suited framework for representation of partial knowledge and has proved to be an appropriate approach for improving the performance of an ensemble model that deals with unreliable information (Tabassian, Ghaderi, & Ebrahimpour, 2011). Compared to the Bayesian method, ⇑ Corresponding author at: School of Cognitive Sciences, Institute for Research in Fundamental Sciences (IPM), Niavaran Sq., P.O. Box 19395-5746, Tehran, Iran. Tel.: +98 21 22294035; fax: +98 21 22280352. E-mail addresses:
[email protected] (M. Tabassian),
[email protected] (R. Ghaderi),
[email protected] (R. Ebrahimpour). 1 Tel.: +98 111 3232071 4; fax: +98 111 3234201. 2 Tel.: +98 21 2297006; fax: +98 21 22970006. 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.06.061
the D–S framework provides a more flexible mathematical tool for dealing with imperfect information. It offers various tools for combining several items of evidence and, as understood in the transferable belief model (Smets & Kennes, 1994), allows to make decision about the class of a given pattern through transforming a belief function into a probability function. Thanks to its flexible characteristics to represent different kinds of knowledge, the D–S theory provides a suitable theoretical framework for combining classifiers especially those learned using imprecise and/or uncertain data. Rogova (1994) introduced an approach for combining neural network classifiers based on the D–S theory and used the proximity of a given sample to some reference vectors for defining mass functions. Each reference vector was considered as the mean of the neural network outputs for training samples belonging to one of the main classes. By combining the mass functions using Dempster’s rule of combination and merging the evidences from all neural networks with the same combination rule, final decision about the class of a test pattern was adopted. Denoeux (2000) proposed an evidential neural network method for pattern classification problems and used the evidence K-nearest neighbor algorithm (Denoeux, 1995) for assigning a pattern to a class by computing distances to limited number of prototypes. It was shown that in the sensor fusion application, this approach can provide an optimal solution in case of sensor failures. A supervised learning framework based on the D–S theory was proposed by Basir, Karray, and Zhu (2005) and the issues of constructing evidence structure and dependence among information sources by utilizing neural networks has been addressed. It has been shown that
1699
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
this approach could improve the performance of the D–S evidence theory and the majority voting method and could provide satisfactory results considering the speed of learning convergence. A pairwise classifier combination based on the D–S theory was proposed by Quost, Denoeux, and Masson (2007) and it was argued that the belief functions framework is an appropriate tool to deal with partial information provided by each pairwise classifier about the class of the object under consideration. A variant of the AdaBoost algorithm within the evidential framework was proposed by Quost and Denoeux (2009) in which the imperfect knowledge produced by a weak learner was interpreted as belief function and it permitted processing data with uncertain labels. The authors showed experimentally that considering the uncertainty of the training patterns may increase robustness and accuracy. In this paper we propose a new approach within the belief functions and combining classifiers frameworks for dealing with the supervised classification problems in which the labels of the learning data are imperfect. Several representations of the data are used and an approach is suggested, based on the supervised information, for detecting inconsistencies in the labels of the learning data in each feature space and assigning crisp and soft labels to them. Multilayer perceptron (MLP) neural network is used as base classifier and its outputs are interpreted as basic belief assignment (BBA). Final decision about the class of a test pattern is made by combining the BBAs produced by the base classifiers using Dempster’s rule of combination. In order to provide complementary sources of information, a classifier selection approach based on forward search algorithm and an optimal diversity measure is adopted. The paper is organized as follows. In Section 2 basic concepts of the belief functions are reviewed. Details of our proposed method are described in Section 3. The employed classifier selection methodology is presented in Section 4. It is followed by the experimental results and discussion on artificial and real data in Section 5 and finally, Section 6 draws conclusion and summarizes the paper.
2. Belief functions The belief functions theory (Shafer, 1976) is a theoretical framework for reasoning with uncertain and partial information and it can be considered as a generalization of Bayesian probability with flexibility. Several models for uncertain reasoning has been proposed based on belief functions theory. An example is the transferable belief model (TBM) (Smets & Kennes, 1994). 2.1. Basic concepts Let X ¼ fx1 ; . . . ; xM g be a finite set of mutually exclusive and exhaustive hypotheses called the frame of discernment. A basic belief assignment (BBA) or mass function is a function m : 2X ! ½0; 1, and it satisfies the two following conditions:
mð/Þ ¼ 0 X mðAÞ ¼ 1
ð1Þ ð2Þ
A#X
where / is the empty set and a BBA that satisfies the condition mð/Þ ¼ 0 is called normal. The subsets A of X with nonzero masses are called the focal elements of m and m(A) indicates the degree of belief that is assigned to the exact set of A and not to any of its subsets. There are also two other definitions in the theory of evidence. They are belief and plausibility functions associated with a BBA and are defined respectively, as follow:
BelðAÞ ¼
X
mðBÞ
ð3Þ
mðBÞ
ð4Þ
B#A
PlðAÞ ¼
X
A\B–/
Bel(A) represents the total amount of probability that is allocated to A, while Pl(A) can be interpreted as the maximum amount of support that could be given to A. Note that, the three functions m, Bel and Pl are in one-to-one correspondence and by knowing one of them, the other two functions could be derived. 2.2. Combination of BBAs Let m1 and m2 be two BBAs induced by two independent items of evidence. These pieces of evidence can be combined using Dempster’s rule of combination (also called the orthogonal sum) which is defined as:
mðCÞ ¼
P A\B¼C m1 ðAÞ m2 ðBÞ P 1 A\B¼/ m1 ðAÞ m2 ðBÞ
ð5Þ
Combining BBAs using Dempster’s rule of combination is possible only if the sources of belief are not totally contradictory which means that there exist two subsets A # X and B # X with A \ B – / such that m1 ðAÞ > 0 and m2 ðBÞ > 0. 2.3. Decision-making After combining all pieces of evidence, a decision has to be made using the final belief structure. The approach adopted in this paper is based on the pignistic transformation which is defined in the TBM. By uniformly distributing the mass of belief mðAÞ amongst its elements for all A # X, a pignistic probability distribution is defined as:
BetPðxÞ ¼
X fA # X;x2Ag
1 mðAÞ ; j A j 1 mð/Þ
8x 2 X
ð6Þ
where j A j is the cardinality of subset A and for the normal BBAs (with mð/Þ ¼ 0), mðAÞ=ð1 mð/ÞÞ would be m(A). Note that, in the TBM, the pignistic probability is employed for decision making based on the Bayes decision theory. 3. The proposed evidence-based ensemble model Fig. 1 shows the architecture of our proposed method. There are two main phases involved in the implementation of the method including re-labeling the learning data and classifying an input test sample by combining decisions of neural networks trained on the learning data with new labels. In the proposed method, the procedure of label assignment is reconsidered and based on the level of ambiguity concerning the class membership of each train pattern, it is allowed to has a crisp class label or a soft label comprising any possible subset of the predefined classes. In our method, the accepted uncertainties are reduced by making use of several complementary representations of the data and merging the evidences raised from these representations through the belief functions framework. Note that, the use of the proposed method is advantageous when it is applied to classification problems with imperfect class labels. Otherwise, utilizing complementary sources of information in the framework of combining classifiers can provide satisfactory results. 3.1. Re-labeling Let X ¼ fx1 ; . . . ; xM g be a set of M classes and x be a data point described by n features in the training set which is associated to
1700
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
Fig. 1. Architecture of the proposed classification scheme. In the training phase, crisp and soft labels are assigned to the learning data and MLPs are trained on the data with new labels. In the test phase, the outputs of the MLPs are considered as measures of confidence associated with different decisions and are converted in the form of BBAs by making use of softmax operator. Final decision on a given test sample is made by combining the experts’ beliefs by Dempster’s rule of combination and using pignistic transformation.
one class in X with certainty. The goals of this stage are (i) to detect inconsistencies in the labels of the training data and, (ii) to reassign each train sample to just one main class or any subset of the main classes based on the level of ambiguity concerning the class membership of that sample. Let P be a M n matrix containing prototype vectors of the main classes and let Di ¼ ½di1 ; . . . ; diM be the set of distances between train sample xi and the M prototypes according to some distance measure (e.g. the Euclidian one). Initial label of xi is ignored and by utilizing the information provided by the vector Di, uncertainty detection and class reassignment for this sample are performed in a three-step procedure: Step 1: The minimum distance between xi and the class prototypes is taken from the vector Di,
dij ¼ minðdil Þ;
l ¼ 1; . . . ; M
ð7Þ
and dij is called dmin. Step 2: A value 0 < ll 6 1 is calculated for each of M classes using the following function:
ll ðxi Þ ¼
dmin þ b ; dil þ b
l ¼ 1; . . . ; M
ð8Þ
in which 0 < b < 1 is a small constant value which ensures that the utilized function allocates a value greater than zero to each of M classes even if dmin ¼ 0. ll is a decreasing function of the difference between dmin and dl and has values close to 1 for small differences. Step 3: A threshold value 0 < s < 1 is defined and based on the level of ambiguity regarding the class membership of train sample xi, this sample can be assigned to (i) a set of classes if the corresponding values of ll for these classes are greater than
or equal to s (soft labeling) or (ii) just one main class, which has the closest prototype to xi and ll ¼ 1, where xi is far away from other main classes’ prototypes (crisp labeling). Close distances between a train pattern and some of the class prototypes can be interpreted as an indication of ambiguity in the pattern’s label and in such cases a soft label is assigned to that train pattern. The above procedure is repeated for all training samples and for all feature spaces. Note that by assigning a soft label to a set of train samples, a new class is added to the problem and these samples are no longer members of their initial classes. As an instance, consider a 2-class problem in which a train sample x belongs to class x1. Suppose that after the re-labeling procedure, x is assigned to soft class x1,2. It means that one of the samples of class x1 is removed and it is added to new class x1,2. Since several representations of the data are employed to provide complementary information, it can be expected that if a soft label has been assigned to a train sample in one of the feature spaces, this sample could belong to one class or a subset of the main classes with less uncertainty, in the other feature spaces. In this way, the negative effects of crisp but imperfect labels of some learning samples on the classification performance can be reduced and a satisfactory classification results will be achieved by combining evidences from the multiple feature spaces. In the current study two approaches for calculating the prototype vectors of the main classes have been adopted. The first approach computes the mean vectors of the main classes as a global estimation of the class prototypes and uses these vectors for re-labeling all train samples while the second one aimed at using information contained in a local region surrounding each train pattern to estimate the prototype vectors. The local method can give better estimation of the main classes’ prototypes and as
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
shown in the next experiments, it is able to provide robustness against erroneous labels. The local strategy for computing class prototypes is discussed in Section 3.1.1. 3.1.1. Local prototype estimation with focus on the K-nearest neighbors information For train sample xi, let X j ¼ fxjt jt ¼ 1; . . . ; K j g denotes the set of training samples from class xj in the set of K-nearest neighbors (K-NN) of this sample. The mean vector of class xj is computed using its samples, fxj1 ; xj2 ; . . . ; xjK j g, in the K-NN set and then be used as the prototype of this class:
pj ¼
Kj 1 X xj K j t¼1 t
ð9Þ
Obviously, the process of re-labeling train sample xi is performed using prototype vectors of only those classes that have any sample in the K-NN category of xi. The distance of sample xi to the jth class prototype, dij, is discounted by a weight 0 < wj < 1 which can be computed as,
1
wj ¼ P M
j¼1 1
Kj K
ð10Þ
Kj K
where Kj is the number of nearest neighbors from class xj in the K-NNs. The above equation takes into account the contributions of the existing classes in the K-NN set in the re-labeling procedure. Before incorporating dijs in Eq. (8), they are multiplied by their corresponding weights and in this fashion the chance of assignment train samples xi to a class xj that has large number of samples in the K-NN set is increased. Note that, there are some exceptions in the implementation of the explained above approach in which the weight factor wj has no impact on the re-labeling procedure. Such situation occurs when (i) K = 1or K > 1 but all samples in the K-NN set belong to a class xj which results in assignment of train sample xi to class xj with certainty and, (ii) all existing classes in the K-NN category have the same number of patterns. 3.2. Basic belief assignment and classification MLP neural network, which is capable of making a global approximation to the classification boundary, is used as base classifier. The learning set of each feature space, which consists of data with new crisp or soft labels, is employed to train the base classifier. Since different types of features are extracted from the data, the training samples in each feature space have their own level of uncertainty in class labels. As a result, after the re-labeling procedure the number and type of the classes in each feature space could be different from the other one and base classifiers with different models (different number of output nodes) are trained on those feature spaces. The diversity among the base classifiers is achieved as a result of this procedure. In the test phase, same types of features as the training stage are extracted from the test data and they are applied to their corresponding base classifiers. The output values of each base classifier can be interpreted as measures of confidence associated with different decisions. For combining the evidences induced by the feature spaces using Dempster’s rule of combination, the decision of the base classifiers should be converted in the form of BBAs. This can be done through normalizing the outputs of the base classifiers by a softmax function as:
expðOji Þ mi ðxj Þ ¼ PC ; j¼1 expðOji Þ
j ¼ 1; . . . ; C
ð11Þ
1701
where Oji is the jth output value of the ith base classifier, C is the number of classes in the ith feature space after the re-labeling stage and mi ðxj Þ is the mass of belief given to class xj . Note that, although our method allows computing a BBA mi that could has any focal set over the set of the main classes, the BBA contains only a set of states which have corresponding classes in the ith feature space after the re-labeling stage. In other word, the frame of discernment is composed of the initial classes and is similar for all classifiers but each base classifier is used to make decision about a subset of this frame. Since after the re-labeling, any train sample will has a crisp or soft class label, the quantity mð/Þ is zero and the BBAs are normal. For making decision on the class of a given test sample, the opinions of the evidences are merged using Dempster’s rule of combination. By applying pignistic transformation on the resulting BBA, the pignistic probability will be achieved and the test sample will be assigned to the class with the largest pignistic probability. Although the main contribution of our method is to classify data with imperfect labels, its application can be extended to classification problems which involve heavily overlapping class distributions and nonlinear class boundaries. The example below is used to demonstrate this capability. Example. Let us consider a 3-class problem in which the generated classes from two feature spaces after the re-labeling stage and the produced BBAs by the neural networks for a given test sample are as shown in Fig. 2. After combining BBA m1 and BBA m2 using Dempster’s rule of combination, a BBA with three main classes and a soft class comprising the main classes 1 and 3 will be obtained. By making use of the pignistic transformation, the value of belief corresponds to soft class x1,3 is equally distributed among its elements and finally the test sample is assigned to class 3 with the highest probability value.
4. Classifier selection One of the necessary conditions for the success of the proposed method, as well as other ensemble networks, is the appropriate choice of the base classifiers. Assuming that by applying different feature extraction procedures to the data or by sampling several subsets of features from the original feature space, different representations of the data are available. By training a classifier on each of these feature spaces a pool of classifiers will be obtained. In order to choose an optimal set of base classifiers from the classifier pool, a classifier selection methodology should be adopted which necessitates the incorporation of a search algorithm and a selection criterion. Fig. 3 shows the overall procedure of implementing a typical classifier selection algorithm. In this paper forward search algorithm which is known as the most intuitive greedy approach is used. The optimality of this search method has been shown by Ruta and Gabrys (2005) based on extensive experiments. It has been demonstrated that the forward search method can outperform the exhaustive search algorithm which suffers from the overfitting problem. The quality of the selected classifiers is then assessed using a diversity-based selection criterion. The interrater agreement, k, which is a non-pairwise diversity measure is used. This classifier selection criterion measures the level of agreement – or disagreement – between classifiers and picks the most diverse subset of classifiers (Kuncheva & Whitaker, 2003). The k is briefly explained below. Let B ¼ fB1 ; . . . ; BS g be a system of S classifiers and let fxi gNi¼1 be the data set containing N labeled data. Let yi ¼ ½yi1 ; . . . ; yiS denote the joint output of a system for sample xi, where yij denotes the output of the jth classifier for the ith input sample and it is defined to be 1 if Bj recognizes xi correctly and 0 otherwise. Let sðxi Þ denotes
1702
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
Fig. 2. The process of making decision on a given test sample for a 3-class problem. The generated classes from two feature spaces after the re-labeling stage are shown. The value of belief produced by the neural networks for each crisp or soft class as well as the final BBA obtained by merging the two bodies of evidence using Dempster’s rule of combination are presented. Decision on the test sample is made by applying pignistic transformation to the final BBA.
Fig. 3. Overall stages involved in a typical classifier selection procedure.
the number of classifiers from B that correctly recognize the input sample xi. It can be expressed by
sðxi Þ ¼
S X
yij
ð12Þ
j¼1
P Denote by aj ¼ N1 Ni¼1 yij the classification rate of the jth classifier, the average accuracy of the ensemble is defined by
¼ a
S 1X aj S j¼1
ð13Þ
Using the notation presented above, k can be expressed as,
PN k¼1
i¼1 sðxi ÞðS sðxi ÞÞ ð1 a Þ NSðS 1Þa
ð14Þ
where k = 0 indicates that the classifiers are independent. Small values of k leads to a better diversity while a negative k shows a negative dependency (high diversity) amongst the classifiers (Kuncheva & Whitaker, 2003). 5. Experimental results and discussion In this section, we report experimental results on artificial and real datasets to highlight the main aspects of our proposed method. We focus on the supervised classification problems in which the labels of the learning data are only partially known or those with erroneous labels. For all datasets a value of b = 0.001 was adopted in the re-labeling stage.
5.1. Artificial data We used a two-dimensional data so that the results could be easily represented and interpreted. The dataset was made of three classes with equal sample size, Gaussian distribution and common identity covariance matrix. We generated 150, 300 and 1500 independent samples for training, validation and testing, respectively. The center of each class was located at one of the vertices of an equilateral triangle. In order to use different representations of the data, the classes were transferred to their near vertices in clockwise direction and in this fashion two other feature spaces were generated. In each feature space, a unique subset of each class overlapped with the data of the other classes. It means that the high uncertainty pertaining to class membership of a pattern in one of the feature spaces can be reduced because this pattern may be located in non-overlapping or less ambiguous areas in the other feature spaces. To evaluate the performance of the proposed method in classification tasks with different levels of ambiguity in class labels, the length of the equilateral triangle (the distance between the neighborhood classes) was varied in {1, 2, 3} and in this way, three cases from strongly overlapping to almost separated classes were studied. Fig. 4 represents graphically the explained above procedure for generating the artificial data. Since a controlled data with known and similar class characteristics was used in this experiment, a global estimation of the class prototypes was adopted. The centers of the main classes were taken by averaging the learning data of each class and the resulting vectors were considered as the main classes’ prototypes. The
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
1703
Fig. 4. Generic representation of the artificial data. For generating new feature spaces, the classes are transferred to their near corners in the clockwise direction. To demonstrate how different parts of a class overlap with other classes in different feature spaces, Class 1 is divided to four parts and positions of these parts in the three feature spaces are shown. (a) First feature space, (b) second feature space, (c) third feature space.
Fig. 5. Representation of partitioning the first learning set into crisp and soft subsets by the proposed re-labeling approach with s = 0.8. (a) First feature space, (b) second feature space, (c) third feature space.
(b)
0.5
0.35
0.45
0.3
Classification error
Classification error
(a) 0.55
0.4 0.35 0.3 0.25
0.25 0.2 0.15 0.1 0.05
0.2
0
(c)
0.2
Classification error
2
0.15
3
4 5 6 7 8 9 10 11 12 Number of hidden neurons
3
4
5 6 7 8 9 10 11 12 Number of hidden neurons
MLP Expert 1 MLP Expert 2 MLP Expert 3 Average Product Maximum Proposed Method
0.1
0.05
0
2
2
3
4
5 6 7 8 9 10 11 12 Number of hidden neurons
Fig. 6. Classification error rates as a function of number of hidden nodes for the single MLPs trained on one of the feature spaces, ensemble networks constructed by combining the outputs of the single MLPs using fixed combining methods and our proposed method. (a) First training set (s = 0.8 for the proposed method), (b) second training set (s = 0.75 for the proposed method), (c) third training set (s = 0.7 for the proposed method).
1704
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707 Table 3 The range of examined number of features for the UCI datasets and their selected cardinalities. Database
Range of examined features
# Selected features
Wine Waveform Ionosphere Breast Cancer Wisconsin
[5–11] [5–15] [5–20] [5–16]
11 15 5 8
Ensemble networks constructed by merging the decisions of the single MLPs using three fixed combining methods (Average, Product and Maximum rules).
Fig. 7. Samples of the employed texture data.
Table 1 Feature space cardinality of texture datasets employed in the classifier selection stage. Dataset
# Selected features
Brick Glass Carpet Fabric
20 5 20 10
optimality of the proposed local approach for computing the prototype vectors will be investigated later using the real data. 5.1.1. Soft and crisp label generation In order to qualitatively examine how training samples with different levels of uncertainty in their labels are treated in the relabeling procedure, the results of partitioning the first training set with s ¼ 0:8 is demonstrated in Fig. 5. Each partition is represented by a convex hull and its class label indicates that the samples of that partition were assigned to which subset of the main classes. It can be seen that soft labels were assigned to samples situated in the boundaries of the main classes or those located at the ambiguous regions. 5.1.2. Performance comparison The performance of our proposed method was compared to the following classifiers: Three MLP neural networks trained separately on one of the feature spaces.
The above classification approaches discard the possible uncertainties in class labels and employed data with initial certain labels. The MLPs were trained by Levenberg–Marquardt algorithm with default parameters and 80 epochs of training with one hidden layer. Each neural network was trained 50 times with random initializations. To evaluate the performance of our proposed method based on different threshold values, we generated 19 training sets from each original training data by varying s from 0.05 to 0.95 with a step size of 0.05. The average test error rates of the employed classifiers for the three datasets and for different number of hidden nodes are represented in Fig. 6. Note that, the classification results of our proposed method with the best s for each dataset are shown. Obviously the improvements that the ensemble networks have caused over the single MLPs are very significant which indicates that these networks were constructed by a set of base classifiers trained on complementary feature spaces. As can be seen, our method yields considerably better classification results than the other classifiers when the first dataset, as the most difficult set with the highest level of uncertainty in class labels, was used (Fig. 6(a)). As shown in Fig. 6(b) and (c), the differences between the test performances of our proposed scheme and the three ensemble networks on the second and third datasets are small. So, it can be concluded that there is not much benefit to be gained from considering uncertainty in perfectly labeled data or in a classification problem with well separated classes. It is expected that when the uncertainty in class labels is decreased, the threshold values that yield satisfactory classification results may be reduced too. This expectation was confirmed by the results of our proposed method on the three datasets where the best performances have been achieved with s = 0.8, 0.75 and 0.7 for the first, second and third training data, respectively. 5.2. Real data In order to specify if the results obtained on the artificial data hold for higher dimensional real data, the proposed method was applied to a subset of the UIUCTex database (Lazebnik, Schmid, & Ponce, 2005) and four datasets from the well known UCI repository (Blake & Merz, 1998). 5.2.1. Experimental settings The Random Subspace Method (RSM) (Kam Ho, 1998) was used for creating classifier pools. In this method different subsets of
Table 2 Specifications of the four UCI datasets employed in this study. Database
# Features
# Train samples
# Validation samples
# Test samples
# Classes
Wine Waveform Ionosphere Breast Cancer Wisconsin
13 40 34 30
109 90 50 82
35 330 160 258
34 630 141 229
3 3 2 2
1705
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
(a)
(b) 60
50
Classification error (%)
Classification error (%)
55 45 40 35 30
50 45 40 35 30 25
25 10
15
20
25 30 35 40 Corruption level (%)
45
(d)
(c) 55 50 40 35 30 25 20 15 10
15
20
25 30 35 40 Corruption level (%)
45
50
10
15
20
25 30 35 40 Corruption level (%)
45
50
65
55 50 45 40 35 30 25
5 0
10
60
45
Classification error (%)
Classification error (%)
20
50
10
15
20
25 30 35 40 Corruption level (%)
Average,
Product,
45
20
50
Maximum,
Rogova’s Method,
Proposed Method
Fig. 8. Average classification error rates (as a function of label corruption level (%)) and standard deviations for the proposed method, ensemble networks constructed by fixed combination rules and Rogova’s method on the UIUCTex datasets. (a) Brick, (b) Glass, (c) Carpet, (d) Fabric.
features are randomly sampled from the original feature space and multiple classifiers are constructed in the low dimensional feature spaces. In the experimental study presented in this section the sizes of the classifier pools for all databases were fixed to 50. The utilized databases were randomly partitioned into train, validation and test sets. The experiments were repeated five times for each randomly partitioned data. In the classifier selection stage, a subset of 10 classifiers were picked for each dataset from the pool of classifiers generated by the RSM and the feature spaces correspond to the selected classifiers were used in the re-labeling phase of the proposed method. In the procedure of re-labeling the real data, the local method proposed for computing the class prototypes was employed and by examining 10 K-NNs ðK ¼ 1; . . . ; 10Þ and 13 ss (s ¼ 0:3; . . . ; 0:9 with a step size of 0.05), new training sets with crisp and soft labels have been generated. These training sets were used in the proposed ensemble network and based on the evaluated performances on the validation data, a small set of suitable training data were selected to assess the performance of the proposed method in the test phase. In order to study the robustness of the proposed method against erroneous labels, the initial labels of the employed datasets were artificially corrupted. By making use of the Bernoulli distribution, the label of sample x was changed to another class in which the probability of mislabeling takes values from f0:1; 0:15; 0:2; . . . ; 0:5g.
images while there are four types of textures for the fabric data. Fig. 7 shows samples of the employed texture datasets. There are 40 images of size 640 480 for each class. The original images were rescaled to 84 70 and 24, 6 and 10 samples of each class were used for training, validation and testing sets, respectively. Principal Component Analysis (PCA) (Turk & Pentland, 1991) was used for reducing the size of the original feature spaces to the maximum allowed dimension (number of training patterns 1) and the RSM was applied to the obtained subspaces. In order to determine the appropriate cardinalities of the feature spaces, subspaces with six different dimensions (5, 10, 15, 20, 25 and 30) were examined and the best dimensions were selected based on the performance on the validation data. Table 1 lists the feature space cardinality of each dataset employed in the classifier selection stage.
5.2.1.1. UIUCTex data. The database is composed of 25 classes and 14 types of texture surfaces. In our experiments four types of textures including brick, carpet, glass and fabric were selected from the original database and each one was considered as a separate classification problem. The first three problems have two classes of texture
5.2.2. Performance comparison Similar to experiments carried out using the artificial dataset, we compared the performance of our proposed method on the real data with those of ensemble neural networks constructed by pooling together the results of the single MLPs with fixed combining
5.2.1.2. UCI data. Table 2 gives the characteristics of the UCI datasets employed in this research. Here, all datasets with their original sample sizes were used except waveform data where a subset of 1050 randomly selected samples from 5000 original samples was employed. Using the validation data and by varying the sizes of the feature spaces within a range given in Table 3, the appropriate number of dimensions for each dataset was determined. Note that for Wine and Waveform problems, only a pair of classes was used for artificial label corruption.
1706
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
(a)
(b)
45
45 Classification error (%)
Classification error (%)
40 35 30 25 20 15
5 10
15
20
25 30 35 40 Corruption level (%)
45
35 30
20
50
10
15
20
25
30
35
40
45
50
40
45
50
Corruption level (%)
(d)
55
Classification error (%)
50 Classification error (%)
40
25
10
(c)
50
45 40 35 30 25 20 15
55 50 45 40 35 30 25 20 15 10 5
10
15
20
25 30 35 40 Corruption level (%)
Average,
Product,
45
50
10
15
20
25
30
35
Corruption level (%)
Maximum,
Rogova’s Method,
Proposed Method
Fig. 9. Average classification error rates (as a function of label corruption level (%)) and standard deviations for the proposed method, ensemble networks constructed by fixed combination rules and Rogova’s method on the UCI datasets. (a) Wine, (b) Waveform, (c) Ionosphere, (d) Breast Cancer Wisconsin.
methods. The results of our method were also compared with those of evidence-based neural network ensemble proposed by Rogova (1994). Note that, similar feature spaces as those used in our method have been incorporated into these ensemble networks. All MLPs in the proposed classification scheme and other ensemble structures were trained by the resilient back-propagation algorithm with default parameters and 50 epochs of training. They had one hidden layer and were trained 10 times with random initial weights. The number of hidden neurons was selected based on the results obtained on the validation set. Average classification error rates (as a function of label corruption level) of the employed ensemble structures on the UIUCTex and UCI datasets are illustrated in Figs. 8 and 9, respectively. Obviously, for the texture datasets the proposed method outperforms the other ensemble networks for label corruption levels greater that 10% and it is more robust to erroneous labels. However, except the results obtained on the Waveform data (Fig. 9(b)), our method causes improvements over the other classification schemes on the UCI datasets only for high corruption levels. By considering the principal assumption in employing the proposed method, its results on the texture and UCI datasets can be interpreted. As mentioned earlier, our method would be helpful if it is applied to a supervised classification problem with imperfect labels and/or heavily overlapping class distributions. In the case of the texture data, characteristics of the classes are similar to each other which means that the extracted features make overlapping areas in the feature space. As a result, discriminating between classes is difficult and even for low levels of label noise our proposed method yields to better results than other classification schemes.
From a different viewpoint, the results of the proposed method can be compared to those of evidence-based ensemble structure proposed by Rogova. Both classification schemes make use of similar information sources and merge the decisions of their base classifiers by utilizing the Dempster’s rule of combination. However, the proposed model can provide better classification performances than the Rogova’s method which can be interpreted by considering the role of the proposed re-labeling algorithm in our method. Using this approach, the ambiguity in the supervised information is considered in the learning phase and by allowing each train data to have a soft label comprising any subset of the main classes, more sophisticated information can be encoded by the BBA. On the other hand, the Rogova’s method tries to handle the uncertainties in labels of the training data only in the decision level and uses BBAs which only composed of the main classes and X. As mentioned before, by considering the average performance of the proposed method on the validation data, a small subset of appropriate training data obtained by the suggested re-labeling approach are selected to evaluate the performance of the proposed classification scheme in the test phase. If the parameters that influence the efficiency of the re-labeling process have been selected accurately, assigning a proper class label to a training data is not very difficult. In order to quantitatively investigate the suitability of the proposed re-labeling algorithm, the results of our method on the real data are taken into account. In the procedure of relabeling each dataset, the proposed local approach was used for computing the class prototypes and a set of learning data with different crisp and soft labels have been obtained by varying K and s. Table 4 lists the minimum and the second minimum classification
1707
M. Tabassian et al. / Expert Systems with Applications 39 (2012) 1698–1707
Table 4 Optimal classification error rates of the proposed method on the real data for three different corruption levels. For each reported error rate, the corresponding parameters of K and s in the re-labeling stage are shown. Corruption level (%)
30
40
50
Brick Minimum classification error (%) Second minimum classification error (%)
32.90 (K = 10, s = 0.65) 33.50 (K = 7, s = 0.75)
35.40 (K = 10, s = 0.75) 37 (K = 7, s = 0.85)
39.40 (K = 10, s = 0.9) 40.10 (K = 7, s = 0.85)
Glass Minimum classification error (%) Second minimum classification error (%)
30.80 (K = 8, s = 0.35) 33.80 (K = 10, s = 0.6)
35.80 (K = 10, s = 0.35) 37.50 (K = 8, s = 0.45)
35.10 (K = 2, s = 0.8) 41.50 (K = 5, s = 0.3)
Carpet Minimum classification error (%) Second minimum classification error (%)
6 (K = 9, s = 0.60) 6.1 (K = 10, s = 0.75)
24.80 (K = 9, s = 0.8) 25.80 (K = 5, s = 0.65)
35.70 (K = 3, s = 0.7) 37.50 (K = 2, s = 0.9)
Fabric Minimum classification error (%) Second minimum classification error (%)
24.95 (K = 9, s = 0.75) 25.15 (K = 10, s = 0.70)
36.35 (K = 3, s = 0.35) 40.10 (K = 9, s = 0.75)
51.10(K = 7, s = 0.30) 51.55 (K = 9, s = 0.35)
Wine Minimum classification error (%) Second minimum classification error (%)
11.52 (K = 9, s = 0.55) 11.83 (K = 10, s = 0.40)
19.23 (K = 9, s = 0.60) 21.15 (K = 8, s = 0.65)
33.49 (K = 2, s = 0.85) 41.60 (K = 6, s = 0.45)
Waveform Minimum classification error (%) Second minimum classification error (%)
30.11 (K = 10, s = 060) 30.98 (K = 9, s = 0.55)
36.11 (K = 9, s = 0.85) 36.23 (K = 7, s = 0.80)
43.19 (K = 8, s = 0.70) 43.31 (K = 9, s = 0.85)
Ionosphere Minimum classification error (%) Second minimum classification error (%)
27.38 (K = 1) 29.79 (K = 5, s = 0.90)
31.81 (K = 1) 34.39 (K = 3, s = 0.70)
43.51 (K = 3, s = 0.90) 43.54 (K = 1)
Breast Cancer Wisconsin Minimum classification error (%) Second minimum classification error (%)
12.26 (K = 10, s = 0.75) 13.44 (K = 8, s = 0.30)
29.48 (K = 6, s = 0.75) 31.88 (K = 7, s = 0.90)
44.27 (K = 10, s = 0.30) 45.88 (K = 8, s = 0.40)
error rates along with their corresponding values of K and s when corruption level was chosen to be 30%, 40% or 50%. In order to examine whether our method can provide favorable results for relatively different sets of parameters, if the first two minimum error rates were achieved with the same value of K, only the best error rate was selected and the second one was picked from the other values of K. As can be seen from Table 4, the error rates are very close to each other and they have been obtained using different combinations of K and s. This indicates that the proposed method does not lead to only one optimal result. 6. Conclusion In this paper, an ensemble method for handling imperfect labels using belief functions has been presented. By extracting different types of features from data, the proposed method takes advantage of information redundancy and complementariness between sources. In each feature space and by making use of the proposed re-labeling technique, the initial labels of the learning data are ignored and each train pattern is then reassigned to a class with crisp or soft label based on its closeness to prototypes of the main classes. Since different representations of the data are employed, our method can provide an appropriate estimation of class labels. MLP neural network is used as base classifier and its outputs are interpreted as BBA and in this way, partial knowledge about the class of a test pattern is encoded. The BBAs are then pooled using Dempster’s rule of combination. In order to ensure that the ensemble network is constructed by a set of complementary sources of information, a classifier selection approach which composed of forward search algorithm and a non-pairwise diversity measure was adopted in our method. Acknowledgments We are grateful for support provided by the Sociedad Mexicana de Inteligencia Artificial (SMIA) and the 9th Mexican International
Conference on Artificial Intelligence (MICAI-2010) in order to enhance, improve, and publish this work. We also thank Dr. Ali Hojjat, from the School of Biosciences, University of Kent, for his careful reading of this paper. References Basir, O., Karray, F., & Zhu, H. (2005). Connectionist-based Dempster–Shafer evidential reasoning for data fusion. IEEE Transactions on Neural Networks, 16(6), 1513–1530. Blake, C. L., & Merz, C. J. (1998). UCI Repository of Machine Learning Databases. Available from http://www.ics.uci.edu./~mlearn/MLReporsitory.html. Denoeux, T. (1995). A k-nearest neighbor classification rule based on Dempster– Shafer theory. IEEE Transactions on Systems, Man, and Cybernetics, 25(5), 804–813. Denoeux, T. (2000). A neural network classifier based on Dempster–Shafer theory. IEEE Transactions on Systems, Man, and Cybernetics, 30(2), 131–150. Ebrahimpour, R., Kabir, E., & Yousefi, M. R. (2007). Face detection using mixture of MLP experts. Neural Processing Letters, 26, 69–82. Kam Ho, T. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. Kittler, J., Ghaderi, R., Windeatt, T., & Matas, J. (2003). Face verification via error correcting output codes. Image and Vision Computing, 21, 1163–1169. Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles. Machine Learning, 51, 181–207. Lazebnik, S., Schmid, C., & Ponce, J. (2005). A sparse texture representation using local a affine regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1265–1278. Quost, B., Denoeux, T., & Masson, M.-H. (2007). Pairwise classifier combination using belief functions. Pattern Recognition Letters, 28, 644–653. Quost, B., & Denoeux, T. (2009). Learning from data with uncertain labels by boosting credal classifiers. In Proc. of the first ACM SIGKDD, int. workshop on knowledge discovery from uncertain data, Paris, France (pp. 38–47). Rogova, G. (1994). Combining the result of several neural network classifiers. Neural Networks, 7, 777–781. Ruta, D., & Gabrys, B. (2005). Classifier selection for majority voting. Information Fusion, 6, 63–81. Shafer, G. (1976). A mathematical theory of evidence. Princeton, NJ: Princeton University Press. Smets, P., & Kennes, R. (1994). The transferable belief model. Artificial Intelligence, 66, 191–234. Tabassian, M., Ghaderi, R., & Ebrahimpour, R. (2011). Knitted fabric defect classification for uncertain labels based on Dempster–Shafer theory of evidence. Expert Systems with Applications, 38, 5259–5267. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Cognitive Neuroscience, 3, 71–86.