Random subspace evidence classifier

Random subspace evidence classifier

Neurocomputing 110 (2013) 62–69 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom ...

631KB Sizes 0 Downloads 75 Views

Neurocomputing 110 (2013) 62–69

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Random subspace evidence classifier Haisheng Li a, Guihua Wen a,n, Zhiwen Yu a, Tiangang Zhou b a b

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China State Key Laboratory of Brain and Cognitive Science, Beijing 100101, China

a r t i c l e i n f o

a b s t r a c t

Article history: Received 29 March 2012 Received in revised form 8 October 2012 Accepted 15 November 2012 Communicated by Weifeng Liu Available online 7 January 2013

Although there exist a lot of k-nearest neighbor approaches and their variants, few of them consider how to make use of the information in both the whole feature space and subspaces. In order to address this limitation, we propose a new classifier named as the random subspace evidence classifier (RSEC). Specifically, RSEC first calculates the local hyperplane distance for each class as the evidences not only in the whole feature space, but also in randomly generated feature subspaces. Then, the basic belief assignment is computed according to these distances for the evidences of each class. In the following, all the evidences represented by basic belief assignments are pooled together by the Dempster’s rule. Finally, RSEC assigns the class label to each test sample based on the combined belief assignment. The experiments in the datasets from UCI machine learning repository, artificial data and face image database illustrate that the proposed approach yields lower classification error in average comparing to 7 existing k-nearest neighbor approaches and variants when performing the classification task. In addition, RSEC has good performance in average on the high dimensional data and the minority class of the imbalanced data. & 2013 Elsevier B.V. All rights reserved.

Keywords: Evidence theory Nearest neighbors Local hyperplane Random subspace

1. Introduction Local classifiers such as nearest neighbor (NN) and k-nearest neighbor (k-NN) have been studied extensively and used heavily for many years because of their many advantages. For example, they are simple in conception, suitable for problems where one cannot assume training and test samples are drawn from the same distribution [1], and best of all, it is often observed that their accuracy matches or surpasses those of more sophisticated classifiers. One of main drawbacks of classical voting k-NN classifier is that each of labeled samples is given equal importance in deciding the class memberships of the sample to be classified. This often causes difficulty when the sample sets are overlapped. Besides the equal importance assumption is just fit for the situation that the k nearest neighbor of test sample are contained in a relatively small region. In practice, however, the distance information with its nearest neighbors is not always negligible, and can even become very large outside the region of high density. To deal with this issue, Keller et al. [2] present a fuzzy version k-NN classifier based on the fuzzy set theory. They assign fuzzy memberships to the labeled samples. The fuzzy memberships give a measure for the classification decision. The theory of

n

Corresponding author. Tel.: þ86 18998384808. E-mail address: [email protected] (G. Wen).

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.11.019

evidence is another excellent framework to cope with uncertainty and Denoeux [3] has combined this theory with k-NN. Denoeux classify test sample by viewing each neighbor training sample as an item of evidence that supports the classification. The degree of support is defined according to distance between test sample and its neighbors, and then the generated belief assignments by evidences are combined by using Dempster’s rule. Additionally, several classification methods based on evidential formalism for handling uncertainty has been presented [4,5,6]. Other disadvantages of local classifiers include that (1) became less effective in high dimensional space due to dimensionality curse; (2) cannot conquer the imbalance problem and (3) easy be influenced by the outliers, particularly in small training sample size situations. The latter two problems may be solved by paying more attentions on the training samples distributed on the boundaries. Many local classifiers based on this idea have been proposed, such as Local Nearest Mean Classifier (LMC) [7,8], Local Probability Center Classifier (LPC) [9] and K-Local Hyperplane Nearest Neighbor Classifier (HKNN) [10]. LCM first selects r nearest training samples from a test sample to be classified for each class. The local mean vector is computed by using only the selected r training samples for each class. The test sample is then classified to the class which gives minimum Euclidean distance between the test sample and local mean vector. LPC performs a classification task based on the local probabilistic centers of each class. The method works by reducing the number of negative contributing points, which are the known samples falling on the

H. Li et al. / Neurocomputing 110 (2013) 62–69

wrong side of the ideal decision boundary. HKNN method is to approximate the potentially training samples in the manifold of each class by a local hyperplane. The classifier first select r training samples of each class from a test sample to construct local hyperplanes, which approximate the local manifold of each class, and then the class label of test sample is assigned according to the distance between the test sample and the local hyperplane. The HKNN and its adaptive version, adaptive local hyperplane algorithm (ALH) [11], has been shown to perform very well. In this paper, unlike Denoeux’s way of only collecting the evidence in neighborhood of test sample, we accumulate the evidence in both whole feature space and subspace. Besides we use the local hyperplane of each class and its distance with the test sample as the evidence to generate basic belief assignment. Then we combine these basic belief assignments by Dempster’s rule to reassign class beliefs for classification. The subspaces are generated by randomly dividing the feature space. So it is different from the random subspace method (RSM) [12,13,14] in ensemble learning, which randomly select the features in feature space for many times. The advantage of randomly dividing the feature space rather than randomly selecting some features is that it makes sure no features will be lost. In addition, because of the randomly recombination of different features, the correlated features will be reduced in subspace and hence the evidence from subspace will be less effect by dimensionality curse. We also generate some different size subspaces, and so obtain enough distinct evidences for classification decision.

2. Elementary theory and method 2.1. Evidence theory for classification Evidence theory, also be called Dempster–Shafer (D–S) theory, is a formal framework for representing and reasoning with uncertain and imprecise information. It is also known as theory of belief functions. The theory is based on two ideas: the idea of obtaining degrees of belief for questions and Dempster’s rule for combining such degree of belief when they based on independent items of evidence. In this section, the main concepts and some basic notation underlying the D–S theory of evidence are briefly recalled, and Denoeux’s k-NN classification rule based on evidence theory is introduced. Let Y be a finite set of mutually exclusive and exhaustive hypotheses about some problem domain, called the frame of discernment. For each subset of Y, we have a belief assigned to it. So basic belief assignment (BBA) is a function m:2Y-[0,1], verifying: mðfÞ ¼ 0 X mðAÞ ¼ 1

ð1Þ

ADY

The quantity m(A) is called basic belief number. It represent a measure of the belief that one is willing to commit exactly to A, given a certain piece of evidence. By using the so called Dempster’s rule of combination, two BBAs m1 and m2, coming from two independent evidence E1 and E2, can be combined to form a new BBA:m ¼m1m2, defined as mðfÞ ¼ 0

P m1 ðBÞm2 ðC Þ mðAÞ ¼ PB\C ¼ A B\C a f m1 ðBÞm2 ðC Þ

ð2Þ

Similarly if we can acquire more evidences related to Y, more BBAs can be combined:

63

The sum of all basic belief numbers for subset of A is called a belief function: X BelðAÞ ¼ mðBÞ ð4Þ BDA

Bel(A), also called the credibility of A, is interpreted as a measure of the  total belief committed to A. The quantity PlðAÞ ¼ 1Bel A , called plausibility of A, is interpreted as one fails doubt in A. It is defined as: X PlðAÞ ¼ mðBÞ ð5Þ B\A a f

Bel and Pl are also called lower and upper probabilities respectively, satisfying: 0 r BelðAÞ rPlðAÞ r1

ð6Þ

In [3], k-NN classifier has been improved by the theory of evidence. In the method each training sample xi lied in the neighborhood of test sample x is considered as an item of evidence and the frame of discernment is composed by every classes, that is Y ¼{o1,...,oi,...,oM}, where oi is the class label. The key assumption of this approach is that each instance belongs to one and only one class. Under this assumption, if training sample xi’s class label is oq, mi will be some beliefs distributed on singleton hypothese {oq}. However, this piece of evidence does not by itself provide 100% certainty, that is only some part of beliefs is committed to {oq}. The rest of beliefs will be distributed to Y, the whole frame of discernment. Then the portion of belief committed to {oq} is defined as a decreasing function of the distance di between xi and x, so mi can be the following form:   mi foq g ¼ afðdi Þ   mi ðYÞ ¼ 1mi foq g mi ðAÞ ¼ 0,8A A 2Y \fY,foq gg

ð7Þ

where j is a decreasing function varying between 1 and 0, 0o a o1 is a parameter. The function j can be defined as   ð8Þ fðdi Þ ¼ exp gd2i where g 40 is a parameter. For a test sample, each training sample as a piece of evidence, lied in his k-nearest neighborhood, can be combined by Dempster’s rule, for example mi þ j



oq



¼

       1  i    m oq  mj oq þ mi oq  mj ðYÞ þ mi ðYÞ  mj oq K

 1 i m ðYÞ  mj ðYÞ K         K ¼ mi foq g  mj foq g þ mi foq g  mj ðYÞ þ mi ðYÞ  mj foq g þ mi ðYÞ  mj ðYÞ mi þ j ðYÞ ¼

ð9Þ In this way, combining all of basic belief assignment, a total and refreshed belief assignment m is presented. Consequently, the credibility and the plausibility of each class oq are respectively as follows     Bel foq g ¼ m foq g ð10Þ     Pl foq g ¼ m foq g þmðYÞ

ð11Þ

Then assume that aq is the action of assigning x to the class oq, the loss of misclassification is equal to 1, and 0 for correct classification. Then the lower and upper expected losses associated to Bel({oq}) and Pl({oq}) are, respectively, equal to     Rn aq 9x ¼ 1m foq g mðYÞ ð12Þ     Rn aq 9x ¼ 1m foq g

ð13Þ n

m ¼ m1  :::  mN

ð3Þ

Given the particular form of m, Rn and R differ only by a constant additive term; consequently, these two expectations

64

H. Li et al. / Neurocomputing 110 (2013) 62–69

induce the same ranking of action a1,..., aM and lead to the same decision rule: the pattern is assigned to the class ot with maximum belief assignment [15,16]:   mðfot gÞ ¼ max m foq g : ð14Þ q

2.2. K-local hyperplane nearest neighbor algorithm K-local hyperplane nearest neighbor (HKNN) algorithm is a manifold-related nearest neighbor classifier which has been proposed to improve the k-nearest neighbor (KNN) classifier. The basic idea of HKNN method is to approximate the potentially training samples in the manifold of each class by a local hyperplane [8]. In this method, r training samples of each class from a test sample are first selected, and then a local hyperplane is constructed to approximate the local manifold of each class based on these selected training samples. In the following, the class label of test sample is assigned according to the distance between the test sample and the local hyperplane of each class. The method intends to improve the classification performance of the conventional KNN to the level of SVM [17] and it has been shown to perform very well in some applications [18,19]. The explicit description the HKNN algorithm can be given formally as Algorithm 1. K-Local Hyperplane Nearest Neighbor HKNN (x, X, k, l) /n x be the test sample, X be the training samples, l be a penalty term, and k be the neighborhood size n/ Step 1. Select k nearest neighbors for the test sample x from each class oj, denoted as Xk(x,oj) Step 2. For each Xk(x,oj)¼{x1,...,xi,...,xk} define the local hyperplane as k X   Hk x, oj ¼ fp9p ¼ x þ ai V i ,ai A Rg

ð15Þ

i¼1

P where, x ¼ ki ¼ 1 xi =k, and V i ¼ xi x. Step 3. Compute the distance between test sample and the local hyperplane by    dh x,Hk x, oj ¼

k X ai V i : min 99xp99 ¼ min:xx ai A R p A Hk ðx, oj Þ i¼1

ð16Þ

where ai can be solved through solving a linear system that can be easily expressed in matrix form as  0  ð17Þ V UV Ua ¼ V 0 UðxxÞ where x and x be n dimensional column vectors, a ¼(a1,...,ak)0 and V is a n  k matrix composed of column vectors Vi. To penalize the large value of ai, a penalty term l is brought in. Then the distance can be redefined as ( ) k k X X    2 2 dh x,Hk x, oj ¼ min :xx ai V i : þ l ai ð18Þ ai A R



  dj ¼ dh x,Hk x, oj

i¼1

i¼1

ð19Þ

Step 4. According to the local hyperplane distance dj of each class oj, i.e. {(oj,dj)}, test sample will be assigned to the class ot with minimum distance dmin .

2.3. Random subspace method Random subspace method (RSM) is based on selecting a random feature subset in training each of the component

classifiers [20]. It is an ensemble learning method. This method was first presented by T. K. Ho for constructing decision forest. The method relies on an autonomous procedure to randomly select a small number of dimensions from a given feature space. In each pass, such a selection is made and a subspace is fixed. Then all samples are projected to this subspace, and a classifier is trained by using the projected training samples. In classification a sample of an unknown class is also projected to the same subspace and classified using the corresponding classifier. For a given feature space of n dimensions, there are 2n such selections that can be made, and with each selection a classifier can be trained. The use of randomization in selecting the dimensions is merely a convenient way to explore the possibilities. The random subspace method is a parallel learning algorithm, that is, the generation of each subspaces is independent. This makes it suitable for parallel implementation for fast learning that is desirable in some practical applications [9].

3. Proposed classifier According to evidence theory belief assignment can be reassigned by combining all the basic belief assignments that generated by each evidences, so a much better decision can be done by collecting much more reliable evidences. The excellent performance of HKNN proved that the local hyperplane of each class and its distance with the test sample can be viewed as reliable and effective evidence. But in the original whole feature space local hyperplanes that we can construct is very limited. Assume that the set of classes is denoted by Y ¼{o1,...,oi,...,oM}, then the amount of local hyperplanes is M, i.e. the number of different classes. For obtaining enough and distinct evidences, in our proposed algorithm we will combine some evidences collecting from randomly divided feature subspaces with the evidences from whole feature space. Besides we will divide the whole feature space many times, and in each pass the feature space will be divided to some almost equal sized subspaces. In this way, some randomly generated different size subspaces can be obtained. Because different size subspaces are explored, it is helpful for acquire more information of the data. We give the description of the proposed algorithm RSEC as Algorithm 2. RSEC (x, X, k, N) /n x be the test sample, X be training samples, k be the set of the neighborhood sizes to be taken for creating evidence, N¼{n1,...,ni,...,nT}, where ni be the divided number and T be the divided times for feature space n/ Step 1. According to divided number ni, the feature space S will be randomly divided to ni equal size subspaces for each pass, P so we obtain P ¼ Ti ¼ 1 ni subspaces totally. All the feature space that the evidences will be accumulated can be expressed as S ¼ fS1 ,:::,Si ,:::,SP g Step 2. By randomly dividing feature space, the training and test data samples are projected as Xi and xi in divided space Si. In each generated space Si, the local hyperplane for each class and its distance with test sample xi are computed by Algorithm 1 as the evidences. So we obtain all the evidences as E ¼ fE1 ,:::,Ei ,:::,EP g n     o oi1 ,di1 ,:::, oij ,dij ,:::, oiM ,diM Ei ¼

H. Li et al. / Neurocomputing 110 (2013) 62–69 i

65

where dj ¼ dh ðxi ,Hk ðxi , oij ÞÞ, EiAE is the set of evidences i obtained in SiAS subspace, ðoij ,dj Þ A Ei is the j-th evidence that obtained in i-th subspace, and 1rirP, 1rj rM. i Step 3. The information provided by each evidence ðoij ,dj Þ A Ei is represented by a BBA mij as follows:   i mij foq g ¼ a  expððdj Þ2 Þ   mij ðYÞ ¼ 1mij foq g

centers. The test sample is labeled with the class having a minimal distance [9]. 6) HKNN: the manifold based k-local hyperplane nearest neighbor classifier, which has been described specifically in Section 2.2 [10]. 7) ALH: the adaptive version of HKNN [11]. 8) RSEC: proposed method in this paper.

mij ðAÞ ¼ 0,8AA 2Y \fY,foq gg

In experiment, the error rate is taken as the measure of performance. For imbalanced data, sensitivity and specificity will be test independently. Euclidean distance is taken in all compared classifiers. When classifying, for each data we performed 5 times five-fold cross validations. All compared approaches depend on some parameters. In experiment, k is common to all compared classifiers, which takes the value over the range of {3, 6, 9, y, 30}. The kernel parameter g for LPC takes the values from {0.1, 0.3, y, 0.9}. These parameters are determined for compared classifier through five-fold cross validations on the training samples, and then applied to perform the classification over the testing samples. In RSEC, N take the values N ¼{1,2,3} for real data and artificial data, N ¼{1,3,5} for high dimensional face image data, and a ¼0.95 according to [16].

ð20Þ

where a is a constant parameter. Step 4. All BBAs mij are combined by using the Dempster’s rule m ¼ m11  :::  mij  :::  mPM

ð21Þ

Step 5. The test sample is assigned to the class ot with maximum belief assignment:   ð22Þ mðfot gÞ ¼ max m foq g q

It is necessary to specifically explain the setup of the parameter N¼ {n1,...,ni,...,nT}. In fact, the divided number ni can be set any number that less than the dimensionality of the data, and divided times T can be infinite. Whereas if we intent to acquire information in both the whole feature space and subspaces, one of the divided number ni should be set 1 to make sure the whole feature space is reserved, i.e. the feature space is not divided. On the other hand, we also intent to make the randomly generated subspaces have distinct size to make different with the whole feature space. Based on former two ideas, we recommend the parameter N¼{n1,...,ni,...,nT} take the value N ¼{1,2,3} for low dimensional data, and N¼ {1,3,5} for high dimensional data. That is the feature space will be randomly divided three times and for each pass the divided subspace have different size. In this way, we will obtain enough distinct feature spaces and evidences for classification. For high dimensional data we hope each divided feature space as different as possible, so it is different with the low dimensional data for the parameter setup.

4. Experimental results 4.1. Experimental setup To validate our approach, we compare it with several approaches through experiments on benchmark real data sets, artificial data sets and face image data sets. These approaches are KNN, FKNN, EKNN, LMC, LPC, HKNN, ALH and RSEC. The reason for choosing these approaches to compare is that they all belong to the same classes, called local classifiers, lazy learners or nearest neighbor methods. The brief descriptions of these methods are given as follows: 1) KNN: it is the conventional rule, which is used widely for pattern classification. 2) FKNN: it classifies an unseen pattern using fuzzy decision rule over the selected k nearest neighbors where membership values to each neighbor are assigned [2]. 3) EKNN: it classifies an unseen pattern on the basis of its nearest neighbors from the point of view of evidence theory [3]. 4) LMC: it classifies a test sample to the class with a minimal distance between this test sample and its categorical centers [7]. 5) LPC: it classifies a test sample by measuring the distances between this test pattern and its categorical probability

4.2. On real data sets In this section, we examine the performance of the competing classification methods using real world data. One of the advantages of real data is that they are generated without any knowledge of the classification procedures that it will be used to test. In our experiment we used ten different real data sets. They are all taken from the UCI Machine Learning Repository [21], illustrated as Table 1, where the records with missing values and nonnumeric attribute are all removed. All of the attributes of the data are scaled to the [0,1] range during the data preprocessing stage, which ensures the larger values of input attributes do not overwhelm the smaller values of input attributes and consequently benefits for reducing errors. It can be observed from Table 2 that RSEC outperforms all compared classifiers in 7/10 of the real data sets. Generally every method has its strengths and weaknesses. It is necessary to apply a measure to evaluate the robustness of the different methods. We take the usually used measure to quantify the robustness by computing the ratio bm of the error rate em of method m and the smallest error rate over all methods being compared in a particular dataset: bm ¼ em =min1 r k r n ek [22]. Thus, the best method m* for that bm ¼ 1, and all other methods have larger value bm 4 1. The larger the value of bm, the worse the performance of the method. This means that the distribution of bm will be a good indicator reflecting its robustness. Fig. 1 shows the distribution of bm for each method over the ten real data sets. Clearly the spread for RSEC is much narrower and closer to one. Table 1 Data sets used in experiments. No. 1 2 3 4 5 6 7 8 9 10

Datasets

Size

Attributes

Classes

Wine Diabetes Ionosphere Segmentation Iris Sonar Monks SPECT SPECTF Hayes_roth

178 768 351 210 150 208 432 267 267 160

13 8 34 19 4 60 6 22 44 4

3 2 2 7 3 2 2 2 2 3

66

H. Li et al. / Neurocomputing 110 (2013) 62–69

Table 2 Average classification errors for real data. Data

KNN

FKNN

EKNN

LMC

LPC

HKNN

ALH

RSEC

Wine Diabetes Ionosphere Segmentation Iris Sonar Monks SPECT SPECTF Hayes_roth

3.13 70.75 25.26 70.47 14.92 70.68 14.00 71.60 4.80 70.55 18.72 70.50 15.81 71.63 18.23 71.13 23.84 70.78 43.10 71.00

3.02 7 0.49 25.43 7 1.10 14.47 7 0.31 13.52 7 1.28 4.40 7 0.59 17.58 7 0.75 19.13 7 2.66 18.87 7 0.56 24.34 7 1.35 59.8 7 0.33

2.57 70.75 24.48 70.47 10.42 70.71 13.52 71.37 4.40 70.59 18.72 70.50 18.35 72.27 16.87 71.12 20.84 70.42 30.00 73.31

2.67 7 0.72 24.21 7 0.13 10.88 7 1.06 12.007 1.03 4.53 7 0.86 14.88 7 3.78 18.98 7 2.66 16.62 7 0.57 21.86 7 1.21 35.12 7 6.60

2.27 70.41 24.30 70.39 11.05 71.14 11.33 71.97 4.00 70.47 15.35 73.04 20.76 72.73 17.99 70.35 21.72 71.00 33.51 75.30

2.22 7 0.68 24.73 7 0.33 9.56 7 0.71 11.33 7 1.48 4.407 0.36 13.90 7 1.32 14.05 7 0.85 17.12 7 0.85 22.72 7 1.15 30.22 7 4.02

1.12 70.39 27.81 70.27 8.65 70.82 10.76 70.79 4.66 70.00 14.56 70.25 10.58 70.92 18.37 70.74 22.99 72.45 27.38 77.14

2.35 70.90 23.30 70.59 7.01 70.68 10.28 70.86 4.53 70.29 13.58 71.03 9.88 71.71 16.12 70.38 21.34 71.00 22.50 73.27

Table 3 Description for imbalanced data sets.

Fig. 1. Average performance distribution of different classifiers.

This result demonstrates that it obtains the most robust performance over ten data sets. 4.3. About sensitivity and specificity Accuracy or error rate is not a useful measure for imbalanced data. For example, assume we have a ratio of 1:100 for minority class and majority class, a classifier that assigns all instances to the majority class will have a 99% accuracy. Several measures [23] have been developed to deal with this problems. Given the ratio of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) several measures can be defined. Among these measures, the most common used are the true positive rate (TPrate), recall or sensitivity and the true negative rate (TNrate) or specificity: TP TP þFN TN TNrate ¼ specif icity ¼ TN þ FP

TPrate ¼ recall ¼ sensitiviy ¼

This step in our experiment, we will devote to test the sensitivity and specificity of each classifier. These classifiers will be applied to three binary imbalanced datasets: Iris, Diabetes and Ionosphere. The famous Iris data contain three classes: Iris-setosa, Iris-versicolor and Iris-virginica. We take the class Iris-setosa as the positive examples and the remainder as the negative, so the Iris data are transformed to a binary imbalanced data [24]. For Diabetes and Ionosphere data, it is imbalanced inherently. Table 3 summaries the data selected in this study. The complete table of results for all the algorithms is shown in Table 4. We can observe that, for transformed Iris data, it seems

Data

Positive examples

Negative examples

Class (min., maj.)

% class (min., maj.)

Iris

50

100

(33.33, 66.67)

Diabetes

268

500

Ionosphere 126

225

(Iris-setosa, remainder) (Tested-positive, tested-negative) (Bad, good)

(34.84, 66.16) (35.90, 64.10)

too easy for all classifiers. All the classifiers can achieve 100% sensitivity and specificity. For all the data, our proposed method has the good performance on the sensitivity by the average. It is also observed that for the specificity test, it is not performed very well. However, in many applications it is more important for detecting minority positive classes, i.e., getting the higher sensitivity rate. So if we are only interested in the performance on the positive minority class, RSEC may be one of the choices. 4.4. The effect of evidences from subspace In RSEC, we collect some evidences from subspaces of feature space and combine them with the ones from whole feature space, so it is reasonable to testify that if these evidences from subspaces can improve the performance of the classifier. In this section we conduct another experiment to test the effect of the evidences from subspace. Firstly for our proposed classifiers the experiment will be repeated by only collecting the evidences from original whole feature space, i.e. the parameter N will be setup as N ¼{1}. For convenience, we call this kind of classifier as local hyperplane evidence classifier (LHEC). So the LHEC is very similar with HKNN, the different between the two method is in the classification decision. HKNN classify the test sample according to its minimum distance with the local hyperplane of each class directly, but LHEC will calculate the belief assignment when considering each local hyperplane as the evidence. Then we compare the average classification errors between LHEC and RSEC. The experiment result is shown in Table 5. It is shown that, comparing with LHEC, RSEC has better performance on 9 dataset of all 10 dataset. The experiment result testified the evidences from subspaces can often, but not always, improve the classification performance. 4.5. On artificial data sets It is well known that the curse of the dimensionality is a hard issue for pattern recognition, as in high dimensional data there may be redundant dimensions and exists a high degree of correlation among these dimensions, or data is sparsely

H. Li et al. / Neurocomputing 110 (2013) 62–69

67

Table 4 Test about sensitivity and specificity. Data

KNN

FKNN

EKNN

LMC

LPC

HKNN

ALH

RSEC

Sensitivity Iris Diabetes Ionosphere Average

100.0 70.00 49.72 72.16 62.40 72.22 70.70 72.19

100.07 0.00 51.58 7 1.05 62.087 1.98 71.22 7 1.52

100.0 70.00 64.56 71.70 71.60 71.41 78.72 71.55

100.07 0.00 64.78 7 1.02 76.67 7 2.62 80.487 1.82

100.0 70.00 65.16 70.81 74.91 73.17 80.02 71.99

100.0 70.00 64.41 71.67 86.01 71.65 83.47 71.66

100.0 70.00 56.64 72.57 80.78 72.33 79.14 72.45

100.0 70.00 63.42 71.79 87.80 72.70 83.74 72.25

Specificity Iris Diabetes Ionosphere Average

100.0 70.00 86.96 71.72 97.77 70.54 94.91 71.13

100.07 0.00 86.48 7 1.72 98.227 0.70 94.907 1.21

100.0 70.00 79.92 70.54 97.68 70.37 92.53 70.45

100.07 0.00 81.047 0.81 97.86 7 0.73 92.96 7 0.77

100.0 70.00 80.96 70.68 97.95 70.39 92.97 70.54

100.0 70.00 80.24 70.58 94.04 71.02 91.42 70.80

100.0 70.00 80.08 71.67 97.86 70.85 92.64 71.26

100.0 70.00 83.36 71.39 95.28 70.80 92.88 71.10

Table 5 Average classification errors of LHEC and RSEC. Data

LHEC

RSEC

Wine Diabetes Ionosphere Segmentation Iris Sonar Monks SPECT SPECTF Hayes_roth

2.43 7 0.31 24.39 7 1.16 9.30 7 2.02 11.42 7 0.82 4.66 7 0.66 14.08 7 0.30 16.50 7 1.36 16.117 1.35 21.60 7 1.31 33.55 7 4.18

2.357 0.90 23.387 0.36 7.017 0.68 10.287 0.86 4.537 0.29 13.587 1.03 9.887 1.71 16.12 7 0.38 21.347 1.00 22.507 3.27

distributed. To validate RSEC with the better ability to deal with this problem, we do experiments on ring norm data [25] and pdimensional data [6], as they can be generated using different dimensions. It can be observed from Fig. 2 that on these data with different dimensions, RSEC performs best as the dimensions increase, while KNN and FKNN go up quickly on both data. These experimental results suggest that RSEC may be more robust to the dimensionality and shows a favorable behavior in high dimensional data spaces. This can be expected to have wider applications.

4.6. Application to face images In recent years face recognition has received substantial attention. However, the face recognition is confronted by a most challenging problem called small sample size problem, i.e. the number of training samples is far smaller than the dimensionality of the samples. It becomes more difficult when the testing samples are subject to severe facial variations such as expression, illumination, occlusion, etc. To validate that RSEC can deal with these issues better, we do experiments on Yale and ORL face database, where Euclidean distance is applied. Yale database contains 165 grayscale images in GIF format of 15 individuals. This data set in matlab format with 32  32 size can be downloaded from http://www.cs.uiuc.edu/dengcai2/Data/Yale/images. html. The example images are shown in Fig. 3. This data is a typical small sample size problem, whose dimension is much higher than the number of training samples. ORL database provides 10 sample images of each of 40 subjects, with total 400 sample images available. The example images are shown in Fig. 4. The different images for each subject provide variation in views of the individual such as lighting, facial features, and slight changes in head orientation. This face database seemed to be a standard set of test images used in much of the literature dealing with face recognition. This data set in

Fig. 2. Classification errors of compared approaches on data with 200 points, having different dimensions taken over {5,10,y,50}.

matlab format with 32  32 size can be downloaded from http://www.cs.uiuc.edu/dengcai2/Data/ORL/images.html. The average classification errors of all compared classifiers on these two images data are shown in Table 6. It can be observed that RSEC is superior to all other algorithms on both data. As these

68

H. Li et al. / Neurocomputing 110 (2013) 62–69

Fig. 3. Example images in Yale faces database.

Fig. 4. Example images in ORL faces database.

Acknowledgments

Table 6 Average classification errors for face images. Classifier

Yale data

ORL data

KNN FKNN EKNN LMC LPC HKNN ALH RSEC

38.087 2.99 37.98 7 2.11 37.12 7 3.10 29.85 7 1.54 33.41 7 2.85 28.91 7 0.97 29.087 1.13 27.267 1.15

8.55 7 0.71 6.907 0.91 8.907 0.96 5.25 7 0.79 10.35 7 0.99 3.35 7 0.57 6.057 0.99 3.167 0.14

The authors thank anonymous reviewers and editors for their valuable suggestions and comments on improving this paper. This work was supported by China National Science Foundation under Grants 60973083, 61273363, 61003174, State Key Laboratory of Brain and Cognitive Science under Grants 08B12, and the Fundamental Research Funds for the Central Universities, SCUT.

Reference data are high dimensional, RSEC may be suitable for the high dimensional data.

5. Conclusions We present a method for classification based on the evidence theory which uses the local hyperplane of each class and its distance with the test sample as the evidence. For obtaining enough and discriminatively evidence, the evidence are collected in some randomly divided subspaces of feature space and then combine these evidences with the ones from whole feature space. In our method we also consider the effect of different size subspaces for forming the evidence. This method may cope with the curse of the dimensionality. Because when randomly dividing the feature space, the randomly recombination of different features will reduce the feature correlation in subspace, and hence the evidence collected from subspace will be less effect by the dimensionality curse. Our experiment in artificial data and high dimensional face image data confirmed this assumption. Besides, randomly divided feature space give the parallel mechanism to the method and make it suitable for parallelizing the algorithm. It is desirable in some practical applications. For our proposed method, it is worthy to discuss how many subspaces we should generate. In consideration that different data has different dimensions and may need to generate different amount of subspaces, the problem become complicated. Besides selecting more reliable evidence may improve the accuracy of classification further, so it is worthy to try to look for more reliable evidence. These problems will be explored in future work.

[1] E.K. Garcia, S. Feldman, M.R. Gupta, S. Srivastava, Completely lazy learning, IEEE Trans. Knowl. Data Eng. 22 (2010) 1274–1285. [2] J.M. Keller, M.R. Gray, J.A. Givens, A fuzzy k-NN neighbor algorithm, IEEE Trans. Syst. Man Cybern. 15 (4) (1985) 580–585. [3] T. Denoeux, A k-nearest neighbor classification rule based on Dempster Shafer theory, IEEE Trans. Syst. Man Cybern. 25 (1995) 804–813. [4] T. Denoeux, A neural network classifier based on Dempster Shafer theory, IEEE Trans. Syst. Man Cybern. A 30 (2000) 131–150. [5] E. Come, L. Oukhellou, T. Denoeux, P. Aknin, Learning from partially supervised data using mixture models and belief functions, Pattern Recognition 42 (2009) 334–348. [6] Z. Younes, F. Abdallah, T. Denoeux, Evidential multi-label classification approach to learning from data with imprecise labels, in: Proceedings of IPMU, Dortmund, Germany, vol. 7, 2010. [7] Y. Mitani, Y. Hamamoto, A local mean-based nonparametric classifier, Pattern Recognition Lett. 27 (2006) 1151–1159. [8] Y. Mitani, Y. Hamamoto, Classifier design based on the use of nearest neighbor samples, in: Proceedings of 15th International Conference on Pattern Recognition, Barcelona, vol. 2. 2000, pp. 773–776. [9] B. Li, Y.W. Chen, Y.Q. Chen, The nearest neighbor algorithm of local probability centers, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 38 (2008) 141–154. [10] P. Vincent, Y. Bengio, K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms, Advances in Neural Information Processing Systems (NIPS), 14, MIT Press, Cambridge, MA, 2002. (pp. 985–992). [11] T. Yang, V. Kecman, Adaptive local hyperplane classification, Neurocomputing 71 (2008) 3001–3004. [12] T.K. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intelligent 20 (1998) 832–844. [13] T.K. Ho, A. Amin, D. Dori, P. Pudil, H. Freeman, Nearest Neighbors in Random Subspaces, Lecture Notes in Computer Science, Springer, Germany, 1998. (pp. 640–648). [14] T.K. Ho, Random Decision Forests, in: Proceedings of Third International Conference on Document Analysis and Recognition, Montreal, Canada, vol. 8, 1995, pp. 278–282. [15] T. Denoeux, Analysis of evidence-theoretic decision rules for pattern classification, Pattern Recognition 30 (7) (1997) 1095–1107. [16] L.M. Zouhal, T. Denoeux, An evidence-theoretic k-NN rule with parameter optimization, IEEE Trans. Syst. Man Cybern.: C 28 (5) (1998) 263–271.

H. Li et al. / Neurocomputing 110 (2013) 62–69

[17] Q. Ni, Z. Wang, X. Wang, Kernel K-local hyperplanes for predicting protein– protein interactions, in: Proceedings of the Fourth International Conference on Natural Computation, 2008, pp. 66–69. [18] L. Nanni, A novel ensemble of classifiers for protein fold recognition, Neurocomputing 69 (2006) 2434–2437. [19] L. Nanni, Hyperplanes for predicting protein–protein interactions, Neurocomputing 69 (2005) 257–263. [20] H. Altincay, Ensembling evidential k-nearest neighbor classifiers through multi-modal perturbation, Appl. Software Comput. 7 (6) (2007) 1072–1083. [21] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2007. [22] C. Domeniconi, J. Peng, D. Gunopulos, Locally adaptive metric nearestneighbor classification, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 1281–1285. [23] N. Garcı´a-Pedrajas, J. Pe´rez-Rodrı´guez, M. Garcı´a-Pedrajas, D. Ortiz-Boyer, C. Fyfe, Class imbalance methods for translation initiation site recognition in DNA sequences, Knowl.-Based Syst. 25 (2012) 22–34. [24] S. Garcia, J. Derrac, I. Triguero, C.J. Carmona, F. Herrera, Evolutionary-based selection of generalized instances for imbalanced classification, Knowl.-Based Syst. (2011). [25] L. Breiman, Arcing classifier, Ann. Stat. 26 (1998) 801–849.

Haisheng Li is a Ph.D. candidate at the School of Computer Science and Engineering, South China University of Technology, Guangzhou, China. He received his M.Sc. degree in Applied Mathematics from Guangzhou University in 2010. His current research interests include pattern recognition, machine learning and artificial intelligence.

Wen Guihua, born in 1968, Ph.D., professor, doctor supervisor. In 2005–2006, he did visiting research on machine learning and semantic web in School of Electronics and Computer, University of Southampton, UK. His main research interests are computational creativity, data mining and knowledge discovery, machine learning, and cognitive geometry. Since 2006, he proposed some original methods based on the computation of cognitive laws, which can effectively solve difficult problems in information science. The research results have been published in the international journals, including Pattern Recognition Neurocomputing, Journal of Software, Journal of computer

69

Research and Development. He also published some papers in the international conferences such as IJCAI. Since 2006, he directed the projects from the China National Natural Science Foundation, State Key Laboratory of Brain and Cognitive Science, the Ministry of Education Scientific Research Foundation for returned overseas students, Guangdong Provincial Science and Technology research project, the Fundamental Research Funds for the Central Universities, SCUT. He also directed many projects from enterprises, with applications of his research results to the practical problems. He has ever been a Council Member of Chinese Association for Artificial Intelligence and a program committee member of many international conferences. He is also a reviewer for China National Natural Science Foundation.

Zhiwen Yu is a professor in the School of Computer Science and Engineering, South China University of Technology, Guangzhou, China. He received the B.Sc. and M.Phil. degrees from the Sun Yat-Sen University in China in 2001 and 2004 respectively, and the Ph.D. degree in Computer Science from the City University of Hong Kong, in 2008. His research interests include bioinformatics, machine learning, pattern recognition, multimedia, intelligent computing and data mining. He has published more than 70 technical articles in referred journals and conference proceedings in the areas of machine learning, data mining, bioinformatics, artificial intelligence, pattern recognition and multimedia.

Tiangang Zhou, born in 1972, associate professor in Institute of Biophysics of Chinese Academy of Sciences (IBP CAS). He received the B.Sc. degree in Computer Science and M.Phil. degree in Biophysics from the University of Science and Technology of China (USTC) in 1993 and 1996 respectively. In 2000, he did cooperated study on fMRI in Frieburg University, Germany. His research interests include Visual perception organization and attention, cognitive process and its brain mechanism and brain cognitive imaging methodology.