A weighted one-class support vector machine

A weighted one-class support vector machine

Author’s Accepted Manuscript A Weighted One-class Support Vector Machine Fa Zhu, Jian Yang, Cong Gao, Sheng Xu, Ning Ye, Tongming Yin www.elsevier.co...

1023KB Sizes 17 Downloads 267 Views

Author’s Accepted Manuscript A Weighted One-class Support Vector Machine Fa Zhu, Jian Yang, Cong Gao, Sheng Xu, Ning Ye, Tongming Yin

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(15)01573-8 http://dx.doi.org/10.1016/j.neucom.2015.10.097 NEUCOM16275

To appear in: Neurocomputing Received date: 20 May 2015 Revised date: 28 September 2015 Accepted date: 2 October 2015 Cite this article as: Fa Zhu, Jian Yang, Cong Gao, Sheng Xu, Ning Ye and Tongming Yin, A Weighted One-class Support Vector Machine, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.10.097 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Weighted One-class Support Vector Machine Fa Zhua, Jian Yanga, Member, IEEE, Cong Gaob, Sheng Xuc, Ning Yed, Tongming Yine

a

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, P.R. China b c d

Department of Computer Science, University of Regina, Canada

Department of Geomatics Engineering, University of Calgary, Canada;

College of Information Science and Technology, Nanjing Forestry University, Nanjing, P.R. China e

College of Forest Resources and Environment, Nanjing Forestry University, P.R. China [email protected]

Abstract: The standard One-class Support Vector Machine (OC-SVM) is sensitive to noises, since every instance is equally treated. To address this problem, the weighted one-class support vector machine (WOC-SVM) was presented. WOC-SVM weakens the impact of noises by assigning lower weights. In this paper, a novel instance-weighted strategy is proposed for WOC-SVM. The weight is only relevant to the neighbors’ distribution knowledge, which is only decided by k-nearest neighbors. The closer to the boundary of the data distribution the instance is, the lower the corresponding weight is. The experimental results demonstrate that WOC-SVM outperforms the standard OC-SVM when using the proposed instance-weighted strategy. The proposed instance-weighted method performs better than previous ones.

Keywords: weighted one-class support vector machine, one-class classification, neighbors’ distribution knowledge, instance weights

1. Introduction Support Vector Machines (SVMs) were first proposed by Vladimir Vapnik and his colleagues in the mid-1990s [1-2]. The primal SVMs only focuses on two-class classification problem. When the problem is imbalance, the performance deteriorates. One of the strategies to handle unbalanced problems is to adopt 1

one-class classifier instead. The one-class version of SVMs was proposed in [3] and had succeeded in handwritten signature verification, information retrieval, disease diagnosis, remote-sensing, image retrieval, document classification and so on [4-10]. Inspired by one-class support vector machine (OC-SVM), Tax D M J et al. proposed a similar algorithm, named support vector data description (SVDD), whose task was to find a hyper-sphere to include all targets [11]. In this paper, we only focus on OC-SVM. OC-SVM inherits the merits of the SVMs: implementing the Structural Risk Minimization (SRM), unique global solution, sparsity of the solution etc. [1-2,12]. But it also retains the disadvantages of the SVMs: not considering the prior knowledge of the training set, needing a good kernel function for mapping data into a linear separable space, and requiring solving quadratic programming (QP), which is time consuming. In order to avoid solving QP problem, least-squares support vector machines (LS-SVMs) related algorithms were proposed [13-15]. In LS-SVMs, we only need to solve a linear programming rather than QP problem. However, LS-SVMs also lose the sparsity of the solution. The solution is decided by all instances rather than minor support vectors. In this paper, we only focus on how to utilize the prior knowledge. One way to utilize prior knowledge is to assign training instances with different weights, termed as weighted one-class support vector machine (WOC-SVM). We propose a novel instance-weighted strategy for WOC-SVM. The weight is only relevant to the neighbors’ distribution knowledge, which is firstly proposed in [16-17]. The neighbors' distribution knowledge is only decided by the instance's k-nearest neighbors. Therefore, it is very simple to calculate. The rest of this paper is organized as follows. The related work is reviewed in section 2. The basic review of the WOC-SVM is summarized in section 3. A new instance-weighted strategy is introduced in section 4. The experiments on artificial synthetic problems and benchmarks datasets are reported in section 5. The discussion and conclusions are provided in the last section.

2. Related work How to improve the performance of SVMs via introducing prior knowledge is a hot topic in the community of machine learning and pattern recognition. One way to incorporate prior knowledge into SVMs is to assign the training data with different weights. The side effect of the noises can be weakened by assigning them with lower weights. Thus, SVMs could become more robust to noises. The previous related work mainly focuses on two-class classification problem [18-21]. The weighted version of OC-SVM and SVDD were firstly proposed by Manuele Bicego & Mario A.T. 2

Figueiredo in [22]. They utilized WOC-SVM to do soft clustering. In their algorithm, the WOC-SVM was regarded as the similarity between the sample and cluster center. Zhang et al. proposed to utilize kernel possibilistic c-means clustering (KPCM) to assign weights for weighted support vector data description (WSVDD) [23]. Since SVDD could contain negative instances, they also assigned negative instances weights. In [10], Cyganek B proposed to utilize fuzzy c-means algorithm to assign weights for the ensemble one-class support vector machines. The weights were obtained from the cluster membership values, calculated through fuzzy c-means algorithm. It could be regarded as an extension of ensemble one-class support vector machines [9]. Furthermore, Cyganek B also extended ensemble OC-SVM to solve multi-class classification problem [24-25]. The above work is all related to cluster algorithms. Thus, the problems in cluster algorithms also exist. The number of clusters is difficult to choose. The results are affected by initialization. In this paper, a simple instance-weighted strategy is proposed. It is only relevant to the neighbors’ distribution knowledge, which is only decided by the instance's k-nearest neighbors. Therefore, it can avoid the above problems. Additionally, there also exists some other work which does not utilize cluster algorithms. Lee et al. provided a density-induced distance and used it to replace Euclidean distance in the SVDD [26]. Li et al. applied relative density degree as the instance weights for WSVDD [27].

3. Basic review of weighted one-class support vector machine 3.1 Problem description In some real applications, the training samples are too few except only one class. There are plenty normal persons but few patients in disease diagnostics tasks, for instance. A good solution to deal with these problems is to utilize one-class classifier instead. In one-class classification, only massive one class data (usually named as targets) and fewer other classes data (usually named as outliers) are available. The aim of one-class classification problem is to find a data description to ensemble all or most of the targets and exclude all outliers. A graphical illustration is shown in Fig. 1. The pluses represent the targets and the solid closed line is the data description, which ensembles all targets.

3

8 6 outlier(yi=-1)

Feature 2

4

target(yi=+1)

2 0 -2 -4 -6 -8 -8

-6

-4

-2 0 2 Feature 1

4

6

8

Fig. 1 A graphical illustration of the one-class classification problem. The pluses are targets and solid line is the data description

Let X represent training set consisting of l targets, xi , i  1,, l, xi  R n ( n is the dimension of x i ). x is a new instance and x  R n . The aim of one-class classification is to find a function f  x  . When x is

a target, f  x  returns 1; otherwise, f  x  returns -1. 3.2 Basic review of OC-SVM Let  xi  denote the image of xi in the feature space.  is unknown, however the inner product of the image  xi  can be computed easily via kernel function, such as the Gaussian kernel





K xi , x j  exp  xi  x j 

2

 

/ 2 2  , where indices i and j range over 1,…l. 

In OC-SVM, the training instances are mapped into the feature space via kernel trick and separated from the origin by the hyperplane with the maximum margin [3]. Then the following convex programming needs to be solved.

min w ,

s.t

1 T 1 l w w     i 2 vl i1 w T   xi      i ,  i  0

i  1,2,, l

(1)

Here,  i is the non-negative slack variable of xi ; v  0,1 controls the fraction of outliers (the training 4

data outside the estimated region) and that of support vectors. v is the upper bound on the fraction of the outliers as well as a lower bound on the fraction of the support vectors. The superscript T denotes the transpose of a matrix or vector. Introducing multipliers i , i  0 for the constraints wT  xi     i  0 and i  0 , the Lagrangian function of (1) can be written as following Lw, ξ , ρ, α, β  

1 T 1 w w 2 vl

 i   i wT  xi     i    ii l

l

l

i 1

i 1

i 1

(2)

Setting the derivatives with respect to w , ξ and  to zero, then we can obtain w    i   xi  l

i 1

i 

1  i , vl

(3)

l

i  1 i 1

(4)

Substituting (3) and (4) into (2), the dual form of (1) can be written as min α T Qα α

s.t.

0  αi 

1 , vl

l

i  1

(5)

i 1

Here, α  1, 2 ,,l  is the vector form of Lagrange multipliers of the constraints; Q is the kernel matrix for the training set, Qi, j   K xi , x j  . The instances xi i  0, i  1,, l are called support vectors. The function f  x  can be written as   f  x   sgn   i K  xi , x      iSVs 

Here,

SVs

(6)

represents the indices of support vectors and the variable  in (6) can be recovered by   w T  x i  



  j K x i , x j  

(7)

j j  j  0

where xi is an instance with a nonzero Lagrange multiplier,  i  0 . For a new instance x , f  x  returns 1 if x is a target; f  x  returns -1 if x is an outlier. Obviously, f  x  is only decided by minor support vectors in OC-SVM.

5

3.3 Weighted One-class Support Vector Machine Let Wi denote the weight of instance x i , then the training set can be represented as

 xi ,Wi li1 . Here,

index i ranges over 1,…l and Wi  0,1. The WOC-SVM needs to solve the following optimal programming 1 T 1 w w    Wii 2 vl i1 w T   xi      i ,  i  0

min w ,

s.t.

(8)

i  1,2,, l

Introducing multipliers i , i  0 for the constraints wT  xi     i  0 and i  0 , the Lagrangian function of (8) can be written as Lw, ξ , ρ, α, β  

1 T 1 w w 2 vl

Wii   i wT  xi     i    ii l

l

l

i 1

i 1

i 1

(9)

Setting the derivatives with respect to w , ξ and  to zero, then we can obtain w    i   xi 

(10)

1 Wi  i , vl

(11)

l

i 1

i 

l

i  1 i 1

Substituting (10) and (11) into (9), the dual form of (8) can be written as min α T Qα α

s.t.

0  αi 

1 Wi , vl

l

i  1

(12)

i 1

Here, α  1, 2 ,,l  is the vector form of Lagrange multipliers of the constraints; Q is the kernel matrix for the training set, Qi, j   K xi , x j  . Obviously, the only difference of the dual form between OC-SVM and WOC-SVM is the upper bounds of the Lagrange multipliers. The upper bound in (12) is 1 Wi . When v  0 , WOC-SVM is identical with standard OC-SVM. The decision function of WOC-SVM vl

is as the same as that of standard OC-SVM. In WOC-SVM, we intend to assign possible noises with lower weights before training. Thus, the side effect of noises can be weakened. A novel instance-weighted strategy is proposed in the following section, which is very simple but effective.

4. An instance-weighted strategy for WOC-SVM 6

A new instance whether belongs to targets or not is only determined by minor support vectors in OC-SVM. The support vectors majorly locate near the boundary of the training set. If we want to assign higher weights to the instances which would become support vectors, we only need to find the instances near the boundary of the training set before training. The support vectors can be classified into two kinds: SVnb  i 0  i  1 / vl  and SVb  i i  1 / vl , where SVnb and SVb denote the indices of two kind support vectors respectively [3]. The former kind of support

vectors are not at bound and strictly on the hyperplane ( f  x  ). When  i  1 / vl  , the instance xi may be not on the hyperplane any more. Furthermore, xi may locate far away from the main part of the training set and induce deviation of the description. If we assign higher weights to the instances near the boundary of the training set, the instances, which would induce deviation of the description, are also assigned with higher weights. In order to weaken the side effect of these instances, we assign the instances in the interior of the training set with higher weights and the instances near the boundary of the training set with lower weights instead. This strategy is consistent with that used in [10,20,22-25]. In the following, we first introduce a novel method to find the instances near the boundary of the training set, and then propose an instance-weighted strategy for WOC-SVM. 4.1 The nearest neighbors’ distribution knowledge Let xi , i  1,, l represent an instance in X . xi, j  j  1,2,, k  kNN xi  are the k-nearest neighbors of xi , D consists of xi , j  j  1,2,, k  kNN x0  and can be regarded as a neighborhood sphere, as shown in

Fig. 2. The center of D is xi and the radius is d xi , xi , k  ( xi, k is the k-th nearest neighbor of xi and d xi , xi , k  is the distance between xi and xi, k ). The neighborhood sphere D is divided into two parts: D1 and D2 , by hyperplane AB, which is perpendicular to xi  xi ( xi is defined as the mean of the k

1 k

k

nearest neighbors , xi   xi , j ). The part, which contains xi , is defined as D1 and the other is defined as j 1

D2 (a graphical illustration of D1 and D2 is shown in Fig. 2).

7

D D2 A

x

B i

x -x

D1

i

x

i

i

Fig. 2 The definitions of D1 and D2 . D is the neighborhood sphere and divided into two parts, D1 and D2 , by AB which is perpendicular to xi  xi . x i is denoted as asterisk. xi is denoted as plus. The neighbors of x i are denoted as diamonds.

Since xi locates in D1 , the neighbors in D1 are more than that in D2 . The difference between the number of neighbors in D1 and the number of neighbors in D2 depends on xi 's location in the training set. The closer to the boundary of the training set xi is, the larger the difference is. To reflect this information, the authors in [16-17] defined the angle between xi  xi and xi  xi , j ( xi , j  kNN  xi  ) as i , j  j  1,2,, k  . Obviously, when xi, j locates in D1 , i, j   / 2 ; when xi, j locates in D2 ,  / 2  i, j   . The graphical illustration is shown in Fig.3 ( xi, j locates in D1 ) and Fig.4 ( xi, j locates in D2 ).



D2

xi

D1

xi-xi,j

xi xi,j

xi-xi

Fig. 3 When xi , j  D1 ,  i , j   / 2 . xi, j is one of the k-nearest neighbors of x i and locates in D1 . xi is defined as the mean of the k-nearest neighbors.  i, j is defined as the angle between xi  xi and xi  xi , j . x i is denoted as asterisk. xi is denoted as plus. The neighbors of x i are denoted as diamonds. 8

xi-xi,j

xi,j

D2



xi

xi-xi

xi

D1

Fig. 4 When xi , j  D2 ,  / 2  i, j   . xi, j is one of the k-nearest neighbors of xi and locates in D2 . xi is defined as the mean of the k-nearest neighbors.  i, j is defined as the angle between xi  xi and xi  xi , j . x i is denoted as asterisk. xi is denoted as plus. The neighbors of x i are denoted as diamonds.

In order to reflect the neighbors' distribution knowledge, Summing the cosine of i , j , j  1,2,, k and the following equation could be obtained (named as cosine sum for short in the following) [16-17].

k

k

x i  x k , x i  xi , j

j 1

j 1

xi  xi xi  xi , j

c sum  xi    cos i , j  

The cosine sum can be rewritten as

,

x

i, j

 kNN  xi , j  1,, k



(13)

 cos i, j   cos i, j . When xi, j locates in D1 , i, j   / 2 ,

xi , jD1

xi , jD2

0  cos i , j  1 ; when xi, j locates in D2 ,  / 2  i, j   , 1  cos i , j  0 . Additionally, the neighbors locating

in D1 are more than that in D2 . Therefore, the cosine sum ranges in 0, k  in general. After introducing xi 

1 k  xi , j into (13), cosine sum can be rewritten as k j 1

c sum  xi    k

j 1

xi , xi  xi , xi , j  2 k 1 x i , x i   x i , x i ,l  2 k l 1 k

1 k 1 k  x i , x i ,l   x i , x i ,l k l 1 k l 1

k

 x i ,l , x i , m

l , m 1

(14)

xi , xi  2 xi , xi , j  xi , j , xi , j

It is only relevant to the inner product of vectors, thus cosine sum can be easily modified for kernel form as following

9

c sum  xi   



k

j 1



K  xi , xi   K xi , xi , j  2 k 1 K  x i , x i    K x i , x i ,l   2 k l 1 k

1 k 1 k  K x i , x i ,l    K x i , x i ,l  k l 1 k l 1

 K  x i , l , x i , m  K  x i , x i   2 K x i , x i , j   K x i , j , x i , j  k

(15)

l , m 1

4.2 A new instance-weighted strategy Obviously, the values of (13) or (15) cannot be used as the weights in WOC-SVM directly. We define Wi via a slightly modification on (13) or (15) as following Wi  1

1 sum c  xi  k

(16)

Here, c sum  xi  is the cosine sum of xi ( xi  X ) and can be calculated through (13) or (15). The absolute operator is to ensure the cosine sum with a positive value. The coefficient 1 / k is to ensure the later term of (16) ranges in 0,1 , thus the range satisfies the condition of the weights in WOC-SVM. The former term constant 1 is to ensure the instances in the interior of the training set with higher weights and the instances near the boundary of the training set with lower weights. First, this strategy is very simple. The weight only relates to the instance's k-nearest neighbors. And then, the range of the weights are not sensitive to the number of the nearest neighbors, thus we do not need to spend too much time in tuning the parameter k in the (16). In Fig. 5, the data description learnt by standard OC-SVM is shown in (a) and the data description learnt by WOC-SVM is shown in (b), respectively. Gaussian RBF kernel is used as the kernel function. The width of Gaussian RBF kernel σ is set to 2-4. The parameter v is set to 0.1, The number of the nearest neighbors in (16) is set to 10.

10

4

OC-SVM

4

2

2

0

0

-2

-2

-4

-4

-6

-6

-4

-2

0

2

4

6

WOC-SVM

-4

-2

0

2

(a)

4

6

(b)

Fig. 5 The data descriptions learnt by standard OC-SVM and WOC-SVM, respectively. The solid lines are data descriptions. The squares are the training samples. In (b), the bold squares represent the samples with weights over 0.5.

It can be found that the data description learnt by WOC-SVM is more compact than that learnt by standard OC-SVM. In Fig. 5 (b), most of the instances with higher weights locate in the interior of the training set and most of the instances with lower weights locate near the boundary of the training set. In Fig. 6, we add 10% Gaussian noise and set the same parameters with that used in Fig. 5.

4

OC-SVM 10% Gaussian noise

4

2

WOC-SVM 10% Gaussian noise

2

0.08 0.38

0

0

-2

-2

-4

-4

-6

-6

0.3 0.19 0.27

0.49

0.65

0.63

-4

-2

0

2

4

6

0.52 0.25

-4

(a)

-2

0

2

0.21

4

6

(b)

Fig. 6 The data descriptions learnt by standard OC-SVM and WOC-SVM, respectively. 10% Gaussian noises are added. The solid lines are data descriptions. The squares are the training samples. In (b), the bold squares represent the samples with weights over 0.5. The diamonds are the noises with Gaussian distribution. The number at bottom right of the noise represents its weight. 11

In Fig. 6, the data description learnt by standard OC-SVM is effected by the noises, however the data description learnt by WOC-SVM still can describe the targets well. The WOC-SVM is more robust than standard OC-SVM. Additionally, it can be found that the noises out of the training set are assigned with lower weights. Therefore the side effect of the noises is weakened.

4.3 Connection to other instance-weighted strategies One way to weight instances is KPCM algorithm [10,20,22-23]. In KPCM, every instance is assigned with a different membership value which can be interpreted as the degrees of possibility of the instance belonging to one cluster [28-29]. In general, the membership values can be used as instance weights, such as in two-class SVM and SVDD with negative instances. Since standard SVM contains two classes and SVDD with negative instances actually contains two classes too, the number of clusters can be set to 2. However, OC-SVM only contains one class, the number of clusters is difficult to choose. In [10], Cyganek B. proposed an ensemble of OC-SVMs. In the ensemble of OC-SVMs, the training set is partitioned into M clusters and the OC-SVM is trained on each cluster. Each OC-SVM contains two parameter to tune, thus we need to tune 2*M in all. Another way to weight instances is based on relative density degree. In [27], the relative density degree was used as the instance weights for SVDD. The relative density degree i for xi can be represented as  

 i  exp  

 Ek , i  1,, l d xi , xi , k 

(17)

Here, xi, k is the k-th nearest neighbor of xi ; E k  1/ l  d xi , xi ,k  . 0    1 is a weighting factor. A l

i 1

sample with higher relative density degree locates in the higher density region as well as the interior of the training set. Obviously, both proposed method and relative density degree need to find the k-nearest neighbors. Therefore, the time complexity of proposed method is as the same as that of relative density degree. The time complexity of finding the k-nearest neighbors is O(l 2 ) without considering any speedup strategies. In [30], Friedman J H et al. proposed a algorithm that could find the best matches in O(l log l ) . A recent review of

12

nearest neighbor searching can refer to [31]. In this paper, we only use naïve method to solve the k-nearest neighbors problem for proposed method and relative density degree. A more faster (approximate) k-nearest neighbor algorithm can be easily incorporated into our method. Additionally, we only do pre-processing once when tuning the parameters via grid search. Thus, even O(l 2 ) is also acceptable. Since the proposed instance-weighted strategy is used in single WOC-SVM in this paper, we only compare the proposed method with relative density degree in the following experiments.

5. Experiments and Simulations In this section, we evaluate the performance of WOC-SVM in comparison with OC-SVM. Both WOC-SVM and OC-SVM are solved by libsvm [32]. The codes to calculate weights are implemented via mex interface in matlab environment and compiled by VC ++ 2010. All the experiments are performed on a Windows 7 machine with an Intel® Core™ i5-4690 CPU @3.50GHz and 8GB DRAM. The weights are generated by the proposed method (based on neighbors' distribution, ND) and relative density degree (RDD), respectively. We name WOC-SVM (ND) and WOC-SVM (RDD) for short in the following experiments. First, we evaluate the performance on two artificial synthetic problems: Sine problem and Spiral problem. And then, we evaluate the performance on 10 UCI [33] benchmark datasets. Lastly, we verify the proposed method on web problem, which is large scale and imbalance. Similar to other one-class related papers, the experimental results are compared in terms of the receiver operating characteristic curve (ROC curve) [34] as well as the value of the area under the ROC curve (AUC value) [35]. A higher AUC value means a better one-class classifier.

5.1 Experiments on artificial synthetic datasets In this experiment, we compare WOC-SVM with standard OC-SVM on two artificial synthetic problems, which contain two dimensions. The first one is Sine problem. The targets are generated by    x   t     N  X   x  1    x sin t      2  

Here, t ranges in [0, 2π] with a step 0.01; N are noises with the Gaussian distribution, whose mean is [0 0]T and covariance matrix is 0.04I (I is a two-dimensional identity matrix). The outliers are generated by 13

uniform distribution. The second one is the Spiral problem. The targets are generated by     2t * cos 0.2t       N1  0.5 * 2  X 1   x x       2t * sin0.2t    

and the outliers are generated by    2t * cos 0.2t       N 2  0.5 * 2  X 2   x x      2t * sin0.2t    

Here, t ranges in [0, 50] with a step 0.5; N 1 and N 2 are noises with Gaussian distribution, whose mean is [0 0]T and covariance matrix is I (I is a two-dimensional identity matrix). We generate 50, 100, 200 targets in the training set for each problem, respectively. In order to evaluate the impact of noises, we add 10%, 20% and 30% Gaussian noise and 10%, 20% and 30% uniform noise into training sets, respectively. Each test set contains 1000 targets and 1000 outliers. We perform 50 trials on random sampled training and test sets. The AUC values of the Sine problem are reported in Table 1. The AUC values of the Spiral problem are reported in Table 2. The results in Table 1 and Table 2 are reported as the mean ± standard deviation (std.) of 50 trials. The experimental results are obtained by the best parameters through the grid search except parameter k. In our previous experience, the width of Gaussian RBF σ obtained by grid search is inclined in lower range. Therefore, we select σ in {2-10,2-9,2-8,2-7,2-6,2-5}. The parameter v decides the upper bound on the fraction of outlier. Large v means more instance in training set are deemed to outliers. Thus, v should not set too large. We select v in {0.01,0.06,0.11,0.16,0.21}. In order to reduce the cost of tuning extra parameter, we set k to 10 directly.

14

Table 1 The AUC values of the Sine problem. The best one is bolded in each row. The first column is the number of the targets in training set, and the third column is the percents of the Gaussian noise or Uniform noise. The AUC values are reported as mean±std. #Training set

Noise (%)

OC-SVM (mean±std.)

50 Gaussian noise

Uniform noise

10% 20% 30% 10% 20% 30%

100 Gaussian noise

Uniform noise

10% 20% 30% 10% 20% 30%

200 Gaussian noise

Uniform noise

10% 20% 30% 10% 20% 30%

0.9461±0.00020 0.9392±0.00027 0.9352±0.00031 0.9327±0.00031 0.9210±0.00047 0.8954±0.00072 0.8686±0.00066 0.9576±5.20867 0.9497±8.69514 0.9435±0.00010 0.9404±0.00014 0.9359±0.00019 0.9107±0.00040 0.8862±0.00092 0.9651±4.19424 0.9595±5.30629 0.9543±5.67785 0.9523±9.22664 0.9506±0.00010 0.9220±0.00032 0.8968±0.00039

ND

WOC-SVM (mean±std.) RDD

0.9455±0.00020 0.9437±0.00022 0.9384±0.00024 0.9361±0.00028 0.9347±0.00033 0.9211±0.00031 0.8991±0.00065 0.9589±6.12796 0.9565±5.68384 0.9516±8.32205 0.9466±0.00010 0.9482±0.00011 0.9297±0.00037 0.9090±0.00062 0.9652±4.60269 0.9633±5.25532 0.9598±7.48328 0.9575±0.00009 0.9577±8.06985 0.9360±0.00029 0.9122±0.00039

0.9472±0.00018 0.9442±0.00019 0.9410±0.00023 0.9377±0.00025 0.9341±0.00022 0.9177±0.00037 0.8928±0.00055 0.9587±4.72740 0.9573±4.35098 0.9549±5.58645 0.9522±6.78750 0.9493±8.75914 0.9385±0.00017 0.9255±0.00024 0.9657±3.91849 0.9651±3.56369 0.9644±4.09743 0.9629±4.43037 0.9606±4.76459 0.9543±7.85449 0.9442±0.00012

Table 2 The AUC values of the Spiral problem. The best one is bolded in each row. The first column is the number of the targets in training set, and the third column is the percents of the Gaussian noise or Uniform noise. The AUC values

are reported as mean±std. #Training set

Noise (%)

OC-SVM (mean±std.)

50 Gaussian noise

Uniform noise

10% 20% 30% 10% 20% 30%

100 Gaussian noise

Uniform noise

10% 20% 30% 10% 20% 30%

200 Gaussian noise

Uniform noise

10% 20% 30% 10% 20% 30%

0.9590±0.00025 0.9364±0.00043 0.9174±0.00074 0.9034±0.00089 0.9242±0.00059 0.9020±0.00072 0.8763±0.00125 0.9744±0.00008 0.9459±0.00029 0.9347±0.00048 0.9094±0.00062 0.9397±0.00060 0.9117±0.00056 0.8874±0.00040 0.9821±3.07776 0.9623±0.00019 0.9443±0.00023 0.9218±0.00039 0.9489±0.00019 0.9184±0.00031 0.8993±0.00031

WOC-SVM (mean±std.) ND RDD 0.9614±0.00024 0.9588±0.00025 0.9388±0.00044 0.9361±0.00043 0.9193±0.00074 0.9170±0.00076 0.9059±0.00085 0.9032±0.00089 0.9283±0.00064 0.9241±0.00059 0.9056±0.00075 0.9018±0.00072 0.8802±0.00129 0.8767±0.00124 0.9797±7.13187 0.9744±7.99582 0.9527±0.00034 0.9477±0.00034 0.9424±0.00049 0.9363±0.00047 0.9142±0.00064 0.9106±0.00066 0.9499±0.00058 0.9410±0.00065 0.9215±0.00063 0.9139±0.00059 0.8948±0.00047 0.8878±0.00042 0.9867±0.00002 0.9823±3.00549 0.9711±0.00021 0.9743±0.00009 0.9549±0.00028 0.9609±0.00014 0.9277±0.00048 0.9374±0.00045 0.9597±0.00029 0.9626±0.00019 0.9247±0.00045 0.9306±0.00050 0.9039±0.00039 0.9147±0.00046

The results in Table 1 and Table 2 demonstrate that the WOC-SVM behaves better than standard 15

OC-SVM in these two artificial synthetic problems. When some noises are added into the training set, WOC-SVM also behaves better than OC-SVM. Additionally, the weight generated by neighbors' distribution performs worse than that generated by relative density degree for Sine problem. However, the weight generated by neighbors' distribution outperforms that generated by relative density degree for Spiral problem.

5.2 Experiments on benchmark datasets In this experiment, we compare the WOC-SVM with standard OC-SVM on 10 benchmark datasets, which are selected from the University of California at Irvine (UCI) machine learning repository [33]. The description of these datasets is reported in the Table 3. The second column denotes the number of the dimensions. The third column and the last column denote the size of the training set and the test set, respectively. The dash in the last column represents no independent identically distributed (i.i.d) test set. The numbers in the parentheses are the size of each class. The number of the dimensions ranges from 4 to 100. The size of the training set ranges from 345 to 4435. Table 3 The description of the benchmark datasets. The number of dimensions is reported in second column. The size of training set and test set is reported in the third column and last column, respectively. The dash in the last column represents no independent identically distributed (i.i.d) test set. Datasets

#Dimensions

#Training Samples (classes)

#Test Samples (classes)

Yeast

8

892(463,429)

-

Pima Indians Diabetes

8

768(500,268)

-

Liver Disorders

6

345(145,200)

-

Blood Transfusion

4

748(178,570)

-

Wdbc

30

569(357,212)

-

Winequality Red

11

1599(681,918)

-

Abalone

8

4177(1307,1342,2649)

-

Svmguide 1

4

3098(2000,1089)

4000(2000,2000)

Satimage

36

4435(1072,3363)

2000(461,1549)

Hill Valley

100

606(305,301)

606(295,311)

In order to evaluate one-class classifiers, these datasets are reorganized. The class containing the largest number of the instances is used as targets, the other class(es) is(are) used as outliers except Abalone. The task of Abalone dataset is to predict the age of the abalone. It contains 3 classes: M, F and I. We convert it into three one-class problems: Abalone (Fvs IM), Abalone (IvsFM) and Abalone (MvsFI). In each problem, one class is used as targets, the others are used as outliers. There is no independent identically distributed (i.i.d) test set in the former seven datasets. The largest class (used as targets) is divided into two parts equally 16

and one part is used for training, the other part and the outliers are both used for testing. The later three datasets contain a training set and an i.i.d test set. The largest class in training set is used for training. The others in training set and all instances in test set are all used for testing. The Gaussian RBF kernel is used as kernel function in the standard OC-SVM as well as WOC-SVM. Similar to other one-class classifier related papers, we evaluate standard OC-SVM and WOC-SVM via AUC value and ROC curve. The AUC values are reported in Table 4 and the ROC curves are reported in Fig. 7, respectively. The results are obtained by the best parameters through grid search. The width of the Gaussian RBF kernel σ is chosen from {2-10,2-9,2-8,2-7,2-6,2-5} and the parameter v in OC-SVM and WOC-SVM is chosen from {0.01,0.06,0.11,0.16,0.21}. The number of the nearest neighbors is assigned to 10 in both instance-weighted strategies. The reasons of selecting parameter ranges is the same as that in experiment 1. Table 4 The AUC values comparison of the benchmark datasets. The best one is bolded in each row. The Win-Loss-Tie (W-L-T) summarization based on AUC value is attached in the last row. The WOC-SVM (ND) is used as control algorithm. Datasets

OC-SVM

Yeast

WOC-SVM ND

RDD

0.5642

0.5865

0.5790

Pima Indians Diabetes

0.6427

0.6523

0.6596

Liver Disorders

0.5181

0.5357

0.5127

Blood Transfusion

0.6077

0.6463

0.7058

Wdbc

0.9748

0.9762

0.9747

Winequality Red

0.5555

0.5660

0.5476

Abalone (FvsIM)

0.6667

0.6645

0.6428

Abalone (IvsFM)

0.8281

0.8369

0.8206

Abalone (MvsIF)

0.6137

0.6241

0.6214

Svmguide 1

0.9343

0.9346

0.9232

Satimage

0.9607

0.9643

0.9683

Hill Valley

0.9343

0.9569

0.9232

W-L-T (WOC-SVM, RD)

1-11-0

-

3-9-0

The Win-Loss-Tie (W-L-T) summarization is attached at the last row of Table 4. The proposed method is used as control algorithm. The results in Table 4 demonstrate that the proposed method performs better than OC-SVM on all benchmark datasets only except Abalone (FvsIM) and better than relative density degree on 9 datasets. The ROC curves in Fig. 7 also demonstrate that the proposed method behaves the best on most benchmark datasets.

17

The ROC curve of Pima Indians Diabetes 1

0.9

0.9

0.8

0.8

True positive rate

True positive rate

The ROC curve of Yeast 1

0.7 0.6 0.5 0.4 0.3 0.2

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.6 0.5 0.4 0.3 0.2

OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1

0.7

OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1 0

1

0

0.1

0.2

False positive rate

0.9

0.9

0.8

0.8

0.7 0.6 0.5 0.4 0.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.3 OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1 0

1

0

0.1

0.2

0.8

True positive rate

True positive rate

0.8 0.7 0.6 0.5 0.4 0.3

0.4

0.5

0.6

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

0.7 0.6 0.5 0.4 0.3 0.2

OC-SVM WOC-SVM(ND) WOC-SVM(RDD) 0.3

0.4

The ROC curve of Winequality Red 0.9

0.2

0.3

False positive rate

0.9

0.1

1

0.4

1

0

0.9

0.5

The ROC curve of Wdbc

0

0.8

0.6

1

0.1

0.7

0.7

False positive rate

0.2

0.6

0.2

OC-SVM WOC-SVM(ND) WOC-SVM(RDD) 0

0.5

The ROC curve of Blood Transfusion 1

True positive rate

True positive rate

The ROC curve of LiverDisorders

0.1

0.4

False positive rate

1

0.2

0.3

OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1 1

False positive rate

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

False positive rate

18

0.8

0.9

1

The ROC curve of Abalone(IvsFM)) 1

0.9

0.9

0.8

0.8

True positive rate

True positive rate

The ROC curve of Abalone(FvsIM)) 1

0.7 0.6 0.5 0.4 0.3 0.2

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.6 0.5 0.4 0.3 0.2

OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1

0.7

OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1 0

1

0

0.1

0.2

False positive rate

0.9

0.9

0.8

0.8

0.7 0.6 0.5 0.4 0.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.3 OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1 0

1

0

0.1

0.2

0.8

True positive rate

True positive rate

0.8 0.7 0.6 0.5 0.4 0.3

0.4

0.5

0.6

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

0.7 0.6 0.5 0.4 0.3 0.2

OC-SVM WOC-SVM(ND) WOC-SVM(RDD) 0.3

0.4

The ROC curve of Hill Valley 0.9

0.2

0.3

False positive rate

0.9

0.1

1

0.4

1

0

0.9

0.5

The ROC curve of satimage

0

0.8

0.6

1

0.1

0.7

0.7

False positive rate

0.2

0.6

0.2

OC-SVM WOC-SVM(ND) WOC-SVM(RDD) 0

0.5

The ROC curve of Svmguide 1 1

True positive rate

True positive rate

The ROC curve of Abalone(MvsFI))

0.1

0.4

False positive rate

1

0.2

0.3

OC-SVM WOC-SVM(ND) WOC-SVM(RDD)

0.1 1

False positive rate

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Fig. 7 ROC curves of benchmark datasets. The dashed line is the ROC curve of OC-SVM. The dashdot line is the ROC curve of WOC-SVM (ND). The dotted line is the ROC curve of WOC-SVM (RDD).

Furthermore, Table 5 provides the Friedman Aligned Ranks comparison of the AUC values reported in 19

Table 4. The WOC-SVM (ND) has the minimum average ranks 10.75, which means that WOC-SVM (ND) performs better than OC-SVM as well as WOC-SVM (RDD). The aligned Friedman statistic is 8.985887 (distributed according to chi-square with 2 degrees of freedom). P-value computed by Aligned Friedman test is 0.011188 and smaller than the aligned Friedman statistic. The null hypothesis whether the measured sum of the aligned ranks are significantly different from the rank Rˆ  j expected, is rejected at a high level of significance. Table 5 Average Ranking of the three algorithms (Aligned Friedman). Datasets

OC-SVM

Average ranks

23.1667

WOC-SVM ND

RDD

10.75

21.5833

The Finner test is as the post hoc test [36]. The P-values applying post hoc methods over the results of Friedman Aligned test is reported in Table 6 and the adjusted P-values through the application of the Finner test is reported in Table 7. In both Table 6 and Table 7, the WOC-SVM (ND) is used as the control algorithm. Obviously, WOC-SVM (RDD) is more inclined to reject the null hypothesis than OC-SVM. The results in Table 5, Table 6 and Table 7 are obtained by keel tool [37-38]. Table 6 Post Hoc comparison for α=0.05 (Friedman alligned) Algorithm

z  R0  Ri  / SE

OC-SVM

2.886816

0.003892

0.025321

WOC-SVM, RDD

2.518699

0.011779

0.05

p

Finner

Table 7 Adjusted p-values. Algorithm

unadjusted p

OC-SVM

0.003892

0.007768

WOC-SVM, RDD

0.011779

0.011779

pFinner

The results of the statistical comparison in the Table 5, 6 and 7 also demonstrate that WOC-SVM (ND) performs better than OC-SVM as well as WOC-SVM (RDD).

5.3 Experiments on web problem The benchmark task of web problem is to predict whether web page belongs to a category based on the presence of 300 selected keywords on the page. There are 8 training files, each with a different number of training examples: web-1a, web-2a, web-3a.dst, web-4a, web-5a, web-6a, web-7a and web-a. These files are 20

converted into matlab format. The details are reported in Table 8. The second column is the size of the training set and the last column is the size of the test set. The numbers in the parentheses are the size of each class. The web problem is imbalance. The number of dimensions is 300 and the size of the training sets ranges from 2477 to 49749. Table 8 The description of the web problem. The size of the training set and test set is reported in the second column and last column, respectively. The numbers in the parentheses are the size of each class. #Training samples (classes)

#Test samples (classes)

web-1a

2477(2405,72)

47272(45865,1407)

web-2a

3470(3363,107)

46279(44907,1372)

web-3a

4912(4769,143)

44837(43501,1336)

web-4a

7366(7150,216)

42383(41120,1263)

web-5a

9888(9607,281)

39861(38663,1198)

web-6a

17188(16663,525)

32561(31607,954)

web-7a

34692(23952,740)

25057(24318,739)

web-a

49749(48270,1479)

21489(20896,593)

There are two classes in web problem. The former is used as the targets, the latter is used as the outliers. The targets in the training set is used for training. The outliers in the training set and the samples in test set are all used for testing. The experimental results are reported as AUC value with the best parameters via grid search. The width of the Gaussian RBF kernel σ is chosen from {2-10,2-9,2-8,2-7,2-6,2-5} and parameter v in WOC-SVM and OC-SVM is chosen from {0.01,0.06,0.11,0.16,0.21}. The number of the nearest neighbors is assigned to 10 in both instance-weighted strategies. The reasons of selecting parameter ranges is the same as that in experiment 1. The AUC values are reported in the Table 9. The best one is bolded in each row. Table 9 The AUC values comparison of the web problem. The bolded one is the best in each row. OC-SVM

WOC-SVM ND

RDD

web-1a

0.6706

0.7001

0.6300

web-2a

0.6341

0.6303

0.6219

web-3a

0.6383

0.6422

0.6182

web-4a

0.6318

0.6502

0.6061

web-5a

0.6407

0.6540

0.6152

web-6a

0.6525

0.6672

0.6211

web-7a

0.6605

0.6757

0.6234

web-a

0.6700

0.6946

0.6177

The results in Table 9 demonstrate that WOC-SVM (ND) performs better than OC-SVM as well as WOC-SVM (RDD). In Fig. 8, we also compare WOC-SVM (ND) with WOC-SVM (RDD) using different 21

number of the nearest neighbors in web-a. Web-a is the largest set in web problem. The number of the nearest neighbors ranges in {10,30,50,70,90,110,130,150}.

The AUC values of web-a

AUC value

0.7

0.65

WOC-SVM (ND) WOC-SVM (RDD) OC-SVM 0.6

30

50

70

90

110

130

150

The number of nearest neighbors Fig. 8 The AUC values with different number of the nearest neighbors. The horizontal axis is the number of the nearest neighbors. The vertical axis is the corresponding AUC value. The dashdot line is the AUC values of WOC-SVM (ND). The dotted line is the AUC values of WOC-SVM (RDD). The dashed line is the AUC value of OC-SVM.

In Fig. 8, the horizontal axis represents the number of nearest neighbors while the vertical axis represents the corresponding AUC value. It can be found that WOC-SVM (ND) can obtain a better result even if setting the number of the nearest neighbors to 10. When setting other value to the number of the nearest neighbors, WOC-SVM (ND) also performs better than OC-SVM. When setting the number of the nearest neighbors to 70, WOC-SVM (RDD) performs the best in the range but still worse than WOC-SVM (ND) and OC-SVM.

6. Discussion and Conclusions

In this paper, we proposed a novel instance-weighted strategy for one-class support vector machine. The strategy is very simple but effective. The weight is only decided by the k-nearest neighbors. The sample locating in the interior of the training set is assigned with a higher weight, while the sample locating near the boundary of the training set is assigned with a lower weight. Thus the influence of the noise can be weakened. The results of the experiments, performed on both artificial synthetic problems and UCI benchmark datasets, demonstrate that the new weighted one-class support vector machine performs better than previous ones. The results of the experiments, performed on artificial synthetic problems, also 22

demonstrate that the new weighted one-class support vector machine is more robust for noises than one-class support vector machine. In this paper, the proposed instance-weighted strategy only focuses on single one-class classifier (OCC). However, the instance-weighted strategy can be also used in any place where one-class support vector machine can be applied, such as the framework of the ensemble one-class support vector machines [9-10,24-25]. The theoretical analysis should be provided in the future work. An online version of WOC-SVM is also worth conducting research.

ACKNOWLEDGMENTS The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work was partially supported by the National Science Fund for Distinguished Young Scholars under Grant Nos. 61125305, 61472187, 61233011 and 61373063, the Key Project of Chinese Ministry of Education under Grant No. 313030, the 973 Program (No. 2014CB349303), Fundamental Research Funds for the Central Universities (No. 30920140121005), Program for Changjiang Scholars and Innovative Research Team in University No. IRT13072, National Basic Research Program of China (973 Program) (2012CB114505), China National Funds for Distinguished Young Scientists (31125008), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grant No. 13KJB520014).

References [1] Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 144-152. [2] Vapnik V. The nature of statistical learning theory[M]. Springer-Verlag New York, Inc., 1995. [3] Schölkopf B, Platt J C, Shawe-Taylor J, et al. Estimating the support of a high-dimensional distribution[J]. Neural computation, 2001, 13(7): 1443-1471. [4] Guerbai Y, Chibani Y, Hadjadji B. The effective use of the one-class SVM classifier for handwritten signature verification based on writer-independent parameters[J]. Pattern Recognition, 2015, 48(1): 103-113. [5] Lipka N, Stein B, Anderka M. Cluster-based one-class ensemble for classification problems in information retrieval[C]//Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2012: 1041-1042. 23

[6] Cabral G G, de Oliveira A L I. One-class Classification for heart disease diagnosis[C]//Systems, Man and Cybernetics (SMC), 2014 IEEE International Conference on. IEEE, 2014: 2551-2556. [7] Guo L, Zhao L, Wu Y, et al. Tumor detection in MR images using one-class immune feature weighted SVMs[J]. Magnetics, IEEE Transactions on, 2011, 47(10): 3849-3852. [8] Li W, Guo Q, Elkan C. A positive and unlabeled learning algorithm for one-class classification of remote-sensing data[J]. Geoscience and Remote Sensing, IEEE Transactions on, 2011, 49(2): 717-725. [9] Cyganek B. Image segmentation with a hybrid ensemble of one-class support vector machines[M] //Hybrid Artificial Intelligence Systems. Springer Berlin Heidelberg, 2010: 254-261. [10] Cyganek B. One-class support vector ensembles for image segmentation and classification[J]. Journal of Mathematical Imaging and Vision, 2012, 42(2-3): 103-117. [11] Tax D M J, Duin R P W. Support vector domain description[J]. Pattern recognition letters, 1999, 20(11): 1191-1199. [12] Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods[M]. Cambridge university press, 2000. [13] Suykens J A K, Vandewalle J. Least squares support vector machine classifiers[J]. Neural processing letters, 1999, 9(3): 293-300. [14] Suykens J A K, Van Gestel T, De Moor B, et al. Basic Methods of Least Squares Support Vector Machines[M]//Least Squares Support Vector Machines. World Scientific Publishing Co. Pte. Ltd, 2002. [15] Choi Y S. Least squares one-class support vector machine[J]. Pattern Recognition Letters, 2009, 30(13): 1236-1240. [16] F. Zhu, et al., Boundary Detection and Sample Reduction for One-class Support Vector Machines, Neurocomputing[J], 2014,123:166-173. [17] F. Zhu, et al., Neighbors’ distribution property and sample reduction for support vector machines, Appl. Soft Comput. [J], 2014,16:201-209. [18] C.-F. Lin and S.-D. Wang. Fuzzy support vector machines. IEEE Transactions on Neural Networks, 2002, 13(2):464–471. [19] Wu X, Srihari R. Incorporating prior knowledge with weighted margin support vector machines[C]// Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004: 326-333. [20] Yang X, Song Q, Wang Y. A weighted support vector machine for data classification[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2007, 21(05): 961-976. [21] Karasuyama M, Harada N, Sugiyama M, et al. Multi-parametric solution-path algorithm for instance-weighted support vector machines[J]. Machine learning, 2012, 88(3): 297-330. [22] Bicego M, Figueiredo M A T. Soft clustering using weighted one-class support vector machines[J]. Pattern Recognition, 2009, 42(1): 27-32. [23] Zhang Y, Liu X D, Xie F D, et al. Fault classifier of rotating machinery based on weighted support vector data description [J]. Expert Systems with Applications, 2009, 36(4): 7928-7932. [24] Krawczyk B, Woźniak M. Diversity measures for one-class classifier ensembles[J]. Neurocomputing, 2014, 126: 36-44. [25] Krawczyk B, Woźniak M, Cyganek B. Clustering-based ensembles for one-class classification[J]. Information Sciences, 2014, 264: 182-195. [26] Lee K Y, Kim D W, Lee D, et al. Improving support vector data description using local density degree[J]. Pattern Recognition, 2005, 38(10): 1768-1771. [27] Li Zhang, Haifei Zhang, Weida Zhou, Ying Lin, Fanchang Li, Density-punished support vector data description, Pattern Recognition and Artificial Intelligence (in Chinese), 2014, 27(2): 160-165. 24

[28] Krishnapuram R, Keller J M. A possibilistic approach to clustering[J]. Fuzzy Systems, IEEE Transactions on, 1993, 1(2): 98-110. [29] Krishnapuram R, Keller J M. The possibilistic c-means algorithm: insights and recommendations[J]. Fuzzy Systems, IEEE Transactions on, 1996, 4(3): 385-393. [30] Friedman J H, Bentley J L, Finkel R A. An algorithm for finding best matches in logarithmic expected time[J]. ACM Transactions on Mathematical Software (TOMS), 1977, 3(3): 209-226. [31] Wang J, Shen H T, Song J, et al. Hashing for similarity search: A survey[J]. arXiv preprint arXiv:1408.2927, 2014. [32] Chang C C, Lin C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27. [33] Lichman, M. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2013. [34] Metz C E. Basic principles of ROC analysis[C]//Seminars in nuclear medicine. WB Saunders, 1978, 8(4): 283-298. [35] Bradley A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms[J]. Pattern recognition, 1997, 30(7): 1145-1159. [36] García S, Fernández A, Luengo J, et al. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power[J]. Information Sciences, 2010, 180(10): 2044-2064. [37] Alcala-Fdez J, Sanchez L, Garcia S, et al. KEEL: a software tool to assess evolutionary algorithms for data mining problems[J]. Soft Computing, 2009, 13(3): 307-318. [38] Alcalá J, Fernández A, Luengo J, et al. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2010, 17(2-3): 255-287.

Fa Zhu is currently pursuing Ph. D. degree from the School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, P. R. China. His current research interests include pattern recognition, machine learning.

Jian Yang received the B.S. degree in mathematics from the Xuzhou Normal University in 1995. He received the M.S. degree in applied mathematics from the Changsha Railway University in 1998 and the PhD degree from the Nanjing University of Science and Technology (NUST), on the subject of pattern recognition and intelligence sys25

tems in 2002. In 2003, he was a postdoctoral researcher at the University of Zaragoza. From 2004 to 2006, he was a Postdoctoral Fellow at Biometrics Centre of Hong Kong Polytechnic University. From 2006 to 2007, he was a Postdoctoral Fellow at Department of Computer Science of New Jersey Institute of Technology. Now, he is a professor in the School of Computer Science and Technology of NUST. He is the author of more than 80 scientific papers in pattern recognition and computer vision. His journal papers have been cited more than 3000 times in the ISI Web of Science, and 6000 times in the Web of Scholar Google. His research interests include pattern recognition, computer vision and machine learning. Currently, he is an associate editor of Pattern Recognition Letters and IEEE Trans. Neural Networks and Learning Systems, respectively.

Cong Gao received the MS degree in computer application technology from Soochow University, Soochow, P.R. China, in 2012. He is currently pursuing Ph. D. degree from the Department of Computer Science, University of Regina, Saskatchewan, Canada S4S 0A2. His current research interests include Rough Set, Machine Learning and Lie Group Machine Learning.

Sheng Xu received the B.S. in computer science and technology from Nanjing Forestry University, Nanjing, China, in 2010. He is currently pursuing Ph. D. degree from the Department of Geomatics Engineering, University of Calgary, Canada. His current research interests include computer vision, pattern recognition, machine learning.

Ning Ye received M.S. degree in test measurement technology and instruments from Nanjing University of Aeronautics and Astronautics, China in 2006, he completed his Ph.D. degree in computer application technology from Southeast University, China. He is full-time professor at the School of Information technology of Nanjing Forestry University, Nanjing, China. His research interests include machine learning, bioinformatics, and data mining.

Tongming Yin received B.S. degree in forestry and Ph.D degree in genetics and molecular biology from Nanjing Forestry University, Jiangsu, China. Dr. Yin‫׳‬s main research interests focus on genomics, gene function and molecular breeding of woody plants. His representative achievements include: (1) contribution towards construction of the genetic platforms for tree genomic studies. (2) Mapping and cloning of genes underlying important traits in woody plants. (3) Development of genetic tools and marker resources for applicability of the sequenced poplar genome to studies of alternate poplar genotypes and species. (4) Discovery on the genetic mechanism triggering the evolution process from hermephordites to diecious plants and genomic proofs for parapatric speciation.

26

Fa Zhu

Jian Yang

Cong Gao

Sheng Xu

Ning Ye

Tongming Yin

27