Generalized support vector data description for anomaly detection

Generalized support vector data description for anomaly detection

Journal Pre-proof GENERALIZED SUPPORT VECTOR DATA DESCRIPTION FOR ANOMALY DETECTION Mehmet Turkoz , Sangahn Kim , Youngdoo Son , Myong K. Jeong , Els...

1MB Sizes 0 Downloads 58 Views

Journal Pre-proof

GENERALIZED SUPPORT VECTOR DATA DESCRIPTION FOR ANOMALY DETECTION Mehmet Turkoz , Sangahn Kim , Youngdoo Son , Myong K. Jeong , Elsayed A. Elsayed PII: DOI: Reference:

S0031-3203(19)30420-0 https://doi.org/10.1016/j.patcog.2019.107119 PR 107119

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

5 December 2018 20 September 2019 19 November 2019

Please cite this article as: Mehmet Turkoz , Sangahn Kim , Youngdoo Son , Myong K. Jeong , Elsayed A. Elsayed , GENERALIZED SUPPORT VECTOR DATA DESCRIPTION FOR ANOMALY DETECTION, Pattern Recognition (2019), doi: https://doi.org/10.1016/j.patcog.2019.107119

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

HIGHLIGHTS 

In this paper, a generalized SVDD procedure called GSVDD for multi-class data is introduced. The proposed procedure finds n hyperspheres. GSVDD uses each class information. Thus, each hypersphere keeps as many as corresponding observations as possible inside the boundary and attempts to keep other classes’ observations outside the hypersphere.



In addition, Bayesian generalized SVDD procedure is proposed by considering a probabilistic behavior of the parameters by taking ‘prior knowledge’ into account. This procedure enables each observation to have a probabilistic information. Thus, these probabilities are used for classification.



Although there are some existing multi-class SVDD procedures, they do not consider the information from each class. Moreover, none of these procedures do not provide probabilistic interpretation.

1

GENERALIZED SUPPORT VECTOR DATA DESCRIPTION FOR ANOMALY DETECTION Mehmet Turkoz1, Sangahn Kim2, Youngdoo Son3, Myong K. Jeong4, Elsayed A. Elsayed4 1

Department of Management and Information Sciences, Rutgers University, Piscataway, NJ, U.S.A. 2

Department of Quantitative Business Analysis, Siena College, Loudonville, NY, U.S.A.

3

Department of Industrial and Systems Engineering, Dongguk University-Seoul, Seoul, South Korea 4

Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ, U.S.A.

Abstract Traditional anomaly detection procedures assume that normal observations are obtained from a single distribution. However, due to the complexities of modern industrial processes, the observations may belong to multiple operating modes with different distributions. In such cases, traditional anomaly detection procedures may trigger false alarms while the process is indeed in another normally operating mode. We propose a generalized support vector-based anomaly detection procedure called generalized support vector data description which can be used to determine the anomalies in multimodal processes. The proposed procedure constructs hyperspheres for each class in order to include as many observations as possible and keep other class observations as far apart as possible. In addition, we introduce a generalized Bayesian framework which does not only consider the prior information from each mode, but also highlights the relationships among the modes. The effectiveness of the proposed procedure is demonstrated through various simulation studies and real-life applications. Key Words: Anomaly Detection; Bayesian Statistics; Multimode Process; Support Vector Data Description.

2

1 Introduction Identifying patterns that do not conform to normally behaved patterns is important since, in variety of applications, these patterns indicate critical and significant information that can be used to take actions to improve the processes. These patterns are called anomaly or outlier. Several procedures based on classification, nearest neighbor, clustering, and statistical procedures have been developed to detect anomalies [1,2]. One of the classification-based procedures called support vector machine (SVM) proposed by Vapnik [3] is effectively used for the detection of the anomalies. Schölkopf et al. [4] expand the SVM-based anomaly detection procedure by transforming the features to high dimensional space via a kernel function. This procedure adapts the SVM methodology to the anomaly detection problem. It separates the normal observations from the origin with maximal margin. By improving SVM procedure, several SVM procedures have been introduced for anomaly detection [5–9]. In addition to the anomaly detection procedures, multi-class SVM procedures have gained significant importance and some of them are introduced in [10-12]. However, these procedures are not suitable for anomaly detection in multi-class environment without any modifications. A particular case of the classification-based procedures, known as one-class classification, is commonly cited in the literature. In the one-class classification, it is assumed that all training instances have only one class label i.e. all training observations are obtained from a certain distribution. Thus, one-class classification algorithms obtain a discriminative boundary around the normal data. One of the promising one-class classification procedures is support vector data description (SVDD) proposed by Tax and Duin [13]. It finds a hypersphere which describes the 3

data with minimal volume by transforming the original observations into a new space through the use of a kernel function. Usage of kernel transformation improves the power of SVDD especially when the original data are complex. SVDD has been of a great interest in a wide range of applications mainly focusing on the detection of the anomalies as well as other real-life problems such as face recognition, image processing, pattern detection and quality control [14– 27]. Numerous studies of SVDD have been introduced for multi-class classification. Lee and Lee [28] introduce the classification procedure where the decision boundaries are based on the posterior probability distribution obtained from the SVDDs of each class. Mu and Nandi [29] introduce multistage multi-class classification procedure through formulating the decision boundaries based on a combination of linear discriminant analysis and finding nearest-neighbor obtained from the negative SVDDs of each class. Both of these procedures construct the decision boundaries for each class by ignoring the interaction among the classes. However, Kang and Cho [30] propose a binary classification algorithm, named support vector class description (SVCD). Unlike the above classification procedures, SVCD penalizes the other class observations and emphasizes the importance of information flow among the classes if they are not classified into their corresponding classes. However, it is important to note that these classification approaches do not identify the anomalies. On the other hand, many studies based on the SVDD have been conducted to improve the traditional SVDD for anomaly detection. Lee et al. [31] improve the SVDD by introducing weight for each observation using the local densities. In addition, Lee et al. [32] present a similar procedure by obtaining the local densities using the nearest neighborhood and Parzen window approaches [33]. Recently, Ghasemi et al. [34] introduce an SVDD with Bayesian approach 4

(BSVDD) by assuming that transformed data in the higher dimensional space follow a Gaussian distribution. The superiority of BSVDD is shown by assuming that the dual variables of the SVDD follow Gaussian distribution. While the procedures mentioned above only use the normal observation to describe the data, Tax and Duin [35] utilize the negative samples (anomalies) by introducing an additional constraint for the negative samples to keep them outside of the boundary. In addition, most of the SVDD procedures for anomaly detection mentioned above assume that the target data have only one class of normal data. However, in many real-life problems, a typical normal data consist of multiple normal classes. In such cases, applying the traditional SVDD procedures results in a description of the normal data without considering its different classes. To describe and distinguish each normal class simultaneously, Huang et al. [36] introduce an anomaly detection procedure in which the normal data consist of two-classes called two-class SVDD (TC-SVDD). However, existing one-class and TC-SVDD procedures have the following limitations. First, the existing SVDD procedures are based on the assumption that normal data consist of one or two-classes. Thus, existing procedures may not distinguish the classes, which would possibly result in a poor performance in detecting anomalies. Second, the existing SVDDs cannot reflect the prior or domain-specific knowledge when they are applied to the real-world problems. Since they find the decision boundary of given dataset in a deterministic way, they may provide limited results. In this paper, we propose a generalized SVDD procedure which simultaneously finds the hypersphere that includes as many its observations as possible for each. Regardless of the

5

number of classes, the proposed procedure identifies the anomalies based on the relative distance to the center of each hypersphere of each class. Thus, the proposed procedure describes the multi-class normal data more accurately by distinguishing each normal class. In addition, we introduce a Bayesian framework for multi-class SVDD procedure. The Bayesian procedure utilizes a probabilistic behavior of the parameters by taking ‘prior knowledge’ from each class. The prior and posterior probabilities are key factors in determining the description of each class. Therefore, it is expected that more accurate boundaries for multi-class normal data can be obtained by considering the prior knowledge of each class. This paper is organized as follows. In Section 2, we propose the generalized SVDD procedure with Bayesian framework. Numerical analysis with simulation is conducted in Section 3. In Section 4, a case study of a Continuous Stirred Tank Heater (CSTH) is demonstrated followed by the conclusions in Section 5. 2. Proposed Procedure In real life problems, normal data can be formed in more than one [13] or two classes [36]. In this case, existing SVDD procedures do not differentiate between classes and obtain a single boundary. Therefore, in this paper, we introduce a generalized SVDD procedure with Bayesian framework which overcomes the above mentioned drawbacks. 2.1 Generalized Multi-Class SVDD In this section, we introduce a multi-class SVDD procedure, which is applicable to any number of classes, called generalized SVDD (GSVDD). It is based on the assumption that the target data set contains n classes of objects. For given n classes, we define the sets of data as



. D1  x1(1) ,





(2) , x(1) N1 , D2  x1 ,



, x(2) N2 ,



, Dn  x1( n) ,

6



, x(Nnn) .where Nk is the size of class k.

The GSVDD constructs n closed spherical boundaries (hyperspheres) around each normal class such that each hypersphere includes as many observations as possible in its class while keeping the observations of other classes outside the boundary. Thus, observations are determined as anomaly if they do not resemble normal data sets. With centers a1 , a2 ,

, an and radiuses

R1, R2 , , Rn , of the data sets the primal formulation of the GSVDD is constructed as follows. min  k 1 Rk2 n

. s.t. xi( k )  a k

2

 Rk2 k  1,

xi( m )  a k

2

 Rk2 m  k , m, k  1,

, n , i  1,

.

Nk ,n

i  1,

(1)

, Nm

The optimization problem in (1) ensures that, every observation in class k falls inside the hypersphere k (constraint 1) and falls outside the other hyperspheres (constraint 2). However, in order to allow an observation to be outside its class’s boundary or to be inside the other classes’ boundary, we introduce penalty terms,  and  in each constraint. Then, the primal formulation of the optimization problem is rewritten as:







min  k 1 Rk2   k 1 Ck  i k1  i( k )   m1  m k 1 B( m,k )  i m1  i( m,k ) n

n

N

s.t. xi( k )  a k

2

 Rk2   i( k ) k  1,

 ak

2

 R 

x

(m) i

2 k

( m,k ) i

n

n

, n , i  1,

N

Nk

m  1...n, k  1...n, m  k , i  1,

 (2)

, Nm

 i( k ) ,  i( m,k )  0 i, k , m where Ck and B( m ,k ) are the regularization parameters that control the volume of the hyperspheres. The constraints ensure that the objective function increases by Ck  i( k ) if an observation of class k, xi( k ) falls outside of the hypersphere k, and the objective function

7

increases by B( m,k ) i( m,k ) if an observation of class m, xi( m) falls inside the hypersphere k. The most important symbols used to obtain GSVDD are summarized in Table 1. Table 1 Mathematical Symbols of GSVDD Rk ak

: radius of class k , (k=1,..,n)

 i( k )  i( m,k )

: penalty of training samples of class k which lies outside the hypersphere k.

: center of class k , (k=1,..,n)

: penalty of training samples of class m which lies inside the hypersphere k.

Dual formulation of GSVDD is obtained by introducing the following Lagrangian function.







L  Rn , a n ,  i( k ) ,  i( m,k )    k 1 Rk2   k 1 Ck  i k1  i( k )   m 1  m  k 1 B( m ,k )  i m1  i( m,k ) n

n

N

n

n

2 n N   k 1  i k1 i( k )  Rk2   i( k )   xi( k )  a k    

  m 1  m  k 1  i 1  n

n

Nm

( m,k ) i

 x ( m )  a  2  R 2   ( m , k )  k k i  i 

N

 (3)

  k 1  i k1  i( k ) i( k )   m 1  m  k 1  i m1  i( m,k ) i( m,k ) n

N

n

n

N

where  i( k )  0, i( m ,k )  0,  i( k )  0 and  i( m ,k )  0 are Lagrangian dual variables to the corresponding constraints in (2). By taking partial derivatives of L  Rn , a n ,  i( k ) ,  i( m,k )  , we obtain (4). L N n N  0   i k1 i( k )   k  m1  i m1 i( m,k )  1 Rk L N n N  0  a k   i k1 i( k ) xi( k )   k  m 1  i m1 i( m,k ) xi( m ) a k L  0  Ck   i( k )   i( k )  0  0   i( k )  Ck (k )  i L 

( m,k ) i

 0  B( m,k )   i( m,k )  i( m ,k )  0  0  i( m ,k )  B( m,k )

m, k

8

(4)

By substituting the solutions obtained in (4) into (3), we obtain the dual formulation as follows. (See Appendix A for detailed derivation of (5)).

max  m1  i m1  i( m )  xi( m )  xi( m )    m 1  m  k 1  i m1 i( m,k )  xi( m )  xi( m )  n

N

  k 1 n

s.t.



Nk i 1



n

n

N

 i1i( k ) xi( k )   k m1  i 1 i( m,k ) xi( m) Nk

n

Nm

 i( k )   k  m 1  i 1 i( m,k )  1, k  1, n

Nm



2

,n

(5)

0   i( k )  Ck , i, k 0  i( m,k )  B( m,k ) , m, k , m  k  1,

,n

The vectors xi( k ) with i( k )  0 and xi( m) with i( m,k )  0 (m  k ) are called support vectors for class k. A support vector xi( k ) with non-zero i( k ) and non-zero i( k ,m) is located on or outside of hypersphere k. In addition, a normal observation of class k is inside the hypersphere k and lies on the hypersphere m (m  k ) if i( k )  0 and 0  i( k ,m)  B( k ,m ) .

The radius, Rk of the hypersphere k is obtained by calculating the distance from the center,

a k to any of the support vectors of class k except the support vectors with  i( k )  Ck and

 i( m ,k )  B( m ,k ) . Let x s be a support vector of the class k with the conditions 0   s( k )  Ck and 0   s( m ,k )  B( m ,k ) . Then,

Rk2  x s  a k

2

, k  1,

,n

  x s  x s   2 i k1 i( k )  x s  xi( k )   2 k  m 1  i m1 i( m, k )  x s  xi( m )    i k1  i k1 i( m, k )  xi( k )  x(jk )  (6) N

n

N

N

N

  k  m 1  i m1  k t 1  j t 1 i( m,k )  (j t ,k )  xi( m )  x(jt )   2 k  m 1  i m1 i( m,k ) (j k )  xi( m )  x(jk )  n

N

n

N

n

9

N

In order to determine whether a new observation z is an anomaly or not, we introduce a decision rule based on the knowledge obtained from the n hyperspheres, such as Rk and a k . The decision rule is based on the distances from a new observation z to the center of each hyperspheres, a k . Therefore, a new observation z is classified as an anomaly if it is not placed in any of the hyperspheres as shown in (7). z  ak

By

replacing

the

inner

2

products,

 Rk2

k  1,

xi  x j

in

(7)

, n.

(5)

with

any

kernel

functions,

K (xi , x j )   (xi ),  (x j ) satisfying Mercer’s theorem, more flexible boundaries can be obtained

[3]. In addition, GSVDD is reduced to the original SVDD [13,35] and TC-SVDD [36] when n equals 1 and 2, respectively. 2.2 Bayesian Generalized Multi-Class SVDD Although we assume that the parameters of the proposed GSVDD procedure are fixed constants to be determined, the resultant center of a class k , a k , may be a random vector with inherent randomness based upon given training dataset. Therefore, in this section, we introduce the Bayesian framework for generalized SVDD called Bayesian generalized SVDD (BGSVDD). With an underlying solution from GSVDD in (4), the mean a k is determined as a weighted average of the training data. Using this information, we introduce a Bayesian approach in GSVDD by assuming that a normal observation x (jk )

 k  1,

, n  is transformed into higher

dimensional space through  () and the transformed data   x (jk )  follow a Gaussian distribution with the mean ak  i k1i( k )  xi( k )   k m1 i m1 i( m,k )  xi( m)  and the covariance matrix  kk2 I . N

n

N

10

 (x(jk ) ) ~ N



Nk i 1



 i( k )  xi( k )    k  m1  i 1 i( m,k )  xi( m ) ,  k2k I . n

Nm

(8)

The difference between the estimation in (8) and the traditional Gaussian density estimation is due to the estimation of the mean in (8), which is the weighted average of training target points. Thus, this model is called weighted Gaussian model. Since the distance of a point to the center of a hypersphere k is inversely proportional to the likelihood in the weighted Gaussian model, GSVDD is a special case of the weighted Gaussian model. Thus, in this paper, we propose a Bayesian framework for the generalized SVDD, which is more realistic and based on



(1) the dual variables of GSVDD. The unknown dual variables α  α ,

, β( n, n1)  can

, α ( n ) , β (1,2) ,

be estimated through a Bayesian approach with a proper prior distribution for α i( k ) and βi( m,k ) , where α ( k )  1( k )



(k )  Nk  and β( m,k )  1( m,k ) T

 N( m,k )  are the dual variables of GSVDD. T

m

BGSVDD approach assumes that the prior distributions of the dual variables i( k ) and i( m,k ) are independent Gaussian distributions which are the conjugate prior for the likelihood and defined as follows:

p α

(k )





1

 2 

N k /2



Nk

e

1 2 2

α ( k ) m( k )

2 2

, p β

( m,k )





1

 2 

N m /2



Nm

e

1 2 2

β ( m ,k )  m ( m ,k )

2 2

.

(9)

Since it is assumed that training data in each class mapped into a higher dimensional kernel space follow a Gaussian distribution, the likelihood probability given parameter α becomes

2 N n Nm ( m , j ) ( m ) 1  Nj  2   x(k j )    j  i( j ) xi( j )    xi   i 1 j  m1  i 1 i 1 2 jj 2   p D | α    e pˆ   pˆ j 1 k 1   2  2  jj 

n

11

(10)

where D  D1 

 Dn is a set of training data. The typical Bayesian rule can be seen as p(α | D) 

p(D | α) p(α) . p(D)

Since p (D) is a normalizing constant independent of α .A posterior distribution of α ignoring p ( D) which results in p(α | D)  p(D | α) p(α).

(11)

Then, maximum a posteriori (MAP) solution is given by αˆ  arg max p (α | D),

(12)

α

and is obtained as in (13) by taking the logarithm of (12) and using the relationships in (11). T n  2 n  2  ( m ) T B ( m ) 1( m )  2 n   kk2   ( m,k )  B ( m,k ) 1( m,k )    mm m  1 m  1 m  k  1   n n n  2 (k ) T (k ) (k ) 2 ( k ,m) T ( k ) ( k ,m)  K     k 1 N k kk   K    m 1  m  k 1 N m mm    n n α  arg max 21  2 (m) T ( m,k ) ( k ,m )   2 m 1  m  k 1 N m mm   K  α   n n n  2  ( i )T ( i ) ( i )T (i )  2  ( i , j )T ( i , j ) ( i , j )T (i, j )   α α  2α m       β β  2β m        i 1   i 1 i  j 1   

where Bii( m )   j K (ijm ) , 1( m ) and 1( m,k ) are 1 Nm vector with all ones and Bii

( m ,k )

(13)

  j K (ijm,k ) . (See

Appendix B for detailed derivation of (13)). Kernel matrix K is identified for n classes as  K (1)  (2,1) K K   ( n ,1) K

K (1,2) K (2) K ( n ,2)

K (1, n )   K (2, n )   ( n,n )  K 

where K ( k ) is a Nk  Nk matrix and each element of K ( k ) matrix is obtained by the kernel distances of each element in class k.

12

K i(jk )  K  xi( k ) , y (jk )     xi( k )  ,   y (jk ) 

xi( k ) , y (jk )  Dk

K i(jm,k )  K  xi( m) , y (jk )     xi( m)  ,   y (jk )  ,  xi( m)  Dm , y (jk )  Dk m  k  1,

(14)

,n

(15)

Note that the optimization problem of (13) needs to satisfy the same constraints of the original optimization problem in (5). Thus, BGSVDD satisfies the following constraints:



Nk i 1

 i( k )   k  m1  i 1 i( m,k )  1, k  1, n

Nm

0   i( k )  Ck , i  1,

, N k , k  1,

0  i( m,k )  B( m,k ) , i  1,

,n

(16)

,n

, N k , m, k , m  k  1,

,n

The optimization problem of BGSVDD is a quadratic programming which implies that BGSVDD has an optimal solution. Therefore, BGVDD is similar to the previous versions of SVDD in terms of complexity and it has the computational complexity O( N 3 ) where N is the total number of normal observations, i.e., N  N1  ...  Nk . BGSVDD process can be implemented based on the following pseudo code:

Algorithm: Bayesian Generalized SVDD

Input:





D1  x1(1) ,



(2) , x(1) N1 , D2  x1 ,



, x(2) N2 ,



, Dn  x1( n) ,



, x(Nnn) ,

Step 1: Initialize the parameters ( ,

,

,

, ) where

, and Kernel Matrix K

Step 2: Calculate Dual Solution

from MAP Solution by

Solving the Optimization Problem in (13) with Constraints in (16). 13

Step 3. Obtain Radius and Center of Each Class (

and

) from Equations (6) and (4),

respectively. Step 4. Identify the anomalies For a new observation : For k =1: n

End If

: is classified as anomaly.

Else is classified as class

.

End End

2.2.1 Parameter Settings of the Prior Distribution It is assumed that the prior distributions of the dual variables of the GSVDD, α ( k ) and β( k ,m) , ( k  m  1,

, n ) follow Gaussian distributions. Thus, the parameters of the prior distributions

( m ( k ) and m( k ,m) ) need to be estimated beforehand to solve the optimization problem. It is shown that if the vector xi( k ) is inside the boundary of class k, the corresponding i( k ) is zero. Since the boundary covers the densest area (note that xi( k ) which has i( k )  0 is close to most of the vectors). Therefore, the sum of the kernel distances between xi( k ) and other vectors in class k is small. However, if xi( k ) vector is outside of the boundary of class k, then i( k ) equals

14

Ck . Since outside the boundary is expected to be less dense than around the center, the sum of the kernel distance between xi( k ) and other vectors in class k is large. In addition, if the xi( k ) vector is on the boundary, corresponding i( k ) satisfies 0  i( k )  Ck . Since this is also at a less dense area, the sum of the kernel distance between xi( k ) and other vectors would also be large. Therefore, it is reasonable to adjust α ( k ) by assigning a smaller weight to vectors if they are inside the boundary of class k and a larger weight for the others. Then, we can obtain the m ( k ) as follows: mi( k )   jD K i(,kj)

(17)

k

where m( k )  (m1( k ) ,..., mN( kk) ) . Accordingly, it is expected that the support vectors are set as larger values. Therefore, sparse and accurate solution boundary is obtained. To control the sparsity of the solution, mi( k ) is chosen as mi( k )  



Nk

K i(,kj) i 1



v

where v (0  v  1).

In addition, class k observations that satisfy i( k ,m)  0 are also the support vectors for class m. These support vectors lie on the boundary of class m if xi( k ) satisfies 0  i( k ,m)  B( k ,m) or inside the boundary of class m if xi( k ) satisfies i( k ,m)  B( k ,m) . Thus, we recognize that these support vectors are close the class m observations rather than class k observations. Therefore, we determine m( k ,m) as follows: mi( k ,m )   jD K i(,kj,m ) m

Sparsity is controlled by the v parameter as given below.

15

(18)

mi( k ,m)   where k  m  1, 2,





Nm

v

K ( k ,m) , j 1 i , j

(19)

, n and v (0  v  1).

2.3. Probabilistic Interpretation Obtained from BGSVDD We propose a Bayesian framework for the predictive distribution with a finite-dimensional mapping function. Assuming that a new input instance z belongs to class k, then the predictive distribution of z can be found as follows: P  z|D, Class k    P  z|α, Class k  P  α|D  dα 2 Nm  1  n  Nk  k   k m , k m         P  α|D  dα  exp   z    x    x        j j j j p 2 2   p k  m 1 j 1 kk  j 1  2   2  2  kk 

 

1

 

(20)

where Classk refers to the event that z  Classk and p is a dimension of the feature map  (). The integration of (20) is often intractable with general posterior distributions P  α|D . Therefore, we approximate (20) by using MAP approximation and the predictive distribution becomes P  z|D, Class k   P  z|αˆ , Class k  2 Nm  1  n  Nk  k  k m,k  m     ˆ  ˆ  exp   z    x    x    j  .  j j j p 2 2   p j 1 k  m 1 j 1 kk   2  2  2   kk 

1

 

 

(21)

After obtaining the predictive distribution, the probability that a new instance belongs to each class can be obtained as follows:

16

P  Class k |z, D  

P  z|D, Class k  P  Class k  P  z|D 

(22)

where P  Class k  is the prior probability of class k. For example, P  Class k  can be chosen as Nk

 m1Nm n

if the appropriate prior knowledge of the probability for each class is unavailable. Also

n

note that

 P  Class |z, D  1. k

k 1

The suggested Bayesian framework also determines the outliers by defining the outlier as a class other than the pre-existing classes. Then express predictive distribution as follows:

P  Class out |z, D  

P  z|D, Class out  P  Class out  . P  z|D 

(23)

However, this expression requires pre-definition of the following parameters:

P  Classout  : the proportion of outliers among the whole data points, which is usually referred to as v in SVDD and its variants.

P  z|D, Classout  : the probability density distribution of the outliers. In this paper, this probability is assumed to be uniformly distributed over the outliers that do not belong to a specific class. The prior probability of the other classes is also modified as

P(Classk ) 

Nk

m1Nm n

17

(1  P(Classout )).

(24)

We extend the same concept to kernel functions with infinite-dimensional feature mapping and p

consider the domain with the volume of b  2  2  kkp e.g. a hypercube with edge length

b

1 p

1

 2  2  kk .

Then, the probabilities for each class are obtained as follows:

P  Class k |z, D  

 k P  Class k 



P  Classout |z, D  

(25)

 P  Classk   b1P  Classout  k 1 k n

b 1P  Classout 

 k 1 k P  Classk   b1P  Classout  n

.

(26)

where 2  1  n Nm  Nk  k   k m , k m      ˆ  ˆ  k  exp  2   z     j  x j    j  x j   .  2 kk k  m 1 j 1  j 1  2  

 

 

(27)

2.3.1. Probabilistic Interpretation Fig. 1 shows an example of the proposed probabilistic interpretation. We generate 100 points for each class from two-dimensional normal distributions. The blue circles denote data points that belong to Class 1, and the red squares denote the points that belong to Class 2. Five black crossshaped test points are also shown in Fig. 1 with the probabilities obtained by the proposed procedure after training with the given points of Classes 1 and 2. Three probability values refer to the probabilities of belonging to Class 1, Class 2, and outliers, respectively. It is noteworthy that the probability of belonging to Class 1 (or 2) is the highest when a test point is located near

18

Class 1 (or 2), and the outlier probability is highest when a test point is located far from the training points.

Fig. 1. Example of probabilistic interpretation

3. Experimental Results Experiments are implemented by using simulated and real-life data sets to compare the proposed BGSVDD and GSVDD procedures with existing SVDD procedures which include TC-SVDD, traditional SVDD, and BSVDD. In addition, GE-SVDD [21] and GOC-ELM [22] that have been recently proposed using variance criteria and/or graphs describing properties of the class are also compared. Moreover, SVM is used for the comparison. However, Since SVM does not give any

19

outlier detection criteria, we define the probability for classification by applying logistic sigmoid function to the distance to the decision boundary of the classifier as follows:

p( x|C1 ) 

1 and p( x|C2 )  1  p( x|C1 ) where f ( x)    i k( x , xi )  b . Therefore, we 1  e f ( x) iSV

define each observation as an outlier when the difference between and is small, specifically less than a threshold  . In other words, we classify the instances as follows:

x  C1 if p( x|C1 )  p( x|C2 )  

x  C2 if p( x|C2 )  p( x|C1 )   x  0 if p( x|C1 )  p( x|C2 )   where 0 is the class for outlier. 3.1. Experimental Setting The

kernel

function

 xy K (x, y )  exp    2 w2 

2

is

chosen

as

a

Gaussian

kernel

which

is

defined

as

  where w is the width parameter of the Gaussian kernel. The  

parameters of the proposed SVDD procedures and the benchmark procedures including kernel parameters are obtained by grid search using 10-fold cross-validation described in Kang and Cho [30]. In our experiments, three performance measures are used for comparisons, they are: the total classification accuracy rate (TCAR), overall classification accuracy rate (OCAR) and F-measure. TCAR measures the classification accuracy rate for the testing dataset which is defined as follows:

20

 n # of trueclassifications in class i # of trueidentifications in anomalies  TCAR  0.5  i 1   (28) total number of normal observations number of anomalies  

However, TCAR is not suitable to compare the traditional SVDD and the Bayesian SVDD since TCAR is suitable for multi-class data. Therefore, we use OCAR considers the combination of each class as one class.

 # of trueclassification for combined class # of trueidentifications in anomalies  OCAR  0.5    (29) number of anomalies  total number of normal observations 

F-measure is also used to provide better comparison with the other procedures. F-measure is defined as follows:

F  measure  2

precision  recall precision  recall

(30)

where precision and recall are defined as follows:

precision 

True Positive True Positive and recall  True Positive  False Positive True Positive  False Negative

(31)

where True Positive, False Positive and False Negative are obtained by considering each class individually to compare classification and anomaly detection performances simultaneously. In each dataset, we report the average performance of each algorithm as percentages by performing 10 -fold cross-validation procedure. The standard deviation is also shown in parenthesis as percentages. The experiments are conducted on MATLAB 2018b in Window 10 (64 bits) desktop with Intel® Core™ i7-6700 CPU at 3.4 GHz (quad core) and 16GB RAM. 21

3.2 Performance Analysis with Simulated Data In this section, the proposed procedures are compared with the existing procedures by using the banana-shaped [37] and multivariate skew normal (MSN) distributions [38,39] as illustrated in Fig. 2 (a) and (b), respectively.

(a)

(b) Fig. 2. (a) Banana-shaped data (b) MSN data

In the first case, the two classes of two dimensional banana shaped data are obtained with size 70 for each, which are considered as target data. Anomaly data are obtained from 2-class multivariate Gaussian data set with size 50 for each, which have means (5,-5) and (-5,-4) and identity covariance matrix. Table 2 illustrates the performance of BGSVDD against other benchmark procedures under banana-shaped data. Best performances are shown in bold. The results show that BGSVDD outperforms the other SVDD procedures based on the OCAR and is comparable with TC-SVDD based on TCAR performance measure. These results show that the proposed BGSVDD procedure performs quite well on banana-shaped data. Table 2 The average classification accuracy of 10-fold cross-validation on banana shaped data Performance Measure

SVDD

BSVDD

GOCELM

GESVDD

SVM

TCSVDD

BGSVDD

TCAR

NA

NA

NA

NA

87(3.8)

92(5.3)

92(5.5)

22

OCAR

90(6.4)

88(6.9)

92(7.1)

90(7.9)

87(3.8)

92(5.6)

93(5.7)

F-Measure

77(11.1)

61(17.1)

83(5.6)

83(7.7)

74(7.0)

85(9.0)

86(7.7)

In the second case, we use MSN distribution which takes advantage of the general set-up for comparison under non-normal dataset by controlling skewness [38,39]. In this case, two different target data are obtained from bivariate skew normal data with sizes 100 and skewness parameters of  0,30  and  0, 30  for each class. Three different anomaly data are obtained by shifting the mean of the first variables of each target data with shift sizes 0.5, 1 and 1.5. Tables 3, 4 and 5 demonstrate TCAR, OCAR and F-measure performance measures of the proposed and the existing procedures according to the shift size. We observe that the proposed BGSVDD shows good performance in overall cases. In addition, TCAR and OCAR measures are the same since these data sets are well separated. Table 3 TCAR of 10-fold cross validation on two-dimensional MSN data Shift Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

0.5

NA

NA

NA

NA

52(5.2)

56(4.3)

57(4.2)

1

NA

NA

NA

NA

60(4.9)

65(4.3)

68(4.2)

1.5

NA

NA

NA

NA

69(4.9)

76(4.4)

78(5.1)

Table 4 OCAR of 10-fold cross validation on two-dimensional MSN data Shift Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

0.5

55(3.9)

55(3.6)

53(2.8)

56(6.6)

52(5.2)

56(4.3)

57(4.2)

65(4.3)

68(4.2)

76(4.4)

78(5.1)

1

66(5.4)

66(5.0)

63(2.6)

66(5.1)

60(4.9)

1.5

75(3.1)

76(3.9)

77(3.9)

78(4.7)

69(4.9)

Table 5 F-Measure of 10-fold cross validation on two-dimensional MSN data Shift Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

B2SVDD

0.5

18(3.5)

18(3.6)

17(0.0)

19(7.5)

17(2.5)

18(1.9)

21(2.5)

23

1

31(5.4)

32(5.4)

21(1.4)

33(5.8)

21(3.0)

27(3.9)

35(4.5)

1.5

50(7.7)

51(9.5)

35(3.1)

51(8.7)

31(4.1)

35(2.7)

53(7.2)

In Tables 5, 6 and 7, we increase the dimension to five and demonstrate the TCAR, OCAR and F-measure performance measures. The MSN data are obtained from two five-dimensional MSN data with the sizes of 100 and skewness parameters set as (-2, 1, -2, 1,-2) and (2, -1, 2, -1, 2), respectively. To obtain the anomaly data, the first three variables, variables 1,2

and 3, of each target data are shifted with sizes 0.5, 1 and 1.5, which provides three anomaly data sets. Tables 5, 6 and 7 show that BGSVDD outperforms other procedures over the cases considered based on the TCAR and OCAR measures. Table 5 TCAR of 10-fold cross validation on five-dimensional MSN data Shift Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

0.50

NA

NA

NA

NA

56(5.6)

58(6.1)

60(5.2)

1.00

NA

NA

NA

NA

69(6.2)

72(4.0)

74(5.8)

1.50

NA

NA

NA

NA

80(5.8)

81(4.9)

82(4.8)

Table 6 OCAR of 10-fold cross validation on five-dimensional MSN data Shift Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

0.50

61(4.8)

61(5.1)

57(5.2)

62(4.7)

58(5.7)

61(4.9)

64(4.9)

1.00

73(4.9)

74(4.9)

69(3.5)

74(7.1)

71(6.2)

75(4.3)

78(5.6)

1.50

83(4.8)

84(5.3)

86(3.6)

86(4.0)

82(5.8)

85(5.6)

87(3.4)

Table 7 F-Measure of 10-fold cross validation on two-dimensional MSN data Shift Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

0.50

23(2.1)

23(3.2)

20(4.5)

23(2.1)

18(2.7)

21(4.9)

24(3.8)

1.00

41(8.8)

39(8.7)

36(7.8)

40(6.7)

29(4.8)

41(9.0)

42(4.5)

1.50

65(9.0)

68(6.6)

58(6.4)

69(11.5)

53(7.0)

68(5.4)

67(8.3)

24

Next, we show that the proposed method is not limited to the uni-modality of the class but applicable to the class multi-modality. Multi-modality can be simply transferrable to the multiclass with multiple single modes. Even if each class is assumed to be multi-modal, however, the proposed method provides superior performance. To show how the proposed procedure is applicable to multi-modal data set, we employ and modify two- dimensional toy data set introduced in [40]. In Fig. 3, blue and red points indicate multi-modal data set and black points represent the anomaly points.

Fig. 3. Multi-modal data Table 8 shows that proposed procedure outperforms the existing procedures based on the OCAR performance. Table 8 OCAR of 10-fold cross validation on multi-modal data SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

86(3.2)

85(3.0)

92(1.7)

89(9.1)

91(1.0)

89(4.6)

93(2.5)

25

3.3 Performance Analysis with Real Data In this section, we analyze the performance of the proposed procedure by using data sets obtained from the UCI machine-learning repository; Balance, Ecoli, Glass, Iris, Seeds and Wine data sets. The description of each data set is summarized in Table 9. Since the original Glass data set has seven classes, we only use the first two classes and the last class in our experiments. One of the classes is considered as anomaly and written as bold in column Data Size. Thus, we consider three cases for each data set. Table 9 Description of datasets used in the experiments Data Set

Number of Classes

Sample Size

Number of Attributes

Balance

3

288,288,49

4

Ecoli

3

143,77,35

7

Glass

3

70,76,29

9

Iris

3

50,50,50

4

Seeds

3

70,70,70

7

Wine

3

59,71,48

13

Comparisons of BGSVDD with the other methods in terms of TCAR, OCAR and F-measure are presented in Tables 10, 11 and 12, respectively. Comparisons of BGSVDD with the other methods in terms of TCAR, OCAR and F-measure show superiority of the proposed algorithm in contrast to the others in most cases. Even though the proposed procedure has gained better performance in most data sets, it could not gain good results for some data sets such as Balance and Seeds data sets. This phenomenon occurs when the difference between the numbers of instances in each normal class is small. For example, the numbers of data points in the normal classes for Case 1 of the Balance data are the same, 288 and it is observed that the proposed method is inferior to the compared methods.

26

Table 10 TCAR performance of 10-fold cross validation on the real data sets Data

Balance

Ecoli

Glass

Iris

Seeds

Wine

Cases

Data Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

Case 1

288,288,49

NA

NA

NA

NA

87(3.0)

76(1.7)

78(2.7)

Case 2

288,288,49

NA

NA

NA

NA

81(3.3)

84(2.1)

86(2.2)

Case 3

288,288,49

NA

NA

NA

NA

81(4.2)

84(3.7)

85(4.0)

Case 1

143,77,35

NA

NA

NA

NA

56(5.2)

77(5.4)

81(3.5)

Case 2

143,77,35

NA

NA

NA

NA

77(5.3)

86(4.2)

87(4.1)

Case 3

143,77,35

NA

NA

NA

NA

59(5.5)

79(4.5)

79(4.8)

Case 1

70,76,29

NA

NA

NA

NA

64(6.0)

79(6.7)

75(7.3)

Case 2

70,76,29

NA

NA

NA

NA

63(5.5)

65(9.4)

71(8.1)

Case 3

70,76,29

NA

NA

NA

NA

45(9.0)

57(6.8)

64(5.8)

Case 1

50,50,50

NA

NA

NA

NA

86(8.5)

90(4.9)

93(5.6)

Case 2

50,50,50

NA

NA

NA

NA

86(6.8)

91(4.8)

90(5.6)

Case 3

50,50,50

NA

NA

NA

NA

80(9.1)

94(5.9)

94(4.3)

Case 1

70,70,70

NA

NA

NA

NA

76(8.2)

88(4.4)

88(4.7)

Case 2

70,70,70

NA

NA

NA

NA

72(8.2)

89(4.1)

88(4.3)

Case 3

70,70,70

NA

NA

NA

NA

87(6.0)

87(3.8)

88(4.4)

Case 1

59,71,48

NA

NA

NA

NA

83(4.7)

90(7.2)

89(8.6)

Case 2

59,71,48

NA

NA

NA

NA

64(25)

89(6.1)

90(6.4)

Case 3

59,71,48

NA

NA

NA

NA

51(10)

77(6.7)

79(8.1)

Table 11 OCAR performance of 10-fold cross validation on the real data sets Data

Balance

Ecoli

Glass

Iris

Seeds

Cases

Data Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

Case 1

288,288,49

57(4.6)

55(0.0)

62(1.9)

57(5.2)

87(3.0)

76(1.7)

78(2.7)

Case 2

288,288,49

88(2.5)

80(2.9)

86(1.4)

88(3.2)

82(3.4)

85(2.8)

89(2.4)

Case 3

288,288,49

87(3.7)

80(2.8)

86(1.6)

87(3.6)

82(4.7)

85(4.2)

88(4.8)

Case 1

143,77,35

71(4.0)

80(4.8)

67(4.8)

76(7.0)

56(5.0)

78(5.0)

82(3.2)

Case 2

143,77,35

82(4.6)

85(3.7)

76(6.6)

84(3.6)

77(5.3)

86(4.2)

87(4.1)

Case 3

143,77,35

87(3.9)

91(7.9)

88(3.7)

90(2.6)

62(6.8)

90(4.0)

89(4.1)

Case 1

70,76,29

93(4.2)

92(4.0)

93(4.5)

94(3.4)

65(5.9)

93(2.5)

93(1.5)

Case 2

70,76,29

62(3.9)

59(4.3)

56(7.9)

62(3.9)

63(5.5)

66(9.0)

71(8.5)

Case 3

70,76,29

56(9.1)

56(7.3)

58(8.2)

59(8.6)

45(9.0)

57(6.8)

64(5.8)

Case 1

50,50,50

92(4.6)

93(5.6)

91(7.2)

92(5.7)

86(8.5)

90(4.9)

93(5.6)

Case 2

50,50,50

91(5.8)

90(4.6)

88(7.3)

91(5.5)

86(6.8)

91(4.8)

91(5.2)

Case 3

50,50,50

98(3.3)

98(4.8)

98(4.2)

98(3.3)

80(9.1)

98(3.5)

98(2.5)

Case 1

70,70,70

89(5.7)

91(5.1)

89(4.3)

90(4.7)

76(8.2)

91(4.8)

92(5.3)

Case 2

70,70,70

91(1.8)

91(2.2)

92(5.4)

91(3.2)

72(8.2)

92(2.3)

91(3.0)

27

Wine

Case 3

70,70,70

83(4.3)

80(4.3)

78(3.4)

83(4.3)

87((6.0)

87(3.8)

88(4.4)

Case 1

59,71,48

91(6.4)

90(6.1)

85(5.5)

91(7.2)

84(4.9)

91(5.6)

91(8.5)

Case 2

59,71,48

86(3.6)

82(3.3)

81(6.4)

87(4.8)

65(24)

89(6.1)

90(6.4)

Case 3

59,71,48

71(2.0)

76(4.4)

74(5.7)

75(4.5)

52(9.4)

78(7.9)

80(9.3)

Table 12 F-Measure performance of 10-fold cross validation on the real data sets Data

Balance

Ecoli

Glass

Iris

Seeds

Wine

Cases

Data Size

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

BGSVDD

Case 1

288,288,49

71(0.7)

72(0.5)

74(1.8)

72(0.5)

85(4.0)

80(1.8)

88(3.2)

Case 2

288,288,49

83(4.9)

67(6.2)

64(1.4)

80(4.9)

69(4.6)

81(3.1)

81(3.3)

Case 3

288,288,49

81(5.7)

68(7.3)

65(4.6)

80(5.7)

68(5.1)

81(5.5)

80(7.1)

Case 1

143,77,35

74(9.6)

77(7.1)

60(6.0)

81(4.7)

50(6.4)

70(4.1)

82(5.2)

Case 2

143,77,35

77(4.9)

78(4.7)

56(10.6)

78(4.8)

61(6.2)

72(4.1)

79(4.9)

Case 3

143,77,35

81(6.5)

84(7.9)

67(9.8)

85(5.1)

30(11)

65(8.9)

77(7.9)

Case 1

70,76,29

90(4.4)

89(3.6)

91(5.9)

91(5.9)

43(15)

69(14.0)

74(11.1)

Case 2

70,76,29

31(5.3)

30(11.1)

23(8.2)

31(5.0)

30(6.0)

34(10.8)

54(15.7)

Case 3

70,76,29

25(12.0)

19(6..9)

27(10.4)

30(20.3)

18(6.0)

25(8.4)

36(10.6)

Case 1

50,50,50

86(16.7)

89(13.2)

84(12.8)

87(11.3)

68(1.1)

85(7.5)

91(6.0)

Case 2

50,50,50

85(8.4)

80(7.4)

78(13.3)

84(8.2)

67(6.9)

85(7.6)

90(12.8)

Case 3

50,50,50

98(3.7)

97(5.6)

97(4.6)

98(3.7)

73(14)

93(7.0)

95(4.0)

Case 1

70,70,70

70(9.2)

83(10.2)

77(7.7)

86(8.9)

63((13)

84(6.0)

86(6.0)

Case 2

70,70,70

70(9.2)

84(7.6)

87(6.7)

85(6.2)

60(15)

80(6.6)

86(7.1)

Case 3

70,70,70

67(5.7)

66(5.1)

59(3.9)

71(7.5)

78(7.0)

71(5.6)

73(11.3)

Case 1

59,71,48

87(10.6)

86(9.7)

78(8.0)

89(9.6)

80(7.2)

87(9.3)

87(10.0)

Case 2

59,71,48

71(10.2)

63(4.1)

61(16.0)

67(4.6)

48(28)

82(8.6)

85(7.6)

Case 3

59,71,48

48(14.0)

60(6.4)

54(8.1)

60(6.4)

31(12)

46(16.4)

65(14.2)

In addition, Wilcoxon signed-rank test [41], one of the most popular non-parametric statistical significance analysis, is utilized to determine whether the improvement observed in the anomaly detection performance with the use of BGSVDD is statistically relevant. Tables 13 and 14 present the results of Wilcoxon test based on OCAR and F-measure performance measures, respectively. A significance level of 0.05 is considered for analysis. Symbol ‘√’ represent that BGSVDD is significantly better than the other procedures in terms of OCAR and F-measure. 28

These results indicate that the outcome of BGSVDD is meaningfully distinguishable from the other procedures. Table 13 Result of Wilcoxon signed-rank test based on OCAR BGSVDD vs.

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

z-value

-4.13020

-3.95390

-4.20000

-3.53830

-4.29170

-3.80000

p-value

0.00001

0.00008

0.00001

0.00040

0.00001

0.00014

Wilcoxon test result













Table 14 Result of Wilcoxon signed-rank test based on F-measure BGSVDD vs.

SVDD

BSVDD

GOC-ELM

GE-SVDD

SVM

TC-SVDD

z-value

-2.89250

-3.18850

-3.75350

-2.20000

-3.98220

-4.10330

p-value

0.00386

0.00142

0.00018

0.02780

0.00006

0.00001

Wilcoxon test result













Moreover, the average computational times are reported in Table 15 to show the efficiency of the proposed procedure in terms of computational time for data sets with different sizes. In this work, we check the computational time of TC-SVDD and BGSVDD for training and testing since both procedures are introduced for multiclass data sets. It can be observed that the average running time of TC-SVDD is much larger than the BGSVDD for smaller size data sets. Although BGSVDD and TC-SVDD have similar results in terms of computation time for large data sets, the proposed procedure has achieved higher performance as proved with Wilcoxon signed-rank test. Table 15 Average running time of the comparing algorithms in seconds Algorithms

TC-SVDD

50

100

0.1532

0.3734

29

Number of data points 250 6.6411

500 46.6702

BGSVDD

0.1061

0.2892

5.8086

46.4448

In order to show the sensitivity of kernel functions to the performance of proposed BGSVDD procedure, we consider two popular kernels: Hermite introduced by Moghaddam and Hamidzadeh [42] and Gaussian kernels. Table 16 presents the F-measure performance of the proposed procedure by Hermite and the Gaussian kernels with five dimensional normal distributed data. The results are obtained in terms of different entry parameters of the Hermite and the width parameter, w and show that the Gaussian kernel performs generally better than Hermite kernel. To optimize the kernel type and its parameters, a cross-validation based approach can be utilized to optimize the kernel type and its parameters. Table 16 The effect of kernel functions according to different parameters Hermite entry parameter 2 3

0

1

0.1538

0.6500

0.5143

0.4533

Gaussian parameter w 15 20

4

5

10

0.4533

0.5939

0.5329

0.6400

0.6962

25 0.6581

3.4 Impact of Noise to the Performance In this section we present the analysis of noise. To show how the proposed procedure is sensitive to the noise, we create the normal data sets and anomaly data sets from five dimensional Gaussian distributions. Since normal data sets do not contain noise, to obtain the noisy data, we add class noise to each class and attribute noise to each attribute as explained in Hamidzadeh and Namaei [25] and Zhu and Wu [43]. Figure 4 shows the effect of noise on BGSVDD.

30

Fig. 4. Noise effect on BGSVDD Fig. 4 illustrates the noise level in an interval of 0–30%. The noise level is indicated on the horizontal axis (x-axis). TCAR, OCAR and F-measure performances (percentage) are indicated on the vertical axis (y-axis), respectively. As it is seen in Fig. 4, the proposed procedure is robust to the noise. It is also clear that if the noise level increases, TCAR, OCAR and F-measure performance will deteriorate.

4. Case Studies 4. 1 Continuous Stirred Tank Heater (CSTH) The stirrer tank heater plant is used in the Department of Chemical and Materials Engineering at the University of Alberta [44]. In this plant, hot and cold water are mixed and heated using steam through a heating coil. The mixed water is drained from the tank through a long pipe. The CSTH plant is shown in Fig. 5. Hot and cold water are well stirred and mixed in the tank by assuming the temperature in the tank to be the same as the outflow temperature and keeping the volume of

31

the water level in the tank as desired value. The tank has a circular cross section with a volume of 8l cm3 and height of 50 cm.

Fig. 5. Continuous stirred tank heater [44] The steam temperature (TC), the thermocouple temperature (TT), the cold water flow (FC), the heating flow (FT), the level controller (LC), and the heating level (LT) are the instruments used for operation of the plant. The cold and hot water (CW and HW) enter the plant with pressure of 60–80 psi, and the hot water boiler is heated by the steam supply. Control valves in the CSTH plant are controlled by the air pressure using 3–15 psi compressed air supply. Flow instruments show signals with a nominal 4–20 mA output. The signals of level instruments are also shown through the same scale, mA. The inputs of the process are the CW, HW and steam valve demands, and the outputs consist of the electronic measurements from the level, cold and hot water flow, and the temperature

32

instruments. Disturbances of the experiment include a deterministic oscillatory disturbance to the cold water flow rate, a random disturbance to the level, and temperature measurement noise. The aim of this study is to determine the dynamic responses of the outputs for specified inputs. For a given input values, it is desired to keep the volume and the outflow temperature in steady-state conditions. In other words, the system is in normal operating condition if the desired volume and outflow temperature are obtained by adjusted input values (volume, temperature and HW valve). It is always not possible to have the desired volume and outflow temperature even when the input values are adjusted. Therefore, if the desired values are obtained for the adjusted input values, then the process operates in normal conditions. This kind of operating conditions is called as normal operating mode. Two operating conditions of CSTH are shown in Table 17. Table 17 Two normal operating conditions of CSTH plant Input Variables

Mode 1

Mode 2

Desired Volume

12

12

10.5

10.5

0

5.5

Desired Temperature HW Valve

The CSTH plant achieves the steady state conditions where the desired variables are satisfied if the adjustments are chosen as one of the modes in Table 17. In this study, we obtain the data from three controlled variables; level, temperature, and CW flow. Two modes of CSTH are obtained under normal operating modes with the size of each mode of objects being 100. In addition, 100 abnormal observations are obtained from each mode by introducing a sudden step change of -0.25, -50 and -0.75 into the level measurement at step 101 to 200. The plots of the normal and abnormal data (-0.75 shift level) are given in Fig. 6.

33

Fig. 6. CSTH normal and abnormal data Three observable variables (the level, CW flow and temperature) are measured at normal and abnormal operating processes for each mode. The 3D plot of normal observations is shown in Fig. 7.

34

Fig. 7. Mode 1 (*) and Mode 2 (+) Tables 18 and 19 show TCAR and OCAR performance measures of TC-SVDD and the proposed BGSVDD approaches which show that the proposed procedure outperforms others. In addition, these data are not both procedures identify the normal and abnormal observations correctly for well separated data (such as banana-shaped data). Table 18 TCAR performance of 10-fold cross validation on CSTH data Shift Size

TC-SVDD

BGSVDD

-0.25

0.8293

0.8470

-0.50

0.8822

0.9015

-0.75

0.9102

0.9400

35

Table 19 OCAR performance of 10-fold cross validation on CSTH data Shift Size

TC-SVDD

BGSVDD

-0.25

0.8293

0.8470

-0.50

0.8822

0.9015

-0.75

0.9102

0.9400

5 Conclusions Identifying anomalies is important due to the fact that they may have critical and significant information in many applications such as machine fault detection, detecting abnormal objects and behaviors in video and chemical processes. Several anomaly detection procedures are introduced, among them, SVDD has gained more attention. Most of the SVDD-based procedures are built on the assumption that the normal data have only one or two-classes. However, in modern industry, it is common that any given target data may have multiple classes. In this paper, we propose a generalized SVDD procedure called GSVDD for multi-class data. The proposed procedure finds n hyperspheres. Each hypersphere keeps as many as corresponding observations as possible inside the boundary and attempts to keep other classes’ observations outside the hypersphere. In addition, we propose a Bayesian generalized SVDD procedure by considering a probabilistic behavior of the parameters by taking ‘prior knowledge’ into account. Experiments with simulated and real data sets including a case study show that BGSVDD is superior than the existing SVDD procedures. The proposed method can be also applied to the multi-modality of each class. When the class multi-modality is unclear, for example, some novel unsupervised learning techniques such as Kernel k-means can be adopted to recognize multi-modes of the class to identify the anomalies 36

[45]. This research can be also extended to obtain a semi-supervised generalized multi-class SVDD formulation to apply in a wider range of data sets. References [1] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Computing Surveys (CSUR). 41 (2009) 15. [2] V. Hodge, J. Austin, A survey of outlier detection methodologies, Artificial Intelligence Review. 22 (2004) 85–126. [3] V. Vapnik, The nature of statistical learning theory, Springer Science & Business Media, (2013). [4] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating the support of a high-dimensional distribution, Neural Computation. 13 (2001) 1443–1471. [5] K. L. Li, H. K. Huang, S. F. Tian, W. Xu, Improving one-class SVM for anomaly detection, Machine Learning and Cybernetics, 2003 International Conference On, IEEE, (2003) 3077– 3081. [6] V.A. Sotiris, W.T. Peter, M.G. Pecht, Anomaly detection through a bayesian support vector machine, IEEE Transactions on Reliability. 59 (2010) 277–286. [7] M. Amer, M. Goldstein, S. Abdennadher, Enhancing one-class support vector machines for unsupervised anomaly detection, Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, ACM, (2013) 8–15. [8] S.M. Erfani, S. Rajasegarar, S. Karunasekera, C. Leckie, High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning, Pattern Recognition. 58 (2016) 121–134.

37

[9]

V. Mygdalis, A. Iosifidis, A. Tefas, I. Pitas, Exploiting subclass information in one-class support vector machine for video summarization, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, (2015) 2259-2263.

[10] J. Weston, C. Watkins, Multi-class support vector machines, in: European Symposium on Artificial Neural Networks. (1999) 1–10. [11] I. Kotsia, S. Zafeiriou, I. Pitas, A novel class of multiclass classifiers based on the minimization of within-class-variance, IEEE Trans. NeuralNetw.20(2009) 14–34. [12] A. Iosifidis, M. Gabbouj, Multi-class support vector machine classifiers using intrinsic and penalty graphs, Pattern Recognition., 55 (2016), 231-246. [13] D.M. Tax, R.P. Duin, Support vector domain description, Pattern Recognition Letters. 20 (1999) 1191–1199. [14] S. W. Lee, J. Park, S. W. Lee, Low resolution face recognition based on support vector data description, Pattern Recognition. 39 (2006) 1809–1812. [15] F. Bovolo, G. Camps-Valls, L. Bruzzone, A support vector domain method for change detection in multitemporal images, Pattern Recognition Letters. 31 (2010) 1148–1154. [16] X. Ning, F. Tsung, Improved design of kernel distance–based charts using support vector methods, IIE Transactions. 45 (2013) 464–476. [17] C. D. Wang, J. Lai, Position regularized support vector domain description, Pattern Recognition. 46 (2013) 875–884. [18] S. Zheng, Smoothly approximated support vector domain description, Pattern Recognition. 49 (2016) 55–64.

38

[19] M. Turkoz, S. Kim, Y.-S. Jeong, K.N. Al-Khalifa, A.M. Hamouda, Distribution-free adaptive step-down procedure for fault identification, Quality and Reliability Engineering International. 32 (2016) 2701–2716. [20] Y. Zhang, H. Lu, L. Zhang, X. Ruan, Combining motion and appearance cues for anomaly detection, Pattern Recognition. 51 (2016) 443–452. [21] A. Iosifidis, V. Mygdalis, A. Tefas, I. Pitas, Graph embedded one-class classifiers for media data classification Pattern Recognition. 60 (C) (2016), 585-595. [22] V. Mygdalis, A. Iosifidis, A. Tefas, I. Pitas, One-class classification based on extreme learning and geometric class information. Neural Processing Letters. 45(2) (2017), 577592. [23] V. Mygdalis, A. Iosifidis, A. Tefas, I. Pitas, Kernel subclass support vector description for face and human action recognition. In2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE). (2016), 1-5. [24] J. Hamidzadeh, M Moradi, Improved one-class classification using filled function. Applied Intelligence. 48(10) (2018), 3263-3279. [25] J. Hamidzadeh, N. Namaei, Belief-based chaotic algorithm for support vector data description. Soft Computing, 23(12), (2019), 4289-4314. [26] R. Sadeghi, J. Hamidzadeh, Automatic support vector data description. Soft Computing. 22(1) (2018), 147-158. [27] J. Hamidzadeh, R. Sadeghi, N. Namaei, Weighted support vector data description based on chaotic bat algorithm. Applied Soft Computing. 60 (2017), 540-51.

39

[28] D. Lee, J. Lee, Domain described support vector classifier for multi-classification problems, Pattern Recognition. 40 (2007) 41–51. [29] T. Mu, A.K. Nandi, Multiclass classification based on extended support vector data description, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 39 (2009) 1206–1216. [30] P. Kang, S. Cho, Support vector class description (SVCD): Classification in kernel space, Intelligent Data Analysis. 16 (2012) 351–364. [31] K. Lee, D. W. Kim, D. Lee, K.H. Lee, Improving support vector data description using local density degree, Pattern Recognition. 38 (2005) 1768–1771. [32] K. Lee, D. W. Kim, K.H. Lee, D. Lee, Density-induced support vector data description, IEEE Transactions on Neural Networks. 18 (2007) 284–289. [33] E. Parzen, On estimation of a probability density function and mode, The Annals of Mathematical Statistics. 33 (1962) 1065–1076. [34] A. Ghasemi, H.R. Rabiee, M.T. Manzuri, M.H. Rohban, A Bayesian approach to the data description problem, AAAI Conference on Artificial Intelligence, Toronto, Ontario, Canada, July 22. (2012). [35] D.M. Tax, R.P. Duin, Support vector data description, Machine Learning. 54 (2004) 45–66. [36] G. Huang, H. Chen, Z. Zhou, F. Yin, K. Guo, Two-class support vector data description, Pattern Recognition. 44 (2011) 320–329. [37] R.P. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. De Ridder, D. Tax, S. Verzakov, PRTools 4-A MATLAB toolbox for pattern recognition. Version 4.1, Delft University of Technology. (2007).

40

[38] A. Azzalini, A.D. Valle, The multivariate skew-normal distribution, Biometrika. 83 (1996) 715–726. [39] A. Azzalini, A. Capitanio, Statistical applications of the multivariate skew normal distribution, Journal of the Royal Statistical Society: Series B (Statistical Methodology). 61 (1999) 579–602. [40] H.-C. Kim, J. Lee, Clustering based on gaussian processes, Neural Computation. 19 (2007) 3088–3107. [41] J. Demsar, Statistical comparisons of classifers over multiple data sets, J.Mach. Learn. Res. 7 (2006) 1–30. [42] VH. Moghaddam, J. Hamidzadeh, New Hermite orthogonal polynomial kernel and combined kernels in support vector machine classifier. Pattern Recognition. 60 (2016) 92135. [43] X. Zhu, X. Wu, Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3) (2004) 177-210. [44] N.F. Thornhill, S.C. Patwardhan, S.L. Shah, A continuous stirred tank heater simulation model with applications. Journal of Process Control. 18 (2008) 347–360. [45] K. Issam Ben, C. Weihs, M. Limam, Kernel k-means clustering based local support vector domain description fault detection of multimodal processes. Expert Systems with Applications. 39(2) (2012) 2166-2171.

Appendix A: Derivation of the Equation (5) By reorganizing (3), following equation is obtained.

41

L  Rn , an ,  i( k ) ,  i( m,k )    k 1 Rk2   k 1  i k1 Ck  i( k )   m1  mk 1  i m1 B( m,k ) i( m,k ) n

n

N

n

n

N

  k 1  i k1 i( k ) Rk2   k 1  i k1 i( k ) i( k )  k 1  i k1 i( k )  xi( k )  ak  n

N

n

N

n

N

2

  m1  m k 1  i m1 i( m,k )  xi( m)  ak    m1  m k 1  i m1 i( m,k ) Rk2   m1  m k 1  i m1 i( m,k ) i( m,k ) n

n

2

N

n

n

N

n

n

N

  k 1  i k1  i( k ) i( k )   m1  m k 1  i m1  i( m,k ) i( m,k ) n

N

n

n

N

By simplifying (A.1), the following equation is obtained.







n

L Rn , a n ,  i( k ) ,  i( m ,k )   k 1 Rk2   Rk2 n

k 1

Nk i 1

 i( k )   m k 1  i 1  i( m ,k ) n



  k 1  i k1  i( k ) Ck   i( k )   i( k ) n

N

Nm



 x  a    

  m1  m k 1  i 1  i( m ,k ) B( m ,k )   i( m ,k )   i( m ,k ) n

n

Nm

  k 1  i k1  i( k ) n



N

2

(k) i

k

n

n

m 1

m  k 1







 ( m ,k )  x(i m)  a k  i 1 i Nm

2



Maximizing L Rn , an ,  i( k ) ,  i( m ,k ) with the constraints 1, 2 and 4 in (4), following optimization problem is obtained.

max  k 1  i k1 i( k )  xi( k )  a k    m1  m k 1  i m1 i( m,k )  xi( m )  a k  n

s.t.



Nk i 1

2

N

n

n

N

2

 i( k )   k  m1  i 1 i( m,k )  1, k  1, , n n

Nm

(A.2)

0   i( k )  Ck , i, k 0  i( m,k )  B( m,k ) , m, k , m  k  1,

,n

In (A.2), the objection function is simplified as follows:

max  k 1  i k1 i( k ) xi( k ) xi( k )  2 k 1  i k1 i( k ) xi( k )a k   k 1  i k1 i( k )a k2 n

N

n

N

n

N

  m 1  m  k 1  i m1 i( m,k ) xi( m ) xi( m )  2 m 1  m  k 1  i m1 i( m,k ) xi( m )a k   m1  m k 1  i m1 i( m,k )a 2k n

s.t.



n

Nk i 1

N

n

n

N

 i( k )   k  m 1  i 1 i( m,k )  1, k  1, , n n

Nm

0   i( k )  Ck , i, k 0  i( m,k )  B( m,k ) , m, k , m  k  1,

,n

42

n

n

N

 A.3

(A.1)

Equation (A.3) can be simplified as follows: max  k 1  i k1  i( k ) xi( k ) xi( k )   m 1  m  k 1  i m1  i( m,k ) xi( m ) xi( m ) n

N

2 k 1 a k n

  k 1 a 2k n





n

Nk i 1

Nk i 1

n

N

 i( k ) xi( k )   m  k 1  i 1 i( m,k ) xi( m ) n

Nm

 i( k )   k  m 1  i 1 i( m,k ) n

Nm





(A.4) s.t.



Nk i 1

 i( k )   k  m 1  i 1 i( m,k )  1, k  1, n

Nm

,n

0   i( k )  Ck , i, k 0   i( m ,k )  B( m,k ) , m, k , m  k  1,

,n

From (4), we have the following equations:



Nk i 1

 i( k )   k  m 1  i 1 i( m,k )  1 n

Nm

a k   i k1 i( k ) xi( k )   k  m 1  i m1 i( m,k ) xi( m ) N

n

N

If these equations are substituted in (A.4), the following optimization problem is obtained.

max  k 1  i k1  i( k ) xi( k ) xi( k )   m 1  m  k 1  i m1 i( m,k ) xi( m ) xi( m ) n

N

n

n

N

 2 k 1 a 2k   k 1 a 2k n

s.t.

n

(A.5)

 i1i( k )   k m1  i 1 i( m,k )  1, k  1, Nk

n

Nm

,n

0   i( k )  Ck , i, k 0  i( m,k )  B( m,k ) , m, k , m  k  1,

,n

In equation (A.5) we can substitute a k with



Nk

i 1

i( k ) xi( k )  k m1 i 1 i( m,k ) xi( m) .

Therefore, equation Eq. (5) is obtained as follows:

43

n

Nm

max  m1  i m1  i( m )  xi( m )  xi( m )    m 1  m  k 1  i m1 i( m,k )  xi( m )  xi( m )  n

N

  k 1 n

s.t.



Nk i 1



n

n

N

 i1i( k ) xi( k )   k m1  i 1 i( m,k ) xi( m) Nk

n

Nm

 i( k )   k  m 1  i 1 i( m,k )  1, k  1, n

Nm



2

,n

0   i( k )  Ck , i, k 0  i( m,k )  B( m,k ) , m, k , m  k  1,

,n

Appendix B: Derivation of the Equation (13) Starting with (11),

α  arg max ln  p  α | D   α

 arg max ln  p  D | α  p  α  

(B.1)

α

 arg max ln  p  D | α    ln  p  α    α

Two terms in (B.1) can be written as

  n  n n p  α     p  α (i )     p  β(i , j )    i 1   i 1 j 1    1  1 n n  Ni  Ni 2    2  i1  i1 

  1 ln  p  α    ln  3 n n  Ni 3  Ni 2 i 1  i 1   2   

 1 n  2   2 2  α( i ) m( i ) 2  1 i 1 e  n n  Ni 2  Ni      2  i1  i1  

  1  2  2  

 1 n n (i, j ) (i, j ) 2   2 2  β m 2 i 1 j 1 e   

 n (i )T (i )  ( i )T (i ) ( i )T (i )    α α  2α m  m m      i 1   n n   T T T    β(i , j ) β(i , j )  2β(i , j ) m(i , j )  m(i , j ) m(i , j )    i 1 i  j 1  

44

The results of ln  p  D | α   and ln  p  α   are substituted in (B.1). Therefore, (13) is obtained as follows:

T n  2 n  2  ( m ) T B ( m ) 1( m )  2 n   kk2   ( m,k )  B( m,k ) 1( m,k )    mm m  1 m  1 m  k  1   n n n  2 (k ) T (k ) (k ) 2 ( k ,m ) T ( k ) ( k ,m )  K     k 1 N k kk   K    m 1  m  k 1 N m mm   1   n n α  arg max 2 2 (m) T ( m,k ) ( k ,m )   2 m 1  m  k 1 N m mm   K  α   n n n  2  ( i )T ( i ) ( i )T (i )  2  ( i , j )T ( i , j ) ( i , j )T (i , j )   α α  2α m       β β  2β m       i  1    i 1 i  j 1   

BIOGRAPHIES Mehmet Turkoz is currently an Assistant Professor of Professional Practice in the Department of Management Science and Information Systems, Rutgers University, USA. He received his PhD from the Department of Industrial and Systems Engineering, Rutgers University, USA in 2018. He holds an MS degree in Operations Research from Rutgers University, USA, in 2012. His research areas include data mining, machine learning, process modeling and monitoring, and operations research. Sangahn Kim is currently an Assistant Professor in the Department of Business Analytics and Actuarial Science, Siena College, USA. He received his Ph.D. from the Department of Industrial and Systems Engineering, Rutgers University, USA. He is a recipient of the Richard A. Freund International Scholarship by American Society for Quality (ASQ) in 2016. He also won the Best PhD Student Award and the Tayfur Altiok Memorial Scholarship from Rutgers University. His research interests include data analytics, process modeling and monitoring, stochastic process, reliability engineering, statistical learning, and machine learning. Youngdoo Son is an Assistant Professor in the Department of Industrial and Systems Engineering, Dongguk University (Seoul campus), Seoul, South Korea. He received B.S. in Physics and M.S. in Industrial and Management Engineering from Pohang University of Science and Technology (POSTECH), Pohang, South Korea in 2010 and 2012, respectively, and his Ph.D. in Industrial Engineering from Seoul National University, South Korea in 2015. His research interests include machine learning, neural networks, Bayesian methods, and their industrial and business applications. 45

Myong K. (MK) Jeong is a Professor in the Department of Industrial and Systems Engineering and RUTCOR (Rutgers Center for Operation Research), Rutgers University, New Brunswick, New Jersey. His research interests include machine learning, data mining, stochastic processes, sensor data analysis, statistical learning, and big data. He received the prestigious Richard A. Freund International Scholarship by ASQ and the National Science Foundation (NSF) CAREER Award in 2002 and 2007, respectively. His research has been funded by the NSF, United States Department of Agriculture (USDA), National Transportation Research Center, Inc. (NTRCI), and industry. He has been a consultant for Samsung Electronics, Intel, ETRI, and other companies. He has published more than 90 refereed journal articles. He has served as an Associate Editor of several journals such as the IEEE Transaction on Automation Science and Engineering, International Journal of Advanced Manufacturing Technology, and International Journal of Quality, Statistics and Reliability. Elsayed A. Elsayed is a Distinguished Professor in the Department of Industrial and Systems Engineering, Rutgers, The State University of New Jersey. His research interests are in the areas of quality and reliability engineering. He is the author of Reliability Engineering, John Wiley & Sons, 2012. He is the author and coauthor of work published in IIE Transactions, IEEE Transactions, and the International Journal of Production Research. His research has been funded by the DoD, FAA, NSF, and industry. Dr. Elsayed has been a consultant for DoD, AT&T Bell Laboratories, Ingersoll-Rand, Johnson & Johnson, Personal Products, AT&T Communications, Ethicon and other companies. Dr. Elsayed was the Editor-in-Chief of IIE Transactions and the Editor of IIE Transactions on Quality and Reliability Engineering. He is also an Editor for the International Journal of Reliability, Quality and Safety Engineering.

46