Pattern Recognition Letters 100 (2017) 117–123
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Transfer learning for one class SVM adaptation to limited data distribution change Yongjian Xue∗, Pierre Beauseroy Institut Charles Delaunay (ICD)/LM2S - CNRS, Université de Champagne, Université de Technologie de Troyes, 12, rue Marie Curie, CS42060, Troyes, 10000, France
a r t i c l e
i n f o
Article history: Received 22 February 2017 Available online 20 October 2017 Keywords: Transfer learning Fault detection Distribution change Kernel adaptation One class classification
a b s t r a c t Data based one class classification rules are widely used in system monitoring. Due to maintenance for example, we may come across a change of data distribution with respect to training data. While lacking of representative samples for the new data set, one can try to adapt the former learned detection rule to the new data set instead of retraining a new rule which implies to gather a significant amount of data. Based on the above, a multi-task learning detection rule approach is proposed to deal with the training of the updated system as some new data are available. The key feature of the new approach is the introduction of a parameter to control how much we rely on the former model. This parameter has to be set and changed as the amount of new data coming from the system increases. We define the new detection model as a classical one class SVM with a specific kernel matrix which depends on the parameter we introduced. A kernel adaptation method for C-one class SVM is developed in order to get the path solution along that parameter and a criteria is established to select a good value. Experiments conducted on toy data and real data set show that the proposed method could adapt to data change, and it gives a good transition from the old detection rule to the new one which is just obtained using the new data set only when the number of samples gathered from that new one is large enough. © 2017 Elsevier B.V. All rights reserved.
1. Introduction Training one class classification rule based on data may be cast in a kind of unsupervised or semi-supervised learning framework, which is different from binary or multi-class classification. The decision rule aims to detect the samples that do not resemble the original majority samples. Such rules are often applied in the area of novelty or outlier detection. Compared to the two class classification, from [5], such approach is used when there are no or few representative samples for negative class in the training procedure, which is very frequent in many monitoring application. Based on the support vector machines (SVM), using only normal samples, two typical methodologies have been proposed as one class support vector machines (OCSVM). One has been developed by Tax and Duin [9], named as support vector domain description (SVDD), which aims to find a hypersphere with minimal volume to enclose the data samples in feature space, the amount of data within the hypersphere is tuned by a parameter C. Another one has been introduced by Schölkopf et al. [8], is known as ν one class support vector machines (ν -OCSVM). It finds an optimal hy∗
Corresponding author. E-mail address:
[email protected] (Y. Xue).
https://doi.org/10.1016/j.patrec.2017.10.030 0167-8655/© 2017 Elsevier B.V. All rights reserved.
perplane in feature space to separate a selected proportion of the data samples from the origin, the selection parameter is ν . It is proved that these two approaches lead to the same solution according to Chang and Lin [1], under the condition that using the same Gaussian kernel and building a connection between parameters ν and C. But, in real applications, we may come across a few problems for traditional one class method. One of those problem is to cope with new added features in the new detection task, this issue is studied in our previous work [11]. Another problem is that the data set for the detection system can experience change due to system operations such as maintenance, for example, the sensor update of a detection system. In such situation, we would like to reduce the performance loss and avoid waiting for long time to get enough new data samples to train a trustful model. So we want to take advantage of the former detection system and of a limited amount of new data to train a new detection system. In order to solve the problems mentioned above, multi-task learning seems to be an ideal mean. It uses the idea that related tasks share some useful information, such as common structure, similar model parameters and common representative features etc. Previous related research shows that learning multiple related tasks simultaneously leads to better performance than learning
118
Y. Xue, P. Beauseroy / Pattern Recognition Letters 100 (2017) 117–123
them independently [3,4,12]. For example, Yang et al. [12] proposed a multi-task learning framework for ν -SVM by upper-bounding the L2 difference between each pair parameters in order to have similar solutions for related tasks. Later, He et al. [4] proposed another multi-task learning method for one-class SVM under the assumption that related tasks’ model or model parameters are close to a certain mean function motivated by Evgeniou and Pontil [3]. Following that one, in [11], we proposed a multi-task learning model to deal with the one class classification problems with additional new features. As multi-task learning is not supposed to transfer from a source task to a target task, but instead learn common representation for multiple tasks and transfer knowledge among them. Different from the usual multi-task learning setting, in this paper we only care about the performance of the new detection system with limited data distribution change, assuming the former system and the new one shares the same feature space. Using that approach, we want to control the drift from former task to new one as the data are gathered. To implement this, we introduce a modified multi-task learning model which makes use of a new parameter μ to balance the weight of training data from the former system and those from the new one. It turns out that the solution of that problem for a given value of μ can be obtained using classical one class SVM technique except with a different kernel matrix. The new kernel is parameterized by μ. Searching a good solution over the whole μ domain is going to be computational too demanding, so a new kernel adaptation approach of Le and Beauseroy [6] for C-one class SVM is proposed which enable to compute at once and find all the solutions for any value of μ. Then a criteria to select μ is proposed to find a reasonable solution for a given number of samples from the new system. The paper is organised as follows. Section 2 will describe the details of the proposed method for multi-task learning which solves the problem of one class classification adaptation to data set change. The details of the kernel adaptation method for C-one class SVM are developed in Section 3. Then in Section 4 we introduce and justify the criteria to select the value of μ which tunes the compromise between the common part and the specific part of the former and new data. And experiments on toy data and wine quality data are conducted to test the performance of the method in Section 5. The concluding section summarizes the main novelty of the paper and discusses further work.
where μ ∈ [0, 1], when μ = 0, then wt = vt , which corresponds to T separated task learning. While μ = 1, then wt = w0 , which corresponds to one single global task. 2.1.1. Primal problem Based on this setting, the primal one class problem could be formulated as:
1 μ w0 w0 ,vt ,ξit 2 min
1 2
2 + ( 1 − μ )
T
vt 2 +C
nt T
ξit
(3)
t=1 i=1
t=1
for t ∈ {1, 2, . . . , T } and i ∈ {1, 2, . . . , nt } in each task, subject to the constraints:
μw0 + (1 − μ )vt , φ (xit ) ≥ 1 − ξit , ξit ≥ 0
(4)
where ξ it is slack variable for each sample and C is penalty parameter. 2.1.2. Dual problem Introducing the Lagrange multipliers α it , β it ≥ 0, then the Lagrangian of this problem could be expressed as:
L(w0 , vt , ξit , αit , βit ) = −
1 μ w0 2 nt T
2 +
T 1 (1 − μ ) vt 2 t=1
2 +C
nt T
ξit
t=1 i=1
αit μw0 + (1 − μ )vt , φ (xit ) − 1 + ξit
t=1 i=1
−
nt T
βit ξit
(5)
t=1 i=1
Then setting the partial derivatives of the Lagrangian to zero, which leads to the following relations:
w0 =
nt T
αit φ (xit )
(6)
t=1 i=1
vt =
nt
αit φ (xit )
(7)
i=1
αit ∈ [0, C]
(8)
Substituting (6), (7) and (8) into (5), the Lagrangian dual form could be given as:
2. Multi-task learning for one class SVM
1 T μ α K α + αT 1 2 s.t. 0 ≤ α ≤ C1
In this section, we first introduce the proposed multi-task learning model and then we apply this model to transfer learning problem when the data distribution experiences a change.
maxα −
2.1. Proposed multi-task learning method
where αT = [α11 , . . . , αn1 1 , . . . , α1T , . . . , αnT T ], and Kμ is a block μ matrix with T × T blocks corresponding to all task pairs. Let Krt denote the block corresponding to task r and t, which is defined as:
Consider the case of T learning tasks in the same space X , where X ⊆ R p . For each task t, we have nt samples Xt = {x1t , x2t , . . . , xnt t }. Intuitively, we may either try to solve the problem by T independent separated tasks or treat them together as one single learning task. Inspired by references [3] and [4], here the idea is to try to balance between the two extreme cases by introducing a parameter μ. The decision function for each task t is:
ft (x ) = sign(wt , φ (x ) − 1 )
(1)
where wt is the normal vector and φ(x) is the non-linear feature mapping. In the chosen multi-task learning approach, the needed vector of each task wt could be divided into two part, one part is the common mean vector w0 shared among all the learning tasks and the other part is the specific vector vt for a specific task.
wt = μw0 + (1 − μ )vt
(2)
(9)
μ
Krt = (μ + (1 − μ )δrt )φ (Xt ), φ (Xr )
(10)
and δ rt is the Kronecker delta:
δrt =
1 if r = t 0
(11)
i f r = t
Notice that (9) could be solved by the classical one class SVM with the specific modified Gram matrix Kμ . Now the decision function for task t is:
ft (x ) = sign
nr T
α jr μ + (1 − μ )δrt φ (x jr ), φ (x ) − 1
r=1 j=1
(12)
Y. Xue, P. Beauseroy / Pattern Recognition Letters 100 (2017) 117–123
α μ μ− is changing with μ such that all xi ∈ Mμ satisfy: −
2.2. Application to transfer learning
M
Here we consider the source task as T1 with data set X1 , and the target task related to the changed data set X2 as T2 (t ∈ {1, 2}). And according to (12), the decision function f1 (x) corresponding to T1 is:
f1 (x ) = sign(g1 (x ) − 1 )
(13)
where the SVM function g1 (x) is defined as:
g 1 ( x ) = αT
− f μ ( xi ) = f μ ( xi ) = 0
(14)
f2 (x ) = sign(g2 (x ) − 1 )
(15)
−1 αμMμ− = K(μMμ− ,Mμ− ) − μ μ− × K(Mμ− ,Mμ− ) αMμ− + C (μ − μ− )K(Mμ− ,E μ− ) 1
μk ( X 1 , x ) g 2 ( x ) = αT k ( X2 , x )
(16)
K = K 0 − K 1
To solve the problem (9), the parameter μ must be chosen in order to get a kernel matrix. However, we do not know exactly which μ is better to the current situation. Moreover, it is time consuming to search all the solution space by solving the one class SVM model for a significant set of possible values of μ. One appealing approach is to determine the entire solution path of the Lagrange multipliers along μ from 0 to 1, then we can choose one solution as what we want. A kernel adaptive one class ν −SVM method is proposed in [6] to deduce one solution from another when the kernel is changing from one to another which is not too different. Following similar ideas, here a C version of kernel adaptive one class SVM is developed with an improved method for the detection of the breakpoints. The aim of the kernel adaptive method here is to find the path solutions of the Lagrange multipliers αμ over the parameter μ, that means the solution path from K0 to K1 . Let:
K μ = ( 1 − μ ) K 0 + μK 1
Following events will affect the composition of the 3 groups M, E, C as μ is increasing from one value μ− to another one: μ−
1. xi leaves the border M to E: αi
(17)
•
μc ( μ c ≥ μ − )
μ = μ c + μ c
μ
−
−
−
Assuming a solution α μ , Mμ , E μ and C μ is known for a given parameter μ− , we want to find another solution α μ where μ is close enough to μ− such that the group sets do not change. That means: μ
μ−
αE μ = αE μ− = C1
(18)
and
αCμμ = αCμμ− = 0 −
(23)
What we want to find is μc such that events 1–4 happen, according to (17), (22) and (23):
K μ = K μc − μ c K
(24)
For events 1 and 2, we just need to monitor the value change μ of αi , which may either reach 0 or C. From (21), all the terms are known except the inverse of K μ (Mμ− ,Mμ− ) . As the 2nd order Taylor expansion is:
(X + Y )−1 = X −1 − X −1Y X −1 + X −1Y X −1Y X −1 + ε
(25)
where is the error term, let Z = K(Mμ− ,Mμ− ) , and X = Z μc , Y = −μc K, then according to (25), the inverse of K μ (Mμ− ,Mμ− ) could be written as:
Zμ
−1
=
−1 K(Mμ− ,Mμ− ) Z μc −1 2 −1 + μ c 2 Z μc K(Mμ− ,Mμ− ) Z μc +
Z μc
−1
+ μ c Z μc
−1
α˜ μMμ− = C0 + μcC1 + μ2c C2
Mμ = {i : f μ (xi ) = 0, αi ∈ [0, C]} for samples on the margin, μ E μ = {i : f μ (xi ) < 0, αi = C } for samples outside the decision region μ C μ = {i : f μ (xi ) > 0, αi = 0} for samples inside the decision region −
μ
(26)
Neglecting , as a result, an approximation of Eq. (21) could be written as:
μ
f μ (xi ) < 0 ⇒ αi = C, μ f μ ( x i ) > 0 ⇒ αi = 0 , μ μ f (xi ) = 0 ⇒ αi ∈ [0, C].
Then based on the above conditions, 3 groups of Lagrangian M, E and C could be defined as: •
μ
∈ [0, C] → αi = C
2. xi changes from M to C: αi ∈ [0, C] → αi = 0 − 3. xi leaves C to M: f μ (xi ) > 0 → f μ (xi ) = 0 − 4. xi leaves the outlier set E to join the border M: f μ (xi ) < 0 → f μ ( xi ) = 0
Consider any value of μ, f(x) is the decision function and the following KKT conditions must be satisfied:
•
(22)
Considering that we study the value αμ around a chosen value
3. Kernel adaptation for all the solution of μ
•
(21)
μ is the kernel matrix with entries (l, j) of Kμ ( M μ− , M μ− ) − − such that the index l ∈ Mμ and j ∈ Mμ , and
μ−
where the SVM function g2 (x) is defined as:
•
(20)
From Eqs. (18), (19) and (20), we can get:
where K
k ( X1 , x ) μk ( X 2 , x )
where k( · ) is the kernel function: k(xi , x j ) = φ (xi ), φ (x j ) (we use Gaussian kernel in this paper). And if the test data x come from T2 , the decision function f2 (x) corresponding to T2 is:
•
119
where:
C0 = Z μc
C1 =
Zμ
−
−1 − αμMμ− + C (μc − μ− ) Z μc K(Mμ− ,E μ− ) 1 (28)
−1 −1 − − K(Mμ− ,Mμ− ) Z μc Z μ α μ + C ( μ c − μ − ) Z μc −1 × K(Mμ− ,Mμ− ) Z μc K(Mμ− ,E μ− ) 1 −1 + C Z μc K(Mμ− ,E μ− ) 1 (29)
C2 = C
Z μc
+ (19)
−1
(27)
−1
Z μc
−1
Z μc
K(Mμ− ,Mμ− )
−1
2
K(Mμ− ,Mμ− )
2
Z μc
−1
Zμ
−
αμMμ− −
120
Y. Xue, P. Beauseroy / Pattern Recognition Letters 100 (2017) 117–123
+ C (μc − μ− )((Z μc )−1 K(Mμ− ,Mμ− ) )2 (Z μc )−1 K(Mμ− ,E μ− ) 1
4
(30) Then the events of 1 and 2 could be approximately solved by 2nd order polynomial with (27).
For samples xi ∈ E C, values μ such that f μ (xi ) = 0 need to be detected: μ
μ
μ
K(E μ− C μ− ,Mμ− ) αMμ− + CK(E μ− C μ− ,Mμ− ) 1 − 1 = 0
(31)
3
2
1
Substituting the approximation (27) into (31), then it comes to:
D0 + μc D1 + μ2c D2 = 0
(32)
where: μ
0
−1
μ
D0 = −1 + CK(Ecμ− C μ− ,E μ− ) 1 + K(Ecμ− C μ− ,Mμ− )C0
(33) −2
D1 = −C K(E μ
μ−
μ− μ− 1 C ,E )
− K(E
μ−
μ− C ,Mμ− )C0
+ K(Ecμ− C μ− ,Mμ− )C1
(34) μ
D2 = −K(E μ− C μ− ,Mμ− )C1 + K(Ecμ− C μ− ,Mμ− )C2
(35)
Again, a 2nd order polynomial could be solved to find the breakpoints corresponding to events 3 and 4. The process of the kernel adaptation for C-one class SVM is shown in Algorithm 1. We set μ− = μc = 0 for the initial point, Algorithm 1 Kernel adaptation for C-one class SVM. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
μ−
− − = μc = 0, then compute α μ , f μ , M, E,
Initialisation: let C and K. while μ < 1 do compute min μc such that events 1,2,3,4 happen. then μc = μ− + min μc . while convergence = False do compute min μc such that events 1,2,3,4 happen. then μ = μc + min μc . if |μ − μc | < then convergence = True end if μc = μ end while μ− = μ − − update α μ , f μ , M, E, C and K. end while −
−
and then α μ , f μ , M, E, C and K can be obtained accordingly. For the next breakpoint, first we compute μ = μ− + min μc which is a first approximation of the next break point, where min μc is the minimum value of μc such that events 1,2,3,4 happen according to (27) and (32). Next in order to find a more precise value of the break point which is near to μc , we compute the minimum μc such that events 1,2,3,4 happen based on μc and μ− according to (27) and (32), then let μ = μc + min μc , and μc = μ. We iterate the steps until the change of μ is smaller than a given threshold which means we find the next breakpoint − − in a given error. Finally set μ− = μ, and update α μ , f μ , M, E, C and K accordingly. Repeat the upper steps until μ = 1. Then for a given value of μ, we can find the two nearest breakpoints μL and μR such that μL ≤ μ ≤ μR , then α μ could be obtained by (21), where μ− = μL . 4. Criteria for choosing the optimal solution As we get all the solution corresponding to μ, one needs to choose an optimal value to build the decision function only based
−3 −3
−2
−1
0
1
2
3
Fig. 1. View of the banana data, the black dot and the blue plus correspond to the source and target samples. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
on the training data. A good solution for the model should have the following properties: 1. It could adapt to the change of the decision function due to the data distribution change. 2. At the same time, it should keep the false alarm rate as close as possible to the desired one. Based on the above properties, first we can choose μ corresponding to a detection that makes use of the new samples in its definition. That means one would like to have a large number of margin support vectors that belong to X2 . So we will consider candidate values of μ that define solutions with a good proportion of margin support vectors in X2 . But for practice results, large number of margin support vectors might have a chance to overestimate the model, so we choose the Nth largest margin support vectors instead, noted as max(#SV2 )Nth . A second rule is to keep the probability of false alarm as close as to the desired value. Related to empirical estimation of false alarm rate, we want the decision boundary for multi-task model of T2 to enclose a give proportion of samples from X2 , noted as min |#{ f2 (X2 , μ ) > 0} − n2 (1 − p)|, for examples, we set p = 0.1, which means we want to keep the proportion of outliers for X2 close to 0.1. Besides that, for multi-task learning point of view, among remaining possible values of μ, we want to select the one which preserves the detection for the initial task T1 (solution when μ = 0), noted as min A(μ), where:
A ( μ ) = g1 ( X1 , μ ) − g1 ( X1 , μ = 0 )
(36)
The proposed criteria for choosing μ is summarized in algorithm 2. Algorithm 2 Choosing the optimal μ∗ . Choose a list L1 of μa , s.t. #SV2 (μa ) ≥ max(#SV2 )Nth , here max(#SV2 )Nth is the Nth largest of #SV2 . 2: Choose a list L2 of μb from L1 , s.t. μb = arg min |#{ f 2 (X2 , μ ) > 1:
0} − n2 ( 1 − p )|. 3: Choose μ∗ from L1 , s.t. μ∗ = arg min A (μ ). μ∈ L 2
μ∈ L 1
Y. Xue, P. Beauseroy / Pattern Recognition Letters 100 (2017) 117–123
0.14 0.12 0.1
α
0.08 0.06 0.04 0.02 0 0
0.2
0.4
μ
0.6
0.8
1
Fig. 2. Solution path for n1 = 400 and n2 = 50.
5. Experiments
use source samples n1 = 400, and we add n2 target samples (from 10 to 10 0 0). For test, 10,0 0 0 target data samples are generated as positive samples and 10,0 0 0 negative samples subjecting to uniform distribution which covers the maximum axis of the test data set are also generated for testing the probability of miss alarm. Results are averaged over 10 repetitions. Since we have enough training data samples in the source task, the Gaussian kernel parameter σ and the constraint parameter C could be tuned based on the source training samples. According to Lee and Scott [7], if C = 1/(nνρ ), the same decision function could be get for C-SVM and ν -SVM. Besides that, the performance is relatively stable for ν -SVM as the number of samples increases if we keep the same value of ν , so we use the ν -SVM for the initialisation of the kernel adaptation method each time. Here we choose σ = 1.26 and ν = 0.1, which makes the false alarm rate around 0.1 in source task. With the kernel adaptation algorithm, all the path solutions could be obtained while μ varies from 0 to 1. For example, Fig. 2 shows the solution path of α with n1 = 400 and n2 = 50. After that, the search strategy is used to choose a good value for μ. In order to see if the selected μ∗ is a proper one, we minimize a criteria Gref (μ) to define a reference value μref . We solve:
min Gref (μ ) = g2 (x, μ ) − g2,ref (x, μ = 0 )
5.1. Toy data
μ∈[0,1]
The proposed test data set is named as banana data set, it is defined as:
where g2,ref (x, μ = 0 ) =
x(1) = 2sinθ + N (0, 0.25 ); x(2) = 2cosθ + N (0, 0.25 );
121
(37) (38) distribution U ( 18 π , 11 8 π ) and N(0, X1 = (x(1 ) , x(2 ) ) ∈ R2 is the source
where θ subjects to uniform 0.25) is Gaussian distribution. data set, and the target data set is X2 , which subjects to the same relationship but with a rotation and a translation. A two dimensional view of 400 source samples and 10 0 0 target samples with π /12 rotation and (−0.3, 0.5 ) translation is shown in Fig. 1. We
10
(39)
0 is a reference bork ( X2 , x ) der defined by averaging 10 realisations of the SVM function for training task two alone, with 400 samples for each realisation. So we want to find μref so that the defined decision function is as close as possible to the reference one. The results of the multi-task learning method noted as MTL(T2 , μ∗ ) are compared with: • •
•
1 10
i=1
αTi
the independent training of X1 (noted as MTL(T1 , μ = 0 )). the method of combining X1 , X2 together as a single big task (TBig ). the method of training target data X2 alone (T2 ).
Fig. 3. Decision boundaries for different methods as n2 increases. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
122
Y. Xue, P. Beauseroy / Pattern Recognition Letters 100 (2017) 117–123
Fig. 4. Results on banana data set: (a) selected μ∗ , (b) false alarm rate, (c) miss alarm rate. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 5. Results on wine quality data set: (a) selected μ∗ , (b) false alarm rate, (c) miss alarm rate. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 3 shows the boundaries of the different decision function as n2 increases from 10 to 100. The blue one (T2 (n2 = 400 )) is the boundary of the targeted decision function (trained with X2 alone and n2 = 400). When n2 is small, the decision boundaries of T2 are very sensitive to the parameter C, here we draw a multi boundaries of T2 with C ∈ [0.02, 1], which changes from small value with exclusive boundaries to large value with inclusive boundaries of all the data set X2 . When n2 = 10, the decision boundary of MTL(T2 ) is almost the same as that of TBig , because here we choose μ∗ = 0.99 according to the criteria. As n2 increases, μ∗ tends to decrease a little, which is coherent with the increase of the information weight from X2 and the decrease of that from X1 . The MTL(T2 ) could detect the boundaries of the new data set for all values of n2 as n2 increases. While the MTL(T1 ) with μ = 0 is the same as the model just based on X1 , which cannot detect the new boundaries of the data set. The TBig method tends to inclose all the data set of X1 and X2 , which means it increases the probability of miss alarm. For the method of T2 with C varies, it tends to cover all the data set X2 without considering the known structure or the former data set information when n2 is small. Besides that it is hard to tune parameter C for T2 at that time, while with the proposed approach it defines a good boundary. Fig. 4(a) shows a comparison results of μ∗ and μref as n2 changes. Obviously, they have the same trend as n2 increases from 10 to 10 0 0, the corresponding false alarm rate (Fig. 4(b)) and miss alarm rate (Fig. 4(c)) for MTL(T2 , μ∗ ) are close to that of
MTL(T2 , μref ). If we use the old detection system (MTL(T2 , μ = 0 ), which is independent training based on X1 ), it can not follow the distribution change. When n2 ≥ 300, the two type errors of MTL(T2 , μ∗ ) and MTL(T2 , μref ) are close to that of T2 , which means that when n2 is large, more representative samples are introduced, a good decision boundary could be obtained based on the new data set only. From the above analysis, we could say that the method for choosing μ is reasonable. And compared to TBig and T2 alone, the MTL(T2 , μ∗ ) gives a good transition from the old detection system to the new one when n2 changes from 10 to 300. When n2 ≥ 300, the selected μ∗ is very small, less information is needed from the former task, that means we can abandon the old system and just use the new one instead. 5.2. Wine quality data set Experiments are also conducted on the real data set: the wine quality data set by Cortez et al. [2]. We consider the data set restricted to red wine as X1 and the added samples are white wine data set as X2 . We choose the 6th most importance features for red wine to train a SVM model, they are sulphates, pH, total sulfur dioxide, alcohol, volatile acidity and free sulfur dioxide, which are different for white wine to simulate a detection system with changed distribution data samples. For the purpose of testing two type errors, we use the wine that classified by wine experts as 3,4 as negative samples and 6,7,8 as positive samples. We set parameter σ = 1.75, and C is chosen by fixing ν = 0.1 for every initial-
Y. Xue, P. Beauseroy / Pattern Recognition Letters 100 (2017) 117–123
isation of the kernel adaptation algorithm. The index of the data set is shuffled, then we use all the red wine data set for T1 , and n2 goes from 10 to 10 0 0, the remaining samples are hold for testing. Experiments are conducted until all the data in white wine are taken part in as training data set, then results are averaged. Fig. 5(a) shows the selected μ∗ by the proposed criteria (2). It decreases from large value to small value as n2 increasing from 10 to 10 0 0, which means the multi-task model changes from the one big task model TBig to a much independent model, and the information needed from the former task decreases as n2 increases. From Fig. 5(b) and Fig. 5(c), we can see that if we use the old detection system MTL(T2 , μ = 0 ), we will have a large proportion of false alarm. If we use TBig , then we have a low false alarm rate, but we increase too much the miss alarm rate, which means we do not detect the change of the data set. If we use T2 , then we have lower miss alarm rate, but we increase the false alarm rate too much, that means the detection is too restrictive. The MTL(T2 , μ∗ ) method could reduce the miss alarm rate compared to TBig without increasing the false alarm rate too much compared to that of T2 when n2 is small. At last the two type errors of MTL(T2 , μ∗ ) come close to that of T2 when we have much more representative samples of X2 . So this means that when we do not have enough samples of the new task T2 , we can have a smooth and acceptable transition from the former detection system to the new one. 6. Conclusion In this paper, we proposed a multi-task learning model to solve the one class classification problem to data distribution with a limited change. In this model, a parameter μ is introduced to control the amount of information that is taken into account from the former task T1 . With the method of kernel adaptation for C-one class SVM, we can get all the path solution along μ, then a criteria is proposed to choose the proper solution μ∗ . The experiments conducted on toy data and wine quality data set show that the proposed method could adapt the decision function change of the data set and give a good transition from the former task with data set X1 to the new task which is just based on the new data set X2
123
as n2 increases gradually. The method could be used to manage a smooth transition between the two tasks and to define the new detection keeping the benefit of a detection system and expanding to limit the performance loss during the transition. For large-scale problems, one approach could be used to reduce the dimension of the former solution based on approximation approach as in [10] before transferring the detection to the new data. But in large scale problems, data gathering is not usually a critical issue. The proposed approach is really useful in case when the sampling rate is slow. References [1] C.-C. Chang, C.-J. Lin, Training v-support vector classifiers: theory and algorithms, Neural Comput. 13 (9) (2001) 2119–2147. [2] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis, Modeling wine preferences by data mining from physicochemical properties, Decis Support Syst. 47 (4) (2009) 547–553. [3] T. Evgeniou, M. Pontil, Regularized multi–task learning, in: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2004, pp. 109–117. [4] X. He, G. Mourot, D. Maquin, J. Ragot, P. Beauseroy, A. Smolarz, E. Grall-Maës, Multi-task learning with one-class SVM, Neurocomputing 133 (2014) 416–426. [5] S.S. Khan, M.G. Madden, A Survey of Recent Trends in One Class Classification, in: Artificial Intelligence and Cognitive Science, Springer, 2009, pp. 188–197. [6] V. Le, P. Beauseroy, Path for kernel adaptive one-class support vector machine, in: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), IEEE, 2015, pp. 503–508. [7] G. Lee, C.D. Scott, The one class support vector machine solution path, in: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, 2, IEEE, 2007, pp. II–521. [8] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001) 1443–1471. [9] D.M. Tax, R.P. Duin, Support vector domain description, Pattern Recognit. Lett. 20 (11) (1999) 1191–1199. [10] T. Wang, J. Chen, Y. Zhou, H. Snoussi, Online least squares one-class support vector machines-based abnormal visual event detection, Sensors 13 (12) (2013) 17130–17155. [11] Y. Xue, P. Beauseroy, Multi-task learning for one-class SVM with additional new features, in: Pattern Recognition (ICPR), 2016 23rd International Conference on, IEEE, 2016, pp. 1571–1576. [12] H. Yang, I. King, M.R. Lyu, Multi-task learning for one-class classification, in: The 2010 International Joint Conference on Neural Networks (IJCNN), IEEE, 2010, pp. 1–8.