ARTICLE IN PRESS
JID: KNOSYS
[m5G;December 13, 2016;2:47]
Knowledge-Based Systems 0 0 0 (2016) 1–16
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Learning with label proportions based on nonparallel support vector machines Zhensong Chen a,c, Zhiquan Qi b,c,∗, Bo Wang b,c, Limeng Cui c,d, Fan Meng a,c, Yong Shi b,c,∗ a
School of Economics and Management, University of Chinese Academy Sciences, Beijing 100190, China Research Center on Fictitious Economy and Data Science, CAS, Beijing 100190, China c Key Laboratory of Big Data Mining and Knowledge Management, CAS, Beijing 100190, China d School of Computer and Control Engineering, University of Chinese Academy Sciences, Beijing 100190, China b
a r t i c l e
i n f o
Article history: Received 1 April 2016 Revised 1 December 2016 Accepted 3 December 2016 Available online xxx Keywords: Learning with label proportion Nonparallel SVM Proportion-NPSVM Large-margin
a b s t r a c t Learning a classifier from groups of unlabeled data, only knowing, for each group, the proportions of data with particular labels, is an important branch of classification tasks that are conceivable in many practical applications. In this paper, we proposed a novel solution for the problem of learning with label proportions (LLP) based on nonparallel support vector machines, termed as proportion-NPSVM, which can improve the classifiers to be a pair of nonparallel classification hyperplanes. The unique property of our method is that it only needs to solve a pair of smaller quadratic programming problems. Moreover, it can efficiently incorporate the known group label proportions with the latent unknown observation labels into one optimization model under a large-margin framework. Compared to the existing approaches, there are several advantages shown as follows: 1) it does not need to make restrictive assumptions on the training data; 2) nonparallel classifiers can be achieved without computing the large inverse matrices; 3) the optimization model can be effectively solved by using the alternative strategy with SMO technique or SOR method; 4) proportion-NPSVM has better generalization ability. Sufficient experimental results on both binary-classes and multi-classes data sets show the efficiency of our proposed method in classification accuracy, which prove the state-of-the-art method for LLP problems compared with competing algorithms. © 2016 Published by Elsevier B.V.
1. Introduction The problem of learning with label proportions (LLP) has been intensively studied in recent years. In this learning setting, the training instances are groups of unlabeled observations, only knowing the proportion of observations with a particular label for each group. The task is to learn a mapping from individual observations to their labels. Fig. 1 shows the differences of LLP from other learning problems. Solutions of LLP have real word applications. Actually, many modern intelligent applications are able to be abstracted to LLP problems. For example, 1) Political Election [1]. In the problem of political election, voters can be divided into alwaysfavorable voters and swing ones. Politicians want to use their limited resources to achieve the largest gains. So that, they want to identify which category each voter is belonging to according to
∗ Corresponding authors at: Research Center on Fictitious Economy and Data Science, CAS, Beijing 100190, China. E-mail addresses:
[email protected] (Z. Qi),
[email protected] (Y. Shi).
the profile of favor regardless voters and its proportion directly revealed based on previous elections; 2) Spam Filtering [2]. In the spam filtering, the emails contain almost pure spam (listed in the spam box) and a mix of spam and non-spam (shown in the inboxes). We would like to improve estimation of spam based on the proportions of spam and non-spam in a user’s inbox, which is much cheaper to be achieved than the actual labels; 3) Quality Control [3]. Suppose there is a steel factory, in which charges of steel sticks are processed sequentially at several production stations. We would like to assess the quality of each stick before it reaches the final production according to the information that related to charges of sticks (not each single stick) achieved during its being processed. Since the sticks that not reach the desired quality can be locked out, this work can help to save resources; and lots more [4–7]. Here, to make a clear understanding of the LLP problems, let’s consider an example of images classification, which is a basic component of large-scale images annotation and image retrieval. As it is known, images classification refers to the procedure of classifying images into different categories based on their contained
http://dx.doi.org/10.1016/j.knosys.2016.12.007 0950-7051/© 2016 Published by Elsevier B.V.
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
JID: KNOSYS 2
ARTICLE IN PRESS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
Fig. 1. Different types of learning problems (colors denote the class labels). (a) supervised learning: all the training instances are labeled; (b) semi-supervised learning: parts of the training instances are labeled and the others are unlabeled; (c) unsupervised learning: all the training instances are unlabeled and (d) learning with label proportions: only knowing the proportion of instances with a particular label for each group. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
objects. Traditionally, supervised learning algorithms, such as support vector machines (SVMs) [8–10], random forest [11], twin support vector machines (TWSVMs) [12–14], spatial pyramid matching (SPM) and its variants (i.e., LrrSPM) [15], and other methods [16,17], are the preferred approaches to achieve classifiers from labeled training samples. However, these supervised learning processes need large amount of labeled training data, which are quite inefficient and labor-intensive to be obtained, especially in the big data era with the rapid increasing of data generation. Compared with strong supervision, which means all the training instances are accurately labeled, the label proportions can be achieved accurately and effectively. Therefore, this classification problem is able to be solved by transforming into a LLP process and Fig. 2 gives an illustration of this procedure. As it is shown, the unlabeled image data sets are given and then can be split into K disjoint groups. The proportions of images belonging to different categories are the only information provided for each group. The target is to estimate a classifier that can predict the class label for each individual image including observed sample and the new one. To address this learning scenario, we contribute a large-margin approach for LLP problems in this paper based on nonparallel support vector machine, termed as proportion-NPSVM, which can efficiently model the unknown labels as latent variables by incorporating label proportions with the classification algorithms. Compared to the existing methods, which include MeanMap [2],
Fig. 2. The example of LLP process for images classification. The original image datasets are given without any labels. Then, these image samples are divided into K disjoint groups. Only knowing, for each group, the proportions of images with a particular label. The target is to estimate a classifier to predict the class label for each individual image.
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
JID: KNOSYS
ARTICLE IN PRESS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
Inverse Calibration [18], ∝SVM [19] and Twin alter-∝SVM [20], proportion-NPSVM has some unique properties shown as follows: – Do not need to make restrictive assumptions on the training data. – Only need to solve two smaller problems. – Do not need to compute the large inverse matrices before training. – Achieve the nonparallel classification hyperplane. – Can be solved by the alternative strategy with successive overrelaxation (SOR) method or sequential minimization optimization (SMO) technique. To the best of our knowledge, no other method exists yet which can share all of these properties.In summary, by incorporating all these properties, the advantages of our method are 1) having the better generalization ability compared with ∝SVM [19], which make proportion-NPSVM achieve the better classification accuracy; 2) faster than Twin alter-∝SVM [20] since our method does not need to compute the large inverse matrices before training and can be solved by the alternative strategy with SOR method or SMO technique. The main contributions of this paper are summarized as follows: 1) Develop two algorithms based on the Nonparallel Support Vector Machines (NPSVMs) to learn SVM classifiers for the LLP problems, which share all the properties mentioned above; 2) Introduce a novel solving strategy in perspective of hyperplanes clustering (Algorithm 2), which is highly competitive to the solving tactic (Algorithm 1) that is usually used in other label proportions learning based SVMs; 3) Enrich and improve the framework for the applications of SVMs in the LLP problems. The rest of this paper is structured as follows. In Section 2, we review the existing works for LLP problems proposed in recent years and the nonparallel SVM model for classification tasks respectively. The proportion-NPSVM algorithm and its solving strategy are presented in Section 3 while the discussion are shown in Section 4. Finally, experimental results and conclusion remarks are respectively made in Section 5 and Section 6. 2. Preliminaries During the past decade, a number of approaches for LLP problem have been developed. In the following, we will firstly review the existing methods such as MeanMap, inverse calibration (InvCal) and proportion-SVM (∝SVM). Then, we introduce the nonparallel support vector machine (NPSVM), which is developed for binary classification tasks.
3
bag should have a soft label, which is corresponding to the label proportion. Stolpe et al. [3] defined the problem of LLP and contributed a developmental solution based on the clustering with label proportions. Hernández [1] proposed to learn a Bayesian network classifiers, which is based on the structural expectation maximization (EM) strategy, to handle the LLP classification tasks. In addition, for the semi-supervised learning scenario , Mann [24] and Bellare et al. [25] proposed an effective and easy implementation method to match the given proportions by using an expectation regularization term to encourage model predictions on the unlabeled data. Similarly, Gillenwater et al. [26] explored a generalized regularization method and Li et al. [27] developed the Semi-Supervised Support Vector Machines (S3VMs), which is typically used to directly estimate the labels for the unlabeled instances, by incorporating the label means of the unlabeled instances. Following the strategies of solving problems, a recently paper [19] proposed a method called proportion-SVM (∝SVM), which can explicitly incorporate the known group label proportions and the latent unknown instance labels into same model under a largemargin framework.
2.2. Nonparallel SVM algorithm In this section, we are going to present the nonparallel support vector machine model, which is introduced in [28]. Consider the binary classification task with the training set
T = {(x1 , +1 ), . . . , (x p , +1 ), (x p+1 , −1 ), . . . , (x p+q , −1 )}
(1)
where xi ∈ Rn , i = 1, . . . , p + q. To solve this problem, many approaches have been proposed, during which, the SVMs and its variants are the most popular and powerful tools. Especially, Tian et al. [28] proposed the nonparallel support vector machine based on twin support vector machines (TWSVM) [29] and twin bounded support vector machines (TBSVM) [30], termed as NPSVM, has achieved great performance. This novel NPSVM algorithm inherits the essence of the SVMs and has some unexpected advantages compared with TWSVM and TBSVM. For the linear case, NPSVM can seek a pair of nonparallel hyperplanes
(w+ · x ) + b+ = 0 and (w− · x ) + b− = 0
(2)
such that each hyperplane is proximal to the data points of one class and as far as possible from the others, where w+ ∈ Rn , w− ∈ Rn , b+ ∈ R, b− ∈ R can be obtained by solving the following pair of quadratic programming problems (QPPs),
2.1. Related work The LLP problem has only recently attracted great attentions in the field of machine learning and computer vision. Chen et al. [21] introduced the problem of learning from aggregate views and Musicant et al. [22] formally described this new learning problem for classification and regression tasks. Then they applied the knearest neighbor, neural networks, and SVMs to solve it. Inspired by multiple instance (MI) learning process, Kuck and de Freitas [23] designed a principled probabilistic model to learn the relationship between the properties of individual instance and its binary labels. Quadrianto et al. [2] proposed the MeanMap approach, which can reconstruct the correct labels with high probability in a uniform convergence sense. But this algorithm is under a restrictive assumption that the instances are conditionally independent given the label. Rüeping [18] presented a large-margin approach which can learn a classifier from group probabilities based on support vector regression (SVR) and inverse classifier calibration. This method has a basic assumption that the mean instance of each
min
y,w+ ,b+ ,η+ ,ξ−
s.t.
1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 Aw+ + e+ b+ = η+ , −(Bw+ + e− b+ ) + ξ− ≥ e− ,
(3)
ξ− ≥ 0
and
min
y,w− ,b− ,η− ,ξ+
s.t.
1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 Bw− + e− b− = η− ,
(4)
(Aw− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 in which A = (x1 , . . . , x p )T ∈ R p×n and B = (x p+1 , . . . , x p+q )T ∈ Rq×n denote positive and negative data matrices respectively, ci (i = 1, 2, 3, 4) are the penalty parameters, e+ and e− are vectors of ones of appropriate dimensions, ξ+ and ξ− are slack vectors of appropriate dimensions, η+ and η− are vectors of appropriate dimensions.
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS 4
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
{xi |i ∈ Bk }Kk=1 , ∪Kk=1 Bk = {1, . . . , N}, Bk ∩ Bl = φ, (∀k = l ). (11) For each bag, only knowing the proportion belonging to a particular class. Let pk denotes the label proportion of positive class in the k-th bag, we have
pk =
Fig. 3. The illustration of standard SVM and Linear NPSVM on two dimensional data. (a) The standard SVM searches for two parallel classification hyperplanes with the maximal width between them. (b) The NPSVM seeks two nonparallel proximal hyperplanes each of them is closer to one class and as far as possible from the other.
The corresponding Wolfe dual problems of (3) and (4) are
max λ, α
s.t.
1 (λT αT )T + c3 eT α − ( λT αT ) Q − 2 0 ≤ α ≤ c1 e−
θ ,γ
1 − (θ T γ T )Q (θ T γ T )T + c4 eT+ γ 2
s.t.
0 ≤ α ≤ c2 e+
y,w,b
(5)
respectively, where
= Q Q =
AAT + c3 I BAT
ABT BBT
BBT + c4 I ABT
BAT AAT
+ E,
+E
(7)
and I is the identity matrix with appropriate dimension, E is the l × l matrix with all entries equal to 1. After solving the dual problems (5) and (6), the solutions of primal problems (3) and (4) can be obtained by
w + = − ( A T λ∗ + B T α∗ ) , b+ = −(eT+ λ∗ + eT− α∗ ).
(8)
and
w− = − ( BT θ ∗ + AT γ ∗ ), b− = −(eT− θ ∗ + eT+ γ ∗ ).
(9)
Thus an unknown point x ∈ Rn is predicted to the Class through the following rule:
Class = arg min |(wk · x ) + bk |, k=−,+
(10)
where | · | denotes the perpendicular distance of point x from the planes (wk · x ) + bk = 0, k = −, +. Fig. 3 shows an illustration of the difference between standard SVM and linear NPSVM on two dimensional data. 3. The proportion-NPSVM algorithm According to the descriptions above, we proposed a novel classification method for the LLP problems, termed as proportionNPSVM, which can improve the classifiers to be a pair of nonparallel classification hyperplanes. In this section, we generate the proportion-NPSVM framework firstly and then introduce the optimization strategies for solving this classification model. 3.1. The proportion-NPSVM framework Let’s consider the following binary learning problem where the training data set {xi }N is given in the form of K disjoint bags: i=1
N 1 T w w+C L(yi , wT (xi )) 2 i=1 K +C p Lp ( pk ( y ), pk )
(13)
k=1
s.t.
(6)
(12)
where y∗i ∈ {−1, +1} is the unknown ground-truth label of xi . This is the mathematical descriptions of LLP and the task is to learn a model to predict the class labels of the individual instances. To address this classification problem, many approaches have been proposed as stated in Section 2.1, among which, the ∝SVM outperforms others on the experimental results. According to [19], ∝SVM model is formulated as below:
min
and
max
|{i|i ∈ Bk , y∗i = 1}| , ∀k ∈ {1 , . . . , K } |Bk |
∀Ni=1 , yi (wT · xi ) ≥ 1, yi ∈ {−1, 1},
in which the L( · ) ≥ 0 is the loss function for traditional supervised learning and Lp ( · ) ≥ 0 is a function to penalize the difference between the approximately calculated label proportions and the real label proportions. Furthermore, in order to provide the theoretical support, the authors also introduced a general framework called Empirical Proportion Risk Minimization (EPRM) in [4]. As elaborated in this work, the EPRM selects the instance label hypothesis h ∈ H to minimize the empirical bag proportion loss on the training set S. It can be expressed as follows:
arg min h∈H
L(φrf (h )( x ), f ( y ))
(14)
( x, f ( y ))∈S
Here, L is a loss function to compute the error of the predicted f proportion φr (h )( x ) (see Definition 1), and the given proportion f ( y ). To proof whether the instance labels can be learned by EPRM, they bounded the generalization error of bag proportions based on the empirical bag proportions and showed the sample complexity of learning bag proportions is only mildly sensitive to the bag size. Then, this paper presented that instance hypothesis h that can achieve low error of bag proportions with high probability is guaranteed to achieve low error on instance labels with high probability. Definition 1. For h ∈ H, define an operator to predict bag f f proportion based on the instances φr (h ) : X r → R, φr (h )( x ) := r f (h(x1 ), . . . , h(xr )), x ∈ X , x = (x1 , . . . , xr ). And therefore the hyf f pothesis class on the bags φr (H ) := {φr (h )|h ∈ H}. Even the validity of ∝SVM model had been proved, some issues still exist. As it is shown, the ∝SVM model needs to solve a large QPP, which has all data points in the constraints. However, in practice, there are often situations in which patterns belonging to one particular class has a more significant influence in the classification tasks. Traditionally, these biased classification problems are able to be solved by fuzzy SVMs [31,32], where patterns of the more important class are assigned higher membership values, or TWSVMs [12–14,29], which can handle them by solving only two smaller sized QPPs. For example, [20] had proposed the Twin alter∝SVM based on linear TWSVM to solve the LLP problem. Yet, these approaches still have several drawbacks, which are shown as follows: 1) fuzzy SVMs still have to solve a large QPP, which is quite time-consuming to be applied in practice; 2) TWSVMs need to compute the inverse of matrices, what is intractable or even impossible for large scaled data sets in the real applications; 3) the
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
primal problem of TWSVMs only consider the empirical risk and it doesn’t have the significant advantages of SVMs achieved by implementing the structural risk minimization principle; 4) TWSVMs lose the sparseness because they use two loss functions for each class, which results that all the points from one class and some points from the other contribute to each final decision function (called semi-sparseness in [28]). Based on the elaboration above, we formulate the proportionNPSVM model for LLP problems under the large-margin framework, which can take the unexpected and incomparable advantages of nonparallel SVM. The formulations of proportion-NPSVM are shown as follows:
min
y,w+ ,b+ ,η+ ,ξ−
s.t.
1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 cp T + e L p (y ) 2 Aw+ + e+ b+ = η+ , −(Bw+ + e− b+ ) + ξ− ≥ e− ,
∀
N , i=1
(15)
ξ− ≥ 0
yi ∈ {−1, +1}
and
min
y,w− ,b− ,η− ,ξ+
s.t.
1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 cp T + e L p (y ) 2 Bw− + e− b− = η− ,
(16)
min
(17)
s.t.
1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 cp T + e L p (y ) 2 (A )w+ + e+ b+ = η+ , −((B )w+ + e− b+ ) + ξ− ≥ e− ,
∀
N , i=1
(18)
ξ− ≥ 0
yi ∈ {−1, +1}
and
min
y,w− ,b− ,η− ,ξ+
s.t.
1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 cp T + e L p (y ) 2 (B )w− + e− b− = η− ,
As it is shown above, the proportion-NPSVM formulation is fairly straightforward and intuitive. However, it is non-convex integer programming problem, which is NP-hard. Therefore, one key issue of this paper is to find out effective strategies to solve the optimizations problems. In this paper, we provide two kinds of solving strategies to optimize our classification model. Firstly, in proportion-NPSVM, similar to the rules taken in [19,20], the unknown labels “y” can be seen as a bridge to connect the hinge loss and the label proportion loss. Thence, the first alternating optimization method, which is shown as follows, can be applied to solve the proportion-NPSVM model. • For a fixed y, the optimizations of (18) and (19) become a primal NPSVM problem. It can be easily solved by SOR method or SMO technique. • When w+ , w− , b+ , b− are solved, a pair of nonparallel super classification hyperplanes are able to be obtained by (w+ · (x )) + b+ = 0 and (w− · (x )) + b− = 0. • After w+ , b+ and w− , b− are fixed,the problem can be transformed as follows (see details in supplement materials): N i=1
LF (yi , wT+ (xi ) + b+ , wT− (xi ) + b− )
+
where x ∈ H, H is the Hilbert space. Then, the corresponding two primal problems in the Hilbert space H can be obtained as y,w+ ,b+ ,η+ ,ξ−
3.2. How to solve proportion-NPSVM
y
where e+ , e− and e are vectors of ones of appropriate dimension, ξ+ and ξ− are slack vectors of appropriate dimension, Lp ( · ) ≥ 0 is a function to penalize the difference between the true label proportions and the estimated label proportions based on the selection of y ∈ Y , where Y = {y|yi ∈ {−1, +1}}. The aim is to optimize the model parameters w+ , w− , b+ , b− and the labels y simultaneously. Since the kernel functions are able to be directly applied in the proportion-NPSVM problems, it is easy to be extended to the nonlinear cases. Introducing the kernel function K (x, x ) = ((x ) · (x )) and the corresponding transformation is shown as follows
x = (x ),
Obviously, the linear proportion-NPSVM problems (15) and (16) can be degenerated by the nonlinear cases (18) and (19) when the linear kernel is applied.
min
(Aw− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}
(19)
((A )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1} in which, e+ , e− and e are vectors of ones of appropriate dimension, ξ+ and ξ− are slack vectors of appropriate dimension, Lp ( · ) ≥ 0 is the penalty function for label proportions.
5
s.t.
K cp L p (y ) c k=1
(20)
∀Ni=1 , yi ∈ {−1, +1}
where LF = max(1 − yi ∗ Fi , 0 ), Fi = fi− − fi+ = (wT− (xi ) + T b− ) − (w+ (xi ) + b+ ) for each point xi (i = 1, 2, . . . , N ). Because each bag {yi |i ∈ Bk }, ∀Kk=1 is independent, we can efficiently solve the optimization problem (20) based on individual bag separately. Specifically, the optimization problem on each bag can be presented as follows:
min
{yi | i∈B k }
i∈B k
LF (yi , wT+ (xi ) + b+ , wT− (xi ) + b− )
cp L p (y ) c ∀i ∈ Bk , yi ∈ {−1, +1} +
s.t.
(21)
To find the solution of problem (21), we adopt the following optimization strategies: • Initialize yi = −1, ∀i ∈ Bk . • Let δ i denotes the reduction of the first term in (21) through flipping the sign of yi , ∀i ∈ Bk . • Sort δi , ∀i ∈ Bk . Then choose the top-M yi , which have the highest reduction, to flip their signs. • For each bag Bk , we only need to sort the δi , ∀i ∈ Bk once. Then, we can incrementally flip the sign and pick the solution with the smallest objective value, which can be seen as the optimal solutions of (21). The solutions of proportion-NPSVM model are able to be obtained by alternatively solving the problems (18), (19) and (20) through the strategies introduced above. The details of this solving approach for proportion-NPSVM can be found in Algorithm 1. Proposition 1. For a fixed pk (y ), the problem (21) can be optimized by applying the solving strategy mentioned above. Secondly, the label proportions pk is able to be seen as a strong constraint condition during the solving process of proportionNPSVM model, which means the label proportions are fixed in each
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS 6
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
Algorithm 1 proportion-NPSVM1. Input: Randomly initialized label yi ∈ {−1, +1}, (∀i ∈ {1, . . ., N} ), parameters ci (i = 1, 2, 3, 4 ) and c p , c1∗ = 10−5 c1 and c2∗ = 10−5 c2 . while c1∗ < c1 or c2∗ < c2 do c1∗ = min{(1 + )c1∗ , c1 }, c2∗ = min{(1 + )c2∗ , c2 } repeat • Fix y to optimize problem (18) and (19) to obtain (w+ , w− , b+ , b− ); • Fix w+ , w− and b+ , b− to solve the problem (20) to get y by using the strategies introduced above; until The decrease of the objective function value is smaller than a threshold (e.g. = 10−4 ). end while Algorithm 2 proportion-NPSVM2. Input: Randomly initialized label yi ∈ {−1, +1}, (∀i ∈ {1, . . ., N} ), label proportions pk (k ∈ {1, . . ., K } ), parameters ci (i = 1, 2, 3, 4 ) and c p , c1∗ = 10−5 c1 and c2∗ = 10−5 c2 . while c1∗ < c1 or c2∗ < c2 do c1∗ = min{(1 + )c1∗ , c1 }, c2∗ = min{(1 + )c2∗ , c2 } repeat • Fix y to optimize problem (18) and (19) to obtain (w+ , w− , b+ , b− ); • Fix w+ , w− , b+ , b− to renovate the initialized label y under the proportions constraint through the tactics presented before; until The decrease of the objective function value is smaller than a threshold (e.g. = 10−4 ). end while
iterative loop. The details of this solution method are shown as follows. • Randomly initialize y, then the optimization of (18) and (19) become a primal NPSVM problem. Similarly, the SOR method or SMO technique can be used to effectively solve it. • When w+ , w− , b+ , b− are solved and fixed, the primal optimization problems are able to be transformed into Eq. (22) with the constraint that the label proportion pk is constant.
min y
s.t.
N
LF i=1 N , i=1
∀
(yi , wT+ (xi ) + b+ , wT− (xi ) + b− )
As elaborative description above, we generate two types of solving strategies for the proposed proportion-NPSVM algorithm. Since the objective function is monotone non-increasing and lower bounded, its convergence can be guaranteed (will be discussed in next section). Besides, in practice, the algorithm procedure would be terminated while the objective function value is no longer declining (or if the reduction of objective is smaller than a threshold). Furthermore, in order to alleviate the problem of local solutions, similar to T-SVM [33] and ∝SVM [19], the novelly proposed proportion-NPSVM algorithm (see Algorithms 1 and 2) also takes an additional annealing loop to gradually increase c1∗ and c2∗ . In the implementation of proportion-NPSVM, we consider = 0.5 and Lp (·) as the absolute loss: L p (y ) = L p ( pk ( y ), pk ) = | pk (y ) − pk |. In the practice scenario, we repeat the proportion-NPSVM several times by randomly initializing y and take the solution with the smallest objective value as the final result of our algorithm. 4. Discussion To ulteriorly elaborate the novel algorithm, in this section, we firstly discuss the relationship between proportion-NPSVM and ∝SVM. Then, analysis for the validity of our model is introduced in the following part. 4.1. Relationship between proportion-NPSVM and ∝SVM According to the statements above, proportion-NPSVM aims at generating two nonparallel classification hyperplanes while the ∝SVM searches for the parallel one. Actually, the ∝SVM is the special case of proportion-NPSVM. Let’s combine the problems of (18) and (19) to be the following problem:
min
s.t.
(22)
yi ∈ {−1, +1}
• Since each bag is independent, we can renovate the initialized label y on each individual bag Bk by adopting the following tactics: – Assume that all the points in bag Bk are belonging to negative class, namely yi = −1 for ∀i ∈ Bk . Then, we can compute the loss function values based on LF ; – Suppose that the labels for the points in bag Bk are +1, and another loss function values can be obtained through LF ; – Let δ i indicate the differences between these two loss. Then, sort δi , ∀i ∈ Bk and choose the appropriate number of points with large δ i to be positive according to the given label proportion pk ; • For each bag Bk , we only need to sort the δi , ∀i ∈ Bk once. Then, we can correctly select the appropriate amount of positive points, which can be seen as the optimal solution for (22). The solution process of the proportion-NPSVM model can be transferred into the following two alternative steps: solving NPSVM optimization problems and renovate the initialized label y. The details of this solving approach for proportion-NPSVM are shown in Algorithm 2.
1 1 c3 (w+ 2 + b2+ ) + c4 (w− 2 + b2− ) 2 2 1 T T + ( η+ η+ + η− η− ) 2 +c1 eT− ξ− + c2 eT+ ξ+ + c p eT L p (y )
(A )w+ + e+ b+ = η+ , (B )w− + e− b− = η− , −((B )w+ + e− b+ ) + ξ− ≥ e− , ξ− ≥ 0 ((A )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}
(23)
Obviously, the solutions of problems of (15) and (16) satisfy the above problem under the given training data sets and parameters. Furthermore, the third term of problem (23) can be ignored if we take ci (i = 1, 2, 3, 4 ) to be large enough. Then it degenerates to be
min
s.t.
1 1 c3 (w+ 2 + b2+ ) + c4 (w− 2 + b2− ) + c1 eT− ξ− 2 2 +c2 eT+ ξ+ + c p eT L p (y ) −((B )w+ + e− b+ ) + ξ− ≥ e− ,
ξ− ≥ 0
(24)
((A )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1} If we want to get a solution satisfying w+ = w− and b+ = b− , it only needs to solve the following special case of problem (23), i.e.,
min s.t.
1 (w2 + b2 ) + C (eT− ξ− + eT+ ξ+ ) + Cp eT L p (y ) 2 −((B )w + e− b) + ξ− ≥ e− , ξ− ≥ 0
((A )w + e+ b) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}
(25)
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
JID: KNOSYS
ARTICLE IN PRESS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
7
Table 1 Data sets used for the experiments.
Fig. 4. The artificial dataset is shown in (a) while (b) presents the NPSVM result achieved with all the labels known.
which is obvious the ∝SVM model. In summary, the ∝SVM with parallel classification hyperplane is a special case of our proportionNPSVM, and proportion-NPSVM is more flexible than ∝SVM and has better generalization ability.
4.2. Validity of proportion-NPSVM In this section, we talk about the validity of proposed proportion-NPSVM algorithm by applying it on an artificial classification data set. This artificial data set is generated from two Gaussians centered at (1, 4) and (4, 1) with standard deviation σ = 1, each representing a different class distribution. Fig. 4(a) shows the resulting data set after 80 instances are sampled (40 for each class) and the instances are represented as points in a 2D feature space. Fig. 4(b) illustrates the classification results of supervised NPSVM algorithm, where a pair of nonparallel classifiers are achieved. Based on this assumption, firstly, we are going to talk about the process of generating the classifiers for proportionNPSVM algorithm. Fig. 5 tells the schematic diagrams for process of our novel proposed algorithm in a way that is easy to visualize. As the descriptions of our algorithm, we firstly split these data points into different groups (shown in Fig. 5(a), where different colors denote different bags). Then, randomly initialize the class labels, and let “.” and “ + ” respectively denote the negative and positive points (as shown in Fig. 5(b)). After running once our algorithm, a pair of classifiers are able to be obtained (presented in Fig. 5(c)). Based on the achieved classifiers, we can adjust the class label for each data point and Fig. 5(d) shows the results after updating the labels. Continually running the proportion-NPSVM algorithm, we can get the final classification results (details can be found in Fig. 5(f)). Compare the result of our algorithm (Fig. 5(f)) with the supervised NPSVM result (Fig. 4(b)), we can know that our algorithm can achieve the highly competitive classification result with NPSVM even only the label proportion information can be obtained. Based on the illustrated elaborations above, we can clearly understand the working principle of the novel proportion-NPSVM algorithm. After then, let’s consider the convergence of this algorithm, which can further proof its validity. To show the convergence of its objective function in the iterative optimization procedure, we apply proportion-NPSVM algorithm to a selected binary classes UCI data set, namely SONAR. According to the descriptions for proportion-NPSVM algorithm, we should firstly decide the bag size to be k points in each bag, and then split these data points into different groups. With running our algorithm on these grouped data points, we can plot the objective function value within each outer iteration in Fig. 6(a)–(d) when using k = 2, 8, 16, 64 points in each bag respectively. From these plots, it can be seen that the novel proportion-NPSVM method provides a satisfactory convergence rate in the optimization procedure.
Data set
#Size
#Features
#Classes
Sonar Heart Ionosphere Vote Breast-cancer Australian Credit-a Breast-w Pima-Indian Splice Ala Iris Wine Glass Vowel Dna Satimage
208 270 351 435 683 690 690 699 768 1,0 0 0 1,605 150 178 214 528 2,0 0 0 4,435
60 13 34 16 10 14 15 9 8 60 119 4 13 9 10 180 36
2 2 2 2 2 2 2 2 2 2 2 3 3 7 11 3 6
5. Experimental results To validate the performance of proportion-NPSVM, in this section, we will design two experiments, one is used to show the better accuracy on several binary-classes data sets than state-of-theart methods for LLP, which include MeanMap [2], Inverse Calibration (InvCal) [18] and alter proportion-SVM (alter-∝SVM) [19], the other is used to test its ability on the multi-classes data sets. The Matlab code for alter-∝SVM, MeanMap and InvCal is supported on the work of [19].1 Our code for proportion-NPSVM is also available online: https://github.com/chenzhensong/proportion-NPSVM. The prediction performance (accuracy) of different methods has been assessed on various data sets including the LibSVM collection2 and the UCI3 [34] shown in Table 1. Since this paper focus on the binary classification, we test the one-vs-rest binary classification performance for the multiple classes data sets. Furthermore, in order to avoid scaling issues in the learning process, the features of each data set are scaled to [−1, +1]. To formulate the LLP classification problems, the training data is randomly partitioned into a particular fold of bags with fixed size σ . We tested various bag size σ : 2, 4, 8, 16, 32 and 64 (with the last bag smaller than σ , if necessary). Following the experiments design of [18,19], in each single experiment, the accuracy has been assessed by tenfold cross-validation. Similarly, we repeat the above process ten times (negative instances are randomly selected from multi-class data sets and partition the data into disjoint bags) and report the mean accuracies. The parameters of each algorithm (both for linear kernel and RBF kernel) are tuned through the following rules: For MeanMap, the parameter is tuned from λ ∈ {0.1, 1, 10}. For InvCal, the parameters are tuned from Cp ∈ {0.1, 1, 10} and ε ∈ {0, 0.01, 0.1}. For alter-∝SVM, the parameters are tuned from C ∈ {0.1, 1, 10} and Cp ∈ {1, 10, 100}. For proportion-NPSVM, the parameters ci (i = 1, 2, 3, 4 ) are tuned for the best classification accuracy in the range 0.1 to 10 and cp ∈ {0.1, 1, 10}. In fact, there are only three parameter to be tuned in proportion-NPSVM, such as c1 , c3 and cp in the problem (18) and c2 , c4 and cp in (19). Hence, the grid search method can be applied. The results of numerical experiments are summarized in the following tables, where the best accuracy is shown by bold figures.
1 2 3
https://github.com/felixyu/pSVM. https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/. http://archive.ics.uci.edu/ml/datasets.html.
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
JID: KNOSYS 8
ARTICLE IN PRESS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
Fig. 5. The process of generating the classifiers: (a) Grouped data points (different colors denote different bags); (b) Initialized data points; (c)∼ (e) Achieving classifiers and updating data points’ labels in each iteration and (f) Final classification result. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 6. The value curve of the objective function at each iteration when using k = 2, 8, 16, 64 points in each bag respectively. From these plots, it can be seen that our novel method provides a satisfactory convergence rate in the optimization procedure.
5.1. Experiments on binary classes data sets In this section, we present the experimental results on binary classification data sets. Table 2 shows the experimental results (classification accuracy and its standard deviations) on the several binary-classes scaled data sets with linear kernel. From this table, we can know that the proportion-NPSVM outperforms the MeanMap, InvCal and alter-∝SVM methods on most data sets. Specifically, proportion-NPSVM obtains the best performance for all the bag size on 7 out of 11 data sets. And on the other 4 data sets including AUSTRALIAN, BREAST-W, BREAST-CANCER and HEART, proportion-NPSVM achieves the highly competitive accuracy with alter-∝SVM, which are better than the MeanMap and InvCal methods. In specific, the classification accuracy of proportion-NPSVM is higher on these first 3 data sets with small bag size while a little worse for the large bag size. As for the data set of HEART, the results of proportion-NPSVM are highly competitive with alter-∝SVM on some bag size and higher on the others. In addition, to further estimate the performance of our proposed method, we also applied the RBF kernel to proportionNPSVM. The classification results on the binary classes data sets are shown in Table 3. As can be seen, the proportion-NPSVM algorithm obtains the best accuracy on the most data sets including
Fig. 7. The relative change of the RBF kernel for the average classification accuracy on the utilized data sets.
SONAR, HEART, VOTE, BREAST-W, PIMA-INDIAN, SPLICE and ALA. As for the data sets of IONOSPHERE, BREAST-CANCER, AUSTRALIAN and CREDIT-A, the proportion-NPSVM achieves the highly competitive classification results with alter-∝SVM. Moreover, the Fig. 7 displays the average improvements of classification performance by algorithm with RBF kernel. (Here, a value more than zero indicates an improvement. Otherwise, it is the deduction.) From Fig. 7, we can see that the improvement brought by RBF kernel is difference for these selected approaches. In detail, for
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
9
Table 2 Accuracy of different methods on binary-classes data sets with linear kernel and bag size 2, 4, 8, 16, 32, 64. In this table, pNPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set
Method
2
4
8
16
32
64
Sonar
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
82.31(0.72) 71.63(2.84) 88.94(1.32) 97.59(0.82) 98.07(0.86)
82.02(0.60) 79.80(4.37) 83.17(2.58) 96.63(0.85) 96.63(0.86)
80.76(1.20) 75.69(3.08) 74.51(4.37) 94.71(1.33) 95.19(1.84)
78.37(1.75)) 77.58(1.53) 70.67(4.65) 91.82(1.40) 91.82(1.39)
77.25(1.39) 72.60(2.38) 66.35(5.76) 88.46(1.51) 91.34(0.76)
75.10(1.98) 65.38(4.10) 66.82(6.87) 82.98(0.85) 87.50(1.71)
Heart
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
81.85(1.28) 83.41(5.81) 87.51(0.88) 88.15(1.45) 89.62(2.42)
80.39(0.56)) 80.98(2.68) 86.29(2.45) 85.19(2.14) 88.51(3.24)
79.63(0.74) 79.45(6.19) 84.67(2.64) 85.07(1.99) 87.04(2.63)
79.46(1.31) 76.94(1.64) 84.07(2.47) 80.00(2.88) 84.07(4.35)
79.00(1.25) 73.76(1.75) 81.85(2.24) 79.63(2.43) 83.70(2.97)
76.06(1.42) 73.04(2.44) 79.74(1.62) 80.00(0.61) 80.00(0.81)
vote
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
87.76(0.24) 95.63(0.18) 96.62(0.23) 97.98(0.84) 96.83(0.70)
92.75(1.56) 95.35(0.42) 96.59(0.44) 96.32(1.31) 96.10(1.36)
91.83(2.24) 94.23(0.20) 96.16(0.40) 96.32(0.82) 96.45(0.82)
89.62(1.45) 94.01(0.66) 95.86(2.40) 96.22(0.48) 95.63(0.59)
87.69(0.27) 91.86(2.46) 92.97(1.81) 95.67(0.15) 95.63(0.41)
87.42(0.78) 91.10(1.09) 93.33(2.48) 95.72(0.19) 95.09(0.21)
Australian
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
86.33(0.42) 85.39(0.25) 85.42(1.96) 92.72(2.17) 92.55(1.70)
85.67(0.27) 85.80(0.36) 85.61(1.37) 89.83(1.49) 89.56(1.41)
85.08(1.42) 84.99(0.77) 85.49(1.68) 86.64(1.28) 87.27(1.10)
83.74(1.57) 83.14(2.32) 84.96(3.54) 80.91(0.11) 80.75(1.21)
83.96(1.96) 80.28(3.96) 84.39(4.29) 76.33(1.92) 78.55(1.89)
82.19(1.95) 80.53(4.68) 82.46(6.18) 68.77(0.85) 68.41(1.94)
Breast-w
MeanMap InvCal alter-/SVM p-NPSVM1 p-NPSVM2
96.11(0.08) 95.88(0.38) 96.71(0.10) 98.13(0.50) 97.68(0.54)
95.95(0.35) 95.65(0.36) 96.77(0.36) 97.47(0.76) 97.48(0.54)
95.16(0.26) 95.53(0.28) 96.59(0.39) 96.62(0.39) 96.53(0.58)
96.03(0.33) 95.39(0.58) 96.41(0.38) 96.55(0.46) 96.42(0.16)
95.42(0.44) 95.23(0.52) 96.41(0.32) 96.05(0.25) 96.08(0.48)
95.80(0.93) 94.31(0.77) 96.25(0.56) 95.92(0.12) 95.85(0.27)
Ionosphere
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
84.32(0.58) 87.17(1.91) 94.02(0.83) 96.32(1.04) 96.29(0.89)
82.17(1.24) 81.23(7.14) 90.60(1.47) 95.53(1.54) 95.04(0.77)
81.91(1.33) 85.18(3.46) 86.32(4.43) 93.13(1.16) 93.10(0.11)
80.03(0.47) 83.47(4.26) 74.64(3.70) 91.28(0.85) 92.02(1.93)
79.96(1.25) 79.77(3.78) 72.36(3.89) 91.76(1.02) 90.88(0.76)
78.99(2.08) 78.06(4.09) 68.95(1.44) 92.33(0.71) 91.73(0.56)
Breast-Cancer
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
96.49(0.02) 96.02(0.40) 96.90(0.22) 97.96(0.41) 97.58(0.36)
96.38(0.18) 96.11(0.61) 96.87(0.61) 97.59(0.26) 97.53(0.11)
96.20(0.27) 95.81(0.23) 96.81(0.23) 96.78(0.46) 96.41(0.33)
96.02(0.43) 95.61(0.28) 96.76(0.29) 96.67(0.12) 96.25(0.56)
96.35(0.38) 95.61(0.14) 96.82(0.12) 96.50(0.26) 96.17(0.17)
95.66(0.55) 94.49(1.01) 96.49(1.00) 96.32(0.20) 95.98(0.25)
Credit-a
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
85.42(0.24) 85.51(0.50) 85.94(0.16) 94.52(1.29) 95.51(1.02)
84.70(0.80) 85.40(0.41) 87.68(0.82) 94.27(0.99) 93.10(1.78)
83.56(1.68) 84.52(0.67) 87.39(1.24) 91.41(0.86) 91.62(1.16)
82.33(1.08) 82.69(2.65) 86.09(1.07) 89.04(1.89) 89.30(0.60)
81.90(2.87) 79.23(3.96) 85.65(0.96) 87.96(0.61) 87.47(0.68)
79.78(3.62) 77.99(4.28)) 85.22(1.61 86.42(0.32) 85.76(0.50)
Pima-Indian
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
75.27(1.22) 77.08(1.09) 81.51(0.67) 92.47(1.02) 93.54(0.66)
73.89(2.74) 73.95(3.58) 80.33(2.12) 90.40(1.11) 91.38(0.56)
73.42(1.87) 76.17(2.24) 75.78(2.79) 88.16(1.69) 86.85(1.66)
71.98(1.48) 73.82(0.78) 72.78(2.62) 78.11(2.57) 84.47(2.69)
70.87(1.25) 65.88(1.16) 69.27(1.55) 79.10(3.65) 80.03(1.90)
69.50(1.66) 65.23(1.44) 68.62(1.40) 74.74(3.21) 72.07(1.98)
Splice
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
75.29(0.24) 51.70(0.71) 87.60(0.66) 96.55(0.88) 96.33(0.25)
74.66(1.21) 55.50(3.84) 84.60(1.28) 94.76(1.49) 94.20(0.70)
72.54(0.76)) 53.80(2.72) 79.30(2.88) 91.52(1.28) 92.17(0.86)
72.19(0.66) 74.20(4.42) 71.40(2.51) 89.03(0.11) 88.76(1.58)
71.40(1.41)) 68.50(1.50) 72.60(2.72) 87.21(1.92) 85.83(1.19)
70.22(1.11) 70.60(4.04) 67.30(2.30) 85.69(0.85) 85.43(0.58)
Ala
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
81.76(0.89) 81.86(0.20) 82.63(0.20) 86.53(0.21) 85.47(1.17)
81.60(0.48) 81.35(0.70) 81.72(0.70) 84.74(1.05) 82.51(2.04)
80.02(0.59) 78.34(0.70) 80.00(0.70) 83.44(1.43) 80.08(1.78)
77.04(1.30) 77.69(1.36) 76.48(1.36) 81.22(1.00) 77.35(1.41)
73.19(2.48) 73.13(3.86) 76.38(4.86) 78.68(2.14) 75.42(2.54))
72.58(0.95) 73.30(1.71) 76.09(1.71) 76.57(2.58) 74.86(2.68)
MeanMap, the accuracy increasing brought by the application of RBF kernel occurred except three datasets AUSTRALIAN, CREDIT-A and ALA. For InvCal, the application of RBF kernel can improve the classification accuracy on the data sets of HEART, BREASTCANCER, AUSTRALIAN, CREDIT-A, BREAST-W, SPLICE and ALA. For alter-∝SVM, the improvement has been appeared on the data sets including HEART, IONOSPHERE, BREAST-CANCER, AUSTRALIAN, BREAST-W, PIMA-INDIAN and SPLICE. As for the novel proposed proportion-NPSVM, the counts of #improvement is 8 on the 11 classification data sets except CREDIT-A, PIMA-INDIAN and ALA.
Furthermore, to clearly show the comparisons for the classification performance of these selected algorithms, we present the average classification error rates on the binary-classes data sets in Figs. 8 and 9. From them, we can see that the average error rates of proportion-NPSVM is smaller than alter-∝SVM, MeanMap and InvCal method. Specifically, as it is shown, the counts of #wins for proportion-NPSVM with linear kernel is 9 on the 11 classification data sets while the counts of #wins for proportion-NPSVM with RBF kernel is 10. In conclusion, the novel proportion-NPSVM algorithm that proposed for solving LLP problems is advanced and effective on the binary classification data sets.
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS 10
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16 Table 3 Accuracy of different methods on binary-classes data sets with RBF kernel and bag size 2, 4, 8, 16, 32, 64. In this table, pNPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set
Method
2
4
8
16
32
64
Sonar
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
84.57(0.27) 85.57(0.66) 94.23(2.33) 98.39(0.73) 97.78(0.54)
83.99(0.56) 78.84(1.23) 70.67(3.12) 97.43(0.56) 95.10(1.29)
83.54(0.19) 64.90(2.58) 67.31(2.82) 96.79(1.11) 93.75(1.20)
79.23(1.05) 53.37(3.66) 64.42(3.57) 96.95(0.28) 92.32(0.48)
78.78(1.47) 53.37(3.79) 62.02(2.70) 95.83(0.73) 91.53(1.00)
75.31(2.53) 53.37(3.79) 62.98(2.83) 96.63(0.48) 90.57(0.53)
Heart
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
83.69(1.28) 90.37(0.55) 96.29(2.12) 97.77(0.57) 96.96(0.61)
81.87(0.47) 85.18(1.35) 89.62(2.22) 95.92(1.61) 95.56(1.34)
80.47(0.83) 80.26(2.77) 85.18(1.78) 95.18(0.64) 93.48(1.01)
79.22(1.46) 79.61(3.26) 85.55(1.73) 93.95(0.21) 92.59(1.81)
79.84(1.42) 76.36(2.69) 82.22(2.81) 89.01(1.40) 91.26(1.28)
77.26(1.25) 73.90(3.58) 80.00(1.47) 89.25(1.11) 89.62(1.08)
Vote
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
91.15(0.21) 95.68(0.12) 95.80(0.33) 98.29(0.13) 98.52(0.45)
90.52(1.76) 94.77(0.47) 95.54(0.42) 97.79(0.35) 96.82(1.15)
91.54(2.22) 93.95(0.27) 94.88(0.57) 97.14(0.42) 96.18(0.88)
90.28(1.45) 93.03(0.57) 92.44(1.28) 97.01(0.43) 96.55(0.96)
89.58(0.21) 87.79(2.42) 90.72(1.47) 97.10(0.21) 95.62(0.70)
89.33(0.82) 86.63(1.11) 90.93(1.35) 96.82(0.10) 96.09(0.21)
Australian
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
85.92(0.72) 86.06(0.30) 85.74(1.86) 95.65(0.65) 95.56(1.20)
84.98(0.34) 86.11(0.26) 85.71(1.71) 93.67(1.18) 93.15(0.91)
85.53(1.21) 86.32(0.56) 86.26(1.31) 90.96(0.33) 89.30(1.00)
84.07(2.04) 84.13(1.57) 85.65(0.98) 85.36(0.25) 86.84(1.28)
83.12(1.47) 82.73(1.12) 83.63(1.52) 79.90(1.03) 81.21(1.11)
80.70(3.96) 81.87(2.35) 83.62(1.61) 76.23(0.94) 74.78(1.17)
Breast-w
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
96.42(0.18) 96.85(0.25) 96.97(0.29) 98.04(0.54) 97.99(0.34)
96.27(0.31) 96.91(0.16) 97.00(0.25) 97.47(0.60) 97.62(0.21)
96.20(0.27) 96.77(0.22) 96.97(0.30) 97.18(0.58) 96.99(0.68)
96.14(0.39) 95.39(0.25) 96.87(0.66) 97.13(0.25) 96.82(0.47)
94.92(1.02) 95.23(0.29) 96.88(0.25) 96.47(0.22) 96.48(0.12)
94.24(1.22) 94.58(1.57) 96.07(0.61) 96.32(0.17) 96.45(0.21)
Ionosphere
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
85.21(0.31) 95.44(0.58) 99.14(0.04) 95.44(2.06) 96.35(0.84)
84.79(0.18) 84.33(1.23) 98.86(0.59) 96.98(0.74) 93.67(0.89)
82.65(0.96) 65.24(2.96) 94.58(1.64) 94.92(1.05) 93.16(1.65)
81.24(1.21) 64.10(3.05) 93.19(1.87) 93.56(1.43) 92.59(1.33)
80.08(0.74) 64.10(3.05) 92.02(1.72) 94.18(0.53) 93.03(1.78)
79.14(0.56) 64.10(3.05) 86.32(2.53) 94.07(0.16) 92.43(0.82)
Breast-Cancer
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
96.69(0.17) 97.07(0.18) 97.16(0.20) 97.31(0.08) 97.45(0.20)
96.72(0.20) 97.10(0.31) 97.12(0.13) 97.47(0.14) 97.36(0.17)
96.84(0.32) 97.02(0.15) 97.19(0.37) 97.26(0.08) 97.24(0.19)
96.60(0.21) 97.08(0.25) 97.35(0.30) 97.31(0.45) 97.07(0.17)
96.67(0.18) 96.51(0.30) 97.09(0.48) 97.12(0.22) 96.89(0.30)
96.78(0.11) 96.09(0.72) 97.23(0.42) 97.07(0.25) 96.72(0.45)
Credit-a
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
85.86(0.81) 86.26(0.65) 86.23(0.12) 88.02(0.17) 87.51(0.76)
85.04(0.73) 85.62(0.12) 86.09(0.35) 87.53(1.23) 87.42(0.91)
84.96(1.22) 85.41(0.42) 85.55(0.37) 77.58(1.76) 84.63(1.99)
83.26(1.55) 83.79(0.64) 84.73(2.52) 76.97(0.66) 79.56(0.80)
81.14(2.96) 82.21(3.72) 82.39(2.63) 79.32(0.17) 79.42(0.59)
76.00(4.58) 76.90(4.02) 80.75(3.66) 79.95(0.75) 75.36(0.93)
Pima-Indian
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
76.21(1.22) 79.94(0.63) 88.28(1.40) 94.14(0.71) 92.94(0.21)
75.75(1.31) 76.95(1.07) 81.25(1.28) 90.83(0.82) 91.56(1.25)
74.32(0.84) 72.65(1.36) 72.53(1.32) 87.68(1.18) 87.83(2.87))
72.87(1.72) 67.06(2.57) 70.70(3.40) 80.73(1.11) 83.33(2.56)
71.20(0.96) 65.10(3.64) 69.92(1.66) 68.49(2.58) 69.82(3.07)
70.44(2.57) 65.10(3.83) 67.71(0.78) 65.10(3.64) 67.08(3.84)
Splice
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
77.24(0.47) 89.30(0.52) 96.10(0.74) 89.24(0.17) 89.66(0.16)
75.71(0.58) 83.50(1.23) 88.80(1.77) 89.30(0.26) 89.60(0.11)
74.92(0.79) 77.40(2.22) 77.72(1.74) 89.26(0.06) 89.43(1.00)
72.47(1.22) 73.20(2.75) 73.00(2.11) 89.46(0.12) 89.30(1.28)
72.16(1.02) 55.70(4.58) 71.21(2.37) 89.06(0.26) 89.23(1.11)
71.08(1.35) 51.80(6.73) 64.80(1.07) 89.50(0.06) 89.43(1.17)
Ala
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
76.16(0.33) 82.31(0.41) 82.22(0.36) 85.78(0.27) 85.33(0.35)
75.86(0.28) 81.49(0.52) 81.80(0.58) 83.25(0.66) 84.06(0.47)
76.44(1.26) 81.12(1.23) 79.16(1.32) 80.14(1.23) 81.92(1.12)
76.48(0.55) 78.66(1.33) 75.67(1.17) 78.35(1.43) 78.08(1.25)
75.93(1.06) 75.47(1.92) 75.73(1.41) 76.88(1.96) 77.02(0.98)
74.95(1.71) 74.57(2.11) 75.48(1.02) 75.04(1.84) 74.97(2.05)
5.2. Experiments on multi-classes data sets Based on the analysis of the experimental results above, we can know that the proposed proportion-NPSVM algorithm has achieved the highly competitive, even better, classification performance on binary classes data sets. In this section, we want to test its classification ability on the multi-classes data sets. As we mentioned before, in this paper, we focus on the binary classification settings. Therefore, we test the one-vs-rest binary classification performance for the selected multi-classes data sets including IRIS, WINE, GLASS, VOWEL, DNA and SATIMAGE. That is
to say, in these experiments, we treat the data from one class as positive, and randomly choose same amount of data as negative from the remaining classes. Based on this assumption, we can generate thirteen one-vs-rest data sets that are able to be used to test our proposed proportion-NPSVM algorithm. The experimental results on multi-classes data sets are presented in Table 4. As they are shown, the proposed proportionNPSVM is highly competitive, even better, compared with alter∝SVM, which both outperform the MeanMap and InvCal methods. In detail, proportion-NPSVM can improve alter-∝SVM on most data sets including IRIS, GLASS, VOWEL, WINE, by introducing the non-
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
11
Table 4 Accuracy of different methods on multi-classes data sets with linear kernel and bag size 2, 4, 8, 16, 32, 64. In this table, p-NPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set
Method
2
4
8
16
32
64
Iris-1
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
88.63(0.23) 79.00(2.19) 74.00(0.11) 93.33(0.52) 96.38(1.22)
88.02(0.42) 76.00(2.39) 74.00(0.24) 91.00(1.09) 92.65(2.30)
87.51(0.65) 73.00(0.95) 73.58(2.30) 85.80(5.16) 89.84(4.32)
84.96(1.21) 73.00(1.28) 71.24(2.66) 75.80(3.87) 72.25(3.86)
81.25(1.34) 71.00(1.21) 70.40(1.66) 77.00(2.65) 71.68(3.63)
80.20(2.58) 69.00(3.72) 69.34(4.21) 74.00(4.89) 74.62(2.86)
Glass-1
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
74.52(1.26) 72.86(1.78) 74.29(0.73) 89.00(1.08) 94.85(2.55)
73.28(0.97) 73.57(1.79) 72.14(1.86) 88.86(1.31) 92.86(2.02)
72.22(1.06) 74.28(1.69) 71.43(1.66) 84.50(2.13) 90.71(1.68)
71.79(1.44) 74.28(0.96) 67.86(1.32) 83.79(2.02) 86.00(3.17)
71.17(0.85) 72.14(1.74) 67.14(1.84) 76.35(2.63) 79.14(0.60)
70.96(0.74) 72.14(0.53) 66.43(1.32) 74.14(3.34) 76.14(1.86)
Glass-2
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
74.68(0.98) 72.37(1.69) 71.53(4.14) 90.71(1.01) 94.47(0.86)
72.47(1.21) 66.48(1.58) 69.73(2.99) 89.33(1.35) 93.42(2.18)
70.83(1.44) 71.71(1.46) 71.05(1.88) 86.59(1.72) 89.73(1.36)
71.25(0.77) 63.16(1.96) 70.39(3.86) 83.88(2.83) 85.13(2.53)
68.37(1.76) 69.73(1.74) 69.34(2.02) 77.12(2.42) 80.92(2.96)
66.54(2.32) 59.21(2.74) 65.31(2.11) 71.05(1.97) 76.31(3.62)
Glass-3
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
75.96(0.62) 79.41(1.35) 79.41(3.61) 88.24(0.17) 91.17(0.00)
74.33(1.11) 76.47(1.98) 73.53(1.73) 87.25(2.07) 90.58(2.46)
71.05(1.55) 67.65(1.55) 70.59(4.10) 85.62(1.42) 89.41(2.63)
72.33(1.76) 67.65(1.46) 70.59(2.06) 82.35(4.87) 88.23(3.83)
69.47(2.02) 70.59(1.23) 69.95(1.63) 78.10(5.32) 78.82(3.83)
67.34(3.14) 55.10(3.39) 64.71(1.17) 69.41(4.84) 75.29(2.63)
Vowel-1
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
93.86(0.22) 95.83(0.53) 97.18(1.86) 96.67(1.68) 96.45(1.14)
92.17(0.53) 93.75(0.47) 94.58(2.89) 95.63(1.68) 94.80(2.03)
90.88(1.11) 92.71(0.69) 93.33(2.56) 95.21(1.57) 93.75(2.81)
90.15(0.83) 91.67(2.73) 90.62(3.11) 93.75(0.49) 91.67(3.65)
89.86(1.20) 92.70(1.66) 89.58(2.92) 93.85(0.33) 90.20(1.71)
87.68(2.22) 86.45(2.29) 87.50(3.82) 93.75(0.11) 89.58(0.93)
Vowel-2
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
83.45(1.25) 85.42(2.89) 89.58(1.15) 96.87(1.39) 96.87(1.19)
81.63(1.37) 83.33(2.38) 85.41(2.27) 96.35(2.03) 96.25(2.39)
79.36(1.59) 80.21(1.20) 82.29(4.71) 94.79(2.31) 94.58(0.87)
77.01(2.22) 78.13(1.94) 81.25(2.57) 91.97(2.69) 90.41(3.85)
76.54(2.54) 79.16(1.13) 80.20(2.79) 89.58(1.72) 88.96(2.28)
75.37(3.26) 82.29(0.39) 79.16 (2.91) 86.45(2.14) 85.63(1.35)
Vowel-3
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
85.77(0.87) 88.54(3.98) 92.25(1.57) 97.71(1.17) 96.87(1.42)
82.58(1.22) 85.42(0.89) 89.06(1.80) 96.67(1.82) 96.04(2.69)
81.36(0.79) 84.38(1.43) 81.60(1.04) 94.38(2.21) 92.50(1.71)
79.87(1.34) 85.41(1.64) 79.16(3.81) 90.94(2.03) 90.63(1.04)
80.25(1.11) 84.37(1.46) 71.87(3.73) 86.46(2.14) 89.79(2.48)
78.63(2.76) 80.21(2.18) 72.91(2.84) 85.52(1.25) 84.58(2.60)
Wine-1
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
95.60(0.21) 99.15(0.00) 99.72(0.05) 99.72(0.02) 99.15(0.05)
94.87(0.16) 99.15(0.05) 99.89(0.01) 99.89(0.00) 99.15(0.02)
94.02(0.73) 99.14(0.02) 99.59(0.02) 99.79(0.01) 99.15(0.07)
93.11(0.57) 99.15(0.05) 99.49(0.01) 99.79(0.01) 98.64(0.02)
92.36(1.02) 98.31(0.35) 99.23(0.06) 99.74(0.00) 98.13(0.39)
89.07(2.21) 93.22(0.99) 98.68(0.57) 99.57(0.00) 95.76(0.20)
Wine-2
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
95.98(0.24) 99.29(0.01) 98.59(0.22) 98.61(0.17) 99.29(0.00)
95.44(0.35) 97.89(0.96) 98.52(0.59) 98.59(0.15) 99.01(0.23)
94.11(0.65) 96.48(1.74) 95.35(0.74) 98.59(0.20) 98.08(0.37)
93.78(0.22) 95.07(1.43) 94.43(1.41) 97.39(0.48) 96.61(1.88)
91.85(0.98) 94.37(1.35) 95.74(1.18) 96.19(0.95) 95.77(2.38)
90.02(1.74) 89.44(0.99) 91.54(2.01) 95.21(0.56) 95.47(1.07)
Dna-1
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
86.78(1.33) 93.25(3.43) 99.36(0.09) 97.95(0.33) 97.88(0.06)
83.62(1.25) 91.05(2.81) 98.53(0.58) 97.30(0.40) 97.19(0.66)
80.21(1.64) 90.19(2.77) 97.16(0.72) 96.44(0.46) 95.40(0.98)
79.46(0.58) 89.33(2.11) 93.87(2.30) 94.39(0.41) 93.71(0.44)
78.20(1.22) 85.13(2.72) 91.81(2.28) 93.59(0.53) 92.46(0.19)
77.73(1.65) 81.04(1.55) 90.67(2.25) 93.23(0.35) 91.91(0.16)
Dna-2
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
88.34(0.68) 92.98(1.94) 99.10(0.06) 96.90(0.20) 97.73(0.39)
84.54(1.57) 90.28(2.08) 97.95(0.64) 96.59(0.21) 96.49(1.68)
82.11(2.02) 88.35(2.03) 95.67(0.83) 96.80(0.35) 95.15(0.60)
79.89(3.36) 87.73(2.25) 92.81(1.31) 95.25(1.20) 93.33(0.88)
78.46(2.25) 87.62(2.37) 90.28(1.87) 92.98(0.74) 92.89(0.99)
76.73(3.04) 83.09(1.58) 89.75(2.25) 91.95(0.32) 91.23(0.27)
Satimage-1
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
96.93(0.25) 94.89(1.89) 98.34(0.07) 97.54(0.02) 97.70(0.00)
96.27(0.32) 94.30(2.73) 98.32(0.09) 97.49(0.01) 97.59(0.00)
95.65(0.51) 92.88(0.74) 98.30(0.20) 97.49 (0.64) 97.49(0.55)
94.36(1.76) 86.61(2.33) 98.21(0.26) 96.81(0.87) 96.38(0.73)
94.25(0.22) 84.03(1.61) 97.99(0.26) 96.03(0.99) 96.45(1.04)
93.76(0.50) 82.09(1.40) 97.79(0.23) 96.65(1.22) 95.72(1.33)
Satimage-2
MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2
94.42(0.38) 94.64(1.62) 96.91(0.17) 97.29(0.20) 97.21(0.26)
93.60(0.64) 93.96(1.34) 96.77(0.20) 96.87(0.33) 95.20(1.53)
92.47(1.11) 88.09(1.47) 96.52(0.28) 95.94(0.65) 95.90(0.51)
92.36(0.34) 86.91(1.64) 96.08(0.45) 95.47(0.24) 95.62(0.05)
90.77(0.38) 83.75(1.45) 95.74(0.30) 95.36(0.41) 95.57(0.09)
89.26(0.29) 84.11(1.33) 95.25(0.32) 95.31(0.22) 95.36(0.10)
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
JID: KNOSYS 12
ARTICLE IN PRESS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
Fig. 8. The average error rates for the classification methods with linear kernel on these 11 binary-classes data sets.
Fig. 9. The average error rates for the classification methods with RBF kernel on these 11 binary-classes data sets.
Fig. 11. The average error rates for the classification methods with RBF kernel on these 6 multi-classes data sets.
Finally, to easily compare the classification performance of these LLP algorithms, we not only present the standard deviation shown in the rackets in above four tables but also apply Wilcoxon signed ranks test to the accuracy achieved. The Wilcoxon signed ranks test results are listed in Table 6 (for binary classification) and Table 7 (for multi classification). As they are shown, the proposed proportion-NPSVM shows an improvement over the compared LLP algorithms including InvCal, MeanMap and alter-∝SVM. Based on the experimental analysis of the above two sections, we conclude that the novel proportion-NPSVM algorithm that proposed to solve the LLP problem has achieved the highly competitive, even the best, classification performance both on the binaryclasses data sets and multi-classes data sets. Therefore, this novel method can be seen as the state-of-art algorithm for solving the problem of LLP. 6. Conclusion remarks
Fig. 10. The average error rates for the classification methods with linear kernel on these 6 multi-classes data sets.
parallel classer. As for the SATIMAGE and DNA data set, this novel method is highly competitive with alter-∝SVM. Similarly, we also applied the RBF kernel to proportion-NPSVM to further estimate its performance on multi-classes data sets. The classification results are shown in Table 5. As can be seen, the proportion-NPSVM algorithm obtains the highly competitive classification accuracy with alter-∝SVM on almost all the selected data sets including VOWEL, WINE and SATIMAGE, which are better than the results of MeanMap and InvCal. As for the data sets of GLASS and DNA, the proportion-NPSVM method achieves the best classification performance compared with alter-∝SVM, MeanMap and InvCal. In addition, to clearly present the comparison for these different selected approaches, we introduce Figs. 10 and 11 to show the average classification error on these six multi-classes data sets. From them, we can see that the average error rates of proportion-NPSVM is smaller than alter-∝SVM, MeanMap and InvCal methods. Specifically, as it is shown, the counts of #wins for proportion-NPSVM with linear kernel is 5 on the 6 classification data sets while the counts of #wins for proportion-NPSVM with RBF kernel is also 5. In summary, the novel proportion-NPSVM algorithm that proposed for solving LLP problems is also advanced and effective for the classification of the multi-classes data sets.
In this paper, we study the problem of learning with label proportions. To address this learning scenario, we contribute a novel solution based on nonparallel support vector machines, termed as proportion-NPSVM. It can incorporate the label proportions with the NPSVM optimization model under a large-margin framework. Improvements have been made both on accuracy performance and the generalization ability. The proportion-NPSVM model can be efficiently solved through optimizing two smaller problems by using alternative strategy with SOR method or SMO technique. Furthermore, it does not need to compute the large inverse matrices before training. Sufficient experimental results show the efficiency of proportion-NPSVM in classification accuracy and good prediction performance. In conclusion, the proportion-NPSVM algorithm can be a feasible classification method for the learning from label proportions tasks, which are conceivable in many practical applications. Although the proportion-NPSVM method has yield great classification performance, it still has some aspects that can be improved. In proportion-NPSVM algorithm, the random initializing labels are applied. It may affect the classification performance and optimization time. In future work, we plan to explore how to initialize the labels with the constraint of label proportions. Besides, we would like to develop the data group strategies based on their own features instead of random tactics. Acknowledgments The authors would like to express their sincere thanks to the associate editor and the reviewers who made great contributions to the improvement of this paper. This work was partially supported by the major project of National Natural Science Foundation of China (Grant nos. 71331005 and 91546201), the international
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
13
Table 5 Accuracy of different methods on multi-classes data sets with RBF kernel and bag size 2, 4, 8, 16, 32, 64. In this table, prop-NPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set
Method
2
4
8
16
32
64
Iris-1
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
91.66(1.24) 97.20(0.42) 98.91(0.00) 84.80(1.62) 96.01(1.09)
90.85(1.37) 95.70(1.94) 98.00(0.95) 83.22(1.30) 91.60(2.79)
88.27(2.04) 90.30(2.11) 97.80(1.40) 80.70(1.70) 91.00(2.82)
87.64(1.65) 82.51(1.91) 94.76(1.52) 78.87(1.76) 87.60(4.77)
87.09(1.74) 78.14(0.92) 89.88(3.66) 74.70(1.18) 87.31(3.13)
86.53(2.11) 74.87(1.72) 83.00(3.06) 73.37(0.95) 85.22(2.96)
Glass-1
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
78.97(0.96) 81.21(2.17) 87.07(4.31) 81.42(1.21) 93.28(1.30)
77.63(1.32) 77.85(2.06) 85.00(5.49) 80.71(0.68) 92.57(2.05)
75.88(1.55) 79.01(1.87) 75.57(3.35) 78.78(2.28) 90.57(1.92)
75.06(2.01) 76.57(1.25) 72.14(4.08) 75.50(3.03) 84.85(2.44)
73.86(1.66) 74.28(1.11) 67.85(3.94) 72.85(1.14) 77.85(2.18)
73.27(1.67) 73.57(2.05) 68.57(4.43) 69.35(0.41) 74.43(2.33)
Glass-2
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
79.73(0.44) 87.89(1.65) 85.52(3.53) 86.71(1.32) 87.50(2.44)
73.57(1.32) 81.58(1.45) 83.03(4.12) 82.89(1.65) 87.36(2.11)
71.42(1.21) 76.97(1.65) 81.71(2.68) 80.92(2.28) 84.86(3.65)
70.02(1.47) 66.45(2.93) 71.13(3.43) 73.68(1.51) 82.89(3.03)
68.68(2.32) 69.07(2.75) 66.44(2.98) 73.02(1.39) 79.47(1.18)
67.53(3.13) 60.53(2.63) 63.82(4.04) 65.79(1.56) 75.39(1.65)
Glass-3
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
76.57(1.22) 75.36(2.69) 78.10(4.67) 85.29(0.17) 88.23(0.24)
74.22(0.97) 76.47(1.76) 79.41(3.22) 84.87(1.11) 85.88(2.63)
70.53(1.32) 74.71(1.87) 73.53(4.26) 83.24(3.11) 85.29(1.61)
69.96(1.75) 67.64(2.58) 70.59(3.71) 78.92(4.32) 79.41(3.83)
67.95(2.01) 70.58(1.67) 70.58(2.86) 76.47(3.16) 77.64(4.92)
66.32(3.43) 65.49(2.02) 70.06(4.17) 73.52(4.11) 71.76(6.69)
Vowel-1
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
95.68(1.11) 99.15(0.02) 99.15(0.02) 98.02(1.22) 99.15(0.02)
94.37(0.96) 98.96(0.20) 99.86(0.02) 97.81(1.73) 98.95(0.32)
93.26(1.44) 87.50(1.86) 97.81(0.61) 97.19(1.10) 96.67(1.36)
90.83(0.98) 79.17(2.06) 93.75(2.38) 97.71(1.08) 95.83(2.70)
87.32(1.23) 68.75(1.71) 87.50(2.98) 98.02(0.33) 96.54(0.87)
85.27(0.76) 68.75(1.53) 87.25(2.88) 97.09(0.16) 94.79(1.04)
Vowel-2
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
93.76(0.77) 98.85(0.42) 99.68(0.02) 97.60(1.10) 98.75(0.87)
91.25(1.22) 91.67(1.12) 93.85(2.57) 96.67(2.29) 98.68(0.36)
89.33(1.54) 87.50(0.82) 85.42(4.36) 92.39(1.70) 96.87(1.27)
85.74(2.73) 75.00(1.56) 83.33(5.10) 91.04(1.98) 93.33(2.51)
83.65(2.88) 79.17(1.43) 81.25(2.42) 90.00(1.32) 89.79(2.00)
82.71(3.65) 71.88(1.79) 77.08(3.45) 90.62(0.08) 89.83(1.42)
Vowel-3
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
93.54(0.56) 98.95(0.03) 99.89(0.01) 97.91(0.08) 98.58(0.93)
92.37(0.78) 95.83(0.35) 97.81(0.88) 95.83(1.85) 97.08(1.14)
90.62(1.17) 88.54(0.64) 93.75(2.67) 92.71(2.33) 95.42(2.26)
87.35(1.68) 75.00(1.69) 87.50(4.21) 89.68(2.32) 92.03(1.19)
84.74(1.97) 79.17(1.42) 83.33(2.38) 87.08(1.32) 88.54(1.71)
83.17(2.63) 69.79(1.94) 82.91(4.49) 86.56(0.59) 86.25(0.85)
Wine-1
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
96.86(0.21) 99.89(0.01) 99.89(0.01) 99.15(0.02) 99.85(0.00)
96.47(0.30) 99.40(0.05) 99.87(0.02) 99.15(0.03) 99.85(0.02)
95.23(0.46) 96.69(1.76) 99.85(0.02) 99.15(0.02) 99.79(0.00)
94.76(1.00) 91.28(1.98) 99.49(0.01) 99.15(0.02) 98.89(0.01)
92.07(1.74) 93.22(1.69) 98.66(0.08) 99.15(0.02) 98.87(0.03)
90.42(2.02) 90.68(1.73) 99.15(0.02) 99.15(0.02) 99.02(0.02)
Wine-2
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
96.43(0.43) 99.64(0.05) 99.92(0.01) 98.73(0.45) 99.92(0.01)
95.22(0.58) 95.84(1.11) 99.91(0.01) 98.59(0.16) 99.71(0.00)
93.76(1.21) 90.21(1.33) 99.88(0.01) 98.45(0.33) 99.29(0.00)
90.20(0.88) 86.19(0.94) 98.73(0.40) 98.38(0.48) 97.88(0.39)
89.35(1.28) 79.57(0.95) 96.19(0.94) 97.74(0.88) 97.74(0.62)
88.90(1.96) 76.76(1.56) 93.66(1.10) 97.18(0.59) 97.18(0.06)
Dna-1
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
92.06(0.25) 98.09(0.48) 99.34(0.06) 99.13(0.06) 99.03(0.16)
91.58(0.38) 94.01(0.96) 98.22(0.42) 99.03(0.20) 99.01(0.28)
88.04(1.20) 88.47(1.63) 95.34(1.10) 98.81(0.12) 98.90(0.12)
86.75(2.77) 82.22(1.86) 92.65 (1.21) 98.81(0.11) 98.74(0.38)
77.35(3.65) 75.21(1.37) 90.28(1.95) 98.38(0.23) 98.70(0.31)
69.46(4.58) 71.98(2.72) 88.82(1.80) 98.49(0.06) 98.41(0.29)
Dna-2
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
91.54(1.55) 97.26(0.42) 98.71(0.37) 99.27(0.04) 99.14(0.21)
91.03(0.72) 94.64(0.84) 96.73(0.87) 99.27(0.04) 99.07(0.37)
88.64(1.67) 91.55(1.47) 94.62(0.82) 98.86(0.08) 99.04(0.16)
85.37(3.12) 86.80(2.17) 91.71(1.08) 98.36(0.11) 98.69(0.21)
76.84(4.44) 79.89(0.67) 89.82(1.67) 98.24(0.26) 98.48(0.42)
73.65(5.28) 75.36(2.99) 88.56(1.95) 98.65(0.17) 98.45(0.21)
Satimage-1
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
97.12(0.52) 98.48(0.10) 99.56(0.07) 98.43(0.22) 98.78(0.00)
96.83(0.35) 98.69(0.11) 98.55(0.15) 98.43(0.21) 98.74(0.06)
96.54(0.43) 97.86(0.48) 98.59(0.15) 98.43(0.21) 98.57(0.06)
95.87(1.21) 96.04(0.88) 98.50(0.26) 98.43(0.22) 97.84(0.93)
93.26(0.81) 94.17(0.85) 98.43(0.27) 98.43(0.22) 96.86(1.53)
90.14(0.22) 92.50(1.36) 98.25(0.51) 98.43(0.22) 96.20(2.24)
Satimage-2
MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2
94.44(0.16) 95.79(0.09) 98.12(0.11) 97.81(0.02) 97.02(0.03)
93.90(0.25) 96.36(0.08) 97.92(0.19) 97.45(0.10) 96.99(0.08)
93.66(0.22) 95.60(0.18) 97.33(0.27) 96.82(0.25) 96.89(0.13)
92.54(0.52) 95.03(0.28) 96.27(0.39) 96.82(0.37) 96.16(0.34)
90.32(1.95) 94.71(0.24) 95.44(0.26) 96.87(0.40) 96.06(2.31)
89.77(1.73) 94.20(0.85) 95.12(0.29) 96.30(0.76) 95.31(1.30)
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS 14
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16 Table 6 Wilcoxon signed ranks test results for binary classification. Linear Kernel
RBF Kernel
proportion-NPSVM1
R+
R−
p-value
R+
R−
p-value
VS. MeanMap VS. InvCal VS. alter-∝SVM
2105 2124 1947
106 87 264
0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01
2046 1912.5 1780
165 298.5 431
0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 017
Linear Kernel +
proportion-NPSVM2
R
VS. MeanMap VS. InvCal VS. alter-∝SVM
2106 2120 1882
R
−
105 91 329
RBF Kernel p-value
R+
R−
p-value
0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01
2066.5 2048 1762.5
144.5 163 448.5
0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 027
Table 7 Wilcoxon signed ranks test results for multi classification. Linear Kernel
RBF Kernel
proportion-NPSVM1
R+
R−
p-value
R+
R−
p-value
VS. MeanMap VS. InvCal VS. alter-∝SVM
2975 3077 2635.5
106 4 445.5
0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01
2698.5 2581.5 1949
382.5 499.5 1132
0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.011900
Linear Kernel
RBF Kernel
proportion-NPSVM2
R+
R−
p-value
R+
R−
p-value
VS. MeanMap VS. InvCal VS. alter-∝SVM
2956 2750 2525.5
125 331 555.5
0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01
3075 2935 2324
6 146 757
0.0 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0.0 0 0 0 0 08
(regional) cooperation project of National Natural Science Foundation of China (Grant no. 71110107026) and the grants from National Natural Science Foundation of China (Grant no. 61402429).
in which, the vectors of α = (α1 , . . . , αq )T , β = (β1 , . . . , βq )T and λ = (λ1 , . . . , λ p )T are Lagrange multipliers. Then, the Karush-KuhnTucker (KKT) sufficient and necessary optimality conditions of the problem (26) are shown by
Appendix A. Supplement materials
c3 w+ + AT λ + BT α = 0 ,
(29)
c3 b+ + eT+ λ + eT− α = 0,
(30)
λ − η+ = 0 ,
(31)
c1 e− − α − β = 0,
(32)
Aw+ + e+ b+ = η+ ,
(33)
A.1. The wolfe dual problems of NPSVM In this section, the derivation process of the Wolfe dual problems for NPSVM will be presented based on [28]. As we mentioned in Section 2, the primal problems of NPSVM are
min
y,w+ ,b+ ,η+ ,ξ−
s.t.
1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 Aw+ + e+ b+ = η+ , −(Bw+ + e− b+ ) + ξ− ≥ e− ,
(26)
ξ− ≥ 0
and
min
y,w− ,b− ,η− ,ξ+
s.t.
1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 Bw− + e− b− = η− ,
− (Bw+ + e− b+ ) + ξ− ≥ e− ,
(Aw− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 where A = (x1 , . . . , x p )T ∈ R p×n and B = (x p+1 , . . . , x p+q )T ∈ Rq×n denote positive and negative data matrices respectively, ci (i = 1, 2, 3, 4) are the penalty parameters, e+ and e− are vectors of ones of appropriate dimensions, ξ+ and ξ− are slack vectors of appropriate dimensions, η+ and η− are vectors of appropriate dimensions. To form the dual problems, we introduce the Lagrangian function of problem (26) as follows:
L ( w + , b + , η+ , ξ− , α, β , λ ) =
1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 +λT (Aw+ + e+ b+ − η+ ) + αT (Bw+ + e − b + − ξ− + e − ) − β
T
ξ−
ξ− ≥ 0 ,
(34)
(27)
(28)
αT (Bw+ + e− b+ − ξ− + e− ) = 0, βT ξ− = 0,
(35)
α ≥ 0, β ≥ 0.
(36)
According to β ≥ 0, from condition (32) we have
0 ≤ α ≤ c1 e−
(37)
We also can obtain w+ and b+ by solving (29) and (30) respectively
⎧ 1 ⎪ ⎨w+ = − (AT λ + BT α ), c3
⎪ ⎩b+ = − 1 (eT+ λ + eT− α ).
(38)
c3
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
ARTICLE IN PRESS
JID: KNOSYS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
Then, by using the conditions of (31) and (38), the Wolfe dual of the primal problem (26) can be achieved as follows:
1 (λT αT )T + c3 eT α − ( λT αT ) Q − 2 0 ≤ α ≤ c1 e−
max λ, α
s.t. where
T = AA + c3 I Q BAT
(39)
+E
(40)
and I ∈ R p×p is an identity matrix, E ∈ Rl×l is the matrix with all elements equal to 1. Similarity, the Wolfe dual problem of (27) is shown as θ ,γ
1 − (θ T γ T ) Q(θ T γ T )T + c4 eT+ γ 2
s.t.
0 ≤ α ≤ c2 e+
max
where
Q =
BBT + c4 I
BAT
ABT
AAT
(41)
+E
(42)
When the w+ , w− , b+ , b− are fixed, the primal problems of (18) and (19) become:
s.t.
A.3. Proof of Proposition 1
c 1 T η η+ + c1 eT− ξ− + p eT L p (y ) 2 + 2
(Ay )w+ + e+ b+ = η+ , −((By )w+ + e− b+ ) + ξ− ≥ e− , ξ− ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}
Proof. Note that, the yi (∀i ∈ Bk ) has independent influence on the first term of the objective function of (21). Without loss of generality, we assume Bk = {1, 2, . . . , |Bk |}. Let σ = pk , and δi∗ denotes the sorted δ i , which satisfies δ1∗ ≥ ∗ ∗ δ2 ≥ . . . ≥ δ|B | . The label proportion can be satisfied by flipping k
A.2. The transformation of problems in alternative solving stage
min
Fi = fi− − fi+ = (wT− (xi ) + b− ) −
σ |Bk | yi ’s signs. For (21), we want to solve the following problem:
and I ∈ Rq×q is an identity matrix, E ∈ Rl×l is the matrix with all elements equal to 1. In conclusion, the Eqs. (39) and (41) are the Wofle dual problems for the primal NPSVM.
y , η+ , ξ −
where LF = max(1 − yi ∗ Fi , 0 ), (wT+ (xi ) + b+ ).
Proposition 1. For a fixed pk (y ), the problem (21) can be optimized by the strategy mentioned above.
T
AB BBT
15
min +
δi −
i∈B + k
Bk
i∈B − k
s.t.
|B+k | = σ |Bk |.
δi (48)
where B+ = {i|yi = 1, i ∈ B k }, B − = {i|yi = −1, i ∈ Bk }, k k It is able to be easily verified that B+ = {1, 2, . . . , σ |Bk |} is the k optimal solution for problem (48) by using the method of reduction to absurdity. Actually, assume that, there exist Bk+∗ and B−∗ , which are k able to satisfy the constraints of Bk+∗ = σ |Bk |, Bk+∗ ∪ B−∗ = Bk and k Bk+∗ ∩ B−∗ = φ. But Bk+∗ = {1, 2, . . . , σ |Bk |}. Then we have k
i∈B−∗ k
δi∗ −
i∈Bk+∗
|B | σ |Bk | k ∗ δi∗ − δ − δi∗ < 0 i i=σ |B |+1 i=1 k
(49)
(43)
∗ However, according to the prior knowledge of i ∈ B + ∗ δi − k |Bk | σ | B k | ∗ δi ≤ 0 and i=σ |B |+1 δi∗ − i∈B−∗ δi∗ ≤ 0, we can know that i=1 k
k
the formula of (49) does not set up. That is to say B+ = k {1, 2, . . . , σ |Bk |} is the optimal solution of problem (48). Proposition is proved.
and
min
y , η− , ξ +
s.t.
References
c 1 T η η− + c2 eT+ ξ+ + p eT L p (y ) 2 − 2
(By )w− + e− b− = η− , ((Ay )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}
(44)
According to the relationship between (43) and (44), we know that for each point xi , two decision values fi+ and fi− can be calculated based ont eh model achieved before:
fi+ = wT+ (xi ) + b+ , fi− = wT− (xi ) + b−
(45)
Then, we can get the loss function for the problem (43) and (44):
Li =
max(1 − fi+ , 0 ), if yi = +1 , max(1 + fi− , 0 ), if yi = −1
(46)
Actually, solving (43) and (44) means searching the appropriate label vector y, which can make the model achieve the least loss. Therefore, under the premise of not changing the results, we can integrate these two decision values fi+ and fi− to a single value by Fi = fi− − fi+ . Based on the analysis above, we know that solving this optimization problem mentioned above can be transformed to:
min y
N i=1
LF (yi , wT+ (xi ) + b+ , wT− (xi ) + b− )
+
K cp L p (y ) c k=1
(47)
[1] J. Hernández-González, I. Inza, J.A. Lozano, Learning Bayesian network classifiers from label proportions, Pattern Recognit. 46 (12) (2013) 3425–3440. [2] N. Quadrianto, A.J. Smola, T.S. Caetano, Q.V. Le, Estimating labels from label proportions, J. Mach. Learn. Res. 10 (2009) 2349–2374. [3] M. Stolpe, K. Morik, Learning from label proportions by optimizing cluster model selection, in: G. Dimitrios, H. Thomas, M. Donato, V. Michalis (Eds.), Proceedings of Machine Learning and Knowledge Discovery in Databases, 2011, pp. 349–364. [4] F.X. Yu, K. Choromanski, S. Kumar, T. Jebara, S.-F. Chang, On learning with label proportions, Statistics 1050 (2014) 24–36. [5] T. Chen, F.X. Yu, J. Chen, Y. Cui, Y.-Y. Chen, S.-F. Chang, et al., Object-based visual sentiment concept analysis and application, in: K. Hua, Y. Rui, R. Steinmetz, et al. (Eds.), Proceedings of the 22nd ACM international conference on Multimedia, ACM, 2014, pp. 367–376. [6] F.X. Yu, L. Cao, M. Merler, N. Codella, T. Chen, J.R. Smith, S.-F. Chang, et al., Modeling attributes from category-attribute proportions, in: K. Hua, Y. Rui, R. Steinmetz, et al. (Eds.), Proceedings of the 22nd ACM international conference on Multimedia, ACM, 2014, pp. 977–980. [7] K.-T. Lai, X.Y. Felix, M.-S. Chen, S.-F. Chang, Video event detection by inferring temporal instance labels, in: S. Dickinson, D. Metaxas (Eds.), Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 2251–2258. [8] C.-F. Tsai, Training support vector machines based on stacked generalization for image classification, Neurocomputing 64 (2005) 497–503. [9] Z. Qi, Y. Tian, Y. Shi, Robust twin support vector machine for pattern classification, Pattern Recognit. 46 (1) (2013) 305–316. [10] Z. Qi, Y. Tian, Y. Shi, Successive overrelaxation for laplacian support vector machine, IEEE Trans. Neural Netw. Learn. Syst. 26 (4) (2015) 674–683. [11] F. Moosmann, E. Nowak, F. Jurie, Randomized clustering forests for image classification, IEEE Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) 1632–1646. [12] X. Pan, Y. Luo, Y. Xu, K-nearest neighbor based structural twin support vector machine, Knowl. Based Syst. 88 (2015) 34–44. [13] Y. Xu, J. Yu, Y. Zhang, Knn-based weighted rough ν -twin support vector machine, Knowl. Based Syst. 71 (2014) 303–313.
s.t. cite ∀i=1this , article yi ∈ {−1 } Please as:, +1 Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based N
Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007
JID: KNOSYS 16
ARTICLE IN PRESS
[m5G;December 13, 2016;2:47]
Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16
[14] Y. Xu, L. Wang, A weighted twin support vector regression, Knowl. Based Syst. 33 (2012) 92–101. [15] X. Peng, R. Yan, B. Zhao, H. Tang, Z. Yi, Fast low rank representation based spatial pyramid matching for image classification, Knowl. Based Syst. 90 (2015) 14–22. [16] M. Berthod, Z. Kato, S. Yu, J. Zerubia, Bayesian image classification using markov random fields, Image Vis. Comput. 14 (4) (1996) 285–295. [17] Y. Xu, Z. Yang, Y. Zhang, X. Pan, L. Wang, A maximum margin and minimum volume hyper-spheres machine with pinball loss for imbalanced data classification, Knowl. Based Syst. 95 (2016) 75–85. [18] S. Rüeping, Svm classifier estimation from group probabilities, in: S. Kim, E.P. Xing (Eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 911–918. [19] F. Yu, D. Liu, S. Kumar, T. Jebara, S. Chang, PSVM for learning with label proportions, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the 30rd International Conference on Machine learning, 2013, pp. 504–512. [20] B. Wang, Z. Chen, Z. Qi, Linear twin svm for learning from label proportions, in: A. Tan, Y. Li (Eds.), Proceedings of the 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol. 3, IEEE, 2015, pp. 56–59. [21] B.C. Chen, L. Chen, R. Ramakrishnan, D.R. Musicant, Learning from aggregate views, in: L. Liu, A. Reuter, K.Y. Whang, J. Zhang (Eds.), Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006, pp. 3–12. [22] D.R. Musicant, J.M. Christensen, J.F. Olson, Supervised learning by training on aggregate outputs, in: N. Ramakrishnan, O.R. ZaTane, Y. Shi, C.W. Clifton, X. Wu (Eds.), Proceedings of the seventh IEEE International Conference on Data Mining (ICDM 2007), 2007, pp. 252–261. [23] H. Kuck, N. de Freitas, Learning about individuals from group statistics, in: D. Braziunas, C. Boutilier (Eds.), Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, 2005, pp. 332–339.
[24] G.S. Mann, A. McCallum, Simple, robust, scalable semi-supervised learning via expectation regularization, in: Z. Ghahramani (Ed.), Proceedings of the 24th international conference on Machine learning, 2007, pp. 593–600. [25] K. Bellare, G. Druck, A. McCallum, Alternating projections for learning with expectation constraints, in: J. Bilmes, A.Y. Ng (Eds.), Proceedings of the Twenty– Fifth Conference on Uncertainty in Artificial Intelligence, 2009, pp. 43–50. [26] J. Gillenwater, K. Ganchev, J. Graça, F. Pereira, B. Taskar, Posterior sparsity in unsupervised dependency parsing, J. Mach. Learn. Res. 12 (2011) 455–490. [27] Y.F. Li, J.T. Kwok, Z.-H. Zhou, Semi-supervised learning using label mean, in: L. Bottou, M. Littman (Eds.), Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 633–640. [28] Y. Tian, Z. Qi, X. Ju, Y. Shi, X. Liu, Nonparallel support vector machines for pattern classification, IEEE Trans. Cybern. 44 (7) (2014) 1067–1079. [29] R. Khemchandani, S. Chandra, et al., Twin support vector machines for pattern classification, IEEE Trans. Pattern Anal. Machine Intell. 29 (5) (2007) 905–910. [30] Y.-H. Shao, C.-H. Zhang, X.-B. Wang, N.-Y. Deng, Improvements on twin support vector machines, IEEE Trans. Neural Netw. 22 (6) (2011) 962–968. [31] C.-F. Lin, S.-D. Wang, Fuzzy support vector machines, IEEE Trans. Neural Netw. 13 (2) (2002) 464–471. [32] R. Khemchandani, S. Chandra, et al., Fast and robust learning through fuzzy linear proximal support vector machines, Neurocomputing 61 (2004) 401–411. [33] O. Chapelle, V. Sindhwani, S.S. Keerthi, Optimization techniques for semi-supervised support vector machines, J. Mach. Learn. Res. 9 (2008) 203–233. [34] K. Bache, M. Lichman, UCI machine learning repository, 2013, URL http:// archive.ics.uci.edu/ml.
Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007