Learning with label proportions based on nonparallel support vector machines

ARTICLE IN PRESS JID: KNOSYS [m5G;December 13, 2016;2:47] Knowledge-Based Systems 0 0 0 (2016) 1–16 Contents lists available at ScienceDirect Kno...

Download PDF

2MB Sizes 6 Downloads 138 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

JID: KNOSYS

[m5G;December 13, 2016;2:47]

Knowledge-Based Systems 0 0 0 (2016) 1–16

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Learning with label proportions based on nonparallel support vector machines Zhensong Chen a,c, Zhiquan Qi b,c,∗, Bo Wang b,c, Limeng Cui c,d, Fan Meng a,c, Yong Shi b,c,∗ a

School of Economics and Management, University of Chinese Academy Sciences, Beijing 100190, China Research Center on Fictitious Economy and Data Science, CAS, Beijing 100190, China c Key Laboratory of Big Data Mining and Knowledge Management, CAS, Beijing 100190, China d School of Computer and Control Engineering, University of Chinese Academy Sciences, Beijing 100190, China b

a r t i c l e

i n f o

Article history: Received 1 April 2016 Revised 1 December 2016 Accepted 3 December 2016 Available online xxx Keywords: Learning with label proportion Nonparallel SVM Proportion-NPSVM Large-margin

a b s t r a c t Learning a classiﬁer from groups of unlabeled data, only knowing, for each group, the proportions of data with particular labels, is an important branch of classiﬁcation tasks that are conceivable in many practical applications. In this paper, we proposed a novel solution for the problem of learning with label proportions (LLP) based on nonparallel support vector machines, termed as proportion-NPSVM, which can improve the classiﬁers to be a pair of nonparallel classiﬁcation hyperplanes. The unique property of our method is that it only needs to solve a pair of smaller quadratic programming problems. Moreover, it can eﬃciently incorporate the known group label proportions with the latent unknown observation labels into one optimization model under a large-margin framework. Compared to the existing approaches, there are several advantages shown as follows: 1) it does not need to make restrictive assumptions on the training data; 2) nonparallel classiﬁers can be achieved without computing the large inverse matrices; 3) the optimization model can be effectively solved by using the alternative strategy with SMO technique or SOR method; 4) proportion-NPSVM has better generalization ability. Suﬃcient experimental results on both binary-classes and multi-classes data sets show the eﬃciency of our proposed method in classiﬁcation accuracy, which prove the state-of-the-art method for LLP problems compared with competing algorithms. © 2016 Published by Elsevier B.V.

1. Introduction The problem of learning with label proportions (LLP) has been intensively studied in recent years. In this learning setting, the training instances are groups of unlabeled observations, only knowing the proportion of observations with a particular label for each group. The task is to learn a mapping from individual observations to their labels. Fig. 1 shows the differences of LLP from other learning problems. Solutions of LLP have real word applications. Actually, many modern intelligent applications are able to be abstracted to LLP problems. For example, 1) Political Election [1]. In the problem of political election, voters can be divided into alwaysfavorable voters and swing ones. Politicians want to use their limited resources to achieve the largest gains. So that, they want to identify which category each voter is belonging to according to

∗ Corresponding authors at: Research Center on Fictitious Economy and Data Science, CAS, Beijing 100190, China. E-mail addresses: [email protected] (Z. Qi), [email protected] (Y. Shi).

the proﬁle of favor regardless voters and its proportion directly revealed based on previous elections; 2) Spam Filtering [2]. In the spam ﬁltering, the emails contain almost pure spam (listed in the spam box) and a mix of spam and non-spam (shown in the inboxes). We would like to improve estimation of spam based on the proportions of spam and non-spam in a user’s inbox, which is much cheaper to be achieved than the actual labels; 3) Quality Control [3]. Suppose there is a steel factory, in which charges of steel sticks are processed sequentially at several production stations. We would like to assess the quality of each stick before it reaches the ﬁnal production according to the information that related to charges of sticks (not each single stick) achieved during its being processed. Since the sticks that not reach the desired quality can be locked out, this work can help to save resources; and lots more [4–7]. Here, to make a clear understanding of the LLP problems, let’s consider an example of images classiﬁcation, which is a basic component of large-scale images annotation and image retrieval. As it is known, images classiﬁcation refers to the procedure of classifying images into different categories based on their contained

http://dx.doi.org/10.1016/j.knosys.2016.12.007 0950-7051/© 2016 Published by Elsevier B.V.

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

JID: KNOSYS 2

ARTICLE IN PRESS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

Fig. 1. Different types of learning problems (colors denote the class labels). (a) supervised learning: all the training instances are labeled; (b) semi-supervised learning: parts of the training instances are labeled and the others are unlabeled; (c) unsupervised learning: all the training instances are unlabeled and (d) learning with label proportions: only knowing the proportion of instances with a particular label for each group. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

objects. Traditionally, supervised learning algorithms, such as support vector machines (SVMs) [8–10], random forest [11], twin support vector machines (TWSVMs) [12–14], spatial pyramid matching (SPM) and its variants (i.e., LrrSPM) [15], and other methods [16,17], are the preferred approaches to achieve classiﬁers from labeled training samples. However, these supervised learning processes need large amount of labeled training data, which are quite ineﬃcient and labor-intensive to be obtained, especially in the big data era with the rapid increasing of data generation. Compared with strong supervision, which means all the training instances are accurately labeled, the label proportions can be achieved accurately and effectively. Therefore, this classiﬁcation problem is able to be solved by transforming into a LLP process and Fig. 2 gives an illustration of this procedure. As it is shown, the unlabeled image data sets are given and then can be split into K disjoint groups. The proportions of images belonging to different categories are the only information provided for each group. The target is to estimate a classiﬁer that can predict the class label for each individual image including observed sample and the new one. To address this learning scenario, we contribute a large-margin approach for LLP problems in this paper based on nonparallel support vector machine, termed as proportion-NPSVM, which can eﬃciently model the unknown labels as latent variables by incorporating label proportions with the classiﬁcation algorithms. Compared to the existing methods, which include MeanMap [2],

Fig. 2. The example of LLP process for images classiﬁcation. The original image datasets are given without any labels. Then, these image samples are divided into K disjoint groups. Only knowing, for each group, the proportions of images with a particular label. The target is to estimate a classiﬁer to predict the class label for each individual image.

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

JID: KNOSYS

ARTICLE IN PRESS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

Inverse Calibration [18], ∝SVM [19] and Twin alter-∝SVM [20], proportion-NPSVM has some unique properties shown as follows: – Do not need to make restrictive assumptions on the training data. – Only need to solve two smaller problems. – Do not need to compute the large inverse matrices before training. – Achieve the nonparallel classiﬁcation hyperplane. – Can be solved by the alternative strategy with successive overrelaxation (SOR) method or sequential minimization optimization (SMO) technique. To the best of our knowledge, no other method exists yet which can share all of these properties.In summary, by incorporating all these properties, the advantages of our method are 1) having the better generalization ability compared with ∝SVM [19], which make proportion-NPSVM achieve the better classiﬁcation accuracy; 2) faster than Twin alter-∝SVM [20] since our method does not need to compute the large inverse matrices before training and can be solved by the alternative strategy with SOR method or SMO technique. The main contributions of this paper are summarized as follows: 1) Develop two algorithms based on the Nonparallel Support Vector Machines (NPSVMs) to learn SVM classiﬁers for the LLP problems, which share all the properties mentioned above; 2) Introduce a novel solving strategy in perspective of hyperplanes clustering (Algorithm 2), which is highly competitive to the solving tactic (Algorithm 1) that is usually used in other label proportions learning based SVMs; 3) Enrich and improve the framework for the applications of SVMs in the LLP problems. The rest of this paper is structured as follows. In Section 2, we review the existing works for LLP problems proposed in recent years and the nonparallel SVM model for classiﬁcation tasks respectively. The proportion-NPSVM algorithm and its solving strategy are presented in Section 3 while the discussion are shown in Section 4. Finally, experimental results and conclusion remarks are respectively made in Section 5 and Section 6. 2. Preliminaries During the past decade, a number of approaches for LLP problem have been developed. In the following, we will ﬁrstly review the existing methods such as MeanMap, inverse calibration (InvCal) and proportion-SVM (∝SVM). Then, we introduce the nonparallel support vector machine (NPSVM), which is developed for binary classiﬁcation tasks.

3

bag should have a soft label, which is corresponding to the label proportion. Stolpe et al. [3] deﬁned the problem of LLP and contributed a developmental solution based on the clustering with label proportions. Hernández [1] proposed to learn a Bayesian network classiﬁers, which is based on the structural expectation maximization (EM) strategy, to handle the LLP classiﬁcation tasks. In addition, for the semi-supervised learning scenario , Mann [24] and Bellare et al. [25] proposed an effective and easy implementation method to match the given proportions by using an expectation regularization term to encourage model predictions on the unlabeled data. Similarly, Gillenwater et al. [26] explored a generalized regularization method and Li et al. [27] developed the Semi-Supervised Support Vector Machines (S3VMs), which is typically used to directly estimate the labels for the unlabeled instances, by incorporating the label means of the unlabeled instances. Following the strategies of solving problems, a recently paper [19] proposed a method called proportion-SVM (∝SVM), which can explicitly incorporate the known group label proportions and the latent unknown instance labels into same model under a largemargin framework.

2.2. Nonparallel SVM algorithm In this section, we are going to present the nonparallel support vector machine model, which is introduced in [28]. Consider the binary classiﬁcation task with the training set

T = {(x1 , +1 ), . . . , (x p , +1 ), (x p+1 , −1 ), . . . , (x p+q , −1 )}

(1)

where xi ∈ Rn , i = 1, . . . , p + q. To solve this problem, many approaches have been proposed, during which, the SVMs and its variants are the most popular and powerful tools. Especially, Tian et al. [28] proposed the nonparallel support vector machine based on twin support vector machines (TWSVM) [29] and twin bounded support vector machines (TBSVM) [30], termed as NPSVM, has achieved great performance. This novel NPSVM algorithm inherits the essence of the SVMs and has some unexpected advantages compared with TWSVM and TBSVM. For the linear case, NPSVM can seek a pair of nonparallel hyperplanes

(w+ · x ) + b+ = 0 and (w− · x ) + b− = 0

(2)

such that each hyperplane is proximal to the data points of one class and as far as possible from the others, where w+ ∈ Rn , w− ∈ Rn , b+ ∈ R, b− ∈ R can be obtained by solving the following pair of quadratic programming problems (QPPs),

2.1. Related work The LLP problem has only recently attracted great attentions in the ﬁeld of machine learning and computer vision. Chen et al. [21] introduced the problem of learning from aggregate views and Musicant et al. [22] formally described this new learning problem for classiﬁcation and regression tasks. Then they applied the knearest neighbor, neural networks, and SVMs to solve it. Inspired by multiple instance (MI) learning process, Kuck and de Freitas [23] designed a principled probabilistic model to learn the relationship between the properties of individual instance and its binary labels. Quadrianto et al. [2] proposed the MeanMap approach, which can reconstruct the correct labels with high probability in a uniform convergence sense. But this algorithm is under a restrictive assumption that the instances are conditionally independent given the label. Rüeping [18] presented a large-margin approach which can learn a classiﬁer from group probabilities based on support vector regression (SVR) and inverse classiﬁer calibration. This method has a basic assumption that the mean instance of each

min

y,w+ ,b+ ,η+ ,ξ−

s.t.

1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 Aw+ + e+ b+ = η+ , −(Bw+ + e− b+ ) + ξ− ≥ e− ,

(3)

ξ− ≥ 0

and

min

y,w− ,b− ,η− ,ξ+

s.t.

1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 Bw− + e− b− = η− ,

(4)

(Aw− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 in which A = (x1 , . . . , x p )T ∈ R p×n and B = (x p+1 , . . . , x p+q )T ∈ Rq×n denote positive and negative data matrices respectively, ci (i = 1, 2, 3, 4) are the penalty parameters, e+ and e− are vectors of ones of appropriate dimensions, ξ+ and ξ− are slack vectors of appropriate dimensions, η+ and η− are vectors of appropriate dimensions.

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS 4

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

{xi |i ∈ Bk }Kk=1 , ∪Kk=1 Bk = {1, . . . , N}, Bk ∩ Bl = φ, (∀k = l ). (11) For each bag, only knowing the proportion belonging to a particular class. Let pk denotes the label proportion of positive class in the k-th bag, we have

pk =

Fig. 3. The illustration of standard SVM and Linear NPSVM on two dimensional data. (a) The standard SVM searches for two parallel classiﬁcation hyperplanes with the maximal width between them. (b) The NPSVM seeks two nonparallel proximal hyperplanes each of them is closer to one class and as far as possible from the other.

The corresponding Wolfe dual problems of (3) and (4) are

max λ, α

s.t.

1 (λT αT )T + c3 eT α − ( λT αT ) Q − 2 0 ≤ α ≤ c1 e−

θ ,γ

1 − (θ T γ T )Q (θ T γ T )T + c4 eT+ γ 2

s.t.

0 ≤ α ≤ c2 e+

y,w,b

(5)

respectively, where

= Q Q =

AAT + c3 I BAT

ABT BBT

BBT + c4 I ABT

BAT AAT

+ E,

+E

(7)

and I is the identity matrix with appropriate dimension, E is the l × l matrix with all entries equal to 1. After solving the dual problems (5) and (6), the solutions of primal problems (3) and (4) can be obtained by

w + = − ( A T λ∗ + B T α∗ ) , b+ = −(eT+ λ∗ + eT− α∗ ).

(8)

and

w− = − ( BT θ ∗ + AT γ ∗ ), b− = −(eT− θ ∗ + eT+ γ ∗ ).

(9)

Thus an unknown point x ∈ Rn is predicted to the Class through the following rule:

Class = arg min |(wk · x ) + bk |, k=−,+

(10)

where | · | denotes the perpendicular distance of point x from the planes (wk · x ) + bk = 0, k = −, +. Fig. 3 shows an illustration of the difference between standard SVM and linear NPSVM on two dimensional data. 3. The proportion-NPSVM algorithm According to the descriptions above, we proposed a novel classiﬁcation method for the LLP problems, termed as proportionNPSVM, which can improve the classiﬁers to be a pair of nonparallel classiﬁcation hyperplanes. In this section, we generate the proportion-NPSVM framework ﬁrstly and then introduce the optimization strategies for solving this classiﬁcation model. 3.1. The proportion-NPSVM framework Let’s consider the following binary learning problem where the training data set {xi }N is given in the form of K disjoint bags: i=1

N 1 T w w+C L(yi , wT (xi )) 2 i=1 K +C p Lp ( pk ( y ), pk )

(13)

k=1

s.t.

(6)

(12)

where y∗i ∈ {−1, +1} is the unknown ground-truth label of xi . This is the mathematical descriptions of LLP and the task is to learn a model to predict the class labels of the individual instances. To address this classiﬁcation problem, many approaches have been proposed as stated in Section 2.1, among which, the ∝SVM outperforms others on the experimental results. According to [19], ∝SVM model is formulated as below:

min

and

max

|{i|i ∈ Bk , y∗i = 1}| , ∀k ∈ {1 , . . . , K } |Bk |

∀Ni=1 , yi (wT · xi ) ≥ 1, yi ∈ {−1, 1},

in which the L( · ) ≥ 0 is the loss function for traditional supervised learning and Lp ( · ) ≥ 0 is a function to penalize the difference between the approximately calculated label proportions and the real label proportions. Furthermore, in order to provide the theoretical support, the authors also introduced a general framework called Empirical Proportion Risk Minimization (EPRM) in [4]. As elaborated in this work, the EPRM selects the instance label hypothesis h ∈ H to minimize the empirical bag proportion loss on the training set S. It can be expressed as follows:

arg min h∈H

L(φrf (h )( x ), f ( y ))

(14)

( x, f ( y ))∈S

Here, L is a loss function to compute the error of the predicted f proportion φr (h )( x ) (see Deﬁnition 1), and the given proportion f ( y ). To proof whether the instance labels can be learned by EPRM, they bounded the generalization error of bag proportions based on the empirical bag proportions and showed the sample complexity of learning bag proportions is only mildly sensitive to the bag size. Then, this paper presented that instance hypothesis h that can achieve low error of bag proportions with high probability is guaranteed to achieve low error on instance labels with high probability. Deﬁnition 1. For h ∈ H, deﬁne an operator to predict bag f f proportion based on the instances φr (h ) : X r → R, φr (h )( x ) := r f (h(x1 ), . . . , h(xr )), x ∈ X , x = (x1 , . . . , xr ). And therefore the hyf f pothesis class on the bags φr (H ) := {φr (h )|h ∈ H}. Even the validity of ∝SVM model had been proved, some issues still exist. As it is shown, the ∝SVM model needs to solve a large QPP, which has all data points in the constraints. However, in practice, there are often situations in which patterns belonging to one particular class has a more signiﬁcant inﬂuence in the classiﬁcation tasks. Traditionally, these biased classiﬁcation problems are able to be solved by fuzzy SVMs [31,32], where patterns of the more important class are assigned higher membership values, or TWSVMs [12–14,29], which can handle them by solving only two smaller sized QPPs. For example, [20] had proposed the Twin alter∝SVM based on linear TWSVM to solve the LLP problem. Yet, these approaches still have several drawbacks, which are shown as follows: 1) fuzzy SVMs still have to solve a large QPP, which is quite time-consuming to be applied in practice; 2) TWSVMs need to compute the inverse of matrices, what is intractable or even impossible for large scaled data sets in the real applications; 3) the

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

primal problem of TWSVMs only consider the empirical risk and it doesn’t have the signiﬁcant advantages of SVMs achieved by implementing the structural risk minimization principle; 4) TWSVMs lose the sparseness because they use two loss functions for each class, which results that all the points from one class and some points from the other contribute to each ﬁnal decision function (called semi-sparseness in [28]). Based on the elaboration above, we formulate the proportionNPSVM model for LLP problems under the large-margin framework, which can take the unexpected and incomparable advantages of nonparallel SVM. The formulations of proportion-NPSVM are shown as follows:

min

y,w+ ,b+ ,η+ ,ξ−

s.t.

1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 cp T + e L p (y ) 2 Aw+ + e+ b+ = η+ , −(Bw+ + e− b+ ) + ξ− ≥ e− ,

∀

N , i=1

(15)

ξ− ≥ 0

yi ∈ {−1, +1}

and

min

y,w− ,b− ,η− ,ξ+

s.t.

1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 cp T + e L p (y ) 2 Bw− + e− b− = η− ,

(16)

min

(17)

s.t.

1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 cp T + e L p (y ) 2 (A )w+ + e+ b+ = η+ , −((B )w+ + e− b+ ) + ξ− ≥ e− ,

∀

N , i=1

(18)

ξ− ≥ 0

yi ∈ {−1, +1}

and

min

y,w− ,b− ,η− ,ξ+

s.t.

1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 cp T + e L p (y ) 2 (B )w− + e− b− = η− ,

As it is shown above, the proportion-NPSVM formulation is fairly straightforward and intuitive. However, it is non-convex integer programming problem, which is NP-hard. Therefore, one key issue of this paper is to ﬁnd out effective strategies to solve the optimizations problems. In this paper, we provide two kinds of solving strategies to optimize our classiﬁcation model. Firstly, in proportion-NPSVM, similar to the rules taken in [19,20], the unknown labels “y” can be seen as a bridge to connect the hinge loss and the label proportion loss. Thence, the ﬁrst alternating optimization method, which is shown as follows, can be applied to solve the proportion-NPSVM model. • For a ﬁxed y, the optimizations of (18) and (19) become a primal NPSVM problem. It can be easily solved by SOR method or SMO technique. • When w+ , w− , b+ , b− are solved, a pair of nonparallel super classiﬁcation hyperplanes are able to be obtained by (w+ · (x )) + b+ = 0 and (w− · (x )) + b− = 0. • After w+ , b+ and w− , b− are ﬁxed,the problem can be transformed as follows (see details in supplement materials): N i=1

LF (yi , wT+ (xi ) + b+ , wT− (xi ) + b− )

+

where x ∈ H, H is the Hilbert space. Then, the corresponding two primal problems in the Hilbert space H can be obtained as y,w+ ,b+ ,η+ ,ξ−

3.2. How to solve proportion-NPSVM

y

where e+ , e− and e are vectors of ones of appropriate dimension, ξ+ and ξ− are slack vectors of appropriate dimension, Lp ( · ) ≥ 0 is a function to penalize the difference between the true label proportions and the estimated label proportions based on the selection of y ∈ Y , where Y = {y|yi ∈ {−1, +1}}. The aim is to optimize the model parameters w+ , w− , b+ , b− and the labels y simultaneously. Since the kernel functions are able to be directly applied in the proportion-NPSVM problems, it is easy to be extended to the nonlinear cases. Introducing the kernel function K (x, x ) = ((x ) · (x )) and the corresponding transformation is shown as follows

x = (x ),

Obviously, the linear proportion-NPSVM problems (15) and (16) can be degenerated by the nonlinear cases (18) and (19) when the linear kernel is applied.

min

(Aw− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}

(19)

((A )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1} in which, e+ , e− and e are vectors of ones of appropriate dimension, ξ+ and ξ− are slack vectors of appropriate dimension, Lp ( · ) ≥ 0 is the penalty function for label proportions.

5

s.t.

K cp L p (y ) c k=1

(20)

∀Ni=1 , yi ∈ {−1, +1}

where LF = max(1 − yi ∗ Fi , 0 ), Fi = fi− − fi+ = (wT− (xi ) + T b− ) − (w+ (xi ) + b+ ) for each point xi (i = 1, 2, . . . , N ). Because each bag {yi |i ∈ Bk }, ∀Kk=1 is independent, we can efﬁciently solve the optimization problem (20) based on individual bag separately. Speciﬁcally, the optimization problem on each bag can be presented as follows:

min

{yi | i∈B k }

i∈B k

LF (yi , wT+ (xi ) + b+ , wT− (xi ) + b− )

cp L p (y ) c ∀i ∈ Bk , yi ∈ {−1, +1} +

s.t.

(21)

To ﬁnd the solution of problem (21), we adopt the following optimization strategies: • Initialize yi = −1, ∀i ∈ Bk . • Let δ i denotes the reduction of the ﬁrst term in (21) through ﬂipping the sign of yi , ∀i ∈ Bk . • Sort δi , ∀i ∈ Bk . Then choose the top-M yi , which have the highest reduction, to ﬂip their signs. • For each bag Bk , we only need to sort the δi , ∀i ∈ Bk once. Then, we can incrementally ﬂip the sign and pick the solution with the smallest objective value, which can be seen as the optimal solutions of (21). The solutions of proportion-NPSVM model are able to be obtained by alternatively solving the problems (18), (19) and (20) through the strategies introduced above. The details of this solving approach for proportion-NPSVM can be found in Algorithm 1. Proposition 1. For a ﬁxed pk (y ), the problem (21) can be optimized by applying the solving strategy mentioned above. Secondly, the label proportions pk is able to be seen as a strong constraint condition during the solving process of proportionNPSVM model, which means the label proportions are ﬁxed in each

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS 6

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

Algorithm 1 proportion-NPSVM1. Input: Randomly initialized label yi ∈ {−1, +1}, (∀i ∈ {1, . . ., N} ), parameters ci (i = 1, 2, 3, 4 ) and c p , c1∗ = 10−5 c1 and c2∗ = 10−5 c2 . while c1∗ < c1 or c2∗ < c2 do c1∗ = min{(1 + )c1∗ , c1 }, c2∗ = min{(1 + )c2∗ , c2 } repeat • Fix y to optimize problem (18) and (19) to obtain (w+ , w− , b+ , b− ); • Fix w+ , w− and b+ , b− to solve the problem (20) to get y by using the strategies introduced above; until The decrease of the objective function value is smaller than a threshold (e.g. = 10−4 ). end while Algorithm 2 proportion-NPSVM2. Input: Randomly initialized label yi ∈ {−1, +1}, (∀i ∈ {1, . . ., N} ), label proportions pk (k ∈ {1, . . ., K } ), parameters ci (i = 1, 2, 3, 4 ) and c p , c1∗ = 10−5 c1 and c2∗ = 10−5 c2 . while c1∗ < c1 or c2∗ < c2 do c1∗ = min{(1 + )c1∗ , c1 }, c2∗ = min{(1 + )c2∗ , c2 } repeat • Fix y to optimize problem (18) and (19) to obtain (w+ , w− , b+ , b− ); • Fix w+ , w− , b+ , b− to renovate the initialized label y under the proportions constraint through the tactics presented before; until The decrease of the objective function value is smaller than a threshold (e.g. = 10−4 ). end while

iterative loop. The details of this solution method are shown as follows. • Randomly initialize y, then the optimization of (18) and (19) become a primal NPSVM problem. Similarly, the SOR method or SMO technique can be used to effectively solve it. • When w+ , w− , b+ , b− are solved and ﬁxed, the primal optimization problems are able to be transformed into Eq. (22) with the constraint that the label proportion pk is constant.

min y

s.t.

N

LF i=1 N , i=1

∀

(yi , wT+ (xi ) + b+ , wT− (xi ) + b− )

As elaborative description above, we generate two types of solving strategies for the proposed proportion-NPSVM algorithm. Since the objective function is monotone non-increasing and lower bounded, its convergence can be guaranteed (will be discussed in next section). Besides, in practice, the algorithm procedure would be terminated while the objective function value is no longer declining (or if the reduction of objective is smaller than a threshold). Furthermore, in order to alleviate the problem of local solutions, similar to T-SVM [33] and ∝SVM [19], the novelly proposed proportion-NPSVM algorithm (see Algorithms 1 and 2) also takes an additional annealing loop to gradually increase c1∗ and c2∗ . In the implementation of proportion-NPSVM, we consider = 0.5 and Lp (·) as the absolute loss: L p (y ) = L p ( pk ( y ), pk ) = | pk (y ) − pk |. In the practice scenario, we repeat the proportion-NPSVM several times by randomly initializing y and take the solution with the smallest objective value as the ﬁnal result of our algorithm. 4. Discussion To ulteriorly elaborate the novel algorithm, in this section, we ﬁrstly discuss the relationship between proportion-NPSVM and ∝SVM. Then, analysis for the validity of our model is introduced in the following part. 4.1. Relationship between proportion-NPSVM and ∝SVM According to the statements above, proportion-NPSVM aims at generating two nonparallel classiﬁcation hyperplanes while the ∝SVM searches for the parallel one. Actually, the ∝SVM is the special case of proportion-NPSVM. Let’s combine the problems of (18) and (19) to be the following problem:

min

s.t.

(22)

yi ∈ {−1, +1}

• Since each bag is independent, we can renovate the initialized label y on each individual bag Bk by adopting the following tactics: – Assume that all the points in bag Bk are belonging to negative class, namely yi = −1 for ∀i ∈ Bk . Then, we can compute the loss function values based on LF ; – Suppose that the labels for the points in bag Bk are +1, and another loss function values can be obtained through LF ; – Let δ i indicate the differences between these two loss. Then, sort δi , ∀i ∈ Bk and choose the appropriate number of points with large δ i to be positive according to the given label proportion pk ; • For each bag Bk , we only need to sort the δi , ∀i ∈ Bk once. Then, we can correctly select the appropriate amount of positive points, which can be seen as the optimal solution for (22). The solution process of the proportion-NPSVM model can be transferred into the following two alternative steps: solving NPSVM optimization problems and renovate the initialized label y. The details of this solving approach for proportion-NPSVM are shown in Algorithm 2.

1 1 c3 (w+ 2 + b2+ ) + c4 (w− 2 + b2− ) 2 2 1 T T + ( η+ η+ + η− η− ) 2 +c1 eT− ξ− + c2 eT+ ξ+ + c p eT L p (y )

(A )w+ + e+ b+ = η+ , (B )w− + e− b− = η− , −((B )w+ + e− b+ ) + ξ− ≥ e− , ξ− ≥ 0 ((A )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}

(23)

Obviously, the solutions of problems of (15) and (16) satisfy the above problem under the given training data sets and parameters. Furthermore, the third term of problem (23) can be ignored if we take ci (i = 1, 2, 3, 4 ) to be large enough. Then it degenerates to be

min

s.t.

1 1 c3 (w+ 2 + b2+ ) + c4 (w− 2 + b2− ) + c1 eT− ξ− 2 2 +c2 eT+ ξ+ + c p eT L p (y ) −((B )w+ + e− b+ ) + ξ− ≥ e− ,

ξ− ≥ 0

(24)

((A )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1} If we want to get a solution satisfying w+ = w− and b+ = b− , it only needs to solve the following special case of problem (23), i.e.,

min s.t.

1 (w2 + b2 ) + C (eT− ξ− + eT+ ξ+ ) + Cp eT L p (y ) 2 −((B )w + e− b) + ξ− ≥ e− , ξ− ≥ 0

((A )w + e+ b) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}

(25)

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

JID: KNOSYS

ARTICLE IN PRESS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

7

Table 1 Data sets used for the experiments.

Fig. 4. The artiﬁcial dataset is shown in (a) while (b) presents the NPSVM result achieved with all the labels known.

which is obvious the ∝SVM model. In summary, the ∝SVM with parallel classiﬁcation hyperplane is a special case of our proportionNPSVM, and proportion-NPSVM is more ﬂexible than ∝SVM and has better generalization ability.

4.2. Validity of proportion-NPSVM In this section, we talk about the validity of proposed proportion-NPSVM algorithm by applying it on an artiﬁcial classiﬁcation data set. This artiﬁcial data set is generated from two Gaussians centered at (1, 4) and (4, 1) with standard deviation σ = 1, each representing a different class distribution. Fig. 4(a) shows the resulting data set after 80 instances are sampled (40 for each class) and the instances are represented as points in a 2D feature space. Fig. 4(b) illustrates the classiﬁcation results of supervised NPSVM algorithm, where a pair of nonparallel classiﬁers are achieved. Based on this assumption, ﬁrstly, we are going to talk about the process of generating the classiﬁers for proportionNPSVM algorithm. Fig. 5 tells the schematic diagrams for process of our novel proposed algorithm in a way that is easy to visualize. As the descriptions of our algorithm, we ﬁrstly split these data points into different groups (shown in Fig. 5(a), where different colors denote different bags). Then, randomly initialize the class labels, and let “.” and “ + ” respectively denote the negative and positive points (as shown in Fig. 5(b)). After running once our algorithm, a pair of classiﬁers are able to be obtained (presented in Fig. 5(c)). Based on the achieved classiﬁers, we can adjust the class label for each data point and Fig. 5(d) shows the results after updating the labels. Continually running the proportion-NPSVM algorithm, we can get the ﬁnal classiﬁcation results (details can be found in Fig. 5(f)). Compare the result of our algorithm (Fig. 5(f)) with the supervised NPSVM result (Fig. 4(b)), we can know that our algorithm can achieve the highly competitive classiﬁcation result with NPSVM even only the label proportion information can be obtained. Based on the illustrated elaborations above, we can clearly understand the working principle of the novel proportion-NPSVM algorithm. After then, let’s consider the convergence of this algorithm, which can further proof its validity. To show the convergence of its objective function in the iterative optimization procedure, we apply proportion-NPSVM algorithm to a selected binary classes UCI data set, namely SONAR. According to the descriptions for proportion-NPSVM algorithm, we should ﬁrstly decide the bag size to be k points in each bag, and then split these data points into different groups. With running our algorithm on these grouped data points, we can plot the objective function value within each outer iteration in Fig. 6(a)–(d) when using k = 2, 8, 16, 64 points in each bag respectively. From these plots, it can be seen that the novel proportion-NPSVM method provides a satisfactory convergence rate in the optimization procedure.

Data set

#Size

#Features

#Classes

Sonar Heart Ionosphere Vote Breast-cancer Australian Credit-a Breast-w Pima-Indian Splice Ala Iris Wine Glass Vowel Dna Satimage

208 270 351 435 683 690 690 699 768 1,0 0 0 1,605 150 178 214 528 2,0 0 0 4,435

60 13 34 16 10 14 15 9 8 60 119 4 13 9 10 180 36

2 2 2 2 2 2 2 2 2 2 2 3 3 7 11 3 6

5. Experimental results To validate the performance of proportion-NPSVM, in this section, we will design two experiments, one is used to show the better accuracy on several binary-classes data sets than state-of-theart methods for LLP, which include MeanMap [2], Inverse Calibration (InvCal) [18] and alter proportion-SVM (alter-∝SVM) [19], the other is used to test its ability on the multi-classes data sets. The Matlab code for alter-∝SVM, MeanMap and InvCal is supported on the work of [19].1 Our code for proportion-NPSVM is also available online: https://github.com/chenzhensong/proportion-NPSVM. The prediction performance (accuracy) of different methods has been assessed on various data sets including the LibSVM collection2 and the UCI3 [34] shown in Table 1. Since this paper focus on the binary classiﬁcation, we test the one-vs-rest binary classiﬁcation performance for the multiple classes data sets. Furthermore, in order to avoid scaling issues in the learning process, the features of each data set are scaled to [−1, +1]. To formulate the LLP classiﬁcation problems, the training data is randomly partitioned into a particular fold of bags with ﬁxed size σ . We tested various bag size σ : 2, 4, 8, 16, 32 and 64 (with the last bag smaller than σ , if necessary). Following the experiments design of [18,19], in each single experiment, the accuracy has been assessed by tenfold cross-validation. Similarly, we repeat the above process ten times (negative instances are randomly selected from multi-class data sets and partition the data into disjoint bags) and report the mean accuracies. The parameters of each algorithm (both for linear kernel and RBF kernel) are tuned through the following rules: For MeanMap, the parameter is tuned from λ ∈ {0.1, 1, 10}. For InvCal, the parameters are tuned from Cp ∈ {0.1, 1, 10} and ε ∈ {0, 0.01, 0.1}. For alter-∝SVM, the parameters are tuned from C ∈ {0.1, 1, 10} and Cp ∈ {1, 10, 100}. For proportion-NPSVM, the parameters ci (i = 1, 2, 3, 4 ) are tuned for the best classiﬁcation accuracy in the range 0.1 to 10 and cp ∈ {0.1, 1, 10}. In fact, there are only three parameter to be tuned in proportion-NPSVM, such as c1 , c3 and cp in the problem (18) and c2 , c4 and cp in (19). Hence, the grid search method can be applied. The results of numerical experiments are summarized in the following tables, where the best accuracy is shown by bold ﬁgures.

1 2 3

https://github.com/felixyu/pSVM. https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/. http://archive.ics.uci.edu/ml/datasets.html.

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

JID: KNOSYS 8

ARTICLE IN PRESS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

Fig. 5. The process of generating the classiﬁers: (a) Grouped data points (different colors denote different bags); (b) Initialized data points; (c)∼ (e) Achieving classiﬁers and updating data points’ labels in each iteration and (f) Final classiﬁcation result. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Fig. 6. The value curve of the objective function at each iteration when using k = 2, 8, 16, 64 points in each bag respectively. From these plots, it can be seen that our novel method provides a satisfactory convergence rate in the optimization procedure.

5.1. Experiments on binary classes data sets In this section, we present the experimental results on binary classiﬁcation data sets. Table 2 shows the experimental results (classiﬁcation accuracy and its standard deviations) on the several binary-classes scaled data sets with linear kernel. From this table, we can know that the proportion-NPSVM outperforms the MeanMap, InvCal and alter-∝SVM methods on most data sets. Specifically, proportion-NPSVM obtains the best performance for all the bag size on 7 out of 11 data sets. And on the other 4 data sets including AUSTRALIAN, BREAST-W, BREAST-CANCER and HEART, proportion-NPSVM achieves the highly competitive accuracy with alter-∝SVM, which are better than the MeanMap and InvCal methods. In speciﬁc, the classiﬁcation accuracy of proportion-NPSVM is higher on these ﬁrst 3 data sets with small bag size while a little worse for the large bag size. As for the data set of HEART, the results of proportion-NPSVM are highly competitive with alter-∝SVM on some bag size and higher on the others. In addition, to further estimate the performance of our proposed method, we also applied the RBF kernel to proportionNPSVM. The classiﬁcation results on the binary classes data sets are shown in Table 3. As can be seen, the proportion-NPSVM algorithm obtains the best accuracy on the most data sets including

Fig. 7. The relative change of the RBF kernel for the average classiﬁcation accuracy on the utilized data sets.

SONAR, HEART, VOTE, BREAST-W, PIMA-INDIAN, SPLICE and ALA. As for the data sets of IONOSPHERE, BREAST-CANCER, AUSTRALIAN and CREDIT-A, the proportion-NPSVM achieves the highly competitive classiﬁcation results with alter-∝SVM. Moreover, the Fig. 7 displays the average improvements of classiﬁcation performance by algorithm with RBF kernel. (Here, a value more than zero indicates an improvement. Otherwise, it is the deduction.) From Fig. 7, we can see that the improvement brought by RBF kernel is difference for these selected approaches. In detail, for

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

9

Table 2 Accuracy of different methods on binary-classes data sets with linear kernel and bag size 2, 4, 8, 16, 32, 64. In this table, pNPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set

Method

2

4

8

16

32

64

Sonar

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

82.31(0.72) 71.63(2.84) 88.94(1.32) 97.59(0.82) 98.07(0.86)

82.02(0.60) 79.80(4.37) 83.17(2.58) 96.63(0.85) 96.63(0.86)

80.76(1.20) 75.69(3.08) 74.51(4.37) 94.71(1.33) 95.19(1.84)

78.37(1.75)) 77.58(1.53) 70.67(4.65) 91.82(1.40) 91.82(1.39)

77.25(1.39) 72.60(2.38) 66.35(5.76) 88.46(1.51) 91.34(0.76)

75.10(1.98) 65.38(4.10) 66.82(6.87) 82.98(0.85) 87.50(1.71)

Heart

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

81.85(1.28) 83.41(5.81) 87.51(0.88) 88.15(1.45) 89.62(2.42)

80.39(0.56)) 80.98(2.68) 86.29(2.45) 85.19(2.14) 88.51(3.24)

79.63(0.74) 79.45(6.19) 84.67(2.64) 85.07(1.99) 87.04(2.63)

79.46(1.31) 76.94(1.64) 84.07(2.47) 80.00(2.88) 84.07(4.35)

79.00(1.25) 73.76(1.75) 81.85(2.24) 79.63(2.43) 83.70(2.97)

76.06(1.42) 73.04(2.44) 79.74(1.62) 80.00(0.61) 80.00(0.81)

vote

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

87.76(0.24) 95.63(0.18) 96.62(0.23) 97.98(0.84) 96.83(0.70)

92.75(1.56) 95.35(0.42) 96.59(0.44) 96.32(1.31) 96.10(1.36)

91.83(2.24) 94.23(0.20) 96.16(0.40) 96.32(0.82) 96.45(0.82)

89.62(1.45) 94.01(0.66) 95.86(2.40) 96.22(0.48) 95.63(0.59)

87.69(0.27) 91.86(2.46) 92.97(1.81) 95.67(0.15) 95.63(0.41)

87.42(0.78) 91.10(1.09) 93.33(2.48) 95.72(0.19) 95.09(0.21)

Australian

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

86.33(0.42) 85.39(0.25) 85.42(1.96) 92.72(2.17) 92.55(1.70)

85.67(0.27) 85.80(0.36) 85.61(1.37) 89.83(1.49) 89.56(1.41)

85.08(1.42) 84.99(0.77) 85.49(1.68) 86.64(1.28) 87.27(1.10)

83.74(1.57) 83.14(2.32) 84.96(3.54) 80.91(0.11) 80.75(1.21)

83.96(1.96) 80.28(3.96) 84.39(4.29) 76.33(1.92) 78.55(1.89)

82.19(1.95) 80.53(4.68) 82.46(6.18) 68.77(0.85) 68.41(1.94)

Breast-w

MeanMap InvCal alter-/SVM p-NPSVM1 p-NPSVM2

96.11(0.08) 95.88(0.38) 96.71(0.10) 98.13(0.50) 97.68(0.54)

95.95(0.35) 95.65(0.36) 96.77(0.36) 97.47(0.76) 97.48(0.54)

95.16(0.26) 95.53(0.28) 96.59(0.39) 96.62(0.39) 96.53(0.58)

96.03(0.33) 95.39(0.58) 96.41(0.38) 96.55(0.46) 96.42(0.16)

95.42(0.44) 95.23(0.52) 96.41(0.32) 96.05(0.25) 96.08(0.48)

95.80(0.93) 94.31(0.77) 96.25(0.56) 95.92(0.12) 95.85(0.27)

Ionosphere

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

84.32(0.58) 87.17(1.91) 94.02(0.83) 96.32(1.04) 96.29(0.89)

82.17(1.24) 81.23(7.14) 90.60(1.47) 95.53(1.54) 95.04(0.77)

81.91(1.33) 85.18(3.46) 86.32(4.43) 93.13(1.16) 93.10(0.11)

80.03(0.47) 83.47(4.26) 74.64(3.70) 91.28(0.85) 92.02(1.93)

79.96(1.25) 79.77(3.78) 72.36(3.89) 91.76(1.02) 90.88(0.76)

78.99(2.08) 78.06(4.09) 68.95(1.44) 92.33(0.71) 91.73(0.56)

Breast-Cancer

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

96.49(0.02) 96.02(0.40) 96.90(0.22) 97.96(0.41) 97.58(0.36)

96.38(0.18) 96.11(0.61) 96.87(0.61) 97.59(0.26) 97.53(0.11)

96.20(0.27) 95.81(0.23) 96.81(0.23) 96.78(0.46) 96.41(0.33)

96.02(0.43) 95.61(0.28) 96.76(0.29) 96.67(0.12) 96.25(0.56)

96.35(0.38) 95.61(0.14) 96.82(0.12) 96.50(0.26) 96.17(0.17)

95.66(0.55) 94.49(1.01) 96.49(1.00) 96.32(0.20) 95.98(0.25)

Credit-a

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

85.42(0.24) 85.51(0.50) 85.94(0.16) 94.52(1.29) 95.51(1.02)

84.70(0.80) 85.40(0.41) 87.68(0.82) 94.27(0.99) 93.10(1.78)

83.56(1.68) 84.52(0.67) 87.39(1.24) 91.41(0.86) 91.62(1.16)

82.33(1.08) 82.69(2.65) 86.09(1.07) 89.04(1.89) 89.30(0.60)

81.90(2.87) 79.23(3.96) 85.65(0.96) 87.96(0.61) 87.47(0.68)

79.78(3.62) 77.99(4.28)) 85.22(1.61 86.42(0.32) 85.76(0.50)

Pima-Indian

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

75.27(1.22) 77.08(1.09) 81.51(0.67) 92.47(1.02) 93.54(0.66)

73.89(2.74) 73.95(3.58) 80.33(2.12) 90.40(1.11) 91.38(0.56)

73.42(1.87) 76.17(2.24) 75.78(2.79) 88.16(1.69) 86.85(1.66)

71.98(1.48) 73.82(0.78) 72.78(2.62) 78.11(2.57) 84.47(2.69)

70.87(1.25) 65.88(1.16) 69.27(1.55) 79.10(3.65) 80.03(1.90)

69.50(1.66) 65.23(1.44) 68.62(1.40) 74.74(3.21) 72.07(1.98)

Splice

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

75.29(0.24) 51.70(0.71) 87.60(0.66) 96.55(0.88) 96.33(0.25)

74.66(1.21) 55.50(3.84) 84.60(1.28) 94.76(1.49) 94.20(0.70)

72.54(0.76)) 53.80(2.72) 79.30(2.88) 91.52(1.28) 92.17(0.86)

72.19(0.66) 74.20(4.42) 71.40(2.51) 89.03(0.11) 88.76(1.58)

71.40(1.41)) 68.50(1.50) 72.60(2.72) 87.21(1.92) 85.83(1.19)

70.22(1.11) 70.60(4.04) 67.30(2.30) 85.69(0.85) 85.43(0.58)

Ala

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

81.76(0.89) 81.86(0.20) 82.63(0.20) 86.53(0.21) 85.47(1.17)

81.60(0.48) 81.35(0.70) 81.72(0.70) 84.74(1.05) 82.51(2.04)

80.02(0.59) 78.34(0.70) 80.00(0.70) 83.44(1.43) 80.08(1.78)

77.04(1.30) 77.69(1.36) 76.48(1.36) 81.22(1.00) 77.35(1.41)

73.19(2.48) 73.13(3.86) 76.38(4.86) 78.68(2.14) 75.42(2.54))

72.58(0.95) 73.30(1.71) 76.09(1.71) 76.57(2.58) 74.86(2.68)

MeanMap, the accuracy increasing brought by the application of RBF kernel occurred except three datasets AUSTRALIAN, CREDIT-A and ALA. For InvCal, the application of RBF kernel can improve the classiﬁcation accuracy on the data sets of HEART, BREASTCANCER, AUSTRALIAN, CREDIT-A, BREAST-W, SPLICE and ALA. For alter-∝SVM, the improvement has been appeared on the data sets including HEART, IONOSPHERE, BREAST-CANCER, AUSTRALIAN, BREAST-W, PIMA-INDIAN and SPLICE. As for the novel proposed proportion-NPSVM, the counts of #improvement is 8 on the 11 classiﬁcation data sets except CREDIT-A, PIMA-INDIAN and ALA.

Furthermore, to clearly show the comparisons for the classiﬁcation performance of these selected algorithms, we present the average classiﬁcation error rates on the binary-classes data sets in Figs. 8 and 9. From them, we can see that the average error rates of proportion-NPSVM is smaller than alter-∝SVM, MeanMap and InvCal method. Speciﬁcally, as it is shown, the counts of #wins for proportion-NPSVM with linear kernel is 9 on the 11 classiﬁcation data sets while the counts of #wins for proportion-NPSVM with RBF kernel is 10. In conclusion, the novel proportion-NPSVM algorithm that proposed for solving LLP problems is advanced and effective on the binary classiﬁcation data sets.

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS 10

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16 Table 3 Accuracy of different methods on binary-classes data sets with RBF kernel and bag size 2, 4, 8, 16, 32, 64. In this table, pNPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set

Method

2

4

8

16

32

64

Sonar

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

84.57(0.27) 85.57(0.66) 94.23(2.33) 98.39(0.73) 97.78(0.54)

83.99(0.56) 78.84(1.23) 70.67(3.12) 97.43(0.56) 95.10(1.29)

83.54(0.19) 64.90(2.58) 67.31(2.82) 96.79(1.11) 93.75(1.20)

79.23(1.05) 53.37(3.66) 64.42(3.57) 96.95(0.28) 92.32(0.48)

78.78(1.47) 53.37(3.79) 62.02(2.70) 95.83(0.73) 91.53(1.00)

75.31(2.53) 53.37(3.79) 62.98(2.83) 96.63(0.48) 90.57(0.53)

Heart

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

83.69(1.28) 90.37(0.55) 96.29(2.12) 97.77(0.57) 96.96(0.61)

81.87(0.47) 85.18(1.35) 89.62(2.22) 95.92(1.61) 95.56(1.34)

80.47(0.83) 80.26(2.77) 85.18(1.78) 95.18(0.64) 93.48(1.01)

79.22(1.46) 79.61(3.26) 85.55(1.73) 93.95(0.21) 92.59(1.81)

79.84(1.42) 76.36(2.69) 82.22(2.81) 89.01(1.40) 91.26(1.28)

77.26(1.25) 73.90(3.58) 80.00(1.47) 89.25(1.11) 89.62(1.08)

Vote

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

91.15(0.21) 95.68(0.12) 95.80(0.33) 98.29(0.13) 98.52(0.45)

90.52(1.76) 94.77(0.47) 95.54(0.42) 97.79(0.35) 96.82(1.15)

91.54(2.22) 93.95(0.27) 94.88(0.57) 97.14(0.42) 96.18(0.88)

90.28(1.45) 93.03(0.57) 92.44(1.28) 97.01(0.43) 96.55(0.96)

89.58(0.21) 87.79(2.42) 90.72(1.47) 97.10(0.21) 95.62(0.70)

89.33(0.82) 86.63(1.11) 90.93(1.35) 96.82(0.10) 96.09(0.21)

Australian

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

85.92(0.72) 86.06(0.30) 85.74(1.86) 95.65(0.65) 95.56(1.20)

84.98(0.34) 86.11(0.26) 85.71(1.71) 93.67(1.18) 93.15(0.91)

85.53(1.21) 86.32(0.56) 86.26(1.31) 90.96(0.33) 89.30(1.00)

84.07(2.04) 84.13(1.57) 85.65(0.98) 85.36(0.25) 86.84(1.28)

83.12(1.47) 82.73(1.12) 83.63(1.52) 79.90(1.03) 81.21(1.11)

80.70(3.96) 81.87(2.35) 83.62(1.61) 76.23(0.94) 74.78(1.17)

Breast-w

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

96.42(0.18) 96.85(0.25) 96.97(0.29) 98.04(0.54) 97.99(0.34)

96.27(0.31) 96.91(0.16) 97.00(0.25) 97.47(0.60) 97.62(0.21)

96.20(0.27) 96.77(0.22) 96.97(0.30) 97.18(0.58) 96.99(0.68)

96.14(0.39) 95.39(0.25) 96.87(0.66) 97.13(0.25) 96.82(0.47)

94.92(1.02) 95.23(0.29) 96.88(0.25) 96.47(0.22) 96.48(0.12)

94.24(1.22) 94.58(1.57) 96.07(0.61) 96.32(0.17) 96.45(0.21)

Ionosphere

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

85.21(0.31) 95.44(0.58) 99.14(0.04) 95.44(2.06) 96.35(0.84)

84.79(0.18) 84.33(1.23) 98.86(0.59) 96.98(0.74) 93.67(0.89)

82.65(0.96) 65.24(2.96) 94.58(1.64) 94.92(1.05) 93.16(1.65)

81.24(1.21) 64.10(3.05) 93.19(1.87) 93.56(1.43) 92.59(1.33)

80.08(0.74) 64.10(3.05) 92.02(1.72) 94.18(0.53) 93.03(1.78)

79.14(0.56) 64.10(3.05) 86.32(2.53) 94.07(0.16) 92.43(0.82)

Breast-Cancer

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

96.69(0.17) 97.07(0.18) 97.16(0.20) 97.31(0.08) 97.45(0.20)

96.72(0.20) 97.10(0.31) 97.12(0.13) 97.47(0.14) 97.36(0.17)

96.84(0.32) 97.02(0.15) 97.19(0.37) 97.26(0.08) 97.24(0.19)

96.60(0.21) 97.08(0.25) 97.35(0.30) 97.31(0.45) 97.07(0.17)

96.67(0.18) 96.51(0.30) 97.09(0.48) 97.12(0.22) 96.89(0.30)

96.78(0.11) 96.09(0.72) 97.23(0.42) 97.07(0.25) 96.72(0.45)

Credit-a

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

85.86(0.81) 86.26(0.65) 86.23(0.12) 88.02(0.17) 87.51(0.76)

85.04(0.73) 85.62(0.12) 86.09(0.35) 87.53(1.23) 87.42(0.91)

84.96(1.22) 85.41(0.42) 85.55(0.37) 77.58(1.76) 84.63(1.99)

83.26(1.55) 83.79(0.64) 84.73(2.52) 76.97(0.66) 79.56(0.80)

81.14(2.96) 82.21(3.72) 82.39(2.63) 79.32(0.17) 79.42(0.59)

76.00(4.58) 76.90(4.02) 80.75(3.66) 79.95(0.75) 75.36(0.93)

Pima-Indian

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

76.21(1.22) 79.94(0.63) 88.28(1.40) 94.14(0.71) 92.94(0.21)

75.75(1.31) 76.95(1.07) 81.25(1.28) 90.83(0.82) 91.56(1.25)

74.32(0.84) 72.65(1.36) 72.53(1.32) 87.68(1.18) 87.83(2.87))

72.87(1.72) 67.06(2.57) 70.70(3.40) 80.73(1.11) 83.33(2.56)

71.20(0.96) 65.10(3.64) 69.92(1.66) 68.49(2.58) 69.82(3.07)

70.44(2.57) 65.10(3.83) 67.71(0.78) 65.10(3.64) 67.08(3.84)

Splice

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

77.24(0.47) 89.30(0.52) 96.10(0.74) 89.24(0.17) 89.66(0.16)

75.71(0.58) 83.50(1.23) 88.80(1.77) 89.30(0.26) 89.60(0.11)

74.92(0.79) 77.40(2.22) 77.72(1.74) 89.26(0.06) 89.43(1.00)

72.47(1.22) 73.20(2.75) 73.00(2.11) 89.46(0.12) 89.30(1.28)

72.16(1.02) 55.70(4.58) 71.21(2.37) 89.06(0.26) 89.23(1.11)

71.08(1.35) 51.80(6.73) 64.80(1.07) 89.50(0.06) 89.43(1.17)

Ala

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

76.16(0.33) 82.31(0.41) 82.22(0.36) 85.78(0.27) 85.33(0.35)

75.86(0.28) 81.49(0.52) 81.80(0.58) 83.25(0.66) 84.06(0.47)

76.44(1.26) 81.12(1.23) 79.16(1.32) 80.14(1.23) 81.92(1.12)

76.48(0.55) 78.66(1.33) 75.67(1.17) 78.35(1.43) 78.08(1.25)

75.93(1.06) 75.47(1.92) 75.73(1.41) 76.88(1.96) 77.02(0.98)

74.95(1.71) 74.57(2.11) 75.48(1.02) 75.04(1.84) 74.97(2.05)

5.2. Experiments on multi-classes data sets Based on the analysis of the experimental results above, we can know that the proposed proportion-NPSVM algorithm has achieved the highly competitive, even better, classiﬁcation performance on binary classes data sets. In this section, we want to test its classiﬁcation ability on the multi-classes data sets. As we mentioned before, in this paper, we focus on the binary classiﬁcation settings. Therefore, we test the one-vs-rest binary classiﬁcation performance for the selected multi-classes data sets including IRIS, WINE, GLASS, VOWEL, DNA and SATIMAGE. That is

to say, in these experiments, we treat the data from one class as positive, and randomly choose same amount of data as negative from the remaining classes. Based on this assumption, we can generate thirteen one-vs-rest data sets that are able to be used to test our proposed proportion-NPSVM algorithm. The experimental results on multi-classes data sets are presented in Table 4. As they are shown, the proposed proportionNPSVM is highly competitive, even better, compared with alter∝SVM, which both outperform the MeanMap and InvCal methods. In detail, proportion-NPSVM can improve alter-∝SVM on most data sets including IRIS, GLASS, VOWEL, WINE, by introducing the non-

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

11

Table 4 Accuracy of different methods on multi-classes data sets with linear kernel and bag size 2, 4, 8, 16, 32, 64. In this table, p-NPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set

Method

2

4

8

16

32

64

Iris-1

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

88.63(0.23) 79.00(2.19) 74.00(0.11) 93.33(0.52) 96.38(1.22)

88.02(0.42) 76.00(2.39) 74.00(0.24) 91.00(1.09) 92.65(2.30)

87.51(0.65) 73.00(0.95) 73.58(2.30) 85.80(5.16) 89.84(4.32)

84.96(1.21) 73.00(1.28) 71.24(2.66) 75.80(3.87) 72.25(3.86)

81.25(1.34) 71.00(1.21) 70.40(1.66) 77.00(2.65) 71.68(3.63)

80.20(2.58) 69.00(3.72) 69.34(4.21) 74.00(4.89) 74.62(2.86)

Glass-1

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

74.52(1.26) 72.86(1.78) 74.29(0.73) 89.00(1.08) 94.85(2.55)

73.28(0.97) 73.57(1.79) 72.14(1.86) 88.86(1.31) 92.86(2.02)

72.22(1.06) 74.28(1.69) 71.43(1.66) 84.50(2.13) 90.71(1.68)

71.79(1.44) 74.28(0.96) 67.86(1.32) 83.79(2.02) 86.00(3.17)

71.17(0.85) 72.14(1.74) 67.14(1.84) 76.35(2.63) 79.14(0.60)

70.96(0.74) 72.14(0.53) 66.43(1.32) 74.14(3.34) 76.14(1.86)

Glass-2

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

74.68(0.98) 72.37(1.69) 71.53(4.14) 90.71(1.01) 94.47(0.86)

72.47(1.21) 66.48(1.58) 69.73(2.99) 89.33(1.35) 93.42(2.18)

70.83(1.44) 71.71(1.46) 71.05(1.88) 86.59(1.72) 89.73(1.36)

71.25(0.77) 63.16(1.96) 70.39(3.86) 83.88(2.83) 85.13(2.53)

68.37(1.76) 69.73(1.74) 69.34(2.02) 77.12(2.42) 80.92(2.96)

66.54(2.32) 59.21(2.74) 65.31(2.11) 71.05(1.97) 76.31(3.62)

Glass-3

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

75.96(0.62) 79.41(1.35) 79.41(3.61) 88.24(0.17) 91.17(0.00)

74.33(1.11) 76.47(1.98) 73.53(1.73) 87.25(2.07) 90.58(2.46)

71.05(1.55) 67.65(1.55) 70.59(4.10) 85.62(1.42) 89.41(2.63)

72.33(1.76) 67.65(1.46) 70.59(2.06) 82.35(4.87) 88.23(3.83)

69.47(2.02) 70.59(1.23) 69.95(1.63) 78.10(5.32) 78.82(3.83)

67.34(3.14) 55.10(3.39) 64.71(1.17) 69.41(4.84) 75.29(2.63)

Vowel-1

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

93.86(0.22) 95.83(0.53) 97.18(1.86) 96.67(1.68) 96.45(1.14)

92.17(0.53) 93.75(0.47) 94.58(2.89) 95.63(1.68) 94.80(2.03)

90.88(1.11) 92.71(0.69) 93.33(2.56) 95.21(1.57) 93.75(2.81)

90.15(0.83) 91.67(2.73) 90.62(3.11) 93.75(0.49) 91.67(3.65)

89.86(1.20) 92.70(1.66) 89.58(2.92) 93.85(0.33) 90.20(1.71)

87.68(2.22) 86.45(2.29) 87.50(3.82) 93.75(0.11) 89.58(0.93)

Vowel-2

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

83.45(1.25) 85.42(2.89) 89.58(1.15) 96.87(1.39) 96.87(1.19)

81.63(1.37) 83.33(2.38) 85.41(2.27) 96.35(2.03) 96.25(2.39)

79.36(1.59) 80.21(1.20) 82.29(4.71) 94.79(2.31) 94.58(0.87)

77.01(2.22) 78.13(1.94) 81.25(2.57) 91.97(2.69) 90.41(3.85)

76.54(2.54) 79.16(1.13) 80.20(2.79) 89.58(1.72) 88.96(2.28)

75.37(3.26) 82.29(0.39) 79.16 (2.91) 86.45(2.14) 85.63(1.35)

Vowel-3

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

85.77(0.87) 88.54(3.98) 92.25(1.57) 97.71(1.17) 96.87(1.42)

82.58(1.22) 85.42(0.89) 89.06(1.80) 96.67(1.82) 96.04(2.69)

81.36(0.79) 84.38(1.43) 81.60(1.04) 94.38(2.21) 92.50(1.71)

79.87(1.34) 85.41(1.64) 79.16(3.81) 90.94(2.03) 90.63(1.04)

80.25(1.11) 84.37(1.46) 71.87(3.73) 86.46(2.14) 89.79(2.48)

78.63(2.76) 80.21(2.18) 72.91(2.84) 85.52(1.25) 84.58(2.60)

Wine-1

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

95.60(0.21) 99.15(0.00) 99.72(0.05) 99.72(0.02) 99.15(0.05)

94.87(0.16) 99.15(0.05) 99.89(0.01) 99.89(0.00) 99.15(0.02)

94.02(0.73) 99.14(0.02) 99.59(0.02) 99.79(0.01) 99.15(0.07)

93.11(0.57) 99.15(0.05) 99.49(0.01) 99.79(0.01) 98.64(0.02)

92.36(1.02) 98.31(0.35) 99.23(0.06) 99.74(0.00) 98.13(0.39)

89.07(2.21) 93.22(0.99) 98.68(0.57) 99.57(0.00) 95.76(0.20)

Wine-2

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

95.98(0.24) 99.29(0.01) 98.59(0.22) 98.61(0.17) 99.29(0.00)

95.44(0.35) 97.89(0.96) 98.52(0.59) 98.59(0.15) 99.01(0.23)

94.11(0.65) 96.48(1.74) 95.35(0.74) 98.59(0.20) 98.08(0.37)

93.78(0.22) 95.07(1.43) 94.43(1.41) 97.39(0.48) 96.61(1.88)

91.85(0.98) 94.37(1.35) 95.74(1.18) 96.19(0.95) 95.77(2.38)

90.02(1.74) 89.44(0.99) 91.54(2.01) 95.21(0.56) 95.47(1.07)

Dna-1

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

86.78(1.33) 93.25(3.43) 99.36(0.09) 97.95(0.33) 97.88(0.06)

83.62(1.25) 91.05(2.81) 98.53(0.58) 97.30(0.40) 97.19(0.66)

80.21(1.64) 90.19(2.77) 97.16(0.72) 96.44(0.46) 95.40(0.98)

79.46(0.58) 89.33(2.11) 93.87(2.30) 94.39(0.41) 93.71(0.44)

78.20(1.22) 85.13(2.72) 91.81(2.28) 93.59(0.53) 92.46(0.19)

77.73(1.65) 81.04(1.55) 90.67(2.25) 93.23(0.35) 91.91(0.16)

Dna-2

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

88.34(0.68) 92.98(1.94) 99.10(0.06) 96.90(0.20) 97.73(0.39)

84.54(1.57) 90.28(2.08) 97.95(0.64) 96.59(0.21) 96.49(1.68)

82.11(2.02) 88.35(2.03) 95.67(0.83) 96.80(0.35) 95.15(0.60)

79.89(3.36) 87.73(2.25) 92.81(1.31) 95.25(1.20) 93.33(0.88)

78.46(2.25) 87.62(2.37) 90.28(1.87) 92.98(0.74) 92.89(0.99)

76.73(3.04) 83.09(1.58) 89.75(2.25) 91.95(0.32) 91.23(0.27)

Satimage-1

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

96.93(0.25) 94.89(1.89) 98.34(0.07) 97.54(0.02) 97.70(0.00)

96.27(0.32) 94.30(2.73) 98.32(0.09) 97.49(0.01) 97.59(0.00)

95.65(0.51) 92.88(0.74) 98.30(0.20) 97.49 (0.64) 97.49(0.55)

94.36(1.76) 86.61(2.33) 98.21(0.26) 96.81(0.87) 96.38(0.73)

94.25(0.22) 84.03(1.61) 97.99(0.26) 96.03(0.99) 96.45(1.04)

93.76(0.50) 82.09(1.40) 97.79(0.23) 96.65(1.22) 95.72(1.33)

Satimage-2

MeanMap InvCal alter- α SVM p-NPSVM1 p-NPSVM2

94.42(0.38) 94.64(1.62) 96.91(0.17) 97.29(0.20) 97.21(0.26)

93.60(0.64) 93.96(1.34) 96.77(0.20) 96.87(0.33) 95.20(1.53)

92.47(1.11) 88.09(1.47) 96.52(0.28) 95.94(0.65) 95.90(0.51)

92.36(0.34) 86.91(1.64) 96.08(0.45) 95.47(0.24) 95.62(0.05)

90.77(0.38) 83.75(1.45) 95.74(0.30) 95.36(0.41) 95.57(0.09)

89.26(0.29) 84.11(1.33) 95.25(0.32) 95.31(0.22) 95.36(0.10)

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

JID: KNOSYS 12

ARTICLE IN PRESS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

Fig. 8. The average error rates for the classiﬁcation methods with linear kernel on these 11 binary-classes data sets.

Fig. 9. The average error rates for the classiﬁcation methods with RBF kernel on these 11 binary-classes data sets.

Fig. 11. The average error rates for the classiﬁcation methods with RBF kernel on these 6 multi-classes data sets.

Finally, to easily compare the classiﬁcation performance of these LLP algorithms, we not only present the standard deviation shown in the rackets in above four tables but also apply Wilcoxon signed ranks test to the accuracy achieved. The Wilcoxon signed ranks test results are listed in Table 6 (for binary classiﬁcation) and Table 7 (for multi classiﬁcation). As they are shown, the proposed proportion-NPSVM shows an improvement over the compared LLP algorithms including InvCal, MeanMap and alter-∝SVM. Based on the experimental analysis of the above two sections, we conclude that the novel proportion-NPSVM algorithm that proposed to solve the LLP problem has achieved the highly competitive, even the best, classiﬁcation performance both on the binaryclasses data sets and multi-classes data sets. Therefore, this novel method can be seen as the state-of-art algorithm for solving the problem of LLP. 6. Conclusion remarks

Fig. 10. The average error rates for the classiﬁcation methods with linear kernel on these 6 multi-classes data sets.

parallel classer. As for the SATIMAGE and DNA data set, this novel method is highly competitive with alter-∝SVM. Similarly, we also applied the RBF kernel to proportion-NPSVM to further estimate its performance on multi-classes data sets. The classiﬁcation results are shown in Table 5. As can be seen, the proportion-NPSVM algorithm obtains the highly competitive classiﬁcation accuracy with alter-∝SVM on almost all the selected data sets including VOWEL, WINE and SATIMAGE, which are better than the results of MeanMap and InvCal. As for the data sets of GLASS and DNA, the proportion-NPSVM method achieves the best classiﬁcation performance compared with alter-∝SVM, MeanMap and InvCal. In addition, to clearly present the comparison for these different selected approaches, we introduce Figs. 10 and 11 to show the average classiﬁcation error on these six multi-classes data sets. From them, we can see that the average error rates of proportion-NPSVM is smaller than alter-∝SVM, MeanMap and InvCal methods. Specifically, as it is shown, the counts of #wins for proportion-NPSVM with linear kernel is 5 on the 6 classiﬁcation data sets while the counts of #wins for proportion-NPSVM with RBF kernel is also 5. In summary, the novel proportion-NPSVM algorithm that proposed for solving LLP problems is also advanced and effective for the classiﬁcation of the multi-classes data sets.

In this paper, we study the problem of learning with label proportions. To address this learning scenario, we contribute a novel solution based on nonparallel support vector machines, termed as proportion-NPSVM. It can incorporate the label proportions with the NPSVM optimization model under a large-margin framework. Improvements have been made both on accuracy performance and the generalization ability. The proportion-NPSVM model can be efﬁciently solved through optimizing two smaller problems by using alternative strategy with SOR method or SMO technique. Furthermore, it does not need to compute the large inverse matrices before training. Suﬃcient experimental results show the eﬃciency of proportion-NPSVM in classiﬁcation accuracy and good prediction performance. In conclusion, the proportion-NPSVM algorithm can be a feasible classiﬁcation method for the learning from label proportions tasks, which are conceivable in many practical applications. Although the proportion-NPSVM method has yield great classiﬁcation performance, it still has some aspects that can be improved. In proportion-NPSVM algorithm, the random initializing labels are applied. It may affect the classiﬁcation performance and optimization time. In future work, we plan to explore how to initialize the labels with the constraint of label proportions. Besides, we would like to develop the data group strategies based on their own features instead of random tactics. Acknowledgments The authors would like to express their sincere thanks to the associate editor and the reviewers who made great contributions to the improvement of this paper. This work was partially supported by the major project of National Natural Science Foundation of China (Grant nos. 71331005 and 91546201), the international

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

13

Table 5 Accuracy of different methods on multi-classes data sets with RBF kernel and bag size 2, 4, 8, 16, 32, 64. In this table, prop-NPSVM is short for proportion-NPSVM and the standard deviation is presented in brackets. Data set

Method

2

4

8

16

32

64

Iris-1

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

91.66(1.24) 97.20(0.42) 98.91(0.00) 84.80(1.62) 96.01(1.09)

90.85(1.37) 95.70(1.94) 98.00(0.95) 83.22(1.30) 91.60(2.79)

88.27(2.04) 90.30(2.11) 97.80(1.40) 80.70(1.70) 91.00(2.82)

87.64(1.65) 82.51(1.91) 94.76(1.52) 78.87(1.76) 87.60(4.77)

87.09(1.74) 78.14(0.92) 89.88(3.66) 74.70(1.18) 87.31(3.13)

86.53(2.11) 74.87(1.72) 83.00(3.06) 73.37(0.95) 85.22(2.96)

Glass-1

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

78.97(0.96) 81.21(2.17) 87.07(4.31) 81.42(1.21) 93.28(1.30)

77.63(1.32) 77.85(2.06) 85.00(5.49) 80.71(0.68) 92.57(2.05)

75.88(1.55) 79.01(1.87) 75.57(3.35) 78.78(2.28) 90.57(1.92)

75.06(2.01) 76.57(1.25) 72.14(4.08) 75.50(3.03) 84.85(2.44)

73.86(1.66) 74.28(1.11) 67.85(3.94) 72.85(1.14) 77.85(2.18)

73.27(1.67) 73.57(2.05) 68.57(4.43) 69.35(0.41) 74.43(2.33)

Glass-2

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

79.73(0.44) 87.89(1.65) 85.52(3.53) 86.71(1.32) 87.50(2.44)

73.57(1.32) 81.58(1.45) 83.03(4.12) 82.89(1.65) 87.36(2.11)

71.42(1.21) 76.97(1.65) 81.71(2.68) 80.92(2.28) 84.86(3.65)

70.02(1.47) 66.45(2.93) 71.13(3.43) 73.68(1.51) 82.89(3.03)

68.68(2.32) 69.07(2.75) 66.44(2.98) 73.02(1.39) 79.47(1.18)

67.53(3.13) 60.53(2.63) 63.82(4.04) 65.79(1.56) 75.39(1.65)

Glass-3

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

76.57(1.22) 75.36(2.69) 78.10(4.67) 85.29(0.17) 88.23(0.24)

74.22(0.97) 76.47(1.76) 79.41(3.22) 84.87(1.11) 85.88(2.63)

70.53(1.32) 74.71(1.87) 73.53(4.26) 83.24(3.11) 85.29(1.61)

69.96(1.75) 67.64(2.58) 70.59(3.71) 78.92(4.32) 79.41(3.83)

67.95(2.01) 70.58(1.67) 70.58(2.86) 76.47(3.16) 77.64(4.92)

66.32(3.43) 65.49(2.02) 70.06(4.17) 73.52(4.11) 71.76(6.69)

Vowel-1

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

95.68(1.11) 99.15(0.02) 99.15(0.02) 98.02(1.22) 99.15(0.02)

94.37(0.96) 98.96(0.20) 99.86(0.02) 97.81(1.73) 98.95(0.32)

93.26(1.44) 87.50(1.86) 97.81(0.61) 97.19(1.10) 96.67(1.36)

90.83(0.98) 79.17(2.06) 93.75(2.38) 97.71(1.08) 95.83(2.70)

87.32(1.23) 68.75(1.71) 87.50(2.98) 98.02(0.33) 96.54(0.87)

85.27(0.76) 68.75(1.53) 87.25(2.88) 97.09(0.16) 94.79(1.04)

Vowel-2

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

93.76(0.77) 98.85(0.42) 99.68(0.02) 97.60(1.10) 98.75(0.87)

91.25(1.22) 91.67(1.12) 93.85(2.57) 96.67(2.29) 98.68(0.36)

89.33(1.54) 87.50(0.82) 85.42(4.36) 92.39(1.70) 96.87(1.27)

85.74(2.73) 75.00(1.56) 83.33(5.10) 91.04(1.98) 93.33(2.51)

83.65(2.88) 79.17(1.43) 81.25(2.42) 90.00(1.32) 89.79(2.00)

82.71(3.65) 71.88(1.79) 77.08(3.45) 90.62(0.08) 89.83(1.42)

Vowel-3

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

93.54(0.56) 98.95(0.03) 99.89(0.01) 97.91(0.08) 98.58(0.93)

92.37(0.78) 95.83(0.35) 97.81(0.88) 95.83(1.85) 97.08(1.14)

90.62(1.17) 88.54(0.64) 93.75(2.67) 92.71(2.33) 95.42(2.26)

87.35(1.68) 75.00(1.69) 87.50(4.21) 89.68(2.32) 92.03(1.19)

84.74(1.97) 79.17(1.42) 83.33(2.38) 87.08(1.32) 88.54(1.71)

83.17(2.63) 69.79(1.94) 82.91(4.49) 86.56(0.59) 86.25(0.85)

Wine-1

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

96.86(0.21) 99.89(0.01) 99.89(0.01) 99.15(0.02) 99.85(0.00)

96.47(0.30) 99.40(0.05) 99.87(0.02) 99.15(0.03) 99.85(0.02)

95.23(0.46) 96.69(1.76) 99.85(0.02) 99.15(0.02) 99.79(0.00)

94.76(1.00) 91.28(1.98) 99.49(0.01) 99.15(0.02) 98.89(0.01)

92.07(1.74) 93.22(1.69) 98.66(0.08) 99.15(0.02) 98.87(0.03)

90.42(2.02) 90.68(1.73) 99.15(0.02) 99.15(0.02) 99.02(0.02)

Wine-2

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

96.43(0.43) 99.64(0.05) 99.92(0.01) 98.73(0.45) 99.92(0.01)

95.22(0.58) 95.84(1.11) 99.91(0.01) 98.59(0.16) 99.71(0.00)

93.76(1.21) 90.21(1.33) 99.88(0.01) 98.45(0.33) 99.29(0.00)

90.20(0.88) 86.19(0.94) 98.73(0.40) 98.38(0.48) 97.88(0.39)

89.35(1.28) 79.57(0.95) 96.19(0.94) 97.74(0.88) 97.74(0.62)

88.90(1.96) 76.76(1.56) 93.66(1.10) 97.18(0.59) 97.18(0.06)

Dna-1

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

92.06(0.25) 98.09(0.48) 99.34(0.06) 99.13(0.06) 99.03(0.16)

91.58(0.38) 94.01(0.96) 98.22(0.42) 99.03(0.20) 99.01(0.28)

88.04(1.20) 88.47(1.63) 95.34(1.10) 98.81(0.12) 98.90(0.12)

86.75(2.77) 82.22(1.86) 92.65 (1.21) 98.81(0.11) 98.74(0.38)

77.35(3.65) 75.21(1.37) 90.28(1.95) 98.38(0.23) 98.70(0.31)

69.46(4.58) 71.98(2.72) 88.82(1.80) 98.49(0.06) 98.41(0.29)

Dna-2

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

91.54(1.55) 97.26(0.42) 98.71(0.37) 99.27(0.04) 99.14(0.21)

91.03(0.72) 94.64(0.84) 96.73(0.87) 99.27(0.04) 99.07(0.37)

88.64(1.67) 91.55(1.47) 94.62(0.82) 98.86(0.08) 99.04(0.16)

85.37(3.12) 86.80(2.17) 91.71(1.08) 98.36(0.11) 98.69(0.21)

76.84(4.44) 79.89(0.67) 89.82(1.67) 98.24(0.26) 98.48(0.42)

73.65(5.28) 75.36(2.99) 88.56(1.95) 98.65(0.17) 98.45(0.21)

Satimage-1

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

97.12(0.52) 98.48(0.10) 99.56(0.07) 98.43(0.22) 98.78(0.00)

96.83(0.35) 98.69(0.11) 98.55(0.15) 98.43(0.21) 98.74(0.06)

96.54(0.43) 97.86(0.48) 98.59(0.15) 98.43(0.21) 98.57(0.06)

95.87(1.21) 96.04(0.88) 98.50(0.26) 98.43(0.22) 97.84(0.93)

93.26(0.81) 94.17(0.85) 98.43(0.27) 98.43(0.22) 96.86(1.53)

90.14(0.22) 92.50(1.36) 98.25(0.51) 98.43(0.22) 96.20(2.24)

Satimage-2

MeanMap InvCal alter-∝SVM p-NPSVM1 p-NPSVM2

94.44(0.16) 95.79(0.09) 98.12(0.11) 97.81(0.02) 97.02(0.03)

93.90(0.25) 96.36(0.08) 97.92(0.19) 97.45(0.10) 96.99(0.08)

93.66(0.22) 95.60(0.18) 97.33(0.27) 96.82(0.25) 96.89(0.13)

92.54(0.52) 95.03(0.28) 96.27(0.39) 96.82(0.37) 96.16(0.34)

90.32(1.95) 94.71(0.24) 95.44(0.26) 96.87(0.40) 96.06(2.31)

89.77(1.73) 94.20(0.85) 95.12(0.29) 96.30(0.76) 95.31(1.30)

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS 14

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16 Table 6 Wilcoxon signed ranks test results for binary classiﬁcation. Linear Kernel

RBF Kernel

proportion-NPSVM1

R+

R−

p-value

R+

R−

p-value

VS. MeanMap VS. InvCal VS. alter-∝SVM

2105 2124 1947

106 87 264

0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01

2046 1912.5 1780

165 298.5 431

0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 017

Linear Kernel +

proportion-NPSVM2

R

VS. MeanMap VS. InvCal VS. alter-∝SVM

2106 2120 1882

R

−

105 91 329

RBF Kernel p-value

R+

R−

p-value

0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01

2066.5 2048 1762.5

144.5 163 448.5

0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 027

Table 7 Wilcoxon signed ranks test results for multi classiﬁcation. Linear Kernel

RBF Kernel

proportion-NPSVM1

R+

R−

p-value

R+

R−

p-value

VS. MeanMap VS. InvCal VS. alter-∝SVM

2975 3077 2635.5

106 4 445.5

0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01

2698.5 2581.5 1949

382.5 499.5 1132

0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.011900

Linear Kernel

RBF Kernel

proportion-NPSVM2

R+

R−

p-value

R+

R−

p-value

VS. MeanMap VS. InvCal VS. alter-∝SVM

2956 2750 2525.5

125 331 555.5

0.0 0 0 0 0 0 0.0 0 0 0 0 0 0.0 0 0 0 01

3075 2935 2324

6 146 757

0.0 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0.0 0 0 0 0 08

(regional) cooperation project of National Natural Science Foundation of China (Grant no. 71110107026) and the grants from National Natural Science Foundation of China (Grant no. 61402429).

in which, the vectors of α = (α1 , . . . , αq )T , β = (β1 , . . . , βq )T and λ = (λ1 , . . . , λ p )T are Lagrange multipliers. Then, the Karush-KuhnTucker (KKT) suﬃcient and necessary optimality conditions of the problem (26) are shown by

Appendix A. Supplement materials

c3 w+ + AT λ + BT α = 0 ,

(29)

c3 b+ + eT+ λ + eT− α = 0,

(30)

λ − η+ = 0 ,

(31)

c1 e− − α − β = 0,

(32)

Aw+ + e+ b+ = η+ ,

(33)

A.1. The wolfe dual problems of NPSVM In this section, the derivation process of the Wolfe dual problems for NPSVM will be presented based on [28]. As we mentioned in Section 2, the primal problems of NPSVM are

min

y,w+ ,b+ ,η+ ,ξ−

s.t.

1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 Aw+ + e+ b+ = η+ , −(Bw+ + e− b+ ) + ξ− ≥ e− ,

(26)

ξ− ≥ 0

and

min

y,w− ,b− ,η− ,ξ+

s.t.

1 1 c4 (w− 2 + b2− ) + ηT− η− + c2 eT+ ξ+ 2 2 Bw− + e− b− = η− ,

− (Bw+ + e− b+ ) + ξ− ≥ e− ,

(Aw− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 where A = (x1 , . . . , x p )T ∈ R p×n and B = (x p+1 , . . . , x p+q )T ∈ Rq×n denote positive and negative data matrices respectively, ci (i = 1, 2, 3, 4) are the penalty parameters, e+ and e− are vectors of ones of appropriate dimensions, ξ+ and ξ− are slack vectors of appropriate dimensions, η+ and η− are vectors of appropriate dimensions. To form the dual problems, we introduce the Lagrangian function of problem (26) as follows:

L ( w + , b + , η+ , ξ− , α, β , λ ) =

1 1 c3 (w+ 2 + b2+ ) + ηT+ η+ + c1 eT− ξ− 2 2 +λT (Aw+ + e+ b+ − η+ ) + αT (Bw+ + e − b + − ξ− + e − ) − β

T

ξ−

ξ− ≥ 0 ,

(34)

(27)

(28)

αT (Bw+ + e− b+ − ξ− + e− ) = 0, βT ξ− = 0,

(35)

α ≥ 0, β ≥ 0.

(36)

According to β ≥ 0, from condition (32) we have

0 ≤ α ≤ c1 e−

(37)

We also can obtain w+ and b+ by solving (29) and (30) respectively

⎧ 1 ⎪ ⎨w+ = − (AT λ + BT α ), c3

⎪ ⎩b+ = − 1 (eT+ λ + eT− α ).

(38)

c3

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

ARTICLE IN PRESS

JID: KNOSYS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

Then, by using the conditions of (31) and (38), the Wolfe dual of the primal problem (26) can be achieved as follows:

1 (λT αT )T + c3 eT α − ( λT αT ) Q − 2 0 ≤ α ≤ c1 e−

max λ, α

s.t. where

T = AA + c3 I Q BAT

(39)

+E

(40)

and I ∈ R p×p is an identity matrix, E ∈ Rl×l is the matrix with all elements equal to 1. Similarity, the Wolfe dual problem of (27) is shown as θ ,γ

1 − (θ T γ T ) Q(θ T γ T )T + c4 eT+ γ 2

s.t.

0 ≤ α ≤ c2 e+

max

where

Q =

BBT + c4 I

BAT

ABT

AAT

(41)

+E

(42)

When the w+ , w− , b+ , b− are ﬁxed, the primal problems of (18) and (19) become:

s.t.

A.3. Proof of Proposition 1

c 1 T η η+ + c1 eT− ξ− + p eT L p (y ) 2 + 2

(Ay )w+ + e+ b+ = η+ , −((By )w+ + e− b+ ) + ξ− ≥ e− , ξ− ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}

Proof. Note that, the yi (∀i ∈ Bk ) has independent inﬂuence on the ﬁrst term of the objective function of (21). Without loss of generality, we assume Bk = {1, 2, . . . , |Bk |}. Let σ = pk , and δi∗ denotes the sorted δ i , which satisﬁes δ1∗ ≥ ∗ ∗ δ2 ≥ . . . ≥ δ|B | . The label proportion can be satisﬁed by ﬂipping k

A.2. The transformation of problems in alternative solving stage

min

Fi = fi− − fi+ = (wT− (xi ) + b− ) −

σ |Bk | yi ’s signs. For (21), we want to solve the following problem:

and I ∈ Rq×q is an identity matrix, E ∈ Rl×l is the matrix with all elements equal to 1. In conclusion, the Eqs. (39) and (41) are the Woﬂe dual problems for the primal NPSVM.

y , η+ , ξ −

where LF = max(1 − yi ∗ Fi , 0 ), (wT+ (xi ) + b+ ).

Proposition 1. For a ﬁxed pk (y ), the problem (21) can be optimized by the strategy mentioned above.

T

AB BBT

15

min +

δi −

i∈B + k

Bk

i∈B − k

s.t.

|B+k | = σ |Bk |.

δi (48)

where B+ = {i|yi = 1, i ∈ B k }, B − = {i|yi = −1, i ∈ Bk }, k k It is able to be easily veriﬁed that B+ = {1, 2, . . . , σ |Bk |} is the k optimal solution for problem (48) by using the method of reduction to absurdity. Actually, assume that, there exist Bk+∗ and B−∗ , which are k able to satisfy the constraints of Bk+∗ = σ |Bk |, Bk+∗ ∪ B−∗ = Bk and k Bk+∗ ∩ B−∗ = φ. But Bk+∗ = {1, 2, . . . , σ |Bk |}. Then we have k

i∈B−∗ k

δi∗ −

i∈Bk+∗

|B | σ |Bk | k ∗ δi∗ − δ − δi∗ < 0 i i=σ |B |+1 i=1 k

(49)

(43)

∗ However, according to the prior knowledge of i ∈ B + ∗ δi − k |Bk | σ | B k | ∗ δi ≤ 0 and i=σ |B |+1 δi∗ − i∈B−∗ δi∗ ≤ 0, we can know that i=1 k

k

the formula of (49) does not set up. That is to say B+ = k {1, 2, . . . , σ |Bk |} is the optimal solution of problem (48). Proposition is proved.

and

min

y , η− , ξ +

s.t.

References

c 1 T η η− + c2 eT+ ξ+ + p eT L p (y ) 2 − 2

(By )w− + e− b− = η− , ((Ay )w− + e+ b− ) + ξ+ ≥ e+ , ξ+ ≥ 0 ∀Ni=1 , yi ∈ {−1, +1}

(44)

According to the relationship between (43) and (44), we know that for each point xi , two decision values fi+ and fi− can be calculated based ont eh model achieved before:

fi+ = wT+ (xi ) + b+ , fi− = wT− (xi ) + b−

(45)

Then, we can get the loss function for the problem (43) and (44):

Li =

max(1 − fi+ , 0 ), if yi = +1 , max(1 + fi− , 0 ), if yi = −1

(46)

Actually, solving (43) and (44) means searching the appropriate label vector y, which can make the model achieve the least loss. Therefore, under the premise of not changing the results, we can integrate these two decision values fi+ and fi− to a single value by Fi = fi− − fi+ . Based on the analysis above, we know that solving this optimization problem mentioned above can be transformed to:

min y

N i=1

LF (yi , wT+ (xi ) + b+ , wT− (xi ) + b− )

+

K cp L p (y ) c k=1

(47)

[1] J. Hernández-González, I. Inza, J.A. Lozano, Learning Bayesian network classiﬁers from label proportions, Pattern Recognit. 46 (12) (2013) 3425–3440. [2] N. Quadrianto, A.J. Smola, T.S. Caetano, Q.V. Le, Estimating labels from label proportions, J. Mach. Learn. Res. 10 (2009) 2349–2374. [3] M. Stolpe, K. Morik, Learning from label proportions by optimizing cluster model selection, in: G. Dimitrios, H. Thomas, M. Donato, V. Michalis (Eds.), Proceedings of Machine Learning and Knowledge Discovery in Databases, 2011, pp. 349–364. [4] F.X. Yu, K. Choromanski, S. Kumar, T. Jebara, S.-F. Chang, On learning with label proportions, Statistics 1050 (2014) 24–36. [5] T. Chen, F.X. Yu, J. Chen, Y. Cui, Y.-Y. Chen, S.-F. Chang, et al., Object-based visual sentiment concept analysis and application, in: K. Hua, Y. Rui, R. Steinmetz, et al. (Eds.), Proceedings of the 22nd ACM international conference on Multimedia, ACM, 2014, pp. 367–376. [6] F.X. Yu, L. Cao, M. Merler, N. Codella, T. Chen, J.R. Smith, S.-F. Chang, et al., Modeling attributes from category-attribute proportions, in: K. Hua, Y. Rui, R. Steinmetz, et al. (Eds.), Proceedings of the 22nd ACM international conference on Multimedia, ACM, 2014, pp. 977–980. [7] K.-T. Lai, X.Y. Felix, M.-S. Chen, S.-F. Chang, Video event detection by inferring temporal instance labels, in: S. Dickinson, D. Metaxas (Eds.), Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 2251–2258. [8] C.-F. Tsai, Training support vector machines based on stacked generalization for image classiﬁcation, Neurocomputing 64 (2005) 497–503. [9] Z. Qi, Y. Tian, Y. Shi, Robust twin support vector machine for pattern classiﬁcation, Pattern Recognit. 46 (1) (2013) 305–316. [10] Z. Qi, Y. Tian, Y. Shi, Successive overrelaxation for laplacian support vector machine, IEEE Trans. Neural Netw. Learn. Syst. 26 (4) (2015) 674–683. [11] F. Moosmann, E. Nowak, F. Jurie, Randomized clustering forests for image classiﬁcation, IEEE Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) 1632–1646. [12] X. Pan, Y. Luo, Y. Xu, K-nearest neighbor based structural twin support vector machine, Knowl. Based Syst. 88 (2015) 34–44. [13] Y. Xu, J. Yu, Y. Zhang, Knn-based weighted rough ν -twin support vector machine, Knowl. Based Syst. 71 (2014) 303–313.

s.t. cite ∀i=1this , article yi ∈ {−1 } Please as:, +1 Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based N

Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

JID: KNOSYS 16

ARTICLE IN PRESS

[m5G;December 13, 2016;2:47]

Z. Chen et al. / Knowledge-Based Systems 000 (2016) 1–16

[14] Y. Xu, L. Wang, A weighted twin support vector regression, Knowl. Based Syst. 33 (2012) 92–101. [15] X. Peng, R. Yan, B. Zhao, H. Tang, Z. Yi, Fast low rank representation based spatial pyramid matching for image classiﬁcation, Knowl. Based Syst. 90 (2015) 14–22. [16] M. Berthod, Z. Kato, S. Yu, J. Zerubia, Bayesian image classiﬁcation using markov random ﬁelds, Image Vis. Comput. 14 (4) (1996) 285–295. [17] Y. Xu, Z. Yang, Y. Zhang, X. Pan, L. Wang, A maximum margin and minimum volume hyper-spheres machine with pinball loss for imbalanced data classiﬁcation, Knowl. Based Syst. 95 (2016) 75–85. [18] S. Rüeping, Svm classiﬁer estimation from group probabilities, in: S. Kim, E.P. Xing (Eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 911–918. [19] F. Yu, D. Liu, S. Kumar, T. Jebara, S. Chang, PSVM for learning with label proportions, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the 30rd International Conference on Machine learning, 2013, pp. 504–512. [20] B. Wang, Z. Chen, Z. Qi, Linear twin svm for learning from label proportions, in: A. Tan, Y. Li (Eds.), Proceedings of the 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol. 3, IEEE, 2015, pp. 56–59. [21] B.C. Chen, L. Chen, R. Ramakrishnan, D.R. Musicant, Learning from aggregate views, in: L. Liu, A. Reuter, K.Y. Whang, J. Zhang (Eds.), Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006, pp. 3–12. [22] D.R. Musicant, J.M. Christensen, J.F. Olson, Supervised learning by training on aggregate outputs, in: N. Ramakrishnan, O.R. ZaTane, Y. Shi, C.W. Clifton, X. Wu (Eds.), Proceedings of the seventh IEEE International Conference on Data Mining (ICDM 2007), 2007, pp. 252–261. [23] H. Kuck, N. de Freitas, Learning about individuals from group statistics, in: D. Braziunas, C. Boutilier (Eds.), Proceedings of the 21st Conference on Uncertainty in Artiﬁcial Intelligence, 2005, pp. 332–339.

[24] G.S. Mann, A. McCallum, Simple, robust, scalable semi-supervised learning via expectation regularization, in: Z. Ghahramani (Ed.), Proceedings of the 24th international conference on Machine learning, 2007, pp. 593–600. [25] K. Bellare, G. Druck, A. McCallum, Alternating projections for learning with expectation constraints, in: J. Bilmes, A.Y. Ng (Eds.), Proceedings of the Twenty– Fifth Conference on Uncertainty in Artiﬁcial Intelligence, 2009, pp. 43–50. [26] J. Gillenwater, K. Ganchev, J. Graça, F. Pereira, B. Taskar, Posterior sparsity in unsupervised dependency parsing, J. Mach. Learn. Res. 12 (2011) 455–490. [27] Y.F. Li, J.T. Kwok, Z.-H. Zhou, Semi-supervised learning using label mean, in: L. Bottou, M. Littman (Eds.), Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 633–640. [28] Y. Tian, Z. Qi, X. Ju, Y. Shi, X. Liu, Nonparallel support vector machines for pattern classiﬁcation, IEEE Trans. Cybern. 44 (7) (2014) 1067–1079. [29] R. Khemchandani, S. Chandra, et al., Twin support vector machines for pattern classiﬁcation, IEEE Trans. Pattern Anal. Machine Intell. 29 (5) (2007) 905–910. [30] Y.-H. Shao, C.-H. Zhang, X.-B. Wang, N.-Y. Deng, Improvements on twin support vector machines, IEEE Trans. Neural Netw. 22 (6) (2011) 962–968. [31] C.-F. Lin, S.-D. Wang, Fuzzy support vector machines, IEEE Trans. Neural Netw. 13 (2) (2002) 464–471. [32] R. Khemchandani, S. Chandra, et al., Fast and robust learning through fuzzy linear proximal support vector machines, Neurocomputing 61 (2004) 401–411. [33] O. Chapelle, V. Sindhwani, S.S. Keerthi, Optimization techniques for semi-supervised support vector machines, J. Mach. Learn. Res. 9 (2008) 203–233. [34] K. Bache, M. Lichman, UCI machine learning repository, 2013, URL http:// archive.ics.uci.edu/ml.

Please cite this article as: Z. Chen et al., Learning with label proportions based on nonparallel support vector machines, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.12.007

Learning with label proportions based on nonparallel support vector machines

Learning with label proportions based on nonparallel support vector machines

Recommend Documents