False positive rate control for positive unlabeled learning

False positive rate control for positive unlabeled learning

Neurocomputing 367 (2019) 13–19 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom False pos...

1MB Sizes 0 Downloads 42 Views

Neurocomputing 367 (2019) 13–19

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

False positive rate control for positive unlabeled learning Shuchen Kong a, Weiwei Shen a, Yingbin Zheng b, Ao Zhang a, Jian Pu a,∗, Jun Wang a a b

School of Computer Science and Technology, East China Normal University, Shanghai, China Shanghai Videt Information Technology Co., Ltd., China

a r t i c l e

i n f o

Article history: Received 4 December 2018 Accepted 2 August 2019 Available online 8 August 2019 Communicated by Steven Hoi Keywords: Positive unlabeled learning False positive rate control Concave-convex procedure

a b s t r a c t Learning classifiers with false positive rate control have drawn intensive attention in applications over past years. While various supervised algorithms have been developed for obtaining low false positive rates, they commonly require the coexistence of both positive and negative samples in data. However, the scenario studied in positive unlabeled (PU) learning is more pervasive in practice. Namely, at inception, most of the data may not have known labels, and the data with known labels may only represent one type of samples. To tackle this challenge, in this paper we propose a new positive unlabeled learning classifier with false positive rate control. In particular, we first prove that in this context employing oftadopted convex surrogate loss functions, such as the hinge loss function, begets a redundant penalty for false positive rates. Then, we present that the non-convex ramp loss surrogate function can overcome this barrier and show a concave-convex procedure can solve the associated non-convex optimization problem. Finally, we demonstrate the effectiveness of the proposed method through extensive experiments on multiple datasets. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Learning classifiers with false positive rate control has recently raised broad attention in applications. For example, various supervised algorithms have been developed for disease diagnosis [1] and spam filtering [2]. Those methods usually require a considerable amount of training data with both positive and negative samples. However, due to the high cost of labeling, in common practice only a small portion of samples are labeled. In the more challenging scenario focused on positive unlabeled (PU) learning, one type of samples is completely missing, e.g., that available data only comprise positive and unlabeled samples. In this situation, on the one hand, existing methods for false positive rate control cannot easily be applied. That is because their required quantity, i.e., the false positive rate, is theoretically undefined due to lack of negative samples. On the other hand, existing PU learning methods on their own cannot fulfill the need neither, as they are inherently blind in differentiating false positives from false negatives [3]. In general, cost-sensitive learning is one frequently used approach to tackle the problem of false positive rate control by assigning different costs to classes. As the definition of cost assignment requires both domain knowledge and heuristic tuning, costs quantification is difficult. As an alternative, the Neyman-Pearson



Corresponding author. E-mail address: [email protected] (J. Pu).

https://doi.org/10.1016/j.neucom.2019.08.001 0925-2312/© 2019 Elsevier B.V. All rights reserved.

(NP) classification approach maximizes the true positive rate given a specific false positive rate [4–6]. Specifically, the method of Lagrangian multiplier is often applied in searching for a saddle point in the associated constrained optimization problem [7,8]. In addition, partial AUC optimization, as a variant of the NP classification method, aims at maximizing the partial AUC score between any two false positive rates on the ROC curve [9,10]. In this work, we grapple with the difficulty in controlling the false positive rate for PU learning. Following the existing work on PU learning, we assume the class prior is given or can be easily estimated by minimizing the Pearson divergence [11]. Based on this assumption, we derive the representation of the false positive rate of PU learning merely by positive and unlabeled samples. Further, we show that any convex surrogate loss leads to a redundant penalty for the false positive rate constraint. Therefore, to obviate this unnecessity we adopt a non-convex loss function, i.e., the ramp loss, in our study. Furthermore, we show that the associated non-convex optimization problem can be transformed and solved by the standard concave-convex procedure (CCCP) [12]. Finally, we demonstrate the effectiveness of the proposed method through extensive experiments on multiple datasets. The rest of this paper is organized as follows. In Section 2, we review the previous work on PU learning. In Section 3, we present our method of solving the PU learning problem with false positive rate control. Experiments on synthetic datasets, UCI datasets, and image datasets are reported in Section 4. Finally, we conclude our work in Section 5.

14

S. Kong, W. Shen and Y. Zheng et al. / Neurocomputing 367 (2019) 13–19

2. Related work

empirical FNR and FPR can be estimated by:

Typical PU learning methods can be roughly categorized into three classes: heuristic approaches, cost-sensitive learning, and sampling-based methods. The heuristic approaches are usually two-step methods. The first step is to identify reliable negative samples from unlabeled data, and the second step is to apply a supervised learning method into the combination of originally known positive samples and the induced negative samples [13–15]. The cost sensitive learning approaches assign different costs to positive and unlabeled samples, respectively. Then, PU learning is transformed as a problem of learning with noise by labeling all unlabeled examples as negative, which is further solved by weighted logistic regression [16]. In addition, a biased twin SVM is formed by constructing two nonparallel hyperplanes to classify samples correctly [17]. Also, a theoretical interpretation of positive unlabeled learning is given, and a linkage to cost sensitive learning with a non-convex loss function has been built [18]. In this strand, a convex formulation is provided in [19]. For the last category, the sampling-based approaches usually apply bootstrap sampling into PU learning. It regards the samples sampled from unlabeled data as the negatives and combines them with the positive samples to create a labeled dataset. Supervised classifiers are naturally trained on the new dataset afterwards. This process is iteratively executed. Also, the bagging framework is applied to convert the PU learning problem into a series of supervised binary classification problems [20,21]. Adaptive sampling is proposed to utilize prediction probabilities from a model to iteratively update the training data [22].

 = FNR

1 n+

 = FPR

1 n−

3. Methods In this section, we first introduce the notations and then derive the formulation and optimization algorithm in detail. 3.1. Notations Consider the binary classification problem, where the input is a feature vector x ∈ Rd , and the output is a label y ∈ {+1, −1}. We are given two sets of training samples: a positive set of n samples X + = {x+ }n and an unlabeled set of m samples X u = i i=1 u m {x j } j=1 . Suppose the positive and unlabeled samples independent and identically distributed (i.i.d.) samples based on the conditional distribution p(x|y = 1 ) and the marginal distribution p(x), respectively:

x+ ∼ p( x | y = 1 ) , i = 1 , . . . , n ; i xuj ∼ p(x ), j = 1, . . . , m. Note that the unlabeled set Xu is a mixture of both positive and negative samples:

p(x ) = π p(x|y = 1 ) + (1 − π ) p(x|y = −1 ), where π denotes the class prior p(y = 1 ) on unlabeled data. In the rest of this paper, we assume the class prior π is known or can be estimated [11]. For a binary prediction function g: x → y, the false negative rate (FNR) and the false positive rate (FPR) are defined respectively as follows:

FNR = E1 [0-1 (g(X ) )] and FPR = E−1 [0-1 (−g(X ) )], where E1 and E−1 denote expectation with respect to the distribution of positive and negative samples, respectively, and 0-1 (z) represents the zero-one loss function. For a finite sample set, the

n+  i=1 n−  j=1

0-1 (g(x+ )), i 0-1 (−g(x−j )),

(1)

where n+ and n− respectively represent the true numbers of positive and negative samples, x+ and x−j respectively represent the ith i positive and the jth negative sample. Note that given the existence of the unlabeled samples the observed positive set and the true positive set may differ n = n+ and the unlabeled set may contain positive samples m = n− . 3.2. Proposed method The risk of binary classification consists of the false positive error and the false negative error. Different from cost-sensitive methods which focus on minimizing a weighted combination of two types of error, the Neyman–Pearson classification approach aims at minimizing FNR while keeping FPR under control:

ming

FNR

s.t

FPR ≤ τ ,

(2)

where τ ∈ (0, 1) is a specified tolerance of FPR. For traditional settings of binary classification, FNR and FPR can be easily formulated and approximated by Eq. (1). However, in the setting of positive-unlabeled learning, without access to the negative samples, the definition of FPR becomes nontrivial. One naive approach is to approximate the FPR score through simply regarding unlabeled samples as negative samples. However, the fact that this approximation to FPR is both scaled and biased, as shown in Theorem 1, is not desired, as FPR serves as the constraint in Eq. (2). The following theorem gives an unscaled and unbiased estimation of FPR: Theorem 1. For a given classifier g: X → y, the false positive rate in PU learning can be expressed in terms of the expectation over positive data and unlabeled data:

E−1 [0-1 (−g(X ) )] =

EX [0-1 (−g(X ) )] + π E1 [0-1 (g(X ) )] − π , 1−π (3)

where X ∼ p(x). Proof. According to the total probability theorem, we can decompose the expectation over all samples X into the summation of the expectation over positive X |y = 1 and negative samples X |y = −1:

EX [0-1 (−g(X ) )]  = 0-1 (−g(X ) ) · p(x )dx  = 0-1 (−g(X ) ) · p(y = 1 ) · p(x|y = 1 )dx  + 0-1 (−g(X ) ) · p(y = −1 ) · p(x|y = −1 )dx  = π 0-1 (−g(X ) ) · p(x|y = 1 )dx  +(1 − π ) 0-1 (−g(X ) ) · p(x|y = −1 )dx = π E1 [0-1 (−g(X ) )] + (1 − π )E−1 [0-1 (−g(X ) )]. Hence, we know EX [0-1 (−g(X ) )] is a biased and scaled estimation of E−1 [0-1 (−g(X ) )], i.e., biased by π E1 [0-1 (−g(X ) )] and scaled by 1 − π. Furthermore, for the 0–1 loss function, we have:

E1 [0-1 (−g(X ) )] = 1 − E1 [0-1 (g(X ) )].

(4)

S. Kong, W. Shen and Y. Zheng et al. / Neurocomputing 367 (2019) 13–19

15

Fig. 1. Illustration of various loss functions and the plots of s (z ) + s (−z ): (a) loss functions; (b) plots of s (z ) + s (−z ) for different loss functions.

Then, the above expectation EX [0-1 (−g(X ) )] can be represented by a weighted form of FPR and FNR:

EX [0-1 (−g(X ) )] = π (1 − E1 [0-1 (g(X ) )] ) + (1 − π )E−1 [0-1 (−g(X ) )]. Therefore, we can represent the false positive rate using the following weighted expectation over unlabeled data X and positive data:

EX [0-1 (−g(X ) )] − π (1 − E1 [0-1 (g(X ) )] ) 1−π EX [0-1 (−g(X ) )] + π E1 [0-1 (g(X ) )] − π = . 1−π 

Plugging the false negative formulation (3) into the original Neyman–Pearson classification framework (2), we have:

s.t.

E1 [0-1 (g(X ) )] EX [0-1 (−g(X ) )] + π E1 [0-1 (g(X ) )] − π ≤ τ. 1−π

In Fig. 1, we illustrate some representative loss functions and the plots of s (z ) + s (−z ). We note that only the ramp loss, the sigmoid loss, and the 0–1 loss satisfy the condition of Corollary 1. 3.3. Optimization

E−1 [0-1 (−g(X ) )] =

min

surrogate function. Therefore, only if the surrogate function is chosen to satisfy s (z ) + s (−z ) = 1, the relaxed constraint is identical to the constraint of optimization (5). 

(5)

Since solving the above optimization with the 0–1 loss is NPhard, researchers relax the optimization problem by some specific surrogate losses [23]. In particular, convex relaxation is commonly preferred for its global optimality property. Accordingly, we can also utilize some convex surrogate loss to transform the optimization problem (5) into a constrained convex optimization problem. However, we find out that in the setup of optimization (5) such convex relaxation leads to a superfluous penalty, thereby making the constraint biased. A better strategy is to choose a surrogate loss which satisfies the following corollary. Corollary 1. The sufficient condition to ensure the constraint of optimization (5) is to choose a surrogate loss function s which satisfies:

s (z ) + s (−z ) = 1. Proof. Using the total probability theorem, we have:

EX [s (−g(X ) )] + π E1 [s (g(X ) )] − π 1−π

(1 − π )E−1 [s (−g(X ) )] + π E1 [s (−g(X ) )] 1−π π E1 [s (g(X ) )] − π + 1−π π E1 [s g((X ) ) + s (−g(X ) )] − π = E−1 [s (−g(X ) )] + . 1−π =

Note that the first term is the definition of false positive rate we want to constrain, while the second term is the expectation of

In the previous section, we formulate the problem of minimizing the true positive rate with a specific false positive rate control for PU learning with a surrogate loss function as

ming

E1 [s (g(X ) )]

s.t

EX [s (−g(X ) )] + π E1 [s (g(X ) )] − π ≤ τ, 1−π

(6)

and the empirical form is

ming

 ( s , g ) FNR

s.t

 ( s , g ) ≤ τ , FPR

(7)

where

    (  s , g ) = 1 n  s g x + , FNR i n i=1     1 1 m  ( s , g ) = FPR s −g xuj 1 − π m j=1  π n   +  +  g x − π . s i n i=1 As mentioned before, the ramp loss, the sigmoid loss, and 0–1 loss satisfy the condition of Corollary 1. In this work,

choose the ramp loss Ramp (z ) = min 1, 12 max {0, 1 − z} as surrogate function for 0–1 loss. The ramp loss can easily be composed into the difference of two convex parts:

(8)

the we the de-

Ramp (z ) = R1 (z ) − R2 (z ), where

R1 ( z ) =

1 max {0, 1 − z} 2

and

R2 ( z ) =

1 max {0, −1 − z}. 2

Applying the method of Lagrangian multiplier, plugging the ramp loss function Ramp , and choosing a linear mapping g(w, b) =

16

S. Kong, W. Shen and Y. Zheng et al. / Neurocomputing 367 (2019) 13–19

w x + b, we transform the optimization problem (7) into the following bivariate optimization:













 Ramp , g + λ FPR  Ramp , g − τ , minw,b maxλ L(w, b, λ ) = FNR (9) where λ is the Lagrangian multiplier. To solve the above min-max optimization problem, we apply the alternative optimization method over w and λ. For the nonconvex optimization over w, we reform (9) as the difference of two convex parts, which is amenable to the Concave-Convex Procedure (CCCP) [12]:

min {L1 (w, b, λ ) − L2 (w, b, λ ) )}, w

π

τ

0.1 0.05 0.1 0.3 0.05 0.1 0.5 0.05 0.1 0.7 0.05 0.1

FPR

TPR

NP-Score

0.041 ± 0.007 0.094 ± 0.007 0.047 ± 0.008 0.095 ± 0.005 0.053 ± 0.005 0.099 ± 0.007 0.058 ± 0.005 0.104 ± 0.007

0.796 ± 0.017 0.883 ± 0.102 0.810 ± 0.024 0.886 ± 0.015 0.826 ± 0.025 0.892 ± 0.016 0.845 ± 0.023 0.895 ± 0.018

0.220 ± 0.037 0.125 ± 0.023 0.222 ± 0.032 0.122 ± 0.025 0.262 ± 0.062 0.132 ± 0.044 0.315 ± 0.091 0.165 ± 0.053

(10) 4.1. Settings

where

 (R1 , g) ) + λFPR  (R1 , g), L1 (w, b, λ ) = FNR and

 (R2 , g) + λFPR  (R2 , g). L2 (w, b, λ ) = FNR In each CCCP iteration of optimizing w and b, we convert the nonconvex optimization (10) into a convex optimization problem by linearizing the second part L2 (w, b, λ ):

min{L1 (w, λt ) − < ∇w L2 (w, bt , λt ), w − wt > w,b

− (b − bt ) ·

Table 1 Performance of the PU-NP model on synthetic datasets with different values of class prior π and tolerance τ .

∂ L (w , b, λt )}, ∂b 2 t

where ∇w L2 (w, bt , λt ) denotes the derivative of L2 (w, bt , λt ) respective to w. The minimization problem (11) is convex, and can be efficiently solved by gradient descent. The optimization over λ is achieved by proximal gradient ascend:

λt+1 = max{0, λt + α∇λ L(wt , bt , λ )},

We compare our methods with two state-of-the-art PU learning approaches. The first method is the convex formulation of costsensitive learning for PU learning (CS-PU) [19], and another one is PU learning via adaptive sampling (AdaSampling) [22]. To be fair, we use the linear-in-parameter classifier g(x ) = wT x + b for all three methods. Hyperparameters are selected via cross-validation. To evaluate the performance of a classifier with a specified false positive rate, we use the NP-score [25] as the evaluation metric: τ1 max{FPR, τ } − TPR. FPR and TPR denote the false positive rate and true positive rate, respectively. τ is the target false positive rate. This measure punishes when the false positive rate is larger than τ , and also takes the TPR score into consideration. The smaller the NP-score is, the better the performance of the classifier is. 4.2. Results on synthetic datasets

(11)

where α is step length chosen by Armijo Rule. The optimization method is summarized in Algorithm 1, which is formed as nested loop. The inner loop solves the nonconvex optimization problem using CCCP, while the outer loop updates the dual variable λ to penalize the violations of constraints.

The synthetic data is used to analyse the behaviour of our method. We generate data using two-dimensional Gaussian distributions with random noise:

Algorithm 1 Alternative optimization of Min-Max problem.

where N μ, σ 2

Input: The set of positive samples, X + ; The set of unlabelled samples, X u ; Initial value w0 and λ0 . Output: The optimal model w∗ , b∗ . 1: Initialize w ← w0 , b ← b0 , and λ ← λ0 ; 2: repeat 3: repeat Perform gradient descent for the linearized convex opti4: mization (11); until convergence 5: Dual variable ascent by Eq. (11); 6: 7: until convergence 8: return wt , bt .

4. Experimental results In this section, we investigate the behavior of the proposed method and evaluate the performance on three types of datasets: synthetic dataset, UCI dataset, and image dataset. The synthetic dataset is first used to verify the properties of our method; then we demonstrate the effectiveness and robustness of our method using four real-world datasets from UCI [24] and two image classification datasets.





p( x|y = 1 ) ∼ N 1, 12 ,





p(x|y = −1 ) ∼ N −1, 12 .





(12)

denotes the Gaussian distribution with mean μ

and variance σ 2 . 10 0 0 positive samples and 20 0 0 unlabeled samples are generated independently from each distribution for training. 10 0 0 samples are randomly generated for test. The class prior π controls the ratio of positive samples in unlabeled samples and τ is the specified constraint for false positive rate. We change π from 0.1 to 0.7 with step 0.2 and set τ to 0.05 and 0.1. The results are shown in Table 1. We first conduct the experiment on synthetic datasets with the class prior π = 0.3. As can be seem from Fig. 2(a), the red points and blue points represent positive samples and negative samples, respectively. The red solid line shows the decision boundary defined by the model with τ = 0.05 while the black dash line separates two classes with τ = 0.1. We can see that although there are only positive and unlabeled data in the training set, the proposed method works very well. We also notice that with a more strict false positive rate target τ , the decision boundary becomes more biased towards positive samples, which exhibits how our method evolves with different targets of false positive rates. Fig. 2(b) shows the discrepancy between actual FPR and the target τ : FPR − τ with the increment of the iterations. The horizontal axis represents the number of iterations in our experiments, and the vertical axis is the value of FPR − τ . The blue curve represents the training process with τ = 0.1, and the orange one represents the experiment with τ = 0.05. For both settings, the value of FPR − τ eventually converges to zero, which implies that the constraint on the false positive rate is satisfied at last. Also, we can

S. Kong, W. Shen and Y. Zheng et al. / Neurocomputing 367 (2019) 13–19

17

Fig. 2. Illustration of experimental results on the synthetic dataset: (a) decision boundary of the linear model with different τ ; (b) convergence curve of the FPR − τ varies with the number of iterations. Table 2 Meta information of four UCI datasets. Dataset

Positive samples

Unlabeled samples

Dimensionality

Australian Magic Mushroom a9a

100 200 500 1000

200 4000 3000 10,000

14 10 112 123

the increment of class prior. In contrast, our method (PU-NP) always maintains a low and stable NP-Score, which means the false positive rate target is always satisfied with our method. The trend that the NP-score grows with class prior is because the number of negative samples decreases with class prior, which increases the false positive rate and finally leads to the increment of the NPscore.

4.4. Analysis on image datasets see that the yellow curve is longer than the blue curve. It can be explained by the fact that with a more strict constraint, the model needs more iterations to converge. Then we conduct experiments by varying the class prior π and the constraint false positive rate τ . We record the false positive rate(FPR), the true positive rate(TPR) and the NP-score for these experiments with different π and τ in Table 1. Each experiment is conducted ten times and the mean, and the standard deviation are reported. With the increment of the class prior π , the constraint of false positive rate becomes more difficult to satisfy, and the true positive rate grows. Since the number of negative samples in training data decreases with the incremental of the class prior π , and thereby the classifier tends to classify negative samples as positive samples. Moreover, we can find that with the same class prior π ; the true positive rate becomes lower for a more strict false positive rate. This scenario is also in line with what we observed in Fig. 2(a). 4.3. Results on UCI datasets Next, we use four UCI datasets [24] for binary classification from different domains and of various sizes to compare the performance of the proposed model with two competitors. These datasets include credit card applications dataset (Australian), dataset MC generated to simulate registration of high energy gamma particles (Magic), gilled mushrooms data (Mushroom) and Census Income dataset (a9a). The meta information for these datasets is shown in Table 2. We compare the performance of the proposed method and two competitors in Fig. 3. The horizontal axis in each figure represents the class prior π , and the vertical axis is the value of the NP-score. All these experiments are conducted under the constraint of the false positive rate target τ = 0.1. As can be seen from Fig. 3, the NP-Scores of CS-NP and Adasampling are unstable and grows with

Finally, we conduct our experiments on three image datasets: Mnist[26] and Fashion-Mnist[27] and Cifar-10 [28]. The mnist dataset is a commonly used hand-writing digits image dataset which contains 10 numbers from 0 to 9. Each image is given in a 28 × 28 gray-scale format. We choose the “0” digit for the positive class, and the “6” digit for the negative class. The Fashion-Mnist dataset is a recently released image dataset which consists of 70 0 0 images for 10 fashion classes. Each image is of the same format as the image in Mnist. We choose “T-shirt” as the positive class and “Pullcover” as the negative class. The CIFAR-10 dataset consists of 10 classes, with 50 0 0 images in each class. Each image is given in a 32 × 32 × 3 RGB format. We consider ”automobile” as the positive class while ”ship” as the negative class. In each dataset, we randomly pick 400 positive samples and 800 unlabeled samples for training data and 10 0 0 samples for test. Each experiment is conducted ten times and the mean, and the standard deviation are reported. The FPR, TPR, and NP-scores with the various class priors π s and target false positive rates τ s of our experiments on Mnist and Fashion-Mnist are shown in Table 3. For each experiment, the best results of the NP-score are shown in bold. As we can see from Table 3, our method gets a lower NP-score compared with the other two methods in most cases. Moreover, with the increase of the class prior π , the false positive rate of the CS-PU method increases and the false positive rate of the AdaSampling method keeps stable, our method makes the false positive rate as an appropriate value under the specified tolerance τ . However, to satisfy the constraint on false positive rate, our method may sacrifice a part of true positive rate, which is tolerated in many real-world scenarios. To sum up, our proposed PU-NP method is more suitable for the applications which only have positive unlabeled data and need to control the false positive rate under a specific tolerance.

18

S. Kong, W. Shen and Y. Zheng et al. / Neurocomputing 367 (2019) 13–19

Fig. 3. Performance comparison on four UCI datasets.

Table 3 Performance comparison on Mnist, Fashion-Mnist and Cifar-10. The task on Mnist is to separate “0” and “6”, the task on Fashion-Mnist is to separate “T-shirt” and “Pullcover”, and the task on Cifar-10 is to separate “automobile” and “ship”. The best NP-scores are shown in bold. Dataset

π

τ

FPR

TPR

NP-Score

FPR

TPR

NP-Score

FPR

TPR

NP-Score

Mnist 400/800 d:784

0.3

0.01 0.05 0.01 0.05 0.01 0.05

0.021 ± 0.046 0.021 ± 0.046 0.075 ± 0.005 0.075 ± 0.005 0.063 ± 0.013 0.063 ± 0.013

0.972 ± 0.006 0.972 ± 0.006 0.988 ± 0.005 0.988 ± 0.005 0.994 ± 0.003 0.994 ± 0.003

0.884 ± 0.036 0.027 ± 0.006 6.162 ± 0.046 0.522 ± 0.010 5.339 ± 0.130 0.289 ± 0.024

0.006 ± 0.002 0.006 ± 0.002 0.003 ± 0.001 0.003 ± 0.001 0.003 ± 0.001 0.003 ± 0.001

0.935 ± 0.014 0.935 ± 0.014 0.932 ± 0.014 0.932 ± 0.014 0.916 ± 0.011 0.916 ± 0.011

0.065 ± 0.014 0.065 ± 0.014 0.079 ± 0.015 0.079 ± 0.015 0.084 ± 0.011 0.084 ± 0.011

0.007 ± 0.001 0.034 ± 0.004 0.007 ± 0.003 0.045 ± 0.001 0.003 ± 0.002 0.041 ± 0.021

0.964 ± 0.010 0.972 ± 0.011 0.964 ± 0.006 0.983 ± 0.003 0.974 ± 0.005 0.982 ± 0.001

0.036 ± 0.010 0.028 ± 0.017 0.036 ± 0.006 0.017 ± 0.003 0.026 ± 0.005 0.015 ± 0.019

0.01 0.05 0.01 0.05 0.01 0.05

0.018 ± 0.004 0.018 ± 0.004 0.029 ± 0.007 0.029 ± 0.007 0.077 ± 0.021 0.077 ± 0.021

0.878 ± 0.010 0.878 ± 0.010 0.930 ± 0.004 0.930 ± 0.004 0.961 ± 0.004 0.961 ± 0.004

0.873 ± 0.045 0.123 ± 0.010 1.971 ± 0.682 0.071 ± 0.004 6.705 ± 2.120 0.572 ± 0.043

0.050 ± 0.009 0.050 ± 0.009 0.021 ± 0.005 0.021 ± 0.005 0.018 ± 0.011 0.018 ± 0.011

0.954 ± 0.009 0.954 ± 0.009 0.879 ± 0.007 0.879 ± 0.007 0.859 ± 0.010 0.859 ± 0.010

4.010 ± 0.885 0.110 ± 0.080 1.172 ± 0.537 0.122 ± 0.007 1.175 ± 0.758 0.141 ± 0.010

0.008 ± 0.002 0.027 ± 0.004 0.005 ± 0.002 0.028 ± 0.006 0.008 ± 0.004 0.029 ± 0.005

0.875 ± 0.023 0.915 ± 0.019 0.834 ± 0.013 0.923 ± 0.019 0.912 ± 0.021 0.919 ± 0.019

0.137 ± 0.038 0.085 ± 0.019 0.167 ± 0.055 0.077 ± 0.019 0.188 ± 0.133 0.081 ± 0.019

0.01 0.05 0.01 0.05 0.01 0.05

0.024 ± 0.006 0.024 ± 0.006 0.058 ± 0.010 0.058 ± 0.010 0.067 ± 0.015 0.067 ± 0.015

0.864 ± 0.011 0.864 ± 0.011 0.907 ± 0.023 0.907 ± 0.023 0.928 ± 0.027 0.928 ± 0.027

1.531 ± 0.122 0.136 ± 0.018 4.893 ± 0.437 0.255 ± 0.026 5.774 ± 0.544 0.396 ± 0.041

0.037 ± 0.005 0.037 ± 0.005 0.052 ± 0.004 0.052 ± 0.004 0.057 ± 0.011 0.057 ± 0.011

0.877 ± 0.008 0.877 ± 0.008 0.885 ± 0.010 0.885 ± 0.010 0.906 ± 0.007 0.906 ± 0.007

2.834 ± 0.130 0.123 ± 0.014 4.315 ± 0.244 0.117 ± 0.023 4.794 ± 0.372 0.231 ± 0.012

0.006 ± 0.001 0.047 ± 0.005 0.005 ± 0.003 0.036 ± 0.003 0.008 ± 0.001 0.044 ± 0.005

0.893 ± 0.014 0.901 ± 0.011 0.902 ± 0.021 0.904 ± 0.015 0.907 ± 0.010 0.915 ± 0.017

0.103 ± 0.015 0.097 ± 0.010 0.101 ± 0.027 0.094 ± 0.019 0.091 ± 0.013 0.085 ± 0.024

0.5 0.7

FashionMnist 400/800 d:784

0.3 0.5 0.7

Cifar-10 400/800 d:3072

0.3 0.5 0.7

CS-PU

AdaSampling

PU-NP

S. Kong, W. Shen and Y. Zheng et al. / Neurocomputing 367 (2019) 13–19

5. Conclusion In this paper, we have proposed a novel method for training a classifier under a specified false positive rate with only positive and unlabeled data. We use positive and unlabeled instances to define the false positive rate without knowing the negative instances. We also prove that using a non-convex surrogate function is sufficient to avoid redundancy in the false positive rate constraint. The final form of the constrained non-convex optimization problem is solved by alternative optimization using the concave-convex procedure. Extensive experimental evaluations on various datasets demonstrate the effectiveness and robustness of our algorithm.

19

[26] Y. LeCun, C. Cortes, MNIST handwritten digit database (2010) http://yann.lecun. com/exdb/mnist/. [27] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms., CoRR abs/1708.07747 (2017). https:// github.com/zalandoresearch/fashion-mnist. [28] A. Krizhevsky, Learning multiple layers of features from tiny images, University of Toronto, Toronto, Ontario, 2009. Shuchen Kong received the M.S. degree in computer science from East China Normal University, Shanghai, China, in 2019. Currently, he is a senior algorithm engineer at Videt Tech Ltd., Shanghai, China. His research interests include machine learning and computer vision.

Declarations of interest None. Acknowledgments This work was supported by the Natural Science Foundation of China (Nos. 61702186, 61672236, 61602459). References [1] H.-L. Chen, B. Yang, J. Liu, D. Liu, A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis., Expert Syst. Appl. 38 (7) (2011) 9014–9022. [2] E. Blanzieri, A. Bryl, Instance-based spam filtering using SVM nearest neighbor classifier., in: D. Wilson, G. Sutcliffe (Eds.), Proceedings of the FLAIRS Conference, 2007, pp. 441–442. [3] X.-Y. Liu, Z.-H. Zhou, Learning with cost intervals., in: Proceedings of the KDD, 2010, pp. 403–412. [4] C.D. Scott, R.D. Nowak, A Neyman–Pearson approach to statistical learning., IEEE Trans. Inf. Theory 51 (11) (2005) 3806–3819. [5] E.L. Lehmann, Testing Statistical Hypotheses, Springer-Verlag, Berlin, 1997. [6] P. Rigollet, X. Tong, Neyman-Pearson classification, convexity and stochastic constraints., J. Mach. Learn. Res. 12 (2011) 2831–2855. [7] M.C. Mozer, R.H. Dodier, M.D. Colagrosso, C. Guerra-Salcedo, R.H. Wolniewicz, Prodding the ROC curve: constrained optimization of classifier performance., in: Proceedings of the NIPS, 2001, pp. 1409–1415. [8] G. Gasso, A. Pappaioannou, M. Spivak, L. Bottou, Batch and online learning algorithms for nonconvex Neyman–Pearson classification, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 28. [9] H. Narasimhan, S. Agarwal, A structural SVM based approach for optimizing partial AUC., in: Proceedings of the ICML, 2013, pp. 516–524. [10] H. Narasimhan, S. Agarwal, SVMpAUCtight: a new support vector method for optimizing partial AUC based on a tight convex upper bound., in: Proceedings of the KDD, 2013. [11] M.C. du Plessis, M. Sugiyama, Class prior estimation from positive and unlabeled data., IEICE Trans. 97-D (5) (2014) 1358–1362. [12] A.L. Yuille, A. Rangarajan, The concave-convex procedure., Neural Comput. 15 (4) (2003) 915–936. [13] X. Li, B. Liu, Learning to classify texts using positive and unlabeled data., in: Proceedings of the IJCAI, 2003, pp. 587–594. [14] B. Liu, W.S. Lee, P.S. Yu, X. Li, Partially supervised classification of text documents, in: Proceedings of the ICML, 2002, pp. 387–394. [15] B. Liu, Y. Dai, X. Li, W.S. Lee, P.S. Yu, Building text classifiers using positive and unlabeled examples., in: Proceedings of the ICDM, 2003, pp. 179–188. [16] W.S. Lee, B. Liu, Learning with positive and unlabeled examples using weighted logistic regression., in: Proceedings of the ICML, 2003, pp. 448–455. [17] Z. Xu, Z. Qi, J. Zhang, Learning with positive and unlabeled examples using biased twin support vector machine., Neural Comput. Appl. 25 (6) (2014) 1303–1311. [18] M.C. du Plessis, G. Niu, M. Sugiyama, Analysis of learning from positive and unlabeled data., in: Proceedings of the NIPS, 2014, pp. 703–711. [19] M.C. du Plessis, G. Niu, M. Sugiyama, Convex formulation for learning from positive and unlabeled data., in: Proceedings of the ICML, 37, 2015, pp. 1386–1394. [20] F. Mordelet, J.-P. Vert, A bagging SVM to learn from positive and unlabeled examples., Pattern Recognit. Lett. 37 (2014) 201–209. [21] M. Claesen, F.D. Smet, J.A.K. Suykens, B.D. Moor, A robust ensemble approach to learn from positive and unlabeled data using SVM base models., CoRR abs/1402.3144 (2014). [22] P. Yang, W. Liu, J. Yang, Positive unlabeled learning via wrapper-based adaptive sampling., in: Proceedings of the IJCAI, 2017, pp. 3273–3279. [23] P.L. Bartlett, M.I. Jordan, J.D. McAuliffe, Convexity, classification, and risk bounds, J. Am. Stat. Assoc. 101 (473) (2006) 138–156. [24] C. Blake, C. Merz, UCI repository of machine learning data sets, 1999, (http://www.ics.uci.edu/∼mlearn/MLRepository.html). [25] C. Scott, Performance measures for Neyman–Pearson classification., IEEE Trans. Inf. Theory 53 (8) (2007) 2852–2863.

Weiwei Shen is a scientist at GE Research and an adjunct faculty member of East China Normal University. He received his B.S. from University of Science and Technology of China, and his M.S. and Ph.D. from Columbia University. He holds the Charted Financial Analyst designation (CFA) and the Financial Risk Manager (FRM) certification. Dr. Shen is a past recipient of the Dushman Award at GE Research. His research focuses on problems in computational finance and machine learning.

Yingbin Zheng received the B.S. and Ph.D. degree in computer science from Fudan University, Shanghai, China. He was a research scientist with SAP Labs China and an associate professor with Shanghai Advanced Research Institute, Chinese Academy of Sciences. Currently he is with Videt Tech Ltd., Shanghai, China. His research interests are in computer vision, especially in scene text analysis and video understanding.

Ao Zhang is currently working as an algorithm engineer at Alibaba DAMO Academy, Hangzhou, China. He received the B.E. degree from Shanghai Jiao Tong University in 2015 and the M.E. degree from East China Normal University in 2018. His research interests lie in the areas of information retrieval and machine learning applications.

Jian Pu received the Ph.D. degree from Fudan University, Shanghai, China, in 2014. Currently, he is an associate professor at the School of Computer Science and Technology, East China Normal University, Shanghai, China. He was a postdoctoral researcher of Institute of Neuroscience, Chinese Academy of Sciences in China from 2014 to 2016. His current research interests include machine learning, computer vision, and medical image analysis.

Jun Wang received the Ph.D. degree in electrical engineering from Columbia University, New York, NY, USA, in 2011. Currently, he is a Professor at the School of Computer Science and Technology, East China Normal University and an adjunct faculty member of Columbia University. From 2010 to 2014, he was a Research Staff Member at IBM T. J. Watson Research Center, Yorktown Heights, NY, USA. His research interests include machine learning, data mining, mobile intelligence, and computer vision. Dr. Wang has been the recipient of the “Thousand Talents Plan” in 2014.