Engineering Applications of Artificial Intelligence 35 (2014) 335–344
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
A convex relaxation framework for a class of semi-supervised learning methods and its application in pattern recognition Liming Yang a,n, Laisheng Wang a, Yongping Gao b, Qun Sun c, Tengyang Zhao d a
College of Science, China Agricultural University, Beijing 100083, China Capital Normal University, Beijing 100048, China c College of Biotechnology, China Agricultural University, Beijing 100193, China d Imperial College London, London SW7 2AZ, United Kingdom b
art ic l e i nf o
a b s t r a c t
Article history: Received 27 June 2013 Received in revised form 12 June 2014 Accepted 15 June 2014
Semi-supervised learning has been an attractive research tool for using unlabeled data in pattern recognition. Applying a novel semi-definite programming (SDP) relaxation strategy to a class of continuous semi-supervised support vector machines (S3VMs), a new convex relaxation framework for the S3VMs is proposed based on SDP. Compared with other SDP relaxations for S3VMs, the proposed methods only require solving the primal problems and can implement L1-norm regularization. Furthermore, the proposed technique is applied directly to recognize the purity of hybrid maize seeds using near-infrared spectral data, from which we find that the proposed method achieves equivalent performance to the exact solution algorithm for solving the S3VM in different spectral regions. Experiments on several benchmark data sets demonstrate that the proposed convex technique is competitive with other SDP relaxation methods for solving semi-supervised SVMs in generalization. & 2014 Elsevier Ltd. All rights reserved.
Keywords: Semi-supervised learning Support vector machine Semi-definite programming Purity of hybrid seeds
1. Introduction Machine learning (Wang et al., 2009; Wu et al., 2013) is a subfield of artificial intelligence that is concerned with the design and development of algorithms and techniques. It allows computers to make inductions or deduction. Traditionally, there have been two fundamentally different types of tasks in machine learning. The first one is unsupervised learning, the goal of which is to find interesting structure in the dataset X ¼ ðx1 ; …; xl Þ such as density estimation, clustering and dimensionality reduction. The second task is supervised learning such as support vector machines (SVMs) (Vapnik, 1998) and logistic regression (Bielza et al., 2011), the goal of which is to learn a mapping from x to y, given a training set ðxi ; yi Þ, where the yi is called the label or the category of the sample xi ði ¼ 1; …; lÞ. In terms of the output, there are two learning tasks in machine learning: classification where the output is discrete and regression where the output is continuous. Supervised learning methods perform the classification tasks only using the labeled samples, and thus when insufficient labeled samples are available, they are usual not satisfactory. Semi-supervised learning (Chapelle et al., 2008; Wang et al., 2009) is a new way between supervised learning and unsupervised learning.
n
Corresponding author. Tel.: þ 86 10 62736511; fax: þ 86 10 62736777. E-mail address:
[email protected] (L. Yang).
http://dx.doi.org/10.1016/j.engappai.2014.06.014 0952-1976/& 2014 Elsevier Ltd. All rights reserved.
It occurs mainly in classification, where the dataset is usually divided into two parts: (1) a small number of labeled samples whose categories (or labels) are known; and (2) a large amount of unlabeled samples whose categories (or labels) are unknown. Semi-supervised classification methods are mainly based on cluster assumption (Chapelle et al., 2008; Wu et al., 2013) that sample points in a data cluster should have the same labels. When the labeled samples are few, semi-supervised learning methods usually achieve better performance than supervised and unsupervised learning methods. Recently, semi-supervised learning has been an attractive technique in machine learning and artificial intelligence because of the difficulty of manual labeling. One central issue in artificial intelligence is how to integrate human's intelligence with machine's processing capacity. This occurs in many applications such as webpage categorization, medical diagnosis, bioinformatics and spam email detection (Wang et al., 2009). For these practical classification problems, the labeled samples may be very few or expensive to obtain, while unlabeled samples are easy to collect. In this setting, the supervised learning methods are difficult to use owing to the lack of labeled samples. The main goal of semi-supervised learning is to employ the large collection of unlabeled samples together with a few labeled samples to improve generalization. Semi-supervised support vector machines (S3VMs) (Bennett and Demiriz, 1998; Fung and Mangasarian, 2001; Bie and Cristianini, 2004; Astorino and Fuduli,
336
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
2007; Chapelle et al., 2008; Reddy et al., 2011; Yang and Wang, 2013), which combine the powerful regularization of SVMs with a direct implementation of the cluster assumption, may seem to be the perfect semi-supervised learning approach, and their effectiveness has been demonstrated in machine learning and artificial intelligence. However, the main drawback of S3VMs is that the objective function is usually nonconvex and nonsmooth, which makes it difficult to find the optimal solution. Thus different techniques have been applied to solve S3VMs such as the exact solution methods: mixed integer programming (MIP) (Bennett and Demiriz, 1998) and branch and bound (Chapelle et al., 2006), iteration algorithms (Fung and Mangasarian, 2001; Astorino and Fuduli, 2007; Chapelle et al., 2008) and convex relaxation methods (Bie and Cristianini, 2004; Xu and Schuurmans, 2005; Xu et al., 2008). However, iteration algorithms are easily susceptible to local optima. The exact solution methods are with high computational cost because of involving combination optimization problems. Efficient convex optimization techniques have had a profound impact on machine learning. Semi-definite programming (SDP) (Vandenberghe and Boyd, 1996) is a popular convex programming and is easy to solve using interior-point algorithms in polynomial time (Vandenberghe and Boyd, 1996). It is often used to compute effective lower bounds for the minimum objective value of nonconvex optimizations. Moreover, SDP extends its toolbox of optimization methods in application, beyond the current unconstrained, linear and quadratic programming techniques. The aim of this investigation is to explore a new convex relaxation technique to solve a class of S3VMs. We first reformulate the S3VMs as nonconvex quadratic optimization problems (QOPs) by applying a series of mathematical transformations to empirical risk variables. Then we relax these nonconvex QOPs into SDP models to compute a lower bound of the minimal objective value of the S3VMs by removing the rank constraints. The advantages of doing so are as follows:
model generalization and complexity. The general picture in semisupervised classification is to determine a decision function able to classify the labeled data and to correctly predict the class of unlabeled samples while maximizing the margin. This investigation is motivated by the work of the literature (Bennett and Demiriz, 1998). S3VM is an extension of the standard SVM. It can be viewed as the standard SVM with an additional regularization term on unlabeled samples. More specifically, assume that the sample set consists of m labeled samples fðxi ; yi Þ; xi A Rn ; yi ¼ 7 1; i ¼ 1; 2; …mg and p unlabeled samples fxj A Rn ; j ¼ m þ1; m þ2; …m þ pg, where yi ¼ 7 1 ði ¼ 1; …; mÞ represents the labels of labeled samples: the sample xi is from positive class if yi ¼ 1, and negative class otherwise. The labeled samples are represented by the matrix A of size m n and the unlabeled samples are represented by the matrix B of size p n. The labels for the labeled samples are given by a diagonal matrix D of m-th order with component yi ¼ 71 ði ¼ 1; …; mÞ. Each row of matrix A (resp. B) denotes a labeled sample (resp. unlabeled sample). That is 0 1 0 1 y1 0⋯0 0 T1 xTm þ 1 x1 B C B 0 y2 ⋯0 C B C B C C ⋮ C; D ¼ B A ¼ @ ⋮ A; B ¼ B B ⋮ @ A ⋮⋱⋮ C @ A xTm þ p xTm 0 0⋯ym
(1) The proposed SDP relaxation technique can implement L1-norm regularization, and thus it is convenient to use in practical applications. However other SDP relaxations for S3VMs are constructed based on L2-norm regularization. (2) The proposed methods need only to solve the primal problems instead of the dual problems as in other SDP relaxations of the literature (Bie and Cristianini, 2004; Xu and Schuurmans, 2005; Xu et al., 2008). (3) As a typical and important application, the proposed approach is applied directly to identify the purity of maize seeds using near-infrared (NIR) spectroscopy technology (Bai and Huang, 2007).
w;b;ξ;r;s
The rest of this paper is organized as follows. Section 2 gives a short summary of the S3VM. In Sections 3, we propose two SDP relaxation formulations for solving the S3VMs. Section 4 includes performance evaluation criteria and optimization design on recognition of the purity of hybrid maize seeds. Experimental results for proposed method are shown in Section 5. The final section concludes this investigation.
2. Semi-supervised support vector machine (S3VM) We consider binary classification problem where all samples are from two classes: positive class and negative class. Semisupervised classification techniques employ the small labeled samples while simultaneously assign the large unlabeled samples to one of the two classes so as to maximize the margin (distance) between two separation planes. Such margin is a measure of the
For each unlabeled sample xj ðj ¼ m þ 1; …; m þ pÞ, we introduce two constraints to define the two possible misclassification errors. Let r j be the misclassification error if xj is wrongly classified as in the positive class, and sj be the misclassification error if xj is wrongly classified as in the negative class. The final class of the unlabeled sample xj corresponds to the one that results in the smallest error: minfr j ; sj g. The problem of finding a optimal linear separating hyperplane wT x b ¼ 0 with maximum classification margin in the Hilbert space H can be formulated as (called q-norm S3VM, q 4 0) min ‖w‖qq þ νeT ξ þ μeT fr; sg s:t: DðAw ebÞ þ ξ Ze; Bw ebþ r Z e; Bw þ eb þ s Ze;
ξZ0
rZ0 s Z0
ð1Þ
with variables w A R ; b A R; ξ A R ; r; s A R . The component by component minimum of two vectors r and s is denoted by minfr; sg, with component j being minfr j ; sj g ðj ¼ m þ 1; …m þp). The column vector of ones of arbitrary dimension is denoted by e. The variable ξi calculates the misclassification error of labeled sample xi ði ¼ 1; …; mÞ. Two parameters, ν 4 0 and μ 4 0, reflect the misclassification penalty for labels and unlabeled samples respectively. The q-norm of w, ‖w‖qq ¼ ∑ni¼ 1 jwi jq ðq 4 0Þ, is used as a regularization term. The first two terms in the objective function correspond to the standard SVM with q-norm. The last term in the objective function, together with the remaining constraints, assigns each unlabeled sample xj to the positive class or the negative class, whichever generates a lower misclassification error: minfr j ; sj g. The corresponding decision function is defined as f ðxÞ ¼ signðwT x bÞ. Therefore, a new sample x is assigned to the positive class if f ðxÞ 4 0 or to the negative class if f ðxÞ o 0. Typically L2-norm or L1-norm of w is adopted as a regularization term in the standard SVMs. However, most convex relaxations for S3VMs (Bie and Cristianini, 2004; Xu and Schuurmans, 2005; Xu et al., 2008) are constructed based on L2-norm regularization. Unfortunately, this framework is a nonconvex optimization owing to the last term in the objective function, which precludes the use of convex methods. In this investigation, we propose a new convex relaxation technique based on SDP. Moreover, two SDP n
m
p
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
relaxation formulations for solving the S3VMs are obtained by applying a series of mathematical transformations to the misclassification error variables r and s. We now use the following lemmas to reformulate the Lq-norm S3VMs (1) (q ¼1 and 2) into nonconvex QOPs (Kim and Kojima, 2001). Lemma 1. For arbitrary real numbers a and c, the following formula holds: minfa; cg ¼
1 ða þ c ja cjÞ 2
ð2Þ
where j j denotes the absolute value operation. The proof is obvious and thus omitted. Let Ω be a set whose elements satisfy the constraint conditions of optimization problem (1), that is 8 DðAw ebÞ þ ξ Z e; ξ Z 0 > <
Ω¼
> :
ðw; b; ξ; r; sÞ :
Bw eb þ r Z e;
Bw þ eb þs Ze;
which implies that the
rZ0
ð3Þ
sZ0
Ω is the feasible region of the problem (1).
3. SDP relaxations of S3VM In this section, we propose two SDP relaxations formulations of nonconvex Lq-norm S3VM with q¼ 1 and q ¼ 2. 3.1. SDP relaxation of L1-norm S3VM Let q ¼1 for problem (1). Then we get a linear L1-norm S3VM min J w J 1 þ νeT ξ þ μeT minfr; sg s:t: ðw; b; ξ; r; sÞ A Ω
ð4Þ
3
which is equivalent to the S VM formulation in the literature (Bennett and Demiriz, 1998). Where minimizing J w J 1 corresponds to maximizing the classification margin using the infinity norm of w, namely 1=J w J 1 . Bennett and Demiriz formulated this problem as a mixed integer program (MIP) to obtain its exact solution. However, this method is impractical for large data sets since it involves combinatorial optimization and thus it has high computation cost. Lemma 2. For the vectors r and s denoted in problem (1), the following equivalence relation holds: ( ðr sÞT ðr sÞ ¼ t T t t ¼ jr sj⟺ ð5Þ jr sj rt where jr sj denotes the absolute value of the difference between vectors r and s, with component j being jr j sj j. Proof. It is shown by Mangasarian and Meyer (2006, Proposition 1) that ( ðt ðr sÞÞ ? ðt þ ðr sÞÞ t ¼ jr sj⟺ ð6Þ t ðr sÞ Z 0; t þ ðr sÞ Z 0 Note that the following equivalence relation also holds: ( ( ðt ðr sÞÞ ? ðt þ ðr sÞÞ ½t ðr sÞT ½t þ ðr sÞ ¼ 0 ⟺ t ðr sÞ Z 0; t þ ðr sÞ Z 0 jr sj r t
By Lemmas 1 and 2 we know that L1-norm S3VM (4) can be reformulated as the following nonconvex QOP: 1 min J w J 1 þ νeT ξ þ μeT ðr þ s tÞ 2 s:t: ðr sÞT ðr sÞ ¼ t T t; jr sj r t ðw; b; ξ; r; sÞ A Ω
ð9Þ n
By introducing variable u A R such that jwj ru, inequality J w J 1 r eT u holds. Then the above problem (9) can be written as min
w;b;ξ;r;s;t;u
1 eT u þ νeT ξ þ μeT ðr þ s tÞ 2
s:t: ðr sÞT ðr sÞ ¼ t T t;
jr sj r t
ðw; b; ξ; r; sÞ A Ω; jwj r u:
ð10Þ
Lemma 3. Let X ¼ xxT and E ¼ ð1x xX Þ; x A Rn : Then we have T
X ¼ xxT ⟺E≽0
and
rankðEÞ ¼ 1
ð11Þ
where E≽0 denotes that matrix E must remain positive semi-definite, and rank(E) denotes the rank of matrix E. Proof. If X ¼ xxT ; ðx A Rn Þ, then matrix X is symmetric and positive semi-definite. Moreover, we have rankðXÞ ¼ 1 when X a0. Again, T E ¼ ð1x xX Þ; and thus matrix E is also symmetric and positive semidefinite. Furthermore, matrix E satisfies rankðEÞ ¼ 1. T In turn, if E ¼ ð1x xX Þ ðx A Rn Þ is a positive semi-definite matrix, then X is a positive semi-definite matrix as well. Again, since rankðEÞ ¼ 1, any k-order ðk Z 2Þ principal subdeterminant of E must be equal to zero. Therefore, the equality X ¼ xxT holds. □ Moreover, from matrix theory we know the following relation: ! 1 xT ð12Þ ≽0⟺X xxT ≽0 x X T
It should be noted that X xxT is a smaller scale matrix than ð1x xX Þ: In the remainder of this paper we focus our attention to relax L1-norm S3VM (10) to a semi-definite programming (SDP) formulation. Further, let ! ! 1 ðr sÞT 1 tT Z ¼ tt T ; Y ¼ ðr sÞðr sÞT ; M ¼ ; N¼ ðr sÞ Y t Z ð13Þ According to Lemma 3, we obtain ( Y ðr sÞðr sÞT ≽0 Y ¼ ðr sÞðr sÞT ⟺ rankðNÞ ¼ 1 and
( T
Z ¼ tt ⟺
ð14Þ
Z tt T ≽0
ð15Þ
rankðMÞ ¼ 1
Let Tr(Y) and Tr(Z) denote the traces of square matrices Y and Z respectively. Then we have t T t ¼ TrðZÞ and ðr sÞT ðr sÞ ¼ TrðYÞ. From the above analysis, we obtain the following relation: ( Y ðr sÞðr sÞT ≽0; Z tt T ≽0 t T t ¼ ðr sÞT ðr sÞ⟺ TrðYÞ ¼ TrðZÞ; rankðMÞ ¼ 1; rankðNÞ ¼ 1 ð16Þ 3
ð7Þ
Finally, applying (16), we can convert L1-norm S VM (10) into a QOP 1 eT u þ νeT ξ þ μeT ðr þ s tÞ 2 s:t: jr sjr t; jwjr u min
w;u;b;ξ;r;s;t;Y;Z;M;N
which is equivalent to ðr sÞT ðr sÞ ¼ t T t jr sjr t Therefore the equivalence relation (5) holds.
337
ð8Þ □
Y ðr sÞðr sÞT ≽0; Z tt T ≽0 TrðYÞ ¼ TrðZÞ; rankðMÞ ¼ 1; rankðNÞ ¼ 1 ðw; b; ξ; r; sÞ A Ω
ð17Þ
338
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
This is a nonconvex problem because of rank constraints rankðMÞ ¼ 1 and rankðNÞ ¼ 1. Following that, we demonstrate how to relax this nonconvex problem into a convex QOP. By dropping out the rank constraints, we extend the feasible set of problem (17) to a convex set, which leads to a convex optimization problem: min
w;u;b;ξ;r;s;t;Y;Z
1 eT u þ νeT ξ þ μeT ðr þ s tÞ 2
s:t: jr sjr t;
1 Z min eT u þ νeT ξ þ μeT ðr þ s tÞ; ðw; u; b; ξ; r; s; t; Y; ZÞ A Ω2 2 ð24Þ
jwj ru T
Y ðr sÞðr sÞ ≽0; TrðYÞ ¼ TrðZÞ;
In addition, it is easy to find that the SDP-S3VM (22) and nonconvex QOP (21) have the same objective function. Thus we have 1 min eT u þ νeT ξ þ μeT ðr þ s tÞ; ðw; u; b; ξ; r; s; t; Y; Z; M; NÞ A Ω1 2 ð23Þ
3
T
Z tt ≽0
ðw; b; ξ; r; sÞ A Ω
ð18Þ
That is, the optimal objective value of SDP-S VM (22) is not less than the minimum objective value for nonconvex QOP (21). The proof is then complete. □
3
More specifically, this is an SDP formula (called SDP-S VM). Therefore, it has global optimal solutions and its solution can easily obtained using the popular SeDumi software (Sturm, 1999). Note that the modified problem (18) is simpler and easier to optimize than the nonconvex QOP (17) because of removing two variables (M and N) and convexity of problem (18). Of course, the ranks of the resulting matrix M and N from (18) may not be equal to 1 anymore. Then we discuss specifically the relationship between problems (17) and (18). To simplify the presentation, with the definition (13), we denote the feasible sets of the problems (17) and (18) by Ω1 and Ω2, respectively: 8 jr sj r t; jwj r u > < ξ ; r; sÞ A Ω : Y ðr sÞðr sÞT ≽0; Z tt T ≽0 ðw; b; Ω1 ¼ > : TrðYÞ ¼ TrðZÞ; rankðMÞ ¼ 1; rankðNÞ ¼ 1 ð19Þ and
Ω2 ¼
8 > < > :
jr sj r t; jwj r u ðw; b; ξ; r; sÞ A Ω :
T
Y ðr sÞðr sÞ ≽0;
Z tt ≽0 T
ð20Þ
TrðYÞ ¼ TrðZÞ
Then the nonconvex QOP (17) can be written as 1 min eT u þ νeT ξ þ μeT ðr þ s tÞ; ðw; u; b; ξ; r; s; t; Y; Z; M; NÞ A Ω1 2 ð21Þ And the problem (18) is expressed as 1 min eT u þ νeT ξ þ μeT ðr þ s tÞ; ðw; u; b; ξ; r; s; t; Y; ZÞ A Ω2 2
ð22Þ
The problem of minimizing the same linear objective function eT u þ νeT ξ þ 12μeT ðr þ s tÞ over the convex set Ω2 serves as a convex relaxation problem of the nonconvex QOP (21). Thus the modified problem (22) is called SDP relaxation of the original nonconvex QOP (21). According to the above analysis, we get the following conclusions:
Lemma 4. (1) L1-norm S3VM (4) is equivalent to the nonconvex QOP (21). (2) If ðw; u; b; ξ; r; s; t; Y; Z; M; NÞ is a feasible solution of the nonconvex QOP (21), then it is also a feasible solution of convex problem SDP-S3VM (22). (3) The optimal objective value of SDP-S3VM (22) is not less than the minimum objective value for nonconvex QOP (21). Proof. The first conclusion is a direct consequence of the above analysis. It is easy to see that the feasible region of SDP-S3VM (22) contains the feasible region of nonconvex QOP (21): Ω1 D Ω2 . Thus a feasible solution of nonconvex QOP (21) is also a feasible solution of SDP-S3VM (22).
Furthermore, we have the following theorem to show the relationship between nonconvex S3VM (21) and its convex relaxation SDP-S3VM (22). Theorem 1. Suppose that ðwn ; un ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n Þ is the optimal solution of convex problem SDP-S3VM (22). Let ! 1 ðt n ÞT Mn ¼ tn Zn n
and n
N ¼
1 ðr n sn Þ
ðr n sn ÞT Yn
n
! :
n
satisfies rankðM n Þ ¼ 1 and If ðwn ; un ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n Þ n n rankðN n Þ ¼ 1, then the point ðwn ; un ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n ; M n ; N n Þ is the exact solution of the nonconvex QOP (21); else the point n n ðwn ; un ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n ; M n ; N n Þ generates a lower bound of the objective value of the nonconvex QOP (21). n
Proof. We note that the feasible region of SDP-S3VM (22) contains the feasible region of nonconvex QOP (21). Moreover, nonconvex QOP (21) and its con-vex relaxation SDP-S3VM (22) have the same n objective function. Therefore if the optimal solution ðwn ; un ; b ; n n n n n n 3 ξ ; r ; s ; t ; Y ; Z Þ of SDP-S VM (22) satisfies the rank constraints n n rankðM n Þ ¼ 1 and rankðN n Þ ¼ 1, then the point ðwn ; un ; b ; ξ ; r n ; sn ; n n n n n t ; Y ; Z ; M ; N Þ is both the feasible solution and the exact solution to the nonconvex QOP (21); else according to Lemma 4, we n n know that the optimum value generated by ðwn ; un ; b ; ξ ; r n ; sn ; t n ; n n 3 Y ; Z Þ of SDP-S VM (22) is a lower bound of the minimal objective value of the nonconvex QOP (21).Thus the conclusions are true. □ We describe below the algorithm for solving the SDP-S3VM (22). Algorithm 1. (1) Given a dataset consisting of labeled and unlabeled samples. (2) Choose appropriate parameters ν and μ 4 0 by 10-fold crossvalidation to maximize testing accuracy. (3) Construct the L1-norm S3VM formulations (4) and (21), and the corresponding convex relaxation problem SDP-S3VM (22). n n (4) Solve the convex problem (22) to obtain ðwn ; un ; b ; ξ ; n n n n n r ; s ; t ; Y ; Z Þ. Then the classification decision function is n f ðxÞ ¼ signððwn ÞT x b Þ. A future unlabeled sample x is assigned to the positive class if f ðxÞ 40 or to the negative class if f ðxÞ o 0. According to the above analysis, an alternative approach for solving the nonconvex S3VM (21) is obtained by convex relaxation technology. The proposed approach consists of a relaxation of the nonconvex S3VM (21) to a SDP optimization (22). More specifically, the resulting SDP optimization (22) has global solution, the solution of which is an approximation of the optimal labeling as well as a bound on the true optimum of the original S3VM (21) objective function.
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
Compared with other SDP relaxation formulations for semisupervised SVMs (Bie and Cristianini, 2004; Xu and Schuurmans, 2005; Xu et al., 2008) which are constructed based on L2-norm regularization and require solving their dual problems, the proposed SDP-S3VM (22) is based on L1-norm regularization and involves only solving the primal problem. This makes it convenient to use in practical applications. Moreover, one major benefit of J w J 1 over J w J 2 is feature reduction (Fung and Mangasarian, 2001) since minimizing J w J 1 leads to most elements of vector w that are zero. When the ith component of w is zero, the ith component of the vector x is irrelevant in deciding the class of x using linear decision function f ðxÞ ¼ signðwT x bÞ. Thus feature selection and classification can be conducted jointly through the L1-norm regularization formulation. 3.2. SDP relaxation of L2-norm S3VM Typically L2-norm is adopted as a regularization term in SVMs, which means classification margin using the L2-norm of w. Let q ¼2 in problem (1). We then obtain L2-norm S3VM: min
w;b;ξ;r;s
‖w‖22 þ
νe ξ þ μe minfr; sg T
T
s:t: ðw; b; ξ; r; sÞ A Ω
ð25Þ
We will see in the next section that this problem can also be relaxed by a SDP model. From (13) and Lemmas 1–3, a similar analysis is carried out for L2-norm S3VM (25), which leads to the following nonconvex QOP: min
w;b;ξ;r;s;t;Y;Z;M;N
1 ‖w‖22 þ νeT ξ þ μeT ðr þ s tÞ 2
s:t: Y ðr sÞðr sÞT ≽0; Z tt T ≽0 TrðYÞ ¼ TrðZÞ; jr sjr t rankðMÞ ¼ rankðNÞ ¼ 1; ðw; b; ξ; r; sÞ A Ω
ð26Þ
T ðw1 wW Þ.
Further, let W ¼ ww and Q ¼ It is not difficult to prove that ‖w‖22 ¼ TrðWÞ. From Lemma 3, we get the following equivalence relation: ( ( Q ≽0 W wwT ≽0 T W ¼ ww ⟺ ð27Þ ⟺ rankðQ Þ ¼ 1 rankðQ Þ ¼ 1 T
Then, the L2-norm S3VM (26) can be written as the following nonconvex QOP because the rank constraints 1 min TrðWÞ þ νeT ξ þ μeT ðr þs tÞ 2 s:t: Y ðr sÞðr sÞT ≽0; Z tt T ≽0 TrðYÞ ¼ TrðZÞ; W wwT ≽0; jr sj r t rankðMÞ ¼ 1; rankðNÞ ¼ 1; rankðQ Þ ¼ 1;
ðw; b; ξ; r; sÞ A Ω
ð28Þ
3
This implies that the L2-norm S VM (25) is equivalent to the nonconvex QOP (28). Following that, we demonstrate how to relax this nonconvex QOP into a SDP model. In the same way, we drop out the rank constraints and obtain a convex relaxation of the nonconvex QOP (28): 1 TrðWÞ þ νeT ξ þ μeT ðr þs tÞ 2 s:t: jr sj rt; TrðYÞ ¼ TrðZÞ min
w;b;ξ;r;s;t;Y;Z;W
Y ðr sÞðr sÞT ≽0; Z tt T ≽0;
W wwT ≽0
ðw; b; ξ; r; sÞ A Ω
ð29Þ 3
More precisely, this is a SDP formulation and called L2-SDP-S VM. It has globally optimal solution and can be solved by interior-point methods. To sum up, we have relaxed the nonconvex L2-norm S3VM (25) (or (28)) into an SDP model (29), the solution of which is an approximation of the optimal labeling as well as a bound on the
339
true optimum of the original L2-norm S3VM (25) objective function. Furthermore, we have the following theorem to show the relationship between the nonconvex S3VM (28) and its convex relaxation SDP-S3VM (29). n
Theorem 2. Suppose that ðwn ; W n ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n Þ is the optimal solution of convex optimization L2-SDP-S3VM (29). Let ! ! 1 ðr n sn ÞT 1 ðt n ÞT n ¼ Mn ¼ ; N ðr n sn Þ Yn tn Zn n
and n
Q ¼
1
ðwn ÞT
wn
Wn
! :
If rankðM n Þ ¼ 1, rankðN n Þ ¼ 1 and rankðQ n Þ ¼ 1, then the point n n ðwn ; W n ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n ; M n ; N n ; Q n Þ is the exact solution of the nonconvex QOP (28); else it yields a lower bound of the minimal objective value of the nonconvex QOP (28). Proof. From the above analysis, we know that the nonconvex QOP (28) and its convex relaxation L2-SDP-S3VM (29) and have the same objective function. Moreover, the feasible region of nonconvex QOP (28) is a subset of the feasible region of L2- SDP-S3VM (29). These imply that a feasible solution of the nonconvex QOP (28) is also a feasible solution of L2- SDP-S3VM (29), and the minimal objective value of the convex relaxation (29) is not less than the minimum objective value of the nonconvex QOP (28). n n Therefore, if the optimal solution ðwn ; un ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n Þ of L2-SDP-S3VM (29) satisfies the rank constraints rankðM n Þ ¼ 1, n n rankðN n Þ ¼ 1 and rankðQ n Þ ¼ 1, then the point ðwn ; W n ; b ; ξ ; r n ; sn ; t n ; Y n ; Z n ; M n ; N n ; Q n Þ is both the feasible solution and the exact solution of the nonconvex QOP (28); else the optimum value of L2-SDP-S3VM (29) is a lower bound of the minimal objective value of the nonconvex QOP (28). The proof is then complete. □ Then we describe the algorithm for solving L2-SDP-S3VM (29). Algorithm 2. (1) Given a dataset consisting of labeled samples and unlabeled samples. (2) Choose suitable parameters ν and μ 4 0 by 10-fold crossvalidation to maximize testing accuracy. (3) Construct L2-norm S3VM problems (25) and (28), and the corresponding convex relaxation problem L2-SDP-S3VM (29). n (4) Solve SDP model L2-SDP-S3VM (29) to obtain ðwn ; W n ; b ; ξn ; r n ; sn ; t n ; Y n ; Z n Þ . Then the classification decision function n is f ðxÞ ¼ signððwn ÞT x b Þ. A future unlabeled sample x is assigned to the positive class if f ðxÞ 4 0 or to the negative class if f ðxÞ o0. According to the above discussion, a convex relaxation formulation of L2-norm S3VM problems (25) has been proposed based on SDP optimization. The resulting SDP optimization (29) has global solution, the solution of which is an approximation of the optimal labeling as well as a bound on the true optimum of the original L2-norm S3VM problems (25) objective function. In addition, the proposed SDP relaxation for L2-norm S3VM can be extended to design nonlinear S3VM classifier which separates binary class samples using nonlinear hyperplane when sample set is linearly non-separable. Let ϕ : X-H denote a nonlinear mapping from the input space into a Hilbert space H. The nonlinear S3VMs in the space H try to find a linear hyperplane f ðxÞ ¼ wT ϕðxÞ b to separate the positive from the negative points as possible. To construct the optimal hyperplane for the S3VMs, the L2-norm in nonlinear versions usually is employed and needs to solve the
340
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
following primal problem: min
w;b;ξ;r;s
‖w‖22 þ
νe ξ þ μe minfr; sg T
T
s:t: yi ðw ϕðxi Þ bÞ þ ξi Z 1; T
wT ϕðxj Þ b þ r j Z 1;
ξi Z0
r j Z0
wT ϕðxj Þ þb þ sj Z1;
sj Z 0
i ¼ 1; 2; …; m; j ¼ m þ1; m þ2; …; m þ p
ð30Þ
þp ¼ ∑m i¼1
αi ϕðxi Þ in terms of the representing Assuming that w theorem (Vapnik, 1998), we obtain a nonlinear S3VM of the form: min αT kα þ νeT ξ þ μeT minfr; sg
α;b;ξ;r;s
s:t: Dðk1 α ebÞ þ ξ Z e; ξ Z 0 k2 α ebþr Ze; r Z0 k2 α þ eb þ s Z e; s Z 0
ð31Þ
Using Lemma 1, we formulate problem (31) as 1 1 min αT kα þ νeT ξ þ μeT ðr þsÞ μeT jtj 2 2 s:t: Dðk1 α ebÞ þ ξ Z e; ξ Z 0 k2 α ebþr Ze k2 α þ eb þ s Z e t ¼ r s; r; s Z0
α;b;ξ;r;s;t
ð32Þ
Here k; k1 ; k2 are kernel functions (Vapnik, 1998). The responding nonlinear S3VM decision function has the form f ðxÞ ¼ sign þp ð∑m i ¼ 1 αi kðx; xi Þ þ bÞ. Let x ¼ ðα; b; ξ; r; sÞ, and Ω3 be the feasible set of problem (32). Similar to the above analysis, the nonlinear problem (32) can be also relaxed as an SDP formula by using (13) and Lemmas 1–3 min
α;b;ξ;r;s;t;Y;Z
1 2
αT kα þ νeT ξ þ μeT ðr þ s tÞ
s:t: ðα; b; ξ; r; sÞ A Ω3 jr sj r t; TrðYÞ ¼ TrðZÞ
Y ðr sÞðr sÞT ≽0 Z tt T ≽0
ð33Þ 3
This is called NLSDP-S VM and it can easily be solved using the popular SeDumi software (Sturm, 1999). The following should be noted: (1) Compared to other SDP relaxations for semi-supervised learning methods that are required to solve their dual problems, the proposed SDP relaxations need only solving their primal problems, which makes them simple to optimize in practical applications. (2) The proposed SDP relaxation technique can implement both L2-norm and L1-norm regularizations instead of other SDP relaxations which are constructed based on L2-norm regularization in semi-supervised learning framework. This makes it convenient to use in many applications. (3) Moreover, the proposed L1-norm SDP outperforms the existing convex relaxation methods of S3VM in terms of a better ability of feature selection since minimizing J w J 1 leads to most elements of vector w that are zero in L1-norm SDP model.
a seed lot. In a seed lot, there will be other variety seeds except the certain variety seeds because of natural variation, natural hybridization and mechanical mix and so on. Thus seed purity is one important indicator of seed quality since it is directly correlated with the field performance and yield. Because of heterosis, corn hybrid has much higher yield than inbred. During seed production, because of incomplete emasculation, the hybrid seeds harvested will be mixed with some mother inbred seeds. The main goal of recognition seed purity is to identify the mother seeds then remove them and, at last, improve the seed purity. Therefore, the recognition of the purity of hybrid seeds is actually a pattern classification problem. Some studies have focused on identification the purity of seeds such as the seedling identification and the field experiment technology (Kang and Hao, 2004; Bai and Huang, 2007). However, these methods require adequate training (or labeled) samples in order to obtain satisfactory results because they estimate the decision function only using the labeled samples. That is, these traditionary identification methods are based on supervised learning methods. To validate the performance of the proposed method in recognition of the purity of hybrid seeds, we demonstrate how to classify hybrid maize seeds and mother maize seeds using the proposed SDP relaxation of S3VM and near-infrared (NIR) spectral data. This will advance the study of the mechanism of the purity of crop seeds. NIR spectroscopy has demonstrated great potential in the analysis of complex samples owing to its simplicity, rapidity and nondestructivity. The NIR spectra of the maize seeds were acquired using an MPA spectrometer. The NIR spectral range of 4000–12 000 cm 1 was recorded with a resolution of 4 cm 1. Each sample spectrum was the average of 32 scans. This procedure was repeated four times for each sample: twice from the front at different locations and twice from the rear at different locations. A final spectrum was taken as the mean spectrum of these four spectra. The initial spectra of maize seeds were digitized by OPUS 5.5 software. After digitization, each spectrum was represented as a row vector; the length of the vector was defined by the number of spectral variables. Consequently, each spectral dataset in different spectral regions is represented by a matrix, each row of which denotes a seed sample. Below, we summarize the main steps in recognition of hybrid seed purity using the proposed method and near-infrared spectral data:
4. Performance evaluation
(1) Harvest hybrid seed and mother seed training samples. Moreover, suppose that mother seed samples are denoted as positive class samples and hybrid seed samples are denoted as negative class samples. (2) The initial spectra of maize seeds were digitized by OPUS 5.5 software to get spectral datasets of maize seeds. (3) Choose suitable parameters ν and μ 4 0 by 10-fold crossvalidation to maximize testing accuracy. (4) Construct the L1-norm S3VM formulations (4) and (21), and the corresponding convex relaxation problem SDP-S3VM (22). n n (5) Solve the convex problem (22) to obtain ðwn ; un ; b ; ξ ; r n ; sn ; n n n t ; Y ; Z Þ. Then the classification decision function is f ðxÞ ¼ n signððwn ÞT x b Þ. A future unlabeled sample x is assigned to the positive class if f ðxÞ 40 or to the negative class if f ðxÞ o 0.
4.1. Optimization design on recognition of the purity of hybrid maize seeds
4.2. Performance evaluation criteria
Maize is the main agricultural crop in China, and its yield was significantly related to the seed purity. Therefore, recognizing the purity of hybrid maize seeds is an important part of seed testing. Seed purity means the percentage of the seeds of certain variety in
The performances of the proposed methods are assessed as follows: The sensitivity (SE) is the identification rate of the positive class; specificity (SP) is the identification rate of the negative class;
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
accuracy (ACC) is the identification rate of all samples from two classes. The above values can be obtained from the decision function and are defined as (Fawcett, 2006) SE ¼ TP=ðTP þFNÞ; SP ¼ TN=ðTN þ FPÞ ACC ¼ ðTP þ TNÞ=ðTP þ FN þ TN þ FPÞ
ð34Þ ð35Þ
where TP and TN denote the true positives and the true negatives respectively; FN and FP denote the false negatives and the false positives, respectively. Matthew's correlation coefficient (MCC) is a comprehensive measure of the quality of classification model. Its value is expressed as (Xu et al., 2009) TP TN FP FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ FPÞðTP þ FNÞðTN þFPÞðTN þ FNÞ
ð36Þ
Generally, the higher the value of the MCC is, the better the model is. In addition, the prediction validity of different classification models is often examined by observing its receiver operating characteristic (ROC) curve (Fawcett, 2006), which plots a series of true positive rates (TPRs) against the corresponding false positive rates (FPRs). ROC graphs are a very useful tool for visualizing and evaluating classifiers. They are able to provide a richer measure of classification performance than scalar measures such as accuracy and error rate. Moreover, ROC curves have an attractive property: they are insensitive to changes in class distribution. If the proportion of positive to negative samples changes in a test set, the ROC curves will not change. Thus this criterion is popular to evaluate the performance for imbalanced data. The larger the area under the ROC curves, the higher the sensitivity for a given specificity, and thus the better the performance of model. In this experiments, the data associated with one class are far fewer than those associated with the other class such as Hepatitis data from UCI Machine Learning Repository. Generally, given a classifier and a sample, there are four possible outcomes. If the sample is positive and it is classified as positive, it is counted as a true positive; if it is classified as negative, it is counted as a false negative. If the sample is negative and it is classified as negative, it is counted as a true negative; if it is classified as positive, it is counted as a false positive. The true positive rate (TPR) and the false positive rate (FPR) of a classifier are estimated as (Fawcett, 2006) TPR ¼
Positives correctly classified Total positives
ð37Þ
FPR ¼
Negatives incorrectly classified : Total negatives
ð38Þ
5. Numerical experiments We test the performance of the proposed methods on various data sets in this section. Numerical simulation experiment is composed of two parts in this investigation. In the first part, we test the performance of the proposed SDP schemes in five benchmark datasets from UCI Machine Learning Repository. In the second part, the proposed methods are used directly to recognize the purity of hybrid seeds using near-infrared (NIR) spectroscopy data. 5.1. Experiment design We choose the well-known exact solution method (Bennett and Demiriz, 1998) for solving L1-norm S3VM (4) as our baseline method. Specifically, by introducing integer variable d with component dj ¼ 0 or 1 for each unlabelled sample xj ðj ¼ m þ 1; …; m þpÞ, L1-norm S3VM (4) was posed as the following mixed
341
integer programming (called MIP-S3VM) (Bennett and Demiriz, 1998): min eT z þ νeT ξ þ μeT ðr þ sÞ s:t: DðAw ebÞ þ ξ Ze;
ξZ0
Bw ebþ r þ Mðe dÞ Z e; r Z 0 Bw þ eb þ s þ MdZ e; jwj r z;
sZ0
d ¼ 0 or e
ð39Þ
its global solution can be obtained. When M 40 is a sufficiently large constant such that if dj ¼0 then rj ¼ 0 is feasible for any optimal w and b, which attempts to classify the unlabeled point xj to negative class. Likewise if dj ¼1 then sj ¼0, which attempts to classify the point xj to positive class. All experiments are implemented in MATLAB 7.0. The experiment environment: PC with Genuine Inter(R) processor(1.6 GHZ) with 1 GB memory. In addition, the following toolboxes are used in the experiments: MATLAB Statistics Toolbox. MATLAB SeDumi Toolbox for SDP (Sturm, 1999). MATLAB YALMIP Toolbox for Mixed Integer Program. (http://control.ee.ethz.ch/ joloef/wiki/pmwiki.php) moreover, SeDumi Toolbox is employed to solve the SDP-S3VM (18) and L2-SDP-S3VM (29). YALMIP Mixed Integer Program Tooolbox is used as a solver for the MIP-S3VM (39). Standard programs of these Toolboxes were modified. For comprehensive evaluation, we conduct 10 times trials in each data. Specifically, each dataset in each trial is randomly divided into two parts: 20% of samples for training and the remaining 80% of the samples for testing with the exception of Spam. For Spam, 100 samples are chosen randomly for the test set in each trial. We remove the labels for the test set, and run the proposed semi-supervised algorithms to reclassify the test set using the learning results. This process is repeated 10 times. The average of 10 replications results is used to evaluate the algorithms. S3VM parameters ν and μ denote the penalty for misclassification errors of labeled samples and unlabeled samples respectively. That is to say, the generalization of the proposed method also depends on the choice of the parameters ν and μ. In general, when they are small, the margin maximization is emphasized leading to large margin; whereas when they are large, the error minimization is predominant leading to more smaller margin. Thus these parameters should be optimized beforehand. For SDP-S3VM (18), Fig. 1 shows that the accuracy varies with various values of ν and μ for Sonar data, the goal of which is to represent the relationship between the performance of the accuracy and the parameters. From Fig. 1, we find that SDP-S3VM yields larger ACC when parameters μ and ν range from 10 2 to 103, which is helpful to the choice of parameters for our experiments. In this experiment, we set μ r ν for S3VM models, which means that we charge a greater weight to the labeled samples than to the unlabeled ones. Two SDP-S3VM parameters μ and ν are adjusted from the set of values f10i ji ¼ 2; 1; 0; 1; 2; 3g by 10-fold crossvalidation. More precisely, (1) The whole dataset is split randomly into 10 equal-sized subsets. (2) For each model-parameter setting, one of the 10 subsets for testing and the remaining nine subsets for training. Then train a SDP-S3VM classifier, and its classification accuracy is recorded. This process is repeated 10 times. (3) Average classification accuracy is computed to estimate SDPS3VM. (4) Adopt the SDP-S3VM parameters with the maximum accuracy. Once the parameters are selected, the optimal parameters are used to learn the final decision function.
342
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
Fig. 1. Accuracy of SDP-S3VM (18) versus parameters in Sonar dataset. Fig. 2. Comparison of SDP-S3VM with MIP-S3VM in terms of ROC curve measurement in Sonar dataset. Table 1 Comparisons of two solving L1-norm S3VM methods: SDP and MIP with the ratio of labeled to unlabeled samples being 2:8 in UCI datasets. Dataset Data scale Parameters
Classification methods
SE (%)
SP (%)
ACC (%)
MCC (%)
Wine (119 13) (ν,μ)¼ (1.0,0.8)
SDP-S3VM MIP-S3VM L2-SDP-S3VM
100 93.39 100
100 100 100
100 96.12 100
100 93.59 100
Ionosphere (351 34) (ν,μ)¼ (1.0,0.8)
SDP-S3VM MIP-S3VM L2-SDP-S3VM
100 83.57 100
69.60 60.99 69.57
88.85 75.98 88.85
73.03 45.74 73.03
Hepatitis (155 19) (ν,μ)¼ (10,1.0)
SDP-S3VM MIP-S3VM L2-SDP-S3VM
95.80 57.55 74.47
25.00 80.83 57.00
75.60 66.66 67.67
29.45 39.46 31.96
Sonar (208 60) (ν,μ)¼ (0.8.0.6)
SDP-S3VM MIP-S3VM L2-SDP-S3VM
69.19 50.24 47.89
68.26 59.72 75.83
62.11 55.61 58.20
37.45 10.01 24.62
WBC (699 9) (ν,μ)¼ (1.0,0.8)
SDP-S3VM MIP-S3VM L2-SDP-S3VM
99.60 96.18 99.56
77.45 97.32 96.40
98.18 97.15 98.43
79.01 93.51 96.01
5.2. Experiments on benchmark datasets To validate the performance of the proposed convex relaxation algorithm in dealing with benchmark data, numerical experiments are carried out on five available labeled datasets from UCI. Wine dataset has 3 data classes, and here consists of dataset from class 1 and class 3 data. Thus five datasets used in this experiment are all binary classification data. The information about them are summarized in Table 1. According to the above analysis, the optimal model parameters ν ¼1.0 and μ ¼ 1.0 in SDP-S3VM (22) were chosen on four UCI datasets, except the Hepatitis dataset, for which ν ¼0.8 and μ ¼0.6 are chosen. The parameters in the other models are set to be the same as in SDP-S3VM (22).
5.2.1. Comparisons of the proposed SDP relaxations with MIP-S3VM We here compare the proposed two SDP formulations (SDP-S3VM and L2-SDP-S3VM) with MIP-S3VM. The evaluation measures for these three models (SE, SP,ACC and MCC) are computed. The average experimental results on the 10 times trials are presented in Table 1. (1) ACC and MCC analysis: Based on the MCC criterion, Table 1 shows that the proposed SDP-S3VM and L2-SDP-S3VM obtain better performances than MIP-S3VM in three from five datasets;
Fig. 3. Comparison of SDP-S3VM with MIP-S3VM in terms of ROC curve measurement in Ionosphere dataset.
in terms of ACC comparison, our SDP formulations is slightly superior to MIP-S3VM in all five datasets. (2) ROC curve analysis: Figs. 2–4 illustrate the ROC curves of SDP-S3VM and MIP-S3VM in three datasets. We find that the ROC curves of SDP-S3VM are above those of MIP-S3VM in compared three datasets. These results suggest that our SDP-S3VM achieves better performance than MIP-S3VM in terms of ROC measurement for the analyzed three benchmark datasets. (3) Comparison of SDP-S3VM with L2-SDP-S3VM for percentage of selected features: To evaluate the ability of feature selection for L1-norm S3VM (or SDP-S3VM) (22), the percentage of selected feature (PSF) of SDP-S3VM (22) and L2-SDP-S3VM (29) are compared in five UCI datasets. The average results of PSF on 10 times trials are computed and shown in Fig. 5. We observe from Fig. 5 that SDP-S3VM selects fewer features than L2-SDP-S3VM in four from five datasets. As discussed in Section 3, the SDP-S3VM (22), without loss of generalization, suppresses many more features than the L2-SDP-S3VM (29) by minimizing the L1-norm instead of the L2-norm as is the case in the latter method. 5.2.2. Comparisons of the proposed SDP relaxations with other S3VM formulations We here compare the proposed two SDP algorithms with stateof-the-art S3VM methods: SVMlight algorithm (Joachims, 1999). ∇S3VM: the gradient decent S3VM algorithm (Chapelle et al., 2008). CS3VM: the convex relaxation for TSVM (Xu et al., 2008).
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
343
Fig. 6. The near infrared spectra of Maize seed samples.
Fig. 4. Comparison of SDP-S3VM with MIP-S3VM in terms of ROC curve measurement in Hepatitis dataset. Table 3 Near-infrared spectral sample regions of maize seeds. Dataset
Spectral range (cm 1) Number of samples Number of wavelengths
Region A 5000–6000 Region B 6000–7000 Region C 7000–8000
240 240 240
260 260 260
5.3. Experiments on hybrid seed dataset
Fig. 5. Comparison of SDP-S3VM with L2-SDP-S3VM for PSF.
Table 2 The comparisons with the traditional methods. Data
SVM light
∇S3VM
CCCP
CS3VM
SDP-S3VM
L2-SDP-S3VM
Ionosp Sonar Cancer
0.7825 0.5526 0.9645
0.8172 0.6936 0.9717
0.8211 0.5601 0.9689
0.8009 0.6739 0.9779
0.8885 0.6211 0.9818
0.8885 0.5820 0.9843
CCCP: the concave convex procedure for S3VM (Chapelle and Zien, 2005, Collobert et al., 2006). The accuracies of these models are computed in three datasets, Ionosphere, Sonar and WBC. The average results on 10 times trials are reported in Table 2.
We know from Table 2 that the performances of SDP-S3VM (22) and L2-SDP-S3VM) (29) are very close in all three datasets. At the same time, we find that the proposed SDP algorithms achieve better performance than SVMlight, and have no significant difference in generalization compared to the traditional approaches: the CCCP for S3VM, ∇S3VM and CS3VM. Through the above analysis, compared with other mentioned S3VM methods, the proposed SDP relaxation methods either improve or show no significant difference in generalization for the majority of datasets.
The proposed method in this section is used in a practical application problem. We recognize the purity of hybrid maize seeds using the proposed SDP relaxation model SDP-S3VM (22) and NIR spectral data. The “NongDa108” maize hybrid seeds and “mother178” seeds are used in this experiment. These seeds were harvested in Beijing, China, in 2008. A total of 240 seed samples including 120 hybrid seeds and 120 mother seeds were selected in this investigation. Finally, we selected 240 spectra comprising spectral data, 120 from hybrid seeds and 120 from mother seeds. The goal of this experiment is to classify the “NongDa108” and “mother178” maize seeds using the proposed SDP-S3VM (22) and NIR spectral data. In addition, 240 NIR spectra of maize samples are illustrated in Fig. 6. It is found from Fig. 6 that the noise level is relatively high in the spectral range of 8000–12 000 cm 1. Thus, numerical experiments were done in spectral range of 5000–8000 cm 1. To validate our method comprehensively, our experiments were carried out in three different spectral ranges: 5000–6000 cm 1, 6000–7000 cm 1 and 7000–8000 cm 1. The corresponding spectral regions are denoted as regions A, B and C. The size of these spectral sets is summarized in Table 3. In this experiment, the choices of two parameters ν and μ were performed to maximize the accuracy by 5-fold cross-validation in spectral region 5000–6000 cm 1. Finally, the best model parameters ν ¼ 1000 and μ ¼1000 were chosen in SDP-S3VM (22) and applied to the other spectral regions. The generalization, SE, SP, ACC and MCC, is computed in three different spectral regions. The average results of SDP-S3VM (22) and MIP-S3VM (39) are reported in Table 4. We know that the proposed SDP-S3VM (22) does not outperform MIP-S3VM (39) on all three spectral regions. The SDP-S3VM (22) outperforms MIP-S3VM (39) in terms of generalization in spectral region C. Moreover, SDP-S3VM (22) achieves slightly better performances than MIP-S3VM (39) in terms of ACC and MCC in spectral region A, while in spectral region B, the SDP-S3VM yields results that are comparable to the MIP-S3VM approach.
344
L. Yang et al. / Engineering Applications of Artificial Intelligence 35 (2014) 335–344
Table 4 Comparisons of two solving L1-norm S3VM methods: SDP and MIP with the ratio of labeled to unlabeled samples being 2:8 in three different spectral ranges: 5000– 6000 cm 1, 6000–7000 cm 1 and 7000–8000 cm 1. Spectral regions
Methods
SE (%)
SP (%)
ACC (%)
MCC (%)
Region A (ν ¼μ ¼1000)
MIP-S3VM SDP-S3VM
69.06 71.66
70.12 85.34
69.57 78.50
41.83 57.54
Region B (ν ¼μ¼ 1000)
MIP-S3VM SDP-S3VM
59.86 47.37
68.57 63.39
64.22 55.38
34.34 10.90
Region C (ν ¼μ¼ 1000)
MIP-S3VM SDP-S3VM
16.92 58.10
88.03 70.47
52.73 64.29
7.070 28.79
Project (973 Program 2012CB720805). Moreover, the authors thank the referees and the editor for their constructive comments. Their suggestions improved the paper significantly.
References
Recognizing the purity of hybrid seeds is an important part of seed testing. We rigorously validated the proposed method in different spectral regions for classification mother seeds and hybrid seeds in terms of different measures. Experimental results show the feasibility and effectiveness of the proposed method using NIR spectroscopic data. 6. Conclusion This paper presents a novel convex relaxation framework for a class of nonconvex S3VMs. In this framework, we first reformulate the S3VMs as nonconvex quadratic optimization problems (QOPs). Then we relax these nonconvex QOPs into SDP models to compute a lower bound of the minimal objective value of the S3VMs by removing the rank constraints. Two SDP relaxation models of the S3VMs are obtained based on L2-norm and L1-norm regularizations. The resulting SDP models have global optimization solutions, and need only to solve the primal problems instead of the dual problems as in other SDP relaxations. Moreover, the proposed SDP relaxation based on L1-norm regularization can compress the dimensionality of the input data. As a typical and important application, the proposed approach is applied directly to identify the purity of maize seeds using NIR spectroscopy technology, from which we find that the proposed method achieves equivalent performance to the exact solution algorithm for solving the S3VM in different spectral regions. Experiments on several benchmark data sets demonstrate that the proposed convex technique is competitive with other SDP relaxation methods for solving semisupervised SVMs in generalization. In addition, the proposed SDP relaxation for L2-norm S3VM is extended to design nonlinear S3VM classifier. Acknowledgments This work is supported by National Nature Science Foundation of China (11271367) and National Program on Key Basic Research
Astorino, A., Fuduli, A., 2007. Nonsmooth optimization techniques for semisupervised classification. IEEE Trans. Pattern Anal. Machine Intell. 9 (12), 2135–2142. Bai, O., Huang, R.D., 2007. Comparison of plant height, light distributing and yield in different purity populations of maize. J. Maize Sci. 15 (3), 59–61. Bennett, K., Demiriz, A., 1998. Semi-supervised support vector machines. In: Advances in Neural Information Processing Systems, vol. 12. pp. 368–374. Bie, T.D., Cristianini, N., 2004. Convex Methods for Transduction. 〈http://www. tijldebie.net/papers/TDB-NC-03.pdf〉. Bielza, C., Robles, V., Larranaga, P., 2011. Regularized logistic regression without a penalty term: an application to cancer classification with microarray data. Expert Syst. Appl. 389, 5110–5118. Chapelle, O., Zien, A., 2005. Semi-supervised classification by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Barbados, pp. 57–64. Chapelle, O., Sindhwani, V., Keerthi, S., 2006. Branch and bound for semisupervised support vector machines. In: Advances in Neural Information Processing Systems, vol. 17. pp. 217–224. Chapelle, O., Sindhwani, V., Keerthi, S., 2008. Optimization techniques for semisupervised support vector machines. J. Mach. Learn. 9, 203–233. Collobert, R., Sinz, F., Weston, J., Bottou, L., 2006. Large scale transductive SVMs. J. Mach. Learn. Res. 7, 1687–1712. Fawcett, T., 2006. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874. Fung, G., Mangasarian, O., 2001. Semi-supervised support vector machines for unlabeled data classification. Optim. Methods Softw. 15, 29–44. Joachims, T., 1999. Transductive inference for text classification using support vector machines. In: 16th International Conference on Machine Learning. Kang, Y.Q., Hao, F., 2004. The study of the determination of seed moisture and seed vigor with fourier transform near-infrared spectroscopy. Seed 23 (7), 10–16. Kim, S., Kojima, M., 2001. Second order cone programming relaxation of nonconvex quadratic optimization problem. Optim. Methods Softw. 15, 201–224. Mangasarian, O., Meyer, R.R., 2006. Absolute value equations. Linear Algebra Appl. 419, 359–367. Reddy, I.S., Shevade, S., Murty, M.N., 2011. A fast quasi-Newton method for semisupervised SVM. Pattern Recognit. 44, 2305–2313. Sturm, J., 1999. Using SeDuMi 1.02. A Matlab toolbox for optimization over symmetric cones. Optim. Methods Softw., 625–653. Vandenberghe, L., Boyd, S., 1996. Semidefinite programming. SIAM Rev. 38 (1), 49–95. Vapnik, V., 1998. Statistical Learning Theory. Wiley, New York. Wang, J.H., Shen, X.T., Pan, W., 2009. On efficient large margin semisupervised learning: method and theory. J. Mach. Learn. Res. 10, 719–742. Wu, G.Ch., Li, Y.H., Yang, X.W., Xi, J.Q., 2013. Local learning integrating global structure for large scale semi-supervised classification. Comput. Math. Appl. 66, 1961–1970. Xu, L.L., Schuurmans, D., 2005. Unsupervised and semi-supervised multi-class support vector machines. In: Proceedings of the 20th National Conference on Artificial Intelligence (AAAI-05), pp. 904–910. Xu, Z.L., Jin, R., Zhu, J.K., King, I., Lyu, M.R., 2008. Efficient convex relaxation for transductive support vector machine. In: Advances in Neural Information Processing Systems, vol. 21. Xu, H., Liu, Z.C., Cai, W.S., Shao, X.G., 2009. A wavelength selection method based on randomization test for near-infrared spectral analysis. Chemometr. Intell. Lab. 97, 189–193. Yang, L.M., Wang, L.S.H., 2013. A class of smooth semi-supervised SVM by difference of convex functions. Knowl. Based Syst. 41, 1–7.