Improved multi-view privileged support vector machine

Improved multi-view privileged support vector machine

Neural Networks 106 (2018) 96–109 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Improv...

1MB Sizes 0 Downloads 99 Views

Neural Networks 106 (2018) 96–109

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Improved multi-view privileged support vector machine Jingjing Tang a , Yingjie Tian b,c,d, *, Xiaohui Liu e , Dewei Li f , Jia Lv g , Gang Kou a a

School of Business Administration, Southwestern University of Finance and Economics, Chengdu 611130, China Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China d Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China e Department of Computer Science, Brunel University London, Uxbridge, Middlesex, UB8 3PH, UK f School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China g College of Computer and Information Sciences, Chongqing Normal University, Chongqing, 401331, China b c

highlights • • • •

IPSVM-MV serves as a general model for multi-view scenario. We employ alternating direction method of multipliers to solve IPSVM-MV efficiently. We theoretically analyze the performance of IPSVM-MV from two aspects. Experimental results demonstrate the effectiveness of the proposed method.

article

info

Article history: Received 30 December 2017 Received in revised form 24 May 2018 Accepted 29 June 2018

Keywords: Multi-view learning Support vector machine Privileged information Consensus Complementarity

a b s t r a c t Multi-view learning (MVL) concentrates on the problem of learning from the data represented by multiple distinct feature sets. The consensus and complementarity principles play key roles in multiview modeling. By exploiting the consensus principle or the complementarity principle among different views, various successful support vector machine (SVM)-based multi-view learning models have been proposed for performance improvement. Recently, a framework of learning using privileged information (LUPI) has been proposed to model data with complementary information. By bridging connections between the LUPI paradigm and multi-view learning, we have presented a privileged SVM-based twoview classification model, named PSVM-2V, satisfying both principles simultaneously. However, it can be further improved in these three aspects: (1) fully unleash the power of the complementary information among different views; (2) extend to multi-view case; (3) construct a more efficient optimization solver. Therefore, in this paper, we propose an improved privileged SVM-based model for multi-view learning, termed as IPSVM-MV. It directly follows the standard LUPI model to fully utilize the multiview complementary information; also it is a general model for multi-view scenario, and an alternating direction method of multipliers (ADMM) is employed to solve the corresponding optimization problem efficiently. Further more, we theoretically analyze the performance of IPSVM-MV from the viewpoints of the consensus principle and the generalization error bound. Experimental results on 75 binary data sets demonstrate the effectiveness of the proposed method; here we mainly concentrate on two-view case to compare with state-of-the-art methods. © 2018 Elsevier Ltd. All rights reserved.

1. Introduction Real-word data are usually collected from diverse domains or obtained from various feature extractors (see Fig. 1). These data can

*

Corresponding author at: Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China. E-mail addresses: [email protected] (J. Tang), [email protected] (Y. Tian), [email protected] (X. Liu), [email protected] (D. Li), [email protected] (J. Lv), [email protected] (G. Kou). https://doi.org/10.1016/j.neunet.2018.06.017 0893-6080/© 2018 Elsevier Ltd. All rights reserved.

be naturally partitioned into distinct feature sets, each of which is regarded as a particular view. Thus, each datum can be represented by multiple views and different views often provide information complementary to each other. Multi-view learning (MVL) focuses on learning with multiple views for performance improvement, which is a popular research direction in machine learning field. Compared with single view learning, conventional MVL algorithms are either to concatenate multiple views into one single view with comprehensive description, or to build the learning function for each feature view separately and then jointly optimize the learning

J. Tang et al. / Neural Networks 106 (2018) 96–109

Fig. 1. Multi-view data: (a) an object can be described from different viewpoints such as image and text; (b) multilingual documents have one view in each language; (c) a web document can be represented by text content and citation links on the page; (d) a web news can be depicted by its title, corresponding image and surrounding text.

function by exploiting the redundant views. The concatenating strategy ignores the statistical property of each view and leads to the curse of dimensionality problem. While the separation strategy considers each view independently. In fact, views are inherently related since they describe the same set of objects through different feature spaces. A number of methods (Blum & Mitchell, 1998; Xu, Tao, & Xu, 2013) have shown that learning with multiple views jointly is better than the naive approach of using one concatenated view or learning from each view separately. To date, numerous multi-view learning algorithms have been proposed that can be categorized into three groups (Xu et al., 2013): (1) co-training, (2) multiple kernel learning, and (3) subspace learning. In particular, co-training style algorithms iteratively maximize the mutual agreement on two distinct views to ensure the consistency on the same validation data (Balcan, Blum, & Yang, 2004; Kumar & Daumé, 2011; Li, Nigel, Tao, & Li, 2006; Ménard & Frezza-Buet, 2005; Wang & Zhou, 2010; Wang, Zhang, Wu, Lin, & Zhao, 2017). Multiple kernel learning (MKL) algorithms use kernels that naturally correspond to different views and combine kernels either linearly or non-linearly to improve learning performance (Bach, Lanckriet, & Jordan, 2004; Rakotomamonjy, Bach, Canu, & Grandvalet, 2008; Sonnenburg, Rätsch, Schäfer, & Schölkopf, 2006; Tang & Tian, 2017). Subspace learning algorithms aim to obtain a latent subspace shared by multiple views with the assumption that the input views are generated from this latent subspace (Chao & Sun, 2016; Dhillon, Foster, & Ungar, 2011; Farquhar, Hardoon, Meng, Shawe-taylor, & Szedmak, 2005; Huang, Chung, & Wang, 2016; Liu et al., 2017; Mao & Sun, 2016; Sun & Keates, 2013; Zong, Zhang, Zhao, Yu, & Zhao, 2017). For a comprehensive survey of multi-view learning, please refer to Xu et al. (2013), Sun (2013) and Zhao, Xie, Xu, and Sun (2017). Although apparent differences exist in the approaches to integrate multiple views for better performance, they mainly embody either the consensus principle or the complementarity principle (Xu et al., 2013) to ensure their success. The consensus principle aims at maximizing the agreement on the hypotheses among multiple views. For example, SVM-2K (Farquhar et al., 2005) minimizes the distance between the predictive functions of two views as well as the loss within each view by following the consensus principle. In contrast, the complementarity principle emphasizes that each view of the data contains some knowledge not in other views

97

and multiple views share such complementary information to describe the data comprehensively and accurately. The consensus and complementarity principles play key roles in guiding model construction for effective multi-view learning. From the perspective that MVL targets the complementary knowledge in the learning process, we find another similar framework called learning using privileged information (LUPI) (Lapin, Hein, & Schiele, 2014; Vapnik & Izmailov, 2015; Vapnik & Vashist, 2009). Different from the conventional learning paradigm, where the training data and the test data have the same representations, LUPI models on the training data containing additional information that is only available in the training process. Such additional information is referred as privileged information, and LUPI aims at leveraging the privileged information to boost the performance. A possible analogy of LUPI is human learning with a teacher: when a student learns a geometry concept in school, the teacher acts as the oracle to provide the additional explanations (privileged information) associated with the answers (main information) at any time. However, when later in life the student encounters a new geometry problem, he or she has no access to the teacher’s expertise anymore. LUPI incorporates the idea of human teaching for modeling. Under the LUPI paradigm, the standard model SVM+ (Vapnik & Vashist, 2009) and its reinforced version SVM∆+ (Vapnik & Izmailov, 2015) are built on a direct observation that a non-linearly separable (soft-margin) support vector machine can be improved if one had access to a so-called slack oracle (Vapnik & Vashist, 2009). SVM∆+ improves SVM+ by fully utilizing the privileged information to build better classifiers. LUPI and MVL are similar in the sense that they both try to exploit all the useful information to improve the learning performance, even though there exist some differences between them (as shown in Fig. 2). Considering multiple views and privileged information, the other views can together act as an oracle teacher to give some additional comments for one particular view. That means different views can mutually provide the privileged information to complement and enrich each other. Thus, multiple views share the complementary information analogous to the comments from the teacher in human learning. In this case, it is natural and beneficial to extend LUPI to MVL field. In our previous work, we have proposed a privileged SVMbased method PSVM-2V for two-view learning (Tang, Tian, Zhang, & Liu, 2017). Despite its distinctive advantages, PSVM-2V can be further improved in the following aspects. First, although it utilizes the idea of LUPI to realize the complementarity principle, it just implicitly connects LUPI with the slack variables which are lower bounded by the unknown ‘‘correcting function’’ (explained in Section 3) over the views that are deemed as privileged information. For fully unleashing the power of the complementary information among different feature views, exactly following the LUPI model is beneficial and can bridge better connections between MVL and LUPI. Second, PSVM-2V is built only for two-view learning, which cannot be extended to deal with multi-view (more than two views) problems. Designing a new model for multi-view case is indispensable. Third, PSVM-2V uses the quadratic programming solver for solution. It is certainly preferable to construct a more efficient optimization solver. In this paper, we propose an improved privileged SVM-based model for multi-view learning, termed as IPSVM-MV. The main contributions of this work are listed as follows: (1) As a general model for multi-view scenario, IPSVM-MV directly follows the LUPI model to fully utilize the multi-view complementary information. (2) We employ the alternating direction method of multipliers to obtain the solution of IPSVM-MV efficiently.

98

J. Tang et al. / Neural Networks 106 (2018) 96–109

Fig. 2. The differences between the learning using privileged information (LUPI) paradigm and multi-view learning (MVL) are as follows: (a) LUPI aims to use the privileged information to assist the main (original) information to complete the learning task. The main information is available at training and test time, yet the privileged information is only available at training time. By building the LUPI model, we achieve the decision function corresponding to the main feature space. (b) MVL aims to learn from all the views (main information) jointly to complete the learning task under the consensus principle and the complementarity principle. By building the MVL model, we achieve the decision functions corresponding to each view respectively, or the decision function corresponding to multi-view collectively.

(3) We theoretically analyze the performance of IPSVM-MV from two aspects, i.e., the consensus principle and the generalization error bound. (4) Experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods on 75 binary data sets. The remainder of this paper is organized as follows. Section 2 briefly reviews related works. Section 3 introduces the IPSVM-MV model. Section 4 provides the corresponding algorithm to obtain the solution. Section 5 theoretically analyzes the performance of the proposed method. Section 6 presents experimental results and discussions. Finally, this paper is concluded in Section 7. 2. Related works In this section, we briefly introduce the SVM∆+ and PSVM-2V models, which are closely relevant to our proposed IPSVM-MV model. 2.1. SVM∆+ To implement the LUPI paradigm, the SVM+ model (Vapnik & Vashist, 2009) and its reinforced version SVM∆+ (Vapnik & Izmailov, 2015) are built on a direct observation that a non-linearly separable (soft-margin) SVM can be improved if one had access to a so-called slack oracle (Vapnik & Vashist, 2009). Assuming that privileged information is functionally related to the slack variables, SVM+ and SVM∆+ exploit the privileged information as a proxy to the oracle. Theoretical analyses (Pechyony & Vapnik, 2010; Vapnik & Izmailov, 2015) and efficient algorithms (Li, Dai, Tan, Xu, & Van Gool, 2016) of these two models have been presented aiming at guaranteeing and enhancing the classification performance. As a reinforced version, SVM∆+ can degenerate to SVM+ with appropriate selection of parameters. Thus, it is certainly preferable to choosing SVM∆+ as the foundation of our improved model. Here, we give the primal formula of SVM∆+ for simplicity. By appending a constant entry of 1 to the main and privileged feature vector respectively, we consider a augmented training set {(xi , x∗i , yi )}li=1 , where xi ∈ Rn , x∗i ∈ Rm , yi ∈ {−1, 1}, i = 1, . . . , l, SVM∆+ seeks a real valued decision function g(x) = sgn(f (x)), which takes full advantage of these prior knowledge for better performance. The primal SVM∆+ model is formulated as follows: min

w, w∗

1 2

l ∑ [yi (w∗ · Φ ∗ (x∗i ))]+ , (∥w∥2 + γ ∥w ∗ ∥2 ) + C i=1

s. t. yi (w · Φ (xi )) ⩾ 1 − [yi (w ∗ · Φ ∗ (x∗i ))]+ , i = 1,· · ·, l. where C > 0 and γ > 0 are parameters and [·]+ = max{0, ·}.

(1)

2.2. PSVM-2V In Tang et al. (2017), a privileged SVM-based model called PSVM-2V was proposed following both consensus and complementarity principles for two-view classification. To be specific, the consensus principle was considered by adding regularization term to bridge the gap between the two classifiers from two views respectively. Regarding two views as the privileged information mutually to complement both, the complementarity principle was realized by the adaptation of LUPI. Let X = X A × X B be the instance space, where X A and X B are two-view spaces, and Y = {+1, −1} denotes the label space. Suppose that D is an unknown (underlying) distribution over X × Y , and what we observe is a training data set S = {(xAi , xBi , yi )}li=1 = {((xAi ; 1), (xBi ; 1), yi )}li=1 ,

(2)

where each example is drawn independently and identically (i.i.d.) from the distribution D; xAi and xBi (resp. xAi and xBi ) indicate the ith augmented (resp. original) samples from view A and view B, respectively, yi = yi ∈ {−1, 1}. PSVM-2V can be formally built as follows: min

wA ,wB

1 2

(∥wA ∥2 + γ ∥wB ∥2 ) + C A

l l l ∑ ∑ ∑ ∗ ∗ ξiA + C B ξiB + C ηi i=1

s. t. |(wA · φ

A A (xi ))

yi (wA ·φ

− (wB · φ

A A (xi )) ⩾ 1

i=1

i=1

| ⩽ ε + ηi ,

B B (xi ))

A∗ i



− ξ , yi (wB ·φB (xBi )) ⩾ 1 − ξiB ,



(3)



ξiA ⩾ yi (wB · φB (xBi )), ξiB ⩾ yi (wA · φA (xAi )), ∗



ξiA , ξiB , ηi ⩾ 0, i = 1, . . . , l, where wA and wB are the weight vectors of view A and view B respectively. Motivated by the LUPI paradigm, the PSVM-2V model ∗ ∗ restricts the nonnegative slack variables ξiA and ξiB of view A and view B by the unknown nonnegative correcting functions determined by view B and view A respectively. Thus, the complementarity principle is realized. The first constraint enforces consensus between two views and uses the slack variables ηi to measure the number of points failing to meet the ε similarity. C A , C B and C are nonnegative penalty parameters. γ is a nonnegative trade-off parameter. 3. The IPSVM-MV model In this section, we propose the IPSVM-MV model in detail followed by some discussions on it. IPSVM-MV fully utilizes the consensus and complementarity principles. In particular, it models the consistency between any two views with the regularization terms. Since multi-view data

J. Tang et al. / Neural Networks 106 (2018) 96–109

99

collected from diverse domains are informative to complement each other, a distinct feature view can receive explicit privileged information from the remaining views together acting as the oracle teacher. Thus, to fulfill the complementarity principle, IPSVMMV models the complementary information exactly following the SVM∆+ model. As a general model, IPSVM-MV is designed for multi-view scenario. To begin with, we introduce one definition (Vapnik & Vashist, 2009) used in the paper, i.e., correcting function. Definition 1 (Correcting Function). Suppose that there exists the best but unknown linear hyperplane (w0 · x) + b0 = 0 for the classification problem and the Oracle function ξ0 (x) of the input x is defined as ξ0 (x) = [1 − y((w0 · x) + b0 )]+ , where [τ ]+ = max{0, τ }. Then the correcting function (slack function), assumed as the function with respect to the privileged information, is the approximation of the Oracle function and constitutes the correcting space. 3.1. Problem formulation Let X = X (1) × · · · × X (m) be the instance space, where X (v ) is v th view space, Y = {+1, −1} denotes the label space and m is the number of views. Suppose that D is an underlying distribution over X × Y . Consider a l labeled multi-view training (m) (1) data set as S = {(xi , . . . , xi , yi )}li=1 , where each example is drawn independently and identically (i.i.d.) from the distribution (v ) D; xi ∈ Rdv ×1 is the v th view feature vector of ith datum and yi ∈ {−1, +1}. Table 1 summarizes the major notations used in this paper. By regarding one view as the main information and the remaining views as its privileged information (privileged views), each pairwise views, i.e., main view and privileged view, shares the complementary information. Since SVM∆+ can fully unleash the power of the complementary information with a better classifier, IPSVM-MV directly follows it to fulfill the complementarity principle. To achieve the consensus principle, IPSVM-MV utilizes regularization terms to bridge the gap between the predictive functions from any two distinct views. Fig. 3 illustrates the model construction of IPSVM-MV. Formally, IPSVM-MV can be built as follows: min

w1 ··· ,wm

m ∑

γv ∥wv ∥2 + C

v=1

ηi(p,q)

i=1 p=1 q=p+1

m ∑ C (v ) ∑ ∑ [yi (wk · φk (x(k) i ))]+ m−1

constraint, i.e., ε -insensitive 1-norm constraint. ε is a parameter (p,q) is a nonused to allow samples that violate the constraint. ηi negative slack variable to measure the failure to meet ε similarity of two classifiers from pth view and qth view on ith sample. C is a nonnegative penalty parameter. (3) The slack variable in SVM measures the degree of misclassification of each training sample. Through directly following the SVM∆+ model, IPSVM-MV models the slack variable of the v th (k) view with the nonnegative correcting function [yi (wk · φk (xi ))]+ determined by the kth view, k = 1, . . . , m, k ̸ = v . Then, the loss caused by the training sample from v th view can be upper bounded (k) by the correcting function [yi (wk · φk (xi ))]+ as described in the second constraint of (4). Thus, the complementary information between views is fully utilized to realize the complementarity principle. {C (v ) }m v=1 are nonnegative penalty parameters. (4) Once {wv }m v=1 are solved in (4), the classifiers can be established on each view separately or on m views jointly to predict the label of a new sample according to the specific conditions. In brief, IPSVM-MV, by imposing the consistency constraints and following the SVM∆+ model, satisfies the consensus and complementarity principles for multi-view learning.

l

m

+

l m−1 m ∑ ∑ ∑

Fig. 3. An illustration of the model construction of IPSVM-MV. Regard one view as main information and the other views as the privileged information of it. Each pairwise views (main view and privileged view) provides the complementary information to each other. By connecting multiple views with privileged information, IPSVM-MV exactly follows the LUPI model to realize the complementarity principle. The regularization terms of the classifiers on any two distinct views are imposed to fulfill the consensus principle.

3.2. Discussions

i=1 k=1,k̸ =v

v=1

(p)

(p,q)

(q)

s. t. |(wp · φp (xi )) − (wq · φq (xi ))| ⩽ ε + ηi (v )

(4)

,

(k)

yi (wv · φv (xi )) ⩾ 1 − [yi (wk · φk (xi ))]+ ,

ηi(p,q) ⩾ 0, p = 1, . . . , m − 1, q = p + 1, . . . , m, v, k = 1, . . . , m, k ̸= v, i = 1, . . . , l. In order to further justify the mechanism of IPSVM-MV, we give the following analyses and explanations: ∑m 2 (1) The regularization term v=1 γv ∥wv ∥ aims at avoiding over-fitting by restricting the capacities of the sets of classifiers for m views. The nonnegative trade-off parameters {γv }m v=1 balance the relationship among them. (2) SVM can be considered as a data projection model that projects data from the feature space to a one-dimensional space (class label). By exploiting the label correlation, the consistency between the predictors on any two views is imposed by the first

In this section, we discuss the similarities and differences among the models of SVM-2K (Farquhar et al., 2005), PSVM2V (Tang et al., 2017) and IPSVM-MV. The models of SVM-2K, PSVM-2V and IPSVM-MV are built based on the commonly accepted multi-view learning assumptions that each feature view alone can provide an informative classifier and the classifiers built from different feature views tend to be consistent for prediction. Thus, modeling on only two views, SVM-2K and PSVM-2V impose the consistency constraint over the two views’ predictors. Modeling on multiple views, IPSVM-MV similarly enforces the classifiers trained from any two different views to agree on the training data. Thus, the consensus principle in these three models is guaranteed. As for the complementarity principle, SVM-2K ignores it. In contrast, by connecting multiple views with privileged information, PSVM-2V and IPSVM-MV draw on the idea of LUPI to fulfill it. For each individual view, it can receive the complementary

100

J. Tang et al. / Neural Networks 106 (2018) 96–109 Table 1 List of notations. Notation (1)

Description (m)

(xi , . . ., xi , yi ) l, m (xi · xj )

w1 , . . . , w m φ1 , . . . , φ m κv (x(i v) , x(j v) ) C , C (1) , . . . , C (m) γ1 , · · · , γm | · |, ∥ · ∥ [·]+ ⊤

ith training point number of training data, number of views inner product between xi and xj as x⊤ i xj weight vectors for views 1, . . . , m mappings from inputs to high-dimensional feature spaces (v ) (v ) kernel function (φv (xi ) · φv (xj )) on view v penalty parameters trade-off parameters 1-norm, 2-norm max{0, ·} transpose operation on vectors or matrices

follows: min

wA ,wB

1 2

(∥wA ∥2 + γ ∥wB ∥2 ) + C A

l ∑ [yi (wB · φB (xBi ))]+ i=1

l

+ CB

l ∑ ∑ [yi (wA · φA (xAi ))]+ + C ηi i=1

Fig. 4. An intuitive illustration of IPSVM-MV in two-view setting, i.e., m = 2. The model imposes the two-view regularization term and follows the SVM∆+ model to satisfy the consensus and complementarity principles respectively.

s. t. |(wA · φ

A A (xi ))

i=1

− (wB · φ

(5)

| ⩽ ε + ηi ,

B B (xi ))

yi (wA · φA (xAi )) ⩾ 1 − [yi (wB · φB (xBi ))]+ , yi (wB · φB (xBi )) ⩾ 1 − [yi (wA · φA (xAi ))]+ ,

information from the other views. Such complementary information is in analogy with the privileged information from the teacher in human learning, which has been fully utilized by the LUPI paradigm. Enlightened by it, PSVM-2V and IPSVM-MV extend the LUPI paradigm to multi-view learning for the realization of complementarity principle. The main difference between PSVM-2V and IPSVM-MV lies in the way of how the LUPI paradigm is adapted to the multi-view learning scenario. Instead of restricting the slack variables with correcting functions in PSVM-2V, IPSVM-MV directly corrects the values of slack variables with the correcting functions. For intuition, let us illustrate this by using the oracle analogy. Multiple views act as the oracle to provide the complementary information for each other mutually. For PSVM-2V, as shown in problem (3), the oracle builds the correcting functions to restrict the values of the slack variables on each two-view sample. Taken the ith sample from view A as an example, corresponding slack variable ∗ is restricted to ξiA ⩾ max{0, 1 − yi (wA · φA (xAi )), yi (wB · φB (xBi ))}. In IPSVM-MV model, the oracle directly corrects the values of the slack variables by the correcting functions exactly following the SVM∆+ model. Without loss of generality, we consider IPSVM-MV in the two-view case with views A and B. For (4), let the slack ∗ variable of the sample from view A be ξiA = [yi (wB · φB (xBi ))]+ . Ac∗ cording to the second constraint, we have ξiA = [yi (wB ·φB (xBi ))]+ = ∗ B A max{0, yi (wB · φB (xi ))} ⩾ 1 − yi (wA · φA (xi )). Thus, ξiA is restricted to a smaller feasible region. Accordingly, the optimal solution of IPSVM-MV can be achieved within a more appropriate feasible region. Therefore, IPSVM-MV is more effective and accurate than PSVM-2V, as we shall verify later in Section 6. 4. Optimization Without loss of generality, we mainly concentrate on the twoview scenario for the solution of IPSVM-MV, i.e., m = 2. Note that the solution for multi-view (m > 2) is similar. Due to the limited space, here we omit the details for multiple views. Considering the two-view data set (2), the IPSVM-MV model is intuitively built as shown in Fig. 4, and can be formulated as

ηi ⩾ 0, i = 1, . . . , l. However, due to the element [u]+ = max{0, u} defined both in the objective function and the constraints, here we face a tricky nonlinear optimization problem. Following (Vapnik & Izmailov, 2015), we introduce the variables τ A = (τ1A , . . . , τlA )⊤ and τ B = (τ1B , . . . , τlB )⊤ , and then approximate the problem (5) with the following quadratic optimization problem: min

wA ,wB

1 2

(∥wA ∥2 + γ ∥wB ∥2 )

+C

l ∑

ηi + ∆A C A

i=1

l ∑

τiB + ∆B C B

i=1

l ∑

τiA

i=1

l

+ CA

∑ [yi (wB · φB (xBi )) + τiB ] i=1

+ CB

l ∑ [yi (wA · φA (xAi )) + τiA ]

(6)

i=1

s. t. |(wA · φA (xAi )) − (wB · φB (xBi ))| ⩽ ε + ηi , yi (wA · φA (xAi )) ⩾ 1 − [yi (wB · φB (xBi )) + τiB ], yi (wB · φB (xBi )) ⩾ 1 − [yi (wA · φA (xAi )) + τiA ], yi (wA · φA (xAi )) + τiA ⩾ 0, yi (wB · φB (xBi )) + τiB ⩾ 0,

ηi , τiA , τiB ⩾ 0, i = 1, . . . , l. 4.1. The dual problem In order to get the solution of (6), we derive its dual problem in this section and provide the corresponding concise formulation. Theorem 1. The dual problem of (6) is a convex quadratic programming problem (QPP) with respect to the nonnegative Lagrange multipliers αiA , αiB , βi+ , βi− ,

J. Tang et al. / Neural Networks 106 (2018) 96–109

βiA , βiB as shown in (7):

which can be rewritten as 1 min π ⊤ H π + p⊤ π + g(η) π 2

l

min

1∑ 2

((αiA yi + αiB yi − βi+ + βi− + βiB yi − C B yi )(αjA yj

+ αjB yj − βj+ + βj− + βjB yj − C B yj )κA (xAi, xAj ) 1



B j yj



⊤ where F = (A⊤ , E6l )⊤ 9l×6l , G = (E3l , 03l×6l ; 06l×3l , E6l )9l×9l , c = (b⊤ , 01×6l )⊤ , and 9l×1

(αiA yi + αiB yi + βi+ − βi− + βiA yi − C A yi )(αjA yj

γ

+ βj − βj + β +



A j yj

l ∑

l ∑

i=1

i=1

(βi+ + βi− ) −

− C yj )κ A

,

B B (xi

xBj ))

(7)

(αiA + αiB )

βi+ + βi− ⩽ C , αiA , αiB , βi+ , βi− , βiA , βiB ⩾ 0.

Apparently, the optimization problem (7) has an elegant formulation similar with standard SVM. To begin with, define the ith element of l-length column vectors αA , αB , β+ , β− , βA , βB as αiA , αiB , βi+ , βi− , βiA , βiB and π = (αA⊤ , αB⊤ , β+⊤ , β−⊤ , βA⊤ , βB⊤ )⊤ . Concisely, the problem (7) can be further reformulated as π

2

(8)

if η ≽ 0; otherwise.

(13)

1 Given the training set and the parameters γ , C A , C B , C , ∆A , ∆B , ε ⩾ 0, kernel parameter σ ; initialize π 0 , η0 , h0 ; set k = 0, convergence threshold δ (0 < δ ≪ 1);

⊤ Theorem 2. Suppose that π ∗ = (αA⊤ , αB⊤ , β+ , β−⊤ , βA⊤ , βB⊤ )⊤ is the solution of the problem (8), then for i = 1, . . . , l, each pair of βi+ and βi− cannot be both simultaneously nonzero, i.e., βi+ βi− = 0. ⊤ Theorem 3. Suppose that π ∗ = (αA⊤ , αB⊤ , β+ , β−⊤ , βA⊤ , βB⊤ )⊤ is the solution of the problem (8), then the optimal solution wA∗ and wB∗ of (6) can be obtained by the following formulas respectively, l ∑

(αiA yi + αiB yi − βi+ + βi− + βiB yi − C B yi )φA (xAi ),

(9)

i=1 l 1 ∑ A (αi yi + αiB yi + βi+ − βi− + βiA yi − C A yi )φB (xBi ).

γ

(10)

4.2. Optimization via ADMM The dual of the approximate IPSVM-MV model boils down to a convex quadratic programming problem (QPP) and can be solved by the alternating direction method of multipliers (ADMM). More details about the ADMM algorithm can be found in Appendix B. For the dual problem (8), the inequalities should be converted into equalities by introducing slack variables. Then ADMM can be applied to solve the transformed problem directly. Introducing additional variables η = (η1⊤ , η2⊤ )⊤ , the problem (8) can be transformed as 1 min π ⊤ H π + p⊤ π π 2 s. t. Aπ + η1 = b,

π − η2 = 0, η1 , η2 ≥ 0,

ρ

2

2

∥F π + Gη − c + hk ∥2 ), (14)

ηk+1 = arg min(g(η) + η

ρ 2

∥F π k+1 + Gη − c + hk ∥2 ),

hk+1 = hk + F π k+1 + Gηk+1 − c , and get the solution π

k+1



k+1

(15) (16)

;

3 Compute the primal residual r k+1 = F π k+1 + Gηk+1 − c and the dual residual sk+1 = ρ F ⊤ G(ηk+1 − ηk ). If ∥r k+1 ∥ > δ, ∥sk+1 ∥ > δ , set k = k + 1, go to step 2. Otherwise, get the solution π ∗ = π k+1 , η∗ = ηk+1 .

Once the optimal wA∗ and wB∗ of problem (6) are achieved, the decision functions corresponding to the two-view sample (xA , xB ) are built as follows: fA = sign(fA (xA )) = sign(wA∗ φA (xA )),

(17)

fB = sign(fB (x )) = sign(wB φB (x )).

(18)



B

i=1

1

π k+1 = arg min( π ⊤ H π + p⊤ π + π

The specific expressions of H , p, A, b are given in Appendix A.2. By applying the KKT conditions, we can achieve the following conclusions for the problem (8) without proofs, which is similar with the conclusions in Deng, Tian, and Zhang (2012).

wB∗ =

0, +∞,

2 Solve the problems

π ⊤ H π + p⊤ π

s. t. Aπ ⩽ b, π ⩾ 0.

wA∗ =

{

Algorithm 1 ADMM for the problem (12)

Proof. See Appendix A.1

1

g(η) =

The expressions of H , p, A, b are provided in Appendix A.2. Algorithm 1 summarizes the procedure of solving problem (12). The detailed convergence analysis of this procedure is described in Nishihara, Lessard, Recht, Packard, and Jordan (2015). In our experiments, we set the maximum iteration number to forty. We observe that the algorithm converges before reaching the maximum number of iterations in most cases.

s. t. αiA + βiA ⩽ (1 + ∆A )C A , αiB + βiB ⩽ (1 + ∆B )C B ,

min

(12)

s. t. F π + Gη = c ,

i,j=1

+

101

∗⊤

B

Since multi-view features can complement one another, the final predictor can be constructed as the average of the predictors from two views collectively f = sign(f (xA , xB )) = sign(0.5(wA∗⊤ φA (xA ) + wB∗⊤ φB (xB ))).

(19)

Similarly, by using the ADMM algorithm to solve (4) with more than two views (m > 2), we can obtain the optimal solutions {wv∗ }m v=1 . Accordingly, the following formulas can be built to predict the label of a new m-view example (x(1) , . . . , x(m) ), respectively and collectively fv = sign(fv (x(v ) )) = sign(wv∗ φv (x(v ) )), v = 1, . . . , m, ⊤

(20)

m

f = sign(f (x(1) , . . . , x(m) )) = sign(

(11)

= sign(

m 1 ∑

m

v=1

wv∗ ⊤ φv (xv )).

1 ∑ m

fv (x(v ) ))

v=1

(21)

102

J. Tang et al. / Neural Networks 106 (2018) 96–109

5. Theoretical analysis This section theoretically analyzes the IPSVM-MV model from two perspectives, i.e., the consensus principle and the generalization error bound, by using Rademacher complexity. We can conclude that IPSVM-MV fulfills the consensus principle and has a good generalization capability with a tight error bound. 5.1. Consensus analysis

PD (yf (x(1) , . . . , x(m) ) ⩽ 0) ⩽

In the IPSVM-MV model, all the classifiers are trained simultaneously by requiring that any two classifiers always retain a maximum consensus on their predictions. By enforcing different classifiers trained from distinct views to agree on the training data, the classifier learned from each view can reinforce one another to guarantee the consensus principle. Here, we analyze the consensus principle by estimating the consistency degree between any two predictors. (1) (m) On a sample set S = {(xi , . . . , xi , yi )}li=1 generated inˆ = ES and E = dependently according to a distribution D, E ED represent the empirical expectation over S and the true expectation over D respectively. Firstly, define the gap between the final predictors of any two views (pth view and qth view) 2 as fp,q = |wp∗⊤ φp (x(p) ) − wq∗⊤ φq (x(q) )| . By using the Rademacher complexity (Anguita, Ghio, Oneto, & Ridella, 2014; Bartlett & Mendelson, 2003; Kakade, Sridharan, & Tewari, 2009), the degree of consistency of IPSVM-MV can be measured by estimating the true expectation of fp,q . According to the definition and lemmas in Bartlett and Mendelson (2003), the true expectation of the gap fp,q can be approximately bounded by the empirical expectation from the following theorem: Theorem 4. Given M ∈ R+ , δ ∈ (0, 1) and the l-size m-view training (m) (1) set S = {(xi , . . . , xi , yi )}li=1 drawn independently according to a probability distribution D, for any two views, i.e., pth view and qth view, if the optimal weight vectors wp∗ and wq∗ of IPSVM-MV satisfy ∥wp∗ ⊤ wp∗ + wq∗ ⊤ wq∗ ∥F ⩽ M and the kernel functions κp and κq are bounded in the feature spaces, then with the probability at least 1 − δ over S, the true expected value of fp,q with wp∗ and wq∗ on the new data is bounded by l 1∑ (p,q) ED [fp,q ] ⩽ (ε + ηi )2 + 3RM l



i=1

ln(2/δ ) 2l

  l ∑ 4M  √ (κ (x(p) , x(p) ) + κ (x(q) , x(q) ))2 , + p i q i i i l

independently according to a probability distribution . Define the ∑D m 1 (v ) function classes F = {f |f : (x(1) , . . . , x(m) ) → m v= 1 fv (x ) = ∑m 1 ⊤ v (v ) ˜˜ ˜ v=1 wv φv (x ), ∥wv ∥ ⩽ R , v = 1, . . . , m} and F = {f |f : m (x(1) , . . . , x(m) , y) → −yf (x(1) , . . . , x(m) ), f (x(1) , . . . , x(m) ) ∈ F }. ∗ If the optimal weight vectors {wv∗ }m v=1 of IPSVM-MV satisfy ∥wv ∥ ⩽ (v ) R , v = 1, . . . , m, then with probability at least 1 − δ over S, the predictive function as Eq. (21) satisfies f (x(1) , . . . , x(m) ) ∈ F and

(22)

Proof. See Appendix C. To sum up, since the predictive functions of any two distinct views possess the tight consistency bound, the consensus principle in IPSVM-MV is guaranteed. Note that the bound of ED [fp,q ] tends to be smaller with l → ∞. 5.2. Generalization error bound analysis The predictive function of IPSVM-MV can be built as the average of the predictors from each view collectively as shown in Eq. (21). We obtain the following theorems to yield the generalization error bound for IPSVM-MV. Theorem 5. Fix δ ∈ (0, 1), R(1) , . . . , R(m) ∈ R+ and consider (1) (m) the l-size m-view training set S = {(xi , . . . , xi , yi )}li=1 drawn

ml

ξi(v)



i=1 v=1

(23)

√ + 2Rˆ l (F ) + 3 (v )∗

where ξi

ln(2/δ ) 2l

,

= mink̸=v,k=1,...,m [yi (wk∗ · φk (x(k) i ))]+ .

Proof. See Appendix D.1. The following theorem gives the expression of Rˆ l (F ) in Theorem 5. (1)

Theorem 6. Consider the l-size m-view training set S = {(xi , . . . , xi(m) , yi )}li=1 drawn independently according to a probability distribution the ∑ function class F = {f |f : (x(1) , . . . , x(m) ) ∑mD and (define 1 1 v) ⊤ v (v ) → m v=1 fv (x ) = m m v=1 wv φv (x ), ∥wv ∥ ⩽ R , v = 1, . . . , m}. Suppose that the optimal weight vectors {wv∗ }m v=1 of IPSVMMV satisfy ∥wv∗ ∥ ⩽ R(v ) , v = 1, . . . , m for some prefixed {R(v ) }m v=1 , R(v ) ∈ R+ , and the kernel functions {κv }m v=1 are bounded in the feature spaces, then the empirical Rademacher complexity of F over S satisfies Rˆ l (F ) ⩽

m 2 ∑

ml

  l ∑ (v )√ R κv (x(i v) , x(i v) ).

v=1

(24)

i=1

Proof. See Appendix D.2. By combining Theorems 5 and 6, the generalization error bound of IPSVM-MV can be yield from the following theorem. Theorem 7. Fix δ ∈ (0, 1), R(1) , . . . , R(m) ∈ R+ and let F and F˜ be the classes of functions defined earlier. Consider the l-size m-view training (1) (m) set S = {(xi , . . . , xi , yi )}li=1 generated independently according to a probability distribution D. If the optimal weight vectors {wv∗ }m v=1 of IPSVM-MV satisfy ∥wv∗ ∥ ⩽ R(v ) , R(v ) ∈ R+ , then with probability at least 1 − δ over S, the prediction function as Eq. (21) satisfies f (x(1) , . . . , x(m) ) ∈ F and

i=1

where R = max(x(p) ,x(q) )∈supp(D) (κp (x(p) , x(p) ) + κq (xq , x(q) )).

l m 1 ∑∑

PD (yf (x , . . . , x (1)

(m)

) ⩽ 0) ⩽

l m 1 ∑∑

ml

(v )∗ i

ξ

√ +3

i=1 v=1

  l m ∑ 4 ∑ (v ) + R √ κv (x(i v) , x(i v) ), ml

where ξ

v=1 (v )∗ i

ln(2/δ ) 2l (25)

i=1

= mink̸=v,k=1,...,m [yi (wk∗ · φk (x(k) i ))]+ .

To conclude, IPSVM-MV has a tight error bound with l → ∞ ensuring its better generalization capability. 6. Experiments In this section, we validate the performance of IPSVM-MV for binary classification on 50 data sets obtained from Corel, 10 from WebKB and 15 from Digits. To compare our method with state-ofthe-art multi-view methods, we focus on IPSVM-MV in the twoview case. The experiments are conducted on a Linux workstation

J. Tang et al. / Neural Networks 106 (2018) 96–109

103

Table 2 Performance on Corel with nonlinear kernel (Acc. ± Std. (%) ).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

SVMA∆+

SVMB∆+

MvTSVMs

SVM-2K

MED-2C

SMVMED

PSVM-2V

IPSVM-MV

Class1 Class2 Class3 Class4 Class5 Class6 Class7 Class8 Class9 Class10 Class11 Class12 Class13 Class14 Class15 Class16 Class17 Class18 Class19 Class20 Class21 Class22 Class23 Class24 Class25 Class26 Class27 Class28 Class29 Class30 Class31 Class32 Class33 Class34 Class35 Class36 Class37 Class38 Class39 Class40 Class41 Class42 Class43 Class44 Class45 Class46 Class47 Class48 Class49 Class50

82.96 ± 0.79 71.34 ± 0.59 78.89 ± 0.78 78.03 ± 1.06 83.14 ± 0.71 73.93 ± 0.87 74.31 ± 0.70 81.27 ± 0.78 76.51 ± 0.94 78.64 ± 0.86 69.26 ± 0.97 81.54 ± 0.64 74.31 ± 0.70 83.37 ± 0.56 77.89 ± 1.02 84.03 ± 1.00 91.27 ± 0.86 72.66 ± 0.79 88.10 ± 0.62 76.63 ± 0.65 85.58 ± 0.82 80.28 ± 0.92 77.83 ± 0.79 78.70 ± 1.16 89.10 ± 0.27 71.55 ± 1.75 86.68 ± 0.50 93.96 ± 0.30 86.15 ± 0.90 91.68 ± 0.31 82.84 ± 0.83 92.94 ± 0.31 91.15 ± 1.12 84.72 ± 0.36 85.31 ± 1.41 85.36 ± 0.47 83.71 ± 0.71 69.45 ± 0.99 91.09 ± 0.81 92.63 ± 0.30 74.76 ± 1.34 68.89 ± 0.55 84.07 ± 0.65 79.26 ± 0.71 93.24 ± 0.65 81.34 ± 1.14 86.40 ± 0.63 85.78 ± 0.71 73.46 ± 1.19 83.07 ± 0.77

91.21 ± 0.79 74.09 ± 0.56 74.26 ± 0.75 75.10 ± 1.09 79.67 ± 1.36 76.96 ± 1.49 74.30 ± 1.03 81.67 ± 1.29 78.09 ± 0.70 78.10 ± 0.55 69.01 ± 0.37 73.71 ± 0.47 71.10 ± 0.58 86.57 ± 0.43 75.17 ± 0.65 79.95 ± 0.57 89.43 ± 0.33 65.84 ± 0.62 82.10 ± 0.24 74.20 ± 0.77 87.20 ± 0.30 71.10 ± 1.43 71.10 ± 0.72 70.88 ± 0.76 92.12 ± 0.90 68.96 ± 0.89 84.41 ± 0.68 91.28 ± 0.62 85.15 ± 0.59 89.32 ± 0.93 80.88 ± 0.56 90.16 ± 0.57 86.91 ± 0.51 84.87 ± 0.53 85.02 ± 1.08 84.06 ± 1.09 85.73 ± 0.95 74.65 ± 0.97 89.73 ± 0.58 93.74 ± 0.44 78.04 ± 1.38 70.63 ± 1.30 80.72 ± 0.65 81.08 ± 0.92 86.50 ± 0.27 81.03 ± 0.59 86.58 ± 0.75 88.05 ± 1.01 73.67 ± 0.84 83.01 ± 0.34

90.06 ± 0.96 73.80 ± 1.01 80.41 ± 0.86 80.93 ± 1.27 83.10 ± 0.90 76.57 ± 1.34 74.28 ± 1.68 85.21 ± 0.65 77.43 ± 0.43 81.31 ± 0.34 71.30 ± 1.06 83.41 ± 1.43 78.23 ± 0.78 87.96 ± 1.24 78.86 ± 0.62 85.72 ± 1.11 93.92 ± 0.21 72.28 ± 0.82 90.48 ± 0.94 80.62 ± 0.86 91.51 ± 0.86 79.18 ± 1.32 75.94 ± 1.63 79.47 ± 1.30 92.75 ± 0.53 76.07 ± 0.65 89.32 ± 0.78 96.17 ± 0.41 86.36 ± 0.86 95.99 ± 0.38 84.25 ± 0.78 95.75 ± 0.27 91.81 ± 0.41 87.04 ± 0.70 89.37 ± 0.31 87.13 ± 1.01 89.83 ± 0.80 75.68 ± 0.94 95.96 ± 0.59 94.98 ± 0.49 77.66 ± 1.24 73.35 ± 1.44 85.67 ± 0.00 82.11 ± 0.68 93.29 ± 0.77 85.02 ± 1.11 89.92 ± 1.34 90.05 ± 0.86 76.70 ± 1.06 87.55 ± 1.26

89.95 ± 0.68 74.74 ± 0.70 79.86 ± 0.44 79.93 ± 0.91 83.85 ± 0.73 80.93 ± 0.39 76.65 ± 0.88 84.31 ± 0.22 80.04 ± 0.87 78.84 ± 0.72 71.47 ± 0.28 81.47 ± 0.65 76.51 ± 0.48 90.05 ± 0.72 78.19 ± 0.90 86.87 ± 0.38 90.37 ± 0.54 72.75 ± 1.14 92.44 ± 0.49 77.39 ± 1.27 90.34 ± 0.54 80.20 ± 0.62 77.00 ± 0.46 78.20 ± 1.06 91.36 ± 0.79 71.46 ± 0.81 88.45 ± 0.62 95.38 ± 0.20 89.95 ± 0.54 93.89 ± 0.36 85.54 ± 0.58 94.49 ± 0.99 90.42 ± 0.55 86.74 ± 0.33 88.95 ± 0.66 88.21 ± 0.45 86.62 ± 0.42 74.12 ± 1.06 94.57 ± 0.62 94.09 ± 0.51 76.63 ± 1.12 70.95 ± 0.81 86.14 ± 0.72 82.57 ± 0.53 93.06 ± 0.23 85.49 ± 0.73 89.12 ± 0.66 89.88 ± 0.39 72.86 ± 1.19 83.64 ± 0.82

73.45 ± 0.50 61.03 ± 0.62 71.73 ± 0.77 71.58 ± 0.31 78.81 ± 1.04 63.52 ± 1.08 60.46 ± 1.82 72.37 ± 0.94 66.04 ± 1.01 69.52 ± 0.86 67.94 ± 1.42 75.06 ± 1.44 69.22 ± 1.01 80.14 ± 0.84 71.38 ± 1.04 78.83 ± 1.04 72.06 ± 0.81 66.75 ± 1.56 82.80 ± 0.72 65.56 ± 0.49 81.98 ± 0.96 66.82 ± 1.09 67.34 ± 0.88 67.33 ± 1.17 88.43 ± 0.77 61.50 ± 1.56 83.02 ± 0.73 83.93 ± 1.14 82.27 ± 1.29 90.17 ± 0.59 70.46 ± 1.01 84.79 ± 0.80 86.08 ± 0.56 76.02 ± 0.72 80.68 ± 0.57 81.98 ± 1.51 79.92 ± 0.72 62.93 ± 1.74 90.96 ± 0.31 88.63 ± 0.65 70.06 ± 1.14 58.27 ± 1.96 75.63 ± 0.67 66.18 ± 0.90 88.82 ± 0.49 72.34 ± 0.99 77.95 ± 1.04 79.94 ± 1.26 59.59 ± 1.08 65.30 ± 1.07

87.58 ± 0.86 71.12 ± 0.78 78.11 ± 0.78 76.53 ± 1.20 81.49 ± 0.86 71.01 ± 0.68 69.88 ± 1.48 82.10 ± 0.57 74.35 ± 0.81 75.33 ± 0.38 66.42 ± 1.42 79.38 ± 0.41 69.75 ± 0.73 84.68 ± 1.29 75.27 ± 1.14 83.19 ± 0.86 89.24 ± 0.34 67.69 ± 1.51 84.31 ± 0.31 76.39 ± 1.09 84.39 ± 0.75 77.70 ± 1.07 70.08 ± 1.01 72.41 ± 0.56 89.87 ± 0.67 72.00 ± 1.01 80.25 ± 0.67 94.27 ± 1.05 86.12 ± 0.77 92.92 ± 0.67 78.74 ± 0.43 92.08 ± 0.70 89.38 ± 1.01 81.95 ± 0.86 81.93 ± 1.01 80.84 ± 1.31 86.39 ± 0.17 74.28 ± 0.81 95.74 ± 0.70 92.17 ± 0.65 72.84 ± 0.90 63.16 ± 0.81 84.36 ± 0.68 77.89 ± 1.47 92.14 ± 0.49 82.02 ± 0.68 86.91 ± 1.11 89.36 ± 1.08 67.95 ± 0.72 82.28 ± 1.84

89.52 ± 0.43 79.33 ± 0.68 80.92 ± 0.67 81.89 ± 0.74 85.21 ± 0.32 77.06 ± 0.87 77.44 ± 0.88 84.56 ± 0.39 79.95 ± 0.64 81.64 ± 0.37 75.70 ± 0.90 82.46 ± 0.50 77.87 ± 0.72 91.23 ± 0.72 80.60 ± 0.87 87.54 ± 0.29 94.16 ± 0.63 73.22 ± 0.83 93.06 ± 0.53 80.49 ± 0.61 91.17 ± 0.59 81.20 ± 0.37 79.80 ± 0.74 79.64 ± 0.86 93.93 ± 0.47 75.61 ± 0.57 88.92 ± 0.64 97.59 ± 0.20 90.07 ± 0.25 94.81 ± 0.39 86.15 ± 0.59 96.70 ± 0.35 93.08 ± 0.75 88.61 ± 0.29 89.21 ± 0.74 88.40 ± 0.44 89.14 ± 0.73 79.37 ± 0.68 96.46 ± 0.40 95.07 ± 0.26 80.98 ± 0.95 72.70 ± 0.92 88.07 ± 0.46 83.38 ± 0.76 94.92 ± 0.41 86.65 ± 0.35 91.35 ± 0.39 91.60 ± 0.33 76.70 ± 0.59 86.02 ± 0.34

91.08 ± 0.40 79.24 ± 0.46 82.42 ± 0.35 82.14 ± 0.67 86.61 ± 0.40 81.37 ± 0.80 77.36 ± 0.49 90.84 ± 0.56 79.56 ± 0.96 83.45 ± 0.61 77.06 ± 0.78 82.94 ± 0.50 77.64 ± 0.58 90.47 ± 0.47 81.60 ± 0.87 87.62 ± 0.30 94.07 ± 0.54 74.81 ± 0.61 93.47 ± 0.53 82.06 ± 0.17 92.66 ± 0.26 81.59 ± 1.00 80.93 ± 0.91 80.79 ± 0.86 95.00 ± 0.53 76.85 ± 0.71 90.95 ± 0.83 97.81 ± 0.26 90.73 ± 0.38 94.81 ± 0.39 86.95 ± 0.56 97.58 ± 0.33 93.16 ± 0.75 90.80 ± 0.22 88.58 ± 0.26 88.60 ± 0.51 91.31 ± 0.79 80.20 ± 0.70 96.56 ± 0.19 96.55 ± 0.35 83.62 ± 0.37 75.51 ± 0.68 88.06 ± 0.31 84.08 ± 0.93 95.01 ± 0.38 86.46 ± 0.35 90.65 ± 0.29 91.62 ± 0.33 80.02 ± 0.39 87.05 ± 0.47

Avg. Acc. Avg. Rank Win/draw/loss

81.78 5.36 50/0/0

80.54 6.00 49/0/1

84.44 3.20 45/0/5

83.94 3.88 48/0/2

74.15 7.82 50/0/0

80.32 6.22 50/0/0

85.62 2.15 40/0/10

86.61 1.37 0/50/0

with Intel(R) Xeon(R) CPU ([email protected] GHz) and 64GB RAM. We perform 5-fold cross validation to obtain the best parameters for each method on each data set. All the experiments are repeated 10 times. 6.1. Experimental setup 6.1.1. Data Sets: (a) Corel: The Corel data set1 consists of 599 classes with 8 pre-extracted feature representations for each image. Each class contains 97–100 images representing a semantic topic, such as Elephant, Roses, Horses, etc. To be precise, 238th class contains 97 samples, 342th and 376th classes contain 99 samples, and rests of the classes contain 100 samples. In the experiments, the first fifty classes are selected to construct 50 binary data sets by the 1 Available at https://github.com/meteozay/Corel Dataset.git.

one-versus-rest strategy as shown in Table 2 (Class associated with numbers in the second column pertain to specific topics representing the desired class in the experiments). Each data set totally includes 200 images (100 images from the desired class comprise the positive class and 100 images randomly drawn from the remaining classes comprise the negative class). The 32-dimensional and 64-dimensional features extracted from the Color Structure descriptor and Scalable Color descriptor are selected to form view A and view B respectively (Eidenberger, 2004) (b) WebKB: The WebKB data set2 consists of 1051 web pages collected from computer science department web sites at four universities: Cornell University, University of Washington, University of Wisconsin, and University of Texas. Each university’s web data include five classes, i.e., Student, Project, Course, Staff and Faculty We perform our experiments on the web pages from the Cornell 2 Available at http://www.cs.umd.edu/projects/linqs/projects/lbc/.

104

J. Tang et al. / Neural Networks 106 (2018) 96–109

Table 3 Performance on WebKB with nonlinear kernel (Acc. ± Std. (%) ).

1 2 3 4 5 6 7 8 9 10

SVMA∆+

SVMB∆+

MvTSVMs

SVM-2K

MED-2C

SMVMED

PSVM-2V

IPSVM-MV

Student vs. Project Student vs. Course Student vs. Staff Student vs. Faculty Project vs. Course Project vs. Staff Project vs. Faculty Course vs. Staff Course vs. Faculty Staff vs. Faculty

92.90 ± 0.95 99.40 ± 0.40 90.50 ± 0.49 93.32 ± 0.47 100.00 ± 0.00 92.86 ± 1.20 93.73 ± 0.91 99.23 ± 0.89 99.06 ± 0.62 84.95 ± 0.73

85.65 ± 0.45 81.40 ± 0.77 84.50 ± 0.09 73.97 ± 0.08 75.10 ± 0.68 80.00 ± 0.99 71.27 ± 0.21 79.17 ± 0.07 73.49 ± 0.22 70.91 ± 0.54

89.18 ± 0.43 96.80 ± 0.33 84.45 ± 0.84 89.56 ± 0.50 95.26 ± 0.70 82.57 ± 1.23 82.91 ± 1.74 95.26 ± 0.70 97.46 ± 0.35 80.73 ± 1.29

88.76 ± 1.21 97.20 ± 0.80 86.35 ± 0.74 83.23 ± 0.79 94.65 ± 0.88 84.91 ± 1.30 86.23 ± 1.49 96.35 ± 1.50 96.21 ± 0.62 77.91 ± 1.09

86.18 ± 0.44 93.60 ± 0.54 85.36 ± 0.29 85.22 ± 0.39 88.46 ± 0.96 65.14 ± 2.02 84.91 ± 1.89 96.92 ± 0.69 94.60 ± 0.59 71.09 ± 1.66

83.27 ± 1.36 82.40 ± 0.88 81.36 ± 1.29 72.17 ± 0.95 75.51 ± 1.28 79.71 ± 1.30 70.36 ± 0.77 80.64 ± 1.27 73.02 ± 0.59 67.27 ± 1.39

90.50 ± 0.86 98.40 ± 0.46 87.56 ± 0.12 84.78 ± 0.65 99.23 ± 0.96 89.91 ± 0.90 93.14 ± 0.15 98.85 ± 0.79 99.04 ± 0.64 80.32 ± 0.57

92.94 ± 0.93 100.00 ± 0.00 94.85 ± 0.92 94.83 ± 0.05 100.00 ± 0.00 92.32 ± 0.21 96.18 ± 0.75 100.00 ± 0.00 100.00 ± 0.00 87.18 ± 0.95

Avg. Acc. Avg. Rank Win/draw/loss

94.60 1.85 8/1/1

77.55 7.10 10/0/0

89.42 4.70 10/0/0

89.18 4.70 10/0/0

85.15 5.60 10/0/0

76.57 7.60 10/0/0

92.17 3.30 10/0/0

95.83 1.15 0/10/0

Table 4 Performance on Digits with nonlinear kernel (Acc. ± Std. (%) ).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

SVMA∆+

SVMB∆+

MvTSVMs

SVM-2K

MED-2C

SMVMED

PSVM-2V

IPSVM-MV

fac vs. fou fac vs. kar fac vs. mor fac vs. pix fac vs. zer fou vs. kar fou vs. mor fou vs. pix fou vs. zer kar vs. mor kar vs. pix kar vs. zer mor vs. pix mor vs. zer pix vs. zer

91.83 ± 0.75 91.66 ± 0.32 91.43 ± 0.62 91.83 ± 0.56 92.35 ± 0.79 93.95 ± 0.26 93.82 ± 0.64 93.92 ± 0.97 94.24 ± 0.93 94.61 ± 0.88 93.91 ± 0.31 94.29 ± 0.53 86.01 ± 0.54 86.05 ± 0.72 94.99 ± 0.69

94.17 ± 0.65 94.32 ± 0.62 86.08 ± 0.76 94.61 ± 0.69 94.03 ± 0.71 94.29 ± 0.39 85.96 ± 0.76 94.29 ± 0.66 93.61 ± 0.33 85.95 ± 0.29 94.19 ± 0.43 93.91 ± 0.64 93.99 ± 0.29 93.52 ± 0.52 93.49 ± 0.57

95.29 ± 0.61 92.91 ± 0.64 90.94 ± 0.44 93.89 ± 0.69 94.97 ± 0.78 97.23 ± 0.57 90.89 ± 0.43 94.31 ± 0.61 96.15 ± 0.58 93.97 ± 0.25 94.73 ± 0.36 95.11 ± 0.51 92.55 ± 0.52 93.06 ± 0.77 95.08 ± 0.78

94.76 ± 0.71 94.21 ± 0.65 87.02 ± 0.78 92.91 ± 0.49 93.94 ± 0.65 97.08 ± 0.25 89.64 ± 0.88 95.66 ± 0.72 97.04 ± 0.47 92.02 ± 0.59 94.79 ± 0.48 94.96 ± 0.46 93.03 ± 0.29 90.12 ± 0.29 94.99 ± 0.33

80.25 ± 0.33 73.71 ± 0.48 59.28 ± 0.65 70.48 ± 0.83 74.01 ± 0.57 82.12 ± 0.37 63.61 ± 0.28 75.62 ± 0.48 77.85 ± 0.89 62.62 ± 0.95 71.27 ± 0.35 77.01 ± 1.05 71.05 ± 0.68 59.52 ± 0.33 64.98 ± 0.73

91.03 ± 0.35 88.92 ± 0.42 87.07 ± 0.74 87.07 ± 0.54 91.71 ± 0.74 95.59 ± 0.47 89.51 ± 0.62 90.14 ± 0.74 94.91 ± 0.52 91.51 ± 0.53 94.05 ± 0.29 95.17 ± 0.48 78.55 ± 0.67 89.13 ± 0.11 88.52 ± 0.42

96.99 ± 0.15 95.66 ± 0.63 88.76 ± 0.57 96.32 ± 0.37 95.73 ± 0.35 98.21 ± 0.17 93.06 ± 0.13 96.98 ± 0.29 96.19 ± 0.52 92.78 ± 0.36 96.81 ± 0.38 96.11 ± 0.18 93.52 ± 0.32 94.07 ± 0.32 96.02 ± 0.38

97.02 ± 0.85 95.74 ± 0.33 94.07 ± 0.64 95.56 ± 0.44 95.76 ± 0.28 98.24 ± 0.44 95.26 ± 0.48 97.46 ± 0.46 96.55 ± 0.62 96.83 ± 0.75 96.83 ± 0.55 96.63 ± 0.38 95.02 ± 0.73 95.13 ± 0.31 96.18 ± 0.89

Avg. Acc. Avg. Rank Win/draw/loss

92.33 5.30 15/0/0

92.43 5.13 15/0/0

94.07 3.73 15/0/0

93.48 4.23 14/0/1

70.89 8.00 15/0/0

90.19 6.07 15/0/0

95.15 2.33 14/0/1

96.15 1.20 0/15/0

University with 195 documents over the 5 classes. The two views are the words occurring in a web page and the number of citation links between pages. Corresponding dimensions are 1703 and 195 respectively. By one-versus-one strategy, we train totally 10 binary classifiers as shown in Table 3. (c) Digits: The Digits data set3 consists of handwritten numerals (0–9) extracted from a collection of Dutch utility maps. Each numeral contains 200 examples that have been digitized in binary images. These digits are represented by the following six views: (1) mfeat-fou: 76 Fourier coefficients of the character shapes; (2) mfeat-fac: 216 profile correlations; (3) mfeat-kar: 64 Karhunen– Love coefficients; (4) mfeat-pix: 240 pixel averages in a 2 by 3 window; (5) mfeat-zer: 47 Zernike moments; and (6) mfeat-mor: six morphological features. Similar to (Peng, Aved, Seetharaman, & Palaniappan, 2018) numerals from 0 to 4 are in class 1 and numerals from 5 to 9 are in class −1 here. We randomly selected 100 examples from each class (class 1 or class −1) for a total of 200 examples in our experiment. Based on the chosen 200 examples with each sample represented by six views, we select two views to construct our experimental data sets by one-versusone strategy (Chen, Yin, Jiang, & Wang, 2018; Sun, Xie, & Yang, 2016). Thus, we train totally 15 binary classifiers as shown in Table 4.









• 6.1.2. Benchmark Methods: We compare IPSVM-MV with six methods:

determined by the privileged information. We choose one view as the privileged information and the other as the main information for training, which are denoted as SVMA∆+ (view B as privileged information) and SVMB∆+ (view A as privileged information) respectively. MvTSVMs. The multi-view twin support vector machines (MvTSVMs) combine two views by introducing the constraint of similarity between two one-dimensional projections identifying two distinct twin support vector machines (TSVMs) from two feature spaces. SVM-2K. The SVM-2K method combines two standard SVMs and the distance minimization version of Kernel Canonical Correlation Analysis (KCCA) (Fukumizu, Bach, & Gretton, 2007; Hsieh, 2000; Vía, Santamaría, & Pérez, 2007) for twoview learning. MED-2C. The consensus and complementarity based MED (MED-2C) method integrates two principles into the maximum entropy discrimination (MED) framework for multiview classification. SMVMED. The soft margin consistency based multi-view MED (SMVMED) method minimizes the relative entropy between the posteriors of two view margins to achieve margin consistency in a less strict way. PSVM-2V. The PSVM-2V model is built under the framework of privileged SVM satisfying the consensus and complementarity principles for multi-view learning.

• SVM∆+ . The SVM∆+ method replaces the slack variables of the standard SVM with the nonnegative correcting functions 3 Available at http://archive.ics.uci.edu/ml/machine-learning-databases/mfeat/.

6.1.3. Measures: To measure the performance of different methods, we use the average accuracy (Acc.) and the standard deviation (Std.) over 10

J. Tang et al. / Neural Networks 106 (2018) 96–109

105

Fig. 5. Plots represent the differences between the classification accuracy of the winning method IPSVM-MV (red) against the average accuracy over the rest methods as follows: PSVM-2V (dark green), SVM-2K (magenta), MvTSVMs (green), MED-2C (yellow), SMVMED (cyan) and SVM∆+ (blue). The length of each bar corresponds to the relative improvement of winning method over the remaining methods. Different colors correspond to different winning methods. The top (middle, bottom) row plots the overall performance on the Corel (WebKB, Digits) data sets. Left (right) plots represent the performance using linear (nonlinear) kernel. Note that the maximum accuracy of SVMA∆+ and SVMB∆+ is chosen to represent the accuracy of SVM∆+ . (a) Corel data sets in linear case. (b) Corel data sets in nonlinear case. (c) WebKB data sets in linear case. (d) WebKB data sets in nonlinear case. (e) Digits data sets in linear case. (f) Digits data sets in nonlinear case. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

106

J. Tang et al. / Neural Networks 106 (2018) 96–109

Fig. 6. Accuracies under different parameter settings on the Class1 data set (Corel). The top left figure shows the performance with linear kernel and the remaining figures show the performance with nonlinear kernel. Note that the values of parameters c, σ and γ are successively fixed at the optimum in Fig. 6(b), 6(c) and 6(d). (a) Linear case. (b) Nonlinear case with c = 10. (c) Nonlinear case with σ = 10. (d) Nonlinear case with γ = 100.

repetitions. In addition, the parameter sensitivity performance is measured by the variations of accuracy with different parameter settings. The convergence of the ADMM algorithm for IPSVM-MV is measured by the varying values of objective function, primal residual and dual residual with the number of iterations. 6.1.4. Kernels: For all the algorithms, we use both the linear kernel κ (xi , xj ) = x⊤ i xj and the Gaussian RBF kernel κ (xi , xj ) = exp(− the data sets.

∥xi −xj ∥2 ) 2σ 2

on all

6.1.5. Parameters: To obtain the best parameters for all the methods, 5-fold cross validation is implemented. For the linear and nonlinear cases, parameter C in SVMA∆+ , SVMB∆+ , MED-2C and SMVMED is tuned over the set {10−3 , 10−2 , . . . , 102 , 103 }. The parameter η in MED2C and α in SMVMED are set to be 1 and 0.5 following the work (Chao & Sun, 2016; Mao & Sun, 2016) respectively. For MvTSVMs, we set C1 = C2 = C3 = C4 and D = H and select them from the set {10−3 , 10−2 , . . . , 102 , 103 }. In SVM-2K, PSVM-2V and IPSVM-MV, parameters C A , C B and C are set equally selected over {10−3 , 10−2 , . . . , 102 , 103 }, and ε is fixed as 0.001 since it has been demonstrated empirically that the performance will usually be better with a smaller ε . In addition, the tradeoff parameter γ for SVMA∆+ , SVMB∆+ , PSVM-2V and IPSVM-MV is tuned in the range {10−3 , 10−2 , . . . , 102 , 103 }. In order to reduce the search space of parameters, we fix the parameters ∆A and ∆B to 1. For the nonlinear case, the kernel parameter σ in the Gaussian

RBF kernel function is selected from {10−3 , 10−2 , . . . , 102 , 103 }. To simplify, the kernel parameters for two views in the two-view leaning methods are set to be the same in the experiments. SVM∆+ pertains to the same kernel parameters for both the decision and correcting spaces. Note that for the methods except SVM∆+ , besides the prediction functions sign(fA (xA )) and sign(fB (xB )) from the separate views, we also consider the hybrid prediction function sign(0.5(fA (xA ) + fB (xB ))) for them, and the one with the highest accuracy will be selected. 6.2. Experimental results In this section, we compare the performance of IPSVM-MV and all the benchmark methods in the linear and nonlinear cases. On the data sets from Corel, WebKB and Digits, the classification results are reported in Appendix E using the linear kernel, and in Tables 2–4 using the nonlinear kernel. The best results are highlighted in boldface. The last three rows show the average accuracy, the average rank of each method and the number of wins, draws and losses between IPSVM-MV and each method. To intuitively display all the experimental results, Fig. 5 depicts the differences between the accuracy of the winning method and the average accuracy over the rest methods. From the experimental results, we can draw the following conclusions: (a) from Tables 2–4, IPSVM-MV obtains the best performance with the highest average accuracy, the smallest average rank and the most counts of wins. Although the performance of

J. Tang et al. / Neural Networks 106 (2018) 96–109

107

Fig. 7. Accuracies under different parameter settings on the Student vs. Project data set (WebKB). The top left figure shows the performance with linear kernel and the remaining figures show the performance with nonlinear kernel. Note that the values of parameters c, σ and γ are successively fixed at the optimums in Fig. 7(b), 7(c) and 7(d). (a) Linear case. (b) Nonlinear case with c = 10. (c) Nonlinear case with σ = 1000. (d) Nonlinear case with γ = 0.001.

IPSVM-MV in some cases is not top-ranked, they are close to the best results. (b) As can be seen from Fig. 5, IPSVM-MV outperforms the others in many cases both with linear and nonlinear kernels, i.e., 23 out of 50 for Corel (46%), 9 out of 10 for WebKB (90%) and 12 out of 15 for Digits (80%) with linear kernel, and 36 out of 50 for Corel (72%), 9 out of 10 for WebKB (90%) and 13 out of 15 for Digits (86.7%) with nonlinear kernel. (c) IPSVM-MV performs better than the direct utilization of SVM∆+ for multi-view learning. It may be caused by the neglection of the consensus principle in SVM∆+ . (d) For most data sets, IPSVM-MV possesses higher accuracies than SVM-2K and PSVM-2V, which reinforces the fact that IPSVM-MV can fully exploit the complementary information between two views and retain a maximum consensus on two views’ predictors, while SVM-2K follows only the consensus principle and PSVM-2V cannot make full use of the complementary information in two views. To sum up, IPSVM-MV outperforms the benchmark methods in most cases both for linear and nonlinear cases. MED-2C and SMVMED perform well with linear kernel but perform badly with nonlinear kernel. SVM∆+ and MvTSVMs yield the best results on a few data sets. 6.3. Parameter sensitivity This section studies the parameter influence on IPSVM-MV in linear and nonlinear cases. Due to the limited space, only the Class1 data set (Corel) and the Student vs. Project data set (WebKB) are

sampled for exhibition, but a similar phenomenon can be observed in the other data sets. Specifically, linear IPSVM-MV model involves two parameters, i.e., the penalty parameter c (C A = C B = C = c) and the two-view trade-off parameter γ , and nonlinear IPSVMMV model involves three parameters, i.e., Gaussian RBF kernel parameter σ as well as c and γ . By varying c and γ in the linear case and alternatively varying two of these three parameters c, γ and σ in the nonlinear case, the variations of accuracy with different parameter settings are depicted in Fig. 6 for the Class1 data set and in Fig. 7 for the Student vs. Project data set. Note that the parameters vary in the same ranges as provided in Section 6.1. From the figures with linear kernel, we observe that the accuracy is not sensitive to the parameter γ when c ⩾ 0.1 on the Class1 data set and c ⩽ 0.1 on the Student vs. Project data set. Moreover, the accuracies change very little for Class1 under different c with the remaining parameters fixed. In contrast, the Student vs. Project data set has a relatively unstable accuracies with varied c, which could be caused by the unstable classifiers learned from small training data sets. Thus, we can conclude that the linear IPSVM-MV framework is partly sensitive to parameters c and γ . According to the figures with nonlinear kernel, we can draw the following conclusions: (a) When fixing c or γ , the accuracy is much sensitive to the parameter σ . (b) In most cases, the accuracies do not vary very much under different c or γ with the rest parameters fixed. It implies that our framework is not really sensitive to c and γ .

108

J. Tang et al. / Neural Networks 106 (2018) 96–109

Fig. 8. Convergence of ADMM with linear and nonlinear kernels. The solid blue lines denote the value of the objective function, the norm of primal residual or dual residual. The dashed red lines denote the corresponding tolerance. The top (bottom) row plots the overall performance on the Class1 data set from Corel (Student vs. Project data set from WebKB). Left (right) plots represent the performance using linear (nonlinear) kernel. (a) Class1 data set in linear case. (b) Class1 data set in nonlinear case. (c) Student vs. Project data set in linear case. (d) Student vs. Project data set in nonlinear case.

6.4. Convergence analysis To understand the convergence of ADMM for IPSVM-MV clearly, the varying values of objective function f , primal residual r, dual residual s according to the iteration k are displayed in Fig. 8 with linear and nonlinear kernels. All the parameters in the problem (7) are set to be the best parameters achieved by the cross validation. Due to the limited space, we take the first data set from Corel and WebKB as examples, but a similar phenomenon can be observed in the other data sets. The experimental results from these figures demonstrate that: (a) the objective functions do not change again after a certain iteration; (b) the values of primal residual ∥r ∥2 and the dual residual ∥s∥2 fluctuate slightly after certain iterations, which proves that the solutions solved by ADMM can converge to good approximate solutions finally. In brief, ADMM is effective in solving the IPSVM-MV model and can converge fast within limited iterations. 6.5. Non-parametric statistical test Additionally, we utilize nonparametric statistical analysis, i.e., the Wilcoxon signed ranks test (Demšar, 2006), to further

verify the statistical significance of IPSVM-MV and the benchmark methods in Tables 5 and 6. The Wilcoxon signed ranks test is a pairwise test that aims to detect significant differences between two algorithms. Tables 5 and 6 present the R+ and R− and p-values computed for all the pairwise comparisons concerning the linear and nonlinear IPSVMMV. As the tables state, IPSVM-MV with linear or nonlinear kernel can yield a significant improvement over the rest methods with a level of significance α = 0.05. 7. Conclusions In this paper, we have proposed an improved privileged SVMbased method named IPSVM-MV for multi-view learning. It is able to take full advantage of the complementary information in multiple views to achieve the performance improvement; also it is a general model designed for multi-view case rather than only two-view case. IPSVM-MV may be viewed as an improved version of both SVM-2K and PSVM-2V via comparisons. We employ the ADMM algorithm to solve IPSVM-MV efficiently. The performance of IPSVM-MV is guaranteed by the theoretical analyses, i.e., the consensus analysis and the generalization error bound analysis.

J. Tang et al. / Neural Networks 106 (2018) 96–109 Table 5 Wilcoxon signed ranks test results with linear kernel. Comparison

R+

R−

p-value

IPSVM-MV vs. SVM∆+ IPSVM-MV vs. MvTSVMs IPSVM-MV vs. SVM-2K IPSVM-MV vs. MED-2C IPSVM-MV vs. SMVMED IPSVM-MV vs. PSVM-2V

2849 2822 2609 2757 2677 2350

1 28 241 93 173 425

2.81E−14 8.26E−14 2.06E−10 1.02E−12 1.94E−11 1.09E−07

Table 6 Wilcoxon signed ranks test results with nonlinear kernel. Comparison

R+

R−

p-value

IPSVM-MV vs. SVM∆+ IPSVM-MV vs. MvTSVMs IPSVM-MV vs. SVM-2K IPSVM-MV vs. MED-2C IPSVM-MV vs. SMVMED IPSVM-MV vs. PSVM-2V

2770 2813 2838 2850 2850 2372

5 37 12 0 0 184

4.84E−14 1.18E−13 4.37E−14 2.69E−14 2.69E−14 1.86E−10

We have conducted experiments on 75 binary multi-view data sets to validate the effectiveness of the proposed model. In our future work, we will extend IPSVM-MV to semi-supervised learning, multi-label learning, etc. Alternative ways to enforce or balance the consensus between any two views are also interesting and under our consideration. Acknowledgments This work has been partially supported by grants from the National Natural Science Foundation of China (Nos. 71731009, 61472390, 71331005, 91546201, 71725001 and 71471149), the Beijing Natural Science Foundation (No. 1162005), the Natural Science Foundation Project of CQ CSTC (cstc2014jcyjA40011), and Major project of the National Social Science Foundation of China (No. 15ZDB153). Appendix A. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.neunet.2018.06.017. References Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2014). A deep connection between the Vapnik–Chervonenkis entropy and the Rademacher complexity. IEEE Transactions on Neural Networks and Learning Systems, 25(12), 2202–2211. Bach, F. R., Lanckriet, G. R., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the international conference on machine learning (pp. 6–13). ACM. Balcan, M., Blum, A., & Yang, K. (2004). Co-training and expansion: Towards bridging theory and practice. In Proceedings of the annual conference on neural information processing systems (pp. 29–58). Bartlett, P., & Mendelson, S. (2003). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research (JMLR), 3, 463–482. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with cotraining. In Proceedings of the annual conference on computational learning theory (pp. 92–100). Chao, G., & Sun, S. (2016). Consensus and complementarity based maximum entropy discrimination for multi-view classification. Information Sciences, 367–368, 296–310. Chen, X., Yin, H., Jiang, F., & Wang, L. (2018). Multi-view dimensionality reduction based on Universum learning. Neurocomputing, 275, 2279–2286. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research (JMLR), 7, 1–30. Deng, N., Tian, Y., & Zhang, C. (2012). Support vector machines: Optimization based theory, algorithms, and extensions. CRC press. Dhillon, P., Foster, D., & Ungar, L. (2011). Multi-view learning of word embeddings via CCA. In Proceedings of the annual conference on neural information processing systems (pp. 199–207).

109

Eidenberger, H. (2004). Statistical analysis of content-based MPEG-7 descriptors for image retrieva. Multimedia Systems, 10(2), 84–97. Farquhar, J., Hardoon, D., Meng, H., Shawe-taylor, J., & Szedmak, S. (2005). Two view learning: SVM-2K, theory and practice. In Proceedings of the annual conference on neural information processing systems (pp. 355–362). Fukumizu, K., Bach, F., & Gretton, A. (2007). Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research (JMLR), 8(2007), 361–383. Hsieh, W. W. (2000). Nonlinear canonical correlation analysis by neural networks. Neural Networks, 13(10), 1095–1105. Huang, C., Chung, F.-l., & Wang, S. (2016). Multi-view L2-SVM and its multi-view core vector machine. Neural Networks, 75, 110–125. Kakade, S. M., Sridharan, K., & Tewari, A. (2009). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of the annual conference on neural information processing systems (pp. 793–800). Kumar, A., & Daumé, H. (2011). A co-training approach for multi-view spectral clustering. In Proceedings of the international conference on machine learning (pp. 393–400). Lapin, M., Hein, M., & Schiele, B. (2014). Learning using privileged information: SVM+ and weighted SVM. Neural Networks, 53, 95–108. Li, J., Nigel, A., Tao, D., & Li, X. (2006). Multitraining support vector machine for image retrieval. IEEE Transactions on Image Processing, 15(11), 3597–3601. Li, W., Dai, D., Tan, M., Xu, D., & Van Gool, L. (2016). Fast algorithms for linear and kernel SVM+. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2258–2266). Liu, A. A., Xu, N., Nie, W. Z., Su, Y. T., Wong, Y., & Kankanhalli, M. (2017). Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Transactions on Cybernetics, 47(7), 1781–1794. http: //dx.doi.org/10.1109/TCYB.2016.2582918. Mao, L., & Sun, S. (2016). Soft margin consistency based scalable multi-view maximum entropy discrimination. In Proceedings of the international joint conference on artificial intelligence (pp. 1839–1845). Ménard, O., & Frezza-Buet, H. (2005). Model of multi-modal cortical processing: Coherent learning in self-organizing modules. Neural Networks, 18(5), 646–655. Nishihara, R., Lessard, L., Recht, B., Packard, A., & Jordan, M. (2015). A general analysis of the convergence of ADMM. In Proceedings of the international conference on machine learning (pp. 343–352). Pechyony, D., & Vapnik, V. (2010). On the theory of learnining with privileged information. In Proceedings of the annual conference on neural information processing systems (pp. 1894–1902). Peng, J., Aved, A. J., Seetharaman, G., & Palaniappan, K. (2018). Multiview boosting with information propagation for classification. IEEE Transactions on Neural Networks and Learning Systems, 29(3), 657–669. Rakotomamonjy, A., Bach, F. R., Canu, S., & Grandvalet, Y. (2008). Simplemkl. Journal of Machine Learning Research (JMLR), 9(3), 2491–2521. Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research (JMLR), 7, 1531–1565. Sun, J., & Keates, S. (2013). Canonical correlation analysis on data with censoring and error information. IEEE Transactions on Neural Networks and Learning Systems, 24(12), 1909–1919. Sun, S. (2013). A survey of multi-view machine learning. Neural Computing and Applications, 23(7–8), 2031–2038. Sun, S., Xie, X., & Yang, M. (2016). Multiview uncorrelated discriminant analysis. IEEE Transactions on Cybernetics, 46(12), 3272–3284. Tang, J., & Tian, Y. (2017). A multi-kernel framework with nonparallel support vector machine. Neurocomputing, 266, 226–238. Tang, J., Tian, Y., Zhang, P., & Liu, X. (2017). Multiview privileged support vector machines. IEEE Transactions on Neural Networks and Learning Systems. http: //dx.doi.org/10.1109/TNNLS.2017.2728139. Vapnik, V., & Izmailov, R. (2015). Learning using privileged information: Similarity control and knowledge transfer. Journal of Machine Learning Research (JMLR), 16, 2023–2049. Vapnik, V., & Vashist, A. (2009). A new learning paradigm: Learning using privileged information. Neural Networks, 22(5), 544–557. Vía, J., Santamaría, I., & Pérez, J. (2007). A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks, 20(1), 139–152. Wang, W., & Zhou, Z. (2010). A new analysis of co-training. In Proceedings of the international conference on machine learning (pp. 1135–1142). Wang, Y., Zhang, W., Wu, L., Lin, X., & Zhao, X. (2017). Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Transactions on Neural Networks and Learning Systems, 28(1), 57–70. Xu, C., Tao, D., & Xu, C. (2013). A survey on multi-view learning. arXiv preprint. arXiv:1304.5634. Zhao, J., Xie, X., Xu, X., & Sun, S. (2017). Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38, 43–54. Zong, L., Zhang, X., Zhao, L., Yu, H., & Zhao, Q. (2017). Multi-view clustering via multi-manifold regularized non-negative matrix factorization. Neural Networks, 88, 74–89.