Signal Processing 165 (2019) 186–196
Contents lists available at ScienceDirect
Signal Processing journal homepage: www.elsevier.com/locate/sigpro
Adaptive graph weighting for multi-view dimensionality reduction Xinyi Xu a, Yanhua Yang c, Cheng Deng a,∗, Feiping Nie b a
School of Electronic Engineering, Xidian University, Xi’an 710071, China OPTIMAL, Northwestern Polytechnical University, Xi’an 710072, China c School of Computer Science and Technology, Xidian University, Xi’an 710071, China b
a r t i c l e
i n f o
Article history: Received 8 January 2019 Revised 8 June 2019 Accepted 16 June 2019 Available online 24 June 2019 Keywords: Multi-view learning Adaptive graph weighting Dimensionality reduction Semi-supervised learning Unsupervised learning
a b s t r a c t Multi-view learning has become a flourishing topic in recent years since it can discover various informative structures with respect to disparate statistical properties. However, multi-view data fusion remains challenging when exploring a proper way to find shared while complementary information. In this paper, we present an adaptive graph weighting scheme to conduct semi-supervised multi-view dimensional reduction. Particularly, we construct a Laplacian graph for each view, and thus the final graph is approximately regarded as a centroid of these single view graphs with different weights. Based on the learned graph, a simple yet effective linear regression function is employed to project data into a lowdimensional space. In addition, our proposed scheme can be well extended to an unsupervised version within a unified framework. Extensive experiments on varying benchmark datasets illustrate that our proposed scheme is superior to several state-of-the-art semi-supervised/unsupervised multi-view dimensionality reduction methods. Last but not least, we demonstrate that our proposed scheme provides a unified view to explain and understand a family of traditional schemes. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Multi-view learning is of vital significance as the abundant information it takes advantage of [1] Multiple views information can be typically collected by various sensors, such as depth information, infrared data, and RGB data, which are employed in object detection, classification and scene understanding to boost the generalization performance [2–4]. Alternatively, it also can be described by distinct feature subsets, such as HOG, SIFT [5] and GIST [6]. These different features characterize partly independent features information from various perspectives [7,8]. Owing to comprehensive representation capability, multi-view learning has gained increasing popularity [9,10]. Complying with consensus and complementary philosophy, multi-view learning makes effort to combine heterogeneous information and thus obtains an integrated type, and a collection of approaches have been proposed in the past decades. A naive one is to concatenate all multiple views and apply single-view learning algorithms directly, which neglects complementary nature and specific statistical properties among different views. To alleviate this issue, Hotelling [11] proposes canonical correlation analysis (CCA) to leverage two views of the same underlying semantic ob∗
Corresponding author. E-mail address:
[email protected] (C. Deng).
https://doi.org/10.1016/j.sigpro.2019.06.026 0165-1684/© 2019 Elsevier B.V. All rights reserved.
ject to extract a common representation. Co-training [12] is one of the earliest methods for multi-view learning, based on which coEM [13], co-testing [14], and robust co-training [15] are proposed. They use multiple redundant views to learn from the data by training a set of classifiers defined in each view, with the assumption that the multi-view features are conditionally independent. However, in most real-world applications, the independence assumption is invalid, such that these methods can not effectively work [16]. From the kernel perspective, multiple kernel functions can be used to map the data of multiple views into a unified space, which can easily and effectively combine the information contained by varying views. As a result, multiple kernel learning approaches [17–19] have gained a lot of attention. Along with powerful representation capability indwelled in multi-view data, the curse of dimensionality still exists. It causes huge computation and memory consumption in real applications. Therein, reducing dimension while retaining important information contained by original high dimensional data becomes necessary [20,21]. Depending on the availability of labeled training examples, dimensionality reduction techniques can be divided into three categories: supervised, semi-supervised and unsupervised methods [22,23]. When conducting dimensional reduction, we desire to enhance the discriminative capability of low-dimensional data. To achieve this goal, a collection of supervised algorithms have been proposed, such as linear discriminate analysis (LDA) [24], lo-
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
187
Fig. 1. Comparison among supervised learning, unsupervised learning and semi-supervised learning.
cal fisher discriminant analysis (LFDA) [25], marginal fisher analysis (MFA) [26,27], and maximum margin criterion (MMC) [28]. Without any label information, unsupervised dimensional reduction algorithms aim to maintain the original structure of highdimensional data at the most extent[29]. Some representative methods are principal components analysis (PCA) [30], neighborhood preserving embedding (NPE) [31], locality preserving projections (LPP), and sparsity preserving projections (SPP) [32]. As shown in Fig. 1, semi-supervised algorithms are the compromise between supervised and unsupervised approaches, which focus on learning intrinsic structure revealed by unlabeled data and a small amount of labeled data [33]. Semi-supervised dimensionality reduction (SSDR) [34] exploits both cannot-link and must-link constraints together with unlabeled data, while semi-supervised local fisher discriminant analysis (SLFDA) [35] bridges PCA and LFDA, such that global structure retained by the former in unsupervised scenarios can complement the latter. In this paper, we focus on multi-view dimensionality reduction issue and propose a novel Laplacian graph framework to cope with it under semi-supervised as well as unsupervised scenarios. The essential motivations are two-fold: (1) adaptive weighted multigraph integration can learn persistent manifold and complement incomplete information through assigning proper significance to individual graph; (2) graph-based learning can be formulated as a transductive semi-supervised method so that the label of unlabeled data can be deduced according to the pairwise similarity of available labeled ones. In detail, we present a novel weighting graph technique tailored for multi-view input, where the target graph is considered as a centroid of the built graph for every single view in terms of relative significance. A family of weights is assigned to each view and optimized jointly, which release from predefining hyper-parameters experientially. Furthermore, our scheme employs a linear regression function for dimensional reduction and constrains discrepancy between low-dimensional representation and prediction vector in the graph. To this end, the dimensional reduction can be well-incorporated with graph-based semi-supervised multi-view learning progress. Extensive experiments on varying benchmark datasets demonstrate that our proposed scheme is superior to several multi-view dimensionality reduction methods including semi-supervised and unsupervised ones. 2. Related work In this section, we will introduce the notations throughout the paper, and then briefly review several representative methods. 2.1. Notations As shown in Table 1, the generally-involved notations are summarized. Specifically, we write matrices as bold uppercase letters and vectors as bold lowercase letters. Given a matrix W = [Wi j ], the ith row and jth column are respectively denoted as Wi and Wj . Y is a binary label matrix with Yi j = 1 if yi = j and Yi j = 0 otherwise. When data set X and affinity matrix S are given, an undi-
Table 1 Notations. Notations
Descriptions
N n m k fv f X ∈ Rf × N W ∈ Rf × k b ∈ Rk × 1 y ∈ RN Y ∈ RN × k F ∈ RN × k Av ∈ RN × N S ∈ RN × N LS ∈ RN × k D ∈ RN × N I ∈ RN × N G 0 1
Number of total data samples Number of labeled data samples Number of views Number of categories Dimensionality of the vth view Dimensionality across all m views The training data set The projection matrix The bias vector The label vector of labeled data The coded binary label Matrix The prediction label matrix The manually established affinity matrix of vth view The optimal affinity matrix across all m views The Laplacian matrix The diagonal matrix The identity matrix The Laplacian graph The vector or matrix of which elements are all 0 The vector or matrix of which elements are all 1
rected weight graph G is formulated as G = {X, S}, whose vertex is each data point Xi ∈ X and two vertices (Xi , Xj ) are connected by a weighted edge Sij ∈ S. The Laplacian matrix LS can be computed by LS = D − S, where D is a diagonal matrix with diagonal elements Dii = j Si j . 2.2. Dimensionality reduction Yan et al. [26] developed a unified graph embedding framework for dimensionality reduction task, which unifies a collection of traditional algorithms, such as PCA, LDA, LE, and Eigenmap. In this paper, the statistical or geometric properties of a dimensionality reduction scheme are integrated into direct graph embedding along with three kinds of transformations of the high-dimensional vector, which are linear projection, nonlinear kernel projection, and tensorization respectively. The optimization problem of direct graph is designed as T F = argmin T r (F LS F ),
(1)
F,FT VF=I
where V is another Laplacian matrix that satisfies the constrains: T T V1 = 0 and 1 V = 0 , purposing to avoid a trivial solution to the objective function. The direct graph embedding is capable to learn F that lies on a low-dimension manifold of the training data. The low-dimensional vector F can be computed by linear mapping: T F = X W, kernel graph embedding in which the kernel Gram maT trix K is available: F = b K, or tensorization which is designed for higher-order structured data: Fi = Xi ×1 W1 ×2 W2 · · · ×n Wn . We suggest interested readers refer (MFA) [26] for details. However, the main drawback of direct graph embedding exits that it fails to map out-of-sample data points.
188
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
2.3. LGC and GFHF
+ T r (F − Y ) U (F − Y )
LGC [36] and GFHF [37] focus on learning a prediction label matrix F ∈ RN × k on a graph. These two algorithms constrain on the manifold smoothness (i.e. F should be smooth on the entire graph in terms of both labeled and unlabeled data) and the label fitness (i.e. F should be close to the labels of the labeled points). The objective functions EL and EG are
2 i N Fj F Fi − Yi 2 , − · Si j + λ · 2 Djj Dii i, j=1 i=1 2
EG ( F ) =
1 2
i F − Y j 2 · Si j + λ∞ · 1 2
2
i, j=1
N
i F − Yi 2 , 2
i=1
(2)
where λ is the balance coefficient of two terms, and λ∞ is a very i i 2 = 0. large number to constrain N i=1 F − Y 2
2 T T 2 + μ β X W + 1 b − F + W F , 2
s.t. S1 = 1, S ≥ 0, d1 = 1, d ≥ 0,
To solve the problem in Eq. (5), we use an alternating optimization strategy that optimizes one variable when fixing the other ones. Initializing the S = v dv Av , the detail iterative process is as follows. 1. Fix F, and then compute W, b. The derivatives of the loss function Eq. (5) with respect to b and W being set to zero, then we have
LapRLS/L [38] is a manifold regularization-based dimensionality reduction approach associated with regression. The regression T function is defined as h(Xi ) = W Xi + b, where W ∈ Rf × k is the k × 1 mapping matrix and b ∈ R is the bias term. LapRLS/L minimizes the ridge regression errors and preserves the manifold smoothness at the same time, where the loss function is defined as
λn W2F + λs T r (W XLS X W )
1 T T F 1 − W X1 , m
b=
T
T T T β (β X I − (1/m )11 X + I )−1 X I − (1/m )11 F. (6)
W=
T
Setting Hc = I − (1/m )11 and B = β (β XHc X + I )−1 XHc , then Eq. (6) is formulated as
(7)
2. Fix W, b, and then compute F. According to Eqs. (6) and (7),
T
T
3. Methodology We construct a collection of graphs for all views and the final integrated graph is approximately regard as a centroid of them. Following the manner of formulating Gaussian similarity from Euclidean distance [39], we initialize an affinity matrix A for every single view graph:
Xvi − Xvj 22 ≤ ε ;
0,
otherwise.
(4)
One paramount element to the success of graph-based multiview learning is to learn an optimally integrated affinity matrix S. We achieve this goal through employing constraint between weighted sum of multiple single view graphs and integrated graph v 2 v dv S − A F . The underlying philosophy behind it is to find out a group of parameter dv to weight every single view’s affinity matrix Av , and maximize the matching degree of the integrated affinT ity matrix S and view affinity matrices A. We employ T r (F LS F ) to T smooth the predictions label vector F, and T r (F − Y ) U(F − Y ) to constrain the label fitness based on semi-supervised setting, which implies that F should be closed to the semi-supervised label information Y. Matrix U = [I; 0] where I ∈ Rn × n and 0 ∈ R(N−n )×(N−n ) . In T T addition, the linear projection function X W + 1b is used to map the data X into prediction label space. Both manifold ranking and linear discriminate projection are used to learn F by constraining the distances between them as small as possible. For W, the mapping matrix of X, we add a 2-norm regularization term W2 to avoid over-fitting. Therein, the loss function is formulated as
E ( S, d, W, b, F ) =
v
dv S − Av F + γ d22 + T r (F LS F ) 2
T
T T T T 1 11 F − 11 X BF m 1 T T 1 T T = I − 11 X BF + 11 F m m 1 T T = Hc X B + 11 F. m
T
X W + 1b = X BF +
where λn and λs are the balance coefficients among the regularization term, the manifold smoothness, and the regression error.
exp(−Xvi − Xvj 22 /t ),
T
T
we can develop X W + 1b as
(3)
i=1
Avi, j =
T
W = BF.
T
m 1 T T + W Xi + b − Yi 22 , m
(5)
where d is a view-weight vector and dv the weight of vth view.
2.4. LapRLS/L
ELa (W, b ) =
3.1. Optimization
k 1 EL ( F ) = 2 k
T
T
Set C = Hc X B + T
T 1 m 11 ,
(8)
Eq. (8) is formulated as
T
X W + 1b = CF.
(9)
Updated the expression of optimal W and b in Eq. (5), we can compute F by minimizing the objective function (5) with respect to F:
F = argmin F
dv S − Av F + γ d22 + T r (F LS F ) 2
v
+ T r (F − Y ) U (F − Y ) T
T
+ μ T r (F B BF ) + β T r ((CF − F ) (CF − F )) . T
T
T
(10)
By setting the derivative of Eq. (10) with respect to F equal to 0, the prediction vector F is updated by:
T
F = U + LS + μβ (C − I ) (C − I ) + μB B T
−1
UY,
(11)
T
where N = Xc . For Hc , we have
1 T 11 m 1 T = I − 11 − m 1 T = I − 11 − m
Hc Hc =
I−
T
= Hc = Hc .
I−
1 T 11 m
T T 1 T 1 11 + − 2 11 11 m m 1 T 1 T 11 + − 2 1m1 m m
(12)
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
We can also easily obtain the following equation:
μβ B XHc X B + μB B = μB (β XHc X + I )B T = μβ B XHc . T
T
T
T
weight of best view will be assigned to 1 and other weights will be 0s. On the contrary, when γ → ∞, the equal weights will be obtained. As a result, (21) can be converted to:
T
(13)
What’s more, the equation BT XHc = Hc XT B can be derived according to the property of matrix transpose. By taking Eq. (12), T T T T Eq. (13) and B XHc = Hc X B, μβ (C − I ) (C − I ) + μB B is rewritten as T
μβ (C − I ) (C − I ) + μB B T 1 T 1 T T T T = μβ Hc X B + 11 − I Hc X B + 11 − I + μB B m m T T T T = μβ Hc X B − Hc H c X B − H c + μB B T
T
T
F = U + LS + μβ Hc − μβ 2 N
T
(14)
−1
UY,
(15)
−1 T T T T where N = Xc (β Xc Xc + I )−1 Xc = Xc Xc β Xc Xc + I . 3. Fix F and d, then compute S. According to Eq. (5) we can see that S does not rely on W or b. Therefore the optimization problem with respect to S is specified as
min
dv S − Av F + T r (F LS F ). 2
T
(16)
v
Our target is to find out an optimal integration affinity matrix S when fixing the predict label matrix F, so the problem (16) becomes
min
m N
Si, j ≥0,S1=1
e + d22 , 2γ
(22)
The optimization is the same as S, which can be solved by Duchi et al. [40]. In conclusion, the complete optimization procedure is concluded in Algorithm (1). Algorithm 1 The learning procedure of semi-supervised scheme.
Defining Xc = XHc , the prediction label matrix F is finally updated by
S1=1,S≥0
min
dT 1=1,dv ≥0
T
μβ B XHc X B − 2μβ Hc X B + μβ Hc + μB B T T = μβ Hc X B − 2μβ Hc X B + μβ Hc T = −μβ Hc X B + μβ Hc −1 T T = −μβ 2 Hc X β XHc X + I XHc + μβ Hc . =
189
dv Si, j − Avi, j
2
n
+
v=1 i, j=1
Fi − F j 22 Si, j .
(17)
i, j=1
Input: N-sample matrix {Xi }N with m different views; i=1 Compute the affinity matrix {Av }m v=1 for each view by Equation (4) Number of the labeled samples is n, the rest N − n ones are unlabeled; Corresponding binary label matrix Y ∈ RN×k with yi, j = 1, if the ith sample belongs to the jth category We further define a diagonal matrix with the first n diagonal element is 1 and the rest is 0; Balance coefficients μ β ; Termination parameter (usually set to 10−4 ); Maximum number of iterations t. 1: Initialization: d = 1/m *ones(m,1), S = m v=1 dv · Av 2: for j = 1; j < t; j + + do Update W and b by Equation (6) and Equation (7) respec3: tively; Update the predicted label matrix F by Equation (15); 4: Update the integrated affinity matrix S by solving the prob5: lem (19); Update the view weight d by solving the problem (22); 6: if the loss residual between two adjacent iteration is smaller 7: than then break; 8: 9: end if 10: end for Output: Transformation matrix W, bias vector b, label matrix F, integration affinity matrix S, and view weight vector d.
Eq. (17) is independent for varying i, we solve the following minimization problem separately for each i:
min
N m
Si, j ≥0,S1=1
dv Si, j − Avi, j
2
+
j=1 v=1
N
Fi − F j 22 Si, j .
3.2. Develop into unsupervised version
(18)
j=1
Fi − F j 22
For simplicity, we denote ui, j = and ui is a vector of which jth element is equal to ui,j . Replacing Fi − F j 22 with ui,j in problem (18), we have
min
Si· 1=1,Si j ≥0
n j=1
Si −
v dv − v dv
ui, j 2
2 2,
(19)
which can be optimized by Duchi et al. [40] 4. Fix S, then compute d. The problem can be simplified as
min
dT 1=1,dv ≥0
dv S − Av 2F + γ d22 .
(20)
v
We define ev = S − Av 2F , so the optimization problem (20) can be converted to:
min
dT 1=1,dv ≥0
dv ev + γ d22 .
(21)
v
where the second term is utilized as a means to smoothen the weight distribution. Straightforwardly, without the regularization term(γ → 0), the trivial solution will be obtained, where the
We can obtain an unsupervised version of the proposed objective function by setting all elements in U equal to 0. We still focus on optimizing the projection matrix W, the bias term b and prediction label matrix F, so Eq. (5) is rewritten as:
Eu ( S, d, W, b, F ) =
dv S − Av F + γ d22 + T r (F LS F )
v
+μ
2
T
T T W2 + βX W + 1b − F22 , T
s.t. S1 = 1, S ≥ 0, d1 = 1, d ≥ 0, F VF = I.
(23) T
For the prediction label matrix F, we add a constraint: F VF = I to make F lie in a sphere after centering operation, which aims to avoid the situation that trivial solution F = 0 [39,41]. We set V = Hc in this paper. Actually, the objective function (23) is a general formulation, which can be easily develop into supervised problem by using different LS and V. Same as the optimization process of semi-supervised dimensionality reduction, we set the derivatives of the objective function in Eq. (23) with respect to W and b to zero, then W and b can be updated by Eq. (7). We can further rewrite Eq. (23) by replacing W
190
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
and b:
Eu ( S, d, F ) =
objective function is convex w.r.t F, W, and b. We formulate E1 (W, b, F) by
dv S − Av F + γ d22 + T r (F LS F ) 2
v
T
T
T T T + μ T r (F B BF ) + β T r (CF − F ) (CF − F ) , T
s.t. F Hc F = I.
Fix S and d, then compute F. F has no relation to the first term and thus we can update the F by minimizing Eq. (24) with respect to F:
F = argmin T r (F LS F ) T
+ μ T r (F B BF ) + β T r ((CF − F ) (CF − F )) . T
T
T
(25)
According to Eq. (14), Eq. (25) that computes the F is developed as
T F = argmin T rF LS + μβ Hc − μβ 2 N F
F,FT H c F= I
T
= argmin T r F F,FT H c F= I
LS − μβ 2 N F ,
(26)
which can be optimized by generalized eigenvalue decomposition [26]. S and d are also updated by solving the problem (19) and (22) respectively. In conclusion, the optimization algorithm of unsupervised scheme is shown in Algorithm 2. Algorithm 2 The learning procedure of unsupervised scheme. Input: N-sample matrix {Xi }N with m different views; i=1 Compute the affinity matrix {Av }m v=1 for each view by Equation (4) Balance coefficients μ β ; Termination parameter (usually set to 10−4 ); Maximum number of iterations t. 1: Initialization: d = 1/m *ones(m,1), S = m v=1 dv · Av 2: for j = 1; j < t; j + + do Update W and b by Equation (6) and Equation (7) respec3: tively; Update the predicted label matrix F by solving the problem 4: (26) by generalized eigenvalue decomposition; Update the integrated affinity matrix S by solving the prob5: lem (19); 6: Update the view weight d by solving the problem (22); if the loss residual between two adjacent iteration is smaller 7: than then break; 8: end if 9: 10: end for Output: Transformation matrix W, bias vector b, label matrix F, integration affinity matrix S, and view weight vector d.
4.1. Convergence analysis
T
F E1 ( W, b, F ) = T r W bT
T
F F P W − Tr W bT bT
2UY 0 0
(28)
where
μβ I + LS + U P= μβ X −μβ 1T
−μβ XT μI + μβ XXT μβ XT
−μβ 1 μβ X1 μβ N
T (29)
Therefore, in order to prove that E1 (W, b, F) is jointly convex w.r.t. F, W, and b, we only need to prove that the matrix P is positive semi-definite. For any vector z = [z1T , z2T , z3 ]T ∈ R(N+ f +1 )×1 , where z1 ∈ Rf × 1 , and z3 is a scalar, we have
zT P z = z1T (μβ I + LS + U ) − 2μβ z1T XT z2 − 2μβ z1T 1z3 +z2T (μI + μβ XXT )z2 + 2μβ z2T X1z3 + μβ Nz3T z3
(LS + U )z1 + μz2T z2 + μβ (z1T z1 − 2z1 XT z2 −2z1T 1z3 + z2T XXT z2 + 2z2T X1z3 + Nz3T z3 ) = z1T (LS + U )z1 + μz2T z2 + μβ (z1 − XT z2 − 1z3 )T × ( z 1 − XT z 2 − 1z 3 ). z1T
=
(30)
So if U and LS are positive semi-definite, μ ≥ 0 and β ≥ 0, then zT Pz ≥ 0 for any z, and thus P is positive semi-definite. Therefore, E1 (W, b, F) is jointly convex w.r.t. F, W, and b. As long as E1 (W, b, F) being jointly convex w.r.t. F, W, and b, we have
v
T (dv )t+1 St+1 − Av 2F + T r (Ft+1 LSt+1 Ft+1 ) + g(Wt+1 , bt+1 , Ft+1 )
(dv )t+1 St+1 − Av 2F + T r (FtT LSt+1 Ft ) + g(Wt , bt , Ft ),
(31)
v
where g(W, b, F ) = E1 (W, b, F ) − T r (F LS F ). For the affinity matrix S, assume that St+1 is the updated value from St , hence we have T
St+1 = argmin
Si, j ≥0,S1=1 v
dv St − Av F + T r (Ft LSt Ft ). 2
T
(32)
And we only need to prove
(dv )t+1 St+1 − Av 2F + T r (FtT LSt+1 Ft ) + g(Wt , bt , Ft )
(dv )t St − Av 2F + T r (FtT LSt Ft ) + g(Wt , bt , Ft ).
(33)
v
Remove the same term g(Wt , bt , Ft ), and the inequality in (33) can be rewritten as
v
We first prove the convergence of Algorithm 1, which naturally engages for the convergence of Algorithm 2. We firstly proof the
(27)
Proof. We know that U, LS ∈ RN × N , F, Y ∈ RN × k , W ∈ Rf × k , b ∈ Rk × 1 . If the matrices U and LS are positive semi-definite, μ ≥ 0 and β ≥ 0, then E1 (W, b, F) is jointly convex with respect to F, LS , and b. In T function E1 (W, b, F), we remove the constant term T r (Y UY ), then E1 (W, b, F) can be rewritten in matrix form as
v
In this section, we analyze the convergence and complexity of our proposed method.
2
4. Theoretical analysis
T
2 T T 2 + μ β X W + 1 b − F + W F
(24)
F,FT H c F= I
E1 ( W, b, F ) = T r ( F LS F ) + T r ( F − Y ) U ( F − Y )
dv St+1 − Av F + T r (FtT LSt+1 Ft ) 2
v
dv St − Av F + T r (FtT LSt Ft ). 2
(34)
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
According to the iteration step of d in Algorithm 1, we have 1 , and then we can know that 2S −Av
(dv )t = v
≤
t
F
St+1 − Av 2F + T r (FtT LSt+1 Ft ) 2 St − Av F
v
(35)
Nie [42] has proved that if we have any nonzero vectors F and Ft , the following inequality holds
√
( F−
Ft )2 ≥ 0 ⇒ F − 2 FFt + Ft ≥ 0 √ f Ft ⇒ F − ≤ Ft − . 2 Ft 2 Ft
(36)
St+1 − Av F −
v
≤
St − Av F −
. F
2 St − Av v
(37)
St+1 − Av F +T r (FtT LSt+1 Ft ) ≤ St − Av F +T r (FtT LSt Ft ).
t+1
v
≤
St+1
dtv
− Av
+γ 2 F
+ T r(
dt+1 22
T
(38)
FtT LSt+1 Ft
dv Av + T r ( F LS F ) + T r ( F − Y ) U ( F − Y ) , T
T
F
λ
E ( W, b, S ) =
dv S − Av F + μW2F
v
+ Tr
(39)
2
T
X W + 1b
T
T
T
LS X W + 1b
T
+ T r (X W + 1b − Y ) U(X W + 1b − Y ) . (43) T
T
T
T
T
If we set the first n and rest N − n diagonal elements of matrix U as n1λ and 0 respectively, then E (W, b, S ) − v dv S − Av 2F is
T (dv )t+1 St+1 − Av 2F + γ dt+1 22 + T r (Ft+1 LSt+1 Ft+1 )
v
2
into
Setting μ = λ1 , β → ∞ that is μβ → ∞ in Eq. (5), so we have 2 T T F = X W + 1b . Eq. (5) is rewritten as
Finally combining inequalities 33 and 39, we have
+ g(Wt+1 , bt+1 , Ft+1 )
matrix and U LGC are all λ, the rest N − n
Relation 2: LapRLS/L and Ours.
)
St − Av 2F +γ dt 22 + T r (FtT LSt Ft ).
v
(41)
which is a general form of LGC, GFHF cooperated with the multiview integration. It is easily concluded that LGC, GFHF are two special cased of our scheme when taking the same strategy for multiview data integration.
v
T
where P ∈ RN × N is a normalized graph Laplacian is the diagonal matrix. The diagonal elements of whereas the first n diagonal elements are ∞ and are zero for GFHF respectively. If we set μ = 0, the objective function (5) turns
•
1 v = If we denote that dt+1 , we will have the following 2St+1 −Av F result according to (20) and (38):
dv
T r (F PF ) + T r (F − Y ) U(F − Y ) ,
(42)
2 F
Adding the above two inequalities, we can easily determine that
Relation 1: Ours and LGC, GFHF.
St+1 − Av 2F 2 St − Av F St − Av
•
E ( F, S ) = S −
Therefore, we have
5.1. Relation to other semi-supervised learning algorithms
We first rewrite the two objective functions of LGC and GFHF to a unified formulation:
St − Av 2F + T r (FtT LSt Ft ). 2 St − Av F
191
s
equal to λ1 ELa (W, b ) in Eq. (3). Therefore, the approach LapRLS/L is s another special case of our proposed scheme for multi-view data.
(dv )t St − Av 2F
v
+ γ dt 22 + T r (FtT LSt Ft ) + g(Wt , bt , Ft ).
(40)
That is to say, 5 will converge to a locally optimal solution.
5.2. Relation to graph embedding framework •
4.2. Complexity analysis Now we consider the computational complexity of our proposed method. Specifically, the computational complexity of step 3, 4, 5 and 6 in Algorithm 1 respectively scales in O( f N 2 + N f 2 + f N k), O(N 3 + N 2 k), O(N), and O(m). Considering that the feature dimension f is the largest among the number of classes k, the number of views m, and the number of samples N, the computational complexity of Algorithm 1 scales in O(f2 ), otherwise the computational complexity scales in O(N3 ) if N is the largest number.
Relation 3: Direct Graph Embedding and Ours.
As far as we know, the objective function of a direct graph T embedding framework is formulated as T r (F LS F ). Setting μ = v 2 2 0 in (23), our scheme develops to v dv S − A F + γ d2 + T r (F LS F ), which is identical with the graph embedding framework associated with our multi-view data integration algorithm. Therein, the graph embedding framework under the multi-view scenario is a special case of our framework. T
6. Experiment 5. Relations to previous algorithms In this section, we first develop our proposed framework into different classical models associated with the multi-view information fusion: semi-supervised algorithm LGC [36] GFHF [37], and LapRLS/L [38]. We further discuss the relation between our proposed unsupervised form with graph embedding framework [26] and spectral regression [43]
In this section, we evaluate the effectiveness of our proposed scheme on four public benchmark datasets for both semisupervised and unsupervised multi-view dimensional reduction tasks. We will first introduce the experiment setup which covers the datasets, baseline and evaluation metrics. And then, we will analysis the quantitative results for two tasks both. The hyperparameters will be discussed finally.
192
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196 Table 2 Datasets description with the details of feature types, number of classes, and number of images. View
COIL
UMIST
YALE-B
CMU
1 2 3 4 5 6 #Classes #Images
Color/1050 GIST/512 HOG_2 × 2/1050 HOG_3 × 3/1050 LBP/1239 SIFT/1050 20 1440
Color/1050 GIST/512 HOG_2 × 2/1050 HOG_3 × 3/1050 LBP/1239 SIFT/1050 20 575
Color/210 GIST/512 HOG_2 × 2/1050 HOG_3 × 3/210 LBP/1239 SIFT/1050 38 2432
Color/420 GIST/512 HOG_2 × 2/1050 HOG_3 × 3/1050 LBP/1239 SIFT/1050 68 3329
Fig. 2. The randomly selected 10 images of CMU, YALE-B, COIL, and UMIST from top to bottom respectively.
6.1. Experimental setup A. Benchmark Datasets Towards the evaluation of the proposed scheme, four widely used image datasets are involved: one for object and the remains three for human faces, which are CMU PIE, COIL-20, UMIST, and YALE-B. CMU PIE dataset [44] includes about 40,0 0 0 facial samples of 68 identities. The images were collected over diverse poses, under various illumination, and with different facial expressions. In our experiment, we select the images from the frontal pose and each identity has about 49 images that are of varying illuminations conditions and facial expressions. We call this dataset as CMU for short in the following part. COIL-20 [45] consists of images of 20 objects, that each object has 72 images acquired from diverse angles every five degrees gap. We call this dataset as COIL for short in the following part. UMIST [46] consists of 575 multi-view samples of 20 subjects, covering a variety of poses from outline to front views. YALE-B [47] contains 38 identities being used in this experiment, with each subject having about 64 almost frontal images under various illuminations conditions. A collection of different kinds of feature descriptors are extracted to represent samples, and every single descriptor is regarded as one view. Referring to the Computer Vision Feature Extraction Toolbox,1 we totally extract 6 kinds of features: ‘color’, ‘gist’, ‘hog2x2’, ‘hog3x3’, ‘lbp’ and ‘sift’, thus 6-view representation obtained. Detail information for the datasets and features are summarized in Table 2 and Fig. 2 give some examples of the four datasets. In addition, we adopt a cross-validation protocol which randomly chooses a half number of samples or 5 samples per category for training respectively corresponding to semi-supervised and unsupervised task, the remains for testing and this progress is repeated 20 times. We report the mean accuracy and standard deviation to show the quantitative result. B. Baselines 1
https://github.com/adikhosla/feature-extraction/.
There are two baselines separately for semi-supervised and unsupervised multi-view dimensional reduction. In terms of semisupervised dimensional reduction task, we compare our approach with 8 state-of-the-art methods which are (a). Margin Fisher Analysis (MFA) [26], (b). Gaussian Fields and Harmonic Functions (GFHF) [37], (c). Local and Global Consistency (LGC) [36], (d). Transductive Component Analysis (TCA) [48], (e). Semi-supervised Discriminant Analysis (SDA) [49], (f). Linear Laplacian Regularized Least Squares (LapRLS/L) [38], (g) Flexible Manifold Embedding (FME) [50], (h) Multiple View Semi-supervised Dimensionality Reduction (MVSSDR), (i)Parameter-Free Auto-Weighted Multiple Graph Learning: A Framework for Multiview Clustering and Semi-Supervised Classification (FMCSC) [51], (j) Multi-view clustering and semisupervised classification with adaptive neighbours (MVAN) [8]. When it comes to unsupervised multi-view dimensional reduction task, we compare our scheme with (a). Principal Component Analysis (PCA) [30], (b). Locality Preserving Projections LPP [41] and (c). Spectral Regression LPP (LPP-SR) [43]. C. Evaluation Metrics Classification accuracy is applied as an evaluation metric throughout the experiment, which is depicted as
n Accuracy =
i=1
δ (yi , yˆi ) n
,
(44)
where yi is the ground truth label of each sample, yˆi is the corresponding predicted label, and n is the number of test samples. Function δ ( · , · ) compares the two arguments, which returns “1” if the two arguments are equal and “0” otherwise. 6.2. Semi-supervised experiments To verify the priority of our proposed semi-supervised dimensional reduction algorithm, only p samples for each category are given label information [49] in the training set. We verify recognition accuracy for unlabeled data and testing set under three different p: p = 1/2/3, which are denoted as ’unlabeled’ and ’Test’. There are two parameters in our proposed algorithm: μ and β . For fair comparison, we set the range of the candidate values are
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
193
Fig. 3. The convergence curves of the four datasets. Our proposed algorithm convergent stably within a few iterations. Table 3 Recognition performance(mean accuracy ± standard deviation %) on CMU over 20 random splits. we report the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sample(s) each category) respectively. The best results are shown in underline. Data
CMU
Method
MFA GFHF LGC TCA SDA LapRLS/L FME MVSSDR FMCSC MVAN Ours/v1 Ours/v2 Ours/v3 Ours/v4 Ours/v5 Ours/v6 Ours
1 labeled
2 labeled
3 labeled
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
– 40.9 ± 3.3 39.5 ± 3.1 62.3 ± 3.7 63.4 ± 3.2 59.4 ± 3.0 65.2 ± 2.5 73.1 ± 3.1 72.3 ± 2.9 73.8 ± 3.9 71.9 ± 2.1 60.0 ± 1.9 56.0 ± 2.3 44.1 ± 1.9 59.5 ± 2.3 42.9 ± 3.1 75.2 ± 1.8
– – – 63.5 ± 3.7 69.1 ± 2.8 54.5 ± 2.8 64.7 ± 2.4 71.7 ± 2.9 70.3 ± 2.9 72.0 ± 3.7 70.4 ± 1.9 58.5 ± 1.6 55.0 ± 1.7 42.8 ± 1.4 58.3 ± 1.7 40.4 ± 1.9 72.0 ± 1.7
78.1 ± 3.6 57.8 ± 2.6 50.9 ± 2.3 79.9 ± 2.5 83.5 ± 2.1 84.1 ± 2.2 83.6 ± 2.0 83.9 ± 2.1 85.6 ± 2.2 86.3 ± 2.9 83.8 ± 1.4 76.6 ± 1.5 71.5 ± 1.7 61.5 ± 2.1 77.0 ± 1.2 60.8 ± 1.4 89.2 ± 1.2
76.8 ± 2.5 – – 83.4 ± 2.3 84.2 ± 2.0 84.0 ± 1.8 84.5 ± 1.6 85.2 ± 2.0 83.2 ± 2.1 85.2 ± 2.9 82.4 ± 1.3 75.0 ± 1.5 69.9 ± 1.5 59.1 ± 1.4 75.6 ± 1.6 65.2 ± 1.3 87.1 ± 1.3
85.0 ± 2.9 65.8 ± 3.1 57.9 ± 0.9 86.9 ± 1.1 89.6 ± 1.2 89.8 ± 1.1 89.8 ± 1.1 89.0 ± 1.7 89.5 ± 1.9 90.1 ± 1.3 89.7 ± 0.9 84.9 ± 1.4 80.0 ± 1.5 72.5 ± 1.4 86.0 ± 1.3 87.7 ± 0.9 94.5 ± 0.8
86.1 ± 1.8 – – 88.6 ± 1.1 89.8 ± 1.1 89.7 ± 1.1 91.9 ± 1.0 89.9 ± 0.9 89.3 ± 1.4 88.6 ± 1.2 88.9 ± 1.2 83.5 ± 1.6 78.9 ± 1.6 71.1 ± 2.0 85.0 ± 1.4 88.9 ± 1.3 93.2 ± 1.1
Table 4 Recognition performance(mean accuracy ± standard deviation %) on COIL over 20 random splits. we report the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sample(s) each category) respectively. The best results are shown in underline. Data
COIL
Method
MFA GFHF LGC TCA SDA LapRLS/L FME MVSSDR FMCSC MVAN Ours/v1 Ours/v2 Ours/v3 Ours/v4 Ours/v5 Ours/v6 Ours
1 labeled
2 labeled
3 labeled
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
– 77.7 ± 2.0 79.4 ± 2.7 71.5 ± 2.8 59.9 ± 2.3 64.5 ± 3.2 76.5 ± 3.4 76.2 ± 3.8 79.3 ± 2.5 81.1 ± 2.7 76.8 ± 3.4 29.7 ± 12.1 71.4 ± 3.5 65.2 ± 4.6 22.9 ± 3.6 75.2 ± 3.0 82.7 ± 2.0
– – – 71.4 ± 2.7 59.8 ± 2.2 63.4 ± 3.3 77.7 ± 3.0 76.7 ± 3.5 77.3 ± 2.4 81.6 ± 2.2 76.7 ± 3.0 39.1 ± 12.8 71.7 ± 3.6 66.8 ± 4.1 52.3 ± 3.0 74.7 ± 2.8 83.2 ± 2.2
71.1 ± 2.7 84.2 ± 2.1 81.9 ± 2.1 79.1 ± 2.8 74.1 ± 2.6 75.6 ± 2.9 84.5 ± 2.9 85.4 ± 2.0 88.7 ± 2.0 89.4 ± 2.5 85.8 ± 2.3 21.9 ± 9.5 79.1 ± 2.7 75.2 ± 2.5 23.4 ± 4.6 82.8 ± 2.1 90.9 ± 1.7
72.1 ± 3.2 – – 78.9 ± 2.1 74.3 ± 2.1 73.7 ± 2.5 86.9 ± 3.4 86.7 ± 2.3 88.7 ± 2.0 88.9 ± 2.3 85.8 ± 2.4 33.9 ± 9.5 80.1 ± 2.2 77.4 ± 2.4 67.7 ± 2.9 83.0 ± 2.2 91.3 ± 2.3
75.5 ± 2.4 86.5 ± 2.0 84.9 ± 2.2 81.6 ± 2.2 79.3 ± 2.2 79.6 ± 2.6 88.1 ± 2.3 90.0 ± 1.7 91.8 ± 2.4 91.5 ± 1.6 88.5 ± 1.4 58.2 ± 7.5 85.1 ± 2.3 81.7 ± 2.3 18.2 ± 2.0 87.0 ± 1.7 94.2 ± 1.8
75.2 ± 2.2 – – 81.5 ± 2.8 78.7 ± 2.4 78.8 ± 2.5 90.6 ± 2.6 91.9 ± 1.1 91.0 ± 2.0 90.3 ± 1.8 89.1 ± 1.6 70.6 ± 5.7 86.0 ± 2.7 83.9 ± 2.3 77.6 ± 3.2 86.9 ± 1.9 94.9 ± 2.1
{10−2 , 10−1 , 100 , 101 , 102 } and pick the best performances for all of these methods. We plot the convergence curves of the four datasets in Fig. 3, which illustrates that our proposed algorithm converges quickly and stable. The fast convergence benefits from using linear transformation function for dimensionality reduction. We use k nearest neighbor to classify the sample after dimensionality reduction andTables 3–6 report the mean top1 recogni-
tion accuracy ± standard deviation over 20 random splits for the four datasets. We have the following observations: a. Compared with the traditional semi-supervised methods, our proposed scheme significantly outperforms them no matter how many labeled samples are given. b. Our scheme achieves better performance than any single view in most cases, which demonstrates that our proposed multi-
194
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196 Table 5 Recognition performance(mean accuracy ± standard deviation %) on UMIST over 20 random splits. we report the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sample(s) each category) respectively. The best results are shown in underline. Data
Method
UMIST MFA GFHF LGC TCA SDA LapRLS/L FME MVSSDR FMCSC MVAN Ours/v1 Ours/v2 Ours/v3 Ours/v4 Ours/v5 Ours/v6 Ours
1 labeled
2 labeled
3 labeled
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
– 65.6 ± 5.2 66.5 ± 5.5 65.2 ± 5.2 58.2 ± 5.3 56.1 ± 5.6 65.0 ± 4.5 71.4 ± 3.5 75.5 ± 3.3 75.7 ± 3.3 75.7 ± 4.0 36.9 ± 7.4 46.4 ± 3.6 48.7 ± 3.7 8.7 ± 3.3 46.0 ± 3.6 77.1 ± 4.8
– – – 64.9 ± 5.1 58.9 ± 4.1 58.9 ± 5.1 67.1 ± 5.4 73.5 ± 2.7 75.9 ± 3.1 76.7 ± 3.8 75.8 ± 5.0 41.1 ± 5.0 45.4 ± 3.4 46.8 ± 3.8 25.6 ± 4.5 45.5 ± 3.2 77.0 ± 5.3
72.5 ± 5.2 78.1 ± 3.9 82.3 ± 2.6 79.9 ± 4.1 75.9 ± 3.6 76.9 ± 4.3 82.7 ± 4.3 85.0 ± 2.9 87.1 ± 2.9 85.5 ± 3.0 89.6 ± 2.6 66.6 ± 3.8 68.34 ± 4.2 69.2 ± 3.7 21.7 ± 5.9 68.2 ± 4.8 88.8 ± 3.0
73.8 ± 3.1 – – 81.9 ± 4.2 76.1 ± 3.5 76.3 ± 4.2 79.9 ± 4.2 83.0 ± 2.7 85.6 ± 2.5 87.3 ± 2.9 89.3 ± 3.1 68.3 ± 2.9 66.3 ± 3.4 68.2 ± 3.2 55.8 ± 4.3 67.0 ± 3.5 89.4 ± 2.5
86.1 ± 2.9 87.9 ± 2.5 83.8 ± 3.7 83.9 ± 4.1 85.1 ± 3.3 85.1 ± 2.9 88.9 ± 2.2 87.0 ± 1.7 89.9 ± 1.0 90.1 ± 2.3 92.5 ± 3.1 73.7 ± 4.0 77.3 ± 4.1 78.7 ± 4.6 30.7 ± 6.0 78.9 ± 4.2 91.6 ± 3.2
84.6 ± 3.2 – – 83.6 ± 3.9 86.7 ± 3.5 84.7 ± 2.1 87.1 ± 3.1 86.9 ± 1.9 91.2 ± 1.0 89.9 ± 3.3 93.1 ± 3.1 79.5 ± 4.2 77.1 ± 3.8 78.0 ± 3.1 72.3 ± 3.5 79.3 ± 3.3 92.9 ± 2.8
Table 6 Recognition performance(mean accuracy ± standard deviation %) on YALE-B over 20 random splits. we report the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sample(s) each category) respectively. The best results are shown in underline. Data
YALEB
Method
MFA GFHF LGC TCA SDA LapRLS/L FME MVSSDR FMCSC MVAN Ours/v1 Ours/v2 Ours/v3 Ours/v4 Ours/v5 Ours/v6 Ours
1 labeled
2 labeled
3 labeled
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
– 28.5 ± 2.5 32.2 ± 3.2 42.5 ± 3.0 39.2 ± 3.0 55.1 ± 3.2 57.9 ± 3.3 55.2 ± 3.0 61.9 ± 3.6 60.1 ± 4.2 55.8 ± 3.1 60.8 ± 3.5 41.3 ± 2.3 47.3 ± 3.3 21.9 ± 3.3 47.5 ± 2.4 63.4 ± 2.6
– – – 42.6 ± 3.1 43.0 ± 3.1 51.2 ± 3.3 51.2 ± 2.9 52.7 ± 2.8 51.3 ± 3.5 51.4 ± 4.1 51.2 ± 3.5 55.2 ± 3.5 39.2 ± 2.7 49.1 ± 2.7 46.2 ± 2.9 44.4 ± 2.7 52.8 ± 2.7
51.6 ± 4.8 45.9 ± 3.3 45.1 ± 3.1 73.6 ± 3.3 75.2 ± 2.6 74.1 ± 3.2 77.1 ± 2.2 82.8 ± 2.1 82.1 ± 3.3 81.6 ± 3.8 80.0 ± 2.5 80.0 ± 2.2 63.6 ± 2.3 67.6 ± 1.5 41.6 ± 2.5 71.0 ± 2.0 84.1 ± 1.6
49.1 ± 4.2 – – 75.4 ± 3.1 75.9 ± 3.2 75.6 ± 2.9 75.0 ± 2.6 75.0 ± 2.2 74.3 ± 3.3 72.0 ± 3.2 75.8 ± 1.2 76.8 ± 2.0 60.4 ± 2.7 65.4 ± 2.6 60.2 ± 3.6 66.7 ± 2.7 76.1 ± 2.4
69.9 ± 2.4 55.2 ± 3.5 49.6 ± 3.3 82.5 ± 2.9 84.8 ± 2.1 83.2 ± 2.6 86.9 ± 2.1 87.1 ± 1.7 87.3 ± 2.7 90.0 ± 2.6 88.6 ± 1.3 88.5 ± 1.2 75.5 ± 2.4 76.4 ± 1.4 55.4 ± 1.3 81.9 ± 1.8 91.6 ± 1.5
70.6 ± 2.8 – – 84.2 ± 2.3 85.0 ± 2.1 85.8 ± 2.1 87.6 ± 2.0 85.9 ± 1.9 85.0 ± 2.8 86.4 ± 2.9 85.3 ± 1.0 86.3 ± 1.3 73.3 ± 2.6 77.6 ± 1.8 77.1 ± 0.8 79.3 ± 2.2 87.6 ± 1.6
view integration algorithm can choose the useful features and filter out the interfering ones. c. The more labeled samples, the standard variation shows a decreasing tendency under most cases, which demonstrates that more label information contributes to more stable performance. 6.3. Unsupervised experiments When conducting the unsupervised multi-view dimensional reduction experiment, we use the same (β , μ) as the semisupervised experiments and verify the performance on multiple feature dimensions: {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}. Table 7 illustrates the results of the four approaches and the number in brackets are the optimal dimension for the four datasets. We can easily conclude that our proposed scheme achieves significant performance boosting than the other unsupervised dimensionality reduction approaches. 6.4. Hyperparameter analysis Finally, we analyze the influence of hyperparameters μ and β in the semi-supervised task. Set the number of labeled sam-
Table 7 Recognition performance(mean accuracy ± standard deviation %) of PCA, LPP-SR, and our proposed unsupervised scheme over 20 random splits on four datasets. The best results are shown in underline. Method
CMU
PCA LPP LPP-SR Ours
63.4 87.7 84.7 93.2
UMIST ± ± ± ±
1.6 (80) 1.3 (80) 2.5 (80) 2.5(80)
85.5 83.4 85.5 90.4
± ± ± ±
YALE-B 3.4 (50) 3.5 (50) 3.2 (50) 3.0(50)
63.1 58.3 63.5 88.4
± ± ± ±
1.3 (60) 3.1 (60) 3.0 (60) 2.5(60)
ples to 3, and vary one of the parameters in the range of {10−2 , 10−1 , 100 , 101 , 102 } when fix another. Fig. 4 shows the performance tendency with respect to β and μ in which the top row is of unlabeled data and the second row is of test data. In most situation, the bigger β , the higher performance we can achieve, which demonstrates the importance of optimizing the label fitness. On the contrary, the recognition accuracy shows negative tendency with respect to μ. We set {β , μ} = {10, 0.1}, {10, 1}, {10, 0.01}, {100, 0.1} for COIL, CMU, UMIST, YALEB respectively. From Fig. 4 we can see that the performance is not very stable with respect to various hyperparameters, however,
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
195
Fig. 4. The accuracy of varying β . The first four graphics are the accuracy of the unlabeled data, while the last four graphics are the accuracy of the unseen data for four different datasets: (a) CMU, (b) COIL, (c) UMIST (d) YALE-B.
there exists a consistent rule throughout the eight experiments: a small μ cooperated with a large β lead to high accuracy. We can further fix a group of {β , μ} = {10, 0.1} at the cost of sacrificing a small amount of accuracy for all the experiments. This phenomenon illustrates the relative importance between the dimenT T sionality reduction term X W + 1b − F22 and the regularization 2 term WF . 7. Conclusion In this paper, we proposed a unified and effective scheme to solve the multi-view dimensionality reduction issue under both semi-supervised and unsupervised scenarios. Particularly, an adaptive weighting multi-view graph is employed for various information integration, which can discover the persistent and complementary pattern. We further learn a linear regression projection for dimensionality reduction and penalize the discrepancy between low-dimensional representation and prediction vector to maximize the matching degree. In optimization, we combine multi-view fusion, dimensionality reduction with graph regularization together, which are jointly optimized and cooperated with each other well. Comprehensive experiments on four benchmark databases clearly demonstrate that our proposed scheme outperforms existing approaches. Declaration of Competing Interest None. Acknowledgments Our work was supported in part by the National Natural Science Foundation of China under Grant 61572388 and 61703327, in part by the Key R&D Program-The Key Industry Innovation Chain of Shaanxi under Grant 2017ZDCXL-GY-05-04-02, 2017ZDCXL-GY05-02 and 2018ZDXM-GY-176, and in part by the National Key R&D Program of China under Grant 2017YFE0104100.
References [1] M. Yang, C. Deng, F. Nie, Adaptive-weighting discriminative regression for multi-view classification, Pattern Recogn. 88 (4) (2019) 236–245. [2] H. Zhang, V.M. Patel, R. Chellappa, Hierarchical multimodal metric learning for multimodal classification, in: IEEE International Conference on Computer Vision (CVPR), 2017, pp. 3057–3065. [3] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, in: European Conference on Computer Vision (ECCV), 2012, pp. 746–760. [4] S. Song, S.P. Lichtenberg, J. Xiao, Sun rgb-d: a rgb-d scene understanding benchmark suite, in: IEEE International Conference on Computer Vision (CVPR), 5, 2015, p. 6. [5] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision (IJCV) 60 (2) (2004) 91–110. [6] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vision (IJCV) 42 (3) (2001) 145–175. [7] J. Xu, J. Han, F. Nie, X. Li, Re-weighted discriminatively embedded k-means for multi-view clustering, IEEE Trans. Image Process. (TIP) 26 (6) (2017) 3016–3027. [8] F. Nie, G. Cai, X. Li, Multi-view clustering and semi-supervised classification with adaptive neighbours, in: American Association for Artificial Intelligence (AAAI), 2017, pp. 2408–2414. [9] X. Liu, L. Huang, C. Deng, B. Lang, D. Tao, Query-adaptive hash code ranking for large-scale multi-view visual search, IEEE Trans. Image Process. (TIP) 25 (10) (2016) 4514–4524. [10] X. Liu, L. Huang, C. Deng, J. Lu, B. Lang, Multi-view complementary hash tables for nearest neighbor search, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1107–1115. [11] H. Hotelling, Relations between two sets of variates, Biometrika 28 (3/4) (1936) 321–377. [12] A. Blum, T.M. Mitchell, Combining labeled and unlabeled sata with co-training, in: CCLT, 1998, pp. 92–100. [13] U. Brefeld, T. Scheffer, Co-em support vector learning, in: International Conference on Machine Learning (ICML), 2004. [14] I.A. Muslea, Active learning with multiple views, J. Artif. Intell. Res. (JAIR) 27 (1) (2011) 203–233. [15] S. Sun, F. Jin, Robust co-training, Int. J. Pattern Recognit. Artif. Intell. (TPAMI) 25 (07) (2011) 1113–1126. [16] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. (JMLR) 7 (Nov) (2006) 2399–2434. [17] S. Yu, T. Falck, A. Daemen, L.-C. Tranchevent, J.A. Suykens, B. De Moor, Y. Moreau, L 2-norm multiple kernel learning and its application to biomedical data fusion, BMC Bioinf. 11 (1) (2010) 309. [18] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least squares support vector machines, Int. J. Circuit Theory Appl. (IJCTA) 27 (6) (2002) 605–615.
196
X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196
[19] J. Ye, S. Ji, J. Chen, Multi-class discriminant kernel learning via convex programming, J. Mach. Learn. Res. (JMLR) 9 (Apr) (2008) 719–758. [20] W. Liu, I.W. Tsang, Making decision trees feasible in ultrahigh feature and label dimensions, J. Mach. Learn. Res. 18 (2017) 81:1–81:36. [21] W. Liu, D. Xu, I.W. Tsang, W. Zhang, Metric learning for multi-output tasks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2) (2019) 408–422. [22] W. Liu, I.W. Tsang, K.-R. Müller, An easy-to-hard learning paradigm for multiple classes and multiple labels, J. Mach. Learn. Res. 18 (2017) 94:1–94:38. [23] X. Peng, J. Feng, S. Xiao, W-Y. Yau, J.T. Zhou, S. Yang, Structured AutoEncoders for Subspace Clustering, IEEE Trans. Image. Process. 27 (10) (2018) 5076–5086, doi:10.1109/TIP.2018.2848470. [24] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 19 (7) (1997) 711–720. [25] M. Sugiyama, Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis, J. Mach. Learn. Res. (JMLR) 8 (2007) 1027– 1061. [26] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 29 (1) (2007) 40–51. [27] Z. Huang, H. Zhu, J.T. Zhou, X. Peng, Multiple marginal fisher analysis, IEEE Trans. Industr. Electron. (2018) 1, doi:10.1109/TIE.2018.2870413. [28] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Trans. Neural Netw. (TNN) 17 (1) (2006) 157–165. [29] E. Yang, C. Deng, T. Liu, W. Liu, D. Tao, Semantic structure-based unsupervised deep hashing, in: International Joint Conferences on Artificial Intelligence (IJCAI), 2018, pp. 1064–1070. [30] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics Intell. Lab. Syst. 2 (1–3) (1987) 37–52. [31] X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, in: IEEE International Conference on Computer Vision (ICCV), 2, 2005, pp. 1208–1213. [32] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognit. (PR) 43 (1) (2010) 331–341. [33] C. Deng, R. Ji, W. Liu, D. Tao, X. Gao, Visual reranking through weakly supervised multi-graph learning, in: IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2600–2607. [34] D. Zhang, Z.-H. Zhou, S. Chen, Semi-supervised dimensionality reduction, in: Industrial Conference on Data Mining (ICDM), 2007, pp. 629–634. [35] M. Sugiyama, T. Idé, S. Nakajima, J. Sese, Semi-supervised local fisher discriminant analysis for dimensionality reduction, Mach. Learn. (ML) 78 (1–2) (2010) 35.
[36] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2004, pp. 321–328. [37] X. Zhu, J. Lafferty, Z. Ghahramani, Combining active learning and semi-supervised learning using gaussian fields and harmonic functions, in: International Conference on Machine Learning (ICML) Workshop, 3, 2003. [38] V. Sindhwani, P. Niyogi, M. Belkin, S. Keerthi, Linear manifold regularization for large scale semi-supervised learning, in: International Conference on Machine Learning (ICML) Workshop, 28, 2005. [39] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2002, pp. 585–591. [40] J. Duchi, S. Shalev-Shwartz, Y. Singer, T. Chandra, Efficient projections onto the l1-ball for learning in high dimensions, in: International Conference on Machine Learning (ICML), 2008, pp. 272–279. [41] X. He, S. Yan, Y. Hu, P. Niyogi, H.J. Zhang, Face recognition using laplacianfaces, in: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2005, pp. 328–340. [42] F. Nie, H. Huang, C. Xiao, C. Ding, Efficient and robust feature selection via joint l2,1 -norms minimization, in: International Conference on Neural Information Processing Systems, 2010. [43] D. Cai, X. He, J. Han, Spectral regression for efficient regularized subspace learning, in: IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8. [44] T. Sim, S. Baker, M. Bsat, The CMU Pose, Illumination, and Expression Database, IEEE Computer Society, 2003. [45] C. Rate, C. Retrieval, Columbia object image library (coil-20), Computer (2011). [46] D.B. Graham, N.M. Allinson, Characterising Virtual Eigensignatures for General Purpose Face Recognition, Springer Berlin Heidelberg, 1998. [47] A. Georghiades et al., From few to many: Illumination cone models for face recognition under variable lighting and pose, 2001, pp. 643–660. [48] W. Liu, D. Tao, J. Liu, Transductive component analysis, in: Industrial Conference on Data Mining (ICDM), 2008, pp. 433–442. [49] Deng, X. He, J. Han, Semi-supervised discriminant analysis (2007) 1–7. [50] F. Nie, D. Xu, I.W.-H. Tsang, C. Zhang, Flexible manifold embedding: a framework for semi-supervised and unsupervised dimension reduction, IEEE Trans. Image Process. (TIP) 19 (7) (2010) 1921–1932. [51] F. Nie, J. Li, X. Li, et al., Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification., in: International Joint Conferences on Artificial Intelligence (IJCAI), 2016, pp. 1881–1887.