Ideal regularization for learning kernels from labels

Ideal regularization for learning kernels from labels

Neural Networks 56 (2014) 22–34 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Ideal re...

575KB Sizes 3 Downloads 105 Views

Neural Networks 56 (2014) 22–34

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Ideal regularization for learning kernels from labels Binbin Pan a,b , Jianhuang Lai c,∗ , Lixin Shen d a

College of Mathematics and Computational Science, Shenzhen University, Shenzhen, China

b

School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou, China

c

School of Information Science and Technology, Sun Yat-sen University, Guangzhou, China

d

Department of Mathematics, Syracuse University, Syracuse, NY 13244, USA

article

info

Article history: Received 20 November 2013 Received in revised form 30 March 2014 Accepted 23 April 2014 Available online 2 May 2014 Keywords: Kernel methods Regularization Labels Ideal kernel Semi-supervised learning von Neumann divergence

abstract In this paper, we propose a new form of regularization that is able to utilize the label information of a data set for learning kernels. The proposed regularization, referred to as ideal regularization, is a linear function of the kernel matrix to be learned. The ideal regularization allows us to develop efficient algorithms to exploit labels. Three applications of the ideal regularization are considered. Firstly, we use the ideal regularization to incorporate the labels into a standard kernel, making the resulting kernel more appropriate for learning tasks. Next, we employ the ideal regularization to learn a data-dependent kernel matrix from an initial kernel matrix (which contains prior similarity information, geometric structures, and labels of the data). Finally, we incorporate the ideal regularization to some state-of-the-art kernel learning problems. With this regularization, these learning problems can be formulated as simpler ones which permit more efficient solvers. Empirical results show that the ideal regularization exploits the labels effectively and efficiently. © 2014 Published by Elsevier Ltd.

1. Introduction In recent years, kernel-based machine learning methods have been successfully explored in various fields including face recognition, text classification, genomic data analysis, medical image analysis (see, e.g., Lanckriet, De Bie, Cristianini, Jordan, & Noble, 2004, Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002, Pan, Lai, & Chen, 2011; Pan et al., 2012). Kernel measures the similarity between the data in Reproducing Kernel Hilbert Space (RKHS). The performance of kernel-based methods critically relies on a kernel learned from the data associated with the given task. Good results are presented when we can effectively make use of the label information for learning the kernel (Cristianini, Shawe-Taylor, Elisseeff, & Kandola, 2002; Kwok & Tsang, 2003; Xiong, Swamy, & Ahmad, 2005). Algorithms for learning kernels from the available labels can be roughly classified into two categories. Algorithms in the first category incorporate the labels by optimizing certain criteria. Lanckriet et al. propose multiple kernel learning (MKL) where the kernel is expressed as a conic combination of several predefined kernels (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jor-



Corresponding author. Tel.: +86 20 84110175; fax: +86 20 84110175. E-mail addresses: [email protected] (B. Pan), [email protected] (J. Lai), [email protected] (L. Shen). http://dx.doi.org/10.1016/j.neunet.2014.04.003 0893-6080/© 2014 Published by Elsevier Ltd.

dan, 2004). The combination coefficients are learned by optimizing the regularized hinge loss or the alignment of the kernel with the ideal kernel. Order spectral kernel (OSK) learns non-parametric spectral transforms of the Laplacian kernel by maximizing an alignment between the kernel from the training data set and the ideal kernel themselves (Zhu, Kandola, Ghahramani, & Lafferty, 2005). The linear discriminant analysis (LDA) criterion is adopted aiming at learning a kernel to maximize the Fisher score in a feature space (Xiong et al., 2005; Yeung, Chang, & Dai, 2007). Other criteria include the quadratic loss between predictive function and labels (Bach & Jordan, 2005), Hilbert–Schmidt independence criterion (Song, Smola, Borgwardt, & Gretton, 2008). Algorithms in the second category exploit the labels via pairwise constraints. Examples include kernel information-theoretic metric learning (KITML) (Davis, Kulis, Jain, Sra, & Dhillon, 2007), non-parametric kernel learning (NPK) (Hoi, Jin, & Lyu, 2007), simple non-parametric kernel learning (SimpleNPK) (Zhuang, Tsang, & Hoi, 2011), and geometry-aware metric learning (GML) (Lu, Jain, & Dhillon, 2009). However, problems arise when the aforementioned algorithms are applied to large-scale data sets. MKL is formulated as quadratically constrained quadratic programming (QCQP) problems to which the interior-point algorithms can be applied (Lanckriet et al., 2004). However, interior-point algorithms are not practical for large-scale applications. Some studies focus on how to solve the

B. Pan et al. / Neural Networks 56 (2014) 22–34

MKL problem efficiently, by reformulating the problem as a semiinfinite linear program (Sonnenburg, Rätsch, Schäfer, & Schölkopf, 2006) or a smooth convex optimization problem (Rakotomamonjy, Bach, Canu, & Grandvalet, 2008) to which more efficient solvers can be used. Optimization problems using the kernel alignment criterion are often reformulated as QCQP problem (Zhu et al., 2005) which is still impractical for large-scale data. A similar situation exists in the LDA criterion which models a learning problem as a nonlinear fractional programming (Yeung et al., 2007). This criterion is rewritten to make the problem more easily to resolve. The pairwise constraints incorporate the labels as linear constraints, and the resulting learning problems are solved via semi-definite programming (SDP) solvers (Hoi et al., 2007) or Bregman projection (Davis et al., 2007; Lu et al., 2009). However, when the number of constraints to the problems is large, it is time-consuming to solve the problems. For example, time complexity of solving NPK per iteration is O(m2 n2 ), where m is the number of constraints and n the number of data samples. If we have 100 training samples, m could be as high as 4950 since we add one constraint for each pair. This would restrict the NPK to small-scale applications. SimpleNPK reformulates the SDP problem as a saddle-point optimization problem to which a fast iterative algorithm can be applied (Zhuang et al., 2011). Bregman projection algorithm randomly chooses one constraint and the resulting problem is solved exactly. This procedure is repeated until convergence. It turns out that more iterations will be required to have a convergent solution when m becomes larger. Although some algorithms can be efficiently solved by their improved variants, there does not exist a principled way on how to effectively reformulate the problems. Also, some methods using kernel alignment criterion or pairwise constraints, such as OSK, KITML and GML et al., are still restricted on small and mediumscale problems. The main idea of this paper is to make efficient use of label information by adding a new regularization term to the objective functions of the optimization problems associated with various algorithms. To this end, we first extend the definition of ideal kernel for labeled data of binary classes to both labeled and unlabeled data of multiple classes. Based on this extended ideal kernel, we propose an ideal regularization which is a linear function of the kernel matrix to be learned. The linear form of the proposed regularization makes the use of label information efficient, thus leads to the learning problems solved easily. The following applications of the ideal regularization are presented.

• The label information is incorporated into standard kernels, such as Gaussian, polynomial and linear ones, by optimizing the standard kernels with ideal regularization. • We present an algorithm to learn a kernel matrix from a prior kernel matrix by integrating the geometric structure and labels of the data. The learning problem is formulated as a convex programming of which the solution is achieved in a closedform. • We employ the ideal regularization to speed up four state-ofthe-art kernel learning algorithms, namely, NPK, OSK, GML and KITML. The NPK with ideal regularization is reformulated as SimpleNPK with linear loss, which enjoys a closed-form solution that can be computed efficiently by the Lanczos sparse eigendecomposition. Also, the original OSK with QCQP can be reduced to a linear programming (LP) problem which can be solved more efficiently. The most time-consuming operation in GML is the Bregman projection. With the help of the ideal regularization, the corresponding problem has a closed-form solution and core operations involved are one sparse eigendecomposition and two matrix–vector multiplications. KITML is also solved with Bregman projection. Using the ideal regularization, the reformulated KITML can be computed in closed form where the core computations are two eigendecompositions.

23

1.1. Notation Vectors are denoted by bold lower case letters and matrices by upper case ones. The ℓ2 -norm of a vector x in Rn is defined as

∥x∥ :=

n

|x[i]|2

1/2

, where x[i] denotes the ith entry of x. We use I to denote the identity matrix, 0 to denote a vector or matrix with all zeros, and 1 to denote a vector or matrix with all ones. The dimensions of I, 0, and 1 are not specified unless necessary. A vector x ≥ 0 means all the entries in x are no less than 0. We use rank(A) to denote the rank of the matrix A. The trace of a square m matrix A in Rm×m is defined as tr(A) = i=1 A[i, i], where A[i, i] is the ith diagonal entry of A. We use A ≽ 0 to denote the matrix A being symmetric and positive semi-definite. We define the inner m×n product as ⟨A, B⟩F := m nbetween two matrices A and B in R i=1 j=1 A[i, j]B[i, j], where A[i, j] and B[i, j] are the entries of A and B at row i and column j. The Frobenius norm of a matrix A is 1/2 defined as ∥A∥F := ⟨A, A⟩F . Clearly, ∥A∥F = (tr(AA⊤ ))1/2 , where ⊤ the superscript denotes the transpose operator. ∞The1 exponential i of a square matrix A is defined as exp(A) := i=0 i! A while the i=1

(−1)i−1

logarithm of A is defined as log(A) := (A − I)i . i=1 i The rest of the paper is organized as follows: in Section 2, we present the ideal regularization along with its theoretical justification. In Section 3, we derive some kernel learning algorithms that use the ideal regularization. Furthermore, we employ the ideal regularization to accelerate some kernel learning problems, including NPK, OSK, GML and KITML. Numerical experiments are presented in Section 4 to demonstrate the effectiveness and efficiency of the proposed ideal regularization. Conclusion is drawn in Section 5.

∞

2. Ideal regularization We restrict ourselves to the transductive learning where we predict the labels of unlabeled data by learning a function defined over the given labeled and unlabeled data. More precisely, let X be an input space and C be the set {1, 2, . . . , c } with c being the number of classes, we have available labeled data {(x1 , y1 ), (x2 , y2 ), . . . , (xℓ , yℓ )} and unlabeled data {xℓ+1 , . . . , xn }, where xi , i = 1, 2, . . . , n, are in X and yj , j = 1, 2, . . . , ℓ, are in C . The pair (xi , yi ) indicates that the sample xi is assigned to class yi . Our aim is to estimate the labels of xℓ+1 , . . . , xn . In kernel methods, it is desired to learn a positive semi-definite matrix K from the available data and labels where the entry K[i, j] reflects the similarity between xi and xj . The K, called as a kernel matrix, can be seen as a Gram matrix generated by a kernel function k(·, ·) restricted on the given data, that is K[i, j] = k(xi , xj ), 1 ≤ i, j ≤ n. In the setting of transduction, the kernel matrix contains all the information for learning and prediction. 2.1. Extended ideal kernel The ideal kernel is typically defined on training data from binary class (Cristianini et al., 2002). For given labeled data {(x1 , y1 ), (x2 , y2 ), . . . , (xℓ , yℓ )} with (xi , yi ) ∈ X × {1, 2}, i = 1, 2, . . . , ℓ, we form a vector y ∈ Rℓ whose ith entry y[i] is 1 if yi = 1 and −1 if y = 2. The ideal kernel denoted by T0 for the labeled data is defined as, T0 = yy⊤ .

(1)

Clearly, if the samples xi and xj are from the same class, then the entry T0 [i, j] is equal to 1; otherwise −1. Hence, one could regard the ideal kernel as ‘experts’ giving their opinion on whether two given points belong to the same class or not. To have a more flexible ideal kernel, we will extend the definition of the ideal kernel for labeled data of binary class to a general situation where the labeled data are from c different classes.

24

B. Pan et al. / Neural Networks 56 (2014) 22–34

Given labeled data {(x1 , y1 ), . . . , (xℓ , yℓ )} and unlabeled data {xℓ+1 , . . . , xn }, we define a matrix T in Rn×n as follows:  +1, if 1 ≤ i, j ≤ ℓ and yi = yj , T[i, j] = −1, if 1 ≤ i, j ≤ ℓ and yi ̸= yj , (2) 0, otherwise. Clearly, for a binary class, i.e., C = {1, 2}, the matrix T in (2) can be simply viewed as the one obtained by augmenting T0 in (1) by attaching n − ℓ zero columns and rows from its right and bottom, respectively. In this sense, we call T the extended ideal kernel of the data. The trace of T is equal to the number of labeled samples. The following result is about the spectral structure of T. Theorem 1. Let C = {1, 2, . . . , c } and (y1 , y2 , . . . , yℓ ) ∈ C ℓ . The matrix T is given by (2). If for each i ∈ C the set {j : yj = i} is nonempty, then we have rank(T) =



1, c,

if c = 2; if c ≥ 3.

Furthermore, the inertia1 of T is (1, 0, n − 1) if c = 2 and

(c − 1, 1, n − c ) if c ≥ 3.

Proof. From the definition of the matrix T, we know that T is symmetric and its last n − ℓ rows and columns are zeros. If yi = yj then the ith and jth columns of T are identical. Moreover, the set {j : yj = i} is nonempty for each i between 1 and ℓ, which implies T has c different nonzero columns. Consequently, T can be transformed into a symmetric block diagonal matrix with c nonzero columns by applying elementary transformations. That is, there exists an invertible matrix P (involving row-switching and rowaddition transformations) such that PTP⊤ , the congruence transformation of T, has a form of PTP⊤ =



Tc

 0

where Tc is the c × c circulant matrix with [1, −1, −1, . . . , −1] as its first row. Immediately, this leads to rank(T) = rank(Tc ). A direct computation shows that 2 − c and 2 (with multiplicity c − 1) are the eigenvalues of Tc . Therefore rank(T) = c − 1 if c = 2 and rank(T) = c if c ≥ 3. We further have that the inertia of Tc is (1, 0, 1) if c = 2 and (c − 1, 1, 0) if c ≥ 3. By Sylvester’s law of inertia (Golub & Van Loan, 1996) which states the inertia of a symmetric matrix is invariant under any congruence transformation, we, therefore, know that the inertia of T is (1, 0, n − 1) if c = 2 and (c − 1, 1, n − c ) if c ≥ 3. The proof is complete.  Theorem 1 indicates that the rank of the extended ideal kernel for a multiclass problem is equal to the number of the classes. 2.2. Ideal regularization In this section, we will show how the extended ideal kernel can be exploited for the enhancement of learning a suitable kernel for the problem. To this end, we first present the following result. Theorem 2. Let T be the extended ideal kernel for a multiclass problem. Consider the following optimization problem max K≽0

s.t.

tr(KT) tr(K) ≤ 1.

(3)

Then the solution to (3) is the outer product of the normalized eigenvector corresponding to the largest eigenvalue of T.

1 The inertia of a square matrix is a triple consisting of the number of positive, negative, and zero eigenvalues.

Proof. We study the optimization problem (3) via its Lagrangian which is

L(K; ν, Z) := tr(KT) − ν(tr(K) − 1) + tr(KZ),

(4)

where ν ≥ 0 is the Lagrangian multiplier for the constrain tr(K) ≤ 1 and Z ≽ 0 is the Lagrangian multiplier for the constraint K ≽ 0. The Karush–Kuhn–Tucker (KKT) conditions for the Lagrangian L(K; ν, Z) are: T − ν I + Z = 0,

(5)

tr(KZ) = 0.

(6)

Due to the positive semi-definiteness of K and Z, we have tr(KZ) = tr(K1/2 K1/2 Z1/2 Z1/2 ) = tr(K1/2 Z1/2 Z1/2 K1/2 ) where both K1/2 and Z1/2 are positive semi-definite. Hence, tr(KZ) = ∥K1/2 Z1/2 ∥2F . It follows from (6) that K1/2 Z1/2 = 0. This, in turn, leads to KZ = 0 and therefore ZK = 0 as well. This indicates that K and Z are commutable and diagonalizable simultaneously by an orthogonal matrix P, i.e., both P⊤ KP and P⊤ ZP are diagonal matrices. From (5), P⊤ TP = ν I − P⊤ ZP is a diagonal matrix as well. In summary, the columns of P are common eigenvectors of the matrices K, Z and T satisfying (5) and (6). Since T is given, P is available. We can assume that P⊤ TP = diag(µ), where µ is a vector whose entries are the eigenvalues of the extended ideal kernel T and are arranged in nonincreasing order. Note that the trace of a matrix is equal to the sum of all eigenvalues of the matrix. We consider the following optimization problem max

λ⊤ µ

s.t.

λ⊤ 1 ≤ 1.

λ≥0

(7)

Clearly, if K⋆ is a solution to (3) and λ⋆ is a solution to (7), then P⊤ K⋆ P = diag(λ⋆ ). By Theorem 1, the first entry of µ should be a positive number. Hence λ⋆ [1] = 1 and λ⋆ [i] = 0 for i = 2, . . . , n. Thus, the solution to (3) is Pdiag(λ⋆ )P⊤ which is the outer product of the first column of the P. That is, the solution K⋆ of (3) is the outer product of the normalized eigenvector corresponding to the largest eigenvalue of T.  Corollary 1. Let T be the ideal kernel for a given binary class problem. Let K⋆ be the solution to the optimization problem (3). Then K⋆ = T/tr(T). Proof. For a binary class problem, its ideal kernel T equals yy⊤ where y is a vector with 1, −1 and 0 as its entries. Note that Ty = yy⊤ y = (y⊤ y)y. Hence, by Theorem 1, y⊤ y is the only positive and largest eigenvalue of T with y as its corresponding eigenvector. By Theorem 2, the solution K⋆ to the optimization problem (3) is yy⊤ /∥y∥2 . Since T = yy⊤ and tr(T) = tr(y⊤ y) = ∥y∥2 , we have that K⋆ = T/tr(T).  As indicated by Theorem 2 and Corollary 1, under the constraints of K being positive semi-definite and its trace bounded by 1, the solution which maximizes the trace of the product of K and T encodes the label information of the given data samples. Therefore, we can regularize a cost function for learning a suitable kernel K with the trace of KT serving as its regularization term. In the following section, for a given data set with T as its ideal kernel, we will use

Ω (K) := −tr(KT)

(8)

as a regularization term in various kernel learning problems to demonstrate its effectiveness in the resulting learning algorithms. In what follows, Ω (K) is called the ideal regularization associated with the ideal kernel T for a given data set.

B. Pan et al. / Neural Networks 56 (2014) 22–34

3. Applications of ideal regularization In this section, the notion of ideal regularization is used in learning kernels from labels in various applications. Firstly, we integrate the ideal regularization into standard kernels, making the new kernels more suitable for learning tasks. Secondly, we propose an algorithm to learn a kernel from a prior kernel, geometric structure, and label information of the data. Finally, we utilize the ideal regularization to four kernel learning methods, namely, NPK, OSK, GML and KITML, with the aim of reducing the time complexity of these methods. In some algorithms, an initial kernel matrix K0 is given beforehand, which is computed from an initial kernel function k0 (·, ·) such that K0 [i, j] = k0 (xi , xj ), 1 ≤ i, j ≤ n. 3.1. Incorporating label information into standard kernels Standard kernels, such as Gaussian, polynomial and linear ones, are usually easy to get. However, these kernels do not carry any label information of the given data. With the help of ideal regularization, we can exploit the label information for standard kernels in a unified framework. Given an initial kernel matrix K0 generated by a standard kernel, we consider the following problem to learn a new kernel matrix K: min D(K, K0 ) + γ Ω (K), K≽0

(9)

where D(K, K0 ) measures the discrepancy between the matrices K and K0 , Ω (K) is the ideal regularization defined by (8), the parameter γ is a positive number balancing D(K, K0 ) and Ω (K). In this paper, von Neumann divergence Dv n (K, K0 ) := tr(K log K − K log K0 − K + K0 ) is adopted. As a consequence, problem (9) becomes min tr(K log K − K log K0 − K + K0 ) − γ tr(KT). K≽0

(10)

Ignoring the positive semi-definiteness constraint for a moment and setting the derivative with respect to K to zero for the objective function in (10), we have log K − log K0 − γ T = 0. It follows that K⋆ = exp(log K0 + γ T)

(11)

which satisfies the positive semi-definiteness constraint. Hence, K given in (11) is the solution of (10). Formula (11) involves two eigendecompositions with complexity of O (n3 ).

the local neighborhood relations between the data points. The entries W[i, j] > 0 if the ith and jth data are neighbors, otherwise W[i, j] = 0. The neighbor relations can be defined in terms of symmetric nearest neighbors or an ϵ -ball distance criterion. The nonzero weights in W can be chosen as W[i, j] = 1, or W[i, j] = exp(−∥xi − xj ∥2 /σ ) where σ > 0. The associated normalized Laplacian matrix on the weighted undirected graph is L = I − D−1/2 WD−1/2 ,

Intuitively, a good data-dependent kernel should exploit the geometric structure and label information of the data. The label information can be utilized by ideal regularization while the geometric structure is usually exploited by manifold regularization (Belkin, Niyogi, & Sindhwani, 2006). When there are few labels and the manifold assumption does not hold or holds in a less degree, the learned kernel only with these information may unlikely be accurate. In this case, we integrate an initial kernel matrix in the learning algorithm. We therefore propose the following optimization problem for kernel learning: min D(K, K0 ) + γ Ω (K) + δ Θ (K), K≽0

(12)

where D(K, K0 ) measures the discrepancy between the matrices K and K0 , Ω (K) is the ideal regularization defined by (8), Θ (K) is a manifold regularization measuring the complexity of preserving the geometric structure, the positive numbers γ and δ are regularization parameters. The manifold regularization Θ adopted here was proposed in Belkin et al. (2006). Given a data set {x1 , x2 , . . . , xn }, we can build a weighted undirected graph with adjacency matrix W to describe

(13)

where D is the diagonal matrix with D[i, i] = j=1 W[j, i] as its ith diagonal entry. Suppose that the data were sampled from a smooth manifold which is approximated by a graph G. We find a nonlinear mapping φ which embeds the graph G to a RKHS so that connected points stay as close together as possible. Therefore, the manifold regularization Θ (see, Belkin et al., 2006) is defined as

n

Θ ( K) =

 2   φ(xi ) φ(xj )  √  W[i, j] = tr(KL), − √  D[i, i] D[j, j]  i,j

(14)

where φ is the nonlinear mapping associated with K, i.e., K[i, j] = φ(xi )⊤ φ(xj ). We use the von Neumann divergence for D(K, K0 ) again. With Θ defined in (14), problem (12) is rewritten as min tr(K log K − K log K0 − K + K0 ) − γ tr(KT) + δ tr(KL). (15) K≽0

With the same argument as the one used in developing (11), the solution to problem (15) is K⋆ = exp(log K0 + γ T − δ L).

(16)

3.3. Speeding up kernel learning algorithms 3.3.1. NPK with ideal regularization The NPK is a technique to learn a fully nonparametric kernel matrix from geometric structure and side-information of the data (Hoi et al., 2007). Let {x1 , x2 , . . . , xn } be given data points. For a given collection of similar pairwise constraints S and a collection of dissimilar pairwise constraints D , we construct a matrix G ∈ Rn×n whose ij-th entry G[i, j] is 1 if (xi , xj ) ∈ S , −1 if (xi , xj ) ∈ D , and 0 if we cannot determine whether xi and xj are similar or not. With this matrix G, the resulting SDP for NPK is



min tr(KL) + γ K≽0

3.2. Learning kernels from geometric structure and label information

25

max(0, 1 − K[i, j]G[i, j]),

(17)

(i,j)∈(S ∪D )

where γ > 0, and L is the normalized Laplacian matrix as mentioned in the previous subsection. The SDP of (17) can be solved by using the standard software packages, such as SeDuMi/YALMIP (Löfberg, 2004). However, it would be computationally expensive to solve the problem when the number of samples or constraints is large. We replace the second term in (17) by our ideal regularization and impose an additional constraint to limit the trace of K, leading to the following problem: min

tr(KL) − γ tr(KT),

s.t.

tr(Kp ) ≤ 1,

K≽0

(18)

where the parameter p ≥ 1 controls the capacity of K. The above problem is exactly the SimpleNPK problem with linear loss (Zhuang et al., 2011). As shown in Zhuang et al. (2011), the SimpleNPK problem can be solved in a closed-form and the SimpleNPK achieves a comparable accuracy with the NPK while reducing the computational cost largely. 3.3.2. OSK with ideal regularization The OSK learns a nonparametric graph kernel by transforming the spectrum of the normalized Laplacian matrix L defined in (13)

26

B. Pan et al. / Neural Networks 56 (2014) 22–34

and aligning it with the ideal kernel T (Zhu et al., 2005). Let λi and vi , i = 1, 2, . . . , n, be the eigenvalues and the corresponding n ⊤ eigenvectors of L such that L = λ v v i i i where the eigenvalues i =1 λi are sorted in non-decreasing order. Since a smaller eigenvalue of L corresponds to a smoother eigenvector over the graph (Smola & Kondor, 2003), the OSK imposes order constraints to eigenvalues so that smoother eigenvectors have larger eigenvalues in K. The OSK is expressed as the following convex optimization problem: max K

s.t.

⟨Ktr , Ttr ⟩F ∥Ktr ∥F ∥Ttr ∥F r  K= µi K i

(19)

i=1

µi ≥ 0 tr(K) = 1 µi ≥ µi+1 ,

i = 1, . . . , r − 1,

where r ≪ n, Ki = vi v⊤ i , Ktr and Ttr are the kernel matrix and the ideal matrix restricted on the labeled data points. The above problem is solved by formulating it as a QCQP problem (Zhu et al., 2005). The objective function (19) can be replaced with the ideal regularization, thus the problem is reformulated as max

r 

µ

tr(Ki T)µi

(20)

1. Initialize M1 ∈ M . 2. For t = 1, 2, . . . Kt +1 = arg min Dld (K, Mt ), K∈K

Mt +1 = arg min Dld (Kt +1 , M). M∈M

(24) (25)

sub-problems in the algorithm can be resolved with Bregman projection algorithms (see Lu et al., 2009 for the details of Algorithm 1). The most time-consuming operation of GML is the matrix– vector multiplication with O (nr ), where r is the rank of M, complexity per iteration in solving (24). When the number of constraints is large, it would be computationally intensive. As described in Section 1, 100 training samples could lead to 4950 constraints. This motivates us to replace the constraints K ∈ K by the ideal regularization term and to add one term to limit the trace of K. We arrive at the following problem: min

K≽0,M∈M

Dld (K, M) − γ tr(KT) + δ tr(K),

(26)

where γ and δ are positive numbers. Problem (26) is also convex individually in K and M−1 , which enjoys the ‘‘block coordinate descent’’ method. Algorithm 2 summarizes the optimization procedure. Note that the subproblem (28) in Algorithm 2 is the same as the subproblem (25) in Algorithm 1. The solution to subproblem (27) will be given in the Appendix.

i=1

s.t. µi ≥ 0 r  µi = 1

Algorithm 2 GML algorithm with ideal regularization 1. Initialize M1 ∈ M . 2. For t = 1, 2, . . . Kt +1 = arg min Dld (K, Mt ) − γ tr(KT) + δ tr(K),

i=1

µi ≥ µi+1 ,

i = 1, . . . , r − 1,

K≽0

where µ = (µ1 , . . . , µr )⊤ . This is a LP problem which can be solved more efficiently. 3.3.3. GML with ideal regularization The GML introduces pairwise constraints to make use of sideinformation, while learning a nonparametric spectral transformation of the normalized Laplacian matrix L in (13) to exploit the geometry of the data (Lu et al., 2009). The pairwise constraints we used form a convex set K as follows:

K := { K ≽ 0 : K[i, i] + K[j, j] − 2K[i, j] ≤ s, (i, j) ∈ S , K[i, i] + K[j, j] − 2K[i, j] ≥ d, (i, j) ∈ D },

(21)

where the definitions of S and D are the same as the ones in Section 3.3.1, s and d are pre-specified constants. The intrinsic geometry of the data is described by a set M as follows:

M :=

Algorithm 1 GML algorithm

 r 

 µ

T i vi vi

|µ1 ≥ · · · ≥ µr ≥ 0 ,

(22)

where vi are the orthonormal eigenvectors of L as defined in Section 3.3.2. The aim of GML is to find K through the optimization problem min

Dld (K, M),

M∈M

(23)

where Dld (K, M) := tr(KM−1 ) − log det(KM−1 ) − n, known as the LogDet divergence. In what follows, when the inverse of the matrix does not exist, we use its Moore–Penrose inverse instead. The problem (23) is not jointly convex with respect to K and M−1 , but is convex individually in K and M−1 . Hence, a procedure exploiting the ‘‘block coordinate descent’’ method (see, e.g., Bertsekas, 1999) for the problem is summarized in Algorithm 1. Both

(28)

3.3.4. KITML with ideal regularization The KITML learns a kernel matrix by minimizing a LogDet divergence with the pairwise constraints (Davis et al., 2007). Given a standard kernel matrix K0 , the optimization problem is min Dld (K, K0 ), K∈K

(29)

where the definition of K is given in (21). Problem (29) can be solved with Bregman projection where the computational complexity per iteration is O (n2 ). It would be computationally expensive when the number of constraints is large since Bregman projection will visit each constraint for several times. To speed up KITML, we replace K ∈ K with the ideal regularization and change the LogDet divergence into von Neumann divergence, leading to the following optimization problem min Dv n (K, K0 ) − γ tr(KT),

i=1

K∈K ,M∈M

Mt +1 = arg min Dld (Kt +1 , M).

(27)

K ≽0

(30)

with γ > 0. The reason that we use von Neumann divergence here is to achieve a closed-form solution to problem (30). Note that problem (30) is exactly the problem in (10) which is solved in closed-form as shown in (11). 4. Discussions We will show how the ideal regularization connects to the related work: kernel alignment (Cristianini et al., 2002), Hilbert– Schmidt Independence Criterion (HSIC) (Gretton, Bousquet, Smola, & Schölkopf, 2005) and class separability. Finally, we will discuss the out-of-sample extensions for some IR-based methods.

B. Pan et al. / Neural Networks 56 (2014) 22–34

27

4.1. Connection to kernel alignment

Z := {(x1 , y1 ), . . . , (xn , yn )} ⊆ X × Y, an empirical HSIC is given by

The kernel alignment criterion poses the following optimization problem

HSIC(Z , F , G) = (n − 1)−2 tr(KHUH),

tr(KT) max  , K≽0 tr(K2 )

(31)

where T is the ideal kernel for binary problem. The alignment criterion is invariant to scales. That is, if we scale K by a positive constant α , the resulting α K will have the same alignment as K. One way to remedy scale invariance is to fix the denominator to 1 and maximize the numerator. This is equivalent to optimizing the problem: max K≽0

tr(KT)

(37)

where H, K, U ∈ Rn×n , K[i, j] := k(xi , xj ), U[i, j] := u(yi , yj ), H := I − n−1 11⊤ . H is called a centering matrix. A high value of HSIC means strong dependence between x and y. So the empirical HSIC can be used to measure the dependence of sample x and label y (Song et al., 2008). A positive semi-definite matrix U is established from the label information. Then we would like to learn a kernel matrix K to maximally depend on U while keeping some constraints. We thus study the following problem max K ≽0

s.t.

tr(KHUH)

(38)

tr(K) ≤ 1.

(32)

We have the following result.

It looks like the problem (3) where the extended ideal kernel T is replaced with the centering label matrix HUH. The following theorem relates these two problems.

Theorem 3. The solution to problem (32) is the outer product of the normalized eigenvector corresponding to the largest eigenvalue of T.

Theorem 4. Given positive semi-definite matrix U and centering matrix H, the following two problems are equivalent:

Proof. The Lagrangian function of problem (32) is

max tr(KU)

s.t.

tr(K2 ) = 1.

L(K; ν, Z) = tr(KT) − ν(tr(K2 ) − 1) + tr(KZ).

K ≽0

(33)

We get the KKT condition as follows: T − 2ν K + Z = 0

(34)

tr(KZ) = 0.

(35)

The second KKT condition indicates that T and Z can be simultaneously diagonalized with the same orthonormal eigenvectors. Furthermore, T has the same set of orthonormal eigenvectors as K by the first KKT condition. Let the eigendecomposition of T be Pdiag(µ1 , . . . , µn )P⊤ , and that of K be Pdiag(λ1 , . . . , λn )P⊤ , where the eigenvalues are arranged in non-increasing order. Problem (32) can be rewritten in terms of the eigenvalues max

λ µ

s.t.

λ⊤ λ = 1,

λ≥0



(36)

where λ = (λ1 , . . . , λn )⊤ and µ = (µ1 , . . . , µn )⊤ . Clearly, the solution λ⋆ is obtained by setting λ⋆ [1] = 1 and λ⋆ [i] = 0 for i = 2, . . . , n. To this end, the solution to problem (32) is the outer product of the normalized eigenvector corresponding to the largest eigenvalue of T.  It is interesting to see that both problems (3) and (32) have the same solutions. But with other constraints included in the problem, we prefer to the ideal regularization since there exists quadratic constraint in the kernel alignment. Another limitation of kernel alignment is that it is defined on positive semi-definite matrices. However, the extended ideal kernel T for both labeled and unlabeled data of multiclass is indefinite, which violates the definition of kernel alignment. So the ideal regularization can be used in more general cases. 4.2. Connection to Hilbert–Schmidt independence criterion The Hilbert–Schmidt Independence Criterion (HSIC) measures the dependence between two random variables x and y by computing the Hilbert–Schmidt norm of cross-covariance operators (Gretton et al., 2005). Let F and G be two separate reproducing kernel Hilbert spaces associated with kernels k : X × X → R and u : Y × Y → R, respectively. Given n independent samples

s.t. tr(K) ≤ 1 and K1 = 0.

max tr(KHUH) s.t. tr(K) ≤ 1.

(39) (40)

K≽0

That is, if the solution to (39) is K⋆ , then K⋆ is also the solution to (40). If the solution to (40) is KĎ , then HKĎ H is the solution to (39). Proof. Let K⋆ and KĎ be the solutions to problems (39) and (40), respectively. Clearly, K⋆ is also feasible for (40). The centering constraint K1 = 0 indicates tr(HK⋆ HU) = tr(K⋆ U). So we have tr(HKĎ HU) ≥ tr(K⋆ U). On the other hand, HKĎ H1 = 0 and tr(HKĎ H) = tr(KĎ H) = tr(KĎ ) − n−1 1⊤ KĎ 1 ≤ tr(KĎ ) ≤ 1. Then HKĎ H is feasible for (39). The optimality of K⋆ for (39) shows tr(K⋆ U) ≥ tr(HKĎ HU). Combining two inequalities gives tr(K⋆ L) = tr(HK⋆ HU) = tr(HKĎ HU), which completes the proof.  Maximizing the empirical HSIC amounts to additionally centering the embedded data for ideal regularization. So the ideal regularization can be converted to HSIC by adding one more constraint K1 = 0. One limitation of HSIC is that the label matrix U should be positive semi-definite. When there exists unlabeled data, it is not easy to define such positive semi-definite matrix. In contrast, the ideal regularization is more flexible. The extended ideal kernel T can be either positive semi-definite or indefinite. 4.3. Ideal regularization and class separability The kernel measures the similarity between the data in RKHS. By modifying the standard kernel with ideal regularization, the new kernel increases the similarity of the intra-class traininig data, and reduces the similarity of the inter-class training data. This is in line with the idea of LDA. We would like to show that increasing the class separability of training data could also improve the class separability of test data. i Let x be the mean of the ith class, x be the mean of all data points, xij be the jth data point in the ith class, n be the total number of data points, c be the number of class, ni be the number of data points that belong to the ith class. The between-class covariance matrix Sb and the within-class covariance matrix Sw are defined as Sb =

Sw =

c 1

n i=1

ni (x − x)(x − x)⊤ , i

ni c 1 

n i=1 j=1

i

(xij − xi )(xij − xi )⊤ .

(41)

(42)

28

B. Pan et al. / Neural Networks 56 (2014) 22–34

Table 1 Value of J : Jtrain for the training data without ideal regularization; JIR-train for the training data with ideal regularization; Jtest for the test data without ideal regularization; JIR-test for the test data with ideal regularization. # training per class

Jtrain

JIR-train

Jtest

JIR-test

10 20 30 40 50

1.31 0.80 1.39 0.82 0.82

1.69 1.47 3.87 3.72 4.82

0.89 0.89 0.89 0.89 0.89

0.92 1.08 1.41 1.65 1.84

Table 2 Performance of SVM using four different kernels: linear kernel (Lin), linear kernel with ideal regularization (IR-Lin), Gaussian kernel (Gau), Gaussian kernel with ideal regularization (IR-Gau). Each cell has two rows: the upper row shows the accuracies; the lower row reports the value of J for test data. Data set

Lin

IR-Lin

Gau

IR-Gau

Heart

72.4 0.129

73.5 0.157

79.9 0.080

81.1 0.101

Ionosphere

80.1 0.056

81.8 0.069

82.0 0.072

87.8 0.101

Liver

55.2 0.012

55.7 0.012

52.8 0.010

53.5 0.011

Sonar

70.4 0.042

71.9 0.049

67.7 0.029

73.4 0.036

Wine

95.0 0.783

96.1 1.032

96.6 0.490

96.7 0.654

The Fisher score is used to measure the class linear separability: J =

tr(Sb ) tr(Sw )

.

(43)

High value of J means different classes can be well-separated, while low value of J indicates the poor class separability. A toy example is given here to show the impact of ideal regularization on class separability. A data set containing twodimensional samples is generated and divided into two classes. One class has Gaussian distribution with mean (−1.6, 0)⊤ and covariance diag(2, 1). The other class also has Gaussian distribution with mean (1.6, 0)⊤ and covariance diag(1, 2). The number of test data for each class is fixed to 100, while the number of training data for each class is varied within the set {10, 20, 30, 40, 50}. We respectively compute the class separability (i.e., the value of J) of the training and test data, before and after the modification of the standard kernel with ideal regularization (formula (11)). The linear kernel is chosen as the standard kernel. Results are recorded in Table 1. Table 1 indicates that by incorporating the label information into the linear kernel, the new kernel significantly improves the class separability of the training data. The more labels we utilized, the more improvements in the class separability of training data we obtained. It also demonstrates that the ideal regularization is not too dependent on the training data. The class separability of the test data improves in the same manner as that of the training data to which the ideal regularization is used. As more labels are available, we get more improvements in the class separability of the test data. To further investigate the impact of ideal regularization on classification, we perform a classification experiment on five UCI data sets: Heart, Ionosphere, Liver, Sonar and Wine. The description of these data sets can be found in Table 3. Soft margin SVM classifier is used where the regularization parameter C = 1. The width of Gaussian kernel is set to the standard deviation of the data. For each class, we randomly selected 10 training samples. The results are averaged over 10 runs. Table 2 lists the accuracy and the values of J for test data. When the linear and Gaussian kernels are optimized with ideal regularization, both the accuracy and the values of J increase. We also note that a high value of J tends to yield high accuracy. These results demonstrate that the improvement of class separability for training data can be beneficial to the increase of class separability for test data, hence improving the performance of classification. 4.4. Out-of-sample extensions The kernel learning algorithms of Section 3 are restricted to learning in the transductive setting. That is, we assume that we have all the data up front, but labels are only available for some of the data. Such methods suffer when new data are added, since this would require re-learning the entire kernel matrix. In the true semi-supervised setting, the kernels are learned from the given labeled and unlabeled data but the prediction is made on the new data which are never encountered before. This requires us to generalize the kernel matrix to the corresponding kernel function

where the new data can be computed, which is known as out-ofsample extensions. We discuss the out-of-sample extensions for the kernel matrices learned from Eqs. (11) and (16). We note that the learning problems (10) and (15) can be fit into the kernel learning framework proposed in Jain, Kulis, Davis, and Dhillon (2012). Given training data {x1 , . . . , xn }, we generate an initial kernel matrix K0 ∈ Rn×n with the associated kernel function k0 (s, t) = φ(s)⊤ φ(t). The following optimization problem is studied in Jain et al. (2012): −1/2

−1/2

min

f (K0

s.t.

gi (K) ≤ bi ,

K ≽0

KK0

)

1 ≤ i ≤ m,

(44)

where f and gi are functions from Rn×n to R. It shows that if f can be written as f (A) = Σi fs (λi ), where λ1 , . . . , λn are the eigenvalues of matrix A and fs : R → R is a function, then problem (44) is equivalent to learning a linear transformation W on φ . Specifically speaking, let the solution to problem (44) be K⋆ , then the kernel function k(·, ·) associated with K⋆ is of the form k(s, t) = φ(s)⊤ Wφ(t). Moreover, W = α I + 8⊤ S8,

(45)

1 −1 ⋆ where S = K− 0 (K − α K0 )K0 , 8 = [φ(x1 ), . . . , φ(xn )], α is the minima of fs . We regard the von Neumann divergence and the regularization term in problems (10) and (15) as the function f and constraints in problem (44), respectively. The fs of von Neumann divergence is fs (λ) = λ log λ − λ where its minima is −1. Based on the previous discussion, the kernel matrices learned from problems (10) and (15) can be extended to the kernel function of the form:

k(s, t) = φ(s)⊤ Wφ(t)

= −k0 (s, t) +

n 

S[i, j]k0 (s, xi )k0 (xj , t).

(46)

i,j=1

5. Experiments In this section, we essentially aim at illustrating two points. The first point is to show that the ideal regularization is effective in utilizing the labels of given data samples while the second point is to illustrate the efficiency of the ideal regularization. 5.1. Experimental setup 5.1.1. Data sets One toy example and 20 real data sets are used in our experiments. Among those real data sets, one is the ORL face data set, one is the US Postal Service (USPS) handwritten digit data set, and the rest are retrieved from the UCI repository. The main characteristics of the data sets are summarized in Table 3.

B. Pan et al. / Neural Networks 56 (2014) 22–34

{x1 , . . . , xn }, i.e., σ =

Table 3 Description of the data sets. Data set Balance Breast Glass Heart Ionosphere Iris Liver Optdigits Pima Protein Segmentation Sonar Soybean Vehicle Waveform Wine Yeast Adult-a1a Adult-a2a Adult-a3a Adult-a4a Adult-a5a Adult-a6a ORL USPS USPS23

#data

#class

#feature

625 569 214 270 351 150 345 1 797 768 116 2 310 208 47 846 2 000 178 1 484

3 2 6 2 2 3 2 10 2 6 7 2 4 4 3 3 10

4 30 9 13 34 4 6 64 8 20 18 60 35 18 21 13 8

1 605 2 265 3 185 4 781 6 414 11 220

2 2 2 2 2 2

14 14 14 14 14 14

400 4 000 800

40 10 2

4096 256 256

The Adult data set contains six subsets which have been used for large-scale learning (Joachims, 1999). The goal of this data set is to predict whether a household has an income greater than $50,000. The ORL data set includes 10 different images of each of 40 individuals. For some individuals, the images were taken at quite different scenarios such as at different times, varying lighting conditions, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). For the USPS data set, we select the first 400 images for each digit (totally 4000 images) to form the data set, while for the USPS23 data set, we take the 2 vs. 3 task, where the first 400 images for each digit were chosen. 5.1.2. Algorithms The following algorithms are used for comparisons:

• • • • •

Gaussian kernel for classification. SimpleNPK (Zhuang et al., 2011) for classification. OSK (Zhu et al., 2005) for classification. GML (Lu et al., 2009) for classification. KITML (Davis et al., 2007) for classification.

We employ our ideal regularization to some of the above algorithms, adding prefix ideal regularized (IR) or ideal and manifold regularized (IMR) to indicate our algorithms:

• • • •

IR-Gaussian (formula (11)) for classification. IMR-Gaussian (formula (16)) for classification. IR-OSK (Section 3.3.2) for classification. IR-GML (Section 3.3.3) for classification.

KITML chooses the Gaussian kernel as its initial kernel while the GML and IR-GML select the diffusion kernel exp(−0.1L), where L is the normalized Laplacian matrix, as their initial kernels. 5.1.3. Parameters of algorithms The parameters of the algorithms are set as the recommended values from the corresponding papers or tuned in our experiments. Specifically, we set the following parameters:

• γ = 0.1 and δ = 1 for Eqs. (11) and (16). • The width σ of the Gaussian kernel κG (x, y) = exp(−∥x − y∥2 /(2σ 2 )) is set to the standard deviation of the data

29



1 n−1

n

i=1

∥xi − x∥2 , where x is the

mean of the data.

• In SimpleNPK, γ = 1 and p = 2. • We set γ = 1 and δ = λ + 10−3 for the IR-GML, where λ is the largest eigenvalue of the extended ideal kernel T.

• The slack variable is fixed to 1 for the KITML. • We use the first 20 eigenvectors for the OSK, GML, IR-OSK and IR-GML.

• The Bregman projections for the GML and KITML at each subproblem halt when, either the convergence tolerance is lower than 0.001, or the number of iterations exceeds 10 times number of constraints. • Many paired constraints are unnecessary for the GML and KITML. For example, if xi and xj are in the same class, and xi and xk are in the different class, we can conclude that xj and xk are in the different classes. To speed up GML and KITML, we roughly choose 1/c of the total constraints, where c is the number of classes. 5.1.4. Other settings For all experiments, we use the normalized graph Laplacian with {0, 1} weights and 5 nearest neighbors except for the Adult and USPS data sets where we use 20 nearest neighbors. All the data are normalized to have zero mean and unit standard deviation for each feature. To compute the accuracy of classification, we adopt k-NN classifier with k = 3. The accuracy of classification is the average over 10 random choices of the labeled set. Each random set samples the same number of labels from each class. All experiments are conducted on a 64 bit Windows PC with 2.3 GHz CPU and 8 GB RAM, and implemented in Matlab. 5.2. Effectiveness of ideal regularization We consider experiments on transductive and semi-supervised settings. In the transductive setting, we use the given labeled and unlabeled data for training and subsequently predict the labels of unlabeled data. In the semi-supervised setting, we also use the given labeled and unlabeled data for training but predict the labels of unlabeled and new data. To compute the new data, we use the scheme proposed in Section 4.4. 5.2.1. Transductive setting We compare our algorithms with the Gaussian kernel, SimpleNPK, OSK, GML and KITML on 17 UCI, USPS23 and ORL data sets. For the data sets with a few training data of each class, we randomly select 5 training samples from each class; otherwise, we randomly choose 10 from each class. Table 4 summarizes the results, where the data sets are arranged in ascending order of size. We highlight the best accuracy in each row, and those that cannot be determined as different from the best, with paired t-test at significance level 0.05. As the size of data set increases, the proportion of training data decreases. For example, it is about 42% in Soybean data set while only 1.5% in Waveform data set. Thus, our data sets cover a wide range of proportion of training data. The results of IR-Gaussian are the same as or better than the one with Gaussian in 14 out of 19 cases, indicating that the ideal regularization is helpful in improving the performance of the algorithms for classification in most of the cases. The SimpleNPK sometimes achieves a comparable accuracy with the other algorithms when the proportion of training data is large. But its performance degrades for the data sets with small proportion of the training data. This is because that SimpleNPK discards the negative eigenvalues of γ T − L, and when T is very sparse (this is the case for small proportion of training data), most of the eigenvalues T are negative. The information loss

30

B. Pan et al. / Neural Networks 56 (2014) 22–34

Table 4 Accuracy (%) on real data sets with the standard deviation following the accuracy. Each data set has two rows: the upper row shows the name of data set; the lower row reports the (#data/#class/#training per class). Accuracies in boldface are the best as determined by a paired t-test at the 0.05 significance level. Data set

Gaussian

SimpleNPK

OSK

GML

KITML

IR-Gaussian

IMR-Gaussian

Soybean (47/4/5) Protein (116/6/5) Iris (150/3/10) Wine (178/3/10) Sonar (208/2/10) Glass (214/6/5) Heart (270/2/10) Liver (345/2/10) Ionosphere (351/2/10) ORL (400/40/5) Breast (569/2/10) Balance (625/3/10) Pima (768/2/10) USPS23 (800/2/10) Vehicle (846/4/10) Yeast (1484/10/5) Optdigits (1797/10/10) Waveform (2000/3/10) Segmentation (2310/7/10)

99.3 ± 1.6

91.1 ± 6.3

91.1 ± 6.3

86.3 ± 9.6

98.9 ± 1.8

98.5 ± 1.9

98.9 ± 1.8

51.4 ± 6.4

50.5 ± 5.5

58.0 ± 6.9

52.6 ± 5.7

54.9 ± 6.2

52.4 ± 5.5

58.5 ± 5.1

92.8 ± 2.7

75.8 ± 8.9

93.8 ± 3.5

93.7 ± 3.9

94.3 ± 1.4

92.6 ± 2.9

95.0 ± 3.4

92.9 ± 1.9

56.6 ± 6.2

65.3 ± 6.2

63.0 ± 4.9

96.2 ± 1.4

95.1 ± 2.1

96.2 ± 1.4

65.3 ± 5.2

64.7 ± 6.2

62.2 ± 7.1

63.5 ± 6.4

67.7 ± 4.2

67.2 ± 5.1

66.8 ± 7.1

50.3 ± 4.0

44.9 ± 6.5

46.3 ± 5.5

44.4 ± 8.1

50.3 ± 6.4

49.8 ± 3.8

48.9 ± 5.2

77.4 ± 7.2

58.6 ± 4.4

59.1 ± 4.7

57.0 ± 5.4

98.9 ± 1.8

77.6 ± 5.7

75.8 ± 8.4

51.8 ± 4.5

54.7 ± 7.9

53.5 ± 5.4

54.2 ± 4.2

56.9 ± 5.2

52.3 ± 4.1

53.1 ± 5.1

76.7 ± 5.4

71.5 ± 6.7

64.3 ± 10.6

65.3 ± 7.5

76.6 ± 4.7

78.3 ± 5.1

81.6 ± 4.9

77.0 ± 3.0

70.0 ± 4.9

69.4 ± 3.5

66.7 ± 4.2

77.7 ± 2.2

77.4 ± 4.2

78.8 ± 3.2

91.2 ± 1.6

67.1 ± 8.0

85.6 ± 2.9

78.9 ± 10.0

91.1 ± 3.5

91.5 ± 1.4

94.5 ± 0.8

57.0 ± 3.8

32.2 ± 15.7

57.6 ± 5.3

64.5 ± 5.3

58.0 ± 4.9

58.7 ± 3.4

59.0 ± 3.2

65.3 ± 5.3

49.9 ± 10.8

59.2 ± 7.4

58.9 ± 4.4

64.2 ± 5.8

65.8 ± 5.3

66.2 ± 5.0

86.1 ± 3.7

76.2 ± 4.0

97.4 ± 1.3

96.3 ± 3.4

90.8 ± 3.9

86.8 ± 3.3

91.2 ± 4.3

49.9 ± 3.4

34.5 ± 1.3

41.3 ± 3.6

43.8 ± 2.9

48.3 ± 4.2

49.9 ± 3.5

49.6 ± 3.7

41.1 ± 3.6

14.8 ± 10.4

39.9 ± 3.8

36.3 ± 4.0

40.5 ± 4.1

40.7 ± 3.9

41.3 ± 4.0

76.4 ± 3.0

27.1 ± 2.0

94.5 ± 1.0

96.2 ± 1.1

77.2 ± 0.6

75.4 ± 2.9

80.5 ± 2.5

73.8 ± 2.5

39.5 ± 5.1

76.4 ± 2.1

74.1 ± 4.4

76.7 ± 2.2

75.8 ± 2.4

78.3 ± 2.0

78.4 ± 3.6

29.9 ± 1.4

79.7 ± 3.2

80.2 ± 3.5

75.4 ± 2.6

78.7 ± 3.1

79.2 ± 3.2

caused by neglecting lots of negative eigenvalues could lead to a degradation of its performance. Note that the OSK and GML learn data-dependent kernels using the manifold structure and labels of a given data set. When the data lie on or close to a manifold, such as the USPS23 and Optdigits data sets, these methods would give good results. However, when the manifold assumption holds to a lesser degree (this happens to many data sets), the performance of these methods would be degraded. KITML is a bit better than IR-Gaussian. Overall, the IMR-Gaussian achieves the best and most stable performance. The results of IMR-Gaussian are the best in 9 out of 19 cases, and the best as determined by the paired t-test in 14 out of 19 cases. To further demonstrate the impact of a manifold assumption, the USPS23 and Wine data sets are selected in our experiments. The Wine data set would have a poor manifold structure since OSK and GML do not work well on this data set. We change the value of δ (i.e., the contribution of the manifold structure) in the IMR-Gaussian, and depict the corresponding changes in accuracies. The results are shown in Fig. 1. In USPS23 data set, the increase of δ results in an improvement of accuracy. When δ is set to 100, the accuracy of the IMR-Gaussian is 96.2% which is close to that of the GML. When δ is set to a small value 0.01, the accuracy of the IMR-Gaussian is 86.9% which approximates that of the IRGaussian. The reversal situation happens to Wine data set, where high value of δ degenerates the performance. A small value 0.01 of δ results in 95.1% accuracy of IMR-Gaussian, closing to that of IR-Gaussian. However, when δ equals 100, the accuracy of IMRGaussian reduces to 68.7% which is a little higher than those of OSK and GML. Therefore, our IMR-Gaussian is more flexible. The

contribution of manifold structure could be controlled through parameter δ . If we have confidence in manifold assumption, we could set a large δ . Otherwise, we set a small δ . However, for the sake of fair comparison, δ is fixed in all the experiments as stated in Section 5.1.3. We conduct an experiment to see the impact of the amount of labels. For the Sonar, Liver and Ionosphere data sets, we randomly select s ∈ {5, 10, 15, 20, 25} training data from each class. The results are shown in Fig. 2. As the amount of labels increases, both the Gaussian and IR-Gaussian kernels achieve higher accuracies. IR-Gaussian performs consistently better than Gaussian. Moreover, the more training data we have, the larger difference of accuracies between IR-Gaussian and Gaussian we get. The results further show that, by utilizing the labels, the IR-Gaussian benefits more than the Gaussian from the training data. 5.2.2. Semi-supervised setting We compare the IR-Gaussian and IMR-Gaussian with Gaussian kernel on 15 data sets. We roughly choose 80% data to form the training set including labeled and unlabeled data, while the remaining data are for test. For the data sets with a few training data of each class, we randomly select 5 training samples from each class; otherwise, we randomly choose 10 from each class. We predict the labels of unlabeled and test data to see how well the IR-Gaussian and IMR-Gaussian extend to new out-of-sample data. Table 5 summarizes the results. We highlight the best accuracy for unlabeled and test data. For IR-Gaussian, the performance on test data is almost indistinguishable to or even better than that on unlabeled data in 12

B. Pan et al. / Neural Networks 56 (2014) 22–34

(a) USPS23.

31

(b) Wine. Fig. 1. The impact of manifold structure on accuracy.

(a) Sonar.

(b) Liver.

(c) Ionosphere.

Fig. 2. Plots of accuracy vs. amount of labels. Table 5 Accuracy (%) on real data sets with the standard deviation following the accuracy. The best accuracies for unlabeled/test data are in boldface. Data set

Algorithm Gaussian

Soybean Protein Iris Wine Sonar Glass Heart Liver Ionosphere ORL Breast USPS23 Vehicle Optdigits Waveform

IR-Gaussian

IMR-Gaussian

Unlabel

Test

Unlabel

Test

Unlabel

Test

100.0 ± 0.0 54.4 ± 3.9 90.6 ± 3.2 91.4 ± 2.6 63.9 ± 4.0 48.1 ± 6.7 77.6 ± 2.8 53.0 ± 3.3 78.7 ± 6.1 77.5 ± 2.5 91.4 ± 1.1 88.4 ± 3.6 54.2 ± 2.8 85.2 ± 1.9 73.1 ± 2.5

100.0 ± 0.0 52.9 ± 7.6 94.3 ± 3.5 91.1 ± 3.9 63.8 ± 8.8 40.9 ± 8.6 81.1 ± 3.0 52.5 ± 5.9 72.3 ± 8.6 74.8 ± 5.0 95.0 ± 2.2 84.8 ± 4.4 53.4 ± 5.2 84.7 ± 2.7 74.6 ± 2.7

100.0 ± 0.0 52.9 ± 6.5 92.8 ± 2.8 93.3 ± 2.1 70.7 ± 5.0 47.9 ± 6.4 80.2 ± 3.5 55.2 ± 1.8 79.0 ± 6.0 77.8 ± 4.8 92.2 ± 0.9 87.8 ± 1.9 50.8 ± 4.4 84.5 ± 1.4 74.0 ± 4.2

98.0 ± 4.2 53.8 ± 8.2 82.7 ± 7.3 95.3 ± 2.6 65.4 ± 8.3 50.7 ± 8.3 75.2 ± 7.6 53.2 ± 5.8 77.6 ± 4.1 82.9 ± 2.6 92.6 ± 2.0 86.9 ± 2.5 50.3 ± 5.1 85.1 ± 1.3 73.3 ± 3.8

100.0 ± 0.0 61.6 ± 8.4 93.7 ± 2.6 95.3 ± 1.8 69.7 ± 4.0 48.1 ± 7.1 78.7 ± 2.3 52.9 ± 3.5 81.3 ± 5.0 77.9 ± 4.0 92.6 ± 1.3 91.7 ± 2.2 49.8 ± 2.5 87.0 ± 1.6 76.7 ± 2.1

100.0 ± 0.0 47.5 ± 12.1 75.7 ± 4.7 94.4 ± 1.9 59.8 ± 9.1 43.3 ± 9.4 79.8 ± 6.4 55.9 ± 4.7 83.7 ± 5.2 78.3 ± 3.3 92.6 ± 1.6 89.9 ± 3.4 53.7 ± 5.9 85.6 ± 2.7 79.5 ± 3.8

out of 15 cases. The out-of-sample performance is degraded on Iris, Sonar and Heart data sets. For IMR-Gaussian, the performance on test data is almost indistinguishable to or even better than that on unlabeled data in 11 out of 15 cases. The in-sample performance is significant better than out-of-sample performance on Protein, Iris, Sonar and Glass data sets. We see that for most of data sets, the IR-Gaussian and IMR-Gaussian extend to new out-of-sample data well. Overall, IMR-Gaussian gives the best performance. 5.3. Efficiency of ideal regularization Experiments in this subsection are divided into two parts. In the first part, we evaluate the algorithms on four small-scale data sets. The second part is to study the scalability of the algorithms on a large-scale data set. The algorithms with high computational complexities are only evaluated in the first part.

5.3.1. Comparisons on small-scale data sets To evaluate the efficiency of ideal regularization, we compare the following paired algorithms: IR-Gaussian vs. KITML, IR-OSK vs. OSK, and IR-GML vs. GML. The time cost as well as accuracies are recorded. The running time of the IR-Gaussian and KITML is the elapsed time for solving the corresponding optimization problems. Note that the solution to IR-Gaussian is acquired in a closedform, while the solution to the KITML is obtained by iteratively employing the Bregman projection. Also, the time cost of IR-OSK and OSK is the elapsed time for solving the LP and QCQP problems, respectively. For the IR-GML and GML, we record the running time which is spent in resolving the corresponding optimization problems. The comparisons are shown in Table 6, where four UCI data sets are selected with various number of labels. ‘‘Speedup’’ in the table means the factor that IR-based methods gained in computational cost over its counterparts.

32

B. Pan et al. / Neural Networks 56 (2014) 22–34

Table 6 Running time (second) and accuracy (%) on the data sets. Each cell has two rows: the upper row shows the running time; the lower row reports the accuracy. # Label

KITML

IR-Gaussian

0.61 77.6 1.41 78.0 2.54 78.7 3.96 79.2 5.73 76.6

0.14 78.2 0.16 79.9 0.15 79.1 0.15 78.6 0.16 75.9

0.98 56.9 2.21 59.4 3.93 59.1 6.27 61.3 9.14 61.6

0.31 52.3 0.29 54.1 0.28 55.9 0.28 59.8 0.30 62.4

Speedup

OSK

IR-OSK

Speedup

0.36 59.1 0.41 61.3 0.48 61.8 0.54 62.5 0.68 61.4

0.28 57.6 0.27 60.3 0.29 59.8 0.25 59.4 0.26 61.0

1.3

0.39 53.5 0.39 52.6 0.44 53.1 0.50 55.1 0.63 55.4

0.25 53.0 0.27 51.8 0.27 50.8 0.26 51.6 0.28 53.3

0.38 85.6 0.41 88.6 0.48 89.2 0.55 88.9 0.65 90.1

0.26 85.0 0.23 86.5 0.22 87.1 0.26 86.9 0.26 88.0

0.37 59.2 0.39 59.8 0.45 61.9 0.49 64.6 0.63 64.3

0.25 59.8 0.23 58.8 0.25 61.6 0.26 62.2 0.29 61.1

GML

IR-GML

Speedup

Heart #data=270 20 30 40 50 60

4.7 8.8 16.9 25.7 36.3

1.5 1.7 2.1 2.6

5.93 55.4 13.45 57.5 24.20 59.2 38.05 60.3 54.62 60.9

0.03 54.3 0.03 59.4 0.04 61.0 0.04 62.3 0.04 61.3

9.77 54.8 21.86 55.8 39.06 55.7 60.78 55.5 87.00 55.9

0.05 53.6 0.05 53.6 0.06 52.7 0.05 55.2 0.05 55.9

40.61 78.3 91.48 83.1 165.88 87.2 259.35 89.9 375.69 89.5

0.18 76.3 0.18 81.2 0.19 83.8 0.19 87.1 0.18 89.5

74.51 58.9 170.24 60.5 305.72 62.7 481.45 64.1 718.44 64.1

0.35 60.0 0.40 61.2 0.38 62.3 0.37 64.9 0.36 65.0

197.7 448.3 605.0 951.3 1365.5

Liver #data=345 20 30 40 50 60

3.1 7.5 14.1 22.2 30.5

1.5 1.4 1.6 1.9 2.3

195.4 437.2 651.0 1215.6 1740.0

Breast #data=569 20 30 40 50 60

4.02 91.1 9.16 92.8 16.44 93.0 25.80 93.8 37.32 94.3

1.09 91.4 1.07 92.6 1.04 93.1 1.11 93.3 1.03 93.5

7.54 64.2 17.20 64.9 31.00 65.6 50.36 66.8 73.96 67.3

2.51 65.8 2.47 67.7 2.51 68.7 2.47 69.4 2.45 69.6

3.7 8.6 15.8 23.3 36.2

1.4 1.8 2.1 2.1 2.5

225.6 508.2 873.1 1365.0 2087.2

Pima #data=768 20 30 40 50 60

3.0 7.0 12.4 20.4 30.2

In terms of accuracy, the IR-Gaussian performs comparably to KITML in Heart and Breast data sets. For Liver data set, KITML performs slightly better than IR-Gaussian while for Pima data set, IR-Gaussian is a little bit better than KITML. However, KITML is more time-consuming. When the amount of labels increases, the running time of KITML rises nearly quadratically. In contrast, the CPU time of IR-Gaussian is nearly invariant for the varying number of labels. The accuracy of the IR-OSK is a bit lower than that of the OSK, while the IR-OSK is about twice as fast as OSK on average. When the amount of labels increases, the problem size of IR-OSK is invariant. Thus, the running time of IR-OSK is almost the same. Similarly, from Eq. (27), the running time of the IR-GML for solving the first subproblem is little influenced by the amount of labels. In contrast, GML needs more time for more training data. IR-GML has gained roughly a factor of 2000 for 60 labels in computational complexity over GML while achieving comparable accuracy. GML is the most time-consuming among the evaluated algorithms. To depict the relation between the CPU time and the amount of labels, we conduct an experiment on the Heart data set by varying the ratio of available labels to total data from 0.05 to 1 with even distribution. The results are shown in Fig. 3. Note that the CPU time of the first and third subfigures is plotted in logarithmic scale. We

1.5 1.7 1.8 1.9 2.2

212.9 425.6 804.5 1301.2 1995.7

observe from the figure that the running time of the KITML and GML are nearly a quadratic function of the amount of labels. The CPU time of the OSK increases almost linearly as more labels are available. The results for IR-based methods are encouraging, which indicates that the CPU time of these methods is nearly independent of the amount of labels. 5.3.2. Comparisons on large-scale data sets The Adult data set is chosen for large-scale learning. We select the OSK, IR-OSK and IR-GML for evaluation, because these three algorithms are efficient as shown in Section 5.3.1. The OSK is served as the benchmark. We would like to see how the computational costs change when more data and labels are available. The CPU time as well as accuracy and standard deviation are recorded in Table 7. The running time of OSK and IR-OSK is the elapsed time for solving the QCQP and LP programming, respectively. For IR-GML, the time cost is the elapsed time for Algorithm 2. As shown in Table 7, the accuracies of the IR-GML are higher than the other two methods. Also, the IR-GML gains 1.5–4.5 times speedups over OSK. The IR-OSK achieves comparable accuracies to the OSK. But it is more efficient than the OSK and IR-GML. The CPU time of IR-OSK is almost the same for varying number of

B. Pan et al. / Neural Networks 56 (2014) 22–34

33

Table 7 Running time (second) and accuracy (%) on Adult data set. Each cell has two rows: the upper row shows the running time; the lower row reports the accuracy and standard deviation. ‘‘Speedup’’ means the factor that IR-based methods gained in CPU time over the OSK. Data set

# Label

Adult-a1a

200

Adult-a2a

400

Adult-a3a

600

Adult-a4a

800

Adult-a5a

1000

Adult-a6a

1200

(a) IR-Gaussian vs. KITML.

OSK

IR-OSK

2.85 66.0 ± 2.6 11.95 66.2 ± 2.5 24.56 68.3 ± 0.7 41.53 68.7 ± 1.1 63.79 68.2 ± 0.6 90.55 68.7 ± 0.8

0.27 64.9 ± 3.2 0.26 67.7 ± 1.5 0.25 65.7 ± 3.0 0.25 65.8 ± 4.3 0.27 67.8 ± 2.2 0.24 70.6 ± 1.6

Speedup 10.6 45.3 97.2 165.3 239.1 371.6

(b) IR-OSK vs. OSK.

IR-GML

Speedup

1.53 73.4 ± 2.0 3.02 73.1 ± 1.7 5.69 76.2 ± 1.1 11.70 75.7 ± 1.2 20.18 76.6 ± 0.7 60.41 76.5 ± 0.7

1.9 4.0 4.3 3.5 3.2 1.5

(c) IR-GML vs. GML.

Fig. 3. Plot of CPU time vs. amount of labels on Heart data set. The x-axis gives the ratio of available labels to total data.

data and labels. This is because the problem size of the IR-OSK is only dependent on the number of eigenvalues, and independent of number of data and labels. IR-OSK gains a factor up to hundreds for the Adult-a6a data set over the OSK and IR-GML in terms of computational cost. Another experiment is conducted on Adult-a6a data set. We change the total number of data within the set of {2000, 4000, 6000, 8000, 10 000, 11 220}, and select 20% data for training. So the number of training data also increases as more data are used. We plot the running time of OSK, IR-GML and IR-OSK in Fig. 4. Note that the CPU time is plotted in logarithmic scale. We also see that the computational cost of IR-OSK is almost invariant with respect to number of data and labels. In contrast, the running time of OSK and IR-GML increases nearly quadratically as more data and labels are available. IR-GML is slightly faster than OSK. 6. Conclusion This paper showed that the ideal regularization can be easily adapted to various kernel learning algorithms to effectively exploit the label information of a data set. In particular, we employed this regularization to incorporate the label information into standard kernels, to learn a data-dependent kernel from the geometric structure and label information, and to accelerate some kernel learning algorithms. Experimental results demonstrate that our regularization learns the labels effectively. The accuracies of classification are significantly improved. Moreover, some existing algorithms modified by the proposed regularization reduces the CPU time while obtaining comparable accuracies. Possible extensions of this work are discussed. Although the regularization parameters, γ and δ , are fixed and work well in the experiments, more systematic procedures of choosing these parameters need to be developed. More applications of this regularization can also be considered. The proposed regularization

Fig. 4. Plot of CPU time on Adult-a6a data set.

should be improved to a more flexible and robust one when coping with more complicated label problems, such as noisy labels (Yang, Jin, & Jain, 2010) and crowdsourcing (Raykar et al., 2010). Acknowledgments This project was supported by NSFC grant (61173084, 61128009, 61272252), US National Science Foundation grant (DMS-0712827, DMS-1115523), Guangdong Provincial Government of China (grant no. 2010.189) through the ‘‘Computational Science Innovative Research Team’’, Natural Science Foundation of SZU (grant no. 00035693).

34

B. Pan et al. / Neural Networks 56 (2014) 22–34

Appendix

References

To solve subproblem (27), we introduce the Lagrange multipliers Z and get the Lagrangian function

Bach, F., & Jordan, M. (2005). Predictive low-rank decomposition for kernel methods. In Proceedings of the 22nd international conference on machine learning (pp. 33–40). Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7, 2399–2434. Bertsekas, D. (1999). Nonlinear programming. Belmont, MA: Athena Scientific. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2002). On kerneltarget alignment. In Advances in neural information processing systems, Vol. 14 (pp. 367–373). Davis, J., Kulis, B., Jain, P., Sra, S., & Dhillon, I. (2007). Information-theoretic metric learning. In Proceedings of the 24th international conference on machine learning (pp. 209–216). Golub, G., & Van Loan, C. (1996). Matrix computations. Johns Hopkins University Press. Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005). Measuring statistical dependence with Hilbert–Schmidt norms. In Algorithmic learning theory (pp. 63–77). Springer. Hoi, S., Jin, R., & Lyu, M. (2007). Learning nonparametric kernel matrices from pairwise constraints. In Proceedings of the 24th international conference on machine learning (pp. 361–368). Jain, P., Kulis, B., Davis, J. V., & Dhillon, I. S. (2012). Metric and kernel learning using a linear transformation. The Journal of Machine Learning Research, 13, 519–547. Joachims, T. (1999). Making large-scale SVM learning practical. In Advances in kernel methods support vector learning (pp. 169–184). Kwok, J., & Tsang, I. (2003). Learning with idealized kernels. In Proceedings of the 20th annual international conference on machine learning (pp. 400–407). Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., & Jordan, M. (2004). Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5, 27–72. Lanckriet, G., De Bie, T., Cristianini, N., Jordan, M., & Noble, W. (2004). A statistical framework for genomic data fusion. Bioinformatics, 20, 2626–2635. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. The Journal of Machine Learning Research, 2, 419–444. Löfberg, J. (2004). YALMIP: a toolbox for modeling and optimization in Matlab. In 2004 IEEE International symposium on computer aided control systems design (pp. 284–289). Lu, Z., Jain, P., & Dhillon, I. (2009). Geometry-aware metric learning. In Proceedings of the 26th annual international conference on machine learning (pp. 673–680). Pan, B., Lai, J., & Chen, W. (2011). Nonlinear nonnegative matrix factorization based on mercer kernel construction. Pattern Recognition, 44, 2800–2810. Pan, B., Xia, J. J., Yuan, P., Gateno, J., Ip, H. H., He, Q., et al. (2012). Incremental kernel ridge regression for the prediction of soft tissue deformations. In Medical image computing and computer-assisted intervention, MICCAI 2012. (pp. 99–106). Springer. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y., et al. (2008). Simplemkl. The Journal of Machine Learning Research, 9, 2491–2521. Raykar, V., Yu, S., Zhao, L., Valadez, G., Florin, C., Bogoni, L., et al. (2010). Learning from crowds. The Journal of Machine Learning Research, 11, 1297–1322. Smola, A., & Kondor, R. (2003). Kernels and regularization on graphs. In Learning theory and kernel machines: 16th annual conference on learning theory and 7th kernel workshop, COLT/kernel (pp. 144–158). Springer. Song, L., Smola, A., Borgwardt, K., & Gretton, A. (2008). Colored maximum variance unfolding. In Advances in neural information processing systems, Vol. 20 (pp. 1385–1392). Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. The Journal of Machine Learning Research, 7, 1531–1565. Xiong, H., Swamy, M., & Ahmad, M. (2005). Optimizing the kernel in the empirical feature space. IEEE Transactions on Neural Networks, 16, 460–474. Yang, T., Jin, R., & Jain, A. (2010). Learning from noisy side information by generalized maximum entropy model. In proceedings of the 27th international conference on machine learning (pp. 1199–1206). Yeung, D., Chang, H., & Dai, G. (2007). Learning the kernel matrix by maximizing a KFD-based class separability criterion. Pattern Recognition, 40, 2021–2028. Zhu, X., Kandola, J., Ghahramani, Z., & Lafferty, J. (2005). Nonparametric transforms of graph kernels for semi-supervised learning. In Advances in neural information processing systems, Vol. 17 (pp. 1641–1648). Zhuang, J., Tsang, I., & Hoi, S. (2011). A family of simple non-parametric kernel learning algorithms. The Journal of Machine Learning Research, 12, 1313–1347.

L(K, Z) = Dld (K, M) − γ tr(KT) + δ tr(K) − tr(KZ), where Z ≽ 0. By the KKT condition, we have

−K−1 + M−1 − γ T + δ I − Z = 0, tr(KZ) = 0.

(A.1)

From the proof of Theorem 2, Eq. (A.1) implies that K and Z can be simultaneously diagonalized by the same set of orthonormal eigenvectors. Let A = M−1 − γ T + δ I, and the eigendecomposition (A) (A) of A is Pdiag(λ1 , . . . , λn )P⊤ , where the columns of P are (A)

the orthonormal eigenvectors and λi s are the corresponding eigenvalues. We also conclude that K and A have the same orthonormal eigenvectors from the first KKT condition, thus the (K) (K) eigendecomposition of K is expressed as Pdiag(λ1 , . . . , λn )P⊤ , (K)

where λi > 0, for all i, are eigenvalues of K. We could rewrite subproblem (27) in terms of the eigenvalues, ignoring the constant terms, as min f (λ(K) ),

λ(K) ≥0

 λi(A) λ(i K) − ni=1 log λ(i K) , λ(K) = (λ(1K) , . . . , λ(nK) )⊤ . Since λ(i K) > 0 for i = 1, . . . , n, the minimum of f (λ(K) ) (A) exists if and only if λi > 0 for all i. In such case, the minimizer of f (λ(K) ) can be computed as

where f (λ(K) ) =

∂ f (λ(K) ) ∂λ(i K)

n

= λ(i A) −

i=1

1

λ(i K)

= 0 ⇒ λ(i K) =

1

λ(i A)

,

i.e., the minimizer K⋆ = A−1 . From Theorem 1, T can be decomposed as T = Q3(T) Q⊤ , where Q ∈ Rn×c consists of orthonormal eigenvectors, 3(T) = (T) (T) (T) diag(λ1 , . . . , λc ), λi s are eigenvalues. Since T is sparse and low-rank, such decomposition can be computed efficiently with the Lanczos sparse eigendecomposition. Then K⋆ = (M−1 − γ Q3(T) Q⊤ + δ I)−1

= N + γ NQ((3(T) )−1 − γ Q⊤ NQ)−1 Q⊤ N,

(A.2)

where N = (M−1 + δ I)−1 , and the second equation follows from the Sherman–Morrison–Woodbury formula. The inverse of the c × c matrix in Eq. (A.2) is not timeconsuming. The spectral decomposition of M is known, thus the computation of N is not time-consuming either. The operation in resolving (27) mainly involves one sparse eigendecomposition and two O (n2 c ) matrix–vector multiplications. To ensure the optimization problem is feasible, we should choose γ and δ beforehand to ensure the positive definiteness of M−1 − γ T + δ I. Note that this matrix is positive definite as long as δ/γ is larger than the largest eigenvalue of T.