Accepted Manuscript
Gaussian Process Approach for Metric Learning Ping Li, Songcan Chen PII: DOI: Reference:
S0031-3203(18)30357-1 https://doi.org/10.1016/j.patcog.2018.10.010 PR 6678
To appear in:
Pattern Recognition
Received date: Revised date: Accepted date:
30 September 2017 8 August 2018 9 October 2018
Please cite this article as: Ping Li, Songcan Chen, Gaussian Process Approach for Metric Learning, Pattern Recognition (2018), doi: https://doi.org/10.1016/j.patcog.2018.10.010
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights • propose a non-parametric metric learning approach (GP-Metric) based on
CR IP T
Gaussian Process (GP). • use GP to extend the bilinear similarity into a non-parametric form. • develop an efficient algorithm to learn the non-parametric metric.
AC
CE
PT
ED
M
AN US
• demonstrate the performance of GP-Metric on real-world datasets.
1
ACCEPTED MANUSCRIPT
Gaussian Process Approach for Metric Learning Ping Li, Songcan Chen∗
CR IP T
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China {ping.li.nj, s.chen}@nuaa.edu.cn
Abstract
AN US
Learning appropriate distance metric from data can significantly improve the
performance of machine learning tasks under investigation. In terms of the distance metric representation forms in the models, distance metric learning (DML) approaches can be generally divided into two categories: parametric and non-parametric. The first category needs to make parametric assumption on the distance metric and learns the parameters, easily leading to overfitting
M
and limiting model flexibility. The second category abandons the above assumption and instead, directly learns a non-parametric distance metric whose
ED
complexity can be adjusted according to the number of available training data, and makes the model representation relatively flexible. In this paper we follow the idea of the latter category and develop a non-parametric DML approach.
PT
The main challenge of our work concerns the formulation and learning of nonparametric distance metric. To meet this, we use Gaussian Process (GP) to extend the bilinear similarity into a non-parametric metric (here we abuse the
CE
concept of metric) and then learn this metric for specific task. As a result, our approach learns not only nonlinear metric that inherits the flexibility of GP but
AC
also representative features for the follow-up tasks. Compared with the existing GP-based feature learning approaches, our approach can provide accurate similarity prediction in the new feature space. To the best of our knowledge, this is the first work that directly uses GP as non-parametric metric. In the ∗ Corresponding
author Email address:
[email protected] (Songcan Chen)
Preprint submitted to Journal of Pattern Recognition
October 9, 2018
ACCEPTED MANUSCRIPT
experiments, we compare our approach with related GP-based feature learning approaches and DML approaches respectively. The results demonstrate the superior performance of our approach.
CR IP T
Keywords: metric learning, Gaussian Process, bilinear similarity, non-parametric metric
1. Introduction
Learning appropriate distance metric from data can significantly improve the
AN US
performance of machine learning tasks under investigation [1]. Since the early work in [2], distance metric learning (DML) has become an active research area 5
and has been widely used in many applications such as person reidentification [3, 4, 5], music recommendation [6], image retrieval [7, 8, 9], clustering analysis [10], etc. The goal of DML is to learn an application-specific distance metric that brings “similar” objects close together while separating “dissimilar” objects [11].
10
M
Following the formulation in [12], DML can be formulated as a weakly supervised learning problem with respect to a set of pairwise or triplet-based constraints. Based on this formulation, a variety of DML approaches have been devised by
ED
using different distance/similarity metric models to solve challenges in various domains. In terms of the distance metric representation forms in the models,
15
PT
DML approaches can be generally divided into two categories: parametric and non-parametric.
In general, the first category needs to make parametric assumption on the
CE
distance metric and then learns the corresponding parameters. For example, the state-of-the-art large margin nearest neighbor (LMNN) [13] method uses a
AC
parametric Mahalanobis distance as the distance metric and learns the param-
20
eter matrix in the training process. Although it results in explicit learning for the involved parameters, this is achieved partially at the expense of the model flexibility. In particular, when dealing with complex structured data, it may fail to learn the demanded distance metric, thus to some extent limiting the application scope. Another drawback in parametric DML is that there are a
2
ACCEPTED MANUSCRIPT
25
large number of parameters that need to be learned, especially in the high dimensional case [14]. Learning such large number of parameters from a finite number of samples often leads to the overfitting and thus poor generalization
CR IP T
performance [15, 16]. Although the regularization methods such as low-rank and sparse constraints can be used to improve the generalization performance 30
[13, 14, 15, 17], much more effort has to be made to select the values of the extra introduced hyper-parameters.
Compared with parametric DML, non-parametric DML tries to learn a nonparametric distance metric which can adaptively adjust its model complexity
35
AN US
according to the number of available training data and makes the model representation relatively flexible [18]. The number of parameters involved in it can change with the number of training samples [19], thus offering the possibility to minimize the risk of overfitting. However, to the best of our knowledge, so far there have had few researches concentrating on the non-parametric formulation of DML [20]. The main challenge of non-parametric DML is how to
design and learn such metric. To meet this, in this paper, we propose a novel
M
40
non-parametric DML approach (GP-Metric) by utilizing the non-parametric
paper are:
ED
characteristic of Gaussian Process (GP) [21]. The main contributions of this
PT
(1) using GP to extend the bilinear similarity into a non-parametric metric (here we abuse the concept of metric) and then adapting it to specific task.
45
CE
(2) proposing a practical method to reduce the complexity of optimization by utilizing the special structure of the non-parametric metric with the bilinear representation and designing a corresponding efficient two-step
AC
algorithm to learn the metric.
50
(3) empirically demonstrating the effectiveness of GP-Metric in multiple realworld data sets and showing that our GP-Metric has superior performance in dimension reduction and metric learning. The rest of this paper is organized as follows. In Section 2, we review the 3
ACCEPTED MANUSCRIPT
related metric learning and GP-based approaches. We describe our GP-Metric 55
method and illustrate how it can be derived from the bilinear similarity function in Section 3. We present our experimental results and analysis in Section 4 and
CR IP T
conclude in Section 5.
2. Related works 2.1. Distance metric learning
As mentioned in Section 1, DML can be carried out by using either paramet-
60
AN US
ric or non-parametric approach. Parametric DML has been extensively studied
in machine learning community and can be classified into two main categories: positive semi-definite (PSD) matrix parameterized Mahalanobis DML and ordinary matrix parameterized bilinear similarity learning. Mahalanobis DML con65
cerns on learning the PSD matrix involved in Mahalanobis distance according to different objective functions and yields various approaches, such as LMNN [13],
M
information theoretic metric learning (ITML) [22] and metric learning to rank approaches (MLR) [17, 23]. Recently, some variations of Mahalanobis DML are
70
ED
also proposed. For example, latent coincidence analysis (LCA) [24] based on latent variable model can also be viewed as a Mahalanobis distance metric learning model which finds a linear projection of high dimensional data by shrinking
PT
the distance between similarly labeled inputs and expanding the distance between differently labeled ones. One-pass closed-form solution for online metric learning (OPML) [25] decomposes the PSD matrix of Mahalanobis distance as the product of a low-rank matrix and its transposed matrix which can also be
CE 75
considered as the projection matrix that transforms the inputs into the new fea-
AC
ture space. Geometric mean metric learning (GMML) [26] uses a PSD matrix to measure the Mahalanobis distances of similar points and uses the inversion of this matrix to measure the Mahalanbis distances of dissimilar points. Different
80
from such a class of Mahalanobis DML, the bilinear similarity learning aims to learn a bilinear similarity function without the PSD constraint on the parameter matrix. This similarity is not necessary to be nonnegative and symmetry,
4
ACCEPTED MANUSCRIPT
thus can be extended to more general applications such as cross-modal similarity learning [27], query-dependent similarity learning [9] and online similarity 85
learning [8]. Although the performance of these above mentioned approaches is
CR IP T
typically superior to that of directly using standard metrics in practice, these approaches make strong assumptions on the parametric form of distance met-
rics, thus leading to the partial loss of flexibility. Furthermore, they usually
contain a large number of parameters when dealing with high dimensional data, 90
easily resulting in the overfitting. A common solution to this problem is imposing certain constraints (according to specific tasks) on these parameters to
AN US
limit the model complexity. For example, [13] uses a low-rank constraint of
the parametric matrix to improve the generalization performance. [28] adopts a sparsity-promoting constraint based on the `1 -norm to learn a sparse para95
metric matrix from high-dimensional data in an efficient and scalable manner. [15] models and learns the parametric matrix as the weighted combination of a set of low-rank “basis metrics” learned from the training data at different local
M
regions. Although these approaches can reduce the risk of the model overfitting, certain extra hyper-parameters have to be introduced (such as the rank of Mahalanobis matrix [13] or the regularization hyper-parameters [28]) and thus
ED
100
much more effort has to be made to select/learn these parameters. Compared with parametric DML, non-parametric DML, as described in Sec-
PT
tion 1, does not need to define the parametric form of distance metric, instead it can carry out a learning-from-data process to obtain the demanded nonparametric metric. Furthermore, the non-parametric DML can adaptively ad-
CE
105
just its complexity according to the number of available training data [19], thus making the model more flexible. For the GP-based DML method, it can also
AC
elegantly carry out nonlinear DML [15, 29, 30] and make prediction with uncertainty. The latter is particularly important in various situations such as medical
110
diagnosis [31], self-deriving cars [32] and so on. Although has these potentially attractive advantages, it remains relatively understudied in machine learning community due to the difficult of non-parametric metric designing and learning. It is based on the above-mentioned purposes, this paper tries to establish 5
ACCEPTED MANUSCRIPT
a non-parametric form metric based on GP and then adapts it to task under 115
investigation by learning from the training data.
CR IP T
2.2. GPR and GPLVM GP, as a Bayesian non-parametric approach, has been widely used in many machine learning scenarios. Compared with other machine learning methods, GP provides a flexible prior distribution over functions and enjoys analytical 120
tractability [33]. In Gaussian Process Regression (GPR) [34, 35], given the
training dataset D = {(x1 , y1 ), ..., (xN , yN )} (where xi ∈ RD and yi ∈ R are
AN US
the ith input vector and the corresponding response variable respectively), we can prove that the response variable y ∗ of a new test observation x∗ follows the Gaussian distribution
y ∗ ∼ N (Kx∗ X (KXX + σ 2 I)−1 y, Kx∗ x∗ − Kx∗ X (KXX + σ 2 I)−1 KXx∗ ) (1) 125
where X ∈ RN ×D and y ∈ RN denote the input and response variables of N
M
training samples; K(·, ·) denotes the kernel function with hyper-parameters θ;
σ denotes the standard variance of noise ∼ N (0, σ 2 ).
ED
Besides being used in regression, GP can also be applied in the nonlinear feature learning and dimensionality reduction tasks, which refers to the Gaussian 130
Process Latent Variable Model (GPLVM) [36]. In these learning scenarios, our
PT
goal is to learn a set of latent variables denoted by Z ∈ RN ×Q from a set of
observed variables X ∈ RN ×D (where N and D denote the number and the
dimension of observed variables respectively; Q denotes the dimension of latent
CE
variables). In general, Q D, thus realizing dimension reduction. In the
135
conventional GPLVM [36], we often assume that xnd is generated by a latent
AC
function with a noise process, xnd = fd (zn,: ) + nd where zn,: denotes the nth
row of matrix Z; nd is the noise following distribution nd ∼ N (0, σ 2 ). As
in GPR, we assume that these D latent functions {fd }D d=1 follow the same GP prior. Thus we can obtain the marginal likelihood of the observations p(X|Z, θ) =
D Y
d=1
1
1
1 2
(2π) |K + 6
1 σ 2 I| 2
T
e− 2 x:,d (K+σ
2
I)−1 x:,d
(2)
ACCEPTED MANUSCRIPT
140
where x:,d denotes the dth column of matrix X; K denotes the kernel matrix of latent variables; θ denotes the hyper-parameters involved in the kernel function and the noise distribution. We can maximize this marginal likelihood with
ˆ = arg max p(X|Z, θ) ˆ θ} {Z, Z,θ
CR IP T
ˆ and θˆ respect to Z and θ simultaneously to find the optimal value Z (3)
Recently, many extensions to the original GPLVM have been proposed, such 145
as Bayesian GPLVM [37], shared GPLVM [38], supervised GPLVMs [39, 40, 41, 42] and so on. Among these models, supervised GPLVMs have been demon-
AN US
strated to significantly improve the quality of feature learning. For example, [39]
develops a discriminative GPLVM (DGPLVM) by employing a prior distribution over the latent space that is derived from a Generalized Discriminant Analysis 150
(GDA) [43]. By this formulation, DGPLVM can obtain desirable generalization properties of generative approaches, while being able to better discriminate between classes in the latent space. [40] proposes a supervised GPLVM
M
(Supervised GPLVM) by assuming that labels and the input variables are independent, conditioned on the latent variables in the low-dimensional space. Hence the mappings from the latent variables to both input variables and la-
ED
155
bels can be established to promote the performance of original GPLVM. To the best of our knowledge, both DGPLVM and Supervised GPLVM are two state-
PT
of-the-art models in the context of GPLVM. Many recently proposed models just extend these two models for different applications. For example, [41] proposes a discriminative shared GPLVM approach (DS-GPLVM) for multiview
CE
160
and view-invariant classification by combining DGPLVM and shared GPLVM [38]. [44] replaces the discriminative prior in DGPLVM with a semi-supervised
AC
prior learned from the pairwise constraints (must-link and cannot-link) of samples and proposes a semi-supervised GPLVM (SSGPLVM). [42] proposes a su-
165
pervised GPLVM (D-SBGPLVM) for action sequences classification by combining Back-constrained GPLVM [45] with DGPLVM. [46] extends the Supervised GPLVM and proposes a hierarchical GP model for multi-kernel and multi-task learning. In fact, our GP-Metric can also be considered as a supervised GPLVM 7
ACCEPTED MANUSCRIPT
for non-linear feature learning. Different from the existing GP-based feature 170
learning approaches which focus on learning representative features, it learns these features and nonlinear metric simultaneously, thus providing accurate
CR IP T
similarity prediction in the new feature space. In Section 4.1, we will show that it achieves excellent results in such tasks as dimension reduction and data visualization.
175
3. GP-based metric learning
In this section, we show how bilinear similarity can be extended to a non-
3.1. Definition of GP-Metric
AN US
parametric metric and propose an efficient algorithm to learn this metric.
We begin with some basic notations. Let’s assume that there are N labeled 180
training samples {(x1 , y1 ), ..., (xN , yN )}, where xi ∈ RD and yi ∈ {1, 2, 3, ..., C}
M
denote the input variable and the label of the ith training sample respectively; C denotes the number of classes. In general, we can write these variables into matrix and vector: X = [x1 , ..., xN ]T and y = [y1 , ..., yN ]T . To simplify the
185
ED
formulation, we also introduce a set of latent variables Z = [z1 , ..., zN ]T where zi denotes the transformation of xi from original space X into space Z. In
bilinear similarity learning, we compute the similarities between the ith and the
PT
j th training samples as Si,j = ziT M zj , where Si,j can be generated by the label information and defined as Si,j = 1 if yi = yj and Si,j = −1 otherwise. Thus
CE
the similarity matrix can be written as follow: S = ZM Z T
(4)
Our goal is to learn the matrix M for estimating pairwise similarity in Eq.(4).
AC 190
In this paper, we adopt GP to extend Eq.(4) to obtain a non-parameter metric. Specifically, in order to derive the GP prior of the bilinear similarity function, we assume that M follows a matrix normal distribution M ∼ MN (0, U , V ), where U and V denote the row and the column covariance matrices respectively.
8
ACCEPTED MANUSCRIPT
195
For any matrix H and K, we define vec(H) to be the vector obtained by concatenating the columns of H. H ⊗ K denotes the Kronecker product of matrices H and K. From the definition of matrix normal distribution, we
CR IP T
can know that vec(M ) follows a multivariate normal distribution vec(M ) ∼ N (vec(0), V ⊗ U ). As a result, the bilinear similarity can be reformulated as 200
follow:
vec(S) = vec(ZM Z T ) = (Z ⊗ Z)vec(M )
(5)
Now S is a linear combination of Gaussian distributed variables and hence is itself Gaussian as well. We therefore learn this distribution by identifying its
are
205
AN US
mean and covariance. Let s = vec(S), then the mean and the covariance of s mean(s) = E[s] = E[vec(ZM Z T )] = 0
(6)
cov(s) = E[ssT ] = E[vec(ZM Z T )(vec(ZM Z T ))T ]
(7)
and
M
where E[ssT ] can be derived from
E[ssT ] = (Z ⊗ Z)E vec(M )(vec(M ))T (Z T ⊗ Z T )
ED
= (Z ⊗ Z)(V ⊗ U )(Z T ⊗ Z T ) = (ZV Z T ) ⊗ (ZU Z T )
(8)
PT
= ((ZC)(ZC)T ) ⊗ ((ZP )(ZP )T )
CE
In the second line of Eq.(8), we use the facts that M follows a matrix normal dis tribution and the covariance of vec(M ) is cov(vec(M )) = E vec(M )(vec(M ))T = V ⊗ U . In the fourth line of Eq.(8), since V and U are both positive definite,
210
we can respectively factorize them into the product of two matrices V = CC T
AC
and U = P P T where C and U are both N × N real matrices. Since both (ZC)(ZC)T and (ZP )(ZP )T are in fact two gram matrices of Z in the transformed spaces, by using the kernel trick, we can use two kernel matrices to replace the (ZC)(ZC)T and (ZP )(ZP )T . In this paper, we assume that M
215
is a symmetric matrix and V = U . Thus we can use the same kernel matrix
9
ACCEPTED MANUSCRIPT
(S)
KZZ to replace these two matrices. As a result, the bilinear similarity function can be extended into a non-parametric metric, by imposing a GP prior (S)
(S)
s ∼ GP(0, KZZ ⊗ KZZ ). Based on this non-parametric metric, we can create 220
CR IP T
our non-parametric metric learning approach. It is also worth to note that matrix M can also be asymmetrical. In this case, it can be used in cross-modal
(or view) bilinear similarity learning as in [27]. Correspondingly, our GP-Metric should also be constructed with respect to M for cross-modal learning, which may be our future work.
So far we just assume that S is noise-free. However, in many real-world applications, the observed similarity variables are often corrupted by noise. In
AN US
225
this paper, following the idea in GPR [34, 35], we also assume that such noise follows a gaussian distribution, since it leads to simple inference equations and a closed-form solution when integrating out the intermediate variables (refer to the later description). Furthermore, it has been demonstrated in [47, 48, 49] 230
that applying GPR directly to binary classification problems while ignoring the
M
discrete nature of the labels (i.e., Sij ∈ {1, −1}) can yield comparable results in performance. Based on the above assumption, we use a latent variable F to
ED
denote the clean counterpart of noisy S. Thus the corresponding distributions of S and F are
235
(S)
(S)
F ∼ MN (0, KZZ , KZZ )
(9)
PT
vec(S) ∼ N (vec(F ), σ 2 IN 2 ×N 2 ),
where IN 2 ×N 2 and σ 2 denote the N 2 × N 2 identity matrix and the variance of noise respectively. In the definition of GP-Metric, Z is modeled as a transfor-
CE
mation of X from original space X into space Z for which we use GPLVM as
AC
follows:
x:,d ∼ N (h:,d , 2 IN ×N ),
(X)
h:,d ∼ N (0, KZZ )
(10)
where x:,d and h:,d denote the dth columns of matrices X and the noise-free
240
(X)
matrix H respectively; 2 denotes the variance of gaussian noise; KZZ denotes
the kernel matrix related to X. Based on the above assumptions, we can see that the proposed GP-Metric still contains several hyper-parameters and many latent variables which should 10
ACCEPTED MANUSCRIPT
be learned from data. However, in Bayesian non-parametric statistics, the term 245
“non-parametric” does not imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed
CR IP T
in advance [19], such as Dirichlet Process [50] and Gaussian Process [21]. Furthermore, it has been demonstrated that GPLVM (with hyper-parameters and
latent variables) belongs to the aforementioned non-parametric class of models 250
[51, 52]. Thus, GP-Metric as an extension of GPLVM is still a non-parametric model and possesses its own advantages. First, it is more flexible than the para-
metric approaches and can be elegantly used in nonlinear similarity learning by (S)
(X)
AN US
using nonlinear kernel functions in both KZZ and KZZ . Second, it can also be
considered as a supervised dimension reduction by utilizing the label information 255
involved in S. In the experiment section, we will demonstrate that GP-Metric has a comparable performance with the widely used supervised GPLVMs such as DGPLVM [39] and Supervised GPLVM [40]. Third, GP-Metric can also provide uncertainty of prediction inherited from GPR as we will show in Section
M
3.3.
Next we derive out the likelihood function of GP-Metric. With the definition
260
ED
of GP-Metric, S and X are independent conditioned on Z. Thus, their joint marginal likelihood is
(11)
PT
p(S, X|θ, Z) = p(S|θ, Z)p(X|θ, Z)
where θ denotes all the hyper-parameters (σ, and hyper-parameters involved
CE
in the kernel functions) involved in the GP-Metric. Marginalizing out the noisefree latent variables F and H results in the following equations of p(S|θ, Z)
AC
and p(X|θ, Z) respectively,
p(S|θ, Z) =
(S) (S) exp − 21 sT (KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 )−1 s (S)
(S)
(2π)N 2 /2 |KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 |1/2 −1 D D exp − 1 xT (K (X) + 2 I ) x Y Y N ×N :,d ZZ 2 :,d p(X|θ, Z) = p(x:,d |θ, Z) = (X) N/2 (2π) |(KZZ + 2 IN ×N )|1/2 d=1 d=1
11
ACCEPTED MANUSCRIPT
Consequently, the joint log marginal likelihood is given by
CR IP T
N2 1 1 (S) (S) (S) (S) L = − sT (KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 )−1 s − ln(2π) − ln |KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 | 2 2 2 D 1X T D ND (X) (X) − x:,d (KZZ + 2 IN ×N )−1 x:,d − ln(2π) − ln |KZZ + 2 IN ×N | 2 2 2 d=1 (12) ˆ of θ and Z, we need To estimate the optimal values (denoted by θˆ and Z) 265
to maximize L. In general, we can compute the gradients of L and then use the gradient-based approach to learn these variables. However, both the computational and the memory complexities of the approach are so high that ordinary
AN US
computers are unaffordable. In next section, we first use the special property
of Kronecker product to formally reduce computational and memory complexi270
ties of L (and its gradients) and then design a corresponding efficient two-step algorithm to estimate the hyper-parameters. 3.2. Efficient hyper-parameter estimation
M
In this section, we provide a relatively efficient algorithm to estimate the hyper-parameters involved in GP-Metric. As mentioned in Section 3.1, although GP-Metric has a similar structure to both GPR and GPLVM, a direct implemen-
ED
275
tation of this algorithm (e.g., by using the gradient-based approach as realized in [36, 39, 40]) would lead to both higher computational and memory complex-
PT
ities. To reduce these complexities, first let us factorize L in Eq.(12) into the following two components, denoted respectively as Kronecker product term and GPLVM term,
CE
280
AC
and
1 1 Kronecker product term = − sT A−1 s − ln |A| 2 2
(13)
D
1 X T −1 D GP LV M term = − x:,d B x:,d − ln |B| 2 2
(14)
d=1
(S)
(S)
For convenience of derivations, let A = KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 and B = (X)
KZZ + 2 IN ×N . Now let us respectively analyze the above two components in
detail. Firstly, we can easily find that the GPLVM term has the same form as 12
ACCEPTED MANUSCRIPT
285
GPLVM. Thus we can directly compute it as in the conventional GPLVM [36] with computational and memory complexities of O(N 3 ) and O(N 2 ) respectively. Secondly, the Kronecker product term can be further factorized into two terms:
CR IP T
the quadratic term (sT A−1 s) and the log-det term (ln |A|). Both these two
terms have much higher computational and memory complexities of O(N 6 ) and 290
O(N 4 ) respectively if we ignore special Kronecker product structure. However, it is because of this special structure, in this subsection, that we can reduce the
computational and memory complexities of the term by three and two orders
with the Kronecker product:
295
AN US
of magnitude respectively. Specifically, we use the following identity associated (S) (S) KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 = (U ⊗ U )(D ⊗ D + σ 2 IN 2 ×N 2 )(U T ⊗ U T ) (15) (S)
where KZZ = U DU T is the eigenvalue decomposition; U is the square matrix
whose columns are its eigenvectors; D is the diagonal matrix whose diagonal elements are the corresponding eigenvalues in a deceasing order. Since both the
M
quadratic term and the log-det term contain the matrix A, we can reduce their complexities respectively by substituting Eq.(15) into these two terms as shown in the following subsections.
ED
300
3.2.1. Computation of the quadratic term
PT
By substituting Eq.(15) into the quadratic term, we can reduce the computation complexity of this term below: (16)
CE
sT A−1 s = (vec(U T SU ))T (D ⊗ D + σ 2 IN 2 ×N 2 )−1 (vec(U T SU ))
Similar to the above derivation, the gradient of the quadratic terms with respect to the hyper-parameter ζ (involved in the kernel functions and the latent variable
AC
305
Z) can also be computed efficiently in terms of ∂sT A−1 s = −2vec(U T SU )T (D ⊗ D + σ 2 IN 2 ×N 2 )−1 ∂ζ ! ! (S) T ∂KZZ 2 −1 T U unvecN,N (D ⊗ D + σ IN 2 ×N 2 ) vec(U SU ) D vec U ∂ζ (17) 13
ACCEPTED MANUSCRIPT
where unvecm,n (·) is the matricizing operator that transforms a column vector amn×1 into a matrix Am×n and unvecm,n (amn×1 ) = Am×n ⇐⇒ vec(Am×n ) = amn×1 . Moreover, the gradient with respect to σ 2 can be further simplified by
CR IP T
∂sT A−1 s ∂σ 2 I = − vec(U T SU )T (D ⊗ D + σ 2 IN 2 ×N 2 )−1 2 2 ∂σ σ
(18)
(D ⊗ D + σ 2 IN 2 ×N 2 )−1 vec(U T SU ) 310
3.2.2. Computation of the log-det term
Similar to the quadratic term, by utilizing Eq.(15), the log-det term can also
AN US
be efficiently computed as follows: ln |A| = ln |D ⊗ D + σ 2 IN 2 ×N 2 |
(19)
The gradient with respect to the hyper-parameter ζ is given by
(20)
M
−1 T ∂ ln |A| =2diag D ⊗ D + σ 2 IN 2 ×N 2 ∂ζ !! (S) T ∂KZZ U diag(D) ⊗ diag U ∂ζ
315
fied by
ED
Moreover, the corresponding gradient with respect to σ 2 can be further simpliN
N
∂ ln |A| X X 1 = 2 ∂σ 2 D D i,i j,j + σ i=1 j=1
(21)
PT
Based on these formulations in Subsections 3.2.1 and 3.2.2, the dominant computational costs of L and its gradients are the eigenvalue decomposition (S)
CE
of KZZ , which now costs O(N 3 ) instead of O(N 6 ). Furthermore, the memory
complexity of the Kronecker product term is also reduced from O(N 4 ) to O(N 2 ). 3.2.3. Efficient two-step algorithm
AC
320
Based on the above-mentioned derivations, we have already reduced the
computational and memory complexities in each iteration of the optimization algorithm. However, we still find that directly using the gradient-based method to learn the model often has a slow convergence rate and results in a local opti-
325
mum because of the non-convexity of the objective function L. To mitigate these 14
ACCEPTED MANUSCRIPT
problems, we provide an efficient two-step algorithm as shown in Algorithm 1. Specifically, the algorithm includes three main parts below: (1) Algorithm initialization. At the beginning of the Algorithm 1, we initialize
CR IP T
the hyper-parameter θ randomly within [0, 1] and train a probabilistic principal component analysis (PPCA) [53] model on X, then use the learned latent
330
variables to initialize Z. This trick has also often been used in GPLVMs to get better initial values for the latent variables [36, 40, 54].
(2) Feature learning. In this step, we use the RBF kernel in p(X|θ, Z) (i.e.,
AN US
p(X|Z, θ)RBF ) and the linear kernel in p(S|θ, Z) (i.e., p(S|Z, θ)linear ) for the dimension reduction task. By using the RBF kernel in p(X|θ, Z), we
335
can learn a set of representative nonlinear features. Meanwhile, by using the linear kernel function in p(S|θ, Z), the supervised information can also be embodied into the features with better visualization effect as we will show in the Section 4. We maximize L1 in (22) with respect to θ and Z by using
M
Eqs.(16)-(21).
340
(22)
ED
L1 = ln p(X|Z, θ)RBF + ln p(S|Z, θ)linear
(3) Similarity learning. We embed the features learned in the previous step to a RBF kernel in p(S|Z, θ) (i.e., p(S|Z, θ)RBF , note that this RBF kernel
PT
can be different from the RBF kernel in p(X|Z, θ)RBF ) and maximize L2 in (23) with respect to θ and Z by using Eqs.(16)-(21). In this way, we can obtain the relatively more optimal initial values for the laten variables (in
CE
345
the feature learning step) so that the similarity learning step can learn these
AC
variables with higher quality and less time. L2 = ln p(X|Z, θ)RBF + ln p(S|Z, θ)RBF
(23)
3.3. Prediction for new samples During the prediction process of DML, our goal is to predict the similarities
350
of the new sample pairs by using the learned model. In this paper, depending on 15
CR IP T
ACCEPTED MANUSCRIPT
Algorithm 1: Latent variables and hyper-parameters learning Input: The training samples {X, S}, the step size α and the
AN US
dimensionality of the latent variables Q.
Output: The hyper-parameter θ and the latent variable Z. 1
Initialize hyper-parameter θ randomly;
2
Initialize Z by using the PPCA approach on X;
3
Use a linear kernel and a RBF kernel in p(S|θ, Z) and p(X|θ, Z)
5 6
while L1 in (22) not converges do 1 θ+ = α ∂L θ ,
Embed the features learned in the previous step to a RBF kernel in p(S|Z, θ);
8
2 θ+ = α ∂L θ ,
2 Z+ = α ∂L Z ; // by using Eqs.(16)-(21).
return θ and Z;
AC
CE
9
while L2 in (23) not converges do
PT
7
1 Z+ = α ∂L Z ; // by using Eqs.(16)-(21).
ED
4
M
respectively;
16
ACCEPTED MANUSCRIPT
different tasks (dimension reduction and similarity metric learning), GP-Metric involves two main steps to predict the new test samples: the feature learning step and the similarity prediction step.
355
CR IP T
In the feature learning step, our goal is to learn the latent variable zi∗ for a
new sample x∗i . This can be achieved by maximizing the marginal likelihood of the test sample as follows:
ln p(x∗i |Z, zi∗ , θ) = ln N (x∗i |µ∗i , ∗2 i ID×D ) (X)
(X)
(24)
(X)
(X)
(X)
where µ∗i = X T (KZZ + 2 IN ×N )−1 kZz∗ , ∗2 = kz∗ z∗ − (kZz∗ )T (KZZ + i i
i
i
(X)
i
kZz∗ denotes the kernel vector k (X) (Z, zi∗ ).
The hyper-
AN US
(X)
2 IN ×N )−1 kZz∗ .
i
i
parameter θ and latent variable Z have already been learned in the model 360
learning step. After the above learning process, we have obtained more representative low-dimensional features which can in turn be used for the follow-up tasks such as clustering and classification.
For the similarity prediction task, the second step (i.e., the similarity predic-
365
M
tion step) can be taken to directly predict the similarity between two new test samples x∗i and x∗j . In this step, assuming that we have learned the latent vari-
ED
ables zi∗ and zj∗ corresponding to x∗i and x∗j respectively, thus we can compute the expectation and variance of their similarity by the following formulations:
PT
T (S) (S) µ(zi∗ , zj∗ ) = vec(U T unvecN,N (kZz∗ ⊗ kZz∗ )U ) i
D ⊗ D + σ 2 IN 2 ×N 2
−1
j
(25)
vec(U T SU )
AC
CE
T (S) (S) (S) (S) var(zi∗ ,zj∗ ) = kz∗ z∗ kz∗ z∗ − vec(U T unvecN,N (kZz∗ ⊗ kZz∗ )U )
370
i
i
j
2
j
D ⊗ D + σ IN 2 ×N 2
i
−1
vec(U
T
j
(S) unvecN,N (kZz∗ i
zi∗
(S)
(26)
⊗ kZz∗ )U ) j
zj∗
It is obvious that the similarity Szi∗ zj∗ between and follows a gaussian distribution Szi∗ zj∗ ∼ N µ(zi∗ , zj∗ ), var(zi∗ , zj∗ ) . One superiority of GP-Metric
is that it can provide not only the expectation of similarity but also uncertainty of this prediction (i.e., the variance) which may be utilized by the follow-up tasks. This mechanism plays an important role in many application scenarios 17
ACCEPTED MANUSCRIPT
where equipping the learning result with a measure of uncertainty may be as 375
important as the result itself [31, 32, 55]. The whole similarity prediction process is shown in Algorithm 2. For the dimension reduction task, its prediction process
CR IP T
is the same as Algorithm 2, except not including the similarity computation step in Eqs.(25) and (26). We can also see from Algorithm 2 that both its feature learning step and similarity prediction step have the computational and memory 380
complexities of O(N 3 ) and O(N 2 ) respectively which are the same as those of conventional GPLVM [36].
AN US
Algorithm 2: Similarity prediction of new samples Input: The training samples {X, S}, the new samples x∗i and x∗j , the latent variables Z and the hyper-parameter θ.
Output: The prediction mean µ(zi∗ , zj∗ ) and the variance var(zi∗ , zj∗ ).
2 3
for x∗ ∈ {x∗i , x∗j } do
Maximize (24) to learn z ∗ ∈ {zi∗ , zj∗ }.
Calculate the mean µ(zi∗ , zj∗ ) and the variance var(zi∗ , zj∗ ) by using Eqs.(25) and (26);
return µ(zi∗ , zj∗ ) and var(zi∗ , zj∗ );
ED
4
M
1
PT
4. Experiments and analysis In this section, we compare the proposed GP-Metric with those closely re-
CE
lated GP-based dimension reduction approaches and metric/similarity learn385
ing approaches. Specifically, we investigate the classification performances of these approaches by using k-nearest neighbor (k-NN) classifier which is widely
AC
used to validate the effectiveness of the learned features [39, 40] and metrics [8, 30, 13, 17]. All the experiments were run on computer with Intel(R) Core(TM) i5-3470 @ 3.20GHz CPU and 16.0 GB RAM.
18
ACCEPTED MANUSCRIPT
390
4.1. Dimension reduction To verify the performance of GP-Metric in data visualization and dimension reduction, we use the linear kernel function k(xi , xj ) = xTi xj in the GPLVM
CR IP T
term and the RBF kernel function in the Kronecker product term. With this setting, a set of representative features can be learned and visualized in the 395
low-dimensional space. We compare GP-Metric with five GPLVM approaches: conventional GPLVM 1 , Bayesian GPLVM 2 , DGPLVM 3 , Supervised GPLVM and D-SBGPLVM. We use the GP machine learning toolkit GPmat1 to imple-
ment GP-Metric, Supervised GPLVM and D-SBGPLVM. The Oil Flow data
400
AN US
set and another three UCI data sets (Banknote Authentication, Segment, Car-
diotocography) are used to validate the effectiveness. The detailed information of these data sets are shown in Table 1. In our experiments, these data sets are first normalized to meet zero mean and unit variance.
Table 1: Data sets for dimension reduction test. Banknote
Segment
Oil Flow
Cardiotocography
#training samples
M
Data set
1000
1386
1000
1000
#test samples
372
462
1000
1126
#features
4
19
12
19
#classes
2
7
3
3
PT
ED
Authentication
In data visualization task, we use the Oil Flow data set to verify the su-
CE
periority of the proposed GP-Metric. Specifically, we set the dimensionality of
405
latent variables to 2, and then train GP-Metric and the above-mentioned five models on it. The visualization effects of the approaches are shown in Figure 1.
AC
As we can see in Figure 1(a), the conventional GPLVM can not maintain the separability between classes, since it does not utilize the label information. The 1 https://github.com/SheffieldML/GPmat.git 2 https://github.com/SheffieldML/vargplvm 3 https://github.com/lawrennd/dgplvm.git
19
ACCEPTED MANUSCRIPT
Bayesian GPLVM in Figure 1(b) can learn relatively more discriminant features 410
by using a Bayesian formulation of GPLVM. And while DGPLVM, Supervised GPLVM, D-SBGPLVM and GP-Metric respectively in Figures 1(c), 1(d), 1(e)
CR IP T
and 1(f) learn more discriminant features in visuality by effectively utilizing the label information.
To further validate the effectiveness of GP-Metric, we evaluate the classifi415
cation performance of the learned features on the Oil Flow data set and another
three UCI data sets by using 1-NN and 5-NN classifiers respectively. Specif-
ically, we use 10-fold cross-validation to test the classification error of these
AN US
approaches. In each experiment, the training and the test data sets are randomly sampled from these data sets without replacement and the numbers of 420
samples involved in these data sets are shown in Table 1. The error rates (%) are respectively shown in Tables 2 and 3 where the best results are highlighted in bold. Moreover, the p-values of the non-parametric Wilcoxon rank-sum test are calculated to check the statistical significance. As we can see, by utilizing la-
425
M
bel information, DGPLVM, Supervised GPLVM, D-SBGPLVM and GP-Metric outperform both conventional GPLVM and Bayesian GPLVM. In most cases,
ED
GP-Metric makes lower classification error rate than the other five approaches, indicating that GP-Metric can better utilize the label information to improve
PT
the performance of classification. 4.2. Metric learning
In this subsection, we treat GP-Metric as a similarity learning approach and
430
CE
directly use it for the similarity prediction for any given pair of samples. The performance is verified by comparing GP-Metric with the following approaches:
AC
LMNN4 : Large margin nearest neighbor approach which makes k-nearest neighbors belong to the same class while samples from different classes are separated
435
by a large margin [13]. GB-LMNN3 : Gradient-boosting based metric learning approach that learns 4 https://bitbucket.org/mlcircus/lmnn/downloads/
20
ACCEPTED MANUSCRIPT
2
2
1.5 1
1 0.5
0
0 −0.5
CR IP T
−1
−1
−2
−1.5 −2
−3 −3
−2
−1
0
1
2
3
−2
−1.5
(a) GPLVM
−1
−0.5
0
0.5
1
1.5
2
(b) Bayesian GPLVM
1 0.4 0.3 0.5
AN US
0.2 0.1
0
0
−0.5
−0.1 −0.2
−1
−0.3 −0.4
−0.3
−0.2
−0.1
0
0.1
0.2
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
(d) Supervised GPLVM
M
(c) DGPLVM
−0.4
0.4
0.2
1.5
1
0.5
ED
0
−0.2
−0.6
PT
−0.4
−1
−0.5
0
0.5
0
−0.5
−1
−1.5
1
−3
−2
0
1
2
3
(f) GP-Metric
CE
(e) D-SBGPLVM
−1
Figure 1: Visualization of Oil Flow data set by using (a) GPLVM, (b) Bayesian GPLVM, (c) DGPLVM, (d) Supervised GPLVM, (e) D-SBGPLVM and (f) GP-Metric. ×, ◦ and +
AC
represent stratified, annular and homogeneous flows respectively.
nonlinear mappings directly in function space [30]. OASIS5 : Online algorithm for scalable image similarity, a bilinear model based 5 http://ai.stanford.edu/
~gal/Code/OASIS/
21
ACCEPTED MANUSCRIPT
Table 2: 1-NN classification error rates of GP-Metric, convetional GPLVM, Bayesian GPLVM, DGPLVM, D-SBGPLVM and Supervised GPLVM (best in bold). •/◦ indicates the performance of GP-Metric is significantly better/worse than the compared methods (at 0.05 signif-
CR IP T
icant level). #Dim denotes the dimensionality of latent variables.
Data set
#Dim
GPLVM
Bayesian GPLVM
DGPLVM
Banknote authentication
2
0.268±0.04•
0.224±0.04•
0.203±0.06•
Supervised GPLVM D-SBGPLVM 0.206±0.05•
0.193±0.07• 0.163±0.05
Banknote authentication
3
0.361±0.12•
0.348±0.03•
0.364±0.04•
0.306±0.05•
0.318±0.05• 0.210±0.07
Segment
2
9.090±0.30•
8.291±0.21•
9.320±0.25•
9.956±0.27•
8.979±0.34• 6.926±0.25
Segment
3
7.642±0.25•
6.584±0.19•
7.412±0.25•
5.410±0.24◦
Segment
4
9.523±0.31•
8.696±0.21•
8.414±0.24•
7.142±0.30
7.324±0.34
2
8.300±0.12•
6.500±0.08•
0.900±0.08•
0.800±0.09•
0.900±0.09• 0.600±0.07
3
8.900±0.14•
7.100±0.08•
0.900±0.11•
0.800±0.11•
0.900±0.10• 0.650±0.09
AN US
Oil Flow Oil Flow Cardiotocography
2
3.970±0.09•
2.754±0.07•
2.645±0.09•
2.398±0.08
2.613±0.09•
Cardiotocography
3
4.030±0.10•
3.810±0.07•
2.951±0.09•
2.867±0.10•
2.714±0.10• 2.060±0.10
Cardiotocography
4
2.688±0.09•
2.619±0.09•
3.584±0.07•
4.211±0.09•
3.167±0.09• 1.164±0.10
5
3.136±0.12•
3.127±0.10•
3.036±0.10•
3.046±0.09
3.002±0.10
M
MLR6 : Metric learning to rank, a general metric learning approach based on
ED
R-MLR6 : Robust structural metric learning approach, a robust extension to the metric learning to rank approach based on the group sparsity penalty [17]. SCML7 : Sparse combination of locally discriminative metrics approach, a non-
PT
linear metric learning approach that uses the combination of many local metrics[15]. LCA: Latent coincidence analysis (LCA), a latent variable based metric learn-
CE
ing approach [24].
OPML: One-pass metric learning, a one-pass closed-form solution for online metric learning [25]. GMML: Geometric mean metric learning, a mahalanobis distance metric learn-
AC 450
2.480±0.09
Cardiotocography
the structural SVM framework and various ranking measures [23].
445
7.475±0.27
8.139±0.37• 6.700±0.25
online metric learning approach [8]. 440
GP-Metric
ing approach which has a closed-form solution [26]. The experiments are conducted on three widely used data sets: USPS [56], 6 https://github.com/bmcfee/mlr/ 7 http://mloss.org/revision/download/2004/
22
2.957±0.10
ACCEPTED MANUSCRIPT
Table 3: 5-NN classification error rates of GP-Metric, convetional GPLVM, Bayesian GPLVM, DGPLVM, D-SBGPLVM and Supervised GPLVM (best in bold). •/◦ indicates the performance of GP-Metric is significantly better/worse than the compared methods (at 0.05 signif-
CR IP T
icant level). #Dim denotes the dimensionality of latent variables.
Data set
#Dim
GPLVM
Bayesian GPLVM
DGPLVM
Banknote authentication
2
1.531±0.09•
0.926±0.08•
0.269±0.18•
Supervised GPLVM D-SBGPLVM GP-Metric 0.272±0.06•
Banknote authentication
3
2.449±0.06•
1.948±0.05•
0.205±0.09•
0.189±0.07◦
Segment
2
10.173±0.46•
9.346±0.32•
12.253±0.16•
7.581±0.34•
10.591±0.21• 6.123±0.51 8.764±0.32• 6.331±0.34
Segment
3
14.296±0.35•
9.103±0.32•
10.163±0.31•
8.235±0.21•
4
14.297±0.38•
9.121±0.34•
9.324±0.22•
6.490±0.37
8.299±0.26• 6.431±0.39
Oil Flow
2
14.700±0.20•
6.900±0.15•
3.700±0.35•
1.400±0.16•
3.600±0.36• 0.500±0.18
7.500±0.30• 0.500±0.16
3
18.600±0.26•
8.200±0.14•
9.000±0.26•
2.400±0.19•
2
6.394±0.11•
3.491±0.09•
2.063±0.09
2.157±0.10•
2.347±0.10• 2.059±0.09
Cardiotocography
3
5.381±0.08•
3.226±0.09•
1.730±0.09•
3.495±0.18•
2.285±0.11• 1.384±0.09
AN US
Oil Flow Cardiotocography Cardiotocography
4
4.116±0.16•
2.719±0.09•
1.256±0.06•
3.852±0.13•
1.934±0.09• 1.156±0.07
Cardiotocography
5
4.267±0.12•
3.048±0.10•
1.521±0.09◦
3.239±0.10•
1.734±0.11
M
MNIST
ED
CIFAR10
PT
Figure 2: Example images of USPS, MNIST and CIFAR10 datasets.
MNIST [57] and CIFAR10 [58] as shown in Figure 2. Both USPS and MNIST are two well known digit recognition data sets, comprising of 16 × 16 and 28 × 28
CE
grayscale images respectively. For each of these two data sets, we select a subset of samples with class labels {1, 3, 5, 7, 9}. CIFAR10 is an image classification
AC
data set, comprising of 32 × 32 pixels color images. Each image belongs to
one of ten classes {airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck }. We select a subset of samples with class labels {automobile, cat,
460
0.201±0.08
Segment
USPS
455
0.237±0.20• 0.155±0.05
0.191±0.15•
dog, horse and truck }. Since the dimensionality of CIFAR10 is high, we use
23
1.798±0.10
ACCEPTED MANUSCRIPT
0.09
0.08 LCA LMNN GB−LMNN OASIS MLR
0.07
0.06 0.05 0.04 0.03 0.02
0.06 0.05 0.04 0.03 0.02 0.01
0.01 0 100
150
200
250
300
350
400
450
0 100
500
# of training samples in each class
150
0.1
0.07 0.06 0.05 0.04 0.03
400
450
500
0.08
R−MLR SCML OPML GMML GP−Metric
0.07 0.06 0.05 0.04 0.03
0.01 100
150
200
250
300
350
400
450
# of training samples in each class
LCA LMNN GB−LMNN OASIS MLR
0.66
R−MLR SCML OPML GMML GP−Metric
ED
0.64 0.62 0.6 0.58 0.56
PT
0.54 0.52 200
300
400
500
600
700
900
0.66
700
450
500
LCA LMNN
R−MLR SCML
GB−LMNN
OPML
OASIS MLR
GMML GP−Metric
0.62 0.6 0.58 0.56
R−MLR SCML OPML GMML GP−Metric
800
900
# of training samples in each class
(g) CIFAR10 SIFT (1-NN)
300
400
500
600
700
800
900
1000
(f) CIFAR10 PCA (5-NN)
1000
Classification error rate
CE 600
400
# of training samples in each class
0.5
500
350
0.64
0.52 200
1000
0.55
400
300
0.7
(e) CIFAR10 PCA (1-NN)
300
250
0.68
# of training samples in each class
0.45 200
200
0.54
800
LCA LMNN GB−LMNN OASIS MLR
150
(d) MNIST (5-NN)
M
0.7 0.68
0.02 100
500
# of training samples in each class
(c) MNIST (1-NN)
Classification error rate
350
LCA LMNN GB−LMNN OASIS MLR
0.09
Classification error rate
0.08
R−MLR SCML OPML GMML GP−Metric
Classification error rate
Classification error rate
0.09
300
AN US
LCA LMNN GB−LMNN OASIS MLR
0.02
Classification error rate
250
(b) USPS (5-NN)
0.11
AC
200
# of training samples in each class
(a) USPS (1-NN)
0.1
R−MLR SCML OPML GMML GP−Metric
CR IP T
Classification error rate
0.07
R−MLR SCML OPML GMML GP−Metric
Classification error rate
LCA LMNN GB−LMNN OASIS MLR
0.08
LCA
R−MLR
LMNN GB−LMNN OASIS
SCML OPML GMML
MLR
GP−Metric
0.55
0.5
0.45 200
300
400
500
600
700
800
900
1000
# of training samples in each class
(h) CIFAR10 SIFT (5-NN)
Figure 3: k-NN (k = 1 and k = 5) classification error rate on the USPS, MNIST, CI-
24 numbers of training samples. FAR10 PCA and CIFAR10 SIFT with different
ACCEPTED MANUSCRIPT
PCA and scale-invariant feature transform (SIFT)8 approaches to reduce the dimensionality into 300 (taking the first 300 eigenvectors that capture at least 95% of the total variance) and 210 (after obtaining the SIFT descriptors, we use
465
CR IP T
bag-of-words representation with 210 words for each image) respectively. These two pre-processed data sets are denoted by CIFAR10 PCA and CIFAR10 SIFT respectively.
In the experiments, for each class in all the data sets, we randomly draw 200, t and 200 samples as test, training and validation sets respectively. We
train the metric learning models and choose the hyper-parameters involved in the compared approaches on the training and validation sets. With these
AN US
470
learned metrics, we use 1-NN and 5-NN classifier to predict the labels of test samples. For different t, we repeat the above process 10 times and compute the average classification error rates on the test sets. Specifically, we set t = {100, 200, 300, 400, 500} for the USPS and MNIST, t = {200, 400, 600, 800, 1000} 475
for the CIFAR10 PCA and CIFAR10 SIFT. The dimension of latent variables
M
in GP-Metric is empirically set to 20 on all the data sets. The results are shown in Figure 3. As we can see, on both USPS and MNIST, the average
ED
error rates of all the approaches are lower due to that these two data sets contain less noise. However, on the CIFAR10, the error rates are much higher 480
because of the various backgrounds, camera angles and positions. On all the
PT
data sets, the nonlinear metric learning approaches have lower error rates than other linear approaches, indicating that nonlinear approaches can learn more
CE
accurate distance/similarity metric. Among all these compared approaches, GP-Metric achieves the best result, demonstrating its effectiveness in nonlinear metric learning tasks.
AC
485
We also calculate the p-values of the non-parametric Wilcoxon rank-sum
test to check the statistical significance. Specifically, in Table 4, we show the pvalues of 5-NN on the USPS, MNIST, CIFAR10 PCA and CIFAR10 SIFT with
200, 200, 1000 and 1000 training samples for each class respectively. As we can 8 https://github.com/adikhosla/feature-extraction
25
ACCEPTED MANUSCRIPT
Table 4: The non-parametric Wilcoxon rank-sum test between GP-Metric and other compared methods.
MNIST
CIFAR10 PCA
GP-Metric→LCA
0.002
0.008
0.006
GP-Metric→LMNN
0.023
0.031
0.015
GP-Metric→GB-LMNN
0.033
0.034
0.031
GP-Metric→OASIS
0.010
0.021
0.025
GP-Metric→MLR
0.003
0.014
0.035
GP-Metric→R-MLR
0.013
0.015
0.025
GP-Metric→SCML
0.032
GP-Metric→OPML
0.002
GP-Metric→GMML
0.031
CIFAR10 SIFT
AN US
CR IP T
USPS
0.017
0.024
0.039
0.019
0.023
0.039
0.041
0.0391
0.043
0.002
0.002
0.018
0.025
0.026
0.021
see, in most cases GP-Metric performs statistically better than the compared
M
490
Compared methods
methods. It is also worth to note that although the p-values with different numbers of training samples in each data set are not provided, in most cases
ED
they are also less than 0.05, indicating the statistical significance between GPMetric and other compared methods. To further demonstrate the performance of GP-Metric, we show the ten
495
PT
nearest neighbour retrieval results of two query images (of CIFAR SIFT with t = 1000) in Figure 4. Specifically, we select two query images with label
CE
automobile and horse from the test set. Then we retrieve their first ten nearest neighbours by using distance/similarity metrics learned in training step. The
500
images in the blue, green and red boxes denote the query, correctly classified
AC
and misclassified images respectively. As we can see from Figure 4, since the automobile is relatively easy to classify because of its distinctive characteristics, all the approaches have higher accuracy when we use their learned metrics to query the automobile image. However, for the horse image, it is more likely to
505
be classified into cat and dog due to the similarity among these three classes.
26
ACCEPTED MANUSCRIPT
LCA LMNN
CR IP T
GB-LMNN OASIS MLR R-MLR SCML
AN US
OPML GMML GP-Metric
(a) Automobile
LMNN
ED
GB-LMNN
M
LCA
OASIS
PT
MLR
R-MLR
CE
SCML
OPML
AC
GMML
GP-Metric (b) Horse
Figure 4: Ten nearest neighbour retrieval results of two query images. Images in blue, green and red boxes denote the query images, correctly classified images and misclassified images, respectively.
27
ACCEPTED MANUSCRIPT
Table 5: The computational and memory complexities analysis of GP-based methods.
Compuation complexity Memory complexity
GPLVM
Bayesian GPLVM
DGPLVM
Supervised GPLVM
D-SBGPLVM
GP-Metric
O(N 3 )
O(N M 2 )
O(N 3 )
O(N 3 )
O(N 3 )
O(N 3 )
2
O(N )
O(N M )
2
O(N )
2
O(N )
2
O(N )
CR IP T
Complexity
The proposed GP-Metric outperforms other compared approaches in both cases. 4.3. Computational and memory complexity
AN US
To demonstrate the efficiency of GP-Metric, we provide an algorithmic com-
parison of the computational and the memory complexities of related GP-based 510
approaches in Table 5. As we have mentioned in Section 3.2, the major com(S)
putational and memory costs are the eigenvalue decomposition of matrix KZZ , which has a computational complexity of O(N 3 ) and a memory complexity of O(N 2 ). In GPLVM, DGPLVM, Supervised GPLVM and D-SBGPLVM, the ma-
515
M
jor computational and memory costs in each iteration are inverting the N × N
matrix, which has also computational and memory complexities of O(N 3 ) and
ED
O(N 2 ) respectively. The Bayesian GPLVM has a lower computational and
memory complexities of O(N M 2 ) and O(N M ) respectively (where M denotes the number of auxiliary points and M N ), because of using sparse GP. Based 520
PT
on the above analysis, we can conclude that the complexity of GP-Metric is only higher than Bayesian GPLVM but its classification accuracy is the highest in most cases. We also provide an empirical time complexity comparison of GP-
CE
Metric and related metric learning approaches on the four data sets (t = 200 for USPS and MNIST, t = 1000 for CIFAR10 PCA and CIFAR10 SIFT ) in Ta-
AC
ble 6. Although more time-consuming than other metric learning approaches,
525
GP-Metric usually has higher accuracy and many sparse approximation methods [59, 60] can be adapted to further reduce the computational and memory complexities.
28
O(N 2 )
ACCEPTED MANUSCRIPT
Table 6: The comparison of time complexity on different data sets. Data set
LCA
LMNN
GB-LMNN
OASIS
MLR
R-MLR
SCML
OPML
GMML
USPS
20.6s
30.6s
242.1s
83.3s
125.9s
414.2s
51.9s
0.0006s
0.032s
GP-Metric 3895.0s
MNST
35.4s
127.3s
1140.2s
683.8s
4634.2s
7316.5s
76.6s
0.0013s
0.045s
7511.6s
71.8s
4145.1s
30781.7s
1381.1s
183742.0s
324919.4s
985.1s
0.0024s
2.3s
325474.4s
58.3s
2239.9s
16221.7s
508.2s
175775.1s
220351.7s
578.3s
0.0015s
1.9s
224114.5s
5. Conclusion and discussion
CR IP T
CIFAR10 PCA CIFAR10 SIFT
In this paper, we propose a gaussian process based non-parametric metric 530
learning approach and derive a practical algorithm for efficiently learning the
AN US
latent variables and the hyper-parameters. The experimental results show that
our GP-Metric can not only provide more accurate similarity prediction but also learn more representative features than the supervised GP-based feature learning approaches. Furthermore, it can also give uncertainty of prediction, which 535
provides confidence for applications such as medical diagnosis, self-deriving cars and so on. In the future work, we wish to address the high time complexity
M
problem involved in the approach by using approximation approaches [59, 60]
ED
to make GP-Metric available for big data issues.
Acknowledgment
This work was supported by the National Natural Science Foundation of
PT
540
China (NSFC) under Grant Nos. 61672281 and 61472186, the Key Program of NSFC under Grant No. 61732006 and the founding of Jiangsu Innovation
CE
Program for Graduate Education under Grant No. KYLX15 0323.
AC
References
545
[1] P. Xie, E. Xing, Large scale distributed distance metric learning, arXiv preprint arXiv:1412.5949.
[2] E. P. Xing, A. Y. Ng, M. I. Jordan, S. Russell, Distance metric learning, with application to clustering with side-information, in: Proceedings of the
29
ACCEPTED MANUSCRIPT
15th International Conference on Neural Information Processing Systems, 2002, pp. 521–528.
550
[3] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, Large
CR IP T
scale metric learning from equivalence constraints, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2288–2295. 555
[4] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Learning to rank in person re-identification with metric ensembles, in: Proceedings of the IEEE Con-
AN US
ference on Computer Vision and Pattern Recognition, 2015, pp. 1846–1855. [5] D. Tao, L. Jin, Y. Wang, X. Li, Person reidentification by minimum classification error-based kiss metric learning, IEEE transactions on Cybernetics 45 (2) (2015) 242–252.
560
[6] B. McFee, L. Barrington, G. Lanckriet, Learning content similarity for
M
music recommendation, IEEE transactions on audio, speech, and language processing 20 (8) (2012) 2207–2218.
ED
[7] S. C. Hoi, W. Liu, S.-F. Chang, Semi-supervised distance metric learning for collaborative image retrieval, in: Proceedings of the IEEE Conference
565
on Computer Vision and Pattern Recognition, 2008, pp. 1–7.
PT
[8] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large scale online learning of image similarity through ranking, Journal of Machine Learning Research
CE
11 (Mar) (2010) 1109–1135. 570
[9] Z. Kuang, J. Sun, K.-Y. Wong, Learning regularized, query-dependent bi-
AC
linear similarities for large scale image retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 413–420.
[10] S. Xiang, F. Nie, C. Zhang, Learning a mahalanobis distance metric for data
575
clustering and classification, Pattern Recognition 41 (12) (2008) 3600–3612.
30
ACCEPTED MANUSCRIPT
[11] L. Yang, R. Jin, R. Sukthankar, Bayesian active distance metric learning, in: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, 2007, pp. 442–449.
CR IP T
[12] A. Bellet, A. Habrard, M. Sebban, A survey on metric learning for feature vectors and structured data, arXiv preprint arXiv:1306.6709.
580
[13] K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research 10 (Feb) (2009) 207–244.
AN US
[14] K. Liu, A. Bellet, F. Sha, Similarity learning for high-dimensional sparse
data, in: Proceedings of the 18th International Conference on Artificial
585
Intelligence and Statistics, 2015, pp. 653–662.
[15] Y. Shi, A. Bellet, F. Sha, Sparse compositional metric learning, in: Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014, pp.
590
M
2078–2084.
[16] C. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto, N. D.
ED
Lawrence, Recurrent gaussian processes, arXiv preprint arXiv:1511.06644. [17] D. Lim, B. Mcfee, G. R. Lanckriet, Robust structural metric learning, in: Proceedings of the 30th International Conference on Machine Learning,
595
PT
2013, pp. 615–623.
[18] K. P. Murphy, Machine learning: a probabilistic perspective, MIT press,
CE
2012.
AC
[19] P. Orbanz, Y. W. Teh, Bayesian nonparametric models, in: Encyclopedia of Machine Learning, Springer, 2011.
[20] B. Babagholami-Mohamadabadi, S. M. Roostaiyan, A. Zarghami, M. S.
600
Baghshah, Multi-modal distance metric learning: parametric approach, in:
A bayesian non-
European Conference on Computer Vision,
Springer, 2014, pp. 63–77.
31
ACCEPTED MANUSCRIPT
[21] C. E. Rasmussen, Gaussian processes for machine learning, MIT Press, 2006. 605
[22] J. V. Davis, B. Kulis, P. Jain, S. Sra, I. S. Dhillon, Information-theoretic
Machine Learning, 2007, pp. 209–216.
CR IP T
metric learning, in: Proceedings of the 24th International Conference on
[23] B. Mcfee, G. R. G. Lanckriet, Metric learning to rank, in: Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 775–782. 610
[24] M. Der, L. K. Saul, Latent coincidence analysis: A hidden variable model
AN US
for distance metric learning, in: Advances in Neural Information Processing Systems, 2012, pp. 3230–3238.
[25] W. Li, Y. Gao, L. Wang, L. Zhou, J. Huo, Y. Shi, Opml: A one-pass closedform solution for online metric learning, Pattern Recognition 75 (2018) 302–314.
615
M
[26] P. H. Zadeh, R. Hosseini, S. Sra, Geometric mean metric learning, in: Proceedings of the 33rd International Conference on International Conference
ED
on Machine Learning, 2016, pp. 2464–2471. [27] C. Kang, S. Liao, Y. He, J. Wang, W. Niu, S. Xiang, C. Pan, Cross-modal similarity learning: A low rank bilinear formulation, in: Proceedings of
620
PT
the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1251–1260.
CE
[28] X. Gao, S. C. H. Hoi, Y. Zhang, J. Wan, J. Li, Soml: Sparse online metric learning with application to image retrieval, in: Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014, pp. 1206–1212.
AC
625
[29] E. Fetaya, S. Ullman, Learning local invariant mahalanobis distances, in: Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 162–168.
[30] D. Kedem, S. Tyree, K. Q. Weinberger, F. Sha, G. Lanckriet, Non-linear
32
ACCEPTED MANUSCRIPT
metric learning, in: Proceedings of the 25th International Conference on
630
Neural Information Processing Systems, 2012, pp. 2573–2581. [31] E. Straszecka, Combining uncertainty and imprecision in models of medical
CR IP T
diagnosis, Information Sciences 176 (20) (2006) 3026–3059.
[32] S. Brechtel, T. Gindele, R. Dillmann, Probabilistic decision-making under uncertainty for autonomous driving using continuous pomdps, in: Pro-
635
ceedings of the 17th International Conference on Intelligent Transportation Systems, IEEE, 2014, pp. 392–399.
preprint arXiv: 1704.03144. 640
AN US
[33] M. Raissi, Parametric gaussian process regression for big data, arXiv
[34] A. Wilson, Z. Ghahramani, D. A. Knowles, Gaussian process regression networks, in: Proceedings of the 29th International Conference on Machine Learning, 2012, pp. 599–606.
M
[35] T. Bui, J. Hern´ andez-Lobato, D. Hern´ andez-Lobato, Y. Li, R. Turner, Deep gaussian processes for regression using approximate expectation propagation, in: Proceedings of the 33rd International Conference on Machine
ED
645
Learning, 2016, pp. 2187–2208. [36] N. Lawrence, Probabilistic non-linear principal component analysis with
PT
gaussian process latent variable models, Journal of Machine Learning Research 6 (2005) 1783–1816. [37] M. K. Titsias, N. D. Lawrence, Bayesian gaussian process latent variable
CE 650
model, in: Proceedings of the 13th International Workshop on Artificial
AC
Intelligence & Statistics, Vol. 9, 2010, pp. 844–851.
[38] S. Eleftheriadis, O. Rudovic, M. Pantic, Shared gaussian process latent vari-
655
able model for multi-view facial expression recognition, in: International Symposium on Visual Computing, Springer, 2013, pp. 527–538.
[39] R. Urtasun, T. Darrell, Discriminative gaussian process latent variable
33
ACCEPTED MANUSCRIPT
model for classification, in: Proceedings of the 24th international conference on Machine learning, 2007, pp. 927–934. [40] X. Gao, X. Wang, D. Tao, X. Li, Supervised gaussian process latent variable model for dimensionality reduction, IEEE Transactions on Systems, Man,
CR IP T
660
and Cybernetics, Part B (Cybernetics) 41 (2) (2011) 425–434.
[41] S. Eleftheriadis, O. Rudovic, M. Pantic, Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition, IEEE transactions on image processing 24 (1) (2015) 189–204.
[42] V. Ntouskos, P. Papadakis, F. Pirri, Probabilistic discriminative dimension-
AN US
665
ality reduction for pose-based action recognition, in: Pattern Recognition Applications and Methods, Springer International Publishing, 2015, pp. 137–152.
[43] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural computation 12 (10) (2000) 2385–2404.
M
670
[44] X. Wang, X. Gao, Y. Yuan, D. Tao, J. Li, Semi-supervised gaussian process
ED
latent variable model with pairwise constraints, Neurocomputing 73 (10-12) (2010) 2186–2195.
[45] N. D. Lawrence, Local distance preservation in the gp-lvm through back
PT
constraints, in: Proceedings of the 23th International Conference on Ma-
675
chine Learning, 2006, pp. 513–520.
CE
[46] P. Li, S. Chen, Hierarchical gaussian processes model for multi-task learning, Pattern Recognition 74 (2018) 134–144.
AC
[47] A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Gaussian processes for
680
object categorization, International journal of computer vision 88 (2) (2010) 169–188.
[48] M. Kemmler, E. Rodner, E.-S. Wacker, J. Denzler, One-class classification with gaussian processes, Pattern Recognition 46 (12) (2013) 3507–3518.
34
ACCEPTED MANUSCRIPT
[49] E. Rodner, D. Hegazy, J. Denzler, Multiple kernel gaussian process classification for generic 3d object recognition, in: Proceedings of the 25th
685
International Conference of Image and Vision Computing New Zealand,
CR IP T
2010, pp. 1–8. [50] D. M. Blei, M. I. Jordan, et al., Variational inference for dirichlet process mixtures, Bayesian analysis 1 (1) (2006) 121–143. 690
[51] Z. Dai, J. Hensman, N. Lawrence, Spike and slab gaussian process latent variable models, arXiv preprint arXiv:1505.02434.
AN US
[52] G. Song, S. Wang, Q. Huang, Q. Tian, Multimodal gaussian process latent
variable models with harmonization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5029–5037. 695
[53] M. E. Tipping, C. M. Bishop, Probabilistic principal component analysis, Journal of the Royal Statistical Society: Series B (Statistical Methodology)
M
61 (3) (1999) 611–622.
[54] X. Jiang, J. Gao, X. Hong, Z. Cai, Gaussian processes autoencoder for
ED
dimensionality reduction, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2014, pp. 62–73.
700
[55] T. Iwata, Z. Ghahramani, Improving output uncertainty estimation and
PT
generalization in deep learning via neural network gaussian processes, arXiv preprint arXiv:1707.05922.
CE
[56] J. J. Hull, A database for handwritten text recognition research, IEEE Transactions on pattern analysis and machine intelligence 16 (5) (1994)
705
AC
550–554.
[57] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-
710
plied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
[58] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny
35
ACCEPTED MANUSCRIPT
images. URL http://www.cs.toronto.edu/~kriz/cifar.html [59] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, in:
2013, pp. 282–290.
715
CR IP T
Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence,
[60] A. Wilson, H. Nickisch, Kernel interpolation for scalable structured gaussian processes (kiss-gp), in: Proceedings of the 32nd International Confer-
AC
CE
PT
ED
M
AN US
ence on Machine Learning, 2015, pp. 1775–1784.
36
ACCEPTED MANUSCRIPT
Ping Li received his B.S. and M.S. degree in Management Science & Engineering from Anhui University of Technology in 2011 and 2014. He is currently pursuing the Ph.D. degree with the College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics. His research interests include pattern recognition and machine learning.
AC
CE
PT
ED
M
AN US
CR IP T
Songcan Chen received his B.S. degree in mathematics from Hangzhou University (now merged into Zhejiang University) in 1983. In 1985, he completed his M.S. degree in computer applications at Shanghai Jiaotong University and then worked at NUAA in January 1986. There he received a Ph.D. degree in communication and information systems in 1997. Since 1998, as a full-time professor, he has been with the College of Computer Science & Technology at NUAA. His research interests include pattern recognition, machine learning and neural computing.