Gaussian process approach for metric learning

Gaussian process approach for metric learning

Accepted Manuscript Gaussian Process Approach for Metric Learning Ping Li, Songcan Chen PII: DOI: Reference: S0031-3203(18)30357-1 https://doi.org/1...

2MB Sizes 0 Downloads 42 Views

Accepted Manuscript

Gaussian Process Approach for Metric Learning Ping Li, Songcan Chen PII: DOI: Reference:

S0031-3203(18)30357-1 https://doi.org/10.1016/j.patcog.2018.10.010 PR 6678

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

30 September 2017 8 August 2018 9 October 2018

Please cite this article as: Ping Li, Songcan Chen, Gaussian Process Approach for Metric Learning, Pattern Recognition (2018), doi: https://doi.org/10.1016/j.patcog.2018.10.010

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • propose a non-parametric metric learning approach (GP-Metric) based on

CR IP T

Gaussian Process (GP). • use GP to extend the bilinear similarity into a non-parametric form. • develop an efficient algorithm to learn the non-parametric metric.

AC

CE

PT

ED

M

AN US

• demonstrate the performance of GP-Metric on real-world datasets.

1

ACCEPTED MANUSCRIPT

Gaussian Process Approach for Metric Learning Ping Li, Songcan Chen∗

CR IP T

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China {ping.li.nj, s.chen}@nuaa.edu.cn

Abstract

AN US

Learning appropriate distance metric from data can significantly improve the

performance of machine learning tasks under investigation. In terms of the distance metric representation forms in the models, distance metric learning (DML) approaches can be generally divided into two categories: parametric and non-parametric. The first category needs to make parametric assumption on the distance metric and learns the parameters, easily leading to overfitting

M

and limiting model flexibility. The second category abandons the above assumption and instead, directly learns a non-parametric distance metric whose

ED

complexity can be adjusted according to the number of available training data, and makes the model representation relatively flexible. In this paper we follow the idea of the latter category and develop a non-parametric DML approach.

PT

The main challenge of our work concerns the formulation and learning of nonparametric distance metric. To meet this, we use Gaussian Process (GP) to extend the bilinear similarity into a non-parametric metric (here we abuse the

CE

concept of metric) and then learn this metric for specific task. As a result, our approach learns not only nonlinear metric that inherits the flexibility of GP but

AC

also representative features for the follow-up tasks. Compared with the existing GP-based feature learning approaches, our approach can provide accurate similarity prediction in the new feature space. To the best of our knowledge, this is the first work that directly uses GP as non-parametric metric. In the ∗ Corresponding

author Email address: [email protected] (Songcan Chen)

Preprint submitted to Journal of Pattern Recognition

October 9, 2018

ACCEPTED MANUSCRIPT

experiments, we compare our approach with related GP-based feature learning approaches and DML approaches respectively. The results demonstrate the superior performance of our approach.

CR IP T

Keywords: metric learning, Gaussian Process, bilinear similarity, non-parametric metric

1. Introduction

Learning appropriate distance metric from data can significantly improve the

AN US

performance of machine learning tasks under investigation [1]. Since the early work in [2], distance metric learning (DML) has become an active research area 5

and has been widely used in many applications such as person reidentification [3, 4, 5], music recommendation [6], image retrieval [7, 8, 9], clustering analysis [10], etc. The goal of DML is to learn an application-specific distance metric that brings “similar” objects close together while separating “dissimilar” objects [11].

10

M

Following the formulation in [12], DML can be formulated as a weakly supervised learning problem with respect to a set of pairwise or triplet-based constraints. Based on this formulation, a variety of DML approaches have been devised by

ED

using different distance/similarity metric models to solve challenges in various domains. In terms of the distance metric representation forms in the models,

15

PT

DML approaches can be generally divided into two categories: parametric and non-parametric.

In general, the first category needs to make parametric assumption on the

CE

distance metric and then learns the corresponding parameters. For example, the state-of-the-art large margin nearest neighbor (LMNN) [13] method uses a

AC

parametric Mahalanobis distance as the distance metric and learns the param-

20

eter matrix in the training process. Although it results in explicit learning for the involved parameters, this is achieved partially at the expense of the model flexibility. In particular, when dealing with complex structured data, it may fail to learn the demanded distance metric, thus to some extent limiting the application scope. Another drawback in parametric DML is that there are a

2

ACCEPTED MANUSCRIPT

25

large number of parameters that need to be learned, especially in the high dimensional case [14]. Learning such large number of parameters from a finite number of samples often leads to the overfitting and thus poor generalization

CR IP T

performance [15, 16]. Although the regularization methods such as low-rank and sparse constraints can be used to improve the generalization performance 30

[13, 14, 15, 17], much more effort has to be made to select the values of the extra introduced hyper-parameters.

Compared with parametric DML, non-parametric DML tries to learn a nonparametric distance metric which can adaptively adjust its model complexity

35

AN US

according to the number of available training data and makes the model representation relatively flexible [18]. The number of parameters involved in it can change with the number of training samples [19], thus offering the possibility to minimize the risk of overfitting. However, to the best of our knowledge, so far there have had few researches concentrating on the non-parametric formulation of DML [20]. The main challenge of non-parametric DML is how to

design and learn such metric. To meet this, in this paper, we propose a novel

M

40

non-parametric DML approach (GP-Metric) by utilizing the non-parametric

paper are:

ED

characteristic of Gaussian Process (GP) [21]. The main contributions of this

PT

(1) using GP to extend the bilinear similarity into a non-parametric metric (here we abuse the concept of metric) and then adapting it to specific task.

45

CE

(2) proposing a practical method to reduce the complexity of optimization by utilizing the special structure of the non-parametric metric with the bilinear representation and designing a corresponding efficient two-step

AC

algorithm to learn the metric.

50

(3) empirically demonstrating the effectiveness of GP-Metric in multiple realworld data sets and showing that our GP-Metric has superior performance in dimension reduction and metric learning. The rest of this paper is organized as follows. In Section 2, we review the 3

ACCEPTED MANUSCRIPT

related metric learning and GP-based approaches. We describe our GP-Metric 55

method and illustrate how it can be derived from the bilinear similarity function in Section 3. We present our experimental results and analysis in Section 4 and

CR IP T

conclude in Section 5.

2. Related works 2.1. Distance metric learning

As mentioned in Section 1, DML can be carried out by using either paramet-

60

AN US

ric or non-parametric approach. Parametric DML has been extensively studied

in machine learning community and can be classified into two main categories: positive semi-definite (PSD) matrix parameterized Mahalanobis DML and ordinary matrix parameterized bilinear similarity learning. Mahalanobis DML con65

cerns on learning the PSD matrix involved in Mahalanobis distance according to different objective functions and yields various approaches, such as LMNN [13],

M

information theoretic metric learning (ITML) [22] and metric learning to rank approaches (MLR) [17, 23]. Recently, some variations of Mahalanobis DML are

70

ED

also proposed. For example, latent coincidence analysis (LCA) [24] based on latent variable model can also be viewed as a Mahalanobis distance metric learning model which finds a linear projection of high dimensional data by shrinking

PT

the distance between similarly labeled inputs and expanding the distance between differently labeled ones. One-pass closed-form solution for online metric learning (OPML) [25] decomposes the PSD matrix of Mahalanobis distance as the product of a low-rank matrix and its transposed matrix which can also be

CE 75

considered as the projection matrix that transforms the inputs into the new fea-

AC

ture space. Geometric mean metric learning (GMML) [26] uses a PSD matrix to measure the Mahalanobis distances of similar points and uses the inversion of this matrix to measure the Mahalanbis distances of dissimilar points. Different

80

from such a class of Mahalanobis DML, the bilinear similarity learning aims to learn a bilinear similarity function without the PSD constraint on the parameter matrix. This similarity is not necessary to be nonnegative and symmetry,

4

ACCEPTED MANUSCRIPT

thus can be extended to more general applications such as cross-modal similarity learning [27], query-dependent similarity learning [9] and online similarity 85

learning [8]. Although the performance of these above mentioned approaches is

CR IP T

typically superior to that of directly using standard metrics in practice, these approaches make strong assumptions on the parametric form of distance met-

rics, thus leading to the partial loss of flexibility. Furthermore, they usually

contain a large number of parameters when dealing with high dimensional data, 90

easily resulting in the overfitting. A common solution to this problem is imposing certain constraints (according to specific tasks) on these parameters to

AN US

limit the model complexity. For example, [13] uses a low-rank constraint of

the parametric matrix to improve the generalization performance. [28] adopts a sparsity-promoting constraint based on the `1 -norm to learn a sparse para95

metric matrix from high-dimensional data in an efficient and scalable manner. [15] models and learns the parametric matrix as the weighted combination of a set of low-rank “basis metrics” learned from the training data at different local

M

regions. Although these approaches can reduce the risk of the model overfitting, certain extra hyper-parameters have to be introduced (such as the rank of Mahalanobis matrix [13] or the regularization hyper-parameters [28]) and thus

ED

100

much more effort has to be made to select/learn these parameters. Compared with parametric DML, non-parametric DML, as described in Sec-

PT

tion 1, does not need to define the parametric form of distance metric, instead it can carry out a learning-from-data process to obtain the demanded nonparametric metric. Furthermore, the non-parametric DML can adaptively ad-

CE

105

just its complexity according to the number of available training data [19], thus making the model more flexible. For the GP-based DML method, it can also

AC

elegantly carry out nonlinear DML [15, 29, 30] and make prediction with uncertainty. The latter is particularly important in various situations such as medical

110

diagnosis [31], self-deriving cars [32] and so on. Although has these potentially attractive advantages, it remains relatively understudied in machine learning community due to the difficult of non-parametric metric designing and learning. It is based on the above-mentioned purposes, this paper tries to establish 5

ACCEPTED MANUSCRIPT

a non-parametric form metric based on GP and then adapts it to task under 115

investigation by learning from the training data.

CR IP T

2.2. GPR and GPLVM GP, as a Bayesian non-parametric approach, has been widely used in many machine learning scenarios. Compared with other machine learning methods, GP provides a flexible prior distribution over functions and enjoys analytical 120

tractability [33]. In Gaussian Process Regression (GPR) [34, 35], given the

training dataset D = {(x1 , y1 ), ..., (xN , yN )} (where xi ∈ RD and yi ∈ R are

AN US

the ith input vector and the corresponding response variable respectively), we can prove that the response variable y ∗ of a new test observation x∗ follows the Gaussian distribution

y ∗ ∼ N (Kx∗ X (KXX + σ 2 I)−1 y, Kx∗ x∗ − Kx∗ X (KXX + σ 2 I)−1 KXx∗ ) (1) 125

where X ∈ RN ×D and y ∈ RN denote the input and response variables of N

M

training samples; K(·, ·) denotes the kernel function with hyper-parameters θ;

σ denotes the standard variance of noise  ∼ N (0, σ 2 ).

ED

Besides being used in regression, GP can also be applied in the nonlinear feature learning and dimensionality reduction tasks, which refers to the Gaussian 130

Process Latent Variable Model (GPLVM) [36]. In these learning scenarios, our

PT

goal is to learn a set of latent variables denoted by Z ∈ RN ×Q from a set of

observed variables X ∈ RN ×D (where N and D denote the number and the

dimension of observed variables respectively; Q denotes the dimension of latent

CE

variables). In general, Q  D, thus realizing dimension reduction. In the

135

conventional GPLVM [36], we often assume that xnd is generated by a latent

AC

function with a noise process, xnd = fd (zn,: ) + nd where zn,: denotes the nth

row of matrix Z; nd is the noise following distribution nd ∼ N (0, σ 2 ). As

in GPR, we assume that these D latent functions {fd }D d=1 follow the same GP prior. Thus we can obtain the marginal likelihood of the observations p(X|Z, θ) =

D Y

d=1

1

1

1 2

(2π) |K + 6

1 σ 2 I| 2

T

e− 2 x:,d (K+σ

2

I)−1 x:,d

(2)

ACCEPTED MANUSCRIPT

140

where x:,d denotes the dth column of matrix X; K denotes the kernel matrix of latent variables; θ denotes the hyper-parameters involved in the kernel function and the noise distribution. We can maximize this marginal likelihood with

ˆ = arg max p(X|Z, θ) ˆ θ} {Z, Z,θ

CR IP T

ˆ and θˆ respect to Z and θ simultaneously to find the optimal value Z (3)

Recently, many extensions to the original GPLVM have been proposed, such 145

as Bayesian GPLVM [37], shared GPLVM [38], supervised GPLVMs [39, 40, 41, 42] and so on. Among these models, supervised GPLVMs have been demon-

AN US

strated to significantly improve the quality of feature learning. For example, [39]

develops a discriminative GPLVM (DGPLVM) by employing a prior distribution over the latent space that is derived from a Generalized Discriminant Analysis 150

(GDA) [43]. By this formulation, DGPLVM can obtain desirable generalization properties of generative approaches, while being able to better discriminate between classes in the latent space. [40] proposes a supervised GPLVM

M

(Supervised GPLVM) by assuming that labels and the input variables are independent, conditioned on the latent variables in the low-dimensional space. Hence the mappings from the latent variables to both input variables and la-

ED

155

bels can be established to promote the performance of original GPLVM. To the best of our knowledge, both DGPLVM and Supervised GPLVM are two state-

PT

of-the-art models in the context of GPLVM. Many recently proposed models just extend these two models for different applications. For example, [41] proposes a discriminative shared GPLVM approach (DS-GPLVM) for multiview

CE

160

and view-invariant classification by combining DGPLVM and shared GPLVM [38]. [44] replaces the discriminative prior in DGPLVM with a semi-supervised

AC

prior learned from the pairwise constraints (must-link and cannot-link) of samples and proposes a semi-supervised GPLVM (SSGPLVM). [42] proposes a su-

165

pervised GPLVM (D-SBGPLVM) for action sequences classification by combining Back-constrained GPLVM [45] with DGPLVM. [46] extends the Supervised GPLVM and proposes a hierarchical GP model for multi-kernel and multi-task learning. In fact, our GP-Metric can also be considered as a supervised GPLVM 7

ACCEPTED MANUSCRIPT

for non-linear feature learning. Different from the existing GP-based feature 170

learning approaches which focus on learning representative features, it learns these features and nonlinear metric simultaneously, thus providing accurate

CR IP T

similarity prediction in the new feature space. In Section 4.1, we will show that it achieves excellent results in such tasks as dimension reduction and data visualization.

175

3. GP-based metric learning

In this section, we show how bilinear similarity can be extended to a non-

3.1. Definition of GP-Metric

AN US

parametric metric and propose an efficient algorithm to learn this metric.

We begin with some basic notations. Let’s assume that there are N labeled 180

training samples {(x1 , y1 ), ..., (xN , yN )}, where xi ∈ RD and yi ∈ {1, 2, 3, ..., C}

M

denote the input variable and the label of the ith training sample respectively; C denotes the number of classes. In general, we can write these variables into matrix and vector: X = [x1 , ..., xN ]T and y = [y1 , ..., yN ]T . To simplify the

185

ED

formulation, we also introduce a set of latent variables Z = [z1 , ..., zN ]T where zi denotes the transformation of xi from original space X into space Z. In

bilinear similarity learning, we compute the similarities between the ith and the

PT

j th training samples as Si,j = ziT M zj , where Si,j can be generated by the label information and defined as Si,j = 1 if yi = yj and Si,j = −1 otherwise. Thus

CE

the similarity matrix can be written as follow: S = ZM Z T

(4)

Our goal is to learn the matrix M for estimating pairwise similarity in Eq.(4).

AC 190

In this paper, we adopt GP to extend Eq.(4) to obtain a non-parameter metric. Specifically, in order to derive the GP prior of the bilinear similarity function, we assume that M follows a matrix normal distribution M ∼ MN (0, U , V ), where U and V denote the row and the column covariance matrices respectively.

8

ACCEPTED MANUSCRIPT

195

For any matrix H and K, we define vec(H) to be the vector obtained by concatenating the columns of H. H ⊗ K denotes the Kronecker product of matrices H and K. From the definition of matrix normal distribution, we

CR IP T

can know that vec(M ) follows a multivariate normal distribution vec(M ) ∼ N (vec(0), V ⊗ U ). As a result, the bilinear similarity can be reformulated as 200

follow:

vec(S) = vec(ZM Z T ) = (Z ⊗ Z)vec(M )

(5)

Now S is a linear combination of Gaussian distributed variables and hence is itself Gaussian as well. We therefore learn this distribution by identifying its

are

205

AN US

mean and covariance. Let s = vec(S), then the mean and the covariance of s mean(s) = E[s] = E[vec(ZM Z T )] = 0

(6)

cov(s) = E[ssT ] = E[vec(ZM Z T )(vec(ZM Z T ))T ]

(7)

and

M

where E[ssT ] can be derived from

  E[ssT ] = (Z ⊗ Z)E vec(M )(vec(M ))T (Z T ⊗ Z T )

ED

= (Z ⊗ Z)(V ⊗ U )(Z T ⊗ Z T ) = (ZV Z T ) ⊗ (ZU Z T )

(8)

PT

= ((ZC)(ZC)T ) ⊗ ((ZP )(ZP )T )

CE

In the second line of Eq.(8), we use the facts that M follows a matrix normal dis  tribution and the covariance of vec(M ) is cov(vec(M )) = E vec(M )(vec(M ))T = V ⊗ U . In the fourth line of Eq.(8), since V and U are both positive definite,

210

we can respectively factorize them into the product of two matrices V = CC T

AC

and U = P P T where C and U are both N × N real matrices. Since both (ZC)(ZC)T and (ZP )(ZP )T are in fact two gram matrices of Z in the transformed spaces, by using the kernel trick, we can use two kernel matrices to replace the (ZC)(ZC)T and (ZP )(ZP )T . In this paper, we assume that M

215

is a symmetric matrix and V = U . Thus we can use the same kernel matrix

9

ACCEPTED MANUSCRIPT

(S)

KZZ to replace these two matrices. As a result, the bilinear similarity function can be extended into a non-parametric metric, by imposing a GP prior (S)

(S)

s ∼ GP(0, KZZ ⊗ KZZ ). Based on this non-parametric metric, we can create 220

CR IP T

our non-parametric metric learning approach. It is also worth to note that matrix M can also be asymmetrical. In this case, it can be used in cross-modal

(or view) bilinear similarity learning as in [27]. Correspondingly, our GP-Metric should also be constructed with respect to M for cross-modal learning, which may be our future work.

So far we just assume that S is noise-free. However, in many real-world applications, the observed similarity variables are often corrupted by noise. In

AN US

225

this paper, following the idea in GPR [34, 35], we also assume that such noise follows a gaussian distribution, since it leads to simple inference equations and a closed-form solution when integrating out the intermediate variables (refer to the later description). Furthermore, it has been demonstrated in [47, 48, 49] 230

that applying GPR directly to binary classification problems while ignoring the

M

discrete nature of the labels (i.e., Sij ∈ {1, −1}) can yield comparable results in performance. Based on the above assumption, we use a latent variable F to

ED

denote the clean counterpart of noisy S. Thus the corresponding distributions of S and F are

235

(S)

(S)

F ∼ MN (0, KZZ , KZZ )

(9)

PT

vec(S) ∼ N (vec(F ), σ 2 IN 2 ×N 2 ),

where IN 2 ×N 2 and σ 2 denote the N 2 × N 2 identity matrix and the variance of noise respectively. In the definition of GP-Metric, Z is modeled as a transfor-

CE

mation of X from original space X into space Z for which we use GPLVM as

AC

follows:

x:,d ∼ N (h:,d , 2 IN ×N ),

(X)

h:,d ∼ N (0, KZZ )

(10)

where x:,d and h:,d denote the dth columns of matrices X and the noise-free

240

(X)

matrix H respectively; 2 denotes the variance of gaussian noise; KZZ denotes

the kernel matrix related to X. Based on the above assumptions, we can see that the proposed GP-Metric still contains several hyper-parameters and many latent variables which should 10

ACCEPTED MANUSCRIPT

be learned from data. However, in Bayesian non-parametric statistics, the term 245

“non-parametric” does not imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed

CR IP T

in advance [19], such as Dirichlet Process [50] and Gaussian Process [21]. Furthermore, it has been demonstrated that GPLVM (with hyper-parameters and

latent variables) belongs to the aforementioned non-parametric class of models 250

[51, 52]. Thus, GP-Metric as an extension of GPLVM is still a non-parametric model and possesses its own advantages. First, it is more flexible than the para-

metric approaches and can be elegantly used in nonlinear similarity learning by (S)

(X)

AN US

using nonlinear kernel functions in both KZZ and KZZ . Second, it can also be

considered as a supervised dimension reduction by utilizing the label information 255

involved in S. In the experiment section, we will demonstrate that GP-Metric has a comparable performance with the widely used supervised GPLVMs such as DGPLVM [39] and Supervised GPLVM [40]. Third, GP-Metric can also provide uncertainty of prediction inherited from GPR as we will show in Section

M

3.3.

Next we derive out the likelihood function of GP-Metric. With the definition

260

ED

of GP-Metric, S and X are independent conditioned on Z. Thus, their joint marginal likelihood is

(11)

PT

p(S, X|θ, Z) = p(S|θ, Z)p(X|θ, Z)

where θ denotes all the hyper-parameters (σ,  and hyper-parameters involved

CE

in the kernel functions) involved in the GP-Metric. Marginalizing out the noisefree latent variables F and H results in the following equations of p(S|θ, Z)

AC

and p(X|θ, Z) respectively,

p(S|θ, Z) =

  (S) (S) exp − 21 sT (KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 )−1 s (S)

(S)

(2π)N 2 /2 |KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 |1/2   −1 D D exp − 1 xT (K (X) + 2 I ) x Y Y N ×N :,d ZZ 2 :,d p(X|θ, Z) = p(x:,d |θ, Z) = (X) N/2 (2π) |(KZZ + 2 IN ×N )|1/2 d=1 d=1

11

ACCEPTED MANUSCRIPT

Consequently, the joint log marginal likelihood is given by

CR IP T

N2 1 1 (S) (S) (S) (S) L = − sT (KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 )−1 s − ln(2π) − ln |KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 | 2 2 2 D 1X T D ND (X) (X) − x:,d (KZZ + 2 IN ×N )−1 x:,d − ln(2π) − ln |KZZ + 2 IN ×N | 2 2 2 d=1 (12) ˆ of θ and Z, we need To estimate the optimal values (denoted by θˆ and Z) 265

to maximize L. In general, we can compute the gradients of L and then use the gradient-based approach to learn these variables. However, both the computational and the memory complexities of the approach are so high that ordinary

AN US

computers are unaffordable. In next section, we first use the special property

of Kronecker product to formally reduce computational and memory complexi270

ties of L (and its gradients) and then design a corresponding efficient two-step algorithm to estimate the hyper-parameters. 3.2. Efficient hyper-parameter estimation

M

In this section, we provide a relatively efficient algorithm to estimate the hyper-parameters involved in GP-Metric. As mentioned in Section 3.1, although GP-Metric has a similar structure to both GPR and GPLVM, a direct implemen-

ED

275

tation of this algorithm (e.g., by using the gradient-based approach as realized in [36, 39, 40]) would lead to both higher computational and memory complex-

PT

ities. To reduce these complexities, first let us factorize L in Eq.(12) into the following two components, denoted respectively as Kronecker product term and GPLVM term,

CE

280

AC

and

1 1 Kronecker product term = − sT A−1 s − ln |A| 2 2

(13)

D

1 X T −1 D GP LV M term = − x:,d B x:,d − ln |B| 2 2

(14)

d=1

(S)

(S)

For convenience of derivations, let A = KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 and B = (X)

KZZ + 2 IN ×N . Now let us respectively analyze the above two components in

detail. Firstly, we can easily find that the GPLVM term has the same form as 12

ACCEPTED MANUSCRIPT

285

GPLVM. Thus we can directly compute it as in the conventional GPLVM [36] with computational and memory complexities of O(N 3 ) and O(N 2 ) respectively. Secondly, the Kronecker product term can be further factorized into two terms:

CR IP T

the quadratic term (sT A−1 s) and the log-det term (ln |A|). Both these two

terms have much higher computational and memory complexities of O(N 6 ) and 290

O(N 4 ) respectively if we ignore special Kronecker product structure. However, it is because of this special structure, in this subsection, that we can reduce the

computational and memory complexities of the term by three and two orders

with the Kronecker product: 

295

AN US

of magnitude respectively. Specifically, we use the following identity associated  (S) (S) KZZ ⊗ KZZ + σ 2 IN 2 ×N 2 = (U ⊗ U )(D ⊗ D + σ 2 IN 2 ×N 2 )(U T ⊗ U T ) (15) (S)

where KZZ = U DU T is the eigenvalue decomposition; U is the square matrix

whose columns are its eigenvectors; D is the diagonal matrix whose diagonal elements are the corresponding eigenvalues in a deceasing order. Since both the

M

quadratic term and the log-det term contain the matrix A, we can reduce their complexities respectively by substituting Eq.(15) into these two terms as shown in the following subsections.

ED

300

3.2.1. Computation of the quadratic term

PT

By substituting Eq.(15) into the quadratic term, we can reduce the computation complexity of this term below: (16)

CE

sT A−1 s = (vec(U T SU ))T (D ⊗ D + σ 2 IN 2 ×N 2 )−1 (vec(U T SU ))

Similar to the above derivation, the gradient of the quadratic terms with respect to the hyper-parameter ζ (involved in the kernel functions and the latent variable

AC

305

Z) can also be computed efficiently in terms of ∂sT A−1 s = −2vec(U T SU )T (D ⊗ D + σ 2 IN 2 ×N 2 )−1 ∂ζ ! ! (S)  T ∂KZZ 2 −1 T U unvecN,N (D ⊗ D + σ IN 2 ×N 2 ) vec(U SU ) D vec U ∂ζ (17) 13

ACCEPTED MANUSCRIPT

where unvecm,n (·) is the matricizing operator that transforms a column vector amn×1 into a matrix Am×n and unvecm,n (amn×1 ) = Am×n ⇐⇒ vec(Am×n ) = amn×1 . Moreover, the gradient with respect to σ 2 can be further simplified by

CR IP T

∂sT A−1 s ∂σ 2 I = − vec(U T SU )T (D ⊗ D + σ 2 IN 2 ×N 2 )−1 2 2 ∂σ σ

(18)

(D ⊗ D + σ 2 IN 2 ×N 2 )−1 vec(U T SU ) 310

3.2.2. Computation of the log-det term

Similar to the quadratic term, by utilizing Eq.(15), the log-det term can also

AN US

be efficiently computed as follows: ln |A| = ln |D ⊗ D + σ 2 IN 2 ×N 2 |

(19)

The gradient with respect to the hyper-parameter ζ is given by

(20)

M

 −1 T ∂ ln |A| =2diag D ⊗ D + σ 2 IN 2 ×N 2 ∂ζ !! (S) T ∂KZZ U diag(D) ⊗ diag U ∂ζ

315

fied by

ED

Moreover, the corresponding gradient with respect to σ 2 can be further simpliN

N

∂ ln |A| X X 1 = 2 ∂σ 2 D D i,i j,j + σ i=1 j=1

(21)

PT

Based on these formulations in Subsections 3.2.1 and 3.2.2, the dominant computational costs of L and its gradients are the eigenvalue decomposition (S)

CE

of KZZ , which now costs O(N 3 ) instead of O(N 6 ). Furthermore, the memory

complexity of the Kronecker product term is also reduced from O(N 4 ) to O(N 2 ). 3.2.3. Efficient two-step algorithm

AC

320

Based on the above-mentioned derivations, we have already reduced the

computational and memory complexities in each iteration of the optimization algorithm. However, we still find that directly using the gradient-based method to learn the model often has a slow convergence rate and results in a local opti-

325

mum because of the non-convexity of the objective function L. To mitigate these 14

ACCEPTED MANUSCRIPT

problems, we provide an efficient two-step algorithm as shown in Algorithm 1. Specifically, the algorithm includes three main parts below: (1) Algorithm initialization. At the beginning of the Algorithm 1, we initialize

CR IP T

the hyper-parameter θ randomly within [0, 1] and train a probabilistic principal component analysis (PPCA) [53] model on X, then use the learned latent

330

variables to initialize Z. This trick has also often been used in GPLVMs to get better initial values for the latent variables [36, 40, 54].

(2) Feature learning. In this step, we use the RBF kernel in p(X|θ, Z) (i.e.,

AN US

p(X|Z, θ)RBF ) and the linear kernel in p(S|θ, Z) (i.e., p(S|Z, θ)linear ) for the dimension reduction task. By using the RBF kernel in p(X|θ, Z), we

335

can learn a set of representative nonlinear features. Meanwhile, by using the linear kernel function in p(S|θ, Z), the supervised information can also be embodied into the features with better visualization effect as we will show in the Section 4. We maximize L1 in (22) with respect to θ and Z by using

M

Eqs.(16)-(21).

340

(22)

ED

L1 = ln p(X|Z, θ)RBF + ln p(S|Z, θ)linear

(3) Similarity learning. We embed the features learned in the previous step to a RBF kernel in p(S|Z, θ) (i.e., p(S|Z, θ)RBF , note that this RBF kernel

PT

can be different from the RBF kernel in p(X|Z, θ)RBF ) and maximize L2 in (23) with respect to θ and Z by using Eqs.(16)-(21). In this way, we can obtain the relatively more optimal initial values for the laten variables (in

CE

345

the feature learning step) so that the similarity learning step can learn these

AC

variables with higher quality and less time. L2 = ln p(X|Z, θ)RBF + ln p(S|Z, θ)RBF

(23)

3.3. Prediction for new samples During the prediction process of DML, our goal is to predict the similarities

350

of the new sample pairs by using the learned model. In this paper, depending on 15

CR IP T

ACCEPTED MANUSCRIPT

Algorithm 1: Latent variables and hyper-parameters learning Input: The training samples {X, S}, the step size α and the

AN US

dimensionality of the latent variables Q.

Output: The hyper-parameter θ and the latent variable Z. 1

Initialize hyper-parameter θ randomly;

2

Initialize Z by using the PPCA approach on X;

3

Use a linear kernel and a RBF kernel in p(S|θ, Z) and p(X|θ, Z)

5 6

while L1 in (22) not converges do 1 θ+ = α ∂L θ ,

Embed the features learned in the previous step to a RBF kernel in p(S|Z, θ);

8

2 θ+ = α ∂L θ ,

2 Z+ = α ∂L Z ; // by using Eqs.(16)-(21).

return θ and Z;

AC

CE

9

while L2 in (23) not converges do

PT

7

1 Z+ = α ∂L Z ; // by using Eqs.(16)-(21).

ED

4

M

respectively;

16

ACCEPTED MANUSCRIPT

different tasks (dimension reduction and similarity metric learning), GP-Metric involves two main steps to predict the new test samples: the feature learning step and the similarity prediction step.

355

CR IP T

In the feature learning step, our goal is to learn the latent variable zi∗ for a

new sample x∗i . This can be achieved by maximizing the marginal likelihood of the test sample as follows:

ln p(x∗i |Z, zi∗ , θ) = ln N (x∗i |µ∗i , ∗2 i ID×D ) (X)

(X)

(24)

(X)

(X)

(X)

where µ∗i = X T (KZZ + 2 IN ×N )−1 kZz∗ , ∗2 = kz∗ z∗ − (kZz∗ )T (KZZ + i i

i

i

(X)

i

kZz∗ denotes the kernel vector k (X) (Z, zi∗ ).

The hyper-

AN US

(X)

2 IN ×N )−1 kZz∗ .

i

i

parameter θ and latent variable Z have already been learned in the model 360

learning step. After the above learning process, we have obtained more representative low-dimensional features which can in turn be used for the follow-up tasks such as clustering and classification.

For the similarity prediction task, the second step (i.e., the similarity predic-

365

M

tion step) can be taken to directly predict the similarity between two new test samples x∗i and x∗j . In this step, assuming that we have learned the latent vari-

ED

ables zi∗ and zj∗ corresponding to x∗i and x∗j respectively, thus we can compute the expectation and variance of their similarity by the following formulations:

PT

 T (S) (S) µ(zi∗ , zj∗ ) = vec(U T unvecN,N (kZz∗ ⊗ kZz∗ )U ) i

D ⊗ D + σ 2 IN 2 ×N 2

−1

j

(25)

vec(U T SU )

AC

CE

 T (S) (S) (S) (S) var(zi∗ ,zj∗ ) = kz∗ z∗ kz∗ z∗ − vec(U T unvecN,N (kZz∗ ⊗ kZz∗ )U )

370

i

i

j

2

j

D ⊗ D + σ IN 2 ×N 2

i

−1

vec(U

T

j

(S) unvecN,N (kZz∗ i

zi∗

(S)

(26)

⊗ kZz∗ )U ) j

zj∗

It is obvious that the similarity Szi∗ zj∗ between and follows a gaussian  distribution Szi∗ zj∗ ∼ N µ(zi∗ , zj∗ ), var(zi∗ , zj∗ ) . One superiority of GP-Metric

is that it can provide not only the expectation of similarity but also uncertainty of this prediction (i.e., the variance) which may be utilized by the follow-up tasks. This mechanism plays an important role in many application scenarios 17

ACCEPTED MANUSCRIPT

where equipping the learning result with a measure of uncertainty may be as 375

important as the result itself [31, 32, 55]. The whole similarity prediction process is shown in Algorithm 2. For the dimension reduction task, its prediction process

CR IP T

is the same as Algorithm 2, except not including the similarity computation step in Eqs.(25) and (26). We can also see from Algorithm 2 that both its feature learning step and similarity prediction step have the computational and memory 380

complexities of O(N 3 ) and O(N 2 ) respectively which are the same as those of conventional GPLVM [36].

AN US

Algorithm 2: Similarity prediction of new samples Input: The training samples {X, S}, the new samples x∗i and x∗j , the latent variables Z and the hyper-parameter θ.

Output: The prediction mean µ(zi∗ , zj∗ ) and the variance var(zi∗ , zj∗ ).

2 3

for x∗ ∈ {x∗i , x∗j } do

Maximize (24) to learn z ∗ ∈ {zi∗ , zj∗ }.

Calculate the mean µ(zi∗ , zj∗ ) and the variance var(zi∗ , zj∗ ) by using Eqs.(25) and (26);

return µ(zi∗ , zj∗ ) and var(zi∗ , zj∗ );

ED

4

M

1

PT

4. Experiments and analysis In this section, we compare the proposed GP-Metric with those closely re-

CE

lated GP-based dimension reduction approaches and metric/similarity learn385

ing approaches. Specifically, we investigate the classification performances of these approaches by using k-nearest neighbor (k-NN) classifier which is widely

AC

used to validate the effectiveness of the learned features [39, 40] and metrics [8, 30, 13, 17]. All the experiments were run on computer with Intel(R) Core(TM) i5-3470 @ 3.20GHz CPU and 16.0 GB RAM.

18

ACCEPTED MANUSCRIPT

390

4.1. Dimension reduction To verify the performance of GP-Metric in data visualization and dimension reduction, we use the linear kernel function k(xi , xj ) = xTi xj in the GPLVM

CR IP T

term and the RBF kernel function in the Kronecker product term. With this setting, a set of representative features can be learned and visualized in the 395

low-dimensional space. We compare GP-Metric with five GPLVM approaches: conventional GPLVM 1 , Bayesian GPLVM 2 , DGPLVM 3 , Supervised GPLVM and D-SBGPLVM. We use the GP machine learning toolkit GPmat1 to imple-

ment GP-Metric, Supervised GPLVM and D-SBGPLVM. The Oil Flow data

400

AN US

set and another three UCI data sets (Banknote Authentication, Segment, Car-

diotocography) are used to validate the effectiveness. The detailed information of these data sets are shown in Table 1. In our experiments, these data sets are first normalized to meet zero mean and unit variance.

Table 1: Data sets for dimension reduction test. Banknote

Segment

Oil Flow

Cardiotocography

#training samples

M

Data set

1000

1386

1000

1000

#test samples

372

462

1000

1126

#features

4

19

12

19

#classes

2

7

3

3

PT

ED

Authentication

In data visualization task, we use the Oil Flow data set to verify the su-

CE

periority of the proposed GP-Metric. Specifically, we set the dimensionality of

405

latent variables to 2, and then train GP-Metric and the above-mentioned five models on it. The visualization effects of the approaches are shown in Figure 1.

AC

As we can see in Figure 1(a), the conventional GPLVM can not maintain the separability between classes, since it does not utilize the label information. The 1 https://github.com/SheffieldML/GPmat.git 2 https://github.com/SheffieldML/vargplvm 3 https://github.com/lawrennd/dgplvm.git

19

ACCEPTED MANUSCRIPT

Bayesian GPLVM in Figure 1(b) can learn relatively more discriminant features 410

by using a Bayesian formulation of GPLVM. And while DGPLVM, Supervised GPLVM, D-SBGPLVM and GP-Metric respectively in Figures 1(c), 1(d), 1(e)

CR IP T

and 1(f) learn more discriminant features in visuality by effectively utilizing the label information.

To further validate the effectiveness of GP-Metric, we evaluate the classifi415

cation performance of the learned features on the Oil Flow data set and another

three UCI data sets by using 1-NN and 5-NN classifiers respectively. Specif-

ically, we use 10-fold cross-validation to test the classification error of these

AN US

approaches. In each experiment, the training and the test data sets are randomly sampled from these data sets without replacement and the numbers of 420

samples involved in these data sets are shown in Table 1. The error rates (%) are respectively shown in Tables 2 and 3 where the best results are highlighted in bold. Moreover, the p-values of the non-parametric Wilcoxon rank-sum test are calculated to check the statistical significance. As we can see, by utilizing la-

425

M

bel information, DGPLVM, Supervised GPLVM, D-SBGPLVM and GP-Metric outperform both conventional GPLVM and Bayesian GPLVM. In most cases,

ED

GP-Metric makes lower classification error rate than the other five approaches, indicating that GP-Metric can better utilize the label information to improve

PT

the performance of classification. 4.2. Metric learning

In this subsection, we treat GP-Metric as a similarity learning approach and

430

CE

directly use it for the similarity prediction for any given pair of samples. The performance is verified by comparing GP-Metric with the following approaches:

AC

LMNN4 : Large margin nearest neighbor approach which makes k-nearest neighbors belong to the same class while samples from different classes are separated

435

by a large margin [13]. GB-LMNN3 : Gradient-boosting based metric learning approach that learns 4 https://bitbucket.org/mlcircus/lmnn/downloads/

20

ACCEPTED MANUSCRIPT

2

2

1.5 1

1 0.5

0

0 −0.5

CR IP T

−1

−1

−2

−1.5 −2

−3 −3

−2

−1

0

1

2

3

−2

−1.5

(a) GPLVM

−1

−0.5

0

0.5

1

1.5

2

(b) Bayesian GPLVM

1 0.4 0.3 0.5

AN US

0.2 0.1

0

0

−0.5

−0.1 −0.2

−1

−0.3 −0.4

−0.3

−0.2

−0.1

0

0.1

0.2

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

(d) Supervised GPLVM

M

(c) DGPLVM

−0.4

0.4

0.2

1.5

1

0.5

ED

0

−0.2

−0.6

PT

−0.4

−1

−0.5

0

0.5

0

−0.5

−1

−1.5

1

−3

−2

0

1

2

3

(f) GP-Metric

CE

(e) D-SBGPLVM

−1

Figure 1: Visualization of Oil Flow data set by using (a) GPLVM, (b) Bayesian GPLVM, (c) DGPLVM, (d) Supervised GPLVM, (e) D-SBGPLVM and (f) GP-Metric. ×, ◦ and +

AC

represent stratified, annular and homogeneous flows respectively.

nonlinear mappings directly in function space [30]. OASIS5 : Online algorithm for scalable image similarity, a bilinear model based 5 http://ai.stanford.edu/

~gal/Code/OASIS/

21

ACCEPTED MANUSCRIPT

Table 2: 1-NN classification error rates of GP-Metric, convetional GPLVM, Bayesian GPLVM, DGPLVM, D-SBGPLVM and Supervised GPLVM (best in bold). •/◦ indicates the performance of GP-Metric is significantly better/worse than the compared methods (at 0.05 signif-

CR IP T

icant level). #Dim denotes the dimensionality of latent variables.

Data set

#Dim

GPLVM

Bayesian GPLVM

DGPLVM

Banknote authentication

2

0.268±0.04•

0.224±0.04•

0.203±0.06•

Supervised GPLVM D-SBGPLVM 0.206±0.05•

0.193±0.07• 0.163±0.05

Banknote authentication

3

0.361±0.12•

0.348±0.03•

0.364±0.04•

0.306±0.05•

0.318±0.05• 0.210±0.07

Segment

2

9.090±0.30•

8.291±0.21•

9.320±0.25•

9.956±0.27•

8.979±0.34• 6.926±0.25

Segment

3

7.642±0.25•

6.584±0.19•

7.412±0.25•

5.410±0.24◦

Segment

4

9.523±0.31•

8.696±0.21•

8.414±0.24•

7.142±0.30

7.324±0.34

2

8.300±0.12•

6.500±0.08•

0.900±0.08•

0.800±0.09•

0.900±0.09• 0.600±0.07

3

8.900±0.14•

7.100±0.08•

0.900±0.11•

0.800±0.11•

0.900±0.10• 0.650±0.09

AN US

Oil Flow Oil Flow Cardiotocography

2

3.970±0.09•

2.754±0.07•

2.645±0.09•

2.398±0.08

2.613±0.09•

Cardiotocography

3

4.030±0.10•

3.810±0.07•

2.951±0.09•

2.867±0.10•

2.714±0.10• 2.060±0.10

Cardiotocography

4

2.688±0.09•

2.619±0.09•

3.584±0.07•

4.211±0.09•

3.167±0.09• 1.164±0.10

5

3.136±0.12•

3.127±0.10•

3.036±0.10•

3.046±0.09

3.002±0.10

M

MLR6 : Metric learning to rank, a general metric learning approach based on

ED

R-MLR6 : Robust structural metric learning approach, a robust extension to the metric learning to rank approach based on the group sparsity penalty [17]. SCML7 : Sparse combination of locally discriminative metrics approach, a non-

PT

linear metric learning approach that uses the combination of many local metrics[15]. LCA: Latent coincidence analysis (LCA), a latent variable based metric learn-

CE

ing approach [24].

OPML: One-pass metric learning, a one-pass closed-form solution for online metric learning [25]. GMML: Geometric mean metric learning, a mahalanobis distance metric learn-

AC 450

2.480±0.09

Cardiotocography

the structural SVM framework and various ranking measures [23].

445

7.475±0.27

8.139±0.37• 6.700±0.25

online metric learning approach [8]. 440

GP-Metric

ing approach which has a closed-form solution [26]. The experiments are conducted on three widely used data sets: USPS [56], 6 https://github.com/bmcfee/mlr/ 7 http://mloss.org/revision/download/2004/

22

2.957±0.10

ACCEPTED MANUSCRIPT

Table 3: 5-NN classification error rates of GP-Metric, convetional GPLVM, Bayesian GPLVM, DGPLVM, D-SBGPLVM and Supervised GPLVM (best in bold). •/◦ indicates the performance of GP-Metric is significantly better/worse than the compared methods (at 0.05 signif-

CR IP T

icant level). #Dim denotes the dimensionality of latent variables.

Data set

#Dim

GPLVM

Bayesian GPLVM

DGPLVM

Banknote authentication

2

1.531±0.09•

0.926±0.08•

0.269±0.18•

Supervised GPLVM D-SBGPLVM GP-Metric 0.272±0.06•

Banknote authentication

3

2.449±0.06•

1.948±0.05•

0.205±0.09•

0.189±0.07◦

Segment

2

10.173±0.46•

9.346±0.32•

12.253±0.16•

7.581±0.34•

10.591±0.21• 6.123±0.51 8.764±0.32• 6.331±0.34

Segment

3

14.296±0.35•

9.103±0.32•

10.163±0.31•

8.235±0.21•

4

14.297±0.38•

9.121±0.34•

9.324±0.22•

6.490±0.37

8.299±0.26• 6.431±0.39

Oil Flow

2

14.700±0.20•

6.900±0.15•

3.700±0.35•

1.400±0.16•

3.600±0.36• 0.500±0.18

7.500±0.30• 0.500±0.16

3

18.600±0.26•

8.200±0.14•

9.000±0.26•

2.400±0.19•

2

6.394±0.11•

3.491±0.09•

2.063±0.09

2.157±0.10•

2.347±0.10• 2.059±0.09

Cardiotocography

3

5.381±0.08•

3.226±0.09•

1.730±0.09•

3.495±0.18•

2.285±0.11• 1.384±0.09

AN US

Oil Flow Cardiotocography Cardiotocography

4

4.116±0.16•

2.719±0.09•

1.256±0.06•

3.852±0.13•

1.934±0.09• 1.156±0.07

Cardiotocography

5

4.267±0.12•

3.048±0.10•

1.521±0.09◦

3.239±0.10•

1.734±0.11

M

MNIST

ED

CIFAR10

PT

Figure 2: Example images of USPS, MNIST and CIFAR10 datasets.

MNIST [57] and CIFAR10 [58] as shown in Figure 2. Both USPS and MNIST are two well known digit recognition data sets, comprising of 16 × 16 and 28 × 28

CE

grayscale images respectively. For each of these two data sets, we select a subset of samples with class labels {1, 3, 5, 7, 9}. CIFAR10 is an image classification

AC

data set, comprising of 32 × 32 pixels color images. Each image belongs to

one of ten classes {airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck }. We select a subset of samples with class labels {automobile, cat,

460

0.201±0.08

Segment

USPS

455

0.237±0.20• 0.155±0.05

0.191±0.15•

dog, horse and truck }. Since the dimensionality of CIFAR10 is high, we use

23

1.798±0.10

ACCEPTED MANUSCRIPT

0.09

0.08 LCA LMNN GB−LMNN OASIS MLR

0.07

0.06 0.05 0.04 0.03 0.02

0.06 0.05 0.04 0.03 0.02 0.01

0.01 0 100

150

200

250

300

350

400

450

0 100

500

# of training samples in each class

150

0.1

0.07 0.06 0.05 0.04 0.03

400

450

500

0.08

R−MLR SCML OPML GMML GP−Metric

0.07 0.06 0.05 0.04 0.03

0.01 100

150

200

250

300

350

400

450

# of training samples in each class

LCA LMNN GB−LMNN OASIS MLR

0.66

R−MLR SCML OPML GMML GP−Metric

ED

0.64 0.62 0.6 0.58 0.56

PT

0.54 0.52 200

300

400

500

600

700

900

0.66

700

450

500

LCA LMNN

R−MLR SCML

GB−LMNN

OPML

OASIS MLR

GMML GP−Metric

0.62 0.6 0.58 0.56

R−MLR SCML OPML GMML GP−Metric

800

900

# of training samples in each class

(g) CIFAR10 SIFT (1-NN)

300

400

500

600

700

800

900

1000

(f) CIFAR10 PCA (5-NN)

1000

Classification error rate

CE 600

400

# of training samples in each class

0.5

500

350

0.64

0.52 200

1000

0.55

400

300

0.7

(e) CIFAR10 PCA (1-NN)

300

250

0.68

# of training samples in each class

0.45 200

200

0.54

800

LCA LMNN GB−LMNN OASIS MLR

150

(d) MNIST (5-NN)

M

0.7 0.68

0.02 100

500

# of training samples in each class

(c) MNIST (1-NN)

Classification error rate

350

LCA LMNN GB−LMNN OASIS MLR

0.09

Classification error rate

0.08

R−MLR SCML OPML GMML GP−Metric

Classification error rate

Classification error rate

0.09

300

AN US

LCA LMNN GB−LMNN OASIS MLR

0.02

Classification error rate

250

(b) USPS (5-NN)

0.11

AC

200

# of training samples in each class

(a) USPS (1-NN)

0.1

R−MLR SCML OPML GMML GP−Metric

CR IP T

Classification error rate

0.07

R−MLR SCML OPML GMML GP−Metric

Classification error rate

LCA LMNN GB−LMNN OASIS MLR

0.08

LCA

R−MLR

LMNN GB−LMNN OASIS

SCML OPML GMML

MLR

GP−Metric

0.55

0.5

0.45 200

300

400

500

600

700

800

900

1000

# of training samples in each class

(h) CIFAR10 SIFT (5-NN)

Figure 3: k-NN (k = 1 and k = 5) classification error rate on the USPS, MNIST, CI-

24 numbers of training samples. FAR10 PCA and CIFAR10 SIFT with different

ACCEPTED MANUSCRIPT

PCA and scale-invariant feature transform (SIFT)8 approaches to reduce the dimensionality into 300 (taking the first 300 eigenvectors that capture at least 95% of the total variance) and 210 (after obtaining the SIFT descriptors, we use

465

CR IP T

bag-of-words representation with 210 words for each image) respectively. These two pre-processed data sets are denoted by CIFAR10 PCA and CIFAR10 SIFT respectively.

In the experiments, for each class in all the data sets, we randomly draw 200, t and 200 samples as test, training and validation sets respectively. We

train the metric learning models and choose the hyper-parameters involved in the compared approaches on the training and validation sets. With these

AN US

470

learned metrics, we use 1-NN and 5-NN classifier to predict the labels of test samples. For different t, we repeat the above process 10 times and compute the average classification error rates on the test sets. Specifically, we set t = {100, 200, 300, 400, 500} for the USPS and MNIST, t = {200, 400, 600, 800, 1000} 475

for the CIFAR10 PCA and CIFAR10 SIFT. The dimension of latent variables

M

in GP-Metric is empirically set to 20 on all the data sets. The results are shown in Figure 3. As we can see, on both USPS and MNIST, the average

ED

error rates of all the approaches are lower due to that these two data sets contain less noise. However, on the CIFAR10, the error rates are much higher 480

because of the various backgrounds, camera angles and positions. On all the

PT

data sets, the nonlinear metric learning approaches have lower error rates than other linear approaches, indicating that nonlinear approaches can learn more

CE

accurate distance/similarity metric. Among all these compared approaches, GP-Metric achieves the best result, demonstrating its effectiveness in nonlinear metric learning tasks.

AC

485

We also calculate the p-values of the non-parametric Wilcoxon rank-sum

test to check the statistical significance. Specifically, in Table 4, we show the pvalues of 5-NN on the USPS, MNIST, CIFAR10 PCA and CIFAR10 SIFT with

200, 200, 1000 and 1000 training samples for each class respectively. As we can 8 https://github.com/adikhosla/feature-extraction

25

ACCEPTED MANUSCRIPT

Table 4: The non-parametric Wilcoxon rank-sum test between GP-Metric and other compared methods.

MNIST

CIFAR10 PCA

GP-Metric→LCA

0.002

0.008

0.006

GP-Metric→LMNN

0.023

0.031

0.015

GP-Metric→GB-LMNN

0.033

0.034

0.031

GP-Metric→OASIS

0.010

0.021

0.025

GP-Metric→MLR

0.003

0.014

0.035

GP-Metric→R-MLR

0.013

0.015

0.025

GP-Metric→SCML

0.032

GP-Metric→OPML

0.002

GP-Metric→GMML

0.031

CIFAR10 SIFT

AN US

CR IP T

USPS

0.017

0.024

0.039

0.019

0.023

0.039

0.041

0.0391

0.043

0.002

0.002

0.018

0.025

0.026

0.021

see, in most cases GP-Metric performs statistically better than the compared

M

490

Compared methods

methods. It is also worth to note that although the p-values with different numbers of training samples in each data set are not provided, in most cases

ED

they are also less than 0.05, indicating the statistical significance between GPMetric and other compared methods. To further demonstrate the performance of GP-Metric, we show the ten

495

PT

nearest neighbour retrieval results of two query images (of CIFAR SIFT with t = 1000) in Figure 4. Specifically, we select two query images with label

CE

automobile and horse from the test set. Then we retrieve their first ten nearest neighbours by using distance/similarity metrics learned in training step. The

500

images in the blue, green and red boxes denote the query, correctly classified

AC

and misclassified images respectively. As we can see from Figure 4, since the automobile is relatively easy to classify because of its distinctive characteristics, all the approaches have higher accuracy when we use their learned metrics to query the automobile image. However, for the horse image, it is more likely to

505

be classified into cat and dog due to the similarity among these three classes.

26

ACCEPTED MANUSCRIPT

LCA LMNN

CR IP T

GB-LMNN OASIS MLR R-MLR SCML

AN US

OPML GMML GP-Metric

(a) Automobile

LMNN

ED

GB-LMNN

M

LCA

OASIS

PT

MLR

R-MLR

CE

SCML

OPML

AC

GMML

GP-Metric (b) Horse

Figure 4: Ten nearest neighbour retrieval results of two query images. Images in blue, green and red boxes denote the query images, correctly classified images and misclassified images, respectively.

27

ACCEPTED MANUSCRIPT

Table 5: The computational and memory complexities analysis of GP-based methods.

Compuation complexity Memory complexity

GPLVM

Bayesian GPLVM

DGPLVM

Supervised GPLVM

D-SBGPLVM

GP-Metric

O(N 3 )

O(N M 2 )

O(N 3 )

O(N 3 )

O(N 3 )

O(N 3 )

2

O(N )

O(N M )

2

O(N )

2

O(N )

2

O(N )

CR IP T

Complexity

The proposed GP-Metric outperforms other compared approaches in both cases. 4.3. Computational and memory complexity

AN US

To demonstrate the efficiency of GP-Metric, we provide an algorithmic com-

parison of the computational and the memory complexities of related GP-based 510

approaches in Table 5. As we have mentioned in Section 3.2, the major com(S)

putational and memory costs are the eigenvalue decomposition of matrix KZZ , which has a computational complexity of O(N 3 ) and a memory complexity of O(N 2 ). In GPLVM, DGPLVM, Supervised GPLVM and D-SBGPLVM, the ma-

515

M

jor computational and memory costs in each iteration are inverting the N × N

matrix, which has also computational and memory complexities of O(N 3 ) and

ED

O(N 2 ) respectively. The Bayesian GPLVM has a lower computational and

memory complexities of O(N M 2 ) and O(N M ) respectively (where M denotes the number of auxiliary points and M  N ), because of using sparse GP. Based 520

PT

on the above analysis, we can conclude that the complexity of GP-Metric is only higher than Bayesian GPLVM but its classification accuracy is the highest in most cases. We also provide an empirical time complexity comparison of GP-

CE

Metric and related metric learning approaches on the four data sets (t = 200 for USPS and MNIST, t = 1000 for CIFAR10 PCA and CIFAR10 SIFT ) in Ta-

AC

ble 6. Although more time-consuming than other metric learning approaches,

525

GP-Metric usually has higher accuracy and many sparse approximation methods [59, 60] can be adapted to further reduce the computational and memory complexities.

28

O(N 2 )

ACCEPTED MANUSCRIPT

Table 6: The comparison of time complexity on different data sets. Data set

LCA

LMNN

GB-LMNN

OASIS

MLR

R-MLR

SCML

OPML

GMML

USPS

20.6s

30.6s

242.1s

83.3s

125.9s

414.2s

51.9s

0.0006s

0.032s

GP-Metric 3895.0s

MNST

35.4s

127.3s

1140.2s

683.8s

4634.2s

7316.5s

76.6s

0.0013s

0.045s

7511.6s

71.8s

4145.1s

30781.7s

1381.1s

183742.0s

324919.4s

985.1s

0.0024s

2.3s

325474.4s

58.3s

2239.9s

16221.7s

508.2s

175775.1s

220351.7s

578.3s

0.0015s

1.9s

224114.5s

5. Conclusion and discussion

CR IP T

CIFAR10 PCA CIFAR10 SIFT

In this paper, we propose a gaussian process based non-parametric metric 530

learning approach and derive a practical algorithm for efficiently learning the

AN US

latent variables and the hyper-parameters. The experimental results show that

our GP-Metric can not only provide more accurate similarity prediction but also learn more representative features than the supervised GP-based feature learning approaches. Furthermore, it can also give uncertainty of prediction, which 535

provides confidence for applications such as medical diagnosis, self-deriving cars and so on. In the future work, we wish to address the high time complexity

M

problem involved in the approach by using approximation approaches [59, 60]

ED

to make GP-Metric available for big data issues.

Acknowledgment

This work was supported by the National Natural Science Foundation of

PT

540

China (NSFC) under Grant Nos. 61672281 and 61472186, the Key Program of NSFC under Grant No. 61732006 and the founding of Jiangsu Innovation

CE

Program for Graduate Education under Grant No. KYLX15 0323.

AC

References

545

[1] P. Xie, E. Xing, Large scale distributed distance metric learning, arXiv preprint arXiv:1412.5949.

[2] E. P. Xing, A. Y. Ng, M. I. Jordan, S. Russell, Distance metric learning, with application to clustering with side-information, in: Proceedings of the

29

ACCEPTED MANUSCRIPT

15th International Conference on Neural Information Processing Systems, 2002, pp. 521–528.

550

[3] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, Large

CR IP T

scale metric learning from equivalence constraints, in: Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2288–2295. 555

[4] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Learning to rank in person re-identification with metric ensembles, in: Proceedings of the IEEE Con-

AN US

ference on Computer Vision and Pattern Recognition, 2015, pp. 1846–1855. [5] D. Tao, L. Jin, Y. Wang, X. Li, Person reidentification by minimum classification error-based kiss metric learning, IEEE transactions on Cybernetics 45 (2) (2015) 242–252.

560

[6] B. McFee, L. Barrington, G. Lanckriet, Learning content similarity for

M

music recommendation, IEEE transactions on audio, speech, and language processing 20 (8) (2012) 2207–2218.

ED

[7] S. C. Hoi, W. Liu, S.-F. Chang, Semi-supervised distance metric learning for collaborative image retrieval, in: Proceedings of the IEEE Conference

565

on Computer Vision and Pattern Recognition, 2008, pp. 1–7.

PT

[8] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large scale online learning of image similarity through ranking, Journal of Machine Learning Research

CE

11 (Mar) (2010) 1109–1135. 570

[9] Z. Kuang, J. Sun, K.-Y. Wong, Learning regularized, query-dependent bi-

AC

linear similarities for large scale image retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 413–420.

[10] S. Xiang, F. Nie, C. Zhang, Learning a mahalanobis distance metric for data

575

clustering and classification, Pattern Recognition 41 (12) (2008) 3600–3612.

30

ACCEPTED MANUSCRIPT

[11] L. Yang, R. Jin, R. Sukthankar, Bayesian active distance metric learning, in: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, 2007, pp. 442–449.

CR IP T

[12] A. Bellet, A. Habrard, M. Sebban, A survey on metric learning for feature vectors and structured data, arXiv preprint arXiv:1306.6709.

580

[13] K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research 10 (Feb) (2009) 207–244.

AN US

[14] K. Liu, A. Bellet, F. Sha, Similarity learning for high-dimensional sparse

data, in: Proceedings of the 18th International Conference on Artificial

585

Intelligence and Statistics, 2015, pp. 653–662.

[15] Y. Shi, A. Bellet, F. Sha, Sparse compositional metric learning, in: Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014, pp.

590

M

2078–2084.

[16] C. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto, N. D.

ED

Lawrence, Recurrent gaussian processes, arXiv preprint arXiv:1511.06644. [17] D. Lim, B. Mcfee, G. R. Lanckriet, Robust structural metric learning, in: Proceedings of the 30th International Conference on Machine Learning,

595

PT

2013, pp. 615–623.

[18] K. P. Murphy, Machine learning: a probabilistic perspective, MIT press,

CE

2012.

AC

[19] P. Orbanz, Y. W. Teh, Bayesian nonparametric models, in: Encyclopedia of Machine Learning, Springer, 2011.

[20] B. Babagholami-Mohamadabadi, S. M. Roostaiyan, A. Zarghami, M. S.

600

Baghshah, Multi-modal distance metric learning: parametric approach, in:

A bayesian non-

European Conference on Computer Vision,

Springer, 2014, pp. 63–77.

31

ACCEPTED MANUSCRIPT

[21] C. E. Rasmussen, Gaussian processes for machine learning, MIT Press, 2006. 605

[22] J. V. Davis, B. Kulis, P. Jain, S. Sra, I. S. Dhillon, Information-theoretic

Machine Learning, 2007, pp. 209–216.

CR IP T

metric learning, in: Proceedings of the 24th International Conference on

[23] B. Mcfee, G. R. G. Lanckriet, Metric learning to rank, in: Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 775–782. 610

[24] M. Der, L. K. Saul, Latent coincidence analysis: A hidden variable model

AN US

for distance metric learning, in: Advances in Neural Information Processing Systems, 2012, pp. 3230–3238.

[25] W. Li, Y. Gao, L. Wang, L. Zhou, J. Huo, Y. Shi, Opml: A one-pass closedform solution for online metric learning, Pattern Recognition 75 (2018) 302–314.

615

M

[26] P. H. Zadeh, R. Hosseini, S. Sra, Geometric mean metric learning, in: Proceedings of the 33rd International Conference on International Conference

ED

on Machine Learning, 2016, pp. 2464–2471. [27] C. Kang, S. Liao, Y. He, J. Wang, W. Niu, S. Xiang, C. Pan, Cross-modal similarity learning: A low rank bilinear formulation, in: Proceedings of

620

PT

the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1251–1260.

CE

[28] X. Gao, S. C. H. Hoi, Y. Zhang, J. Wan, J. Li, Soml: Sparse online metric learning with application to image retrieval, in: Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014, pp. 1206–1212.

AC

625

[29] E. Fetaya, S. Ullman, Learning local invariant mahalanobis distances, in: Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 162–168.

[30] D. Kedem, S. Tyree, K. Q. Weinberger, F. Sha, G. Lanckriet, Non-linear

32

ACCEPTED MANUSCRIPT

metric learning, in: Proceedings of the 25th International Conference on

630

Neural Information Processing Systems, 2012, pp. 2573–2581. [31] E. Straszecka, Combining uncertainty and imprecision in models of medical

CR IP T

diagnosis, Information Sciences 176 (20) (2006) 3026–3059.

[32] S. Brechtel, T. Gindele, R. Dillmann, Probabilistic decision-making under uncertainty for autonomous driving using continuous pomdps, in: Pro-

635

ceedings of the 17th International Conference on Intelligent Transportation Systems, IEEE, 2014, pp. 392–399.

preprint arXiv: 1704.03144. 640

AN US

[33] M. Raissi, Parametric gaussian process regression for big data, arXiv

[34] A. Wilson, Z. Ghahramani, D. A. Knowles, Gaussian process regression networks, in: Proceedings of the 29th International Conference on Machine Learning, 2012, pp. 599–606.

M

[35] T. Bui, J. Hern´ andez-Lobato, D. Hern´ andez-Lobato, Y. Li, R. Turner, Deep gaussian processes for regression using approximate expectation propagation, in: Proceedings of the 33rd International Conference on Machine

ED

645

Learning, 2016, pp. 2187–2208. [36] N. Lawrence, Probabilistic non-linear principal component analysis with

PT

gaussian process latent variable models, Journal of Machine Learning Research 6 (2005) 1783–1816. [37] M. K. Titsias, N. D. Lawrence, Bayesian gaussian process latent variable

CE 650

model, in: Proceedings of the 13th International Workshop on Artificial

AC

Intelligence & Statistics, Vol. 9, 2010, pp. 844–851.

[38] S. Eleftheriadis, O. Rudovic, M. Pantic, Shared gaussian process latent vari-

655

able model for multi-view facial expression recognition, in: International Symposium on Visual Computing, Springer, 2013, pp. 527–538.

[39] R. Urtasun, T. Darrell, Discriminative gaussian process latent variable

33

ACCEPTED MANUSCRIPT

model for classification, in: Proceedings of the 24th international conference on Machine learning, 2007, pp. 927–934. [40] X. Gao, X. Wang, D. Tao, X. Li, Supervised gaussian process latent variable model for dimensionality reduction, IEEE Transactions on Systems, Man,

CR IP T

660

and Cybernetics, Part B (Cybernetics) 41 (2) (2011) 425–434.

[41] S. Eleftheriadis, O. Rudovic, M. Pantic, Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition, IEEE transactions on image processing 24 (1) (2015) 189–204.

[42] V. Ntouskos, P. Papadakis, F. Pirri, Probabilistic discriminative dimension-

AN US

665

ality reduction for pose-based action recognition, in: Pattern Recognition Applications and Methods, Springer International Publishing, 2015, pp. 137–152.

[43] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural computation 12 (10) (2000) 2385–2404.

M

670

[44] X. Wang, X. Gao, Y. Yuan, D. Tao, J. Li, Semi-supervised gaussian process

ED

latent variable model with pairwise constraints, Neurocomputing 73 (10-12) (2010) 2186–2195.

[45] N. D. Lawrence, Local distance preservation in the gp-lvm through back

PT

constraints, in: Proceedings of the 23th International Conference on Ma-

675

chine Learning, 2006, pp. 513–520.

CE

[46] P. Li, S. Chen, Hierarchical gaussian processes model for multi-task learning, Pattern Recognition 74 (2018) 134–144.

AC

[47] A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Gaussian processes for

680

object categorization, International journal of computer vision 88 (2) (2010) 169–188.

[48] M. Kemmler, E. Rodner, E.-S. Wacker, J. Denzler, One-class classification with gaussian processes, Pattern Recognition 46 (12) (2013) 3507–3518.

34

ACCEPTED MANUSCRIPT

[49] E. Rodner, D. Hegazy, J. Denzler, Multiple kernel gaussian process classification for generic 3d object recognition, in: Proceedings of the 25th

685

International Conference of Image and Vision Computing New Zealand,

CR IP T

2010, pp. 1–8. [50] D. M. Blei, M. I. Jordan, et al., Variational inference for dirichlet process mixtures, Bayesian analysis 1 (1) (2006) 121–143. 690

[51] Z. Dai, J. Hensman, N. Lawrence, Spike and slab gaussian process latent variable models, arXiv preprint arXiv:1505.02434.

AN US

[52] G. Song, S. Wang, Q. Huang, Q. Tian, Multimodal gaussian process latent

variable models with harmonization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5029–5037. 695

[53] M. E. Tipping, C. M. Bishop, Probabilistic principal component analysis, Journal of the Royal Statistical Society: Series B (Statistical Methodology)

M

61 (3) (1999) 611–622.

[54] X. Jiang, J. Gao, X. Hong, Z. Cai, Gaussian processes autoencoder for

ED

dimensionality reduction, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2014, pp. 62–73.

700

[55] T. Iwata, Z. Ghahramani, Improving output uncertainty estimation and

PT

generalization in deep learning via neural network gaussian processes, arXiv preprint arXiv:1707.05922.

CE

[56] J. J. Hull, A database for handwritten text recognition research, IEEE Transactions on pattern analysis and machine intelligence 16 (5) (1994)

705

AC

550–554.

[57] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-

710

plied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.

[58] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny

35

ACCEPTED MANUSCRIPT

images. URL http://www.cs.toronto.edu/~kriz/cifar.html [59] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, in:

2013, pp. 282–290.

715

CR IP T

Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence,

[60] A. Wilson, H. Nickisch, Kernel interpolation for scalable structured gaussian processes (kiss-gp), in: Proceedings of the 32nd International Confer-

AC

CE

PT

ED

M

AN US

ence on Machine Learning, 2015, pp. 1775–1784.

36

ACCEPTED MANUSCRIPT

Ping Li received his B.S. and M.S. degree in Management Science & Engineering from Anhui University of Technology in 2011 and 2014. He is currently pursuing the Ph.D. degree with the College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics. His research interests include pattern recognition and machine learning.

AC

CE

PT

ED

M

AN US

CR IP T

Songcan Chen received his B.S. degree in mathematics from Hangzhou University (now merged into Zhejiang University) in 1983. In 1985, he completed his M.S. degree in computer applications at Shanghai Jiaotong University and then worked at NUAA in January 1986. There he received a Ph.D. degree in communication and information systems in 1997. Since 1998, as a full-time professor, he has been with the College of Computer Science & Technology at NUAA. His research interests include pattern recognition, machine learning and neural computing.