Accelerated low-rank sparse metric learning for person re-identification

Pattern Recognition Letters 112 (2018) 234–240 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

Download PDF

1MB Sizes 0 Downloads 38 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 112 (2018) 234–240

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Accelerated low-rank sparse metric learning for person re-identiﬁcation Niki Martinel Department of Mathematics and Computer Science, University of Udine, Via Delle Scienze, 206, Udine 33100, Italy

a r t i c l e

i n f o

Article history: Received 12 January 2018 Available online 31 July 2018 Keywords: Person re-identiﬁcation Metric learning Low-rank manifold Proximal gradient optimization

a b s t r a c t Person re-identiﬁcation is an open and challenging problem in computer vision. A surge of effort has been spent design the best feature representation, and to learn either the transformation of such features across cameras or an optimal matching metric. Metric learning solutions which are currently in vogue in the ﬁeld generally require a dimensionality reduction pre-processing stage to handle the highdimensionality of the adopted feature representation. Such an approach is suboptimal and a better solution can be achieved by combining such a step in the metric learning process. Towards this objective, a low-rank matrix which projects the high-dimensional vectors to a low-dimensional manifold with a discriminative Euclidean distance is introduced. The goal is achieved with a stochastic accelerated proximal gradient method. Experiments on two public benchmark datasets show that better performances than state-of-the-art methods are achieved. © 2018 Elsevier B.V. All rights reserved.

1. Introduction The problem of tracking pedestrian within a single camera ﬁeldof-view (FoV), i.e.intra-camera tracking, has been on the edge of computer vision for decades (e.g.[13,24,37,46]). Only recently, motivated by the increasing request of more safety, there has been raising interest in the problem of tracking a person moving across disjoint camera FoVs, i.e., inter-camera tracking with no temporal constraints. This is known as the person re-identiﬁcation problem. Such a problem is very attractive since, for video surveillance applications, knowing whether a person is present in the monitored area at a precise time instant is of paramount importance. The problem has received increasing attention (see [41] for a recent survey) due to its intrinsic open challenges. Person images generally have low spatial resolution which makes the acquisition of discriminating biometric features unreliable. This motivated the community to mainly exploit visual appearance features, thus opening to the complex variations of individuals clothing appearance (due to viewing angles, lighting, background clutter, and occlusions). A swell of efforts has been devoted to address the stated problems by following three main approaches. These are (i) Discriminative signature based methods –which seek to design the best visual representation that is robust to the aforementioned challenges. (ii) Feature transformation based methods –that aim to

E-mail address: [email protected] https://doi.org/10.1016/j.patrec.2018.07.033 0167-8655/© 2018 Elsevier B.V. All rights reserved.

model the transformation of visual features that occur between pairs of cameras. (iii) Metric Learning based approaches –which focus on learning discriminant metrics yielding an optimal matching score/distance between gallery and probe images. Despite such efforts, the ﬁrst family of works suffer from signiﬁcant limitations including the fact that a hand-crafted visual representation is generally not suﬃciently robust to the illumination, pose, occlusion challenges that are widely common in reidentiﬁcation. Indeed, as shown in Fig. 1, the projection of a same visual feature in the feature space of a disjoint camera is likely to return a wrong list of matches. On the other hand, if a distance function can be learned in such a way that features of a same person acquired from two cameras projected onto a shared feature space will be “closer” than features of different persons, then the re-identiﬁcation goal can be better tackled (Fig. 2). Motivation and contribution: While widely explored, metric learning-based solutions generally require a pre-processing stage to handle the high-dimensionality of the adopted feature representation. This usually translates into the application of Principle Component Analysis (PCA) before the learning process starts. It is reasonable to believe that such an approach is suboptimal and may induce severe misclassiﬁcation errors due to the rejection –by PCA– of low-variance components which may carry relevant information for the re-identiﬁcation task. A better solution could be achieved by combining both the metric learning and the dimensionality reduction steps in such a way that relevant components are identiﬁed by means of the re-identiﬁcation performance. The core contribution of this work is a metric learn-

N. Martinel / Pattern Recognition Letters 112 (2018) 234–240

Fig. 1. Example of tackling the re-identiﬁcation problem by projection of visual features between camera dependent Euclidean feature spaces.

Fig. 2. Proposed approach pipeline. High-dimensional visual features are extracted from image pairs acquired by non-overlapping cameras. Common metric learning solutions apply a pre-processing stage to handle such a high dimensional space. In the proposed approach, the metric learning and the dimensionality reduction steps are combined in such a way that relevant components are identiﬁed by means of the re-identiﬁcation performance.

ing approach, named Accelerated Low-Rank sparse Metric learning (ALRM), which produces a low-rank solution that self-determines the discriminative dimensions of the underlying manifold. In the experimental section, results show that ALRM has similar or better performances than the ones of such state-of-the-art solutions on two benchmark datasets.

2. Related work

Feature transformation methods have addressed the reidentiﬁcation problem by modeling the transformation of visual features that occur between pairs of cameras. An early work was proposed in [11], where the Brightness Transfer Function (BTF) computed between the appearance features was used to match persons across camera pairs. More recently, local eigendissimilarities between multiple features extracted from image pairs [29] as well as warp function space of feature transformations [27] were explored. Cameras viewpoint information was also explored to train binary [8,18] and dictionary-based classiﬁers [12] according to the similarity of cross-view transforms. Metric Learning approaches focus on learning discriminant metrics which aim to yield an optimal matching score/distance between a gallery and a probe image. Since the early work of [44], many different solutions have been introduced [4]. More speciﬁcally, in the re-identiﬁcation ﬁeld metric learning approaches have been proposed by relaxing [10] or enforcing [22] the PSD conditions as well as by considering equivalence constraints [40]. While most of the existing methods capture the global structure of the dissimilarity space, local solutions [20,34] have been proposed too. Motivated by the success of both solutions, approaches combining them in metric ensembles [30,33] have been introduced. Different solutions yielding similarity measures have also been investigated by proposing to learn listwise [6] and pairwise [52] similarities as well as mixture of polynomial kernel-based models [5]. Related to these similarity learning models are the deep architectures which have been exploited to tackle the task [39,54]. With respect to all such methods, two recent works [21,22] share the idea of ﬁnding discriminative low-rank projections, however, there are signiﬁcant differences with the proposed method. Speciﬁcally, this work introduces: (i) An accelerated convex optimization solver which reduces the per-epoch computational complexity while improving the convergence rate. This is obtained by formulating the problem in a stochastic fashion and by avoiding to visit all the training samples in each epoch before updating the metric parameters. (ii) An additional sparsity regularizer on the low-rank projection that allows to self-discover the relevant components of the underlying manifold, thus avoiding the speciﬁcation of its dimensionality beforehand. 3. Methodology 3.1. Preliminaries and deﬁnitions |P |

Person re-identiﬁcation has recently become an expansive ﬁeld of research [41]. A surge of effort has been devoted to attack the problem from different perspectives, ranging from partially seen persons [49] to low resolution images [19], which can eventually be synthesized in the open-world re-identiﬁcation idea [53]. In the following, a brief overview of the literature is given by discussing relevant works following the three main research directions. Discriminative signature based methods have been the most widely used ones. Salient color names [45] and the color distribution structure [16] were proposed as robust feature descriptors. To tackle pose changes, correlation between random patches extracted from pair of images were also exploited [28]. To handle background and illumination variations, feature representations based on the combination of Biologically Inspired Features (BIF) [26] and Covariance descriptors [21] were introduced. Works going beyond appearance features and integrating a semantic aspect into image representations were proposed by leveraging on the interaction between attributes and appearance [14] and by describing the relations among the low-level part features, middle-level clothing attributes, and high-level re-identiﬁcation labels [17].

235

|G |

Let P = {I p } p=1 and G = {Ig }g=1 be the set of probe and gallery images acquired by two disjoint cameras. Let x p ∈ Rd and xg ∈ Rd be the feature representations of Ip and Ig of two persons p and g. Let X = {(x p , xg ; y p,g )(i ) }ni=1 denote the training set of n = |P | × |G | probe-gallery pairs where y p,g ∈ {−1, +1} indicates if p and g are the same person (+1) or not (−1). Finally, let an iteration be a parameter update computed by visiting a single sample and let an epoch denote a complete cycle on the training set. 3.2. Low-rank sparse metric learning In this section, the core contribution of this work is discussed: a low-rank sparse metric learning approach along with an accelerated stochastic solver. Objective The image feature representations x might be very high-dimensional and contain undiscriminative components. To address such a problem, metric learning approaches generally apply PCA before the learning process starts. To embed similar PCA capabilities in discovering the discriminative components within the learned metric, a matrix L ∈ Rr×d that

236

N. Martinel / Pattern Recognition Letters 112 (2018) 234–240

projects the high-dimensional vectors to a low-dimensional manifold with a discriminative Euclidean distance

at each iteration t, i.e., by considering a single random pair (p(t) , g(t) ), the low-rank projection is updated as

δL (x p , xg ) = Lx p − Lxg 22 = (x p − xg )T LT L(x p − xg )

L˜ (t+1) = prox L˜ (t ) − η∇ L˜ (t ) ( p(t ) , g(t ) )

(1)

should be obtained. The aim is that (1) produces a “small” dissimilarity if p and g are the same person, and a “large” dissimilarity otherwise. Embedding this aim with the pairwise label yields to

1

DL ( p, g) = y p,g −

2

δL (x p , xg )

(2)

where the 1/2 has been included only for aesthetic purposes of the gradient-based convex optimization solution. Considering (2) in a smooth margin loss [35] that facilitates the gradient-based optimization results in

⎧1 ⎪ ⎨ 2 − DL ( p, g) 2 if DL ( p, g) ≤ 0 L ( p, g) = 1 1 − DL ( p, g) if 0 < DL ( p, g) < 1 . 2 ⎪ ⎩ 0

(3)

otherwise

Learning the low-rank projection matrix using (2) does not guarantee that all the r dimensions of the resulting manifold are discriminative for the re-identiﬁcation task. To overcome such a problem, motivated by the success of feature selection methods [31], the group sparsity induced by the 2,1 norm [2] was considered to drive the rows of L decay to zero. This corresponds to rejecting non discriminative manifold dimensions. Armed with the 2,1 norm and considering the empirical risk optimization problem over a set of training data patterns in X , the ﬁnal objective function can be written as

argmin JL + βL2,1

(4)

L

where

JL =

n 1 (i ) (i )

L p , g n

(5)

i=1

and p(i) and g(i) denote the identities of persons p and g in the ith pair of X . β is a trade-off parameter controlling the regularization strength. Accelerated optimization solution The objective function in (4) is a sum of two functions, where the ﬁrst function is convex and smooth, while the second one is a an 2,1 norm which is convex but non-smooth. Given that the loss function JL is differentiable with Lipschitz continuous gradients and that the regularization function || · ||2,1 is convex, then a classical solution to the objective in (4) can be obtained through the proximal gradient method. As shown in [3], at each epoch s the standard batch proximal gradient (BPGD) method ﬁnds the unique minimizer of a quadratic approximation of the objective function in (4) at a given point L(s) . The unique solution yields to the updated vector of parameters L(s+1 ) . More speciﬁcally, the standard proximal gradient method computes

1 L(s+1) = argmin ∇ JLT(s) L + L − L(s) 22 + β||L||2,1 2 η L

(6)

with ∇ JL(s) = 1n ni=1 ∇ L(s) ( p(i ) , g(i ) ) and η is the step size. From (6) it can be noticed that, to compute a single step towards the optimal solution for L, the gradient should be evaluated for all the n samples. This can be extremely computationally expensive if n is very large. Such a problem can be alleviated by considering a stochastic proximal gradient descent (SPGD) method. More speciﬁcally, with the deﬁnition of the proximal mapping operator [2]

prox(Z ) = argmin L

1 L − Z22 + β||L||2,1 2

(7)

(8)

where L˜ (t ) and L(s) are used to distinguish between the update occurring at each iteration and at each epoch, respectively. By formulating the optimization problem with such a stochastic approach, the effort required to perform an update has been significantly reduced. However, due the effect of variance introduced by the required random sampling, η has to gradually decay with each iteration. As a result, this solution has a slower convergence rate than BPGD. To overcome such a problem, the stochastic variancereduced gradient (SVRG) method [43] with mini-batch acceleration [32] is exploited. This reduces the variance in estimating the gradient for the whole training set. Thus, a large step size can be used and a faster convergence rate is achieved. More speciﬁcally, let L(s) and L˜ (t ) be the optimal solutions obtained at each epoch s and iteration t using the mini-batch set B (t ) = {(x p , xg ; y p,g )( j ) }m , where m < n deﬁnes the number j=1 of randomly sampled pairs. Then, at each iteration, the following mini-batch gradient direction is considered

∇ JL˜B(t ) =

m 1 ∇ L˜ (t ) ( p( j ) , g( j ) ) m j=1

−

m 1 ∇ L(s) ( p( j ) , g( j ) ) + ∇ JL(s) m

(9)

j=1

where the subscripts denote the parameters which are considered in the gradient computation. Once the mini-batch gradient direction is obtained, Nesterov acceleration can be exploited to obtain updated optimal iteration parameter estimate as

L˜ (t+1) = prox(L˜ (t ) − η∇ JL˜B(t ) )

+ λ prox(L˜ (t ) − η∇ JL˜B(t ) ) − L˜ (t )

(10)

where λ is the constant step Nesterov acceleration parameter. After the K iterations over the mini-batch sets are completed, the updated solution for L(s) is given by L(s+1 ) = L˜ (K ) . To minimize the stated objective with the proposed accelerated procedure, the gradient of the loss function as well as the proximal mapping operator should be deﬁned. Precisely, considering (3), computing the gradient ∇ L˜ (t ) ( p(t ) , g(t ) ) with respect to L˜ (t ) results in

⎧ y p,g L˜ (t ) (x p − xg )(x p − xg )T if DL ( p, g) ≤ 0 ⎪ ⎪ ⎨y p,g L˜ (t ) (x p − xg )(x p − xg )T if 0 < DL ( p, g) < 1 (t ) (t ) 1 − D ( p , g ) ⎪ ( t ) ˜ L ⎪ ⎩ 0

(11)

otherwise

Then, by deﬁning L(t ) = L˜ (t ) − η∇ J ˜B(t ) , the proximal mapping L operator computes

(t )

1 prox L = argmin L(t ) ||2F + β||L||2,1 ||L − 2 η L

= Li,(t:) max 0, 1 −

ηβ || Li,(t:) ||2

(12)

whose closed form solution has been obtained using the group soft-thresholding technique [2] and i = 1, · · · , r denotes the ith row of a parameter matrix. These two components are then adopted in the accelerated SPGD procedure as summarized in Algorithm 1 .

N. Martinel / Pattern Recognition Letters 112 (2018) 234–240

237

Fig. 3. 5 image pairs from the (a) VIPeR, (b) PRID450S, (c) and Market-1501 datasets. Columns correspond to different persons, rows to different cameras.

Table 1 Comparison with state-of-the-art methods on the VIPeR dataset. Best results for each rank are in boldface font.

Algorithm 1: ALRM– Accelerated stochastic low-rank sparse metric learning. Input: η > 0, m > 0, λ > 0, L(1 ) , X Iterate s = 1, 2, . . . 1. Compute the full gradients for all training samples using the current epoch optimal solution ∇ JL(s) = 1n ni=1 ∇ L(s) ( p(i) , g(i) ) 2. Set L˜ (1 ) = L(s ) Iterate t = 1, . . . , K 1. Uniformly sample m pairs from the training set X B (t ) = {(x p , xg ; y p,g )( j ) }m j=1 2. Compute the mini-batch gradients as in eq.(9) ∇ JL˜B(t ) = m1 mj=1 ∇ L˜ (t ) ( p( j ) , g( j ) ) 1 m ( j) ( j) ) + ∇ J −m j=1 ∇ L(s ) ( p , g L (s ) 3. Evaluate the proximal mapping operator using the current iteration solution and the mini-batch gradient, then apply Nesterov acceleration L(t ) = L˜ (t ) − η∇ J ˜B(t ) L

L˜ (t+1 ) = prox( L(t ) ) + λ prox( L(t ) ) − L˜ (t ) 3. Update the optimal solution L˜ (s+1 ) = L˜ (K )

4. Experimental results Datasets: Three publicly available benchmark datasets, have been considered to evaluate the proposed approach (details follow). Such datasets include common re-identiﬁcation challenges like viewpoint variations, background clutter and severe color changes (see Fig. 3 for a few sample images). Following the literature [1,5], results averaged over 10 independent trials are provided using the Cumulative Matching Characteristic (CMC) curve. VIPeR [9] is considered the most challenging person reidentiﬁcation datasets. It contains 1,264 images of 632 persons viewed by two cameras. Most of the image pairs have viewpoint changes larger than 90°. Following the general protocol, the dataset has been split into a training and a test set each including 316 persons. PRID450S [36] is a more recent dataset with 450 persons viewed by two disjoint cameras suffering from viewpoint changes, background interference and partial occlusion. As performed in literature [38,45], the dataset has been partitioned into a training and a test set each with 225 individuals. Market-1501 [50] with 32,668 images of 1,501 persons taken from 6 disjoint cameras is one of the largest person reidentiﬁcation datasets. Its images have been obtained by means of a state-of-the-art detector, thus providing a realistic setup. To run the experiments, the common protocol [50] has been followed, hence the train/test partitions containing 750 and 751 person identities each has been used.

Rank →

1

10

20

50

ALRM LMF+LADF [48] SS-SVM [47] KEPLER [30] MLAPG [22] XQDA [21] SCNCDFinal [45] PKFM [5] ROCCA [1] QALF [49] LMF [48] MCE-KISS [40] ISR [23] IMS-LFDA [19]

45.88 43.29 42.66 42.41 40.73 40.00 37.80 36.8 30.44 30.17 29.10 28.2 27.43 26.27

81.96 85.13 84.27 82.37 82.34 80.51 81.20 83.7 75.63 62.44 66.30 72.1 61.06 69.78

89.55 94.12 91.93 90.70 92.37 91.08 90.40 91.7 86.61 73.81 81.00 72.92 85.66

98.41 97.06 97.0 97.8 95.98 95.6 86.69 -

Implementation 1 : To model the visual appearance of a person, the common Local Maximal Occurrence (LOMO) representation [21] has been exploited for VIPeR and PRID450S, while for Market1501 the IDE descriptor [51] has been considered. β = 1e−7 and η = 0.7 have been selected by performing 5-fold cross validation on the training set with on values ranging in {1e−5, · · · , 1e−9} and in {1, , 0.1}, respectively. Following the same recipe in [32], m−2 λ has been set to m +2 , where m = 24. To attain the objective, the optimization has been run for s = 30 epochs. An early stopping criterion has been introduced to stop the learning process if the difference between consecutive objective function values was higher than 5e−5. 4.1. State-of-the-art comparisons Table 1 shows the results achieved by existing methods and compares the proposed approach with the top performer [48] on the VIPeR leaderboard. Results show that ALRM achieves competitive performance with respect to existing solutions. Speciﬁcally, it improves the results obtained by MLAPG [22] by about 3% at rank 1. This is an interesting achievement since MLAPG [22] uses the same LOMO feature. This shows that, exploiting the same visual representation, ALRM is able to obtain better performance than state-of-the-art methods. In Table 2 performance comparisons between existing methods and the proposed approach on the PRID450S dataset are shown. Results show that ALRM has better performance than very recent approaches (e.g., [7,38]). In particular, it achieves better recognition rates at ﬁrst ranks than Chen et al. [7] –which considers only the person foreground to extract the visual features and to compute the match. This may suggest that the proposed approach is robust to background clutter and occlusions which PRID450S suffer from. In Table 3, a comparison with existing methods on the Market 1501 dataset is given. Results are shown in terms of Rank 1 recog1

Source code available at https://github.com/iN1k1/ALRM.

238

N. Martinel / Pattern Recognition Letters 112 (2018) 234–240

Fig. 4. Convergence curves computed for different mini-batch sizes (m) and learning rates (η) are shown in (a) and (b), respectively. Training times [s] are shown in parenthesis. In (c) CMC performances width different 2,1 -norm regularization strength are shown. The inside picture shows the CMC performances for a reduced rank range.

Table 2 Comparison with state-of-the-art methods on the PRID 450S dataset. Best results for each rank are in boldface font. Rank →

1

5

10

50

ALRM MirrorRep [7] CSL [38] SCNCDFinal [45] SCNCD [45] KISSME [15]

57.6 55.4 44.4 41.6 41.5 33

81.6 79.3 71.6 48.9 66.6 -

88.6 87.8 82.2 79.4 75.9 71

98.6 96.0 95.4 92.4 90

Table 3 Rank 1 and mAP performance comparison with existing methods on the Market 1501 dataset. First 4 rows show the results obtained by exploitation of deep learning solutions as feature extractors. Last 3 rows show the performance achieved by end-to-end deep learning methods.

IDE_ResNet_50 + ALRM IDE_ResNet_50 + Euclidean IDE_ResNet_50 + XQDA [21] IDE_ResNet_50 + KISSME [15] HydraPlus-Net [25] SVDNet [39] Ver-Id [55]

Rank 1

mAP

77.96 77.67 77.35 78.80 76.9 82.3 79.5

56.08 54.56 56.01 56.13 62.1 59.9

nition rate and mean average precision (mAP) for those methods that use the same IDE representation as well as for recent deep learning approaches. The depicted performance show that, with an mAP of 56.08, ALRM has similar performance than state-of-theart solutions, even when end-to-end learning solutions are considered. Such a result might be due to the discriminative power of the deeply-learned feature representation that is considered for such a dataset. This shows that the proposed approach is able to tackle the re-identiﬁcation problem under very challenging realworld scenarios.

4.2. Ablation analysis The performance of the proposed metric learning solution depends on the selection of different hyperparamaters. These are: (i) the mini-batch size m; (ii) the 2,1 -norm regularization strength β ; and (iii) the learning rate η. To study how much dependent is the performance of the proposed approach with different values of such hyperparameters the following experiments have been

run. Since it is considered one of the most challenging dataset, this ablation study has been conducted considering the VIPeR dataset. Mini-batch size Since the mini-batch size is important for the convergence of the algorithm, in Fig. 4(a) both the convergence rates as well as the corresponding attained rank 1 re-identiﬁcation performances are shown. Results show that, with large values of m, an increase in the attained objective is achieved and, as a consequence, the rank 1 performances degrade. In particular, by considering all the possible training pairs at each iteration (i.e., m = 99856), the worst result is achieved. Such a result follows the outcomes of [42] demonstrating that averaging out over large minibatches yields to drastic overshooting of relevant gradient directions. As shown in [42], the problem could be mitigated by reducing the learning rate if more weight changes are accumulated before being applied. Learning rate In Fig. 4(b), the converge rate of the proposed approach is shown for different learning rate values. Results demonstrate that with η = 1, the minimum objective value is achieved within 15 epochs. Despite this, the best rank-1 performance is obtained when η = 0.7. With smaller values, worse performance are reached both in terms of re-identiﬁcation performances and objective value. Such a behavior is due to the fact that with lower learning rates the process stops due to the early stopping criterion. 2,1 -norm regularization strength To conclude the ablation study, in Fig. 4(c) the performance achieved by the proposed method are shown for different regularization strength (i.e., β ) values. Results demonstrate that the best rank-1 recognition rate is obtained for β = 1e − 7. For smaller values (e.g., β = 1e − 9) performance are worse due to the fact that non-discriminative dimensions are kept, while for larger values (e.g., β = 1e − 5) the relevant components are force to be rejected. To summarize, results shown through the preceding experiments have demonstrated that the proposed approach is robust to the choice of the algorithm hyperparameters.

4.3. Computational complexity Table 4 shows the times required to train and to evaluate the proposed solution. Within parenthesis, the image sizes considered to extract the visual features. In the current implementation, the feature sizes of the LOMO descriptor for the three experiment on VIPeR are of 26960-D, 35722-D, and 47854-D, respectively. Results show that the training procedure is requiring more than half an hour (this is reasonable since a stochastic gradient descent-based approach is used). Despite this, the re-identiﬁcation can run in real-time, thus the solution can be applied to real deployments.

N. Martinel / Pattern Recognition Letters 112 (2018) 234–240 Table 4 Computational times obtained by running an unoptimized MATLAB code on the three considered datasets. Within parenthesis the used image size. Dataset (H × W)

Training [s]

Re-Identiﬁcation [s/probe]

VIPeR (64 × 128) VIPeR (80 × 160) VIPeR (108 × 216) PRID450S (64 × 128) Market-1501 (64 × 128)

1997 2520 3463 863 2512

0.003 0.004 0.007 0.002 0.003

5. Conclusion In this paper, the person re-identiﬁcation problem has been addressed with the idea that the application of dimensionality reduction techniques, such as PCA, before learning a metric is suboptimal and may induce severe misclassiﬁcation errors. To handle such a problem, an approach producing a low-rank solution which is able to self-determine the discriminative dimensions of the underlying manifold has been proposed. Speciﬁcally, such a goal has been obtained by ﬁrst formulating a convex optimization problem then by proposing an accelerated solver based on a stochastic proximal gradient descent method. To validate the approach, extensive experimental results on three benchmark person re-identiﬁcation datasets have been carried out. Comparisons with state-of-the-art methods have shown that the proposed approach performs better than existing solutions. References [1] L. An, S. Yang, B. Bhanu, Person Re-Identiﬁcation by Robust Canonical Correlation Analysis, IEEE Signal Process. Lett. 22 (8) (2015) 1103–1107. [2] F. Bach, R. Jenatton, J. Mairal, G. Obozinski, Convex optimization with sparsity-inducing norms, in: Optimization for Machine Learning, The MIT Press, 2011, pp. 1–35. [3] A. Beck, M. Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202, doi:10.1137/ 080716542. [4] A. Bellet, A. Habrard, M. Sebban, A Survey on Metric Learning for Feature Vectors and Structured Data, ArXiv e-prints (2013). [5] D. Chen, Z. Yuan, G. Hua, N. Zheng, J. Wang, Similarity learning on an explicit polynomial kernel feature map for person re-identiﬁcation, in: International Conference on Computer Vision and Pattern Recognition, 2015. [6] J. Chen, Z. Zhang, Y. Wang, Relevance metric learning for person reidentiﬁcation by exploiting listwise similarities, IEEE Trans. Image Process. 7149 (c) (2015), doi:10.1109/TIP.2015.2466117. 1–1 [7] Y.C. Chen, W.S. Zheng, J. Lai, Mirror representation for modeling view-speciﬁc transform in person re-identiﬁcation, in: IJCAI International Joint Conference on Artiﬁcial Intelligence, 2015, pp. 3402–3408. [8] J. García, N. Martinel, A. Gardel, I. Bravo, G.L. Foresti, C. Micheloni, Modeling feature distances by orientation driven classiﬁers for person re-identiﬁcation, J. Visual Commun. Image Represent. 38 (2016) 115–129. [9] D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recongnition, reacquisition and tracking, IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), 2007. [10] M. Hirzer, P.M. Roth, K. Martin, H. Bischof, Relaxed pairwise learned metric for person re-identiﬁcation, in: European Conference Computer Vision, Lecture Notes in Computer Science, 7577, 2012, pp. 780–793, doi:10.1007/ 978- 3- 642- 33783- 3. [11] O. Javed, K. Shaﬁque, Z. Rasheed, M. Shah, Modeling inter-camera spacetime and appearance relationships for tracking across non-overlapping views, Comput. Vision Image Understanding 109 (2) (2008) 146–162, doi:10.1016/j.cviu. 20 07.01.0 03. [12] S. Karanam, Y. Li, R.J. Radke, Person re-identiﬁcation with discriminatively trained viewpoint invariant dictionaries, in: International Conference on Computer Vision, 2015, pp. 4516–4524, doi:10.1109/ICCV.2015.513. [13] V. Karavasilis, C. Nikou, A. Likas, Real time visual tracking using a spatially weighted von Mises mixture model, Pattern Recognit. Lett. 90 (2017) 50–57, doi:10.1016/j.patrec.2017.03.013. [14] S. Khamis, C.-h. Kuo, V.K. Singh, V.D. Shet, L.S. Davis, Joint learning for attribute-consistent person, in: European Conference on Computer Vision Workshops and Demonstrations, 2014. [15] M. Kostinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: International Conference on Computer Vision and Pattern Recognition, 2012, pp. 2288–2295, doi:10.1109/CVPR. 2012.6247939.

239

[16] I. Kviatkovsky, A. Adam, E. Rivlin, Color invariants for person re-identiﬁcation, IEEE Trans. Pattern Anal. Mach.Intell. 35 (7) (2013) 1622–1634, doi:10.1109/ TPAMI.2012.246. [17] A. Li, L. Liu, K. Wang, S. Liu, S. Yan, Clothing attributes assisted person reidentiﬁcation, IEEE Trans. Circuits Syst. Video Technol. 25 (5) (2015) 869–878, doi:10.1109/TCSVT.2014.2352552. [18] W. Li, X. Wang, Locally aligned feature transforms across views, in: International Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 3594–3601, doi:10.1109/CVPR.2013.461. [19] X. Li, A. Wu, M. Cao, J. You, W.-s. Zheng, Towards more reliable matching for person re-identiﬁcation, in: International Conference on Computer Vision and Pattern Recognition, 2015. [20] Z. Li, S. Chang, F. Liang, T.S. Huang, L. Cao, J.R. Smith, Learning locallyadaptive decision functions for person veriﬁcation, in: International Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 3610–3617, doi:10.1109/CVPR.2013.463. [21] S. Liao, Y. Hu, X. Zhu, S.Z. Li, Person re-identiﬁcation by local maximal occurrence representation and metric learning, in: International Conference on Computer Vision and Pattern Recognition, 2015. [22] S. Liao, S.Z. Li, Eﬃcient PSD constrained asymmetric metric learning for person re-identiﬁcation, in: International Conference on Computer Vision, 2015, pp. 3685–3693, doi:10.1109/ICCV.2015.420. [23] G. Lisanti, I. Masi, A.D. Bagdanov, A.D. Bimbo, Person re-identiﬁcation by iterative re-weighted sparse ranking, IEEE Trans. Pattern Anal. Mach.Intell. 37 (8) (2015) 1629–1642. [24] F. Liu, T. Zhou, K. Fu, J. Yang, Kernelized temporal locality learning for real-time visual tracking, Pattern Recognit. Lett. 90 (2017) 72–79, doi:10.1016/j.patrec. 2017.03.019. [25] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, X. Wang, HydraPlus-Net: attentive deep features for pedestrian analysis, in: International Conference on Computer Vision, 2017, pp. 350–359, doi:10.1109/ICCV.2017.46. [26] B. Ma, Y. Su, F. Jurie, Covariance descriptor based on bio-inspired features for person re-identiﬁcation and face veriﬁcation, Image Vision Comput. 32 (2014) 379–390, doi:10.1016/j.imavis.2014.04.002. [27] N. Martinel, A. Das, C. Micheloni, A.K. Roy-Chowdhury, Re-Identiﬁcation in the function space of feature warps, IEEE Trans. Pattern Anal. Mach.Intell. 37 (8) (2015) 1656–1669, doi:10.1109/TPAMI.2014.2377748. [28] N. Martinel, C. Micheloni, Sparse matching of random patches for person reidentiﬁcation, in: International Conference on Distributed Smart Cameras, ACM Press, Venice, Italy, 2014, pp. 1–6, doi:10.1145/2659021.2659034. [29] N. Martinel, C. Micheloni, Classiﬁcation of local eigen-dissimilarities for person re-identiﬁcation, IEEE Signal Process. Lett. 22 (4) (2015) 455–459, doi:10.1109/ LSP.2014.2362573. [30] N. Martinel, C. Micheloni, G.L. Foresti, Kernelized saliency-based person reidentiﬁcation through multiple metric learning, IEEE Trans. Image Process. 24 (12) (2015) 5645–5658, doi:10.1109/TIP.2015.2487048. [31] F. Nie, H. Huang, X. Cai, C.H. Ding, Eﬃcient and robust feature selection via joint L2, 1-norms minimization, Adv. Neural Inf. Process. Syst. (2010) 1813–1821. [32] A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems - NIPS ’14, 2014, pp. 1–9. [33] S. Paisitkriangkrai, C. Shen, A.V.D. Hengel, Learning to rank in person re-identiﬁcation with metric ensembles, in: International Conference on Computer Vision and Pattern Recognition, 2015. [34] S. Pedagadi, J. Orwell, S. Velastin, Local ﬁsher discriminant analysis for pedestrian re-identiﬁcation, in: International Conference on Computer Vision and Pattern Recognition, 2013, pp. 3318–3325, doi:10.1109/CVPR.2013.426. [35] J. Rennie, N. Srebro, Loss functions for preference levels: regression with discrete ordered labels, in: IJCAI Workshop on Advances in Preference Handling, 2005, p. 6. [36] P.M. Roth, M. Hirzer, M. Koestinger, C. Beleznai, H. Bischof, Mahalanobis distance learning for person re-identiﬁcation, in: S. Gong, M. Cristani, S. Yan, C.C. Loy (Eds.), Person Re-Identiﬁcation, Springer London, London, 2014, pp. 247–267, doi:10.1007/978- 1- 4471- 6296- 4. [37] D. Schreiber, Generalizing the LucasKanade algorithm for histogram-based tracking, Pattern Recognit. Lett. 29 (7) (2008) 852–861, doi:10.1016/j.patrec. 2007.12.014. [38] Y. Shen, W. Lin, J. Yan, M. Xu, J. Wu, J. Wang, Person re-identiﬁcation with correspondence structure learning, in: International Conference on Computer Vision, 2015, pp. 3200–3208, doi:10.1109/ICCV.2015.366. [39] Y. Sun, L. Zheng, W. Deng, S. Wang, SVDNet for pedestrian retrieval, in: International Conference on Computer Vision, 2017. [40] D. Tao, L. Jin, Y. Wang, X. Li, Person reidentiﬁcation by minimum classiﬁcation error-based KISS metric learning, IEEE Trans. Cybern. 45 (2) (2015) 242–252, doi:10.1109/TCYB.2014.2323992. [41] R. Vezzani, D. Baltieri, R. Cucchiara, People reidentiﬁcation in surveillance and forensics, ACM Comput. Surv. 46 (2) (2013) 1–37, doi:10.1145/2543581. 2543596. [42] D.R. Wilson, T.R. Martinez, The general ineﬃciency of batch training for gradient descent learning, Neural Netw. 16 (10) (2003) 1429–1451. [43] L. Xiao, T. Zhang, On proximal gradient descent with progressive variance reduction, SIAM J. Optim. 24 (4) (2014) 2057–2075. [44] E.P. Xing, A.Y. Ng, M.I. Jordan, S. Russell, Distance metric learning , with application to clustering with side-information, Adv. Neural Inf. processing systems 15 (2002) 505–512. 10.1.1.58.3667

240

N. Martinel / Pattern Recognition Letters 112 (2018) 234–240

[45] Y. Yang, Y. Jimei, Y. Junjie, S. Liao, Salient color names for person re-identiﬁcation, in: European Conference on Computer Vision, 2014. [46] X. Yu, Q. Yu, Y. Shang, H. Zhang, Dense structural learning for infrared object tracking at 200+ frames per second, Pattern Recognit. Lett. 100 (2017) 152–159, doi:10.1016/j.patrec.2017.10.026. [47] Y. Zhang, B. Li, H. Lu, A. Irie, X. Ruan, Sample-speciﬁc SVM learning for person re-identiﬁcation, in: International Conference on Computer Vision and Pattern Recognition, 2016, pp. 1278–1287. [48] R. Zhao, W. Ouyang, X. Wang, Learning mid-level ﬁlters for person reidentiﬁcation, in: International Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 144–151, doi:10.1109/CVPR.2014.26. [49] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identiﬁcation : a benchmark, in: International Conference on Computer Vision, 2015. [50] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person reidentiﬁcation: a benchmark, in: International Conference on Computer Vision, IEEE, 2015, pp. 1116–1124, doi:10.1109/ICCV.2015.133.

[51] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian, N.E.C. Labs, Person re-identiﬁcation in the wild, in: International Conference on Computer Vision and Pattern Recognition, 2017, pp. 1367–1376. [52] W.-S. Zheng, S. Gong, T. Xiang, Re-identiﬁcation by relative distance comparison, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 653–668, doi:10.1109/ TPAMI.2012.138. [53] W.-S. Zheng, S. Gong, T. Xiang, Towards open-world person re-identiﬁcation by one-shot group-based veriﬁcation, IEEE Trans. Pattern Anal. Mach. Intell. 8828 (2) (2015) 1, doi:10.1109/TPAMI.2015.2453984. [54] Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by GAN improve the person re-identiﬁcation baseline in vitro, in: International Conference on Computer Vision, 2017. [55] Z. Zheng, L. Zheng, Y. Yang, A discriminatively learned CNN embedding for person reidentiﬁcation, ACM Trans. Multimedia Comput. Commun. Appl. 14 (1) (2018) 1–20, doi:10.1145/3159171.

Accelerated low-rank sparse metric learning for person re-identification

Accelerated low-rank sparse metric learning for person re-identification

Recommend Documents