Re-identification by neighborhood structure metric learning

Author’s Accepted Manuscript Re-identification by Neighborhood Structure Metric Learning Wei Li, Yang Wu, Jianqing Li www.elsevier.com/locate/pr PII...

Download PDF

809KB Sizes 4 Downloads 56 Views

Report

PDF Reader
Full Text

Author’s Accepted Manuscript Re-identification by Neighborhood Structure Metric Learning Wei Li, Yang Wu, Jianqing Li

www.elsevier.com/locate/pr

PII: DOI: Reference:

S0031-3203(16)30202-3 http://dx.doi.org/10.1016/j.patcog.2016.08.001 PR5830

To appear in: Pattern Recognition Received date: 8 July 2015 Revised date: 20 May 2016 Accepted date: 1 August 2016 Cite this article as: Wei Li, Yang Wu and Jianqing Li, Re-identification by Neighborhood Structure Metric Learning, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.08.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Re-identification by Neighborhood Structure Metric Learning Wei Lia,∗, Yang Wub , Jianqing Lia a School

of Instrument Science and Engineering, Southeast University, 2 Sipailou, Nanjing 210096, China b Institute for Research Initiatives, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192,Japan

Abstract Re-identifying persons of interest among distributed cameras remains a challenge in current academia and industry. Because feature designing is inevitably subject to handcrafting subjectivity and real-scenario complexity, learning a discriminative metric has gained increasing attention to date. Although metric learning has achieved inspiring results, the research progress seems to slow down far before the performance is satisfactory. The diﬃculty mainly comes from the variability and sparsity of human image data, which impairs traditional metric learning models based on point-wise dissimilarity. In consideration of the neighborhood structure manifold which exploits the relative relationship between the concerned samples and their neighbors in the feature space, we propose a novel method, “Neighborhood Structure Metric Learning”, to learn discriminative dissimilarities on such manifold by adapting the codomain metrics of its charts. Experiments on widely-used benchmarks have demonstrated the advantage of this method in terms of eﬀectiveness, robustness, eﬃciency, stability, and generalizability. Keywords: re-identiﬁcation, metric learning, neighborhood structure manifold

∗ Corresponding

author Email address: [email protected] (Wei Li)

Preprint submitted to Journal of LATEX Templates

August 4, 2016

1. Introduction Considering judging the re-appearance of a person of interest in deployed cameras, re-identiﬁcation is valuable but challenging. The diﬃculties primarily come from various covariates including pose, illumination, and viewpoint, as 5

well as potential resemblance of clothing and body styles in the real scenario [1]. Basically, current methodology involves three paradigms: feature representation, feature transformation, and metric learning. Feature representation mainly focuses on suitable description of human appearance characteristics [2, 3, 4, 5, 6]. Recently, some methods also suggest

10

introducing other cues that complement the appearance to further enhance the discriminability of feature representation. Ziyan Wu et al. [7] describe human appearance as a function of pose using the training data, and then apply the pose prior in online re-identiﬁcation to make it more robust to viewpoint variation, before integrating the learned person-speciﬁc features. Zheng Liu et al.

15

[8] combine the appearance descriptor with the gait descriptor to deal with the challenging appearance distortions, and then match the descriptors based on the learned metric before ﬁnal score-level or feature-level fusion. An ideal feature space can compact human image samples in the same class close together and separate those from diﬀerent classes far apart. Unfortunately,

20

due to the real-world challenge and handcrafting subjectiveness, mere handcrafted feature space has been found incapable for re-identifying the people. To solve it, two diﬀerent directions, feature transformation and metric learning have attracted research interests. To ensure the inter-class discrimination and intra-class invariance of sam-

25

ples, generally, feature transformation tries to learn transformation functions for projecting human image features between cameras. For instance, ROCCA (RObust Canonical Correlation Analysis) [9] matches people from diﬀerent views in a coherent subspace. Attributing to the shrinkage estimation and smoothing technique, this method can robustly estimate the data covariance matrices even

30

with limited training data size. However, by considering that the human image

2

feature transformation function may not be unique and even may vary from frame to frame due to real-scenario variations, FSFW (Function Space of Feature Warps) [10] assumes that feature transformations between cameras lie in a nonlinear function space of all possible feature transformations. This method 35

learns a discriminating surface to separate the feasible and infeasible sets of warp functions ﬁrst, and then re-identiﬁes persons by classifying a test warp function as either feasible or infeasible in terms of the Random Forest classiﬁer. Feature representation and transformation are not the focus of this proposal, so we don’t plan to pay attention to details except for admitting that the widely-

40

used representative features can serve as a reliable platform for the further study. From the measure perspective, metric learning intends to use the optimized metric to improve the relative comparison between intra- and inter-class distances of human image features [11]. This paradigm is diﬀerent from feature transformation in technique, although it can also transform the feature space in

45

a broad sense. We hereby sketch several related approaches which are designed or have potentiality for person re-identiﬁcation. Generally speaking, metric learning models include the linear and non-linear ones. Linear metric learning learns the linear distance for scaling and rotating the coordinate system of feature space. In purpose of classiﬁcation, LMNN (Large Margin Nearest

50

Neighbor) [12] maximizes the margin to distinguish the classes of human image features, and this method inspires the early works for metric learning based re-identiﬁcation [13]. KISSME (Keep It Simple and Straightforward MEtric) [14] designs the equivalence constraints from a statistical inference perspective for both eﬃciency and eﬀectiveness, and this method has been discussed in sev-

55

eral re-identiﬁcation works [15, 16, 17]. MCML (Maximally Collapsing Metric Learning) [18] works for the target of intra-class distances to be zero and interclass distances to be inﬁnity. This method can help de-noise the feature space for re-identiﬁcation [19]. Re-identiﬁcation can also be recast as the ranking problem. RankSVM [20] translates person re-identiﬁcation into relative rank-

60

ing instead of absolute scoring. MLR (Metric Learning to Rank) [21] utilizes the structural SVM framework to optimize the ranking. With the loss function 3

measure Mean Reciprocal Rank (MRR), this method has been veriﬁed eﬀective for the re-identiﬁcation task [22]. RDC (Relative Distance Comparison) [23] maximizes the likelihood for the expected distance relationship between 65

the true and wrong match pairs of human image features. This method is quite robust to over-training due to the soft discriminant manner in formulation. Human image features are usually sparsely-distributed in the high dimensional space. From the angle of subspace ﬁnding (low-dimensionality, low-rank, sparsity), PCCA (Pairwise Constrained Component Analysis) [24] learns a low-

70

dimensional projection where the distance comparison constraints are sparsiﬁed from a large number of negative pairs consisting of diﬀerent person identities. XQDA (Cross-view Quadratic Discriminant Analysis) [15] learns a discriminant low dimensional subspace, in which the metric is simultaneously optimized, by cross-view quadratic discriminant analysis on the human image feature rep-

75

resentation named LOMO (LOcal Maximal Occurrence). RMLLC (Relevance Metric Learning with Listwise Constraints) [17] fully uses the available similarity information from training data by means of forcing the learned metric to conserve the predeﬁned listwise similarities between human image features in the low-dimensional projection subspace.

80

Though linear metric learning methods usually have achieved improved performance, they are inevitably ineﬀective when data are immersed in non-linear manifolds. In such situation, one typical solution is kernel-based metric learning, which learns the non-linear relationship between data in the reproducing kernel Hilbert space; the other is local metric learning [25], which localizes the

85

data domain and learns the metrics separately. For kernel-based metric learning methods, χ2 -LMNN [26] strictly preserves the histogram properties of input data on a probability simplex for the χ2 -distance. GBLMNN (Gradient Boosted LMNN) [26] trains the non-linear distance directly in the function space with gradient-boosted regression trees. χ2 -LMNN and GBLMNN are general metric

90

learning methods, but they more or less inherit the merits from LMNN to deal with the re-identiﬁcation issue. DDML (Discriminative Deep Metric Learning) [27] learns a deep neural network of hierarchical nonlinear metric to project fea4

ture pairs into the same subspace where the distance comparisons of positive and negative pairs are constrained. HERML (Hybrid Euclidean-Riemannian Metric 95

Learning) [28] implicitly maps multiple complementary statistics of image sets into high-dimension Hilbert spaces by exploiting Euclidean and Riemanninan kernels, and then fuses these kernels in the hybrid metric learning framework. DDML and HERML have been validated to be eﬀective for classifying realscenario face images, and also they have potentiality for the re-identiﬁcation

100

task because the face and human body images encounter the similar variations to some extent. kLFDA (kernel Local Fisher Discriminant Analysis) [29] handles high dimensional human image features by optimizing the Fisher discriminant objective based on the kernel trick. For local metric learning models, LBDM (Locality Based Discriminative Measure) [16, 30] constructs the local metric

105

ﬁeld to exploit the distance discriminability between each pair of human image sets from query and corpus domains to handle the undesired local variations. Although non-linear metric learning oftentimes has to face the expensive nonconvex optimization, this scheme has advantage in capturing the high-order correlations among samples.

110

Both linear and non-linear metric learning measures the original dissimilarity between samples from the geometric perspective, while ignores the relative information hidden in their neighborhood structures in the sense of topology. Original information only relies on the absolute position of the measured pair of samples in the feature space. This kind of information is inevitably sensitive

115

to the variability and sparsity problems long perplexing the re-identiﬁcation issue. We consider that the relative information has the potential for solving this problem, and such potential can be fully exploited by the metric learning strategy, which inspires the idea of our work.

5

2. Problem Formulation and Method Summary 120

2.1. Problem Formulation In the real situation, there are a lot of people captured in each camera. This paper deﬁnes re-identiﬁcation as a between-camera matching problem, which aims at building correspondences between subjects of unknown identities based on biometric cue, like full body appearance, captured from diﬀerent camera

125

domains. Although the camera network is widely used in the real scene, we focus on the two-camera case in this paper, because this case is fundamental and important for the re-identiﬁcation in general multiple-camera settings. Two-camera re-identiﬁcation can provide the technical support and platform for multiple-

130

camera re-identiﬁcation. It even can be anticipated that when the two-camera re-identiﬁcation is conquered, the multiple-camera re-identiﬁcation can be readily solved as well. The people in one camera usually have their corresponding matches in the other camera, which belongs to enclosure re-identiﬁcation. In evaluation, we

135

need to ﬁnd the correspondences between samples from query and corpus domains. In practice, such matching reformulation can help re-identify the incoming and outgoing pedestrians between entrance and exit of the library, shopping mall, airport, and so forth. It is also feasible to establish the correspondence in the camera network if biometric data from all the entrances and exits are

140

collected, respectively [31]. More concretely, the corpus data can be obtained from the camera or sensor set on the only path according to the requirement and restriction, and these data also can be updated according to the real condition and situation (for example, we may just maintain a record for the past a few hours/days only, and keep updating it). On the other hand, such matching

145

reformulation is also able to beneﬁt building data association between tracklets of unknown subjects in non-overlapping camera views for cross-camera tracking in public places, especially when the camera locations are not so far. Matching formulation allows us to take advantage of the domain information: 6

the remaining data from both query and corpus domains besides the data pair 150

under measure. These unlabeled samples potentially contains the neighborhood information for the to-be-measured samples. We hope such information is useful against the challenges of data variability and sparsity in re-identiﬁcation. 2.2. Method Summary Human image features inescapably suﬀer from the real-scenario complexity

155

and the curse of dimensionality, which result in the data variability and sparsity. Variability gives rises to large intra-class variations and small inter-class diﬀerences which deprive the point-wise dissimilarity of discrimination capability. Sparsity leads up to the inadequacy of in-class distribution information for each identity so that traditional metric learning methods are easy to be over-ﬁtting.

160

Motivated by the idea that the neighborhood structure1 can provide useful information to tackle the variations and sparseness of the data, we propose a novel non-linear metric learning method, “Neighborhood Structure Metric Learning (NSML)”, to learn discriminative dissimilarities on a neighborhood structure manifold.

165

Admittedly, many issues confront the similar problems of sample variability and sparsity, like face recognition, gait recognition, and so forth. We choose person re-identiﬁcation as the research focus, because it is typically representative of the above-mentioned problems. Even so, the proposed method has the ability to solve the general re-identiﬁcation/recognition problem in diverse

170

applications. In sum, the main contributions of this paper are as follows: • We have proposed NSML to tackle the data variability and sparsity problem during re-identiﬁcation, which learns discriminative dissimilarities on a novel neighborhood structure manifold. 1 Throughout

this paper, neighborhood structure means the topological relationship be-

tween the concerned elements and their neighbors.

7

175

• We have formulated NSML as a non-convex optimization problem, and designed the cutting-surface approach for it. • We have demonstrated the advantage of NSML in terms of eﬀectiveness, robustness, eﬃciency, stability, and generalizability. 3. Our Method

180

3.1. Neighborhood Structure Manifold Point-wise dissimilarity measure is widely used for re-identiﬁcation. However, such dissimilarity is weak for handling the large within-class variations and small between-class diﬀerences of human image data in the feature space. And moreover, the sparsely-distributed data also makes point-wise dissimilarity

185

based learning easy to be over-ﬁtting. Point-wise dissimilarity ignores the topological relationship among feature samples. Actually, each sample point has a neighborhood structure and is meanwhile within the neighborhood structures of other sample points in the feature space. Neighborhood structure contains the useful information that one single

190

sample point doesn’t have. Such useful information originates from the resource of well-distributed samples with satisfactory intra-class compactness and interclass separation. We want to take advantage of such resource to cope with the variability and sparsity of human image data, and motivated by this, we design a new neighborhood-wise dissimilarity.

195

Fig. 1 illustrates one example. Suppose that a is a query sample, and b to f are corpus samples. {a, b}, {c, d}, and {e, f } are in diﬀerent classes. {a, b} are badly-distributed, which means that they suﬀer from the severe intra-class variation and inter-class diﬀerence, so that inter-class distance is smaller than intraclass distance using the Euclidean metric. {c, d} and {e, f } are well-distributed,

200

which means that they have the satisfactory intra-class compactness and interclass separation, so that inter-class distance is larger than intra-class distance in the Euclidean space. We pay attention to b and c because they are nearer

8

to a than other samples based on Euclidean distance. Apparently, direct Euclidean distance measure fails to build the correct query-corpus correspondence 205

for b and c. However, in fact, dissimilarity for re-identiﬁcation does not have to be Euclidean, and even it doesn’t have to strictly adhere to the properties of symmetry and triangle inequality [32].

Figure 1: Samples classes can be distinguished by various colors and shapes.

By compressing the neighborhood structure into ranking lists in terms of Euclidean distances, we are able to quantify the relative position of the paired samples in each other’s global neighborhood structure as a neighborhood-wise dissimilarity: DM (a, b) = Ra (b) + Rb (a),

(1)

where Ra (b) and Rb (a) are the rank-order distances that count the rank order of the sample in each other’s list. For a, the ranking list is “acbedf ”; for b, 210

the ranking list is “bacedf ”; for c, the ranking list is “cdeaf b”. The rank order of a in b-list is 1 and in c-list is 3; the rank order of b in a-list is 2; the rank order of c in a-list is 1. Hence, DM (a, b) = 1 + 2 = 3 ≤ DM (a, c) = 3 + 1 = 4, which outperforms the Euclidean distance. The discriminability of DM (a, b) comes from absorbing the eﬀectiveness of the well-distributed samples in the

215

neighborhoods of a and b. In addition, the simultaneous use of Ra (b) and Rb (a) seems trivial, but this step plays an importance role in making sure the symmetry property of neighborhood-wise dissimilarity, such that DM (a, b) = 9

DM (b, a). In an elaborately selected or designed feature space, there should exist a 220

portion of well-distributed samples. By delivering the eﬀectiveness of these well-distributed samples to the badly-distributed ones, neighborhood-wise dissimilarity has the advantage over point-wise dissimilarity which is vulnerable to the sample distribution deviation. In practice, we use those unlabeled available samples from both query and corpus domains to ﬁll the space, as the possible

225

neighbors of the to-be-measured sample pair [31]. Neighborhood-wise dissimilarity is non-Euclidean but helpful for re-identiﬁcation. Rethinking this dissimilarity from the manifold perspective, we can assume samples with such dissimilarity measure providentially lay on a new neighborhood structure manifold M. We hope to further exploit the discriminability of this

230

dissimilarity. For this purpose, we design a novel learning model. This model learns the discriminative neighborhood-wise dissimilarities on M by adapting the codomain metrics of its charts, as shown in Fig. 2. Conceptually, chart, denoted by {(U, ϕ)}, is a homeomorphism ϕ from an open subset U of M to an open subset V of Euclidean space E, where ϕ is a bi-continuous function

235

between U and V, which is deﬁned as ϕ : U → V, and thus its inverse is given by ϕ−1 : V → U. 3.2. Metric Learning of Manifold Suppose that a and b stay in the chart codomain of M. It is diﬃcult to explicitly describe the dissimilarity on M in a form of binary operation between ϕ−1 (a)

240

and ϕ−1 (b), so, we suggest reformulating the dissimilarity between ϕ−1 (a) and ϕ−1 (b) on M implicitly as DM (La, Lb), and studying the linear transition map L, irrespective of the concrete form of charts ϕ. Adapting the chart codomain by suitable L is expected to help re-orient the samples on M for better discrimination.

245

Since measuring Mahalanobis distance DW (a, b) = (a − b) W (a − b) using the metric W is equivalent to measuring distance in the Euclidean space linearly

10

L

Euclidean Space Euclidean Space

Chart Chart

Manifold

Figure 2: Illustration of the neighborhood structure manifold. Coordinate maps are between manifold and chart codomains; transition map is between chart codomains. Adapting the chart codomain by transition map L can help re-orient the samples on M for better discrimination.

transformed by L: DE (La, Lb) = (La − Lb) (La − Lb)

(2)

for which W = L L → L = W 1/2 , neighborhood-wise dissimilarity between a M and b with a learned metric W , denoted by DW (a, b), can be straightforwardly 250

expressed as: M DW (a, b) = DM (W 1/2 a, W 1/2 b),

(3)

whereupon, we design a new max-margin model to learn W on E in consideration of the anticipated relative comparisons between intra-class and inter-class dissimilarities on M.

255

Given query sample collection Q = q | q ∈ Rd and corpus sample collection X = xqi | xqi ∈ Rd , a joint feature map is adopted to represent the whole set of ranked data X , which is inspired by MLR [21]. Let yqranking ∈ Y be a ranking of X with respect to q where Y is the ranking set, and ψ M (q, yqranking , W, X ) ∈ Rd be a vector-valued joint feature map incorporating the learned metric, which is deﬁned as the partial order feature: ψ M (q, yqranking , W, X ) =

ranking yij (

xqi ∈Xq+ xqj ∈Xq−

11

M M DW (q, xqi ) − DW (q, xqj ) ), + − |Xq ||Xq |

(4)

260

where Xq+ denotes the positive set of corpus samples in the same class as q; Xq− denotes the negative set of corpus samples in diﬀerent classes from q; | • | means the set cardinality; |Xq+ ||Xq− | plays the role of normalization in consideration of that the double accumulation of Eq. 4 causes the addition of altogether the ranking number of |Xq+ ||Xq− | terms; yij is deﬁned by ranking yij =

⎧ ⎨

xqi ≺yq xqj ,

1 ⎩ −1

(5)

xqi yq xqj .

265

The best W is expected to be the one that simultaneously makes yq∗ ← arg

max

ranking

yq

∈Y

ψ M (q, yqranking , W, X ),

(6)

where yq∗ is the ground truth ranking of X for q, and in most cases yqranking = yq∗ . One attractive property of ψ M (q, yqranking , W, X ) is that, for a ﬁxed W , the ranking yqranking which maximizes ψ M (q, yqranking , W, X ) can be obtained by sort270

M ing X in the descending order of score DW (q, xqi ). Thus, W can be learned by

solving the following problem: arg min F (W, ξ) = tr(W ) + Cξ

(7)

W

s.t. 1 n

n

(ψ M (q, yq∗ , W, X ) − ψ M (q, yqranking , W, X )) ≥

q=1

1 n

n q=1

Δ(yq∗ , yqranking ) − ξ, W ≥ 0,

ξ ≥ 0.

ξ is the slack variable; C is the trade-oﬀ parameter; Δ(yq∗ , yqranking ) is the 275

loss function to penalize predicting yqranking instead of yq∗ , deﬁned by Δ(yq∗ , yqranking ) = 1 − S(q, yqranking ),

(8)

where score S ∈ [0, 1], in which, S = 0 means the worst ranking and S = 1 means the perfect ranking. S can be instantiated by MRR, which has been veriﬁed eﬀective for the re-identiﬁcation task [22]. Conceptually, the reciprocal rank of a query response is the multiplicative inverse of the rank of the ﬁrst

12

280

correct match, and MRR is the average of such reciprocal ranks of results over the whole query collection. Eq. 7 is the NSML framework. The constraints of Eq. 7 is non-convex with M is bounded when the size of ranking respect to W . However, the range of DW

list is certain. So the NSML model can converge to some local minima. 285

3.3. Optimization Because there is no closed-form solution for NSML, NSML will be solved by the iterative process. On account of the large number of constraints, obtaining an approximate solution is more practicable. To this end, the cuttingsurface method, extended from the cutting-plane algorithm, is recommended.

290

The cutting-surface strategy applies the non-linear surface approximation of the constraint set by the curved surfaces that are formed by the most violated constraints for all queries. The objective function in Eq. 7 is minimized over a subset of active cutting surfaces. Such subset is updated with the most violated constraints in each

295

iteration. The process repeats until the most-violated constraint for each query satisﬁes some speciﬁed loss threshold on the active set. The gradient of the objective F (W, ξ) can be expressed in terms of the constraint set that achieves the current largest margin violation: ∂F ∂W

=I−

=I− −

C n

C n

n

n

(

q=1

(

∗ ∂ψ M (q,yq ,W,X ) ∂W

q=1 x ∈X + x ∈X − q q qi qj

ranking yij (

−

∗ yij (

ranking ∂ψ M (q,yq ,W,X ) ) ∂W

M (q,x )−DM (q,x )) ∂(DW qi qj W ∂W |Xq+ ||Xq− |

M (q,x )−DM (q,x )) ∂(DW qi qj W ∂W |Xq+ ||Xq− |

))

) (9)

xqi ∈Xq+ xqj ∈Xq− ranking ∗ n (yij −yij ) ∂(D M (q,xqi )−D M (q,xqj )) W W . =I−C + n ∂W |Xq ||Xq− | + − q=1 x ∈X x ∈X q q qi qj

300

We provide the details of further derivation and analysis in Part I of Supplementary Material due to page limitation. By these derivation and analysis,

13

ﬁnally, we can change Eq. 9 into ∂F ∂W

=I−

=I−

H n

C n

n

n

∗ ∂ψ M (q,yq ,W,X ) ∂W

(

q=1

q=1 x

+ qi ∈Xq

xqj ∈Xq−

−

ranking ∂ψ M (q,yq ,W,X ) ) ∂W

(yq∗ − yqranking )

φqi −φqj

|Xq+ ||Xq− |

(10)

,

where H = (n − 1)C. Because H only inﬂuences the step size in Eq. 10, for conciseness and elegance, we temporarily suggest setting = 1/(n − 1) to make 305

H = C. Compared with Eq. 9, Eq. 10 is tractable and simple. Eq. 10 holds M the meaning of the tendency to improve the discriminability of DW on manifold

M. On the other hand, learning W in the chart codomain E of manifold M can be formulated by arg min F1 (W, ξ) = tr(W ) + C1 ξ

(11)

W

310

s.t. 1 n

n

(ψ E (q, yq∗ , W, X ) − ψ E (q, yqranking , W, X )) ≥

q=1

1 n

n q=1

Δ(yq∗ , yqranking ) − ξ, W ≥ 0,

where ψ E (q, yqranking , W, X ) = E (a, b) = which DW

ranking yij (

xqi ∈Xq+ xqj ∈Xq− E 1/2 1/2 D (W a, W b), and C1 is

ξ ≥ 0,

E E DW (q,xqi )−DW (q,xqj ) ) |Xq+ ||Xq− |

in

the trade-oﬀ parameter.

The gradient of F1 (W, ξ), which holds the meaning of the tendency to imE prove the distance discriminability of DW , is given by ∂F1 ∂W

=I−

C1 n

n q=1 x

+ qi ∈Xq

− xqj ∈Xq

φ

−φ

(yq∗ − yqranking ) |Xqi+ ||Xqj − . | q

(12)

q

315

By comparing Eq. 10 and Eq. 12, we ﬁnd that the two gradients

∂F ∂W

=

∂F1 ∂W

when H = C1 . This outcome shows that, for the variable and sparse data in person re-identification, improving distance discriminability in E can help improving dissimilarity discriminability on M.

14

320

3.4. Learning for Re-identiﬁcation Training NSML uses backtracking line search based on Armijo-Goldstein condition [33]. All procedures are displayed in Part II of Supplementary Material because of page limitation. For train-test consistency, in the testing stage, we also map the testing data

325

to a neighborhood structure manifold discriminative for re-identiﬁcation by our previous work Common-Near-Neighbor Modeling (CNNM) dissimilarity [34], which is composed of the symmetric and asymmetric terms: DCNNM (a, b) = Rsym (a, b) + 2λlRasym (a, b),

(13)

Rsym (a, b) = Ralocal (b) + Rblocal (a),

(14)

where

Ralocal (b) =

l−1

Ob (ha (i));

(15)

Rasym (a, b) = min(Oa (b), Ob (a)).

(16)

i=0

330

l is the local neighborhood size parameter and λ is the trade-oﬀ parameter. Brieﬂy, Rsym counts the sample rank orders within the local area for a and b from each other’s list, which has been illustrated in Fig. 3. And Rasym counts the minimum of the rank orders for a and b in each other’s list. local neighborhood size

a

a

c ha (0)

b

d ha (1)

b

d

0

1

e

b

ha (l-1)

…

a

c

e

l-1

l

l+1

…

Figure 3: Illustration for “symmetric dissimilarity”. Rlocal (b) is calculated from the 0th to a (a) is calculated in a similar way. the (l − 1)th nearest neighbor in a-list, and Rlocal b

Dissimilarity DCNNM uses local averaging to enhance the measure robustness 335

against the class outliers. The local neighborhood size l is speciﬁed as around half the class size to maximize the capture of within-class sample distribution 15

information for each sample. If l is too large, intruders from diﬀerent classes may be involved; if, on the other hand, l is too small, useful local neighborhood information will be lost. DCNNM also uses the asymmetric term to tackle the 340

asymmetric ranking problem that a pair of samples usually do not yield the same rank order for each other in their own rank order lists. This term has been conﬁrmed especially helpful cooperating with the symmetric dissimilarity by λ when the class size is small [31]. In both training and testing, we take advantage of the neighborhood struc-

345

ture manifold. There is close relationship between DCNNM in Eq. 13 and DM in Eq. 1 actually. Both of them computes the neighborhood-wise dissimilarity. DCNNM can be regarded as generalizing DM with the additional local averaging operation and the auxiliary asymmetric term. On the other hand, DM can also be treated as instantiating the parametric space of DCNNM in a special way

350

that shrinks the local neighborhood of the symmetric term and discards the asymmetric term. In contrast to Eq. 1 concerning the pair of samples themselves on M, averaging rank-order dissimilarities in the local area by Eq. 13 has some eﬀect of shifting the samples to their neighborhood centers on M before dissimilarity

355

measure. The local averaging contributes to the robustness of neighborhood structure dissimilarity against the sample deviation of each class, which is, honestly, desired in the testing stage. However, in the training stage this process inevitably weakens the learner to cope with the potential intruders or extruders of each class by itself, so it is better not to do local averaging in training.

360

Actually, setting l = 1 is equivalent to tightening the constraints of relative comparison among neighborhood structure dissimilarities. If the relative dissimilarity comparison for every possible triplet of samples (two are in the same class and one is from a diﬀerent class) has been achieved, the local averaging will actually become superﬂuous. Last but not least, such simpliﬁcation can

365

also boost the eﬃciency due to avoiding the extra computations in the local areas. During training, the neighborhood structure of each sample will be re16

organized aimed at enhancing the dissimilarity discriminability on M. The asymmetric ranking problem will also be tackled along with the neighborhood 370

structure re-organization. Hence, the asymmetric term becomes redundant and can be removed while training NSML. And so does it to the testing stage attributing to the learned space, which is diﬀerent from the unsupervised case for CNNM. This simpliﬁcation saves the eﬀort to search for a suitable λ. 4. Experiments

375

4.1. Dataset Description We evaluate NSML for re-identiﬁcation on widely-used benchmark datasets: VIPeR [4], ETHZ1 [35], i-LIDS [36], Caviar4REID [37], EPFL [38], PRID450S\2011 [39, 40], RAiD [41], and WARD [42]. All of them undergo real-scenario complexities in spite of the distinct challenges of each their own. The description and

380

illustration of these datasets are detailed in Part III of Supplementary Material due to page limitation. 4.2. Experimental Setup Because datasets have diﬀerent characteristics, we suggest using diﬀerent signatures for diﬀerent datasets. We choose two well-known signatures to describe

385

human images. One is combined by Densely Sampled Color Histograms and Weighted HSV color histograms, denoted by “DSCH+WHSV” [2, 13, 22, 34]; the other is concatenated by DSCH, Schmid-Filter-Bank, and Gabor-Transform, denoted by “DSCH+SFB+GT” [19, 23, 24, 31]. Both signatures have reliable performance for metric learning based approaches. And we have found that

390

DSCH+WHSV is more suitable for VIPeR, ETHZ1, i-LIDS, and RAiD, while DSCH+SFB+GT is more suitable for Caviar4REID, EPFL, and PRID2011\450S, and WARD. Feature representation is not the focus of this paper, but we admit introducing a more eﬀective representation to our model could likely result in an improved performance. We perform ten-fold cross validation with randomly

395

halving the dataset for train-test split every time. Each time we use the same

17

train-test split for method comparison. In evaluation, for each identity, we use one image as the corpus sample; in the query domain, we use all the images, but match each of them to the corpus domain one by one. Unlike other re-identiﬁcation datasets, RAiD and WARD are speciﬁcally 400

built for re-identiﬁcation across multiple camera views, although multiple camera views can be split into a group of camera view pairs. This paper focuses on re-identifying people between two non-overlapping cameras. So in principle, the datasets for two-camera re-identiﬁcation are the best choices. However, we would like to take this opportunity to challenge our proposed method on

405

RAiD and WARD by recasting the multiple-camera re-identiﬁcation into the two-domain matching problem. In both training and testing, we treat one camera view as the corpus domain (camera 1 for RAiD; camera 3 for WARD), while keep the other camera views together as the query domain. 4.3. Parameters and Robustness

410

Compared with original CNNM, during training, NSML reduces the neighborhood scope by setting the local neighborhood size l = 1 in the symmetric term, and removes the asymmetric term by setting the trade-oﬀ parameter λ = 0. These simpliﬁcations just embody the robustness of NSML and contribute to widening the application range of the model.

415

Since DM is a special instantiation of the parametric space of DCNNM , we here generalize NSML by involving DCNNM instead of DM in training, to well validate the robustness of our simpliﬁcations of parameters. We temporarily initialize C = 10 and = 1/(n − 1) by heuristic for conveniently discussing l and λ. Denote the recommended l and λ from CNNA

420

[31] by lori and λori . For VIPeR, lori = 1 and λori = 1; for ETHZ1, lori = 30 and λori = 0; for i-LIDS, lori = 2 and λori = 0.1; for Caviar4REID and EPFL, as they have the class sizes larger than i-LIDS but smaller than ETHZ1, we set lori = 5 and λori = 0.01 compromisingly; for PRID450S, we compare it to VIPeR and set lori = 1 and λori = 1; for RAiD, lori = 55 and λori = 0; for WARD, lori = 25

425

and λori = 0. 18

Table 1: Contrastive analysis on NSML for λtr and λts when setting ltr = lts = lori . (Results are given in the format of “λtr = λts = 0 \ λtr = λts = λori ”.) Dataset Rank-1 Rank-3 Rank-5 VIPeR 25.92 \ 25.28 43.10 \ 42.63 52.72 \ 52.75 ETHZ1 77.10 \ 77.10 89.94 \ 89.94 93.11 \ 93.11 i-LIDS 39.00 \ 39.00 57.33 \ 56.97 64.79 \ 65.02 Caviar4REID 36.54 \ 36.60 51.85 \ 51.58 60.82 \ 61.07 EPFL 56.62 \ 56.76 84.24 \ 84.20 92.02 \ 92.59 PRID450S 25.87 \ 25.20 43.02 \ 43.11 51.29 \ 51.56 RAiD 15.47 \ 15.47 32.39 \ 32.39 45.28 \ 45.28 WARD 64.69 \ 64.69 83.47 \ 83.47 83.47 \ 83.47

Rank-7 59.84 \ 59.37 94.57 \ 94.57 71.13 \ 70.64 67.51 \ 67.82 97.11 \ 96.64 56.44 \ 57.20 54.89 \ 54.89 93.11 \ 93.11

To study the eﬀect of eliminating λ, we compare NSML between λtr = λts = 0 and λtr = λts = λori with setting ltr = lts = lori , where subscripts “tr” and “ts” of l or λ indicate the stages of training and testing. The results are tabulated in Table 1, from which, we can see the dispensable role of the asymmetric term 430

for NSML. To investigate the eﬀect of decreasing l in training, we compare NSML between ltr = 1 and ltr = lori with setting lts = lori and λtr = λts = 0. We expect the local averaging can provide dissimilarity measure with additional robustness in testing after learning. But during learning, we don’t want to encapsulate lo-

435

calities into dissimilarity measure. The results are exhibited in Table 2, from which, we can see the trivial role of local averaging in the symmetric term on datasets VIPeR, i-LIDS, and PRID450S having few images per class. Whereas, on datasets ETHZ1, Caviar4REID, and EPFL having multiple images in each class, ltr = 1 has small but discernible advantage over ltr = lori , and the advan-

440

tage is much more obvious on datasets RAiD and WARD. More rigorously, we compare NSML between λtr = λts = 0 and λtr = λts = λori with setting ltr = 1 and lts = lori . The results are described in Table 3, which still show the minor function of the trade-oﬀ parameter. According to Table 1 and 3, we can directly compare NSML between ltr = 1

445

and ltr = lori with setting lts = lori and λtr = λts = λori in Table 4. Consistent with Table 2, the results still vary little on datasets VIPeR, i-LIDS, PRID450S, and ltr = 1 has some slight advantage over ltr = lori for datasets ETHZ1, Caviar4REID, 19

Table 2: Contrastive analysis on NSML for ltr when setting lts = lori and λtr = λts = 0. (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD

the format of “ltr Rank-1 25.92 \ 25.92 78.13 \ 77.10 38.54 \ 39.00 37.08 \ 36.54 58.39 \ 56.62 25.87 \ 25.87 23.39 \ 15.47 68.66 \ 64.69

= 1 \ ltr = lori ”.) Rank-3 Rank-5 43.10 \ 43.10 52.72 \ 52.72 90.36 \ 89.94 93.49 \ 93.11 57.38 \ 57.33 65.07 \ 64.79 52.92 \ 51.85 61.96 \ 60.82 85.68 \ 84.24 93.62 \ 92.02 43.02 \ 43.02 51.29 \ 51.29 50.80 \ 32.39 66.37 \ 45.28 86.24 \ 83.47 92.07 \ 90.10

Rank-7 59.84 \ 59.84 94.73 \ 94.57 71.09 \ 71.13 68.60 \ 67.51 97.63 \ 97.11 56.44 \ 56.44 77.67 \ 54.89 94.38 \ 93.11

Table 3: Contrastive analysis on NSML for λtr and λts when setting ltr = 1 and lts = lori . (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD

the format of “λtr Rank-1 25.92 \ 25.28 78.13 \ 78.13 38.54 \ 38.71 37.08 \ 37.10 58.39 \ 58.45 25.87 \ 25.20 23.39 \ 23.39 68.66 \ 68.66

= λts = 0 \ λtr Rank-3 43.10 \ 42.63 90.36 \ 90.36 57.38 \ 57.77 52.92 \ 52.94 85.68 \ 85.74 43.02 \ 43.11 50.80 \ 50.80 86.24 \ 86.24

= λts = λori ”.) Rank-5 52.72 \ 52.75 93.49 \ 93.49 65.07 \ 64.96 61.96 \ 61.75 93.62 \ 93.70 51.29 \ 51.56 66.37 \ 66.37 92.07 \ 92.07

Rank-7 59.84 \ 59.37 94.73 \ 94.73 71.09 \ 70.76 68.60 \ 68.46 97.63 \ 97.69 56.44 \ 57.20 77.67 \ 77.67 94.38 \ 94.38

and EPFL, and the advantage is much larger on datasets RAiD and WARD. The positive eﬀect of local averaging in the testing stage has been conﬁrmed 450

again by Table 5 that compares NSML between lts = lori and lts = 1 with setting ltr = 1 and λtr = λts = 1, which agrees with the conclusion of our previous work CNNM well [31]. We also spare no eﬀort on discussing whether λts is necessary or not in testing by comparing NSML between λts = 0 and λts = λori when eliminating

455

the asymmetric term in training with setting ltr = 1 and lts = lori and λtr = 0. The results are given in Table 6, which once more supports the claim of triviality of λts . 4.4. Stability Stability indicates that the model can work reliably with normal convergence

460

and satisfactory performance with the optimization parameters in the regular 20

Table 4: Contrastive analysis on NSML for ltr when setting lts = lori and λtr = λts = λori . (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD

the format of “ltr Rank-1 25.92 \ 25.28 78.13 \ 77.10 38.71 \ 39.00 37.10 \ 36.60 58.45 \ 56.76 25.20 \ 25.20 23.39 \ 15.47 68.66 \ 64.69

= 1 \ ltr = lori ”.) Rank-3 Rank-5 43.10 \ 42.63 52.72 \ 52.75 90.36 \ 89.94 93.49 \ 93.11 57.77 \ 56.97 64.96 \ 65.02 52.94 \ 51.58 61.75 \ 61.07 85.74 \ 84.20 93.70 \ 92.59 43.11 \ 43.11 51.56 \ 51.56 50.80 \ 32.39 66.37 \ 45.28 86.24 \ 83.47 92.07 \ 90.10

Rank-7 59.84 \ 59.37 94.73 \ 94.57 70.76 \ 70.64 68.46 \ 67.82 97.69 \ 96.64 57.20 \ 57.20 77.67 \ 54.89 94.38 \ 93.11

Table 5: Contrastive analysis on NSML for lts when setting ltr = 1 and λtr = λts = 0. (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD

the format of “lts Rank-1 25.92 \ 25.92 78.13 \ 74.61 38.54 \ 39.10 37.08 \ 35.90 58.39 \ 48.49 25.87 \ 25.87 23.39 \ 19.85 68.66 \ 59.46

= lori \ lts = 1”.) Rank-3 Rank-5 43.10 \ 43.10 52.72 \ 52.72 90.36 \ 88.25 93.49 \ 91.68 57.38 \ 56.91 65.07 \ 64.16 52.92 \ 51.91 61.96 \ 59.87 85.68 \ 75.79 93.62 \ 87.87 43.02 \ 43.02 51.29 \ 51.29 50.80 \ 44.90 66.37 \ 60.45 86.24 \ 79.14 92.07 \ 87.03

Rank-7 59.84 \ 59.84 94.73 \ 93.44 71.09 \ 69.45 68.60 \ 66.46 97.63 \ 94.11 56.44 \ 56.44 77.67 \ 71.45 94.38 \ 90.78

range. We evaluate the stability of NSML by uncovering the relationship between the convergence situation and H in Eq. 10. The backtracking line search scheme is adopted based on the Armijo-Goldstein condition, where the convergence threshold is set as 10−5 , Armijo rule number is set as 10−5 , the learning 465

rate is initialized as 10−8 (n − 1)C, and the increase and decrease parameters for controlling the learning rate after each iteration is ((1 + ((1 +

√ 2

√ 2

5)/2)1/3 and

5)/2)−1 , respectively. We also set the maximum number of backtracking

100. If the backtracking time exceeds this number, the program will be forced to stop even before convergence, and warns the ill-initialization of the learning 470

rate. Retrospecting H = (n − 1)C, we will discuss two cases: evaluating C by ﬁxing ; evaluating by ﬁxing C. Firstly, we set = 1/(n − 1) for experimental succinctness. In this case, H = C. The iteration numbers and recognition rates are averaged according to the cross-validation principle on each dataset. The results are recorded in Table 21

Table 6: Contrastive analysis on NSML for λts when setting ltr = 1 and lts = lori and λtr = 0. (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD

475

the format of “λts Rank-1 25.92 \ 25.38 78.13 \ 78.13 38.54 \ 38.48 37.08 \ 37.08 58.39 \ 58.19 25.87 \ 24.98 23.39 \ 23.39 68.66 \ 68.66

= 0 \ λts = λori ”.) Rank-3 Rank-5 43.10 \ 42.41 52.72 \ 52.37 90.36 \ 90.36 93.49 \ 93.49 57.38 \ 57.54 65.07 \ 64.91 52.92 \ 52.94 61.96 \ 61.96 85.68 \ 85.68 93.62 \ 93.54 43.02 \ 43.42 51.29 \ 51.47 50.80 \ 50.80 66.37 \ 66.37 86.24 \ 86.24 92.07 \ 92.07

Rank-7 59.84 \ 58.89 94.73 \ 94.73 71.09 \ 70.86 68.60 \ 68.56 97.63 \ 97.69 56.44 \ 51.47 77.67 \ 77.67 94.38 \ 94.38

7, where H ∈ {0.01, 0.1, 1, 10, 100, 1000, 10000}. The contents from the ﬁrst row to the seventh row correspond with the experimental results under diﬀerent Hs in an increasing order for each dataset. Table 7: The average iteration number (aver iter num) till convergence and the recognition rate on rank-1,5 for diﬀerent Hs in NSML. (Results are given in the format of “aver iter num \ rank-1 \ rank-5”.) VIPeR 23.2 \ 23.77 \ 52.03 26.6 \ 23.42 \ 51.55 16.3 \ 24.78 \ 52.03 13.2 \ 25.92 \ 52.72 14.9 \ 25.92 \ 52.75 12.0 \ 25.28 \ 51.96 14.6 \ 23.42 \ 51.11 Caviar4REID 29.7 \ 28.12 \ 55.15 14.3 \ 37.03 \ 61.77 14.1 \ 37.20 \ 61.69 14.4 \ 37.08 \ 61.96 14.4 \ 37.01 \ 61.81 9.8 \ 34.78 \ 60.61 8.8 \ 29.79 \ 55.62

ETHZ1 11.8 \ 74.22 \ 91.13 19.8 \ 75.98 \ 92.08 5.0 \ 78.17 \ 93.51 5.2 \ 78.13 \ 93.49 5.4 \ 78.16 \ 93.50 4.1 \ 78.12 \ 93.49 2.0 \ 78.34 \ 93.47 EPFL 12.6 \ 47.42 \ 86.12 9.2 \ 58.34 \ 93.85 10.1 \ 58.33 \ 94.04 9.2 \ 58.39 \ 93.62 8.3 \ 58.57 \ 93.69 5.3 \ 58.07 \ 93.08 3.1 \ 54.24 \ 90.59

i-LIDS 20.3 \ 35.42 \ 64.64 10.3 \ 36.50 \ 62.85 8.0 \ 38.54 \ 65.01 8.2 \ 38.54 \ 65.07 6.9 \ 38.43 \ 64.84 4.8 \ 38.38 \ 64.85 2.9 \ 36.60 \ 63.05 PRID450S 16.5 \ 20.67 \ 46.44 9.7 \ 25.73 \ 51.51 9.3 \ 25.73 \ 51.47 9.4 \ 25.87 \ 51.29 8.1 \ 25.78 \ 51.42 4.9 \ 23.64 \ 48.67 4.9 \ 20.09 \ 42.80

RAiD 8.69 \ 34.91 24.11 \ 69.40 27.08 \ 72.29 23.39 \ 66.37 21.98 \ 62.38 21.70 \ 61.26 20.86 \ 62.43 WARD 13.6 \ 52.63 \ 85.01 6.0 \ 68.73 \ 92.10 6.5 \ 68.71 \ 92.07 6.4 \ 68.66 \ 92.07 5.6 \ 68.81 \ 92.12 4.5 \ 68.36 \ 92.02 2.1 \ 60.84 \ 88.20

9.3 37.3 97.0 36.7 31.8 31.9 32.6

\ \ \ \ \ \ \

It is nearly impossible to draw a conclusion of the constant uniform H setting to cope with diﬀerent challenges in diﬀerent datasets. Even so, we endeavor 480

to ﬁnd an approximate common range of H for the relatively stable satisfactory performance of NSML taking into consideration of the overall results. From the results in Table 7, we can see that not only does the model system always 22

converge for diﬀerent Hs, but also the iteration number decreases when H increases on the whole, except for a slight ﬂuctuation that is ineluctably caused 485

by the data complexity. The iteration number and the recognition rate for H ∈ {0.01, 0.1} are apparently lower than H ∈ {1, 10, 100} on the whole. When H ∈ {1000, 10000}, the recognition rate trend is downward. The results show that the model has stable satisfactory performance for H ∈ [1, 100] considering the diversity of these datasets for a variety of real-world challenges.

490

Then, we testify with setting C = 10. The experimental details and results are given in Part IV of Supplementary Material because of page limitation. Considering the results comprehensively, we recommend using the value around = 1/(n − 1) to avoid under-training whilst maximizing the discriminability of W . Thus, H and C share the same recommended scope of assignment.

495

From the above, we can see that although NSML seems to contain many parameters, C is the only parameter required to be discussed during training. Actually, for C, we have found out the most suitable range in consideration of a variety of real-world challenges based on the experimentation and analysis across diverse datasets. Moreover, parameters ltr and λtr are just used to help

500

analyze the intrinsic mechanism of the proposed algorithm, and they have the ﬁxed recommended values already independent of training data. The setting of lts and λts in testing can directly follow the conclusion of the work CNNM [31], which is also easy to handle, and actually, the small ﬂuctuation of their values will not cause the big change of CNNM performance. Therefore, we don’t need

505

to concern too much about parameter setting during implementation. 4.5. Complexity and Eﬃciency Model eﬃciency demonstration is carried out from both theoretical and practical perspectives. From the theoretical perspective, we analyze the computational complexity for training NSML; from the practical perspective, we test the

510

real time expense under ten-fold cross-validation. The details can refer to Part V of Supplementary Material due to page limitation.

23

4.6. Eﬀectiveness Related methods for comparison include CNNA, MLR, RDC, LMNN, XQDA, SDALF, MFA (Marginal Fisher Analysis) [43, 44], kLFDA (kernel Local Fisher 515

Discriminant Analysis) [45, 44], and rPCCA (regularized Pairwise Constrained Component Analysis) [24, 44]. The same as NSML, methods CNNA, MLR, RDC, LMNN, and SDALF use DSCH+WHSV and DSCH+SFB+GT as the original features. It has been testiﬁed that there is no big performance diﬀerence between original SDALF and DSCH+WHSV or DSCH+SFB+GT, while

520

the vector-form features provide more convenience for metric learning strategies [22]. Hence, we use SDALF to indicate measuring original features in Euclidean space, which serves as the baseline for learning approaches in comparison. MFA, kLFDA, and rPCCA adopt the linear kernel based on their own 75-region features, because we ﬁnd they perform more robustly with the linear kernel and

525

their recommended features based on their published code. NSML in comparisons chooses the parameters {ltr = 1, λtr = 0} and {lts = lori , λts = 0} according to the discussion in Section 4.3. The results are illustrated by CMC (Cumulative Match Characteristic) curves in Fig. 4. It can be seen, on the whole, NSML has encouraging performance. De-

530

spite the unstable behavior of other competitors across datasets, NSML readily leads the performance. In greater details, we can see the stair-like performance enhancement of MLR, CNNA, and NSML. MLR is a suitable and capable metric learning approach for re-identiﬁcation. As the closest analogue of NSML, CNNA has been veriﬁed more eﬀective than MLR due to exploiting the neigh-

535

borhood structure information by CNNM in the testing stage. Further, NSML performs metric learning in the neighborhood structure manifold instead of the original feature space as what has been done by CNNA, so that the neighborhood structure dissimilarity based discrimination is directly optimized, which makes the best out of CNNM in testing.

540

NSML performs remarkably better than CNNA on the datasets of serious variability and sparsity, like VIPeR and PRID450S. In Caviar4REID, i-LIDS, and EPFL with severe variability but less sparsity, NSML also performs well 24

VIPeR

i-LIDS

ETHZ1

60

75

100

70

55

65

45 40 35 30

NSML CNNA

25

MLR RDC LMNN

20 15 10

90

85

80

NSML CNNA

75

MLR RDC LMNN

70

XQDA SDALF 1

2

3

4

5

6

65

7

1

2

3

4

5

6

50 45 40 35

25

5

7

55 50 45 40 35

NSML

30

CNNA MLR

25

RDC LMNN XQDA SDALF

15

5

6

85 80 75 70 65

NSML

60

CNNA MLR

55

RDC LMNN XQDA SDALF

50 45 40

7

Recognition Percentage

90

Recognition Percentage

60

1

2

3

Rank Score

4

VIPeR

5

6

7

5

6

45 40 35 30

NSML 25

CNNA MLR

20

RDC LMNN XQDA SDALF

15 10

7

1

2

3

4

5

6

7

Rank Score

ETHZ1

i-LIDS

95

75 70

55 90

45 40 35 30 25 20

NSML

15

MFA kLFDA rPCCA

10

2

3

4

5

6

85

80

75

70

65

NSML MFA kLFDA rPCCA

60

55

7

1

2

3

Rank Score

Caviar4REID

35 30 25 20

NSML

15

MFA kLFDA rPCCA

5 0

7

1

2

3

Recognition Percentage

50 45 40 35 30 25 20 15

NSML

10

MFA kLFDA rPCCA

5 5

6

75 70 65 60 55 50 45 40 35 30 25 20

NSML MFA kLFDA rPCCA

15 10 5 7

0

1

2

3

4

5

6

7

PRID450S 55

55

4

Rank Score

60

85

Rank Score

40

90 80

4

45

100 95

60

3

6

50

EPFL

65

2

5

55

Rank Score

70

1

4

60

10

Recognition Percentage

1

65

Recognition Percentage

Recognition Percentage

50

0

4

50

Rank Score

60

5

3

PRID450S 55

4

2

EPFL 60

3

1

Rank Score

95

2

XQDA SDALF

10

100

1

MLR RDC LMNN

20

65

10

NSML CNNA

30

70

20

Recognition Percentage

55

Rank Score

Caviar4REID

Recognition Percentage

60

15

XQDA SDALF

Rank Score

Recognition Percentage

Recognition Percentage

50

Recognition Percentage

Recognition Percentage

95

5

6

50 45 40 35 30 25

NSML

20

MFA kLFDA rPCCA

15

7

10

1

2

Rank Score

3

4

5

6

7

Rank Score

Figure 4: Result comparisons across benchmarks.

whereas the gap between it and its runner-up CNNA becomes narrow. As for ETHZ1, which has the slightest variability and sparsity due to the relatively 545

adequate and well-distributed class samples in the feature space, NSML falls behind CNNA while still outstrips other rivals. As for other methods, LMNN has moderate performance. LMNN learns the metric beneﬁcial for classiﬁcation instead of ranking. Ranking is more robust to the real-world re-identiﬁcation complexity due to jointly concerning the behav-

550

iors of several top candidates, which can make up the shortage of classiﬁcation that overemphasizes the accuracy at rank-1. However, classiﬁcation doesn’t 25

totally run counter to ranking because ranking can be regarded as a kind of relaxed multi-class classiﬁcation in essence. RDC has moderate performance as well when the samples are not suﬃcient 555

for each class. When the class size increases, its performance falls. RDC optimizes the relative distance comparisons based on logistic function constraints. This soft discriminant manner has intrinsic robustness to over-ﬁtting. However, usually, model robustness and representability go against each other. When the class size increases, the over-ﬁtting problem will be alleviated with it. It be-

560

comes comparatively outstanding that the soft discriminant manner will limit the model representability of the class information. XQDA also has a moderate performance when using the same baseline features as other methods. It implies that this method is sensitive to the original feature space.

565

Compared with kernel-based metric learning approaches rPCCA, MFA, and kLFDA, NSML also has advantage. rPCCA suggests relaxing the pairwise constraints by choosing a small portion, around one-tenth, from a vast number of constraints. However, rPCCA itself is more or less sensitive to the constraints selected for the highly variable data. MFA and kLFDA perform normally on

570

VIPeR, ETHZ1, and PRID450S, while lose eﬀect on i-LIDS, Caviar4REID, and EPFL. Kernel-based metric learning relies on the kernel-based dissimilarity and reference points, which may not ﬁt each application-speciﬁc dataset. The results reveal that MFA and kLFDA models themselves are not suitable for i-LIDS, Caviar4REID, and EPFL. We have also found that MFA and kLFDA produce

575

large training error on them even. When the kernel-based metric learning models are suitable, the performance are mainly inﬂuenced by the data complexity. On less complex ETHZ1 they are approaching NSML, while on more complex VIPeR and PRID450S they follow behind. We also provide the comparison between NSML and some reported tech-

580

niques on the popular VIPeR dataset in Table 8. These techniques are PCCA (Pairwise Constrained Component Analysis) [24], SSCDL (Semi-Supervised Coupled Dictionary Learning) [46], ColorInv (Color Invariants) [47], LFDA (Local 26

Fisher Discriminant Analysis) [45], eLDFV (ensemble Local Descriptor Fisher Vector) [48], KISSME (Keep It Simple and Straightforward MEtric) [14], CPS 585

(Custom Pictorial Structure) [49], RankSVM, ITML (Information-Theoretic Metric Learning) [50], MCML (Maximally Collapsing Metric Learning) [18], and XQDA [15]. Here, we directly quote or generate the results from their publications or codes. NSML uses the same train-test splits as SDALF. Although the results recorded for other approaches are probably not exactly the same, the comparison is relatively fair in the statistical averaging sense. Table 8: VIPeR dataset: Top ranked matching rates (%) with 316 persons. Method Rank-1 Rank-5 Rank-10 Rank-20 NSML(LOMO+XQDA) 42.59 72.88 84.49 93.51 XQDA(LOMO) 40.00 68.13 80.51 91.08 NSML(HSV+DSCH) 25.92 52.72 67.15 81.20 SSCDL 25.60 53.70 68.10 83.60 ColorInv 24.24 44.91 56.55 69.40 PCCA-χ2RBF 19.27 48.89 64.91 80.28 LFDA 24.18 51.20 67.12 82.00 eLDFV 22.34 47.00 60.00 71.00 KISSME 17.47 46.42 61.23 75.82 CPS 21.00 45.00 57.00 71.00 RankSVM 13.00 37.00 51.00 68.00 ITML 11.61 31.39 45.76 63.86 MCML 15.19 41.77 57.59 73.39

590

Obviously, NSML is superior to the techniques in Table 8. However, the pure performance race is innocuous due to several realistic factors. We believe that a deeper study on how the proposed model diﬀerentiates itself from closely related works under shared conditions may be more informative and meaningful than a 595

pure competition on the ﬁnal overall performance of all kinds of methods. The real advantage of NSML can be seen from the ability to enhance the performance upon the original feature. Because NSML is based on the extracted features so its performance is unavoidably inﬂuenced and limited by them. We further test NSML in LOMO feature space transformed by XQDA which is denoted

600

by “LOMO+XQDA”. The results show that NSML actually has the ability to stand ahead of the advanced performance with the state-of-the-art feature

27

space. Furthermore, the experiments on multiple-camera datasets RAiD and WARD are detailed in Part VI of Supplementary Material because of page limitation. 605

At last, re-identifying the same or similar identities between the query and corpus domains belongs to enclosure re-identiﬁcation. To further demonstrate the ability of NSML, we do re-identiﬁcation with much more identities in the corpus domain than the query domain by PRID2011 with treating view A as query and view B as corpus in Part VII of Supplementary Material due to page

610

limitation. 4.7. Generalizability In the real-scenario surveillance, we may also obtain other valuable cues complementary to the full body appearance, like the face cue, gait cue, and so forth, for unobtrusive re-identiﬁcation/recognition. We provide extra experiments to

615

validate the generalizability of NSML in diﬀerent application scenarios: face recognition and gait recognition. For face recognition, we compare NSML with LMNN, EigenFace, PCA, and L2-norm in the original image space by means of the AR face database [51]. For gait recognition, we compare NSML with RVTM (Robust View Transformation Model) and TSVD (Truncated Singular

620

Vector Decomposition) in the GEI (Gait Energy Image) feature space in terms of the CASIA gait dataset B [52, 53]. More details are provided in Part VIII of Supplementary Material because of page limitation. 5. Conclusion and Discussion This paper has formulated person re-identiﬁcation as a metric learning prob-

625

lem. In particular, towards the challenging variability and sparsity of human image data, we proposed the NSML method to learn discriminative dissimilarities on a neighborhood structure manifold. Experiments demonstrated the advantage of NSML in terms of eﬀectiveness, robustness, eﬃciency, stability, and generalizability.

28

630

We use the widely-used baseline features to demonstrate NSML in experiments. Honestly, some newly-proposed methods in the directions of feature representation and feature transformation report higher performance than the ones we used for comparison. Actually, NSML can be updated with the newest and most eﬀective feature representation and/or transformation algorithms techni-

635

cally to further enhance the performance. Noteworthily, although NSML is speciﬁcally designed to address the variability and sparsity problem in re-identiﬁcation, it can also be applied to other vision problems that share the similar characteristics or have close relationship to person re-identiﬁcation, such as object recognition, cross-camera tracking,

640

multi-camera event detection, and so forth. Our ongoing work is extending NSML to solve them. Acknowledgements This work has been supported by the Fundamental Research Funds for the Central Universities; JSPS KAKENHI Grant Number 15K16024.

645

References [1] R. Vezzani, D. Baltieri, R. Cucchiara, People reidentiﬁcation in surveillance and forensics: A survey, Acm Computing Surveys 2 (46) (2013) 1–37. [2] M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person reidentiﬁcation by symmetry-driven accumulation of local features, in: Pro-

650

ceedings of the 23rd Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, San Francisco, USA, 2010, pp. 2360–2367. [3] B. Slawomir, C. Etienne, B. Francois, T. Monique, Person re-identiﬁcation using haar-based and dcd-based signature, in: Proceedings of the 7th International Conference on Advanced Video and Signal Based Surveillance,

655

AVSS, IEEE, Boston, USA, 2010, pp. 1–8.

29

[4] D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking, in: Proceedings of the 10th International Workshop on Performance Evaluation for Tracking and Surveillance, PETS, IEEE, Rio de Janeiro, Brazil, 2007, pp. 41–47. 660

[5] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by ﬁsher vectors for person re-identiﬁcation, in: Computer Vision – ECCV 2012. Workshops and Demonstrations, Vol. 7583, Springer Berlin Heidelberg, Florence, Italy, 2012, pp. 413–422. [6] S. Iodice, A. Petrosino, Salient feature based graph matching for person

665

re-identiﬁcation, Pattern Recognition 48 (2015) 1074–1085. [7] Z. Wu, Y. Li, R. J. Radke, Viewpoint invariant human re-identiﬁcation in camera networks using pose priors and subject-discriminative features, IEEE Transactions on Pattern Analysis and Machine Intelligence 5 (37) (2015) 1095–1108.

670

[8] Z. Liu, Z. Zhang, Q. Wu, Y. Wang, Enhancing person re-identiﬁcation by integrating gait biometric, Neurocomputing 168 (2015) 1144–1156. [9] L. An, S. Yang, B. Bhanu, Person re-identiﬁcation by robust canonical correlation analysis, Signal Processing Letters 22 (8) (2015) 1103–1107. [10] N. Martine, A. Das, C. Micheloni, A. K. Roy-Chowdhury, Re-identiﬁcation

675

in the function space of feature warps, IEEE Transactions on Pattern Analysis and Machine Intelligence 8 (37) (2015) 1656–1669. [11] B. Kulis, Metric learning: A survey, Journal of Machine Learning Research 5 (2012) 287–364. [12] K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin

680

nearest neighbor classiﬁcation, Journal of Machine Learning Research 10 (2009) 207–244.

30

[13] M. Dikmen, E. Akbas, T. S. Huang, N. Ahuja, Pedestrian recognition with a learned metric, in: Proceedings of the 10th Asian Conference on Computer Vision, ACCV, Vol. 6495, Springer, New Zealand, UK, 2010, pp. 501–512. 685

[14] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: Foundations and Trends in Machine Learning, Vol. 5, IEEE, Providence, Rhode Island, 2012, pp. 2288–2295. [15] S. Liao, Y. Hu, X. Zhu, S. Z. Li, Person re-identiﬁcation by local maximal

690

occurrence representation and metric learning, in: Proceeding of the 28th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Boston, USA, 2015, pp. 2197–2206. [16] W. Li, Y. Wu, M. Mukunoki, Y. Kuang, M. Minoh, Locality based discriminative measure for multiple-shot human re-identiﬁcation, Neurocomputing

695

167 (2015) 280–289. [17] J. Chen, Z. Zhang, Y. Wang, Relevance metric learning for person reidentiﬁcation by exploiting listwise similarities, IEEE Transactions on Image Processing 24 (12) (2015) 4741–4755. [18] A. Globerson, S. Roweis, Metric learning by collapsing classes, Proceedings

700

of Advances in Neural Information Processing Systems 18, NIPS 18 (2006) 451–458. [19] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Coupled metric learning for singleshot vs. single-shot person re-identiﬁcation, Optical Engineering 52 (2) (2013) 027203–1–027203–10.

705

[20] B. Prosser, W.-S. Zheng, S. Gong, T. Xiang, Person re-identiﬁcation by support vector ranking, in: Proceedings of the British Machine Vision Conference, BMVC, BMVA Press, Wales, UK, 2010, pp. 21.1–21.11.

31

[21] B. McFee, G. Lanckriet, Metric learning to rank, in: Proceedings of the 27th International Conference on Machine Learning, ICML, Omnipress, 710

Haifa, Israel, 2010, pp. 775–782. [22] Y. Wu, M. Mukunoki, T. Funatomi, M. Minoh, Optimizing mean reciprocal rank for person re-identiﬁcation, in: The 2nd International Workshop on Multimedia Systems for Surveillance, MMSS, IEEE, Klagenfurt, Austria, 2011, pp. 408–413.

715

[23] W.-S. Zheng, S. Gong, T. Xiang, Re-identiﬁcation by relative distance comparison, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (3) (2012) 653–668. [24] A. Mignon, F. Jurie, Pcca: A new approach for distance learning from sparse pairwise constraints, in: Proceedings of the 25th Conference on Com-

720

puter Vision and Pattern Recognition, CVPR, IEEE, Providence, Rhode Island, USA, 2012, pp. 2666–2672. [25] D. Ramanan, S. Baker, Local distance functions: A taxonomy, new algorithms, and an evaluation, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (4) (2011) 794–806. doi:10.1109/TPAMI.2010.127.

725

[26] D. Kedem, S. Tyree, F. Sha, G. R. Lanckriet, K. Q. Weinberger, Non-linear metric learning, in: Proceedings of Advances in Neural Information Processing Systems 25, NIPS, Curran Associates, Inc., Lake Tahoe, Nevada, US, 2012, pp. 2573–2581. [27] J. Hu, J. Lu, Y.-P. Tan, Discriminative deep metric learning for face veri-

730

ﬁcation in the wild, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Columbus, Ohio, US, 2014, pp. 1875–1882. [28] Z. Huang, R. Wang, S. Shan, X. Chen, Face recognition on large-scale video in the wild with hybrid euclidean-and-riemannian metric learning, Pattern

735

Recognition 48 (2015) 3113–3124. 32

[29] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identiﬁcation using kernel-based metric learning methods, in: Proceeding of the 27th Conference on Computer Vision and Pattern Recognition, ECCV, Vol. 8695, Springer, Zurich, Switzerland, 2014, pp. 1–16. 740

[30] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Locality based discriminative measure for multiple-shot person re-identiﬁcation, in: Proceedings of 10th International Conference on Advanced Video and Signal Based Surveillance, AVSS, IEEE, Krakow, Porland, 2013, pp. 312–317. [31] W. Li, M. Mukunoki, Y. Kuang, Y. Wu, M. Minoh, Person re-identiﬁcation

745

by common-near-neighbor analysis, IEICE Transactions on Information and Systems E97-D (2014) 1745–1361. [32] W. J.Scheirer, M. J. Wilber, M. Eckmann, T. E. Boult, Good recognition is non-metric, Pattern Recognition 47 (2014) 2721–2731. [33] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University

750

Press, New York, NY, USA, 2004. [34] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Common-near-neighbor analsis for person re-identiﬁcation, in: Proceedings of the 19th International Conference on Image Processing, ICIP, IEEE, Florida, USA, 2012, pp. 1621–1624. [35] A. Ess, B. Leibe, L. V. Gool, Depth and appearance for mobile scene

755

analysis, in: Proceedings of the 11th International Conference on Computer Vision, ICCV, IEEE, Rio de Janeiro, Brazil, 2007, pp. 1–8. [36] W.-S. Zheng, S. Gong, T. Xiang, Associating groups of people, in: Proceedings of the 20th British Machine Vision Conference, BMVC, BMVA Press, London, UK, 2009, pp. 23.1–23.11.

760

[37] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, V. Murino, Custom pictorial structures for re-identiﬁcation, in: Proceedings of the British Machine Vision Conference, BMVC, BMVA Press, Scotland, UK, 2011, pp. 68.1–68.11. 33

[38] Y. Xu, L. Lin, W.-S. Zheng, X. Liu, Human re-identiﬁcation by match765

ing compositional template with cluster sampling, in: Proceedings of the 14th International Conference on Computer Vision, ICCV, IEEE, Sydney, Australia, 2013, pp. 3152 – 3159. [39] M. Hirzer, C. Beleznai, P. M. Roth, H. Bischof, Person re-identiﬁcation by descriptive and discriminative classiﬁcation, in: In Proceedings of Scan-

770

dinavian Conference on Image Analysis, SCIA, Vol. 6688, Springer Berlin Heidelberg, Ystad, Sweden, 2011, pp. 91–102. [40] P. M. Roth, M. Hirzer, M. Koestinger, C. Beleznai, H. Bischof, Mahalanobis distance learning for person re-identiﬁcation, in: S. Gong, M. Cristani, S. Yan, C. C. Loy (Eds.), Person Re-Identiﬁcation, Advances in Computer

775

Vision and Pattern Recognition, Springer-Verlag London, London, UK, 2014, pp. 247–267. [41] A. Das, A. Chakraborty, A. K. Roy-Chowdhury, Consistent re-identiﬁcation in a camera network, in: Proceeding of the 27th Conference on Computer Vision and Pattern Recognition, ECCV, Vol. 8690, Springer, Zurich,

780

Switzerland, 2014, pp. 330–345. [42] N. Martinel, C. Micheloni, C. Piciarelli, Distributed signature fusion for person re-identiﬁcation, in: Proceeding of the Sixth International Conference on Distributed Smart Cameras, ICDSC, IEEE, Hongkong, China, 2012, pp. 1–6.

785

[43] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: A general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1) (2007) 40–51. [44] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identiﬁcation using

790

kernel-based metric learning methods, in: Proceedings of the 13th European Conference on Computer Vision, ECCV, Springer, Zurich, Switzerland, 2014, pp. 1–16. 34

[45] S. Pedagadi, J. Orwell, S. Velastin, B. Boghossian, Local ﬁsher discriminant analysis for pedestrian re-identiﬁcation, in: Proceedings of the 26th 795

Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Portland, OR, USA, 2013, pp. 3318–3325. [46] X. Liu, M. Song, D. Tao, X. Zhou, C. Chen, J. Bu, Semi-supervised coupled dictionary learning for person re-identiﬁcation, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE,

800

Columbus, Ohio, USA, 2014, pp. 3550–3557. [47] I. Kviatkovsky, A. Adam, E. Rivlin, Color invariants for person reidentiﬁcation, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (7) (2013) 1622–1634. [48] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by ﬁsher vectors for

805

person re-identiﬁcation, in: Computer Vision - ECCV 2012 Workshops and Demonstrations, Vol. 7583, Springer, Florence, Italy, 2012, pp. 413–422. [49] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, V. Murino, Custom pictorial structures for re-identiﬁcation, in: Procedings of the British Machine Vision Conference, BMVC, BMVA Press, Scotland, UK, 2011, pp.

810

68.1–68.11. [50] J. V. Davis, B. Kulis, P. Jain, S. Sra, I. S. Dhillon, Information-theoretic metric learning, in: Proceedings of the 24th International Conference on Machine Learning, ICML, ACM, Corvalis, Oregon, US, 2007, pp. 209–216. [51] A. Martinez, R. Benavente, The ar face database, Tech. rep., Computer

815

Vision Center at Universitat Autonoma de Barcelona (1998). [52] W. Kusakunniran, Q. Wu, H. Li, J. Zhang, Multiple views gait recognition using view transformation model based on optimized gait energy image, in: Proceedings of the 12th International Conference on Computer Vision Workshops, ICCV Workshops, IEEE, Kyoto, Japan, 2009, pp. 1058–1064.

35

820

[53] S. Zheng, J. Zhang, K. Huang, R. He, T. Tan, Robust view transformation model for gait recognition, in: Proceedings of the 8th International Conference on Image Processing, ICIP, IEEE, Brussels, Belgium, 2011, pp. 2073–2076.

36

Re-identification by neighborhood structure metric learning

Re-identification by neighborhood structure metric learning

Recommend Documents