Author’s Accepted Manuscript Re-identification by Neighborhood Structure Metric Learning Wei Li, Yang Wu, Jianqing Li
www.elsevier.com/locate/pr
PII: DOI: Reference:
S0031-3203(16)30202-3 http://dx.doi.org/10.1016/j.patcog.2016.08.001 PR5830
To appear in: Pattern Recognition Received date: 8 July 2015 Revised date: 20 May 2016 Accepted date: 1 August 2016 Cite this article as: Wei Li, Yang Wu and Jianqing Li, Re-identification by Neighborhood Structure Metric Learning, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.08.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Re-identification by Neighborhood Structure Metric Learning Wei Lia,∗, Yang Wub , Jianqing Lia a School
of Instrument Science and Engineering, Southeast University, 2 Sipailou, Nanjing 210096, China b Institute for Research Initiatives, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192,Japan
Abstract Re-identifying persons of interest among distributed cameras remains a challenge in current academia and industry. Because feature designing is inevitably subject to handcrafting subjectivity and real-scenario complexity, learning a discriminative metric has gained increasing attention to date. Although metric learning has achieved inspiring results, the research progress seems to slow down far before the performance is satisfactory. The difficulty mainly comes from the variability and sparsity of human image data, which impairs traditional metric learning models based on point-wise dissimilarity. In consideration of the neighborhood structure manifold which exploits the relative relationship between the concerned samples and their neighbors in the feature space, we propose a novel method, “Neighborhood Structure Metric Learning”, to learn discriminative dissimilarities on such manifold by adapting the codomain metrics of its charts. Experiments on widely-used benchmarks have demonstrated the advantage of this method in terms of effectiveness, robustness, efficiency, stability, and generalizability. Keywords: re-identification, metric learning, neighborhood structure manifold
∗ Corresponding
author Email address:
[email protected] (Wei Li)
Preprint submitted to Journal of LATEX Templates
August 4, 2016
1. Introduction Considering judging the re-appearance of a person of interest in deployed cameras, re-identification is valuable but challenging. The difficulties primarily come from various covariates including pose, illumination, and viewpoint, as 5
well as potential resemblance of clothing and body styles in the real scenario [1]. Basically, current methodology involves three paradigms: feature representation, feature transformation, and metric learning. Feature representation mainly focuses on suitable description of human appearance characteristics [2, 3, 4, 5, 6]. Recently, some methods also suggest
10
introducing other cues that complement the appearance to further enhance the discriminability of feature representation. Ziyan Wu et al. [7] describe human appearance as a function of pose using the training data, and then apply the pose prior in online re-identification to make it more robust to viewpoint variation, before integrating the learned person-specific features. Zheng Liu et al.
15
[8] combine the appearance descriptor with the gait descriptor to deal with the challenging appearance distortions, and then match the descriptors based on the learned metric before final score-level or feature-level fusion. An ideal feature space can compact human image samples in the same class close together and separate those from different classes far apart. Unfortunately,
20
due to the real-world challenge and handcrafting subjectiveness, mere handcrafted feature space has been found incapable for re-identifying the people. To solve it, two different directions, feature transformation and metric learning have attracted research interests. To ensure the inter-class discrimination and intra-class invariance of sam-
25
ples, generally, feature transformation tries to learn transformation functions for projecting human image features between cameras. For instance, ROCCA (RObust Canonical Correlation Analysis) [9] matches people from different views in a coherent subspace. Attributing to the shrinkage estimation and smoothing technique, this method can robustly estimate the data covariance matrices even
30
with limited training data size. However, by considering that the human image
2
feature transformation function may not be unique and even may vary from frame to frame due to real-scenario variations, FSFW (Function Space of Feature Warps) [10] assumes that feature transformations between cameras lie in a nonlinear function space of all possible feature transformations. This method 35
learns a discriminating surface to separate the feasible and infeasible sets of warp functions first, and then re-identifies persons by classifying a test warp function as either feasible or infeasible in terms of the Random Forest classifier. Feature representation and transformation are not the focus of this proposal, so we don’t plan to pay attention to details except for admitting that the widely-
40
used representative features can serve as a reliable platform for the further study. From the measure perspective, metric learning intends to use the optimized metric to improve the relative comparison between intra- and inter-class distances of human image features [11]. This paradigm is different from feature transformation in technique, although it can also transform the feature space in
45
a broad sense. We hereby sketch several related approaches which are designed or have potentiality for person re-identification. Generally speaking, metric learning models include the linear and non-linear ones. Linear metric learning learns the linear distance for scaling and rotating the coordinate system of feature space. In purpose of classification, LMNN (Large Margin Nearest
50
Neighbor) [12] maximizes the margin to distinguish the classes of human image features, and this method inspires the early works for metric learning based re-identification [13]. KISSME (Keep It Simple and Straightforward MEtric) [14] designs the equivalence constraints from a statistical inference perspective for both efficiency and effectiveness, and this method has been discussed in sev-
55
eral re-identification works [15, 16, 17]. MCML (Maximally Collapsing Metric Learning) [18] works for the target of intra-class distances to be zero and interclass distances to be infinity. This method can help de-noise the feature space for re-identification [19]. Re-identification can also be recast as the ranking problem. RankSVM [20] translates person re-identification into relative rank-
60
ing instead of absolute scoring. MLR (Metric Learning to Rank) [21] utilizes the structural SVM framework to optimize the ranking. With the loss function 3
measure Mean Reciprocal Rank (MRR), this method has been verified effective for the re-identification task [22]. RDC (Relative Distance Comparison) [23] maximizes the likelihood for the expected distance relationship between 65
the true and wrong match pairs of human image features. This method is quite robust to over-training due to the soft discriminant manner in formulation. Human image features are usually sparsely-distributed in the high dimensional space. From the angle of subspace finding (low-dimensionality, low-rank, sparsity), PCCA (Pairwise Constrained Component Analysis) [24] learns a low-
70
dimensional projection where the distance comparison constraints are sparsified from a large number of negative pairs consisting of different person identities. XQDA (Cross-view Quadratic Discriminant Analysis) [15] learns a discriminant low dimensional subspace, in which the metric is simultaneously optimized, by cross-view quadratic discriminant analysis on the human image feature rep-
75
resentation named LOMO (LOcal Maximal Occurrence). RMLLC (Relevance Metric Learning with Listwise Constraints) [17] fully uses the available similarity information from training data by means of forcing the learned metric to conserve the predefined listwise similarities between human image features in the low-dimensional projection subspace.
80
Though linear metric learning methods usually have achieved improved performance, they are inevitably ineffective when data are immersed in non-linear manifolds. In such situation, one typical solution is kernel-based metric learning, which learns the non-linear relationship between data in the reproducing kernel Hilbert space; the other is local metric learning [25], which localizes the
85
data domain and learns the metrics separately. For kernel-based metric learning methods, χ2 -LMNN [26] strictly preserves the histogram properties of input data on a probability simplex for the χ2 -distance. GBLMNN (Gradient Boosted LMNN) [26] trains the non-linear distance directly in the function space with gradient-boosted regression trees. χ2 -LMNN and GBLMNN are general metric
90
learning methods, but they more or less inherit the merits from LMNN to deal with the re-identification issue. DDML (Discriminative Deep Metric Learning) [27] learns a deep neural network of hierarchical nonlinear metric to project fea4
ture pairs into the same subspace where the distance comparisons of positive and negative pairs are constrained. HERML (Hybrid Euclidean-Riemannian Metric 95
Learning) [28] implicitly maps multiple complementary statistics of image sets into high-dimension Hilbert spaces by exploiting Euclidean and Riemanninan kernels, and then fuses these kernels in the hybrid metric learning framework. DDML and HERML have been validated to be effective for classifying realscenario face images, and also they have potentiality for the re-identification
100
task because the face and human body images encounter the similar variations to some extent. kLFDA (kernel Local Fisher Discriminant Analysis) [29] handles high dimensional human image features by optimizing the Fisher discriminant objective based on the kernel trick. For local metric learning models, LBDM (Locality Based Discriminative Measure) [16, 30] constructs the local metric
105
field to exploit the distance discriminability between each pair of human image sets from query and corpus domains to handle the undesired local variations. Although non-linear metric learning oftentimes has to face the expensive nonconvex optimization, this scheme has advantage in capturing the high-order correlations among samples.
110
Both linear and non-linear metric learning measures the original dissimilarity between samples from the geometric perspective, while ignores the relative information hidden in their neighborhood structures in the sense of topology. Original information only relies on the absolute position of the measured pair of samples in the feature space. This kind of information is inevitably sensitive
115
to the variability and sparsity problems long perplexing the re-identification issue. We consider that the relative information has the potential for solving this problem, and such potential can be fully exploited by the metric learning strategy, which inspires the idea of our work.
5
2. Problem Formulation and Method Summary 120
2.1. Problem Formulation In the real situation, there are a lot of people captured in each camera. This paper defines re-identification as a between-camera matching problem, which aims at building correspondences between subjects of unknown identities based on biometric cue, like full body appearance, captured from different camera
125
domains. Although the camera network is widely used in the real scene, we focus on the two-camera case in this paper, because this case is fundamental and important for the re-identification in general multiple-camera settings. Two-camera re-identification can provide the technical support and platform for multiple-
130
camera re-identification. It even can be anticipated that when the two-camera re-identification is conquered, the multiple-camera re-identification can be readily solved as well. The people in one camera usually have their corresponding matches in the other camera, which belongs to enclosure re-identification. In evaluation, we
135
need to find the correspondences between samples from query and corpus domains. In practice, such matching reformulation can help re-identify the incoming and outgoing pedestrians between entrance and exit of the library, shopping mall, airport, and so forth. It is also feasible to establish the correspondence in the camera network if biometric data from all the entrances and exits are
140
collected, respectively [31]. More concretely, the corpus data can be obtained from the camera or sensor set on the only path according to the requirement and restriction, and these data also can be updated according to the real condition and situation (for example, we may just maintain a record for the past a few hours/days only, and keep updating it). On the other hand, such matching
145
reformulation is also able to benefit building data association between tracklets of unknown subjects in non-overlapping camera views for cross-camera tracking in public places, especially when the camera locations are not so far. Matching formulation allows us to take advantage of the domain information: 6
the remaining data from both query and corpus domains besides the data pair 150
under measure. These unlabeled samples potentially contains the neighborhood information for the to-be-measured samples. We hope such information is useful against the challenges of data variability and sparsity in re-identification. 2.2. Method Summary Human image features inescapably suffer from the real-scenario complexity
155
and the curse of dimensionality, which result in the data variability and sparsity. Variability gives rises to large intra-class variations and small inter-class differences which deprive the point-wise dissimilarity of discrimination capability. Sparsity leads up to the inadequacy of in-class distribution information for each identity so that traditional metric learning methods are easy to be over-fitting.
160
Motivated by the idea that the neighborhood structure1 can provide useful information to tackle the variations and sparseness of the data, we propose a novel non-linear metric learning method, “Neighborhood Structure Metric Learning (NSML)”, to learn discriminative dissimilarities on a neighborhood structure manifold.
165
Admittedly, many issues confront the similar problems of sample variability and sparsity, like face recognition, gait recognition, and so forth. We choose person re-identification as the research focus, because it is typically representative of the above-mentioned problems. Even so, the proposed method has the ability to solve the general re-identification/recognition problem in diverse
170
applications. In sum, the main contributions of this paper are as follows: • We have proposed NSML to tackle the data variability and sparsity problem during re-identification, which learns discriminative dissimilarities on a novel neighborhood structure manifold. 1 Throughout
this paper, neighborhood structure means the topological relationship be-
tween the concerned elements and their neighbors.
7
175
• We have formulated NSML as a non-convex optimization problem, and designed the cutting-surface approach for it. • We have demonstrated the advantage of NSML in terms of effectiveness, robustness, efficiency, stability, and generalizability. 3. Our Method
180
3.1. Neighborhood Structure Manifold Point-wise dissimilarity measure is widely used for re-identification. However, such dissimilarity is weak for handling the large within-class variations and small between-class differences of human image data in the feature space. And moreover, the sparsely-distributed data also makes point-wise dissimilarity
185
based learning easy to be over-fitting. Point-wise dissimilarity ignores the topological relationship among feature samples. Actually, each sample point has a neighborhood structure and is meanwhile within the neighborhood structures of other sample points in the feature space. Neighborhood structure contains the useful information that one single
190
sample point doesn’t have. Such useful information originates from the resource of well-distributed samples with satisfactory intra-class compactness and interclass separation. We want to take advantage of such resource to cope with the variability and sparsity of human image data, and motivated by this, we design a new neighborhood-wise dissimilarity.
195
Fig. 1 illustrates one example. Suppose that a is a query sample, and b to f are corpus samples. {a, b}, {c, d}, and {e, f } are in different classes. {a, b} are badly-distributed, which means that they suffer from the severe intra-class variation and inter-class difference, so that inter-class distance is smaller than intraclass distance using the Euclidean metric. {c, d} and {e, f } are well-distributed,
200
which means that they have the satisfactory intra-class compactness and interclass separation, so that inter-class distance is larger than intra-class distance in the Euclidean space. We pay attention to b and c because they are nearer
8
to a than other samples based on Euclidean distance. Apparently, direct Euclidean distance measure fails to build the correct query-corpus correspondence 205
for b and c. However, in fact, dissimilarity for re-identification does not have to be Euclidean, and even it doesn’t have to strictly adhere to the properties of symmetry and triangle inequality [32].
Figure 1: Samples classes can be distinguished by various colors and shapes.
By compressing the neighborhood structure into ranking lists in terms of Euclidean distances, we are able to quantify the relative position of the paired samples in each other’s global neighborhood structure as a neighborhood-wise dissimilarity: DM (a, b) = Ra (b) + Rb (a),
(1)
where Ra (b) and Rb (a) are the rank-order distances that count the rank order of the sample in each other’s list. For a, the ranking list is “acbedf ”; for b, 210
the ranking list is “bacedf ”; for c, the ranking list is “cdeaf b”. The rank order of a in b-list is 1 and in c-list is 3; the rank order of b in a-list is 2; the rank order of c in a-list is 1. Hence, DM (a, b) = 1 + 2 = 3 ≤ DM (a, c) = 3 + 1 = 4, which outperforms the Euclidean distance. The discriminability of DM (a, b) comes from absorbing the effectiveness of the well-distributed samples in the
215
neighborhoods of a and b. In addition, the simultaneous use of Ra (b) and Rb (a) seems trivial, but this step plays an importance role in making sure the symmetry property of neighborhood-wise dissimilarity, such that DM (a, b) = 9
DM (b, a). In an elaborately selected or designed feature space, there should exist a 220
portion of well-distributed samples. By delivering the effectiveness of these well-distributed samples to the badly-distributed ones, neighborhood-wise dissimilarity has the advantage over point-wise dissimilarity which is vulnerable to the sample distribution deviation. In practice, we use those unlabeled available samples from both query and corpus domains to fill the space, as the possible
225
neighbors of the to-be-measured sample pair [31]. Neighborhood-wise dissimilarity is non-Euclidean but helpful for re-identification. Rethinking this dissimilarity from the manifold perspective, we can assume samples with such dissimilarity measure providentially lay on a new neighborhood structure manifold M. We hope to further exploit the discriminability of this
230
dissimilarity. For this purpose, we design a novel learning model. This model learns the discriminative neighborhood-wise dissimilarities on M by adapting the codomain metrics of its charts, as shown in Fig. 2. Conceptually, chart, denoted by {(U, ϕ)}, is a homeomorphism ϕ from an open subset U of M to an open subset V of Euclidean space E, where ϕ is a bi-continuous function
235
between U and V, which is defined as ϕ : U → V, and thus its inverse is given by ϕ−1 : V → U. 3.2. Metric Learning of Manifold Suppose that a and b stay in the chart codomain of M. It is difficult to explicitly describe the dissimilarity on M in a form of binary operation between ϕ−1 (a)
240
and ϕ−1 (b), so, we suggest reformulating the dissimilarity between ϕ−1 (a) and ϕ−1 (b) on M implicitly as DM (La, Lb), and studying the linear transition map L, irrespective of the concrete form of charts ϕ. Adapting the chart codomain by suitable L is expected to help re-orient the samples on M for better discrimination.
245
Since measuring Mahalanobis distance DW (a, b) = (a − b) W (a − b) using the metric W is equivalent to measuring distance in the Euclidean space linearly
10
L
Euclidean Space Euclidean Space
Chart Chart
Manifold
Figure 2: Illustration of the neighborhood structure manifold. Coordinate maps are between manifold and chart codomains; transition map is between chart codomains. Adapting the chart codomain by transition map L can help re-orient the samples on M for better discrimination.
transformed by L: DE (La, Lb) = (La − Lb) (La − Lb)
(2)
for which W = L L → L = W 1/2 , neighborhood-wise dissimilarity between a M and b with a learned metric W , denoted by DW (a, b), can be straightforwardly 250
expressed as: M DW (a, b) = DM (W 1/2 a, W 1/2 b),
(3)
whereupon, we design a new max-margin model to learn W on E in consideration of the anticipated relative comparisons between intra-class and inter-class dissimilarities on M.
255
Given query sample collection Q = q | q ∈ Rd and corpus sample collection X = xqi | xqi ∈ Rd , a joint feature map is adopted to represent the whole set of ranked data X , which is inspired by MLR [21]. Let yqranking ∈ Y be a ranking of X with respect to q where Y is the ranking set, and ψ M (q, yqranking , W, X ) ∈ Rd be a vector-valued joint feature map incorporating the learned metric, which is defined as the partial order feature: ψ M (q, yqranking , W, X ) =
ranking yij (
xqi ∈Xq+ xqj ∈Xq−
11
M M DW (q, xqi ) − DW (q, xqj ) ), + − |Xq ||Xq |
(4)
260
where Xq+ denotes the positive set of corpus samples in the same class as q; Xq− denotes the negative set of corpus samples in different classes from q; | • | means the set cardinality; |Xq+ ||Xq− | plays the role of normalization in consideration of that the double accumulation of Eq. 4 causes the addition of altogether the ranking number of |Xq+ ||Xq− | terms; yij is defined by ranking yij =
⎧ ⎨
xqi ≺yq xqj ,
1 ⎩ −1
(5)
xqi yq xqj .
265
The best W is expected to be the one that simultaneously makes yq∗ ← arg
max
ranking
yq
∈Y
ψ M (q, yqranking , W, X ),
(6)
where yq∗ is the ground truth ranking of X for q, and in most cases yqranking = yq∗ . One attractive property of ψ M (q, yqranking , W, X ) is that, for a fixed W , the ranking yqranking which maximizes ψ M (q, yqranking , W, X ) can be obtained by sort270
M ing X in the descending order of score DW (q, xqi ). Thus, W can be learned by
solving the following problem: arg min F (W, ξ) = tr(W ) + Cξ
(7)
W
s.t. 1 n
n
(ψ M (q, yq∗ , W, X ) − ψ M (q, yqranking , W, X )) ≥
q=1
1 n
n q=1
Δ(yq∗ , yqranking ) − ξ, W ≥ 0,
ξ ≥ 0.
ξ is the slack variable; C is the trade-off parameter; Δ(yq∗ , yqranking ) is the 275
loss function to penalize predicting yqranking instead of yq∗ , defined by Δ(yq∗ , yqranking ) = 1 − S(q, yqranking ),
(8)
where score S ∈ [0, 1], in which, S = 0 means the worst ranking and S = 1 means the perfect ranking. S can be instantiated by MRR, which has been verified effective for the re-identification task [22]. Conceptually, the reciprocal rank of a query response is the multiplicative inverse of the rank of the first
12
280
correct match, and MRR is the average of such reciprocal ranks of results over the whole query collection. Eq. 7 is the NSML framework. The constraints of Eq. 7 is non-convex with M is bounded when the size of ranking respect to W . However, the range of DW
list is certain. So the NSML model can converge to some local minima. 285
3.3. Optimization Because there is no closed-form solution for NSML, NSML will be solved by the iterative process. On account of the large number of constraints, obtaining an approximate solution is more practicable. To this end, the cuttingsurface method, extended from the cutting-plane algorithm, is recommended.
290
The cutting-surface strategy applies the non-linear surface approximation of the constraint set by the curved surfaces that are formed by the most violated constraints for all queries. The objective function in Eq. 7 is minimized over a subset of active cutting surfaces. Such subset is updated with the most violated constraints in each
295
iteration. The process repeats until the most-violated constraint for each query satisfies some specified loss threshold on the active set. The gradient of the objective F (W, ξ) can be expressed in terms of the constraint set that achieves the current largest margin violation: ∂F ∂W
=I−
=I− −
C n
C n
n
n
(
q=1
(
∗ ∂ψ M (q,yq ,W,X ) ∂W
q=1 x ∈X + x ∈X − q q qi qj
ranking yij (
−
∗ yij (
ranking ∂ψ M (q,yq ,W,X ) ) ∂W
M (q,x )−DM (q,x )) ∂(DW qi qj W ∂W |Xq+ ||Xq− |
M (q,x )−DM (q,x )) ∂(DW qi qj W ∂W |Xq+ ||Xq− |
))
) (9)
xqi ∈Xq+ xqj ∈Xq− ranking ∗ n (yij −yij ) ∂(D M (q,xqi )−D M (q,xqj )) W W . =I−C + n ∂W |Xq ||Xq− | + − q=1 x ∈X x ∈X q q qi qj
300
We provide the details of further derivation and analysis in Part I of Supplementary Material due to page limitation. By these derivation and analysis,
13
finally, we can change Eq. 9 into ∂F ∂W
=I−
=I−
H n
C n
n
n
∗ ∂ψ M (q,yq ,W,X ) ∂W
(
q=1
q=1 x
+ qi ∈Xq
xqj ∈Xq−
−
ranking ∂ψ M (q,yq ,W,X ) ) ∂W
(yq∗ − yqranking )
φqi −φqj
|Xq+ ||Xq− |
(10)
,
where H = (n − 1)C. Because H only influences the step size in Eq. 10, for conciseness and elegance, we temporarily suggest setting = 1/(n − 1) to make 305
H = C. Compared with Eq. 9, Eq. 10 is tractable and simple. Eq. 10 holds M the meaning of the tendency to improve the discriminability of DW on manifold
M. On the other hand, learning W in the chart codomain E of manifold M can be formulated by arg min F1 (W, ξ) = tr(W ) + C1 ξ
(11)
W
310
s.t. 1 n
n
(ψ E (q, yq∗ , W, X ) − ψ E (q, yqranking , W, X )) ≥
q=1
1 n
n q=1
Δ(yq∗ , yqranking ) − ξ, W ≥ 0,
where ψ E (q, yqranking , W, X ) = E (a, b) = which DW
ranking yij (
xqi ∈Xq+ xqj ∈Xq− E 1/2 1/2 D (W a, W b), and C1 is
ξ ≥ 0,
E E DW (q,xqi )−DW (q,xqj ) ) |Xq+ ||Xq− |
in
the trade-off parameter.
The gradient of F1 (W, ξ), which holds the meaning of the tendency to imE prove the distance discriminability of DW , is given by ∂F1 ∂W
=I−
C1 n
n q=1 x
+ qi ∈Xq
− xqj ∈Xq
φ
−φ
(yq∗ − yqranking ) |Xqi+ ||Xqj − . | q
(12)
q
315
By comparing Eq. 10 and Eq. 12, we find that the two gradients
∂F ∂W
=
∂F1 ∂W
when H = C1 . This outcome shows that, for the variable and sparse data in person re-identification, improving distance discriminability in E can help improving dissimilarity discriminability on M.
14
320
3.4. Learning for Re-identification Training NSML uses backtracking line search based on Armijo-Goldstein condition [33]. All procedures are displayed in Part II of Supplementary Material because of page limitation. For train-test consistency, in the testing stage, we also map the testing data
325
to a neighborhood structure manifold discriminative for re-identification by our previous work Common-Near-Neighbor Modeling (CNNM) dissimilarity [34], which is composed of the symmetric and asymmetric terms: DCNNM (a, b) = Rsym (a, b) + 2λlRasym (a, b),
(13)
Rsym (a, b) = Ralocal (b) + Rblocal (a),
(14)
where
Ralocal (b) =
l−1
Ob (ha (i));
(15)
Rasym (a, b) = min(Oa (b), Ob (a)).
(16)
i=0
330
l is the local neighborhood size parameter and λ is the trade-off parameter. Briefly, Rsym counts the sample rank orders within the local area for a and b from each other’s list, which has been illustrated in Fig. 3. And Rasym counts the minimum of the rank orders for a and b in each other’s list. local neighborhood size
a
a
c ha (0)
b
d ha (1)
b
d
0
1
e
b
ha (l-1)
…
a
c
e
l-1
l
l+1
…
Figure 3: Illustration for “symmetric dissimilarity”. Rlocal (b) is calculated from the 0th to a (a) is calculated in a similar way. the (l − 1)th nearest neighbor in a-list, and Rlocal b
Dissimilarity DCNNM uses local averaging to enhance the measure robustness 335
against the class outliers. The local neighborhood size l is specified as around half the class size to maximize the capture of within-class sample distribution 15
information for each sample. If l is too large, intruders from different classes may be involved; if, on the other hand, l is too small, useful local neighborhood information will be lost. DCNNM also uses the asymmetric term to tackle the 340
asymmetric ranking problem that a pair of samples usually do not yield the same rank order for each other in their own rank order lists. This term has been confirmed especially helpful cooperating with the symmetric dissimilarity by λ when the class size is small [31]. In both training and testing, we take advantage of the neighborhood struc-
345
ture manifold. There is close relationship between DCNNM in Eq. 13 and DM in Eq. 1 actually. Both of them computes the neighborhood-wise dissimilarity. DCNNM can be regarded as generalizing DM with the additional local averaging operation and the auxiliary asymmetric term. On the other hand, DM can also be treated as instantiating the parametric space of DCNNM in a special way
350
that shrinks the local neighborhood of the symmetric term and discards the asymmetric term. In contrast to Eq. 1 concerning the pair of samples themselves on M, averaging rank-order dissimilarities in the local area by Eq. 13 has some effect of shifting the samples to their neighborhood centers on M before dissimilarity
355
measure. The local averaging contributes to the robustness of neighborhood structure dissimilarity against the sample deviation of each class, which is, honestly, desired in the testing stage. However, in the training stage this process inevitably weakens the learner to cope with the potential intruders or extruders of each class by itself, so it is better not to do local averaging in training.
360
Actually, setting l = 1 is equivalent to tightening the constraints of relative comparison among neighborhood structure dissimilarities. If the relative dissimilarity comparison for every possible triplet of samples (two are in the same class and one is from a different class) has been achieved, the local averaging will actually become superfluous. Last but not least, such simplification can
365
also boost the efficiency due to avoiding the extra computations in the local areas. During training, the neighborhood structure of each sample will be re16
organized aimed at enhancing the dissimilarity discriminability on M. The asymmetric ranking problem will also be tackled along with the neighborhood 370
structure re-organization. Hence, the asymmetric term becomes redundant and can be removed while training NSML. And so does it to the testing stage attributing to the learned space, which is different from the unsupervised case for CNNM. This simplification saves the effort to search for a suitable λ. 4. Experiments
375
4.1. Dataset Description We evaluate NSML for re-identification on widely-used benchmark datasets: VIPeR [4], ETHZ1 [35], i-LIDS [36], Caviar4REID [37], EPFL [38], PRID450S\2011 [39, 40], RAiD [41], and WARD [42]. All of them undergo real-scenario complexities in spite of the distinct challenges of each their own. The description and
380
illustration of these datasets are detailed in Part III of Supplementary Material due to page limitation. 4.2. Experimental Setup Because datasets have different characteristics, we suggest using different signatures for different datasets. We choose two well-known signatures to describe
385
human images. One is combined by Densely Sampled Color Histograms and Weighted HSV color histograms, denoted by “DSCH+WHSV” [2, 13, 22, 34]; the other is concatenated by DSCH, Schmid-Filter-Bank, and Gabor-Transform, denoted by “DSCH+SFB+GT” [19, 23, 24, 31]. Both signatures have reliable performance for metric learning based approaches. And we have found that
390
DSCH+WHSV is more suitable for VIPeR, ETHZ1, i-LIDS, and RAiD, while DSCH+SFB+GT is more suitable for Caviar4REID, EPFL, and PRID2011\450S, and WARD. Feature representation is not the focus of this paper, but we admit introducing a more effective representation to our model could likely result in an improved performance. We perform ten-fold cross validation with randomly
395
halving the dataset for train-test split every time. Each time we use the same
17
train-test split for method comparison. In evaluation, for each identity, we use one image as the corpus sample; in the query domain, we use all the images, but match each of them to the corpus domain one by one. Unlike other re-identification datasets, RAiD and WARD are specifically 400
built for re-identification across multiple camera views, although multiple camera views can be split into a group of camera view pairs. This paper focuses on re-identifying people between two non-overlapping cameras. So in principle, the datasets for two-camera re-identification are the best choices. However, we would like to take this opportunity to challenge our proposed method on
405
RAiD and WARD by recasting the multiple-camera re-identification into the two-domain matching problem. In both training and testing, we treat one camera view as the corpus domain (camera 1 for RAiD; camera 3 for WARD), while keep the other camera views together as the query domain. 4.3. Parameters and Robustness
410
Compared with original CNNM, during training, NSML reduces the neighborhood scope by setting the local neighborhood size l = 1 in the symmetric term, and removes the asymmetric term by setting the trade-off parameter λ = 0. These simplifications just embody the robustness of NSML and contribute to widening the application range of the model.
415
Since DM is a special instantiation of the parametric space of DCNNM , we here generalize NSML by involving DCNNM instead of DM in training, to well validate the robustness of our simplifications of parameters. We temporarily initialize C = 10 and = 1/(n − 1) by heuristic for conveniently discussing l and λ. Denote the recommended l and λ from CNNA
420
[31] by lori and λori . For VIPeR, lori = 1 and λori = 1; for ETHZ1, lori = 30 and λori = 0; for i-LIDS, lori = 2 and λori = 0.1; for Caviar4REID and EPFL, as they have the class sizes larger than i-LIDS but smaller than ETHZ1, we set lori = 5 and λori = 0.01 compromisingly; for PRID450S, we compare it to VIPeR and set lori = 1 and λori = 1; for RAiD, lori = 55 and λori = 0; for WARD, lori = 25
425
and λori = 0. 18
Table 1: Contrastive analysis on NSML for λtr and λts when setting ltr = lts = lori . (Results are given in the format of “λtr = λts = 0 \ λtr = λts = λori ”.) Dataset Rank-1 Rank-3 Rank-5 VIPeR 25.92 \ 25.28 43.10 \ 42.63 52.72 \ 52.75 ETHZ1 77.10 \ 77.10 89.94 \ 89.94 93.11 \ 93.11 i-LIDS 39.00 \ 39.00 57.33 \ 56.97 64.79 \ 65.02 Caviar4REID 36.54 \ 36.60 51.85 \ 51.58 60.82 \ 61.07 EPFL 56.62 \ 56.76 84.24 \ 84.20 92.02 \ 92.59 PRID450S 25.87 \ 25.20 43.02 \ 43.11 51.29 \ 51.56 RAiD 15.47 \ 15.47 32.39 \ 32.39 45.28 \ 45.28 WARD 64.69 \ 64.69 83.47 \ 83.47 83.47 \ 83.47
Rank-7 59.84 \ 59.37 94.57 \ 94.57 71.13 \ 70.64 67.51 \ 67.82 97.11 \ 96.64 56.44 \ 57.20 54.89 \ 54.89 93.11 \ 93.11
To study the effect of eliminating λ, we compare NSML between λtr = λts = 0 and λtr = λts = λori with setting ltr = lts = lori , where subscripts “tr” and “ts” of l or λ indicate the stages of training and testing. The results are tabulated in Table 1, from which, we can see the dispensable role of the asymmetric term 430
for NSML. To investigate the effect of decreasing l in training, we compare NSML between ltr = 1 and ltr = lori with setting lts = lori and λtr = λts = 0. We expect the local averaging can provide dissimilarity measure with additional robustness in testing after learning. But during learning, we don’t want to encapsulate lo-
435
calities into dissimilarity measure. The results are exhibited in Table 2, from which, we can see the trivial role of local averaging in the symmetric term on datasets VIPeR, i-LIDS, and PRID450S having few images per class. Whereas, on datasets ETHZ1, Caviar4REID, and EPFL having multiple images in each class, ltr = 1 has small but discernible advantage over ltr = lori , and the advan-
440
tage is much more obvious on datasets RAiD and WARD. More rigorously, we compare NSML between λtr = λts = 0 and λtr = λts = λori with setting ltr = 1 and lts = lori . The results are described in Table 3, which still show the minor function of the trade-off parameter. According to Table 1 and 3, we can directly compare NSML between ltr = 1
445
and ltr = lori with setting lts = lori and λtr = λts = λori in Table 4. Consistent with Table 2, the results still vary little on datasets VIPeR, i-LIDS, PRID450S, and ltr = 1 has some slight advantage over ltr = lori for datasets ETHZ1, Caviar4REID, 19
Table 2: Contrastive analysis on NSML for ltr when setting lts = lori and λtr = λts = 0. (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD
the format of “ltr Rank-1 25.92 \ 25.92 78.13 \ 77.10 38.54 \ 39.00 37.08 \ 36.54 58.39 \ 56.62 25.87 \ 25.87 23.39 \ 15.47 68.66 \ 64.69
= 1 \ ltr = lori ”.) Rank-3 Rank-5 43.10 \ 43.10 52.72 \ 52.72 90.36 \ 89.94 93.49 \ 93.11 57.38 \ 57.33 65.07 \ 64.79 52.92 \ 51.85 61.96 \ 60.82 85.68 \ 84.24 93.62 \ 92.02 43.02 \ 43.02 51.29 \ 51.29 50.80 \ 32.39 66.37 \ 45.28 86.24 \ 83.47 92.07 \ 90.10
Rank-7 59.84 \ 59.84 94.73 \ 94.57 71.09 \ 71.13 68.60 \ 67.51 97.63 \ 97.11 56.44 \ 56.44 77.67 \ 54.89 94.38 \ 93.11
Table 3: Contrastive analysis on NSML for λtr and λts when setting ltr = 1 and lts = lori . (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD
the format of “λtr Rank-1 25.92 \ 25.28 78.13 \ 78.13 38.54 \ 38.71 37.08 \ 37.10 58.39 \ 58.45 25.87 \ 25.20 23.39 \ 23.39 68.66 \ 68.66
= λts = 0 \ λtr Rank-3 43.10 \ 42.63 90.36 \ 90.36 57.38 \ 57.77 52.92 \ 52.94 85.68 \ 85.74 43.02 \ 43.11 50.80 \ 50.80 86.24 \ 86.24
= λts = λori ”.) Rank-5 52.72 \ 52.75 93.49 \ 93.49 65.07 \ 64.96 61.96 \ 61.75 93.62 \ 93.70 51.29 \ 51.56 66.37 \ 66.37 92.07 \ 92.07
Rank-7 59.84 \ 59.37 94.73 \ 94.73 71.09 \ 70.76 68.60 \ 68.46 97.63 \ 97.69 56.44 \ 57.20 77.67 \ 77.67 94.38 \ 94.38
and EPFL, and the advantage is much larger on datasets RAiD and WARD. The positive effect of local averaging in the testing stage has been confirmed 450
again by Table 5 that compares NSML between lts = lori and lts = 1 with setting ltr = 1 and λtr = λts = 1, which agrees with the conclusion of our previous work CNNM well [31]. We also spare no effort on discussing whether λts is necessary or not in testing by comparing NSML between λts = 0 and λts = λori when eliminating
455
the asymmetric term in training with setting ltr = 1 and lts = lori and λtr = 0. The results are given in Table 6, which once more supports the claim of triviality of λts . 4.4. Stability Stability indicates that the model can work reliably with normal convergence
460
and satisfactory performance with the optimization parameters in the regular 20
Table 4: Contrastive analysis on NSML for ltr when setting lts = lori and λtr = λts = λori . (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD
the format of “ltr Rank-1 25.92 \ 25.28 78.13 \ 77.10 38.71 \ 39.00 37.10 \ 36.60 58.45 \ 56.76 25.20 \ 25.20 23.39 \ 15.47 68.66 \ 64.69
= 1 \ ltr = lori ”.) Rank-3 Rank-5 43.10 \ 42.63 52.72 \ 52.75 90.36 \ 89.94 93.49 \ 93.11 57.77 \ 56.97 64.96 \ 65.02 52.94 \ 51.58 61.75 \ 61.07 85.74 \ 84.20 93.70 \ 92.59 43.11 \ 43.11 51.56 \ 51.56 50.80 \ 32.39 66.37 \ 45.28 86.24 \ 83.47 92.07 \ 90.10
Rank-7 59.84 \ 59.37 94.73 \ 94.57 70.76 \ 70.64 68.46 \ 67.82 97.69 \ 96.64 57.20 \ 57.20 77.67 \ 54.89 94.38 \ 93.11
Table 5: Contrastive analysis on NSML for lts when setting ltr = 1 and λtr = λts = 0. (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD
the format of “lts Rank-1 25.92 \ 25.92 78.13 \ 74.61 38.54 \ 39.10 37.08 \ 35.90 58.39 \ 48.49 25.87 \ 25.87 23.39 \ 19.85 68.66 \ 59.46
= lori \ lts = 1”.) Rank-3 Rank-5 43.10 \ 43.10 52.72 \ 52.72 90.36 \ 88.25 93.49 \ 91.68 57.38 \ 56.91 65.07 \ 64.16 52.92 \ 51.91 61.96 \ 59.87 85.68 \ 75.79 93.62 \ 87.87 43.02 \ 43.02 51.29 \ 51.29 50.80 \ 44.90 66.37 \ 60.45 86.24 \ 79.14 92.07 \ 87.03
Rank-7 59.84 \ 59.84 94.73 \ 93.44 71.09 \ 69.45 68.60 \ 66.46 97.63 \ 94.11 56.44 \ 56.44 77.67 \ 71.45 94.38 \ 90.78
range. We evaluate the stability of NSML by uncovering the relationship between the convergence situation and H in Eq. 10. The backtracking line search scheme is adopted based on the Armijo-Goldstein condition, where the convergence threshold is set as 10−5 , Armijo rule number is set as 10−5 , the learning 465
rate is initialized as 10−8 (n − 1)C, and the increase and decrease parameters for controlling the learning rate after each iteration is ((1 + ((1 +
√ 2
√ 2
5)/2)1/3 and
5)/2)−1 , respectively. We also set the maximum number of backtracking
100. If the backtracking time exceeds this number, the program will be forced to stop even before convergence, and warns the ill-initialization of the learning 470
rate. Retrospecting H = (n − 1)C, we will discuss two cases: evaluating C by fixing ; evaluating by fixing C. Firstly, we set = 1/(n − 1) for experimental succinctness. In this case, H = C. The iteration numbers and recognition rates are averaged according to the cross-validation principle on each dataset. The results are recorded in Table 21
Table 6: Contrastive analysis on NSML for λts when setting ltr = 1 and lts = lori and λtr = 0. (Results are given in Dataset VIPeR ETHZ1 i-LIDS Caviar4REID EPFL PRID450S RAiD WARD
475
the format of “λts Rank-1 25.92 \ 25.38 78.13 \ 78.13 38.54 \ 38.48 37.08 \ 37.08 58.39 \ 58.19 25.87 \ 24.98 23.39 \ 23.39 68.66 \ 68.66
= 0 \ λts = λori ”.) Rank-3 Rank-5 43.10 \ 42.41 52.72 \ 52.37 90.36 \ 90.36 93.49 \ 93.49 57.38 \ 57.54 65.07 \ 64.91 52.92 \ 52.94 61.96 \ 61.96 85.68 \ 85.68 93.62 \ 93.54 43.02 \ 43.42 51.29 \ 51.47 50.80 \ 50.80 66.37 \ 66.37 86.24 \ 86.24 92.07 \ 92.07
Rank-7 59.84 \ 58.89 94.73 \ 94.73 71.09 \ 70.86 68.60 \ 68.56 97.63 \ 97.69 56.44 \ 51.47 77.67 \ 77.67 94.38 \ 94.38
7, where H ∈ {0.01, 0.1, 1, 10, 100, 1000, 10000}. The contents from the first row to the seventh row correspond with the experimental results under different Hs in an increasing order for each dataset. Table 7: The average iteration number (aver iter num) till convergence and the recognition rate on rank-1,5 for different Hs in NSML. (Results are given in the format of “aver iter num \ rank-1 \ rank-5”.) VIPeR 23.2 \ 23.77 \ 52.03 26.6 \ 23.42 \ 51.55 16.3 \ 24.78 \ 52.03 13.2 \ 25.92 \ 52.72 14.9 \ 25.92 \ 52.75 12.0 \ 25.28 \ 51.96 14.6 \ 23.42 \ 51.11 Caviar4REID 29.7 \ 28.12 \ 55.15 14.3 \ 37.03 \ 61.77 14.1 \ 37.20 \ 61.69 14.4 \ 37.08 \ 61.96 14.4 \ 37.01 \ 61.81 9.8 \ 34.78 \ 60.61 8.8 \ 29.79 \ 55.62
ETHZ1 11.8 \ 74.22 \ 91.13 19.8 \ 75.98 \ 92.08 5.0 \ 78.17 \ 93.51 5.2 \ 78.13 \ 93.49 5.4 \ 78.16 \ 93.50 4.1 \ 78.12 \ 93.49 2.0 \ 78.34 \ 93.47 EPFL 12.6 \ 47.42 \ 86.12 9.2 \ 58.34 \ 93.85 10.1 \ 58.33 \ 94.04 9.2 \ 58.39 \ 93.62 8.3 \ 58.57 \ 93.69 5.3 \ 58.07 \ 93.08 3.1 \ 54.24 \ 90.59
i-LIDS 20.3 \ 35.42 \ 64.64 10.3 \ 36.50 \ 62.85 8.0 \ 38.54 \ 65.01 8.2 \ 38.54 \ 65.07 6.9 \ 38.43 \ 64.84 4.8 \ 38.38 \ 64.85 2.9 \ 36.60 \ 63.05 PRID450S 16.5 \ 20.67 \ 46.44 9.7 \ 25.73 \ 51.51 9.3 \ 25.73 \ 51.47 9.4 \ 25.87 \ 51.29 8.1 \ 25.78 \ 51.42 4.9 \ 23.64 \ 48.67 4.9 \ 20.09 \ 42.80
RAiD 8.69 \ 34.91 24.11 \ 69.40 27.08 \ 72.29 23.39 \ 66.37 21.98 \ 62.38 21.70 \ 61.26 20.86 \ 62.43 WARD 13.6 \ 52.63 \ 85.01 6.0 \ 68.73 \ 92.10 6.5 \ 68.71 \ 92.07 6.4 \ 68.66 \ 92.07 5.6 \ 68.81 \ 92.12 4.5 \ 68.36 \ 92.02 2.1 \ 60.84 \ 88.20
9.3 37.3 97.0 36.7 31.8 31.9 32.6
\ \ \ \ \ \ \
It is nearly impossible to draw a conclusion of the constant uniform H setting to cope with different challenges in different datasets. Even so, we endeavor 480
to find an approximate common range of H for the relatively stable satisfactory performance of NSML taking into consideration of the overall results. From the results in Table 7, we can see that not only does the model system always 22
converge for different Hs, but also the iteration number decreases when H increases on the whole, except for a slight fluctuation that is ineluctably caused 485
by the data complexity. The iteration number and the recognition rate for H ∈ {0.01, 0.1} are apparently lower than H ∈ {1, 10, 100} on the whole. When H ∈ {1000, 10000}, the recognition rate trend is downward. The results show that the model has stable satisfactory performance for H ∈ [1, 100] considering the diversity of these datasets for a variety of real-world challenges.
490
Then, we testify with setting C = 10. The experimental details and results are given in Part IV of Supplementary Material because of page limitation. Considering the results comprehensively, we recommend using the value around = 1/(n − 1) to avoid under-training whilst maximizing the discriminability of W . Thus, H and C share the same recommended scope of assignment.
495
From the above, we can see that although NSML seems to contain many parameters, C is the only parameter required to be discussed during training. Actually, for C, we have found out the most suitable range in consideration of a variety of real-world challenges based on the experimentation and analysis across diverse datasets. Moreover, parameters ltr and λtr are just used to help
500
analyze the intrinsic mechanism of the proposed algorithm, and they have the fixed recommended values already independent of training data. The setting of lts and λts in testing can directly follow the conclusion of the work CNNM [31], which is also easy to handle, and actually, the small fluctuation of their values will not cause the big change of CNNM performance. Therefore, we don’t need
505
to concern too much about parameter setting during implementation. 4.5. Complexity and Efficiency Model efficiency demonstration is carried out from both theoretical and practical perspectives. From the theoretical perspective, we analyze the computational complexity for training NSML; from the practical perspective, we test the
510
real time expense under ten-fold cross-validation. The details can refer to Part V of Supplementary Material due to page limitation.
23
4.6. Effectiveness Related methods for comparison include CNNA, MLR, RDC, LMNN, XQDA, SDALF, MFA (Marginal Fisher Analysis) [43, 44], kLFDA (kernel Local Fisher 515
Discriminant Analysis) [45, 44], and rPCCA (regularized Pairwise Constrained Component Analysis) [24, 44]. The same as NSML, methods CNNA, MLR, RDC, LMNN, and SDALF use DSCH+WHSV and DSCH+SFB+GT as the original features. It has been testified that there is no big performance difference between original SDALF and DSCH+WHSV or DSCH+SFB+GT, while
520
the vector-form features provide more convenience for metric learning strategies [22]. Hence, we use SDALF to indicate measuring original features in Euclidean space, which serves as the baseline for learning approaches in comparison. MFA, kLFDA, and rPCCA adopt the linear kernel based on their own 75-region features, because we find they perform more robustly with the linear kernel and
525
their recommended features based on their published code. NSML in comparisons chooses the parameters {ltr = 1, λtr = 0} and {lts = lori , λts = 0} according to the discussion in Section 4.3. The results are illustrated by CMC (Cumulative Match Characteristic) curves in Fig. 4. It can be seen, on the whole, NSML has encouraging performance. De-
530
spite the unstable behavior of other competitors across datasets, NSML readily leads the performance. In greater details, we can see the stair-like performance enhancement of MLR, CNNA, and NSML. MLR is a suitable and capable metric learning approach for re-identification. As the closest analogue of NSML, CNNA has been verified more effective than MLR due to exploiting the neigh-
535
borhood structure information by CNNM in the testing stage. Further, NSML performs metric learning in the neighborhood structure manifold instead of the original feature space as what has been done by CNNA, so that the neighborhood structure dissimilarity based discrimination is directly optimized, which makes the best out of CNNM in testing.
540
NSML performs remarkably better than CNNA on the datasets of serious variability and sparsity, like VIPeR and PRID450S. In Caviar4REID, i-LIDS, and EPFL with severe variability but less sparsity, NSML also performs well 24
VIPeR
i-LIDS
ETHZ1
60
75
100
70
55
65
45 40 35 30
NSML CNNA
25
MLR RDC LMNN
20 15 10
90
85
80
NSML CNNA
75
MLR RDC LMNN
70
XQDA SDALF 1
2
3
4
5
6
65
7
1
2
3
4
5
6
50 45 40 35
25
5
7
55 50 45 40 35
NSML
30
CNNA MLR
25
RDC LMNN XQDA SDALF
15
5
6
85 80 75 70 65
NSML
60
CNNA MLR
55
RDC LMNN XQDA SDALF
50 45 40
7
Recognition Percentage
90
Recognition Percentage
60
1
2
3
Rank Score
4
VIPeR
5
6
7
5
6
45 40 35 30
NSML 25
CNNA MLR
20
RDC LMNN XQDA SDALF
15 10
7
1
2
3
4
5
6
7
Rank Score
ETHZ1
i-LIDS
95
75 70
55 90
45 40 35 30 25 20
NSML
15
MFA kLFDA rPCCA
10
2
3
4
5
6
85
80
75
70
65
NSML MFA kLFDA rPCCA
60
55
7
1
2
3
Rank Score
Caviar4REID
35 30 25 20
NSML
15
MFA kLFDA rPCCA
5 0
7
1
2
3
Recognition Percentage
50 45 40 35 30 25 20 15
NSML
10
MFA kLFDA rPCCA
5 5
6
75 70 65 60 55 50 45 40 35 30 25 20
NSML MFA kLFDA rPCCA
15 10 5 7
0
1
2
3
4
5
6
7
PRID450S 55
55
4
Rank Score
60
85
Rank Score
40
90 80
4
45
100 95
60
3
6
50
EPFL
65
2
5
55
Rank Score
70
1
4
60
10
Recognition Percentage
1
65
Recognition Percentage
Recognition Percentage
50
0
4
50
Rank Score
60
5
3
PRID450S 55
4
2
EPFL 60
3
1
Rank Score
95
2
XQDA SDALF
10
100
1
MLR RDC LMNN
20
65
10
NSML CNNA
30
70
20
Recognition Percentage
55
Rank Score
Caviar4REID
Recognition Percentage
60
15
XQDA SDALF
Rank Score
Recognition Percentage
Recognition Percentage
50
Recognition Percentage
Recognition Percentage
95
5
6
50 45 40 35 30 25
NSML
20
MFA kLFDA rPCCA
15
7
10
1
2
Rank Score
3
4
5
6
7
Rank Score
Figure 4: Result comparisons across benchmarks.
whereas the gap between it and its runner-up CNNA becomes narrow. As for ETHZ1, which has the slightest variability and sparsity due to the relatively 545
adequate and well-distributed class samples in the feature space, NSML falls behind CNNA while still outstrips other rivals. As for other methods, LMNN has moderate performance. LMNN learns the metric beneficial for classification instead of ranking. Ranking is more robust to the real-world re-identification complexity due to jointly concerning the behav-
550
iors of several top candidates, which can make up the shortage of classification that overemphasizes the accuracy at rank-1. However, classification doesn’t 25
totally run counter to ranking because ranking can be regarded as a kind of relaxed multi-class classification in essence. RDC has moderate performance as well when the samples are not sufficient 555
for each class. When the class size increases, its performance falls. RDC optimizes the relative distance comparisons based on logistic function constraints. This soft discriminant manner has intrinsic robustness to over-fitting. However, usually, model robustness and representability go against each other. When the class size increases, the over-fitting problem will be alleviated with it. It be-
560
comes comparatively outstanding that the soft discriminant manner will limit the model representability of the class information. XQDA also has a moderate performance when using the same baseline features as other methods. It implies that this method is sensitive to the original feature space.
565
Compared with kernel-based metric learning approaches rPCCA, MFA, and kLFDA, NSML also has advantage. rPCCA suggests relaxing the pairwise constraints by choosing a small portion, around one-tenth, from a vast number of constraints. However, rPCCA itself is more or less sensitive to the constraints selected for the highly variable data. MFA and kLFDA perform normally on
570
VIPeR, ETHZ1, and PRID450S, while lose effect on i-LIDS, Caviar4REID, and EPFL. Kernel-based metric learning relies on the kernel-based dissimilarity and reference points, which may not fit each application-specific dataset. The results reveal that MFA and kLFDA models themselves are not suitable for i-LIDS, Caviar4REID, and EPFL. We have also found that MFA and kLFDA produce
575
large training error on them even. When the kernel-based metric learning models are suitable, the performance are mainly influenced by the data complexity. On less complex ETHZ1 they are approaching NSML, while on more complex VIPeR and PRID450S they follow behind. We also provide the comparison between NSML and some reported tech-
580
niques on the popular VIPeR dataset in Table 8. These techniques are PCCA (Pairwise Constrained Component Analysis) [24], SSCDL (Semi-Supervised Coupled Dictionary Learning) [46], ColorInv (Color Invariants) [47], LFDA (Local 26
Fisher Discriminant Analysis) [45], eLDFV (ensemble Local Descriptor Fisher Vector) [48], KISSME (Keep It Simple and Straightforward MEtric) [14], CPS 585
(Custom Pictorial Structure) [49], RankSVM, ITML (Information-Theoretic Metric Learning) [50], MCML (Maximally Collapsing Metric Learning) [18], and XQDA [15]. Here, we directly quote or generate the results from their publications or codes. NSML uses the same train-test splits as SDALF. Although the results recorded for other approaches are probably not exactly the same, the comparison is relatively fair in the statistical averaging sense. Table 8: VIPeR dataset: Top ranked matching rates (%) with 316 persons. Method Rank-1 Rank-5 Rank-10 Rank-20 NSML(LOMO+XQDA) 42.59 72.88 84.49 93.51 XQDA(LOMO) 40.00 68.13 80.51 91.08 NSML(HSV+DSCH) 25.92 52.72 67.15 81.20 SSCDL 25.60 53.70 68.10 83.60 ColorInv 24.24 44.91 56.55 69.40 PCCA-χ2RBF 19.27 48.89 64.91 80.28 LFDA 24.18 51.20 67.12 82.00 eLDFV 22.34 47.00 60.00 71.00 KISSME 17.47 46.42 61.23 75.82 CPS 21.00 45.00 57.00 71.00 RankSVM 13.00 37.00 51.00 68.00 ITML 11.61 31.39 45.76 63.86 MCML 15.19 41.77 57.59 73.39
590
Obviously, NSML is superior to the techniques in Table 8. However, the pure performance race is innocuous due to several realistic factors. We believe that a deeper study on how the proposed model differentiates itself from closely related works under shared conditions may be more informative and meaningful than a 595
pure competition on the final overall performance of all kinds of methods. The real advantage of NSML can be seen from the ability to enhance the performance upon the original feature. Because NSML is based on the extracted features so its performance is unavoidably influenced and limited by them. We further test NSML in LOMO feature space transformed by XQDA which is denoted
600
by “LOMO+XQDA”. The results show that NSML actually has the ability to stand ahead of the advanced performance with the state-of-the-art feature
27
space. Furthermore, the experiments on multiple-camera datasets RAiD and WARD are detailed in Part VI of Supplementary Material because of page limitation. 605
At last, re-identifying the same or similar identities between the query and corpus domains belongs to enclosure re-identification. To further demonstrate the ability of NSML, we do re-identification with much more identities in the corpus domain than the query domain by PRID2011 with treating view A as query and view B as corpus in Part VII of Supplementary Material due to page
610
limitation. 4.7. Generalizability In the real-scenario surveillance, we may also obtain other valuable cues complementary to the full body appearance, like the face cue, gait cue, and so forth, for unobtrusive re-identification/recognition. We provide extra experiments to
615
validate the generalizability of NSML in different application scenarios: face recognition and gait recognition. For face recognition, we compare NSML with LMNN, EigenFace, PCA, and L2-norm in the original image space by means of the AR face database [51]. For gait recognition, we compare NSML with RVTM (Robust View Transformation Model) and TSVD (Truncated Singular
620
Vector Decomposition) in the GEI (Gait Energy Image) feature space in terms of the CASIA gait dataset B [52, 53]. More details are provided in Part VIII of Supplementary Material because of page limitation. 5. Conclusion and Discussion This paper has formulated person re-identification as a metric learning prob-
625
lem. In particular, towards the challenging variability and sparsity of human image data, we proposed the NSML method to learn discriminative dissimilarities on a neighborhood structure manifold. Experiments demonstrated the advantage of NSML in terms of effectiveness, robustness, efficiency, stability, and generalizability.
28
630
We use the widely-used baseline features to demonstrate NSML in experiments. Honestly, some newly-proposed methods in the directions of feature representation and feature transformation report higher performance than the ones we used for comparison. Actually, NSML can be updated with the newest and most effective feature representation and/or transformation algorithms techni-
635
cally to further enhance the performance. Noteworthily, although NSML is specifically designed to address the variability and sparsity problem in re-identification, it can also be applied to other vision problems that share the similar characteristics or have close relationship to person re-identification, such as object recognition, cross-camera tracking,
640
multi-camera event detection, and so forth. Our ongoing work is extending NSML to solve them. Acknowledgements This work has been supported by the Fundamental Research Funds for the Central Universities; JSPS KAKENHI Grant Number 15K16024.
645
References [1] R. Vezzani, D. Baltieri, R. Cucchiara, People reidentification in surveillance and forensics: A survey, Acm Computing Surveys 2 (46) (2013) 1–37. [2] M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person reidentification by symmetry-driven accumulation of local features, in: Pro-
650
ceedings of the 23rd Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, San Francisco, USA, 2010, pp. 2360–2367. [3] B. Slawomir, C. Etienne, B. Francois, T. Monique, Person re-identification using haar-based and dcd-based signature, in: Proceedings of the 7th International Conference on Advanced Video and Signal Based Surveillance,
655
AVSS, IEEE, Boston, USA, 2010, pp. 1–8.
29
[4] D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking, in: Proceedings of the 10th International Workshop on Performance Evaluation for Tracking and Surveillance, PETS, IEEE, Rio de Janeiro, Brazil, 2007, pp. 41–47. 660
[5] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by fisher vectors for person re-identification, in: Computer Vision – ECCV 2012. Workshops and Demonstrations, Vol. 7583, Springer Berlin Heidelberg, Florence, Italy, 2012, pp. 413–422. [6] S. Iodice, A. Petrosino, Salient feature based graph matching for person
665
re-identification, Pattern Recognition 48 (2015) 1074–1085. [7] Z. Wu, Y. Li, R. J. Radke, Viewpoint invariant human re-identification in camera networks using pose priors and subject-discriminative features, IEEE Transactions on Pattern Analysis and Machine Intelligence 5 (37) (2015) 1095–1108.
670
[8] Z. Liu, Z. Zhang, Q. Wu, Y. Wang, Enhancing person re-identification by integrating gait biometric, Neurocomputing 168 (2015) 1144–1156. [9] L. An, S. Yang, B. Bhanu, Person re-identification by robust canonical correlation analysis, Signal Processing Letters 22 (8) (2015) 1103–1107. [10] N. Martine, A. Das, C. Micheloni, A. K. Roy-Chowdhury, Re-identification
675
in the function space of feature warps, IEEE Transactions on Pattern Analysis and Machine Intelligence 8 (37) (2015) 1656–1669. [11] B. Kulis, Metric learning: A survey, Journal of Machine Learning Research 5 (2012) 287–364. [12] K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin
680
nearest neighbor classification, Journal of Machine Learning Research 10 (2009) 207–244.
30
[13] M. Dikmen, E. Akbas, T. S. Huang, N. Ahuja, Pedestrian recognition with a learned metric, in: Proceedings of the 10th Asian Conference on Computer Vision, ACCV, Vol. 6495, Springer, New Zealand, UK, 2010, pp. 501–512. 685
[14] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: Foundations and Trends in Machine Learning, Vol. 5, IEEE, Providence, Rhode Island, 2012, pp. 2288–2295. [15] S. Liao, Y. Hu, X. Zhu, S. Z. Li, Person re-identification by local maximal
690
occurrence representation and metric learning, in: Proceeding of the 28th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Boston, USA, 2015, pp. 2197–2206. [16] W. Li, Y. Wu, M. Mukunoki, Y. Kuang, M. Minoh, Locality based discriminative measure for multiple-shot human re-identification, Neurocomputing
695
167 (2015) 280–289. [17] J. Chen, Z. Zhang, Y. Wang, Relevance metric learning for person reidentification by exploiting listwise similarities, IEEE Transactions on Image Processing 24 (12) (2015) 4741–4755. [18] A. Globerson, S. Roweis, Metric learning by collapsing classes, Proceedings
700
of Advances in Neural Information Processing Systems 18, NIPS 18 (2006) 451–458. [19] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Coupled metric learning for singleshot vs. single-shot person re-identification, Optical Engineering 52 (2) (2013) 027203–1–027203–10.
705
[20] B. Prosser, W.-S. Zheng, S. Gong, T. Xiang, Person re-identification by support vector ranking, in: Proceedings of the British Machine Vision Conference, BMVC, BMVA Press, Wales, UK, 2010, pp. 21.1–21.11.
31
[21] B. McFee, G. Lanckriet, Metric learning to rank, in: Proceedings of the 27th International Conference on Machine Learning, ICML, Omnipress, 710
Haifa, Israel, 2010, pp. 775–782. [22] Y. Wu, M. Mukunoki, T. Funatomi, M. Minoh, Optimizing mean reciprocal rank for person re-identification, in: The 2nd International Workshop on Multimedia Systems for Surveillance, MMSS, IEEE, Klagenfurt, Austria, 2011, pp. 408–413.
715
[23] W.-S. Zheng, S. Gong, T. Xiang, Re-identification by relative distance comparison, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (3) (2012) 653–668. [24] A. Mignon, F. Jurie, Pcca: A new approach for distance learning from sparse pairwise constraints, in: Proceedings of the 25th Conference on Com-
720
puter Vision and Pattern Recognition, CVPR, IEEE, Providence, Rhode Island, USA, 2012, pp. 2666–2672. [25] D. Ramanan, S. Baker, Local distance functions: A taxonomy, new algorithms, and an evaluation, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (4) (2011) 794–806. doi:10.1109/TPAMI.2010.127.
725
[26] D. Kedem, S. Tyree, F. Sha, G. R. Lanckriet, K. Q. Weinberger, Non-linear metric learning, in: Proceedings of Advances in Neural Information Processing Systems 25, NIPS, Curran Associates, Inc., Lake Tahoe, Nevada, US, 2012, pp. 2573–2581. [27] J. Hu, J. Lu, Y.-P. Tan, Discriminative deep metric learning for face veri-
730
fication in the wild, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Columbus, Ohio, US, 2014, pp. 1875–1882. [28] Z. Huang, R. Wang, S. Shan, X. Chen, Face recognition on large-scale video in the wild with hybrid euclidean-and-riemannian metric learning, Pattern
735
Recognition 48 (2015) 3113–3124. 32
[29] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identification using kernel-based metric learning methods, in: Proceeding of the 27th Conference on Computer Vision and Pattern Recognition, ECCV, Vol. 8695, Springer, Zurich, Switzerland, 2014, pp. 1–16. 740
[30] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Locality based discriminative measure for multiple-shot person re-identification, in: Proceedings of 10th International Conference on Advanced Video and Signal Based Surveillance, AVSS, IEEE, Krakow, Porland, 2013, pp. 312–317. [31] W. Li, M. Mukunoki, Y. Kuang, Y. Wu, M. Minoh, Person re-identification
745
by common-near-neighbor analysis, IEICE Transactions on Information and Systems E97-D (2014) 1745–1361. [32] W. J.Scheirer, M. J. Wilber, M. Eckmann, T. E. Boult, Good recognition is non-metric, Pattern Recognition 47 (2014) 2721–2731. [33] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University
750
Press, New York, NY, USA, 2004. [34] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Common-near-neighbor analsis for person re-identification, in: Proceedings of the 19th International Conference on Image Processing, ICIP, IEEE, Florida, USA, 2012, pp. 1621–1624. [35] A. Ess, B. Leibe, L. V. Gool, Depth and appearance for mobile scene
755
analysis, in: Proceedings of the 11th International Conference on Computer Vision, ICCV, IEEE, Rio de Janeiro, Brazil, 2007, pp. 1–8. [36] W.-S. Zheng, S. Gong, T. Xiang, Associating groups of people, in: Proceedings of the 20th British Machine Vision Conference, BMVC, BMVA Press, London, UK, 2009, pp. 23.1–23.11.
760
[37] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, V. Murino, Custom pictorial structures for re-identification, in: Proceedings of the British Machine Vision Conference, BMVC, BMVA Press, Scotland, UK, 2011, pp. 68.1–68.11. 33
[38] Y. Xu, L. Lin, W.-S. Zheng, X. Liu, Human re-identification by match765
ing compositional template with cluster sampling, in: Proceedings of the 14th International Conference on Computer Vision, ICCV, IEEE, Sydney, Australia, 2013, pp. 3152 – 3159. [39] M. Hirzer, C. Beleznai, P. M. Roth, H. Bischof, Person re-identification by descriptive and discriminative classification, in: In Proceedings of Scan-
770
dinavian Conference on Image Analysis, SCIA, Vol. 6688, Springer Berlin Heidelberg, Ystad, Sweden, 2011, pp. 91–102. [40] P. M. Roth, M. Hirzer, M. Koestinger, C. Beleznai, H. Bischof, Mahalanobis distance learning for person re-identification, in: S. Gong, M. Cristani, S. Yan, C. C. Loy (Eds.), Person Re-Identification, Advances in Computer
775
Vision and Pattern Recognition, Springer-Verlag London, London, UK, 2014, pp. 247–267. [41] A. Das, A. Chakraborty, A. K. Roy-Chowdhury, Consistent re-identification in a camera network, in: Proceeding of the 27th Conference on Computer Vision and Pattern Recognition, ECCV, Vol. 8690, Springer, Zurich,
780
Switzerland, 2014, pp. 330–345. [42] N. Martinel, C. Micheloni, C. Piciarelli, Distributed signature fusion for person re-identification, in: Proceeding of the Sixth International Conference on Distributed Smart Cameras, ICDSC, IEEE, Hongkong, China, 2012, pp. 1–6.
785
[43] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: A general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1) (2007) 40–51. [44] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identification using
790
kernel-based metric learning methods, in: Proceedings of the 13th European Conference on Computer Vision, ECCV, Springer, Zurich, Switzerland, 2014, pp. 1–16. 34
[45] S. Pedagadi, J. Orwell, S. Velastin, B. Boghossian, Local fisher discriminant analysis for pedestrian re-identification, in: Proceedings of the 26th 795
Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Portland, OR, USA, 2013, pp. 3318–3325. [46] X. Liu, M. Song, D. Tao, X. Zhou, C. Chen, J. Bu, Semi-supervised coupled dictionary learning for person re-identification, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE,
800
Columbus, Ohio, USA, 2014, pp. 3550–3557. [47] I. Kviatkovsky, A. Adam, E. Rivlin, Color invariants for person reidentification, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (7) (2013) 1622–1634. [48] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by fisher vectors for
805
person re-identification, in: Computer Vision - ECCV 2012 Workshops and Demonstrations, Vol. 7583, Springer, Florence, Italy, 2012, pp. 413–422. [49] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, V. Murino, Custom pictorial structures for re-identification, in: Procedings of the British Machine Vision Conference, BMVC, BMVA Press, Scotland, UK, 2011, pp.
810
68.1–68.11. [50] J. V. Davis, B. Kulis, P. Jain, S. Sra, I. S. Dhillon, Information-theoretic metric learning, in: Proceedings of the 24th International Conference on Machine Learning, ICML, ACM, Corvalis, Oregon, US, 2007, pp. 209–216. [51] A. Martinez, R. Benavente, The ar face database, Tech. rep., Computer
815
Vision Center at Universitat Autonoma de Barcelona (1998). [52] W. Kusakunniran, Q. Wu, H. Li, J. Zhang, Multiple views gait recognition using view transformation model based on optimized gait energy image, in: Proceedings of the 12th International Conference on Computer Vision Workshops, ICCV Workshops, IEEE, Kyoto, Japan, 2009, pp. 1058–1064.
35
820
[53] S. Zheng, J. Zhang, K. Huang, R. He, T. Tan, Robust view transformation model for gait recognition, in: Proceedings of the 8th International Conference on Image Processing, ICIP, IEEE, Brussels, Belgium, 2011, pp. 2073–2076.
36