ARTICLE IN PRESS
JID: PR
[m5G;October 16, 2017;7:5]
Pattern Recognition 0 0 0 (2017) 1–12
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/patcog
M3 L: Multi-modality mining for metric learning in person re-Identification Xiaokai Liu, Xiaorui Ma, Jie Wang, Hongyu Wang∗ School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116024, PR China
a r t i c l e
i n f o
Article history: Received 27 December 2016 Revised 12 August 2017 Accepted 27 September 2017 Available online xxx Keywords: Person re-identification Multi-modality mining Diagonal model Metric learning
a b s t r a c t Learning a scene-specific distance metric from labeled data is critical for person re-identification. Most of the earlier works in this area aim to seek a linear transformation of the feature space such that relevant dimensions are emphasized while irrelevant ones are discarded in a global sense. However, when training data exhibit multi-modality transitions, the globally learned metric would deviate from the correct metrics learned from each modality. In this study, we propose a multi-modality mining approach for metric learning (M3 L) to automatically discover multiple modalities of illumination changes by exploring the shift-invariant property in log-chromaticity space, and then learn a sub-metric for each modality to maximally reduce the bias derived from metric learning model with global sense. The experiments on the challenging VIPeR dataset and the fusion dataset VIPeR&PRID 450S have validated the effectiveness of the proposed method with an average improvement of 2–7% over original baseline methods. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction Mahalanobis metric learning has gained a considerable interest for person re-identification. The main idea is to find a linear transformation of the feature space such that relevant dimensions are emphasized while irrelevant ones are discarded in a global sense. Mahalanobis metrics reflect the visual camera-to-camera transitions. Researches in [1–3] have shown that appropriate an distance metric, learned from labeled data, usually allows for a more powerful retrieval result compared to the standard Euclidean distance. Most previous works in this area attempt to learn a discriminative distance metric in a global sense that keeps the sample pairs with same identity close, while ensuring that those from different persons remain separated. Nevertheless, these goals may conflict and cannot be simultaneously satisfied when the data exhibit multimodality transitions. In this paper, multi-modality is defined as two or more transition modalities between the same camera pair. Under each modality, the images from different camera views are captured in similar inter-view illumination conditions, and exhibit similar color transition mode. In real-world person re-identification scenarios, the images are often collected over the course of several days or even months, therefore the illumination changes would exhibit multi-modality distributions, e.g. when images are captured respectively in the morning, in the evening, on a cloudy day, on a
∗
Corresponding author. E-mail address:
[email protected] (H. Wang).
sunny day, the color transition might be significantly different under the influence of the surrounding context and environment. As shown in Fig. 1(a), apparently person 1 and 2 have limited color changes between camera 1 and camera 2, while person 3 and 4 undergo significant illumination changes. With regard to scenarios with multi-modality transformations, conventional metric learning approaches attempt to enforce a strong assumption that different modalities have a common or a shared subspace, and enforce person 1–4 have a common metric. Note that person 2 and 3 look quite similar in camera 1, while absolutely different in camera 2. The man in dull-red shirt and dark gray pants stays unchanged in some extent, while the woman in dull-red T-shirt and dark gray pants turns into the one in crimson T-shirt and denim blue pants. In this case, a set of colors would correspond to two sets of transformed colors in another camera, and the assumption of a common metric that attempt to bring the two modes together encounters difficulties and results in a compromised metric (Fig. 1(b)), leading to a significant degradation in ranking accuracy. In order to address the aforementioned problem, this paper proposes a novel clustering based approach to multi-modality mining with agglomerative hierarchical clustering algorithm. Rather than satisfying all of the pair-wise constraints, our algorithm focuses on discovering subsets of sample pairs which sharing common illumination changes. Fig. 1(c) illustrates the ranking orders ˜ and two sub-metrics {Mi }2 . Owing to under a global metric M i=1 the limitation of the scope, only top 15 ranking results are shown. Obviously, the sub-metric M2 focuses more on the dramatic color changes by training with sample pairs in the same modality, and
https://doi.org/10.1016/j.patcog.2017.09.041 0031-3203/© 2017 Elsevier Ltd. All rights reserved.
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
JID: PR 2
ARTICLE IN PRESS
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12
Fig. 1. (a) Four sample image pairs from VIPeR dataset. (b) Illustration of feature transformations between two transformed spaces. (c) Illustration of the ranking orders by ˜ and two sub-metrics {Mi }2 . In (a), visually, person 1 and 2 have limited color changes, while person 3 and 4 undergo significant illumination changes. global metric M i=1 Person 2 and 3 look quite similar in camera 1, but quite different in camera 2. In (b), conventional metric learning approaches enforce person 2 and 3 to follow a common ˜ . The proposed algorithm automatically discovers multiple modalities and learns two discriminative sub-metrics M1 and M2 . metric, resulting in a compromised metric M In (c), obviously, the sub-metric M2 focuses more on the significant color changes and achieves a much better ranking result. The matched gallery images retrieved within top 15 are marked with red box. Best viewed in color. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
consequently achieves much better ranking result. Note that top 5 persons ranked by M2 are quite similar in appearance with the real match candidate, which implies that the sub-metric M2 is powerful in identifying this type of image pairs. The primary challenge in multi-modality mining for metric learning is the chicken-andegg dilemma: as we need to cluster transition metrics, but not data itself, on one hand, to discover the multiple modalities of local distance metrics, we need to identify the transformation metric of each pair; on the other hand, to identify the transformation metric of each pair, we need to know the clusterings of local distance metrics. To resolve this problem, we propose a novel multimodality mining approach by exploring shift-invariant properties in log-chromaticity space, which explicitly addresses the chickenand-egg problem from a new perspective. The contributions of this study can be summarized in three-folds:
erty of the RGB values from related views in log-chromaticity space are discovered in this study. The shift vectors in logchromaticity space usually does indeed contain discriminative structures of the illumination change modalities, due to different variations on separate channels. (3) The proposed multi-modality mining approach is independent of specific metric learning algorithms. Any advanced metric learning method can exploit our framework to obtain superior performance in future. Extensive experiments conducted on benchmarking reidentification dataset VIPeR and a simulated VIPeR&PRID 450S dataset demonstrate that person re-identification can significantly benefit from applying the multi-modality mining approach investigated in this study. 2. Related work
(1) While most existing metric learning based person reidentification approaches focus on learning a global discriminative metric for all the training sample pairs, we provide experimental evidence to support the view that some benefits can be obtained by multi-modality metrics guided by the shift-invariant property in log-chromaticity space. To our best knowledge, this is the first study that attempts to investigate the effect of multi-modality metrics in relation to the long-run surveillance scenes for person re-identification. (2) Our idea of using inter-transition structure for guiding multimodality metric learning, is novel. The shift-invariant prop-
Mahalanobis metric learning has proven to be effective in person re-identification problem. The main idea is to seek an optimal metric that reflects the visual view-to-view transitions, allowing for a boosted ranking accuracy. In [3], a large number of Mahalanobis metric learning algorithms have been evaluated and shown to be effective in re-identification problem, for example, Linear Discriminant Metric Learning (LDML [4]), Information Theoretic Metric Learning (ITML [5]), Large Margin Nearest Neighbor (LMNN [6]), Large Margin Nearest Neighbor with Rejection (LMNN-R [7]), and Keep It Simple and Straightforward Metric Learning (KISSME
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
JID: PR
ARTICLE IN PRESS
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12
[2]). Several task-oriented approaches were exploited to address the ill-posed problems arising from re-identification. By ignoring easy samples and focusing on hard samples, Hirzer et al. [1] introduced an impostor-based LMNN to exploit the natural constraints given by the person re-identification task. Considering neighborhood structure manifold which exploits the relative relationship between the concerned samples and their neighbors in the feature space, Li et al.[8] proposed a neighborhood structure metric learning algorithm to learn discriminative dissimilarities on such manifold. These metric learning algorithms assumed that all data could be evaluated in a global metric and ignore the multiple modalities caused by various illumination changes. Considering the multi-modality data distributions, Liu et al. [9] developed a local distance metric learning algorithm (LDM) to optimize local compactness and separability in a probabilistic framework. Noh et al. [10] learned a local metric for nearest neighbor classification using generative models. Saxena et al. [11] embedded the local metrics in a global low-dimensional representation to coordinate the local metrics. Their purpose resembles the aim in this study, but differs mainly in: in local metric learning applications, they only have one data domain, and the main idea is to classify the data itself (namely data classification), while we have two data domains separately captured from two cameras, and our purpose is to cluster the transition metrics (namely metric clustering). In person re-identification, Li and Wang [12] first took notice of the locally aligned feature transforms across views, and proposed a local metric learning approach by jointly partitioning the training data according to the similarity of cross-view transforms. Motivated by Li and Wang [12], Erin et al. [13–15] assumed that the overall metric was jointly decided by the global metric and a set of local metrics, and obtained the metrics by iteratively adapting the Mahalanobis metrics to maximize an objective function, which was formulated in the framework of sparse coding [13], structured learning [14] and Gaussian mixture model [15]. Although successfully exploiting local metrics, all these methods rely on tedious iterative optimization procedures. Re-identification research has driven the requirements for faster and lighter methods for real time applications. Liu et al. [16] considered person reidentification as a cross-modality image comparison problem and learned a unique local metric for each image pair. However, as Mahalanobis metric is a statistic-based [17] similarity measuring method, such single-point learning approach is unstable and vulnerable to noise. In this paper, we start with the original causes of the multi-modality transitions, and directly model the illumination changes to exploit multi-modality metrics. The proposed method does not depend on iterative optimization procedures, accordingly relieves the computational burden. 3. Multi-modality mining by shift-invariant property An overview of the proposed multi-modality metric learning algorithm can be seen in Fig. 2. In the training process, given a set of pairs of images with the same identity (xp , xg ), where the superscripts of p and g indicate the probe image and the gallery image, two kinds of descriptors are first extracted. The first one is four color descriptors in four color spaces (Section 3.1) { f m }4m=1 , which are used to learn the distance metrics (Section 3.4) or measure the similarity between samples under certain metrics. The second one is rgb values in log-chromaticity color space (Section 3.2), which are applied to extract shift vectors {di }ni=1 to imply the color transitions between views. Then, in each color space, we utilize an agglomerative hierarchical clustering algorithm to discover multiple modalities of the illumination changes (Section 3.3), and apply metric learning algorithm on each cluster to get a set of m f m nk sub-metrics M f = {Mk }k=1 (Section 3.4). In the process of re-
3
identifying a probe image, for each feature descriptor fm , with the m sub-metrics M f learned in the aforementioned training stage, we ˆ f m = F (M f m ) according to apply a metric selection algorithm M the similarities between the shift vectors of the top ranked candidates and that of cluster centers (Section 3.5), and get the disˆ f m . Fitance measured by feature fm under appropriate metric M nally, we ensemble all distances measured by each feature descriptor together to get a final ranking result. Henceforth, we drop the superscript fm unless it is needed. 3.1. Feature representation For the feature representation, we follow the feature extraction scheme in [18] and apply commonly used histogram-based features. We are intended to include low-level features that have invariant properties in some illumination change situations. Four features are extracted: RGB, HSV, Normalized RGB, and color names feature [19] (CN). We use an 8-bin (in each dimension) histogram for the RGB, HSV, and Normalized RGB models, and 11-bin for CN features. We partition an image into six horizontal stripes. For each strip, all histograms based color descriptors are extracted. For each color model, the color descriptors are extracted from both the whole image and foreground image (using a max-margin segmentation method [20]) and concatenated to form a feature vector. The dimensionality of the concatenated feature vector is reduced to 20 via principal component analysis (PCA) [21]. We refer the readers to [18] for further details about the feature representation. Note that more sophisticated PCA algorithms [22,23] have been proposed, and may be exploited to improve the recognition accuracy, but we will not consider such extensions in this work. 3.2. Diagonal model and shift vector in log-chromaticity space To settle the chicken-and-egg dilemma, starting with the original causes of the multi-modality metrics, we directly model the illumination changes, and cluster the sample pairs by exploring the shift-invariant properties under illumination changes. Our intuition, validated in this work, is that, as metric reflects the visual view-to-view transitions, if the illumination changes of the sample pairs could be modeled in a color transition model, the color model could replace the unknown metric to conduct metric clustering. To cluster the illumination transitions, we need to determine a mapping that transforms the color responses to an object imaged under some reference illumination in probe camera to corresponding responses under another illumination in gallery camera. Image acquisition is a complex process affected by spectral distribution of the illumination, surface reflection of the spectral distribution and sensor response effect. In real-world person re-identification settings, it is hardly possible to obtain an accurate estimate of the spectral distribution of the illumination and the surface reflection. However, in practical applications on recognition and retrieval, it is sufficient to compare the illumination shift without getting the precise estimates of the illumination and surface reflection. Finlayson et al. [24] have proven that the diagonal model of the illumination change is sufficient for coping with the majority of the illumination variations encountered in practice. In rgb color model, each pixel x is composed of a 3-element tuple (xr , xg , xb ). Given a pair of pixels (xp , xg ), which are captured from different cameras but correspond to the same 3D point, the diagonal model could be written as a multiplication of rgb vector by a diagonal matrix :
⎛ g⎞
xr α g x ⎝ g⎠ = 0 0 xg b
0
β 0
0 0
γ
⎛ x p ⎞ r
⎝xgp ⎠
(1)
xbp
Despite simplicity, diagonal model has achieved good results in
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
JID: PR 4
ARTICLE IN PRESS
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12
Fig. 2. Overview of the proposed multi-modality metric learning algorithm.
[25] when applied to model illumination changes. Inspired by the invariant property of log-ratio chromaticity color space introduced in [26], we further expand the two dimensional space to a three dimensional space on absolute rgb values, which we called logchromaticity color space. The log-chromaticity color space is defined as:
l (x ) = (ln xr , ln xg , ln xb )
(2)
Applying the diagonal transformation model, Eq. (1), on pixels xp and xg , respectively from probe image xp and xg , results in a three-dimensional shift vector in the log coordinates:
l (xg ) = (ln α xrp , ln β xgp , ln γ xbp )
= l (x p ) + (ln α , ln β , ln γ )
(3)
where the shift vector d(xg , x p ) = l (xg ) − l (x p ) = (ln α , ln β , ln γ ) is determined only by the diagonal model and is independent of the pixel values. Therefore, to cluster the color transition caused by illumination changes, we just need to cluster the shift vectors. The reason why we use the log-chromaticity is two fold: (a) a 3-D color feature is more informative than a 2-D feature, and would be more discriminative when applied in classification and clustering process; (b) with a log operation, the division operation could be transformed to subtraction operation, thus the transformation relationship could be represented by a 3-D vector, which is much easier to process. Ideally the points from two sets are one-to-one correspondent, so if we get only one pair of log rgb values from one point in two relevant views, we could obtain the shift vector between the two views. However, due to the difference of view angles and movements of the human body, the points from two sets are not oneto-one correspondent. To get an estimate of the shift vector, we map two sets of the points separately from two relevant views to the log-chromaticity color space, and set the shift vector as the difference between the centers of two sets:
d(x p , xg ) = X¯ (l (xg )) − X¯ (l (x p ))
(4)
where l (x p ) = {l (xi )|xi ∈ x p } is all the log-chromaticity color values in the probe image xp , and the same way for l(xg ). X¯ (· ) represents the mean of the points set.
3.3. Multi-modality mining With the 3-D feature descriptor d, which implies the illumination transition property between the two camera views, we aim to discover several types of transitions by clustering algorithms. Sample pairs in each cluster share similar transitions. Formally, given an input of n shift vectors {di }ni=1 , we aim to discover a set of clusn
k ters c = {ck }k=1 , each of which consists of shift vectors representing similar color transitions. To this end, we employ the groupaverage agglomerative clustering algorithm (GAAC) [27] on the 3-D shift vectors {di }ni=1 . GAAC computes the average of all the shift vectors in a cluster as the center and utilize the center to evaluate the similarities between clusters. Compared with the singlelink and complete-link criteria [28], both of which equate clustering with the similarity of a single pair of data, GAAC avoids the pitfalls of ‘chaining effect’ and ‘tight bound’ in each clustering pattern, and was proven by [27] to be more efficient for documents clustering purpose. In order to decide which clusters should be combined, a measure of distance between observations and a linkage criterion which specifies the distance of sets are required. We define the distance between two observations as
dist (di , dj ) = min(di − dj 22 , di + dj 22 )
(5)
As shown in Fig. 3, the transitions between two cameras are bidirectional and exchangeable, so we conduct a min operation to select the minimum distance between di and positive and negative vectors of dj . Such operation implicitly regards dj and −dj as the same transition. For example, the transitions within person 3 and person 4 are supposed to be clustered into one set, nevertheless the shift vectors are diametrically opposite. With the help of Eq. (5), the shift vector of person 4 (d4 ) is implicitly converted into a negative vector of d4 and accordingly clustered into the same set with d3 . In a merging process, for instance, clusters ct1 and ct2 are selected to merge into a new one, therefore the centers ζct and 1 ζc would be updated to a new one according to their consistence t2
in direction. If ζct and ζct are in the same direction, sum oper1 2 ation is applied to get a new cluster center, and subtraction for
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
JID: PR
ARTICLE IN PRESS
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12
5
Fig. 3. Illustration of shift vectors in log-chromaticity space (a-e) and the process of agglomerative hierarchical clustering (f). Both person 1 and person 2 experience mild illumination changes, while person 3, 4 and 5 undergo dramatic illumination changes. Best viewed in color.
different directions. This is realized via the following equation:
ζ =
(ζct1 + ζct2 )/2 if dist (ζct1 , ζct2 ) ≤ dist (ζct1 , −ζct2 ), (ζct1 − ζct2 )/2 others.
(6)
The agglomerative hierarchical clustering algorithm is summarized in Algorithm 1 and the illustration of the clustering process is shown in Fig. 3(f).
yi j = 1 indicates the images with the same identity, and yi j = 0 otherwise. Given two sets of training samples S and D, where S = {( fi , f j )}yi j =1 and D = {( fi , f j )}yi j =0 , from a statistical inference point of view, the KISSME [2] Mahalanobis distance matrix −1 M is defined in closed form M = S−1 − D , where S and D are the covariance matrices of S and D
S =
i
3.4. Metric learning In each feature model, we denote a pair of feature points as (fi , fj ), fi , f j ∈ Rd , and the Mahalanobis distance between a pair of samples is measured by dc2 ( fi , f j ; M ) = ( fi − f j ) M( fi − f j ), where the superscript (indexes of features) and the subscript (indexes of sub-metrics) are drops due to its applicability to all the features and sub-metrics. We introduce a similarity label yij , where
1
|S | ( f , f )∈S
D =
1 |D|
( fi − f j )( fi − f j )
(7)
( fi − f j )( fi − f j )
(8)
j
( fi , f j )∈D
Compared with the large margin nearest neighbor (LMNN) [6] and relative distance comparison (RDC) [29] methods, the KISSME method [2] performs well with an order of magnitude faster in training time. Note that although all the experiments in this study
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
ARTICLE IN PRESS
JID: PR
X. Liu et al. / Pattern Recognition 000 (2017) 1–12
Relevance score
6
[m5G;October 16, 2017;7:5]
ranking order by sub-metric
real match
-1
ranking order by sub-metric (b)
Rank index
1
(a)
nt Fig. 4. (a) Non-linear mapping from rank index to relevance score. (b) Example of the metric selection process. Given a probe image xp , ranking lists {RMt }t=1 are separately nt . Top K (K = 7 in this example) images are shown and the real matched image is labeled with red box. s1 ( · ) is the relevance score of obtained under sub-metrics {Mt }t=1 related ranking index, and s2 ( · ) is the similarities of the shift vector between supposed sample pairs and cluster centers. We pick the metric, under which the sum-ofproducts of s1 and s2 has the maximum value. See details in Section 3.5. Best viewed in color. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
apply KISSME metric, the proposed multi-modality mining approach is independent of specific metric learning algorithms. Any advanced metric learning method can exploit our framework to obtain superior performance in future.
be normalized. With Eq. (10), we rend the rank orders to relevance scores to make the similarities comparable. The second term evaluates the similarities of the shift vectors g between the sample pairs (x p , xk ) and the cluster centers ζct unm der metric Mt .
3.5. Metric selection At test time, given a probe image xp , we need to decide which nt metric to choose from the multi-modality metrics M = {Mt }t=1 . Note that, what we classify is transitions, not data itself, so it is not trivial to decide which one to choose only given a probe image. In this paper, we assume that under the appropriate metric, given a specific probe image, its similar candidates are ranked at the top of list. Therefore, for a given probe image xp , we get a ranking list g g g RMt = (xk · · · xk · · · xk )Mt in terms of metric Mt , where m
1
g
g
nk
g
g
xk xk means xk ranks ahead of xk , and the subscript km indii
j
i
j
cates the index of the gallery image. Considering the illumination transitions of the top K candidates, we select the metric by their synergistic effect of relevant scores and affiliation probabilities to clusters, to obtain a proper metric for similarity evaluation:
ˆ = F (M ) = max M Mt
K
s1 (x p , xgk ; Mt )s2 (x p , xgk ; ζct , Mt ) m
rkm =1
(9)
m
where the first term is a non-linear function, which is introduced by Li et al. [30] to map rank orders to relevant scores. The second term evaluates the similarities of the shift vectors between sample pairs and the cluster centers. Based on two assumptions that: (1) the initial ranking lists are reasonable due to the effect of transition-oriented sub-metrics; (2) given a specific probe image, its similar candidates are ranked at the top of the list according to the relevance scores. The first term is defined as:
s1 (x p , xgk ; Mt ) = 1 − m
1
(10)
1 + ρ (−rkm + ) g
where the symbol rkm indicates the rank order of image xk given m probe image xp when measured by the metric Mt . The constants ρ and control the changing rate of the scores along with the growth of the rank orders and the ranking position where the corresponding relevance score equals 0.5, respectively. As shown in Fig. 4(a), with this function, different gallery images are assigned different relevance scores in the range (0, 1) according to their ranking orders, and the top ranked gallery images are assigned higher scores. The similarity scores under different metrics are not directly comparable, because there is no regulation for metrics to
s2 (x p , xgk ; ζct ) = exp m
−dist 2 (d(x p , xgk ; Mt ), ζct )
δ2
m
(11)
where δ is bandwidth of the Gaussian function. Given a probe image xp , from Eqs. (9) and (10), we can see that, under the metric Mt , if the color shift vectors of the top ranked candidates are close to the cluster center of the corresponding Mt , the metric Mt is more likely to be the correct one. The illustration of the metric selection is shown in Fig. 4(b). Since there is no single color model or descriptor, which has the characteristic of robustness against all types of illumination changes, features based on four different color models described in Section 3.1 are finally fused to compensate each other. After getˆ f m in ting the ranking lists {RMˆ f m }4m=1 under the selected metric M m terms of each feature f , we re-rank the list simply by summing up all the distances measured by selected metrics, and sort the distances in descending order.
4. Experiments In this section, we conduct extensive experiments on the widely used VIPeR dataset and an artificial fusion dataset VIPeR&PRIDs 2011 dataset. The proposed M3 L is implemented mainly in MATLAB with compiled C-MEX subroutines for core algorithm in KISSME method, and all of our experiments are running on a PC with Intel(R) i7-4790 K CPU (4.00GHZ) and 32 GB RAM. Given a probe image, it takes 0.224 s to look over all the 541 candidate images in the fusion dataset and obtain the final ranking results. In the implementation, feature extraction spends the most time 0.196 s. Two baseline metric learning methods are selected to validate our approach including ITML [5] and LMNN [6]. Each experiment is repeated 10 times, and the results are reported in the form of average Cumulated Matching Characteristic (CMC) curves. In feature extraction section, the dimensionality of the concatenated feature vector in each color model is reduced to 20 via principal component analysis (PCA). In the metric selection process, we set K = 10, ρ = 2 and γ = 10 in Eqs. (9) and ( 10).
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
JID: PR
ARTICLE IN PRESS
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12
7
Fig. 5. Examples of illumination transitions discovered in VIPeR dataset (a–b) in the training process and the images classified to each class (c–d) in the test process. The training set is clustered to two subsets: Images in (a) demonstrate drastic illumination changes. It is obvious that the color of the ground changes from light pink to red or changes reversely. Images in (b) present small illumination changes. Almost all image pairs in (c) and (d) are classified to correct clusters. The misclassified image pair is marked in red box. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
4.1. VIPeR dataset
4.2. VIPeR&PRID 450s dataset
The VIPeR dataset1 is arguably the most challenging database for the human re-identification problem due to significant appearance changes, drastic illumination differences and large pose variations. The data were taken from arbitrary viewpoints under varying illumination conditions over the course of several months. There are 632 individuals captured in outdoor scenarios with two images from each person (one front/back and one side views). All images are normalized to 128 × 48 for experiments. In our experiments, we randomly selected half of the dataset for training and the remaining for tests. In each set, half of the data is randomly selected as probe images, and the rest as gallery images. Fig. 5 illustrates the results of the clustering in the training process and the classification in the test phase. Images in Fig. 5(a) demonstrate drastic illumination changes. It is obvious that the color of the ground changes from light pink to red or changes reversely. In Eq. (5), due to its tolerance to the opposite directions of the shift vector, transitions and those with opposite directions could be clustered in the same cluster. Images in Fig. 5(b) present small illumination changes. Almost all the image pairs in Fig. 5(c) and Fig. 5(d) are classified to correct clusters. Inter-view transitions are clustered into 2 groups. Misclassified image pairs are marked in red box. Quantitative comparisons on VIPeR dataset are illustrated in Table 1. By discovering different types of illumination changes and learning discriminative metric under each type, our method outperforms each baseline metric learning method, ITML [5], LMNN [6], and KISSME [2], at all ranking positions, especially at low ranks. Overall, the proposed method achieves 30.22% at rank 1, 59.57% at rank 5 and 71.99% at rank 10, with significant improvement over all other methods.
In an experimental environment, due to time and resource constraint, surveillance videos are always captured in limited time and scope of the activity area, e.g. PRID 450S dataset.2 In this case, illumination changes are stable in some extent and present single-modality characteristics. However, in real-world scenarios, the single-modality assumption is impractical, and would encounter difficulties when illumination changes. So in this section, we make up a virtual scenario by fusing VIPeR and PRID 450S dataset to a new one, named VIPeR&PRID 450S dataset, to simulate multi-modality illumination changes in real-world scenarios. PRID 450S is a newly constructed dataset which contains 450 single shot person images recorded from two different static cameras. We fuse both datasets and randomly select half of the dataset for training and the remaining for tests. In each set, half of the data is randomly selected as probe images, and the rest as gallery images. All the images are normalized to 128 × 48 for evaluation. In the clustering process, the number of the clusters is set to be 3. More details about the selection of the number of the clusters are discussed in Section 4.3. Fig. 6 illustrates the results of the clustering in the training process and the metric selection process in the test phase. Image pairs in Fig. 6(a) demonstrate mild illumination changes. Image pairs in Fig. 6(b) show medium illumination changes, with images in camera 2 presenting green shift in color. And image pairs in Fig. 6(c) present drastic illumination changes. It is obvious that the color of the ground changes from light pink to red or changes reversely. Almost all image pairs in Fig. 6(d–f) are classified to correct clusters. The misclassified image pair is marked in red box. In the VIPeR&PRID 450S dataset, the inter-view transitions are clustered into 3 groups. To give an insight into the low cluster number suc-
1
The VIPeR dataset is available at http://vision.soe.ucsc.edu/?q=node/178.
2
PRID 450S Dataset is available at: https://lrs.icg.tugraz.at/download.php.
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
ARTICLE IN PRESS
JID: PR 8
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12 Table 1 Comparison of CMC ranking rate (%) on the VIPeR dataset ( p = 316 ). Best results for each rank are emphasized in bold font. Rank-r
r=1
r=5
r = 10
r = 15
r = 20
r = 25
r=30
LMNN [6] LMNN+M3 L ITML [5] ITML+M3 L KISSME [2] KISSME+M3 L
19.74 21.87 23.02 25.08 26.74 30.22
33.11 34.53 47.23 51.03 55.70 59.57
42.68 47.19 63.13 66.85 69.15 71.99
48.61 53.20 72.39 72.94 78.48 79.27
54.47 60.16 78.56 78.64 83.39 84.34
57.40 63.33 82.13 82.80 87.66 88.13
62.22 68.71 85.28 85.57 89.48 90.51
Fig. 6. Examples of illumination transitions discovered in VIPeR&PRID 450S dataset (a–b) in the training process and the image pairs classified to each class (c–d) in the test process. The training set are clustered to three subsets: image pairs in (a) demonstrate mild illumination changes. Image pairs in (b) show medium illumination changes, images in camera 2 present green shift in color. And the image pairs in (c) present drastic illumination changes. It is obvious that the color of the ground changes from light pink to red or changes reversely. Almost all image pairs in (c), (d) and (e) are classified to correct clusters. The misclassified image pairs are marked in red box. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
cess, a scatter plot of the 3-D shift vectors in training set is shown in Fig. 7. We can see that obvious boundaries lie between different clusters in both datasets. More detailed analysis is given in Section 4.4. Quantitative comparisons on VIPeR&PRID 450S dataset are illustrated in Table 2. By discovering different types of illumination changes and learning discriminative metric under each type, our method outperforms each baseline metric learning method, ITML [5], LMNN [6], and KISSME [2], at all ranking positions, especially at low ranks. Overall, the proposed method achieves 29.33% at rank
1, 55.21% at rank 5 and 67.71% at rank 10, with significant improvement over all other methods. We can see that the result of the fusion dataset is much better than that in VIPeR dataset. We assume that in scenarios with more complicated illumination changes, our method would achieve more distinct improvements. 4.3. Sensitivity In this section, we conduct experiments to evaluate the sensitivity to the number of clusters in multi-modality mining process. Performances of using a varying number of clusters on both VIPeR
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
ARTICLE IN PRESS
JID: PR
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12
9
Fig. 7. Scatter plot of the 3-D shift vectors in the training set of (a) VIPEeR dataset and (b) VIPeR&PRID 450S dataset. The points grouped into different clusters are marked in a different color, and the points indicating cluster centers are marked in dotted black circle. Table 2 Comparison of CMC ranking rate (%) on the VIPeR&PRID 450S dataset ( p = 541 ). Best results for each rank are emphasized in bold font. Rank-r
r=1
r=5
r = 10
r = 15
r = 20
r = 25
r = 30
LMNN [6] LMNN+M3 L ITML [5] ITML+M3 L KISSME [2] KISSME+M3 L
17.24 20.35 21.01 24.15 22.24 29.33
32.95 34.21 41.84 48.80 48.00 55.21
39.98 45.49 54.84 62.17 60.57 67.71
48.48 52.14 61.86 70.18 69.44 75.66
53.84 58.24 67.47 73.94 75.54 80.22
59.39 67.67 71.90 77.45 80.04 83.30
62.71 68.71 76.22 81.08 82.87 85.71
Fig. 8. Performance changing with the number of the clusters. CMC curves of using different numbers of clusters nt in VIPeR dataset (a) and VIPeR&PRID 450S dataset (b). For comparison, CMC curves generated by KISSME algorithm are also provided.
and VIPeR&PRID 450S dataset are presented in Fig. 8. On VIPeR dataset, the recognition rate is stable and keeps in high level when the number nt is less than 7, and begins to fall off when more than 10. On the VIPeR&PRID 450S dataset, the best result is achieved when nt equals 3. The proposed method with M3 L outperforms KISSME method when nt is less than 4, and performs comparably when nt is no more than 10. The recognition rate also falls off to a low level when nt is more than 15. When nt is large, the scale of each cluster is small. In KISSME metric learning algorithm, if the scale of a cluster is less than the dimension of the reduced feature, the learned metric would be singular and far from accurate. So if the number of the clusters is large with respect to the dimension of the features, the recognition rate may drop sharply. But if the number of the clusters is set to a relatively small value, the proposed M3 L algorithm is robust and effective. In real-world scenarios, we recommend to set the number of the clusters to 2–5, specifically depending on the complexity of the illumination changes.
Algorithm 1 The Agglomerative Hierarchical Clustering Algorithm. Input: Shift vectors generated by all the positive sample pairs from training set in log-chromaticity space. 1: Assign each shift vector generated by each positive sample pair to a separate cluster and set the center of each cluster as the shift vector itself. 2: for n=1:NT (number of iterations) do Evaluate all pair-wise distances between clusters centers us3: ing Equation 5, and construct a distance matrix D. 4: Look for the pair of clusters with the shortest distance within distance matrix D, merge them into a new cluster and update the distance matrix and the cluster center according to Equation 6. Repeat step 3 and 4 until all the observations are merged 5: into one cluster 6: end for nt nt Output: A set of clusters {ct }t=1 and related centers {ζct }t=1 .
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
ARTICLE IN PRESS
JID: PR 10
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12 Table 3 Comparison of CMC ranking rate (%) with the state-of-the-art methods on the VIPeR ( p = 316 ) and VIPeR&PRID 450S datasets ( p = 541 ). Best results for each rank are emphasized in bold font. Dataset
MLF [31] eSDC_ocsvm [32] ECM [18] IGLML [15] AMMF [30] OURS
VIPeR
VIPeR&PRID 450S
r=1
r=5
r = 10
r = 15
r = 20
r=1
r=5
r = 10
r = 15
r=20
30.16 26.74 27.64 27.72 28.79 30.22
52.67 50.70 57.81 56.52 59.32 59.57
67.13 62.37 70.99 68.73 71.86 71.99
73.62 70.01 78.31 75.95 80.18 79.27
79.63 76.36 84.17 80.85 86.91 84.34
27.44 25.63 21.77 23.07 24.78 29.33
55.18 53.07 46.99 49.94 52.69 55.21
67.30 65.38 61.49 63.45 64.78 67.71
75.42 72.06 69.91 70.85 72.50 75.66
80.98 77.41 75.70 76.84 77.75 80.22
As aforementioned, we apply PCA to reduce the dimensionality of the concatenated feature vectors to κ in each experiments, where κ is an important parameter affecting the recognition performance. Note that in metric learning method, the feature dimension κ should be no larger than the size of the training set (the number of the samples in the training set), in order to keep the learned matrix nonsingular. Therefore, the upper limit of the reduced feature dimension is subject to the minimum size of all the subsets. As proven in [30], the best performance is obtained when κ is set to 50–70. Therefore, the training set should contain at least 50 data points in each subset to achieve the best performance. This requirement is easy to meet in real world scenarios.
ity by exploring the shift-invariant property in log-chromaticity space. The idea is simple but effective. Experimental results on the VIPeR dataset and VIPeR&PRID 450S dataset demonstrate the proposed algorithm achieves boosted performance against conventional global metric learning algorithms. Further more, other metric learning methods can exploit our framework to obtain superior performance. Acknowledgment X. Liu, X. Ma, and H. Wang are supported in part by National Natural Science Foundation of China under grants 61671102, 61671103, and 61401059.
4.4. Comparison with state-of-the-art results References We compare our methods with two feature-seeking methods: MLF [31] and eSDC_ocsvm [32], and three state-of-the-art metric learning based methods: ECM [18], AMMF [30] and IGLML [15]. For fair comparison, the dimensionality of the concatenated feature vectors in all the experiments are reduced to 20. Key rank-r identification rates are shown in Table 3. The proposed method performs well against the state-of-the-art methods, especially at low ranks. In particular, we achieve more than 4% jump in performance at rank-1 on the VIPeR&PRID 450S dataset compared with metric learning based methods. Both feature-seeking methods perform stably well on both datasets, however all three metric learning based methods have a significant performance decrease on the VIPeR&PRID 450S dataset compared with on VIPeR dataset. The results suggest that: (1) metric learning based methods are susceptible to multiple modalities, because global metric is not able to model multiple modes of transitions; (2) the more dispersed the distribution of the multiple modalities is, the greater impact would be exerted on the performances of the metric learning based methods. Although there are multiple separated modalities in both datasets, as shown in Fig. 7, the data points in one of the clusters (the one with points in green) of the VIPeR dataset is relatively few, thus the influence would be relatively small. Nevertheless, in the fusion dataset, there are plenty of data points in two major clusters (clusters in orange and blue), thus the global metric would notably deviate from two sub-metrics, accordingly reduces the identification performance. 5. Conclusion In real-world scenarios, observed images acquired from different viewpoints tend to exhibit multi-modality transitions due to the time-varying illumination changes. Globally learned metrics would deviate from correct metrics by mixing up all transition modalities. In this paper, we propose a multi-modality metric learning approach that automatically discovers multiple modalities of illumination changes, and learns a sub-metric for each modality. Independent of specific metric learning algorithms, the proposed approach learns a sub-metrics for each illumination change modal-
[1] M. Hirzer, P.M. Roth, H. Bischof, Person re-identification by efficient impostor-based metric learning, in: IEEE Advanced Video and Signal-based Surveillance, Beijing, 2012, pp. 203–208. [2] M. Köstinger, M. Hirzer, P. Wohlhart, H. Roth, P.M. Bischof, Large scale metric learning from equivalence constraints, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Rhode Island, 2012, pp. 2288–2295. [3] P.M. Roth, M. Hirzer, M. Köstinger, C. Beleznai, H. Bischof, Mahalanobis distance learning for person re-identification, in: Person Re-Identification, Springer, London, 2014, pp. 247–267. [4] M. Guillaumin, J. Verbeek, C. Schmid, Is that you? Metric learning approaches for face identification, in: Proceedings of IEEE International Conference on Computer Vision, Kyoto, 2009, pp. 498–505. [5] J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: International Conference on Machine Learning, Corvallis, 2007, pp. 209–216. [6] K.Q. Weinberger, J. Blitzer, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, in: Advances in neural information processing systems, 18, 2006, pp. 1473–1480. [7] M. Dikmen, E. Akbas, T.S. Huang, N. Ahuja, Pedestrian recognition with a learned metric, in: Tenth Asian Conference on Computer Vision, Queenstown, 2010, pp. 501–512. [8] W. Li, Y. Wu, J. Li, Re-identification by neighborhood structure metric learning, Pattern Recognit. 61 (2016) 327–338. [9] L. Yang, R. Jin, R. Sukthankar, Y. Liu, An efficient algorithm for local distance metric learning, in: the National Conference on Artificial Intelligence, 2006, pp. 543–548. [10] Y.K. Noh, B.T. Zhang, D.D. Lee, Generative local metric learning for nearest neighbor classification, Adv. Neural Inf. Process. Syst. 18 (2010) 417–424. [11] S. Saxena, J. Verbeek, Coordinated local metric learning, in: Proceedings of the IEEE International Conference on Computer Vision, Paris, 2016, pp. 369–377. [12] W. Li, X. Wang, Locally aligned feature transforms across views, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013, pp. 3594–3601. [13] V. Erin, J. Lu, Y. Ge, Regularized local metric learning for person re-identification, Pattern Recognit. Lett. 68 (2015) 288–296. [14] X. Gu, Y. Ge, Weighted local metric learning for person re-identification, in: Chinese Conference on Biometric Recognition, Springer International Publishing, 2016, pp. 686–694. [15] J. Zhang, X. Zhao, Integrated global-local metric learning for person re-identification, in: IEEE Winter Conference on Application of Computer Vision, 2017, pp. 596–604. [16] K. Liu, Z. Zhao, A. Cai, Parametric local multi-modal metric learning for person re-identification, in: 22nd International Conference on Pattern Recognition Parametric, 2014, pp. 2578–2583. [17] P.C. Mahalanobis, On the generalised distance in statistics, in: Proceedings of the National Institute of Sciences of India, 2, 1936, pp. 49–55. [18] X. Liu, H. Wang, Y. Wu, J. Yang, M.-H. Yang, An ensemble color model for human re-identification, in: IEEE Winter Conference on Applications of Computer Vision, Hawaii, 2015, pp. 868–875.
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
JID: PR
ARTICLE IN PRESS X. Liu et al. / Pattern Recognition 000 (2017) 1–12
[19] J.v.d. Weijer, C. Schmid, J. Verbeek, Learning color names from real-world images, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, 2007, pp. 1–8. [20] J. Yang, S. Sáfár, M.-H. Yang, Max-margin Boltzmann machines for object segmentation, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014, pp. 320–327. [21] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics Intell. Lab. Syst. 2 (1–3) (1987) 37–52. [22] V.K. Kadappagari, A. Negi, SubXPCA and a generalized feature partitioning approach to principal component analysis, Pattern Recognit. 41 (2008) 1398–1409. [23] V.K. Kadappagari, A. Negi, Global modular principal component analysis, Signal Process. 105 (2014) 381–388. [24] G.D. Finlayson, M.S. Drew, B.V. Funt, Diagonal transforms suffice for color constancy, in: Proceedings of the IEEE International Conference on Computer Vision, 1993. [25] I. Kviatkovsky, A. Adam, E. Rivlin, Color invariants for person re-identification, IEEE Trans. Pattern Anal. Mach. Intell. 35 (7) (2013) 1622–1634. [26] M.S. Drew, G.D. Finlayson, S.D. Hordley, Recovery of chromaticity image free from shadows via illumination invariance, in: Workshop on Color and Photometric Methods in Computer Vision, Nice, volume 1, 2003, pp. 32–39.
[m5G;October 16, 2017;7:5] 11
[27] A. El-Hamdouchi, P. Willett, Comparison of hierarchical agglomerative clustering methods for document retrieval, Comput. J. 32 (3) (1989) 220–227. [28] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (3) (1999) 264–323. [29] W.-S. Zheng, S. Gong, T. Xiang, Person re-identification by probabilistic relative distance comparison, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2011) 653–668. [30] P. Li, M. Liu, Y. Gu, L. Yao, J. Yang, Adaptive multi-metric fusion for person re-identification, in: Chinese Conference on Pattern Recognition, 2016, pp. 258–267. [31] R. Zhao, W. Ouyang, X. Wang, Learning mid-level filters for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 144–151. [32] R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3586–3593.
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041
JID: PR 12
ARTICLE IN PRESS
[m5G;October 16, 2017;7:5]
X. Liu et al. / Pattern Recognition 000 (2017) 1–12 Xiaokai Liu received the B.E. degree in 2010 from School of Information Engineering, Dalian Maritime University (DMU), PR China. Now she is a doctoral candidate in the School of Information and Communication Engineering of Dalian University of Technology (DUT), PR China. Her research interests include person reidentification and machine learning.
Xiaorui Ma received the B.E. degree in 2008 from School of Mathematics and Statistics, Lanzhou University (LZU), P.R. China. Now she is a doctoral candidate in the School of Information and Communication Engineering of Dalian University of Technology (DUT), P.R. China. Her research interests include remote sensing image classification and machine learning.
Jie Wang (M’12) received his B.S. degree from Dalian University of Technology, Dalian, China, in 2003, M.S. degree from Beihang University, Beijing, China, in 2006, and Ph.D degree from Dalian University of Technology, Dalian, China, in 2011, all in Electronic Engineering. He is currently an Associate Professor at Dalian University of Technology. From January 2013 to January 2014, he was a visiting researcher with University of Florida. His research interests include wireless localization and tracking, radio tomography, wireless sensing, wireless sensor networks, and cognitive radio networks. Dr. Wang served as Technical Program Committee member for many international conferences, e.g., IEEE GLOBECOM and ICC. He serves as an Associate Editor for IEEE Transactions on Vehicular Technology from January 2017. He served as a Guest Editor for International Journal of Distributed Sensor Networks for a special issue on wireless localization in 2015.
Hongyu Wang received the B.S. degree from Jilin University of Technology in 1990, and M.S. degree from Graduate School of Chinese Academy of Sciences in 1993, both in Electronic Engineering. He received the Ph.D. in Precision Instrument and Optoelectronics Engineering from Tianjin University in 1997. He is currently a Professor at Dalian University of Technology. His research interests include algorithmic, optimization, and performance issues in wireless ad hoc, mesh and sensor networks.
Please cite this article as: X. Liu et al., M3 L: Multi-modality mining for metric learning in person re-Identification, Pattern Recognition (2017), https://doi.org/10.1016/j.patcog.2017.09.041