Neurocomputing 242 (2017) 15–27
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Multiple-shot person re-identification via fair set-collaboration metric learning Wei Li∗, Jianqing Li, Lifeng Zhu School of Instrument Science and Engineering, Southeast University, 2 Sipailou, Nanjing 210096, China
a r t i c l e
i n f o
Article history: Received 9 November 2015 Revised 15 January 2017 Accepted 2 February 2017 Available online 20 February 2017 Communicated by Dr Zhang Zhaoxiang Keywords: Multiple-shot person re-identification Metric learning Set-collaboration dissimilarity Fairness principle
a b s t r a c t As an issue that attracts increasing interests in both academia and industry, multiple-shot person reidentification has shown promising results but suffers from real-scenario complexities and featurecrafting heuristics. To tackle the problems of set-level data variation and sparseness during reidentification, this paper proposes a novel metric learning method, named “Fair Set-Collaboration Metric Learning”, motivated by utilizing the opportunities whilst overcoming the challenges from the set of multiple instances. This method optimizes a new set-collaboration dissimilarity measure, which introduces the fairness principle into the collaborative representation based set to sets distance, in the set based metric learning framework. Experiments on widely-used benchmark datasets have demonstrated the advantages of this method in terms of effectiveness and robustness.
1. Introduction To contribute to the information forensics and security in visual surveillance, multiple-shot person re-identification deals with the problem of building correspondences between human image sets acquired from deployed cameras. However, diverse unavoidable real-scenario challenges, such as severe variations of illumination, pose, camera calibration, as well as potential resemblances of clothing and body style, are blended together, which significantly boosts the re-identification difficulty. The difference between multiple-shot and single-shot reidentification is whether a bundle of instances or only one single sample per person is available to be re-identified. For the multiple-shot case, we can deal with the set of samples jointly because the within-set sample relevance information is known a priori, while for the single-shot case, we have to treat each sample separately. Actually, multiple-shot re-identification is more practical, because video cameras usually capture the fully or partially continuous human image sequences rather than the discrete, single one. The multiple-shot case contains more useful resources than the single-shot case, but meanwhile it suffers from the extra challenge of within-set variations. Admittedly, if the single-shot re-identification problem is conquered, the multiple-shot case will be readily solved. However, in fact, the performance of single-shot re-identification is largely limited by the real-world complex-
∗
Corresponding author. E-mail address:
[email protected] (W. Li).
http://dx.doi.org/10.1016/j.neucom.2017.02.003 0925-2312/© 2017 Elsevier B.V. All rights reserved.
© 2017 Elsevier B.V. All rights reserved.
ities. Therefore, it is meaningful and necessary to sufficiently take advantage of the valuable information of the image sets for re-identification. Methodology for re-identification can be categorized into the feature and measure directions basically. The feature methodology includes feature designing and feature learning. The gist of feature designing for multiple-shot re-identification is to manually craft the suitable representation to describe the human image sets. Describing the set of color and/or texture information of human images in a statistical way dominates this branch [1–4]. The strategies of designing image features for multiple-shot and single-shot re-identification do not hold much difference, except for the procedure of condensing the whole set into one descriptor. In the feature learning branch, deep learning has advantage in jointly handling the complex mixture of real-world variations of human image data. Current deep learning approaches for re-identification mainly focus on the single-shot case. In deep architecture, patch matching plays an important role in capturing the local variations of human appearance across views. For example, base on the convolutional neural network, Wei Li et al. calculate the local patch displacements within the horizontal stripes between human images, and then analyze the largest patch-match responses in each stripe; Ejaz Ahmed et al. compute the cross-input neighborhood differences of local patches between image samples, and further distill these local differences into the patch summary feature [5,6]. On the other hand, the complementarity of diverse data domains, combination of different modeling ways, and cooperation of global and local body parts can also benefit the deep learning models for re-identification as well [7–9].
16
W. Li et al. / Neurocomputing 242 (2017) 15–27
Both feature crafting and deep learning develop rapidly in recent years [7,10]. Although the elaborately-crafted features aim to characterize the human appearance in a discriminative way, the complexities of real scenario and the heuristics of hand designing more or less impede the performance enhancement [11,12]. To solve these problems, the measure methodology tries to improve the discriminability of dissimilarity in the feature space. The first branch of measure methodology concentrates upon the collaboration-based dissimilarities. Usually, the solution mode is to, firstly, represent the query set by all the corpus sets collaboratively, and then, determine the classification results according to the reconstruction errors [13–16]. Two representative works in this branch are Collaboratively Regularized Nearest Points (CRNP) [14] and Collaborative Representation based Set to Sets Distance (CRSSD) [16]. CRNP finds set-to-set dissimilarities using the l2 norm regularized affine hulls, and meanwhile makes use of discriminative Collaborative Sparse Approximation (CSA) [13] for setbased recognition. CRSSD is originally proposed for face recognition, but we discover the potential of this method for person re-identification in spite of the challenging difference between face and body images. Essentially, the idea of CRSSD is similar to CRNP: approximating the query set using collaborative representation over all the corpus sets. Their difference is that CRSSD measures the nearest-point distance between the convex hulls instead of affine hulls of the query and corpus sets. Both affine hull based and convex hull based dissimilarities rely on the peripheral information of sets. However, for set-based dissimilarity measure, affine hull stresses the discriminative power of outlying points, while convex hull emphasizes the negative impact of outliers and discards them. It is hard to say which is absolutely superior, because it is confusing to judge whether the abnormal points away from dominant subset is informative or noisy. However, from our practical experience, in most cases, the convex hull works more effectively in the multiple-shot re-identification application. The second branch of measure methodology centers on metric learning approaches. This branch uses the optimization schemes to adjust the metric space structure for better discrimination. Point based metric learning targets to improve the relative comparison relationship between intra-class and inter-class dissimilarities on the sample level for single-shot re-identification, which can also provide the inspiration and reference for multiple-shot reidentification. RDC (Relative Distance Comparison) [17] maximizes the likelihood of a pair of true matches having a relatively smaller distance than that of a wrong match pair in a soft discriminant manner, and further develops an ensemble model to maintain the model tractability in large scale learning. PCCA (Pairwise Constrained Component Analysis) [18] learns a low-dimensional space that exhibits good generalization properties in presence of high dimensional data by using sparse pairwise dissimilarity constraints. KISSME (KISS Metric Learning) [19] introduces a simple though effective strategy to learn a distance metric from equivalence constraints based on a statistical inference perspective. kLFDA (kernel Local Fisher Discriminant Analysis) [20] handles high dimensional features by optimizing the Fisher discriminant objective based on the kernel trick. RMLLC (Relevance Metric Learning method with Listwise Constraints) [21] fully uses the available similarity information from training data by means of forcing the learned metric to conserve the predefined listwise similarities in the lowdimensional projection subspace. To address the problem that the predefined binary listwise similarities usually fail to capture the relative similarity information, this method further employs a rectification term and jointly optimizes the metric and this term together. LADF (learning of the Locally-Adaptive Decision Functions) [22] learns a decision function for verification that can be viewed as a joint model of a distance metric and a locally adaptive thresh-
olding rule. The performance of metric learning is also related to the feature space. XQDA (Cross-view Quadratic Discriminant Analysis) [23] learns a discriminant low dimensional subspace, in which the metric is simultaneously optimized, by cross-view quadratic discriminant analysis on the feature space of LOMO (LOcal Maximal Occurrence). IGB (Integrating Gait Biometric) [24] integrates the gait feature with appearance feature to deal with the challenging appearance distortions, and also uses metric learning to solve the view-angle change problem that may damage discriminability of the similarity measure between descriptors before final scorelevel or feature-level fusion. IGB shows effectiveness and flexibility of the metric learning strategy to improve human image feature space, though IGB focuses on feature integration and metric learning is just one step of the entire workflow. Although metric learning and deep learning are in different directions, their advantages can also be integrated for better re-identification. By incorporating metric optimization into the deep models, deep metric learning can jointly learn the feature and metric from image pixels directly. For instance, Dong Yi et al. utilize the siamese deep neural network to learn the color feature, texture feature, and metric simultaneously [25]; from the perspective of deep ranking, Shi-Zhe Chen et al. unify the deep joint representation learning and ranking based metric learning into one framework [26]; to solve the problem of cross-domain distribution gap, Junlin Hu et al. further introduce transfer learning into the deep metric learning scheme [27]. Set based metric learning models usually optimizes the relative set-based comparison relationship between intra-class and inter-class set-based dissimilarities. One most typical method is Set Based Discriminative Ranking (SBDR) [11]. This model iterates between set-based dissimilarity finding and metric space adjustment to achieve the simultaneous optimization of the anticipated relationship of intra- and inter-class set-to-set dissimilarities. Both existing collaboration-based dissimilarity and metric learning have their respective merits and weakness. Collaborationbased dissimilarity exploits the global collaborative information among corpus sets. Global collaborative information provides more robustness to the local layout deviation of the sets in contrast with the direct dissimilarity measure between the independent pair of sets. However, collaboration-based dissimilarity is vulnerable to the original location of the distributed sets [13,14,16]. Metric learning can help adjust the location of distributed sets in the learned metric space based on the supervised knowledge. However, metric learning only considers the set-to-set dissimilarity in the local area, but misses the global collaborative information of the sets [11,12]. Thus, its performance is easy to be limited by the data sparseness. Inspired by the collaboration-based and learning-based methodologies, we propose a new method that inherits and also integrates the advantages of both strategies whilst remedying the weakness of each other. This method is named “Fair SetCollaboration Metric learning (FSCML)”. To benefit the intra-class compactness with regard to the inter-class separation, FSCML learns a novel set-based dissimilarity based on the idea of fair collaboration. Optimizing the metric using the discriminative set-based collaboration information, to our knowledge, makes FSCML different from the existing models and opens a new door to metric learning schemes for multiple-shot re-identification. In sum, there are three contributions of this proposal: • We design the new fair set-collaboration dissimilarity which introduces the fairness principle into the collaborative representation based set-to-set distance (Section 2.2.1). • We propose the novel metric learning model constrained by the relative comparison relationship based on fair set-collaboration dissimilarities (Section 2.2.2).
W. Li et al. / Neurocomputing 242 (2017) 15–27
• We demonstrate the advantage of this proposal in terms of effectiveness and robustness on widely-used benchmarks (Section 3.3). 2. Solution 2.1. Problem description This paper solves the multiple-shot person re-identification problem from the image set based recognition perspective, which aims at building correspondences between human image sets captured from different camera views. The persons in one camera usually have their corresponding matches in the respective camera. Hence, as the widely-adopted precondition for this issue, for the query identity, there is the corresponding target with the same identity in the corpus. Suppose, in the corpus-domain camera, there is one labeled image set per person, while in the query-domain camera, there is at least one image set for each person with unknown labels. Our target is to find the correct correspondences between the query and corpus sets to determine the identity labels of the query sets. In brief, given a query set Q ∈ Q where Q denotes the query domain and Q belongs to one fixed identity, our goal is to decide the identity of Q by comparing it to a collection of sets in the corpus domain X in which each corpus set X ∈ X is identity-specific as well. Usually, re-identification can be formulated as a matching or retrieval problem. The difference between the two is that, in the matching case, we can use all the query-domain information, but in the retrieval case, we only have one query set each time. In contrast with matching, the retrieval setting seems more fundamental, and many re-identification works in literature adopt such setting. Therefore, we formulate multiple-shot person re-identification as the set-based retrieval problem. Every time, we rank all the corpus sets in response to the query set according to the final decision scores in an ascending order. And the query set will be assigned the same label as the corpus set that is ranked first. 2.2. Method formulation 2.2.1. Conventional set-to-set dissimilarity When multiple-shot person re-identification is recast as the setbased retrieval problem, the re-identification accuracy is much influenced by the appropriateness of set-to-set dissimilarity measure. Conventional set-to-set dissimilarities for re-identification can be divided into “minority-based measure” and “majority-based measure”. Minority-based measure emphasizes the significance of local distances between the partial closest minorities of sets, while majority-based measure inclines to aggregating the global majorities of low-level distances of all set samples. The former has shown better performance than the latter for multiple-shot reidentification, which implies the periphery of sets carries more discriminative information than the central part in measuring the setbased dissimilarity [12,28]. For minority-based measure, two exemplary methods are Minimum Point-wise Distance (MPD) [1] and Convex Hull based Image Set Distance (CHISD) [29]. MPD measures the minimum point-wise distance between two sets. MPD highly depends on the location of two nearest points of sets, so, MPD is vulnerable to outliers caused by the irregular within-class variations. CHISD improves such situation by a more active way. It constructs the convex hulls to exclude the outliers located far away from the dominant part of sets, before dissimilarity measure between convex hulls. 2.2.2. Fair set-collaboration dissimilarity The difference between MPD and CHISD lies in their behavior to outliers. Outlier can be understood as an observation which de-
17
viates from the other observations so it arouses suspicions that it was generated by a different mechanism [30]. Outliers themselves cause wide disputes, because the way to treat them may greatly influence the results. Although CHISD does better than MPD in set-based recognition tasks, the behavior of CHISD is also thoughtprovoking, because it is very ambiguous what constitutes a “sufficient” deviation for a sample to be considered an outlier. It is indeed unfair to directly remove the outlying samples from each set in the situation that we are blind to whether they are noisy data that disturb the dissimilarity measure or informative patterns that bear the discriminative information. Without the prior semantic information of outliers, we instead seek for a relative fair way to deal with them during dissimilarity measure. (It is also a common sense of life that relative fairness is more practical and reliable than absolute fairness.) To realize such relative fairness, we take advantage of the within-domain coopetition information among corpus sets when they are matched to the query set. The within-domain coopetition information means the mutual cooperation and competition of all corpus sets. Technically, we incorporate the idea of “fair coopetition” into the setcollaboration model. In the set-collaboration aspect, we suggest employing the CRSSD model [16]. The training part of this model is formulated by
arg min Qq αq − X B2
(1)
αq ,B
s.t. Nq αq = αq1 , . . . , αqNq , αqk = 1, k=1
B=
Nj |X | β jl = 1, β1 , . . . , β|X | , β j = β j1 , . . . , β jN j , j=1 l=1
0 ≤ αqk ≤ τ ,
q = 1, . . . , |Q|,
k = 1, . . . , Nq ,
0 ≤ β jl ≤ τ ,
j = 1, . . . , |X |,
l = 1, . . . , N j ,
where · denotes the l2 -norm; q and j denote the set identities; |·| denotes the cardinality, so |Q| and |X | indicate the set numbers in the query and corpus domains, respectively, where Qq ∈ Q and X j ∈ X ; Nq and Nj are the set sizes for Qq and Xj , respectively; α qk and β jl are the kth and lth reconstruction coefficients of vectors α q and β j , respectively, for which αq = [αq1 , . . . , αqNq ] and β j = [β j1 , . . . , β jN j ] , thus vector B = [β1 , . . . , β|X | ] ; and τ satisfies τ ≤ 1. After getting the optimal αq and B , Eq. (2) has been suggested to measure the dissimilarity between Qq and Xj :
D(Qq , X j ) = Qq αq − X j β j 2 ,
(2)
where B = [β1 , . . . , β|X| ] and β j = [β j1 , . . . , β jN ] . j
Eq. (1) treats all the sets in the corpus domain as a whole, and collaboratively represents a query set by all the corpus sets at the same time. Eq. (1) usually assigns large weights to the preference corpus samples that are most probably belonging to the same class as the query set, and small weights to those irrelevant ones far apart, and zeroes to the irregular outliers. By identifying the corpus set which has a lower reconstruction residual, we are likely able to acquire the correct class label of the query set. Eq. (2) is a little bit different from the objective function of Eq. (1). Eq. (2) defines the dissimilarity score as the difference between the weighted average of each query set of samples and each corpus set of samples, and the weights are just their corresponding coefficients. In this paper, we explain CRSSD from a new point of view: the intrinsic cooperation and competition among participants within collaborative representation. From the cooperation perspective, all
18
W. Li et al. / Neurocomputing 242 (2017) 15–27
corpus sets simultaneously take part in the reconstruction process. The reconstruction residual, which is used as the dissimilarity score, considers the global distribution information of the sets to a certain degree. This global view can orient the behavior of convex hull construction in the direction of benefitting set-based classification as a whole, which implicitly reduces the blindness of constructing convex hulls to suppress outliers in the local areas. Cooperation spirit is the essence of CRSSD. The strength of cooperation is however ensured by a deeper competition mechanism. During collaborative representation, all the samples in the corpus domain work against each other to compete for the large representation coefficients, but only a small number of suitable ones win. Deciding outliers from global competition will be more unbiased and reliable than only from the local area. Nevertheless, we should be careful in such competition mechanism. Winners may not always be the best, and the crucial point is the rule, which meets with our life experience as well. CRSSD constructs one big global convex hull for all the sets in the corpus domain in the learning stage, while, in the testing stage, for measuring the set-to-set dissimilarities, it rebuilds the individual small hull for each corpus set one by one based on the coefficients assigned. Hence, in testing, the hull size of the query set is fixed, but those of the corpus sets vary. This makes the minoritybased dissimilarity that highly relies on the hull surfaces under the non-uniform situation, which, in turn, incurs the unfairness of relative dissimilarity comparisons between different query-corpus set pairs. Such unfairness is harmful to the competition mechanism, and results in the degradation of the discrimination capability of CRSSD to some extent. To guarantee a fair condition for relative dissimilarity comparisons, we design the “fairness principle” for CRSSD. This principle is carried out in terms of normalization from two aspects in the testing stage: the first is the set shapes; the second is the scale of the learned coefficients. As the first aspect, we can imagine the sets are enveloped by hulls. The hull shape in fact reflects the within-set distribution variation. CRSSD measures the dissimilarity between the hull surfaces. So, such measure will unavoidably suffer from the abnormal hull shapes caused by the minority of irregular samples. To address it, we normalize the set shapes as follows:
DNormalized (Qq , X j ) =
Xjβ Qq α − 2 , Qq ∗ X j ∗ q
j
Qq αq /Qq ∗ − X j β j /X j ∗ 2 . k αqk + l β jl
q
q
q
q
notes the irrelevant ones. Then, the desired ranking of corpus sets y ∈ Y (where Y is the space of all feasible rankings) in response Q q
to Qq will be the one that satisfies: Xi ≺y X j where ∀i ∈ IQ+ , j ∈ q
Qq
IQ− . It means that in ground-truth, a relevant corpus set should alq
ways be ranked before any irrelevant corpus set with regard to the query set. A desired ranking algorithm is able to learn W, a symmetric, positive semi-definite matrix, which can successfully distinguish the correct ranking y from any other incorrect rankings. Q q
On the basis of the maximum-margin paradigm, FSCML can be formulated by the following framework:
C arg min tr(W ) + ξQ q W |Q|
(5)
Qq
(3)
where ·∗ denotes the nuclear norm. Thus, Eq. (3) simultaneously considers the majority-based robustness and minority-based discriminability. The second aspect indicates the scale of the accumulated weight coefficients within the measured pair of sets. The sum of the weight coefficients for the query set can be either constant or variant, which is set to 1 here. The weights of the corpus sets are determined from coopetition. They vary from set to set in the corpus domain. If the pair of corpus and query sets belong to the same class, it is more probable that the weights are large. Otherwise, the weights tend to be small. However, one conflictive thing is that the large weights may tend to expand the difference between two sets, while the small weights may incline to shrink their difference on the other hand. To solve this problem, we suggest normalizing the accumulation of the coefficients for the set pair as a whole in Eq. (4). This normalization can further mine the classification power from the weights during set-to-set dissimilarity measure.
DFairness (Qq , X j ) =
2.2.3. Fair set-collaboration metric learning Traditional metric learning improves the relative point-to-point distance comparison for single-shot re-identification. For multipleshot re-identification, SBDR extends traditional metric learning to the set level, and it optimizes the relative set-to-set dissimilarity comparison relationship in favor of set-based discrimination [11]. For the image set classification/re-identification task, point based dissimilarity or metric learning ignores the important setlevel information, to be frank. Set based dissimilarity strategy seems more effective than simply fusing the point based dissimilarities, because set based dissimilarities are able to handle the interferences from point-level noises due to their elaborate design [14,16,29,31]. We expect that set based metric learning can inherit and maintain such effectiveness from the in-built set based dissimilarity in a suitable way. Because collaborative representation can capture the global sample distribution information, the collaborative representation based dissimilarity is more robust to the set deviation problem in the local area, which is caused by set-level sparsity and variation, than direct dissimilarity measure between the pair of sets. Inspired by this, we propose FSCML to learn the relative comparison between set-collaboration dissimilarities under the new fairness principle. FSCML uses the ranking-based model. For the query set Qq index by q, we divide the corpus sets into two groups indexed by IQ+ and IQ− , where IQ+ denotes the sets relevant to Qq , and IQ− de-
s.t.
ψ ( Qq , X , y , W ) − ψ (Qq , X , yQq , W ) ≥ (y , y Q q ) − ξQ q , Qq Qq ξQ q ≥ 0 , ∀ Q q ∈ Q , where tr(W) is the regularizer instantiated by the trace of W, and as an option, this term can also be replaced by W 2F /2, in which ·F denotes the Frobenius norm; |Q| means the set numbers in the query domain; ψ (Qq , X , yQq , W ) is the set-level joint feature map for a candidate ranking yQq ; (y , yQq ) is the loss function; Q q
ξQq is the slack variable for Qq ; and C is the trade-off parameter.
With the help of the decomposed property of the metric matrix: W = L L → W 1/2 = L, we design the set-level partial order feature ψ = ψ Fairness which encapsulates the fair set-collaboration dissimilarity of Eq. (4):
ψ Fairness (Qq , X , yQq , W ) =
Fairness DW yranking ij
Xi ∈IQ+ X j ∈IQ− q
(4)
where
Thus, two aspects of normalization can ensure the relative setbased dissimilarity comparisons under a more fair condition.
yranking ij
q
=
1 −1
Xi ≺yQq X j , Xi yQq X j ,
(6)
(
) | || |
Fairness Qq , Xi − DW + − IQq IQq
( Qq , X j )
,
(6)
(7)
W. Li et al. / Neurocomputing 242 (2017) 15–27
as well as Fairness DW
LQq αq /LQq ∗ − LXi βi /LXi ∗ 2 (Qq , Xi ) = , k αqk + l βil
Algorithm 1 Fair set-collaboration metric learning (FSCML).
(8)
Fairness (Q , X ) is defined in the same manner. and DW q j ψ Fairness plays an important role in interpreting the multipleclass ranking problem from the binary classification perspective by means of separating the correct and wrong rankings. Given an input query Qq , the best W is expected to be the one that simultaneously makes
y ← arg max ψ Fairness (Qq , X , yQq , W ), Qq yQq ∈Y
19
(9)
where y is the ground-truth ranking of X for Qq , and note that, Q q
yQq = y in most cases. One attractive property of ψ Fairness is that, Q q
for a fixed W, the ranking yQq that maximizes ψ Fairness can be obFairness [11,32]. tained by sorting X based on descending DW Fairness DW is different from the sample-based Mahalanobis distance dW . Given two arbitrary samples xi and xj , dW (xi , xj ) can be formulated as dW (xi , x j ) = W, (xi − x j )(xi − x j ) F , where ·F is the Frobenius inner product. This actually reveals the linear nature of dW (xi , xj ) with respect to W. As we may know, the discrimination capability of set-based measure relies on the strategy of aggregating the low-level points. As the discriminative set-to-set Fairness inescapably involves more complicated operdissimilarity, DW ations and calculations than dW , and thus it is non-linear and nondifferentiable with regard to W. So, it is inconvenient to take out Fairness for optimization whilst leaving alone the feature W from DW Fairness as a whole part for operation. Therefore, we need to treat DW in the optimization process. The loss function (y , yQq ) helps to quantize the penalty of Q
Require: Collection of query sets: Q; collection of corpus sets: X ; space of all feasible rankings of X in response to each Qq ∈ Q: Y;ground-truth rankings: {y , . . ., y 1 |Q| } ⊂ Y; trade-off parame-
ter: C > 0; termination threshold: > 0. Ensure: Metric matrix: W > 0. 1: Initialize W by a diagonal matrix, whose diagonal values are the standard deviations of the training data on each feature dimension. 2: Initialize the bundle of working constraints: C ← ∅. 3: repeat Decompose the metric matrix W by W = L L → L = W 1/2 . 4: Compute the representation coefficients αq and B by Eq. (1) 5: using the updated data. Fairness Measure the fair set-collaboration dissimilarities DW 6: Fairness (Qq , Xi ) and DW (Qq , X j )according to Eq. (8). Calculate the partial order feature ψ Fairness (Qq , X , yQq , W ) by 7: Eq. (6) and the loss function (y , yQq ) by Eq. (10). Q q
8:
Solve for the optimal W :
C arg min tr(W ) + ξQ q W |Q|
Qq
s.t.
∀ ( y , . . ., y 1 |Q| ) ∈ C, ψ Fairness (Qq , X , y , W ) − ψ Fairness (Qq , X , yQq , W ) Qq ≥ (y , y Q q ) − ξQ q , Qq
ξQ q ≥ 0 .
q
predicting yQq instead of y : Q
9:
q
(
y , yQq Qq
) = 1 − S ( Qq , yQq ),
(10)
where S ∈ [0, 1], which means the ranking quality from the worst to the best. Multiple-shot re-identification needs to award the correct ranking predications whilst penalizing the wrong ones. This point stays the same as the single-shot case. The reciprocal rank of a query response is defined as the multiplicative inverse of the rank of the first correct match, and Mean Reciprocal Rank (MRR) is the average of such reciprocal ranks of results over the entire query collection. S can be instantiated by the improved MRR score, which has already been validated quite effective for single-shot reidentification:
1 S ( Qq , yQq ) = |Q|
Qq ∈Q
1/rQq ,
rQq ≤ v;
0,
rQq > v,
(11)
where rQq is the position of the first relevant item in response to Qq in yQq ; v is the number of top ranked items to be considered, which is suggested to be 3 here. If the position of the first relevant item in response to Qq in yQq is larger than v, the reciprocal rank score will be set to 0 [11,33]. Solving FSCML uses the alternating direction strategy between set-based dissimilarity finding and metric optimization [34]. In each iteration of optimization, with L = W 1/2 , Qq and X can be updated by Qq ← LQq and X ← LX , respectively. Then, αq and βi in Eq. (8) can be obtained directly by Eq. (1). Although FSCML is non-convex to L, we find this model always fastly converges to a local minimal in quite limited rounds of iterations in practise. The procedures of FSCML are presented in Algorithm 1. In brief, Eq. (5) is the whole FSCML framework, and Eqs. (6)–(11) are the details of model components which together form the whole
10: 11:
Update the bundle of active constraints: for q = 1 to |Q| do yQq ← arg max {ψ Fairness (Qq , X , yQq , W ) + (y , yQq )}, Q yQq ∈Y
q
end for C = C {(y1 , . . ., y|Q| )}. 14: until 1 1 15: (y , y Q q ) − |Q (ψ Fairness (Qq , X , y ,W ) − Q Q | Q| |
12:
13:
Qq ψ Fairness (Q
16:
q
q
Qq
q , X , yQq , W ))
≤
C | Q|
Qq
ξQ q + .
return W .
framework. Eqs. (6)–(8) formulate the set-level partial order feature, which is based on FairCRSSD in a learned metric space, in a non-parametric way. Eq. (9) explains the target of the optimization based on set-level partial order feature. Eqs. (10) and (11) describe the loss function over rankings instantiated by MRR. In summary, the trade-off parameter C is the only critical parameter required to be specified. It balances the penalty and regularizer terms during learning and is set to 1 by default in our experiments. Over-fitting is a common problem that all the metric learning methods for re-identification have to face [17,20,28]. FSCML alleviates this problem by two aspects: “metric learning framework” and “dissimilarity measure core”. Firstly, FSCML relies on a suitable framework, the structural SVM, to learn the metric. For machine learning approaches which directly work on ERM (Empirical Risk Minimization), minimizing the empirical risk cannot be guaranteed equal to minimizing the expected risk when the number of training samples is limited, so the models are easy to incur over-fitting. SVMs have advantage in avoiding over-fitting because SVMs use the spirit of the SRM (Structural Risk Minimization) principle. The SRM principle
20
W. Li et al. / Neurocomputing 242 (2017) 15–27
addresses over-fitting by balancing the model’s complexity against its success at fitting the training data. SVM learning actually minimizes both the Vapnik–Chervonenkis dimension (confidence interval) and the approximation error (empirical risk) at the same time. So, FSCML inherits the robustness from the structural SVM against over-fitting. Secondly, as the dissimilarity measure core of FSCML, FairCRSSD is non-parametric and robust to local variations. FairCRSSD measures set to sets distance between the nearest points in the convex hulls of query set and corpus sets under the fairness principle, without imposing any assumption on data distribution. Different from the complex parametric setting which is easy to cause the poor predictive ability of learning models, such simple nonparametric nature of FairCRSSD can prevent FSCML from being too tailored to the particularities of training data. Furthermore, FSCML is robust to local data variations due to its collaborative way to deal with the entire data domain. Although CHISD, the dissimilarity measure core of SBDR, also belongs to hull based modeling, CHISD is equally sensitive to the location of each individual set, and thus it is easy to exaggerate the importance of handling irregular variation of each set in the local area during training SBDR, at the cost of weakening the model conformability to global distribution and generalization ability to new data. 3. Experiments 3.1. Dataset description For multiple-shot re-identification, according to the widelyused experimental setting in literature, each human image set should contain at least more than 2 images usually, which may be manually or automatically collected from image sequences [2,3,11– 14,35]. We evaluate FSCML for multiple-shot person re-identification on widely-used benchmark datasets: i-LIDS-MA, i-LIDS-AA [3], ETHZ1 [36], Caviar4REID [37], and CUHK03 [5]. All of these datasets undergo real-scenario complexities. The i-LIDS-MA and i-LIDS-AA datasets [3] are specifically established for multiple-shot person re-identification. They are collected from the videos of i-LIDS MCTS captured by a multi-camera CCTV network at an airport arrival hall during the busy time. From these videos, i-LIDS-MA is made of manually-annotated person images, while i-LIDS-AA is obtained by the automatic detector and tracker. i-LIDS-MA contains 40 individuals, and each person has 46 frames selected by hands. i-LIDS-AA includes 100 individuals with different numbers of frames automatically generated for each identity. i-LIDS-AA is more difficult than i-LIDS-MA, because the challenges of i-LIDS-AA are dramatically increased by the noises from detection and tracking. i-LIDS-MA and i-LIDS-AA have sufficient samples for each identity in contrast to another dataset also extracted from i-LIDS MCTS, i-LIDS, which contains 479 images from 119 pedestrians [17]. Thus, they are more suitable than i-LIDS for the task of multiple-shot re-identification. The ETHZ dataset [36] contains three video sequences of street scenes captured by two moving cameras mounted on a carriage. ETHZ1 consists of the first sequence, in which, there are a total of 4857 images belonging to 83 pedestrians and each pedestrian has 7 to 226 images. Despite being not collected from the fixed views, ETHZ1 has been widely adopted for evaluating the reidentification methods in both single-shot and multiple-shot settings [1,3,12,13,17,33,35,38,39]. The prominent challenges of ETHZ1 include serious illumination variations and occlusions. The Caviar4REID dataset [37] is manually selected from the less-controlled video sequence filmed in a shopping center, which includes human walking alone, meeting with others, window shopping, entering and exiting shops. There are 72 pedestrians in
this dataset: 50 of them have two views with 20 images for each person, and the remaining have one view with 10 images per person. This dataset confronts illumination variations, pose changes, and occlusions across broad resolution changes. The CUHK03 dataset [5] contains 14, 096 images of 1467 pedestrians. This dataset is recorded by six surveillance cameras in an open area over months, where pedestrians walk in different directions and each identity is observed by two disjoint camera views with an average of 4.8 images per view. Illumination changes, misalignment, occlusions, and body part missing are quite common. Both automatically-detected and manually-cropped bounding boxes are available for this dataset. We use the automaticallydetected setting in experiments by default. The inaccurate detection results in this setting largely increase the re-identification difficulty. All the images in above-mentioned datasets are normalized to 48 × 128 pixels before experimentation. We illustrate the challenges of these datasets by some examples in Fig. 1. 3.2. Experimental setup We adopt the well-known discriminative signature to describe the human images. This signature is concatenated by three kinds of descriptors: Multi-Space Color Histograms (MSCH), SchmidFilter-Bank (SFB), and Gabor-Transform (GT), which is denoted by “MSCH+SFB+GT” [17]. We choose this signature because it has been proved quite suitable and effective for the re-identification task and also broadly used in the state-of-the-art methods [11,14,17,40]. Therefore, this signature can not only provide a good feature space for FSCML, but also ensure a fair baseline for the comparison with other competitive methods. We demonstrate the effectiveness and robustness of FSCML. For effectiveness, we test the performance of FSCML as well as its modeling core, FairCRSSD. For robustness, we evaluate the generalization ability, trade-off parameter, and regularizer of FSCML. Moreover, we analyze the theoretical computational complexity and empirical running time of FSCML. Following the universally-used evaluation protocol [3,11,13,14,37,40], we perform ten-fold cross-validation for demonstrating the method effectiveness and robustness; in each trial, we randomly select a certain number of identities for testing whilst using the full or partial remaining identities for training; and also, we keep the same train-test splits for method comparison; in the testing stage, for each identity, we keep one image set in the corpus domain, and leave the rest in the query domain; in each domain, every set is formed by at most five images which are randomly gathered. The key difference between the evaluation protocols of multiple-shot and single-shot settings lies in the set concept. Under the multiple-shot setting, in each trial, we have the set of samples for each identity. We need to build the correspondence between the query and corpus sets to determine the label of the query set. This process is on the basis of the set-integrity assumption: the samples within each set are relevant. But under the single-shot setting, in each trial, we only have one sample for each identity. In such case, we need to match between the query and corpus samples to decide the query sample label. Hence, compared with the single-shot case, in multiple-shot re-identification, we have more opportunities to exploit the set-based information, but also confront the extra challenges of within-set variations. 3.3. Result analysis 3.3.1. Effectiveness To be honest, current researches on re-identification show a trend of diversification. Every year, many new methods claim to
W. Li et al. / Neurocomputing 242 (2017) 15–27
21
Fig. 1. Examples of i-LIDS-MA, i-LIDS-AA, ETHZ1, Caviar4REID, and CUHK03.
Fig. 2. Performances of FSCML and its closely-related methods.
achieve the state-of-the-art performance for re-identification. We believe that a deeper study on how the proposed model differentiates itself from closely related works under shared conditions would be more informative and meaningful than a pure competition on the final overall performance of all kinds of methods. Therefore, we conduct three groups of experiments to demonstrate the effectiveness of our method in comparison with other related comparable competitors. The first group includes SBDR [11], CRSSD [16], CHISD [29], which are the approaches analogous and closely-related to FSCML; the second group includes CRNP [14], SANP (Sparse Approximated Nearest Points) [31], CSA [13], CRC (Collaborative Representation Classification) [41], and MAD (Mean Approach Distance) [28], which are the recently proposed collaboration-based and set-based models; the third group includes XQDA [23], kLFDA [20], MFA (Marginal Fisher Analysis) [20,42], and KISSME [19], which are the state-of-the-art metric learning methods. Results across datasets are illustrated by the CMC (Cumulative Match Characteristic) curves in Figs. 2–4, where p denotes the
number of testing identities, and q is the training identity number, except for ETHZ1, in which the competition happens in a finer range: the performance of FSCML is saturated, while the performances of other methods fall into the range between 98.6 and 99.8 recognition percentages. It can be observed that on the whole, despite the unstable behavior of other competitors across datasets, FSCML has the steady encouraging performance: it leads the performance on i-LIDS-MA ( p = 30), i-LIDS-AA ( p = 70, 90), ETHZ1 ( p = 50), Caviar4REID ( p = 50), and CUHK03 ( p = 100, 150); it has the second-best performance on i-LIDS-AA ( p = 30, 50). These results reveal that the learned metric matrix can successfully transfer the discriminability of fair set-collaboration dissimilarity from the training data to the testing data. In greater details, in Fig. 2, both FSCML and SBDR have remarkable performance. Honestly, FSCML deserves a strong opponent against SBDR. Actually, both methods belong to the set based metric learning methodology, which relies on the supervised way to enhance the set-based discrimination strength. Their biggest
22
W. Li et al. / Neurocomputing 242 (2017) 15–27
Fig. 3. Performances of FSCML and collaboration-based and set-based methods.
Fig. 4. Performances of FSCML and point based metric learning methods.
difference is that SBDR constrains the comparison relationship between the independent set-to-set dissimilarities but FSCML constrains that between the collaborative ones. Concretely speaking, when the size of training data is small, like the cases of i-LIDS-MA (q = 10), i-LIDS-AA (q = 10, 30), ETHZ1 (q = 33), Caviar4REID (q = 15), and CUHK03 (q = 15, 20), FSCML performs better than SBDR. Whereas, when the training data are relatively sufficient, like i-LIDS-AA (q = 50, 70), SBDR has advantage, and even it wins FSCML by a narrow margin. These phenomena exactly show that FSCML is robust to the training data insufficiency, which is also strongly related to its modeling gist: the learned fair set-collaboration information is useful for the problem of set-level sparseness and variation. From the graphical perspective, each set can be abstracted as a node, and thus the node-to-node relationship can be described by an edge. The performance of both FSCML and SBDR highly depends on the quantity and quality of node resource. Sparse training data leads to the low quantity of node resource, which will necessarily restrict the learners’ power. Nevertheless, FSCML further exploits the useful information from the edge resource to remedy such node resource poverty problem, which just acts as the intrinsic mechanism for this method being able to behave robustly
on i-LIDS-MA ( p = 30), i-LIDS-AA ( p = 70, 90), ETHZ1 ( p = 50), Caviar4REID ( p = 50), and CUHK03 ( p = 100, 150). On the other hand, when the number of nodes enhances, the resources gradually become sufficient as well. Then, both learners perform better than before. However, when the resource quantity problem is alleviated, the quality problem becomes more obtrusive. Then, the reliability of collaborative information is more sensitive to the quality of node resources. Hence, the node resources contaminated by real-world noises will unavoidably impede FSCML during its competition with SBDR for the champion performance on i-LIDS-AA ( p = 30, 50). It is hard to distinguish the advantage and disadvantage between SBDR and CRSSD due to their similar performance. SBDR relies on the supervised information while CRSSD resorts to the collaborative information. The power of both supervised information and collaborative information is subject to the size of training data. Their tight race shows the similar discriminative power between the supervised independent set-to-set dissimilarity and the unsupervised collaborative set-to-set dissimilarity in multiple-shot re-identification. Unsurprisingly, the advantage of CRSSD over CHISD is apparent. Although both CRSSD and CHISD belong to the convex hull based
W. Li et al. / Neurocomputing 242 (2017) 15–27
set-to-set dissimilarity, CRSSD is more advanced than CHISD because it develops CHISD by further utilizing the collaborative information. This result clearly illuminates the significance of the collaborative information for an effective set-to-set dissimilarity measure. In Fig. 3, CSA builds the collaborative representation over the corpus samples to approximate the query samples via affine combinations. CSA measures the distance between two nearest points on the query and corpus affine hulls. CSA loses to CRSSD, because CSA is over-dependent on the discrimination power of outlying samples which might be the noisy outliers. CRNP combines the ideas of set-to-set dissimilarity measure upon regularized affine hulls and set-based recognition by collaborative representation. This approach is well-performed, but it seems quite sensitive to the preprocessing step of data standardization. To ensure the fairness of the comparison, we keep the same experimenting condition as other methods by directly running CRNP. Although the competence of CRNP is weakened in comparison with the original one, it is still consistently superior to CRC, a method that measures the minimum point-wise distance between the query and corpus sets. The performance of CRC is inferior to other competitors as well. It is mainly because the stage and final goals of CRC are segregate: CRC works to generate the sample-based, rather than set-based, collaboration reconstruction residual scores before encoding these low-level scores into the setto-set dissimilarity. In contrast to directly exploiting the set-level information for set-based measure, this bottom-up strategy is inevitably susceptible to the extra sample-level noises. SANP and MAD produce the similar performance in spite of their different formulations. SANP measures the dissimilarity between two sets upon the nearest points that are sparsely approximated from their respective sets. MAD collects the mean point-toset approach into the set-to-set distance between the pair of sets. In fact, a shared spirit is hidden behind their different formulations: they use a suitable way to incorporate both minority-based information and majority-based information into dissimilarities. On the other hand, despite their similar performance, operation-based MAD runs obviously faster than optimization-based SANP. We compare FSCML with point based metric learning approaches based on the same feature MSCH+SFB+GT in Fig. 4. Although they are metric learning methods for single-shot reidentification, the learned metric can be applied to the samples of the sets. Fusing the distance scores between samples from the pair of sets by the min-rule, sum-rule or max-rule, is equivalent to measuring the minimum, average or maximum point-wise distance between sets, respectively. Minimum point-wise distance seems more widely used [1,13,14,28,35], so we choose the min-rule strategy to fuse point-wise distances in the metric space learned by these metric learning approaches. From the results, we can see that FSCML is more effective than these point based metric learning approaches using the simple fusion trick on multi-shot setting. And also we find that, although the complexity and limited size of the training data substantially depress the performances of point based metric learning methods, FSCML still performs best. This just further confirms the ability and expertise of FSCML to cope with the problems of data variation and sparseness. To validate the value of the fairness principle designed, we further demonstrate the models CRSSD, CRSSD under the fairness principle which is denoted by “FairCRSSD”, and FSCML. From the results in Tables 1–5, we can see their performances present a stair-type rise from CRSSD through FairCRSSD to FSCML. Overall, the proposed FSCML algorithm is better than original CRSSD about 2 to 3 percentages on average even on rank-1, and its advantage is more obvious on other rank positions. On the other hand, the advantage of our proposed FSCML over our proposed FairCRSSD seems not large. This indirectly shows FairCRSSD, the
23
Table 1 Model comparison on i-LIDS-MA.
p = 30
Model
Rank-1
Rank-3
Rank-5
Rank-7
FSCML FairCRSSD CRSSD
51.33 51.00 50.33
75.67 75.33 74.67
81.00 80.67 80.00
85.33 84.00 83.33
Model
Rank-1
Rank-3
Rank-5
Rank-7
FSCML FairCRSSD CRSSD FSCML FairCRSSD CRSSD FSCML FairCRSSD CRSSD FSCML FairCRSSD CRSSD
36.67 34.33 33.33 29.40 27.60 27.00 26.71 25.43 23.43 25.00 23.89 22.78
57.33 56.33 53.00 46.00 45.60 42.80 44.29 43.29 40.86 38.89 38.22 36.44
68.67 68.67 68.33 56.80 56.40 54.60 54.43 54.00 51.86 48.78 47.78 46.89
77.00 75.67 75.67 63.60 62.80 62.40 59.43 59.29 56.86 54.67 53.89 53.56
Model
Rank-1
Rank-3
Rank-5
Rank-7
FSCML FairCRSSD CRSSD
100 100 99.80
100 100 100
100 100 100
100 100 100
Table 2 Model comparison on i-LIDS-AA.
p = 30
p = 50
p = 70
p = 90
Table 3 Model comparison on ETHZ1.
p = 50
Table 4 Model comparison on Caviar4REID.
p = 50
Model
Rank-1
Rank-3
Rank-5
Rank-7
FSCML FairCRSSD CRSSD
27.40 26.40 26.40
31.80 30.80 30.20
35.20 34.00 32.40
38.00 38.00 36.40
Table 5 Model comparison on CUHK03.
p = 100
p = 150
Model
Rank-1
Rank-3
Rank-5
Rank-7
FSCML FairCRSSD CRSSD FSCML FairCRSSD CRSSD
19.60 19.20 17.90 17.10 15.90 14.60
34.40 33.50 32.20 28.80 27.50 26.00
40.80 40.30 39.30 36.20 32.80 32.10
46.60 45.90 44.70 40.20 36.90 36.10
core part of FSCML, is a powerful set based dissimilarity, leaving limited performance enhancing space for optimization. Even so, FSCML brings up the performance on each dataset, and the performance enhancement is quite obvious on i-LIDS-AA ( p = 30, 50) and CUHK03 ( p = 150). Basically, by comparing CRSSD and FairCRSSD, we can confirm the positive role of the fairness principle in increasing the discriminability of set-collaboration dissimilarity, because FairCRSSD invariably outperforms CRSSD; by comparing FairCRSSD and FSCML, we can clarify the supervision can help boost the discrimination capability of fair set-collaboration dissimilarity, because FSCML consistently outstrips FairCRSSD. Specifically, the performance enhancement brought by FSCML on the basis of FairCRSSD is larger on i-LIDS-AA than on i-LIDS-MA, Caviar4REID, and CUHK03. This verifies the power of FSCML to overcome the severe large intraclass variation and small inter-class difference. More severe challenges of i-LIDS-AA depress all the methods’ performance though, but they supply the capable ones with the bigger platform. Last
24
W. Li et al. / Neurocomputing 242 (2017) 15–27 Table 8 Comparison of state-of-the-art results on CUHK03 ( p = 100).
Table 6 Comparison of state-of-the-art results on i-LIDS. Method
Top-1 accuracy
Test identities
Reference
Method
Top-1 accuracy
Reference
FSCML LDFR-DGD MCPB-CNN LRME TPCR FSCML DFL-RDC CFDL SLM-DML TPCR
65.17 64.60 60.40 50.30 61.83 66.40 52.10 46.60 42.99 62.40
p = 60 p = 60 p = 60 p = 60 p = 60 p = 50 p = 50 p = 50 p = 50 p = 50
Proposed CVPR2016 [7] CVPR2016 [9] CVPR2015 [43] ICPR2012 [40] Proposed Pattern Recognit. 2015 [44] WACV2016 [45] Neurocom. 2015 [46] ICPR2012 [40]
FSCML LDNS JL-SIR&CIR SS-SVM LOMO+XQDA IDLA
83.70 54.70 52.17 51.20 46.25 44.96
Proposed CVPR2016 CVPR2016 CVPR2016 CVPR2015 CVPR2015
Table 7 Comparison of metric learning methods on CUHK03 ( p = 100). Method
Rank-1
Rank-3
Rank-5
Rank-7
FSCML KISSME MFA kLFDA pHGD
83.70 79.90 76.10 73.30 68.50
94.80 93.40 90.90 90.60 86.30
97.40 96.60 95.70 94.00 91.60
98.00 97.60 97.20 96.30 93.90
but not least, on ETHZ1, both FSCML and FairCRSSD are tied for the first place with the saturated performance. This shows that although the discriminatory capability of FairCRSSD is already strong enough to trivialize the power from FSCML, FSCML still maintains the robustness to the potential over-fitting risk in such special situation. FSCML is based on the extracted features, so its performance is unavoidably influenced and limited by them. Although feature representation is not the focus of this paper, we admit introducing a more effective feature representation to our model could result in an enhanced performance. Different datasets have different characteristics and difficulties, which can be better handled by means of different feature representations. In consideration of that Third-Party Collaborative Representation (TPCR) has obtained inspiring performance on i-LIDS [40], we further demonstrate the effectiveness of FSCML based on this feature on such dataset, and use p/(119 − p) as the number of test/training identities. We also compare FSCML with the stateof-the-art methods including Learning Deep Feature Representations with Domain Guided Dropout (LDFR-DGD) [7], Multi-Channel Parts-Based CNN (MCPB-CNN) [9], Learning to Rank with Metric Ensembles (LRME) [43], Deep Feature Learning with Relative Distance Comparison (DFL-RDC) [44], Coarse-to-Fine Deep Learning (CFDL) [45], and Set-Label Modeling and Deep Metric Learning (SLM-DML) [46], because these methods are reported to achieve the advanced performance. The results are recorded in Table 6, which reveal that FSCML has the ability to stand ahead of the advanced performance. Considering that projected Hierarchical Gaussian Descriptor (pHGD) by means of XQDA has obtained remarkable performance on CUHK03 [10], we evaluate FSCML and the metric learning methods KISSME, MFA, and kLFDA in the pHGD feature space with 100 test and 50 training identities under the multiple-shot setting. The results are given in Table 7. It can be seen from the results that such feature space provides an effective platform for the metric learning approaches; among them, FSCML still leads the performance attributing to the suitably-exploited fair set-collaboration information. Then, we compare FSCML with the methods that achieved the advanced performance recently, such as Learning a Discriminative Null Space (LDNS) [47], Joint Learning of Single-Image Representation and Cross-Image Representation (JL-SIR&CIR) [8], SampleSpecific SVM (SS-SVM) [48], LOMO+XQDA [23], and Improved Deep
[47] [8] [48] [23] [6]
Learning Architecture (IDLA) [6]. We list the results in Table 8 and show the advantage of FSCML in comparison with them. In addition, we also test FSCML based on pHGD using the manually-cropped setting of CUHK03. FSCML attains 85.70% at top1 accuracy, which to our knowledge significantly outperforms the state-of-the-arts as well. It is worth mentioning that, to demonstrate the advantage of multiple-shot strategy over single-shot strategy, the compared approaches in Tables 6 and 8 are based on single-shot tests. Especially, LDFR-DGD, MCPB-CNN, DFL-RDC, CFDL, JL-SIR&CIR, and IDLA are the state-of-the-art deep model based methods for single-shot re-identification. Compared with them, the advantage of FSCML just reflects the benefits mainly attributed to the multiple-shot scheme. In Tables 7 and 8, FSCML has achieved quite encouraging performance with limited training identities on CUHK03. To sum up, such effectiveness comes from three aspects. Firstly, FSCML is based on the capable pHGD feature space. This feature itself is quite powerful and even outperforms many advanced methods. Cooperating with it, the performance of FSCML is just like standing on the shoulder of giants. Secondly, FSCML takes advantage of the set-based information during multiple-shot re-identification. Set-based information provides the useful resource to cope with the problem of large intra-class variation and small inter-class difference of human image features. Thirdly, FSCML is robust to the training data insufficiency due to the suitably-exploited fair setcollaboration information. Such information can help tackle the problem of set-level sparseness and variation of training data, and thus much enhances generalization ability of the model. 3.3.2. Robustness Under the ideal distributional assumption that training and testing data are independent and identically distributed, metric learning methods can perform well. Nevertheless, the potential distribution difference makes the methods that too strongly fit the training data will result in poor generalization ability to new testing data. Such over-fitting phenomenon is especially obvious when training and testing data are from different datasets of different scenarios, which, however, just provides an opportunity to challenge the generalization ability of FSCML. We randomly collect 15 and 105 identities, respectively, from CUHK03 dataset as the third-party training data to learn the metric in the feature space MSCH+SFB+GT, and then use the metric to test the identities from i-LIDS-MA ( p = 30), i-LIDS-AA ( p = 50, 70), and Caviar4REID ( p = 50). Because the training data is always different from the testing data, we do not have to update the training data in each trial of cross-validation. The results in Fig. 5 show the good generalization ability of FSCML in comparison with SBDR, which reflects its robustness to over-fitting even with the limited training data size. The robustness of FSCML to over-fitting can be partly attributing to the simple parameter setting of modeling. Here, we evaluate how the trade-off parameter C influences the performance of FSCML, using i-LIDS-AA( p = 30, 50, 70) as representatives. The results are given by the MRR value, which is a condensed description
W. Li et al. / Neurocomputing 242 (2017) 15–27
25
Fig. 5. Generalization ability comparison between FSCML and SBDR.
Table 9 Evaluation on the trade-off parameter C of FSCML.
Table 11 Training time of FSCML and SBDR (unit: second).
C
0.001
0.01
0.1
1
10
100
10 0 0
p = 30 p = 50 p = 70
51.46 42.41 39.32
51.38 42.48 39.30
51.60 42.25 39.48
51.60 42.34 39.41
51.59 42.31 39.34
51.34 42.34 39.36
51.76 42.55 39.34
Table 10 Evaluation on the regularizer of FSCML. Regularizer
i-LIDS-MA (q = 10)
i-LIDS-AA (q = 70)
i-LIDS-AA (q = 50)
i-LIDS-AA (q = 30)
tr(W) W 2F /2 i-LIDS-AA (q = 10) 35.68 35.68
64.69 64.39 ETHZ1 (q = 33) 100 100
51.60 51.61 Caviar4REID (q = 15) 33.51 33.54
42.34 42.28 CUHK03 (q = 15) 30.60 30.57
39.41 39.31 CUHK03 (q = 20) 26.81 25.47
of CMC. From Table 9, we can see that in spite of the minor performance fluctuation, FSCML is robust to the change of C overall. We further tentatively evaluate the role of the regularizer of FSCML by comparing the optional regularizer terms tr(W) and W 2F /2 in Eq. (5) based on the MRR value in feature space of MSCH+SFB+GT. According to the results in Table 10, it can bee seen that these two regularizer terms of FSCML produce little performance difference, which also reflects the robustness of the model itself to some extent. 3.3.3. Complexity and cost From the theoretical perspective, we analyze the computational complexity for training FSCML and SBDR, respectively. Suppose the feature dimensionality is d, the training data size is u, the average set size is m, the set number is e, and the required iteration number till model convergence is h for FSCML and g for SBDR. Note that, in actual experiments and most real applications, the average set size is much smaller than the full training data size as well as high dimensionality of feature space. In each iteration during training FSCML, the major computational complexity is from metric decomposition, data projection, and FairCRSSD measure between all pairs of sets. Decomposing the learned metric costs O(d3 ). Using the decomposed metric to map all the training data costs O(ud2 ). As FairCRSSD solves a convex QP problem based on the linear kernel, to reduce the computational
Method
i-LIDS-MA (q = 10)
i-LIDS-AA (q = 70)
i-LIDS-AA (q = 50)
i-LIDS-AA (q = 30)
FSCML SBDR i-LIDS-AA (q = 10) 6.4 32.2
5.2 15.0 ETHZ1 (q = 33) 52.6 71.8
370.3 1370.9 Caviar4REID (q = 15) 4.2 6.6
157.5 554.8 CUHK03 (q = 15) 13.0 98.2
73.0 147.8 CUHK03 (q = 20) 13.4 138.8
cost, the linear kernel matrix can be computed and stored first, which costs O(u2 de), and then, quadratic programming costs O(u3 e) in consideration of the asymptotic worst-case bound. Performing max-pooling after multiplying h, we acquire the computational complexity to train FSCML: O(max (hd3 , hud2 , hu2 de, hu3 e)). More concretely, the complexity is O(hd3 ) if d > ue1/2 ; and O(hu2 de) if u < d ≤ ue1/2 ; and otherwise o(hu3 e). From the results, we can see that the complexity to learn FSCML is determined by the factors of training data size, training set number, feature dimension, and required iteration number which is relevant to the model and data properties. On the other hand, in each iteration during training SBDR, the major computational complexity is similar to FSCML, except that measuring CHISD between all pairs of sets costs O(u2 d). Thus, we acquire the computational complexity to train SBDR: O(max (gd3 , gud2 , gu2 d)). More concretely, the complexity is O(gd3 ) when d > u; and O(gu2 d) when d ≤ u. Different than FSCML, the complexity of SBDR is not influenced by the set number of training data. This is because the dissimilarity measure within SBDR does not take advantage of the set collaboration information in the training data domain. Nevertheless, in general, the comparison results of computational complexity in theory is not the warranty of exactly the same outcome in practice. To show the computational cost of FSCML, we further analyze their real running time empirically based on the feature MSCH+SFB+GT. The code is implemented by MATLAB R2014a on the processor Intel(R) Core(TM) i7-4790K CPU @4.00 GHz with 16 GB memory. It is worth mentioning that, however, the actual CPU utilization and memory usage are far less than these indicators. The training time reported in Table 11 shows that FSCML is much more efficient than SBDR, even though the program running
26
W. Li et al. / Neurocomputing 242 (2017) 15–27
time is influenced by the hardware and software conditions, and we admit that some other computing machines and programming environments might accelerate the running speed of the methods. 4. Conclusion This paper has formulated multiple-shot person reidentification as a set based metric learning problem. In particular, towards the challenges of set-level sparseness and variation of human image data, FSCML was proposed to learn the metric for enhancing the discrimination capability of the set-collaboration dissimilarity under the fairness principle. Experimental results have demonstrated the advantage of FSCML in terms of effectiveness and robustness. Admittedly, for the proposed method, more encouraging results can be anticipated in the future with the development of human image features. It is worth mentioning that although FSCML is specially designed for multiple-shot person re-identification, this approach also has the potentiality to tackle other vision problems sharing the similar characteristics or having the close relationship to re-identification, such as face verification, gait recognition, crosscamera tracking, and so forth. Our prospective work in future will include developing FSCML to solve them. Moreover, although deep learning gains popularity recently, the proposed method is in a different methodological direction from deep learning actually. Even so, this proposal may provide inspiration and reference for exploiting the multiple-shot image information within the deep learning framework as well, which will also be taken into consideration in our future work. On the hand hand, this paper formulates multiple-shot reidentification as the set-based retrieval problem. To be frank, the re-identification methods based on such formulation usually cannot directly handle the special situation that the query does not belong to any known identity in the corpus. This situation may bring new challenge for the traditional issue precondition as well as retrieval formulation in the re-identification research: given a query, we need one more procedure to judge whether the target exists in the corpus or not before doing re-identification. This will be one topic in our future research as well. Acknowledgments This work has been supported by the Natural Science Foundation of Jiangsu Province of China (Grant Numbers: BK20160697 and BK20150634); the Fundamental Research Funds for the Central Universities (Grant Number: 2242016K40048); the Open Research Fund of Jiangsu Key Laboratory of Remote Measurement and Control Technology (Grant Number: YCCK201501). References [1] M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-identification by symmetry-driven accumulation of local features, in: Proceedings of the 23rd Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, San Francisco, USA, 2010, pp. 2360–2367. [2] B. Loris, C. Marco, P. Alessandro, F. Michela, M. Vittorio, Multiple-shot person re-identification by HPE signature, in: Proceedings of the 20th International Conference on Pattern Recognition, ICPR, IEEE, Istanbul, Turkey, 2010, pp. 1413–1416. [3] B. Slawomir, C. Etienne, B. Francois, T. Monique, Multiple-shot human re-identification by Mean Riemannian Covariance Grid, in: Proceedings of the 8th Advanced Video and Signal-Based Surveillance, AVSS, IEEE, Klagenfurt, Austria, 2011, pp. 179–184. [4] B. Slawomir, C. Etienne, B. Francois, T. Monique, Person re-identification using haar-based and DCD-based signature, in: Proceedings of the 7th International Conference on Advanced Video and Signal Based Surveillance, AVSS, IEEE, Boston, USA, 2010, pp. 1–8. [5] W. Li, R. Zhao, T. Xiao, X. Wang, DeepReID: Deep filter pairing neural network for person re-identification, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Columbus, USA, 2014, pp. 152–159.
[6] E. Ahmed, M. Jones, T.K. Marks, An improved deep learning architecture for person re-identification, in: Proceedings of the 28th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Boston, USA, 2015, pp. 3908–3916. [7] T. Xiao, H. Li, W. Ouyang, X. Wang, Learning deep feature representations with domain guided dropout for person re-identification, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Las Vegas, USA, 2016, pp. 1249–1258. [8] F. Wang, W. Zuo, L. Lin, D. Zhang, L. Zhang, Joint learning of single-image and cross-image representations for person re-identification, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Las Vegas, USA, 2016, pp. 1288–1296. [9] D. Cheng, Y. Gong, S. Zhou, J. Wang, N. Zheng, Person re-identification by multi-channel parts-based CNN with improved triplet loss function, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Las Vegas, USA, 2016, pp. 1335–1344. [10] T. Matsukawa, T. Okabe, E. Suzuki, Y. Sato, Hierarchical Gaussian descriptor for person re-identification, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Las Vegas, USA, 2016, pp. 1363–1372. [11] Y. Wu, M. Minoh, M. Mukunoki, S. Lao, Set based discriminative ranking for recognition, in: Proceedings of the 12th European Conference on Computer Vision, ECCV, vol. 7574, IEEE, Florence, Italy, 2012, pp. 497–510. [12] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Locality based discriminative measure for multiple-shot person re-identification, in: Proceedings of 10th International Conference on Advanced Video and Signal Based Surveillance, AVSS, IEEE, Krakow, Porland, 2013, pp. 312–317. [13] Y. Wu, M. Minoh, M. Mukunoki, W. Li, S. Lao, Collaborative sparse approximation for multiple-shot across-camera person re-identification, in: Proceedings of the 19th International Conference on Advanced Video and Signal-Based Surveillance, AVSS, IEEE, Beijing, China, 2012, pp. 209–214. [14] Y. Wu, M. Minoh, M. Mukunoki, Collaboratively regularized nearest points for set based recognition, in: Proceedings of the 24th British Machine Vision Conference, BMVC, BMVA Press, Bristol, UK, 2013, pp. 134-1–134-10. [15] X. Liu, M. Song, D. Tao, X. Zhou, C. Chen, J. Bu, Semi-supervised coupled dictionary learning for person re-identification, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Columbus, USA, 2014, pp. 3550–3557. [16] P. Zhu, W. Zuo, L. Zhang, S.C.-K. Shiu, D. Zhang, Image set-based collaborative representation for face recognition, IEEE Trans. Inform. Forensics Secur. 9 (7) (2014) 1120–1132. [17] W.-S. Zheng, S. Gong, T. Xiang, Re-identification by relative distance comparison, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2012) 653–668. [18] A. Mignon, F. Jurie, PCCA: A new approach for distance learning from sparse pairwise constraints, in: Proceedings of the 25th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Providence, USA, 2012, pp. 2666–2672. [19] M. Koestinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: Proceedings of the 25th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Providence, USA, 2012, pp. 2288–2295. [20] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identification using kernel-based metric learning methods, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, ECCV, vol. 8695, Springer, Zurich, 2014, pp. 1–16. [21] J. Chen, Z. Zhang, Y. Wang, Relevance metric learning for person re-identification by exploiting listwise similarities, IEEE Trans. Image Process. 24 (12) (2015) 4741–4755. [22] Z. Li, S. Chang, F. Liang, Learning locally-adaptive decision functions for person verification, in: Proceedings of the 26th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Portland, USA, 2013, pp. 3610–3617. [23] S. Liao, Y. Hu, X. Zhu, S.Z. Li, Person re-identification by local maximal occurrence representation and metric learning, in: Proceedings of the 28th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Boston, USA, 2015, pp. 2197–2206. [24] Z. Liu, Z. Zhang, Q. Wu, Y. Wang, Enhancing person re-identification by integrating gait biometric, Neurocomputing 168 (2015) 1144–1156. [25] D. Yi, Z. Lei, S. Liao, S.Z. Li, Deep metric learning for person re-identification, in: Proceedings of the 23rd International Conference on Pattern Recognition, ICPR, IEEE, Stockholm, Sweden, 2014, pp. 34–39. [26] S.-Z. Chen, C.-C. Guo, J.-H. Lai, Deep ranking for person re-identification via joint representation learning, IEEE Trans. Image Process. 25 (2016) 2353–2367. [27] J. Hu, J. Lu, Y.-P. Tan, Deep transfer metric learning, in: Proceedings of the 28th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Boston, USA, 2015, pp. 325–333. [28] W. Li, Y. Wu, M. Mukunoki, Y. Kuang, M. Minoh, Locality based discriminative measure for multiple-shot human re-identification, Neurocomputing 167 (2015) 280–289. [29] H. Cevikalp, B. Triggs, Face recognition based on image sets, in: Proceedings of the 23rd Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, San Francisco, USA, 2010, pp. 2567–2573. [30] C.C. Aggarwal, Outlier Analysis, Kluwer Academic Publishers, London, England, UK, 2013. [31] Y. Hu, A.S. Mian, R. Owens, Sparse approximated nearest points for image set classification, in: Proceedings of the 24th Conference on Computer Vison and Pattern Recognition, CVPR, IEEE, Colorado Springs, USA, 2011, pp. 121–128.
W. Li et al. / Neurocomputing 242 (2017) 15–27 [32] B. McFee, G. Lanckriet, Metric learning to rank, in: Proceedings of the 27th International Conference on Machine Learning, ICML, Omnipress, Haifa, Israel, 2010, pp. 775–782. [33] Y. Wu, M. Mukunoki, T. Funatomi, M. Minoh, Optimizing mean reciprocal rank for person re-identification, in: Proceedings of the 2nd International Workshop on Multimedia Systems for Surveillance, MMSS, IEEE, Klagenfurt, Austria, 2011, pp. 408–413. [34] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, USA, 2004. [35] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Bi-level relative information analysis for multiple-shot person re-identification, IEICE Trans. Inform. Syst. E96-D (11) (2013) 2450–2461. [36] A. Ess, B. Leibe, L.V. Gool, Depth and appearance for mobile scene analysis, in: Proceedings of the 11th International Conference on Computer Vision, ICCV, IEEE, Rio de Janeiro, Brazil, 2007, pp. 1–8. [37] D.S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, V. Murino, Custom pictorial structures for re-identification, in: Proceedings of the 22nd British Machine Vision Conference, BMVC, BMVA Press, Scotland, UK, 2011, pp. 68.1–68.11. [38] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Coupled metric learning for single-shot vs. single-shot person re-identification, Opt. Eng. 52 (2) (2013) 027203-1–027203-10. [39] W. Li, M. Mukunoki, Y. Kuang, Y. Wu, M. Minoh, Person re-identification by common-near-neighbor analysis, IEICE Trans. Inform. Syst. E97-D (2014). 1745–1361 [40] Y. Wu, M. Minoh, M. Mukunoki, S. Lao, Robust object recognition via third-party collaborative representation, in: Proceedings of the 21st International Conference on Pattern Recognition, ICPR, IEEE, Tsukuba, Japan, 2012, pp. 3423–3426. [41] L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: which helps face recognition? in: Proceedings of the 13th International Conference on Computer Vision, ICCV, IEEE, Barcelona, Spain, 2011, pp. 471–478. [42] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 40–51. [43] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Learning to rank in person re-identification with metric ensembles, in: Proceedings of the 28th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Boston, USA, 2015, pp. 1846–1855. [44] S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative distance comparison for person re-identification, Pattern Recogn. 48 (2015) 2993–3003. [45] A. Franco, L. Oliveira, A coarse-to-fine deep learning for person re-identification, in: Proceedings of the 16th Winter Conference on Applications of Computer Vision, WACV, IEEE, New York, USA, 2016, pp. 1–7. [46] H. Liu, B. Ma, L. Qin, J. Pang, C. Zhang, Q. Huang, Set-label modeling and deep metric learning on person re-identification, Neurocomputing 151 (2015) 1283–1292.
27
[47] L. Zhang, T. Xiang, S. Gong, Learning a discriminative null space for person re-identification, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Las Vegas, USA, 2016a, pp. 1239–1248. [48] Y. Zhang, B. Li, A. Irie, X. Ruan, Sample-specific SVM learning for person re-identification, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, Las Vegas, USA, 2016b, pp. 1278–1287. Wei Li is an assistant professor with School of Instrument Science and Engineering, Southeast University, China. He received the bachelor’s and master’s degrees from Southeast University in 2007 and 2010, and the doctoral degree from Kyoto University in 2014. His research interests include computer vision, pattern recognition, and machine learning.
Jianqing Li is a professor with School of Instrument Science and Engineering, Southeast University, China. He received the B.S. and M.S. degrees from School of Automation at Southeast University in 1986 and 1992, respectively, and received the Ph.D. degree from School of Instrument Science and Engineering at Southeast University in 20 0 0. His current research interests include intelligence science and technology, signal processing, wireless sensor network applications, and robotics.
Lifeng Zhu is an assistant professor with School of Instrument Science and Engineering, Southeast University, China. He received the doctoral degree in computer science from Peking University in 2012. He was working as a post-doctor in the University of Tokyo from 2012 to 2013 and in the University of Pennsylvania from 2013 to 2015. His research topics are visual computing and human-computer interaction.