J. Vis. Commun. Image R. 59 (2019) 527–536
Contents lists available at ScienceDirect
J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci
A multi-image Joint Re-ranking framework with updateable Image Pool for person re-identification q Mingyue Yuan a,b, Dong Yin a,b,⇑, Jinwen Ding a,c, Zhipeng Zhou a,b, Chengfeng Zhu a,b, Rui Zhang a,b, An Wang a a b c
School of Information Science Technology, USTC, Hefei, Anhui 230027, China Key Laboratory of Electromagnetic Space Information of CAS, Hefei, Anhui 230027, China Department of Material Science and Engineering, USTC, Hefei, Anhui 230026, China
a r t i c l e
i n f o
Article history: Received 3 January 2018 Revised 22 November 2018 Accepted 26 January 2019 Available online 29 January 2019 Keywords: Person re-identification Image pool Multi-shot Re-ranking
a b s t r a c t Real-world video surveillance has increasing demand for person re-identification. Existing multi-shot works usually aggregate single sample features by computing the average features or using time series model. The Multi-image Joint Re-ranking framework with updateable Image Pool that we are proposing will give a different approach. First, we defined the term ‘Image Pool’ to store image samples for each pedestrian. Next, the updating rules of Image Pool has been defined in order to optimize the representativeness of it. Second, we compute initial ranking lists of every sample in Image Pool, and propose the ‘Multiple-image Joint Re-ranking’ algorithm to aggregate initial ranking lists. We calculate the rank score of partial elements of initial ranking lists. In the end, we get final ranking list by ascending the order of the rank scores. We validated our re-ranking results on Market-1501, iLIDS-VID, PRID-2011 and our ITSD datasets, and the results outperform other methods. Ó 2019 Elsevier Inc. All rights reserved.
1. Introduction Recognizing the same object is an important task in the computer vision field [1]. In the recent years, person re-identification (re-id) has been rapid developed from object match and recognition. The task is to identify the same person in a cross-camera setting [2]. Accurate re-id is not only useful for single-camera tracking [3] but also crucial for robust wide-area person tracking. Because of great differences in appearance caused by environmental and geometric variations, person re-id still is a challenging task. In general, person re-id works mainly have two classes, imagebased class and multi-shot class. In traditional image-based re-id research, matching two cropped pedestrian images by low-level features is the basic method [4]. Furthermore, the method of calculating distance between the probe image and gallery images also affects accuracy. A number of works about metric learning [5–8] and re-ranking method [9,10] have been proved to improve the accuracy.
q
This paper has been recommended for acceptance by Zicheng Liu.
⇑ Corresponding author at: Key Laboratory of Electromagnetic Space Information of CAS, School of Information Science Technology, University of Science and Technology of China, Hefei 230027, China. E-mail address:
[email protected] (D. Yin). https://doi.org/10.1016/j.jvcir.2019.01.041 1047-3203/Ó 2019 Elsevier Inc. All rights reserved.
For another research branch, multi-shot re-id works use a sequence of person’s video frames rather than a single image. Traditional methods of multi-shot extract pose message from image sequences [11]. McLaughlin et al. [3] was the first to apply deep learning on solving the video re-id problem in 2016. They use the Long-Short Term Memory (LSTM) network to aggregate frame-wise person features in a recurrent manner [12]. Even though multi-shot methods improve the accuracy effectively, the re-id accuracy will saturate as the number of frames increases [13]. In addition, multi-shot methods have limitations in realworld multi-camera system. Their time complexity is too high to meet the real-time requirements. For public safety purpose, large networks of cameras are increasingly deployed in public places like airports, railway stations, campus and shopping malls. Manual analysis of these video data is erroneous, time consuming and expensive. Therefore, automatic video analysis is necessary. Tracking people across multiple cameras is essential for wide area scene analytics, and person re-id is a technical foundation of multi-camera tracking [14]. Video provides several frames of data of the target, and multishot re-id is more suitable for real-world tasks. However, there is still a great gap between the existing research and the real-world application. For example, the method of the mainstream multishot researches is to compute the average feature vector for each
528
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
person [15], the diversity of multi-shot features is not fully expressed. In addition, neighbor frame images in video sequences are very similar, thus the redundancy is high when we directly use sequent frames from video footage. Moreover, when we use multiple images to query separately, we need to propose an algorithm to aggregate their query results. To solve these problems, we combine re-ranking and multiple-image method for re-id task. In this paper, we propose the Image Pool to store images of every identity. For one Image Pool, we will maximize the diversity of samples by our updating rules. Then we use proposed Multiimage Joint Re-ranking framework to identify the target person. The contribution of this paper is threefold. (1) Image Pool is defined following the criterion of maximizing diversity to ensure the representativeness of the sampled data. (2) We propose a Multi-image Joint Re-ranking framework, which is used to aggregate the ranking list of the Image Pool, then we get re-ranking list. (3) A new person re-id dataset called the ‘Indoor Train Station Dataset (ITSD)’ was created in order to evaluate the performance of the proposed framework. Existing video datasets such as iLIDS-VID and PRID-2011 are only two camera views, while ITSD contains 8 cameras in real-world surveillance system and 443 identities, where 50 identities are captured by at least 3 disjoint cameras. Furthermore, the bounding boxes of ITSD are more tight, and has more complex scenes and viewpoint. Some instances are blocked by object and pedestrian, which matches with the realworld scenario. In the end, we evaluated the performance of our framework on ITSD and the real world, and we compared our results with the state-of-the-art method on Market-1501, iLIDSVID and PRID-2011 datasets.
2. Related work 2.1. Feature extraction The performance of person re-id algorithm mostly hinges on the detection of discriminative features. Some of the classical works [1] improve the viewpoint invariance by combining spatial and color information. While some other works use supervised-learningbased methods to map raw features into a new space which improves the discriminative power [16–19]. For instance, mapping low-level color and texture features with the body configuration would construct more discriminative feature vectors. With the rapid development of Convolutional Neural Networks (CNN), the manual work has been gradually replaced. Now that CNN has become the standard for feature extraction [20], together with the application of deep learning networks, the accuracy of recognition has been greatly improved compare to the traditional approach. Some CNN-based approaches learn global features [21–23], such as ResNet and VGG. Some CNN-based approaches learn the combination of global and local features in order to get more discriminative feature descriptors [10,24]. For example, some works [25–27] divide the whole body into several parts, and some works use pose estimation method[28–30] to extract feature. When dealing with cross-camera situation, a joint learning framework [24] is used to unify SIR and CIR. This framework will exploit the connection between the single-image representation and cross-image representation methods. Furthermore, some works pay attention to unsupervised research. For example, RULER [31] uses an unsupervised transfer method to learn a robust visual representation of cameras. Besides, DNPR [32] exploits the camerapair matching cost in camera network to improve performance. PUL get discriminating feature representations without using labeled training data [33]. Unsupervised learning is becoming the
trend since it can help learn more discriminative features and explore more training data. 2.2. Distance metric Distance metric is used to emphasize inter-person distance and deemphasize intra-person distance. Various traditional methods have been proposed for re-id task, such as Relaxed Pairwise Learning (PRLM) [17], Large Margin Nearest-Neighbor (LMNN) [18], KISSME [34], XQDA [5] and Relevance Component Analysis (RCA) [35]. They are widely used and discussed in some re-id works at the early stage of research [17,36]. With the development of CNN, deep metric learning methods assist CNN learning features in an end-to-end fashion through various metric learning losses. Such as contrastive loss [37], triplet loss [38], improved triplet loss [39], quadruplet loss [40], and triplet hard loss [41]. Triplet loss is motivated by the margin enforced between positive and negative pairs. Furthermore, selecting suitable samples for the training model through hard mining has been shown to be effective [40–42]. Combining softmax loss with metric learning loss to speed up the convergence is also a popular method [43]. 2.3. Multiple images for re-id Multiple-image method can be exploited to improve the performance of accuracy in realistic scenarios. Existing methods for multi-shot re-id include collecting interest-point descriptors over time [44], and training features classifiers over multiple frames [45]. In addition, supervised-learning based methods have also been used in re-id research, such as learning a distance preserving low-dimensional manifold [46], or learning to map among the appearances in sequences [47]. Furthermore, there are some approaches that explicitly model video include using a conditional random field (CRF) to ensure similar images in a video sequence receive similar labels [48]. In some works, a number of keyframes are selected from the individual’s video sequence, for instance, Cong et al. [49] selected ten key frames. Researches apply it to reduce the final signature dimensionality. Block sparse recovery [15] constructs the feature dictionary D to model highdimensional visual data with sparse vectors. Person re-id has a lot in common with image classification and instance retrieval tasks in the aspect of testing and training [1]. Therefore, we can transfer works that have achieved promising results in classification tasks [50] and retrieval tasks [51] to re-id task. For example, AL-LUPI [52] uses optimization method to ensure the selected samples are representative, which exploits the diversity and uncertainty measurement of an image cluster. In this paper, inspired by AL-LUPI, the criterion of robustness is proposed. With the requirements of maximize diversity and high robustness, we developed Algorithm 1. 2.4. Re-ranking method Re-ranking is a step which can gain higher ranks from initial ranking list and the relevant images. Re-ranking methods have been studied to improve object retrieval accuracy in some works [53]. Shen et al. [54] achieves the re-ranking goal using the knearest neighbors method to explore the similarity relationships. The new score of each image is calculated by its positions in the ranking lists that has been produced. Chum et al. [55] proposed the average query expansion method to re-query the database, where a new query vector is obtained by averaging the vectors in the top-k returned results. To take advantage of the negative sam-
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
ples, Arandjelovic et al. [56] developed the query expansion using linear SVM, where the distance from the decision boundary is employed to revise the initial ranking list. POP [57] optimizes post-rank by one-shot negative feedback selection. CommonNear-Neighbor [58] combines the relative information and direct information of each pair of samples to get re-ranking results. The Discriminant Context Information Analysis based post-ranking approach [59] obtains initial ranking lists from different feature representations. Then the Stuart ranking aggregation method is employed to combine complementary ranking lists. In summary, Re-ranking is a very interesting research direction. It can greatly improve the accuracy of re-id with small calculation cost.
3. Proposed method Considering the diversity and representativeness of the queried data, we use multiple images of one pedestrian to form a cluster, which named Image Pool. To achieve the goal of improving re-id accuracy, we proposed a framework and name it the ‘Multipleimage Joint Re-ranking’ framework. In this chapter we will break down into four parts to explain our framework. In the first part, the updating rules of the Image Pool will be introduced. In the second part, the principle of ‘Multiple-image Joint Distance’ calculation that we proposed will be explained. On top of that, the ‘Multiple-Image Joint Re-ranking’ algorithm will be introduced in the third part. In the end, we give an overall review of the framework to wrap up.
3.1. Image Pool and the rules of updating Image Pool In a multi-camera system, we will be provided with views from different positions of a same pedestrian. Each person will be taped from different position. Here we assume that abundant images of the target pedestrian would improve the performance of person re-id. To store image samples for each pedestrian, we propose the term ‘Image Pool’. Define each identity of the Image Pool as XID ¼ fqi ji ¼ 1; 2; . . . ; Mg, where ID is the number of identities, and qi represents the ith image in ‘Image Pool’. An exemplar of Image Pool is showed in Table 1. The next step is to collect images for the Image Pool. With the help of automatic detecting and tracking system, huge effort can be saved to tag and collect pedestrian images [60]. Multipleimage re-id method is more corresponding to realistic setting. Therefore, in our research, we will be focusing on the optimization of representativeness of Image Pool as well as the improvement of person re-id algorithm performance. When we are dealing with sequence-based datasets, we simply select images based on the criterion of maximizing diversity. If we were working with image-based datasets, we consider each queried image as main image of an Image Pool at the beginning, then we visit labels and select images following the criterion of maximizing diversity. Then if we are dealing with a surveillance video, for each identity, the Image Pool is initialized by the automatic tracking system [60] with first 5 frames of images. The Image Pool will be updated as the automatic tracking system keeps running. However, if the automatic tracking system lost the target, the updating process will be terminated until the system been restarted once there is a confirmation for a rematch success. Once the Image Pool is built successfully, we can forward onto the reranking process. Except for considering the maximization of diversity, we also have to take the uncertainty of the Image Pool distribution into consideration. Therefore, inspired by AL-LUPI [52], the criterion of robustness is proposed. With the requirements of maximize
529
diversity and high robustness, we developed Algorithm 1. The pseudo code is shown as follow. Algorithm 1. The rules of updating Image Pool 1: Input: queried image set of one identity fIðn1 ; 1Þ; Iðn1 ; 2Þ; . . . ; Iðn1 ; f Þg 2: 3: Begin 4: switch Environment: do 5: case Sequence-based datasets 6: Calculate the diversity factor ki ði ¼ 1; 2; 3; . . . ; f Þ of every queried image by Eqs. (1)–(4). 7: Get a ranking list with ki by descending order. 8: Select M images from the top M positions of the list, and the image from the top of the list is regard as main image. 9: Image Pool:X ¼ fqi ji ¼ main; 1; 2; . . . ; M 1g 10: case Image-based datasets 11: Regard every queried image as main image of an Image Pool. 12: Calculate the diversity factor ki ði ¼ 1; 2; 3; . . . ; f Þ of every queried image by Eqs. (1)–(4). 13: Get a ranking list with ki by descending order. 14: Select M 1 images from the top M 1 positions of the list. 15: Image Pool: X ¼ fqi ji ¼ main; 1; 2; . . . ; M 1g 16: case Real-world surveillance system 17: Initialization: 18: Get initial Image Pool X ¼ fqi ji ¼ main; 1; 2; . . . ; M 1g as the case of sequencebased datasets. 19: Updating: 20: if system still track the target then 21: Get current frame of the target and name it s. 22: Add s to initial queried image set temporarily, then generate a new set fs; Iðn1 ; 1Þ; Iðn1 ; 2Þ; . . . Iðn1 ; f Þg 23: Calculate the diversity factor ki by Eqs. (1)–(4). 24: Get a ranking list with ki by descending order 25: if the position of s in list < M 26: Add s to queried image set and select M 1 images from the top M 1 positions of the list to form new Image Pool 27: else 28: Not update queried image set 29: Image Pool = initial Image Pool 30: 31: else 32: Stop updating 33: 34: Output: Image Pool
The calculation process of uncertainty measurement is as follow.
1 Pi ¼ E qi ; qtop1 E qi ; qtop2 þ
ð1Þ
Ei ¼
M 1X Eðqi ; qn Þ M n¼1
ð2Þ
s2i ¼
M 2 1X Eðqi ; qn Þ Ei M n¼1
ð3Þ
530
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
Table 1 An Image Pool sample. In this table, each column represents a different camera. For example, Iðnx ; f Þ is the photo number f that’s taken by camera number x. Camera_1
Camera_2
Camera_3
Camera_4
Camera_5
Camera_6
Camera_7
Camera_8
Iðn1 ; 1Þ
Iðn2 ; 4Þ
Iðn3 ; 7Þ
Iðn4 ; 10Þ
Iðn5 ; 13Þ
Iðn6 ; 16Þ
Iðn7 ; 19Þ
Iðn8 ; 22Þ
Iðn1 ; 2Þ
Iðn2 ; 5Þ
Iðn3 ; 8Þ
Iðn4 ; 11Þ
Iðn5 ; 14Þ
Iðn6 ; 17Þ
Iðn7 ; 20Þ
Iðn8 ; 23Þ
Iðn1 ; 3Þ
Iðn2 ; 6Þ
Iðn3 ; 9Þ
Iðn4 ; 12Þ
Iðn5 ; 15Þ
Iðn6 ; 18Þ
Iðn7 ; 21Þ
Iðn8 ; 24Þ
ki ¼
Pi s2i
ð4Þ
In Eq. (1), the term Pi ði ¼ 1; 2; . . . ; MÞ represents the uncertainty measurement of each image of the image set fqi ji ¼ 1; 2; . . . ; Mg. Then we calculate the Euclidean distance between qi and each one of the rest images from the image set and sort them in an increasing order. In this way, we get the smallest Euclidean distance E qi ; qtop1 , and the second smallest Euclidean distance E qi ; qtop2 . To avoid a zero denominator, a positive constant epsilon that is infinitely approaching zero is added to the denominator. The closer E qi ; qtop1 and E qi ; qtop2 gets, the smaller the denominator becomes, the larger Pi becomes. And large Pi indicates smaller uncertainty. Eq. (2) calculates the mean of the Euclidean distances. Eq. (3) calculates the variance of the Euclidean distances. In many cases, the variance can represent the distribution of the samples, so we use it to express the diversification relationship between qi and the samples of whole image set. In Eq. (4) we use ki to determine the diversity from image qi and the entire image set. 3.2. Multiple-image Joint Distance In this section, we will introduce the term ‘Multiple-image Joint Distance’ which is used to describe the Euclidean distance between the Image Pool and each image in the gallery. First, the gallery set G is represented as G ¼ g j jj ¼ 1; 2; . . . ; N , where N is the total number of images in the gallery, and g j is the jth image of the gallery. Then calculate the Euclidean distance between each image in gal lery G and the Image Pool of one pedestrian and define as S g j;ID . Next, sort S g j;ID in an descending order and get an initial ranking init ; g init ; g init . list RðG;ID Þ ¼ g init 3 ; . . . ; gN 1 2 Define f g j ; qi as follow in Eq. (5) to evaluate the similarity between a queried image qi and a gallery image g j .
f g j ; qi ¼
8 <
1 Eðg j ;qi Þ
:0
E g j ; qi 6 T E g j ; qi > T
ð5Þ
In Eq. (5), E g j ; qi is the Euclidean distance between the queried image qi and the gallery image g j . If this number is larger than the threshold T, we consider these two images are not similar and set the similarity parameter f g j ; qi to 0; If this number is no larger than the threshold T, simply define similarity parameter f g j ; qi as the reciprocal of E g j ; qi . Define S g j;ID as the Multiple-image Joint Distance between the gallery image g j and the Image Pool.
3
2
7 M 6 6 X 7 gf g j ; qi 7 6W i S g j ; XID ¼ E g j ; qmain ; q E g j main 7 6 M X 5 i¼1 4 f g j ; qi
ð6Þ
i¼1
Previously in Section 3.1 we have determined a main image for each of the Image Pool. Set this image as qmain , and the rest images of the Image Pool will be referred as assist images. In Eq. (6), E g j ; qmain is the Euclidean distance between the main image and a gallery image g j . g is a scale factor to determine the significance of assist images in the calculation. W i ði ¼ 1; 2; . . . ; M 1Þ is the weight of each one of assist images. At this point, we have count in the influences from all of the assist images on the gallery image g j . Then by subtracting the Euclidean distance to assist images from the Euclidean distance to the main image, we get the term S g j;ID which shows the similarity between the gallery image g j and the Image Pool. The Multiple-image Joint Distance algorithm gave an approach to calculate the distance between one image to a set of image. However, There are still improvements to be made: (1) To get S g j;ID , we calculated the Euclidean distance between each one of the gallery images and each one of the Image Pool. The time
complexity is O N 2 , suppose the size of the gallery set is N, which is time-consuming. In the next section, a ranking aggregation method will be introduced to lower the time complexity of Algorithm 1. (2) Apart from hyper parameter W i , it is difficult to set accurately suitable value of some parameters in Eqs. (2) and (3), such as threshold T, and scale factor g.
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
3.3. Multiple-image Joint Re-ranking In this section, the Multiple-image Joint Re-ranking algorithm will be introduced as an update to the Multiple-image Joint Distance Formula. The Multiple-image Joint Re-ranking framework will be referred as the MJR framework in the rest of the article. Let the main image in the Image Pool be qmain . Then by sorting the Euclidian distance between the main image and each one of the gallery images g j ð j ¼ 1; 2; . . . ; N Þ in an increasing order, we get the initial ranking list of the main image ; g main ; g main ; . . . ; g main . Accordingly, the initial RðG; qmain Þ ¼ g main 1 2 3 N ranking list for each one of the other images from the Image Pool is as follow:
RðG; qi Þ ¼ g i1 ; g i2 ; . . . ; g iN ði ¼ 1; 2; . . . ; M 1Þ
ð7Þ
At this point, for each image, we will only take the first k data from its initial ranking list:Rk ðG; qi Þ ¼ g i1 ; g i2 ; g i3 ; . . . ; g ik ði ¼ main; 1; 2; . . . ; M 1Þ. We set k1 as the length of the main image, and set k2 as the length of the assist images. For image-based datasets, each queried image will take turns to be considered as the main image of the Image Pool. For sequencebased datasets, the main image is chosen using the evaluation protocol from the PRID 2011 dataset [34]. In this way, the results of the Image Pool can represent the results of the whole sequence. The pseudo code of the Multiple-image Joint Re-ranking algorithm is shown as follow: Algorithm 2. Multiple-image Joint Re-ranking
Input: initial ranking lists of image qi ði ¼ main; 1; 2; . . . ; M 1Þ, 2: Rk1 ðG; qmain Þ ¼ g main ; g main ; . . . ; g main , 2 k1 i 1i 4: Rk2 ðG; qi Þ ¼ g 1 ; g 2 ; . . . ; g ik2 ði ¼ 1; 2; . . . ; M 1Þ Begin
¼ j; Times g main ¼ 0ð j ¼ 1; 2; . . . ; k1Þ 6: Rank g main j j for g ¼ g main : g main do 1 k1 8: for i ¼ 1 : M 1 do
for m ¼ g ij g ij 2 Rk2 ðG; qi Þ 10:
12:
if g ¼¼ m then
þþ Times g main j
main Rank g j þ ¼ M Times g main k1 j
, then obtain Reorder Rank g main j
Rk1 ðG; XÞ ¼ g 1 ; g 2 ; . . . ; g k1 by ascending order Rank g main j 14: end 16: Output: Rk1 ðG; XÞ ¼ g 1 ; g 2 ; . . . ; g k 1
In line 6, RankðÞ is used to record the rank score of the images that we get from the main images’ ranking list, and it is initialized with their initial sorting values; TimesðÞ is used to record the time of an image g main being found in other ranking lists, and it is initialj ized with the value of 0. In line 12, RankðÞ is increased each time by the value of k1 multiplied by M Times(g main ). j This indicates that rank score of an image is not only related to its initial value, but also related to the number of times that it’s
531
being found in other ranking lists. The more frequent the image is found, the lower it’s score would be. Fig. 1 is an example of the Multiple-image Joint Re-ranking algorithm. In this example, k1 is set to be 10 and k2 is set to be 5. Which means that the main image has 10 images in its Ranking List, each one of the assist images has 5 images in its Ranking List. To show cases clearer, same images are labeled with same color boxed. To be more specific, in the Ranking List section, there are three red boxes, two green boxes, two blue boxes and two yellow boxes. This means that the red box image has the smallest rank score and comes the top of the Re-ranking List. 3.4. Multi-image Joint Re-ranking framework with updateable Image Pool The Multi-image Joint Re-ranking framework with updateable Image Pool is shown as follow: Fig. 2 shows an exemplar for re-ranking process of a single identity. In this framework, we aggregate the Ranking List and get the Re-ranking List. In this example, a gallery set with N images and an maximized-diversity Image Pool with M images are given. First, the feature vectors are extracted using convolutional neural network. Then the Euclidean distances of probe images and the gallery images are calculated and sorted in ascending order to form the Ranking Lists. The last procedure is to obtain the Re-ranking list using the Multiple-image Joint Re-ranking algorithm (Algorithm 2). What more, a real-world person re-id system is designed to verify the effectiveness of the Multi-image Joint Re-ranking framework. The system is designed to use on ITSD dataset. There are 16 cameras in the surveillance system. The first thing is to calibrate the tracking system to lock on the target person. Then the system will initiate an Image Pool with the first few frames that obtained by using the mean-shift tracking method. After that, as the tracking system keeps running, the Image Pool will be updated to maximize its diversity. Otherwise, if the tracking system lost the target, the Image Pool would stop the updating progress until the tracking system has been restarted. In this way, the Multi-image Joint Reranking framework successfully and effectively achieved the goal of person re-identification in the real-world surveillance system. 4. Experiment 4.1. Dataset Since our framework is based on multiple images, we conduct experiments on four datasets. Two image sequence datasets: the PRID 2011 dataset [16], iLIDS-VID dataset [61]. One image-based dataset Market-1501 [62], and our newly developed dataset Indoor train station dataset (ITSD). The available re-id datasets hardly simulate the real-world surveillance data. In order to better evaluate the effectiveness of our framework in a real-world setting, we built the indoor train station dataset (ITSD). This dataset is collected using the data from a real-world surveillance video from a railway station. At this point, the datasets we are using has complex backgrounds and real-world surveillance view. With manual supervision, we collected samples of every identity using mean-shift tracking method, and determined bounding boxes using DPM. The size of the image is 64 128. ITSD contains 5607 images, 443 identities captured from 8 different viewpoints, where 50 identities are captured by at least 3 disjoint cameras. The bounding boxes of ITSD are more tight. Furthermore, our dataset has more complex scenes and viewpoint. For experimental purposes, the dataset is split into
532
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
Fig. 1. Demonstration of the Multiple-image Joint Re-ranking algorithm on a person re-id application.
Fig. 2. Multi-image Joint Re-ranking framework.
two parts: 3833 images with 343 identities for training and 1774 images with 100 identities for testing. In general, ITSD provides abundant camera view and real-world data. Thus we can adequately evaluate the performance of our MJR framework. 4.2. Experimental settings and evaluation protocol Experimental settings: training is all implemented by using Caffe on NVIDIA GeForce GTX 1080 GPU. We used ResNet-50 [21] as the base model which pre-trained on ImageNet, then we extracted 2048-dimensional feature by ResNet-50 as the final representation. All images are resized to 224 224 pixels as input. We trained the model with softmax loss as baseline, and we achieved our state of the art results with triplet loss. The network was trained for 50 000 iterations. SGD was used to optimize our model. The initial learning rate was set to 0.001, this rate is lowered each 2 104 iterations. The batch size is 16. Evaluation protocol: to evaluate the performance of re-id methods, we used two evaluation metrics. The first one is the Cumulated Matching Characteristics (CMC). Considering re-id as a ranking problem, we report the cumulated matching accuracy at rank-1. The other one is the mean average precision (mAP).
Considering re-id as an object retrieval problem, which has been described in previous reference [62]. For image-based datasets, the results of Image Pool are considered as the results of the main image. For sequence-based datasets, the results of the Image Pool are considered as the results of whole sequence. The evaluation protocol of experimental results follows the setting of PRID 2011 dataset [61]. 4.3. Parameters analysis In this section we will be discussing how the parameters in the framework were chosen. The feature extractor is trained on classification model including CaffeNet [63] and ResNet-50 [21]. CaffeNet generates a 1024-dim vector and ResNet-50 generates a 2048-dim vector for each image. We evaluate the influence of k1, k2, M of Image Pool on rank-1 accuracy and mAP, and the experiments were conducted on the Market-1501 dataset. The results of the Image Pool are considered as the results of the main image. In experiments, every image in the query set is used as main image for one time, then we use Algorithm 1’s criterion of maximizing diversity to collect images for the Image Pool. In addition, assist images of Image Pool also come from query set.
533
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
Parameter M of the Image Pool: M is the number of images in the Image Pool, and it determines the number of initial ranking lists to be aggregated. In order to study the effect of M on performance, we simply set both k1 and k2 as a fixed number 10. Experiments are conducted using three different metrics: Euclidean, KISSME [34] and XQDA [5]. This way, we can evaluate our method under different conditions. Results in Table 2 show the influence of M on different conditions. Both the rank-1 accuracy and mAP are significantly improved with our framework in all experiments. When CaffeeNet + Euclidean is employed as the baseline, our method gains 18.35% improvement in rank-1 accuracy and 4.37% improvement in mAP when M is equals 4. When ResNet-50 + Euclidean is employed as the baseline, our method gains 12.77% improvement in rank-1 accuracy and 3.73% improvement in mAP when M is equals 4. From the table, we can see that the rank-1 accuracy increases as M increases. The mAP increases as M increases when M is less than 3, then it starts to decrease when M is larger than 3. The query set Market-1501 that’s been explored, contains 3368 images of 750 identities, where 27 identities have less than 3 queried images, 115 identities have less than 4 queried images and 98 identities have a maximum of 6 queried images. One problem is that when M increases, insufficient of queried images might affect the performance. Despite that, after calculation all the 3368 images in the dataset, the maximum value of average time of each query is 0.03517s and the minimum is 0.01613s. From the table, the timeconsuming of M = 3 is less than M = 2 and M = 4. The reasons it’s happened are as follow. First, the distances between the query images and the gallery images has already been calculated and saved. Thus, adding new images to the Image Pool will not affect the computing time much. Second, the code works differently in Matlab when M = 3 or M = 4 compare to when M = 2. Because of the Matlab code, M = 2 has the slowest computing time. And when M is larger than 3, the computing time will slightly increase as M increases. In general, we can minimize the calculation time and optimal mAP when M is set to 3. In the experiments mentioned above, The performance of ResNet-50 is better than CaffeNet. XQDA metrics has best performance on mAP and calculation time, while Euclidean outperforms other two methods in rank-1 accuracy. We can draw a conclusion that the re-ranking method can make better use of simple distance methods, such as Euclidean distance, than complicated metric measures. Parameters k1, k2 of Image Pool: k1 is the length of the main image’s ranking list, k2 is the length of an assist image’s ranking
list. Same as before, Market-1501 dataset was used for the experiments to evaluate the influence of k1 and k2 on rank-1 accuracy and mAP. In this part of the experiment, we will take ResNet-50 + Euclidean as baseline and set M to 3. Moving forward, we set the value of k2 to 6 to see how the algorithm performs as k1 changes. Then we set the value of k1 to 70 to see how the algorithm performs as k2 changes. Table 3 shows that, as k1 increases from 6 to 100, the rank-1 accuracy meets the peak value when k1 is 70. At this point, the rank-1 accuracy improves by 18.74% and the mAP improves by 13.46% compare to the baseline result. The reason why the performance starts to decline when k1 surpasses 70 is that large k1 causes more false matches. Table 4 shows that, as k2 decreases from 6 to 100, the rank-1 accuracy meets the peak value when k2 is 2. At this point, the rank-1 accuracy improves by 21.52% and the mAP improves by 12.42% compare to the baseline result. The reason why the performance starts to decline when k2 surpasses 2 is that large k2 causes more false matches. Based on the experiment result, setting k1 to 70 and k2 to 2 will ensure our algorithm the best performance. 4.4. Updating rules of Image Pool In this part, we will be focusing on evaluating the updating rules of Image Pool using the ITSD dataset. Based on the previous parameters analysis section, we will set M to 3, k1 to 70 and k2 to 2 as we carry on. ResNet-50 + Euclidean will be used as base method and the model will be trained with softmax loss. In the experiment, the Image Pool images will be obtained in three different ways: selected randomly, selected based on the ‘sparse
Table 3 The algorithm performance as k1 changes when k2 is set to 6. Notice that this experiment is done using ResNet-50 + Euclidean as baseline, and M is set to 3. Method
Rank-1
mAP
Baseline k1 = 6, k2 = 6 k1 = 10, k2 = 6 k1 = 20, k2 = 6 k1 = 30, k2 = 6 k1 = 40, k2 = 6 k1 = 50, k2 = 6 k1 = 60, k2 = 6 k1 = 70, k2 = 6 k1 = 80, k2 = 6 k1 = 90, k2 = 6 k1 = 100, k2 = 6
73.28 82.24 85.72 88.90 89.85 90.56 90.88 91.09 91.75 91.48 91.45 91.54
47.87 50.69 52.07 52.66 53.53 57.79 58.80 59.45 59.31 60.59 61.00 61.33
Table 2 The influence of the number of images (parameter M) of Image Pool on the Market1501. The k1 is fixed at 10 and k2 is fixed at 10. Method
Rank-1
mAP
Time
CaffeNet + Euclidean CaffeNet + Euclidean + joint(M = 2) CaffeNet + Euclidean + joint(M = 3) CaffeNet + Euclidean + joint(M = 4)
56.98 67.76 74.44 75.33
31.41 35.17 35.90 35.78
94.035464 118.458838 109.906106 112.739676
ResNet-50 + XQDA ResNet-50 + XQDA + joint(M = 2) ResNet-50 + XQDA + joint(M = 3) ResNet-50 + XQDA + joint(M = 4)
73.81 80.97 83.40 84.20
51.19 54.79 54.26 54.19
93.118241 78.275997 54.338506 56.466179
ResNet-50 + KISSME ResNet-50 + KISSME + joint(M = 2) ResNet-50 + KISSME + joint(M = 3) ResNet-50 + KISSME + joint(M = 4)
75.68 82.51 84.38 85.84
50.95 52.27 54.06 54.10
92.781606 84.042617 64.588775 66.142786
ResNet-50 + Euclidean ResNet-50 + Euclidean + joint(M = 2) ResNet-50 + Euclidean + joint(M = 3) ResNet-50 + Euclidean + joint(M = 4)
73.28 81.77 85.15 86.05
47.87 52.27 51.62 51.60
82.399024 84.042617 62.396900 62.425897
Table 4 The algorithm performance as k2 changes when k1 is set to 70. Notice that this experiment is done using ResNet-50 + Euclidean as baseline, and M is set to 3. Method
Rank-1
mAP
Baseline k1 = 70, k2 = 20 k1 = 70, k2 = 15 k1 = 70, k2 = 10 k1 = 70, k2 = 9 k1 = 70, k2 = 8 k1 = 70, k2 = 7 k1 = 70, k2 = 6 k1 = 70, k2 = 5 k1 = 70, k2 = 4 k1 = 70, k2 = 3 k1 = 70, k2 = 2 k1 = 70, k2 = 1
73.28 87.25 87.68 89.40 89.90 90.44 90.97 91.39 92.04 93.11 93.91 94.80 94.51
47.87 58.84 59.62 60.20 60.26 60.29 60.29 60.12 59.93 59.67 59.03 57.60 54.73
534
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
Table 5 Different performance of Rank CMC (%) and mAP when select assist images following different rules. Method
Rank-1
Rank-5
Rank-10
Rank-20
mAP
Baseline (a) (b) (c)
38.68 44.19 56.87 58.53
59.46 61.79 70.76 78.17
66.83 64.00 75.25 81.44
75.02 73.19 82.76 85.00
42.22 46.75 57.11 64.50
Fig. 3. CMC curves comparison with different Image Pool rules.
Table 6 Comparison of the Multiple-image Joint Re-ranking framework and the state-of-theart method on the Market-1501 dataset. Method
Rank-1
mAP
Bow + Kissme [62] SCSP [64] Null Space [7] LSTM Siamese [25] Gated Siamese [37] PIE [28] SVDNet [65] TinNet [41] GLAD CamStyle [66] HA-CNN [67] ResNet-50 + softmax loss ResNet-50 + softmax loss + Re-ranking ResNet-50 + softmax loss + MJR ResNet-50 + triplet loss ResNet-50 + triplet loss + Re-ranking ResNet-50 + triplet loss + MJR
44.42 51.90 55.43 61.60 65.88 79.33 80.50 84.92 87.90 88.12 93.80 73.28 74.85 94.80 85.78 91.57 95.57
20.76 26.35 29.87 35.30 39.55 55.95 55.90 69.14 71.00 68.72 82.80 47.87 59.87 57.60 69.11 75.29 75.30
recovery’ method [48] and selected based on the criterion of maximizing diversity. The table and the figure below demonstrate the performance of Rank CMC (%). This experiment used ITSD dataset and considered average feature of all the probe images as baseline. (a) represents the case when the assist images were chosen by random; (b) represents the case when the assist images were chosen by ‘sparse recovery’ method; (c) represents the case when the assist images were based on the criterion of maximizing diversity. Table 5 and Fig. 3 shows that the performance of Rank CMC (%) and mAP becomes the best is when the assist images were chosen based on the criterion of maximizing diversity: the rank-1 accuracy is increased by 19.85% and the mAP is increased by 22.28% for the baseline. The results of (c) and (a) show that selecting image by our updating rules is much effective than selecting images randomly. The results of (c), (b) illustrate that our method is better than ’sparse recovery’ method.
Table 7 Comparison of our framework with state-of-the art on PRID-2011 and iLIDS-VID. dataset Method
PRID-2011 Rank-1
iLIDS-VID Rank-1
VR [61] AFDA [68] STA [69] RFA [12] RNN-CNN [3] ASTPN [70] MJR
42 43 64 64 70 77 81
35 38 44 49 58 62 66
4.5. Comparison with the state-of-the-art methods We now compare the performance of the Multiple-image Joint Re-ranking framework with the state-of-the-art methods on large-scale person re-id benchmark datasets. All of the datasets were split into two parts to train and to test the model following the evaluation protocol [61,62]. Experiments on Market-1501: ResNet-50 + Euclidean was adopted as the base method, and was trained with softmax loss and triplet loss respectively. M is set to 3, k1 is set to 70 and k2 is set to 2. The comparison is shown in Table 6. Table 6 shows that with the appropriate parameter settings, the Multiple-image Joint Re-ranking framework outperforms the stateof-the-art methods in both rank-1 accuracy and mAP. When the model is trained on ResNet-50 and softmax loss, ResNet-50 + softmax loss + MJR shows that rank-1 and mAP can reach 94.80 and 57.60. The rank-1 accuracy increased by 21.52% and the mAP increased by 12.42%; when the model is trained on ResNet-50 and triplet loss, ResNet-50 + triplet loss + Re-ranking shows that rank-1 and mAP can reach 95.57 and 75.30. The rank-1 accuracy increased by 95.57% and the mAP increased by 75.30%. Compare with the unsupervised re-ranking method [9], our method achieved better result. Experiments on PRID-2011 and iLIDS-VID: The model was trained with ResNet-50 and softmax loss. M is set to 10, k1 is set to 70 and k2 is set to 2. The comparison is shown in Table 7.
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
Table 7 indicates that the Multiple-image Joint Re-ranking framework achieves better results than all of the contrast methods when using PRID-2011 and iLIDS-VID datasets. When the MJR framework is applied, the rank-1 accuracy is improved by 3.2% and 1.7% for PRID-2011 and iLIDS-VID datasets respectively.
5. Conclusion In this paper, we proposed the Multi-image Joint Re-ranking framework for person re-id. The Image Pool was formed by collecting images for each identity following rules of maximizing diversity. Then the initial ranking list was obtained by calculating the distance between the feature vectors between the Image Pool and the gallery. By aggregating the ranking list of the Image Pool, the re-ranking list was formed. The ITSD dataset was collected to evaluate the MJR framework’s performance in the real-world scenario. Experimental results showed that the MJR framework achieve state-of-the-art performance in public datasets and has great potential in real-world application. In the future, the MJR framework will be further improved as more real-world test will be conducted. Acknowledgement This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61671423 and Grant No. 61271403. References [1] L. Zheng, Y. Yang, A.G. Hauptmann, Person re-identification: Past, present and future, arXiv preprint arXiv:1610.02984. [2] D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking, in: Proc. IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), vol. 3. Citeseer, 2007, pp. 1–7. [3] N. McLaughlin, J. Martinez del Rincon, P. Miller, Recurrent convolutional network for video-based person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1325– 1334. [4] N. Gheissari, T.B. Sebastian, R. Hartley, Person reidentification using spatiotemporal appearance, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE, 2006, pp. 1528–1535. [5] S. Liao, Y. Hu, X. Zhu, S.Z. Li, Person re-identification by local maximal occurrence representation and metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2197– 2206. [6] N. Martinel, C. Micheloni, G.L. Foresti, Kernelized saliency-based person reidentification through multiple metric learning, IEEE Trans. Image Process. 24 (12) (2015) 5645–5658. [7] L. Zhang, T. Xiang, S. Gong, Learning a discriminative null space for person reidentification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1239–1248. [8] J. García, N. Martinel, A. Gardel, I. Bravo, G.L. Foresti, C. Micheloni, Modeling feature distances by orientation driven classifiers for person re-identification, J. Visual Commun. Image Represent. 38 (2016) 115–129. [9] Z. Zhong, L. Zheng, D. Cao, S. Li, Re-ranking person re-identification with kreciprocal encoding, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 3652–3661. [10] W. Li, Y. Wu, M. Mukunoki, M. Minoh, Common-near-neighbor analysis for person re-identification, in: 2012 19th IEEE International Conference on Image Processing (ICIP), IEEE, 2012, pp. 1621–1624. [11] M.S. Nixon, T. Tan, R. Chellappa, Human Identification based on Gait, vol. 4, Springer Science & Business Media, 2010. [12] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, X. Yang, Person re-identification via recurrent feature aggregation, in: European Conference on Computer Vision, Springer, 2016, pp. 701–716. [13] W. Zajdel, Z. Zivkovic, B. Krose, Keeping track of humans: Have i seen this person before?, in: Robotics and Automation, 2005 ICRA 2005. Proceedings of the 2005 IEEE International Conference on, IEEE, 2005, pp. 2081–2086. [14] L. Bazzani, M. Cristani, A. Perina, M. Farenzena, V. Murino, Multiple-shot person re-identification by HPE signature, in: 2010 20th International Conference on Pattern Recognition (ICPR), IEEE, 2010, pp. 1413–1416. [15] S. Karanam, Y. Li, R.J. Radke, Person re-identification with block sparse recovery, Image Vis. Comput. 60 (2017) 75–90.
535
[16] M. Hirzer, C. Beleznai, P.M. Roth, H. Bischof, Person re-identification by descriptive and discriminative classification, in: Scandinavian Conference on Image Analysis, Springer, 2011, pp. 91–102. [17] M. Hirzer, P.M. Roth, M. Köstinger, H. Bischof, Relaxed pairwise learned metric for person re-identification, in: European Conference on Computer Vision, Springer, 2012, pp. 780–793. [18] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, J. Mach. Learn. Res. 10 (Feb) (2009) 207–244. [19] S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, R.J. Radke, A systematic evaluation and benchmark for person re-identification: Features, metrics, and datasets, IEEE Transactions on Pattern Analysis & Machine Intelligence (1) 1–1. [20] J. Zhang, N. Wang, L. Zhang, Multi-shot pedestrian re-identification via sequential decision making, arXiv preprint arXiv:1712.07257. [21] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, Int. J. Computer Vis. 115 (3) (2015) 211–252. [23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556. [24] F. Wang, W. Zuo, L. Lin, D. Zhang, L. Zhang, Joint learning of single-image and cross-image representations for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1288– 1296. [25] R.R. Varior, B. Shuai, J. Lu, D. Xu, G. Wang, A siamese long short-term memory architecture for human re-identification, in: European Conference on Computer Vision, Springer, 2016, pp. 135–153. [26] S. Yu, Y. Cheng, L. Xie, Z. Luo, M. Huang, S. Li, A novel recurrent hybrid network for feature fusion in action recognition, J. Visual Commun. Image Represent. 49 (2017) 192–203. [27] H. Yao, S. Zhang, Y. Zhang, J. Li, Q. Tian, Deep representation learning with part loss for person re-identification, arXiv preprint arXiv:1707.00798. [28] L. Zheng, Y. Huang, H. Lu, Y. Yang, Pose invariant embedding for deep person re-identification, arXiv preprint arXiv:1701.07732. [29] L. Wei, S. Zhang, H. Yao, W. Gao, Q. Tian, Glad: Global-local-alignment descriptor for pedestrian retrieval, in: Proceedings of the 2017 ACM on Multimedia Conference, ACM, 2017, pp. 420–428. [30] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, X. Tang, Spindle net: person re-identification with human body region guided feature decomposition and fusion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1077–1085. [31] N. Martinel, M. Dunnhofer, G.L. Foresti, C. Micheloni, Person re-identification via unsupervised transfer of learned visual representations, in: Proceedings of the 11th International Conference on Distributed Smart Cameras, ACM, 2017, pp. 151–156. [32] N. Martinel, G.L. Foresti, C. Micheloni, Person reidentification in a distributed camera network framework, IEEE Trans. Cybernet. 47 (11) (2017) 3530–3541. [33] H. Fan, L. Zheng, Y. Yang, Unsupervised person re-identification: clustering and fine-tuning, arXiv preprint arXiv:1705.10444. [34] M. Koestinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 2288–2295. [35] A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, Learning a mahalanobis metric from equivalence constraints, J. Mach. Learn. Res. 6 (Jun) (2005) 937–965. [36] S. Liao, S.Z. Li, Efficient PSD constrained asymmetric metric learning for person re-identification, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3685–3693. [37] R.R. Varior, M. Haloi, G. Wang, Gated siamese convolutional neural network architecture for human re-identification, in: European Conference on Computer Vision, Springer, 2016, pp. 791–808. [38] H. Liu, J. Feng, M. Qi, J. Jiang, S. Yan, End-to-end comparative attention networks for person re-identification, arXiv preprint arXiv:1606.04404. [39] D. Cheng, Y. Gong, S. Zhou, J. Wang, N. Zheng, Person re-identification by multi-channel parts-based cnn with improved triplet loss function, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1335–1344. [40] W. Chen, X. Chen, J. Zhang, K. Huang, Beyond triplet loss: a deep quadruplet network for person re-identification, Proc. CVPR, vol. 2, 2017. [41] A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person reidentification, arXiv preprint arXiv:1703.07737. [42] Q. Xiao, H. Luo, C. Zhang, Margin sample mining loss: a deep learning based method for person re-identification, arXiv preprint arXiv:1710.00478. [43] M. Geng, Y. Wang, T. Xiang, Y. Tian, Deep transfer learning for person reidentification, arXiv preprint arXiv:1611.05244. [44] O. Hamdoun, F. Moutarde, B. Stanciulescu, B. Steux, Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences, in: Distributed Smart Cameras, 2008. ICDSC 2008. Second ACM/IEEE International Conference on, IEEE, 2008, pp. 1– 6. [45] C. Nakajima, M. Pontil, B. Heisele, T. Poggio, Full-body person recognition system, Pattern Recognit. 36 (9) (2003) 1997–2006. [46] D.N.T. Cong, C. Achard, L. Khoudour, L. Douadi, Video sequences association for people re-identification across multiple non-overlapping cameras, in: International Conference on Image Analysis and Processing, Springer, 2009, pp. 179–189.
536
M. Yuan et al. / J. Vis. Commun. Image R. 59 (2019) 527–536
[47] W. Li, X. Wang, Locally aligned feature transforms across views, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 3594–3601. [48] S. Karaman, A.D. Bagdanov, Identity inference: generalizing person reidentification scenarios, in: European Conference on Computer Vision, Springer, 2012, pp. 443–452. [49] D.-N.T. Cong, L. Khoudour, C. Achard, C. Meurie, O. Lezoray, People reidentification by spectral classification of silhouettes, Signal Process. 90 (8) (2010) 2362–2374. [50] W. Li, L. Niu, D. Xu, Exploiting privileged information from web data for image categorization, in: European Conference on Computer Vision, Springer, 2014, pp. 437–452. [51] Y. Wu, I. Kozintsev, J.-Y. Bouguet, C. Dulong, Sampling strategies for active learning in personal photo retrieval, in: 2006 IEEE International Conference on Multimedia and Expo, IEEE, 2006, pp. 529–532. [52] Y. Yan, F. Nie, W. Li, C. Gao, Y. Yang, D. Xu, Image classification by cross-media active learning with privileged information, IEEE Trans. Multimed. 18 (12) (2016) 2494–2502. [53] L. Zheng, Y. Yang, Q. Tian, Sift meets cnn: a decade survey of instance retrieval, IEEE Trans. Pattern Anal. Mach. Intell. [54] X. Shen, Z. Lin, J. Brandt, S. Avidan, Y. Wu, Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 3013–3020. [55] O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman, Total recall: Automatic query expansion with a generative feature model for object retrieval, in: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, IEEE, 2007, pp. 1–8. [56] R. Arandjelovic´, A. Zisserman, Three things everyone should know to improve object retrieval, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 2911–2918. [57] C. Liu, C. Change Loy, S. Gong, G. Wang, Pop: Person re-identification post-rank optimisation, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 441–448.
[58] Z. Zheng, L. Zheng, Y. Yang, A discriminatively learned cnn embedding for person reidentification, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (1) (2017) 13. [59] J. Garcia, N. Martinel, A. Gardel, I. Bravo, G.L. Foresti, C. Micheloni, Discriminant context information analysis for post-ranking person re-identification, IEEE Trans. Image Process. 26 (4) (2017) 1650–1665. [60] R. Vezzani, D. Baltieri, R. Cucchiara, People reidentification in surveillance and forensics: a survey, ACM Comput. Surveys (CSUR) 46 (2) (2013) 29. [61] T. Wang, S. Gong, X. Zhu, S. Wang, Person re-identification by video ranking, in: European Conference on Computer Vision, Springer, 2014, pp. 688–703. [62] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person reidentification: a benchmark, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1116–1124. [63] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Adv. Neural Inform. Process. Syst., 2012, pp. 1097–1105. [64] D. Chen, Z. Yuan, B. Chen, N. Zheng, Similarity learning with spatial constraints for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1268–1277. [65] Y. Sun, L. Zheng, W. Deng, S. Wang, Svdnet for pedestrian retrieval, arXiv preprint. [66] Z. Zhong, L. Zheng, Z. Zheng, S. Li, Y. Yang, Camera style adaptation for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5157–5166. [67] W. Li, X. Zhu, S. Gong, Harmonious attention network for person reidentification, CVPR, vol. 1, 2018, p. 2. [68] Y. Li, Z. Wu, S. Karanam, R.J. Radke, Multi-shot human re-identification using adaptive fisher discriminant analysis, BMVC, vol. 1, 2015, p. 2. [69] K. Liu, B. Ma, W. Zhang, R. Huang, A spatio-temporal appearance representation for video-based pedestrian re-identification, in: 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 3810–3818. [70] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, P. Zhou, Jointly attentive spatialtemporal pooling networks for video-based person re-identification, arXiv preprint arXiv:1708.02286.