Person re-identification based on multi-instance multi-label learning

Person re-identification based on multi-instance multi-label learning

Author’s Accepted Manuscript Person Re-identification Based on Multi-Instance Multi-Label Learning Ying Lin, Feng Guo, Liujuan Cao, Jinlin Wang www.e...

1MB Sizes 2 Downloads 41 Views

Author’s Accepted Manuscript Person Re-identification Based on Multi-Instance Multi-Label Learning Ying Lin, Feng Guo, Liujuan Cao, Jinlin Wang

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)30613-0 http://dx.doi.org/10.1016/j.neucom.2016.04.060 NEUCOM17225

To appear in: Neurocomputing Received date: 9 October 2015 Revised date: 6 April 2016 Accepted date: 29 April 2016 Cite this article as: Ying Lin, Feng Guo, Liujuan Cao and Jinlin Wang, Person Re-identification Based on Multi-Instance Multi-Label Learning, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.04.060 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Person Re-identification Based on Multi-Instance Multi-Label Learning Ying Lina , Feng Guoa,∗, Liujuan Caob , Jinlin Wanga a Department

of Cognitive Science, School of Information Science and Engineering, Xiamen University b Department of Computer Science, School of Information Science and Engineering, Xiamen University

Abstract The person re-identification is an active research branch of the computer vision and attracts many researchers study on it. However, because of the variance in viewpoints, illumination, pedestrians’ pose and some of other factors, building a robust person re-identification algorithm in open-world is rather challenging. In this paper, we propose two algorithms, indiscriminative patches trim strategy and multi-instance multi-label learning method, to fix the person re-identification in closed short-term surveillance network. There are two main contributions of this paper: First, we define the discriminative region in a person’s image. Correspondingly, an adapted canopy-kmeans algorithm is proposed to evaluate the discriminability of image patches which form the discriminative regions. Besides, two strategies, in local image and in global gallery dataset, are proposed to filter out distractions. Finally, the true discriminative patches are employed to compute the similarity between two images, and then used to re-identify pedestrians. Second, we first introduce the multi-instance multi-label learning methods into re-identification, and hence we propose a framework of solving person re-identification by applying MIMLL. We employ two MIMLL methods, the MIMLBOOST and the MIMLSVM, to detect attributes in each image, and those detected attributes contribute to recognize different pedestri∗ Corresponding

author Email address: [email protected] (Feng Guo)

Preprint submitted to Elsevier

June 9, 2016

ans. Experiments on two benchmark datasets show the competitive performance of the two proposed algorithms. Keywords: Person Re-identification; Multi-Instance Multi-Label Learning; Canopy-kmeans

1. Introduction Person re-identification is a hot topic with wide applications and has gotten comprehensive development. However, due to the variance in viewpoints, illumination, pedestrians poses and some other factors, building a robust person re-identification algorithm is rather challenging. Hence researchers are pursuing a robust way that can overcome those challenges and match the same pedestrian in different cameras. In this paper, we propose two algorithms, based on discriminative patches and multi-instance multi-label learning (MIMLL) method respectively, to fix the person re-identification in a closed short-term surveillance network. The algorithm based on discriminative patches employs an adapted canopy-kmeans method to search raw discriminative patterns and two trim strategies are proposed to filter out distractions. The remaining discriminative patterns are used to represent gallery images and compute the similarity between gallery images and probe images. The algorithm based on multi-instance multi-label learning method introduces two MIMLL methods; MIMLBOOST and MIMLSVM, to detect image attributes which can help refine the final matching process of re-identification. This is the first time using multi-instance multi-label learning method in person re-identification. There are two main contributions of this paper: First, we define the discriminative region in a persons image. Correspondingly, an adapted canopy-kmeans algorithm is proposed to evaluate the discriminability of image patches which form the discriminative regions. Besides, two strategies, in local image and in global gallery dataset, are proposed to filter out distractions. Finally, the true discriminative patches are employed to compute the similarity between two im-

2

ages, and are then used to re-identify pedestrians. Second, we first introduce the multi-instance multi-label learning method into re-identification, and we propose a framework to solve person re-identification by applying MIMLL. We employ two MIMLL methods, the MIMLBOOST and the MIMLSVM, to detect attributes in each image, and those detected attributes contribute to the recognition of different pedestrians. These algorithms could be used to improve the performance of the prior art [1, 2, 3]. Experiments on two benchmark datasets show the competitive performance of the two proposed algorithms. The rest of this paper is organized as follows: We first introduce the related work in the section 2. In the section 3, we describe the re-identification algorithm based on discriminative patches, and the re-identification algorithm based on MIMLL method is described in the section 4, the section 5 shows the experiments and we conclude the paper at the last section.

2. Related Work Existing re-identification algorithms can be generally sorted into two classes, algorithms seeking robust features [4, 5, 6, 7, 8] and algorithms seeking model learning [9, 10, 11, 12, 13, 14]. Building robust features can overcome the impact from different illuminations, and for example, the SDALF [6], Mid-level filter [15], kBiCov [16], fisher vectors (LDFV) [17], salience match [18] and histogram of color [19] have made impressive improvements over robust feature representation, and advanced the person re-identification research. However, the pose of pedestrian and the viewpoint are various, texture features alone, especially HOG and SIFT, play less important role in re-identification, and thus researchers prefer to fuse features. Rui Zhao et al [20] combined the dense SIFT and color histogram to describe pedestrian image while Douglas Gray et al [21] used the Gabor filter, Schmid filter and color histogram mixture instead. The factors change fast in surveillance network and salience is introduced to improve the accuracy of the re-identification algorithms because of the reliability of salient regions. Salient regions are stable even after crossing many

3

different cameras as they are always discriminative regions of images. Rui Zhao et al [20] proposed a unsupervised salience computing method for person reidentification which obtained good performance.Meanwhile, attribute [22] has been employed for re-identification as it takes semantics contributing to humanmachine interaction. Ryan Layne et al [23] detected 21 binary attributes to help re-identification and Annan Li et al [24] defined 11 kinds of multi-value attributes to do re-identification. Both of them achieved desirable performance. Many metric learning algorithms [10, 25, 26, 27, 28, 29] have been proposed to act with the distance or similarity function of person re-identification features. In practice, the metric learning methods run in a two-stage processing, that is, the Principle Component Analysis (PCA) is first applied for dimension reduction, then metric learning is performed on the PCA subspace. In this paper, we propose two new person re-identification algorithms, one based on discriminative method and the other based on multi-instance multilabel learning method using salient regions. The former clusters discriminative patches for re-identification and the latter detects attributes with two MIMLL algorithms. Although both of the algorithms detect distinguish regions in images, still there are differences between them. The discriminative patches do not possess semantics and are only the patterns that appear frequently, while attributes are semantic regions which can be used in human-machine interaction. Besides, the re-identification algorithm based on discriminative patches is an original algorithm, re-identification results can be refined by the algorithm based on attributes.

3. Person Re-identification Based on Discriminative Patches 3.1. Discriminative Patches Patches with good discriminability should be a representative region recognized as pattern which could be used to differentiate from other patches. We first split every gallery image into 6 even stripes, and sliding window method is then applied with an 88 pixel size and a pixel step of 4. Each patch is represented

4

Figure 1: The 21 texture filters, the first 13 filters are Schmid filters and the last 8 filters are Gabor filters

by color histogram and texture histogram, the color histogram is extracted in 8 channels (RGB, HSV, YCbCr), while 13 Schmid [30] filters and 8 Gabor [31] filters are extracted in illumination channel, and the bin of both kinds of feature is set to 16 thus each patch can be represented by a 464-dimensions feature. We used two families of texture, Schmid and Gabor, these Schmid filters were originally designed to model rotation invariant texture, and we are motivated by the desire to be invariant to viewpoint and pose. The Gabor filters are robust at processing horizontal and vertical features. We set the parameters (γ, λ, θ, σ 2 ) of Gabor filter are (0.3,0,4,2), (0.3,0,8,2), (0.4,0,4,1), (0.4,0,4,1), (0.3,π/2,4,2), (0.3,π/2,8,2), (0.4,π/2 ,4,1) and (0.4,π/2,4,1) while (τ, σ) of Schmid filters are (2,1), (4,1), (4,2), (6,1), (6,2), (6,3), (8,1), (8,2), (8,3), (10,1), (10,2), (10,3) and (10,4). The figure 1 shows these 21 texture filters. After using a 464D feature to describe each patch, we adapt a canopy-kmeans clustering algorithm to cluster patches in each stripe each image. The original canopy-kmeans algorithm has two thresholds T1 and T2 which determine the largest and the least radius of each canopy T1 > T2 , leading to an ambiguity due to the fact that a patch could belongs to more than one canopy. To eliminate the ambiguity, we adapt the canopy-kmeans algorithm by restricting the prerequisite that each patch can be a member of only one canopy. Thus we use the threshold T (T = 5,T = 6 are set for VIPeR dataset and PRID2011 5

dataset in the experimental process)to define the radius. Algorithm 1 shows the adapted canopy-kmeans algorithm. Algorithm 1 The adapted Canopy-Kmeans algorithm 1. Initialize the list and the threshold T . 2. Randomly select a patch P in list as a new canopy and delete it from list. 3. In the remaining list, repeat: i. Randomly select patch I and compute the distances d between I and each canopy. ii. Compute the minimum value in d, denote as dimin ,in which i represents the i-th canopy. iii. If dimin ≤ T , add I into the i-th canopy and update the i-th canopy; else if dimin > T , create a new canopy with I. Until list is empty.

In the algorithm, we use weighted Euclidean distance to compute the distance between each pair of patches, thus each kind of feature has a weight. C. Liu et al [19] indicated that three kinds of color features and two kinds of texture features play different roles in re-identification, figure 2 shows their importance in the VIPeR dataset and the PRID2011 dataset and table 1 are the weights we employ in our experiment. After clustering in each stripe, we have a raw set of discriminative patches. 3.2. Trim Strategy In the raw set of discriminative patches, clusters in background region will degenerate the re-identification and such distractions should be trimmed. As the backgrounds extensively appear in both pedestrian image and the gallery image dataset, we propose two trim strategies, namely intrim and extrim, to trim the background. Intrim: the intrim strategy is performed in every gallery image. It first computes the distance between two raw clusters from two different stripes. If

6

Figure 2: The weights of each kinds of feature in the VIPeR dataset and the PRID2011 dataset

cluster Pdisc is Nin (Nin = 3 in experiment) distances less than θin (θin = 3 in experiment), then cluster Pdisc will be considered as a distraction and be trimmed from the raw cluster set. Extrim: the extrim strategy is performed in the whole gallery image dataset. It counts the sum of the distance between Pdisc and other clusters less than θex (θex = 3 in experiment), if the sum is more than Nex (Nex =

|G| 2

in exper-

iment), the cluster Pdisc will also be recognized as a distraction and thus be trimmed. The refinement of raw cluster set will be achieved in the gallery dataset. After the refinement, each gallery image will be described by patches with higher discriminability in each stripe. We then utilize a kNN algorithm to calculate the similarity between a probe image and a gallery image, the similarity is defined as |v| N X X d(xsi , xsj ) = ( exp(− ))/N 2σ 2 i=1 j=1 ∗

Simf (IvP , IuG )

7

(1)

Table 1: The weight allocation of the features

VIPeR

PRID2011

stripe [RGB,HSV,YCBCR,Gabor,Schmid] [RGB,HSV,YCBCR,Gabor,Schmid] 1

[0.22,0.28,0.17,0.18,0.15]

[0.26,0.17,0.15,0.24,0.18]

2

[0.13,0.44,0.19,0.15,0.09]

[0.26,0.22,0.20,0.19,0.12]

3

[0.13,0.44,0.14,0.18,0.10]

[0.19,0.25,0.20,0.20,0.16]

4

[0.22,0.38,0.19,0.15,0.06]

[0.25,0.18,0.19,0.21,0.17]

5

[0.28,0.27,0.15,0.18,0.12]

[0.23,0.20,0.18,0.22,0.17]

6

[0.23,0.31,0.19,0.19,0.08]

[0.32,0.13,0.14,0.24,0.17]

Where IvP ,IuG are the v-th image in the probe dataset and the u-th image in the gallery dataset respectively, |v| represents the amount of images in the probe dataset,N is the sum of the refined clusters involved in computation,xsi repre∗

sents the i-th patch in stripe s while xsj is the j-th patch in stripe sin gallery image.

4. Person Re-identification Based on Multi-Instance Multi-Label Learning When doing re-identification annotation manually, experienced experts usually recognize identical pedestrians by their distinguished regions, such as the hair style, the clothing style and the package they carried. These semantic regions are generally called attributes representing some pedestrians properties. In the re-identification field, using attribute could greatly improve the performance of re-identification algorithm: First, when performing single-shot recognition, semantic attributes learned from dataset could provide the ability of transfer learning. Second, attribute features are more robust than low-level features, and algorithm combining low-level features and high-level attributes shows better performance. Finally, adopting attributes could help the implementation of human-computer interface and eliminate the semantic gap. As a result, how to combine the attribute and the low-level features to improve the performance

8

of re-identification algorithm becomes a hot topic in this field. In this paper, we introduce multi-instance multi-label learning method to learn the attributes, because (1) huge annotation work costs expensive human source. When training attribute detector, annotation work must be done before the training. The bigger the gallery dataset is, the more the human source will cost; (2) it is hard to keep all annotation consistent. No matter the annotation work is done by single expert or several experts there will always exist ambiguity in annotation due to the fatigue and different understanding about the same attribute. We first extensively split every gallery image into various sizes of patches, and then color features and texture features are computed to represent the patches. If an image can be seen as a bag, patches in a bag are its instances. We introduce the MIMLBOOST algorithm [32] and the MIMLSVM algorithm [32] to learn the attribute detectors. We then use these detectors to detect the attributes in probe images. Person re-identification could be done by employing the detected attributes. We apply sliding window method again to split each image. The width and height of each sliding window come from the set 8,16,32, and the step is 4 pixels. Color features and texture features are extracted in each patch. In view of the vast amount of bags and instances, we extract the color histogram in HSV image. 4.1. MIMLBOOST Given the gallery dataset G , it contains |G| images, denote the u-th image (u)

of G as Xu , the j-th patch in such image as xj

. We assume image Xu has an

attribute set Yu , then the image can be represented with (Xu , Yu ) , and the task of MIMLBOOST is to learn fM IM L : 2G → 2Y from {(Xu , Yu )}(u = 1, 2, K, |G|) . Convert such task into a multi-instance learning task, that is, for each y ∈ Y , if and only if y ∈ Yu , we have Ψ(Xu , y) = +1 where Ψ : 2G × Y → +1, −1. Eventually, a MIML learning task is transformed into a multi-instance learning task, then the MIBOOSTING algorithm is employed to finish the task, and the computational complexity is O(|G||Y |). Algorithm 2 is the MIMLBOOST 9

algorithm. Algorithm 2 The MIMLBOOST Algorithm 1. Convert the image (Xu , Yu )(u = 1, 2, K, |G|) of G into |Y | bags {[(Xu , y1 ), Ψ(Xu , y1 )], K, [(Xu , y|Y | ), Ψ(Xu , y|Y | )]},

the

original

MIML

dataset is transformed to multi-instance dataset with |G|×|Y | instances, represented with {[(X (1) , y (1) ), Ψ(X (1) , y (1) )], K, [(X (i) , y (i) ), Ψ(X (i) , y (i) )]}(i = 1, 2, K, |G| × |Y |). 2. Initial the weight of each bag with W i =

1 |G|×|Y | (i

= 1, 2, K, |G| × |Y |).

3. Repeat (i)

i. Update the weight of the j-th instance in the i-th bag with Wj

=

W (i) /ni (i = 1, 2, K, |G| × |Y |) , denote the label of all instance in such (i)

bag with Ψ(X (i) , y (i) ), build an instance level detector ht [(xj , y (i) )] ∈ {+1, −1} for each instance. ii. Calculate the sum of misclassified instance for the i-th bag and compute its error rate e(i) =

Pni

(i) (i) )]6=Ψ(X (i) ,y (i) )) j=1 (ht [(xj ,y

ni

.

iii. If all error rate e(i) in gallery dataset are less than 0.5, jump to step 4, else: iv. Compute ct = arg min ct

P|G|×|Y | i=1

W (i) exp[(2e(i) − 1)ct ] .

v. If ct < 0 jump to step 4, else: vi. Update W (i) = W (i) exp[(2e(i) −1)ct ](i = 1, 2, K, |G|×|Y |) and normalize P|G|×|Y | (i) it with i=1 W =1; P P 4. Compute Y ∗ = {y|argy∈Y sign( j t ct ht [(x∗j , y)]) = +1} and return its value.

4.2. MIMLSVM Similarly, denote the image of gallery dataset as (Xi , Yj ) where Xi is the i-th image and its attribute set is represented by Yi . MIMLSVM algorithm transforms a MIML learning mission into a multi-label learning mission. If a transformative function is ϕ : 2G → Z holding zj = ϕ(Xi ) and for each y ∈ Y , if and only if y ∈ Yi , Φ(zi , y) = +1, the MIML learning mission can be converted 10

to multi-label learning mission by employing function Φ : Z × Y → {+1, −1} . The MIMLSVMs computational complexity is O(Y R3 ), where R is the number of support vectors. Algorithm 3 displays the MIMLSVM algorithm. Algorithm 3 The MIMLSVM Algorithm 1. Denote the MIML samples as (Xu , Yu (u = 1, 2, K, |G|) and the training dataset as Γ = {Xu |u = 1, 2, K, |G|}. 2. Randomly select k elements from Γ as the initial clusters of the k-mediods algorithm; 3. Repeat: i. Γt = {Mt }(t = 1, 2, K, k). ii. For each Xu ∈ (Γ − {Mt |t = 1, 2, K, k}), do: index = arg min dH (Xu , Mt ), Γindex = Γindex ∪ {Xu } t∈{1,2,K,k} P iii. Mt = arg min B∈Γt dH (A, B)(t = 1, 2, K, k). A∈Γ

4. Convert ple

the

(zu , Yu )(u

MIML =

sample

(Xu , Yu )

1, 2K, |G|),where

zu

into =

multi-label (zu1 , zu2 , K, zuk )

sam=

(dH (Xu , M1 ), dH (Xu , M2 ), K, dH (Xu , Mk )). 5. For each y ∈ Y , train a SVM model hy = SV M T rain(Dy ) with its positive training set Dy = {(zu , ϕ(zu , y))|u = 1, 2, K, |G|}. 6. Return the label set of samples Y ∗ = {arg max hy (z ∗ )} ∪ {y|hy (z ∗ ) ≥ 0, y ∈ y∈Y

Y }, where z ∗ = (dH (X ∗ , M1 ), dH (X ∗ , M2 ), K, dH (X ∗ , Mk )).

After obtaining the attribute detectors from the gallery dataset, these detectors are applied to the probe dataset detecting attributes of every probe image. All attributes are then combined with low-level features to calculate the similarity between two images. Denote XuP as image from probe dataset, it has P P P G attribute vector AP u = [au1 , au2 , K, aum ] and image Xv from gallery dataset G G G has attribute vector AG u = [av1 , av2 , K, avm ] , the similarity between these two

images is computed as follow:

similarity(XuP , XvG ) = wT Φ(XuP , XvG ) 11

(2)

Φ(XuP , XvG ) = [simf (XuP , XvG ), simA (XuP , XvG )]

(3)

G simA (XuP , XvG ) = exp(−|AP u − Av |)

(4)

Where simf (XuP , XvG ) is defined by eq. 1, w is a weight vector learned by the RankSVM algorithm.

5. Experiment We evaluate our algorithm in two benchmark dataset VIPeR and PRID2011. The VIPeR dataset contains 632 pairs of pedestrian image taken from two cameras; all the images are normalized to 12848 pixels. According to the settings of D. Gray et al, we set Cam B as the gallery dataset and Cam A as the probe dataset. The PRID2011 dataset is taken from two cameras, Cam A and Cam B. There are 385 pedestrian images in Cam A while 749 pedestrian images in Cam B, the first 200 images of both cameras are corresponded. To keep the consistence with M. Hirzer et al, we use the first 200 images to do experiment and set Cam A as the probe dataset while Cam B as the gallery dataset. 5.1. Intrim and Extrim We first compare the performance of two trim strategies by computing the rate of the number of foreground patches over that of background patches. Figure 3 shows the samples adopting different strategies. Table 2 gives the ratio of the two strategies, it shows that two strategies can effectively improve the percentage of foreground patches; furthermore, the extrim is better than the intrim. 5.2. Performance of Discriminative Patch We then compare the performance of two algorithms based on the two trim strategies on the VIPeR dataset and the PRID2011 dataset (Figure 4). The

12

Figure 3: Some samples of different trim strategies.

Table 2: The ratio of the number of foreground and background patches in two dataset

VIPeR

PRID2011

raw

1.68

1.71

intrim

2.31

2.69

extrim

2.87

3.13

baseline algorithms on the VIPeR dataset are the LOMO+XQDA [5], Midlevel filter [15], ELF algorithm [21] and the l2 -norm algorithm [33], and on the PRID2011 dataset is the DDC algorithm [34]. Figure 4 shows recognition performance as CMC curves. Algorithms rank the images of gallery dataset the similarity to the probe image, and the matching rate shows the matching percentage when top rank N candidate result are concerned. In the Figure 4(a), we compare these algorithms on the VIPeR dataset and in the Figure 4(b), we compare these algorithms on the PRID2011 dataset where the performance of the original algorithm. The Figure 4(a) shows that the LOMO+XQDA has best performance of person re-identification, matching score is 40% when rank 1 on VIPeR dataset. The original algorithms using trim strategies is worse than the LOMO+XQDA while better than other methods on VIPeR dataset. The original algorithm without any trim strategy is better than the l2 -norm algorithm between rank 1 and rank 140. On the PRID2011 dataset, LOMO+XQDA is still far better than other method, the extrim is as good as Mid-level filter. The intrim method is

13

(a)

(b)

Figure 4: The re-identification performance of the two strategies. The disc represents the original algorithm without any trim strategy, intrim and extrim represent the algorithm using intrim strategy and extrim strategy respectively.

only better than original algorithm. The results demonstrate that the extrim method is competitive on person reidentification when apply on discriminative Patch selection. The intrim method is worse than extrim. The main reason that the DDC algorithm is better than intrim is it introduces a manual work during the ranking process. On the two dataset, both the two trim strategies improve the performance of the original algorithm. 5.3. Performance of the MIMLL algorithms We evaluate the performance of the MIMLBOOST algorithm and the MIMLSVM algorithm with the accuracy of detecting 21 attributes. The accuracy is matching percentage when we adopt rank top 20 candidate images. The Table 3 shows the distribution of the 21 attributes in the second and the third columns and the performance on the two datasets in the last 4 columns. During the experiment, the MIMLBOOST algorithm trained 10 week classifiers; the penalty factor is 1 and the core function is RBF. Table 3 shows that the accuracies on the two dataset of both the MIMLBOOST algorithm and the MIMLSVM algorithm are relatively high. The accuracy of the MIMLBOOST on the attribute Redshirt, Greenshirt, Not light dark jeans colour, Shorts and Patterned is above 90%. Its average accuracy on

14

the VIPeR and the PRID2011 is 77.09% and 78.01% respectively. The accuracy of the MIMLSVM on the attribute Redshirt, Greenshirt,Not light dark jeans colour, Barelegs, Shorts and Patterned is over 90%, and its average accuracy is 77.09% and 81.34%, a litter higher than that of the MIMLBOOST algorithm. The high accuracy of the two algorithms demonstrates the effectiveness of the MIMLL method. 5.4. Performance of Attributes The Figure 5 shows the re-identification performance of the two MIMLL algorithms as CMC curves. We adopt baseline algorithms, such as the LOMO+XQDA, Mid-level filter on both dataset, use ELF algorithm and the l2 -norm algorithm on the VIPeR dataset, and on the PRID2011 dataset is the DDC algorithm. We compare MIMLBOOST and MIMLSVM frameworks directly to these methods, and combine the extrim with MIMLBOOST and MIMLSVM to find out the improvement of extrim. Figure 5 demonstrates that the attributes learned by the two MIMLL algorithms can help to solve the re-identification problem greatly. The two MIMLL algorithms combined with extrim obtain as good performance as LOMO+XQDA on the VIPeR dataset, and the MIMLSVM combined with extrim is even better than the LOMO+XQDA on the PRID2011 dataset. Without extrim, the two MIMLL algorithms are competitive to Mid-level filter. The results show that the attributes are important for person re-identification and it causes the fact the four MIML algorithms have better performance than ELF algorithm and l2 -norm algorithm. 6. Conclusion In this paper, we define the discriminative region in a persons image, the method is evaluating the discriminability of image patches by an adapted canopykmeans algorithm. And we first introduce the multi-instance multi-label learning method into re-identification; we propose a framework to solve person reidentification by applying MIMLL. The MIMLBOOST and the MIMLSVM are 15

Table 3: The accuracies on the two dataset of both the MIMLBOOST algorithm and the MIMLSVM algorithm

Attribute

Distribution

MIMLBOOST

MIMLSVM

VIPeR PRID2011 VIPeR PRID2011 VIPeR PRID2011

Redshirt

0.1

0.06

94.1

95.6

94.1

96.6

Blueshirt

0.1

0.07

80.9

91.2

83.8

93.1

Lightshirt

0.446

0.20

64.7

72.1

75

85.3

Darkshirt

0.495

0.635

54.4

66.2

55.9

67.6

Greenshirt

0.052

0.01

97.1

97.1

97.1

97.1

Nocoats

0.367

0.055

83.8

91.2

85.3

91.2

0.041

0.02

98.5

98.9

94.4

98.5

0.381

0.645

64.7

58.8

60.3

64.7

0.473

0.145

80.9

77.9

79.4

84.8

Hassatchel

0.261

0.325

58.8

52.9

64.7

55.9

Barelegs

0.136

0.055

89.7

94.1

94.1

95.6

Shorts

0.089

0.005

98.5

98.5

98.5

98.5

Jeans

0.375

0.200

70.6

72.1

76.5

67.6

Male

0.476

0.420

52.9

51.5

61.8

63.2

Skirt

0.051

0.1

85.3

92.6

86.8

92.6

Patterned

0.092

0.03

97.1

97.1

95.6

98.5

Midhair

0.261

0.30

60.3

60.3

51.5

58.8

Dark hair

0.617

0.51

54.4

45.6

51.5

48.5

Bald

0.008

0.08

83.8

88.2

76.5

89.7

0.079

0.22

73.5

58.8

64.7

75

0.324

0.12

75

77.9

73.5

85.3

0.249

0.200

77.09

78.01

77.19

81.34

Not light dark jeans colour Dark bottoms Light bottoms

Has handbag carrier bag Has backpack Average

16

(a)

(b)

Figure 5: The re-identification performance of the two MIMLL algorithms. We compare the two algorithms combine with extrim to LOMO+XQDA, Mid-level filter, ELF algorithm and other methods.

adopted respectively to detect attributes in image, and those detected attributes contribute to the recognition of different pedestrians. The experiments show the superior performance of the multi-instance multi-label learning method. Although both the discriminative patches and the attributes are salient regions of pedestrian images, only the attributes are regions with semantics contributing to human-computer interaction. The experimental results show that attributes can largely improve the performance of re-identification algorithms.

7. Acknowledgments This work is supported by the Natural Science Foundation of Fujian Province of China (No.2014J01249), the Xiamen city science and technology project (No. 3502Z20153003), the Special Fund for Earthquake Research in the Public Interest (No.201508025), the Nature Science Foundation of China (No. 61402388, No. 61422210 and No. 61373076), the Fundamental Research Funds for the Central Universities (No. 20720150080 and No.2013121026), and the CCF-Tencent Open Research Fund.

References [1] R. Ji, Y. Gao, W. Liu, X. Xie, Q. Tian, X. Li, When location meets social multimedia: A survey on vision-based recognition and mining for geo-social 17

multimedia analytics, ACM Transactions on Intelligent Systems and Technology (TIST) 6 (1) (2015) 1. [2] T. Guan, Y. Wang, L. Duan, R. Ji, On-device mobile landmark recognition using binarized descriptor with multifeature fusion, ACM Transactions on Intelligent Systems and Technology (TIST) 7 (1) (2015) 12. [3] R. Ji, L.-Y. Duan, J. Chen, T. Huang, W. Gao, Mining compact bag-ofpatterns for low bit rate mobile visual search, Image Processing, IEEE Transactions on 23 (7) (2014) 3099–3113. [4] S. Bak, E. Corvee, F. Br´emond, M. Thonnat, Person re-identification using spatial covariance regions of human body parts, in: Advanced Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE International Conference on, IEEE, 2010, pp. 435–440. [5] S. Liao, Y. Hu, X. Zhu, S. Z. Li, Person re-identification by local maximal occurrence representation and metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2197– 2206. [6] L. Bazzani, M. Cristani, V. Murino, Symmetry-driven accumulation of local features for human characterization and re-identification, Computer Vision and Image Understanding 117 (2) (2013) 130–144. [7] M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person reidentification by symmetry-driven accumulation of local features, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2360–2367. [8] L. An, M. Kafai, S. Yang, B. Bhanu, Person reidentification with reference descriptor, IEEE Transactions on Circuits and Systems for Video Technology 26 (4) (2016) 776–787. [9] S. Gong, M. Cristani, S. Yan, C. C. Loy, Person re-identification, Vol. 1, Springer, 2014. 18

[10] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2288–2295. [11] P. M. Roth, M. Hirzer, M. K¨ostinger, C. Beleznai, H. Bischof, Mahalanobis distance learning for person re-identification, in: Person Re-Identification, Springer, 2014, pp. 247–267. [12] C. C. Loy, C. Liu, S. Gong, Person re-identification by manifold ranking, in: Image Processing (ICIP), 2013 20th IEEE International Conference on, IEEE, 2013, pp. 3567–3571. [13] W.-S. Zheng, S. Gong, T. Xiang, Reidentification by relative distance comparison, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (3) (2013) 653–668. [14] L. An, S. Yang, B. Bhanu, Person re-identification by robust canonical correlation analysis, Signal Processing Letters, IEEE 22 (8) (2015) 1103– 1107. [15] R. Zhao, W. Ouyang, X. Wang, Learning mid-level filters for person reidentification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 144–151. [16] B. Ma, Y. Su, F. Jurie, Covariance descriptor based on bio-inspired features for person re-identification and face verification, Image and Vision Computing 32 (6) (2014) 379–390. [17] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by fisher vectors for person re-identification, in: Computer Vision–ECCV 2012. Workshops and Demonstrations, Springer, 2012, pp. 413–422. [18] R. Zhao, W. Ouyang, X. Wang, Person re-identification by salience matching, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2528–2535. 19

[19] C. Liu, S. Gong, C. C. Loy, X. Lin, Person re-identification: What features are important?, in: Computer Vision–ECCV 2012. Workshops and Demonstrations, Springer, 2012, pp. 391–401. [20] R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3586–3593. [21] D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an ensemble of localized features, in: Computer Vision–ECCV 2008, Springer, 2008, pp. 262–275. [22] M. Liu, D. Zhang, S. Chen, Attribute relation learning for zero-shot classification, Neurocomputing 139 (2014) 34–46. [23] R. Layne, T. M. Hospedales, S. Gong, Attributes-based re-identification, in: Person Re-Identification, Springer, 2014, pp. 93–117. [24] A. Li, L. Liu, S. Yan, Person re-identification by attribute-assisted clothes appearance, in: Person Re-Identification, Springer, 2014, pp. 119–138. [25] M. Dikmen, E. Akbas, T. S. Huang, N. Ahuja, Pedestrian recognition with a learned metric, in: Computer Vision–ACCV 2010, Springer, 2010, pp. 501–512. [26] W.-S. Zheng, S. Gong, T. Xiang, Person re-identification by probabilistic relative distance comparison, in: Computer vision and pattern recognition (CVPR), 2011 IEEE conference on, IEEE, 2011, pp. 649–656. [27] M. Hirzer, P. M. Roth, M. K¨ostinger, H. Bischof, Relaxed pairwise learned metric for person re-identification, in: Computer Vision–ECCV 2012, Springer, 2012, pp. 780–793. [28] R. Ji, Y. Gao, R. Hong, Q. Liu, D. Tao, X. Li, Spectral-spatial constraint hyperspectral image classification, Geoscience and Remote Sensing, IEEE Transactions on 52 (3) (2014) 1811–1824. 20

[29] Q. Zou, J. Zeng, L. Cao, R. Ji, A novel features ranking metric with application to scalable visual and bioinformatics data classification, Neurocomputing 173 (2016) 346–354. [30] C. Schmid, Constructing models for content-based image retrieval, in: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 2, IEEE, 2001, pp. II–39. [31] I. Fogel, D. Sagi, Gabor filters as texture discriminator, Biological cybernetics 61 (2) (1989) 103–113. [32] Z.-H. Z. M.-L. Zhang, Multi-instance multi-label learning with application to scene classification (2006) 1609–1616. [33] W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, S. Maybank, Principal axisbased correspondence between multiple cameras for people tracking, Pattern Analysis and Machine Intelligence, IEEE Transactions on 28 (4) (2006) 663–671. [34] M. Hirzer, C. Beleznai, P. M. Roth, H. Bischof, Person re-identification by descriptive and discriminative classification, in: Image Analysis, Springer, 2011, pp. 91–102.

21