Multiview max-margin subspace learning for cross-view gait recognition

Multiview max-margin subspace learning for cross-view gait recognition

ARTICLE IN PRESS JID: PATREC [m5G;October 27, 2017;11:34] Pattern Recognition Letters 0 0 0 (2017) 1–8 Contents lists available at ScienceDirect ...

1MB Sizes 1 Downloads 125 Views

ARTICLE IN PRESS

JID: PATREC

[m5G;October 27, 2017;11:34]

Pattern Recognition Letters 0 0 0 (2017) 1–8

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Multiview max-margin subspace learning for cross-view gait recognition Wanjiang Xu a,b,∗, Canyan Zhu a, Ziou Wang a a b

Institute of Intelligent Structure and System, Soochow University, Suzhou, 215006, China Yancheng Teachers University, Yancheng, 224002, China

a r t i c l e

i n f o

Article history: Available online xxx MSC: 41A05 41A10 65D05 65D17 Keywords: Subspace learning Gait recognition Transform matrices

a b s t r a c t Cross-view gait recognition can be regarded as a domain adaption problem, in which, probe gait to be recognized in one view is different from gallery gaits collected in another view. In this paper, we present a subspace learning based method, called Multiview Max-Margin Subspace Learning (MMMSL), to learn a common subspace for associating gait data across different views. A group of projection matrices that respectively map data from different views into the common subspace are optimized via simultaneously minimizing the within-class variations and maximizing the local between-class variations of the lowdimensional embeddings from both inter-view and intra-view. In the learnt subspace, same-class samples from all views cluster together, and each different-class cluster is kept away from its nearest neighbors as far as possible. Experimental results on two benchmark gait databases, CASIA-B and OU-ISIR, demonstrate the effectiveness of the proposed method. Extensive experiments also show that our MMMSL achieves significant improvements compared with related subspace learning based methods. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Gait is one of the few biometric features that can be measured remotely without physical contact and proximal sensing. Gait recognition is to identify a person by his (her) walking manner. It is useful in many applications such as robot vision and intelligent security system. However, in reality, there are some factors significantly affecting human gait including walking speed, dressing, carrying objects, and viewing angle change [30]. Among these factors, viewing angle change has been regarded as one of the most challenging problems for gait recognition [42]. This is because walking sequences of one person catched by a camera is a 2D gait, and the appearance of 2D gait observed varies a lot from one viewing angle to another one. Recognizing gaits across different views can be considered as domain adaption problems, in which, gallery gaits (termed as source domain) and probe gaits (termed as target domain) follow different distributions. Domain adaptation is also more generally known as transfer learning [28], which has been applied in various fields, e.g., computer vision [6], natural language processing [35], etc. In principle, domain adaptation attempts to transfer the rich knowledge in a source domain to target domain with



Corresponding author. E-mail address: [email protected] (W. Xu).

limited information to induce a better model. During the learning process, domain adaptation makes use of information coming from both source and target domains. To bridge the two domains, an efficient scheme is exploring the commonality of both domain [20]. For example, a common structure for two different modalities was learned in the work [13] to reduce the sematic gap via pairwise constraints. A common dictionary [43] in the lowdimensional space was formed for domain adaptive sparse representation. To discover the common knowledge, specifically, a common subspace is always acquired, in which the structures of both domains are preserved and the disparity is reduced [20,33]. The most typical approach to obtain a common subspace for multiple views should be the Canonical Correlation Analysis (CCA). CCA attempts to learn two transformations, one for each view, to respectively project samples from two views into a common subspace, by maximizing the cross correlation between two views. In addition, several variants of CCA have been proposed for crossview gait recognition, such as kernal CCA (KCCA) [3] and discriminant CCA (DCCA) [15]. CCA and its variants are only applicable for two-view scenario. To deal with multi-view (more than two views) cases, the pairwise strategy is usually exploited to convert one common subspace for v views problem to Cv2 common subspaces problem, which costs more computational resources. A more efficient and robust solution is to learn a unified common subspace shared by all views rather than two views. For this purpose, Multiview CCA (MCCA) [29] was proposed to obtain one common space

https://doi.org/10.1016/j.patrec.2017.10.033 0167-8655/© 2017 Elsevier B.V. All rights reserved.

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033

JID: PATREC 2

ARTICLE IN PRESS

[m5G;October 27, 2017;11:34]

W. Xu et al. / Pattern Recognition Letters 000 (2017) 1–8

Fig. 1. The overview of multiview max-margin subspace learning for gait recognition. Gait samples are represented by Gait Energy Image (GEI) [12]. Gaits from different views are projected into a common subspace. In the common subspace, gaits of same person from various viewing angles are collapsed into a sphere, and different spheres keep away from each other.

by maximizing the total correlations between any two views. Compared with pairwise CCA, MCCA obtains only v view-specific transforms rather than Cv2 pairs of two view-specific transforms. Although MCCA can learn a common subspace shared by all views, the learnt subspace may be unfavorable for cross-view gait recognition. This is because MCCA does not consider discriminant information, e.g., class label, which is beneficial for recognition or classification. To address this problem, we propose a Multiview Max-Margin Subspace Learning (MMMSL) method that can learn a discriminant common subspace. As seen in Fig. 1, gait features from different views can be mapped into the common subspace then they could be matched directly. In addition, the learnt subspace is a discriminant subspace, in which same-class gait samples cluster together and keep away from neighboring clusters. Compared with previous works, the proposed method has the following properties. 1) A common subspace shared by multiple views is obtained for cross-view gait recognition by jointly optimizing v view-specific transforms. 2) In the learning process, the max margin constraint between neighboring different-class samples brings a more discriminant subspace. 3) The proposed method achieves pretty performances on CASIA-B [42] and OU-ISIR [24] gait datasets. The scores achieved by the proposed method on two datasets are better than that of related subspace learning methods. The rest of this paper is organized as follows. Related works are reviewed in Section 2. In Section 3, problem statement and formulation are analyzed. Experimental results are shown in Section 4, and conclusion is drawn in Section 5. 2. Related works In the recent literature, approaches to gait recognition can be grouped into two categories, model-based methods [1,36,40] and appearance-based methods [18,22,38]. The model-based methods generally characterize kinematics of human joints in order to measure physical gait parameters such as trajectories, limb lengths, and angular speeds. These methods were prevail in the early researches. However, human body is a highly flexible structure, and it is difficult to accurately locate the joints position. In addition, only a small amount of parameters may be not enough to uniquely

identify a person. Instead, appearance-based methods extract gait representations directly from video. Compared with the former, the latter has demonstrated better performance on the common databases. However, all these methods meet with the difficulty of view change. A variety of approaches have been proposed to solve the problem of viewing angle change, which can be generally classified into three main categories. The first category [5,17,31,44] is to construct 3D gait model through cooperative multi-camera. Sufficient 3D information guarantees the promising performance for gait recognition. However, the methods based on 3D reconstruction require complex camera calibration and expensive computation. This restricts greatly their application in many scenarios. Approaches in the second category [11,18,23] aim to extract gait features that are invariant to view change. These approaches can perform well for their specific scenarios. But usually it is hard to generalize for other cases. The third category [21,37,38] relies on learning projection relationships of gait across views. Through a training process, gait features from different viewing spaces are projected into one or more common canonical subspaces before gait similarity is measured. Compared with the former two, the third category only needs a simple non-cooperative camera system and has more efficient and stable if sufficient training samples are supplied. The method proposed in this paper belongs to this category. In the third category, there are two kinds of methods. One is based on view transform model (VTM), and the other is based on subspace learning. VTM can transform gait features from one view into another. Makihara et al. [25] used frequency-domain gait features from different views to form a large matrix. Then they factorized the matrix by adopting singular value decomposition (SVD) to establish the VTM. Kusakunniran et al. [21] created a VTM using support vector regression based on local dynamic feature extraction. Lately, Muramatsu et al. [26] developed an arbitrary view transform model (AVTM) by combining aspects of both first and third categories. 3D gait visual hulls were established and used to generate training gait sequences under any required views. Then VTM was constructed to transform features. Different from VTM, subspace learning based approaches attempt to transform features from various viewing spaces into a shared feature subspace. Bashir et al. [3] using CCA learned maximally correlated feature subspaces and employed correlation strength to measure gait similarity. Hu [15] proposed an uncorrelated multilinear sparse local discriminant canonical correlation analysis (UMSLDCCA) approach to model the correlations of gait features from different viewing angles. A tensor-to-vector projection (TVP) was adopted to extract gait features for measuring similarity. Xing et al. [38] proposed complete CCA (C3A) to overcome the singular problem of covariance matrix and alleviate the computational burden of high dimensional matrix for typical gait image data. Compared with VTM, subspace learning based methods can cope with feature mismatch across views and are more robust against feature noise. On the other hand, subspace learning methods for multiview data have widely been applied in many other scenarios such as multiview face recognition [8,10,19], text-image retrieval [32]. Among them, one of the best known approach is canonical correlated analysis (CCA). CCA is a two-view subspace learning method. For multi-view (more than two views) gait recognition, they can be extended by using pairwise strategy. However, such a pairwise manner is neither efficient nor optimal for classification across different views. What we need is a unified semantic common space, which should embody invariant features or attributes that can identify the underlying object, commonly shared by all the views rather than only two views. For this purpose, Multiview CCA (MCCA) [27] was proposed to obtain one common subspace for all views by maximizing the total correlations between any

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033

ARTICLE IN PRESS

JID: PATREC

[m5G;October 27, 2017;11:34]

W. Xu et al. / Pattern Recognition Letters 000 (2017) 1–8

two views. Since it is straightforward to apply them for multiview data, MCCA has been further developed to conduct multiview clustering and regression [29]. However, MCCA is an unsupervised method. In the mission of object recognition, the supervised one would be better. Based on this consideration, generalized multi-view analysis (GMA) framework was proposed in the work [32]. GMA can be used to extend many existing feature extraction techniques into their multi-view counterparts. By the employment of such as intra-view label and local structure information, GMA has better recognition capability. Later, a Multiview Discriminant Analysis (MvDA) method was proposed in the work [19], which considered both inter-view and intra-view global structure information, leading to a discriminant common subspace. Recently, Ding and Yun [9] considered that structure information consisted of class structure and view structure. They disassembled the two intertwined structures from each other in the learned subspace through dual low-rank decompositions. Most recently, Wu et al. [37] proposed a Deep Learning (DL) based method for recognizing gait across different viewing angles. A deep convolutional neural network (CNN) was learned to judge whether any two gait samples were in the same class. CNN can obtain rich features in a discriminative manner due to deep model. It is essentially a deep non-linear normalization of gait feature. The disadvantage of their method is that it may take days or even weeks for training a CNN model. In addition, paired matching measurement costs a lot of time in the testing phase. 3. Our method In this section, we firstly state the problem of cross-view gait recognition and introduce our main objective briefly. Then, formulation of the proposed method and detailed inferences are provided. Finally, we discuss the differences between the proposed method and the related subspace learning methods.

Fig. 2. Schematic illustration of samples’ neighborhood before learning (left) versus after learning (right). The common subspace θ is optimized so that: (i) same-class samples are clustered together; (ii) the neighboring different-class samples are separated in different clusters with maximum margin.

( α ,β )

Each affinity element sw,i j β

(β )

Then, c j = ciα indicates that xi(α ) and x j ( α ,β )

Therefore, element sw,i j



(α ,β )

sw,i j =

are in the same class.

can be defined as following

β

1 : c j = ciα

(2)

0 :otherwise.

On the other hand, we expect that neighboring different-class samples could be kept away from each other to avoid classification error. As seen in Fig. 2, before learning, samples from different classes may have a close neighbor relation. After learning, we hope in the common subspace these neighbors should be separated in different clusters with maximum margin. The distances of each neighboring pair of different-class samples should be as far as possible.





α ,β

θ

(β )

i, j

(α ,β )

d (xi(α ) , x j )sb,i j . ( α ,β )

Affinity element sb,i j For example, as seen in Fig. 1, the variances of same person under different viewing angles are much greater than the variances of different person under same viewing angle. We attempt to seek a unified common subspace θ , in which gait samples under different viewing angles can be matched correctly. Let S = {(x1(1) , x1(2) , . . . , x1(v) ), . . . , (xn(1) , xn(2) , . . . , xn(v) )} represent gait sam-

describes whether the two samples be-

long to the same class. We define ciα as the class of sample xi(α ) .

max 3.1. Problem statement

3

(3)

represents the neighboring relationship

(β )

(α )

of sample xi and x j . The affinity would be assigned to 1 if the two samples are neighbors. However, for two samples from different viewing spaces, their neighboring relationship could not be measured in Euclidean distance. Here, we define Pk (ciα ) as a set β

of class pairs which are k nearest pairs among the set {(ciα , c j ) : β

β

ples, where xi(α ) represents the ith sample from the α th view. Then

ciα = c j }. The pair (ciα , c j ) is added into set Pk (ciα ) if the sample

distance of each pair of samples xi

xj

(α )

(β )

and x j

in the common sub-

(β )

space θ can be represented as dθ (xi(α ) , x j ). In this paper, we employ Nearest Neighbor (NN) rule for classification. The NN rule classifies each unlabeled probe example by the label of its nearest neighbor in gallery set. Therefore, for a probe gait, we hope that in the learnt common subspace gallery samples of same class could be its nearest neighbors. And meanwhile, to avoid classification error, the neighboring samples of different class could be kept far away with a large margin. That is, in the common subspace θ , same-class samples should be clustered together, no matter these samples are from same or different viewing spaces. Besides, the neighboring clusters should be separated from each other as far as possible, as shown in Fig. 2. In short, there are two principle objectives to learn the common subspace. On one hand, we would like to cluster same-class samples together. The distances of each pair of same-class samples should be as short as possible. Then the objective function can be written as

min θ



 α ,β

i, j

(β )

(α ,β )

dθ (xi(α ) , x j )sw,i j .

(1)

(α )

( α ,β ) is k nearest neighbor of xi(α ) . Therefore, each element sb,i j is

encoded as (α ,β )

sb,i j



=

β

β

1 :(ciα , c j ) ∈ Pk (ciα ) or (ciα , cβj ) ∈ Pk (c j ) 0: otherwise. ( β ,α )

From Eqs. (2) and (4), we can conclude that sw, ji

( β ,α )

sb, ji

( α ,β )

(α )

= sb,i j . If same-class samples xi

(β )

and x j

(4) ( α ,β )

= sw,i j

and

are mapped far

apart, the large distance between them in the learnt subspace incurs a heavy penalty according to the objective function (1). Like( α ,β ) wise, the objective function (3) with our choice of sb,i j incurs a (β ) heavy penalty if different-class samples xi(α ) and x j are mapped close together.

3.2. Formulation As described in last subsection, we attempt to seek a common subspace shared by different views. To achieve this, different linear transforms W1 , W2 , . . . , Wv are supposed to project samples from v views to the common subspace. As a consequence,

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033

ARTICLE IN PRESS

JID: PATREC 4

[m5G;October 27, 2017;11:34]

W. Xu et al. / Pattern Recognition Letters 000 (2017) 1–8

our objective is reformed to find v view-specific optimal linear transform matrices for projecting samples into a common subspace. Formally, we define all samples in the α viewing space as X(α ) = {x1(α ) , x2(α ) , . . . , xn(α ) }, where xi(α ) ∈  pα represents the ith



sample from the α th view. Each Linear transform matrix Wα vα =1 is to project X(α ) into the common space. Distance of each sample (β ) pair xi(α ) and x j can be measured as:

 2  (β ) (β )  dθ (xi(α ) , x j ) = WTα xi(α ) − WTβ x j  .

(5)

2

We define a new objective function to pursue the optimal projection matrices by considering both Eqs. (1) and (3). The objective function combing target (1) and target (3) is



Sb(1,1 )

Sb(1,2 )

···

Sb(1,v )

S ( v,1 )

Sb(v,2 )

···

Sb(v,v )



⎢S ( 2 , 1 ) S ( 2 , 2 ) · · · S ( 2 , v ) ⎥ b b ⎢ b ⎥ . . . ⎥, diagonal matrix .. . . ⎦ ⎣ .. . . .

Sb = ⎢

b ⎡

v ( 1 ,β ) β Dw

⎢ ⎢ ⎢ Dw = ⎢ ⎢ ⎣ ⎢ ⎢ ⎢ Db = ⎢ ⎢ ⎣

v

( 2 ,β ) β Dw

0 . ..

⎡

0

v ( 1 ,β ) β Db

0 . . .

. ..

··· .. .

0

···

0

···

v

( 2 ,β ) β Db

0 ( α ,β ) H

···

0

(α ,β )V

. . .

··· .. .

0

···



0 . . . v

0

( v ,β )

β Dw

0 . . . v

( α ,β )

0

( v ,β )

⎥ ⎥ ⎥ ⎥, and ⎥ ⎦

⎤ ⎥ ⎥ ⎥ ( α ,β ) ⎥. Therefore, Dw ⎥ ⎦

β Db

( α ,β ) H

(α ,β )V

=Dw = Dw and Db = Db = Db . The optimal W can be obtained by solving generalized eigen-value problem:

(W∗1 , W∗2 , . . . , W∗v )

 2  T (α ) T (β )  (α ,β ) W x − W x  α i α ,β i, j β j  sw,i j = arg min ,  22 W1 ,...,Wv    T (α ) (β )  (α ,β ) T − Wβ x j  sb,i j α ,β i, j Wα xi 

Z(Dw − Sw )ZT W = λZ(Db − Sb )ZT W.



(6)

(9)

Here, W is composed of eigenvectors corresponds to the first m smallest eigen-values of Eq. (9).

2

3.3. Discussion ( α ,β )

where sw,i j

( α ,β )

and sb,i j

describe the same-class affinity and neigh(β )

boring different-class affinity of the sample pair xi(α ) and x j . The definition of these terms are provided in Eqs. (2) and (4). By some deduction of linear algebra, Eq. (6) is transformed as following (please refer to Appendix A for more details):











(α ,β )H (β )T (α ,β ) (α ,β )T (α )T T (α ) (α ,β )V (α )T X Wa +WTβ X(β ) Dw X Wβ − WTα X(α ) Sw X(β )T Wβ −WTβ X(β ) Sw X Wα α ,β tr Wα X Dw

(W∗ ) = arg min  W1 ,...,Wv

T (α ) (α ,β )V X(α )T W + WT X(β ) D(α ,β )H X(β )T W −WT X(α ) S(α ,β ) X(β )T W −WT X(β ) S(α ,β )T X(α )T W a α β β α α ,β tr Wα X Db β b b β b

( α ,β )

sents the trace operator.





X (1 ) 0 ··· 0 X (2 ) ··· 0 ⎥ ⎢ 0 Assuming that Z = ⎢ . , the alternative . . ⎥ . ⎣ . . .. . ⎦ . . . 0 0 ··· X (v ) expression of Eq. (7) is as follows (please refer to Appendix B for more details):

 (W ) = arg min 

 ,

tr WT Z(Dw − Sw )ZT W

W

where

(8)

tr WT Z(Db − Sb )ZT W



(7)

( α ,β )

where W = [W1 , W2 , . . . , Wv ], Sw and Sb represent the within-class and between-class adjacency matrices, each element ( α ,β ) ( α ,β ) (α ,β )V ( α ,β ) H (α ,β )V ( α ,β ) H of which is sw,i j and sb,i j . Dw , Dw , Db and Db  ( α ,β ) (α ,β )V ( α ,β ) H are diagonal matrices defined as Dw,ii = j sw,i j , Dw,ii =  ( α ,β )   (α ,β )V ( α ,β ) ( α ,β ) H ( α ,β ) = j sb,i j and Db,ii = i sb,i j . tr(.) reprei sw,i j , Db,ii



As a multiview subspace learning method, our algorithm MMMSL shares some common properties with recent work MCCA [29], GMA [32] and MvDA [19]. To further clarify our method, this subsection will discuss the major differences between our method with other three.

( 1,1 ) Sw ⎢Sw(2,1) ⎢ Sw = ⎢ . ⎣ .. ( v,1 ) Sw

( 1,2 ) Sw ( 2,2 ) Sw .. . ( v,2 ) Sw

··· ··· .. . ···



( 1,v ) Sw ( 2,v ) ⎥ Sw ⎥ .. ⎥, . ⎦ ( v,v ) Sw

Firstly, MMMSL is a supervised method while MCCA is an unsupervised one. MCCA learns only the correlation between views. Differently, in our method, label information is used to cluster same-class samples together and set different clusters apart in the learned space. Obviously, supervised method is more competent for the task of classification or recognition. Secondly, MMMSL considers both intra-view and inter-view discriminant information, while GMA only considers the intra-view information. GMA is a comprehensive framework that we can obtain multiview extension of PCA, LDA, LPP, MFA and so on. However, it only extends these methods within each individual view. The inter-view relationship is established using correlation information like MCCA. For cross-view and multi-view object recognition, the inter-view discriminant information is more important. Thirdly, both MMMSL and MvDA used label information to learn a discriminant common space. The difference between our method and MvDA is whether local information is employed. MvDA tries to learn multiple view-specific linear transforms by maximizing the between-class variations and minimizing the within-class variations in the common space. The variations is employed in global manner. However, in gait recognition task, local information may be more important. For example, given a probe sample, the nearest neighboring gallery sample is usually singled out as a candidate. MMMSL attempts to set the neighboring different-class samples far apart to avoid recognition error. Hence, MMMSL is more capable of recognizing gait across different views.

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033

ARTICLE IN PRESS

JID: PATREC

[m5G;October 27, 2017;11:34]

W. Xu et al. / Pattern Recognition Letters 000 (2017) 1–8

5

Fig. 3. Sample images from the OU-ISIR gait database.

4. Experiments In this section, we evaluate the performance of our proposed algorithm on two large gait databases. We begin with a description of the experimental settings. 4.1. Experimental settings 4.1.1. Datasets Extensive experiments have been conducted on the two largest benchmark gait datasets: CASIA-B [42] and OU-ISIR [24]. CASIA-B dataset [42] is one of the most widely used gait dataset to evaluate gait recognition across different viewing angles. This database contains 124 subjects from 11 views (0°, 18°, . . . , 180°). There are six normal, two carrying, and two wearing gait sequences for each subject under each view. As viewing angle is the main focus in this work, a subset consisting of seven views (36°, 54°, 72°, 90°, 108°, 126°, 144°) of normal gait image sequences from all subjects is selected as the evaluation data. This subset is divided into two part: gait sequences of first 74 subjects are used in training process while that of latter 50 subjects are evaluated in testing process. The second dataset is OU-ISIR gait dataset [24]. In OU-ISIR, there are 4007 subjects (2,135 males and 1872 females) with ages ranging from 1 to 94 years old. Gait data was captured using a single camera placed at a 5-meter distance from the course. For each subject, there are two sequences available, one in the gallery and the other as a probe sample. As illustrated in Fig. 3, each walking sequence is divided into four subsets based on observation angle (55°, 65°, 75°, 85°). The angle is defined by walking direction and the line of sight of the camera. 4.1.2. Gait feature representation Previously, majority feature representation approaches in gait recognition extract feature from human silhouette. The well known Gait Energy Image (GEI) [12] has been demonstrated powerful performance in representing human gait. It is constructed from a sequence of aligned gait images in a window of complete walking cycle(s) since gait is a kind of periodic action. GEI contains rich information of human gait including human shape, motion frequency, temporal and spatial changes of human body. The example GEIs can be seen in Fig. 1. 4.1.3. Other settings The proposed MMMSL is compared with the related subspace learning methods including pairwise CCA (PW-CCA), MCCA [29], GMA-LDA [32], GMA-MFA [32] (GMA-LDA and GMA-MFA are extensions of LDA [4] and MFA [39] in framework GMA) and MvDA [19]. In the work [19], there are two algorithms MvDA and MvDA-VC. MvDA-VC is a extension of MvDA, introducing a viewconsistency constraint for the multiple linear transforms. For simplification of comparison, we report the best result of the two algorithms as the result of MvDA. CCA is a two-view method, so we exploit the pair-wise strategy (PW-CCA) for multi-view classification. Nearest neighbor classier is used for all experiments. To reduce dimensionality, Principal Component Analysis (PCA) [34] is firstly applied. For all methods, the reduced dimension is set to 100 to preserve more than 95% energy. It is still an open

Fig. 4. An example illustrating 2D representation of gait samples in CASIA-B before and after applying MMMSL. Different colors represent different views. Different shapes represent different persons.

problem to set parameters in some manifold learning algorithms such as neighbors k in ISOMAP [2], LLE [7] and LPP [14]. A guideline for neighboring parameter k selection was presented in the work [41]. As its advice, we empirically set k to 5. 4.2. Results on CASIA-B gait database Each gait sequence is firstly used to generate a GEI as described in the work [16]. Scatter diagrams of training gait samples from view 54o , 126o and 144o in CASIA-B before and after applying the learned transformations are shown in Fig. 4. Fig. 4(a) shows the original data distribution projected into a 2D space generated by PCA. Different colors represent different viewing angles. Fig. 4(b) shows the original distribution of samples of person 0 01, 0 03 and 006 under different viewing angles. In each viewing space, samples of the three person are neighboring. Meanwhile, samples of same person under different viewing angles are far away from each other. On account of the view problem, cross-view gait samples are hard to be matched directly. Fig. 4(c) illustrates the projected data distribution generated by applying the learned transformations. Fig. 4(d) illustrates the new 2D representation of samples of person 0 01, 0 03 and 0 06 after applying the transformations. From the Fig. 4, we can find that samples of same person get together and neighboring different-person samples in original view spaces are driven away from each other after using MMMSL. There are 50 × 6 GEIs of each subject for testing under each viewing angle. The sample GEIs are from 7 views, thus leading to 7 × 6 = 42 evaluations in term of rank-1 recognition rates. So far as we know, deep convolutional neural networks (CNN) [37] and view-independent discriminative projection (ViDP) [16] are two of the best performers on this dataset. However, CNN needs many training samples to perform well. Besides, different from our method, the work [37] obtained the direct similarity of any two gait samples rather than a common subspace in which the similarity can be measured. Hence, in this work, we only compare our method with ViDP, which is a subspace learning based method. ViDP achieved excellent performance only second to CNN [37]. As seen in Table 1, our proposed method performs better than ViDP. The results of ViDP were obtained with the same subset of training and testing data as this experiment. Our average recognition rate for probe 54°, 90° and 126° is up 8.1%, 2.6% and 4.5% over that of ViDP respectively. From Table 1, we found that the performance of

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033

ARTICLE IN PRESS

JID: PATREC 6

[m5G;October 27, 2017;11:34]

W. Xu et al. / Pattern Recognition Letters 000 (2017) 1–8

Table 1 Comparision of our method with other subspace learning methods on CASIA-B by average accuracies (%). Models are trained with gait sequences of the first 74 subjects. Gallery

Table 2 Multiview gait recognition results (%) obtained on OU-ISIR with our method. Probe angle

Gallery angle 55°

65°

75°

85°

Mean

55° 65° 75° 85°

– 99.7 ± 0.1 99.0 ± 0.4 97.8 ± 0.3

98.6 ± 0.2 – 99.9 ± 0.1 99.5 ± 0.3

98.0 ± 0.2 98.9 ± 0.2 – 99.8 ± 0.1

96.0 ± 0.5 98.1 ± 0.2 99.0 ± 0.1 –

97.5 ± 0.3 98.9 ± 0.1 99.3 ± 0.2 99.0 ± 0.2

36°–144°

Probe

36°

54°

72°

90°

108°

126°

144°

Mean

ViDP PW-CCA MCCA GMA-LDA GMA-MFA MvDA MMMSL

– 73.9 69.4 62.7 70.4 81.5 87.7

87.0 88.1 78.5 61.9 82.5 86.7 95.1

– 85.7 79.0 79.0 73.0 86.6 93.7

87.7 84.2 78.1 76.6 66.9 85.6 90.3

– 87.3 80.0 74.3 70.9 92.1 88.9

89.3 84.1 81.4 74.7 83.4 86.7 93.8

– 74.1 64.3 67.4 69.4 81.0 88.1

– 82.5 75.8 70.9 73.8 85.8 91.1

Table 3 Comparisions on average multiview gait recognition (%) of different methods using OU-ISIR gait database. Method

PW-CCA

MCCA

GMMFA

MvDA

MMMSL

Accuracy

95.6 ± 0.4

94.6 ± 0.3

94.9 ± 0.2

96.9 ± 0.2

98.7 ± 0.1

As seen in Fig. 5, the more close between probe and gallery viewing angles, the better all methods perform. This is because gaits of two close viewing angles share more common information. For example in probe viewing angle 90°, the recognition rates of all methods are more than 90% in the cases of gallery viewing angle 72° and 108°, while the rates fall below 75% in the case of gallery viewing angle 144°. Among these methods as seen in Fig. 5, the proposed method (MMMSL) provides the highest accuracy, followed by MvDA. GMALDA and GMA-MFA perform poorly, even worse than MCCA, though MCCA is the only unsupervised one. It suggests that in this experiment, only consideration of intra-view information is helpless with improvement of recognition rate. MvDA and our method perform better than other three in almost all cases. It can be ascribed to the employment of discriminant information embedded in both inter-view and intra-view variations. Especially, the larger the deviation between probe and gallery viewing angles, the bigger the improvement on the recognition rates. For example, in the case of probe viewing angle 36° and gallery viewing angle 144°, the recognition rates of MvDA and our method still above 70% while that of other three methods below 50%. Furthermore, the proposed method outperforms MvDA. Both methods jointly consider intra-view and inter-view discriminant information. Compared with MvDA, not only label information but also local structure are exploited in the proposed method. In this experiment, driving neighboring different-class samples away from each other is more effective for gait recognition. 4.3. Results on OU-ISIR database

Fig. 5. Recognition rates (%) of subspace learning methods for cross-view gait recognition. The probe viewing angles are from 36° to 144°.

pair-wise CCA is always better than Multiview CCA (MCCA). We argue that pair-wise CCA seeks the most relevant feature shared by each pair of views which receives a more immediate reward, while MCCA pursues the common feature across all viewing angles. To further prove the effectiveness of the proposed method, the detailed performances of five subspace learning methods (MCCA, GMA-LDA, GMA-LDA, MvDA, and MMMSL) under each probe viewing angle are shown in Fig. 5.

As shown in Fig. 3, four views are considered in this dataset, which associated with four observation azimuth angles 55°, 65°, 75°, and 85°. Due to its biggest number of gait subjects, this database allows us to determine statistically significant performance differences between gait recognition approaches. We apply five-fold cross-validation on this dataset. All subjects are randomly divided into five sets. In each run, four sets are for training, while the remained set is for testing. Finally, the average identification accuracies and their deviations are reported in Table 2. From Table 2, we can see that the recognition rates of all pairs achieve at high level. There are two reasons. One is that there is only a little variation of viewing angles in this OU-ISIR, maximum angle change is 30°. The other reason is that the background of video contains only a flat color as seen in Fig. 3. Silhouette extraction is simple and accurate. Hence, there is very little noise mixed in gait energy image (GEI) for recognition. To further evaluate the proposed method, we also compare MMMSL with other methods. As we can observe from Table 3, the proposed method achieves the highest accuracy 98.7%. It confirms that our method is stable and effective.

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033

ARTICLE IN PRESS

JID: PATREC

[m5G;October 27, 2017;11:34]

W. Xu et al. / Pattern Recognition Letters 000 (2017) 1–8

7



X (1 ) ⎢ 0 where W = [W1 , W2 , . . . , Wv ], Z = ⎢ ⎣ .. . 0

5. Conclusion To address the problem of gait recognition across multiple views, we propose a method, called multiview max-margin subspace learning, which can obtain single common space shared by all views. Based on deduced formula of multiview similarities measurement, we have developed MMMSL by collapsing sameclass samples and setting neighboring different-class samples apart. Experiments on two largest gait databases validate the efficiency of MMMSL. Experimental results show that our method performs better than related subspace learning methods on gait recognition. In the future, we will turn linear projection into nonlinear projection to extract more discriminant feature with the help of Deep Learning.



( 1,1 ) Sw ( 2,1 ) Sw .. . ( v,1 ) Sw

⎢ ⎢ ⎣

Sw = ⎢



( 1,2 ) Sw ( 2,2 ) Sw .. . ( v,2 ) Sw

Sb(1,1 )

⎢ S ⎢ b . . ⎣ .

Sb

S ( v,1 )

⎡  b ( 1 ,β ) v β Dw ⎢ ⎢ 0 ⎢ Dw = ⎢ .. ⎢ ⎣ .

The work was supported by the Natural Science Foundation of China (No.61071214 and No.11301457), and the Funding of Jiangsu Innovation Program for Graduate Education (KYZZ_0340).

( 2,v )

··· .. .

. . .

Sb(v,2 )

Sb

( 2 ,β )

β Dw .. .

··· .. .

0

···



··· ··· .. . ···

0 0 ⎥ ⎥, . ⎦ . . ( v ) X

⎥ ⎥ ⎥, ⎦ ⎤ ⎥ ⎥ ⎥, diagonal matrix ⎦

···

0 v

. . .



Sb(v,v )

···

0

Appendix A. The derivation of Eq. (7)

Sb(1,v )

···

( 2,2 )

Sb = ⎢

Acknowledgments

···

Sb(1,2 )

( 2,1 )

( 1,v ) Sw ( 2,v ) Sw .. . ( v,v ) Sw

··· ··· .. .

0 X (2 ) . . . 0



0 . .. v

0

( v ,β )

⎥ ⎥ ⎥ ⎥, and ⎥ ⎦

β Dw

The detail of derivation of Eq. (7) is as follow:

(W∗1 , W∗2 , . . . , W∗v )

 2  T (α ) T (β )  (α ,β ) W x − W x  α i α ,β i, j β j  sw,i j = arg min  22 W1 ,...,Wv    T (α ) (β )  (α ,β ) T − Wβ x j  sb,i j α ,β i, j Wα xi 2   2 (α ,β )   T (α ) T (β ) tr ( W x − W x α i α ,β i, j β j ) sw,i j   = arg min  2  W1 ,...,Wv T (α ) − WT x(β ) ) s(α ,β ) α ,β i, j tr (Wα xi j β b,i j 



= arg min  W ,...,W v

1

 = arg min  W1 ,...,Wv



α ,β α ,β

 

i, j i, j

α ,β tr (

(α ,β )

(β ) (α ,β ) (β )T

(β ) (α ,β )

(α ,β ) (β )T

tr (WTα xi(α ) sw,i j xi(α )T Wα + WTβ x j sw,i j x j

Wβ − WTβ x j sw,i j xi(α )T Wα − WTα xi(α ) sw,i j x j

tr (

Wβ −

(α ,β ) WTα xi(α ) sb,i j xi(α )T Wα

(α ,β ) V (α )T WTα X(α ) Dw X Wa

+

WT

(β ) (α ,β ) (β )T

βxj

sb,i j x j

(α ,β )H

+ WTβ X(β ) Dw

WT

(β ) (α ,β ) (α )T

βxj

sb,i j xi

Wα −

(α ,β ) (β )T X W

X(β )T Wβ − WTα X(α ) Sw

Wβ )

(α ,β ) (β )T WTα xi(α ) sb,i j x j Wβ

)

T (β ) (α ,β )T (α )T X Wα ) β − Wβ X Sw

V T (α ) (α ,β ) X(α )T W + WT X(β ) D(α ,β )H X(β )T W − WT X(α ) S(α ,β ) X(β )T W − WT X(β ) S(α ,β )T X(α )T W ) a α β β α α ,β tr (Wα X Db β b b β b

,

(10) ( α ,β )

( α ,β )

where Sw

and Sb

represent the same-class and different( α ,β )

class adjacency matrices, each element of which is sw,i j ( α ,β )

(α ,β )V

( α ,β ) H

, Dw

(α ,β )V

and

( α ,β ) H

⎡  ( 1 ,β ) v β Db ⎢ ⎢ 0 ⎢ Db = ⎢ . ⎢ .. ⎣

, Db and Db are diagonal matri ( α ,β )  ( α ,β ) (α ,β )V ( α ,β ) H (α ,β )V ces defined as Dw,ii = j sw,i j , Dw,ii = i sw,i j , Db,ii =  ( α ,β )  ( α ,β ) ( α ,β ) H s and D = s . tr (.) represents the trace operaj b,i j i b,i j b,ii tor. sb,i j . Dw

Appendix B. The derivation of Eq. (8)

v

( α ,β )

fore Dw

( α ,β ) H

= Dw

( 2 ,β )

β Db . ..

0

The detail of derivation of Eq. (8) is as follow:

···

0

0 (α ,β )V

= Dw

··· .. . ··· ( α ,β )

and Db



0 . . . v

0

( v ,β ) β Db ( α ,β ) H

= Db

⎥ ⎥ ⎥ ⎥. ⎥ ⎦

There-

(α ,β )V

= Db

.

(W∗1 , W∗2 , . . . , W∗v ) 

= arg min  W1 ,...,Wv

V (α ,β )H (β )T (α ,β ) (α ,β )T (α )T T (α ) (α ,β ) X(α )T Wa + WTβ X(β ) Dw X Wβ − WTα X(α ) Sw X(β )T Wβ − WTβ X(β ) Sw X Wα ) α ,β tr (Wα X Dw V

(α ,β ) (α ,β )H (β )T (α ,β ) (β )T (α ,β )T (α )T tr (WT X(α ) Db X(α )T Wa + WTβ X(β ) Db X Wβ − WTα X(α ) Sb X Wβ − WTβ X(β ) Sb X Wα ) α,β α  V  T (α ) (α ,β ) ( α )T T (α ) (α ,β ) (β )T

= arg min

W1 ,...,Wv

= arg min W

tr 2 ×



tr 2 ×



α ,β Wα X

Dw

X

Wa −

α ,β Wα X

Sw

X



V  T (α ) (α ,β ) X(α )T W − T (α ) (α ,β ) X(β )T W a β α ,β Wα X Db α ,β Wα X Sb

(11)



tr (WT Z(Dw − Sw )ZT W ) , tr (WT Z(Db − Sb )ZT W )

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033

JID: PATREC 8

ARTICLE IN PRESS

[m5G;October 27, 2017;11:34]

W. Xu et al. / Pattern Recognition Letters 000 (2017) 1–8

References [1] G. Ariyanto, M.S. Nixon, Model-based 3d gait biometrics, in: International Joint Conference on Biometrics, 2011, pp. 1–7. [2] M. Balasubramanian, E.L. Schwartz, The isomap algorithm and topological stability., Science 295 (5552) (2002) 7. [3] K. Bashir, T. Xiang, S. Gong, K. Bashir, T. Xiang, Cross view gait recognition using correlation strength., Bmvc (2010) 1–11. [4] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. (1997) 711–720. [5] R. Bodor, A. Drenner, D. Fehr, O. Masoud, N. Papanikolopoulos, View-independent human motion classification using image-based reconstruction, Image Vis. Comput. 27 (8) (2009) 1194–1206. [6] S.F. Chang, D.T. Lee, D. Liu, I. Jhuo, Robust visual domain adaptation with low-rank reconstruction, Comput. Vis. Pattern Recognit. (2012) 2168–2175. [7] H. Diedrich, U. Potsdam, lle: Locally linear embedding (2012). [8] C. Ding, D. Tao, A comprehensive survey on pose-invariant face recognition, ACM Trans. Intell. Syst. Technol. 7 (3) (2015) 37. [9] Z. Ding, F. Yun, Robust multi-view subspace learning through dual low-rank decompositions, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016. [10] Z. Ding, F. Yun, Robust multiview data analysis through collective low-rank subspace, IEEE Trans. Neural Netw. Learn Syst. PP (99) (2017) 1–12. [11] M. Goffredo, I. Bouchrika, J.N. Carter, M.S. Nixon, Self-calibrating view-invariant gait biometrics, IEEE Trans. Syst. Man Cybern. Part B 40 (4) (2010) 997–1008. [12] J. Han, Bhanu, B.: individual recognition using gait energy image, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2) (2006) 316–322. [13] R. He, M. Zhang, L. Wang, J. Ye, Q. Yin, Cross-modal subspace learning via pairwise constraints, IEEE Trans. Image Process. A 24 (12) (2015) 5543–5556. [14] X. He, S. Yan, Y. Hu, P. Niyogi, H.J. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [15] H. Hu, Multiview gait recognition based on patch distribution features and uncorrelated multilinear sparse local discriminant canonical correlation analysis, IEEE Trans. Circuits Syst. Video Technol. 24 (4) (2014) 617–630. [16] M. Hu, Y. Wang, Z. Zhang, J.J. Little, D. Huang, View-invariant discriminative projection for multi-view gait-based human identification, IEEE Trans. Inf. Forensics Secur. 8 (12) (2013) 2034–2045. [17] Y. Iwashita, R. Baba, K. Ogawara, R. Kurazume, Person identification from spatio-temporal 3d gait, in: International Conference on Emerging Security Technologies, 2010, pp. 30–35. [18] F. Jean, R. Bergevin, A.B. Albu, Computing and evaluating view-normalized body part trajectories, Image Vis. Comput. 27 (9) (2009) 1272–1284. [19] M. Kan, S. Shan, H. Zhang, S. Lao, X. Chen, Multi-view discriminant analysis, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2015) 808–821. [20] M. Kan, J. Wu, S. Shan, X. Chen, Domain adaptation for face recognition: targetize source domain bridged by common subspace, Int. J. Comput. Vis. 109 (1–2) (2014) 94–109. [21] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, Support vector regression for multi-view gait recognition based on local motion feature selection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 974–981. [22] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, Gait recognition under various viewing angles based on correlated motion regression, IEEE Trans. Circuits Syst. Video Technol. 22 (6) (2012) 966–980. [23] W. Kusakunniran, Q. Wu, J. Zhang, Y. Ma, H. Li, A new view-invariant feature for cross-view gait recognition, IEEE Trans. Inf. Forensics Secur. 8 (10) (2013) 1642–1653. [24] Y. Makihara, H. Mannami, A. Tsuji, M.A. Hossain, K. Sugiura, A. Mori, Y. Yagi, The ou-isir gait database comprising the treadmill dataset, Ipsj Trans. Comput. Vis. Appl. 4 (2012) 53–62.

[25] Y. Makihara, R. Sagawa, Y. Mukaigawa, T. Echigo, Y. Yagi, Gait Recognition Using a View Transformation Model in the Frequency Domain, Springer Berlin Heidelberg, 1970. [26] D. Muramatsu, A. Shiraishi, Y. Makihara, M.Z. Uddin, Y. Yagi, Gait-based person recognition using arbitrary view transformation model., IEEE Trans. Image Process. 24 (1) (2014) 140–154. [27] A.A. Nielsen, Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data., IEEE Trans. Image Process. 11 (3) (2002) 293–305. [28] S.J. Pan, Q. Yang, A Survey on Transfer Learning, IEEE Educational Activities Department, 2010. [29] J. Rupnik, J. Shawe-Taylor, Multi-view canonical correlation analysis, in: Conference on Data Mining and Data Warehouses (SiKDD 2010), 2010, pp. 1–4. [30] S. Sarkar, P.J. Phillips, Z. Liu, I.R. Vega, P. Grother, K.W. Bowyer, The humanid gait challenge problem: data sets, performance, and analysis., IEEE Trans. Pattern Anal. Mach. Intell. 27 (2) (2005). 162–77. [31] R.D. Seely, S. Samangooei, M. Lee, J.N. Carter, M.S. Nixon, The university of southampton multi-biometric tunnel and introducing a novel 3d gait dataset, in: IEEE International Conference on Biometrics: Theory, Applications and Systems, 2008, pp. 1–6. [32] A. Sharma, Generalized multiview analysis: a discriminative latent space, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2160–2167. [33] S. Shekhar, V.M. Patel, H.V. Nguyen, R. Chellappa, Generalized domain-adaptive dictionaries, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 361–368. [34] M. Turk, A. Pentland, Eigenfaces for recognition., J. Cogn. Neurosci. 3 (1) (1991) 71–86. [35] D. Uribe, Domain adaptation in sentiment classification, in: International Conference on Machine Learning and Applications, 2010, pp. 857–860. [36] L. Wang, H. Ning, T. Tan, W. Hu, Fusion of static and dynamic body biometrics for gait recognition, IEEE Trans. Circuits Syst. Video Technol. 14 (2) (2003) 149–158. [37] Z. Wu, Y. Huang, L. Wang, X. Wang, T. Tan, A comprehensive study on cross-view gait based human identification with deep cnns, IEEE Trans. Pattern Anal. Mach. Intell. (2016). 1–1. [38] X. Xing, K. Wang, T. Yan, Z. Lv, Complete canonical correlation analysis with application to multi-view gait recognition, Pattern Recognit. 50 (C) (2015) 107–117. [39] D. Xu, S. Yan, D. Tao, S. Lin, H.J. Zhang, Marginal fisher analysis and its variants for human gait recognition and content- based image retrieval., IEEE Trans. Image Process. 16 (11) (2007) 2811–2821. [40] C.Y. Yam, M.S. Nixon, J.N. Carter, Automated person recognition by walking and running via model-based approaches, Pattern Recognit. 37 (5) (2004) 1057–1072. [41] S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction., IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 40–51. [42] S. Yu, D. Tan, T. Tan, A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition, in: Proceedings 18th International Conference on Pattern Recognition, 2006, pp. 441–444. [43] H. Zhang, V.M. Patel, S. Shekhar, R. Chellappa, Domain adaptive sparse representation-based classification, in: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2015, pp. 1–8. [44] G. Zhao, G. Liu, H. Li, M. Pietikainen, 3d gait recognition using multiple cameras, in: International Conference on Automatic Face and Gesture Recognition, 2006, pp. 529–534.

Please cite this article as: W. Xu et al., Multiview max-margin subspace learning for cross-view gait recognition, Pattern Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.10.033