Locality-aware group sparse coding on Grassmann manifolds for image set classification

Locality-aware group sparse coding on Grassmann manifolds for image set classification

Locality-Aware Group Sparse Coding on Grassmann Manifolds for Image Set Classification Communicated by Dr Li Bing Journal Pre-proof Locality-Aware ...

2MB Sizes 0 Downloads 28 Views

Locality-Aware Group Sparse Coding on Grassmann Manifolds for Image Set Classification

Communicated by Dr Li Bing

Journal Pre-proof

Locality-Aware Group Sparse Coding on Grassmann Manifolds for Image Set Classification Dong Wei, Xiaobo Shen, Quansen Sun, Xizhan Gao, Wenzhu Yan PII: DOI: Reference:

S0925-2312(19)31728-X https://doi.org/10.1016/j.neucom.2019.12.026 NEUCOM 21654

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

10 May 2019 30 September 2019 4 December 2019

Please cite this article as: Dong Wei, Xiaobo Shen, Quansen Sun, Xizhan Gao, Wenzhu Yan, LocalityAware Group Sparse Coding on Grassmann Manifolds for Image Set Classification, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.12.026

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Locality-Aware Group Sparse Coding on Grassmann Manifolds for Image Set Classification Dong Weia , Xiaobo Shena , Quansen Suna,∗, Xizhan Gaoa , Wenzhu Yana a School

of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094

Abstract Riemannian sparse coding methods are attracting increasing interest in many computer vision applications, relying on its non-Euclidean structure. One such recently successful task is image set classification by the aid of Grassmann Manifolds, where an image set can be seen as a point. However, due to irrelevant information and outliers, the probe set may be represented by misleading sets with large sparse coefficients. Meanwhile, it is difficult for a single subspace to cover changes within an image set and the hidden structure among samples is relaxed. In this paper, we propose a novel Grassmann Locality-Aware Group Sparse Coding model (GLGSC) that attempts to preserve locality information and take advantage of the relationship among image sets to capture the inter and intra-set variations simultaneously. Since the contributions of different gallery subspaces to the probe subspace should vary in importance, we then introduce a novel representation adaption term. In addition, a kernelised version of GLGSC is proposed to handle non-linearity in data. To reveal the effectiveness of our algorithm over state-of-the-art, several classification tasks are conducted, including face recognition, object recognition and gesture recognition. Keywords: Grassmann manifolds, group sparse coding, locality preserving, image set classification

∗ Corresponding

author Email addresses: [email protected] (Dong Wei), [email protected] (Xiaobo Shen), [email protected] (Quansen Sun), [email protected] (Xizhan Gao), [email protected] (Wenzhu Yan)

Preprint submitted to Journal of LATEX Templates

December 12, 2019

1. Introduction Image set classification is a paramount but challenging task in computer vision and pattern recognition. It has widely potential applications, such as object categorization [1, 2, 3, 4], face recognition [5, 6, 7] and action recognition 5

[8]. This benefits from the development of multimedia network technology and video surveillance that numerous videos and images of a person or an object can be easily obtained. Unlike traditional classification techniques based on single images, the many sample images in an image set convey more intra-class variations of the subject and show superior performance [9]. Meanwhile, a wide

10

range of appearance variations also makes task challenging, caused by arbitrary poses, partial occlusions, illumination conditions and object deformations through multiple representatives. Current image set classification methods focus on two crucial aspects, i.e., how to model an image set and how to measure the similarity between two

15

image sets. In the past decades, a number of image set representation models have been developed, such as subspace [10, 11, 12], affine or convex hull [7, 13, 14], manifold [5, 6] and covariance matrix [15, 16]. Among them, the success of manifold-based methods is mainly due to the geometry of the manifolds, which encode Riemannian geometry of the underlying manifolds in the learning

20

stage. The complexity of the image set structure makes it difficult for a single subspace to cover changes within the set. Hence, Manifold-Manifold Distance (MMD) [5] expresses a manifold by a collection of local linear models and then measures the distances between pair of subspaces. The method in [6], called Manifold Discriminant Analysis (MDA) is proposed to learn an embedding space

25

to enhance discriminative ability. Learning Euclidean-to-Riemannian Metric (LERM) [17] and Projection Metric Learning (PML) [18] are metric learning methods on Sparse Positive Definite (SPD) manifolds and Grassmann manifolds respectively. Sparse coding is becoming increasingly attractive in signal processing and

30

related areas [19, 20]. Sparse Representation-based Classification (SRC) [19]

2

treats the entire training samples as a dictionary for face recognition. To preserve the consistency in sparse representation of similar local features, LScSPM [20] incorporates a Laplacian matrix into objective function where the Laplacian matrix is constructed by histogram intersection. Recently, several methods 35

considered sparse coding on manifolds due to its non-Euclidean geometry. Riemannian Sparse Representation (RSR) [21] performs the sparse coding of SPD matrices with the aid of Stein kernel. Later, Grassmann Sparse Coding (GSC) [8] proposes to embed Grassmann manifolds into the space of symmetric matrices and then explore sparse coding. However, these methods discard the locality

40

among image sets, resulting in misleading sparse coefficients. Moreover, it is difficult for a single subspace to cover changes within an image set and the hidden structure among samples is relaxed. One fact in face recognition is that the face images from different people still share common similarity. Hence, similar to Collaborative Representation Classification (CRC) [22], it is useful to model the

45

local structure among images within the same set and the relationships between different sets. To solve the aforementioned problems of sparse coding on manifolds, this paper proposes a novel image set classification algorithm namely locality-aware group sparse coding model on Grassmann manifolds (GLGSC), to simultane-

50

ously exploit the representation exclusivity and locality consistency in a unified manner. Specifically, to alleviate the impact of outliers, we consider a representation adaption term to enforce the representations of different dictionaries to be adaptive. Meanwhile, we explore the hidden structure within the image set and extract multiple subspaces. In this situation, inspired by the idea of linear

55

discriminant analysis, a locality consistent regularization term is employed to constrain representation coefficients of neighbor points on Grassmann manifolds to be as similar as possible, which promotes intra-set compact. Finally, a kernelised version of GLGSC is proposed to handle non-linearity in data. Fig. 1 shows the sketches of Grassmann sparse representation and the proposed model

60

for image set classification. The main contributions of this paper include: 1) We deal with image set 3

Manifold

&ODVVLILFDWLRQ

5HVLGXDOV

5HSUHVHQWHG

0DMRU9RWLQJ Probe Subspace

Gallery Subspaces

Fig. 1. The sketches of Grassmann sparse coding (top) and the proposed model for image set classification (bottom). A probe set can be described by a linear subspace and represented as a point on a Grassmann manifold. Then, this probe subspace is represented by linear combination of gallery subspaces and classified by computing reconstructive residuals. Since multiple subspaces are extracted, the proposed model covers more variations within an image set and preserves locality information. classification via a novel locality-aware group sparse coding on Grassmann manifolds, which improves the matching consistency. Thus, this model can be understood as both a multi-model representation and a sparse representation model. 65

To our best knowledge, it is the first method that exploits locality property for image set classification. 2) We develop the kernelised version of this method, which can address the non-linearity in data. 3) The proposed method is applied to several computer vision tasks, including face recognition, object recognition and gesture recognition. Experimental results show the superiority of our algo-

70

rithm over state-of-the-art methods. The remainder of this paper is organized as follows: Section 2 first presents the related work of our proposed method. In Section 3, the proposed method is exhibited. We present the kernelised version of proposed method in Section

4

4. The experimental results are depicted in Section 5. The conclusions are 75

provided in Section 6.

2. Related Work From the representation point of view, existing image set classification methods can be mainly classified into two categories: parametric methods and nonparametric methods [23]. Parametric methods use parametric distribution to 80

model each set, such as single Gaussian or Gaussian mixture models (GMM), and then measure the similarity between two distributions with Kullback-Leibler divergence. However, the performances of these methods are sensitive to the statistical correlations between gallery and query sets. In the past decades, non-parametric methods have attracted increasing at-

85

tention due to its stable and better performances. Different from distribution function in parametric methods, non-parametric methods utilize more flexible models to represent image sets, such as subspace [10, 11], affine or convex hull [24, 25], manifold [5, 6], graph model [26] and covariance matrix [15, 27]. The method in [10] exploits subspaces to model image sets and measures the

90

similarity by mutual subspace method (MSM). Inspired by the idea of Linear Discriminant Analysis (LDA), Discriminant Analysis of Correlation Canonical Correlations (DCC) models an image set with a linear subspace and employs canonical correlations to measure the similarity between two sets. Affine Hull based Image Set Distance (AHISD) and Convex Hull based Image Set Distance

95

(CHISD) [7] are proposed to use affine hulls and convex hulls to represent image sets respectively. This operation can reveal unseen appearance variations implicitly. The methods in [1] and [2] propose to model each image set as a point on Grassmann manifolds and then different discriminant analysis methods can be conducted. Covariance Discriminative Learning (CDL) [15] is proposed to

100

represent each image set with a covariance matrix, which causes that kernel discriminant analysis could be conducted on SPD manifolds. Different from the methods above, several methods introduce sparse coding

5

and dictionary learning into image set classification. The method in [24], called Sparse Approximated Nearest Points (SANP), is proposed to take advantage of 105

sparse approximation to select nearest neighbors. A reduced model, Regularized Nearest Points (RNP) [28], is proposed to represent an image set by a regularized hull, which reduces the complexity of SANP. The method in [29] are proposed to construct a dictionary for gallery sets and the reconstruction error is used to measure the similarity between the query set and the gallery set. To consider

110

the class-level and atom-level sparsity, the method in [30] models a video clip as a discriminative dictionary. Later, Harandi et al. [8] attempt to sparse coding on Grassmann manifolds and achieve a great success, which benefits from its non-Euclidean geometry. Ortiz et al. [31] propose a MSSRC model to represent the face track by its mean image. Zhu et al. [32] propose to model the probe

115

set as a convex or regularized hull and represent this hull by all the gallery sets collaboratively.

3. Proposed Method 3.1. Problem setup Let gallery sets be X = {Xi }1≤i≤n ∈ Rd×N , where n is the number of 120

gallery sets and N is the number of total gallery images. The i-th gallery set Pn Xi = {xji }1≤j≤ni ∈ Rd×ni , therefore N = . Note that xji denotes i=1 ni

the j-th image of the i-th gallery set. Similarly, let the probe sets be Y = {Yi }1≤i≤p ∈ Rd×P . Here, p denotes the number of probe sets and P denotes

125

the total number of probe images. Each probe set Yi = {yji }1≤j≤pi has pi P images and yji ∈ Rd denotes the j-th image of the i-th probe set (P = pi=1 pi ).

Given a probe set or several probe sets, our goal is to find the corresponding class label of the whole set, which consists of multiple images. N otations: Throughout this paper, the bold uppercase characters, bold lowercase characters and italic letters are used to denote matrices, vectors and

130

scalar, respectively. For a matrix A, we use ai , aj and Aij to denote the i-th row, the j-th column and the ij-th element of A, respectively. For a vector

6

α, we denote the i-th entry by [α]i . Tr(A) denotes the trace of the A if A is square. Let AT denotes the transposed matrix of A and In denotes the identity matrix in Rn×n . 135

3.2. Multiple Subspace Extraction Due to arbitrary poses, partial occlusions, illumination variations and object deformations, the image set usually has complicated structure that a single subspace could not convey such intra-set variations. Fortunately, Wang et al. [33] observe that the input data can be first decomposed into a collection of

140

local linear model, which motivates us to segment the image sets into several clusters of linear model, and then use Singular Value Decomposition (SVD) to compute subspaces. Several local models extracting methods have been proposed in the literatures. The methods in [34, 35] use classical K-means method to cluster exemplars and then each cluster is represented as a plane computed by Principal Component Analysis (PCA). According to distance measure on the appearance manifold, Fan et al. [36] explore Hierarchical Agglomerative Clustering (HAC) to investigate grouping in a data set. However, they have two main limitations: the number of clusters needs to be specified and linear property is not guaranteed. Recently, Wang et al. [6, 33] construct Maximal Linear Patch (MLP), whose nonlinearity degree is measured by the deviation between Euclidean distance and geodesic distance. As indicated in [6], one-shot sequential clustering method in [5] is developed to construct MLPs, suffering from the problem of unbalanced clusters. For the sake of efficiency, Hierarchical Divisive Clustering (HDC) is exploited to substitute HAC because the appropriate number of clusters is much smaller than the number of samples. In this paper, we follow this method and formulate the multiple MLPs extracted from an image set

7

(k)

Xi ∈ Rd×pi , i.e., local models Di , (1)

(2)

(m)

Xi = {Di , Di , · · · , Di (j)

(k)

Di ∩ Di (j)

Di

},

s.t. Xi =

m [

(j)

Di ,

j=1

=∅

(j 6= k, j, k = 1, 2, · · · , m),   m X (j) (j) (j) = {di,1 , di,2 , · · · , di,gj }  gj = Gi  ,

(1)

j=1

(j)

where m is the number of patches; gj is the number of images in patch Di

and

Gi is the number of images in the i-th image set. Firstly, the pair wise Euclidean distance matrix WE and geodesic distance matrix WG , based on k-NN graph, can be easily obtained. Then, a matrix holding distance ratios is computed as: R (du , dv ) = WG (du , dv ) /WE (du , dv ). Evidently, R (du , dv ) > 1 holds for any entry of R since geodesic distance is always large than Euclidean distance. Further, a similarity matrix H can be constructed, whose the u-th column represents the indices of k nearest neighbors of the image du . Finally, to measure the nonlinearity degree of one MLP (j)

Di , a nonlinearity score function can be defined as follows: S (j) =

145

pj pj 1 X X  (j) (j)  R du , dv . pj u=1 v=1

(2)

With these definitions, the MLPs can be obtained using the HDC algorithm [6]. After extracting the local models from the image sets, we can group the images of probe set Y into m clusters, i.e. Y = {Y(1) , Y(2) , · · · , Y(m) }. Similarly, for the gallery sets X = {Xi }1≤i≤n , we can form the dictionary as follows: (1)

(2)

(m)

(1)

(2)

(m)

(2) (m) D = {D1 , D1 , · · · , D1 , D2 , D2 , · · · , D2 , D(1) n , Dn , · · · , Dn }. (3)

Here, n is the number of the gallery sets and m is the number of the clusters of each set. The HDC scheme makes sure that the extracted MLPs are balanced and the images in the same local patch have similar visual characteristics. Then,

8

each gallery set can be represented by a linear subspace spanned by an orthonor(j)

mal basis matrix, in other words, Di matrix

¯ (j) D i

can be denoted by an orthonormal basis

∈ Rd×q , s.t. (j)

(j)T

Di Di

¯ (j) Λ(j) D ¯ (j)T , =D i i i

(4)

¯ (j) and Λ(j) are the eigenvector matrix of the q largest eigenvalues and where D i i diagonal matrix of the q largest eigenvalues, respectively. 3.3. Locality-aware Group Sparse Coding on Grassmann Manifolds In several previous works, sparse coding is explored simply on Grassmann 150

manifolds, which ignores the locality and similarity information among image sets. Besides, as discussed in [22], data from different classes may share similarities for face recognition. 1) Grassmann manifold sparse coding scheme: Motivated by these considerations, we represent an image set by multiple subspaces, which can be seen as several points on Grassmann manifolds. Meanwhile, we consider locality and similarity information via distance measure on Grassmann manifolds, which is called Projection Metric [1]. For convenience, we consider the situation that only ¯ ∈ Rd×q one probe set Y is given in this subsection. To avoid clutter, we use Y

¯ i }1≤i≤n ∈ Rd×nq to represent subspaces of the probe set Y and the and {X gallery sets {Xi }1≤i≤n , respectively. Intuitively, Grassmann sparse coding with l1 -norm regularisation tries to find the representation by solving the following optimization problem:

2

n ]

¯ ¯

min Y [α]j Xj

α

j=1

+ λ kαk1 ,

(5)

geod

where Y and Xj are points on Grassmann manifold G(p, d). The operators ,

U

,

and k·kgeod are subtraction, summation, multiplication and geodesic distance 155

on Grassmnan manifolds, respectively. Since matrix subtraction, summation and scalar multiplication are not defined appropriately, it’s very hard to solve 9

Eqn.(5). Fortunately, as depicted in [8], Grassmann manifolds can be embedded into the space of symmetric matrices via mapping Γ : G(p, d) → Sym(d),

Γ (X) = XXT

(6)

and then sparse coding on Grassmann manifolds can be performed directly. Since the mapping is diffeomorphism in the manifold of symmetric matrices, , U and can be seen as operation on vector space. Consequently, the form of sparse coding on Grassmann manifolds is recast as:

2

n X

T ¯Y ¯T − ¯ ¯

Y [α] X X min j j j + λ kαk1 . α

j=1

(7)

F

Recall that the image set is naturally divided into several clusters to capture more variations within intra-set according to HDC algorithm. With this choice, we can rewrite Eqn.(7) as:

2

m·n m m X X X

T ¯ iY ¯T − ¯ ¯

Y [α ] D D + λ kαi k1 . min i j j j 1 i

{αi }m i=1

j=1 i=1 i=1

(8)

F

¯ i and D ¯ j are the subspaces of the i-th cluster of the probe set and the Here, Y j-th cluster of all gallery sets (see Eqn.(3)), respectively. [αi ]j ∈ Rm·n is the

160

¯ jD ¯T. ¯ iY ¯ T under D representation coefficient of Y i j

2) Representation adaption term: It is expected that the representation coefficients α of each point on Grassmann manifold should reflect this grouping performance. Specifically, we introduce a representation adaption term to objective function so that contributions of different gallery subspaces to the probe 165

subspace can vary in importance. For better understanding, we first introduce the definition of exclusivity. Definition 1. (Exclusivity [37]) Exclusivity between two vectors u ∈ Rn and Pn v ∈ Rn is defined as H(u, v) = ku ⊗ vk0 = i=1 [u]i · [v]i 6= 0, where ⊗ desig10

nates the Hadamard product and k·k0 is the l0 -norm. From the definition, we observe that the exclusivity encourages two vectors to be as diverse as possible. This motivates us to constrain representation coefficients by distance vector. Ideally, if the distance between two points on Grassmann manifolds is small (approximate 0), then the exclusivity enforces the corresponding representation coefficient as large as possible. This operation can change coefficients to adapt the more appropriate linear representation. For non-convexity and effectiveness, we relax the l0 norm into l2 norm to make objective function computationally tractable. Consequently, the representation adaption term is as follows: Ξ1 (α) = λ2

m X n X

(αi )Cj ⊗ (1 · Bji ) 2 ). ( 2

(9)

i=1 j=1

Here, Cj = {(j − 1)m + 1, (j − 1)m + 2, · · · , (j − 1)m + m} denotes the set

of indices corresponding to the j-th gallery set and (αi )Cj ∈ Rm denotes the

representation coefficients of the i-th cluster of the probe set under the j-th gallery set, which contains m clusters. 1 ∈ Rm is a vector whose elements are 1. λ2 is regularisation parameter and Bji represents the average distance between the i-th cluster of the probe set and the j-th gallery set on Grassmann manifolds. In particular, the ji-th element of B ∈ Rn×m can be defined as: Pm

¯ ¯ (k) k=1 dgeod (Yi , Dj ) Pm ¯ ¯ (k) j=1 k=1 dgeod (Yi , Dj )

Bji = Pn

(10)

¯ i and D ¯ (k) on Grassmann manifolds can where the geodesic distance between Y j be easily computed as:

2



¯ (k)T ¯ 2 ¯ (k)T ¯ (k) 2 ¯ i, D ¯ (k) ) = ¯TY ¯ i dgeod (Y Yi + Dj Dj .

Y

− 2 D j i j F

F

(11)

F

3) Locality consistent term: The work [20] has validated that the stability of sparse codes can be improved by incorporating the locality and similarity

11

information into the objective of sparse coding. Since we split each gallery set into several clusters, an overcomplete or sufficient dictionary is obtained, which brings about the loss in the locality of the probe clusters to be encoded. In this case, similar probe clusters may be encoded as totally different sparse codes, resulting in the instability of the sparse coding. To preserve locality and similarity information, similar gallery clusters should be enforced to have similar sparse codes. Thus, we introduce the locality consistent term naturally as: Ξ2 (α) =

m m λ3 X X 2 kαi − αj k Wij . 2 i=1 j=1

(12)

Here, αi is the sparse coding of the i-th probe cluster under an overcomplete dictionary that consists of all gallery clusters on Grassmann manifolds. λ3 is a parameter to balance the importance of this term. W is a similarity matrix, ¯ i, Y ¯ j ) on whose entry Wij measures the similarity between a vertex pair (Y Grassmann manifolds. Specifically, we define Wij by Gaussian kernel function:

Wij = exp (− 170

2

¯ ¯T ¯ jY ¯T

Yi Yi − Y j

F

2σ 2

),

(13)

where σ is a parameter. 4) Overall objective function: Based on the above analysis, we prefer to simultaneously exploit the representation exclusivity and locality consistency. Now, putting all concerns together results in our final Grassmann Locality Group Sparse Coding (GLGSC) model as follows:

2

m m·n m X X X

¯ iY ¯T − ¯ jD ¯ T + λ1

Y min [α ] D kαi k1 i j i j

{αi }m i=1

i=1 j=1 i=1 F

m X m m X n X X

2

(αi )Cj ⊗ (1 · Bji ) 2 ) + λ3 + λ2 ( kαi − αj k Wij , 2 2 i=1 j=1 i=1 j=1

(14)

where λ1 ,λ2 and λ3 are the tradeoffs corresponding to the sparsity, exclusivity and consistency terms, respectively.

12

3.4. Optimization For simplicity, we first define similarity between the i-th probe subspace ¯ i ∈ Rd×q and gallery subspaces D ¯ ∈ Rd×(q·mn) by an mn dimensional vector Y and the j-th entry of this vector is:

2 ¯ i , D)] ¯ j = T r(D ¯TY ¯ iY ¯TD ¯ j) = ¯TD ¯ j [Ki (Y

Y

j i i

F

(15)

¯ can be defined by an mn× Similarly, the similarity between gallery subspaces D mn symmetric matrix:

2 ¯ i,j = T r(D ¯TD ¯ jD ¯TD ¯ i) = ¯TD ¯ j [K(D)]

D

i j i

F

(16)

To derive an efficient solution of Eqn.(14), we first give the following theorem. 175

Theorem 1. For the least square problem of α: the solution of

2

m·n m X X

T ¯ iY ¯T − ¯ ¯

Y min [α ] D D i j j j i

{αi }m i=1

i=1 j=1

(17)

F

is equivalent to the following optimization problem min m

{αi }i=1 1

m X i=1

2

kzi − Aαi k2 ,

(18)

1

¯ i , D). ¯ U, Σ ∈ Rmn×mn are the SVD where A = Σ 2 UT and zi = Σ− 2 UT Ki (Y ¯ of the symmetric positive definite matrix K(D).

Proof

The proof can be found in Appendix A.



To simply the third term in Eqn.(14), we define a block projection matrix Q ∈ Rn×(m·n) , whose the i-th row is 0 except elements from the position (i − 1)m−1 to the position im are 1. Then, we can rewrite the third term of Eqn.(14)

13

as following: λ2 =λ2 =λ2

m

2 X

αi ⊗ (QT B)i 2

i=1 m X i=1 m X

2

diag[(QT B)i ]αi

F

(19)

αTi S(i) αi .

i=1

Here, we use S(i) ∈ R(mn)×(mn) to represent diag[(QT B)i ]T diag[(QT B)i ]. Before simply the last term of objective function, we define a diagonal matrix as E, whose the i-th diagonal element Eii denotes the summation of all the ¯ i , that is, Eii = Pm Wij . According to [38], the Laplacian similarities of Y j=1 matrix can be defined as L = E − W. Hence, we formulate the last term as

follows:

180

m m λ3 X X 2 kαi − αj k Wij = λ3 T r(ΩLΩT ), 2 i=1 j=1

(20)

with Ω = [α1 , α2 , · · · , αm ]. Now, we reformulate the objective function Eqn.(14) as follows:

min m

{αi }i=1

m X i=1

2

kzi − Aαi k2 + λ1

m X i=1

kαi k1 + λ2

m X

αTi S(i) αi + λ3 T r(ΩLΩT ) (21)

i=1

It is difficult to jointly solve all the αi at the same time. In this paper, we propose a solution by optimizing each αi one by one until converges. When optimizing αi , we fix the remaining sparse codes αj (j 6= i). Consequently, we reformulate the objective function of Eqn.(21) as follows:   2 min kzi − Aαi k2 +λ1 kαi k1 +λ2 αTi S(i) αi +λ3 αTi (ΩLi ) + (ΩLi )T αi − αTi Lii αi ) , αi

(22)

where Li is the i-th column of L and Lii is the i-th row and the i-th column of L. In order to handle the k·k1 regularizer, we employ Inexact Augmented La-

14

grange Multipliers (IALM) [39] algorithm to find the optimal solution. First, we introduce a slack variable u and add a corresponding equality constraint into Eqn.(22): 2

min kzi − Aui k2 + λ1 kαi k1 + λ2 uTi S(i) ui

αi ,ui

  + λ3 uTi (ΩLi ) + (ΩLi )T ui − uTi Lii ui )

(23) s.t. ui = αi .

This operation causes that ui and αi can be optimized alternatively. IALM augments that the Lagrangian function with quadratic penalty term. Thus, Eqn.(23) can be further rewritten as: 2

J = min kzi − Aui k2 + λ1 kαi k1 + λ2 uTi S(i) ui αi ,ui

2 (24)

v

. u − α + + λ3 [uTi (ΩLi ) + (ΩLi )T ui − uTi Lii ui )] + µ i

i 2µ 2

Update ui : We can compute ui by enforcing partial derivative of Eqn.(24) to be zero: ∂J = 2(AT Aui − AT zi ) + 2λ2 S(i) ui + 2λ3 (U?−i Li,−i + Lii ui ) ∂ui v ) = 0, + 2µ(ui − αi + 2µ

(25)

where U? = [u1 , u2 , · · · , um ] and U?−i ∈ R(mn)×(m−1) denotes the submatrix

after removing the i-th column of U? . Li,−i ∈ R(m−1)×1 denotes the vector

which removes the i-th entry of vetor Li . More specifically, we update ui by solving: (t+1)

ui

  −1  v(t) ?(t) (t) = AT A + λ2 S(i) + λ3 Lii I + µ(t) I AT − λ3 U−i Li,−i + µ(t) αi − . 2 (26)

Update αi : This is done by solving the following equation: (t+1)

αi

= arg min 2 αi

2

λ1 v(t) (t)

, kα k + α − (u + ) i 1 i

i 2µ(t) 2µ(t) 2 15

(27)

which can be obtained by: (t+1)

αi

=S

λ1 2µ(t)

(t)

(ui +

v(t) ). 2µ(t)

(28)

Note that Sθ (x) = sign(x) · max(0, |x| − θ) is the soft-thresholding operator. Update Multipliers: They are updated by:   v(t+1) = v(t) + µ(t) (u(t) − α(t) ) i i  µ(t+1) = ρ · µ(t) ,

(29)

where ρ is set as 1.1 in this paper.

Algorithm 1: Grassmann locality-aware group sparse coding (GLGSC). ¯ i }m , with Y ¯ i ∈ G(q, d); Input: Subspaces of the probe set clusters {Y i=1 ¯ = {D ¯ j }mn , with Subspaces of all gallery sets clusters D j=1 ¯ j ∈ G(q, d); parameters λ1 , λ2 , λ3 . D Initialize: t = 0; v(0) = 0 ∈ Rmn×1 ; µ(0) = 0.0001 for i ← 1 to mn do for j ← 1 to mn do

2

¯ j ¯ i) = ¯TD ¯ jD ¯TD ¯ i,j = T r(D ¯TD [K(D)]

D i j i F end end ¯ = UΣUT ; A = Σ 12 UT ; K(D) for i ← 1 to m do for j ← 1 to mn do

2

¯ j ¯ j) = ¯TD ¯ iY ¯TD ¯ i , D)] ¯ j = T r(D ¯TY Y [Ki (Y

i i j F end 1 ¯ i , D); ¯ zi = Σ− 2 UT Ki (Y end while not converged do (t+1) Update ui by Eqn.(26); (t+1) Update αi by Eqn.(28); Update v(t+1) and µ(t+1) by Eqn.(29); t ← t + 1; end Output: The sparse code α∗ = [α∗1 , α∗2 , · · · , α∗m ]. The algorithm of GLGSC is summarized in Algorithm 1. The convergence of inexact ALM has been studied when the number of blocks is at most two [40]. Since there are two blocks (including α and u) in Algorithm 1, it would 16

be easy to prove the convergence in theory. Fig. 2 shows the convergence of the proposed GLGSC model on Cambridge gesture dataset. When we have solution α∗ of Eqn.(22), the regularized residuals can be computed as follows: ri = kzi − Aα∗i k2

(30)

Since the proposed method is a more stable representation of a probe set, voting strategies become less sensitive to noise and outliers. Thus, in this case, we simply adopt majority-vote strategy similar to [29]: identity(Y) = vote(arg min {ri })

(31)

i

2.8 2.6

objective function value

2.4 2.2 2 1.8 1.6 1.4 1.2 1

0

2

4

6

8

10

12

iteration number

Fig. 2. Convergence of the proposed GLGSC model.

4. Kernelised GLGSC In this section, we employ our model on higher-dimensional Grasssmann

185

manifolds to deal with non-linearity of data. It’s the kernel extension of section 3. 4.1. Formulation pi Assume that clusters of the probe set Y = {Yi }m i=1 with Yi = {yj }j=1 ,yj ∈

Rd where m denotes the number of probe clusters and pi denotes the number 17

of images in the i-th cluster. Then, Yi can be mapped into a dot product Hilbert space H by a mapping φ: Rd → H. According to [41], we obtain

hy1 , y2 iH = φ(y1 )T φ(y2 ) = k(y1 , y2 ) for any y1 , y2 ∈ Rd . Here, k(·, ·) is a given kernel function and we use Gaussian kernel function k(u, v) = exp(−γku − vk2 )

in this paper. With this mapping, we have Φ(Yi ) = [φ(y1 ), φ(y2 ), · · · , φ(ypi )] and its q-dimensional orthonormal basis on H: Ψ(Yi ) = [ϕ1 , ϕ2 , · · · , ϕq ]. Note that the relationship between pi and q satisfies q ≤ pi corresponding to the original space. KPCA in [41] has described the following equation: −1

ˆ (Y )q Σ ˆ 2 , Ψ(Yi ) = Φ(Yi )U i (Yi )q

(32)

ˆ (Y )q and Σ ˆ (Y )q are the q largest eigenvectors and singular values diwhere U i i ˆ (Y ) and Σ ˆ (Y ) . It is worth mentioning that agonal matrix corresponding to U i i ˆ and Σ ˆ are obtained by decomposing: U ˆT . ˆ (Y ) Σ ˆ (Y ) U Φ(Yi )T Φ(Yi ) = U (Yi ) i i

(33)

Similarly, the j-th gallery cluster Ψ(Dj ) is also obtained by corresponding formulations. With above analysis, we reformulate the Eqn.(14) as

2

m·n m m X X X

T

Ψ(Yi )Ψ(Yi )T − [α ] Ψ(D )Ψ(D ) + λ kαi k1 min i j j j 1

{αi }m i=1

i=1 j=1 i=1 F

m X n m X m X X

2

(αi )Cj ⊗ (1 · Bji ) 2 ) + λ3 + λ2 ( kαi − αj k Wij , 2 2 i=1 j=1 i=1 j=1

(34)

which is similar to formulation on original space. 190

4.2. Optimization Compared with Eqn.(14), only the first term changed on Hilbert space. This motivates us that the solution of Eqn.(A.1) is the key problem. Therefore, we

18

rewrite Eqn.(15) as follows: 0

[Ki (Ψ(Yi ), Ψ(D))]j

2 = Ψ(Yi )T Ψ(Dj ) F

1 − 12 2

ˆ−2 ˆ T ˆ (D )q Σ ˆ = Σ U K(Y , D ) U i j j (Yi )q (Yi )q (Dj )q .

(35)

F

Here, K(Yi , Dj ) denotes a kernel matrix, whose a-th row and b-th column entry is k = (ya , db ). Note that ya and db are the a-th column of Yi and the b-th column of Dj , respectively. Similarly, Eqn.(16) can be rewritten as:

1

2 1 0

ˆ−2 ˆ T ˆ ˆ−2 [K (Ψ(D))]i,j = Σ (Di )q U(Di )q K(Di , Dj )U(Dj )q Σ(Dj )q , F

(36)

where a-th row and b-th column entry of K(Di , Dj ) is k = (da , db ). Thus, we can reformulate Eqn.(A.1) as follows:

=

2

m m·n X X

T

Ψ(Yi )Ψ(Yi )T − [αi ]j Ψ(Dj )Ψ(Dj )

i=1 j=1

m  X i=1

195

F

0

0



(39)

c + αTi K (Ψ(D))αi − 2αTi Ki (Ψ(Yi ), Ψ(Dj )) ,

where c is a constant. Consequently, this results in a simple solution according to Eqn.(21). Implement is detailed in Algorithm 2.

5. Experiments To demonstrate the performance of the proposed model, we evaluate three typical visual classification tasks: face recognition, object categorization and 200

gesture recognition. 5.1. Datasets and settings For the face recognition task, we use the Honda/UCSD dataset [34], the Extended Yale Face Database B [42] and the YouTube Celebrities dataset [43]. The 19

Algorithm 2: Kernel locality-aware group sparse coding (KLGSC). Input: The probe set clusters {Yi }m i=1 ; all gallery sets clusters D = {Dj }mn j=1 ,; parameters λ1 , λ2 , λ3 . Initialize: t = 0; v(0) = 0 ∈ Rmn×1 ; µ(0) = 0.0001 ˆ (Y ) , Σ ˆ (D ) , Σ ˆY , U ˆ D by Eqn.(33); Compute U i j i j for i ← 1 to mn do for j ← 1 to mn do

1

0 − 12 2

ˆ−2 ˆ T ˆ (D )q Σ ˆ [K (Ψ(D))]i,j = Σ U K(D , D ) U i j j (Di )q (Di )q (Dj )q , (37) F

end end 0 1 K (Ψ(D)) = UΣUT ; A = Σ 2 UT ; for i ← 1 to m do for j ← 1 to mn do

2

1 1 0

ˆ−2 ˆ T ˆ (D )q Σ ˆ−2 U K(Y , D ) U [Ki (Ψ(Yi ), Ψ(D))]j = Σ i j (Y )q j (Yi )q (Dj )q i

F

(38)

end 0 1 zi = Σ− 2 UT Ki (Ψ(Yi ), Ψ(D)); end while not converged do (t+1) Update ui by Eqn.(26); (t+1) Update αi by Eqn.(28); Update v(t+1) and µ(t+1) by Eqn.(29); t ← t + 1; end Output: The sparse code α∗ = [α∗1 , α∗2 , · · · , α∗m ].

Honda/UCSD contains 59 video sequences of 20 different people. Each video 205

contains about 300-500 frames covering large pose, illumination and expression variations. The Extended Yale Face Database B (YaleB) consists of 16128 images with 28 humans under 9 poses and 64 illumination conditions, which results in a large illumination variation database. We build image sets depending on different poses of each subject, where each image set contains approximately 20-64

210

frames. The YouTube Celebritites (YTC) database is collected from YouTube and contains 1910 video sequences of 47 subjects. The face image in each video sequence is highly compressed and low-resolution, which is much more challenging compared with Honda and YaleB. Follow the work [5, 6, 15], a cascaded face detector is employed to collect faces. Each face is resized to a 20 × 20 intensity

20

(a)

(b)

(c)

(d)

(e)

Fig. 3. Some example images of five datasets: (a) Honda, (b) YaleB, (c) YTC, (d) ETH-80, (e) Cambridge-Gesture.

215

image and then pre-processed by histogram equalization to eliminate lighting effects. Some examples can be seen in Fig. 3(a)(b)(c). For the object categorization task, we use the benchmark database ETH-80 [44]. It contains 8 different categories with each categories containing 10 objects. Each object has 41 frames, captured from different views. We also resize the

220

images to 20 × 20 and treat each object as an image set. Some examples are shown in Fig. 3(d). For the gesture recognition task, we consider the Cambridge Gesture dataset [45] which contains 900 sequences for 9 gestures. Each gesture contains 100 video sequences, which is further divided into 5 illuminations and 10 motions

225

from each of two subjects (see Fig. 3(e) for examples). Following the protocol

21

of [45], the first four sets are used as test and the fifth set is used as train. Each frame is resized to 20 × 20 and pixel values are used as feature vectors. For the first four datasets, we conducted ten-fold cross validation experiments. As shown in Table 1, the average recognition rate and variance were 230

performed by ten randomly selected combinations. For the last dataset, we conducted one experiment following the protocol of [45] since it is an action recognition task. In particular, on the Honda/UCSD dataset, one image set per subject was randomly selected for training (20 sets in total) and the remaining 39 sets were used to test. On the Yale B dataset, 3 image sets of each subject

235

were randomly used for gallery and the remaining set were used for probes. On the YTC dataset, one person had 3 randomly chosen image sets for training and 6 for testing. On the ETH-80 database, we randomly selected 5 sets of each object for gallery and the remaining 5 sets for probes. For Honda/UCSD, we employed two types of images: raw and pre-processing with histogram equal-

240

ization. It was worth mentioning that histogram equalization achieved a litter better performance since it provides some compensation to illumination variations. 5.2. Comparative methods and settings We compare the performance of our proposed method with several existing

245

image set classification methods as following. 1.

Linear subspace based: Mutual Subspace Method (MSM) [10]; Constrained Mutual Subspace Method (CMSM) [11]; Discriminant Canonical Correlations (DCC) [12];

250

2.

Nonlinear manifold based: Manifold-Manifold Distance (MMD) [5]; Manifold Discriminant Analysis (MDA) [6]; Grassmann Discriminant Analysis (GDA) [1]; Grassmannian Graph Embedding Discriminant Analysis (GEDA) [2];

255

Projection Metric Learning (PML) [17]; 22

3.

Affine subspace based: Affine Hull based Image Set Distance (AHISD) [7]; Convex Hull based Image Set Distance (CHISD) [7];

4.

Covariance matrix based: Covariance Discriminative Learning (CDL) [15];

260

5.

Sparse coding based: Sparse Approximated Nearest Point (SANP) [24]; Regularized Nearest Points (RNP) [28]; Mean Sequence Sparse Coding Classification (MSSRC) [31];

265

Image Set Collaborative Representation and Classification (ISCRC) [32]; Grassmann Sparse Coding (GSC) [8];

Note that the codes of all above methods were available on the authors’ websites, with the exception of CMSM and MSSRC. Thus, we implemented CMSM 270

and MSSRC in MATLAB and used the optimal parameters. It is worth mentioning that the comparative methods can also be regrouped into two categories, i.e. unsupervised and supervised methods. Particularly, the unsupervised methods include MSM, MMD, AHISD, CHISD, SANP, RNP, ISCRC, MSSRC, GSC and the proposed method. The supervised methods include DCC, MDA, GDA,

275

GEDA, CDL, CMSM and PML, which employ label information to learn the model and enhance the discriminative capabilities. For fair comparison, the important parameters of other methods were empirically tuned according to the original literature and source codes, which provided by the original authors. More preciously, for MSM, CMSM, DCC and MMD,

280

we employed PCA to learn linear subspaces by preserving 95% of data energy. Particularly, when DCC was employed on Honda, each image set was randomly seperated into two subsets to obtain within-class scatter matrices. For MDA, the dimension of embedding space and the number of between-class NN local models were the same as [6] according to different datasets. For GDA, GEDA

285

and PML, we experimentally set the dimension of Grassmannian manifolds for best performance. For AHISD and CHISD, the error penalty was set as C = 100 23

and the percentage of energy preserved by PCA is set to 95. For CDL, PLSbased version was adopted since it achieved higher accuracy by adopting the same setting as [15]. For SANP, RNP and ISCRC, we used the same param290

eters to balance between the reconstruction term and the regularization term from original literature [24] [28] [32] respectively. To determine the number of compressed dictionary atoms used in ISCRC, we select from {5, 10, 20} for the better performance of l1 -norm regularized optimization as [32]. Note that for GSC learning, we expanded its original version that all image sets were divided

295

into clusters by HDC which called GSC-HDC so as to prove good performance of our proposed method. For our proposed method, GLGSC was compared to those non-kernel methods and KLGSC was compared to those kernel methods. Note that on Honda, YTC and ETH80, the number of clusters was set according to MDA as [6]. On

300

YaleB and Cambridge gesture datasets, we used the parameters experimentally which resulted in the best performance for tasks. 5.3. Results and analysis The ten-fold cross-validated average and standard deviation of all methods on these five datasets are summarized in Table 1. Each reported rate is an

305

average of the ten-fold trials. It’s worth mentioning that the proposed method GLGSC-HDC separates each image set into several clusters by HDC algorithm. To show the superiority of the representation adaption term and the locality consistent regularization term, we also conduct experiments on the proposed method without separation, which is called GLGSC. In this case, several probe

310

sets are tested simultaneously and we can compare it with GSC intuitively. From the Table 1, we find that the proposed method achieves the best performance on all datasets compared with other methods. Here, we observe that the HDC clustering method is beneficial to image set classification on Grassmann manifolds sparse coding framework except on ETH-80 dataset and Cambridge

315

gesture dataset, that is, the performance of GSC-HDC and GLGSC-HDC is better than GSC and GLGSC, respectively. This is reasonable because ETH-80 is 24

Table 1. Average recognition rates (%) of different image set classification methods on the five datasets. Methods MSM[10] DCC[12] MMD[5] MDA[6] AHISD[7] CHISD[7] SANP[24] RNP[28] GDA[1] GEDA[2] CMSM[11] CDL[15] PML[17] MSSRC[31] ISCRC[32] GSC[8] GSC-HDC GLGSC GLGSC-HDC

ETH-80

Honda

YaleB

YTC

Cambridge

86.00±1.31 88.25±1.80 83.75±2.75 79.50±2.71 70.00±0.68 73.50±2.53 69.75±0.74 72.50±1.90 91.00±2.13 92.50±1.16 88.25±0.90 93.50±0.72 93.75±1.80 75.75±1.10 70.00±0.50 95.00±1.10 92.75±1.00 96.75±0.91 93.50±0.90

93.59±0.77 95.12±0.04 94.10±1.90 96.15±0.47 90.77±1.80 89.74±1.30 89.74±1.30 96.15±0.47 96.92±0.85 91.79±3.48 96.41±0.76 100.00±0.00 97.95±0.41 95.38±0.85 98.97±0.32 97.95±0.41 98.46±0.32 98.46±0.32 99.23±0.15

67.26±0.95 79.46±0.48 49.05±0.72 62.56±1.40 59.05±0.89 63.52±0.72 58.57±1.10 63.27±1.10 82.80±0.96 79.11±0.92 69.70±0.65 88.21±0.57 77.08±0.18 87.68±0.57 73.87±0.44 88.04±0.51 87.32±0.29 88.51±0.53 90.71±0.41

60.92±1.20 64.65±0.59 63.88±0.78 66.48±0.13 63.70±0.59 66.33±0.81 70.53±0.78 68.65±0.95 63.45±0.99 64.56±0.54 63.42±2.10 69.61±0.95 66.76±1.20 72.10±0.70 70.39±0.27 69.72±1.60 76.44±1.60 69.80±1.40 76.83±2.00

59.86 65.00 58.10 20.90 18.10 18.30 22.50 20.28 81.53 81.86 79.44 73.40 84.72 13.89 14.03 88.61 83.33 89.58 85.56

a tiny dataset that each image set only contains 41 frames. Further, since Cambridge gesture dataset is an action recognition task, image set clustering may break the consistency of the gesture. In particular, the accuracy on YTC has 320

increased about 7% by clustering before sparse coding, due to that YTC dataset is mostly low resolution and highly compressed, containing complex variation in each set. As expected, our proposed GLGSC and GLGSC-HDC representations significantly outperform the GSC and GSC-HDC, respectively. This result confirms that the group locality-aware Grassmann sparse coding more successfully

325

preserves locality information and captures the inter- and intra-set variations than GSC and other existing methods. An interesting observation from experiments is that methods based on sparse coding on Euclidean space (e.g. SANP, RNP, MSSRC, ISCRC) achieve great performance for face recognition, but are not suitable for object recognition and

330

action recognition. Fortunately, the proposed method performs sparse coding on Grassmann manifold, which is appropriate for various tasks. Note that the

25

0.98

CMSM GDA GEDA PML GSC GLGSC

0.96

Recognition Accuracy

0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78

1

2

3

4

5

6

7

8

9

10

Grassmann Manifold Dimensionality

Fig. 4. Comparison of different methods vary from the dimension of Grassmann manifolds. nonlinear manifolds based methods (e.g. GDA, GEDA, PML) and the covariance matrix based method (CDL) exhibit high accuracy on all datasets, and CDL achieves the best performance on Honda/UCSD. This implies that Grass335

mann manifolds are beneficial to image set classification, especially our proposed method. The CDL method on ETH-80 dataset exhibit inferior accuracy to that reported by [15] due to different folds on trials. Fig. 4 shows the performance of different methods vary from the dimension of Grassmann manifolds and our method achieves the highest accuracy on most of dimensions. Table 2. Performance of the proposed method with varying parameters q, λ1 , λ2 , λ3 on Honda/UCSD dataset. q accuracy λ1 accuracy λ2 accuracy λ3 accuracy

340

3 91.79 0.01 89.74 0.01 96.67 0.01 98.97

4 93.59 0.02 93.56 0.05 97.95 0.02 99.23

5 96.41 0.05 96.67 0.1 98.21 0.5 99.23

6 96.67 0.1 99.23 0.15 99.23 1 99.23

7 99.23 0.2 99.23 0.2 98.97 2 99.23

8 97.18 0.3 98.97 0.3 98.46 3 98.97

9 97.69 0.5 98.21 0.4 97.69 4 98.97

10 97.69 1 93.56 0.5 97.17 5 98.97

The proposed model has four parameters: the dimension of Grassmann manifolds q and three regularized parameters λ1 , λ2 , λ3 . As suggested in [8], the

26

dimension of Grassmann manifolds can be predicted by using a NN classifier on Grassmann manifolds on preprocessing step. Note that the performance of the proposed method is sensitive to the dimension of Grassmann manifolds. Prelim345

inary experiments suggests that q ∈ [3, 12]. Fig.4 shows the results on ETH-80 dataset using 5 gallery sets per class for training and the remaining for testing. Table 2 shows the results of GLGSC-HDC model varying above parameters on Honda/UCSD dataset when fixing other parameters. From this table, we find that the recognition performance is insensitive to the dimension of Grass-

350

mann manifolds when q ∈ [5, 10] on Honda. Since λ1 is the l1 -norm regularized parameter, larger λ1 results in coefficient more sparse. Thus, the performance is sensitive to the value of λ1 . λ2 controls the representation adaption term to enforce that the contributions of different gallery subspaces to the probe subspaces can vary in importance. We notice that this term improve the performance of

355

the model remarkably. λ3 constrains the locality consistent term that incorporates the locality and similarity information into the objective of sparse coding to improve the stability of sparse codes. This implies that our model is robust to outliers. Fig. 5 shows the results of different parameters λ1 , λ2 , λ3 with the dimension of Grassmann manifolds is 3 on ETH-80 dataset.

(a)

(b)

(c)

Fig. 5. Parameter analysis with respect to λ1 ,λ2 ,λ3 on ETH-80 dataset. 360

As shown in Fig. 6, average recognition rates of kernelised image set classification methods are exhibited on the five datasets. Our kernelised version outperform the other methods on ETH-80, Honda and Cambridge Gesture. Ad-

27

1

KAHISD KCHISD KCH-ISCRC KGSC KGLGSC

0.9

Recognition Accuracy

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

ETH-80

Honda

YaleB

YTC

Cambridge

Fig. 6. Average recognition rates of kernelised image set classification methods on the five datasets. ditionally, KCH-ISCRC achieves the best performance on Extended YaleB and YTC datasets. However, KAHISD, KCHISD and KCH-ISCRC performs poorly on Cambridge Gesture and are not suitable for action recognition. Fig. 7 shows the performance of Honda varying from different kernel parameter values of γ. 1 0.95

Recognition Accuracy

365

0.9 0.85 0.8 0.75 0.7 0.65

0.05

0.1

0.15

0.2

0.25

0.3

Fig. 7. Performance of Honda for kernel parameter values of γ.

5.4. Recognition with uncontrolled dataset We observe that the least square coding coefficients are discriminative if the data classes are distributed far apart. However, in face processing, there is a 28

370

general phenomenon that the variations among faces of the same class due to illumination are always larger than the face variations due to change in class. To evaluate the robustness of our method, in this subsection, we also conduct experiments on Honda/UCSD dataset without histogram equalisation, where the difference in lighting is larger than the difference between different people.

375

Table 3 shows the comparison of the proposed method with some recent and most popular set-based visual classification approaches mentioned in the previous subsection. We observe that our method delivers the best performance with the highest accuracy on this uncontrolled dataset. Table 3. Average recognition rates (%) of different methods on Honda/UCSD without histogram equalisation. Methods DCC MMD MDA AHISD CHISD SANP RNP GDA GEDA

Accuracy 85.64±0.41 76.67±1.80 78.97±0.85 65.38±2.20 67.69±1.90 69.23±2.60 88.72±1.70 84.87±3.57 83.08±3.24

Methods CMSM CDL PML MSSRC ISCRC GSC GSC-HDC GLGSC GLGSC-HDC

Accuracy 87.68±1.30 93.33±0.18 89.74±1.74 89.74±1.90 92.31±3.10 94.10±1.20 93.33±1.10 95.64±0.30 94.36±0.91

5.5. Computational complexity evaluation 380

Finally, we analyze the computational complexity of the proposed methods. It is acknowledged that the operation of mapping Grassmann manifolds to the symmetric space is computationally expensive. However, thanks to the matrix decomposition (see Eqn.(18)), the complex equation can be a common sparse coding problem. Moreover, our proposed model clusters a single set into multi-

385

ple image sets and models each image set to a point on Grassmann manifolds, resulting in a high computational cost. To reduce the computational complexity, we also evaluate its simple version GLGSC. Since GSC only considers a l1 -norm regularised term, GSC is slightly faster than the proposed method. Time in seconds required for offline training and online testing for total subjects respec-

390

tively on the YTC dataset are listed in Table 4. Here, ’N/A’ means that the 29

method does not require training. Note that training time is only required by discriminant methods and testing time is the classification time for matching one probe set with all gallery sets following [15]. The results in Table 4 suggest that the computation time of our method is higher than some methods such 395

as MDA and CMSM due to sparse constraints. However, our simple version GLGSC is almost much faster than other methods based on sparse coding in both training and testing. Note that all methods are implemented in Matlab on a 3.4GHz x64 computer. Table 4. Computation time (in seconds) of different methods for training and testing on YTC (classification of one image set). Methods DCC MMD MDA AHISD CHISD SANP RNP GDA GEDA

Train 7.029 N/A 6.369 N/A N/A N/A N/A 23.174 28.518

Test 0.295 39.535 0.085 1.019 4.277 11.132 0.128 0.016 0.001

Methods CMSM CDL PML MSSRC ISCRC GSC GSC-HDC GLGSC GLGSC-HDC

Train 0.348 21.936 72.608 N/A 7.978 N/A N/A N/A N/A

Test 0.008 0.003 0.002 3.906 1.158 0.152 5.331 1.805 22.291

6. Conclusion 400

In this paper, we propose an image set classification algorithm namely Grassmann locality-aware group sparse coding (GLGSC) and its kernelised version (KLGSC). The proposed methods preserve the locality information by exploiting the representation exclusivity. To take advantage of the relationship among image sets to capture the inter- and intra-set variations by applying the hier-

405

archical divisive clustering (HDC) approach, a locality consistent regularization term is employed to constrain representation coefficients of neighbor points on Grassmann manifolds to be as similar as possible. Finally, we integrate these regularised term into Grassmann manifolds sparse coding framework to improve the performance of image set classification. The proposed approach has been

410

evaluated on three visual classification applications: face recognition, object 30

categorization and gesture recognition. However, our proposed method is not very appropriate for small datasets as it may not extract proper multiple subspaces on an image set with few samples. Further, the linear subspace modeling cannot well accommodate the image sets with complex variations. The second415

order statistic, such as convariance matrix, may be a better representation for an image set with any number of frames and characterizes the structure more faithfully. For future work, we will adopt this model to more possible types of Riemannian manifolds such as SPD and Stiefel manifolds. We attempt to preserve more

420

discriminant information by sparse coding on its tangent space or Reproducing Kernel Hilbert Space (RKHS). We will also apply our method to more tasks such as pedestrian detection.

Declaration of interests The authors declare that they have no known competing financial interests or 425

personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements This work is supported by the National Science Foundation of China under Grant no. 61673220, the National Natural Science Foundation of China un430

der Grant No. 61906091, the Natural Science Foundation of Jiangsu Province, China (Youth Fund Project) under Grant No. BK20190440, the Fundamental Research Funds for the Central Universities under Grant No. 30919011229.

31

Appendix A. Proof of Theorem 1 Proof.

The optimization problem in Eqn.(17) can be expanded as follows:

2

m m·n X X

T ¯ iY ¯T − ¯ ¯

Y

min [α ] D D i j j j i

{αi }m i=1

i=1 j=1

F

m m m·n m·n

X X X

¯ ¯ T 2 X ¯ jD ¯ T )T ( [αi ]k D ¯ kD ¯ T )] + T r[( = min Y Y [α ] D

i i j i j k m {αi }i=1

−2

m X i=1

= min m

{αi }i=1

F

i=1

i=1

j=1

k=1

(A.1)

m·n X ¯ iY ¯ T )T ( [αi ]j D ¯ jD ¯ T )] T r[(Y i j j=1

m  X i=1



¯ ¯ T 2 ¯ i − 2αT Ki (Y ¯ i , D) ¯

Yi Yi + αTi K(D)α i F

Evidently, we can remove the first term of Eqn.(A.1) since it is a constant. ¯ is a positive semi-definite matrix which makes Eqn.(A.1) is convex. In K(D) 1

1

T ¯ i , D),UU ¯ = Imn and the sequel, assuming A = Σ 2 UT , zi = Σ− 2 UT Ki (Y

plugging them back into Eqn.(18):

min m

{αi }i=1

m X i=1

= min m

{αi }i=1

2

kzi − Aαi k2 = min m

{αi }i=1

m X i=1

m

2 X

− 12 T ¯ i , D) ¯ − Σ 12 UT αi

Σ U Ki (Y

2

i=1 1

1

1

1

¯ i , D) ¯ T UΣ− 2 Σ− 2 UT Ki (Y ¯ i , D) ¯ − αT UΣ 2 Σ− 2 UT Ki (Y ¯ i , D) ¯ [T r(Ki (Y i

¯ i , D) ¯ T UΣ− 12 Σ 12 UT αi + αT UΣ 21 Σ 12 UT αi )] − Ki (Y i = min m

{αi }i=1

m X i=1

¯ i , D) ¯ + αT K(D)α ¯ i const − 2αTi Ki (Y i



(A.2)

¯ = UΣUT . Thus, the optimization where U and Σ can be computed by K(D) 435

problem of Eqn.(17) is equivalent to the optimization problem of Eqn.(18). 2

32

References References [1] J. Hamm, D. D. Lee, Grassmann discriminant analysis: a unifying view on subspace-based learning, in: Proceedings of the 25th international confer440

ence on Machine learning, ACM, 2008, pp. 376–383. [2] M. T. Harandi, C. Sanderson, S. Shirazi, B. C. Lovell, Graph embedding discriminant analysis on grassmannian manifolds for improved image set matching, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 2705–2712.

445

[3] X. Shen, W. Liu, I. W. Tsang, Q.-S. Sun, Y.-S. Ong, Multilabel prediction via cross-view search, IEEE transactions on neural networks and learning systems 29 (9) (2017) 4324–4338. [4] X. Shen, F. Shen, L. Liu, Y.-H. Yuan, W. Liu, Q.-S. Sun, Multiview discrete hashing for scalable multimedia search, ACM Transactions on Intelligent

450

Systems and Technology (TIST) 9 (5) (2018) 53. [5] R. Wang, S. Shan, X. Chen, W. Gao, Manifold-manifold distance with application to face recognition based on image set, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.

455

[6] R. Wang, X. Chen, Manifold discriminant analysis, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 429–436. [7] H. Cevikalp, B. Triggs, Face recognition based on image sets, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE,

460

2010, pp. 2567–2573. [8] M. Harandi, C. Sanderson, C. Shen, B. Lovell, Dictionary learning and sparse coding on grassmann manifolds: An extrinsic solution, in: Computer 33

Vision (ICCV), 2013 IEEE International Conference on, IEEE, 2013, pp. 3120–3127. 465

[9] Y. Dong, B. Du, L. Zhang, L. Zhang, D. Tao, Lam3l: Locally adaptive maximum margin metric learning for visual data classification, Neurocomputing 235 (2017) 1–9. [10] O. Yamaguchi, K. Fukui, K.-i. Maeda, Face recognition using temporal image sequence, in: Automatic Face and Gesture Recognition, 1998. Proceed-

470

ings. Third IEEE International Conference on, IEEE, 1998, pp. 318–323. [11] K. Fukui, A. Maki, Difference subspace and its generalization for subspacebased methods, IEEE transactions on pattern analysis and machine intelligence 37 (11) (2015) 2164–2177. [12] T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition

475

of image set classes using canonical correlations, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (6) (2007) 1005–1018. [13] W. Yan, Q. Sun, H. Sun, Y. Li, Z. Ren, Multiple kernel dimensionality reduction based on linear regression virtual reconstruction for image set classification, Neurocomputing 361 (2019) 256–269.

480

[14] X. Gao, Q. Sun, H. Xu, J. Gao, Sparse and collaborative representation based kernel pairwise linear regression for image set classification, Expert Systems with Applications 140 (2020) 112886. [15] R. Wang, H. Guo, L. S. Davis, Q. Dai, Covariance discriminative learning: A natural and efficient approach to image set classification, in: Computer

485

Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2496–2503. [16] X. Gao, Q. Sun, H. Xu, D. Wei, J. Gao, Multi-model fusion metric learning for image set classification, Knowledge-Based Systems 164 (2019) 253–264.

34

[17] Z. Huang, R. Wang, S. Shan, X. Chen, Projection metric learning on grass490

mann manifold with application to video based face recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 140–149. [18] Z. Huang, R. Wang, S. Shan, X. Chen, Learning euclidean-to-riemannian metric for point-to-set classification, in: Computer Vision and Pattern

495

Recognition, 2014, pp. 1677–1684. [19] J. Wright, A. Ganesh, Z. Zhou, A. Wagner, Y. Ma, Robust face recognition via sparse representation, in: IEEE International Conference on Automatic Face & Gesture Recognition, 2009, pp. 1–2. [20] S. Gao, W. H. Tsang, L. T. Chia, P. Zhao, Local features are not lonely -

500

laplacian sparse coding for image classification, in: Computer Vision and Pattern Recognition, 2010, pp. 3555–3561. [21] M. Harandi, C. Sanderson, R. Hartley, B. C. Lovell, Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach, in: European Conference on Computer Vision, 2012, pp. 216–229.

505

[22] L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: Which helps face recognition?, in: Computer vision (ICCV), 2011 IEEE international conference on, IEEE, 2011, pp. 471–478. [23] Z. Zhao, S. Xu, D. Liu, W. Tian, Z. Jiang, A review of image set classification, Neurocomputing 335 (2019) 251–260.

510

[24] Y. Hu, A. S. Mian, R. Owens, Sparse approximated nearest points for image set classification, in: Computer vision and pattern recognition (CVPR), 2011 IEEE conference on, IEEE, 2011, pp. 121–128. [25] Z. Ren, B. Wu, X. Zhang, Q. Sun, Image set classification using candidate sets selection and improved reverse training, Neurocomputing 341 (2019)

515

60–69.

35

[26] Z. Chen, B. Jiang, J. Tang, B. Luo, Image set representation and classification with attributed covariaterelation graph model and graph sparse representation classification, Neurocomputing 226 (2017) 262–268. [27] M. Hayat, M. Bennamoun, A. A. El-Sallam, An rgb-d based image set 520

classification for robust face recognition from kinect data, Neurocomputing 171 (2016) 889–900. [28] M. Yang, P. Zhu, L. Van Gool, L. Zhang, Face recognition based on regularized nearest points between image sets, in: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Work-

525

shops on, IEEE, 2013, pp. 1–7. [29] Y.-C. Chen, V. M. Patel, S. Shekhar, R. Chellappa, P. J. Phillips, Videobased face recognition via joint sparse representation, in: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, IEEE, 2013, pp. 1–8.

530

[30] Z. Cui, H. Chang, S. Shan, B. Ma, X. Chen, Joint sparse representation for video-based face recognition, Neurocomputing 135 (2014) 306–312. [31] E. G. Ortiz, A. Wright, M. Shah, Face recognition in movie trailers via mean sequence sparse representation-based classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013,

535

pp. 3531–3538. [32] P. Zhu, W. Zuo, L. Zhang, C. K. Shiu, D. Zhang, Image set-based collaborative representation for face recognition, IEEE Transactions on Information Forensics and Security 9 (7) (2014) 1120–1132. [33] R. Wang, S. Shan, X. Chen, J. Chen, W. Gao, Maximal linear embedding

540

for dimensionality reduction, IEEE transactions on pattern analysis and machine intelligence 33 (9) (2011) 1776–1792. [34] K.-C. Lee, J. Ho, M.-H. Yang, D. Kriegman, Video-based face recognition using probabilistic appearance manifolds, in: IEEE Computer Society 36

Conference on Computer Vision and Pattern Recognition, Vol. 1, Citeseer, 545

2003, pp. I–313. [35] A. Hadid, M. Pietikainen, From still image to video-based face recognition: an experimental analysis, in: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings., IEEE, 2004, pp. 813–818.

550

[36] W. Fan, D.-Y. Yeung, Locally linear models on face appearance manifolds with application to dual-subspace based classification, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, IEEE, 2006, pp. 1384–1390. [37] X. Guo, Exclusivity regularized machine, arXiv preprint arXiv:1603.08318.

555

[38] U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing 17 (4) (2007) 395–416. [39] Y. Peng, A. Ganesh, J. Wright, W. Xu, Y. Ma, Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images, IEEE transactions on pattern analysis and machine intelligence 34 (11) (2012)

560

2233–2246. [40] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1) (2013) 171–184. [41] J. Shawe-Taylor, N. Cristianini, et al., Kernel methods for pattern analysis,

565

Cambridge university press, 2004. [42] A. S. Georghiades, P. N. Belhumeur, D. J. Kriegman, From few to many: illumination cone models for face recognition under variable lighting and pose, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (6) (2002) 643–660.

37

570

[43] M. Kim, S. Kumar, V. Pavlovic, H. Rowley, Face tracking and recognition with visual constraints in real-world videos, in: 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8. [44] B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: 2003 IEEE Computer Society Conference

575

on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 2, IEEE, 2003, pp. II–409. [45] T.-K. Kim, R. Cipolla, Canonical correlation analysis of video volume tensors for action categorization and detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (8) (2009) 1415–1428.

580

Dong Wei received the B.S. degree from the Nanjing University of Science and Technology, Nanjing, China in 2017. He is currently a doctoral student in Nanjing University of Science and Technology. His research interests include pattern recognition, data mining and image set classification.

Xiaobo Shen received his BSc and PhD from School of Com585

puter Science and Engineering, Nanjing University of Science and Technology in 2011 and 2017 respectively. He is currently a Professor with the School of Computer Science and Engineering, Nanjing University of Science and Technology, China. He has authored over 30 technical papers in prominent journals and conferences, such as IEEE TNNLS, IEEE TIP, IEEE TCYB, NIPS, ACM

590

MM, AAAI, and IJCAI. His primary research interests are Multi-view Learning, Multi-label Learning, Network Embedding and Hashing. 38

Quansen Sun received the Ph.D. degree in pattern recognition and intelligence system from the Nanjing University of Science and Technology (NJUST), Nanjing, China, in 2006. He is a Professor with the Department of 595

Computer Science, NJUST. He was with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, in 2004 and 2005, respectively. He has published more than 100 scientific papers. His current research interests include pattern recognition, image processing, remote sensing information system, and image set classification.

Xizhan Gao received the B.S. degree and the M.S. degree from

600

the Liaocheng University, Liao-cheng, China in 2011 and 2015, respectively. He is currently a doctoral student in Nanjing University of Science and Technology. His research interests include pattern recognition, data mining and image set classification.

605

Wenzhu Yan received the B.S and M.S. de-

39

gree from Southwest University of Science and Technology, in 2013 and 2016 respectively. He is currently a doctoral student in Nanjing University of Science and Technology. His current research interests include pattern recognition and image processing.

40