A set-level joint sparse representation for image set classification

A set-level joint sparse representation for image set classification

Information Sciences 448–449 (2018) 75–90 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/i...

1MB Sizes 0 Downloads 181 Views

Information Sciences 448–449 (2018) 75–90

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

A set-level joint sparse representation for image set classification Peng Zheng a, Zhong-Qiu Zhao a,∗, Jun Gao a, Xindong Wu b a b

College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA 70504-3694, USA

a r t i c l e

i n f o

Article history: Received 31 August 2017 Revised 25 February 2018 Accepted 26 February 2018 Available online 13 March 2018 Keywords: Sparse representation Image set classification Multiple features Identification rate

a b s t r a c t Traditional image set classification methods measure the similarities between different sets based on the extracted characteristics of each set. Most of these methods build their models on one kind of visual features or a simply concatenated feature vector of several kinds of features. However, due to redundant or irrelevant information, the concatenated features are usually not discriminant or suffer from the curse of dimensionality. Meanwhile, if the sizes of sets are small, or improper features are employed, the useful information will be limited and conflictive. So in this paper, we propose a set-level joint sparse representation classification (SJSRC) model to combine multiple features to accomplish image set classification task. In the SJSRC, the atom-level and concept-level regularization terms are both imposed to obtain robust representations and the images in the same concept are regarded as a whole to optimize the objective on them jointly to strengthen the intra similarities via a set-level regularization. In addition, we adopt two schemes, namely ‘Anchor Graph’ and ‘Regularized Nearest Points (RNP)’, to improve computational efficiency and identification rate. Experiments on several benchmark datasets show that our model obtains competitive recognition performance for image set classification. © 2018 Published by Elsevier Inc.

1. Introduction Traditionally, we take advantage of single image to perform object classification and accordingly the query image is identified one by one [1]. As the applications of large capacity storage media are increasing in recent years, multiple images of an object or a person can be easily collected from many real-life scenes such as personal albums, multi-view camera networks, as well as security and surveillance systems. To recognize an object or a person by combining the information from multiple images is attracting more and more interests, and accordingly various approaches of image set classification have been developed in recent years [2–6]. A wide range of appearance variations caused by background switches and illumination condition changes can be effectively handled with various image set classification approaches. These methods make the classification decisions by measuring the similarities between the gallery sets and the query set. However, even with numerous images under different appearance variations, these methods are still confronted with the challenge of how to effectively integrate the information from all the existing images to conduct a reliable decision. To solve this problem, many techniques have been proposed to



Corresponding author. E-mail address: [email protected] (Z.-Q. Zhao).

https://doi.org/10.1016/j.ins.2018.02.062 0020-0255/© 2018 Published by Elsevier Inc.

76

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

Fig. 1. The sketches of atom-level and concept-level sparse representations for image set classification. The concept-level regularization strengthens the intra similarities, which suppresses noises and produces more robust representations.

find the optimal representations which can accurately and efficiently measure the intra similarities between the images from the same image set and the inter differences between different concepts. For example, image sets are modeled as different subspaces in [2–4,7], affine hull/convex hull is adopted to generate the set-representative exemplars in [5,6], a mean image of each set is calculated for the set representation in [6,8,9], and the image sets are mapped to various spaces of manifolds in [10,11]. Recently, some studies have found that structural primitives (e.g., edges and line segments) can be taken to encode natural images [12], which is similar to the phenomenon in simple cell receptive fields. Inspired by this finding, sparse representation is introduced to encode natural images, which can produce low reconstruction errors and has a good robustness to noise, occlusion and corruption. As a result, sparse representation has been successfully applied to face recognition [1], image classification [13], video processing [14], and image set classification [6,11]. For image set classification, training samples from each category as well as their affine hull model are utilized to represent the concept, and the most relevant samples or subspaces are selected to measure the set-to-set distance by an iterative optimization [6,11]. Ortiz et al. [9] proposed a model named Mean Sequence Sparse Representation Classifier (MSSRC), which reduces the optimization to a single minimization over the mean sample in each query set. These methods have some drawbacks. On one hand, the strategies in [6] abandons the discriminant nature of sparse representation, and a more complex optimization procedure is introduced to mix affine hull model and sparse representation together with alternative computations of several independent sparse codes [11]. On the other hand, the mean image calculated in [9] can not accurately represent a set of images with large variations. When the size of each set is relatively small, it is difficult to recognize different objects based on little information provided by the limited images, especially when these images are in low resolution. This problem becomes more serious if an improper feature is extracted and employed to measure the similarities. The multi-task joint sparse representation classification model (MTJSRC) [15] was proposed to solve this problem and further extended to accomplish other computer vision tasks [16]. However, the atom-level (or sample-level) sparsity regularization term, which is adopted in the MTJSRC, handles the samples individually and overlooks the intra similarities among the same concept. Fig. 1 shows the sketches of atom-level and concept-level sparse representations for image set classification. In the atomlevel sparse representation, different query images may seek out distinct training samples with the most similar appearances

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

77

from various concepts. On the contrary, imposing the sparsity on an image concept instead of on an isolated image makes the samples within each set more compacted, which suppresses noises and produces more robust representations. It indicates that, not only the atom-level regularization, but also the concept-level regularization is of necessity. So in this paper, we propose a set-level joint sparse representation classification (SJSRC) model for image set classification, where both of these two regularization terms are integrated to incorporate different kinds of complementary features. At the same time, we solve the SJSRC in the kernelized space and propose an improvement to it to reduce the time and space consumption. The main contributions of this paper are as follows: 1) We propose a SJSRC model for image set classification by adopting the set-level regularization, which optimizes on each individual concept jointly to explore the complementary information between different feature channels to improve identification performance. 2) We formulate the image set classification problem as multi-task joint sparse coding with mixed regularization terms, which integrates atom-level and concept-level representations to incorporate different kinds of complementary features to accomplish small-data identification tasks. 3) We take two methods, namely “Anchor Graph” [17] and “Regularized Nearest Points (RNP)” [18], to reduce the size of the whole dataset and to filter out apparently irrelevant classes, which thereby improve the efficiency and accuracy. The rest of this paper is organized as follows. Some related works are reviewed in Section 2. In Section 3, the original MTJSRC model is stated in detail. Then our proposed SJSRC model is exhibited in Section 4. The improvement to the SJSRC is presented in Section 5. Experimental results and some analyses on several benchmark datasets are provided in Section 6. Finally, concluding remarks are presented in Section 7. 2. Related works According to representation types, the existing image set classification methods can be classified into two categories: parametric models and non-parametric models. The parametric models usually formulate each image set with a certain statistical distribution, such as single Gaussian and Gaussian mixture models (GMM), and measure the similarity between two different image sets in terms of KLdivergence. Their performances mainly depend on the statistical relationship between training and query sets. The non-parametric models represent an image set either with some representative exemplars or on a geometric surface. When a whole image set is considered as a point on a geometric surface [3,4,10], the set is usually represented by a subspace, a mixture of several subspaces or a complex non-linear manifold. To cooperate with these methods, different distance metrics are designed, such as principal angles for linear subspaces and other evaluation terms for manifolds [19–21]. To extract discriminant features between image sets on the manifold surface, different discriminant analysis methods for different set representations, such as Discriminative Canonical Correlations (DCC) [2], Manifold Discriminant Analysis (MDA) [4] and Covariance Discriminative Learning (CDL) [10], have been proposed. These non-parametric models have compact representations with efficient computation. However, these representations may be sub-optimal due to some loss of latent information. For example, if an image set is represented by a subspace, only the information related to the selected dimensions of the subspace can be retained [22]. In the meantime, there are a quantity of methods which represent each image set with its representative exemplars such as the set mean [3] and adaptively learned set samples [5,6]. Cevikalp et al. [5] took the affine or convex hull models to represent each image set and to learn set samples from them. However, in visual appearances, the learned representative set samples may be quite different from other images in the same set. Hu et al. [6] combined the mean image and affine hull model together to define the Sparse Approximated Nearest Points (SANPs). Its shortcomings lie in a relatively complex optimization and many unknown parameters. Ortiz et al. [9] proposed a MSSRC model to represent each face track with its mean image. When the appearance variations of various categories are large, its performance is unsatisfactory. To sum up, the performances of these methods depend on a comprehensive description of the characteristics of any image set which contains adequate image samples with robust and discriminant features. However, due to the lack of compact descriptions, the discriminant capability may degrade with improper or conflictive features, especially when there are a relatively number of samples in each set. With the success applications of sparse representation in many application domains, such as image reranking [23], person re-identification [24], human pose recovery [25,26] and image classification [27,28], different modalities [29] or information sources [30,31] of images can be integrated to extract additional information to solve the problem of visual classification. Among these methods, the MTJSRC model [15] is a representative one, which combines different tasks together via a joint covariate selection by Lasso [32]. However, the 1, 2 regularization term in the MTJSRC ignores the intra similarities among the same class. Besides, it is time-consuming to calculate sparse codes of all samples to predict the label of each query set via majority voting. Instead, in this paper, we take the reconstruction coefficients for each query set as a whole and take into account the intra similarities among each category, which we call the set-level regularization. Similarity matrix is calculated [33] to obtain the correlations between different samples and it is usually combined with other techniques, such as manifold learning [34], tensor learning [35] and kernel embedding [36], to accomplish pattern recognition tasks and improve identification performance. In this paper, we take a Gaussian kernel trick to map our SJSRC to a non-linear space and adopt an anchor graph strategy to simplify the calculation. In the anchor graph regularization

78

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

approach [17], the graph, which is constructed with a small set of anchors, approximates the neighborhood structure in original space and thereby reduces computational and storage costs. Along with the anchor graph approach, the scheme of “RNP” [18] is adopted to find the k nearest neighbors for each query set, which are taken as clues to remove most of irrelevant concepts. This scheme is expected to ensure sparse codes more concentrated and to reduce the negative effects of anchor graph. 3. Multi-task joint sparse representation classification Suppose that there are c classes and each image has n different modalities of features (e.g., colour, texture and shape). For the lth modality (task) (l = 1, 2, . . . , n), X l = [X1l , . . . , Xcl ] represents the training feature matrix, where X jl ∈ Rdl ×m j is associated with the jth class, dl denotes the dimensionality of the lth modality and mj represents the number of training images  in class j. So there are totally cj=1 m j = m training samples. A test image yl ∈ Rdl × 1, which denotes the lth modality of feature, can be represented by a linear combination of all training images as follows.

yl =

c 

X jl wlj + ε l , l = 1, . . . , n

(1)

j=1

where wlj ∈ Rm j ×1 is the reconstruction coefficients for the jth class and ε l ∈ Rdl ×1 is the residual error.

Suppose that wl = [(wl1 )T , . . . , (wlc )T ]T represents the reconstruction coefficients in the lth task and w j = [(w1j )T , . . . , (wnj )T ]T denotes the representation coefficients related to the jth class for all n tasks, the multi-task joint sparse representation is formulated as the solution to the following problem of multi-task least square regressions with 1, 2 mixednorm regularization:

min W

n c c  1 l  l l 2 y − X j w j 2 + λ w j 2 2 j=1

l=1

(2)

j=1

 where W = [wlj ]l, j , and the regularization term cj=1 w j 2 is to force test features to be sparsely reconstructed by the most similar and representative classes in the training data. Assuming that the test image can be approximated by training images from a certain class, then the label of the class with the lowest reconstruction error accumulated over all n tasks can be assigned to the test image as follows:

j∗ = arg min j

where {θ l }nl=1 (

n 

θ l yl − X jl wˆ lj 22

(3)

l=1



θ l = 1 ) denotes the weights of different tasks in the final decision, θ l can be set equally as θ l = 1/n, or

l

ˆ lj is the approximated sparse codes for the jth concept in the optimized on a validation set via an LPBoost way [37] and w lth feature channel. This problem is referred to as the multi-task joint covariate selection in Lasso related research [32]. 4. Set-level joint sparse representation for image set classification For a collection of training data with c classes, c image sets (X1 , X2 , . . . , Xc ) are provided (the images belonging to the same class form an image set). Xi = {xi,k |ti,k = i, k = 1, 2, . . . , mi }, where xi, k denotes the kth sample in class i, ti, k represents the corresponding label and mi denotes the number of the images in class i. The image set classification task is to assign a correct class label TQ to an arrived query image set Q = [q1 , q2 , . . . , q j , . . . , qNq ] ∈ Rd×Nq , where qj denotes the jth query image in the query set, d denotes the dimensionality of the feature and Nq represents the number of the images in the query set. 4.1. The set-level joint sparse representation Generally, we can calculate sparse reconstruction coefficients for each image from Q by Eq. (2) and use Eq. (3) to assign one specific class to it according to class-specific reconstruction residuals. And the class label to the query set can be decided via majority voting of class labels of all images in Q. In MTJSRC, the query images are processed individually and the mutual correlations among each set are ignored. Therefore, we transfer this individual optimization to a multiple instances based optimization and thereby propose an image set classification model to conduct multiple features and multiple instances based recognition. In the set-level, Eq. (2) can be modified as follows.

min W

n c c c   1 l  l l 2 Y − X j W j F + λ1 H (W j ) + λ2 P (W j ) 2 j=1

l=1

W j = [(

)

W j1 T , . . . ,

j=1

(

)

W jn T ]T

∈R

(n∗m j )×Nq

j=1

(4)

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

79

where ·F denotes the Frobenius norm which summarises reconstruction errors in each feature channel, λ1 and λ2 are two predefined balance parameters, Y l ∈ Rdl ×Nq denotes the feature matrix of the query set related to the lth feature channel and W jl represents the reconstruction coefficients for the jth concept in the lth feature channel, H(Wj ) and P(Wj ) are two regularization terms to concentrate non-zero coefficients on the sets from the most relevant concept. On one hand, H (W j ) =  q W j 1 = Nk=1 |W j,k |, which sums all the absolute values in the class-specific reconstruction matrix, is employed as the  Nq atom-level regularization to search for visually similar samples. On the other hand, P (W j ) = W j F = W j,k 22 , which k=1 sums up the norm of each column Wj, k of Wj for all k’s, is employed to measure the similarities between the query set and training sets in the concept-level. Similar to MTJSRC, sparse codes from various feature channels are treated equally and solved by the optimization on each concept jointly to make them more cooperative. In our SJSRC, the samples in the query set are considered as a whole and the regularization is imposed on the whole query set. So the similarities between the samples in the same set are taken into consideration when solving sparse representation. Meanwhile, the cooperation between atom-level and concept-level representations can explore or strengthen the associations with the most relevant concept, and thereby results in a more compact and robust representation than the MTJSRC. 4.2. Optimization and classification To solve the optimization problem in Eq. (4), the Accelerated Proximal Gradient (APG) method [38,39] is adopted. 4.2.1. APG algorithm ˆ s = [W l,s ]s≥1 and an aggregation matrix Vˆ s = [V l,s ]s≥1 are In each iteration s of the APG method, the weight matrix W j j ˆ s+1 with obtained aggregation matrix Vˆ s by gradient mappings, updated alternately by two steps. One step is to update W s +1 s +1 ˆ ˆ ˆ and the other is to update V by combining W and W s linearly. The details are shown as below. Gradient mapping step. According to [39], Eq. (12) can be modified to the generalized gradient formula as below,

(W, Vˆ s ) =

n  

f (Vˆ l,s ) + W l − Vˆ l,s , ∇ f (Vˆ l,s ) +

l=1



  1 H (W j ) + λ2 P (W j ) W l − Vˆ l,s 2F + λ1 2η c

c

j=1

j=1

 1 f (Vˆ l,s ) = Y l − X jl Vˆ jl,s 2F 2 c

(5)

j=1

where ., . indicates matrix inner product and W is updated iteratively through the optimization on Vˆ s . Then Eq. (5) can be further simplified as below,

(W, Vˆ s ) =

 n  c c    1 H (W j ) + λ2 P (W j ) W l − Bl 2F + C l + λ1 2η j=1

l=1

j=1

Bl = Vˆ l,s − η∇ f (Vˆ l,s )

η

∇ f (Vˆ l,s )2F 2 ∇ f (Vˆ l,s ) = −(X l )T Y l + (X l )T X l Vˆ l,s C l = f (Vˆ l,s ) −

(6)

where η is the step size. Given fixed in each feature channel, we can transfer Eq. (6) into the following minimization problem, Vˆ s ,

(W, Vˆ s ) =

Bl

and

Cl

can be calculated via Eq. (6). As

n c c   1  W l − Bl 2F + λ1 H (W j ) + λ2 P (W j ) 2η l=1



j=1



Cl

is a constant,

(7)

j=1

Considering that H (W j ) = W j  is adopted as the regularization, we can take a similar derivation as in [32]. Thereby, 1 Eq. (7) equals to the following optimization problem,

(W, Vˆ s ) =

n c  1  Wˆ l,s+1 − Dl 2F + λ2 P (W j ) 2η j=1

l=1

D = sgn(B )  max(|B | − λ1 , 0 ) l

l

l

(8)

To solve this problem, the approach in [40] is adopted to update the weight matrix as below,



ˆ s+1 = 1 − W j

λ2 η P (D j )



D j , j = 1, . . . , c +

where [·]+ = max(·, 0 ), and Dj is the coefficient matrix associated with the jth concept.

(9)

80

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

ˆ s+1 and W ˆ s are combined together linearly to update Vˆ s+1 as below [38], Aggregation step. W

2 s+3 ˆ s; ˆ s+1 − W =W

αs+1 = δs+1

ˆ s+1 + Vˆ s+1 = W

(10)

1 − αs

αs

αs+1 δs+1

ˆ s+1 and W ˆ s . As the where {α s }s ≥ 1 is an adjustment parameter, which controls the weights of the difference δs+1 between W 2 optimization can be processed with a convergent rate of O(1/t ) (t is the number of iterations) and it’s not necessary to obtain the most accurate sparse codes, a promising result is achieved after 3–5 iterations. 4.2.2. Classification The reconstruction residuals of different feature channels are calculated separately and then accumulated to make the final decision.

TQ = arg min j

n 

 2 θ l Y l − X jl Wˆ jl F

(11)

l=1

where  · F denotes the Frobenius norm and {θ l }nl=1 denote the confidences of different feature channels in the final decision, which can be decided via cross-validation. 4.3. Kernelized extensions To process nonlinear classification, we also extend the SJSRC into a Reproducing Kernel Hilbert Space (RKHS) to deal with inner product of features. Assuming that a non-linear function φ l is taken for each feature channel l to map the training and query sets to a higher dimensional RKHS, where φ l (xi )T φ l (x j ) = gl (xi , x j ) and g(·, ·) is a kernel function, Eq. (4) can be rewritten as:

min W

n c c c    1 φ l (Y l ) − φ l (X jl )W jl 2F + λ1 H (W j ) + λ2 P (W j ) 2 j=1

l=1

j=1

(12)

j=1

l where φ l (X jl ) = [φ l (X j,l 1 ), . . . , φ l (X j,m )]. For the lth feature, assuming φ l (X l ) = [φ l (X1l ), . . . , φ l (Xcl )], then the training kernel j

matrix can be represented by Gl = φ l (X l )T φ l (X l ) and the query kernel matrix can be denoted as Q l = φ l (X l )T φ l (Y l ). Correspondingly, the optimization functions in Eqs. (5) and (6) should be rewritten as below,

 1 l l φ (Y ) − φ l (X jl )Vˆ jl,s 2F 2 c

f (Vˆ l,s ) =

j=1

∇ f (V ) = −Q + G V ˆ l,s

l

l

ˆ l,s

(13)

where Q and G are the kernelized feature matrices for the training and query sets respectively. Similar to Eq. (11), the classification decision is made based on the accumulated reconstruction errors as follows.

TQ = arg min j

= arg min j

n 

 2 θ l φ l (Y l ) − φ l (X jl )Wˆ jl F

l=1 Nq n  

(14)

θ( l

ˆl −2Q lj,kW j,k

+(

)

ˆ l T Gl W ˆl W j j,k j,k

)

l=1 k=1

ˆ l represents where Q lj,k indicates the kth column of Q lj and Q lj is the coefficients of Ql associated with the jth concept, W j,k

the reconstruction coefficients of the kth sample in the jth class of the query set and Glj = φ (X jl )T φ (X jl ) is the block diagonal of Gl associated with the jth class. 4.4. Optimizing the weights of different feature channels on validation set We optimize the weights of different feature channels on a validation set. Provided with a validation set Yval =

{(Yi1 , . . . , Yin ), ti }ci=1 and the class label ti ∈ {1, . . . , c}, we can obtain the corresponding reconstruction error eli, j for each samˆ l 2 , i = 1, . . . , c, j = 1, . . . , c, l = 1, . . . , n, ple in the validation set across different concepts, where eli, j = φ l (Yil ) − φ l (X jl )W j,i F ˆ l denotes the sparse coefficients related to the jth subject for the ith query set Y l in the lth feature channel. Then accordW j,i i   ing to Eq. (14), the inequality nl=1 θ l eli,t ≤ nl=1 θ l eli, j should be satisfied by a certain margin ρ for ∀j = ti , where θ l are the weights of different feature channels.

i

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

81

Fig. 2. The flowchart of the set-level joint sparse representation classification model for image set classification.

Taking the theory of Support Vector Machine (SVM) into consideration, a slack variable ξ i can be introduced for some jamming samples and the following linear programming can be solved to estimate the optimal weights:

min −ρ +

ρ ,ξ ,θ

s.t.

n  l=1

c 1 ξi , c i=1

θ l eli,ti − ξi ≤

n 

θ l eli, j − ρ

l=1

(15)

∀i = 1, . . . , c, j = 1, . . . , c, j = ti n 

θ l = 1, ρ ≥ 0, ξi ≥ 0, θ l ≥ 0, ∀i, l

l=1

This optimization problem can be efficiently solved by standard linear programming solvers. It should also be noticed that when n = 1, which indicates that there is only one kind of feature (classifier) provided, the model degrades to optimizing on each single feature channel individually. We will prove that it is not optimal to combine the decisions of these individual classifiers directly, which cannot make full use of the correlations between different feature channels in Section 6.3.3. 4.5. The flowchart of the SJSRC The flowchart of the proposed kernelized SJSRC model for image set classification is illustrated in Fig. 2. This model can be divided into the following stages. (1) Feature extraction. When a query set comes, n kinds of features (Y l , l = 1, . . . , n ) are extracted at first. (2) Construction of similarity matrices. Then the kernelized similarity matrices Ql are constructed and computed with an RBF kernel function to map the query set to high dimensional RKHS. (3) Calculation of sparse codes. With the input of Qs and the pre-computed training similarity matrices Xl , sparse coefficients Wl is calculated via the joint optimization in Eq. (12). (4) Classification. Finally, the classification decision is made by calculating the class-specific reconstruction residuals for all feature channels via Eq. (14). 5. Computational complexity analysis and two tips for the kernelization The 1st stage of our SJSRC, viz. feature extraction, is usually performed offline. The 2nd stage, viz. construction of similarity matrices, is divided into two parts: one is for the training sets (offline) and the other is for the query sets (online). The last two stages are completely online and take test time. So two factors, namely the set size and the number of concepts, hinder the applications of the SJSRC in real datasets. On one hand, although similarity matrices of the training sets are constructed offline, the storage space spent on these matrices is increasing rapidly with the increment of the size of the training set (a 60k∗ 60k matrix will take about 24 GB memory). On the other hand, when the set size and the number of concepts are large, calculating sparse codes and making decisions become very time-consuming. In addition, too many irrelevant concepts will confuse the decision and produce disruptive results, especially when few samples are provided in some sets.

82

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

Considering that the samples in the same set have small visual variants, one reasonable and feasible strategy to improve the SJSRC is to select the most informative samples while abandoning the rest to reduce the set size of each set. Another strategy to reduce computational complexity is to filter out some irrelevant classes from a different viewpoint, which may enforce non-zero coefficients more concentrated on relevant ones. 5.1. Anchor graph to reduce set size Anchor graph is a subset graph extracted from a whole graph with anchor points and has achieved great successes in many applications [17,41]. To improve efficiency, we construct a “Anchor Graph” by extracting the commonness, namely similar patterns, local structures and common textures, among each set and picking out the most representative samples. The stages to construct this subset graph are as follows. i) Divide each set into several subsets named maximal linear patches (MLPs) using Hierarchical Divisive Clustering (HDC) in [4]. The HDC formulates each set as a manifold and measures the nonlinearity degrees of different local patches with the deviation between Euclidean distances and geodesic distances. The HDC scheme makes sure that the extracted MLPs are balanced and the samples in the same local patch own similar visual characteristics. ii) Divide the samples of each MLP into several smaller subsets with k-means to reduce the set size further. iii) Calculate and normalize the mean image of each subset to represent the whole subset. These mean images are viewed as the alternatives to anchor points. iv) Take these anchor points to construct an anchor graph and calculate the similarity matrices accordingly. Thereby, the set size is greatly reduced while high recognition rates are retained due to the preserved necessary information. For the training sets, the stages to construct anchor graph are conducted offline. For the query sets, only the construction of the reduced similarity matrices is required, which can be processed efficiently online. 5.2. RNP to filter out apparently irrelevant sets In order to filter out apparently irrelevant image sets, the Regularized Nearest Points (RNP) in [18] is employed to obtain the k nearest neighbors (KNN) of each query set in different feature channels. The RNP regards each image set as a regularized affine hull (RAH) and takes the regularized nearest points from both the training sets and query set to calculate between-set distances. The optimization function is defined as follows,

min



Xi α − Y β22 + λ1 α2 + λ2 β2 ,

α ,β

s.t.



αk = 1,

k



(16)

βk = 1

k

where Xi represents the ith training set, Y denotes the query set, α , β are the reconstruction coefficients, and λ1 , λ2 denotes the weights of different regularization terms. Merged with the RNP, Eq. (12) is modified as below,

min W

n  1 φ l (Y l ) − 2 l l=1

φ l (X jl )W jl 22 + λ1

j∈N (Y )

c  j=1

H (W j ) + λ2

c 

P (W j )

(17)

j=1

where N(Yl ) represents the k nearest neighbors of the query set Y in the lth feature channel. Since the optimization considers both the distances between the regularized nearest points and the structure of different image sets, it can be solved by alternative optimization and ridge regression with a low computational complexity. In theory, negative classes are helpful to the classification. However, due to the limitations of extracted visual features, the introduction of similar concepts will confuse the classification decision. In fact, we are aiming to distinguish among these concepts with the cooperation of visual features and classifiers. If there are quite a lot of concepts with large quantities of images, our scheme will narrow the search scope to several most relevant concepts and thereby reduce computational complexity to a certain degree. It should also be noticed that both correct class and some negative classes are retained with a proper choice of the number of relevant classes. 6. Experimental results and discussions We evaluate our SJSRC and compare it with the existing state-of-the-art methods on several benchmark datasets, including Honda/UCSD dataset, CMU Mobo dataset, Public Figures Face Database (PubFig), YouTube Celebrities (YTC) dataset, ETH-80 dataset and Scenes 15 dataset. These datasets cover different sub-tasks of image set classification, including face identification, object categorization and scene classification. All experiments are conducted with a Core 4 Quad Machine and MATLAB 2014a.

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

83

6.1. Experimental settings 6.1.1. Pre-processing of datasets Different schemes are employed to automatically detect face frames at first. Then the face regions are cropped after successful detections and resized to different resolutions. For all datasets, the colored images are converted to gray scale levels. Finally, histogram equalization is conducted to minimize illumination variations. Details are depicted in Section 6.2. Three kinds of the most commonly used features are adopted, namely raw pixels, the concatenated Local Binary Pattern (LBP) histograms from non-overlapping uniformly spaced rectangular blocks [9] and Gabor wavelet. 6.1.2. Settings of the SJSRC For all datasets, the dimensions of the features are reduced to no more than 100 by dimensionality reduction scheme. For simplicity, kernel similarity matrices are computed as exp(−Euc2 (x, x )), where Euc2 calculates the pairwise Euclidean distance between two samples from different sets. The number of iterations in APG is set as 5 in all experiments. The parameters of “HDC” and “RNP” are set the same as in [4] and [18]. The number of nearest neighbors k in Eq. (17) is set to be 15 ∗ n_tr (n_tr is the number of training sets). The weighing parameters θ l in Eq. (14) and balance parameters λ1 , λ2 in Eq. (12) are optimized for the best performance on a validation set. Detailed properties for λ1 and λ2 are analyzed in Section 6.3.1. For the concepts with too few samples, such as the first video sequence of “gwen_stefani” in the YTC dataset (fewer than 10 images in each set), the strategies in [42], including rotation, scaling and horizontal shearing, etc., are taken to produce some deformed images as validation data to tune the parameters. 6.1.3. Settings of compared methods We compare our SJSRC with several state-of-the-art methods, including Mutual Subspace Method (MSM) [7], Affine Hullbased Image Set Distance (AHISD) [5], Convex Hull-based Image Set Distance (CHISD) [5], Sparse Approximated Nearest Points (SANP) [6], Regularized Nearest Points (RNP) [18], Discriminant Canonical Correlation Analysis (DCC) [2], Manifoldto-Manifold Distance (MMD) [3], Manifold Discriminant Analysis (MDA) [4], Covariance Discriminant Learning (CDL) [10], Mean Sequence Sparse Representation Classification (MSSRC) [9] and Set to Set Distance Metric Learning (SSDML) [43]. For all methods, their parameters are tuned for the best performance. Specifically, for the MSM, PCA is applied to retain 90% of the total energy. The nonlinear version of the AHISD is adopted. For the CHISD, its linear version is taken and the error penalty term C is set as in [5]. For the SANP, the same weighing parameters as in [6] are taken to conduct convex optimization. For the RNP [18], PCA is utilized to preserve 90% of the energy and the same weighing parameters as in [18] are adopted. For the DCC, a 100-dimensional embedding space is retained. Its dimension is further reduced to 10 with 90% of the energy preserved and the set-to-set similarity is measured with the corresponding 10 maximum canonical correlations. As there is only one training set per class in the Honda/UCSD, Mobo, PubFig and Scenes 15 datasets, we randomly divide each training set into two subsets as in [2] to construct the within-class sets. The parameters of the MMD and MDA are set as in [3] and [4], respectively. To calculate the geodesic distance, the number of connected nearest neighbors equals to min(12, mms) (mms denotes the minimum set size). The ratios between Euclidean distance and geodesic distance are optimized on different datasets. For the MMD, maximum canonical correlation is taken to compute the distance. There are no parameter configurations required for the CDL, MSSRC and SSDML. 6.2. Results and analysis 6.2.1. Honda/UCSD dataset The Honda/UCSD dataset consists of 59 video sequences of 20 different subjects. The number of frames varies from 12 to 645 in each video sequence. The face frames in the videos are automatically detected by Viola and Jones face detection algorithm [44]. And the cropped gray scale images are resized to 20 × 20. Following the standard evaluation configuration in [45], each video is treated as an image set. From each subject, a video sequence is randomly selected for training (20 in total) while the remaining 39 ones are regarded as the query sets. The experiments are conducted for 10 times with different random selections of the training and query image sets. To evaluate the robustness of various methods, the experiments are conducted with the varying set sizes as in [18]. In other words, the total number of images in each set is bounded with an upper limit. And the following size pairs, namely Ntrain /Ntest = {All–All, 100–100,50–50,50–25,50–15,25–50, All–1} are considered. The experimental results are presented in Table 1, which indicate that our SJSRC achieves the best average identification rates and outperforms other state-of-the-art methods for all size pairs. Besides, it can be observed that the reduction of set size has different degrees of effects on the performance of various methods. The methods of MSM, DCC, MMD and MDA, which utilize either a linear subspace or a combination of multiple linear subspaces to represent an image set, are more sensitive to the set size reduction. However, the affine or convex hull based methods, such as the AHISD, CHISD and SANP, are less affected by the set size reduction. Our SJSRC is the least sensitive to the set size reduction, and obtains satisfactory identification rates even when there are less than 50 images in each set. When there is only one image in each query set, our SJSRC can still achieve a promising result of 85.1%. It should be noticed that image set classification methods were originally proposed to take advantage of very low resolution images

84

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90 Table 1 Identification rates (%) of various methods on the Honda/UCSD dataset. Methods

All–All

100–100

50–50

MSM [7] DCC [2] MMD [3] MDA [4] AHISD [5] CHISD [5] SANP [6] CDL [10] MSSRC [9] SSDML [43] RNP [18]

91.3 92.6 92.1 94.4 88.6 90.3 93.3 96.4 88.2 93.3 95.9

1.9 2.3 2.3 3.4 2.8 1.2 2.9 2.9 3.9 3.9 2.2

85.6 ± 4.4 89.3 ± 2.5 85.6 ± 2.2 91.8 ± 1.6 92.8 ± 2.2 92.3 ± 1.8 94.9 ± 2.6 95.4 ± 2.2 89.7 ± 2.6 95.4 ± 2.8 92.3 ± 3.2

83.1 82.1 83.1 85.6 94.9 92.3 94.9 91.3 88.7 95.9 90.2

SJSRC

99.5 ± 2.8

99.5 ± 2.0

99.0 ± 2.3

± ± ± ± ± ± ± ± ± ± ±

± ± ± ± ± ± ± ± ± ± ±

50–25 1.7 3.3 4.5 5.8 1.8 0.0 2.6 3.4 2.3 4.7 3.3

82.3 80.8 82.4 85.0 93.9 93.3 94.3 85.1 90.8 92.8 89.3

± ± ± ± ± ± ± ± ± ± ±

50–15 4.3 8.8 5.2 4.0 2.9 2.9 3.8 7.1 3.9 5.3 6.6

99.0 ± 2.8

80.7 80.2 80.4 83.7 93.9 93.3 92.8 71.3 88.2 92.8 85.4

± ± ± ± ± ± ± ± ± ± ±

25–50 5.0 3.4 3.6 5.8 2.3 2.9 2.2 8.6 3.9 5.3 2.0

98.0 ± 1.4

79.2 78.4 81.2 84.7 92.6 91.5 91.3 80.0 89.2 92.8 84.8

± ± ± ± ± ± ± ± ± ± ±

All–1 5.0 2.7 3.8 4.5 2.3 2.5 2.9 3.3 3.3 4.6 2.7

100.0 ± 0.0

57.7 ± 12.3 3.8 ± 3.0 – – 76.7 ± 7.7 79.0 ± 6.4 54.9 ± 11.4 5.1 ± 0.0 79.0 ± 6.9 68.7 ± 3.3 40.0 ± 7.4 85.1 ± 3.1

Average identification rates and standard deviations of various methods on the Honda/UCSD dataset. Different set sizes of the training and query image sets are taken into consideration in the experiments. “x − y” means that there are x and y frames in the training and query sets, respectively. Besides, it is infeasible for the MDA and MMD methods to accomplish identification task with only one image in each query set. Table 2 Identification rates (%) of various methods on the Mobo, Pubfig, YTC, ETH and Scenes 15 datasets. Methods

Mobo

MSM [7] DCC [2] MMD [3] MDA [4] AHISD [5] CHISD [5] SANP [6] CDL [10] MSSRC [9] SSDML [43] RNP [18]

96.8 92.8 92.5 81.0 85.8 96.4 97.5 90.0 95.6 95.3 96.1

SJSRC

98.5 ± 2.3

± ± ± ± ± ± ± ± ± ± ±

PubFig 2.0 1.2 2.9 12.3 4.2 0.8 1.8 5.0 1.8 3.2 1.8

62.7 48.3 76.2 74.3 85.9 87.5 80.4 90.1 85.6 88.6 88.8

± ± ± ± ± ± ± ± ± ± ±

YTC 1.4 2.1 6.9 6.4 0.6 0.9 2.5 1.6 2.8 1.6 1.4

95.2 ± 0.5

47.2 50.3 52.0 54.5 51.7 54.0 57.5 52.4 53.4 55.4 57.6

ETH ± ± ± ± ± ± ± ± ± ± ±

1.2 1.1 3.7 4.6 1.9 2.9 2.9 2.7 1.2 2.7 2.2

61.5 ± 1.8

Scenes 15 4.8 6.5 5.0 5.5 5.3 5.3 7.3 4.2 4.8 6.6 3.2

100.0 ± 0.0 96.7 ± 2.4 98.7 ± 1.8 98.0 ± 3.0 84.7 ± 3.0 86.7 ± 3.3 99.3 ± 1.5 86.7 ± 0.0 90.7 ± 2.8 99.3 ± 1.5 100.0 ± 0.0

94.5 ± 4.0

100.0 ± 0.0

75.5 86.0 77.5 77.3 78.8 79.5 77.8 77.8 77.0 81.9 81.0

± ± ± ± ± ± ± ± ± ± ±

(here is 20 × 20 in this case) to accomplish identification task. As a result, making correct decisions with only one image is very challenging. 6.2.2. CMU Mobo dataset The Mobo (Motion of Body) dataset was originally designed for human pose estimation. This dataset consists of 24 subjects walking on a treadmill with equally 4 sequences each. The Viola and Jones face detection algorithm is employed to extract face frames and the face regions are resized to 40 × 40. Following the common settings, from each subject, one sequence is randomly selected for training while the remaining three ones are taken for testing. With different random selections of the training and query sets, the experiments are repeated for 10 runs, and the average identification rates are exhibited in the 2nd column of Table 2. From the results, it can be found that our SJSRC achieves the best performance among various methods. 6.2.3. Public figures face database (PubFig) The PubFig database is a real-life dataset, which consists of 200 people collected from the Internet. As the images were acquired in uncontrolled situations without any user cooperation, they show large variations in poses, lighting, expressions and backgrounds. The provided cropped face images are utilized and resized to 20 × 30. Following [22], we divide each subject into three folds equally among which one fold is taken for training while the other two are considered as the query sets. Experiments are repeated for ten times with different random slices of the training and testing folds. The experimental results are depicted in the 3rd column of Table 2, which show that our SJSRC greatly outperforms the others and obtains a promising identification rate of 95.2%. This comparison indicates that our proposed SJSRC can handle a wide range of appearance variations. 6.2.4. YouTube celebrities dataset The YouTube Celebrities (YTC) dataset contains 1910 videos of 47 celebrities collected from YouTube and each celebrity has three unique sequences. The face images of this dataset exhibit a large diversity and variations in pose, illumination and expression. Moreover, due to high compression rate, the quality and resolution of the images is quite low. To ensure that

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

85

Table 3 Identification results of the SJSRC with different parameter settings on the Mobo dataset (%).

λ1 = 0 η fixed

λ2 Accuracy

λ2 = 0 η fixed λ1 , λ2

Accuracy

λ1

fixed

Accuracy

η

0 95.8

0.001 95.8

0.01 96.8

0.1 96.8

0.2 96.8

0.4 96.5

1 94.7

2 97.5

0 95.8

0.001 96.0

0.01 97.1

0.1 96.5

0.2 96.9

0.4 94.7

0.6 94.7

0.8 81.9

0 95.5

0.001 96.5

0.002 96.8

0.005 95.5

0.01 24.7

0.05 15.5

0.1 0.04

0.2 0.04

each set has a maximum of clearly detected face images, the Incremental Learning tracking method in [46] is taken instead to extract face frames. Therefore, about 30 0,0 0 0 images are extracted which are resized to 30 × 30. Following [3] and [4], we use five fold cross validations to evaluate identification performance. The dataset is equally divided into five folds with 9 image sets per subject in each fold. 1 training clip and 2 testing clips are taken from each unique sequence of a certain subject to conduct the experiments. The average identification rates and standard deviations of various methods on this dataset are summarized in the 4th column of Table 2. It can be observed that our proposed SJSRC outperforms other methods. To cover the maximum number of face frames, the cropped faces are not uniform across frames even they are extracted from the same video and a plenty of tracking errors are introduced by Ross et al. [46]. So our results are inferior to those in [3,4], where only the successfully detected faces by the Viola and Jones algorithm [44] are kept. In fact, a significant number of frame detections by Viola and Jones fail, especially when there exist large head rotations and the videos are in low quality. 6.2.5. ETH-80 dataset The ETH-80 dataset is designed for object categorization. There are eight object categories in total, including apples, cars, cows, cups, dogs, horses, pears and tomatoes. Each object category is composed of ten subcategories, in which a number of images under 41 orientations with different breeds or brands are exhibited. The original cropped images are resized to 32 × 32. Following [4], each subcategory is viewed as an image set and therefore there are a total of 80 sets. From each object, half of the subcategories are randomly selected for training while the other half are taken for query. The experiments are repeated for 10 times and corresponding average identification rates of various methods are presented in the 5th column of Table 2. From the table, we can observe a significant improvement of our proposed model over the other state-of-the-art methods. 6.2.6. Scenes 15 The Scenes 15 dataset contains a total of 4485 images from 15 scene categories, such as street, industrial, living room and kitchen, and the number of images of each category varies from 200 to 400. The size of the images is resized to 30 0∗ 20 0 as in [47]. The images of each concept are equally divided into three subsets. Similar to the Honda/UCSD dataset, one subset of each category is randomly selected for training, and the rest two ones are taken for testing. We repeat the experiments for 10 times and list the average recognition rates in the last column of Table 2. From the table, we can find that our SJSRC achieves perfect performance with 100% identification rate, like other two state-of-the-art methods: MSM and RNP. 6.3. Ablative analysis In this section, we provide a further analysis on the SJSRC. Effects of several parameters, including λ1 , λ2 and η, on identification performance are discussed at first. Following that, the effect of dimensionality reduction scheme is analyzed. Then the necessity of feature fusion with the SJSRC is validated. After that, the proposed SJSRC is compared with the original MTJSRC. Finally, two improvements of the SJSRC are evaluated and a time analysis is provided. 6.3.1. Effects of parameters There are two balance parameters controlling the balance between different bias terms in Eq. (12), where λ1 affects the atom-level sparsity while λ2 decides the concept-level sparsity. Identification results of the SJSRC with different settings of λ1 and λ2 on the Mobo dataset are provided in Table 3. From it, when fixing one parameter and finetuning the other one, a similar trend of identification performance can be observed, which is improved at first and then degraded with the increase of tuned parameter value. On one hand, larger parameter values will lead to high sparsity (most of matrix coefficients become zero) and weaken the discriminant capability of learned model. On the other hand, when too light penalties are imposed, namely small values of λ1 and λ2 , the reconstruction error is overly emphasized, which leads to representations with less selectivity and low discriminant capability. Meanwhile, identification performance is more sensitive to the choice of λ1 while keeps relatively stable in a larger range of λ2 . By comparing the results with those in Table 2, a further improvement can be observed with the combination

86

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90 Table 4 Identification rates of different dimensionality reduction schemes (%) on all datasets. Methods

Honda

Mobo

PubFig

YTC

ETH

Scenes 15

PCA ISOMAP LLE

99.5 ± 2.8 99.0 ± 1.4 96.5 ± 1.4

98.5 ± 2.3 97.3 ± 3.1 97.0 ± 1.3

95.2 ± 0.5 92.0 ± 1.2 91.1 ± 2.4

61.5 ± 1.8 58.3 ± 2.5 59.1 ± 1.7

94.5 ± 4.0 95.0 ± 3.9 96.0 ± 2.9

100.0 ± 0.0 96.7 ± 3.0 98.0 ± 1.8

Table 5 A comparison of identification rates (%) between the SJSRC with single features and different fusion schemes with multiple features on various datasets. Datasets

Pixel

Honda Mobo PubFig YTC ETH Scenes 15

96.4 95.5 91.5 51.2 91.0 97.3

Wavelet ± ± ± ± ± ±

2.6 3.4 0.5 2.6 4.9 2.8

94.9 94.0 88.5 35.0 88.5 91.0

± ± ± ± ± ±

LBP

1.8 1.3 1.0 1.5 2.9 4.9

95.4 96.5 91.7 58.5 83.5 84.7

posterior ± ± ± ± ± ±

2.8 2.1 0.9 1.6 3.4 6.1

97.5 97.0 95.6 58.5 92.5 97.3

± ± ± ± ± ±

2.8 2.0 0.4 1.6 4.9 2.8

concatenation 98.0 95.3 94.6 56.9 92.0 96.0

± ± ± ± ± ±

2.6 4.3 0.8 2.4 4.5 4.4

SJSRC 99.5 ± 2.8 98.5 ± 2.3 95.2 ± 0.5 61.5 ± 2.0 94.5 ± 4.0 100.0 ± 0.0

‘Posterior’ indicates processing each feature channel independently and fusing the decisions of all channels via Eq. (14). ‘Concatenation’ indicates taking the concatenated feature vector of different features as the input to SJSRC-single. Table 6 The identification rates of the SJSRC (%) on the Honda dataset with different set sizes. Schemes

Pixel

All–All 100–100 50–50 50–25 25–50 All–1

97.4 97.4 96.4 95.5 96.6 80.3

Wavelet ± ± ± ± ± ±

2.3 2.3 2.6 2.2 1.4 5.3

94.9 94.9 92.8 90.3 88.2 55.4

± ± ± ± ± ±

LBP 1.8 1.8 4.6 3.3 4.7 4.3

95.4 95.4 96.4 95.9 95.9 77.0

Fusion ± ± ± ± ± ±

2.8 2.8 1.8 2.9 1.4 4.1

99.5 ± 2.8 99.5 ± 2.0 99.0 ± 2.3 99.0 ± 2.8 100.0 ± 0.0 85.1 ± 3.1

of the atom-level and concept-level regularization terms. Based on the observations above, default parameters of λ1 = 0.01 and λ2 = 0.05 are chosen in our experiments. In Table 3, we also provide an analysis on the effect of the gradient descent step size η in Eq. (6). It can be observed that the parameter η should be kept among 0.001 to 0.005 to control the gradient to descend at a relatively small amplitude. As a result, default parameter of η = 0.002 is chosen in our experiments. 6.3.2. Effects of dimensionality reduction schemes In this subsection, we compare the results of different dimensionality reduction schemes mentioned in Section 6.1.3. Three classical methods, including Principal Components Analysis (PCA) [48], ISOMAP [49] and Locally linear Embedding (LLE) [50], are evaluated in the experiments and corresponding results are shown in Table 4. From the table, we find that due to noise and low resolution, manifold learning based methods (ISOMAP and LLE) produce inferior results to the linear method (PCA) on the benchmark datasets. The abnormal result on the ETH-80 dataset may be caused by the relatively small data size. 6.3.3. Necessity of feature fusion with the SJSRC As described in Section 4.4, the classifier of each individual feature (SJSRC-single) can be achieved if we set the number of features n in Eq. (12) to 1. In this subsection, we compare these individual classifiers with the SJSRC and provide a validation for the necessity of feature fusion with the SJSRC. Table 5 shows the improvement of the identification rates of the SJSRC with multiple features over single features. From this table, it can be observed that the identification rates are largely improved with the cooperations between different features, which indicates that there is complementary information successfully explored among various features. Table 5 also compares the identification rates of different fusion schemes. From it, the following remarks can be noticed. The scheme ‘posterior’ obtains better results by combining different individual classifiers. However, it may reduce to one of them if one feature is much stronger than the others. The scheme ‘concatenation’ obtains degraded performance when the concatenation feature vector has a lower discriminant capability. Among these schemes, the SJSRC obtains the best results except for the PubFig dataset, which exhibits the necessity of the joint optimization of sparse codes for all features. The exception may be caused by the interferences from quite a lot of negative concepts which are introduced into cooperative feature channels when acquiring sparse codes.

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

87

Table 7 Identification rates of MTJSRC and SJSRC (%) on all datasets. Methods

Honda

Mobo

PubFig

YTC

ETH

Scenes 15

MTJSRC SJSRC

98.0 ± 2.5 99.5 ± 2.8

96.5 ± 2.1 98.5 ± 2.3

92.0 ± 0.9 95.2 ± 0.5

58.9 ± 1.9 61.5 ± 1.8

93.0 ± 4.9 94.5 ± 4.0

97.3 ± 2.8 100.0 ± 0.0

Table 8 Time consumption of MTJSRC and SJSRC (seconds) to identify a query set on all datasets. Methods

Honda

Mobo

PubFig

YTC

ETH

Scenes 15

MTJSRC SJSRC

0.67 0.49

1.35 0.87

7.55 5.63

69.38 59.69

0.34 0.24

0.54 0.43

Fig. 3. The effects of ‘Anchor Graph’ and ‘RNP’ evaluated on the YTC dataset. With the varying set sizes (number of anchor points), Sub-figure (a) plots recognition rates of different schemes while Sub-figure (b) plots time consumption of different schemes to identify each query set. ‘Anchor Graph’ means to take all ‘anchor points’ as the dictionary atoms while ‘Anchor Graph+RNP’ means that only the samples in the k-nearest neighbor sets are kept to calculate sparse codes. The dotted line of ‘Original’ is a reference line, which is obtained with the SJSRC by taking all the training images as the dictionary.

Table 6 presents the identification rates of our SJSRC on the Honda/UCSD dataset with different set sizes. From it, we can find that with the reduction of set size, the performance of single features is affected more or less while that of the SJSRC keeps stable (Row 1–5). And when there is only one sample in each query set (the last row), the improvement of the proposed SJSRC over single features is very apparent. In short, the comparison in Table 6 demonstrates that with the exploring of complementary information among various features, the SJSRC becomes more robust to set size changes. 6.3.4. Comparison between MTJSRC and SJSRC The comparisons of the identification rates and time consumption between the MTJSRC and the SJSRC are provided in Tables 7 and 8, respectively. From Table 7, we can find that for all datasets, our SJSRC obtains higher identification rates than the MTJSRC, which indicates that taking the commonness and similarities among each subject as a regularization is helpful for image set classification. From Table 8, it can be observed that our SJSRC takes less time to identify a query set than the MTJSRC. The efficiency improvement is caused by the fact that the set-level regularization employed in the SJSRC keeps each set as a whole and rejects much more negative categories than the MTJSRC. Another reason is that the optimization is conducted in group by sets in the SJSRC, while conducted individually one by one in the MTJSRC. 6.3.5. The improvement of SJSRC and time analysis We evaluate the improvement of the SJSRC with Anchor Graph and RNP on the YTC dataset which has a large scale and owns a high difficulty. Fig. 3 shows the recognition rates and time consumption of the improved SJSRC. From Fig. 3(a), it can be found that increasing the number of anchor points can bring a steady improvement of the SJSRC. Moreover, with the introduction of the RNP, identification performance can be further improved. This phenomenon is more apparent when there are fewer samples in each set. It proves that the weakened discriminant ability caused by a small amount of image samples can be improved by the RNP which binds the representation with the most similar categories. The SJSRC with the Anchor Graph and RNP achieves better results than the original SJSRC when the set size is larger than 75. From Fig. 3(b), it can be observed that the time consumption of the SJSRC to identify each query set is greatly reduced by the Anchor Graph, and is further reduced by the RNP. In brief, we can find that by combining our SJSRC with the Anchor Graph

88

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90 Table 9 A comparison of time consumption (seconds) for training and identifying a query set between various methods. Methods

MSM

DCC

MMD

MDA

AHISD

CHISD

SANP

CDL

MSSRC

SSDML

RNP

SJSRC

Train Test

N/A 10.11

391.92 1.59

N/A 1.34

9.28 0.60

N/A 72.96

N/A 4.22

N/A 41.87

53.80 0.29

N/A 33.62

908.02 32.74

N/A 0.69

N/A 0.85

and RNP, the calculational efficiency can be greatly improved while a relatively high identification rate is simultaneously maintained. As there are only around 20 samples preserved in each concept, it also proves the effectiveness of our SJSRC on small-data identification tasks. Finally, a comparison of computational complexity between all methods on the YTC dataset is presented in Table 9. The schemes of Anchor Graph and RNP are both adopted in our SJSRC. From the table, we can find that the time taken by our SJSRC to classify a query set is very competitive among all state-of-the-art methods. 7. Conclusion Traditional methods for image set classification rely on a single feature or a concatenated vector of different features to obtain the characteristics of different visual concepts. The discriminant capability can be weakened by the reduction of set size or by the improper correlations between different feature channels. In this paper, we propose a set-level joint sparse representation classification (SJSRC) model for image set classification, which imposes two levels of sparse regularization terms, namely atom-level and concept-level, on the representation of each concept. By taking the images in each query set as a whole to find the most relevant subject, the conflictions between different query images are avoided and a more robust representation is obtained. At the same time, by combining different features together via our SJSRC, more discriminant information can be explored from the cooperations between them. In addition, we take two schemes, namely ‘Anchor Graph’ and ‘RNP’, to solve the time complexity problem brought by the kernel trick in our SJSRC. Experimental results on several benchmark datasets prove the effectiveness of our SJSRC model. Acknowledgments This research was supported by the National Natural Science Foundation of China (No. 61672203 & 61375047), and Anhui Natural Science Funds for Distinguished Young Scholar (No. 170808J08). The authors would like to thank Xiao-Tong Yuan, who provided their MTJSRC codes on the Internet for sharing, and to thank Jianchao Yang, who provided the SIFT ScSPM codes for sharing. The authors would also like to thank the anonymous reviewers who provided extensive constructive comments that improved our work. References [1] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [2] T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 1005–1018. [3] R. Wang, S. Shan, X. Chen, W. Gao, Manifold-manifold distance with application to face recognition based on image set, in: Proceedings of CVPR, 2008, pp. 1–8. [4] R. Wang, X. Chen, Manifold discriminant analysis, in: Proceedings of CVPR, 2009, pp. 429–436. [5] H. Cevikalp, B. Triggs, Face recognition based on image sets, in: Proceedings of CVPR, 2010, pp. 2567–2573. [6] Y. Hu, A.S. Mian, R. Owens, Face recognition using sparse approximated nearest points between image sets, IEEE Trans. Pattern Anal. Mach. Intell. 34 (10) (2012) 1992–2004. [7] O. Yamaguchi, K. Fukui, K.-i. Maeda, Face recognition using temporal image sequence, in: Proceedings of FG, 1998, pp. 318–323. [8] J. Lu, G. Wang, P. Moulin, Image set classification using holistic multiple order statistics features and localized multi-kernel metric learning, in: Proceedings of ICCV, 2013, pp. 329–336. [9] E.G. Ortiz, A. Wright, M. Shah, Face recognition in movie trailers via mean sequence sparse representation-based classification, in: Proceedings of CVPR, 2013, pp. 3531–3538. [10] R. Wang, H. Guo, L.S. Davis, Q. Dai, Covariance discriminative learning: a natural and efficient approach to image set classification, in: Proceedings of CVPR, 2012, pp. 2496–2503. [11] S. Chen, C. Sanderson, M.T. Harandi, B.C. Lovell, Improved image set classification via joint sparse approximated nearest subspaces, in: Proceedings of CVPR, 2013, pp. 452–459. [12] B.A. Olshausen, D.J. Field, Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis. Res. 37 (23) (1997) 3311–3325. [13] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: Proceedings of CVPR, 2009, pp. 1794–1801. [14] Y. Liu, F. Wu, Y. Zhuang, Group sparse representation for image categorization and semantic video retrieval, Sci. China Inf. Sci. 54 (10) (2011) 2051–2063. [15] X.-T. Yuan, S. Yan, Visual classification with multi-task joint sparse representation, in: Proceedings of CVPR, 2010, pp. 3493–3500. [16] J. Li, H. Zhang, L. Zhang, Efficient superpixel-level multitask joint sparse representation for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 53 (10) (2015) 5338–5351. [17] C. Deng, R. Ji, D. Tao, X. Gao, X. Li, Weakly supervised multi-graph learning for robust image reranking, IEEE Trans. Multimedia 16 (3) (2014) 785–795. [18] M. Yang, P. Zhu, L. Van Gool, L. Zhang, Face recognition based on regularized nearest points between image sets, in: Proceedings FG, 2013, pp. 1–7. [19] P. Turaga, A. Veeraraghavan, A. Srivastava, R. Chellappa, Statistical computations on grassmann and stiefel manifolds for image and video-based recognition, IEEE Trans. Pattern Anal. Mach. Intell. 33 (11) (2011) 2273–2286.

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

89

[20] J. Hamm, D.D. Lee, Grassmann discriminant analysis: a unifying view on subspace-based learning, in: Proceedings of ICML, 2008, pp. 376–383. [21] M.T. Harandi, C. Sanderson, A. Wiliem, B.C. Lovell, Kernel analysis over Riemannian manifolds for visual recognition of actions, pedestrians and textures, in: Proceedings of WACV, 2012, pp. 433–439. [22] M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification, IEEE Trans. Pattern Anal. Mach. Intell. 37 (4) (2015) 713–727. [23] J. Yu, Y. Rui, D. Tao, Click prediction for web image reranking using multimodal sparse coding, IEEE Trans. Image Process. 23 (5) (2014) 2019–2032. [24] L. An, X. Chen, S. Yang, B. Bhanu, Sparse representation matching for person re-identification, Inf. Sci. 355–356 (2016) 74–89. [25] C. Hong, J. Yu, J. Wan, D. Tao, M. Wang, Multimodal deep autoencoder for human pose recovery, IEEE Trans. Image Process. 24 (12) (2015) 5659. [26] J. Yu, C. Hong, Y. Rui, D. Tao, Multi-task autoencoder model for recovering human poses, IEEE Trans. Ind. Electron. PP (99) (2017). 1–1 [27] C. Zhang, J. Cheng, Y. Zhang, J. Liu, C. Liang, J. Pang, Q. Huang, Q. Tian, Image classification using boosted local features with random orientation and location selection, Inf. Sci. 310 (C) (2015) 118–129. [28] C. Shao, X. Song, Z.H. Feng, X.J. Wu, Y. Zheng, Dynamic dictionary optimization for sparse-representation-based face classification using local difference images, Inf. Sci. 393 (2017) 1–14. [29] C. Hong, J. Yu, J. You, X. Chen, D. Tao, Multi-view ensemble manifold regularization for 3d object recognition, Inf. Sci. 320 (C) (2015) 395–405. [30] Z.-Q. Zhao, H. Glotin, Z. Xie, J. Gao, X.D. Wu, Cooperative sparse representation in two opposite directions for semi-supervised image annotation, IEEE Trans. Image Process. 21 (9) (2012) 4218–4231. [31] Z.Q. Zhao, Y.M. Cheung, H. Hu, X. Wu, Corrupted and occluded face recognition via cooperative sparse representation, Pattern Recognit. 56 (C) (2016) 77–87. [32] L. Yuan, J. Liu, J. Ye, Efficient methods for overlapping group lasso, in: Proceedings of NIPS, 2011, pp. 352–360. [33] A. Serra, P. Coretto, M. Fratello, R. Tagliaferri, Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data, Bioinformatics (2017). [34] K. Levin, V. Lyzinski, Laplacian eigenmaps from sparse, noisy similarity measurements, IEEE Trans. Signal Process. 65 (8) (2017) 1988–2003. [35] F. Wu, X.Y. Jing, W. Zuo, R. Wang, X. Zhu, Discriminant tensor dictionary learning with neighbor uncorrelation for image set based classification, in: Proceedings of IJCAI, 2017, pp. 3069–3075. [36] G. Xia, H. Sun, L. Feng, G. Zhang, Y. Liu, Human motion segmentation via robust kernel sparse subspace clustering., IEEE Trans. Image Process. PP (99) (2017). 1–1 [37] P. Gehler, S. Nowozin, On feature combination for multiclass object classification, in: Proceedings of ICCV, 2009, pp. 221–228. [38] X. Chen, W. Pan, J.T. Kwok, J.G. Carbonell, Accelerated gradient method for multi-task sparse learning problem, in: Proceedings of ICDM, 2009, pp. 746–751. [39] M. Schmidt, N.L. Roux, F.R. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, in: Proceedings of NIPS, 2011, pp. 1458–1466. [40] M. Schmidt, E. Berg, M. Friedlander, K. Murphy, Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm, in: Proceedings of Artificial Intelligence and Statistics, 2009, pp. 456–463. [41] B. Xu, J. Bu, C. Chen, C. Wang, D. Cai, X. He, Emr: a scalable graph-based ranking model for content-based image retrieval, IEEE Trans. Knowl. Data Eng. 27 (1) (2015) 102–114. [42] D. Ciresan, U. Meier, L.M. Gambardella, J. Schmidhuber, Deep big simple neural nets excel on handwritten digit recognition, Corr 22 (12) (2010) 3207–3220. [43] P. Zhu, L. Zhang, W. Zuo, D. Zhang, From point to set: extend the learning of distance metrics, in: Proceedings of ICCV, 2013, pp. 2664–2671. [44] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2) (2004) 137–154. [45] K.-C. Lee, J. Ho, M.-H. Yang, D. Kriegman, Video-based face recognition using probabilistic appearance manifolds, in: Proceedings of CVPR, 1, 2003, pp. 313–320. [46] D.A. Ross, J. Lim, R.S. Lin, M.H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis. 77 (1–3) (2008) 125–141. [47] P. Zheng, Z.-Q. Zhao, J. Gao, X. Wu, Image set classification based on cooperative sparse representation, Pattern Recognit. 63 (2017) 206–217. [48] L.I. Smith, A tutorial on principal components analysis, Inf. Fus. 51 (3) (2002) 52. [49] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (550 0) (20 0 0) 2319–2323. [50] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding., Science 290 (5500) (2000) 2323.

90

P. Zheng et al. / Information Sciences 448–449 (2018) 75–90

Peng Zheng is a PostDoc at Hefei University of Technology, China. He received his Bachelor’s degree in 2010, and his Ph.D. in 2017 from Hefei University of Technology. His interests cover pattern recognition, image processing and computer vision. Zhong-Qiu Zhao is a professor at Hefei University of Technology, China. He obtained the Master’s degree in Pattern Recognition & Intelligent System at Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, China, in 2004, and the Ph.D. degree in Pattern Recognition & Intelligent System at University of Science and Technology, China, in 20 07. From April 20 08 to November 2009, he held a postdoctoral position in image processing in CNRS UMR6168 Lab Sciences de l’Information et des Systèmes, France. From January 2013 to December 2014, he held a research fellow position in image processing at the Department of Computer Science of Hongkong Baptist University, Hongkong, China. His research is about pattern recognition, image processing, and computer vision. Jun Gao was born in 1963 and obtained his Bachelor’s and Master’s degree from Hefei University of Technology in 1985 and 1990. After that, he received his Ph.D. degree at Chinese Academy of Sciences in 1999. Now he is a professor and Ph.D. supervisor in the School of Computer and Information at Hefei University of Technology. His research interests include image processing, computer vision and intelligent information processing. Xindong Wu is a professor of Computer Science at the University of Louisiana at Lafayette (USA), a Yangtze River Scholar in the School of Computer Science and Information Engineering at the Hefei University of Technology (China), and a Fellow of the IEEE and the AAAS. Dr. Wu received his Bachelors and Masters degrees in Computer Science from Hefei University of Technology, China, and his Ph.D. degree in Artificial Intelligence from the University of Edinburgh, Britain. His research interests include data mining, knowledge-based systems, and Web information exploration. He is the Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM) the Editor-in-Chief of Knowledge and Information Systems (KAIS, by Springer), and a Series Editor of the Springer Book Series on Advanced Information and Knowledge Processing (AI&KP). He was the Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (TKDE, by the IEEE Computer Society) between 2005 and 2008. He served as Program Committee Chair/Co-Chair for ICDM 03 (the 2003 IEEE International Conference on Data Mining), KDD-07 (the 13th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining), and CIKM 2010 (the 19th ACM Conference on Information and Knowledge Management).