Discriminative sparsity preserving projections for image recognition

Discriminative sparsity preserving projections for image recognition

Author's Accepted Manuscript Discriminative sparsity preserving projections for image recognition Quanxue Gao, Yunfang Huang, Hailin Zhang, Xin Hong,...

802KB Sizes 1 Downloads 170 Views

Author's Accepted Manuscript

Discriminative sparsity preserving projections for image recognition Quanxue Gao, Yunfang Huang, Hailin Zhang, Xin Hong, Kui Li, Yong Wang

www.elsevier.com/locate/pr

PII: DOI: Reference:

S0031-3203(15)00075-8 http://dx.doi.org/10.1016/j.patcog.2015.02.015 PR5354

To appear in:

Pattern Recognition

Received date: 3 July 2014 Revised date: 9 February 2015 Accepted date: 18 February 2015 Cite this article as: Quanxue Gao, Yunfang Huang, Hailin Zhang, Xin Hong, Kui Li, Yong Wang, Discriminative sparsity preserving projections for image recognition, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2015.02.015 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Discriminative sparsity preserving projections for Image Recognition Quanxue Gao, Yunfang Huang,

Hailin Zhang, Xin Hong, Kui Li

State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China, 710071. Emial: [email protected]; [email protected] Tel: 086-029-88204753 Yong Wang School of Electronic Engineering, Xidian University, Xi’an China, 710071.



Abstract Previous works have demonstrated that image classification performance can be significantly improved by manifold learning. However, performance of manifold learning heavily depends on the manual selection of parameters, resulting in bad adaptability in real-world applications. In this paper, we propose a new dimensionality reduction method called discriminative sparsity preserving projections (DSPP). Different from the existing sparse subspace algorithms, which manually construct a penalty adjacency graph, DSPP employs sparse representation model to adaptively build both intrinsic adjacency graph and penalty graph with weight matrix, and then integrates global within-class structure into the discriminant manifold learning objective function for dimensionality reduction. Extensive experimental results on four image databases demonstrate the effectiveness of the proposed approach.

Keywords: Dimensionality reduction, Manifold learning, Sparse representation, Image Recognition

Discriminative Sparsity Preserving Projections for Image Recognition 1

1

Quanxue Gao,

1

Yunfang Huang,

1

Xin Hong,

1

Hailin Zhang, Yong Wang2,

1

Kui Li

State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China, 710071. 2

School of Electronic Engineering, Xidian University, Xi’an China, 710071.



Abstract Previous works have demonstrated that image classification performance can be significantly improved by manifold learning. However, performance of manifold learning heavily depends on the manual selection of parameters, resulting in bad adaptability in real-world applications. In this paper, we propose a new dimensionality reduction method called discriminative sparsity preserving projections (DSPP). Different from the existing sparse subspace algorithms, which manually construct a penalty adjacency graph, DSPP employs sparse representation model to adaptively build both intrinsic adjacency graph and penalty graph with weight matrix, and then integrates global within-class structure into the discriminant manifold learning objective function for dimensionality reduction. Extensive experimental results on four image databases demonstrate the effectiveness of the proposed approach. Keywords: Dimensionality reduction, Manifold learning, Sparse representation, Image Recognition

1 Introduction With the rapid development of information technique, objects such as images are usually represented as high-dimensional vectors in applications and do not satisfy the assumption of Gaussian distribution. This results in not enough better performance for popular dimensionality reduction techniques such as PCA [1-3] and LDA [4,5] in processing general type (real-valued or non-real-valued) data. Because they only capture the global Euclidean structure of images and can not well characterize the local structure [6-7]. Many previous works have demonstrated that local structure embedded in high-dimensional vector space is very important for image recognition and helps characterize both the intrinsic structure and discriminative

structure of images [6-11]. The most representative manifold learning methods include Laplacian Eigenmap (LE) [6] Isomap [9], and LLE [7]. These nonlinear methods yield impressive results on some benchmark artificial data sets. However, they yield maps that are defined only on the training data points and how to evaluate the maps on novel test data points remains unclear. In order to address it, He et al. [8,12]and Cai et al. [13] respectively proposed linear manifold learning methods such as Locality preserving projection (LPP) [8], Neighborhood Preserving Embedding (NPE) [12], and Isometric Projection [13] by linear approximation to these nonlinear algorithms. Motivated by LPP [8] and NPE [12], many local discriminant approaches [14-23] have been developed for image classification, among which the most prevalent ones include margin fisher analysis (MFA) [20], local discriminant embedding (LDE) [14], local fisher discriminant analysis (LFDA) [15], locality sensitive discriminant Analysis (LSDA) [18], discriminative locality alignment (DLA) [19], and LLDE [22], where MFA, LDE, LFDA and LSDA learn the intra-class compactness by LPP. LLDE and DLA combine NPE, which explicitly considers the geometric reconstruction among nearby points, with the within-class scatter to learn the intra-class compactness. Inspired by these methods, some supervised approaches, which impose the non-negative constraint [24-26], orthogonal constraint [30-32], and diversity constraint [27-29], and semi -supervised approaches [33-34] have been developed for dimensionality reduction and image classification. Although their motivations are different, the key ideas of the above-mentioned approaches can be unified within the graph embedding framework (GEF) proposed by Yan et al. [20]. GEF maps nearby points having the same class label in the high-dimensional space to nearby points with the low-dimensional representation by LPP [8] or NPE [12]. Meanwhile, GEF maps nearby points having different class labels to be as distant as possible in the reduced space. The main shortcoming of GEF is that the performance heavily relies on the selection of parameters such as neighbor parameters k and t. For most practical cases, the distribution of data is unknown and complex, it is very difficult to select suitable parameters for each method. Worse yet, even if

k, t and the sample number are fixed, the performance would still fluctuate with each new set of random samples [35-36]. This reduces the flexibility of these methods in practical applications. In order to overcome this problem, some researchers began to investigate on how to construct the adjacent graph. Yang et al. [37] constructed Sample-dependent Graph based on samples in question to determine neighbors of each sample and similarities between sample pairs, instead of predefining the same neighbor parameter k for all samples. Zhao et al. [38] used label information to construct Locally Discriminating Projection (LDP). However, they still need to select suitable parameter t, which is difficult to set as suitable value. Recently, sparse signal representation has proven to be an extremely powerful tool for computer vision and pattern recognition. Wright et al. [39] proposed a sparse representation classifier (SRC) for robust face recognition, which is robust with occlusion and corruption in face recognition. Zheng et al. [40] and Gao et al. [41] combined sparse representation and manifold learning in its objective function for image clustering and representation. Zhou and Tao [42] proposed double shrinking model for dimensionality reduction. J. Gui et al. [43] proposed GSM-PAF for multivew image data analysis. Yang et al. [44] proposed the SRC steered discriminative projection (SRC-DP) method for classification. Zhuang et al. [45] constructed adjacency graph by combining low-rank and sparse representation and proposed NNLRS-Graph for dimensionality reduction. Yan and Wang [46], Cheng et al. [47], and Wright et al. [48] respectively employed L1-graph for computer vision, image analysis and pattern recognition. They have demonstrated that sparse representation can well characterize the local relationship of data. Inspired by these meaningful works, researchers adaptively constructed the adjacency graph by minimizing L1-optimization problem and combined subspace learning for dimensionality reduction. For example, Qiao et al. [49] proposed sparsity preserving projection (SPP), which has higher recognition accuracy than PCA and NPE. Zang and Zhang [50] introduced the label information in minimizing L1-regularization objective function and then combined the sparse reconstruction error and local within-class scatter of data to solve the

projection matrix for dimensionality reduction. Gui et al. [51] proposed a discriminant sparse neighborhood preserving embedding (DSNPE) algorithm by combining SPP and maximum margin criterion (MMC) for face recognition. However, these algorithms ignore or impair the local discriminant information embedded in data, which is very important for image classification [14-18]. In this paper, we propose a discriminant sparse preserving projection (DSPP) algorithm by combining SPP with local discriminant information for dimensionality reduction. To be specific, we adaptively construct the intrinsic adjacency graph with weighted matrix by using label information in minimizing L1-regularization objective function on sparse representation. To construct penalty adjacency graph, for training data x i , we firstly find the atom with the minimum non-zero weight in the i -th row of the weight matrix, and calculate the distance ε between x i and this atom, and then put an edge between x i and atoms, which lie in the ε -radius of x i and have the different class labels with x i . Based on this content, we aim to find the

projection directions on which the linked points in the intrinsic graph are close to each other while the linked points in the penalty graph are distant in the low-dimensional representation. Experiments in four image databases indicate the good performance for classification. The remainder of this paper is organized as follows: In Section 2, we give an overview of the L1-graph construction. Section 3 introduces our proposed method. The experimental results in four image datasets are presented in Section 4. We conclude this paper in Section 5. 2 L1-graph Graph efficiently characterizes the pairwise relations while the relation among visual images can be exactly estimated by sparse representation. Many researches demonstrate that the L1 linear reconstruction error minimization can naturally lead to a sparse representation for images [39,46,47], and the coefficients deduced in the L1 optimization essentially characterize the relation among images. Thus, it is natural to construct the adjacency graph by L1 optimization problem. Suppose that N samples x i ∈ R n that are

denoted as matrix X = [ x1 , x 2 ,..., x N ] , we can adaptively construct the adjacency graph Gv = {X, W} with the weight matrix W by solving the following L1-norm optimization problem arg min α i 1 , α

s.t. x i = D i α i , i = 1, , N

i

(1)

where Di = [x1 ,, x i −1 , x i +1 ,, x N ] . The weighted elements Wij are defined as follows:

 α ij , if i > j  Wij =  α ij −1 , if i < j   0, if i = j

(2)

0.45 0.4 0.35

Coefficient

0.3 0.25 0.2 0.15 0.1 0.05 0 -0.05

0

20

40

60

80

100

120

140

160

180

200

Atom

Figure 1. Coefficient vs. Sample on the PIE database.

It has been demonstrated [46] that, compared with the graph constructed by k-nearest neighbor or ε -ball methods, the L1-graph has the following three advantages: Robustness to data noise, sparsity and datumadaptive neighborhood. Thus, some researchers combined L1-graph with the graph embedding algorithms for dimensionality reduction [49-51]. However, the L1-graph also has the following disadvantages: L1-Graph may impair the local discriminant structure embedded in data. Taking the 200 face images, which are randomly selected from the CMU-PIE database, as samples, we select one image as x i and the remaining 199 images as Di , Figure 1 plots the coefficients Wij by solving the objective function (1). In Figure 1, the first 20 images have the same class label with x i . An obvious observation from Figure 1 is that, some coefficients corresponding to the atoms, whose labels are different from x i , are still large. It means that L1-graph may have the potential to connect samples having different class labels. This impairs the local discriminant structure of data and does not well characterize the intrinsic structure of data. Thus, if we

directly use L1-optimization problem to build adjacency graph without other information such as label information, we can not well preserve the discriminant structure embedding in nearby data. 3. Discriminative sparsity preserving projections 3.1 Motivation 0.5 0.45 0.4

Coefficient

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0

20

40

60

80

100

120

140

160

180

200

Atom

(a) Coefficient vs. Atom 1.4

1.2

0.8

0.6

0.4

0.2

0

0

20

40

60

80

100

120

140

160

180

200

Atom

(b) Distance vs. Atom Figure.2 Sample vs. atoms from different classes. 0.9 0.8 0.7 0.6

Coefficient

Distance

1

0.5 0.4 0.3 0.2 0.1 0

0

2

4

6

8

10

12

Atom

(a) Coefficient vs. Atom

14

16

18

20

1.4

1.2

Distance

1

0.8

0.6

0.4

0.2

0

0

2

4

6

8

10

12

14

16

18

20

Atom

(b) Distance vs. Atom Figure 3. Relationship between sample and atoms from the same class.

(a) Sample

(b) Atoms Figure 4. Sample vs. atoms having the different labels from the sample. Taking the PIE face images, which are randomly selected from different classes, as examples. We randomly select one image as x i , and the remaining images as atoms. Figure 2 (a) and (b) show the non-negative coefficients and the distance between x i and all atoms respectively. We also select images, which have the same class label with x i , as atoms. Figure 3 (a) and (b) show the non-negative coefficients and the distance between x i and the corresponding atoms respectively. In Figure 2 (b) and Figure 3 (b), the lines with circle mark denote the distances between x i and atoms corresponding to non-zero coefficients. An interesting observation from Figure 2 and Figure 3 is that, many atoms corresponding to non-zero coefficients lie in the ε -radius of x i . It indicates that L1-graph has the potential to connect nearby samples; another observation

is that, the smaller non-zero coefficients correspond to the atoms which are distant to x i while the larger non-zero coefficients correspond to the atoms which are close to x i . In real applications, the distance

between the same class images may be larger than the distance between images having different class labels due to the uneven distribution. It means that the neighborhood of x i may include some samples whose labels are different from class label of x i . We select the distance between x i and the atom corresponding to the smallest non-zero coefficient in Figure 3 as ε -radius of x i , Figure 4 shows sample x i and the atoms that lie in ε -radius of x i and have different labels with x i , respectively. It is easy to see that these atoms are valuable for classification because very important local discriminant information embeds in these atoms and sample x i . Motivated by the above analysis and observation, we adaptively construct both adjacency intrinsic graph and penalty graph by the following two steps: First, we construct adjacency intrinsic graph by solving the L1 optimization problem on the same class atoms with the non-negative constraint. Second, for penalty graph, we put an edge between the sample and its neighboring atoms whose labels are different from the sample, and calculate the weights by binary entry or solving L1-optimization problem with non-negative constraint. After constructing adjacency graphs, we seek to find projection directions such that the linked pair points in intrinsic graph are as close as possible to each other while the linked pair points in penalty graph are distant to each other in the low-dimensional space. In the following section, we first introduce the adjacency graph construction by L1-minimization problem. 3.2. Adjacency Graph Construction

Motivated by DSNPE [51] and the above-mentioned analysis in section 3.1, given N training samples

x i ∈ R p , we construct the adjacency intrinsic graph Gs = {X, W} with the weight matrix W by solving the following L1 optimization problem: min || x i − X l ( xi ) s i || 2 + λ || s i ||1 si

(3)

s.t. s i ≥ 0

where the column vectors of matrix X l ( xi ) are composed of the training samples which have the same class label with x i and don’t include sample x i . l (x i ) denotes the class label of x i . λ ≥ 0 is a parameter.

l ( xi )

Suppose the elements s j

in vector s l ( xi ) are defined as follows

s

l ( xi ) j

 si , j , if i > j  =  si , j −1 , if i < j   0, if i = j

(4)

Then we can define the column vectors Wi in weight matrix W as Wi = [0,...0, s l (x i ) ,0,...,0]T . According to the above analysis in Section 3.1, we construct adjacency penalty graph Gd = {X, B} with weight matrix B by the following three steps: Step 1: Adaptively calculate the radius for each sample. In this step, we adaptively set the radius for each sample. Given sample

x i , we find the smallest non-zero coefficient Wij in the i th row of weight matrix

W , and calculate the distance d ij between x i and x j . This distance d ij is set as radius Step 2: Construct the penalty graph. Given sample of

εi

of

xi .

x i , denote N (ε i ) by the set of samples in ε i -radius

x i . We put an edge between x i and x j if x j ∈ N (ε i ) and l (x i ) ≠ l (x j ) . Step 3: Calculate the weight matrix B . We can define the weight matrix by two ways: heat kernel matrix

and binary matrix. Given sample

x i , for binary weight matrix, we define the elements Bij in matrix B as

follows:  1, if x j ∈ N (ε i ) and l (x i ) ≠ l (x j ) Bij =   0, Otherwise

(5)

For Heat kernel weight matrix, the elements Bij in matrix B are defined as follows:  Bij , if x j ∈ N (ε i ) and l ( xi ) ≠ l (x j ) Bij =   0, Otherwise Let vector

(6)

z i be composed of the coefficients Bij , then we can adaptively obtain vector z i by solving

the following L1-optimization objective function min || x i − X ≠l ( xi ) z i || 2 +λ || z i ||1 bi

(7)

s.t. z i ≥ 0

where the columns of matrix X ≠ l ( xi ) are composed of samples, which lie in different class labels with

x i . λ ≥ 0 is the parameter.

ε i -radius of x i

and have the

3.3. Objective function of the proposed method

类心

类心

类心

(a)

类心 (b)

(c)

(d)

Figure 5. (a) The center point has five neighbors. The points with the same color and shape belong to the same class. (b)The between-class graph connects nearby points with different labels. (c) The global graph connects nearby points with the same labels. (d) After Discriminant Sparse Preserving Projection, the margin between different classes is maximized. As the previous discussed, our proposed approach aims to preserve the intrinsic structure and characterize the discriminant structure in the low-dimensional space. For convenience of description, we consider the problem of mapping the original high-dimensional images into a line so that the linked points in intrinsic graph are close to each other. Suppose that yi is a map of image x i , a reasonable map is to optimize the following objective function: arg min ∑ ( yi − y j ) 2 Wi , j

(8)

i, j

y

In order to effectively discover the discriminant structure embedded in high-dimensional data and improve the classification performance, we consider the problem of mapping the original high-dimensional images into a line so that the linked points in penalty graph are distant to each other. Similarly, suppose yi is still a map of image x i , a reasonable map is to optimize the following objective function: arg max ∑ ( yi − y j ) 2 Bi , j y

(9)

i, j

It is easy to see that, the objective function (8) and (9) consider the local structure and ignore the global structure of data. However, in real applications, both global and local structures are important due to the complex distribution of data. In order to preserve the global structure and improve the classification, we seek to find low-dimensional representation so that the points x j ∈ N (ε i ) , which have different class labels with

x i ,i.e. l (x j ) ≠ l (x i ) , are close to themselves class centroid in the low-dimensional space. Denote by yi a

map of image x i , a reasonable map is to optimize the following objective function: n

arg min ∑ y



i =1 l ( x j ) ≠ l ( x i )

(y

)

− µ l (x j ) 2 w j

j

(10)

where w j = 1 , if x j ∈ N (ε i ) ; otherwise, w j = 0 . µ l (x j ) denotes the mean value of the l (x j ) = τ th class in low-dimensional space. Suppose β denotes a projection vector, substituting yi = β T x i into the objective function (8), and by simple algebraic formulation, we can see that

1 2 T T 2 ∑ ( yi − y j ) Wi , j = ∑ (β x i − β x j ) Wi , j i, j 2 i, j = ∑ (β T x i ∑ j Wi , j β T x i ) − ∑ β T x i Wi , j β T x j

(11)

i, j

i

T

T

= β X( DW − W ) X β = β T XL w X T β where L w = D w − W is the Laplacian matrix. D w is a diagonal matrix whose entries are row sums of W , i.e., Dw (i, i ) = ∑ Wi , j . j

Similarly, substituting yi = β T x i into the objective functions (9) and (10), respectively, we have 1 2 2 T T ∑ ( y i − y j ) Bi , j = ∑ (β x i − β x j ) Bi , j i, j 2 i, j = ∑ (β T x i ∑ j Bi , j β T x i ) − ∑ β T x i Bi , j β T x j

(12)

i, j

i

T

T

= β X( D b − B ) X β = β T XL b X T β where L b = D b − B , D b is a diagonal matrix whose elements are column sum of B , i.e., Db (i, i ) = ∑ Bi , j j

n





i =1 l ( x j ) ≠ l ( x i )

(y

j

)

− µ l (x j ) 2 w j = ∑



(w (β x T

i l ( x j ) ≠l ( xi )

j

j

)(

− β T m l (x j ) β T x j − β T m l (x j )

 = β T  ∑ ∑ w j x j − m l (x j ) x j − m l (x j )  i l ( x j )≠l ( xi ) = βT S w β

(

where S w = ∑



i l ( x j ) ≠l ( xi )

(

)(

w j x j − m l (x j ) x j − m l (x j )

)

T

)(

)

T

 β 

)) T

(13)

, m l (x j ) is the average vector of the l (x j ) = τ th class

in original high-dimensional space. Substituting Eq. (11), (12), (13) into the objective function (8), (9) and (10), respectively, these

objective functions respectively become: arg min β T XL w X T β

(14)

β

arg max β T XL b X T β

(15)

β

arg min β T S w β

(16)

β

Our aim is to seek projection vector β that simultaneously satisfies the objective functions (14)-(16). We have several ways to build the objective function of DSPP by integrating the objective functions (14)-(16). The representative one is the Rayleigh quotients form that is formally similar to LDA. Motivated by this, we write the objective function of DSPP as follows:

arg min β

βT (XLwXT + ρSw )β βT XLb XT β

(17)

where ρ is a parameter that balances the local and global contribution for data classification. The learning procedure of our approach is illustrated in Figure 5. The optimal projection vector β that minimizes the objective function (17) is given by the minimum non -zero eigenvalue solution to the generalized eigenvalue problem

(XL X w

T

+ ρS w ) β = γ XLb XT β

(18)

Let the column vectors β1 , β 2 ,  , β d be the solutions of equation (18), ordered according to their eigenvalues, 0 < γ 1 ≤ γ 2 ≤  ≤ γ d . For an arbitrary image vector x ∈ R p , the embedding is as follows:

x → y = φT x

,

φ = [β 1 , β 2  β d ]

(19)

Note that, in image recognition, the dimensionality of image vector is very high relative to the number of training images. Therefore, the matrix XL b X T is a non-singular matrix such that it is very difficult to directly calculate the optimal projection vector. In this way, we may first apply principal component analysis to reduce the dimensionality of image vector and then perform DSPP. Algorithm 1 summarizes the whole procedure of performing classification by DSPP-H or DSPP-B. Note that, DSPP-B means that weight matrix in penalty graph is defined as binary matrix while DSPP-H means that weight matrix in penalty graph is

defined as heat kernel matrix. Algorithm 1. DSPP-H/DSPP-B for Classification Input: Training data matrix X = [ x1 , , x N ] and the corresponding labels l (x i ) ; Test data X* T 1. Obtain X p by PCA, i.e. X p = WPCA X , where WPCA denotes the projection matrix learned by PCA.

2. Obtain the optimizer s i to problem (3) and calculate the weight matrix W by Eq. (4). 3. Obtain the weight matrix B by Eq. (5) for DSPP-B, or by Eq. (6) and Eq. (7) for DSPP-H. 4. Compute the optimization projection matrix φ to the objective function (17). 5. Calculate the low-dimensional embedding Y = [ y 1 , , y N ] for training data by (19). 6. Calculate the low-dimensional embedding Y* for testing data by (19). 7. According to the minimum Euclidean distance to classify the testing data. 4. Experimental Results 4.1 Recognition on Image databases In this section, in order to evaluate the performance of our proposed approaches DSPP-H and DSPP-B for classification, we compare the proposed approaches DSPP-H and DSPP-B with six dimensionality reduction methods including LPP [8], SPP1 [49], SPP2[49], DSNPE1 [51], DSNPE2 [51], and SRC-DP [44] on four

、 、PIE and USPS). At the classification stage, we employ the Euclidean

public image databases (YaleB AR

distance to measure the similarity and classify the test images as the class of training images having the minimum distance in the reduce space. In the experiments, it is very difficult to select the suitable principal components for each method due to the complex and unknown structure of images. Thus, on every database, we select principal components with the best classification accuracy for each method and fix principal components i.e. PCA ratio to perform other experiments in this database. Similarly, we use the same way to select parameter ρ for our approaches in each database. Moreover, we solve the L1-norm optimization problem, i.e. the objective functions (3) and (7) by L1-Ls- matlab packages [52].

Table 1. The average recognition accuracy (%) of eight approaches and the corresponding standard deviation (std) on the PIE database.

Methods Average PCA ratio Std

DSPP-B DSPP-H DSNPE1 DSNPE2 94.94 94.61 95.92 95.81 0.95 0.95 1 1 4.62 4.53 4.70 6.81

±

±

±

±

SPP1 93.98 1 5.89

±

SPP2 LPP SRC-DP 93.06 94.18 94.71 1 0.98 0.99 5.99 5.68 6.38

±

±

±

The CMU-PIE database [53] contains 68 subjects with 41368 face images as a whole. The face images are captured by 13 synchronized cameras and 21 flashes, under varying pose, illumination and expression. We select pose-29 images as gallery, including 24 samples for each individual in the experiments. Each image is manually cropped and resized to 64 × 64 pixels. In one experiment, we select the first 12 images per person for training and the remaining 12 images for testing, and then select the suitable parameters with the best classification for each method. After obtaining the parameters, we randomly select 12 images per person for training and the corresponding remaining images for testing, and repeat this procedure 10 times. Table 1 lists the average recognition accuracy of eight approaches and the corresponding standard deviation on the PIE database. Figure 6 shows some images of one person. Figure 7 plots the recognition accuracy vs. number of project vectors for one experiment. In experiments, parameter ρ is 0.0009 for DSPP-H and DSPP-B.

Figure 6. Some sample images of one person in the PIE database. 1

0.9

Recognition Accuracy

0.8

0.7

0.6 LPP DSNPE1 DSNPE2 SPP1 SPP2 SRC-DP DSPP-H DSPP-B

0.5

0.4

0.3

5

10

15

20 25 30 Number of Project Vectors

35

40

45

Figure 7. Recognition accuracy vs. number of project vectors on the PIE database. The AR face database [54] contains over 4000 color face images of 126 people (56 women and 70 men) including frontal views of faces with different facial expressions, lighting conditions and occlusions. The pictures of most persons are taken in two sessions (separated by two weeks). Each session contains 13 color images and 120 individuals (65 men and 55 women) participate in both sessions. The facial portion of each

× 40 in the experiments. In one experiment,

image is manually cropped and then normalized to the size of 50

we select the images from the first session for training and the remaining images for testing, and respectively select the parameters value with the best performance for each method. After obtaining the parameters, we do another 9 experiments by randomly selecting 13 images per person for training and the remaining images for testing. Table 2 lists the average recognition accuracy and the corresponding standard deviation on this database. Figure 8 shows some images of one person. Figure 9 plots the recognition accuracy vs. number of project vectors for one experiment. In experiments, parameter ρ is 0.0002 for DSPP-H and DSPP-B.

Figure 8. The first session images of one subject in the AR database. 1

0.9

Recognition Accuracy

0.8

0.7

0.6 LPP DSNPE1 DSNPE2 SPP1 SPP2 SRC-DP DSPP-H DSPP-B

0.5

0.4

0.3

5

10

15

20

25

30

35

40 45 50 55 Number of Project Vectors

60

65

70

75

80

85

90

Figure 9. Recognition accuracy vs. number of project vectors on the AR database. Table 2. The average recognition accuracy (%) of eight approaches and the corresponding standard deviation

(std) on the AR database. Methods DSPP-B DSPP-H DSNPE1 Average 98.26 98.48 98.48 Std 1.47 1.47 1.57 PCA ratio 0.96 0.95 0.95

±

±

±

DSNPE2 95.43 4.59 0.96

±

SPP1 95.53 3.37 0.89

±

SPP2 96.71 3.22 0.94

±

LPP SRC-DP 94.14 97.90 5.69 1.69 0.94 0.96

±

±

Table 3. The average recognition accuracy (%) of eight approaches and the corresponding standard deviation (std) on the YaleB database.

YaleB Average Std PCA ratio

DSPP-B DSPP-H DSNPE1 88.36

87.98

86.64

0.99

0.99

1

±4.52 ±4.76 ±3.78

DSNPE2

SPP1

SPP2

LPP

SRC-DP

84.20

85.56

86.34

84.59

87.97

1

0.99

0.99

0.99

0.99

±5.70 ±4.78 ±4.96 ±4.11 ±3.44

The Extended Yale B database [55] contains 2414 front-view face images of 38 individuals. For each individual, 64 pictures are taken under various laboratory-controlled lighting conditions. In our experiments, images of 31 individuals are selected as gallery, and each image is resized to 32 × 32 . We randomly select the 10 images per person for training, and the remaining image for testing, and then repeat this procedure 10 times. In experiments, the value of parameters is determined by the top recognition accuracy in the first experiment. Table 3 lists the average recognition accuracy of eight methods and the corresponding standard deviation on this database. Figure 10 shows some images of one person. Figure 11 plots the recognition accuracy vs. number of project vectors for one experiment. In experiments, parameter ρ is 0.0002 for DSPP-H and DSPP-B.

Figure 10. Some sample images of one subject in the Yale-B database.

1

0.9

0.8

Recognition Accuracy

0.7

0.6

0.5

0.4

LPP DSNPE1 DSNPE2 SPP1 SPP2 SRC-DP DSPP-H DSPP-B

0.3

0.2

0.1 5

10

15

20

25

30

35 40 45 50 Number of Project Vectors

55

60

65

70

75

80

Figure 11. Recognition accuracy vs. number of project vectors on the Yale-B database. Table 4. The average recognition accuracy (%) of eight approaches and the corresponding standard deviation (std) on the USPS database.

YaleB DSPP-B DSPP-H DSNPE1 Average 93.53 93.75 93.91 Std 0.29 0.22 0.35 PCA ratio 0.81 0.81 0.8

±

±

±

DSNPE2 93.57 0.42 0.8

±

SPP1 92.27 0.28 0.8

±

SPP2 92.38 0.34 0.82

±

LPP SRC-DP 93.58 92.85 0.58 0.43 0.81 0.86

±

±

USPS handwritten digits data set [56] is composed of 7291 training images and 2007 test images of size 16 × 16 . Each image is represented by a 256-dimensional vector. In our experiments, 200 images per class

are randomly selected for training and 2007 test images are selected for testing. We repeat this procedure 10 times. In experiments, the value of parameters is selected by using the same way as the above-mentioned experiments in AR database. Table 4 lists the average recognition accuracy and the corresponding standard deviation. Figure 12 shows some image of one digit. Figure 13 plots the recognition accuracy vs. number of project vectors for one experiment. In experiments, parameter ρ is 0.0001 for DSPP-H and DSPP-B.

Figure 12. Some sample images of one subject in the USPS database.

1

0.9

Recognition Accuracy

0.8

0.7

0.6 LPP DSNPE1 DSNPE2 SPP1 SPP2 SRC-DP DSPP-H DSPP-B

0.5

0.4

5

10

15

20 Number of Project Vectors

25

30

35

Figure 13. Recognition accuracy vs. number of project vectors on the USPS database. From Table1 to Table 4, Figure 7, Figure 9, Figure 11, and Figure 13, we can see that 1.

SPP is not always better than LPP in some experiments, such as in PIE and USPS database. This is probably because that, compared with the graph constructed by k-nearest neighbor, unsupervised L1Graph can not well characterize the intrinsic structure of images. As above-mentioned analysis in Section 2, L1-Graph has the potential to connect the images having different class labels. This will impair the very important local discriminant structure, which is embedded in among nearby data having different class labels.

2.

DSNPE is overall superior to SPP. This is probably because that DSNPE well characterizes the intrinsic structure by combining L1-graph with label information of samples. Another reason may be that DSNPE well discovers the discriminant structure of images. In some experiments, DSNPE is inferior to LPP. This is probably because that DSNPE mainly emphasizes the global discriminant structure. This may impair the very important local discriminant structure.

3.

SRC-DP is superior to SPP. This is probably because that SRC-DPP well characterizes the intrinsic structure of data. Moreover, SRC-DP is inferior to DSNPE. This is probably due to the fact that SRC –DP does not well encode the discriminate structure. Another reason may be that SRC-DP well suits

SRC rather than the nearest classifier. 4.

Our proposed approaches DSPP-B and DSPP-H are superior to other approaches. This is probably because that DSPP-B and DSPP-H well characterize both intrinsic structure and local discriminant structure that is very important for image classification. Another reason is that our approaches well preserve the global intrinsic structure of images. Moreover, DSPP-B is approximately superior to DSPP-H except for experiments on the USPS database. This indicates that nearby images are equally important for classification.

5.

DSPP-H and DSPP-B have better recognition accuracy than other approaches under the same number of project vectors when the number project vector is larger than 5. It indicates that local discriminant information is very important and our approaches are effective for image classification.

4.2 Discussion of parameter In this section, we discussed the effect of parameter ρ for classification performance. In experiments, we randomly select 13 images per person for training, and the remaining images for testing in AR database. Similarly, in PIE database, we also randomly select 12 images per person for training, and the remaining images for testing. As the above-mentioned analysis in Section 4.1, DSPP-B is overall superior to DSPP-H for Face recognition. Thus, in the following experiments, we select DSPP-B approach to evaluate the effect of parameter ρ for classification. Figure 14 and Figure 15 plot the curve of recognition accuracy of DSPP-B vs. parameter ρ on the AR and PIE database respectively. We can see that, when ρ is 0, the recognition accuracy of DSPP-B is not the best. It indicates that (10) is important for image classification, i.e. the global geometric structure is important for image classification. Moreover, Figure 14 and Figure 15 indicate that ρ is different for different databases when DSPP-B obtains the better recognition accuracy. We will study this problem in our future work according to the meaningful works [57-58].

0.98

Recognition Accuracy

0.976

0.972

0.968

0.964

0.96

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 -3

ρ

x 10

Figure 14. Recognition accuracy vs. ρ on the AR database;

Recognition Accuracy

0.99

0.985

0.98

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

ρ

Figure 15. Recognition accuracy vs. ρ on the PIE database. 5. Conclusion This paper presents a new dimensionality reduction approach, called discriminative sparsity preserving projections (DSPP), by combing manifold learning and sparse representation. DSPP adaptively constructs both intrinsic structure and penalty structure of data by L1-optimization problem, and then integrates the global within-class structure into the discriminant manifold learning objective function for dimensionality reduction. Experiments are done on the PIE, AR, the extend Yale-B, and USPS databases, and results overall demonstrate the performance advantage of the proposed approach over others. Acknowledgements We would like to thank the handling Associate Editor and the anonymous reviewers for their constructive comments on this paper. This work is supported by National Natural Science Foundation of China under Grant 61271296, Natural Science Basic Research Plan in Shaanxi Province of China under Grant

2012JM8002, China Postdoctoral Science Foundation under grant 2012M521747, the 111 Project of China (B08038), Fundamental Research Funds for the Central Universities of China under Grant BDY21, and the Open Project Program of the State Key Laboratory of CAD&CG (No.A1407), Zhejiang University, China.

References [1] M. Turk and A. P. Pentland, Face recognition using eigenfaces, In Proc. Int. Conf. Computer Vision and Pattern Recognition, pp.586-591, June 1991. [2] J. Li and D. Tao, Simple exponential family PCA, IEEE Trans. Neural Networks and Learning Systems, vol. 24, no.3, pp. 485-497, 2013.

[3] J. Li and D. Tao, On preserving original variables in Bayesian PCA with application to image analysis, IEEE Trans. Image Processing, vol. 21, no.12, pp. 4830-4843, 2012.

[4] K. Fukunaga, Introduction to statistical Pattern Recognition, Second e.d., Academic Press, 1990. [5] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriengman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711720, 1997. [6] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation, vol.15, no. 6, pp.1373-1396, 2003. [7] S.T. Roweis and L.K. Saul, Nonlinear dimensionality reduction by locally linear Embedding, Science, vol. 290, no. 5050, pp. 2323-2326, 2000.

[8] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, Face recognition using Laplacianfaces, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no.3, pp. 328-340, 2005. [9] J. Tenenbaum, V. de Silva, and J. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, vol. 290, no. 5050, pp. 2319-2323, 2000.

[10] D. Donoho and C. Grimes, Hessian eigenmaps: locally linear embedding techniques for highdimensional data, In Proc. the National Academy of Sciences of the United States, pp. 5591-5596, 2003. [11] S. Xiang, F. Nie, C. Zhang, and C. Zhang, Nonlinear dimensionality reduction with local spline embedding, IEEE Trans. Knowledge and Data Engineering, vol. 21, no.9, pp. 1285-1298, 2009. [12] X. He, D. Cai, S. Yan, and H.J. Zhang, Neighborhood preserving embedding, In IEEE International Conference on Computer Vision (ICCV), pp. 1208-1213, Oct. 2005. [13] D. Cai, X. He, and J. Han, Isometric Projection, In Proc. Association for the Advancement of Artificial Intelligence(AAAI), pp. 528-533, 2007. [14] H. T. Chen, H. W. Chang, and T. L. Liu, Local discriminant embedding and its variants, In Proc. IEEE Computer Visual and Pattern Recognition (CVPR), pp. 846-853, 2005. [15] M. Sugiyama, Local fisher discriminant analysis for supervised dimensionality reduction, In Proc. Int’l Conf. Machine Learning (ICML), pp. 905-912, 2006. [16] H. Wang, S. Che, Z. Hu, and W. Zheng, Locality-preserved maximum information projection, IEEE Trans. Neural Networks, vol. 19, no. 4, pp. 571-585, 2008. [17] G. Lu, Z. Lin, and Z. Jin, Face recognition using discriminant locality preserving projections based on maximum margin criterion, Pattern Recognition, vol. 43, no. 10, pp. 3572-3579, 2010. [18] D. Cai, X. He, K. Zhou. J. Han, and H. Bao, Locality sensitive discriminant analysis, In. Proc. 20th International Joint Conference Artificial Intelligence (IJCAI), pp. 708-713, 2007. [19] T. Zhang, D. Tao, X. Li, and J. Yang, Patch alignment for dimensionality reduction, IEEE Trans. Knowledge Data and Engineering, vol. 21. no. 9, pp. 1299-1313, 2009. [20] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40-51, 2007.

[21] S. Wang, S. Yan, J. Yang, C. Zhou, and X. Fu, A general exponential framework for dimensionality reduction, IEEE Trans. Image Processing, vol. 23, no. 2, pp. 920-930, 2014. [22] B. Li, C. Zheng, and D. S. Huang, Locally linear discriminant embedding: An efficient method for face recognition, Pattern Recognition, vol. 41, no. 12, pp. 3813-3821, 2008.



[23] J. Gui, W. Jia, L. Zhu, S. L. Wang, and D.S. Huang Locality preserving discriminant projections for face and palmprint recognition, Neurocomputing, vol. 73, no. 8, pp. 2696-2707, 2010.

[24] J. Yang, S. Yan, Y. Fu, X. Li, and T. Huang, Non-negative graph embedding, In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1-8, June 2008. [25] N. Guan, D. Tao, Z. Luo, and B. Yuan, Non-negative patch alignment framework, IEEE Trans. Neural Networks, vol. 22, no. 8, pp. 218-1230, 2011. [26] Y. Chen, J. Zhang, D. Cai, W. Liu, and X. He, Nonnegative local coordinate factorization for image representation, IEEE Trans. Image Processing, vol. 22, no. 3, pp. 969-979, 2013. [27] Q. Gao, J. Liu, H. Zhang, X. Gao, and K. Li, Joint global and local structure discriminant analysis, IEEE Trans. Information Forensics and Security, vol. 8, no. 4, pp. 626-635, 2013. [28] C. Hou, C. Zhang, Y. Wu, and Y. Jiao, Stable local dimensionality reduction approaches, Pattern Recognition, vol. 42, no. 9, pp. 2054-2066, 2009. [29] Q. Gao, H. Xu, Y. Li, and D. Xie, Two-dimensional supervised local similarity and diversity projection, Pattern Recognition, vol. 43, no. 10, pp. 3359-3363, 2010. [30] H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, Trace ratio vs. ratio trace for dimensionality reduction, In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1-8, June 2007. [31] Y. Jia, F. Nie, and C. Zhang, Trace ratio problem revised, IEEE Trans. Neural Networks, vol. 20, no. 4, pp. 729-735, 2009. [32] Q. Gao, J. Ma, H. Zhang, X. Gao, and Y. Liu, Stable orthogonal local discriminant embedding for linear

dimensionality reduction, IEEE Trans. Image Processing, vol. 22. no. 7, pp. 2521-2530, 2013. [33] J. Gui, S. Wang, and Y. K. Lei, Multi-step dimensionality reduction and semi-supervised graph-based tumor classification using gene expression data, Artificial Intelligence in Medicine, vol. 50, no.3, pp. 181-191, 2010.

[34] Y. Song, F. Nie, C. Zhang, and S. Xiang, A unified framework for semisupervised dimensionality reduction, Pattern Recognition, vol. 41, no. 9, pp. 2789-2799, 2008

[35] M. Balasubramanian and E. Schwartz, The isomap algorithm and topological stability, Science, vol. 295, no. 5552, pp. 7-7, 2002. [36] R. Wang, S. Shan, X. Chen, J. Chen, and W. Gao, Maximal linear embedding for dimensionality reduction, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 9, pp. 1776–1792, 2011. [37] B. Yang and S. Chen, Sample-dependent graph construction with application to dimensionality reduction, Neurocomputing, vol. 74, no. 12, pp. 301-314, 2010. [38] H. Zhao, S. Sun, Z. Jing, and J. Yang, Local structure based supervised feature extraction, Pattern Recognition, vol. 39, no. 8, pp. 1546-1550, 2006. [39] J. Wright, A. Yang, S. Sastry, and Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, 2009. [40] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, and D. Cai, Graph regularized sparse coding for image representation, IEEE Trans. Image Processing, vol. 20, no. 5, pp. 1327-1336, 2011. [41] S. Gao, I. Tsang, L. Chia, and P. Zhao, Local features are not lonely-Laplacian sparse coding for image classification, In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 35553561. 2010. [42] T. Zhou and D. Tao, Double shrinking sparse dimension reduction, IEEE Trans. Image Processing, vol. 22, no.1, pp. 244-257, 2013. [43] J. Gui, D. Tao, Z. Sun, Y. Luo, X. You, and Y. Y. Tang, Group sparse multiview patch alignment framework

with view consistency for image classification, IEEE Trans. Image Processing, vol. 23, no. 7, pp. 3126-3137, 2014.

[44] J. Yang, D. Chu, L. Zhang, Y. Xu, and J. Yang, Sparse representation classifier steered discriminative projection with applications to face recognition, IEEE Trans. Neural Networks and Learning Systems, vol. 24, no. 7, pp. 1023-1035, 2013. [45] L. Zhuang, H. Gao, Z. Lin, Y. Ma, X. Zhang, and N. Yu, Non-negative low rank and sparse graph for semi-supervised Learning, In Proc. IEEE Conference Computer Vision and Pattern Recognition (CVPR), pp. 2328- 2335, 2012.



[46] S. Yan and H. Wang, Semi-supervised learning by sparse representation, In SIAM Int l Conf. on Data Mining (SDM), pp. 792-801, 2009. [47] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang, Learning with L1-Graph for image analysis, IEEE Trans. Image Processing, vol. 19, no. 4, pp. 858-866, 2010. [48] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan, Sparse representation for computer vision and pattern recognition, Proceedings of the IEEE, vol. 98, no. 6, pp. 1031-1044, 2010. [49] L. Qiao, S. Chen, and X. Tan, Sparsity preserving projections with applications to face recognition. Pattern Recognition, vol. 43, no. 1, pp. 331-341, 2010. [50] F. Zang and J. Zhang, Discriminative learning by sparse representation for classification. Neurocomputing, vol. 74, no.6, pp. 2176-2183, 2011. [51] J. Gui, Z. Sun, W. Jia, R. Hu, Y. Lei and S. Ji, discriminant sparse neighborhood preserving embedding for face recognition, Pattern Recognition, vol. 45, no. 8, pp. 2884-2893, 2012. [52] K. koh, S. Kim, and S. Boyd, L1-ls:a Matlab solver for large-scale l1-regularized least squares problems, http://www.stanford.edu/~boyd/l1_ls/. [53] T. Sim, S. Baker, and M. Bsa, The CMU Pose, Illumination, and Expression (PIE) Database, In Proc.

IEEE Int’l Conf. Automatic Face and Gesture Recognition, pp.46-51, 2002. [54] A.M. Martinez and R. Benavente, The AR face database, Tech. Rep., CVC Technical report, 1998. [55] K. Lee, J. Ho, and D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 684–698, 2005. [56] J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 5, pp. 550-554, 1994. [57] B. Geng, D. Tao, C. Xu, L. Yang, and X.S. Hua, Ensemble Manifold Regularization, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1227-1233, 2012. [58] J. Gui, Z. Sun, J. Cheng, S. Ji, and X. Wu, How to estimate the regularization parameter for spectral regression discriminant analysis and its kernel version?, IEEE Trans. Circuits and Systems for Video Technology, vol. 24, no. 2, pp. 211-223, 2014.

Highlights

(1) We analyzed the vertex points of penalty graph using sparse representation. (2) We adaptively constructed both penalty adjacency graph and intrinsic graph. (3) Our method simultaneously considered the global and local geometric structure of data.

Author Biography Quanxue Gao received the B.Eng. degree in 1998 from University of Xi’an Highway, Xi’an China, the M.S. degree in 2001 from GanSu University of Science and Technology, Lan Zhou, China, and the Ph.D. degree from Northwestern Polytechnical University, Xi’ an, China, in 2005. In 2006-2207, he was an associate research assistant in The Hong Kong Polytechnic University. He is currently a professor at xidian University, Xi’an China. His research interests include manifold learning, biometric recognition, sparse representation, dimensionality reduction, and statistical pattern recognition. Yunfang Huang received the B. Eng. degree from Xidian University, Xi'an China in 2012. He is currently working toward the M. S. degree from XiDian University, Xi'an, China. Her research interests include manifold learning, dimensionality reduction and sparse representation.

Hailin Zhang received the Ph.D. from Xidian University, xi’an China, in 1991. He is now a senior Professor at Xidian University. He is currently the Dean of this school, the Director of Key Laboratory in Wireless Communications Sponsored by China Ministry of Information Technology and a field leader in Telecommunications and Information Systems in Xidian University His research interests include key transmission technologies and standards on broadband wireless communications for B3G, 4G and next generation broadband wireless access systems, Pattern recognition Xin Hong She is currently working toward the B.Eng. degree from Xidian University, Xi'an, China. Her research interests include dimensionality reduction and sparse representation. Kui Li received the M.S. degree from Xi’an Polytechnic University, China, in 2014. His research interests include pattern recognition and sparse representation. Yong Wang received the Ph.D. from XI’an Jiaotong University, xi’an China, in 2006. His research interests include biometric recognition, image processing, and sparse representation.