Optimized projections for nonnegative linear reconstruction classification

Optimized projections for nonnegative linear reconstruction classification

Author’s Accepted Manuscript Optimized projections for nonnegative linear reconstruction classification Jie Xu, Shengli Xie www.elsevier.com PII: DO...

986KB Sizes 2 Downloads 116 Views

Author’s Accepted Manuscript Optimized projections for nonnegative linear reconstruction classification Jie Xu, Shengli Xie

www.elsevier.com

PII: DOI: Reference:

S0925-2312(15)01356-9 http://dx.doi.org/10.1016/j.neucom.2015.09.048 NEUCOM16116

To appear in: Neurocomputing Received date: 6 January 2015 Revised date: 6 July 2015 Accepted date: 17 September 2015 Cite this article as: Jie Xu and Shengli Xie, Optimized projections for nonnegative linear reconstruction classification, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.09.048 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Optimized projections for nonnegative linear reconstruction classification Jie Xu1, Shengli Xie1 1 Faculty of Automation,Guangdong University of Technology,Guangzhou 510006,P. R. China, e-mail:[email protected] Abstract: Nonnegative linear reconstruction measure can produce the sparse nonnegative representation coefficients which are concentrated on the similar training samples. The samples with the same class-label are generally more similar than those with different class-labels. Based on this, we develop a nonnegative linear reconstruction based classifiers (NLRC). To enhance the classification performance, we design a feature extractor based on the same decisional rule of NLRC, such that the extracted features are most suitable to NLRC. The proposed feature extractor, nonnegative linear reconstruction projection (NLRP), and NLRC share a common decisional rule, and thus the features extracted by NLRP fit NLRC well in theory. The feasibility and effectiveness of the proposed recognition system are verified on the Yale face database, the AR face database the PolyU finger knuckle print database, and the palmprint database with promising results. Keywords: nonnegative linear measure reconstruction; sparse; recognition system; dimensionality reduction.

1 Introduction Feature extraction, feature selection, and classifier design are three main parts in the field of pattern recognition. For a general recognition problem, one firstly extracts and selects statistically significant (dominant, leading) features, which allow us to discriminate different classes or cluster. Then project all the samples into the obtained feature space and perform classification. It is obvious that feature extraction and classification are two independent phases and the connections between them are loose. If the features extracted in the first phase are most suitable to the classification of the second phase, the recognition accuracy should be improved a lot. A general way is to design a feature extractor based on the decisional rule of the selected classifier [1-3]. Yang et al. [1] proposed the local-mean based nearest neighbor classifier (LM-NNC) and then developed a local mean based NN discriminant analysis (LM-NNDA). The central idea of LM-NNDA is to extract the most suitable features for LM-NNC. LM-NNDA plus LM-NNC achieve encouraging recognition performance. Inspired by this, Chen et al. took advantages of the linear regression classification (LRC) [4] and presented a reconstruction discriminant analysis (RDA) algorithm [2] to fit LRC well. The above mentioned methods verify that a

recognition system consisted by a classifier and its matching feature extractor can make the whole recognition performance enhanced a lot. Recently, Zhang et al. present the linear reconstruction measure steered nearest neighbor (LRM-NN) classification framework [5]. Linear reconstruction measure (LRM) uses the linear regression as a tool to achieve the function of similarity measure in a classifier design. LRM represents a query sample as a linear combination of all the training samples and determines the nearest neighbors of the query sample by sorting the minimum L2-norm error linear reconstruction coefficients [5]. In LRM-NN, a query sample q is coded as q = Xw + e with e minimum, where X = [ x1, x2 ,, xn ] is the training sample set and e denotes the linear reconstruction error. The query sample q is assigned to the class of xi with the largest wi

or wi xi . Similarly, [6] and [7] assign the query sample q to the class of xi

with the smallest reconstruction error of q − wi xi . Observing the representation coefficients of [5-7] algorithms, we find all of them are with different signs. It means the query sample q is represented as the linear combination of all the training samples involving both additive and subtractive interactions. The samples associated with positive coefficients make the positive contribution in the reconstruction of sample q and vice versa. However, this will lead to the case of “cancel each other out” [8]. A simple but effective way is to impose the nonnegative constraint [9-12] on the representation coefficients. As a result, the query sample q can be represented as the nonnegative linear combination of some parts of training samples. These training samples associated with nonnegative reconstruction coefficients have some similar features as the query sample q . Thus they can be treated as the similar neighbors of the query sample q . Generally speaking, the samples with the same class-label should have more common features. It means the right class has the minimum reconstruction residuals. Based on this, we propose the nonnegative linear reconstruction based classification (NLRC), and then use the NLRC decisional rule to design a matching feature extractor, called nonnegative linear reconstruction projection (NLRP). NLRP uses the class reconstruction residuals to characterize the intra-class and inter-class reconstruction scatters. By maximizing the inter-class reconstruction residuals of samples and simultaneously minimizing the intra-class reconstruction residuals of samples, NLRP extracts the most suitable features for NLRC. By this way, the proposed NLRP and NLRC are seamlessly integrated into a recognition system. The remainder of paper is organized as follows. Section 2 details the NLRC algorithm and its variations. Section 3 presents the NLRP algorithm. Connections with some related works are analyzed in Section 4. Section 5 shows the experimental results, and Section 6 concludes this paper.

2 Nonnegative linear reconstruction based classifier and its variations Suppose there are c known pattern classes. Let us define X i as the matrix formed by the training samples of Class i , i.e. X i = [ xi1 , xi 2 , , xini ] ∈ R d ×ni , where ni is the number of training samples from Class i . Collect all of the X i to form the data matrix of X = [ X 1 , X 2 , , X c ] ∈ R d × N , where n = ∑i =1 ni . X is composed of entire training samples. c

For a given query sample q , it can be represented by a overcomplete dictionary whose basis vectors are training sample themselves, i.e. q = Xw . The obtained representation coefficients with different signs give us a direct explanation in sense of reconstruction. The samples with positive coefficients make the positive contributions in the reconstruction of sample q and vice versa. Suppose all the coefficients are imposed with nonnegative constraint as follows:

w* = arg min q − Xw 2 w

(1)

s.t. wi ≥ 0, i = 1, 2,, N where w = [w1, w2 ,, wN ] . By solving Eq. (1), one can obtain a sparse nonnegative coefficient vector

w* . For example, given a query sample set Q = [q1 , q2 , q3 ] and a training sample set Χ = [ x1, x2 ,, x8 ] , the optimal nonnegative linear reconstruction coefficient matrix w = [w1 , w2 , w3 ] can be obtained by solving Eq. (1). Calculate Xw and we can approximatively reconstruct Q as follows: 0 0.06 0   0 0 0.11   -2.27 -0.29 0.01   0.97 0.09 0.94 -0.11 1.03 0.20 -0.21 0.98     -0.42 -1.79 -0.07   -0.01 1.00 -0.08 1.24 1.21 -0.85 0.59 -1.46    0 0 0.40       0.67 -1.83 -0.59   -1.22 -0.30 -0.15 -0.11 0.29 1.63 0.86 1.24   0.04 0 0 ≈     0 0 0  0.43 -1.02 -0.35   0.14 1.14 1.17 0.67 -1.33 -0.49 -1.80 1.05     0.08 0.02 -0.79   1.39 1.52 -2.14 0.06 -0.64 0.015 0.55 0.60   0.07 0 0     -0.90 0.74 0.89   -1.90 1.92 0.13 -0.00 -0.71 0.62 0.30 -0.46   0 0 0.29        0 0  X Q  0.00    w

Observing linear reconstruction coefficient matrix w , we find that the obtained reconstruction coefficient vector wi for i = 1, 2, and 3 are actually nonnegative and sparse. The linear combination of all samples associated with the nonnegative coefficients can approximatively represent the query sample

q . It means the query sample q and the samples associated with the nonnegative coefficients have some common features. Therefore, these samples can be viewed as the similar neighbors of the query sample q and the corresponding nonnegative coefficient can be treated as the similar scalar between

the query sample q and its similar neighbor. Let us solve Eq. (1) to obtain the nonnegative sparsest coefficient vector w* . Then we begin to design a nonnegative linear reconstruction based classifier (NLRC) in the following way. For Class i , let δ i : R N → R N be the characteristic function that selects the coefficients associated with the Class i . For w ∈ R N , δi (w) is a vector whose only nonzero entries are the entries in w that are associated with Class i . Using only the coefficients associated with Class i , one can reconstruct the given query sample q as v i = X δ i ( w* ) . The distance between the query sample q and its prototype vi of Class i is defined by ri ( q ) =|| q − v i ||2 =|| q − X δ i ( w* ) ||2 .

(2)

The NLRC decisional rule is: Identity (q) = Class( min ri (q)) . i =1,2,, c

(3)

2.2 Variations of NLRC For a query sample q , one can find its similar neighbors and the corresponding similar scalars by solving Eq. (1). Larger the similar scalar is, more similar the query sample q and the associating sample are. The NLRC decisional criterion can be rewritten as Identity ( q ) = Class ( max wij* ) i =1,2,, c j =1,2,, ni

,

(4)

* where wij is the reconstruction coefficient associated with the j th sample from Class i .

To enhance the classification ability of NLRC, we are going to fuse the class information in the design of classifier. The query sample should have more intra-class similar neighbors and less inter-class ones. It means we can classify the query sample q to the class with the maximum summation of the reconstruction coefficients. The NLRC decisional criterion also can be described as below: ni

Identity ( q ) = Class ( max i =1,2,, c

∑w ) * ij

j =1

,

(5)

Combining with above classification criterions, one easily finds that a query sample q should be assigned to the category with minimum reconstruction error and maximum similar scalar. We unify the decision criterions of Eq. (3) and Eq. (5) and present an optimal nonnegative linear reconstruction classification with the following classification criterion:

Identity ( q) = Class( min i =1,2,, c

|| q − X δ i ( w* ) ||2 ni

∑w

* ij

)

.

(6)

j =1

To summarize the algorithm of NLRC and its variations clearly, we show the flowchart using the following Fig. 1. ① ②

Solve above Eq. (1) and obtain the nonnegative

linear

construction *

coefficient vector w .

Identify the class the query sample belonging to.





Fig. 1 Flowchart of NLRC and its variations

3 Nonnegative linear reconstruction projection 3.1Basic idea This section will develop a feature extraction method under the guide of the NLRC decisional rule. The basic idea of designing method is to make the subsequent NLRC achieve the optimal performance in the transformed space. Suppose we have obtained the optimal projections P = { p1 ,, pd } . Let us project each data point

x to the transformed subspace by y = P T x . The training data matrix in the transformed subspace is

Y = [Y1 , Y2 ,, Yc ] , where Yi = [ yi1 , yi 2 , , yin ] is the training data subset formed by the samples with i

class-label i . In the transformed subspace, we consider using NLRC. Concretely, for each training sample yij , let us leave it out from the training set and use the rest to nonnegative linear reconstruct it by using Eq. (1). We obtain the nonnegative representation coefficient vector w ij . Define δ t ( w ij ) as the representation coefficient vector with respect to Class t . The prototype of Class t with respect to the sample yij is

vijt . The reconstruction error between yij and vijt is defined as follows: 2

Dt ( yij ) = yij − vijt . 2

(7)

i For the purpose of classification, we expect the intra-class prototype of yij , i.e., vij , is close to yij

t and at the same time, the inter-class prototype vij , for t ≠ i , is far away from yij . To make NLRC

achieve the optimal performance on the whole training set, we would like to calculate the average intra-class and inter-class reconstruction errors, i.e., the intra-class and inter-class reconstruction scatters. The intra-class reconstruction scatter characterizes the compactness of the intra-class samples and the inter-class reconstruction scatter characterizes the separability of different classes. According to the decisional rule of NLRC, larger the inter-class reconstruction scatter and smaller the intra-class reconstruction scatter will lead to better classification performance in an average sense. In the following subsection, we are going to construct the intra-class and inter-class reconstruction scatters.

3.2 Characterization of inter-class reconstruction scatter and intra-class reconstruction scatter The intra-class and inter-class reconstruction scatters of samples in the transformed subspace are respectively defined as follows:

1 c ni 1 Di ( yij ) = ∑∑ N i =1 j =1 N

ni

c

∑∑ i =1 j =1

yij − viji

2 2

=

c ni c ni 1 1 D ( y ) = yij − vijt ∑∑∑ ∑∑∑ t ij N (c − 1) i =1 j =1 t ≠i N (c − 1) i =1 j =1 t ≠i

1 c ni ( yij − viji )T ( yij − viji ) . ∑∑ N i =1 j =1 2 2

=

c ni 1 ( yij − vijt )T ( yij − vijt ) ∑∑∑ . N (c − 1) i =1 j =1 t ≠i

(8) (9)

T Insert yij = P xij , viji = X δ i ( wij ) and vijt = X δ t ( wij ) in Eq. (8) and Eq.(9), respectively, as below:

1 N

c

ni

∑∑ ( y

ij

− viji )T ( yij − viji )

i =1 j =1

= tr (

1 N

c

ni

∑∑ ( P

T

xij − PT X δ i ( wij ))( PT xij − PT X δ i ( wij ))T )

i =1 j =1

 1 c ni  = tr ( PT  ∑∑ ( xij − X δ i ( wij ))( xij − X δ i ( wij ))T  P )  N i =1 j =1  T = tr ( P S w P )

(10)

c ni 1 ( yij − vijt )T ( yij − vijt ) ∑∑∑ N (c − 1) i =1 j =1 t ≠i c ni 1 = tr ( ( PT xij − PT X δ t ( wij ))( PT xij − PT X δ t ( wij ))T ) ∑∑∑ N (c − 1) i =1 j =1 t ≠i

(11)

c ni  1  = tr ( PT  ( xij − X δ t ( wij ))( xij − X δ t ( wij ))T  P ) ∑∑∑  N (c − 1) i =1 j =1 t ≠i  T = tr ( P Sb P )

where S w =

1 N

c

ni

∑∑ ( x

ij

i =1 j =1

− X δ i ( wij ))( xij − X δ i ( wij ))T is the intra-class reconstruction scatter matrix,

Sb =

c ni 1 ( xij − X δ t ( wij ))( xij − X δ t ( wij ))T is the inter-class reconstruction scatter matrix, ∑∑∑ N (c − 1) i =1 j =1 t ≠i

and tr (⋅) is the trace operator. By maximizing the inter-class reconstruction scatter and minimizing the intra-class reconstruction scatter simultaneously, we can find the optimal projections that maximize the following criterions: J ( P ) = tr ( P T ( S b − S w ) P ) .

(12)

To eliminate the freedom, we add the constraint of P T P = I , where I is the identity matrix. Therefore, the goal of NLRP is to solve the following optimization problem: max tr ( PT ( Sb − S w ) P ) P

s.t. P T P = I

(13)

It is easy to prove that the optimal transformed matrix P is composed by the d standard eigenvectors of (Sb − Sw ) p = λ p corresponding to d largest eigenvalues.

3.3 Algorithm of NLRP In summary of the above description, the NLRP algorithm is given below: Input: a training data set X Output: the transformation matrix PNLRP Step 1: project all the samples into a PCA transformed space Step 2: calculate the intra-class and inter-class reconstruction scatter matrices, i.e., Sw and Sb Step 3: solve (Sb − Sw ) p = λ p to obtain the d eigenvectors associating d largest eigenvalues, and then construct P =  p1 , p2 , , pd  Step 4: output PNLRP = PPCA P .

4 Further discussions 4.1 Comparisons between NLRC and SRC SRC [13] was proposed by Wright et al. and successfully applied to face recognition. The basic idea of SRC is to represent a query sample as a sparse linear combination of all training samples. The sparse nonzero coefficients are concentrated on the training samples with the same class-label as the query sample. Similar to SRC, NLRC presents a query sample as a sparse linear combination of all training samples. But there are some differences between them. Now we list them in follows:

(1) Both of SRC and NLRC are sparse representation methods. Their representation coefficient vectors are sparse. Besides sparse, the reconstruction coefficients of NLRC are nonnegative. But this does not been required in SRC. Thus, NLRC can be viewed as a special case of SRC. The reconstruction coefficients of SRC can be obtained by solving the minimizing L1-norm regularized least squares regression model, and that of NLRC can be obtained by solving the least square approximation with the nonnegative constraint on the representation coefficient. (2) SRC is a parametric method. To obtain the best result, one needs to tune the regularized parameter in a large range. Obviously, it is a time-consuming thing. Unlike SRC, NLRC is a nonparametric classification. The existing matlab function of "lsqnonneg" can be used to solve the nonnegative linear reconstruction problem. In contrast to SRC, NLRC seems to be more simple and feasible. (3) Since the nonzero reconstruction coefficients with different signs, SRC represents a query sample with the additive and subtractive interactions. The training samples associated with positive coefficients make the positive contribution in the reconstruction of the query sample and vice versa. However, this will lead to the case of “cancel each other out”. NLRC represents a query sample as the nonnegative linear combination of some components of the training samples. From this point of view, NLRC seems to be more in line with the man's intuitive notion of combining the parts to form a whole than SRC. (4) NLRC represents a query sample as the nonnegative linear combination of some components of the training samples. From the construction point of view, the query sample and those samples associated with nonnegative coefficients have some common similar features. Different from NLRC, SRC represents a query sample with the additive and subtractive interactions. This is contrary to the intuitive notion of combining the parts to form a whole. From this point of view, NLRC really has the adaptive similar neighbor selection mechanism.

4.2 Comparisons between two recognition systems of NLRP+NLRC and LM-NNDA+LM-NNC LM-NNC is a local classification method. For a given sample q , one needs to find k -nearest neighbors of q in each class, and then calculates the mean of k -nearest neighbors. Suppose there are

c pattern classes, and we can obtain c local means. Find the class whose local mean is nearest to q and assign q to this class. Based on the LM-NNC decisional rule, Yang et al. further developed LM-NNDA, which can be viewed as a "locally" FLDA method, to match LM-NNC well. The proposed NLRC are based on the reconstruction concept, i.e., the query sample can be represented as a combination of its similar neighbors. These similar neighbors and the represented

sample are positive correlations. By performing linear reconstruction, the reconstructed intra-class prototype should be more similar to the represented sample than the inter-class prototypes. Based on this, we design NLRP sharing the decisional criterion of NLRC. Thus NLRP should match NLRC well in theory. In the recognition system of LM-NNDA+LM-NNC, both methods of LM-NNDA and LM-NNC are local algorithms. LM-NNDA needs to set the different neighbor parameters in the characterizations of intra-class and inter-class scatters. Moreover, the neighbor parameter still needs to be specified for LM-NNC in the phase of classification. Generally, the nearest neighbor parameters are set by experiments. For a large-scale data set, it is time-consuming to tune the neighbor parameter. Moreover, the nearest neighbors are in the sense of Euclidean distance. A large nearest neighbor parameter will lead to a large neighborhood. In this case, some selected nearest neighbors may be geodetically far away, but compelled close. Obviously, the manually selected parameter cannot promise the optimal performance. As a result, LM-NNDA and LM-NNC are hard to match each other well in reality. In the proposed recognition system of NLRP+NLRC, the nonnegative linear reconstruction coefficient vector is sparse and can be obtained by using the matlab function of "lsqnonneg". The sparse degree, i.e. the number of nonzero items, is adaptively set by the algorithm itself and does not need to be set in advance. Therefore, NLRP+NLRC can be viewed as nonparametric recognition system and avoids the awkward cases happened in many parametric methods.

4.3 Comparisons between two recognition systems of NLRP+NLRC and RDA+LRC NLRC first uses all the training samples to reconstruct the query sample, and then finds the class in which the samples have done the most for the reconstruction. The design of LRC is based on the assumptions that each single class lies on a linear subspace and each sample can be represented as a combination of its intra-class samples. LRC identifies the query sample to the class with minimum reconstruction error and achieves encouraging results in dealing with the face recognition. However, when the number of training samples per class is larger than the dimensionality of sample, LRC will be out of work. In contrast to LRC, NLRC seems to be more robust and avoids this case happened in LRC. Moreover, the reconstruction error of NLRC is calculated based on the similar samples per class, but that of LRC is based on the whole training samples per class. These similar neighbors used in NLRC are only a part of training samples per class and are the most similar ones to the query sample. This seems to be more reasonable to use a few similar samples to calculate the reconstruction error than using all training samples per class, since the negative effect caused by the outlier, illuminations, and noises can be reduced to some extent.

From the view of reconstruction, NLRP can be treated as a special RDA method. Indeed, if the linear reconstruction in our method is compulsorily rewritten as the linear reconstruction of all the training samples per class, then the linear combination coefficients of all the training samples except for the similar neighbors should be zero. Moreover, we find, in RDA, the intra-class reconstruction error can be computed as

ε ij = xij − X i wij

2

and the intra-class reconstruction scatter matrix is defined as c

ni

( xij − X i wij )( xij − X i wij )T . ∑∑ i =1 j =1

(12)

where X i is the samples from the Class i . Observing Eq. (12), we find the definition of intra-class scatter matrix of RDA is unreasonable. xij should be removed from X i . Otherwise, ε ij will be zero and the calculated intra-class scatter is meaningless.

5 Experiments In this section, the nonnegative linear reconstruction projection (NLRP) and the nonnegative linear reconstruction based classifier (NLRC) are evaluated on the AR face image database, the Yale face image database, the PolyU finger knuckle print database, and the PolyU palmprint database and compared with the principal component analysis (PCA) [14], Fisher linear discriminant analysis (FLDA) [14], maximum margin criterion (MMC) [15], nonnegative matrix factorization (NMF) [16-17], LM-NNDA, and RDA. The nonnegative linear reconstruction coefficient vector of NLRP and NLRC can be obtained by using the matlab function "lsqnonneg". The items of the optimal nonnegative coefficients larger than 10-5 are thought as nonzero.

5.1 Experiments using the AR and Yale face image databases 5.1.1 Introductions of the AR and Yale face image databases The AR face image database [18-19] contains over 4,000 color face images of 126 people, including frontal views of faces with different facial expressions, lighting conditions and occlusions. The images of 120 persons including 65 males and 55 females are selected and used in our experiments. The pictures of each person are taken in two sessions (separated by two weeks) and each section contains 7 color images without occlusions. The face portion of each image is manually cropped and then normalized to 50×40 pixels. The sample images of one person are shown in Fig. 2. The Yale face image database is constructed at the Yale Center for Computational Vision and Control.

It contains 165 gray scale images of 15 individuals. The images demonstrate the variations in lighting condition, facial expression, and with/without glasses. The size of each cropped image is 100 × 80. Fig. 3 shows some sample images of one individual. These images vary as follows: center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleep, surprised and winking.

Fig. 2 Sample images of one person in the AR face database

Fig. 3 Sample face images from the Yale face database

5.1.2 Experimental settings and results In AR face image database, we would like to compare the performance of SRC with NLRC firstly. We select the 7 images from the first session for train and the rest from the second section for test. Since the dimension of the face image vector space is much larger than the number of training samples, we first use PCA for dimension reduction and then perform SRC in the 220-dimensionality PCA-transformed subspace. Let the regularized parameter λ vary in the range of {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}. We show the recognition results of SRC in Fig. 4. From Fig. 4, we find when λ =0.05 , SRC obtains the best recognition rate of 76.67%. We choose the optimal regularized parameter λ =0.05 for SRC in the following experiments. In addition, we apply NLRC and its variations to the 220-dimensionality PCA-subspace and record the recognition results and the run-time in Table 1. Observing the results in Table 1, we have following findings: 1) SRC achieves the results comparative to NLRC and its variations; 2) NLRC and its variations run nearly 30 times faster than SRC. The reported SRC result is based on the optimal regularized parameter λ , which needs to be tuned in a large range. This is a time-consuming thing. From this point of view, the nonparametric NLRC seems to be more effective and simple than SRC.

To further evaluate the performance of the proposed method in the AR face image database, we perform random experiments with l ( l = 2, 3, 4, 5, and 6) images per individual for training and the remaining 14 − l images for test. Similar to above experiments, the FLDA, MMC, NMF, LM-NNDA, RDA, and the proposed NLRP are performed on the 220-dimensionality PCA transformed subspace. We run the systems 10 times and report the maximal average recognition rates of each method with different classifiers in Table 2. The maximal average recognition rates of three recognition systems of LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the variation of the size of training set are illustrated in Fig. 5. Based on the random data, we perform SRC, NLRC, and its variations again on the 220-dimensional PCA transformed subspace and report the average results in Table 3. In Yale face image database, we conduct our experiments with random training samples. l ( l = 3, 4, 5, and 6) images are randomly selected from the image gallery of each individual to form the training set. The remaining 11- l images are used for test. FLDA, MMC, NMF, LM-NNDA, RDA, and NLRP are performed in the 44-dimensionality PCA-transformed subspace. We independently run the system 20 times. Table 4 shows the average recognition rates across 20 runs of each method with different classifiers and their corresponding standard deviations. The maximal average recognition rates of three recognition systems of LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the variation of the size of training set are illustrated in Fig. 6. 0.8

Recognition rate

SRC

0.75

0.7

0.65

0.005

0.001

0.05

0.01

0.5

0.1

Regularized parameter

Fig. 4 The recognition rates of SRC based on the 220-dimensional PCA-subspace over the variations of regularized parameter λ in the AR face database Table 1: The maximal recognition rates (%) and the corresponding run-time (s) of SRC, NLRC and its variations based on the 220-dimension PCA-subspace in the AR face image database when first 7 images for train and the rest for test Recognition rate (%) Run time (s)

SRC 79.40 4.5154 e3

NLRC 79.31 142.47

NLRC-Single 73.60 116.03

NLRC-Sum 79.55 126.47

NLRC-Optimal 79.79 146.77

Table 2: The maximal average recognition rates (%) and the corresponding standard deviations of each method with

different classifiers on the AR face image database across 10 runs 2 71.81±4.41 71.72±3.93 75.58±7.63 75.74±7.68 74.69±5.91 74.03±5.12 63.06±6.69 61.67±5.55 76.42±6.48 76.28±6.35 76.42±6.48 8.96±1.83 11.88±1.31 13.01±1.35 72.01±6.80 84.12±3.98 83.81±4.61

PCA +NNC PCA +MDC FLDA +NNC FLDA +MDC MMC +NNC MMC +MDC NMF+NNC NMF+MDC LM-NNDA +NNC LM-NNDA +MDC LM-NNDA +LM-NNC RDA +NNC RDA +MDC RDA +LRC NLRP NNC NLRP +MDC NLRP +NLRC

3 76.72±7.93 72.95±7.48 67.03±12.40 67.02±12.52 78.80±8.59 77.01±8.76 72.76±11.62 69.44±13.28 84.69±8.20 84.70±7.85 84.69±8.20 54.92±9.13 66.27±7.74 76.11±8.20 85.30±4.60 87.52±4.60 88.27±4.45

4 79.31±8.96 74.32±6.78 83.47±6.85 84.03±6.90 81.33±10.58 79.17±9.28 79.15±6.63 76.97±7.52 88.00±7.28 88.20±7.25 88.19±7.51 66.63±5.35 77.45±5.33 84.43±5.38 86.67±5.12 89.65±4.36 90.44±4.69

5 80.29±9.43 74.16±7.48 85.21±8.79 86.03±8.86 82.18±10.96 79.94±9.33 80.46±9.87 77.81±11.09 87.59±8.56 87.93±8.52 87.82±9.35 71.14±7.47 81.93±6.35 87.34±6.31 87.44±5.70 90.21±5.22 91.18±4.94

6 87.72±9.70 76.77±7.50 92.49±7.56 92.86±7.51 90.36±10.69 85.04±8.52 88.45±9.30 82.24±10.50 93.60±7.10 93.73±6.58 93.77±7.31 80.38±10.62 87.91±7.08 92.81±5.89 91.71±5.94 93.35±4.70 94.52±4.54

Table 3: The maximal average recognition rates (%) and the corresponding standard deviations of SRC, NLRC and its variations based on the 220-dimensional PCA subspace in the AR face image database across 10 runs NLRC NLRC-Single NLRC-Sum NLRC-Optimal SRC (λ=0.05)

2 81.97±4.97 79.03±5.63 80.58±5.08 81.03±4.95 78.97±5.65

3 85.37±6.97 84.72±7.38 84.29±7.50 84.72±7.38 82.21±7.09

4 86.93±8.21 86.18±8.44 85.78±8.56 86.18±8.44 84.47±8.22

5 86.60±9.71 85.93±9.98 85.63±10.05 85.93±9.98 84.95±9.19

6 91.99±9.24 91.56±9.58 91.33±9.78 91.56±9.58 91.16±8.89

Table 4 The maximal average recognition rates (%) and the corresponding standard deviations of each method with different classifiers on the Yale face image database across 20 runs PCA +NNC PCA +MDC FLDA +NNC FLDA +MDC MMC +NNC MMC +MDC NMF+NNC NMF+MDC LM-NNDA +NNC LM-NNDA +MDC LM-NNDA +LM-NNC RDA+NNC RDA+MDC RDA+LRC NLRP +NNC NLRP +MDC NLRP +NLRC

3 89.58±2.57 90.00±2.78 89.50±4.63 89.42±4.64 83.08±6.68 85.57±3.63 85.21±3.09 84.54±3.79 90.58±4.19 90.58±4.24 90.58±4.19 19.50±6.29 24.21±7.66 40.25±9.45 85.58±4.00 91.58±4.27 95.12±3.85

4 90.62±2.53 90.57±2.80 71.38±9.12 71.24±8.96 89.38±2.34 88.81±3.88 86.10±4.92 85.14±3.64 91.52±4.00 91.57±4.26 91.57±3.78 57.76±7.32 64.33±7.45 74.71±7.54 82.86±5.31 92.95±4.51 93.19±4.11

5 91.56±4.66 91.67±5.41 87.06±9.39 87.67±9.70 90.72±6.00 89.94±7.29 87.56±6.86 87.78±8.76 92.50±5.97 92.78±5.95 92.44±6.03 72.00±12.10 78.11±9.95 84.61±10.39 87.28±7.52 93.72±6.24 93.67±5.05

6 92.60±5.32 92.60±5.55 91.33±6.69 92.00±6.55 92.07±6.79 91.87±6.74 89.33±5.71 89.80±6.03 94.20±4.90 94.60±5.09 94.33±5.40 73.00±12.71 81.27±11.79 86.07±12.38 91.53±6.91 96.20±4.82 95.00±7.17

Average recognition rate

0.9 0.8 0.7 0.6 0.5 0.4 0.3 LM-NNDA+LM-NNC NLRP+NLRC RDA+LRC

0.2 0.1

2

3

4

5

6

Numbers of training samples

Fig. 5 The maximal average recognition rates of LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the variation of the size of training set in the AR face image database

Average recognition rate

1 0.9 0.8 0.7 0.6 0.5 RDA+LRC LM-NNDA+LM-NNC NLRP+NLRC

0.4 3

4

5

6

Numbers of training samples

Fig. 6 The maximal average recognition rates of LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the variation of the size of training set in the Yale face image database

5.1.3 Experimental analysis Observing Tables 1-4, Fig. 5, and Fig. 6, we have following findings: (1) NLRP combined with NLRC outperforms other recognition systems, irrespective of the variations of the size of the training samples per class. (2) With respect to the results of NLRP, NLRP combined with NNC performs worse among 3 classifiers. In contrast to NNC, MDC seems to suit NLRP better and achieves comparable results. The reason may be that minimizing the intra-class reconstruction errors of samples cannot guarantee that a query sample and its nearest neighbor are in the same class, but sometimes it can make the samples

with the same class-label clustered better. (3) When the size of training sample per class is relative small, NLRP significantly outperforms other recognition systems. When the number of training samples per class becomes large, NRLP is still slightly better than other recognition systems. This makes sense for real-world recognition, because the available training samples are generally limited in practice. (4) Compared with the results of LM-NNDA, RDA, and NLRP, we find their recognition accuracies are higher when using the matched classifiers, i.e., LM-NNC, LRC, and NLRC, than using other classifiers. These results are consistent with those we have drawn on the foregoing sections that the feature extractor and the classifier sharing a common decisional rule should perform better than those with different decision rules in theory. (5) When the size of training sample per class is small, the performance of RDA is so worse. Let us have a close look at the intra-class scatter matrix of RDA and find it is singular. This further proves the characterization of intra-class reconstruction scatter in RDA is unreasonable. (6) The results listed in Table 1 and Table 3 show us that the performance of NLRC and its variations are comparable. In the following experiments we will choose the NLRC decision rule to design NLRP. (7) By comparing with the results of NLRP+NLRC in Table 2 and NLRC in Table 3, we find that the features extracted by NLRP are really more suitable for NLRC than those features extracted by PCA, irrespective of the variations of the size of the training samples per class. (8) The results listed in Table 1 is obtained by using 7 images per person from the first section for training and the last 7 images from second section for test. These two sections are separated by two weeks. Even with 7 images per person for training, the experimental results of NLRC and its variations are worse than their random experiments with only 2 images per person for train. Obviously, it is still problematic to recognize the image with the variations over time.

5.2 Experiments using the PolyU finger knuckle print database and PolyU palmprint database 5.2.1 Details of databases Finger-knuckle-print (FKP) images on the PolyU FKP database [20-21] were collected from 165 volunteers. These images are collected in two separate sessions. In each session, the subject was asked to provide six images for each of the left index finger, the left middle finger, the right index finger, and the right middle finger. The images were processed by ROI extraction algorithm described in [22]. In the experiment, we select 1200 FKP images of the right index finger of 100 subjects. All the samples in the database are histogram equalized and resized to 55 × 110 pixels. Fig. 7 shows twelve sample images of

one right index finger. The PolyU palmprint database contains 600 gray-scale images of 100 different palms with six samples for each palm (http://www4.comp.polyu.edu.hk/~biometrics/). Six samples from each of these palms are collected in two sessions, where the first three are captured in the first session and the other three in the second session. The average interval between the first and the second sessions is two months. In our experiments, by using the algorithm mentioned in [23], the central part of each original image is automatically cropped with the size of 64 × 64 pixels and preprocessed using histogram equalization. Fig. 8 shows some sample images of one palm.

Fig.7 Sample images of one right index finger

Fig. 8 Samples of the cropped images in the PolyU palmprint database

5.2.2 Experimental settings, results and analysis In this subsection, we are going to perform random experiments on the PolyU finger knuckle print database. We randomly select l ( l =2, 3, and 4) images per individual from 12 images for training and the remaining 12 − l images for test. FLDA, MMC, NMF, LM-NNDA, RDA, and NLRP are performed in the 80-dimensional PCA-subspace. We run the system 10 times and report the maximal average recognition rates of each method with different classifiers in Table 5. The maximal average recognition rates of three recognition systems of LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the variation of the size of training set are illustrated in Fig. 9. In PolyU palmprint database, we are going to evaluate the performance of seven methods on recognizing the images captured in different times. We use 3 palmprint images collected in Session 2 of each class for training and the remaining for test. FLDA, MMC, NMF, LM-NNDA, RDA, and NLRP are performed on the 150-dimeansional PCA-subspaces. The maximal recognition rates of each method are also listed in Table 5. The maximal recognition rates of three similar recognition systems of

LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the variations of the dimension of training set are illustrated in Fig. 10. From Table 5, Fig. 9, and Fig. 10, we find, in both databases, the recognition system of NLRP+NLRC outperforms other combination systems, irrespective of the variations of the training samples. While the recognition systems of NLRP+NNC and NLRP+MDC are suboptimal to most combination systems. It further proves that, a recognition system sharing a common decisional rule should achieve higher classification accuracy than those without unified decisional rule. Table 5: The maximal average recognition rates (%) and the corresponding standard deviations of each method with different classifiers on the finger knuckle print database and palmprint across 10 runs Method

PCA +NNC PCA +MDC FLDA +NNC FLDA +MDC MMC +NNC MMC +MDC NMF+NNC NMF+MDC LM-NNDA +NNC LM-NNDA +MDC LM-NNDA +LM-NNC RDA +NNC RDA +MDC RDA +LRC NLRP +NNC NLRP +MDC NLRP +NLRC

Database PolyU FKP 2 63.75±9.87 57.93±6.69 18.31±16.15 18.34±16.25 64.22±10.21 58.23±6.95 63.62±10.33 59.35±11.50 69.94±8.02 69.48±7.38 69.48±7.38 52.18±7.16 55.27±9.55 66.38±9.65 49.50±8.24 53.35±10.76 72.75±8.13

3 71.79±11.78 60.47±8.75 72.18±11.07 71.14±10.73 72.27±12.49 60.98±9.04 70.39±11.58 59.58±9.65 77.24±9.33 76.52±8.86 76.87±9.24 65.44±10.13 59.96±11.75 76.33±10.20 63.01±10.41 62.43±11.35 79.77±9.01

PolyU Palmprint last 3 images for training 82.33(90) 80.00(90) 95.00(60) 94.33(55) 82.33(85) 80.00(85) 75.67(85) 73.00(90) 96.67(90) 96.67(90) 96.67(90) 87.33(90) 89.67(90) 88.33(90) 97.33(90) 96.00(90) 97.67(90)

4 80.81±10.55 64.95±8.57 84.41±9.34 82.13±8.72 81.41±11.15 66.05±8.89 77.19±15.71 62.32±12.57 86.51±8.15 85.01±7.92 85.91±8.21 76.74±9.88 67.14±11.60 86.19±9.00 75.93±9.75 70.76±10.87 87.36±8.00

0.9

Average recognition rate

0.8 0.7 0.6 0.5 0.4 0.3 RDA+LRC NLRP+NLRC LM-NNDA+LM-NNC

0.2 0.1

2

3

4

Numbers of training samples

Fig. 9 The maximal average recognition rates of LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the variation of the size of training set in the PolyU FKP database

1 0.95

Recognition rate

0.9 0.85 0.8 0.75 0.7 LM-NNDA+LM-NNC RDA+LRC NLRP+NLRC

0.65 0.6

10

20

30

40

50

60

70

80

90

Dimension

Fig. 10 The recognition rates of three recognition system of LM-NNDA+LM-NNC, RDA+LRC, and NLRP+NLRC versus the dimensions on the PolyU palmprinte database

6 Conclusions Based on the nonnegative linear reconstruction measure, we propose a recognition system including the feature extractor NLRP and classifier NLRC. The designs of NLRP and NRLC are bound together by using the same decision rule. Therefore, NLRP is the most suitable feature extractor for NLRC in theory. This has been demonstrated by our experiments on the AR face image database, the Yale face image database, the PolyU finger knuckle print database and palmprint database. The recognition system of NLRP+NLRC really outperforms other combinations.

Acknowledgments This work is partially supported by the National Nature Science Foundation of China under grant no. 61305036 and the China Postdoctoral Science Foundation funded project 2014M560657.

References [1] J. Yang, L. Zhang, J.-Y. Yang, D. Zhang, From Classifiers to Discriminators: A Nearest Neighbor Rule Induced Discriminant Analysis, Pattern Recognition, 44 (7) (2011) 1387-1402. [2] Y. Chen, Z. Jin, Reconstructive discriminant analysis: A feature extraction method induced from linear regression classification, Neurocomputing, 87 (2012) 41-50. [3] J. Xu, J. Yang, K-local Hyperplane Distance Nearest Neighbor Classifier Oriented Local Discriminant Analysis, Information Sciences, 232 (2013) 11-26. [4] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE Trans. PatternAnal. Mach. Intell. 32 (11) (2010) 2106–2112.

[5] J. Zhang, J. Yang, Linear Reconstruction Measure Steered Nearest Neighbor Classification Framework, Pattern Recognition, 47 (2014)1709-1720. [6] Y. Xu, D. Zhang, J. Yang, J.-Y. Yang, A two-phase test sample sparse representation method for use with face recognition, IEEE Transactions on Circuits and Systems for Video Technology, 21(9) (2011) 1255-1262. [7] Y. Xu, Q. Zhu, Y. Chen, J. -S. Pan, An improvement to the nearest neighbor classifier and face recognition experiments, Int. J. Innov. Comput. Inf. Control, 9 (2) (2013) 543-554. [8] P.O. Hoyer, "Non-negative Sparse Coding," in Proc. of the 12th IEEE Workshop on Neural Networks for Sig. Proc., (2002) 557-565. [9] G. Zhou, Z. Yang, S. Xie, J.-M Yang, Mixing Matrix Estimation from Sparse Mixtures with Unknown Number of Sources, IEEE Transactions on Neural Networks, 22 (2) (2011) 211-221. [10] G. Zhou; A. Cichocki, and S. Xie, Fast Nonnegative Matrix/Tensor Factorization Based on Low-Rank Approximation, IEEE Transactions on Signal Processing, 60 (6) (2012) 2928-2940. [11] Z. He, S. Xie, R. Zdunek, G. Zhou, A. Cichocki, Symmetric Nonnegative Matrix Factorization: Algorithms and Applications to Probabilistic Clustering, IEEE Trans Neural Networks, 22 (12) (2011) 2117-2131. [12] J. Xu, J. Yang, A nonnegative sparse representation based fuzzy similar neighbor classifier, Neurocomputing, 99 (1) (2013) 76-86. [13] J. Wright, A. Yang, S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [14] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997)11-720. [15] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, 17 (1) (2006) 157-165. [16] D.D. Lee H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature 1999 401(6755): 788 -791. [17] H.S. Seung D.D. Lee The Manifold Ways of Perception. Science, 2000, vol. 290. pp. 2268-2269. [18] A.M. Martinez, R. Benavente, The AR Face Database, , 2006. [19] A.M. Martinez, R. Benavente, The AR Face Database, CVC Technical Report #24, June 1998. [20] L. Zhang, L. Zhang, D. Zhang, Finger-knuckle-print: a new biometric identifier, in: Proceedings of the IEEE International Conference on Image Processing, 2009. [21] The FKP Database: to evaluate the performance of the proposed method. [22] L. Zhang, L. Zhang, D. Zhang, H. Zhu, Online finger-knuckle-print verification for personal authentication, Pattern Recognition 43 (7) (2010) 2560-2571. [23] D. Zhang, Palmprint authentication, Kluwer Academic, (2004).

, ,

, ,



Biographies Jie Xu received the B.S. degree in Mathematics from South China Normal University, the M.S. degree from SUN YAT-SEN University and the Ph.D. degree from Nanjing University of Science and Technology (NUST) China, in 2002, 2008 and 2013, respectively. She is a post doctor fellow in the Faculty of Automation,Guangdong University of Technology,Guangzhou,China. She research interests include face recognition, image processing and content-based image retrieval, pattern recognition.

Shengli Xie (M’01–SM’02) received the M.S. degree in mathematics from Central China Normal Unive rsity, Wuhan, China, in 1992, and the Ph.D. degree in control theory and applications from the South Chi na University of Technology, Guangzhou, China, in 1997. He is currently a Full Professor and the Head of the Institute of Intelligent Information Processing with the Guangdong University of Technology, Gua ngzhou. He has authored or co-authored two books and over 150 scientific papers in journals and confer ence proceedings. His current research interests include wireless networks, automatic control, and blind signal processing. Prof. Xie received the second prize of the China’s State Natural Science Award for his work on blind source separation and identification in 2009.

Jie Xu

Shengli Xie