Improved Group Sparse Classifier

Pattern Recognition Letters 31 (2010) 1959–1964 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

Download PDF

210KB Sizes 0 Downloads 59 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 31 (2010) 1959–1964

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Improved Group Sparse Classiﬁer Angshul Majumdar *, Rabab K. Ward Department of Electrical and Computer Engineering, University of British Columbia, Canada

a r t i c l e

i n f o

a b s t r a c t

Article history: Received 6 November 2009 Available online 27 July 2010 Communicated by F. Roli

This work proposes a new classiﬁer based on the assumption that the training samples of a particular class approximately form a linear basis for any new test sample belonging to that class. This is not a new assumption; two previous works namely Sparse Classiﬁer (Yang et al., 2009) and Group Sparse Classiﬁer (Majumdar and Ward, 2009a) has been built upon it. However both the previous works are fraught with certain shortcomings. The two optimization algorithms proposed in the two previous works do not capture all the implications of the said assumption perfectly. This work, accounts for all the intricacies of the said assumption and consequently requires solving a new non-convex optimization problem. We have developed an elegant approach to solve the optimization problem based on Iterative Reweighted Least Squares method. Our classiﬁer is very ﬂexible; the previous classiﬁers are actually two special cases of the said classiﬁer. The proposed classiﬁer has been rigorously compared against the previous ones stemming from the said assumption on benchmark classiﬁcation databases. The results indicate that in most cases the proposed classiﬁer is signiﬁcantly better than the previous ones and in the worst case the proposed classiﬁer is as good as the previous ones. Ó 2010 Elsevier B.V. All rights reserved.

Keywords: Classiﬁcation Quasi-convex optimization Face recognition Character recognition

Recently there has been an interest in a classiﬁcation model where the training samples of a particular class are assumed to form a linear basis for approximately representing any new test sample belonging to that class. This assumption has led to the development of some new classiﬁers like the Sparse Classiﬁer (SC) (Yang et al., 2009), Group Sparse Classiﬁer (GSC) (Majumdar and Ward, 2009a) and its fast version called the Fast Group Sparse Classiﬁer (FGSC) (Majumdar and Ward, 2009b). Formally, this assumption can be expressed as,

variability of that pose. Now if there is a new test sample, with a pose somewhere between frontal and left proﬁle, then it may be expressed as a linear combination of the training samples in frontal and left proﬁle poses. The relative contribution of the training samples in each pose will be decided by their weights (a). All the aforementioned classiﬁers (SC, GSC and FGSC) determine the unknown weights (ak,i’s) by solving different optimization problems and use these weights to determine the class of the test sample. The ﬁrst step for all of them is to express the assumption in terms of all the training samples instead of just one class as in (1). In compact matrix vector notation it can be expressed as,

v k;test ¼ ak;1 v k;1 þ ak;2 v k;2 þ þ ak;n v k;n

v k;test ¼ V a þ e

1. Introduction

k

k

þ ek

ð1Þ

where vk,test is the test sample of the kth class, vk,i is the ith training sample of the kth class, ak,i is the weight corresponding weight and ek is the approximation error. Geometrically this assumption implies that each class comprises of a few subspaces, and each subspace is spanned by some training vectors. The new test sample can be approximately represented as a union of these subspaces. To take a concrete example, consider the problem of face recognition. Faces appear in different poses – frontal, left proﬁle, right proﬁle, etc. Each of the poses can be assumed to lie on a subspace. If there are enough training samples, a few samples in each pose will be sufﬁcient to capture the * Corresponding author. Tel.: +1 6047245411; fax: +1 6048229013. E-mail addresses: [email protected] (A. Majumdar), [email protected] (R.K. Ward). 0167-8655/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.06.014

ð2Þ

where V ¼ ½v 1;1 j jv n;1 j jv k;1 j jv k;nk j v C;1 j jv C;nC and a ¼ ½a1;1 a1;n1 ak;1 ak;nk aC;1 aC;nC 0 . Given vk,test and V, the problem is to determine the class of the test sample. To achieve this from the said assumption, it is required to solve the unknown weight vector a. The way to do this is to solve an optimization problem of the following form,

a^ ¼ min uðaÞ a s:t: kv k;test V ak2 6 r

ð3Þ

The objective function u(a) contains prior information about the nature of a and the constraint arises from the assumption that the misﬁt is Normally distributed. The different classiﬁcation methods vary in their choice of the objective function. In SC (Yang et al., 2009) it is assumed that the

1960

A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964

weight vector a should be sparse and therefore employed the l1norm on the weight vector ðkak1 Þ as the objective function. The choice of the l1-norm as a sparsifying function can be traced back to studies in statistics (Tibshirani, 1996) and signal processing (Chen et al., 2001). However the l1-norm is not an appropriate choice in this case. This is because as pointed out in (Zou and Hastie, 2005) l1-norm only selects a single sample from a group of correlated training samples. For the example of face recognition, it implies that, if the test sample is a pose between frontal and left proﬁle, the SC (employing l1-norm) will try to express the test sample either as a frontal or as a left proﬁle, but not a combination of both. Clearly this is not desired. To address this issue, the GSC was proposed by Majumdar and Ward (2009a). It employed a mixed norm for the objective function, P given by kak2;1 ¼ k kak k2 . This norm ensured that for a particular class, all the training samples were selected, but the number of classes selected is sparse (Huang and Zhang, 2008; Eldar et al., 2009). The GSC (also the FGSC which is a fast implementation of GSC) showed good results, but it is not perfect either. This is because the inner l2-norm of the mixed norm selects all the training samples from a particular class. We again refer to the problem of face recognition. In this case, if a test image is somewhere between the left proﬁle and frontal, this classiﬁer will try to represent it as a combination of all the view – left, frontal and right. This is erroneous as well. The SC and the GSC are two extremes – the SC only selects one sample from the entire class (in the extreme case) and the GSC selects all. Ideally we would like the classiﬁer to represent the test sample only as a combination of the relevant subspaces and not all. For the face recognition problem, this means that if a test sample is somewhere between the left and frontal view, it should be only represented by training samples corresponding to these and not training samples from right proﬁle view. Present classiﬁers like SC and GSC can not address this problem. This paper is dedicated to ﬁnd methods to solve this shortcoming so that the classiﬁer can operate between the two extremes. The other issue we would tackle in this paper is the relaxation of the l2-norm misﬁt (3). There is no reason (physical or mathematical) to assume that the misﬁt is normally distributed. In this work we would like to experiment with some other norms for the misﬁt. This paper has two major contributions – (i) changing the objective function to address aforementioned issues; and (ii) changing the norm for the misﬁt. Both of these lead to new optimization problems which will solved in an efﬁcient manner in this paper. The rest of the paper is divided into several sections. In the next section, the background of this work is discussed. The proposed classiﬁcation model is discussed in Section 3. In Section 4, the optimization problem behind the proposed classiﬁer is detailed out. The experimental results are described in Section 5. The conclusions of the work are discussed in Section 6. 2. Background The assumption that a new test sample can be approximated as a linear combination of the training samples of that class was proposed by Yang et al. (2009) for the Sparse Classiﬁer (SC). Following the same assumption the Group Sparse Classiﬁer (GSC) was later developed by Majumdar and Ward (2009a). The Fast version of GSC (FGSC) is not a different classiﬁer, but employs fast algorithms to solve the same optimization problem. Therefore we will discuss the SC and the GSC in detail to understand their shortcomings. 2.1. Sparse classiﬁer The classiﬁcation assumption is mathematically expressed in (2). At the sake of repetition it is expressed as,

v k;test ¼ V a þ e Yang et al. (2009) argues that the weight vector a should sparse according to the assumption, because all the classes apart from the correct class will have zero weights associated with the training vectors. This prior knowledge about the weight vector can be imposed on the optimization problem for solving a in the following form,

a^ ¼ min kak0 such that kv k;test V ak2 < e a

ð4Þ

The l0-norm is not a true norm, it only counts the number of nonzeros in the vector. Owing to this deﬁnition, the minimum of this norm is the sparsest solution. However it is known that (4) is an NP hard problem (Amaldi and Kann, 1998). Therefore following the work of Donoho (2006) the NP hard l0-norm was replaced by its closest convex surrogate, the l1-norm, and instead of (4) solves the following problem,

a^ ¼ min kak1 such that kv k;test V ak2 < e a

ð5Þ

This is a convex optimization problem and can be solved by quadratic programming. A host of solvers are already available to solve (5) (l1-magic, SPGL1). Once a sparse solution of a is obtained, the following classiﬁcation algorithm was proposed to determine the class of the test sample. SC algorithm 1. Solve the optimization problem expressed in (5). 2. For each class i repeat the following two steps: a. Reconstruct a sample for each class by a linear combination of the training samples belonging to that class by the equaPni tion v recon ðiÞ ¼ j¼1 ai;j v i;j . b. Find the error between the reconstructed sample and the given test sample by errorðv test ; iÞ ¼ kv k;test v reconðiÞ k2 3. Once the error for every class is obtained, choose the class having the minimum error as the class of the given test sample. There is a limitation to the l1-norm minimization. If there is a group of samples whose pair-wise correlations are very high, then Lasso tends to select one sample only from the group (Zou and Hastie, 2005). For the problem of classiﬁcation, the correlation between samples of each class is high. In such a situation, the SC will try to represent the test sample only by a single training sample. This will not always give the desired results. We have already clariﬁed the repercussions of this method for the face recognition example in Section 1. If the test face image is actually expressed as a linear combination of frontal and left proﬁle views, the SC will represent it only from one view (the one which is the most correlated to the test sample). This might give erroneous results. 2.2. Group Sparse Classiﬁer The Group Sparse Classiﬁer (GSC) (Majumdar and Ward, 2009a) has the same assumption as the SC. The GSC argues that the solution to (2) has non-zero coefﬁcients corresponding to the correct class of vectors and is zero for the other classes. Therefore, the solution to the inverse problem should ensure that all the weights for the correct class are selected. This condition is not satisﬁed by ordinary l1-norm minimization. In order to ensure that all the weights for the class are selected, the following optimization problem was proposed,

a^ ¼ min kak2;0 such that kv test V ak2 < e a

ð6Þ

The mixed norm k k2;0 is deﬁned for a ¼ ½a1;1 ; . . . ; a1;n1 ; a2;1 ; . . . ; P |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} |ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ} a2;n2 a2 ; . . . ; ak;1 ; . . . ; ak;nk T as kak2;0 ¼ l Iðkal k2a1> 0Þ, where |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} ak

Iðkal k2 > 0Þ ¼ 1 if kal k2 > 0. Here, k is the number of classes.

1961

A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964

However solving (6) is also an NP hard problem. Majumdar and Ward, 2009a replaced the NP hard k k2;0 by the nearest convex P surrogate, i.e. kak2;1 ¼ l kal k2 so that the optimization problem takes the following form,

a^ ¼ min kak2;1 such that kvtest V ak2 < e a

ð7Þ

This is also a convex optimization problem. Although it is not a very widely studied problem like (5), there are customized solvers for it as well (SPGL1). Once this optimization problem is solved, the classiﬁcation algorithm is quite straight forward. GSC algorithm 1. Solve the optimization problem expressed in (7). 2. Find those i’s for which kai k2 > 0. 3. For those classes i satisfying the condition in step 2, repeat the following two steps: a. Reconstruct a sample for each class by a linear combination of the training samples in that class via the equation Pi v recon ðiÞ ¼ nj¼1 ai;j v i;j . b. Find the error between the reconstructed sample and the given test sample by errorðv test ; iÞ ¼ kv k;test v reconðiÞ k2 . 4. Once the errorðv test ; iÞ for every class i is obtained, choose the class having the minimum error as the class of the given test sample.

the value the sparser is the result. However, lower the value of m, the more susceptible it is to error in data, therefore the value of m should be kept somewhere above 0. The previous discussion also applies to the outer lp-norm of the k km;p mixed norm. In order to achieve sparsity in the chosen classes, the value of p should be between 0 and 1. The norm for the mismatch is dictated by the nature of the approximation error in (2). In previous works the error was assumed to be normal. But there is no physical or mathematical intuition that suggests a normally distributed error. In this paper, we will change the lq-norm of the mismatch and see which norm gives the best results. We will keep the value of q to be a positive integer, so that the constraint is at least convex. There is no specialized algorithm to solve (8). Therefore, we propose a new elegant method to solve (8) based on the Iterative Reweighted Least Squares (IRLS) (Rao and Kreutz-Delgado, 1999; Chartrand and Yin, 2008; Daubecheis et al., preprint) approach. The actual classiﬁcation algorithm for determining the class with the proposed classiﬁer is the same as the GSC, Therefore we do not repeat it. The fundamental difference between the previous classiﬁers (SC/GSC) and the proposed one is in the underlying optimization. 4. Optimization algorithm

Although the GSC yields better results compared to the SC, it is not perfect as well. The inner l2-norm of k k2;1 promotes a smooth solution and therefore selects all the training samples from the class. This may not be always desired. In the face recognition problem discussed in the Introduction, a test face image having a head pose somewhere between frontal and left proﬁle should be expressed as a linear combination of the training samples belonging to these two poses. However the GSC will try to express the test sample from all poses, even training samples from right proﬁle views.

In standard optimization terminology, the term kakpm;p is the objective function and kv test V akqq < r is the constraint. This constrained form has the following equivalent unconstrained formulation for a proper choice of the Lagrangian (k),

3. Proposed classiﬁer

a^ ¼ min kakpm;p þ kkv test V akqq a

The optimization problem (8) is the same as the following problem because the minimizer for k kt is the same as the minimize for k ktt

a^ ¼ min kakpm;p such that kvtest V akqq < r a

ð9Þ

ð10Þ

We have discussed the shortcomings of the SC and the GSC in terms of the objective function in the optimization problem (3) which encodes prior knowledge about the solution. The other issue common to both the SC and the GSC is the l2-norm mismatch. There is no reason which dictates that the error term in (2) is Gaussian; therefore the l2-norm mismatch may not be optimal. It is worth exploring other lq-norm for the mismatch. We propose to solve (2) via the following optimization problem,

The constant k is related to r. But the relationship between k and r cannot be determined analytically for most practical cases. In many situations (as in ours) the term r is not even known beforehand; consequently trying to determine k is meaningless. In this paper, we employ a non-parametric regularization method which does not require specifying either k or r.

a^ ¼ min kakm;p such that kvtest V akq < e a

The problem we attempt to solve is a non-convex problem. In general, non-convex optimization problems are not guaranteed to reach a global minima. However, recently it was shown by LaTourette (2008) that quasi-convex problems such as lp-norm minimization (Chartrand and Staneva, 2008; Chartrand and Yin, 2008) are guaranteed to reach a global minima. We show that our problem is also a quasi-convex one and hence is guaranteed to reach a global minima. A function f : Rn ! R is said to be quasi-convex if it satisﬁes the following relationship for b 2 ð0; 1Þ

P

ð8Þ

p 1=p where kakm;p ¼ . l kal km The inner lm-norm of k km;p controls the selection of training samples within each group. The outer lp-norm of k km;p controls the sparsity in the selection of groups. The lq-norm mismatch is the mismatch for non-Gaussian error term. For the GSC the values of m, p and q were 2, 1 and 2, respectively and for the SC the corresponding settings was m = 1, p = 1 and q = 2. The problem with the GSC is that the inner l2-norm of the mixed norm promotes a solution which is smooth, i.e. involves all the training samples in the class. However, we have argued earlier while this may not be desired. There may be some training samples which are not required to represent the test sample, in that case the corresponding weights should be zeros. In other words, the weights for each class (al) can be sparse. To promote a sparse solution within the group, we chose the inner norm of the mixed norm to be lm-norm. Following works in Compressed Sensing (Chartrand and Yin, 2008), the value of m is varied between 0 and 2, the lower

4.1. Convergence of the optimization problem

f ðbx þ ð1 bÞyÞ 6 max½f ðxÞ; f ðyÞ

ð11Þ

For our problem, the constraint is convex, but the objective function P is not. We need to show that the objective function kxkpm;p ¼ l kxl kpm is quasi-convex

X l

kbxl þ ð1

bÞyl kpm

X X ¼ ðbxi þ ð1 bÞyi Þm l

i

!p=m

1962

A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964

2007). The earlier work employed a thresholding scheme to bind the weight matrix instead of perturbing them as we did. Even though both achieve the same objective (binding the elements of the matrix), our algorithm is robust to the choice of the perturbation factor, where as their’s is sensitive to the choice of the threshold.

let zi ¼ max½xi ; yi

) kbx þ ð1

bÞykpm

X X 6 ðbzi þ ð1 bÞzi Þm l

i

X X m 6 ðzi Þ l

!p=m

!p=m

i

¼ max½kxkpm;p ; kykpm;p

4.4. Optimization algorithm

m

Thus it is proved that the objective function is quasi-convex; therefore following (LaTourette, 2008) we are guaranteed to reach a global minima for our optimization problem.

The problem is to solve (10). With the approximations made in Sections 3.2 and 3.3, (10) can be represented (approximately) in the following form

4.2. Reweighted Least Squares formulation of objective function

a^ ¼ min kW m ak22 þ kkW f ðv k;test V aÞk22

In the IRLS method, the modeling term ðk kpm;p Þ is approximated by a weighted l2-norm

Alternatively (18) can be expressed in the form

ka

kpm;p

kW m k22

a

ð12Þ

Wf

a^ ¼ min

0

0 Wm

V v k;test 2 x 0 kI 2

The weight matrix Wm is updated at each iteration. At iteration t, it takes the following form,

The closed form solution to (19) is

W m ðtÞ ¼ diag kai kp=2m=2 jai;j jm=21 m

a^ ¼ kW Tm W m þ V T W Tf W f V

ð13Þ

where the subscript i denotes the ith class and the subscript j denotes the coefﬁcient corresponding to the jth sample in the ith class. The choice behind the particular weight matrix is based on the idea that, when the solution reaches convergence the weighted l2norm behaves as a near perfect approximation of the original k kpm;p . When the coefﬁcients in a become zeroes (which is expected since a is sparse) the corresponding diagonal elements of the weight matrix approach inﬁnity. To avoid such a situation, the weight matrix is perturbed slightly, so that

W m ðtÞ ¼ diag kai kp=2m=2 jai;j jm=21 þ dðtÞ m

ð14Þ

The perturbation factor d(t) is reduced at each iteration, so that it has negligible effect when the solution converges. The idea of perturbing the diagonal matrix for IRLS algorithms was proposed by Chartrand and Yin (2008). They showed that more robust results are obtained from IRLS method with perturbation than without. 4.3. Reweighted Least Squares formulation of constraint The lq-norm constraint is manipulated in the same way as the objective function. It is approximated by a weighted l2-norm in the following way,

kv k;test V akqq kW f ðv k;test V aÞk22

ð15Þ

1

V T W Tf W f v k;test

ð18Þ

ð19Þ

ð20Þ

It is possible to apply (20) iteratively by updating the weight matrices till the solution converges. In fact this is the solution proposed by Wohlberg and Rodríguez (2007) for the problem of Total Variation minimization. Applying (20) directly for our purpose has one major drawback – choosing the regularization parameter k. We set forth to solve (9). Even though, there is a relation between the amount of misﬁt r in (9) and the regularization parameter k of (10), the relation is not analytical. Moreover for our problem, we do not even know r, therefore there is no way one can chose the regularization parameter for this problem. Therefore, we adopt a non-parametric regularization technique. With a change of variable (pre-conditioning), u = Wma (this is just a rescaling, the sparsity of original variable x and the transformed version u is the same), the closed form solution becomes,

1 ^ ¼ kI þ W Tm V T W Tf W f VW m u W Tm V T W Tf W f v k;test

ð21Þ

The usual way to solve (21) is by conjugate gradient method. Since we do not know k, we can efﬁciently solve (21) by making k = 0 and control the number of iterations of the conjugate gradient (CG) algorithm for regularization (Tradz et al., 2003). The number of CG iterations is to be controlled based on Global Cross Validation (GCV). Following (Tradz et al., 2003), the following approximate GCV criterion is used to control the number of iterations for the CG algorithm

kV aðtÞ v k;test k22

At each iteration, the actual misﬁt is approximated in by (15). The weight matrix is updated at each iteration as follows:

GCVðtÞ ¼

W f ðtÞ ¼ diag jv k;test V aðt 1Þjðq=21Þ

where N is the length of the weight vector (a). The iterations (t) are to be stopped when the GCV criterion reaches a minimum. The GCV criterion is very cheap to compute, it only requires dividing the current residual by a scalar. How the GCV acts as a regularizer can be intuitively understood. The numerator (misﬁt) reduces at each CG iteration so that the solution is a better ﬁt for the observed data. But at every iteration the denominator in reduces as well, so that the reduction in misﬁt is penalized. Thus the solution is regularized. For a theoretical understanding, the reader is encouraged to read (Haber, 1997). Our discussion behind the algorithm is complete. The complete algorithm is expressed tersely in the following pseudo code. ^ ð0Þ ¼ min kv k;test V ak22 solved Initialization – set d(0) = 1 and a by regularized CG.

ð16Þ

It can be easily veriﬁed, that this form of the weight matrix leads to perfect approximation of the desired lq-norm when the solution converges. In order to bind the elements of the weight matrix, it is slightly perturbed by a factor d in each iteration. The perturbation factor is reduced after each iteration, so that it is negligible when the solution converges. The perturbed weight matrix takes the form,

W f ðtÞ ¼ diag jv k;test V aðt 1Þ þ dðtÞjðq=21Þ

ð17Þ

Such reweighted schemes for approximating the lq-norm misﬁt constraints have been employed earlier (Wohlberg and Rodríguez,

ðN tÞ2

ð22Þ

1963

A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964

At iteration t – continue the following steps till convergence (i.e. either d is less than 106 or the number of iterations has reached maximum limit) (1) Find the current weight matrices from equations (14) and (17). p=2m=2 a. W m ðtÞ ¼ diagðkai km jai;j jm=21 þ dðtÞÞ b. W f ðtÞ ¼ diagðjv k;test V aðt 1Þ þ dðtÞjðq=21Þ Þ (2) (3) (4) (5) (6)

Form a new matrix as required by (18), L = WfVWm. Scale vk,test as required by t = Wfvk,test. ^ ðtÞ ¼ min kt Luk22 via regularized CG. Solve u Find x by rescaling u, aðtÞ ¼ W 1 m uðtÞ. Reduce d by a factor of 10 if kv k;test V akq has reduced.

This algorithm is non-parametric – does not require specifying parameters like k or e. Third, the algorithm is guaranteed to converge (may be to a local minima), since it follows IRLS methodology whose convergence has been proved (Daubechies et al., preprint). But we know that the global and the local minima for our problem is the same since it is quasi-convex. 5. Experimental evaluation The experimental evaluation was carried out in two parts. In the ﬁrst part, the experiments were carried out on benchmark datasets from the University of California, Irvine Machine Learning (UCI ML) repository. The second part of the experiments were carried out on two widely used face and character recognition datasets, namely the YaleB and USPS, respectively. For the UCI ML datasets, we only chose those datasets which do not have any missing values or unlabeled samples. The proposed classiﬁer has three parameters to be determined – m, p and q: the inner and outer norms for the mixed norm objective function and the norm for the constraint, respectively. The previous classiﬁers – SC (m = 1, p = 1, q = 2) and the GSC (m = 2, p = 1, q = 2) are special cases of our proposed classiﬁer. In this work, we will see that it is possible to achieve better classiﬁcation results than either of the previous ones by properly tuning the three parameters. We randomly select about 10% of the data from each the UCI ML databases as a tuning dataset. The (m, p and q) parameters for the proposed classiﬁers are determined by cross validation criterion. The rest of the data from each database is used for testing the classiﬁcation accuracy of the three classiﬁers. Leave-One-Out cross validation is used for testing the classiﬁcation results (Table 1). The

Table 1 Results on UCI ML datasets. Dataset

SC

GSC

Proposed (m, p, q)

Page Block Abalone Segmentation Yeast German Credit Tic-Tac-Toe Vehicle Australian Cr. Balance Scale Ionosphere Liver Ecoli Glass Wine Iris Lymphography Hayes Roth Satellite Haberman

94.78 27.17 96.31 57.75 69.30 78.89 65.58 85.94 93.33 86.94 66.68 81.53 68.43 85.62 96.00 85.81 40.23 80.30 40.52

95.66 27.17 94.09 58.94 74.50 84.41 73.86 86.66 95.08 90.32 70.21 82.88 70.19 85.62 96.00 86.42 41.01 82.37 43.28

96.11 32.03 96.31 63.47 74.50 86.79 74.98 86.66 96.66 93.09 75.83 82.88 71.02 88.95 98.33 85.81 43.12 82.37 47.96

(2, 0.8, 2) (1.2, 1, 4) (1, 1, 2) (1.8, 0.8, 1) (2, 1, 2) (2, 1, 1) (2, 1, 1) (2, 1, 2) (1.4, 0.8, 2) (1, 0.6, 1) (1.8, 0.8, 3) (2, 1, 2) (1.6, 1, 2) (1.2, 0.8, 6) (1.8, 1, 1) (1, 1, 2) (1.6, 1, 2) (2, 1, 2) (1.2, 0.8, 4)

Table 2 Results on YaleB. Dimensionality

SC

GSC

Proposed

30 56 120 504 Full

83.10 92.83 95.92 98.09 98.09

86.61 93.40 96.15 98.09 98.09

88.32 95.04 97.02 98.94 98.94

Table 3 Results on USPS. Dimensionality

SC

GSC

Proposed

40 80 120 Full

93.82 94.77 94.77 94.77

92.02 94.66 94.77 94.77

94.77 95.36 95.89 96.55

values of (m, p and q) used to achieve these results for the proposed classiﬁer are shown as well. Results from Table 1, show how the ﬂexibility of the proposed classiﬁer can be harnessed to achieve better (or at least as good) results than the previous classiﬁers. Out of the 19 datasets, our proposed classiﬁer shows better results in 15 of them. For the rest, the proposed classiﬁer is as good as the previous ones. The proposed classiﬁer shows signiﬁcantly better results in those cases where the classiﬁcation accuracy is low. In the second part of the experiments we experiment with two standard image recognition datasets – YaleB for face recognition and USPS for handwritten digit recognition. For both the databases, the dimensionality of the image was reduced by Principal Component Analysis. In the YaleB database, the images are stored as 192 168 pixel grayscale images. We followed the same methodology as in Yang et al. (2009). Only the frontal faces were chosen for recognition. Ten percent of the images were used as the tuning set for the proposed classiﬁer. Of the remaining images, one half of the images (for each individual) were selected for training and the other half for testing. The division into training and testing set was done ﬁve times and the average of these ﬁve cases is reported. For the USPS database the test set and the training set are already segregated. We chose about 10% of the data from the training set for tuning the parameters for the proposed classiﬁer. The samples in USPS are stored as 16 16 grayscale images. The results for the YaleB and USPS datasets are tabulated in Tables 2 and 3, respectively. For the YaleB, the best results are achieved for m = 1.6, p = 1, q = 2; and for the USPS the corresponding values are m = 1.2, p = 0.8, q = 1. Results from Tables 2 and 3 are interesting. For the YaleB the best results are obtained for m = 1.6 and p = 1, which is close to the next best result, the GSC (m = 2, p = 1). The same is true for the USPS, the parameters for best results (m = 1.2, p = 0.8) are near to that of the SC (m = 1, p = 1), which shows the next best results. Even though the absolute gain in recognition accuracy is not large with the proposed classiﬁer, it must be noted that the recognition accuracy for all the classiﬁers are in the high 90s and a profound improvement is not to be expected. 6. Conclusion This paper proposes a new classiﬁcation algorithm based on a simple assumption that the training samples of a particular class approximately form a linear basis for any new test sample belonging to the class. The assumption appears to be simple, but incorporating the implications of this assumption into the actual optimization problem involved in the classiﬁcation is far from

1964

A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964

trivial. Two previous works have already been done based on the said assumption – they are the Sparse Classiﬁer (SC) (Yang et al., 2009) and the Group Sparse Classiﬁer (GSC) (Majumdar and Ward, 2009a). Yet none of them could perfectly account for all the implications of this assumption. In this paper, we discuss the issues with the previous classiﬁers and propose a new classiﬁer that takes care of the all the inherent implications. Our proposed classiﬁer is actually a generalization of both the SC and the GSC. The previous ones are just special cases of our proposed method with particular parameter values. The proposed classiﬁer requires solving a new non-convex optimization problem. Even though the problem, is not convex, it is not an arbitrary non-convex problem. It is quasi-convex and therefore is guaranteed to reach a global minimum. We solve this new optimization problem in an elegant manner via an Iterative Reweighted Least Squares approach. The proposed classiﬁer is compared with the previous ones on benchmark databases from the UCI Machine Learning Repository and also the YaleB and USPS face and character recognition databases, respectively. Recognition results show that the proposed classiﬁer gives better results than the other two. In the worst case, the result is equal to the better of the two previous classiﬁers (SC and GSC). We have made our code available in the web (Majumdar Matlab Central). We encourage the readers to use this code for their own problems. As we have mentioned throughout the text, the optimization problem it solves is a very general one. With different parameter settings different convex and non-convex problems can be solved. We assume our implementation will be beneﬁcial to a wide variety of readers interested in signal processing and machine learning. References Amaldi, E., Kann, V., 1998. On the approximability of minimizing nonzero variables or unsatisﬁed relations in linear systems. Theor. Comput. Sci. 209, 237–260.

Chartrand, R., Staneva, V., 2008. Restricted isometry properties and nonconvex compressive sensing. Inverse Prob. 24 (035020), 1–14. Chartrand, R., Yin, W., 2008. Iteratively reweighted algorithms for compressive sensing. In: IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, pp. 3869–3872. Chen, S., Donoho, D., Saunders, M., 2001. Atomic decomposition by basis pursuit. SIAM Rev. 43 (1), 129–159. Daubechies, I., DeVore, R., Fornasier, M., Sinan Güntürk, C., preprint. Iteratively reweighted least squares minimization for sparse recovery. Donoho, D.L., 2006. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59, 797–829. Eldar, Y.C., Kuppinger, P., Bölcskei, H., 2009. Compressed Sensing of BlockSparse Signals: Uncertainty Relations and Efﬁcient Recovery CoRR abs/ 0906.3173. Haber, E., 1997. Numerical Strategies for the Solution of Inverse Problems. Ph.D. Thesis, Univ. of British Columbia. Huang, J., Zhang, T., 2008. The Beneﬁt of Group Sparsity. Available from: arxiv:/ arXiv:0901.2962v1. l1-magic. . LaTourette, K., 2008. Sparse Reconstructions Via Quasiconvex Homotopic Regularization. . Majumdar Matlab Central. . Majumdar, A., Ward, R.K., 2009a. Classiﬁcation via group sparsity promoting regularization. In: IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, pp. 861–864. Majumdar, A., Ward, R.K., 2009b. Fast Group Sparse Classiﬁer. In: IEEE Paciﬁc Rim Conf. on Communications, Computers and Signal Processing, pp. 11–16. Rao, B.D., Kreutz-Delgado, K., 1999. An afﬁne scaling methodology for best basis selection. IEEE Trans. Signal Process. 47 (1), 187–200. SPGL1. . Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 (1), 267–288. Tradz, D., Ulrych, T., Sacchi, M., 2003. Latest views of the sparse Radon transform. Geophysics 68 (1), 386–399. Wohlberg, B., Rodríguez, P., 2007. An iteratively reweighted norm algorithm for minimization of total variation functionals. IEEE Signal Process Lett. 14 (12), 948–995. Yang, Y., Wright, J., Ma, Y., Sastry, S.S., 2009. Feature selection in face recognition: A sparse representation perspective. IEEE Trans. PAMI 31 (2), 210–227. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 (2), 301–320.

Improved Group Sparse Classifier

Improved Group Sparse Classifier

Recommend Documents