Pattern Recognition Letters 31 (2010) 1959–1964
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Improved Group Sparse Classifier Angshul Majumdar *, Rabab K. Ward Department of Electrical and Computer Engineering, University of British Columbia, Canada
a r t i c l e
i n f o
a b s t r a c t
Article history: Received 6 November 2009 Available online 27 July 2010 Communicated by F. Roli
This work proposes a new classifier based on the assumption that the training samples of a particular class approximately form a linear basis for any new test sample belonging to that class. This is not a new assumption; two previous works namely Sparse Classifier (Yang et al., 2009) and Group Sparse Classifier (Majumdar and Ward, 2009a) has been built upon it. However both the previous works are fraught with certain shortcomings. The two optimization algorithms proposed in the two previous works do not capture all the implications of the said assumption perfectly. This work, accounts for all the intricacies of the said assumption and consequently requires solving a new non-convex optimization problem. We have developed an elegant approach to solve the optimization problem based on Iterative Reweighted Least Squares method. Our classifier is very flexible; the previous classifiers are actually two special cases of the said classifier. The proposed classifier has been rigorously compared against the previous ones stemming from the said assumption on benchmark classification databases. The results indicate that in most cases the proposed classifier is significantly better than the previous ones and in the worst case the proposed classifier is as good as the previous ones. Ó 2010 Elsevier B.V. All rights reserved.
Keywords: Classification Quasi-convex optimization Face recognition Character recognition
Recently there has been an interest in a classification model where the training samples of a particular class are assumed to form a linear basis for approximately representing any new test sample belonging to that class. This assumption has led to the development of some new classifiers like the Sparse Classifier (SC) (Yang et al., 2009), Group Sparse Classifier (GSC) (Majumdar and Ward, 2009a) and its fast version called the Fast Group Sparse Classifier (FGSC) (Majumdar and Ward, 2009b). Formally, this assumption can be expressed as,
variability of that pose. Now if there is a new test sample, with a pose somewhere between frontal and left profile, then it may be expressed as a linear combination of the training samples in frontal and left profile poses. The relative contribution of the training samples in each pose will be decided by their weights (a). All the aforementioned classifiers (SC, GSC and FGSC) determine the unknown weights (ak,i’s) by solving different optimization problems and use these weights to determine the class of the test sample. The first step for all of them is to express the assumption in terms of all the training samples instead of just one class as in (1). In compact matrix vector notation it can be expressed as,
v k;test ¼ ak;1 v k;1 þ ak;2 v k;2 þ þ ak;n v k;n
v k;test ¼ V a þ e
1. Introduction
k
k
þ ek
ð1Þ
where vk,test is the test sample of the kth class, vk,i is the ith training sample of the kth class, ak,i is the weight corresponding weight and ek is the approximation error. Geometrically this assumption implies that each class comprises of a few subspaces, and each subspace is spanned by some training vectors. The new test sample can be approximately represented as a union of these subspaces. To take a concrete example, consider the problem of face recognition. Faces appear in different poses – frontal, left profile, right profile, etc. Each of the poses can be assumed to lie on a subspace. If there are enough training samples, a few samples in each pose will be sufficient to capture the * Corresponding author. Tel.: +1 6047245411; fax: +1 6048229013. E-mail addresses:
[email protected] (A. Majumdar),
[email protected] (R.K. Ward). 0167-8655/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.06.014
ð2Þ
where V ¼ ½v 1;1 j jv n;1 j jv k;1 j jv k;nk j v C;1 j jv C;nC and a ¼ ½a1;1 a1;n1 ak;1 ak;nk aC;1 aC;nC 0 . Given vk,test and V, the problem is to determine the class of the test sample. To achieve this from the said assumption, it is required to solve the unknown weight vector a. The way to do this is to solve an optimization problem of the following form,
a^ ¼ min uðaÞ a s:t: kv k;test V ak2 6 r
ð3Þ
The objective function u(a) contains prior information about the nature of a and the constraint arises from the assumption that the misfit is Normally distributed. The different classification methods vary in their choice of the objective function. In SC (Yang et al., 2009) it is assumed that the
1960
A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964
weight vector a should be sparse and therefore employed the l1norm on the weight vector ðkak1 Þ as the objective function. The choice of the l1-norm as a sparsifying function can be traced back to studies in statistics (Tibshirani, 1996) and signal processing (Chen et al., 2001). However the l1-norm is not an appropriate choice in this case. This is because as pointed out in (Zou and Hastie, 2005) l1-norm only selects a single sample from a group of correlated training samples. For the example of face recognition, it implies that, if the test sample is a pose between frontal and left profile, the SC (employing l1-norm) will try to express the test sample either as a frontal or as a left profile, but not a combination of both. Clearly this is not desired. To address this issue, the GSC was proposed by Majumdar and Ward (2009a). It employed a mixed norm for the objective function, P given by kak2;1 ¼ k kak k2 . This norm ensured that for a particular class, all the training samples were selected, but the number of classes selected is sparse (Huang and Zhang, 2008; Eldar et al., 2009). The GSC (also the FGSC which is a fast implementation of GSC) showed good results, but it is not perfect either. This is because the inner l2-norm of the mixed norm selects all the training samples from a particular class. We again refer to the problem of face recognition. In this case, if a test image is somewhere between the left profile and frontal, this classifier will try to represent it as a combination of all the view – left, frontal and right. This is erroneous as well. The SC and the GSC are two extremes – the SC only selects one sample from the entire class (in the extreme case) and the GSC selects all. Ideally we would like the classifier to represent the test sample only as a combination of the relevant subspaces and not all. For the face recognition problem, this means that if a test sample is somewhere between the left and frontal view, it should be only represented by training samples corresponding to these and not training samples from right profile view. Present classifiers like SC and GSC can not address this problem. This paper is dedicated to find methods to solve this shortcoming so that the classifier can operate between the two extremes. The other issue we would tackle in this paper is the relaxation of the l2-norm misfit (3). There is no reason (physical or mathematical) to assume that the misfit is normally distributed. In this work we would like to experiment with some other norms for the misfit. This paper has two major contributions – (i) changing the objective function to address aforementioned issues; and (ii) changing the norm for the misfit. Both of these lead to new optimization problems which will solved in an efficient manner in this paper. The rest of the paper is divided into several sections. In the next section, the background of this work is discussed. The proposed classification model is discussed in Section 3. In Section 4, the optimization problem behind the proposed classifier is detailed out. The experimental results are described in Section 5. The conclusions of the work are discussed in Section 6. 2. Background The assumption that a new test sample can be approximated as a linear combination of the training samples of that class was proposed by Yang et al. (2009) for the Sparse Classifier (SC). Following the same assumption the Group Sparse Classifier (GSC) was later developed by Majumdar and Ward (2009a). The Fast version of GSC (FGSC) is not a different classifier, but employs fast algorithms to solve the same optimization problem. Therefore we will discuss the SC and the GSC in detail to understand their shortcomings. 2.1. Sparse classifier The classification assumption is mathematically expressed in (2). At the sake of repetition it is expressed as,
v k;test ¼ V a þ e Yang et al. (2009) argues that the weight vector a should sparse according to the assumption, because all the classes apart from the correct class will have zero weights associated with the training vectors. This prior knowledge about the weight vector can be imposed on the optimization problem for solving a in the following form,
a^ ¼ min kak0 such that kv k;test V ak2 < e a
ð4Þ
The l0-norm is not a true norm, it only counts the number of nonzeros in the vector. Owing to this definition, the minimum of this norm is the sparsest solution. However it is known that (4) is an NP hard problem (Amaldi and Kann, 1998). Therefore following the work of Donoho (2006) the NP hard l0-norm was replaced by its closest convex surrogate, the l1-norm, and instead of (4) solves the following problem,
a^ ¼ min kak1 such that kv k;test V ak2 < e a
ð5Þ
This is a convex optimization problem and can be solved by quadratic programming. A host of solvers are already available to solve (5) (l1-magic, SPGL1). Once a sparse solution of a is obtained, the following classification algorithm was proposed to determine the class of the test sample. SC algorithm 1. Solve the optimization problem expressed in (5). 2. For each class i repeat the following two steps: a. Reconstruct a sample for each class by a linear combination of the training samples belonging to that class by the equaPni tion v recon ðiÞ ¼ j¼1 ai;j v i;j . b. Find the error between the reconstructed sample and the given test sample by errorðv test ; iÞ ¼ kv k;test v reconðiÞ k2 3. Once the error for every class is obtained, choose the class having the minimum error as the class of the given test sample. There is a limitation to the l1-norm minimization. If there is a group of samples whose pair-wise correlations are very high, then Lasso tends to select one sample only from the group (Zou and Hastie, 2005). For the problem of classification, the correlation between samples of each class is high. In such a situation, the SC will try to represent the test sample only by a single training sample. This will not always give the desired results. We have already clarified the repercussions of this method for the face recognition example in Section 1. If the test face image is actually expressed as a linear combination of frontal and left profile views, the SC will represent it only from one view (the one which is the most correlated to the test sample). This might give erroneous results. 2.2. Group Sparse Classifier The Group Sparse Classifier (GSC) (Majumdar and Ward, 2009a) has the same assumption as the SC. The GSC argues that the solution to (2) has non-zero coefficients corresponding to the correct class of vectors and is zero for the other classes. Therefore, the solution to the inverse problem should ensure that all the weights for the correct class are selected. This condition is not satisfied by ordinary l1-norm minimization. In order to ensure that all the weights for the class are selected, the following optimization problem was proposed,
a^ ¼ min kak2;0 such that kv test V ak2 < e a
ð6Þ
The mixed norm k k2;0 is defined for a ¼ ½a1;1 ; . . . ; a1;n1 ; a2;1 ; . . . ; P |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflffl{zfflfflfflfflffl} a2;n2 a2 ; . . . ; ak;1 ; . . . ; ak;nk T as kak2;0 ¼ l Iðkal k2a1> 0Þ, where |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} ak
Iðkal k2 > 0Þ ¼ 1 if kal k2 > 0. Here, k is the number of classes.
1961
A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964
However solving (6) is also an NP hard problem. Majumdar and Ward, 2009a replaced the NP hard k k2;0 by the nearest convex P surrogate, i.e. kak2;1 ¼ l kal k2 so that the optimization problem takes the following form,
a^ ¼ min kak2;1 such that kvtest V ak2 < e a
ð7Þ
This is also a convex optimization problem. Although it is not a very widely studied problem like (5), there are customized solvers for it as well (SPGL1). Once this optimization problem is solved, the classification algorithm is quite straight forward. GSC algorithm 1. Solve the optimization problem expressed in (7). 2. Find those i’s for which kai k2 > 0. 3. For those classes i satisfying the condition in step 2, repeat the following two steps: a. Reconstruct a sample for each class by a linear combination of the training samples in that class via the equation Pi v recon ðiÞ ¼ nj¼1 ai;j v i;j . b. Find the error between the reconstructed sample and the given test sample by errorðv test ; iÞ ¼ kv k;test v reconðiÞ k2 . 4. Once the errorðv test ; iÞ for every class i is obtained, choose the class having the minimum error as the class of the given test sample.
the value the sparser is the result. However, lower the value of m, the more susceptible it is to error in data, therefore the value of m should be kept somewhere above 0. The previous discussion also applies to the outer lp-norm of the k km;p mixed norm. In order to achieve sparsity in the chosen classes, the value of p should be between 0 and 1. The norm for the mismatch is dictated by the nature of the approximation error in (2). In previous works the error was assumed to be normal. But there is no physical or mathematical intuition that suggests a normally distributed error. In this paper, we will change the lq-norm of the mismatch and see which norm gives the best results. We will keep the value of q to be a positive integer, so that the constraint is at least convex. There is no specialized algorithm to solve (8). Therefore, we propose a new elegant method to solve (8) based on the Iterative Reweighted Least Squares (IRLS) (Rao and Kreutz-Delgado, 1999; Chartrand and Yin, 2008; Daubecheis et al., preprint) approach. The actual classification algorithm for determining the class with the proposed classifier is the same as the GSC, Therefore we do not repeat it. The fundamental difference between the previous classifiers (SC/GSC) and the proposed one is in the underlying optimization. 4. Optimization algorithm
Although the GSC yields better results compared to the SC, it is not perfect as well. The inner l2-norm of k k2;1 promotes a smooth solution and therefore selects all the training samples from the class. This may not be always desired. In the face recognition problem discussed in the Introduction, a test face image having a head pose somewhere between frontal and left profile should be expressed as a linear combination of the training samples belonging to these two poses. However the GSC will try to express the test sample from all poses, even training samples from right profile views.
In standard optimization terminology, the term kakpm;p is the objective function and kv test V akqq < r is the constraint. This constrained form has the following equivalent unconstrained formulation for a proper choice of the Lagrangian (k),
3. Proposed classifier
a^ ¼ min kakpm;p þ kkv test V akqq a
The optimization problem (8) is the same as the following problem because the minimizer for k kt is the same as the minimize for k ktt
a^ ¼ min kakpm;p such that kvtest V akqq < r a
ð9Þ
ð10Þ
We have discussed the shortcomings of the SC and the GSC in terms of the objective function in the optimization problem (3) which encodes prior knowledge about the solution. The other issue common to both the SC and the GSC is the l2-norm mismatch. There is no reason which dictates that the error term in (2) is Gaussian; therefore the l2-norm mismatch may not be optimal. It is worth exploring other lq-norm for the mismatch. We propose to solve (2) via the following optimization problem,
The constant k is related to r. But the relationship between k and r cannot be determined analytically for most practical cases. In many situations (as in ours) the term r is not even known beforehand; consequently trying to determine k is meaningless. In this paper, we employ a non-parametric regularization method which does not require specifying either k or r.
a^ ¼ min kakm;p such that kvtest V akq < e a
The problem we attempt to solve is a non-convex problem. In general, non-convex optimization problems are not guaranteed to reach a global minima. However, recently it was shown by LaTourette (2008) that quasi-convex problems such as lp-norm minimization (Chartrand and Staneva, 2008; Chartrand and Yin, 2008) are guaranteed to reach a global minima. We show that our problem is also a quasi-convex one and hence is guaranteed to reach a global minima. A function f : Rn ! R is said to be quasi-convex if it satisfies the following relationship for b 2 ð0; 1Þ
P
ð8Þ
p 1=p where kakm;p ¼ . l kal km The inner lm-norm of k km;p controls the selection of training samples within each group. The outer lp-norm of k km;p controls the sparsity in the selection of groups. The lq-norm mismatch is the mismatch for non-Gaussian error term. For the GSC the values of m, p and q were 2, 1 and 2, respectively and for the SC the corresponding settings was m = 1, p = 1 and q = 2. The problem with the GSC is that the inner l2-norm of the mixed norm promotes a solution which is smooth, i.e. involves all the training samples in the class. However, we have argued earlier while this may not be desired. There may be some training samples which are not required to represent the test sample, in that case the corresponding weights should be zeros. In other words, the weights for each class (al) can be sparse. To promote a sparse solution within the group, we chose the inner norm of the mixed norm to be lm-norm. Following works in Compressed Sensing (Chartrand and Yin, 2008), the value of m is varied between 0 and 2, the lower
4.1. Convergence of the optimization problem
f ðbx þ ð1 bÞyÞ 6 max½f ðxÞ; f ðyÞ
ð11Þ
For our problem, the constraint is convex, but the objective function P is not. We need to show that the objective function kxkpm;p ¼ l kxl kpm is quasi-convex
X l
kbxl þ ð1
bÞyl kpm
X X ¼ ðbxi þ ð1 bÞyi Þm l
i
!p=m
1962
A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964
2007). The earlier work employed a thresholding scheme to bind the weight matrix instead of perturbing them as we did. Even though both achieve the same objective (binding the elements of the matrix), our algorithm is robust to the choice of the perturbation factor, where as their’s is sensitive to the choice of the threshold.
let zi ¼ max½xi ; yi
) kbx þ ð1
bÞykpm
X X 6 ðbzi þ ð1 bÞzi Þm l
i
X X m 6 ðzi Þ l
!p=m
!p=m
i
¼ max½kxkpm;p ; kykpm;p
4.4. Optimization algorithm
m
Thus it is proved that the objective function is quasi-convex; therefore following (LaTourette, 2008) we are guaranteed to reach a global minima for our optimization problem.
The problem is to solve (10). With the approximations made in Sections 3.2 and 3.3, (10) can be represented (approximately) in the following form
4.2. Reweighted Least Squares formulation of objective function
a^ ¼ min kW m ak22 þ kkW f ðv k;test V aÞk22
In the IRLS method, the modeling term ðk kpm;p Þ is approximated by a weighted l2-norm
Alternatively (18) can be expressed in the form
ka
kpm;p
kW m k22
a
ð12Þ
Wf
a^ ¼ min
0
0 Wm
V v k;test 2 x 0 kI 2
The weight matrix Wm is updated at each iteration. At iteration t, it takes the following form,
The closed form solution to (19) is
W m ðtÞ ¼ diag kai kp=2m=2 jai;j jm=21 m
a^ ¼ kW Tm W m þ V T W Tf W f V
ð13Þ
where the subscript i denotes the ith class and the subscript j denotes the coefficient corresponding to the jth sample in the ith class. The choice behind the particular weight matrix is based on the idea that, when the solution reaches convergence the weighted l2norm behaves as a near perfect approximation of the original k kpm;p . When the coefficients in a become zeroes (which is expected since a is sparse) the corresponding diagonal elements of the weight matrix approach infinity. To avoid such a situation, the weight matrix is perturbed slightly, so that
W m ðtÞ ¼ diag kai kp=2m=2 jai;j jm=21 þ dðtÞ m
ð14Þ
The perturbation factor d(t) is reduced at each iteration, so that it has negligible effect when the solution converges. The idea of perturbing the diagonal matrix for IRLS algorithms was proposed by Chartrand and Yin (2008). They showed that more robust results are obtained from IRLS method with perturbation than without. 4.3. Reweighted Least Squares formulation of constraint The lq-norm constraint is manipulated in the same way as the objective function. It is approximated by a weighted l2-norm in the following way,
kv k;test V akqq kW f ðv k;test V aÞk22
ð15Þ
1
V T W Tf W f v k;test
ð18Þ
ð19Þ
ð20Þ
It is possible to apply (20) iteratively by updating the weight matrices till the solution converges. In fact this is the solution proposed by Wohlberg and Rodríguez (2007) for the problem of Total Variation minimization. Applying (20) directly for our purpose has one major drawback – choosing the regularization parameter k. We set forth to solve (9). Even though, there is a relation between the amount of misfit r in (9) and the regularization parameter k of (10), the relation is not analytical. Moreover for our problem, we do not even know r, therefore there is no way one can chose the regularization parameter for this problem. Therefore, we adopt a non-parametric regularization technique. With a change of variable (pre-conditioning), u = Wma (this is just a rescaling, the sparsity of original variable x and the transformed version u is the same), the closed form solution becomes,
1 ^ ¼ kI þ W Tm V T W Tf W f VW m u W Tm V T W Tf W f v k;test
ð21Þ
The usual way to solve (21) is by conjugate gradient method. Since we do not know k, we can efficiently solve (21) by making k = 0 and control the number of iterations of the conjugate gradient (CG) algorithm for regularization (Tradz et al., 2003). The number of CG iterations is to be controlled based on Global Cross Validation (GCV). Following (Tradz et al., 2003), the following approximate GCV criterion is used to control the number of iterations for the CG algorithm
kV aðtÞ v k;test k22
At each iteration, the actual misfit is approximated in by (15). The weight matrix is updated at each iteration as follows:
GCVðtÞ ¼
W f ðtÞ ¼ diag jv k;test V aðt 1Þjðq=21Þ
where N is the length of the weight vector (a). The iterations (t) are to be stopped when the GCV criterion reaches a minimum. The GCV criterion is very cheap to compute, it only requires dividing the current residual by a scalar. How the GCV acts as a regularizer can be intuitively understood. The numerator (misfit) reduces at each CG iteration so that the solution is a better fit for the observed data. But at every iteration the denominator in reduces as well, so that the reduction in misfit is penalized. Thus the solution is regularized. For a theoretical understanding, the reader is encouraged to read (Haber, 1997). Our discussion behind the algorithm is complete. The complete algorithm is expressed tersely in the following pseudo code. ^ ð0Þ ¼ min kv k;test V ak22 solved Initialization – set d(0) = 1 and a by regularized CG.
ð16Þ
It can be easily verified, that this form of the weight matrix leads to perfect approximation of the desired lq-norm when the solution converges. In order to bind the elements of the weight matrix, it is slightly perturbed by a factor d in each iteration. The perturbation factor is reduced after each iteration, so that it is negligible when the solution converges. The perturbed weight matrix takes the form,
W f ðtÞ ¼ diag jv k;test V aðt 1Þ þ dðtÞjðq=21Þ
ð17Þ
Such reweighted schemes for approximating the lq-norm misfit constraints have been employed earlier (Wohlberg and Rodríguez,
ðN tÞ2
ð22Þ
1963
A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964
At iteration t – continue the following steps till convergence (i.e. either d is less than 106 or the number of iterations has reached maximum limit) (1) Find the current weight matrices from equations (14) and (17). p=2m=2 a. W m ðtÞ ¼ diagðkai km jai;j jm=21 þ dðtÞÞ b. W f ðtÞ ¼ diagðjv k;test V aðt 1Þ þ dðtÞjðq=21Þ Þ (2) (3) (4) (5) (6)
Form a new matrix as required by (18), L = WfVWm. Scale vk,test as required by t = Wfvk,test. ^ ðtÞ ¼ min kt Luk22 via regularized CG. Solve u Find x by rescaling u, aðtÞ ¼ W 1 m uðtÞ. Reduce d by a factor of 10 if kv k;test V akq has reduced.
This algorithm is non-parametric – does not require specifying parameters like k or e. Third, the algorithm is guaranteed to converge (may be to a local minima), since it follows IRLS methodology whose convergence has been proved (Daubechies et al., preprint). But we know that the global and the local minima for our problem is the same since it is quasi-convex. 5. Experimental evaluation The experimental evaluation was carried out in two parts. In the first part, the experiments were carried out on benchmark datasets from the University of California, Irvine Machine Learning (UCI ML) repository. The second part of the experiments were carried out on two widely used face and character recognition datasets, namely the YaleB and USPS, respectively. For the UCI ML datasets, we only chose those datasets which do not have any missing values or unlabeled samples. The proposed classifier has three parameters to be determined – m, p and q: the inner and outer norms for the mixed norm objective function and the norm for the constraint, respectively. The previous classifiers – SC (m = 1, p = 1, q = 2) and the GSC (m = 2, p = 1, q = 2) are special cases of our proposed classifier. In this work, we will see that it is possible to achieve better classification results than either of the previous ones by properly tuning the three parameters. We randomly select about 10% of the data from each the UCI ML databases as a tuning dataset. The (m, p and q) parameters for the proposed classifiers are determined by cross validation criterion. The rest of the data from each database is used for testing the classification accuracy of the three classifiers. Leave-One-Out cross validation is used for testing the classification results (Table 1). The
Table 1 Results on UCI ML datasets. Dataset
SC
GSC
Proposed (m, p, q)
Page Block Abalone Segmentation Yeast German Credit Tic-Tac-Toe Vehicle Australian Cr. Balance Scale Ionosphere Liver Ecoli Glass Wine Iris Lymphography Hayes Roth Satellite Haberman
94.78 27.17 96.31 57.75 69.30 78.89 65.58 85.94 93.33 86.94 66.68 81.53 68.43 85.62 96.00 85.81 40.23 80.30 40.52
95.66 27.17 94.09 58.94 74.50 84.41 73.86 86.66 95.08 90.32 70.21 82.88 70.19 85.62 96.00 86.42 41.01 82.37 43.28
96.11 32.03 96.31 63.47 74.50 86.79 74.98 86.66 96.66 93.09 75.83 82.88 71.02 88.95 98.33 85.81 43.12 82.37 47.96
(2, 0.8, 2) (1.2, 1, 4) (1, 1, 2) (1.8, 0.8, 1) (2, 1, 2) (2, 1, 1) (2, 1, 1) (2, 1, 2) (1.4, 0.8, 2) (1, 0.6, 1) (1.8, 0.8, 3) (2, 1, 2) (1.6, 1, 2) (1.2, 0.8, 6) (1.8, 1, 1) (1, 1, 2) (1.6, 1, 2) (2, 1, 2) (1.2, 0.8, 4)
Table 2 Results on YaleB. Dimensionality
SC
GSC
Proposed
30 56 120 504 Full
83.10 92.83 95.92 98.09 98.09
86.61 93.40 96.15 98.09 98.09
88.32 95.04 97.02 98.94 98.94
Table 3 Results on USPS. Dimensionality
SC
GSC
Proposed
40 80 120 Full
93.82 94.77 94.77 94.77
92.02 94.66 94.77 94.77
94.77 95.36 95.89 96.55
values of (m, p and q) used to achieve these results for the proposed classifier are shown as well. Results from Table 1, show how the flexibility of the proposed classifier can be harnessed to achieve better (or at least as good) results than the previous classifiers. Out of the 19 datasets, our proposed classifier shows better results in 15 of them. For the rest, the proposed classifier is as good as the previous ones. The proposed classifier shows significantly better results in those cases where the classification accuracy is low. In the second part of the experiments we experiment with two standard image recognition datasets – YaleB for face recognition and USPS for handwritten digit recognition. For both the databases, the dimensionality of the image was reduced by Principal Component Analysis. In the YaleB database, the images are stored as 192 168 pixel grayscale images. We followed the same methodology as in Yang et al. (2009). Only the frontal faces were chosen for recognition. Ten percent of the images were used as the tuning set for the proposed classifier. Of the remaining images, one half of the images (for each individual) were selected for training and the other half for testing. The division into training and testing set was done five times and the average of these five cases is reported. For the USPS database the test set and the training set are already segregated. We chose about 10% of the data from the training set for tuning the parameters for the proposed classifier. The samples in USPS are stored as 16 16 grayscale images. The results for the YaleB and USPS datasets are tabulated in Tables 2 and 3, respectively. For the YaleB, the best results are achieved for m = 1.6, p = 1, q = 2; and for the USPS the corresponding values are m = 1.2, p = 0.8, q = 1. Results from Tables 2 and 3 are interesting. For the YaleB the best results are obtained for m = 1.6 and p = 1, which is close to the next best result, the GSC (m = 2, p = 1). The same is true for the USPS, the parameters for best results (m = 1.2, p = 0.8) are near to that of the SC (m = 1, p = 1), which shows the next best results. Even though the absolute gain in recognition accuracy is not large with the proposed classifier, it must be noted that the recognition accuracy for all the classifiers are in the high 90s and a profound improvement is not to be expected. 6. Conclusion This paper proposes a new classification algorithm based on a simple assumption that the training samples of a particular class approximately form a linear basis for any new test sample belonging to the class. The assumption appears to be simple, but incorporating the implications of this assumption into the actual optimization problem involved in the classification is far from
1964
A. Majumdar, R.K. Ward / Pattern Recognition Letters 31 (2010) 1959–1964
trivial. Two previous works have already been done based on the said assumption – they are the Sparse Classifier (SC) (Yang et al., 2009) and the Group Sparse Classifier (GSC) (Majumdar and Ward, 2009a). Yet none of them could perfectly account for all the implications of this assumption. In this paper, we discuss the issues with the previous classifiers and propose a new classifier that takes care of the all the inherent implications. Our proposed classifier is actually a generalization of both the SC and the GSC. The previous ones are just special cases of our proposed method with particular parameter values. The proposed classifier requires solving a new non-convex optimization problem. Even though the problem, is not convex, it is not an arbitrary non-convex problem. It is quasi-convex and therefore is guaranteed to reach a global minimum. We solve this new optimization problem in an elegant manner via an Iterative Reweighted Least Squares approach. The proposed classifier is compared with the previous ones on benchmark databases from the UCI Machine Learning Repository and also the YaleB and USPS face and character recognition databases, respectively. Recognition results show that the proposed classifier gives better results than the other two. In the worst case, the result is equal to the better of the two previous classifiers (SC and GSC). We have made our code available in the web (Majumdar Matlab Central). We encourage the readers to use this code for their own problems. As we have mentioned throughout the text, the optimization problem it solves is a very general one. With different parameter settings different convex and non-convex problems can be solved. We assume our implementation will be beneficial to a wide variety of readers interested in signal processing and machine learning. References Amaldi, E., Kann, V., 1998. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theor. Comput. Sci. 209, 237–260.
Chartrand, R., Staneva, V., 2008. Restricted isometry properties and nonconvex compressive sensing. Inverse Prob. 24 (035020), 1–14. Chartrand, R., Yin, W., 2008. Iteratively reweighted algorithms for compressive sensing. In: IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, pp. 3869–3872. Chen, S., Donoho, D., Saunders, M., 2001. Atomic decomposition by basis pursuit. SIAM Rev. 43 (1), 129–159. Daubechies, I., DeVore, R., Fornasier, M., Sinan Güntürk, C., preprint. Iteratively reweighted least squares minimization for sparse recovery. Donoho, D.L., 2006. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59, 797–829. Eldar, Y.C., Kuppinger, P., Bölcskei, H., 2009. Compressed Sensing of BlockSparse Signals: Uncertainty Relations and Efficient Recovery CoRR abs/ 0906.3173. Haber, E., 1997. Numerical Strategies for the Solution of Inverse Problems. Ph.D. Thesis, Univ. of British Columbia. Huang, J., Zhang, T., 2008. The Benefit of Group Sparsity. Available from: arxiv:/ arXiv:0901.2962v1. l1-magic.
. LaTourette, K., 2008. Sparse Reconstructions Via Quasiconvex Homotopic Regularization.
. Majumdar Matlab Central. . Majumdar, A., Ward, R.K., 2009a. Classification via group sparsity promoting regularization. In: IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, pp. 861–864. Majumdar, A., Ward, R.K., 2009b. Fast Group Sparse Classifier. In: IEEE Pacific Rim Conf. on Communications, Computers and Signal Processing, pp. 11–16. Rao, B.D., Kreutz-Delgado, K., 1999. An affine scaling methodology for best basis selection. IEEE Trans. Signal Process. 47 (1), 187–200. SPGL1. . Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 (1), 267–288. Tradz, D., Ulrych, T., Sacchi, M., 2003. Latest views of the sparse Radon transform. Geophysics 68 (1), 386–399. Wohlberg, B., Rodríguez, P., 2007. An iteratively reweighted norm algorithm for minimization of total variation functionals. IEEE Signal Process Lett. 14 (12), 948–995. Yang, Y., Wright, J., Ma, Y., Sastry, S.S., 2009. Feature selection in face recognition: A sparse representation perspective. IEEE Trans. PAMI 31 (2), 210–227. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 (2), 301–320.