Regularized discriminant entropy analysis

Regularized discriminant entropy analysis

Pattern Recognition 47 (2014) 806–819 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Re...

1MB Sizes 3 Downloads 127 Views

Pattern Recognition 47 (2014) 806–819

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Regularized discriminant entropy analysis Haitao Zhao a,b, W.K. Wong a,n a b

Institute of Textiles and Clothing, The Hong Kong Polytechnic University, Hong Kong Automation Department, East China University of Science and Technology, Shanghai, China

art ic l e i nf o

a b s t r a c t

Article history: Received 3 September 2012 Received in revised form 4 August 2013 Accepted 23 August 2013 Available online 5 September 2013

In this paper, we propose the regularized discriminant entropy (RDE) which considers both class information and scatter information on original data. Based on the results of maximizing the RDE, we develop a supervised feature extraction algorithm called regularized discriminant entropy analysis (RDEA). RDEA is quite simple and requires no approximation in theoretical derivation. The experiments with several publicly available data sets show the feasibility and effectiveness of the proposed algorithm with encouraging results. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Regularized discriminant entropy Entropy-based learning Discriminant entropy analysis

1. Introduction Feature extraction from original data is an important step in pattern recognition, often dictated by practical feasibility. It is also an essential process in exploratory data analysis, where the goal is to map input data onto a feature space which reflects the inherent structure of original data. We are interested in methods that reveal or enhance the class structure of data, which is an essential procedure for a given classification problem. For unsupervised learning's sake, feature extraction is performed by constructing the most representative features out of original data. For pattern classification, however, the purpose of feature extraction is to map original data onto a discriminative feature space in which samples from different classes are clearly separated. Many approaches have been proposed for feature extraction, such as principal component analysis (PCA) [1], linear discriminant analysis (LDA) [1,2], isometric feature mapping (ISOMAP) [3], local linear embedding (LLE) [4], locality preserving projection (LPP) [5] and graph embedding [6]. As feature extraction methods are applied to realistic problems, where dimensionality is high or the amount of training data is very large, it is impractical to manually process the data. Therefore, feature extraction methods that can robustly obtain a low-dimensional subspace are of particular interest in practice. Many robust algorithms, such as robust PCA [7], robust ISOMAP [8,9], robust LLE [10] and robust LDA [11], have been proposed in the past few years. Recently, many feature extraction algorithms addressing the robust classification problem have involved the use of information theoretic learning (ITL) techniques [12–14]. The maximum likelihood n

Corresponding author. Tel.: þ 852 27666471; fax: 852 27731432. E-mail address: [email protected] (W.K. Wong).

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.08.020

principle [15], entropy [16–18], Kullback Leibler (KL) divergence [19,20], Bhattacharyya distance [21], and Chernoff distance [22] are often selected to develop feature extraction algorithms. In these algorithms, however, the handling of robust feature extraction is achieved at the cost of increased computational complexity. The projection matrix or the projection procedure of these algorithms is often solved by iterative optimization which has relatively high computational complexity. Due to the often used non-convex constraints, these algorithms are also prone to the local minimum (or maximum) problem. In order to preserve computational simplicity and the characteristics of eigenvalue-based techniques, such as LDA, He et al. [11] proposed the maximum entropy robust discriminant analysis (MaxEnt-RDA) algorithm. Renyi's quadratic entropy was used in MaxEnt-RDA as a class-separability measure. The MaxEnt distribution was estimated by the nonparametric Parzen window density estimator with a Gaussian kernel. Due to the first-order Taylor expansion of each Gaussian kernel term, MaxEnt-RDA is an approximate algorithm which avoids iterative calculation of entropy. MaxEnt-RDA is proposed for discriminant feature extraction. It can effectively overcome the limitations of traditional LDA algorithms in a data distribution assumption and is robust against noisy data [11]. MaxEnt-RDA tried to obtain the projection matrix U by solving the following constraint MaxEnt problem: max HðU T XÞ U

s:t:

HðU T XjCÞ ¼ c1 ;

U T U ¼ I;

where HðU T XÞ ¼ ln

! 1 n n T T ∑ ∑ GðU x U x ; sÞ j i n2 i ¼ 1 j ¼ 1

ð1Þ

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819 T is the esimate pffiffiffiffiffiffiof Renyi's entropy by Parzen method, GðU xj  U T xi ; sÞ ¼ ð1= 2π sÞ expð J U T xj U T xi J 2 =2s2 Þ. And HðU T XjCÞ ¼ ∑cj ¼ 1 pðC j ÞHðU T XjC ¼ C j Þ, where Cj is the label of the jth class. The closed form solution of the constraint MaxEnt problem is hard to find, because the problem is nonlinear. To solve this problem, He et al. [11] used the first-order Taylor expansion of each Gaussian kernel:

GðU T xj U T xi ; sÞ  Gðxj xi ; sÞ J U T xj U T xi J 2 þconst:

ð2Þ

Substituting Eq. (2) into (1), MaxEnt-RDA is reduced to a constraint graph embedding problem: max TrðU T XLt XUÞ

s:t:

TrðU T XLw XUÞ ¼ c1 ;

U T U ¼ I;

where Lt ¼ Dt W t ; Lw ¼ Dw W w ; W tij ¼

2Gðxi xj ; sÞ ; s2 ∑ni¼ 1 ∑nj¼ 1 Gðxi xj ; sÞ

Ww ij ¼ I ðC i ¼ C j Þ pðC i Þ

2Gðxi xj ; sÞ ; s2 ∑ni¼ 1 ∑nj¼ 1 Gðxi xj ; sÞ

w t w n Dtii ¼ ∑nj¼ 1 W tij ; Dw are diagonal matrices. ii ¼ ∑j ¼ 1 W ij , D and D

For more details, refer to [11]. Das and Nenadic [23] proposed a discriminant feature extraction method based on information discriminant analysis (IDA) [24] and showed how a feature extraction matrix can be obtained through eigenvalue decomposition, reminiscent of LDA. IDA is based on maximization of an information-theoretic objective function referred to as a μmeasure, which enjoys many properties of mutual information and the Bayes error [24,23]. For a continuous random variable X A Rn and class variable fC 1 ; C 2 ; …; C c g, the simplified form of the μmeasure is given by ! c 1 μðXÞ ¼ lnðjST jÞ ∑ pðC i Þ lnðjSi jÞ ; 2 i¼1 where ST is the total (unconditional) covariance matrix, Si is the class-conditional covariance matrix. The optimal feature extraction matrix U is found by the maximization of μðZÞ, where Z ¼ U T X is the feature random variable. While both the gradient and the Hessian of μðZÞ (with respect to U) can be found analytically, the maximization of μ must be performed numerically. Das and Nenadic's method was referred to as approximate information discriminant analysis (AIDA) [23]. Das and Nenadic [23] showed that their criterion is related to the μmeasure in an approximate way, which made their algorithm simple and computationally efficient. Assume Q A Rnn be a symmetric positive definite matrix with eigenvalue and eigenvector matrices, Λ ¼ diagðλ1 ; …; λn Þ and V, respectively. Let xQ ¼ V diagðln λ1 ; …; ln λn ÞV T . AIDA evaluates the matrix c

SAIDA ¼ ln ST  ∑ pðC i Þln Si i¼1

1=2

in the transformed domain (note that ST -SW 1=2 Si SW

∑ci ¼ 1 pðC i ÞSi ).

1=2

ST SW

1=2

, Si - SW

and SW ¼ The eigenvectors corresponding to the largest m eigenvalues of SAIDA are taken to form the projection matrix U. For more details, refer to [23]. Motivated by information theoretic learning, we propose the regularized discriminant entropy (RDE), and then introduce a novel supervised feature extraction algorithm, called regularized discriminant entropy analysis (RDEA), which preserves computational

807

simplicity and characteristics of eigenvalue-based techniques. Several interesting perspectives should be addressed: 1. The RDE is simple and easy to understand. It considers both class information and scatter information on original data. Section 2 shows that maximization of the RDE has connections to the mean shift algorithm and the pre-image reconstruction. However, the RDE is designed for supervised learning and is based on the within-class entropy and scatter information on data. The regularization parameter used in the RDE is proposed for the first time ever and does not appear in the mean shift algorithm and the pre-image reconstruction. 2. RDEA utilizes the results of the RDE, which has a clear theoretical foundation. RDEA is reduced to a simple constrained optimization problem, which can be obtained by eigen decomposition. For more details, refer to Section 3. RDEA is a direct solution without approximation. This is quite different form MaxEnt-RDA and AIDA. In order to use eigenvalue-based techniques, both MaxEnt-RDA and AIDA use approximation in their theoretical derivation. 3. RDEA can be regarded as a framework for supervised feature extraction. Firstly, we use the most widely used Shannon's entropy to design the RDE. If other generalized entropies are used, one may obtain other forms of RDE. Secondly, we put orthogonal constraints on basis functions to obtain projection matrix. The RDE and the likelihood values can be preserved under orthogonal projection (refer to Section 3 for details). Other constraints can be used here to substitute orthogonal constraints. The combination of different constraints and the maximization of the RDE can derive different feature extraction algorithms. 4. In Section 5, we compare our method with several other second-order feature extraction techniques using real datasets. The results show that RDEA compares favorably with other methods. We conclude that RDEA should be considered as an alternative to the prevalent feature extraction techniques.

2. Maximum entropy principle and regularized discriminant entropy 2.1. Maximum entropy principle Let P ¼ ðp1 ; p2 ; …; pN Þ be a probability distribution for N variates x1 ; x2 ; …; xN , and then there is uncertainty about the outcomes. Shannon used the measure N

TðPÞ ¼  ∑ pi ln pi

ð3Þ

i¼1

to measure this uncertainty and called it the entropy of the probability distribution P [25]. It can be regarded as a measure of equality of p1 ; p2 ; …; pN among themselves. Renyi, Havrda and Charvat, Kapur, Sharma-Taneja and others proposed other functions [25–27] of p1 ; p2 ; …; pN to measure this uncertainty and called these functions as generalized measures of entropies or simply generalized entropies, such as Burg's entropy: N

T B ðPÞ ¼ ∑ ln pi

ð4Þ

i¼1

and Kapur's entropy: N

N

i¼1

i¼1

T K ðPÞ ¼  ∑ pi ln pi  ∑ ð1pi Þ lnð1pi Þ:

ð5Þ

808

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

If we have information about p1 ; p2 ; …; pN in the form of moment constraints, N

N

∑ pi ¼ 1;

∑ pi g r ðxi Þ ¼ ar ;

i¼1

i¼1

r ¼ 1; 2; …; M; M 5 N;

ð6Þ

there may be an infinity of probability distributions consistent with such information. According to Jaynes' maximum entropy principle [25], out of infinity of distributions, the distribution which best represents our knowledge is the one which maximizes uncertainty measure (Eqs. (3) and (4) or (5)) subject to Eq. (6). 2.2. Within-class entropy and between-class entropy Let X ¼ ½X 1 ; X 2 ; …; X c  A RnN be the vector space representation of c classes and X i ¼ ½xi1 ; xi2 ; …; xiNi  A RnNi is the representation of the ith class ði ¼ 1; …; cÞ; N and Ni denote the total number of data points and the number of data points in ith class, respectively; n is the dimensionality of the original data space. This part focuses on how to design a generalized measure for labeled data, which means class information is used to derive a generalized entropy. Since we need to consider class information, we give the following notion instead of directly using P ¼ ðp1 ; p2 ; …; pN Þ, whose subscripts have no relationship to label information. Let vij be the likelihood value of data point xij ði ¼ 1; 2; …; c; Ni c c i j ¼ 1; 2; …; N i Þ and let ∑N j ¼ 1 vij ¼ N i ; ∑i ¼ 1 ∑j ¼ 1 vij ¼ ∑i ¼ 1 N i ¼ N. Ni c i Since ∑N j ¼ 1 vij =N i ¼ 1 and ∑i ¼ 1 ∑j ¼ 1 vij =N ¼ 1, we have the follow-

ing theorem. Theorem 1. Let the total entropy Ni

c

T ¼ ∑ ∑

i¼1j¼1

vij vij ln N N

ð7Þ

i be Shannon's entropy of all the likelihood values, T i ¼ ∑N j¼1 vij =N i ln vij =N i be Shannon's entropy of the likelihood values in the ith class, the within-class entropy

c

Ni Ti i¼1 N

TW ¼  ∑

ð8Þ

be the mean of the entropies of the classes, N i =N be the priori probability of the ith class, and the between-class entropy c

TB ¼  ∑

i¼1

Ni N ln i N N

ð9Þ

be the entropy of the probabilities of the classes. Then, we have T ¼ T W þ T B:

ð10Þ

Ni, are fixed in the training process, TB will remain constant. In this case, we expect TW to be maximized. 2.3. Regularized discriminant entropy According to information theory [25], if only maximizing i within-class entropy TW, under the constraints1 ∑N j ¼ 1 vij ¼ Ni ði ¼ 1; 2; …; cÞ, we can obtain vij ¼ 1ði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ. This results are data independent. It means one always use the same weights for different data points, even for the noisy data. In this paper, a regularization term is added to the objective function to make the within-class entropy maximization more suitable for real applications. Robust methods are very important for real applications. Robust statistics methods provide tools for statistics problems in which underlying assumptions are inexact [28]. A robust procedure should be insensitive to noisy data. There are several types of robust estimators. Among them are M-estimator (maximum likelihood type estimator), L-estimator (linear combinations of order statistics) and R-estimator (estimator based on rank transformation); RM estimator (repeated median) and LMS estimator (estimator using the least median of squares) [28]. We are interested in the M-estimator in this paper. Given i, consider the data points X i ¼ ½xi1 ; xi2 ; …; xiNi , where xij ¼ mi þ ϵij ðj ¼ 1; 2; …; N i Þ. The key problem of M estimation is to estimate the mean mi under the noise ϵij . The distribution of ϵij is not assumed to be known exactly. The only underlying assumption is that ϵi1 ; ϵi2 ; …; ϵiNi obey a symmetric, independent, identical distribution (symmetric i.i.d.). A robust estimator has to deal with departures from this assumption [28]. Let the residual errors be ϵij ¼ xij mi ðj ¼ 1; 2; …; N i Þ and the error penalty function be qðϵij Þ. The M-estimate m^ i is defined as the minimum of a global error function: m^ i ¼ argminEðmi Þ; where Ni

Eðmi Þ ¼ ∑ qðϵij Þ j¼1

In practice, the error penalty function qðÞ gives the contribution of each residual error to the objective function. Many functions can be used as the error penalty function [28]. In this paper, we use Mahalanobis distance, qðxij mi Þ ¼ ð1=ηi Þðxij mi ÞT ðxij mi Þ. In considering Eqs. (11) and (8), we can put the within-class entropy and robust estimation together and maximize the following objective function:

M;V

Ni v N c vij vij vij N i ij i ln ¼  ∑ ∑ ln N N Ni N i¼1j¼1 i ¼ 1 j ¼ 1 Ni N

¼ ∑

i¼1

c 1 N Ni vij vij N i Ni vij i ∑ ∑ ‖xij mi ‖2 ; ln γ ∑ Ni Ni i ¼ 1 N j ¼ 1 Ni i ¼ 1 ηi N j ¼ 1

argmaxJðV; MÞ ¼  ∑

Ni

T ¼ ∑ ∑ c

Ni vij vij ¼ ∑ qðxij mi Þ : Ni j ¼ 1 Ni

c

Proof. Since c

ð11Þ

mi

Ni

c

Ni

vij vij vij Ni N N ∑ ln  ∑ i ln i ∑ N j ¼ 1 Ni Ni i ¼ 1 N N j ¼ 1 Ni

c

c N N N ¼ ∑ i T i  ∑ i ln i N i¼1 N i¼1 N

under the constraints Ni

∑ vij ¼ N i ði ¼ 1; 2; …; cÞ;

j¼1

where V ¼ fvij gi ¼ 1;…;c;

¼ T W þ T B; we can get Theorem 1.

ð12Þ



If the classes are homogeneous (i.e., in each class the likelihood values of the data points are the same), the within-class entropy TW is maximized. If the sample numbers of different classes are the same, the between-class entropy TB is maximized. If T is fixed, from Eq. (10), TW being large implies that TB is small, and vice versa. In a given classification problem, if the total number of all the data points, N, and the number of the data points in the ith class,

j ¼ 1;…;N i

, M ¼ fm1 ; m2 ; …; mc g.

i The first term of JðV ; MÞ, ∑ci ¼ 1 ðN i =NÞ∑N j ¼ 1 ðvij =N i Þ ln vij =N i , is

the within-class entropy. The second term of JðV; MÞ, ∑ci ¼ 1 ð1=ηi Þ 2 i ðN i =NÞ∑N j ¼ 1 ‖xij mi ‖ vij =N i , is a regularization term and γ is a

positive regularization parameter. In this paper, we also set γ r 1. It means that in the objective function, the first term is always relatively important than the second term. Note that, if γ ¼ 0, the 1

From the discussion in Section 2, the constraint vij Z 0 is not needed.

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

objective function becomes maximizing within-class entropy TW and can only obtain a trivial solution (vij ¼ 1ði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ). In this paper, we mainly analyzed the regularization parameter γ A ½0; 1. i For Shannon's entropy, let ηi ¼ ð1=N i ðN i 1ÞÞ∑N ∑Ni ðx xil Þ2 . k ¼ 1 l ¼ 1 ik It means that the covariance matrix Si of xij ðj ¼ 1; 2; …; Ni Þ is assumed to be ηi I. Then we have the Mahalanobis distance, qðxij mi Þ ¼ T ðxij mi ÞT S1 i ðxij mi Þ ¼ ð1=ηi Þðxij mi Þ ðxij mi Þ. The objection function JðV; MÞ in Eq. (12) is called the regularized discriminant entropy (RDE). 3. Regularized discriminant entropy analysis

809

For k ¼ 1; 2; …; MaxIter  Compute the likelihood value vijði ¼ 1; 2; …; c; j ¼ 1; 2; …; Ni Þ by Eq. (13);  Compute the mean vector of the ith class mi by Eq. (14);  Compute the objective function value by Eq. (12): J k ¼ JðV; MÞ;  If jJ k J k1 j o ε break and return vij ði ¼ 1; …; c; j ¼ 1; …; Ni Þ, mi ði ¼ 1; 2; …; cÞ; End if  Let mi ¼ mi ði ¼ 1; 2; …; cÞ. End for

This section shows how to design a linear feature extraction algorithm based on the RDE. Firstly, we maximize JðV; MÞ to obtain vij and mi ði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ; then we utilize vij and mi to obtain the linear projection matrix for feature extraction.

Remark 1. According to Algorithm 1, the alternatively iterative procedure obtains a closed-form solution at each step. Therefore, Algorithm 1 increases the objective function value of Eq. (12) after each iteration. Again, since

3.1. Obtaining likelihood value vij and mean mi

JðV; MÞ ¼  ∑

c 1 N Ni vij vij N i Ni vij i ∑ ∑ J xij mi J 2 ln γ ∑ N N N Ni i j¼1 i i¼1 i ¼ 1 ηi N j ¼ 1 c

c

Obviously, the RDE (12) is non-convex. Here, we solve it by an alternative iteration scheme to produce a closed-form solution in i each iteration. Let mi ¼ ð1=N i Þ∑N j ¼ 1 xij ði ¼ 1; 2; …; cÞ. The iterative procedure consists of the following two main steps:

Step 1: Given mi ði ¼ 1; 2; …; cÞ and γ , compute the optimal solution vij ði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ by maximizing the RDE. With a series of mathematical manipulations (Appendix A), we can obtain a closed-form solution:   γ 1 exp  J xij mi J 2 η vij ¼ N i  i γ 1 Ni ∑j ¼ 1 exp  J xij mi J 2

ð13Þ

ηi

Step 2: Compute the optimal solution mi ði ¼ 1; 2; …; cÞ by maximizing the RDE as   γ 1 Ni 2 Ni ∑ x exp  J x m J ij ij i j ¼ 1 ∑j ¼ 1 xij vij ηi mi ¼ ¼ ð14Þ   γ i 1 ∑N v N j ¼ 1 ij ∑j ¼i 1 exp  J xij mi J 2

ηi

(Appendix A for the derivation). The algorithm of maximizing the RDE can be found in the following Algorithm 1. Algorithm 1. The algorithm for maximizing the regularized discriminant entropy. Input:  X ¼ ½X 1 ; X 2 ; …; X c  — data matrix;  ηi ði ¼ 1; 2; …; cÞ — regularized parameter;  γ — regularization parameter;  ε — iterative stop threshold; Output:  vij ði ¼ 1; …; c; j ¼ 1; …; N i Þ — likelihood value of data point xij in the ith class;  mi ði ¼ 1; 2; …; cÞ — mean vector of the ith class. Procedure:  Let vij ¼ 1ði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ; i  Let mi ¼ N1i ∑N j ¼ 1 xij ði ¼ 1; 2; …; cÞ;

 Compute J 0 ¼ JðV ; MÞ;

r ∑

i¼1

r

vij N i Ni vij ∑ ln N j ¼ 1 Ni Ni

1 c ∑ N ln Ni ; Ni¼1 i

the objective function value of Eq. (12) has a upper bound. According to Cauchy's convergence rule [29], Algorithm 1 is convergent. Remark 2. γ ð0 o γ r1Þ is similar to the so-called annealing parameter, which has been proposed for vector quantization (VQ) and clustering problems [30]. Some variants have been presented for clustering algorithms [31,32]. In this paper, γ is not used as an annealing parameter. It is fixed after the training process. Note that if γ ¼ 0, vij is always equal to 1, which means no scatter information on data is used in the design of the likelihood values. If γ ¼ 1, the scatter information of the data is fully considered. For γ ¼ 1, we present the relations to other algorithms in the following part. If 0 o γ o 1, both label information and scatter information are used in the likelihood values. 3.1.1. Relation to other entropies For comparison, we also use the within-class Burg's entropy and the within-class Kapur's entropy instead of the within-class entropy to design the objective function: Objective function using the within-class Burg's entropy: First, we given the following theorem. The theorem can be proved using a similar proof idea of Theorem 1. i Theorem 2. Let T B ¼ ∑ci ¼ 1 ∑N j ¼ 1 ln vij =N be the Burg's entropy of all i the similarities, T Bi ¼ ∑N j ¼ 1 ln vij =N i be the Burg's entropy of the

similarities in the ith class, within-class entropy T BW ¼ ∑ci ¼ 1 T Bi be the mean of the entropies of the classes and between-class entropy i T BB ¼ ∑ci ¼ 1 ∑N j ¼ 1 ln N i =N be the entropy of the probabilities of the

classes. Then, we have T B ¼ T Bw þ T BB :

ð15Þ

Using TWB instead of TW in Eq. (12), we have the following objective function: c

Ni

argmaxJ B ðV; MÞ ¼ ∑ ∑ ln M;V

i¼1j¼1

c 1 N Ni vij vij i ∑ J xij mi J 2 : γ ∑ Ni η N N i i¼1 i j¼1

ð16Þ

810

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

Here, we can also solve it by an alternative iteration scheme to produce a closed-form solution in each iteration. Let mi ¼ ð1=N i Þ

i mi ¼ ð1=N i Þ∑N j ¼ 1 xij ði ¼ 1; 2; …; cÞ. The iterative procedure consists of the following two main steps:

i ∑N j ¼ 1 xij ði ¼ 1; 2; …; cÞ. The iterative procedure consists of the

following two main steps:

Step 1: Given mi ði ¼ 1; 2; …; cÞ and γ , compute the optimal solution vijði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ by maximizing the J B ðV; MÞ. With a series of mathematical manipulations (omitted due to space reason), we can obtain a closedform solution:

Step 1: Given mi ði ¼ 1; 2; …; cÞ and γ , compute the optimal solution vijði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ by maximizing the J K ðV; MÞ. With a series of mathematical manipulations (omitted due to space reason), we can obtain a closedform solution: vij ¼ Ni

J xij mi J 2 1 Ni

∑j ¼ 1

J xij mi J

ð17Þ 2

with

ηi ¼

i where ηi and λi are the solutions of ∑N j ¼ 1 vij ¼ N i . Step 2: Compute the optimal solution mi ði ¼ 1; 2; …; cÞ by maximizing the J K ðV; MÞ as

mi ¼

N i Ni 1 ∑ γ N j ¼ 1 J xij mi J 2

i ∑N j ¼ 1 xij vij i ∑N j ¼ 1 vij

Ni

¼ ∑

j¼1

Step 2: Compute the optimal solution mi ði ¼ 1; 2; …; cÞ by maximizing the J B ðV ; MÞ as mi ¼

i ∑N j ¼ 1 xij vij

¼ i ∑N j¼1

1 J xij mi J 2 1

ð18Þ

J xij mi J 2

It can be found vij here can be viewed as the normalized inverse Euclidean distance. If J xij mi J 2 is large/small, vij is small/ large. Objective function using the within-class Kapur's entropy TWK: Let within-class Kapur's entropy be     c N Ni v c N Ni vij vij vij ij T KW ¼  ∑ i ∑ ln 1 : ln  ∑ i ∑ 1 Ni i ¼ 1 N j ¼ 1 Ni Ni i ¼ 1 N j ¼ 1 Ni For Shannon's entropy and Burg's entropy, we have T ¼ T W þ T B and T B ¼ T BW þ T BB . For other entropies, such as Kapur's entropy, it is not the case. For a given classification problem, maximizing the within-class Kapur's entropy is different from maximizing Kapur's entropy on the whole data set. It means that for maximizing Shannon's entropy (or Burg's entropy) of the whole data set, we can only consider maximizing Shannon's entropy (or Burg's entropy) of each class in the data set. However, for Kapur's entropy, it is not the case. If we try to use Kapur's entropy to design the objective function, the derivation should be too complicated to have a closed-form solution. But for comparison, We also use the within-class Kapur's entropy TWK instead of the within-class entropy TW to design the objective function. Using TWK instead of TW in Eq. (12), we have the following objective function:     c N Ni v c N Ni vij vij vij ij ln 1 argmax J K ðV; MÞ ¼  ∑ i ∑ ln  ∑ i ∑ 1 M;V Ni i ¼ 1 N j ¼ 1 Ni Ni i ¼ 1 N j ¼ 1 Ni vij 1 Ni Ni ∑ J xij mi J 2 : Ni i ¼ 1 ηi N j ¼ 1 c

γ ∑

xij  γ 1 1 þ exp  J xij mi J 2 λi

ð21Þ

ηi

i Since ∑N j ¼ 1 vij ¼ N i is a simple function, the computation of ηi and λi is also very fast. We use the function “fsolve” in MATLAB to get the solutions of ηi and λi in each iteration.

i ∑N j ¼ 1 vij i ∑N j ¼ 1 xij

ð20Þ

ηi

1 vij ¼ N i

1  γ 1 1 þ exp  J xij mi J 2 λi

ð19Þ

Here, we can also solve it by an alternative iteration scheme to produce a closed-form solution in each iteration. Let

3.1.2. Relation to mean shift The mean shift is a nonparametric, iterative mode-seeking algorithm [33]. It was first developed by Fukunaga [34] based on kernel density estimation, and later generalized by Cheng [35] and Comaniciu and Meer [33]. For the data points fxi1 ; xi2 ; …; xiNi g of the ith class, the multivariate kernel density estimator with a Gaussian kernel and bandwidth matrix H ¼ ð1=ηi ÞI, computed in point mi is given by   1 Ni 1 f ðmi Þ p ð22Þ ∑ exp  J xij mi J 2 : Ni j ¼ 1 ηi The density gradient estimator is obtained as the gradient of the density estimator by exploiting the linearity of Eq. (22):   2 3 1 i   ∑N xij exp  J xij mi J 2 Ni j ¼ 1 6 7 ηi ^ ðmi Þ p ∑ exp  1 J xij mi J 2 6   mi 7 ∇f 4 N 5: 1 η j¼1 i ∑j ¼i 1 exp  J xij mi J 2

ηi

ð23Þ Both terms of the product in Eq. (23) have special significance. The second term   1 2 i ∑N x exp  J x m J j ¼ 1 ij ηi ij i   mi δ mi ¼ ð24Þ 1 2 i ∑N j ¼ 1 exp  J xij mi J

ηi

is the mean shift (i.e., the difference between the weighted mean and, mi, the center of the kernel). The implication of the mean shift is that the iteration procedure   1 ðkÞ 2 i ∑N x exp  J x m J ij ij i j¼1 ηi þ 1Þ   ; k ¼ 1; 2; … ð25Þ ¼ mðk i 1 ðkÞ 2 Ni ∑j ¼ 1 exp  J xij mi J

ηi

is a hill-climbing process to the nearest maximum of f ðmi Þ. Eq. (25) is the same as Eq. (14) with γ ¼ 1. Our algorithm can derive the mean shift algorithm from the perspective of maximization of the

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

RDE and can also extend the mean shift algorithm with a regularization parameter. The relation between the information theory and the mean shift is not thoroughly investigated in the statistical literature, with only [36] discussing it with Renyi's quadratic entropy.

3.2. Scatter matrices with likelihood values After the computation of vij ði ¼ 1; …; c; j ¼ 1; …; N i Þ and mi ði ¼ 1; 2; …; cÞ of Algorithm 1, we can obtain the mean vector m of all the data points as Ni

c

3.1.3. Relation to pre-image Define the nonlinear transformation from the input space to the feature space as

ΦðxÞ A F ;

Φ : R -F : n

For data X i ¼ ½xi1 ; xi2 ; …; xiNi , let the transformed features ΦðX i Þ ¼ ½Φðxi1 Þ; Φðxi2 Þ; …; ΦðxiNi Þ and the inner product ðΦðxij Þ; Φðxik ÞÞ ¼ expðð1=ηi Þ J xij xik J 2 Þ. The definitions of the nonlinear transformation and the inner product in the feature space are often used in kernel learning algorithms [37,16], such as kernel PCA [38] and kernel LDA [39]. Assume that ΦðxÞ is the mean of fΦðxi1 Þ; Φðxi2 Þ; …; ΦðxiNi Þg in i the feature space F , which means ΦðxÞ ¼ ð1=N i Þ∑N j ¼ 1 Φðxij Þ.

We attempt to find a data point z A Rn satisfying ΦðzÞ  ΦðxÞ. It means that we compute the mean of the transformed nonlinear features, and then inversely transform it back to the input space. However, due to the nonlinear transformation, the exact solution may not exist. In this part, we reconstruct it in the leastsquares sense. The vector z can be obtained by minimizing

ρðzÞ ¼ J ΦðzÞΦðxÞ J 2

811

m ¼ EðXÞ ¼ ∑ ∑ xij i¼1j¼1

vij : N

Define the total scatter matrix ST , within-class scatter matrix SW and between-class scatter matrix SB as ST ¼ E½ðXmÞðXmÞT  Ni

c

¼ ∑ ∑ ðxij mÞðxij mÞT i¼1j¼1

vij ; N

 c  N SW ¼ ∑ E½ðXmi ÞðXmi ÞT i i N i¼1 vij Ni Ni ∑ ðxij mi Þðxij mi ÞT Ni i¼1 N j¼1 c

¼ ∑

c

SB ¼ ∑

i¼1

Ni ðmmi Þðmmi ÞT ; N

ð30Þ

respectively. It is easy to find that Ni

c

ST ¼ ∑ ∑ ðxij mÞðxij mÞT i¼1j¼1

vij N

Ni

c

i¼1j¼1 c

2 Ni ρðzÞ ¼ ðΦðzÞ; ΦðzÞÞ ∑ ðΦðzÞ; Φðxij ÞÞ þ Δ: Ni j ¼ 1

c

ð26Þ

þ ∑

i¼1

Ni c vij vij þ ∑ ∑ ðmi mÞðmi mÞT N i¼1j¼1 N

Ni ðm mÞðmi mÞT N i

¼ SW þ SB :

This leads to a necessary condition for the extremum:   1 2 i ∑N x exp  J x z J ij ij j¼1 ηi   : z¼ 1 Ni ∑j ¼ 1 exp  J xij z J 2

3.2.1. Relation to the scatter matrices used in the linear discriminant analysis From information theory [25], if without any constraint, TW is maximized when vij ¼ 1ði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ. The result that vij ¼ 1ði ¼ 1; 2; …; c; j ¼ 1; 2; …; Ni Þ is consistent with Laplace's principle of insufficient reason [25]. According to this principle, if we i have no information except that each vij Z0, ∑N j ¼ 1 vij ¼ N i and

c i ∑ci ¼ 1 ∑N j ¼ 1 vij ¼ ∑i ¼ 1 N i ¼ N, we should choose the uniform dis-

ηi

ð27Þ

ηi

i ð1=N i Þ∑N j ¼ 1 xij

Ni

i¼1j¼1

To optimize Eq. (26), we employ standard gradient descent methods. An optimal z can be determined as follows [38]:   Ni ∂ρðzÞ 1 ¼ ∑ exp  J xij z J 2 ðzxij Þ ¼ 0: ∂z ηi j¼1

Then we can devise an iteration scheme for z by   1 2 i ∑N j ¼ 1 xij exp η J xij zk J i   ; k ¼ 1; 2; …: zk þ 1 ¼ 1 2 i ∑N exp  J x z J ij k j¼1

Ni c vij vij þ ∑ ∑ ðx mi Þðmi mÞT N i ¼ 1 j ¼ 1 ij N

þ ∑ ∑ ðmi mÞðxij mi ÞT

Replacing terms independent of z by Δ, we obtain

ð29Þ

and

¼ ∑ ∑ ðxij mi Þðxij mi ÞT

1 Ni ¼ J ΦðzÞ ∑ Φðxij Þ J 2 : Ni j ¼ 1

ð28Þ

in the input space, Different from the mean Eq. (27), after convergence, is a weighted mean which is the least-square reconstruction of the mean of the nonlinear transformed features. Eq. (27) is basically the same as Eq. (14) with γ ¼ 1 (using mi instead of zk þ 1 and mi instead of zk). It can be seen that Algorithm 1 computes the weighted mean of the input space by reconstructing the mean of the reproducing kernel Hilbert space (RKHS) [38] in the least-squares sense. The connection with the pre-image and kernel method opens many interesting issues, especially when the regularization parameter can make the original pre-image algorithm more flexible for image reconstruction problems.

tribution for each class since there is no reason to choose any other distribution. Using vij ¼ 1 ði ¼ 1; 2; …; c; j ¼ 1; 2; …; N i Þ, we can obtain the mean vector m of all the data points as Ni

c

m ¼ EðXÞ ¼ ∑ ∑ xij i¼1j¼1

vij 1 c Ni ¼ ∑ ∑ x ; N N i ¼ 1 j ¼ 1 ij

ð31Þ

the mean vector mi of the ith class as Ni

mi ¼ EðX i Þ ¼ ∑ xij j¼1

vij 1 Ni ¼ ∑ x ; N i N i j ¼ 1 ij

ð32Þ

the total scatter matrix ST, the within-class scatter matrix SW and the between-class scatter matrix SB as c

Ni

ST ¼ E½ðXmÞðXmÞT  ¼ ∑ ∑ ðxij mÞðxij mÞT ; i¼1j¼1

  N c 1 c Ni  SW ¼ ∑ E½ðXmi ÞðXmi ÞT i i ¼ ∑ ∑ ðxij mi Þðxij mi ÞT  N Ni¼1j¼1 i¼1

812

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

Algorithm 2. The regularized discriminant entropy analysis (RDEA) algorithm.

and c

N SB ¼ ∑ i ðmmi Þðmmi ÞT ; i¼1 N respectively. Hence the scatter matrices ST, SB and SW used in LDA can be considered as special cases of ST , SB and SW , respectively. 3.3. RDEA Algorithm In this part, we introduce a novel linear extraction algorithm called regularized discriminant entropy analysis (RDEA). The theoretical justifications of our algorithm are presented in Section 3.4. In Section 2.3, it can be found that the RDE has a direct relationship with the scatter matrices. The RDE is designed with the trace of the within-class scatter matrix. On the other hand, the scatter matrices can be computed with vij ði ¼ 1; …; c; j ¼ 1; …; N i Þ and mi ði ¼ 1; 2; …; cÞ obtained by maximizing the RDE. When vij ði ¼ 1; …; c; j ¼ 1; …; Ni Þ and mi ði ¼ 1; 2; …; cÞ are obtained, we can compute the scatter matrices and then use them for feature extraction. Since the Fisher criterion [34,40] is conceptually simple and gives a systematic feature extraction algorithm, which has no assumption about data distribution, we use a similar criterion to obtain our algorithm. Let the projection vector be w. After this projection, the withinclass scatter, the between-class scatter and the total scatter are wT SW w, wT SB w and wT ST w, respectively. Our feature extraction criterion is as follows: w SB w ; J RDEA ðwÞ ¼ T w ST w T

Input:  X ¼ ½X 1 ; X 2 ; …; X c  — data matrix;  ηi ði ¼ 1; 2; …; cÞ — regularized parameter;  γ — regularization parameter;  ε — iterative stop threshold Output:  W RDEA ¼ ½w1 ; w2 ; …; wd  — linear projection matrix; Procedure:  Compute vij ði ¼ 1; …; c; j ¼ 1; …; N i Þ and mi ði ¼ 1; 2; …; cÞ by Algorithm 1;  Compute the total scatter matrix ST and between-class scatter matrix SB by Eqs. (28) and (30), respectively;  Compute the projections. Let fw1 ; w2 ; …; wk g be the projections, and define (34) W ðk1Þ ¼ ½w ; w ; …; w : 1

2

k1

Then fw1 ; w2 ; …; wk g can be iteratively computed as follows: (a) Compute w1 as the eigenvector of S1 T SB associated with the largest eigenvalue; (b) Compute wk as the eigenvector associated with the largest eigenvalue of the eigenfunction P ðkÞ SB w ¼ λST w; where (35) ðk1Þ 1 P ðkÞ ¼ IðW ðk1Þ ÞððW ðk1Þ ÞT S1 ÞÞ ðW ðk1Þ ÞT S1 T ðW T :  Obtain the RDEA transformation W RDEA ¼ ½w1 ; w2 ; …; wd . The projection is as follows: φ ¼ W TRDEA ϕ; d-dimensional representation of ϕ.

ð33Þ

where φ is a

where w is a n-dimension vector. It can be found that if ST is invertible, the vector w that maximizes JRDEA(w) is the eigenvector of

3.4. Theoretical justification of RDEA

S1 T SB w ¼ w

In order to obtain the kth basis vector, we maximize the following objective function:

corresponding to the largest eigenvalues. For the multi-classification problem, we solve the eigenfunction and obtain the eigenvectors w1 ; w2 ; …; wn . Let W ¼ ½w1 ; w2 ; …; wn . Consider RDE in Eq. (12). If orthogonal projection matrix W is used, we have c 1 N Ni vij vij N i Ni vij i ∑ ∑ J W T xij W T mi J 2 ln γ ∑ Ni Ni i ¼ 1 N j ¼ 1 Ni i ¼ 1 ηi N j ¼ 1 c

JðV ; MÞ ¼  ∑ c

¼ ∑

i¼1

Ni

Ni

vij vij vij Ni 1 Ni ∑ ∑ J xij mi J 2 : ln γ ∑ N j ¼ 1 Ni Ni Ni i ¼ 1 ηi N j ¼ 1 c

JðV ; MÞ does not change under orthogonal projection and so do the likelihood values. That is   γ 1 exp  J W T xij W T mi J 2 η vij ¼ Ni  i γ 1 Ni ∑j ¼ 1 exp  J W T xij W T mi J 2 

J RDEA ðwk Þ ¼

wTk Sb wk ; wTk ST wk

with the constraints wTk w1 ¼ wTk w2 ¼ ⋯; ¼ wTk wk1 ¼ 0;

We can use the method of Lagrange multipliers to transform the maximization function to include all constraints: k

Lðwk Þ ¼ wTk SB wk λðwTk ST wk 1Þ ∑ γ i wTk wi :

ð36Þ

i¼1

The maximization is performed by setting the partial derivative of Lðwk Þ with respect to wk equal to zeros: k1

2SB wk 2λST wk  ∑ γ i wi ¼ 0:

ð37Þ

i¼1

ηi

 γ 1 exp  J xij mi J 2 η ¼ Ni  i γ 1 Ni ∑j ¼ 1 exp  J xij mi J 2

wTk ST wk ¼ 1:

Multiplying the left side of Eq. (37) by wk, we have ði ¼ 1; …; c; j ¼ 1; …; N i Þ:

ηi

Orthogonal basis functions preserve the likelihood values. The orthogonal projection is important for feature extraction algorithm [34]. There have been a number of feature extraction algorithms [41,40,42] designed with it. In this paper, we also put orthogonal constraints on the basis functions of the projection. For feature extraction, we will use W d ¼ ½w1 ; w2 ; …; wd , where d r n. The algorithmic procedure of RDEA is described in Algorithm 2.

λ¼

wTk SB wk : wTk ST wk

ð38Þ

Thus, λ represents the maximization problem to be optimized. Multiplying the left-hand side of Eq. (37) by wTj S1 T ðj ¼ 1; 2; …; k1Þ, we have ðk1Þ T 1 2ðW ðk1Þ ÞT S1 Þ ST W ðk1Þ Γ T SB wk ðW

where Γ

ðk1Þ

ðk1Þ

¼ 0;

¼ ½γ 1 ; γ 2 ; …; γ k1 T , we can obtain

ðk1Þ 1 Γ ðk1Þ ¼ 2ððW ðk1Þ ÞT S1 Þ ðW ðk1Þ ÞT S1 T W T SB wk :

ð39Þ

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

Since Eq. (37) can be written as 2SB wk 2λST wk W

ðk1Þ

Γ

ðk1Þ

¼ 0:

ð40Þ

Substituting Eq. (39) into Eq. (40), we have ðk1Þ 1 ½IW ðk1Þ ððW ðk1Þ ÞT S1 Þ ðW ðk1Þ ÞT S1 T W T SB wk ¼ λST wk :

Using Eq. (35), we have P SB w ¼ λST w; ðkÞ

which is used in our RDEA algorithm.

4. Toy example In this part, we give an example to show that the likelihood values vij obtained by Algorithm 1 are better than simply setting vij ¼ 1. Let X 1 ¼ f0; 0; 0; 0; 0; 0; 0; 0; 0; 0g and X 2 ¼ f9; 1; 1; 1; 1; 1; 1; 1; 1; 1g be two classes of data points. Assume X2 contains an outlier 9. Fig. 1 shows the data points. Data point  9 is far away from other points in X2. Let vij ¼ 1, and then we get the mean of different classes as 10

m1 ¼ ∑ 0 ¼ 0

9

m2 ¼ 9 þ ∑ 1 ¼ 0:

and

i¼1

i¼1

Hence, m1 ¼ m2 . In this case, the data points are hard to classify based on their mean. Using Algorithm 1, we get the likelihood values v1k ¼ 1 ðk ¼ 1; 2; …; 10Þ and v21 ¼ 0:0046, v2j ¼ 1:1106 ðj ¼ 2; …; 10Þ. Then we can obtain the mean of different classes as 10

m1 ¼ ∑ 0  i¼1

1 ¼0 10

and m2 ¼ 9 

813

matrices through eigenvalue/eigenvector decomposition. The performance was tested with four data sets from the UCI machine learning repository, three data sets from the Statlog repository, one data set from the Delve (Data for Evaluating Learning in Valid Experiments) repository and two face image data sets. These data sets were from a variety of applications with various numbers of classes and attributes (Table 1). In order to evaluate RDEA in the small sample size (SSS) problem, the FERET and the AR face data sets are used in our experiments. For the FERET data set, we select 72 subjects from the FERET database, with 6 images for each subject. The 6 images are extracted from four different sets, namely Fa, Fb, Fc, and duplicate [43]. The images are selected to bear with more differences in lighting, facial expressions, and facial details. The AR database [44] is another popular face database. Face image variations in the AR database include illumination, facial expression and occlusion. In our experiment, 952 images (119 individuals, 8 images per person), which involve large variations in facial expressions, are selected. We manually crop the face portion of the images. All the images of these two face data sets are aligned at the centers of the eyes and mouth. Histogram equalization is applied to the face images for photometric normalization, and the images are also converted into intensity images that contain values in the range of 0.0 (black) and 1.0 (full intensity or white). All the images are then normalized with resolution 57  47. The images from the two subjects of FERET are shown in Fig. 2. Fig. 3 shows the images of the AR data set. The recognition accuracy is evaluated based on the minimum distance classifier for its simplicity, and the Euclidean metric is used as our distance measure. 5.1. Performance evaluation

9 0:0046 1:1106 þ ∑ 1 ¼ 0:9957: 10 10 i¼1

It can be found that the mean of X2, m2 ¼ 0:9957, is very close to its real mean 1. And the two classes of data points are easy to classify based on their mean.

5. Experimental results of real data sets The performance of RDEA is tested experimentally with publicly available data sets and compared to those of PCA, LPP, LDA, MaxEnt-RDA and AIDA. In the LPP algorithm, the number of nearest neighbors is set to 5 to construct a similarity matrix. All of the above-mentioned methods are based on the first two statistical moments and thus belong to the class of second-order techniques. In addition, all the methods yield feature extraction

For data sets australian, diabetes, german.number, heart, liver, sonar, splice and wine, the evaluation consists of the following cross validation (CV) procedure used in [24]. The choice of the number of folds, k, for CV is dictated by the bias-variance trade-off [24]. As it is widely accepted that 10-fold CV offers a good bias-variance compromise, data sets australian, diabetes, german.number, heart, liver, sonar, splice and wine were tested using 10-fold CV. In order to test the performance of our algorithm on the unbalanced data sets, we also perform the following evaluation procedure: 1. Randomly select one class of the data set into 2 nonoverlapping folds of equal size and use one fold for training and the other fold for testing; 2. Randomly divide other classes of the data sets into l nonoverlapping folds of equal size and use 1 fold for training and the other l1 folds for testing;

1 0.8

Table 1 Data sets.

0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1

−9

−8

−7

−6

−5

−4

−3

−2

−1

Fig. 1. Data points of different classes.

0

1

Data set

Repository

Class

Dimension

Sample size per class

australian diabetes german.number heart liver sonar splice wine FERET AR

Statlog UCI Statlog Statlog UCI UCI Delve UCI N/A N/A

2 2 2 2 2 2 2 3 72 119

14 8 24 13 6 60 60 13 2679 2679

307 268 300 120 145 97 483 48 6 8

814

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

Fig. 2. Samples from FERET data set.

Fig. 3. Samples from AR data set.

recognition rate on the training set for the following feature extraction. Note that γ sel is obtained only based on the training set.

3. Based on the training set, calculate the projection matrix. Transform the training data and test data into the d-dimensional feature subspace and perform the classification; 4. Repeat Steps 1–3 20 times to obtain the averaging recognition rate.

5.2. Analysis of results

In order to determine the regularization parameter γ (γ is selected from set f0; 0:1; 0:2; …; 0:9; 1g), for each data set (such as australian, diabetes, german.number, heart, liver, sonar, splice and wine) on the training set, we perform 10-fold CV and record γ sel corresponding to the maximum averaging recognition rate on the training set. And then we use γ sel to calculate the projection matrix for the feature extraction. Note that γ sel is obtained only based on the training set. Due to the SSS problem (training samples per class are less than 4 in our experiments.), the FERET and AR data sets are tested by random experiments for 20 times and the average accuracy is recorded. For the FERET data set, in each random experiment, we randomly select three images from each individual for training, while the remaining images of each individual are selected for testing. For the AR data set, in each random experiment, four images are randomly selected from each person for training, while the remaining images of each individual are selected for testing. For data sets FERET and AR, due to the SSS problem , the evaluation of the regularization parameter γ (γ is selected from set f0; 0:1; 0:2; …; 0:9; 1g) consists of the following leave-one-out CV procedure that is used in [23]. For the face image data set (such as FERET and AR), we perform leave-one-out CV on the training set and record γ sel corresponding to the maximum averaging

The performance by the recognition accuracy is shown in Tables 2–8 for data sets australian, diabetes, german.number, heart, liver, sonar, splice and wine. To keep the tables concise, only results for the best performance of each algorithm and the corresponding dimensions are reported. The best recognition accuracy is given by RDEA (with selected γ from set f0; 0:1; 0:2; …; 0:9; 1g corresponding to the best recognition accuracy after the testing procedure) in most cases (with different data sets and different numbers of training samples). RDEA has the best performance in 44 out of 48 cases. Table 3 shows the performance of RDEA estimated through CV on different data sets with balanced training samples by different γ . The selection of the regularization parameter γ is very important for classification, especially for the liver data set, whose best recognition rate 66.07% (for γ ¼ 0:2) is much higher than the recognition rate 58.57% (for γ ¼ 0:5). Figs. 4–11 show the performance of RDEA (γ sel ) and other feature extraction algorithms on different dimensions with balanced training samples. RDEA with γ sel outperforms other algorithms in 5 data sets for the best performance. Although other algorithms can obtain similar performance from certain data sets, RDEA with γ sel is the most stable one, which gives the comparable recognition accuracy in almost all the data sets. For the overall

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

815

Table 2 The best recognition accuracy (%) estimated through CV on different data sets with balanced training samples and corresponding dimensions (shown in parentheses). Data set

PCA

LDA

LPP

MaxEnt-RDA

AIDA

RDEA

RDEA with γ sel

australian diabetes german.number heart liver sonar splice wine

84.67 (5) 71.15 (7) 70.83 (14) 80.83 (6) 41.79 (2) 67.22 (3) 78.23 (55) 95.83 (7)

86.67 (1) 70.58 (1) 71.67 (1) 82.92 (1) 58.93 (1) 65.56 (1) 80.73 (1) 97.50(2)

85.17 (8) 71.73(7) 69.67 (14) 80.92 (9) 49.29 (3) 68.33(4) 78.46 (57) 95.83 (6)

86.67 (2) 71.15 (6) 71.67 (2) 82.50 (3) 53.57 (4) 66.11 (31) 80.94 (3) 97.50(11)

86.67 (5) 70.77 (3) 71.50 (2) 83.33(12) 59.64 (5) 67.78 (24) 81.56(11) 97.50(11)

87.00(7) 71.92(4) 73.00(14) 84.17(3) 66.07(2) 68.89(24) 81.77(13) 98.33(5)

87.00 (7) 70.77 (4) 72.50(3) 83.33(3) 64.64(2) 66.67 (24) 81.56(3) 97.50(5)

Table 3 The performance (%) of RDEA estimated through CV on different data sets with balanced training samples by different γ ¼ 0; 0:1; …; 0:9; 1. Data set

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

australian diabetes german.number heart liver sonar splice wine

86.67 71.15 71.67 83.75 59.29 67.78 80.21 97.50

86.83 71.92 71.17 84.17 62.50 67.78 80.52 98.33

87.00 70.00 71.83 84.17 66.07 68.83 80.21 96.67

86.33 70.19 71.33 82.92 64.64 68.89 80.21 96.67

86.83 71.54 71.83 83.75 62.50 67.22 81.77 96.67

86.50 70.19 72.50 82.50 58.57 66.67 81.56 97.50

86.50 69.62 73.00 83.75 59.29 65.00 81.46 97.50

86.83 70.58 71.50 82.92 64.64 65.56 81.25 98.33

86.67 67.31 71.83 83.33 65.00 65.56 80.83 97.50

86.50 70.19 71.67 82.92 63.57 66.67 80.52 98.33

86.67 70.77 71.50 82.50 64.64 66.67 80.63 96.67

Table 4 The best recognition accuracy (%) estimated on different data sets with unbalanced training samples and corresponding dimensions (shown in parentheses) with l ¼ 2. Data set

PCA

LDA

LPP

MaxEnt-RDA

AIDA

RDEA

RDEA with γ sel

australian diabetes german.number heart liver sonar splice wine

84.57 (4) 70.43 (7) 69.93 (15) 82.00 (8) 53.17 (5) 67.04 (7) 76.87 (58) 95.42 (6)

85.44 (1) 72.35 (1) 70.40 (1) 82.67 (1) 65.38 (1) 67.77 (1) 79.00 (1) 97.92 (2)

84.83 (6) 72.67 (7) 68.73 (15) 82.42 (12) 55.31 (3) 67.80 (51) 77.49 (55) 97.36 (12)

86.67(2) 72.57 (6) 70.73 (15) 83.25(5) 66.55(4) 67.42 (36) 80.21(54) 97.50 (9)

85.37 (4) 70.77 (3) 67.03 (8) 82.67(12) 54.90 (5) 69.48(48) 77.45 (38) 94.72 (2)

85.70(12) 72.91(4) 71.40(4) 83.67(2) 64.90(2) 70.41(24) 80.43(11) 98.06(2)

85.64 (9) 72.69(2) 71.30(4) 82.75 (2) 63.03 (2) 69.28 (26) 79.94 (6) 97.78(2)

Table 5 The best recognition accuracy (%) estimated on different data sets with unbalanced training samples and corresponding dimensions (shown in parentheses) with l ¼ 4. Data set

PCA

LDA

LPP

MaxEnt-RDA

AIDA

RDEA

RDEA with γ sel

australian diabetes german.number heart liver sonar splice wine

83.23(8) 69.60 (7) 67.37 (15) 78.13 (9) 51.99 (3) 68.60(10) 76.24 (58) 95.31 (10)

84.77 (1) 70.24 (1) 68.25 (1) 81.27 (1) 63.04(1) 62.15 (1) 78.41 (1) 96.35(2)

84.69 (7) 69.67 (7) 67.48 (15) 79.40 (8) 54.16 (3) 62.31 (57) 76.35 (51) 95.63 (6)

84.79 (5) 70.42 (7) 67.68 (2) 78.20 (12) 63.43(4) 62.48 (16) 77.58 (51) 95.31 (12)

84.71 (11) 70.69 (4) 65.81 (5) 80.80 (10) 57.24 (5) 66.28 (39) 76.67 (34) 92.81 (11)

85.42(13) 71.70(3) 69.57(2) 82.40(5) 62.60 (2) 67.85(59) 78.87(7) 96.98(4)

85.49(13) 71.58(4) 69.28(5) 81.73(3) 61.16 (2) 67.11 (59) 78.71(2) 96.25 (4)

Table 6 The best recognition accuracy (%) estimated on different data sets with unbalanced training samples and corresponding dimensions (shown in parentheses) with l ¼ 6. Data set

PCA

LDA

LPP

MaxEnt-RDA

AIDA

RDEA

RDEA with γ sel

australian diabetes german.number heart liver sonar splice wine

85.31(8) 67.27 (7) 66.55 (14) 79.62 (12) 52.89 (5) 65.50(11) 73.30 (50) 94.42 (3)

84.72(1) 70.34 (1) 68.03 (1) 80.44 (1) 59.04 (1) 55.66 (1) 77.43 (1) 93.65 (2)

82.52 (13) 68.96 (7) 67.53 (15) 81.06 (8) 53.90 (5) 58.14 (57) 74.67 (59) 94.81(7)

84.57 (2) 69.75 (7) 68.50(8) 79.81 (9) 59.22 (5) 58.22 (48) 75.56 (51) 91.63 (12)

83.94 (13) 66.27 (7) 66.53 (7) 81.62 (12) 55.65 (5) 65.50(20) 74.83 (34) 92.21 (12)

85.09(12) 70.95(4) 68.80(14) 82.37(6) 60.57(2) 64.34 (19) 78.29(23) 96.35(10)

84.99 (13) 70.81(2) 68.23 (3) 81.69(4) 59.53(2) 61.24 (16) 77.66(54) 94.54 (6)

816

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

Table 7 The best recognition accuracy (%) estimated on different data sets with unbalanced training samples and corresponding dimensions (shown in parentheses) with l ¼ 8. Data set

PCA

LDA

LPP

MaxEnt-RDA

AIDA

australian diabetes german.number heart liver sonar splice wine

82.83 (9) 68.05 (7) 65.33 (10) 81.03 (8) 52.62 (5) 65.34 (7) 72.53 (56) 93.26 (6)

84.57 (1) 70.68 (1) 66.43 (1) 77.09 (1) 61.16 (1) 50.38 (1) 75.72 (1) 92.96 (2)

82.99 (10) 69.13 (7) 65.53 (15) 80.06 (8) 52.82 (5) 57.22 (17) 73.78 (57) 93.06 (12)

84.53 (5) 70.33 (7) 64.98 (5) 77.33 (10) 60.00 (5) 53.46 (30) 74.37 (57) 90.56 (11)

84.03 68.26 63.40 79.82 55.53 66.24 73.95 90.93

(9) (7) (8) (12) (5) (46) (36) (12)

RDEA

RDEA with γ sel

84.88 (10) 71.52 (4) 67.60 (8) 81.94 (8) 60.90 (2) 62.63 (26) 76.61 (32) 94.91 (9)

84.86 (13) 71.44 (3) 67.11 (3) 80.85 (12) 58.34 (2) 60.00 (12) 76.30 (8) 94.54 (12)

Table 8 The best recognition accuracy (%) estimated on different data sets with unbalanced training samples and corresponding dimensions (shown in parentheses) with l ¼ 10. Data set

PCA

LDA

LPP

MaxEnt-RDA

AIDA

australian diabetes german.number heart liver sonar splice wine

82.37 (5) 67.08 (7) 62.50 (11) 75.64 (11) 51.72 (3) 63.31 (8) 72.59 (49) 92.00 (9)

83.23 (1) 69.57 (1) 63.02 (1) 76.55 (1) 62.17 (1) 50.37 (1) 74.25 (1) 89.82 (2)

82.58 (10) 67.87 (7) 63.81 (13) 76.27 (12) 52.36 (3) 53.01 (2) 73.29 (52) 91.91 (11)

84.23 (5) 68.80 (7) 61.74 (5) 75.54 (10) 58.28 (3) 52.35 (5) 72.16 (57) 90.09 (12)

84.09 66.75 64.86 80.83 55.91 58.60 72.54 89.27

0.9

0.725

0.85

0.715

(8) (7) (5) (10) (5) (9) (36) (11)

RDEA

RDEA with γ sel

84.51 (10) 70.19 (3) 66.74 (13) 80.30 (12) 60.49 (2) 57.35 (4) 75.67 (35) 94.00 (8)

84.40 (7) 70.00 (2) 65.05 (6) 79.70 (12) 58.08 (2) 55.00 (9) 75.16 (47) 92.64 (12)

Recognition accuracy

Recognition accuracy

0.72

0.8 0.75 0.7

0.71 0.705 0.7 0.695 0.69 0.685

0.65

0

2

4

6

8

10

0.68

12

1

2

3

classification performance on different data sets, no single existing algorithm can compare with RDEA. Except RDEA with selected γ after testing, RDEA with γ sel has the best performance with 28 best cases. In particular, for the data sets german.number and splice, RDEA with γ sel is almost uniformly superior (over different l) to other feature extraction algorithms. Note that γ sel is selected by CV only on the training set. This procedure is more suitable for practical applications as we can hardly obtain the testing set in advance. AIDA gives the best performance with 8 best cases and MaxEntRDA gives the best performance with 7 best cases. The recognition accuracy using PCA, LDA and LPP is comparable (6, 5 and 3 best cases, respectively). Note that the number of best cases does not fully indicate the performance of the algorithms. For example, although PCA has 1 more best case than LDA (6 best cases for PCA and 5 best cases for LDA), if we only compare the performances of LDA and that of PCA without considering other algorithms, LDA can actually achieve higher accuracy than PCA in more cases. The fewer number of best cases is mainly caused by the fact that other algorithms have higher recognition accuracy than LDA in the cases where LDA outperforms PCA. The performance of the recognition accuracy is shown in Tables 9 and 10 and Figs. 12 and 13 for FERET and AR data sets. PCA, LDA, LPP, MaxEnt-RDA and our proposed RDEA are used in the experiments. PCA used in face recognition is often called the eigenfaces algorithm [45]. For LDA, LPP, MaxEnt-RDA and RDEA,

5

6

Fig. 5. The experimental performance on diabetes data set with balanced training samples. 0.74 0.72 Recognition accuracy

Fig. 4. The experimental performance on australian data set with balanced training samples.

4

Dimensionality

Dimensionality

0.7 0.68 0.66 0.64 0.62 0.6 0.58 0.56 0.54

0

2

4

6

8

10

12

14

Dimensionality

Fig. 6. The experimental performance on german.number data set with balanced training samples.

we perform PCA on the face images in advance; we keep 98% information in the sense of reconstruction.2 PCA is often used for preprocessing face images. Feature extraction algorithms, such as the Fisherfaces algorithm [45] and the Laplacianfaces algorithm [5], perform PCA projection on the data set first. The Fisherfaces algorithm can be viewed as PCA þLDA [45] and the Laplacianfaces algorithm as PCA þLPP [5]. Since AIDA needs the scatter matrix of

2 Select the number of eigenvectors such that 98% of the energy in the eigenspectrum (computed as the sum of eigenvalues) was retained.

0.86

1

0.84

0.95

0.82

Recognition accuracy

Recognition accuracy

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

0.8 0.78 0.76 0.74 0.72 0.7

0.9 0.85 0.8 0.75 0.7 0.65

0.68 0.66

1

2

3

4

5

6

7

8

9

10

11

1

Fig. 7. The experimental performance on heart data set with balanced training samples. 0.7

Recognition accuracy

0.65 0.6 0.55 0.5 0.45 0.4 1

2

3

4

Dimensionality

Fig. 8. The experimental performance on liver data set with balanced training samples.

0.7

Recognition accuracy

0.68 0.66 0.64 0.62 0.6 0.58 0

10

20

30

40

50

60

Dimensionality

Fig. 9. The experimental performance on sonar data set with balanced training samples. 0.9 0.85 Recognition accuracy

2

3

4

5

6

7

8

9

10

11

Dimensionality

Dimensionality

0.35

817

0.8 0.75 0.7 0.65

Fig. 11. The experimental performance on wine data set with balanced training samples.

entropy TWB in RDEA is recorded as Burg's entropy and the results using the within-class Kapur's entropy TWK in RDEA is recorded as Kapur's entropy in Figs. 12 and 13. The experimental results of Burg's entropy and those of Kapur's entropy are not as good as the experimental results of Shannon's entropy (RDEA) in general. For Burg's entropy, as we derived, vij p 1= J xij mi J 2 , which may not be stable if xij  mi . For Kapur's entropy, due to the usage of the within-class Kapur's entropy TWK instead of the Kapur's entropy TK and the approximation errors of ηi and λi in the iterative computation, the solutions of vij and mi may not be quite accurate. These may be the reasons why the experimental results of Burg's entropy and those of Kapur's entropy cannot compare with the results of Shannon's entropy. But from Figs. 12 and 13, one can found that the experimental results of Burg's entropy and those of Kapur's entropy are also quite good and even better than MaxEntRDA on AR data sets. Table 9 and Figs. 9, 12 and 13 show that RDEA with γ sel is superior to PCA, LDA, LPP, orthgonal LDA, orthgonal LPP and MaxEnt-RDA algorithms. We summarize the experiments below: 1. Although no single algorithm gives optimal performance for all data sets consisting of diverse numbers of classes, attributes and sample sizes, RDEA with γ sel outperforms PCA, LDA, LPP, MaxEnt-RDA and AIDA with regard to the number of best performance and emerges as the clear winner. 2. Due to the maximization of the RDE, the likelihood values and the regularization parameter, RDEA with γ sel becomes effective for classification tasks. 3 3. The time complexity of RDEA is OðcnN 2 þ dn Þ with c being the number of classes, d being the dimension of the projected feature space, n being the data dimension and N being the total number of training samples. The time complexity of RDEA is similar to that of AIDA, OðcnN 2 þ cn3 Þ, and slightly higher than that of LDA and MaxEnt-RDA, OðcnN 2 þ n3 Þ. Given the superior performance of our algorithm, the trade-off between implementation time and performance seems justified.

0.6 0.55

0

10

20

30

40

50

60

Dimensionality

Fig. 10. The experimental performance on splice data set with balanced training samples.

each class nonsingular, AIDA is not performed on the FERET and AR data sets. For comparison, we also performed orthgonal LDA and orthgonal LPP algorithms. Moreover, in the first procedure of RDEA in Algorithm 2, we also used the within-class Burg's entropy TWB and the within-class Kapur's entropy TWK to obtain vij and mi , respectively. The experimental results using the within-class Burg's

6. Conclusion In this paper, we became the first to propose the RDE. Based on the within-class entropy and the scatter information of the data, the RDE provides a natural way to process noisy and labeled data. We can obtain the likelihood values and the mean vectors of different classes by maximizing the RDE. The relation to the mean shift algorithm and the pre-image reconstruction is highlighted. We then propose a simple and computationally efficient algorithm, RDEA, for supervised feature extraction. RDEA can be regarded as a new feature extraction framework based on the RDE. This study adopts the most widely used Shannon's entropy.

818

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819

Table 9 The best recognition accuracy (%) estimated on FERET and AR face data sets and corresponding dimensions (shown in parentheses). Data set

PCA

LDA

LPP

MaxEnt-RDA

RDEA

RDEA with γ sel

FERET AR

77.87 (69) 79.16 (116)

88.33 (70) 88.99 (115)

84.17 (71) 85.29 (118)

90.60 (70) 91.85 (118)

92.50 (70) 94.79 (117)

92.50 (71) 94.79 (117)

Table 10 The performance (%) of RDEA estimated through CV on FERET and AR data sets by different γ ¼ 0; 0:1; …; 0:9; 1. Data set

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

FERET AR

92.04 94.62

92.13 94.66

92.50 94.79

92.50 94.79

92.50 94.71

92.50 94.71

92.50 94.62

92.22 94.66

92.22 94.66

92.31 94.62

92.22 94.62

results shown in Section 5.2, it is clear that RDEA with γ sel outperforms PCA, LDA, LPP, MaxEnt-RDA and AIDA with regard to the number of best performances and emerges as the clear winner. Based on the variety of data sets studied, which consist of diverse numbers of classes, attributes and sample sizes, we conclude that RDEA with γ sel compares favorably with other methods. Based on our results, we conclude that RDEA with γ sel should be considered as an alternative to the prevalent feature extraction techniques.

95

Recognition accuracy

90 85 80 75 70 65 60 55 21

30

40

50

60

70

Conflict of interest

Dimensionality

Fig. 12. The experimental performance (%) on FERET data set. 95

None declared.

Acknowledgment

Recognition accuracy

90 85

The authors would like to thank the anonymous reviewers and the financial support from the National Science Foundation of China (Project No. 61375012, 61375007, 61072090) and Shanghai Pujiang Program (Project No. 12PJ1402200).

80 75 70 65 60

Appendix A. Theoretical derivation of Equations (13) and (14)

55 50 20

30

40

50

60

70

80

90

100

110

Dimensionality

Fig. 13. The experimental performance (%) on AR data set.

Based on the design of the RDE, a series of feature extraction algorithms can be derived by selecting different generalized entropies. In the derivation of RDEA, we put orthogonal constraints on the basis functions to obtain the final projection matrix. Other constraints can be used to substitute these constraints, such as Sanguinetti's work [15], which imposes orthonormality with respect to the covariance matrix to design the dimension reduction algorithm. How to combine different constraints directly with the maximization of the RDE is another direction for future research. RDEA is a simple algorithm which preserves computational simplicity and characteristics of eigenvalue-based techniques. The regularization parameter γ makes RDEA suitable for classification tasks. The experimental results also reflect the positive effects of this parameter. RDEA is a linear algorithm, a direct solution without approximation. However, both MaxEnt-RDA and AIDA use approximation in their algorithms. The idea behind RDEA is general and can potentially be extended to feature extraction algorithms including their nonlinear or kernelized versions. We compare our method with several other second-order feature extraction techniques using real data sets. Based on the experimental

In order to maximize the following objective function: c 1 N Ni vij vij Ni Ni vij i ∑ ∑ J xij mi J 2 ; ln γ ∑ N N N η N N i i i¼1 i¼1 i j¼1 i j¼1 c

JðV; MÞ ¼  ∑

ð41Þ

i under the constraints ∑N j ¼ 1 vij ¼ N i ði ¼ 1; 2; …; cÞ, where V ¼ fvij gi ¼ 1;…;c;j ¼ 1;…;N , M ¼ fm1 ; m2 ; …; mc g, ηi Z 0ði ¼ 1; 2; …; cÞ and i β Z 1, we can use the method of Lagrange multipliers to transform the maximization function to include all constraints:

c 1 N Ni vij vij N i Ni vij i ∑ ∑ J xij mi J 2 ln γ ∑ N N N η N N i i i i¼1 i¼1 i j¼1 j¼1 ! Ni v c ij þ ∑ λi ∑ 1 : i¼1 j ¼ 1 Ni c

LðV; MÞ ¼  ∑

The maximization is performed by setting the partial derivative of LðV; MÞ with respect to vij and mi equal to zeros:   vij 1 ∂L N 1 11 1 J x m J 2 þ λi ¼ 0; ¼ i ln þ ð42Þ γ ∂vij Ni N Ni Ni Ni ηi N ij i Ni N 1 vij ∂L ¼ 2 ∑ i ðxij mi Þ ¼ 0: ∂mi Ni j ¼ 1 N ηi

ð43Þ

For Eq. (42), we have ln vij =N i ¼ γ ð1=ηi Þ J xij mi J 2 þ γ ðN=N i Þλi 1. Let γ ¼ 1=β. Then   vij 1 λ γN ¼ expð J xij mi J 2 Þγ exp i 1 : ð44Þ Ni Ni ηi

H. Zhao, W.K. Wong / Pattern Recognition 47 (2014) 806–819 i Since ∑N j ¼ 1 vij =N i ¼ 1, we have   λ γN 1 exp i 1 ¼  γ : Ni 1 2 i ∑N j ¼ 1 exp  J xij mi J

ð45Þ

ηi

Substituting Eq. (45) into Eq. (44), we have  γ 1 exp  J xij mi J 2 η vij ¼ Ni i γ ; 1 Ni ∑j ¼ 1 exp  J xij mi J 2

ηi

which is Eq. (13). For Eq. (43), we have Ni

∑ ðxij mi Þvij ¼ 0:

j¼1

Then i ∑N j ¼ 1 xij vij mi ¼ i ∑N j ¼ 1 vij

 γ 1 2 i ∑N x exp  J x m J j ¼ 1 ij ηi ij i ¼  γ : 1 2 i ∑N j ¼ 1 exp  J xij mi J

ηi

References [1] A.M. Martinez, A.C. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2) (2001) 228–233. [2] Z.F. Li, D.H. Lin, X.O. Tang, Nonparametric discriminant analysis for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (4) (2009) 755–761. [3] M. Balasubramanian, The isomap algorithm and topological stability, Science 295 (5552) (2002). [4] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [5] X.F. He, S.C. Yan, Y. Hu, P. Niyogi, H.J. Zhang, Face recognition using Laplacianfaces, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (3) (2005) 328–340. [6] S.C. Yan, D. Xu, B.Y. Zhang, H.J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1) (2007) 40–51. [7] F. De la Torre, M.J. Black, Robust principal component analysis for computer vision, in: Proceedings 8th IEEE International Conference on Computer Vision (ICCV 2001), vol. 1, 2001, pp. 362–369. [8] H. Choi, S. Choi, Robust kernel isomap, Pattern Recognition 40 (3) (2007) 853–862. [9] X. Geng, D.C. Zhan, Z.H. Zhou, Supervised nonlinear dimensionality reduction for visualization and classification, IEEE Transactions on Systems Man and Cybernetics, Part B Cybernetics 35 (6) (2005) 1098–1107. [10] H. Chang, D. Yeung, Robust locally linear embedding, Pattern Recognition 39 (6) (2006) 1053–1065. [11] R. He, B. G. Hu, X. T. Yuan, Robust discriminant analysis based on nonparametric maximum entropy, in: Advances in Machine Learning, Lecture Notes in Computer Science, LNAI 5828, 2009, pp. 120–134. [12] K.E. Hild II, D. Erdogmus, K. Torkkola, J.C. Principe, Feature extraction using information-theoretic learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (9) (2006) 1385–1392. [13] A. Singh, J.C. Príncipe, Information theoretic learning with adaptive kernels, Signal Processing 91 (February (2)) (2011) 203–213. [14] W.F. Liu, P.P. Pokharel, J.C. Principe, Correntropy: properties and applications in non-Gaussian signal processing, IEEE Transactions on Signal Processing 55 (11) (2007) 5286–5298. [15] G. Sanguinetti, Dimensionality reduction of clustered data sets, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (3) (2008) 535–540. [16] R. Jenssen, Kernel entropy component analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (5) (2010) 847–860.

819

[17] X.T. Yuan, B.G. Hu, Robust feature extraction via information theoretic learning, in: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), 2009, pp. 633–640. [18] L.M. Zhang, L.S. Qiao, S.C. Chen, Graph-optimized locality preserving projections, Pattern Recognition 43 (6) (2010) 1993–2002. [19] K. Torkkola, Feature extraction by non-parametric mutual information maximization, Journal of Machine Learning Research 3 (7–8) (2003) 1415–1438. [20] D.C. Tao, X.L. Li, X.D. Wu, S.J. Maybank, Geometric mean for subspace selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 260–274. [21] S. Zhang, T. Sim, Discriminant subspace analysis: a Fukunaga–Koontz approach, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (October (10)) (2007) 1732–1745. [22] M. Loog, R.P.W. Duin, Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (June (6)) (2004) 732–739. [23] K. Das, Z. Nenadic, Approximate information discriminant analysis: a computationally simple heteroscedastic feature extraction technique, Pattern Recognition 41 (May (5)) (2008) 1548–1557. [24] Z. Nenadic, Information discriminant analysis: feature extraction with an information-theoretic objective, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (8) (2007) 1394–1407. [25] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley Series in Telecommunications, vol. 1, Wiley, NJ, 1991. [26] J.C. Principe, Information Theoretic Learning: Renyi's Entropy and Kernel Perspectives, Springer, New York, 2010. [27] J.N. Kapur, Maximum-Entropy Models in Science and Engineering, John Wiley & Sons, New Delhi, 1989. [28] Stan Z. Li, Markov Random Field Modeling in Image Analysis, 3rd ed., Springer, New York, 2009. [29] W. Rudin, Principles of Mathematical Analysis, 3rd ed., McGraw-Hill Book Company, New York, 1976. [30] K. Rose, E. Gurewitz, G.C. Fox, Vector quantization by deterministic annealing, IEEE Transactions on Information Theory 38 (4) (1992) 1249–1257. [31] J. Buhmann, H. Kühnel, Complexity optimized data clustering by competitive neural networks, Neural Computation 5 (1) (1993) 75–88. [32] Y.F. Wong, Clustering data by melting, Neural Computation 5 (January (1)) (1993) 89–104. [33] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (May (5)) (2002) 603–619. [34] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed., Academic Press, 1990. [35] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (8) (1995) 790–799. [36] S. Rao, A.D.M. Martins, J. Principe, Mean shift: an information theoretic perspective, Pattern Recognition Letters 30 (February (3)) (2009) 222–230. [37] G. Heo, P. Gader, Robust kernel discriminant analysis using fuzzy memberships, Pattern Recognition 44 (March (3)) (2011) 716–723. [38] S. Mika, B. Schölkopf, A. Smola, K. Müller, M. Scholz, G. Rätsch, Kernel PCA and de-noising in feature spaces, in: Advances in Neural Information Processing Systems (NIPS), vol. 11, 1999, pp. 536–542. [39] J. Yang, A.F. Frangi, J.Y. Yang, D. Zhang, Z. Jin, KPCA plus LDA: A complete kernel Fisher discriminant framework for feature extraction and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2) (2005) 230–244. [40] J.P. Ye, Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems, Journal of Machine Learning Research 6 (1) (2005) 483–502. [41] D. Cai, X.F. He, J.W. Han, H.J. Zhang, Orthogonal Laplacianfaces for face recognition, IEEE Transactions on Image Processing 15 (November (11)) (2006) 3608–3614. [42] Z. Jin, A theorem on the uncorrelated optimal discriminant vectors, Pattern Recognition 34 (10) (2001) 2041–2047. [43] P.J. Phillips, H. Moon, S.A. Rizvi, P.J. Rauss, The FERET evaluation methodology for face-recognition algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (10) (2000) 1090–1104. [44] A.M. Martinez, R. Benavente, The AR Face Database, CVC Technical Report 24, 1998. [45] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 711–720.

Haitao Zhao received his Ph.D. degree from Nanjing University of Science and Technology. Currently, he is a full professor in East China University of Science and Technology. His recent research interests include machine learning, pattern recognition and computer vision.

W.K. Wong received his Ph.D. degree from The Hong Kong Polytechnic University. Currently, he is an associate professor in this university. He has published more than fifty scientific articles in refereed journals, including IEEE Transactions on Neural Networks and Learning Systems, Pattern Recognition, International Journal of Production Economics, European Journal of Operational Research, International Journal of Production Research, Computers in Industry, IEEE Transactions on Systems, Man, and Cybernetics, among others. His recent research interests include artificial intelligence, pattern recognition, feature extraction and optimization of manufacturing scheduling, planning and control.