Heteroscedastic linear feature extraction based on sufficiency conditions

Pattern Recognition 45 (2012) 821–830 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr He...

Download PDF

489KB Sizes 6 Downloads 172 Views

Report

PDF Reader
Full Text

Pattern Recognition 45 (2012) 821–830

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Heteroscedastic linear feature extraction based on sufﬁciency conditions Mohammad Shahin Mahanta , Amirhossein S. Aghaei, Konstantinos N. Plataniotis, Subbarayan Pasupathy Department of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road, Toronto, ON, Canada M5S 3G4

a r t i c l e i n f o

abstract

Article history: Received 25 October 2010 Received in revised form 21 June 2011 Accepted 23 July 2011 Available online 4 August 2011

Classiﬁcation of high-dimensional data typically requires extraction of discriminant features. This paper proposes a linear feature extractor, called whitened linear sufﬁcient statistic (WLSS), which is based on the sufﬁciency conditions for heteroscedastic Gaussian distributions. WLSS approximates, in the least squares sense, an operator providing a sufﬁcient statistic. The proposed method retains covariance discriminance in heteroscedastic data, while it reduces to the commonly used linear discriminant analysis (LDA) in the homoscedastic case. Compared to similar heteroscedastic methods, WLSS imposes a low computational complexity, and is highly generalizable as conﬁrmed by its consistent competence over various data sets. & 2011 Elsevier Ltd. All rights reserved.

Keywords: Feature extraction Dimension reduction Sufﬁcient statistic Heteroscedastic data Discriminant analysis Gaussianity Quadratic classiﬁer

1. Introduction Pattern recognition systems have been widely used to classify different patterns in various applications, such as personal identiﬁcation from biometrics (e.g. face recognition), medical diagnosis of patients from the experimental results, and data mining [1]. In most of these areas, the high dimensionality of the data poses a major challenge to the classiﬁcation problem. However, classiﬁcation of high-dimensional data with correlated features can be simpliﬁed, and the corresponding curse of dimensionality [2] can be avoided, by feature extraction, i.e. extracting a small set of discriminant features. In general, there are two techniques for dimensionality reduction, i.e. feature selection and feature extraction. While the former only selects a subset of features, the latter involves a linear or non-linear transformation of the data [1]. The ultimate purpose of feature extraction is to represent the data with the smallest number of features while retaining all the discriminatory information. A feature extractor (FE) achieving this purpose provides a minimum-dimensional sufﬁcient statistic [3] and is Bayes optimal [4]. This paper focuses on linear feature extraction algorithms since they are computationally efﬁcient and preserve certain properties of the data such as Gaussianity. One of the most commonly used linear FEs is linear discriminant analysis (LDA).

Corresponding author. Tel.: þ 1 416 978 8672; fax: þ1 416 978 4428.

E-mail address: [email protected] (M.S. Mahanta). 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.07.024

LDA maximizes Fisher’s criterion, the ratio of the between-class to the within-class scatter, through a computationally efﬁcient training procedure based on matrix eigendecomposition [5]. LDA implicitly assumes Gaussian class-conditional distributions with a common covariance, i.e. homoscedastic as opposed to heteroscedastic data. Under this assumption, the feature set provided by LDA is a sufﬁcient statistic. However, LDA suffers from several problems addressed by the literature [6–8]. One of the most important restrictions of LDA stems from the homoscedasticity assumption, as a result of which any discriminatory information in the class covariance differences is ignored. Several methods have been proposed to extend LDA to the case of heteroscedastic data. These methods can be categorized into the following three approaches. In the ﬁrst heteroscedastic approach, alternative criteria, which may be indirectly related to sufﬁciency, are iteratively optimized. For example, LDA can also be derived from the criterion of maximizing the likelihood of homoscedastic data. This criterion can be generalized to heteroscedastic data, and hence the linear operator is iteratively optimized to maximize the data likelihood [9]. Alternatively, statistical class distances such as pairwise Bhattacharyya, Kullback–Leibler, or Chernoff distance can be maximized in the feature space [10–12]. These criteria have been related to the sufﬁciency of the features [10,13,14]. However, these methods also require iterative training algorithms, and consequently lead to concerns over computational complexity [15]. In the second approach, while the computational efﬁciency of the non-iterative training procedure is saved, Fisher’s criterion is

822

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

replaced with similar criteria which incorporate the discriminatory information of class covariance differences. These criteria include the geometrical mean of pairwise Mahalanobis distances between classes [16], the sum of approximated pairwise Chernoff distances between classes [15], and an approximate measure of mutual information between the class variable and the feature variable [4]. However, these criteria are not directly based on sufﬁciency, and their solutions may incur higher order computational complexity than LDA. The third approach is a direct method to simplify the sufﬁciency conditions and derive a non-iterative sufﬁcient statistic FE [17]. However, the work in [17] leads to an asymmetric formulation which results in a feature space biased toward the selected reference class as discussed in [18]. This asymmetry furthermore causes the resulting feature space to be sensitive to the error in the estimated parameters for that reference class. This paper takes the direct approach based on the sufﬁciency conditions. We derive a general formulation for the necessary and sufﬁcient conditions satisﬁed by a linear sufﬁcient statistic for Gaussian distributions. It is shown that this general formulation can be symmetrically simpliﬁed. Hence, a simple FE is formulated which provides unbiased features. Minimal sufﬁciency of the derived FE ensures that it is the optimal linear FE in theory. In fact, our method simpliﬁes to LDA in the case of homoscedastic data, which is in consistence with Bayes optimality of LDA. Furthermore, simplicity of our FE formulation enhances its generalizability [2] in limited training sample size scenarios, which has been veriﬁed by the consistent performance in different experiments on real data. Another advantage of our simple formulation is computational efﬁciency. Training of the proposed linear FE is based on a scatter matrix eigendecomposition, where complexity of calculating this scatter matrix is the same as complexity of the corresponding step in LDA. The next section will continue with the problem deﬁnition and a review of the sufﬁciency concept for classiﬁcation. The general linear sufﬁcient statistic conditions will be derived in Section 3. Section 4 will provide further simpliﬁcations toward a simple FE. Finally, Section 5 is dedicated to experimental evaluation based on both synthetic and real data sets.

2. Problem deﬁnition and prior work Throughout this paper, a typical classiﬁcation scenario is assumed. Each vectorial data1 xA Rn belongs to either of C classes, oi ,1 ri rC, with non-zero prior probabilities Pðoi Þ and likelihoods f ðx9oi Þ. The classiﬁer can learn these distributions using the available training samples with known classes. The objective is to achieve the smallest classiﬁcation error rate when predicting the classes of testing samples. In our system, the data are ﬁrst transformed by the FE, y ¼ TðxÞ, and then processed by the classiﬁer. Therefore, the FE is desired to keep all the class-related information of the data. Thus, y is desired to be a sufﬁcient statistic for the classes, i.e. f ðx9oi ,yÞ ¼ f ðx9yÞ,

1r i rC:

ð1Þ

Such a statistic y is admissible, i.e. it preserves the posterior class probabilities and hence the Bayes classiﬁcation error [14]. Therefore, a sufﬁcient statistic with the minimum dimension provides the most compressed representation of discriminatory information and is the ultimate goal in FE design. 1 Scalars, vectors, and matrices are respectively shown in regular lowercase/ uppercase (e.g. a or A), boldface lowercase (e.g. a), and boldface uppercase (e.g. A). Column space, row space, and kernel space of a matrix A are denoted as ColðAÞ, RowðAÞ, and KerðAÞ.

The minimum dimension for a generally non-linear sufﬁcient statistic was investigated by [19]. The minimum value which holds for all the possible data distributions was proved to be minðn,C1Þ. However, any possible dimension reduction in this scenario would require the estimation of posterior probabilities which is itself affected by the curse of dimensionality. We will assume Gaussian class-conditional densities, N ðli , Ri Þ. Gaussianity is a common assumption in most applications. Furthermore, it facilitates the estimation of data distributions that are determined by only two parameters, li and Ri , for each class. We will also mainly focus on the linear FEs which can be represented by a linear operator T as follows: yd1 ¼ Tdn xn1 : In [20,21], based on the above assumptions and a prior data normalization such that f ðx9o1 Þ becomes N ð0,IÞ, the necessary and sufﬁcient conditions for a minimum-dimensional linear sufﬁcient statistic (MDLSS) were formulated. Subsequently, a class-referenced linear sufﬁcient statistic (CLSS) FE was proposed 1=2 1=2 in [17]. CLSS operator is calculated as T ¼ T00 R1 , where R1 accounts for the assumed prior normalization of the data and the rows of T00 consist of those left singular vectors of the following matrix MCLSS whose corresponding singular values have the largest magnitudes: MCLSS ¼ ½A12 ,A13 ; . . . ; A1C A1i ¼ ½R1

1=2

1=2

ðli l1 Þ, R1

Ri R11=2 I, 1 r ir C:

ð2Þ

If the values of the li and Ri parameters in (2) are accurate, retaining all the non-zero singular values of MCLSS guarantees that the operator T00 given by CLSS leads to a MDLSS. Otherwise, T00 provides a least squares approximation to an orthonormal MDLSS operator for the normalized data [17]. In either case, T also includes the implied o1 -normalization. Although CLSS is a simple formulation for a heteroscedastic sufﬁcient statistic, its performance depends on the accuracy of the estimated parameters for the speciﬁc reference class o1 . Additionally, least squares approximation to the ideal operator is best justiﬁed in a space with isotropic average covariance R . However, CLSS applies a least squares approximation in a normalized space with an isotropic R1 . Furthermore, CLSS does not simplify LDA in the homoscedastic case, where Ri ¼ R ,8i. In fact, its feature space is biased towards l1 l , where l is the average class mean, and hence CLSS favors the reference class o1 [18,13]. This paper will continue with the derivation of the general sufﬁciency conditions for a linear statistic without any normalization assumption. This step helps to avoid problems associated with CLSS which were created by asymmetrically normalized sufﬁciency conditions.

3. General conditions for sufﬁciency of linear statistics An arbitrary function y of x is a sufﬁcient statistic if and only if (1) holds. We assume that y is a linear statistic of the data, y ¼ Tx. It is known that any invertible transformation of y does not affect the sufﬁciency of y. Therefore, assuming that T provides a sufﬁcient statistic, any other linear operator, and speciﬁcally any full row-rank operator Tfrr , also provides a sufﬁcient statistic if its row space includes the row space of T. Hence, without loss of generality, we assume the operator T to be full row-rank, and the result is ﬁnally extended to the general case. Based on the above assumptions, the following lemma simpliﬁes the necessary and sufﬁcient condition in (1) for sufﬁciency of linear statistics [13].

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

823

Lemma 3.1. Let the random data x A Rn belong to either of oi ,1 ri r C, with the Gaussian conditional densities f ðx9oi Þ ¼ N ðli , Ri Þ and non-zero prior probabilities Pðoi Þ. Then, a linear statistic y ¼ Tx, with full row-rank linear transformation Tdn , is a sufﬁcient statistic for oi if and only if DðT, oi ,xÞ deﬁned as below is independent of ‘‘i’’ for any x:

The whitened means and covariances and their averages are

1 T 1 T DðT, oi ,xÞ ¼ ðxli ÞT R1 ZR1 i Z ðZRi Z Þ i ðxli Þ

Since sufﬁciency is invariant under translation and invertible transformation of the data [21], MDLSS can be sought in the whitened space. According to (7), a linear operator T0 provides a linear sufﬁcient statistic in the whitened space if and only if

T 1 T þ log9ðZR1 i Z Þ ZZ 9,

ð3Þ

where the matrix ZðndÞn is full row-rank with RowðZÞ an orthogonal complement of RowðTÞ such that TZT ¼ 0 and ½ZT A Rnn is non-singular. Proof. See Appendix A.

¼

1 T 1 T R1 ZR1 j Z ðZRj Z Þ j ,

1 T 1 1 T 1 T 1 1 T R1 ZR1 i Z ðZRi Z Þ i li ¼ Rj Z ðZRj Z Þ ZRj lj :

C X

Pðoi ÞR0i ¼ I,

l0 ¼

i¼1

1r i oj rC:

1 r io j r C,

m X

ð5Þ

m X

ð10Þ

ai Ai ¼ I,

ð11Þ

then the following set of conditions are equivalent for zA Rn : ðaÞ

zT Ai ¼ zT Aj ,

ðbÞ

zT Ai ¼ zT ,

ðcÞ

zT A1 ¼ zT , i

ðdÞ

¼ zT A1 zT A1 i j ,

81 ri o j rm, 81r i rm, 81 ri r m, 81 ri o j rm:

zT R1 ¼ zT R1 i j ,

¼ zT , zT R01 i ð7Þ

The latter conditions also apply to any linear operator providing sufﬁcient statistic even if it is not full rank. Therefore, any linear operator T provides a sufﬁcient statistic if and only if the projections of R1 and R1 i i li onto KerðTÞ do not vary with ‘‘i’’. Moreover, if there exists a full row-rank operator T satisfying (7), any other linear operator also satisﬁes (7) if and only if its row space includes that of T. Since (7) holds by choosing Inn as T, there always exists a full row-rank operator T that satisﬁes (7). However, an operator with a row-rank lower than n may also satisfy (7). The next section seeks a minimum row-rank T satisfying (7) through further simpliﬁcation of the conditions.

matrices.

i¼1

ð6Þ

8z A KerðTÞ:

ð9Þ

ai ¼ 1,

Proof. See Appendix B.

1 r io j r C,

ð8Þ

8z A KerðT0 Þ:

Lemma 4.1. Let Ai A Rnn ,1 r ir m, be invertible If coefﬁcients ai A R ,1 ri rm, exist such that

Furthermore, if these conditions hold, by substitution in (3), DðT, oi ,xÞ will be independent of ‘‘i’’ for any ﬁxed x. Therefore, the conditions in (6) are necessary and sufﬁcient for sufﬁciency of Tx. These conditions can be written in terms of any arbitrary vector z in kernel space of T as follows:

T 1 zT R1 i li ¼ z Rj lj ,

Pðoi Þl0i ¼ 0:

i¼1

The following lemma [13] helps in simplifying (9).

i¼1

ZR1 ¼ ZR1 i j ,

C X

¼ zT R01 zT R01 i j ,

ð4Þ

T Note that since Z is full row-rank, invertibility of ZR1 is i Z guaranteed. Multiplying (4) and (5) on the left by Z, we arrive at the following necessary conditions for sufﬁciency of Tx, in terms of Z:

1 ZR1 i li ¼ ZRj lj ,

R0 ¼

0 0 T 01 zT R01 i li ¼ z Rj , lj ,

Assume that full rank operator T gives a sufﬁcient statistic. Then DðT, oi ,xÞ, which denotes the difference between the log-likelihood of oi in x domain and in y domain, is independent of ‘‘i’’. Then, from the terms containing x in (3), it can be shown that for 1 r io j r C: 1 T 1 T R1 ZR1 i Z ðZRi Z Þ i

R0i ¼ R1=2 Ri R1=2 , l0i ¼ R1=2 ðli l Þ,

Since (8) holds, we can set m ¼ C, Ai ¼ R0i , and ai ¼ Pðoi Þ in Lemma 4.1. Then the ﬁrst line of (9) is equivalent to 1 ri r C,

8z A KerðT0 Þ:

ð12Þ

Substitution into the second line of (9) yields zT l0i ¼ zT l0j ,

1 r io j r C,

8z A KerðT0 Þ,

which, using (8), is equivalent to zT l0i ¼ zT l 0 ¼ 0,

1r i rC,

8zA KerðT0 Þ:

Also, (12) is equivalent to: zT R0i ¼ zT ,

1r i rC,

8zA KerðT0 Þ:

Therefore, the necessary and sufﬁcient conditions of (9) for sufﬁciency of T0 can be stated in the following simpliﬁed form: zT l0i ¼ 0, zT ðR0i IÞ ¼ 0,

1 ri rC,

8z A KerðT0 Þ:

ð13Þ

The conditions in (13) can be written in matrix format as 4. Whitened linear sufﬁcient statistic

T

M z ¼ 0,

8zA KerðT0 Þ,

where The general linear sufﬁcient statistic conditions for classiﬁcation of Gaussian distributions were derived in the previous section. In order to simplify these conditions without introducing any asymmetry, a prior whitening and centering step with respect to the average mean and covariance is used: 0

1=2

x ¼R

ðxl Þ:

MnCðn þ 1Þ ¼ ½A1 ,A2 , . . . ,AC , Ai ¼ ½l0i ,ðR0i IÞ, 0

1 ri r C:

ð14Þ

Therefore, T provides a sufﬁcient statistic for the whitened data if and only if KerðT0 Þ D KerðMT Þ, or equivalently, RowðMT Þ DRowðT0 Þ, or ColðMÞ DRowðT0 Þ. Assuming ColðMÞ D RowðT0 Þ, T0 guarantees a

824

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

linear sufﬁcient statistic. But, to reach a MDLSS, two further conditions must be met: First, any redundant linear dependency between the rows of T 0 should be avoided, hence T 0 needs to be full row-rank. Second, any rows of T 0 that are not needed to span ColðMÞ are redundant and should be removed. These are the only two additional conditions needed for a MDLSS. Therefore, T0 provides a MDLSS for the whitened Gaussian distributions if and only if it is full row-rank and satisﬁes 0

RowðT Þ ¼ ColðMÞ:

ð15Þ

Therefore, the minimum dimension of the linear sufﬁcient statistic dmin equals the rank of M. Furthermore, in order to ﬁnd T0 , the condition in (15) can be stated as the equivalent condition that a full rank decomposition of M in the following form can be found: M ¼ T 0ndmin TGdmin Cðn þ 1Þ ,

with

rankðT0 Þ ¼ rankðMÞ ¼ dmin :

ð16Þ

0

For any T satisfying (16) and for any non-singular matrix AA Rdmin dmin , AT0 also satisﬁes this condition. Therefore, MDLSS is not unique. However, if we ﬁnd one solution with orthonormal rows, any other solution can be found by applying a linear transformation on this solution. As a case in point, the ‘‘Q’’ matrix in the QR decomposition or the left singular vectors corresponding to the non-zero singular values in the singular value decomposition (SVD) will result in a solution for T0 with orthonormal rows. Finally, by combining the whitening transformation and the calculated orthonormal T0 , a MDLSS for the original data domain x will be provided by T ¼ T0 R1=2 . Due to the prior whitening step and orthonormality of T0 , the statistic provided by T is whitened with respect to R and is called the whitened linear sufﬁcient statistic (WLSS). The dimension of this MDLSS is the same as the rank of T0 which equals the rank of M or dmin . It can be veriﬁed that dmin equals the rank of MCLSS in (2). Furthermore, it can be shown that if the accurate actual li and Ri values are used, the operators proposed by WLSS and CLSS are related to each other through a non-singular linear transformation. 4.1. Feature extraction using MDLSS approximation In practice, the actual li and Ri values are not known and can only be estimated using the training samples, e.g. using maximum likelihood (ML) estimator. These estimated parameters, i.e. l^ i and R^ i , ^ For any desired d, we will can be plugged in (14) to estimate M as M. then estimate a d-rank whitened MDLSS operator for feature extraction. 1=2 Assume a given target dimension d. Let T ¼ T0 R^ where the 0 ^ with the rows of T are taken as the d left singular vectors of M largest (non-zero) singular value magnitudes. Using SVD properties [22] and a procedure similar to [17], it can be shown that T0 approximates an orthonormal MDLSS operator for whitened data in the Frobenius norm or least squares sense. Additionally, the ^ are the ones most probable ignored smallest singular values of M to be generated by the parameter estimation errors. Moreover, this approximation can be improved by using the weighted least squares method as follows. ^ which correspond to the Those left singular vectors of M singular values with the largest magnitudes coincide with those ^M ^ T which correspond to the largest eigenvalues eigenvectors of M [22]. But ^M ^T¼ M

C X

R^

1=2

ðl^ i l^ Þðl^ i l^ ÞT R^

and the different terms corresponding to the different classes can be weighted in proportion to the estimated prior class probabilities: SH ¼

C X

1=2 1=2 Pðoi ÞR^ ðl^ i l^ Þðl^ i l^ ÞT R^

i¼1

þ

C X

Pðoi ÞðR^

1=2

R^ i R^

1=2

IÞðR^

1=2

R^ i R^

1=2

IÞT :

ð17Þ

i¼1

Selecting the rows of T0 as the eigenvectors of SH corresponding to the largest eigenvalues is equivalent to a weighted least squares ^ with each column of M ^ approximation to the column space of M weighted according to the corresponding prior probability. Using this construction for T0 , the proposed whitened linear sufﬁcient statistic (WLSS) FE for estimated Gaussian distributions is T ¼ T0 R1=2 as demonstrated in the algorithm of Fig. 1. Unlike CLSS solution, WLSS is not based on normalization with respect to parameters of a speciﬁc class and hence its solution is not biased towards parameters of any class [18,13]. This fundamental difference accounts for the prominent superiority of WLSS over CLSS in simulations of Section 5. Furthermore, these FEs differ in their computational complexity. The FE computations consist of the ofﬂine training phase and the online testing phase. Although rapid calculation is more critical in the online testing phase, all linear FEs share the same efﬁcient testing phase operator, i.e. a simple linear projection. Therefore, only the distinguishing complexity of the training phase is included in Table 3. It should be noted that, for a highdimensional problem, the highly demanding training phase computations can still determine the cost, and hence feasibility, of a FE. The training complexity for all the considered FEs consists of a common OðNn2 Þ term for estimating the class means and covariances and a method-speciﬁc term for calculation and decomposition of the scatter matrices. The second distinct term of computational complexity for WLSS training is Oðn3 Þ which is the same as that of LDA, versus OðC 2 n3 Þ for the Chernoff method [15] and OðCn3 Þ for approximate information discriminant analysis (AIDA) [4] (ref. Table 3). This difference in complexity will be signiﬁcant for high data dimension and large number of classes. Unlike Chernoff, AIDA, and CLSS, the WLSS method does not require individual class covariance matrices to be non-singular. Also, the simple formulation of WLSS which does not involve logarithms of covariance matrices is expected to be less affected

1=2

i¼1

þ

C X i¼1

ðR^

1=2

R^ i R^

1=2

IÞðR^

1=2

R^ i R^

1=2

IÞT ,

Fig. 1. Pseudocode for the training stage of the proposed WLSS feature extraction method.

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

by the estimation errors. Finally, the formulation of SH in (17) is similar to the whitened between-class scatter for LDA. Therefore, enhancements similar to weighted LDA [7], fractional-step LDA [23], and regularized LDA [24,25] can also be extended to the WLSS heteroscedastic method.

825

only on the ﬁrst two moments of the data [15]. Therefore, they are closely related to the quadratic Gaussian Bayes classiﬁer which is used for classiﬁcation. Also, it should be noted that Mahalanobis, Chernoff, AIDA, and WLSS can be considered as heteroscedastic extensions of LDA. The simulations are run on both synthetic data and standard pattern recognition data sets.

4.2. Homoscedastic case 5.1. Experiments on synthetic data

SH ¼

C X

1=2 1=2 Pðoi ÞR^ ðl^ i l^ Þðl^ i l^ ÞT R^ :

ð18Þ

i¼1

The rank of SH in (18) is at most C1. Therefore, dmin r C1 and the choice of d for FE is limited to d r C1. Furthermore, it can be shown that the value of dmin coincides with the minimum dimension derived by [19] for generally non-linear sufﬁcient statistics of Gaussian distributions. Thus, for homoscedastic Gaussian data, the minimum-dimensional sufﬁcient statistic can be achieved by a linear operator. Moreover, if n 4 C1, almost surely the class means are linearly independent and the rank of SH is C1. The homoscedastic formula in (18) reveals that the homoscedastic SH is proportional to the whitened between-class scatter matrix used by LDA. Thus, those eigenvectors of the two matrices which have the largest corresponding eigenvalues should span the same subspace. Therefore, considering that the solution set of both methods is closed under any non-singular linear transformation, the solution sets of WLSS and LDA coincide in the homoscedastic case [13]. Thus, WLSS can be considered as a heteroscedastic extension of LDA. In fact, considering the theoretical foundation of WLSS, the LDA method can be derived from WLSS.

5. Experimental evaluation The simulations focus on the comparison of LDA [26], CLSS [17], Mahalanobis [16], Chernoff [15], AIDA [4], and WLSS FEs. All these linear FEs can be trained using a non-iterative procedure involving SVD or eigendecomposition. Moreover, all of them are directly or indirectly based on the assumption of Gaussian data and depend

1

LDA CLSS Mahalanobis Chernoff AIDA WLSS

Error rate

0.8 0.6 0.4 0.2 0

This section demonstrates that certain FEs provide a MDLSS, and compares performance of different methods when the data do not deviate from the Gaussian distribution. Therefore, random synthetic data generated according to randomly selected Gaussian distributions are utilized. The data belong to C ¼ 6 Gaussian distributions N ðli , Ri Þ with the dimension n ¼ 50. The li values are randomly selected in a unit cube in Rn . The values for Ri are selected as FðAAT þBi BTi Þ, where F is a constant scaling factor, A is a matrix with uniform random entries between 0.5 and þ0.5, and Bi ,1 r ir C, is a matrix with the ﬁrst 20 rows and columns entries taken as uniform random values between 0.5 and þ0.5 and all the other entries set to zero. Therefore, the class covariances differ randomly only in 20 dimensions. Also, 100 testing samples per class are generated and their extracted features are classiﬁed by a quadratic Gaussian classiﬁer. In Fig. 2, the accurately known actual values of li and Ri are used for all the feature extraction methods. Hence, the FE and the Bayes classiﬁer are designed based on the actual parameters. Therefore, their performance is compared in the absence of any parameter estimation error. The scaling factor F is set to 1 in Fig. 2 and to 100 in Fig. 2. In each case, the selection of means and covariances is repeated 100 times, and the average error rate (AER) of different methods is plotted versus d. Due to the optimality of the quadratic Bayes classiﬁer for Gaussian data, its probability of error never increases with the inclusion of new data features [14]. Therefore, the curves in Fig. 2 will be non-increasing. Moreover, if a method provides a sufﬁcient statistic at d, AER at this dimension should almost equal the Bayes error for the original n-dimensional data. The CLSS, Chernoff, AIDA, and WLSS methods can extract any necessary features. However, LDA cannot provide more than minðC1,nÞ ¼ 5 features. Also, the Mahalanobis method does not provide more than minðCðC1Þ=2,nÞ ¼ 15 features. For these methods with limited number of available features, a horizontal line without marker is plotted after the last available performance to illustrative the infeasible range of d.

1

LDA CLSS Mahalanobis Chernoff AIDA WLSS

0.8 Error rate

Homoscedastic model is useful in many problems where the more signiﬁcant and reliable information lie in class mean differences. In this model, Ri ¼ R and class-conditional distributions differ only in their mean values. By substituting all the covariance estimates in (17) with the pooled estimate of the common covariance matrix R^ , SH simpliﬁes to

0.6 0.4 0.2

1

5

10 15 Number of features

20

25

0

1

5

10 15 Number of features

20

25

Fig. 2. Average error rate of quadratic classiﬁer with different FEs using actual li and Ri parameters for heteroscedastic Gaussian distributions. (a) F ¼ 1 and (b) F ¼ 100.

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

When d rC1 ¼ 5, from Fig. 2, LDA overcomes for the smaller covariance scale, while in Fig. 2, Chernoff, AIDA, and WLSS provide the least AER for the larger covariance scale. This can be explained as follows. For small covariances, the differences in the class means provide the major discriminatory information that is exploited by LDA. However, for large covariances, the class-conditional distributions signiﬁcantly overlap; therefore, utilization of the covariance information is crucial in distinguishing the classes. For d 4 C1 in Fig. 2a and b, Chernoff, AIDA, and WLSS achieve the best performance with Chernoff marginally surpassing the other two. The persistent signiﬁcant gap between WLSS and CLSS performance can be explained by the inherent bias of CLSS solution towards o1 [18,13]. The CLSS, Chernoff, AIDA, and WLSS methods are limited to dmin ¼ 25 and provide the same AER as the original n-dimensional data, 0.019 in Fig. 2a and 0.038 in b, at this dimension. This demonstrates the minimum-dimensional sufﬁciency of these statistics. The minimum dimension of 25 for a linear sufﬁcient statistic in this problem can be veriﬁed theoretically from the rank of SH . The experiments of Fig. 3 include the effect of parameter estimation error on the FE and classiﬁer performance while the data still belong to Gaussian distributions. The experiments are performed on the previously described synthetic data, but the actual mean and covariance values are not known to the classiﬁer and FE. These parameters are estimated, using the ML estimator, based on 70 training samples per class during training phase of the classiﬁcation system. Selection of the actual li and Ri parameters is repeated 10 times, and for each selection of parameters, the training and testing samples are generated 10 times. In each of the resulting 100 repetitions, the FEs and the quadratic plug-in quadratic classiﬁer are trained using the provided training samples, and their AER is calculated over 10 testing samples per class. Finally, the average error rate over 100 repetitions is plotted versus d. Fig. 3a and b depict AER for all the methods when F ¼ 1 and 100 respectively. It should be noted that since the estimated parameters l^ i and R^ i are used, CLSS, Chernoff, AIDA, and WLSS cannot determine the value of dmin and their feasible range of d continues up to n. Furthermore, for the plug-in classiﬁer, the probability of error might be increasing versus d in some ranges due to effect of parameter estimation error on both the classiﬁer and the FE. From Fig. 3a and b, when d r C1 ¼ 5, Chernoff, AIDA, and WLSS provide the least AER for the larger covariance scale (F), while LDA overcomes them for the smaller covariance scale as before. When d 4C1, as seen in Fig. 3a and b, Chernoff, AIDA, and WLSS overcome the other methods with marginally improving performance in this order. The relative performance of WLSS

1

LDA CLSS Mahalanobis Chernoff AIDA WLSS

Error rate

0.8 0.6 0.4 0.2 0

and the other two methods has been generally switched compared to the case with known li and Ri values. This can be explained by the higher complexity of Chernoff and AIDA and their reliance on the logarithms and inverses of the class covariances which are more severely affected by the estimation errors than the covariance estimates themselves. Also, CLSS is based on normalization with respect to estimated o1 parameters which have higher variance than the estimated average class parameters (l^ and R^ ). Therefore, the reliance of CLSS on estimated o1 parameters leads to the relatively low performance of CLSS. As mentioned before, dmin ¼ 25 is the minimum linear sufﬁcient statistic dimension for this problem. However, the AER of Chernoff, AIDA, and WLSS based on the estimated parameters reaches a minimum value for a dimension d lower than dmin as shown in Fig. 3. This phenomenon is caused by decreased accuracy of the estimated parameters for larger values of d. Thus, the additional information of the last features cannot compensate the additional estimation error incurred by the growth in the dimensionality, and the minimum AER has occurred at a dimension lower than dmin . Furthermore, this minimum for AER is generally lower than the AER for the original data, 0.268 in Fig. 3a, and 0.322 in b. 5.2. Experiments on UCI data sets The University of California, Irvine (UCI) Machine Learning Repository [27] was initiated in 1987 and consists of online data sets, domain theories, and data generators. The data set repository includes multivariate, univariate, and text data types for various purposes such as classiﬁcation and regression. We have adopted

Table 1 Speciﬁcations of the selected classiﬁcation data sets from the UCI repository: original data dimensionality (n), data dimensionality after PCA (dPCA ), number of classes (C), total number of samples (N), average number of samples per class (N i ), and multivariate heteroscedasticity measure (JH ). Database name

n

dPCA

C

N

Ni

JH

(a) Wisconsin breast cancer (b) BUPA liver disorder (c) Pima Indians diabetes (d) Cleveland heart disease (e) Iris plants (f) Thyroid gland (g) Vowel context (h) Image segmentation (i) Waveform

9 6 8 13 4 5 10 19 21

9 6 8 13 4 5 10 14 21

2 2 2 5 3 3 11 7 3

683 345 768 297 150 215 990 2310 5000

341.5 172.5 384 59.4 50 71.7 90 330 1666.7

318.10 7.63 43.44 0.75 14.06 93.69 185.99 1266.04 1095.92

1

LDA CLSS Mahalanobis Chernoff AIDA WLSS

0.8 Error rate

826

0.6 0.4 0.2

1

5

10

15 20 25 30 35 Number of features

40

45

50

0

1

5

10

15 20 25 30 35 Number of features

40

45

50

Fig. 3. Average error rate of quadratic classiﬁer with different FEs using estimated parameters l^ i and R^ i for heteroscedastic Gaussian distributions. (a) F ¼1 and (b) F¼ 100.

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

repetition, the available samples of the data set are split into training and testing samples. The training samples are used to calculate ML estimates of li and Ri . In Figs. 5 and 6, the training set consists of 10% and 90% of the total samples respectively. The experimental setup is illustrated in Fig. 4. In the preprocessing stage, the data are projected onto dPCA -dimensional space ColðR^ Þ using principal component analysis (PCA) to ensure a nonsingular average covariance. Then, class covariances are slightly regularized toward their non-singular average using the regularization coefﬁcient of 0.001. This step guarantees the non-singularity of R^ i as required by CLSS, AIDA, and Chernoff FEs as well as the quadratic classiﬁer. Figs. 5 and 6 reveal that feature extraction helps to decrease AER of the original data or at least the required dimensionality. However, the effectiveness of heteroscedastic feature extraction becomes evident in data set (g). From Table 1, competitive advantage of heteroscedastic methods in data set (g) can be attributed to signiﬁcant heteroscedasticity (JH b 1:6) and high number of classes (C ¼ 11) which makes it challenging to perform classiﬁcation based on only the class means. The larger training set size in Fig. 6 leads to more accurate parameter estimates. As observed in the synthetic experiments,

several multivariate UCI data sets for classiﬁcation purpose. Speciﬁcations of the selected data sets are given in Table 1. A heteroscedasticity test is applied on every data set to investigate the need for a heteroscedastic approach. For this purpose, a multivariate test [28] designed for data sets with lower samples than the data dimension is deployed. The resulting scalar measure JH provides an unbiased estimate of normalized Frobenius scatter of the covariance matrices. This measure is distributed as N ð0,1Þ under null hypothesis of homoscedasticity. Therefore, assuming a signiﬁcance level of a ¼ 0:05, values of JH in Table 1 exceeding ð1aÞ quantile, i.e. 1.64, indicate a signiﬁcant heteroscedasticity. Experiments of this section compare the existing FEs with our proposed FE when the li and Ri parameters need to be estimated and the actual distributions are not Gaussian. The probability of classiﬁcation error on each data set is estimated as the AER over 100 repetitions based on the rotation method [1]. In each

Fig. 4. Experimental setup for UCI experiments.

0.2

0.5

0.1 0.05

0.3 Error rate

Error rate

Error rate

0.4

0.4

0.15

0.3 0.2

2

4

6

0

8

LDA CLSS Maha Cher AIDA WLSS

0.2 0.1

0.1

0

827

1

2

Number of features

3

4

5

0

6

2

4

Number of features

6

8

Number of features

0.8

0.4 0.2 0

0.25

0.25

0.2

0.2

Error rate

Error rate

Error rate

0.6

0.15 0.1 0.05

2

4

6

8

10

0

12

0.15 0.1 0.05

1

2

Number of features

3

0

4

1

2

Number of features

3

4

5

Number of features

0.5 0.6

0.4

0.4 0.3 0.2

0.2 0

0.4

0.5

Error rate

0.6

Error rate

Error rate

0.8

4

6

Number of features

8

10

0

0.2 0.1

0.1 2

0.3

2

4

6

8

10

Number of features

12

14

0

5

10

15

Number of features

Fig. 5. Average error rate versus number of features for UCI data sets when 10% of the samples are used for training.

20

828

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

0.2

0.5

0.1 0.05

Error rate

Error rate

Error rate

0.25

0.4

0.15

0.3 0.2 0.1

0

2

4

6

0

8

1

2

3

4

5

0

6

0.08

0.3 0.2 0.1

8

10

0.04

0

12

8

0.1 0.05

1

2

3

0

4

1

2

Number of features

3

4

5

Number of features

0.5

0.8 0.6 0.6

0.4

0.4

Error rate

0.5 Error rate

Error rate

6

0.15

0.06

Number of features

0.4 0.3 0.2

0.2

2

4

6

Number of features

8

10

0

0.3 0.2 0.1

0.1 0

4

0.2

0.02

6

2

Number of features

Error rate

0.4 Error rate

Error rate

0.1

4

0.1

Number of features

0.5

2

LDA CLSS Maha Cher AIDA WLSS

0.15

0.05

Number of features

0

0.2

2

4

6

8

10

12

0

14

Number of features

5

10 15 Number of features

20

Fig. 6. Average error rate versus number of features for UCI data sets when 90% of the samples are used for training.

performances of Chernoff, AIDA, and WLSS are close to each other in this case. However, in Fig. 5 with less accurate estimation, WLSS error rate is lower than that of Chernoff or AIDA for data sets (e)–(g). As shown in Table 1 for these three data sets, the average number of samples per class, i.e. N i ¼ N=C, is relatively small and simultaneously JH b 1:6. In addition, among these three data sets, WLSS achieves higher performance gain for larger JH . In other data sets, WLSS consistently outperforms other heteroscedastic methods, while requiring the lowest computational complexity (ref. Section 4.1). The approximate time required for ofﬂine training of each FE using 90% of samples of data set (h) is shown in Table 2. The reported times do not include the time for the common parameter estimation step with OðNn2 Þ complexity (1.14 ms) and the time for the PCA step (2.31 ms). Each duration is calculated on a system with eight X5355 processors and 8 GB memory, and is averaged over 100 iterations. Note that here FE input data is of dimension 19, and these durations will be more differentiating for higher dimensions. Finally, bias of CLSS solution toward o1 leads to its relatively poor performance on most data sets. In fact, even the higher estimation accuracy in Fig. 6 does not compensate for this inherent bias.

Table 2 Average time (in milliseconds) required for training different FEs on data set (h). The ﬁgures include neither the time for parameter estimation step, which is the same for all methods, nor the time for the PCA step. Method

WLSS

Chernoff

AIDA

CLSS

Mahalanobis

LDA

Training time (ms)

1.24

216.86

25.04

1.05

0.87

0.97

6. Conclusions and remarks Non-iterative heteroscedastic linear FEs provide practical advantages in three aspects. First, they make use of the discriminatory information in the second order moments of the data in addition to the ﬁrst order moments utilized by LDA. Second, they alleviate the restriction of C1 on the feature space dimensionality of LDA. Third, their simple formulation based on a few parameters provides acceptable generalizability [15]. This paper proposed a non-iterative heteroscedastic linear FE, called WLSS, based on sufﬁciency conditions. Simulations on synthetic data demonstrated that WLSS, along with Chernoff and AIDA, signiﬁcantly outperform Mahalanobis and CLSS

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

829

Table 3 Comparison of different heteroscedastic linear FEs. Method

MDLSS

Extension of LDA

Limit on ‘‘d’’

Multi-class formulation

Tolerance to Sing. R^ i

Complexity Oð:Þ

WLSS

Yes

Yes

n

Inherent

Yes

Nn2 þ n3

Chernoff

Yes

Yes

n

Pairwise

No

Nn2 þ C 2 n3

AIDA

Yes

Yes

n

Inherent

No

Nn2 þ Cn3

CLSS

Yes

No

n

Inherent

No

Nn2 þ n3

Mahalanobis

No

Yes

Inherent

Yes

Nn2 þ n3

LDA

No

–

Inherent

Yes

Nn2 þ n3

CðC1Þ ,n min 2 minfC1,ng

methods. Furthermore, it was illustrated that WLSS achieved even higher standing with respect to Chernoff and AIDA when accurate li and Ri parameters are not available. Simulations using real data also conﬁrmed that WLSS provides a consistent improvement in classiﬁcation accuracy over other FEs. These FEs can be further compared according to the criteria of Table 3. LDA has also been included for comparison. In the ﬁrst column of Table 3, Mahalanobis does not result in a MDLSS. WLSS and CLSS are designed for the ideal goal of providing MDLSS; whereas in case of Chernoff and AIDA, this property has been determined based on our synthetic simulation results. The second column shows that CLSS does not reduce to the popular LDA in the homoscedastic case. According to the next column, target dimensionality d for Mahalanobis is further restricted. Chernoff and AIDA typically compete with WLSS in classiﬁcation performance. However, as shown in Table 3, Chernoff is not inherently multi-class, and provides a sub-optimal pairwise solution for more than two classes. Moreover, both Chernoff and AIDA suffer from a time complexity higher than that of LDA, and this complexity difference is critical for high-dimensional data. Additionally, these two methods cannot be deployed if any of the individual class covariance estimates Ri is singular, and therefore are not applicable in small sample size scenarios. In conclusion, WLSS features the advantageous properties listed in Table 3. In particular, WLSS approximates a MDLSS, and does not impose any additional computational complexity compared to LDA. Finally, it is noteworthy that WLSS can be further improved through weighting, regularization, and fractional-step dimension reduction.

and using the partitioned matrix inversion formula from [30], 1 T Ri ½TT ZT ¼ Z

ðA:2Þ

where T T 1 , D1ðT, oi Þ ¼ ðTRi TT Þ1 þ ðTRi TT Þ1 TRi ZT Q 1 ðT, oi Þ ZRi T ðTRi T Þ

D2ðT, oi Þ ¼ ðTRi TT Þ1 TRi ZT Q 1 ðT, oi Þ , Q ðT, oi Þ ¼ ZRi ZT ZRi TT ðTRi TT Þ1 TRi ZT 1=2

¼ ZRi

1=2 T

1=2

T ðTRi TT Þ1 TRi

ðIRi

1=2 T

pseudoinverse ð:Þ 1=2 T

IðRi

þ

1=2

T ðTRi TT Þ1 TRi

Deﬁne Q 0ðT, oi Þ ¼ IRi

T Þ . Hence Q 0ðT, oi Þ is the operator for orthogonal 1=2 T ?

projection onto ColðRi

T Þ , the orthogonal complement of

1=2 ColðRi TT Þ, or equivalently onto 1=2 T Z Þ [31]. Thus, onto ColðRi 1=2 T

IðRi

1=2 T þ

T ÞðRi

T Þ

1=2

KerðTRi

1=2 T

1=2

1r i rC,

which is equivalent to Pðoi 9yÞ=Pðoj 9yÞ ¼ Pðoi 9xÞ=Pðoj 9xÞ, or using Bayes’ theorem, f ðy9oi Þ=f ðy9oj Þ ¼ f ðx9oi Þ=f ðx9oj Þ, or in terms of the log-likelihoods, 1 ri oj rC:

Using (A.4) and (A.2), (A.1) simpliﬁes to 1=2 T T 1 DðT, oi ,xÞ ¼ ðxli ÞT Ri1=2 Q 0ðT, oi Þ R1=2 ZT ðZZT Þ1 ðZR1 ZRi i Z ÞðZZ Þ i

ðxli Þ þ log

! 9Ri 99TTT 9 : 9TRi TT 9

also an orthogonal projection onto Q 0ðT, oi Þ

1=2 T 1=2 T 1 ¼ Ri Z ðZR1 ZRi , i Z Þ

DðT, oi ,xÞ ¼ ðxli Þ

T

ðA:1Þ

We need to simplify (A.1). Using the deﬁnition for matrix Z 1 T T T T T Ri ½TT ZT ðxli Þ, ðxli ÞT R1 i ðxli Þ ¼ ðxli Þ ½T Z Z Z

1=2

Þ ðTRi

1=2 KerðTRi Þ

ðA:5Þ Þ. Thus, Q 0ðT, oi Þ is 1=2 T

or onto ColðRi

Z Þ:

which reduces (A.5) to

1 T 1 1 T 1 T T R1 ðZR1 ZR1 i Z ðZRi Z Þ i Z ÞðZRi Z Þ i

ðxli Þ þ log

! 9Ri 99TTT 9 9TRi TT 9

1 T 1 T ZR1 ¼ ðxli ÞT R1 i Z ðZRi Z Þ i ðxli Þ þlog

Deﬁning DðT, oi ,xÞ ¼ log½f ðy9oi Þlog½f ðx9oi Þ, and using y ¼ Tx, T 1 T T DðT, oi ,xÞ ¼ ðxli ÞT R1 Tðxli Þ i ðxli Þðxli Þ T ðTRi T Þ

ðA:3Þ ðA:4Þ

1=2 þ

By equivalence of sufﬁciency and admissibility [14], (1) is equivalent to

,

T 1 ZZT : Q ðT, oi Þ ¼ ZZT ðZR1 i Z Þ

Since T is full row-rank, Q 0ðT, oi Þ ¼ IðTRi

! 9Ri 99TTT 9 : 9TRi TT 9

Þ, or because TZT ¼ 0,

T 1 Z ðZR1 ZRi i Z Þ

¼ Ri

Appendix A. Proof of Lemma 3.1

þ log

. Using the Moore–Penrose

1=2 T þ

T ÞðRi

1=2

log½f ðy9oi Þlog½f ðx9oi Þ ¼ log½f ðy9oj Þlog½f ðx9oj Þ,

Z :

for a full column-rank matrix [31], Q 0ðT, oi Þ ¼

Q 0ðT, oi Þ Ri

Pðoi 9yÞ ¼ Pðoi 9xÞ,

1=2 T

ÞRi

! 9Ri 99TTT 9 : 9TRi TT 9

From partitioned matrix determinant formula [30] (Ref. [13]), ! T 9Ri 99TTT 9 1 T 1 T T T 9 þ log9ZZ ðZ R Z Þ ZZ 92log log ¼ log9TT i Z T 9TRi T 9 T 1 ZZT 9, ¼ log9ðZR1 i Z Þ T T 1 ðZR1 ZR1 which leads to DðT, oi ,xÞ ¼ ðxli ÞT R1 i Z i Z Þ i ðxli Þ T 1 T þ log9ðZR1 Z Þ ZZ 9, or (3). Thus, sufﬁciency of T is equivalent to i the condition that DðT, oi ,xÞ in (3) is independent of ‘‘i’’.

830

M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830

Appendix B. Proof of Lemma 4.1 Equivalence of (b) and (c) is trivial. Therefore, it only sufﬁces to prove equivalence of the following two pairs of conditions:

Equivalence of (a) and (b): Assuming (b), (a) is trivial. Assuming (a), (b) is deduced as follows: For any 1 r j rm, zT Aj ¼ zT Aj

m X

ai ,

using ð10Þ

i¼1

¼

m X

ai zT Aj ¼

i¼1 m X T

m X

ai zT Ai ,

based on ðaÞ

i¼1

ai Ai ¼ zT ,

¼z

using ð11Þ:

i¼1

Equivalence of (c) and (d): If (c) is satisﬁed, (d) follows trivially. Condition (d) can be written as zT ¼ zT A1 j Ai ,

81 ri o j rm:

ðB:1Þ

Therefore, if (d) holds, (c) is established as follows: For any 1 rj r m, zT A1 ¼ zT A1 j j

m X

ai Ai ¼

i¼1

¼

m X

ai zT ¼ zT ,

m X

ai zT A1 j Ai ,

using ð11Þ

i¼1

using ðB:1Þandð10Þ:

i¼1

References [1] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 4–37. [2] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, second ed., WileyInterscience, 2000. [3] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley-Interscience, 2004. [4] K. Das, Z. Nenadic, Approximate information discriminant analysis: a computationally simple heteroscedastic feature extraction technique, Pattern Recognit. 41 (2008) 1565–1574. [5] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., 1990. [6] O. Hamsici, A. Martinez, Bayes optimality in linear discriminant analysis, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008).

[7] M. Loog, R.P.W. Duin, R. Haeb-Umbach, Multiclass linear dimension reduction by weighted pairwise Fisher criteria, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 762–766. [8] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional data, with application to face recognition, Pattern Recognition 34 (2001) 2067–2070. [9] N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition, Speech Commun. 26 (1998) 283–297. [10] G. Saon, M. Padmanabhan, Minimum Bayes error feature selection for continuous speech recognition, in: Advances in Neural Information Processing Systems, vol. 13, MIT Press, 2000, pp. 800–806. [11] L. Rueda, M. Herrera, Linear dimensionality reduction by maximizing the Chernoff distance in the transformed space, Pattern Recognition 41 (2008) 3138–3152. [12] L. Rueda, B.J. Oommen, C. Henrı´quez, Multi-class pairwise linear dimensionality reduction using heteroscedastic schemes, Pattern Recognition 43 (2010) 2456–2465. [13] M.S. Mahanta, Linear feature extraction with emphasis on face recognition, Master’s Thesis, University of Toronto, 2009. [14] L. Devroye, A Probabilistic Theory of Pattern Recognition, Springer, 1996. [15] R.P.W. Duin, M. Loog, Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 732–739. [16] H. Brunzell, J. Eriksson, Feature reduction for classiﬁcation of multidimensional data, Pattern Recognition 33 (2000) 1741–1748. [17] J.D. Tubbs, W.A. Coberly, D.M. Young, Linear dimension reduction and Bayes classiﬁcation with unknown population parameters, Pattern Recognition 15 (1982) 167–172. [18] M.S. Mahanta, K.N. Plataniotis, Linear feature extraction using sufﬁcient statistic, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 2218–2221. [19] L. Buturovic, On the minimal dimension of sufﬁcient statistics, IEEE Trans. Inf. Theory 38 (1992) 182–186. [20] B.C. Peters Jr., R. Redner, H.P. Decell Jr., Characterizations of linear sufﬁcient statistics, Sankhya¯ Ser. A 40 (1978) 303–309. [21] H.P. Decell Jr., P.L. Odell, W.A. Coberly, Linear dimension reduction and Bayes classiﬁcation, Pattern Recognition 13 (1981) 241–243. [22] S. Leon, Linear Algebra with Applications, second ed., Macmillan, 1986. [23] R. Lotlikar, R. Kothari, Fractional-step dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 623–627. [24] J. Friedman, Regularized discriminant analysis, J. Am. Statist. Assoc. 84 (1989) 165–175. [25] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, Regularization studies on LDA for face recognition, in: Proceedings of the International Conference on Image Processing, 2004, pp. 63–66. [26] R. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen. 7 (1936) 179–188. [27] A. Asuncion, D. Newman, UCI Machine Learning Repository, 2007. [28] J.R. Schott, A test for the equality of covariance matrices when the dimension is large relative to the sample sizes, Comput. Statist. Data Anal. 51 (2007) 6535–6542. [30] D. Harville, Matrix Algebra from a Statistician’s Perspective, Springer, 1997. [31] T. Boullion, P. Odell, Generalized Inverse Matrices, Wiley-Interscience, 1971.

Mohammad Shahin Mahanta received the dual B.Sc. degree in electrical engineering and petroleum engineering from Sharif University of Technology, Tehran, Iran, in 2007, and the M.A.Sc. degree in electrical and computer engineering from the University of Toronto, Toronto, ON, Canada, in 2009. He is currently working toward the Ph.D. degree in electrical and computer engineering. His research interests include statistical signal processing and machine learning, with particular emphasis on feature extraction for biomedical signals.

Amirhossein S. Aghaei received the B.Sc. degree in electrical engineering from Iran University of Science and Technology, Tehran, Iran, in 2006, and the M.A.Sc. degree from the Department of Electrical and Computer Engineering at the University of Toronto, in 2008, where he is currently working toward the Ph.D. degree. His research interests include transceiver design for multiuser MIMO channels, as well as statistical signal processing with emphasis on analysis of improper complex biomedical signal.

Konstantinos N. Plataniotis is a Professor with the ECE Department at the University of Toronto. His research interests are multimedia systems, biometrics, image and signal processing, communications systems and pattern recognition. He is a registered professional engineer in Ontario, and the Editor-in-Chief (2009–2011) for the IEEE Signal Processing Letters.

Subbarayan Pasupathy received the B.E. degree in telecommunications from the University of Madras, the M.Tech. degree in electrical engineering from the Indian Institute of Technology, Madras, and the M.Phil. and Ph.D. degree in engineering and applied science from Yale University. Currently, he is a Professor Emeritus in the Department of Electrical and Computer Engineering at the University of Toronto, where he has been a Faculty member from 1972. His research over the last three decades has mainly been in statistical communication theory and signal processing and their applications to digital communications. He has served as the Chairman of the Communications Group and as the Associate Chairman of the Department of Electrical Engineering at the University of Toronto. He is a registered Professional Engineer in the province of Ontario. During 1982–1989 he was an editor for data communications and modulation for the IEEE Transactions on Communications. He has also served as a technical associate editor for IEEE Communications Magazine (1979–1982) and as an associate editor for the Canadian Electrical Engineering Journal (1980–1983). He wrote a regular humor column entitled ‘‘Light Trafﬁc’’ for IEEE Communications Magazine during 1984–98. Dr. S. Pasupathy was elected as a Fellow of the IEEE in 1991 ‘‘for contributions to bandwidth efﬁcient coding and modulation schemes in digital communication,’’ was awarded the Canadian Award in Telecommunications in 2003 by the Canadian Society of Information Theory, was elected as a Fellow of the Engineering Society of Canada in 2004 and as a Fellow of the Canadian Academy of Engineering in 2007. He has been identiﬁed as a ‘‘highly cited researcher’’ by ISI Web of Knowledge and his name is listed in ISIHighlyCited.com.

Heteroscedastic linear feature extraction based on sufficiency conditions

Heteroscedastic linear feature extraction based on sufficiency conditions

Recommend Documents