Pattern Recognition 45 (2012) 821–830
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Heteroscedastic linear feature extraction based on sufficiency conditions Mohammad Shahin Mahanta , Amirhossein S. Aghaei, Konstantinos N. Plataniotis, Subbarayan Pasupathy Department of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road, Toronto, ON, Canada M5S 3G4
a r t i c l e i n f o
abstract
Article history: Received 25 October 2010 Received in revised form 21 June 2011 Accepted 23 July 2011 Available online 4 August 2011
Classification of high-dimensional data typically requires extraction of discriminant features. This paper proposes a linear feature extractor, called whitened linear sufficient statistic (WLSS), which is based on the sufficiency conditions for heteroscedastic Gaussian distributions. WLSS approximates, in the least squares sense, an operator providing a sufficient statistic. The proposed method retains covariance discriminance in heteroscedastic data, while it reduces to the commonly used linear discriminant analysis (LDA) in the homoscedastic case. Compared to similar heteroscedastic methods, WLSS imposes a low computational complexity, and is highly generalizable as confirmed by its consistent competence over various data sets. & 2011 Elsevier Ltd. All rights reserved.
Keywords: Feature extraction Dimension reduction Sufficient statistic Heteroscedastic data Discriminant analysis Gaussianity Quadratic classifier
1. Introduction Pattern recognition systems have been widely used to classify different patterns in various applications, such as personal identification from biometrics (e.g. face recognition), medical diagnosis of patients from the experimental results, and data mining [1]. In most of these areas, the high dimensionality of the data poses a major challenge to the classification problem. However, classification of high-dimensional data with correlated features can be simplified, and the corresponding curse of dimensionality [2] can be avoided, by feature extraction, i.e. extracting a small set of discriminant features. In general, there are two techniques for dimensionality reduction, i.e. feature selection and feature extraction. While the former only selects a subset of features, the latter involves a linear or non-linear transformation of the data [1]. The ultimate purpose of feature extraction is to represent the data with the smallest number of features while retaining all the discriminatory information. A feature extractor (FE) achieving this purpose provides a minimum-dimensional sufficient statistic [3] and is Bayes optimal [4]. This paper focuses on linear feature extraction algorithms since they are computationally efficient and preserve certain properties of the data such as Gaussianity. One of the most commonly used linear FEs is linear discriminant analysis (LDA).
Corresponding author. Tel.: þ 1 416 978 8672; fax: þ1 416 978 4428.
E-mail address:
[email protected] (M.S. Mahanta). 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.07.024
LDA maximizes Fisher’s criterion, the ratio of the between-class to the within-class scatter, through a computationally efficient training procedure based on matrix eigendecomposition [5]. LDA implicitly assumes Gaussian class-conditional distributions with a common covariance, i.e. homoscedastic as opposed to heteroscedastic data. Under this assumption, the feature set provided by LDA is a sufficient statistic. However, LDA suffers from several problems addressed by the literature [6–8]. One of the most important restrictions of LDA stems from the homoscedasticity assumption, as a result of which any discriminatory information in the class covariance differences is ignored. Several methods have been proposed to extend LDA to the case of heteroscedastic data. These methods can be categorized into the following three approaches. In the first heteroscedastic approach, alternative criteria, which may be indirectly related to sufficiency, are iteratively optimized. For example, LDA can also be derived from the criterion of maximizing the likelihood of homoscedastic data. This criterion can be generalized to heteroscedastic data, and hence the linear operator is iteratively optimized to maximize the data likelihood [9]. Alternatively, statistical class distances such as pairwise Bhattacharyya, Kullback–Leibler, or Chernoff distance can be maximized in the feature space [10–12]. These criteria have been related to the sufficiency of the features [10,13,14]. However, these methods also require iterative training algorithms, and consequently lead to concerns over computational complexity [15]. In the second approach, while the computational efficiency of the non-iterative training procedure is saved, Fisher’s criterion is
822
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
replaced with similar criteria which incorporate the discriminatory information of class covariance differences. These criteria include the geometrical mean of pairwise Mahalanobis distances between classes [16], the sum of approximated pairwise Chernoff distances between classes [15], and an approximate measure of mutual information between the class variable and the feature variable [4]. However, these criteria are not directly based on sufficiency, and their solutions may incur higher order computational complexity than LDA. The third approach is a direct method to simplify the sufficiency conditions and derive a non-iterative sufficient statistic FE [17]. However, the work in [17] leads to an asymmetric formulation which results in a feature space biased toward the selected reference class as discussed in [18]. This asymmetry furthermore causes the resulting feature space to be sensitive to the error in the estimated parameters for that reference class. This paper takes the direct approach based on the sufficiency conditions. We derive a general formulation for the necessary and sufficient conditions satisfied by a linear sufficient statistic for Gaussian distributions. It is shown that this general formulation can be symmetrically simplified. Hence, a simple FE is formulated which provides unbiased features. Minimal sufficiency of the derived FE ensures that it is the optimal linear FE in theory. In fact, our method simplifies to LDA in the case of homoscedastic data, which is in consistence with Bayes optimality of LDA. Furthermore, simplicity of our FE formulation enhances its generalizability [2] in limited training sample size scenarios, which has been verified by the consistent performance in different experiments on real data. Another advantage of our simple formulation is computational efficiency. Training of the proposed linear FE is based on a scatter matrix eigendecomposition, where complexity of calculating this scatter matrix is the same as complexity of the corresponding step in LDA. The next section will continue with the problem definition and a review of the sufficiency concept for classification. The general linear sufficient statistic conditions will be derived in Section 3. Section 4 will provide further simplifications toward a simple FE. Finally, Section 5 is dedicated to experimental evaluation based on both synthetic and real data sets.
2. Problem definition and prior work Throughout this paper, a typical classification scenario is assumed. Each vectorial data1 xA Rn belongs to either of C classes, oi ,1 ri rC, with non-zero prior probabilities Pðoi Þ and likelihoods f ðx9oi Þ. The classifier can learn these distributions using the available training samples with known classes. The objective is to achieve the smallest classification error rate when predicting the classes of testing samples. In our system, the data are first transformed by the FE, y ¼ TðxÞ, and then processed by the classifier. Therefore, the FE is desired to keep all the class-related information of the data. Thus, y is desired to be a sufficient statistic for the classes, i.e. f ðx9oi ,yÞ ¼ f ðx9yÞ,
1r i rC:
ð1Þ
Such a statistic y is admissible, i.e. it preserves the posterior class probabilities and hence the Bayes classification error [14]. Therefore, a sufficient statistic with the minimum dimension provides the most compressed representation of discriminatory information and is the ultimate goal in FE design. 1 Scalars, vectors, and matrices are respectively shown in regular lowercase/ uppercase (e.g. a or A), boldface lowercase (e.g. a), and boldface uppercase (e.g. A). Column space, row space, and kernel space of a matrix A are denoted as ColðAÞ, RowðAÞ, and KerðAÞ.
The minimum dimension for a generally non-linear sufficient statistic was investigated by [19]. The minimum value which holds for all the possible data distributions was proved to be minðn,C1Þ. However, any possible dimension reduction in this scenario would require the estimation of posterior probabilities which is itself affected by the curse of dimensionality. We will assume Gaussian class-conditional densities, N ðli , Ri Þ. Gaussianity is a common assumption in most applications. Furthermore, it facilitates the estimation of data distributions that are determined by only two parameters, li and Ri , for each class. We will also mainly focus on the linear FEs which can be represented by a linear operator T as follows: yd1 ¼ Tdn xn1 : In [20,21], based on the above assumptions and a prior data normalization such that f ðx9o1 Þ becomes N ð0,IÞ, the necessary and sufficient conditions for a minimum-dimensional linear sufficient statistic (MDLSS) were formulated. Subsequently, a class-referenced linear sufficient statistic (CLSS) FE was proposed 1=2 1=2 in [17]. CLSS operator is calculated as T ¼ T00 R1 , where R1 accounts for the assumed prior normalization of the data and the rows of T00 consist of those left singular vectors of the following matrix MCLSS whose corresponding singular values have the largest magnitudes: MCLSS ¼ ½A12 ,A13 ; . . . ; A1C A1i ¼ ½R1
1=2
1=2
ðli l1 Þ, R1
Ri R11=2 I, 1 r ir C:
ð2Þ
If the values of the li and Ri parameters in (2) are accurate, retaining all the non-zero singular values of MCLSS guarantees that the operator T00 given by CLSS leads to a MDLSS. Otherwise, T00 provides a least squares approximation to an orthonormal MDLSS operator for the normalized data [17]. In either case, T also includes the implied o1 -normalization. Although CLSS is a simple formulation for a heteroscedastic sufficient statistic, its performance depends on the accuracy of the estimated parameters for the specific reference class o1 . Additionally, least squares approximation to the ideal operator is best justified in a space with isotropic average covariance R . However, CLSS applies a least squares approximation in a normalized space with an isotropic R1 . Furthermore, CLSS does not simplify LDA in the homoscedastic case, where Ri ¼ R ,8i. In fact, its feature space is biased towards l1 l , where l is the average class mean, and hence CLSS favors the reference class o1 [18,13]. This paper will continue with the derivation of the general sufficiency conditions for a linear statistic without any normalization assumption. This step helps to avoid problems associated with CLSS which were created by asymmetrically normalized sufficiency conditions.
3. General conditions for sufficiency of linear statistics An arbitrary function y of x is a sufficient statistic if and only if (1) holds. We assume that y is a linear statistic of the data, y ¼ Tx. It is known that any invertible transformation of y does not affect the sufficiency of y. Therefore, assuming that T provides a sufficient statistic, any other linear operator, and specifically any full row-rank operator Tfrr , also provides a sufficient statistic if its row space includes the row space of T. Hence, without loss of generality, we assume the operator T to be full row-rank, and the result is finally extended to the general case. Based on the above assumptions, the following lemma simplifies the necessary and sufficient condition in (1) for sufficiency of linear statistics [13].
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
823
Lemma 3.1. Let the random data x A Rn belong to either of oi ,1 ri r C, with the Gaussian conditional densities f ðx9oi Þ ¼ N ðli , Ri Þ and non-zero prior probabilities Pðoi Þ. Then, a linear statistic y ¼ Tx, with full row-rank linear transformation Tdn , is a sufficient statistic for oi if and only if DðT, oi ,xÞ defined as below is independent of ‘‘i’’ for any x:
The whitened means and covariances and their averages are
1 T 1 T DðT, oi ,xÞ ¼ ðxli ÞT R1 ZR1 i Z ðZRi Z Þ i ðxli Þ
Since sufficiency is invariant under translation and invertible transformation of the data [21], MDLSS can be sought in the whitened space. According to (7), a linear operator T0 provides a linear sufficient statistic in the whitened space if and only if
T 1 T þ log9ðZR1 i Z Þ ZZ 9,
ð3Þ
where the matrix ZðndÞn is full row-rank with RowðZÞ an orthogonal complement of RowðTÞ such that TZT ¼ 0 and ½ZT A Rnn is non-singular. Proof. See Appendix A.
¼
1 T 1 T R1 ZR1 j Z ðZRj Z Þ j ,
1 T 1 1 T 1 T 1 1 T R1 ZR1 i Z ðZRi Z Þ i li ¼ Rj Z ðZRj Z Þ ZRj lj :
C X
Pðoi ÞR0i ¼ I,
l0 ¼
i¼1
1r i oj rC:
1 r io j r C,
m X
ð5Þ
m X
ð10Þ
ai Ai ¼ I,
ð11Þ
then the following set of conditions are equivalent for zA Rn : ðaÞ
zT Ai ¼ zT Aj ,
ðbÞ
zT Ai ¼ zT ,
ðcÞ
zT A1 ¼ zT , i
ðdÞ
¼ zT A1 zT A1 i j ,
81 ri o j rm, 81r i rm, 81 ri r m, 81 ri o j rm:
zT R1 ¼ zT R1 i j ,
¼ zT , zT R01 i ð7Þ
The latter conditions also apply to any linear operator providing sufficient statistic even if it is not full rank. Therefore, any linear operator T provides a sufficient statistic if and only if the projections of R1 and R1 i i li onto KerðTÞ do not vary with ‘‘i’’. Moreover, if there exists a full row-rank operator T satisfying (7), any other linear operator also satisfies (7) if and only if its row space includes that of T. Since (7) holds by choosing Inn as T, there always exists a full row-rank operator T that satisfies (7). However, an operator with a row-rank lower than n may also satisfy (7). The next section seeks a minimum row-rank T satisfying (7) through further simplification of the conditions.
matrices.
i¼1
ð6Þ
8z A KerðTÞ:
ð9Þ
ai ¼ 1,
Proof. See Appendix B.
1 r io j r C,
ð8Þ
8z A KerðT0 Þ:
Lemma 4.1. Let Ai A Rnn ,1 r ir m, be invertible If coefficients ai A R ,1 ri rm, exist such that
Furthermore, if these conditions hold, by substitution in (3), DðT, oi ,xÞ will be independent of ‘‘i’’ for any fixed x. Therefore, the conditions in (6) are necessary and sufficient for sufficiency of Tx. These conditions can be written in terms of any arbitrary vector z in kernel space of T as follows:
T 1 zT R1 i li ¼ z Rj lj ,
Pðoi Þl0i ¼ 0:
i¼1
The following lemma [13] helps in simplifying (9).
i¼1
ZR1 ¼ ZR1 i j ,
C X
¼ zT R01 zT R01 i j ,
ð4Þ
T Note that since Z is full row-rank, invertibility of ZR1 is i Z guaranteed. Multiplying (4) and (5) on the left by Z, we arrive at the following necessary conditions for sufficiency of Tx, in terms of Z:
1 ZR1 i li ¼ ZRj lj ,
R0 ¼
0 0 T 01 zT R01 i li ¼ z Rj , lj ,
Assume that full rank operator T gives a sufficient statistic. Then DðT, oi ,xÞ, which denotes the difference between the log-likelihood of oi in x domain and in y domain, is independent of ‘‘i’’. Then, from the terms containing x in (3), it can be shown that for 1 r io j r C: 1 T 1 T R1 ZR1 i Z ðZRi Z Þ i
R0i ¼ R1=2 Ri R1=2 , l0i ¼ R1=2 ðli l Þ,
Since (8) holds, we can set m ¼ C, Ai ¼ R0i , and ai ¼ Pðoi Þ in Lemma 4.1. Then the first line of (9) is equivalent to 1 ri r C,
8z A KerðT0 Þ:
ð12Þ
Substitution into the second line of (9) yields zT l0i ¼ zT l0j ,
1 r io j r C,
8z A KerðT0 Þ,
which, using (8), is equivalent to zT l0i ¼ zT l 0 ¼ 0,
1r i rC,
8zA KerðT0 Þ:
Also, (12) is equivalent to: zT R0i ¼ zT ,
1r i rC,
8zA KerðT0 Þ:
Therefore, the necessary and sufficient conditions of (9) for sufficiency of T0 can be stated in the following simplified form: zT l0i ¼ 0, zT ðR0i IÞ ¼ 0,
1 ri rC,
8z A KerðT0 Þ:
ð13Þ
The conditions in (13) can be written in matrix format as 4. Whitened linear sufficient statistic
T
M z ¼ 0,
8zA KerðT0 Þ,
where The general linear sufficient statistic conditions for classification of Gaussian distributions were derived in the previous section. In order to simplify these conditions without introducing any asymmetry, a prior whitening and centering step with respect to the average mean and covariance is used: 0
1=2
x ¼R
ðxl Þ:
MnCðn þ 1Þ ¼ ½A1 ,A2 , . . . ,AC , Ai ¼ ½l0i ,ðR0i IÞ, 0
1 ri r C:
ð14Þ
Therefore, T provides a sufficient statistic for the whitened data if and only if KerðT0 Þ D KerðMT Þ, or equivalently, RowðMT Þ DRowðT0 Þ, or ColðMÞ DRowðT0 Þ. Assuming ColðMÞ D RowðT0 Þ, T0 guarantees a
824
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
linear sufficient statistic. But, to reach a MDLSS, two further conditions must be met: First, any redundant linear dependency between the rows of T 0 should be avoided, hence T 0 needs to be full row-rank. Second, any rows of T 0 that are not needed to span ColðMÞ are redundant and should be removed. These are the only two additional conditions needed for a MDLSS. Therefore, T0 provides a MDLSS for the whitened Gaussian distributions if and only if it is full row-rank and satisfies 0
RowðT Þ ¼ ColðMÞ:
ð15Þ
Therefore, the minimum dimension of the linear sufficient statistic dmin equals the rank of M. Furthermore, in order to find T0 , the condition in (15) can be stated as the equivalent condition that a full rank decomposition of M in the following form can be found: M ¼ T 0ndmin TGdmin Cðn þ 1Þ ,
with
rankðT0 Þ ¼ rankðMÞ ¼ dmin :
ð16Þ
0
For any T satisfying (16) and for any non-singular matrix AA Rdmin dmin , AT0 also satisfies this condition. Therefore, MDLSS is not unique. However, if we find one solution with orthonormal rows, any other solution can be found by applying a linear transformation on this solution. As a case in point, the ‘‘Q’’ matrix in the QR decomposition or the left singular vectors corresponding to the non-zero singular values in the singular value decomposition (SVD) will result in a solution for T0 with orthonormal rows. Finally, by combining the whitening transformation and the calculated orthonormal T0 , a MDLSS for the original data domain x will be provided by T ¼ T0 R1=2 . Due to the prior whitening step and orthonormality of T0 , the statistic provided by T is whitened with respect to R and is called the whitened linear sufficient statistic (WLSS). The dimension of this MDLSS is the same as the rank of T0 which equals the rank of M or dmin . It can be verified that dmin equals the rank of MCLSS in (2). Furthermore, it can be shown that if the accurate actual li and Ri values are used, the operators proposed by WLSS and CLSS are related to each other through a non-singular linear transformation. 4.1. Feature extraction using MDLSS approximation In practice, the actual li and Ri values are not known and can only be estimated using the training samples, e.g. using maximum likelihood (ML) estimator. These estimated parameters, i.e. l^ i and R^ i , ^ For any desired d, we will can be plugged in (14) to estimate M as M. then estimate a d-rank whitened MDLSS operator for feature extraction. 1=2 Assume a given target dimension d. Let T ¼ T0 R^ where the 0 ^ with the rows of T are taken as the d left singular vectors of M largest (non-zero) singular value magnitudes. Using SVD properties [22] and a procedure similar to [17], it can be shown that T0 approximates an orthonormal MDLSS operator for whitened data in the Frobenius norm or least squares sense. Additionally, the ^ are the ones most probable ignored smallest singular values of M to be generated by the parameter estimation errors. Moreover, this approximation can be improved by using the weighted least squares method as follows. ^ which correspond to the Those left singular vectors of M singular values with the largest magnitudes coincide with those ^M ^ T which correspond to the largest eigenvalues eigenvectors of M [22]. But ^M ^T¼ M
C X
R^
1=2
ðl^ i l^ Þðl^ i l^ ÞT R^
and the different terms corresponding to the different classes can be weighted in proportion to the estimated prior class probabilities: SH ¼
C X
1=2 1=2 Pðoi ÞR^ ðl^ i l^ Þðl^ i l^ ÞT R^
i¼1
þ
C X
Pðoi ÞðR^
1=2
R^ i R^
1=2
IÞðR^
1=2
R^ i R^
1=2
IÞT :
ð17Þ
i¼1
Selecting the rows of T0 as the eigenvectors of SH corresponding to the largest eigenvalues is equivalent to a weighted least squares ^ with each column of M ^ approximation to the column space of M weighted according to the corresponding prior probability. Using this construction for T0 , the proposed whitened linear sufficient statistic (WLSS) FE for estimated Gaussian distributions is T ¼ T0 R1=2 as demonstrated in the algorithm of Fig. 1. Unlike CLSS solution, WLSS is not based on normalization with respect to parameters of a specific class and hence its solution is not biased towards parameters of any class [18,13]. This fundamental difference accounts for the prominent superiority of WLSS over CLSS in simulations of Section 5. Furthermore, these FEs differ in their computational complexity. The FE computations consist of the offline training phase and the online testing phase. Although rapid calculation is more critical in the online testing phase, all linear FEs share the same efficient testing phase operator, i.e. a simple linear projection. Therefore, only the distinguishing complexity of the training phase is included in Table 3. It should be noted that, for a highdimensional problem, the highly demanding training phase computations can still determine the cost, and hence feasibility, of a FE. The training complexity for all the considered FEs consists of a common OðNn2 Þ term for estimating the class means and covariances and a method-specific term for calculation and decomposition of the scatter matrices. The second distinct term of computational complexity for WLSS training is Oðn3 Þ which is the same as that of LDA, versus OðC 2 n3 Þ for the Chernoff method [15] and OðCn3 Þ for approximate information discriminant analysis (AIDA) [4] (ref. Table 3). This difference in complexity will be significant for high data dimension and large number of classes. Unlike Chernoff, AIDA, and CLSS, the WLSS method does not require individual class covariance matrices to be non-singular. Also, the simple formulation of WLSS which does not involve logarithms of covariance matrices is expected to be less affected
1=2
i¼1
þ
C X i¼1
ðR^
1=2
R^ i R^
1=2
IÞðR^
1=2
R^ i R^
1=2
IÞT ,
Fig. 1. Pseudocode for the training stage of the proposed WLSS feature extraction method.
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
by the estimation errors. Finally, the formulation of SH in (17) is similar to the whitened between-class scatter for LDA. Therefore, enhancements similar to weighted LDA [7], fractional-step LDA [23], and regularized LDA [24,25] can also be extended to the WLSS heteroscedastic method.
825
only on the first two moments of the data [15]. Therefore, they are closely related to the quadratic Gaussian Bayes classifier which is used for classification. Also, it should be noted that Mahalanobis, Chernoff, AIDA, and WLSS can be considered as heteroscedastic extensions of LDA. The simulations are run on both synthetic data and standard pattern recognition data sets.
4.2. Homoscedastic case 5.1. Experiments on synthetic data
SH ¼
C X
1=2 1=2 Pðoi ÞR^ ðl^ i l^ Þðl^ i l^ ÞT R^ :
ð18Þ
i¼1
The rank of SH in (18) is at most C1. Therefore, dmin r C1 and the choice of d for FE is limited to d r C1. Furthermore, it can be shown that the value of dmin coincides with the minimum dimension derived by [19] for generally non-linear sufficient statistics of Gaussian distributions. Thus, for homoscedastic Gaussian data, the minimum-dimensional sufficient statistic can be achieved by a linear operator. Moreover, if n 4 C1, almost surely the class means are linearly independent and the rank of SH is C1. The homoscedastic formula in (18) reveals that the homoscedastic SH is proportional to the whitened between-class scatter matrix used by LDA. Thus, those eigenvectors of the two matrices which have the largest corresponding eigenvalues should span the same subspace. Therefore, considering that the solution set of both methods is closed under any non-singular linear transformation, the solution sets of WLSS and LDA coincide in the homoscedastic case [13]. Thus, WLSS can be considered as a heteroscedastic extension of LDA. In fact, considering the theoretical foundation of WLSS, the LDA method can be derived from WLSS.
5. Experimental evaluation The simulations focus on the comparison of LDA [26], CLSS [17], Mahalanobis [16], Chernoff [15], AIDA [4], and WLSS FEs. All these linear FEs can be trained using a non-iterative procedure involving SVD or eigendecomposition. Moreover, all of them are directly or indirectly based on the assumption of Gaussian data and depend
1
LDA CLSS Mahalanobis Chernoff AIDA WLSS
Error rate
0.8 0.6 0.4 0.2 0
This section demonstrates that certain FEs provide a MDLSS, and compares performance of different methods when the data do not deviate from the Gaussian distribution. Therefore, random synthetic data generated according to randomly selected Gaussian distributions are utilized. The data belong to C ¼ 6 Gaussian distributions N ðli , Ri Þ with the dimension n ¼ 50. The li values are randomly selected in a unit cube in Rn . The values for Ri are selected as FðAAT þBi BTi Þ, where F is a constant scaling factor, A is a matrix with uniform random entries between 0.5 and þ0.5, and Bi ,1 r ir C, is a matrix with the first 20 rows and columns entries taken as uniform random values between 0.5 and þ0.5 and all the other entries set to zero. Therefore, the class covariances differ randomly only in 20 dimensions. Also, 100 testing samples per class are generated and their extracted features are classified by a quadratic Gaussian classifier. In Fig. 2, the accurately known actual values of li and Ri are used for all the feature extraction methods. Hence, the FE and the Bayes classifier are designed based on the actual parameters. Therefore, their performance is compared in the absence of any parameter estimation error. The scaling factor F is set to 1 in Fig. 2 and to 100 in Fig. 2. In each case, the selection of means and covariances is repeated 100 times, and the average error rate (AER) of different methods is plotted versus d. Due to the optimality of the quadratic Bayes classifier for Gaussian data, its probability of error never increases with the inclusion of new data features [14]. Therefore, the curves in Fig. 2 will be non-increasing. Moreover, if a method provides a sufficient statistic at d, AER at this dimension should almost equal the Bayes error for the original n-dimensional data. The CLSS, Chernoff, AIDA, and WLSS methods can extract any necessary features. However, LDA cannot provide more than minðC1,nÞ ¼ 5 features. Also, the Mahalanobis method does not provide more than minðCðC1Þ=2,nÞ ¼ 15 features. For these methods with limited number of available features, a horizontal line without marker is plotted after the last available performance to illustrative the infeasible range of d.
1
LDA CLSS Mahalanobis Chernoff AIDA WLSS
0.8 Error rate
Homoscedastic model is useful in many problems where the more significant and reliable information lie in class mean differences. In this model, Ri ¼ R and class-conditional distributions differ only in their mean values. By substituting all the covariance estimates in (17) with the pooled estimate of the common covariance matrix R^ , SH simplifies to
0.6 0.4 0.2
1
5
10 15 Number of features
20
25
0
1
5
10 15 Number of features
20
25
Fig. 2. Average error rate of quadratic classifier with different FEs using actual li and Ri parameters for heteroscedastic Gaussian distributions. (a) F ¼ 1 and (b) F ¼ 100.
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
When d rC1 ¼ 5, from Fig. 2, LDA overcomes for the smaller covariance scale, while in Fig. 2, Chernoff, AIDA, and WLSS provide the least AER for the larger covariance scale. This can be explained as follows. For small covariances, the differences in the class means provide the major discriminatory information that is exploited by LDA. However, for large covariances, the class-conditional distributions significantly overlap; therefore, utilization of the covariance information is crucial in distinguishing the classes. For d 4 C1 in Fig. 2a and b, Chernoff, AIDA, and WLSS achieve the best performance with Chernoff marginally surpassing the other two. The persistent significant gap between WLSS and CLSS performance can be explained by the inherent bias of CLSS solution towards o1 [18,13]. The CLSS, Chernoff, AIDA, and WLSS methods are limited to dmin ¼ 25 and provide the same AER as the original n-dimensional data, 0.019 in Fig. 2a and 0.038 in b, at this dimension. This demonstrates the minimum-dimensional sufficiency of these statistics. The minimum dimension of 25 for a linear sufficient statistic in this problem can be verified theoretically from the rank of SH . The experiments of Fig. 3 include the effect of parameter estimation error on the FE and classifier performance while the data still belong to Gaussian distributions. The experiments are performed on the previously described synthetic data, but the actual mean and covariance values are not known to the classifier and FE. These parameters are estimated, using the ML estimator, based on 70 training samples per class during training phase of the classification system. Selection of the actual li and Ri parameters is repeated 10 times, and for each selection of parameters, the training and testing samples are generated 10 times. In each of the resulting 100 repetitions, the FEs and the quadratic plug-in quadratic classifier are trained using the provided training samples, and their AER is calculated over 10 testing samples per class. Finally, the average error rate over 100 repetitions is plotted versus d. Fig. 3a and b depict AER for all the methods when F ¼ 1 and 100 respectively. It should be noted that since the estimated parameters l^ i and R^ i are used, CLSS, Chernoff, AIDA, and WLSS cannot determine the value of dmin and their feasible range of d continues up to n. Furthermore, for the plug-in classifier, the probability of error might be increasing versus d in some ranges due to effect of parameter estimation error on both the classifier and the FE. From Fig. 3a and b, when d r C1 ¼ 5, Chernoff, AIDA, and WLSS provide the least AER for the larger covariance scale (F), while LDA overcomes them for the smaller covariance scale as before. When d 4C1, as seen in Fig. 3a and b, Chernoff, AIDA, and WLSS overcome the other methods with marginally improving performance in this order. The relative performance of WLSS
1
LDA CLSS Mahalanobis Chernoff AIDA WLSS
Error rate
0.8 0.6 0.4 0.2 0
and the other two methods has been generally switched compared to the case with known li and Ri values. This can be explained by the higher complexity of Chernoff and AIDA and their reliance on the logarithms and inverses of the class covariances which are more severely affected by the estimation errors than the covariance estimates themselves. Also, CLSS is based on normalization with respect to estimated o1 parameters which have higher variance than the estimated average class parameters (l^ and R^ ). Therefore, the reliance of CLSS on estimated o1 parameters leads to the relatively low performance of CLSS. As mentioned before, dmin ¼ 25 is the minimum linear sufficient statistic dimension for this problem. However, the AER of Chernoff, AIDA, and WLSS based on the estimated parameters reaches a minimum value for a dimension d lower than dmin as shown in Fig. 3. This phenomenon is caused by decreased accuracy of the estimated parameters for larger values of d. Thus, the additional information of the last features cannot compensate the additional estimation error incurred by the growth in the dimensionality, and the minimum AER has occurred at a dimension lower than dmin . Furthermore, this minimum for AER is generally lower than the AER for the original data, 0.268 in Fig. 3a, and 0.322 in b. 5.2. Experiments on UCI data sets The University of California, Irvine (UCI) Machine Learning Repository [27] was initiated in 1987 and consists of online data sets, domain theories, and data generators. The data set repository includes multivariate, univariate, and text data types for various purposes such as classification and regression. We have adopted
Table 1 Specifications of the selected classification data sets from the UCI repository: original data dimensionality (n), data dimensionality after PCA (dPCA ), number of classes (C), total number of samples (N), average number of samples per class (N i ), and multivariate heteroscedasticity measure (JH ). Database name
n
dPCA
C
N
Ni
JH
(a) Wisconsin breast cancer (b) BUPA liver disorder (c) Pima Indians diabetes (d) Cleveland heart disease (e) Iris plants (f) Thyroid gland (g) Vowel context (h) Image segmentation (i) Waveform
9 6 8 13 4 5 10 19 21
9 6 8 13 4 5 10 14 21
2 2 2 5 3 3 11 7 3
683 345 768 297 150 215 990 2310 5000
341.5 172.5 384 59.4 50 71.7 90 330 1666.7
318.10 7.63 43.44 0.75 14.06 93.69 185.99 1266.04 1095.92
1
LDA CLSS Mahalanobis Chernoff AIDA WLSS
0.8 Error rate
826
0.6 0.4 0.2
1
5
10
15 20 25 30 35 Number of features
40
45
50
0
1
5
10
15 20 25 30 35 Number of features
40
45
50
Fig. 3. Average error rate of quadratic classifier with different FEs using estimated parameters l^ i and R^ i for heteroscedastic Gaussian distributions. (a) F ¼1 and (b) F¼ 100.
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
repetition, the available samples of the data set are split into training and testing samples. The training samples are used to calculate ML estimates of li and Ri . In Figs. 5 and 6, the training set consists of 10% and 90% of the total samples respectively. The experimental setup is illustrated in Fig. 4. In the preprocessing stage, the data are projected onto dPCA -dimensional space ColðR^ Þ using principal component analysis (PCA) to ensure a nonsingular average covariance. Then, class covariances are slightly regularized toward their non-singular average using the regularization coefficient of 0.001. This step guarantees the non-singularity of R^ i as required by CLSS, AIDA, and Chernoff FEs as well as the quadratic classifier. Figs. 5 and 6 reveal that feature extraction helps to decrease AER of the original data or at least the required dimensionality. However, the effectiveness of heteroscedastic feature extraction becomes evident in data set (g). From Table 1, competitive advantage of heteroscedastic methods in data set (g) can be attributed to significant heteroscedasticity (JH b 1:6) and high number of classes (C ¼ 11) which makes it challenging to perform classification based on only the class means. The larger training set size in Fig. 6 leads to more accurate parameter estimates. As observed in the synthetic experiments,
several multivariate UCI data sets for classification purpose. Specifications of the selected data sets are given in Table 1. A heteroscedasticity test is applied on every data set to investigate the need for a heteroscedastic approach. For this purpose, a multivariate test [28] designed for data sets with lower samples than the data dimension is deployed. The resulting scalar measure JH provides an unbiased estimate of normalized Frobenius scatter of the covariance matrices. This measure is distributed as N ð0,1Þ under null hypothesis of homoscedasticity. Therefore, assuming a significance level of a ¼ 0:05, values of JH in Table 1 exceeding ð1aÞ quantile, i.e. 1.64, indicate a significant heteroscedasticity. Experiments of this section compare the existing FEs with our proposed FE when the li and Ri parameters need to be estimated and the actual distributions are not Gaussian. The probability of classification error on each data set is estimated as the AER over 100 repetitions based on the rotation method [1]. In each
Fig. 4. Experimental setup for UCI experiments.
0.2
0.5
0.1 0.05
0.3 Error rate
Error rate
Error rate
0.4
0.4
0.15
0.3 0.2
2
4
6
0
8
LDA CLSS Maha Cher AIDA WLSS
0.2 0.1
0.1
0
827
1
2
Number of features
3
4
5
0
6
2
4
Number of features
6
8
Number of features
0.8
0.4 0.2 0
0.25
0.25
0.2
0.2
Error rate
Error rate
Error rate
0.6
0.15 0.1 0.05
2
4
6
8
10
0
12
0.15 0.1 0.05
1
2
Number of features
3
0
4
1
2
Number of features
3
4
5
Number of features
0.5 0.6
0.4
0.4 0.3 0.2
0.2 0
0.4
0.5
Error rate
0.6
Error rate
Error rate
0.8
4
6
Number of features
8
10
0
0.2 0.1
0.1 2
0.3
2
4
6
8
10
Number of features
12
14
0
5
10
15
Number of features
Fig. 5. Average error rate versus number of features for UCI data sets when 10% of the samples are used for training.
20
828
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
0.2
0.5
0.1 0.05
Error rate
Error rate
Error rate
0.25
0.4
0.15
0.3 0.2 0.1
0
2
4
6
0
8
1
2
3
4
5
0
6
0.08
0.3 0.2 0.1
8
10
0.04
0
12
8
0.1 0.05
1
2
3
0
4
1
2
Number of features
3
4
5
Number of features
0.5
0.8 0.6 0.6
0.4
0.4
Error rate
0.5 Error rate
Error rate
6
0.15
0.06
Number of features
0.4 0.3 0.2
0.2
2
4
6
Number of features
8
10
0
0.3 0.2 0.1
0.1 0
4
0.2
0.02
6
2
Number of features
Error rate
0.4 Error rate
Error rate
0.1
4
0.1
Number of features
0.5
2
LDA CLSS Maha Cher AIDA WLSS
0.15
0.05
Number of features
0
0.2
2
4
6
8
10
12
0
14
Number of features
5
10 15 Number of features
20
Fig. 6. Average error rate versus number of features for UCI data sets when 90% of the samples are used for training.
performances of Chernoff, AIDA, and WLSS are close to each other in this case. However, in Fig. 5 with less accurate estimation, WLSS error rate is lower than that of Chernoff or AIDA for data sets (e)–(g). As shown in Table 1 for these three data sets, the average number of samples per class, i.e. N i ¼ N=C, is relatively small and simultaneously JH b 1:6. In addition, among these three data sets, WLSS achieves higher performance gain for larger JH . In other data sets, WLSS consistently outperforms other heteroscedastic methods, while requiring the lowest computational complexity (ref. Section 4.1). The approximate time required for offline training of each FE using 90% of samples of data set (h) is shown in Table 2. The reported times do not include the time for the common parameter estimation step with OðNn2 Þ complexity (1.14 ms) and the time for the PCA step (2.31 ms). Each duration is calculated on a system with eight X5355 processors and 8 GB memory, and is averaged over 100 iterations. Note that here FE input data is of dimension 19, and these durations will be more differentiating for higher dimensions. Finally, bias of CLSS solution toward o1 leads to its relatively poor performance on most data sets. In fact, even the higher estimation accuracy in Fig. 6 does not compensate for this inherent bias.
Table 2 Average time (in milliseconds) required for training different FEs on data set (h). The figures include neither the time for parameter estimation step, which is the same for all methods, nor the time for the PCA step. Method
WLSS
Chernoff
AIDA
CLSS
Mahalanobis
LDA
Training time (ms)
1.24
216.86
25.04
1.05
0.87
0.97
6. Conclusions and remarks Non-iterative heteroscedastic linear FEs provide practical advantages in three aspects. First, they make use of the discriminatory information in the second order moments of the data in addition to the first order moments utilized by LDA. Second, they alleviate the restriction of C1 on the feature space dimensionality of LDA. Third, their simple formulation based on a few parameters provides acceptable generalizability [15]. This paper proposed a non-iterative heteroscedastic linear FE, called WLSS, based on sufficiency conditions. Simulations on synthetic data demonstrated that WLSS, along with Chernoff and AIDA, significantly outperform Mahalanobis and CLSS
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
829
Table 3 Comparison of different heteroscedastic linear FEs. Method
MDLSS
Extension of LDA
Limit on ‘‘d’’
Multi-class formulation
Tolerance to Sing. R^ i
Complexity Oð:Þ
WLSS
Yes
Yes
n
Inherent
Yes
Nn2 þ n3
Chernoff
Yes
Yes
n
Pairwise
No
Nn2 þ C 2 n3
AIDA
Yes
Yes
n
Inherent
No
Nn2 þ Cn3
CLSS
Yes
No
n
Inherent
No
Nn2 þ n3
Mahalanobis
No
Yes
Inherent
Yes
Nn2 þ n3
LDA
No
–
Inherent
Yes
Nn2 þ n3
CðC1Þ ,n min 2 minfC1,ng
methods. Furthermore, it was illustrated that WLSS achieved even higher standing with respect to Chernoff and AIDA when accurate li and Ri parameters are not available. Simulations using real data also confirmed that WLSS provides a consistent improvement in classification accuracy over other FEs. These FEs can be further compared according to the criteria of Table 3. LDA has also been included for comparison. In the first column of Table 3, Mahalanobis does not result in a MDLSS. WLSS and CLSS are designed for the ideal goal of providing MDLSS; whereas in case of Chernoff and AIDA, this property has been determined based on our synthetic simulation results. The second column shows that CLSS does not reduce to the popular LDA in the homoscedastic case. According to the next column, target dimensionality d for Mahalanobis is further restricted. Chernoff and AIDA typically compete with WLSS in classification performance. However, as shown in Table 3, Chernoff is not inherently multi-class, and provides a sub-optimal pairwise solution for more than two classes. Moreover, both Chernoff and AIDA suffer from a time complexity higher than that of LDA, and this complexity difference is critical for high-dimensional data. Additionally, these two methods cannot be deployed if any of the individual class covariance estimates Ri is singular, and therefore are not applicable in small sample size scenarios. In conclusion, WLSS features the advantageous properties listed in Table 3. In particular, WLSS approximates a MDLSS, and does not impose any additional computational complexity compared to LDA. Finally, it is noteworthy that WLSS can be further improved through weighting, regularization, and fractional-step dimension reduction.
and using the partitioned matrix inversion formula from [30], 1 T Ri ½TT ZT ¼ Z
ðA:2Þ
where T T 1 , D1ðT, oi Þ ¼ ðTRi TT Þ1 þ ðTRi TT Þ1 TRi ZT Q 1 ðT, oi Þ ZRi T ðTRi T Þ
D2ðT, oi Þ ¼ ðTRi TT Þ1 TRi ZT Q 1 ðT, oi Þ , Q ðT, oi Þ ¼ ZRi ZT ZRi TT ðTRi TT Þ1 TRi ZT 1=2
¼ ZRi
1=2 T
1=2
T ðTRi TT Þ1 TRi
ðIRi
1=2 T
pseudoinverse ð:Þ 1=2 T
IðRi
þ
1=2
T ðTRi TT Þ1 TRi
Define Q 0ðT, oi Þ ¼ IRi
T Þ . Hence Q 0ðT, oi Þ is the operator for orthogonal 1=2 T ?
projection onto ColðRi
T Þ , the orthogonal complement of
1=2 ColðRi TT Þ, or equivalently onto 1=2 T Z Þ [31]. Thus, onto ColðRi 1=2 T
IðRi
1=2 T þ
T ÞðRi
T Þ
1=2
KerðTRi
1=2 T
1=2
1r i rC,
which is equivalent to Pðoi 9yÞ=Pðoj 9yÞ ¼ Pðoi 9xÞ=Pðoj 9xÞ, or using Bayes’ theorem, f ðy9oi Þ=f ðy9oj Þ ¼ f ðx9oi Þ=f ðx9oj Þ, or in terms of the log-likelihoods, 1 ri oj rC:
Using (A.4) and (A.2), (A.1) simplifies to 1=2 T T 1 DðT, oi ,xÞ ¼ ðxli ÞT Ri1=2 Q 0ðT, oi Þ R1=2 ZT ðZZT Þ1 ðZR1 ZRi i Z ÞðZZ Þ i
ðxli Þ þ log
! 9Ri 99TTT 9 : 9TRi TT 9
also an orthogonal projection onto Q 0ðT, oi Þ
1=2 T 1=2 T 1 ¼ Ri Z ðZR1 ZRi , i Z Þ
DðT, oi ,xÞ ¼ ðxli Þ
T
ðA:1Þ
We need to simplify (A.1). Using the definition for matrix Z 1 T T T T T Ri ½TT ZT ðxli Þ, ðxli ÞT R1 i ðxli Þ ¼ ðxli Þ ½T Z Z Z
1=2
Þ ðTRi
1=2 KerðTRi Þ
ðA:5Þ Þ. Thus, Q 0ðT, oi Þ is 1=2 T
or onto ColðRi
Z Þ:
which reduces (A.5) to
1 T 1 1 T 1 T T R1 ðZR1 ZR1 i Z ðZRi Z Þ i Z ÞðZRi Z Þ i
ðxli Þ þ log
! 9Ri 99TTT 9 9TRi TT 9
1 T 1 T ZR1 ¼ ðxli ÞT R1 i Z ðZRi Z Þ i ðxli Þ þlog
Defining DðT, oi ,xÞ ¼ log½f ðy9oi Þlog½f ðx9oi Þ, and using y ¼ Tx, T 1 T T DðT, oi ,xÞ ¼ ðxli ÞT R1 Tðxli Þ i ðxli Þðxli Þ T ðTRi T Þ
ðA:3Þ ðA:4Þ
1=2 þ
By equivalence of sufficiency and admissibility [14], (1) is equivalent to
,
T 1 ZZT : Q ðT, oi Þ ¼ ZZT ðZR1 i Z Þ
Since T is full row-rank, Q 0ðT, oi Þ ¼ IðTRi
! 9Ri 99TTT 9 : 9TRi TT 9
Þ, or because TZT ¼ 0,
T 1 Z ðZR1 ZRi i Z Þ
¼ Ri
Appendix A. Proof of Lemma 3.1
þ log
. Using the Moore–Penrose
1=2 T þ
T ÞðRi
1=2
log½f ðy9oi Þlog½f ðx9oi Þ ¼ log½f ðy9oj Þlog½f ðx9oj Þ,
Z :
for a full column-rank matrix [31], Q 0ðT, oi Þ ¼
Q 0ðT, oi Þ Ri
Pðoi 9yÞ ¼ Pðoi 9xÞ,
1=2 T
ÞRi
! 9Ri 99TTT 9 : 9TRi TT 9
From partitioned matrix determinant formula [30] (Ref. [13]), ! T 9Ri 99TTT 9 1 T 1 T T T 9 þ log9ZZ ðZ R Z Þ ZZ 92log log ¼ log9TT i Z T 9TRi T 9 T 1 ZZT 9, ¼ log9ðZR1 i Z Þ T T 1 ðZR1 ZR1 which leads to DðT, oi ,xÞ ¼ ðxli ÞT R1 i Z i Z Þ i ðxli Þ T 1 T þ log9ðZR1 Z Þ ZZ 9, or (3). Thus, sufficiency of T is equivalent to i the condition that DðT, oi ,xÞ in (3) is independent of ‘‘i’’.
830
M.S. Mahanta et al. / Pattern Recognition 45 (2012) 821–830
Appendix B. Proof of Lemma 4.1 Equivalence of (b) and (c) is trivial. Therefore, it only suffices to prove equivalence of the following two pairs of conditions:
Equivalence of (a) and (b): Assuming (b), (a) is trivial. Assuming (a), (b) is deduced as follows: For any 1 r j rm, zT Aj ¼ zT Aj
m X
ai ,
using ð10Þ
i¼1
¼
m X
ai zT Aj ¼
i¼1 m X T
m X
ai zT Ai ,
based on ðaÞ
i¼1
ai Ai ¼ zT ,
¼z
using ð11Þ:
i¼1
Equivalence of (c) and (d): If (c) is satisfied, (d) follows trivially. Condition (d) can be written as zT ¼ zT A1 j Ai ,
81 ri o j rm:
ðB:1Þ
Therefore, if (d) holds, (c) is established as follows: For any 1 rj r m, zT A1 ¼ zT A1 j j
m X
ai Ai ¼
i¼1
¼
m X
ai zT ¼ zT ,
m X
ai zT A1 j Ai ,
using ð11Þ
i¼1
using ðB:1Þandð10Þ:
i¼1
References [1] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 4–37. [2] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., WileyInterscience, 2000. [3] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley-Interscience, 2004. [4] K. Das, Z. Nenadic, Approximate information discriminant analysis: a computationally simple heteroscedastic feature extraction technique, Pattern Recognit. 41 (2008) 1565–1574. [5] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., 1990. [6] O. Hamsici, A. Martinez, Bayes optimality in linear discriminant analysis, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008).
[7] M. Loog, R.P.W. Duin, R. Haeb-Umbach, Multiclass linear dimension reduction by weighted pairwise Fisher criteria, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 762–766. [8] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional data, with application to face recognition, Pattern Recognition 34 (2001) 2067–2070. [9] N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition, Speech Commun. 26 (1998) 283–297. [10] G. Saon, M. Padmanabhan, Minimum Bayes error feature selection for continuous speech recognition, in: Advances in Neural Information Processing Systems, vol. 13, MIT Press, 2000, pp. 800–806. [11] L. Rueda, M. Herrera, Linear dimensionality reduction by maximizing the Chernoff distance in the transformed space, Pattern Recognition 41 (2008) 3138–3152. [12] L. Rueda, B.J. Oommen, C. Henrı´quez, Multi-class pairwise linear dimensionality reduction using heteroscedastic schemes, Pattern Recognition 43 (2010) 2456–2465. [13] M.S. Mahanta, Linear feature extraction with emphasis on face recognition, Master’s Thesis, University of Toronto, 2009. [14] L. Devroye, A Probabilistic Theory of Pattern Recognition, Springer, 1996. [15] R.P.W. Duin, M. Loog, Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 732–739. [16] H. Brunzell, J. Eriksson, Feature reduction for classification of multidimensional data, Pattern Recognition 33 (2000) 1741–1748. [17] J.D. Tubbs, W.A. Coberly, D.M. Young, Linear dimension reduction and Bayes classification with unknown population parameters, Pattern Recognition 15 (1982) 167–172. [18] M.S. Mahanta, K.N. Plataniotis, Linear feature extraction using sufficient statistic, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 2218–2221. [19] L. Buturovic, On the minimal dimension of sufficient statistics, IEEE Trans. Inf. Theory 38 (1992) 182–186. [20] B.C. Peters Jr., R. Redner, H.P. Decell Jr., Characterizations of linear sufficient statistics, Sankhya¯ Ser. A 40 (1978) 303–309. [21] H.P. Decell Jr., P.L. Odell, W.A. Coberly, Linear dimension reduction and Bayes classification, Pattern Recognition 13 (1981) 241–243. [22] S. Leon, Linear Algebra with Applications, second ed., Macmillan, 1986. [23] R. Lotlikar, R. Kothari, Fractional-step dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 623–627. [24] J. Friedman, Regularized discriminant analysis, J. Am. Statist. Assoc. 84 (1989) 165–175. [25] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, Regularization studies on LDA for face recognition, in: Proceedings of the International Conference on Image Processing, 2004, pp. 63–66. [26] R. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen. 7 (1936) 179–188. [27] A. Asuncion, D. Newman, UCI Machine Learning Repository, 2007. [28] J.R. Schott, A test for the equality of covariance matrices when the dimension is large relative to the sample sizes, Comput. Statist. Data Anal. 51 (2007) 6535–6542. [30] D. Harville, Matrix Algebra from a Statistician’s Perspective, Springer, 1997. [31] T. Boullion, P. Odell, Generalized Inverse Matrices, Wiley-Interscience, 1971.
Mohammad Shahin Mahanta received the dual B.Sc. degree in electrical engineering and petroleum engineering from Sharif University of Technology, Tehran, Iran, in 2007, and the M.A.Sc. degree in electrical and computer engineering from the University of Toronto, Toronto, ON, Canada, in 2009. He is currently working toward the Ph.D. degree in electrical and computer engineering. His research interests include statistical signal processing and machine learning, with particular emphasis on feature extraction for biomedical signals.
Amirhossein S. Aghaei received the B.Sc. degree in electrical engineering from Iran University of Science and Technology, Tehran, Iran, in 2006, and the M.A.Sc. degree from the Department of Electrical and Computer Engineering at the University of Toronto, in 2008, where he is currently working toward the Ph.D. degree. His research interests include transceiver design for multiuser MIMO channels, as well as statistical signal processing with emphasis on analysis of improper complex biomedical signal.
Konstantinos N. Plataniotis is a Professor with the ECE Department at the University of Toronto. His research interests are multimedia systems, biometrics, image and signal processing, communications systems and pattern recognition. He is a registered professional engineer in Ontario, and the Editor-in-Chief (2009–2011) for the IEEE Signal Processing Letters.
Subbarayan Pasupathy received the B.E. degree in telecommunications from the University of Madras, the M.Tech. degree in electrical engineering from the Indian Institute of Technology, Madras, and the M.Phil. and Ph.D. degree in engineering and applied science from Yale University. Currently, he is a Professor Emeritus in the Department of Electrical and Computer Engineering at the University of Toronto, where he has been a Faculty member from 1972. His research over the last three decades has mainly been in statistical communication theory and signal processing and their applications to digital communications. He has served as the Chairman of the Communications Group and as the Associate Chairman of the Department of Electrical Engineering at the University of Toronto. He is a registered Professional Engineer in the province of Ontario. During 1982–1989 he was an editor for data communications and modulation for the IEEE Transactions on Communications. He has also served as a technical associate editor for IEEE Communications Magazine (1979–1982) and as an associate editor for the Canadian Electrical Engineering Journal (1980–1983). He wrote a regular humor column entitled ‘‘Light Traffic’’ for IEEE Communications Magazine during 1984–98. Dr. S. Pasupathy was elected as a Fellow of the IEEE in 1991 ‘‘for contributions to bandwidth efficient coding and modulation schemes in digital communication,’’ was awarded the Canadian Award in Telecommunications in 2003 by the Canadian Society of Information Theory, was elected as a Fellow of the Engineering Society of Canada in 2004 and as a Fellow of the Canadian Academy of Engineering in 2007. He has been identified as a ‘‘highly cited researcher’’ by ISI Web of Knowledge and his name is listed in ISIHighlyCited.com.