Neural Networks 97 (2018) 127–136
Contents lists available at ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
Matrix exponential based discriminant locality preserving projections for feature extraction Gui-Fu Lu *, Yong Wang, Jian Zou, Zhongqun Wang School of Computer Science and Information, AnHui Polytechnic University, WuHu, AnHui 241000, China
article
info
Article history: Received 18 October 2016 Received in revised form 21 July 2017 Accepted 28 September 2017 Available online 16 October 2017 Keywords: Dimensionality reduction Discriminant locality preserving projections Matrix exponential Linear discriminant analysis
a b s t r a c t Discriminant locality preserving projections (DLPP), which has shown good performances in pattern recognition, is a feature extraction algorithm based on manifold learning. However, DLPP suffers from the well-known small sample size (SSS) problem, where the number of samples is less than the dimension of samples. In this paper, we propose a novel matrix exponential based discriminant locality preserving projections (MEDLPP). The proposed MEDLPP method can address the SSS problem elegantly since the matrix exponential of a symmetric matrix is always positive definite. Nevertheless, the computational complexity of MEDLPP is high since it needs to solve a large matrix exponential eigenproblem. Then, in this paper, we also present an efficient algorithm for solving MEDLPP. Besides, the main idea for solving MEDLPP efficiently is also generalized to other matrix exponential based methods. The experimental results on some data sets demonstrate the proposed algorithm outperforms many state-of-the-art discriminant analysis methods. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction In recent years, high-dimensional data is often used in pattern recognition and computer vision fields (Gu & Sheng, 2016; Gu, Sheng, Tay, Romano, & Li, 2015a; Gu, Sheng, Wang, Ho, Osman, & Li, 2015b; Gu, Sun, & Sheng, 2016; Wen, Shao, Xue, & Fang, 2015; Zhou, Wang, Wu, Yang, & Sun, 2016). Dimensionality reduction, which aims at transforming the high-dimensional data into the meaningfully low-dimensional ones, has been a key issue in these domains. In the past decades, a lot of dimensionality reduction methods have been developed and the most famous methods may be principal component analysis (PCA) (Fukunaga, 1990) and linear discriminant analysis (LDA) (Duda, Hart, & Stork, 2000). PCA is an unsupervised method which aims to find a set of orthonormal basis vectors on which the variance over all the data is maximized. PCA is very popular since it is optimal in terms of minimum reconstruction error. However, PCA may be not suitable for classification problem since it does not use any class information in computing the basis vectors. Different from PCA, LDA is a supervised technique for dimensionality reduction. LDA tries to find the most discriminant projection vectors by maximizing the between-class scatter matrix and minimizing the within-class scatter matrix simultaneously. In many pattern recognition tasks, however, LDA suffers from the
* Corresponding author.
E-mail address:
[email protected] (G.-F. Lu).
https://doi.org/10.1016/j.neunet.2017.09.014 0893-6080/© 2017 Elsevier Ltd. All rights reserved.
well-known small sample size (SSS) problem where the number of samples is less than the dimension of samples. In order to address the SSS problem, a lot of LDA-based methods have been proposed. PCA+LDA, also known as Fisherface (Belhumeour, Hespanha, & Kriegman, 1997), is a two-stage method, which first applies PCA to reduce the dimension of the highdimensional data and then applies LDA for feature extraction. However, one potential problem for Fisherface is that some useful discriminant information may be lost in the PCA step. In Chen, Liao, Ko, and Yu (2000) proposed the null space LDA (NLDA) method, which projects the high-dimensional data onto the null space of the within-class scatter matrix where the between-class scatter matrix is maximized. Direct LDA (Yu & Yang, 2001) first removes the null space of the between-class scatter matrix and then computes the eigenvectors corresponding to the smallest eigenvalues of the reduced within-class scatter matrix. Maximum margin criterion (MMC), which is proposed by Li, Jiang, and Zhang (2004, 2006) and Song, Zhang, Mei, and Guo (2007), respectively, is another method to address the SSS problem encounter by LDA. In MMC, the inverse matrix of the within-class scatter matrix does not to be computed since the difference of between-class scatter matrix and withinclass scatter matrix is used as discriminant criterion. Then the SSS problem is alleviated. Though PCA and LDA are very useful for feature extraction, they can only preserve the global Euclidean structure and cannot discover the underlying manifold structure hidden in the highdimensional data. To address the problem, some manifold learning methods have been developed, e.g., Isomap (Tenenbaum, Silva, &
128
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
Langford, 2000), locally linear embedding (LLE) (Roweis & Saul, 2000), and Laplacian Eigenmap (LE) (Belkin & Niyogi, 2003). Experiments demonstrate that these manifold learning methods are effective in finding the geometrical structure of the underlying manifold. However, it is difficult for these manifold learning methods to evaluate the map for unseen data. He, Yan, Hu, Niyogi, and Zhang (2005b) and He, Yan, Hu, and Zhang (2003) proposed the locality preserving projections (LPP) algorithm which is a linearization of LE and can provide a way to the projection of the new data samples. LPP can preserve the locality of data samples by building a graph incorporating neighborhood information among data samples. Similarly, neighborhood preserving embedding (NPE) (He, Cai, Yan, and Zhang, 2005a) is also a linearization of locally linear embedding (LLE). Although LPP is effective in discovering the manifold structure hidden in high-dimensional data, it is an unsupervised learning technique. Then, discriminant locality preserving projections (DLPP) (Yu, Teng, & Liu, 2006), which can utilize the class information, is proposed to address the problem of LPP. To overcome the limitation of LDA that it can only preserve the global geometry structure, Chen, Chang, and Liu (2005) proposed the local discriminant embedding (LDE) method. Yan, Xu, Zhang, Zhang, Yang, and Lin (2007) proposed a general framework, called graph embedding framework, to unify all the above-mentioned methods. Besides, based on the proposed framework, Yan et al. also proposed a new dimensionality reduction algorithm, called Marginal Fisher Analysis (MFA). Similar to LDA, the manifold learning methods also suffers from the SSS problem. Then, PCA is first used to reduce the dimension of samples and then uses these manifold learning algorithms for feature extraction. Similarly, some useful discriminant information may also be lost in the PCA step. Zhang, Fang, Tang, Shang, and Xu (2010) proposed an exponential discriminant analysis (EDA) to address the SSS problem of LDA. In EDA, the matrix exponential is used, which makes the withinclass and between-class scatter matrices be positive definite. The performance of EDA is good since no discriminant analysis is lost in performing the EDA algorithm. By borrowing the idea of EDA, Wang, Chen, Peng, and Zhou (2011) proposed the exponential LPP (ELPP) method, which aims at addressing the SSS problem of LPP. Similarly, Dornaika and Bosaghzadeh (2013) proposed the exponential local discriminant embedding (ELDE) algorithm to overcome the SSS problem of LDE without discarding any discriminant information. Furthermore, Dornaika and Traboulsi (2017) proposed the exponential semi-supervised discriminant analysis (ESDE), which is the semi-supervised extension of ELDE. Recently, Wang, Yan, Yang, Zhou, and Fu (2014) proposed an exponential graph embedding framework for dimensionality reduction. Unfortunately, all of these matrix exponential based methods are computationally intensive since the computational complexities are O(d3 ), where d denotes the dimension of samples and d is usually very large in many applications. More specifically, one has to compute the exponential of large matrices and a large matrix exponential eigenproblem, which makes these matrix exponential based methods to have high computational cost problem. So it is urgent to find new procedures for solving matrix exponential based methods efficiently. In this paper, we propose a novel matrix exponential based discriminant locality preserving projections (MEDLPP), which can address the SSS problem of DLPP elegantly. Similar to other matrix exponential based methods, MEDLPP also suffers from the limitation that the computational complexity of MEDLPP is usually high. Then, we present an efficient algorithm for solving MEDLPP. The main idea for solving MEDLPP efficiently is also generalized to other matrix exponential based methods. The experimental results on some datasets demonstrate the proposed algorithm outperforms many state-of-the-art discriminant analysis methods.
The main contributions of this paper are as follows. (1) We propose a matrix exponential based discriminant locality preserving projections to address the SSS problem of DLPP. By using the matrix exponential, the SSS problem is overcome elegantly and no discriminant information is lost for MEDLPP. (2) We present an efficient procedure for solving MEDLPP, which, to our best knowledge, is the first efficient procedure for solving matrix exponential based method. Besides, the main idea for solving MEDLPP efficiently is also generalized to other matrix exponential based methods. The remainder of the paper is organized as follows. In Section 2, we introduce some related works, i.e., LDA and DLPP. In Section 3, we propose the matrix exponential based discriminant locality preserving projections (MEDLPP) and the efficient procedure for solving MEDLPP. We report the experiment results in Section 4. Finally, we conclude the paper in Section 5. 2. Related works 2.1. Linear discriminant analysis (LDA) Given a data matrix X = [x1 , x2 , . . . , xn ] ∈ Rd×n , where xi ∈ Rd , for i = 1, . . . , n, is the ith training sample and assume that the data matrix X is grouped as X = [X1 , . . . , Xk ], where∑ Xi ∈ Rd×ni consists k of the ni data samples from the ith class and i=1 ni = n. Let Ni be the set of column indices that belong to the ith class, i.e., xj , for j ∈ Ni , belongs to the ith class. In the LDA method, there are three scatter matrices, which are called as within-class, between-class and total scatter matrices, respectively, and are defined as follows (Duda et al., 2000):
Sw =
k ∑ ∑
(xj − mi )(xj − mi )T
(1)
i=1 j∈Ni
Sb =
k ∑
ni (mi − m)(mi − m)T
(2)
(xj − m)(xj − m)T = Sb + Sw
(3)
i=1 n
St =
∑ j=1
where mi =
1 Xe ni i i ni
is the mean of the ith class, where ei =
(1, 1, . . . , 1) ∈ R and m = 1n X e is the global mean of the dataset, where e = (1, 1, . . . , 1)T ∈ Rn . The objective function of LDA is to find an optimal projection matrix G that maximizes the Fisher’s criterion (Duda et al., 2000): T
G = arg max tr G
((
)−1
GT Sw G
)
GT Sb G
(4)
where tr(·) is the trace operator. It is well-known that the optimal projection G can be obtained by solving the eigenvalue problem as follows: Sb G = λSw G
(5)
or −1 Sw Sb G = λG
(6)
As shown in Eq. (6), LDA cannot work when Sw is singular. The PCA is usually used to overcome the singularity of Sw .
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
(6) If v1 , v2 , . . . , vn are eigenvectors of S that correspond to the eigenvalues λ1 , λ2 , . . . , λn , then v1 , v2 , . . . , vn are eigenvectors of exp(S) that correspond to the eigenvalues eλ1 , eλ2 , . . . , eλn , too.
2.2. Discriminant locality preserving projections (DLPP) Although LDA is an effective method for feature extraction, it can only preserve the global Euclidean structure. DLPP (Yu et al., 2006), which is a manifold learning method, can discover the geometrical structure of the underlying manifold. DLPP tries to find the optimal projection G that maximize the following objection function: G = arg max G
tr(GT MHM T G)
(7)
tr(GT XLX T G)
where M = [m1 , m2 , . . . , mk ], L and H are Laplacian matrices, L = D − W , D = diag {D1 , . . . , Dk }, Di is a diagonal matrix and its elements are row (or column) sum of W (i) , W = (i) diag {W (1) , . . . , W (k) }, here W (i) = [Wjl ] (j, l = 1, 2, . . . , n, i =
2 1, . . . , k), Wjli = exp(− xj − xl /σ 2 ) and xj and xl both belong to the ith class, H = E − B, E is a diagonal matrix and its elements are row (or column) sum of B, B = [Bij ](i, j = 1, 2, . . . , k), Bij = 2 exp(− mi − mj /σ 2 ). Eq. (7), which is a trace ratio optimization problem and does not have a closed-form solution, can be replaced by the following simpler yet inexact trace form: G = arg max tr
((
)−1
GT XLX T G
)
GT MHM T G
(8)
G
Eq. (8) has a closed-form solution and the optimal transformation projection G can be obtained by solving the following eigenvalue problem: MHM T G = λXLX T G
(9)
or (10)
Similar to LDA, DLPP also suffers from the SSS problem since DLPP cannot work when XLX T is singular. 3. Matrix exponential based discriminant locality preserving projections In this section, we first introduce the definition and properties of matrix exponential (Golub & Loan, 1996; Moler & Loan, 1978, 2003) briefly. Then we propose the matrix exponential based discriminant locality preserving projections (MEDLPP) method which can address the SSS problem of DLPP elegantly. Finally, we present an efficient procedure for solving MEDLPP and give the computational complexity analysis of MEDLPP. 3.1. Matrix exponential The exponential for an n × n matrix S is defined as follows: S2
Sm
+ ··· (11) 2! m! where I is an identity matrix. The following are some properties of matrix exponential. (1) (2) (3) (4)
+ ··· +
3.2. Matrix exponential based discriminant locality preserving projections (MEDLPP) By computing the exponential of XLX T and MHM T , we can obtain the exponential version of DLPP. That is, the objective function of MEDLPP is
(
G = arg max tr
)
tr GT exp(MHM T )G
(
(14)
)
tr GT exp(XLX T )G
G
Eq. (14) does not have a closed-form solution, too. Then, we replace Eq. (14) with simpler yet inexact trace form:
((
)−1
GT exp(XLX T )G
G = arg max tr
GT exp(MHM T )G
)
(15)
G
It is well-known that orthogonality is important to discriminant analysis. Then we enforce the projection G to be orthogonal and, finally, the objective function of MEDLPP is rewritten as G = arg max tr
((
)−1
GT exp(XLX T )G
)
GT exp(MHM T )G
(16)
GT G=I
Motivated by the orthogonal LDA (OLDA) method (Ye, 2005; Ye & Xiong, 2006), we can obtain the solution to Eq. (16) by solving the eigenvalue problem firstly as follows: exp(MHM T )G = λ exp(XLX T )G
(17)
or
(XLX T )−1 MHM T G = λG
exp(S) = I + S +
129
exp(S) is a finite matrix. exp(S) is a full-rank matrix. If ST = TS, then exp(S + T ) = exp(S) exp(T ). For an arbitrary square matrix S, there exists the inverse of exp(S), i.e., (exp(S))−1 = exp(−S)
exp(MHM T )G = λG
(18)
and then orthogonalize the transformation matrix G. From the property of matrix exponential we can know that exp(MHM T ) and exp(XLX T ) are both full-rank matrices. That is, exp(MHM T ) and exp(XLX T ) are still nonsingular even in the case where the SSS problem occurs and MHM T and XLX T are both singular. Hence the discriminant information contained in the null space of XLX T can be extracted by MEDLPP. 3.3. Efficient procedure for solving MEDLPP Although MEDLPP can address the SSS problem of DLPP without loss of discriminant information, its computational complexity is usually high. Note that the computational complexities of computing exp(MHM T ) and exp(XLX T ) are O(d3 ), respectively, and the computational complexity of solving the eigenvalue problem Eq. (17) is also O(d3 ). Then the total computational complexity of MEDLPP is O(d3 ) and d is usually very large in many pattern recognition applications. Then the computational complexity of MEDLPP is high. In the following, we will develop an efficient procedure for solving MEDLPP. Suppose the economic SVD of X is X = PΣV T
(19)
where P ∈ R is a column orthogonal matrix, Σ ∈ R is a diagonal matrix, V ∈ Rn×t is also a column orthogonal matrix and t = rank(X ). d×t
t ×t
(12)
Theorem 1. Let column orthogonal matrix P ⊥ ∈ Rd×(d−t) be such that [P P ⊥ ] is orthogonal. We have
(13)
P ⊥ (P ⊥ )T exp(MHM T ) = P ⊥ (P ⊥ )T
(5) If T is a nonsingular matrix, then exp(T −1 ST ) = T −1 exp(S)T
)−1
exp(XLX T )
(
(20)
130
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
Since [P P ⊥ ] is a unitary matrix, we have
and P ⊥ (P ⊥ )T exp(XLX T ) = P ⊥ (P ⊥ )T
(21)
PP T + P ⊥ (P ⊥ )T = I
(36)
That is Proof. From the definition of matrix exponential, i.e., Eq. (11), we have T 2
(MHM )
exp(MHM T ) = I + MHM T +
+ ··· +
2! (MHM T )m m!
(22)
+ ···
and exp(XLX T ) = I + XLX T +
(XLX T )2 2!
+ ··· +
(XLX T )m m!
+ ···
(23)
Since [P P ⊥ ] is orthogonal, we have
exp(MHM T )Pu = λ exp(XLX T )Pu
Note that SB and SW are both t × t matrices. Then the sizes of SB and SW are much smaller than those of exp(XX T ) and exp(XMX T ) since we usually have t = rank(X ) ≪ d. The computational complexity of MEDLPP can be reduced substantially using Theorem 2. However, the computational complexities of computing exp(MHM T ) and exp(XLX T ) are still O(d3 ), respectively. In the following, we will introduce how to compute exp(MHM T ) and exp(XLX T ) efficiently. Suppose that the eigen-decomposition of H is H = QH ΣH QHT
⊥ T
(P ) P = 0
(P ⊥ )T X = (P ⊥ )T P Σ V T = 0
(38)
(24)
where QH ∈ Rk×k is an orthogonal matrix, ΣH ∈ Rt ×t is a diagonal matrix. Furthermore, let MH = MQH ΣH0.5 and its economic SVD be
(25)
MH = QM ΣM VMT
Then we can obtain
(39)
where QM ∈ R is a column orthogonal matrix, ΣM ∈ R is a diagonal matrix, VM ∈ Rk×rM is also a column orthogonal matrix and rM = rank (MH ). Then we have d×rM
and (P ⊥ )T M = (P ⊥ )T XF = 0
(26) 1 nj
where F = [fij ] is an n × k matrix with fij = if xi belongs to class j and fij = 0 otherwise. By combining Eqs. (23) and (25), we have P ⊥ (P ⊥ )T exp(XLX T ) = P ⊥ (P ⊥ )T
(27)
P ⊥ (P ⊥ )T exp(MHM T ) = P ⊥ (P ⊥ )T .
□
(28)
Theorem 2. Let SB = P T exp(MHM T )P and SW = P T exp(XLX T )P. −1 Suppose u ∈ Rt ×1 to be the eigenvector of the matrix SW SB corresponding to the eigenvalue λ . Then Pu is the eigenvector of the matrix ( )−1 exp(MHM T ) corresponding to the eigenvalue λ. exp(XLX T )
2 MHM T = QM ΣM QMT
(40)
(29)
Let column orthogonal matrix QM
QM QM⊥ T ΣM2
= [QM QM⊥ ][ [ + [QM QM⊥ ]
From Eq. (30), we can obtain P T exp(MHM T )Pu = λP T exp(XLX T )Pu
+··· +
⎛ ⎜ ⎜ ⊥ ⎜ QM ] ⎜ ⎝
I+
[ 2m ΣM O(d−rM )×rM
[ 2 ΣM O(d−rM )×rM
[
T
⊥
⊥ T
P (P ) exp(XLX )P = P (P ) P = 0
⊥
⊥ T
PP + P (P )
)
(33)
m! OrM ×(d−rM ) I(d−rM )×(d−rM )
])
⎟ ⎟ ⎟ ⎟ ⎠ + ···
(41)
[QM QM⊥ ]T
P T exp(MHM T )P ( ) = (P T QM ) exp ΣM2 (QMT P) + I − (P T QM )(QMT P) (34)
exp(MHM )Pu (35)
(42)
Similarly, suppose that the eigen-decomposition of L is L = QL ΣL QLT
T
( ) = λ PP T + P ⊥ (P ⊥ )T exp(XLX T )Pu
ΣM2m OrM ×(d−rM ) O(d−rM )×rM O(d−rM )×(d−rM )
]
+ ··· ⎞
where OrM ×(d−rM ) , O(d−rM )×rM and O(d−rM )×(d−rM ) are rM × (d − rM ), (d − rM ) × rM and (d − rM ) × (d − rM ) zeros matrices, respectively and I(d−rM )×(d−rM ) is a (d − rM ) × (d − rM ) identity matrix. From Eq. (41), we have
By combining Eqs. (32)–(34), we have T
× [QM QM⊥ ]T ([ 2 exp(ΣM ) = [QM QM⊥ ]
m! ] OrM ×(d−rM ) O(d−rM )×(d−rM )
(32)
and ⊥ T
]
OrM ×(d−rM ) [QM QM⊥ ]T O(d−rM )×(d−rM )
( ) ( )T = QM exp ΣM2 QMT + QM⊥ QM⊥ ( 2) T = QM exp ΣM QM + I − QM QMT
Since (P ) P = 0, then by using Theorem 1 we have
⊥
+ ···
(31)
⊥ T
P ⊥ (P ⊥ )T exp(MHM T )P = P ⊥ (P ⊥ )T P = 0
m!
]
O(d−rM )×rM
[QM QM⊥ ]
= [QM
2m T QM ΣM QM
OrM ×(d−rM ) [QM QM⊥ ]T O(d−rM )×(d−rM )
O(d−rM )×rM
Then PP T exp(MHM T )Pu = λPP T exp(XLX T )Pu
be such that
]
+··· + (30)
∈ R
[QM QM⊥ ] is orthogonal. Then we have
and SB u = λSW u
d×(d−rM )
⊥
−1 Proof. Since u is the eigenvector of the matrix SW SB corresponding to the eigenvalue λ, we have −1 SW SB u = λu
rM ×rM
2 exp(MHM) = I + QM ΣM QMT + · · · +
Similarly, by combining Eqs. (20) and (26), we have
(
(37)
□
(43)
where QL ∈ Rn×n is an orthogonal matrix, ΣL ∈ Rn×n is a diagonal matrix. Furthermore, let XL = Σ V T QL ΣL0.5 and its economic SVD
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
131
3.4. Computational complexity analysis for MEDLPP
be XL = QX ΣX VXT
(44)
where QX ∈ Rn×rX is a column orthogonal matrix, ΣX ∈ RrX ×rX is a diagonal matrix, VX ∈ Rn×rX is also a column orthogonal matrix and rX = rank (XL ). Then we have XLX T = PQX ΣX2 QXT P T
(45)
Let column orthogonal matrix PX⊥ ∈ [PQX PX⊥ ] is orthogonal. Then we have exp(XLX T ) = I + PQX ΣX2 QXT P T + · · · +
Rd×(d−rX ) be such that
PQX ΣX2m QXT P T m!
+ ···
In this subsection, we will analyze the computational complexity for MEDLPP. The computational complexity of Step 1 in Algorithm 1 is O(n3 ). The computational complexities of doing economic SVD and eigendecomposition in Step 2 are O(dn2 ), O(k3 ), respectively. The matrix multiplication and economic SVD in Step 3 take O(dn2 ). The computational complexity of Step 4 is O(n3 ). Doing the matrix multiplication and economic SVD in Step 5 both take O(n3 ). The computational complexities of matrix multiplication in Step 6 are O(n3 ) and O(dn2 ), respectively. Solving the eigenvalue problem in Step 7 is O(n3 ). Finally, the matrix multiplication and orthogonalization process in Step 7 take O(dnc) and O(dc 2 ). Then the total computational complexity for Algorithm 1 is O(dn2 ).
= [PQX PX⊥ ][PQX PX⊥ ]T ] [ 2 OrX ×(d−rX ) ΣX ⊥ [PQX PX⊥ ]T + [PQX PX ] O(d−rX )×rX O(d−rX )×(d−rX ) [ 2m ] ΣX OrX ×(d−rX ) [PQX PX⊥ ] [PQX PX⊥ ]T O(d−rX )×rX O(d−rX )×(d−rX ) +··· + + ··· m! ] [ 2 ⎞ ⎛ ΣX OrX ×(d−rX ) I+ (46) ⎟ ⎜ O(d−rX )×rX O(d−rX )×(d−rX ) ⎟ ⎜ ] [ ⊥ ⎜ ⎟ = [PQX PX ] ⎜ ⎟ ΣX2m OrX ×(d−rX ) ⎠ ⎝ O(d−rX )×rX O(d−rX )×(d−rX ) +··· + + ··· m! × [PQX PX⊥ ]T([ ]) exp(ΣX2 ) OrX ×(d−rX ) ⊥ [PQX PX⊥ ]T = [PQX PX ]
3.5. Extensions to other matrix exponential based methods
( )T = PQX exp ΣX QXT P T + PX⊥ PX⊥ ( ) = PQX exp ΣX2 QXT P T + I − PQX QXT P T
G = arg min tr
where OrX ×(d−rX ) , O(d−rX )×rX and O(d−rX )×(d−rX ) are rX × (d − rX ), (d −
(
O(d−rX )×rX I(d−rX )×(d−rX )
(
) 2
From Eq. (46) we have
)
((
)−1
GT XDX T G
G
)
GT XLX T G
(48)
With different choices of L and D (Yan et al., 2007), Eq. (48) can be the objection function of LDA, LPP and NPE. To address the SSS problem, Wang et al. (2014) proposed an exponential graph embedding framework for dimensionality reduction, the objective function of which is
((
)−1
GT exp(XDX T )G
)
GT exp(XLX T )G
(49)
which can be solved by the following eigenvalue problem:
I(d−rX )×(d−rX ) is a (d − rX ) × (d − rX ) identity matrix.
(
G = arg min tr
G
rX ) × rX and (d − rX ) × (d − rX ) zeros matrices, respectively and
P T exp(XLX T )P = QX exp ΣX2 QXT + I − QX QXT
In this subsection, we will generalize the main idea for solving MEDLPP to other matrix exponential based methods. Yan et al. (2007) proposed a general framework, called graph embedding framework, to unify all the existing feature extraction methods. The objective function of the graph embedding can be expressed as
(47)
Now the procedure for solving MEDLPP is summarized in Algo-
)−1
exp(XDX T )
exp(XLX T )G = λG
(50)
However, the computational complexity of solving Eq. (50) is O(d3 ) and then the computational cost is high. Here, we will generalize the main idea for solving MEDLPP to solve Eq. (49). From Eq. (25), we can obtain P ⊥ (P ⊥ )T exp(XDX T ) = P ⊥ (P ⊥ )T
(51)
Then we have the following theorem.
rithm 1. Algorithm 1: MEDLPP. Input: Samples X. Output: projection matrix G. 1: Construct the matrices H, L and M; 2: Calculate the economic SVD of X as X = P Σ V T and the eigen-decomposition of H as H = QH ΣH QHT ; 3: Compute the matrix MH = MQH ΣH0.5 and its economic SVD as MH = QM ΣM VMT . 4: Compute the eigen-decomposition of L as L = QL ΣL QLT . 5: Calculate the matrix XL = Σ V T QL ΣL0.5 and its economic SVD as XL = QX ΣX VXT . ( ) 6: Compute the matrices QX exp ΣX2 QXT + I − QX QXT and
2 (P T QM ) exp ΣM (QMT P) + I − (P T Q(M )(QMT P). ( ) ) 7: Solving the eigenvalue problem QX exp ΣX2 QXT + I − QX QXT u =
(
)
( ( ) T ) T P) + I − (P T UW )(UW P) u. Let λ (P T UW ) exp Σ 2 (UW λ1 ≥ λ2 ≥ · · · ≥ λc be the c largest eigenvalues and u1 , . . . , uc be the associated eigenvectors. 8. Let G = PU, where U = [u1 , . . . , uc ] and then orthogonalize the projection matrix G = orthogonalize(G).
Theorem 3. Let SD = P T exp(XDX T )P and SW = P T exp(XLX T )P. Suppose u ∈ Rt ×1 to be the eigenvector of the matrix SD−1 SW corresponding to the λ. Then Pu is the eigenvector of the matrix ( )−1 eigenvalue exp(XDX T ) exp(XLX T ) corresponding to the eigenvalue λ. The proof of Theorem 3 is similar to that of Theorem 2, and then we omit it. Eq. (47) provides the efficient computation of P T exp(XLX T )P. In the following, we will introduce how to compute P T exp(XDX T )P efficiently. Let XD = Σ V T D0.5 and its economic SVD be XD = QD ΣD VDT
(52)
where QD ∈ Rn×rD is a column orthogonal matrix, ΣD ∈ RrD ×rD is a diagonal matrix, VD ∈ Rn×rD is also a column orthogonal matrix and rD = rank (XD ). Then we have XDX T = PQD ΣD2 QDT P T
(53)
132
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
Let column orthogonal matrix PD⊥ ∈ [PQD PD⊥ ] is orthogonal. Then we have I PQD ΣD2 QDT P T PQD PD⊥ T OrD ×(d−rD ) ΣD2
T
exp(XDX ) = + ⊥
+ ··· +
Rd×(d−rD ) be such that
PQD ΣD2m QDT P T m!
+ ···
]
= [PQD PD ][ [ ⊥ + [PQD PD ]
]
[PQD PD⊥ ]T ] OrD ×(d−rD ) [PQD PD⊥ ]T
O(d−rD )×rD O(d−rD )×(d−rD ) ⊥
[PQD PD ] +··· +
[ 2m ΣD
O(d−rD )×rD O(d−rD )×(d−rD )
I+
⎜ ⎜ = [PQD PD ] ⎜ ⎜ ⎝
[ 2 ΣD O(d−rD )×rD
⊥
[ +··· +
OrD ×(d−rD ) O(d−rD )×(d−rD )
ΣD2m O(d−rD )×rD
]
OrD ×(d−rD ) O(d−rD )×(d−rD ) m!
⊥ T
([ = [PQD PD ]
exp(ΣD2 )
OrD ×(d−rD )
O(d−rD )×rD I(d−rD )×(d−rD )
]
⎟ ⎟ ⎟ ⎟ ⎠ + ···
[PQD PD⊥ ]T
where OrD ×(d−rD ) , O(d−rD )×rD and O(d−rD )×(d−rD ) are rD × (d − rD ), (d − rD ) × rD and (d − rD ) × (d − rD ) zeros matrices, respectively, and I(d−rD )×(d−rD ) is a (d − rD ) × (d − rD ) identity matrix. From Eq. (54) we have
)
Number of classes
1024 1024 4096 1024
40 11 68 20
Size
LDA
EDA
DLPP
MEDLPP
3 4
84.6 ± 2.3 90.0 ± 1.3
88.5 ± 1.9 93.1 ± 1.4
85.8 ± 1.9 91.1 ± 1.9
91.9 ± 2.0 94.9 ± 1.8
4.2. Experiments on some image datasets
( ) ( )T = PQD exp ΣD2 QDT P T + PD⊥ PD⊥ ( ) = PQD exp ΣD2 QDT P T + I − PQD QDT P T
(
Dimensionality
400 165 1632 1440
(54)
])
P T exp(XDX T )P = QD exp ΣD2 QDT + I − QD QDT
Number of samples
ORL Yale PIE COIL
⎞
× [PQD PD ] ⊥
Datasets
Table 2 Recognition accuracy (%) on ORL (mean ± std).
+ ···
m!
⎛
Table 1 Statistics of the four databases.
(55)
By using Eqs. (46) and (55), we can solve the eigenvectors of SD−1 SW efficiently. Then we can use Theorem 3 to compute the final projection vectors efficiently. Similarly, the computational complexity of the procedure is also O(dn2 ). 4. Experimental results In this section, we will compare the performances of our proposed MEDLPP with LDA, DLPP and EDA on both synthetic and real data. For the synthetic data, the projection axes and the embedded data are visualized. For real data, the recognition accuracies of different methods on several public image databases are evaluated. In the experiments, we use the nearest neighbor classifier (1-NN) for classification. The programming environment is MATLAB 2015. 4.1. Experiments on synthetic datasets In the first experiment, we will evaluate the proposed MEDLPP method on a 3-D synthetic dataset, which is composed of two classes. Each class contains 300 points. The first class is generated from a single Gaussian with zero mean and 0.5I covariance. The second class, which is a Gaussian mixture, consists of three components. Each √ component has 100 √ points with 0.5I covariance and [1, 4, 0], [2 3, −2, 0] and [−2 3, 2, 0] mean, respectively. Fig. 1(a) shows the dataset and Fig. 1(b)–(e) shows the embedded results of the 3-D data using different algorithms. Note that LDA can only project the 3-D data into a 1-D space since rank(Sb ) is equal to 1 for a two-class data, where rank(·) denotes the rank of a matrix. As can be seen from Fig. 1, the proposed MEDLPP can provide projected data with the best discrimination.
In the second experiment, we will evaluate the performance of MEDLPP on four public image databases. ORL face image database: The ORL face database consists of a total of 400 face images, of a total of 40 people (10 samples per person). For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glassed/no glassed). All the images were taken against a dark homogeneous background with the subjects in an upright, front position (with tolerance for some side movement). In our experiments, each image in ORL database was manually cropped and resized to 32×32. Some example images of one person are shown in Fig. 2. Yale face image database: The Yale face database contains 165 gray scale images of 15 individuals, each individual has 11 images. The images demonstrate variations in lighting condition, facial expression (normal, happy, sad, sleepy, surprised, and wink). In our experiments, each image in Yale database was resized to 32 × 32. Fig. 3 shows sample images of one person. The CMU PIE face database contains 68 individuals with 41,368 face images as a whole. The face images were captured under varying pose, illumination, and expression. In our experiments, a subset (C29) of CMU PIE face database which contains 1632 images of 68 individuals (each individual has twenty-four images) is used. The C29 subset involves variations in illumination, facial expression and pose. All of these face images are aligned based on eye coordinates and cropped to 32 × 32. Fig. 4 shows the sample images of one person. COIL image database contains 1440 images of 20 objects. For each object, 72 images were captured with a black background from varying angles. The moving interval of the camera is five degrees. Each image is down-sampled into 32×32 in our experiment. Fig. 5 shows some sample images in COIL database. Table 1 reports the important statistics information of ORL, Yale, PIE and COIL, which includes the number of samples, the dimensionality of the feature and the number of the classes. We randomly select i (i = 3, 4 for ORL database, i = 4, 5 for Yale and PIE database and i = 5 for COIL) samples of each individual for training, and the rest for testing. For each giving i, we perform 10 times to randomly choose the training set and calculate the average recognition rates as well as the standard deviations. Tables 2–5 present the maximal average recognition and standard deviations for each method. Figs. 6–9 illustrate the plot of recognition rate vs. the dimension of reduced space for different methods. In Tables 6– 9, we report the cost time of different methods on ORL, Yale, PIE and COIL database. Note that EDA(fast) is the efficient implementation, which is implemented using the idea in Section 3.4, for EDA. From the above experiment results, we can make the following observations:
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
(a) Original 3-D data.
(b) One-dimensional projection using LDA.
(c) Two-dimensional projection using EDA.
(d) One-dimensional projection using DLPP.
(e) Two-dimensional projection using MEDLPP.
Fig. 1. Projection results using different methods.
Fig. 2. Images of one person in ORL.
133
134
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
Fig. 3. Images of one person in Yale.
Fig. 4. Images of one person in PIE.
Fig. 6. Recognition rate vs. dimension of reduced space on the ORL database (4 Train). Fig. 5. Some sample images in COIL. Table 7 The cost time of LDA, EDA, DLPP and MEDLPP on Yale (Second)
Table 3 Recognition accuracy (%) on Yale (mean ± std). Size
LDA
EDA
DLPP
MEDLPP
4 5
71.7 ± 4.2 76.7 ± 3.5
74.9 ± 4.3 81.8 ± 3.2
72.2 ± 4.9 79.4 ± 2.6
77.1 ± 4.0 84.3 ± 2.7
Size
LDA
EDA
EDA(fast)
DLPP
MEDLPP
4 5
0.0029 0.0035
1.0554 1.1091
0.0068 0.0090
0.0063 0.0083
0.0089 0.0115
Table 8 The cost time of LDA, EDA, DLPP and MEDLPP on PIE (Second).
Table 4 Recognition accuracy (%) on PIE (mean ± std). Size
LDA
EDA
DLPP
MEDLPP
4 5
86.4 ± 1.0 88.5 ± 1.3
89.3 ± 1.1 91.6 ± 1.0
87.4 ± 1.0 90.0 ± 1.0
91.2 ± 0.8 93.3 ± 0.9
Size
LDA
EDA
EDA(fast)
DLPP
MEDLPP
4 5
0.1117 0.1477
72.1716 73.6623
0.2391 0.3727
0.1407 0.1912
0.2858 0.4529
Table 9 The cost time of LDA, EDA, DLPP and MEDLPP on COIL (Second).
Table 5 Recognition accuracy (%) on COIL (mean ± std) Size
LDA
EDA
DLPP
MEDLPP
5
78.0 ± 1.3
84.5 ± 1.6
81.8 ± 0.9
86.1 ± 1.2
Table 6 The cost time of LDA, EDA, DLPP and MEDLPP on ORL (Second). Size
LDA
EDA
EDA(fast)
DLPP
MEDLPP
3 4
0.0052 0.0093
1.0945 1.1724
0.0197 0.0386
0.0141 0.0198
0.0243 0.0432
(1) Generally, the performances of the matrix exponential based methods, i.e., EDA and MEDLPP are superior to those of the conventional methods, i.e., LDA and DLPP. The reason
Size
LDA
EDA
EDA(fast)
DLPP
MEDLPP
5
0.0094
1.6819
0.0171
0.0144
0.0215
may be that, to overcome the SSS problem, LDA and DLPP may lose some useful discriminant information whereas EDA and MEDLPP do not lose any discriminant information. (2) The highest recognition rates of DLPP and MEDLPP on the four image databases are higher than those of LDA and EDA, respectively. The reason may be that the manifold structure information hidden in the data are used in the DLPP and MEDLPP methods. In Fig. 8, however, when the dimension value varies from 0 to 100, the recognition rates of MEDLPP is lower than the other methods.
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
135
the cost times of MEDLPP and EDA(fast) are much smaller than that of EDA indeed, which experimentally verify that our proposed efficient algorithm for solving MEDLPP and its extensions to other matrix exponential based methods can reduce the computational complexities of matrix exponential based methods dramatically. 5. Conclusions
Fig. 7. Recognition rate vs. dimension of reduced space on the Yale database (4 Train).
In this paper, we propose a novel matrix exponential based discriminant locality preserving projections (MEDLPP) to overcome the SSS problem encountered by discriminant locality preserving projections (DLPP). By using the matrix exponential, MEDLPP can address the SSS problem elegantly since the matrix exponential of a symmetric matrix is always positive definite. Besides, an efficient procedure for solving MEDLPP is proposed, which, to our best knowledge, is the first efficient procedure for solving matrix exponential based method. The main idea for solving MEDLPP efficiently is also generalized to other matrix exponential based methods. The experimental results on some data sets demonstrate that the proposed algorithm outperforms some stateof-the-art discriminant analysis methods. Acknowledgments This research is supported by NSFC of China (No. 61572033, 71371012), the Natural Science Foundation of Education Department of Anhui Province of China (No. KJ2015ZD08), the Social Science and Humanity Foundation of the Ministry of Education of China (No. 13YJA630098). The authors would like to thank the anonymous reviewers and the editor for their helpful comments and suggestions. References
Fig. 8. Recognition rate vs. dimension of reduced space on the PIE database (4 Train).
Fig. 9. Recognition rate vs. dimension of reduced space on the COIL database (5 Train).
(3) The proposed MEDLPP method obtains the best performances in all the experiments, which means that MEDLPP can overcome the SSS problem of DLPP well. (4) Note that the computational complexities of MEDLPP and EDA(fast) are both O(dn2 ) while the computational complexity of EDA is O(d3 ). From Tables 6–9 we can find that
Belhumeour, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 711–720. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15, 1373–1396. Chen, H. T., Chang, H. W., & Liu, T. L. (2005). Local discriminant embedding and its variants. In Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 2 (pp. 846–853). Chen, L. F., Liao, H. Y. M., Ko, M. T., & Yu, G. J. (2000). A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition, 33, 1713–1726. Dornaika, F., & Bosaghzadeh, A. (2013). Exponential local discriminant embedding and its application to face recognition. IEEE Transactions on Cybernetics, 43, 921–934. Dornaika, F., & Traboulsi, Y. E. (2017). Matrix exponential based semi-supervised discriminant embedding for image classification. Pattern Recognition, 61, 92–103. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. (2nd ed.). New York: John Wiley & Sons. Fukunaga, K. (1990). Introduction to statistical pattern recognition. (2nd ed.). Boston, USA: Academic Press. Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. (3rd ed.). Maryland, Baltimore: The Johns Hopkins University Press. Gu, B., & Sheng, V. S. (2016). A robust regularization path algorithm for ν -support vector classification. IEEE Transactions on Neural Networks and Learning Systems. Gu, B., Sheng, V. S., Tay, K. Y., Romano, W., & Li, S. (2015). Incremental support vector learning for ordinal regression. IEEE Transactions on Neural Networks and Learning Systems, 26, 1403–1416. Gu, B., Sheng, V. S., Wang, Z., Ho, D., Osman, S., & Li, S. (2015). Incremental learning for ν -support vector regression. Neural Networks, 67, 140–150. Gu, B., Sun, X., & Sheng, V. S. (2016). Structural minimax probability machine. IEEE Transactions on Neural Networks and Learning Systems. He, X., Cai, D., Yan, S., & Zhang, H. (2005) (2005). Neighborhood preserving embedding. In Proceedings in international conference on computer vision (pp. 1208–1213). He, X., Yan, S., Hu, Y., Niyogi, P., & Zhang, H. (2005). Face recognition using Laplacian faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 328–340.
136
G.-F. Lu et al. / Neural Networks 97 (2018) 127–136
He, X., Yan, S., Hu, Y., & Zhang, H. (2003). Learning a locality preserving subspace for visual recognition. In Proceedings of the ninth international conference on computer vision (pp. 385–392). Li, H., Jiang, T., & Zhang, K. (2004). Efficient and robust feature extraction by maximum margin criterion. In Advances in neural information processing systems: Vol. 16. (pp. 97–104). Cambridge, Mass: MIT Press. Li, H., Jiang, T., & Zhang, K. (2006). Efficient and robust feature extraction by maximum margin criterion. IEEE Transactions on Neural Networks, 17, 1157–1165. Moler, C., & Loan, C. V. (1978). Nineteen dubious ways to compute the exponential of a matrix. SIAM Review, 20, 801–836. Moler, C., & Loan, C. V. (2003). Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review, 45, 3–49. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimension reduction by locally linear embedding. Science, 290, 2323–2326. Song, F., Zhang, D., Mei, D., & Guo, Z. (2007). A multiple maximum scatter difference discriminant criterion for facial feature extraction. IEEE Transacti ons on Systems, Man and Cybernetics, Part B (Cybernetics), 37, 1599–1606. Tenenbaum, J. B., Silva, V. D., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319–2323. Wang, S.-J., Chen, H.-L., Peng, X.-J., & Zhou, C.-G. (2011). Exponential locality preserving projections for small sample size problem. Neurocomputing, 74, 3654–3662.
Wang, S.-J., Yan, S., Yang, J., Zhou, C.-G., & Fu, X. (2014). A general exponential framework for dimensionality reduction. IEEE Transactions on Image Processing, 23, 920–930. Wen, X., Shao, L., Xue, Y., & Fang, W. (2015). A rapid learning algorithm for vehicle classification. Information Sciences, 295, 395–406. Yan, S., Xu, D., Zhang, B., Zhang, H.-J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 40–51. Ye, J. (2005). Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research (JMLR), 348–502. Ye, J., & Xiong, T. (2006). Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. Journal of Machine Learning Research (JMLR), 1183–1204. Yu, H., & Yang, J. (2001). A direct LDA algorithm for high-dimensional data with application to face recognition. Pattern Recognition, 34, 2067–2070. Yu, W., Teng, X., & Liu, C. (2006). Face recognition using discriminant locality preserving projections. Image and Vision Computing, 24, 239–248. Zhang, T., Fang, B., Tang, Y. Y., Shang, Z., & Xu, B. (2010). Generalized discriminant analysis: A matrix exponential approach. IEEE Transacti ons on Systems, Man and Cybernetics, Part B (Cybernetics), 40, 186–197. Zhou, Z., Wang, Y., Wu, Q. M. J., Yang, C.-N., & Sun, X. (2016). Effective and efficient global context verification for image copy detection. IEEE Transactions on Information Forensics and Security.